INOM EXAMENSARBETE TEKNIK, GRUNDNIVÅ, 15 HP , STOCKHOLM SVERIGE 2018
Implementing Bayesian
phylogenetic tree inference with
Sequential Monte Carlo and the
Phylogenetic Likelihood Library
ISAAC ARVESTAD
HENRIK LAGEBRAND
KTH
Implementing Bayesian
phylogenetic tree inference
with Sequential Monte Carlo
and the Phylogenetic
Likelihood Library
ISAAC ARVESTAD & HENRIK LAGEBRAND
Bachelor in Computer Science Date: June 3, 2018
Supervisor: Jens Lagergren Examiner: Örjan Ekeberg
Swedish title: Implementering av fylogenetisk trädinferens med Sekvenentiell Monte Carlo och programspråksbiblioteket Phylogenetic Likelihood Library
iii
Abstract
We investigate the usability of the Phylogenetic Likelihood Library (PLL) in Bayesian phylogenetic tree inference using Sequential Monte Carlo (SMC) algorithms. This is done by implementing two different versions of the same algorithm with two different approaches of the use of PLL. The implementation using the main PLL API encountered performance issues that the lower level implementation did not. We conclude that it is possible to use PLL in SMC methods but it is unclear if the main API is suitable.
iv
Sammanfattning
Vi undersöker om programspråksbiblioteket Phylogenetic Likelihood Lib-rary (PLL) kan användas för bayesiansk inferens av phylogenetiska träd med en sekventiell Monte Carlo-metod (SMC). Genom att imple-mentera algoritmen med två olika delar av PLL:s programmerings-gränssnitt visar vi att det går att använda PLL för att implementera SMC-algoritmen men att det är oklart om det huvudsakliga program-meringsgränssnittet är lämpligt.
Contents
1 Introduction 1 1.1 Problem statement . . . 2 1.2 Scope . . . 2 1.3 Outline . . . 2 2 Background 3 2.1 Phylogenetic trees . . . 42.2 Hidden Markov models . . . 5
2.2.1 Definition . . . 5
2.2.2 Filtering . . . 6
2.3 Sequential Monte Carlo . . . 6
2.3.1 Importance Sampling . . . 7
2.3.2 Sequential Importance Sampling . . . 8
2.3.3 Resampling . . . 9
2.3.4 Algorithm . . . 9
2.3.5 Applications . . . 10
2.4 PosetSMC . . . 10
2.4.1 Proposals and weights . . . 10
2.5 Phylogenetic Likelihood Library . . . 11
2.5.1 The main API . . . 11
2.5.2 Core functions . . . 12 3 Methods 13 3.1 Implementation . . . 13 3.2 Experiments . . . 13 3.2.1 Comparing pll-smc implementations . . . 14 4 Results 15 v
vi CONTENTS
5 Discussion 19
5.1 Implementation comparison . . . 19 5.2 Potential problems and limitations . . . 20 5.3 Future work . . . 21
6 Conclusion 22
Bibliography 23
Chapter 1
Introduction
Darwin’s theory of evolution explains how organisms change over generations. He exemplified this by comparing morphological fea-tures of Galápagos finches. Comparisons like these can be used to construct a phylogeny describing how a set of species have evolved. Species with similar features are more likely to be closely related than those with less similar features. However, comparing morphological features has downsides. For example, species with different ances-tors can evolve similar features independently of each other. Biologist Willi Hennig made the case for analysing phylogenies in terms of her-itable genetic traits, known as phylogenetics.[6] A tree based on such an analysis displaying a phylogeny is a phylogenetic tree.
We will in this paper discuss phylogenetic inference, the topic of re-constructing a phylogenetic tree from a set of related genetic sequences. Constructing a perfect tree is computationally hard for anything more than a few sequences but there are several methods for approximating these trees. One method, PosetSMC, relies on Bayesian inference to approximate a distribution of trees using Sequential Monte Carlo (SMC). One of its main advantages compared to other Bayesian inference meth-ods is that it is easy to parallelize. Speed is important with phyloge-netic inference since the amount of sequenced data available is grow-ing faster than the computational power available.[2]
Phylogenetic inference methods are difficult to implement. The algorithms modelling genetic evolution are non-trivial and can take time to implement properly.[5] One software library which aims to solve these algorithmic tasks is the Phylogenetic Likelihood Library (PLL). PLL provides data structures and algorithms related to constructing
2 CHAPTER 1. INTRODUCTION
and analysing phylogenetic trees quickly using low-level optimisa-tions such as SIMD instrucoptimisa-tions. One of the applicaoptimisa-tions Flouri et al. [5] shows is using PLL with a Markov Chain Monte Carlo (MCMC) algorithm. An application which is interesting to investigate is if PLL could also be used for SMC algorithms.
We will in this thesis investigate how suitable PLL is as a software library for implementing SMC tree inference methods. We approach this problem by implementing two versions of the PosetSMC algo-rithm which use PLL in two different ways. This allows us to inves-tigate if different parts of PLL are more or less suitable for SMC tree inference methods.
1.1 Problem statement
Can PLL be used to implement the SMC tree inference method PosetSMC described by Bouchard-Côté, Sankararaman, and Jordan [2]?
1.2 Scope
We limit scope of the project by only inferring coalescent phylogenetic trees using the PriorPrior proposal used in PosetSMC and defined by Teh, Daume III, and Roy [13].
1.3 Outline
In Section (2) SMC, phylogenetic trees and PLL are explained. This is followed by Section (3) describing how the study was performed. Finally, the results are presented (4) and discussed (5).
Chapter 2
Background
PosetSMC is an implementation of a particle filter which uses SMC to approximate a phylogenetic tree. This background aims to introduce the concepts required to understand and implement a program similar to PosetSMC using PLL.
At the most basic level, Bayesian phylogenetics builds on hidden Markov models (HMM) where the trees are the hidden states and the given sequences are the observations. Filtering builds on HMM and models how states can be inferred based on previous observations. Finally, particle filters provide a technique for approximating a filter. In the case of PosetSMC, particle filtering is done using SMC.
The goal of Bayesian phylogenetic inference is to approximate some distribution of phylogenetic trees ⇡ known as the posterior distribu-tion.[3] The approximation b⇡ is a discrete distribution of phylogenetic trees where each tree has an associated weight. One advantage of com-puting an entire distribution instead of approximating a single tree is that the distribution can be used to answer multiple questions. An example given in Bouchard-Côté [1] is the probability that a certain clade1exists for a set of sequences:
K
X
k=1
wkf (tk) (2.1)
where wkis the normalized weight of tree k and
f (tk) =
(
1, if clade is part of tk
0, otherwise (2.2)
1Maximal subtree covering a set of leaves
4 CHAPTER 2. BACKGROUND
We will begin by briefly describing phylogenetic trees. Next we describe HMM, filtering, SMC and particle filtering based on Doucet and Johansen [3]. Finally we describe PosetSMC and PLL.
2.1 Phylogenetic trees
Phylogenetic trees represent evolutionary relationship. The leaves of the tree represent that which has evolved such as the genetic sequences of a set of species and the branches represent evolutionary distance.[9] Figure (2.1) shows a phylogenetic tree with four leaves {L1, L2, L3, L4} with distance represented as horizontal lines.
Figure 2.1: A phylogenetic tree rendered using FigTree[4]
0.4
L2 L0 L3
L1
Coalescent trees, the focus of this paper, are a particular type of tree where the time between evolutionary events is exponentially dis-tributed. Every event represents two descendants coalescing into an an-cestor. Each generation every member is equally likely to coalesce.[13]
CHAPTER 2. BACKGROUND 5
2.2 Hidden Markov models
2.2.1 Definition
HMM is defined by a “X -valued discrete-time Markov process {Xn}n 1”[3]
which is described by two equations:
X1 ⇠ µ(x1)and Xn|(Xn 1= xn 1)⇠ f(xn|xn 1) (2.3)
Yn|(Xn= xn)⇠ g(yn|xn) (2.4)
where µ(x1)is a density function describing the initial state, f(x|x0)
describes a transition from x0 to x and g(y|x) describes how the
Y-valued variable y is distributed given that x has occured. X represents the hidden data while the statistically independent values Y represent the observations.
Using (2.3) and (2.4) Doucet and Johansen [3] defines two probabil-ity functions. The prior distribution is defined as the probabilprobabil-ity that a certain sequence of states x1:noccurs:
p(x1:n) = µ(x1) n
Y
k=2
f (xk|xk 1) (2.5)
The likelihood function is defined as the probability that a sequence of observations y1:noccurs given a specific sequence of states x1:n.
p(y1:n|x1:n) = n
Y
k=1
g(yk|xk) (2.6)
Using Bayes’ theorem Doucet and Johansen [3] define the probabil-ity of a sequence x1:noccurring given a sequence of observations y1:n,
known as the posterior distribution, as p(x1:n|y1:n) =
p(x1:n)p(y1:n|x1:n)
p(y1:n)
(2.7) where p(y1:n)is known as the marginal likelihood and
p(y1:n) =
Z
6 CHAPTER 2. BACKGROUND
2.2.2 Filtering
Filtering is the “[problem of] characterising the distribution of the state of the hidden Markov model at the present time, given the information provided by all of the observations received up to the present time.”[3] It provides a way to sequentially approximate the posterior distribu-tion (2.7) for each time t 2 {1 . . . n} by defining the posterior distri-bution recursively. This requires computing the product of the prior distribution (2.5) and likelihood function (2.6) as well as approximat-ing the marginal likelihood (2.8):
p(x1:n|y1:n) = p(x1:n 1|y1:n 1) f (xn|xn 1)g(yn|xn) p(yn|y1:n 1) (2.9) where p(yn|y1:n 1) = Z p(xn 1|y1:n 1)f (xn|xn 1)g(yn|xn) dxn 1:n (2.10)
Doucet and Johansen also provide a recursive definition for the marginal likelihood: p(y1:n) = p(y1) n Y k=2 p(yk|y1:k 1) (2.11)
2.3 Sequential Monte Carlo
Sequential Monte Carlo methods are a type of Monte Carlo methods which are used to approximate a “target probability distribution”[3] defined as ⇡n(x1:n) = n (x1:n) Zn (2.12) where Zn = Z n(x1:n) dx1:n (2.13)
Doucet and Johansen [3] describes the basic Monte Carlo method approximating ⇡n(x1:n)as sampling X1:ni ⇠ ⇡n(x1:n)for i 2 {1, . . . , N}.
Each Xi is known as a particle which is updated as n increases. The
CHAPTER 2. BACKGROUND 7 b⇡n(x1:n) = 1 N N X i=1 Xi 1:n(x1:n) (2.14) where Xi 1:n(x1:n) = ( 1, Xi 1:n = x1:n 0, otherwise (2.15)
The obvious problem with this approach is that the distribution ⇡ in many scenarios is not known. For example, in the context of phy-logenetic inference, we wish to approximate the distribution of phylo-genetic trees for a given set of sequences. This distribution of trees is not known and thus we cannot sample from it. To solve this problem Doucet and Johansen [3] describe three methods which incrementally build on each other and together compose SMC: Importance Sampling, Sequential Importance Sampling and Resampling.
2.3.1 Importance Sampling
Importance sampling (IS) introduces a density function qn(x1:n)where
⇡n(x1:n) > 0 =) qn(x1:n) > 0 (2.16)
This requirement formulates the notion that if some sequence of states x1:ncan be sampled from ⇡nit must also be possible for the same
sequence to be sampled from qn. It does not mean they have to be
equally probable, but if a sequence is at all probable in ⇡ it must also be probable in q. This is necessary to guarantee that b⇡ is asymptotically correct.
To adjust for the fact that ⇡n 6= qn a weight wn is introduced such
that ⇡n(x1:n) = wn(x1:n)qn(x1:n) Zn (2.17) where Zn = Z wn(x1:n)qn(x1:n) dx1:n (2.18)
8 CHAPTER 2. BACKGROUND
wn(x1:n) = n
(x1:n)
qn(x1:n)
(2.19) Adding the weights also requires changing the Monte Carlo ap-proximation which is now written as
b⇡n(x1:n) = N X i=1 Wn Xi i 1:n(x1:n) (2.20) where Wni = wn(X i 1:n) PN j=1wn(X j 1:n) (2.21) One problem with this method which Doucet and Johansen [3] bring up is that qn(x1:n), although easy to sample from, is
computa-tionally expensive since for each time step an entirely new sequence x1:n has to be sampled. If we want to approximate ⇡t(x1:t) for each
t 2 {1 . . . n} this would result in a computational complexity of O(n) for each step i and O(n2)overall.
2.3.2 Sequential Importance Sampling
To achieve linear time complexity, Doucet and Johansen [3] present a solution where qn(x1:n)is defined recursively as
qn(x1:n) = qn 1(x1:n 1)qn(xn|x1:n 1) (2.22)
and correspondingly, wnis recursively defined as
wn(x1:n) = n (x1:n) qn(x1:n) (2.23) = n 1(x1:n 1) qn 1(x1:n 1) n(x1:n) n 1(x1:n 1)qn(xn|x1:n 1) (2.24) Defining the distribution recursively allows for reusing previously sampled data. To sample from qn(x1:n)we sample Xni ⇠ qn(xn|X1:n 1i )
where Xi
1:n 1 ⇠ qn 1(x1:n 1)and X1i ⇠ q1(x1).
The issue with SIS (and IS) is that the variance of SIS “increases exponentially with n”[3]. With a large variance a large number of par-ticles is needed to compute a decent approximation of ⇡n. This is a
CHAPTER 2. BACKGROUND 9
Bouchard-Côté, Sankararaman, and Jordan [2] describes how this results in “a large weight imbalance [...] called particle degeneracy”[2] where the majority of particles have low weights. It would be better to spend the limited computing power on higher weighted particles; this is where resampling comes in.
2.3.3 Resampling
Resampling creates a distribution ⇡nby sampling N times from the
ap-proximated posterior b⇡n.[3] Particles with high weights are more likely
to be sampled multiple times and those with low weight might not be sampled at all. ⇡n(x1:n) = 1 N N X i=1 Nn Xi i 1:n(x1:n) (2.25) where Nni ⇠ multinomial distribution (N, Wni), i2 {1, . . . , N} (2.26) Note that there are multiple ways of sampling Ni
n. Doucet and
Jo-hansen [3] also bring up systematic- and residual resampling.
2.3.4 Algorithm
Now that the three concepts IS, SIS and resampling have been defined, Doucet and Johansen [3] combine them into an algorithm.
1. Begin by sampling N times from the density function q X1:ni ⇠ ( q1(x1), n = 1 qn(xn|X i 1:n 1), n > 1 , i2 {1 . . . N} (2.27)
2. Compute a weight w for each sample Xi 1:n win(X1:ni ) = ( 1/N, n = 1 wn 1(X1:n 1i ) n(X1:ni ) n 1(X1:n 1i )qn(Xni|X1:n 1i ), n > 1 (2.28)
3. Compute normalized weights Wi
nusing (2.21)
4. Resample Xi1:n ⇠ b⇡n, where X i
10 CHAPTER 2. BACKGROUND
2.3.5 Applications
By choosing definitions for and Z, SMC can used for multiple pur-poses. For example, n(x1:n) = p(x1:n)p(y1:n|x1:n)and Zn= p(y1:n)gives
⇡n(x1:n) = n (x1:n) Zn = p(x1:n)p(y1:n|x1:n) p(y1:n) (2.29) which is the posterior distribution (2.12). We can thus approximate the posterior distribution by choosing a density function qnwhich gives
decent proposals and iterating the steps defined in section (2.3.4). This particular application is known as particle filtering.
2.4 PosetSMC
PosetSMC is a particle filter which introduces a proposal density q, weight function w and density defined on the space S. Given a set of genetic sequences it will infer a phylogenetic tree distribution. A parti-cle represents a forest containing all given genetic sequences as leaves and the space S spans all the possible forests covering the leaves.
The goal of PosetSMC is to make proposals on it’s particles until it has created a single tree from the forest. Each iteration it resamples using multinomial resampling. PosetSMC discusses two proposals for coalescent trees, PriorPrior and PriorPost, which were originally pre-sented in Teh, Daume III, and Roy [13] and found that “Surprisingly [. . . ], PriorPrior outperforms the more complicated PriorPost by one order of magnitude.”[2]
2.4.1 Proposals and weights
PriorPrior suggests two things when making a proposal:
1. A pair of trees (t1, t2)in the forest s 2 S to merge by connecting
their rootes to a new node, where t1, t2 2 s. Each pair of trees is
equally likely to be selected.[13]
2. The height of the new root as the sum of the previously tallest node and a height increment sampled from an exponential dis-tribution.[13, 2]
CHAPTER 2. BACKGROUND 11
Using this method for defining the proposal distribution q, Bouchard-Côté, Sankararaman, and Jordan [2] define w and for coalescent trees. Inserting this definition in equation (2.28) gives
wni = win 1(X1:n 1i ) n(X i 1:n) n 1(X1:n 1i )qn(Xni|X1:n 1i ) (2.30) = wi n 1(X1:n 1i ) n( (X1:ni ))Lm LlLrqn(Xni|X1:n 1i ) (2.31) / wi n 1(X1:n 1i ) Lm LlLr (2.32) where Lm, Ll, Lr is the likelihood of the merged tree, left tree and
right tree respectively in the particle Xi
1:n on the leaves that each
re-spective tree covers and n( (X1:ni )) =Expd( (X1:ni ), l n+1
2 )is the
ex-ponential density for the height increment of the particle Xi
1:n, where
lis the number of leaves.[2]
2.5 Phylogenetic Likelihood Library
The Phylogenetic Likelihood Library is “highly optimized application programming interface for developing likelihood-based phylogenetic inference and post-analysis software”, developed by Flouri et al. [5]. The PLL has functions for parsing input files of different formats for “sequence data [. . . and] likelihood evaluation”[5] as well as data struc-tures for representing phylogenetic trees.[5] Of particular interest to this paper is the PLL’s functionality for computing the likelihood of a phylogenetic tree since this is required when computing particle weights in (2.32).
We identified two different methods for computing likelihoods us-ing PLL. The first alternative is usus-ing PLL’s recommended and doc-umented methods. [5][14] The second is using PLL’s core functions which are exposed by the library but are less documented.
2.5.1 The main API
PLL defines a data structure called the partition instance. It keeps track of evolutionary model parameters and buffers for storing intermedi-ate computations. It also defines the order that the tree should be
12 CHAPTER 2. BACKGROUND
traversed when computing the likelihood of the tree and is central to PLL’s main API functionality.
Calculating the likelihood of the tree involves a series of steps: 1. Create a partition instance by specifying the number of tree nodes,
edges and other attributes.
2. Specify the evolutionary model parameters, for example the nu-cleotide frequencies.
3. Convert the input tree into a vector of operations specifying the order that the tree should be traversed and update the partition instance.
4. Compute the likelihood of the root.
2.5.2 Core functions
PLL API’s core functions provide an alternative way of computing the likelihood. They allow the author to manually allocate and specify the data that should be used for computing likelihood. The main differ-ence to the main API is that a partition instance is not required.
Chapter 3
Methods
We implemented a phylogenetic tree inference program based on the PosetSMC algorithm (2.4). Two distinct versions were made based on the two methods for computing likelihood described in (2.5).
3.1 Implementation
Our PosetSMC implementation using PLL called pll-smc (https:// github.com/isaacarvestad/pll-smc) was implemented in the programming language C++. It uses the PLL implementation libpll (https://github.com/xflouris/libpll) as the library for per-forming the likelihood calculations required to compute Lm, Lland Lr
in (2.32).
Two versions of pll-smc where implemented. The first implemen-tation used the main API of PLL (2.5.1) where each particle kept track of a PLL partition instance. The second implementation used the core functions of PLL. Each particle has a set of pointers to trees represent-ing the forest. The trees are stored in a recursive data structures and keep track of any data required by PLL inside the nodes and edges of the tree. See Appendix (A) for pseudocode describing pll-smc-main (1) and pll-smc-core (2).
3.2 Experiments
The experiments performed involved comparing the two pll-smc im-plementations (pll-smc-main and pll-smc-core). Comparing these tree
14 CHAPTER 3. METHODS
inference implementations was done by randomly generating a set of phylogenetic trees and inferring them with each respective implemen-tation. Each trial the following data was recorded:
• Runtime of tree inference implementation in seconds
• Robinson-Foulds distance (RF) of the original tree to the inferred tree.[8] This distance is the weighted average RF distance over the tree distribution.
Trials were run sequentially on a machine with 16 GiB of mem-ory using the vector instruction set SSE. Coalescent trees were ran-domly generated using the Python package DendroPy[12]. Sequences for these trees where generated with Seq-Gen[7] using the evolution-ary model GTR with equal frequencies and rates and gamma rate 1. Finally JPrIME[11] was used to compare RF distances between trees.
3.2.1 Comparing pll-smc implementations
There were two goals when comparing pll-smc implementations. 1. Correctness: verify that they did not differ in the quality of the
inferred trees.
2. Efficiency: identify which implementation used less computa-tional resources (memory and processor time).
These goals were met by running two experiments:
1. Generate sets of trees with varying leaf counts and with sequence lengths 102and 103respectively, and infer the trees using the two
implementations. Record and compare the RF distance of each implementation to test if similar trees are inferred. Additionally, record and compare the runtime to test the efficiency of each im-plementation.
Each set of trees includes five generated trees for each leaf count in the range [4, 50]. For each tree pll-smc-core was run using 102,
103and 104particles and pll-smc-main was run using 102and 103
particles.
2. Using the Valgrind tool Massif [10] profile the heap memory us-age for each implementation when inferring a tree with 30 leaves and sequence length 1000.
Chapter 4
Results
The results of the experiments showed similarities and differences be-tween the two pll-smc implementations. The implementations infer tree distributions with similar RF distances, both for sequence lengths 102(4.1) and 103(4.2). It can be observed that with increasing leaf count
the standard deviation of the RF distances increases for both imple-mentations. The RF distances also decreases similarly as the particle count is increased from 102 to 103 particles.
Figure 4.1: Average RF distance to original tree compared to the leaf count with 100 nucleotide long sequences.
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 100 particles Leaves RF ● pll−smc−core pll−smc−main ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● 10 20 30 40 50 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 1000 particles Leaves RF ● pll−smc−core pll−smc−main 15
16 CHAPTER 4. RESULTS
Figure 4.2: Average RF distance to original tree compared to the leaf count with 1000 nucleotide long sequences
●● ● ● ●●●●● ● ● ●●●● ● ●● ● ●● ● ●● ●● ● ● 5 10 15 20 25 30 0 5 10 15 20 25 0 5 10 15 20 25 100 particles Leaves RF ● pll−smc−core pll−smc−main ● ● ●●● ● ● ●●●●● ● ●●●●● ●● ● ●● ●●●●● 5 10 15 20 25 30 0 5 10 15 0 5 10 15 1000 particles Leaves RF ● pll−smc−core pll−smc−main
On the other hand, runtimes for inferring the trees were not sim-ilar. As can be observed in (4.3) and (4.4), pll-smc-core scales better as the number of leaves increases. Comparing (4.3) and (4.4) it can also be observed that increasing the number of particles results in a proportionally large increase in runtime.
Figure 4.3: Average runtime in seconds compared to leaf count with 100 nucleotide long sequences
● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● 10 20 30 40 50 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 100 particles Leaves Time (s) ● pll−smc−core pll−smc−main ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● 10 20 30 40 50 0 2 4 6 8 0 2 4 6 8 1000 particles Leaves Time (s) ● pll−smc−core pll−smc−main
CHAPTER 4. RESULTS 17
Figure 4.4: Average runtime in seconds compared to leaf count with 1000 nucleotide long sequences
●●●●●●●●●●●●●●●●●●●●●●●●●●●● 5 10 15 20 25 30 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 100 particles Leaves Time (s) ● pll−smc−core pll−smc−main ●●●●●●●●●●●●●●●●●●●●●●●●●●●● 5 10 15 20 25 30 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 1000 particles Leaves Time (s) ● pll−smc−core pll−smc−main
The memory profile performed also favored pll-smc-core. Figure (4.5) shows how memory initially is allocated before stabilizing for the remaining execution time. Although the two implementations have similar shapes the amount of memory used differs.
Figure 4.5: Memory allocations when inferring a tree with 30 leaves and sequence length 1000.
pll−smc−core Time (ms) Memor y usage (MiB) 0 50 100 150 200 250 300
0.0e+00 5.0e+09 1.0e+10 1.5e+10 2.0e+10
pll−smc−main Time (ms) Memor y usage (GiB) 0 3 6 9 12 14
18 CHAPTER 4. RESULTS
Comparing pll-smc-core with particle counts 102, 103 and 104it can
be observed in (4.6) that although there is an apparent benefit to in-creasing the particle count from 102 to 103, there is no significant
dif-ference between 103 and 104 particles.
Figure 4.6: Average RF distance compared to leaf count with 100 and 1000 nucleotide long sequences with increasing number of particles inferred using pll-smc-core
10 20 30 40 50 0 5 10 15 20 25 30 35 Leaves RF ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10 20 30 40 50 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
Median RF with differening particle count, sequence length 100
● pll−smc−core: 1e2 pll−smc−core: 1e3 pll−smc−core: 1e4 10 20 30 40 50 0 10 20 30 Leaves RF ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10 20 30 40 50 0 10 20 30 0 10 20 30
Median RF with differening particle count, sequence length 1000
● pll−smc−core: 1e2
pll−smc−core: 1e3 pll−smc−core: 1e4
Chapter 5
Discussion
As the results showed, both pll-smc-core and pll-smc-main inferred tree distributions with similar RF distances suggesting that they pro-duce tree distributions of similar quality. The main difference between the implementations was the time and memory required to complete the tree inference computations; most likely explained by pll-smc-core containing optimizations that are not part of pll-smc-main. We begin discussing these implementation differences.
5.1 Implementation comparison
What sets the two implementations apart is how the PLL API is used, see section (2.5). One of the main differences is the amount of data that can be shared between particles. As explained in Bouchard-Côté [1], phylogenetic SMC algorithms can be optimized to reduce their mem-ory usage. Each time a particle generation is resampled particles are copied to the next particle generation. If one particle is copied multi-ple times, multimulti-ple particles in the new generation will share the same underlying data. Specifically, when a particle p1 is copied to make a
particle p2, p1 and p2 share the same trees in their forests. Since each
particle in pll-smc-core has a set of pointers to trees making up their forest, as recommended by Bouchard-Côté [1], it is trivial to copy a particle by simply copying the set of tree pointers. This is possible since it is guaranteed that no particle will modify the existing trees in the forest; a proposal (2.4.1) can only extend existing trees by adding a parent node and will not modify the inferred subtrees.
In contrast, pll-smc-main keeps track of one PLL partition per
20 CHAPTER 5. DISCUSSION
cle, essentially allocating a forest for every particle. To our knowledge it is not possible to share data between partitions using the main API. This design choice caused two problems, both of which most likely where contributing factors to the performance difference. Firstly, stor-ing multiple instances of the same data will use more memory. This is a clear disadvantage, especially when SMC is primarily memory intensive.[1] Secondly, resampling was not equally trivial since entire PLL partitions had to be copied. In our implementation this was done using a series of memcpys.
The memory allocation graphs (4.5) show the detrimental effect of not optimizing for memory usage and limits the usability of pll-smc-main. On many modern machines it will require swap-memory and for longer sequences or larger particle counts this problem will only get worse. These graphs also provide some evidence that no major memory leaks are present.
5.2 Potential problems and limitations
It is important to acknowledge potential problems and limitations with our approach. One potential problem is the possibility of an incorrect implementation. Another is the possibility that the model parameters were incorrectly setup, either inside the implementation or when gen-erating phylogenetic trees. A hint of a potential problem is the com-parison of pll-smc-core with different number of particles (4.6) where no particular improvement can observed when inferring with 104
in-stead of 103 particles. One would expect that if the implementation
and model parameters were correct that the inferred trees would con-verge towards the correct tree and RF distance concon-verge towards 0 as particle counts increased.
Although this is a problem it does not change the fact that the relative difference in time and memory usage between pll-smc-core and pll-smc-main increased with larger leaf counts. Neither would a change in model parameters change the number of likelihood com-putations performed in the implementations. It should therefore still be possible to make conclusions regarding the effectiveness of each respective implementation and thus the usability of PLL in SMC algo-rithms.
CHAPTER 5. DISCUSSION 21
5.3 Future work
Moving forward there are a couple topics which we think would be interesting to investigate. Firstly it would be interesting to investigate what, if any, API changes are required to improve the usability of PLL with SMC algorithms. Based on our experience with pll-smc the crit-ical feature is being able to store data in a tree-like structure instead of in partitions. This solves two key criteria. Primarily, it enables in-cremental likelihood computation by only computing the likelihood of the nodes once when they are first proposed using the proposal dis-tribution (2.4.1). Additionally, it allows for the memory saving tech-niques used in pll-smc-core and described by Bouchard-Côté [1].
Secondly it would also be interesting to use PLL to implement more advanced SMC algorithms such as Combinatorial Sequential Monte Carlo (CSMC) which can infer a more general class of phylogenetic trees.[15]
Chapter 6
Conclusion
The use of different parts of the PLL API in core and pll-smc-main shows how design and API choices can affect the performance of an implementation. Between the two implementations pll-smc-core performed significantly better as the size of the trees increased, even though it did not infer worse trees with regards to RF distance. As discussed in (5.1) this can most likely be attributed to the restrictions of the PLL partition data type which resulted in high memory usage.
The pll-smc-main implementation is not suitable for phylogenetic tree inference. However, since pll-smc-main is not representative of all possible SMC implementations using the main PLL API, it cannot be concluded that the main PLL API isn’t suitable for SMC methods. Nevertheless, it can be concluded that the PLL’s core functions can be used to implement a SMC algorithm for inferring coalescent phyloge-netic trees.
Bibliography
[1] Bouchard-Côté. “SMC (sequential Monte Carlo) for Bayesian phy-logenetics”. In: Bayesian Phylogenetics: methods, algorithms, and ap-plications. Ed. by Ming-Hui Chen, Lynn Kuo, and Paul O Lewis. CRC Press, 2014, pp. 163–184.
[2] Alexandre Bouchard-Côté, Sriram Sankararaman, and Michael I Jordan. “Phylogenetic inference via sequential Monte Carlo”. In: Systematic biology 61.4 (2012), pp. 579–593.
[3] Arnaud Doucet and Adam M Johansen. “A tutorial on particle filtering and smoothing: Fifteen years later”. In: Handbook of non-linear filtering 12.656-704 (2009), p. 3.
[4] Alexei J Drummond et al. “Bayesian phylogenetics with BEAUti and the BEAST 1.7”. In: Molecular biology and evolution 29.8 (2012), pp. 1969–1973.
[5] T Flouri et al. “The phylogenetic likelihood library”. In: System-atic biology 64.2 (2014), pp. 356–362.
[6] Willi Hennig. “Phylogenetic systematics”. In: Annual review of entomology 10.1 (1965), pp. 97–116.
[7] Andrew Rambaut and Nicholas C Grass. “Seq-Gen: an applica-tion for the Monte Carlo simulaapplica-tion of DNA sequence evoluapplica-tion along phylogenetic trees”. In: Bioinformatics 13.3 (1997), pp. 235– 238.
[8] David F Robinson and Leslie R Foulds. “Comparison of phylo-genetic trees”. In: Mathematical biosciences 53.1-2 (1981), pp. 131– 147.
[9] Naruya Saitou. Introduction to evolutionary genomics. Springer, 2016.
24 BIBLIOGRAPHY
[10] J. Seward, N. Nethercote, and J. Weidendorfer. Valgrind 3.3 - Ad-vanced Debugging and Profiling for GNU/Linux Applications. Net-work Theory Ltd., 2008.ISBN: 0954612051, 9780954612054. [11] Joel Sjöstrand et al. “GenPhyloData: realistic simulation of gene
family evolution”. In: BMC bioinformatics 14.1 (2013), p. 209. [12] Jeet Sukumaran and Mark T Holder. “DendroPy: a Python
li-brary for phylogenetic computing”. In: Bioinformatics 26.12 (2010), pp. 1569–1571.
[13] Yee W Teh, Hal Daume III, and Daniel M Roy. “Bayesian ag-glomerative clustering with coalescents”. In: Advances in Neural Information Processing Systems. 2008, pp. 1473–1480.
[14] Tomáš Flouri, Diego Darriba, Kassian Kobert, Mark T. Holder, Alexey Kozlov, Alexandros Stamatakis. API Reference. https: //github.com/xflouris/libpll/wiki/API-Reference. 2016.
[15] Liangliang Wang, Alexandre Bouchard-Côté, and Arnaud Doucet. “Bayesian phylogenetic inference using a combinatorial sequen-tial Monte Carlo method”. In: Journal of the American Statistical Association 110.512 (2015), pp. 1362–1374.
Appendix A
Implementations
Algorithm 1 General program structure of pll-smc-main, including all
PLL function calls.
1: functionSMC(sequences, particle_count) 2: particles 2 ⇤ number of particles to use 3: for all p 2 particles do
4: p.partition pll_partition_create(size of final tree)
5: pll_set_frequencies(p, . . . )
6: pll_set_category_rates(p, . . . )
7: pll_set_subst_params(p, . . . )
8: for all s 2 sequences do
9: pll_set_tip_states(p, s, . . . )
10: pll_compute_root_loglikelihood(p, . . . )
11: end for 12: end for
26 APPENDIX A. IMPLEMENTATIONS
13: for all i 2 {1, . . . , sequence count} do 14: if i is even then
15: RESAMPLE(upper half of particles, lower half of
parti-cles)
16: PROPOSE(lower half of particles)
17: NORMALIZE(lower half of particles)
18: else
19: RESAMPLE(lower half of particles, upper half of
parti-cles)
20: PROPOSE(upper half of particles)
21: NORMALIZE(upper half of particles)
22: end if 23: end for
24: Return particles along with their associated normalized weight 25: end function
26: functionRESAMPLE(from, to) 27: for all p 2 to do
28: p0 ⇠ from, weighted by normalized weights
29: p.weight p0.weight
30: p.normalized_weight size of from1
31: p.partition p0.partition (using memcpy)
32: end for 33: end function
34: functionPROPOSE(particles) 35: for all p 2 particles do
36: pll_update_prob_matrices(p, . . . )
37: pll_update_partials(p, . . . )
38: pll_compute_root_loglikelihood(p, . . . )
39: end for 40: end function
41: functionNORMALIZE(particles)
42: Normalize weights
APPENDIX A. IMPLEMENTATIONS 27
Algorithm 2 General program structure of pll-smc-core, including all
PLL function calls.
1: functionSMC(sequences, particle_count)
2: reference_partition pll_partition_create(size of final tree) 3: pll_set_frequencies(reference_partition, . . . )
4: pll_set_category_rates(reference_partition, . . . ) 5: pll_set_subst_params(reference_partition, . . . ) 6: for all s 2 sequences do
7: pll_set_tip_states(reference_partition, s, . . . ) 8: end for
9: pll_update_eigen(reference_partition, . . . )
10: particle Particle(reference_partition, sequences, . . .) 11: for all s 2 particle.sequences do
12: pll_core_root_loglikelihood(. . . )
13: end for
14: particles 2 ⇤ number of particles to use 15: for all p 2 particles do
16: p.weight copy(particle.weight)
17: p.normalized_weight copy(particle.normalized_weight)
18: p.particle_pointers copy(particle.particle_pointers)
19: end for
20: for all i 2 {1, . . . , sequence count} do 21: if i is even then
22: RESAMPLE(upper half of particles, lower half of
parti-cles)
23: PROPOSE(lower half of particles)
24: NORMALIZE(lower half of particles)
25: else
26: RESAMPLE(lower half of particles, upper half of
parti-cles)
27: PROPOSE(upper half of particles)
28: NORMALIZE(upper half of particles)
29: end if 30: end for
31: Return particles along with their associated normalized weight 32: end function
28 APPENDIX A. IMPLEMENTATIONS
33: functionRESAMPLE(from, to) 34: for all p 2 to do
35: p0 ⇠ from, weighted by normalized weights
36: p.weight p0.weight 37: p.normalized_weight 1 size of from 38: p.particle_pointers p0.particle_pointers 39: end for 40: end function
41: functionPROPOSE(particles) 42: for all p 2 particles do
43: pll_core_update_pmatrix(left_edge, . . . ) 44: pll_core_update_pmatrix(right_edge, . . . ) 45: pll_core_update_partial_ii(. . . ) 46: pll_core_root_loglikelihood(. . . ) 47: end for 48: end function
49: functionNORMALIZE(particles)
50: Normalize weights