Implementing Bayesian phylogenetic tree inference with Sequential Monte Carlo and the Phylogenetic Likelihood Library

(1)

INOM EXAMENSARBETE TEKNIK, GRUNDNIVÅ, 15 HP , STOCKHOLM SVERIGE 2018

Implementing Bayesian

phylogenetic tree inference with

Sequential Monte Carlo and the

Phylogenetic Likelihood Library

ISAAC ARVESTAD

HENRIK LAGEBRAND

KTH

(2)

Implementing Bayesian

phylogenetic tree inference

with Sequential Monte Carlo

and the Phylogenetic

Likelihood Library

ISAAC ARVESTAD & HENRIK LAGEBRAND

Bachelor in Computer Science Date: June 3, 2018

Supervisor: Jens Lagergren Examiner: Örjan Ekeberg

Swedish title: Implementering av fylogenetisk trädinferens med Sekvenentiell Monte Carlo och programspråksbiblioteket Phylogenetic Likelihood Library

(3)

(4)

iii

Abstract

We investigate the usability of the Phylogenetic Likelihood Library (PLL) in Bayesian phylogenetic tree inference using Sequential Monte Carlo (SMC) algorithms. This is done by implementing two different versions of the same algorithm with two different approaches of the use of PLL. The implementation using the main PLL API encountered performance issues that the lower level implementation did not. We conclude that it is possible to use PLL in SMC methods but it is unclear if the main API is suitable.

(5)

iv

Sammanfattning

Vi undersöker om programspråksbiblioteket Phylogenetic Likelihood Lib-rary (PLL) kan användas för bayesiansk inferens av phylogenetiska träd med en sekventiell Monte Carlo-metod (SMC). Genom att imple-mentera algoritmen med två olika delar av PLL:s programmerings-gränssnitt visar vi att det går att använda PLL för att implementera SMC-algoritmen men att det är oklart om det huvudsakliga program-meringsgränssnittet är lämpligt.

(6)

Chapter 1 Introduction

Darwin’s theory of evolution explains how organisms change over generations. He exemplified this by comparing morphological fea-tures of Galápagos finches. Comparisons like these can be used to construct a phylogeny describing how a set of species have evolved. Species with similar features are more likely to be closely related than those with less similar features. However, comparing morphological features has downsides. For example, species with different ances-tors can evolve similar features independently of each other. Biologist Willi Hennig made the case for analysing phylogenies in terms of her-itable genetic traits, known as phylogenetics.[6] A tree based on such an analysis displaying a phylogeny is a phylogenetic tree.

We will in this paper discuss phylogenetic inference, the topic of re-constructing a phylogenetic tree from a set of related genetic sequences. Constructing a perfect tree is computationally hard for anything more than a few sequences but there are several methods for approximating these trees. One method, PosetSMC, relies on Bayesian inference to approximate a distribution of trees using Sequential Monte Carlo (SMC). One of its main advantages compared to other Bayesian inference meth-ods is that it is easy to parallelize. Speed is important with phyloge-netic inference since the amount of sequenced data available is grow-ing faster than the computational power available.[2]

Phylogenetic inference methods are difficult to implement. The algorithms modelling genetic evolution are non-trivial and can take time to implement properly.[5] One software library which aims to solve these algorithmic tasks is the Phylogenetic Likelihood Library (PLL). PLL provides data structures and algorithms related to constructing

(9)

2 CHAPTER 1. INTRODUCTION

and analysing phylogenetic trees quickly using low-level optimisa-tions such as SIMD instrucoptimisa-tions. One of the applicaoptimisa-tions Flouri et al. [5] shows is using PLL with a Markov Chain Monte Carlo (MCMC) algorithm. An application which is interesting to investigate is if PLL could also be used for SMC algorithms.

We will in this thesis investigate how suitable PLL is as a software library for implementing SMC tree inference methods. We approach this problem by implementing two versions of the PosetSMC algo-rithm which use PLL in two different ways. This allows us to inves-tigate if different parts of PLL are more or less suitable for SMC tree inference methods.

1.1 Problem statement

Can PLL be used to implement the SMC tree inference method PosetSMC described by Bouchard-Côté, Sankararaman, and Jordan [2]?

1.2 Scope

We limit scope of the project by only inferring coalescent phylogenetic trees using the PriorPrior proposal used in PosetSMC and defined by Teh, Daume III, and Roy [13].

1.3 Outline

In Section (2) SMC, phylogenetic trees and PLL are explained. This is followed by Section (3) describing how the study was performed. Finally, the results are presented (4) and discussed (5).

(10)

Chapter 2 Background

PosetSMC is an implementation of a particle filter which uses SMC to approximate a phylogenetic tree. This background aims to introduce the concepts required to understand and implement a program similar to PosetSMC using PLL.

At the most basic level, Bayesian phylogenetics builds on hidden Markov models (HMM) where the trees are the hidden states and the given sequences are the observations. Filtering builds on HMM and models how states can be inferred based on previous observations. Finally, particle filters provide a technique for approximating a filter. In the case of PosetSMC, particle filtering is done using SMC.

The goal of Bayesian phylogenetic inference is to approximate some distribution of phylogenetic trees ⇡ known as the posterior distribu-tion.[3] The approximation b⇡ is a discrete distribution of phylogenetic trees where each tree has an associated weight. One advantage of com-puting an entire distribution instead of approximating a single tree is that the distribution can be used to answer multiple questions. An example given in Bouchard-Côté [1] is the probability that a certain clade1_{exists for a set of sequences:}

K

X

k=1

wkf (tk) (2.1)

where wkis the normalized weight of tree k and

f (tk) =

(

1, if clade is part of tk

0, otherwise (2.2)

1_{Maximal subtree covering a set of leaves}

(11)

4 CHAPTER 2. BACKGROUND

We will begin by briefly describing phylogenetic trees. Next we describe HMM, filtering, SMC and particle filtering based on Doucet and Johansen [3]. Finally we describe PosetSMC and PLL.

2.1 Phylogenetic trees

Phylogenetic trees represent evolutionary relationship. The leaves of the tree represent that which has evolved such as the genetic sequences of a set of species and the branches represent evolutionary distance.[9] Figure (2.1) shows a phylogenetic tree with four leaves {L1, L2, L3, L4} with distance represented as horizontal lines.

Figure 2.1: A phylogenetic tree rendered using FigTree[4]

0.4

L2 L0 L3

L1

Coalescent trees, the focus of this paper, are a particular type of tree where the time between evolutionary events is exponentially dis-tributed. Every event represents two descendants coalescing into an an-cestor. Each generation every member is equally likely to coalesce.[13]

(12)

CHAPTER 2. BACKGROUND 5

2.2 Hidden Markov models

2.2.1 Definition

HMM is defined by a “X -valued discrete-time Markov process {Xn}n 1”[3]

which is described by two equations:

X1 ⇠ µ(x1)and Xn|(Xn 1= xn 1)⇠ f(xn|xn 1) (2.3)

Yn|(Xn= xn)⇠ g(yn|xn) (2.4)

where µ(x1)is a density function describing the initial state, f(x|x0)

describes a transition from x0 _{to x and g(y|x) describes how the}

Y-valued variable y is distributed given that x has occured. X represents the hidden data while the statistically independent values Y represent the observations.

Using (2.3) and (2.4) Doucet and Johansen [3] defines two probabil-ity functions. The prior distribution is defined as the probabilprobabil-ity that a certain sequence of states x1:noccurs:

p(x1:n) = µ(x1) n

Y

k=2

f (xk|xk 1) (2.5)

The likelihood function is defined as the probability that a sequence of observations y1:noccurs given a specific sequence of states x1:n.

p(y1:n|x1:n) = n

Y

k=1

g(yk|xk) (2.6)

Using Bayes’ theorem Doucet and Johansen [3] define the probabil-ity of a sequence x1:noccurring given a sequence of observations y1:n,

known as the posterior distribution, as p(x1:n|y1:n) =

p(x1:n)p(y1:n|x1:n)

p(y1:n)

(2.7) where p(y1:n)is known as the marginal likelihood and

p(y1:n) =

Z

(13)

2.2.2 Filtering

Filtering is the “[problem of] characterising the distribution of the state of the hidden Markov model at the present time, given the information provided by all of the observations received up to the present time.”[3] It provides a way to sequentially approximate the posterior distribu-tion (2.7) for each time t 2 {1 . . . n} by defining the posterior distri-bution recursively. This requires computing the product of the prior distribution (2.5) and likelihood function (2.6) as well as approximat-ing the marginal likelihood (2.8):

Doucet and Johansen also provide a recursive definition for the marginal likelihood: p(y1:n) = p(y1) n Y k=2 p(yk|y1:k 1) (2.11)

2.3 Sequential Monte Carlo

Sequential Monte Carlo methods are a type of Monte Carlo methods which are used to approximate a “target probability distribution”[3] defined as ⇡n(x1:n) = n (x1:n) Zn (2.12) where Zn = Z n(x1:n) dx1:n (2.13)

Doucet and Johansen [3] describes the basic Monte Carlo method approximating ⇡n(x1:n)as sampling X1:ni ⇠ ⇡n(x1:n)for i 2 {1, . . . , N}.

Each Xi _{is known as a particle which is updated as n increases. The}

(14)

CHAPTER 2. BACKGROUND 7 b⇡n(x1:n) = 1 N N X i=1 Xi 1:n(x1:n) (2.14) where Xi 1:n(x1:n) = ( 1, Xi 1:n = x1:n 0, otherwise (2.15)

The obvious problem with this approach is that the distribution ⇡ in many scenarios is not known. For example, in the context of phy-logenetic inference, we wish to approximate the distribution of phylo-genetic trees for a given set of sequences. This distribution of trees is not known and thus we cannot sample from it. To solve this problem Doucet and Johansen [3] describe three methods which incrementally build on each other and together compose SMC: Importance Sampling, Sequential Importance Sampling and Resampling.

2.3.1 Importance Sampling

Importance sampling (IS) introduces a density function qn(x1:n)where

⇡n(x1:n) > 0 =) qn(x1:n) > 0 (2.16)

This requirement formulates the notion that if some sequence of states x1:ncan be sampled from ⇡nit must also be possible for the same

sequence to be sampled from qn. It does not mean they have to be

equally probable, but if a sequence is at all probable in ⇡ it must also be probable in q. This is necessary to guarantee that b⇡ is asymptotically correct.

To adjust for the fact that ⇡n 6= qn a weight wn is introduced such

that ⇡n(x1:n) = wn(x1:n)qn(x1:n) Zn (2.17) where Zn = Z wn(x1:n)qn(x1:n) dx1:n (2.18)

(15)

wn(x1:n) = n

(x1:n)

qn(x1:n)

(2.19) Adding the weights also requires changing the Monte Carlo ap-proximation which is now written as

b⇡n(x1:n) = N X i=1 W_{n X}i i 1:n(x1:n) (2.20) where W_ni = wn(X i 1:n) PN j=1wn(X j 1:n) (2.21) One problem with this method which Doucet and Johansen [3] bring up is that qn(x1:n), although easy to sample from, is

computa-tionally expensive since for each time step an entirely new sequence x1:n has to be sampled. If we want to approximate ⇡t(x1:t) for each

t _{2 {1 . . . n} this would result in a computational complexity of O(n)} for each step i and O(n2₎_overall.

2.3.2 Sequential Importance Sampling

To achieve linear time complexity, Doucet and Johansen [3] present a solution where qn(x1:n)is defined recursively as

qn(x1:n) = qn 1(x1:n 1)qn(xn|x1:n 1) (2.22)

and correspondingly, wnis recursively defined as

wn(x1:n) = n (x1:n) qn(x1:n) (2.23) = n 1(x1:n 1) qn 1(x1:n 1) n(x1:n) n 1(x1:n 1)qn(xn|x1:n 1) (2.24) Defining the distribution recursively allows for reusing previously sampled data. To sample from qn(x1:n)we sample Xni ⇠ qn(xn|X1:n 1i )

where Xi

1:n 1 ⇠ qn 1(x1:n 1)and X1i ⇠ q1(x1).

The issue with SIS (and IS) is that the variance of SIS “increases exponentially with n”[3]. With a large variance a large number of par-ticles is needed to compute a decent approximation of ⇡n. This is a

(16)

Bouchard-Côté, Sankararaman, and Jordan [2] describes how this results in “a large weight imbalance [...] called particle degeneracy”[2] where the majority of particles have low weights. It would be better to spend the limited computing power on higher weighted particles; this is where resampling comes in.

2.3.3 Resampling

Resampling creates a distribution ⇡nby sampling N times from the

ap-proximated posterior b⇡n.[3] Particles with high weights are more likely

to be sampled multiple times and those with low weight might not be sampled at all. ⇡n(x1:n) = 1 N N X i=1 N_{n X}i i 1:n(x1:n) (2.25) where N_ni _{⇠ multinomial distribution (N, W}_ni), i_{2 {1, . . . , N}} (2.26) Note that there are multiple ways of sampling Ni

n. Doucet and

Jo-hansen [3] also bring up systematic- and residual resampling.

2.3.4 Algorithm

Now that the three concepts IS, SIS and resampling have been defined, Doucet and Johansen [3] combine them into an algorithm.

1. Begin by sampling N times from the density function q X_1:ni _⇠ ( q1(x1), n = 1 qn(xn|X i 1:n 1), n > 1 , i_{2 {1 . . . N}} (2.27)

2. Compute a weight w for each sample Xi 1:n win(X1:ni ) = ( 1/N, n = 1 wn 1(X1:n 1i ) n(X_1:ni ) n 1(X1:n 1i )qn(Xni|X1:n 1i ), n > 1 (2.28)

3. Compute normalized weights Wi

nusing (2.21)

4. Resample Xi_1:n ⇠ b⇡n, where X i

(17)

2.3.5 Applications

By choosing definitions for and Z, SMC can used for multiple pur-poses. For example, n(x1:n) = p(x1:n)p(y1:n|x1:n)and Zn= p(y1:n)gives

⇡n(x1:n) = n (x1:n) Zn = p(x1:n)p(y1:n|x1:n) p(y1:n) (2.29) which is the posterior distribution (2.12). We can thus approximate the posterior distribution by choosing a density function qnwhich gives

decent proposals and iterating the steps defined in section (2.3.4). This particular application is known as particle filtering.

2.4 PosetSMC

PosetSMC is a particle filter which introduces a proposal density q, weight function w and density defined on the space S. Given a set of genetic sequences it will infer a phylogenetic tree distribution. A parti-cle represents a forest containing all given genetic sequences as leaves and the space S spans all the possible forests covering the leaves.

The goal of PosetSMC is to make proposals on it’s particles until it has created a single tree from the forest. Each iteration it resamples using multinomial resampling. PosetSMC discusses two proposals for coalescent trees, PriorPrior and PriorPost, which were originally pre-sented in Teh, Daume III, and Roy [13] and found that “Surprisingly [. . . ], PriorPrior outperforms the more complicated PriorPost by one order of magnitude.”[2]

2.4.1 Proposals and weights

PriorPrior suggests two things when making a proposal:

1. A pair of trees (t1, t2)in the forest s 2 S to merge by connecting

their rootes to a new node, where t1, t2 2 s. Each pair of trees is

equally likely to be selected.[13]

2. The height of the new root as the sum of the previously tallest node and a height increment sampled from an exponential dis-tribution.[13, 2]

(18)

Using this method for defining the proposal distribution q, Bouchard-Côté, Sankararaman, and Jordan [2] define w and for coalescent trees. Inserting this definition in equation (2.28) gives

w_ni = wi_{n 1}(X_{1:n 1}i ) n(X i 1:n) n 1(X1:n 1i )qn(Xni|X1:n 1i ) (2.30) = wi n 1(X1:n 1i ) n( (X1:ni ))Lm LlLrqn(Xni|X1:n 1i ) (2.31) / wi n 1(X1:n 1i ) Lm LlLr (2.32) where Lm, Ll, Lr is the likelihood of the merged tree, left tree and

right tree respectively in the particle Xi

1:n on the leaves that each

re-spective tree covers and n( (X1:ni )) =Expd( (X1:ni ), l n+1

2 )is the

ex-ponential density for the height increment of the particle Xi

1:n, where

lis the number of leaves.[2]

2.5 Phylogenetic Likelihood Library

The Phylogenetic Likelihood Library is “highly optimized application programming interface for developing likelihood-based phylogenetic inference and post-analysis software”, developed by Flouri et al. [5]. The PLL has functions for parsing input files of different formats for “sequence data [. . . and] likelihood evaluation”[5] as well as data struc-tures for representing phylogenetic trees.[5] Of particular interest to this paper is the PLL’s functionality for computing the likelihood of a phylogenetic tree since this is required when computing particle weights in (2.32).

We identified two different methods for computing likelihoods us-ing PLL. The first alternative is usus-ing PLL’s recommended and doc-umented methods. [5][14] The second is using PLL’s core functions which are exposed by the library but are less documented.

2.5.1 The main API

PLL defines a data structure called the partition instance. It keeps track of evolutionary model parameters and buffers for storing intermedi-ate computations. It also defines the order that the tree should be

(19)

traversed when computing the likelihood of the tree and is central to PLL’s main API functionality.

Calculating the likelihood of the tree involves a series of steps: 1. Create a partition instance by specifying the number of tree nodes,

edges and other attributes.

2. Specify the evolutionary model parameters, for example the nu-cleotide frequencies.

3. Convert the input tree into a vector of operations specifying the order that the tree should be traversed and update the partition instance.

4. Compute the likelihood of the root.

2.5.2 Core functions

PLL API’s core functions provide an alternative way of computing the likelihood. They allow the author to manually allocate and specify the data that should be used for computing likelihood. The main differ-ence to the main API is that a partition instance is not required.

(20)

Chapter 3 Methods

We implemented a phylogenetic tree inference program based on the PosetSMC algorithm (2.4). Two distinct versions were made based on the two methods for computing likelihood described in (2.5).

3.1 Implementation

Our PosetSMC implementation using PLL called pll-smc (https:// github.com/isaacarvestad/pll-smc) was implemented in the programming language C++. It uses the PLL implementation libpll (https://github.com/xflouris/libpll) as the library for per-forming the likelihood calculations required to compute Lm, Lland Lr

in (2.32).

Two versions of pll-smc where implemented. The first implemen-tation used the main API of PLL (2.5.1) where each particle kept track of a PLL partition instance. The second implementation used the core functions of PLL. Each particle has a set of pointers to trees represent-ing the forest. The trees are stored in a recursive data structures and keep track of any data required by PLL inside the nodes and edges of the tree. See Appendix (A) for pseudocode describing pll-smc-main (1) and pll-smc-core (2).

3.2 Experiments

The experiments performed involved comparing the two pll-smc im-plementations (pll-smc-main and pll-smc-core). Comparing these tree

(21)

14 CHAPTER 3. METHODS

inference implementations was done by randomly generating a set of phylogenetic trees and inferring them with each respective implemen-tation. Each trial the following data was recorded:

• Runtime of tree inference implementation in seconds

• Robinson-Foulds distance (RF) of the original tree to the inferred tree.[8] This distance is the weighted average RF distance over the tree distribution.

Trials were run sequentially on a machine with 16 GiB of mem-ory using the vector instruction set SSE. Coalescent trees were ran-domly generated using the Python package DendroPy[12]. Sequences for these trees where generated with Seq-Gen[7] using the evolution-ary model GTR with equal frequencies and rates and gamma rate 1. Finally JPrIME[11] was used to compare RF distances between trees.

3.2.1 Comparing pll-smc implementations

There were two goals when comparing pll-smc implementations. 1. Correctness: verify that they did not differ in the quality of the

inferred trees.

2. Efficiency: identify which implementation used less computa-tional resources (memory and processor time).

These goals were met by running two experiments:

1. Generate sets of trees with varying leaf counts and with sequence lengths 102_{and 10}3_{respectively, and infer the trees using the two}

implementations. Record and compare the RF distance of each implementation to test if similar trees are inferred. Additionally, record and compare the runtime to test the efficiency of each im-plementation.

Each set of trees includes five generated trees for each leaf count in the range [4, 50]. For each tree pll-smc-core was run using 102_,

103_{and 10}4_{particles and pll-smc-main was run using 10}2_{and 10}3

particles.

2. Using the Valgrind tool Massif [10] profile the heap memory us-age for each implementation when inferring a tree with 30 leaves and sequence length 1000.

(22)

Chapter 4 Results

The results of the experiments showed similarities and differences be-tween the two pll-smc implementations. The implementations infer tree distributions with similar RF distances, both for sequence lengths 102_{(4.1) and 10}3_{(4.2). It can be observed that with increasing leaf count}

the standard deviation of the RF distances increases for both imple-mentations. The RF distances also decreases similarly as the particle count is increased from 102 _{to 10}3 _particles.

Figure 4.1: Average RF distance to original tree compared to the leaf count with 100 nucleotide long sequences.

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 100 particles Leaves RF ● pll−smc−core pll−smc−main ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● 10 20 30 40 50 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 1000 particles Leaves RF ● pll−smc−core pll−smc−main 15

(23)

16 CHAPTER 4. RESULTS

Figure 4.2: Average RF distance to original tree compared to the leaf count with 1000 nucleotide long sequences

●● ● ● ●●●●● ● ● ●●●● ● ●● ● ●● ● ●● ●● ● ● 5 10 15 20 25 30 0 5 10 15 20 25 0 5 10 15 20 25 100 particles Leaves RF ● _{pll−smc−core} pll−smc−main ● ● ●●● ● ● ●●●●● ● ●●●●● ●● ● ●● ●●●●● 5 10 15 20 25 30 0 5 10 15 0 5 10 15 1000 particles Leaves RF ● _{pll−smc−core} pll−smc−main

On the other hand, runtimes for inferring the trees were not sim-ilar. As can be observed in (4.3) and (4.4), pll-smc-core scales better as the number of leaves increases. Comparing (4.3) and (4.4) it can also be observed that increasing the number of particles results in a proportionally large increase in runtime.

Figure 4.3: Average runtime in seconds compared to leaf count with 100 nucleotide long sequences

● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● 10 20 30 40 50 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 100 particles Leaves Time (s) ● pll−smc−core pll−smc−main ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● 10 20 30 40 50 0 2 4 6 8 0 2 4 6 8 1000 particles Leaves Time (s) ● pll−smc−core pll−smc−main

(24)

CHAPTER 4. RESULTS 17

Figure 4.4: Average runtime in seconds compared to leaf count with 1000 nucleotide long sequences

●●●●●●●●●●●●●●●●●●●●●●●●●●●● 5 10 15 20 25 30 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 100 particles Leaves Time (s) ● _{pll−smc−core} pll−smc−main ●●●●●●●●●●●●●●●●●●●●●●●●●●●● 5 10 15 20 25 30 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 1000 particles Leaves Time (s) ● _{pll−smc−core} pll−smc−main

The memory profile performed also favored pll-smc-core. Figure (4.5) shows how memory initially is allocated before stabilizing for the remaining execution time. Although the two implementations have similar shapes the amount of memory used differs.

Figure 4.5: Memory allocations when inferring a tree with 30 leaves and sequence length 1000.

pll−smc−core Time (ms) Memor y usage (MiB) 0 50 100 150 200 250 300

0.0e+00 5.0e+09 1.0e+10 1.5e+10 2.0e+10

pll−smc−main Time (ms) Memor y usage (GiB) 0 3 6 9 12 14

(25)

18 CHAPTER 4. RESULTS

Comparing pll-smc-core with particle counts 102_{, 10}3 _{and 10}4_{it can}

be observed in (4.6) that although there is an apparent benefit to in-creasing the particle count from 102 _{to 10}3_{, there is no significant}

dif-ference between 103 _{and 10}4 _particles.

Figure 4.6: Average RF distance compared to leaf count with 100 and 1000 nucleotide long sequences with increasing number of particles inferred using pll-smc-core

10 20 30 40 50 0 5 10 15 20 25 30 35 Leaves RF ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10 20 30 40 50 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35

Median RF with differening particle count, sequence length 100

● _{pll−smc−core: 1e2} pll−smc−core: 1e3 pll−smc−core: 1e4 10 20 30 40 50 0 10 20 30 Leaves RF ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10 20 30 40 50 0 10 20 30 0 10 20 30

Median RF with differening particle count, sequence length 1000

● _{pll−smc−core: 1e2}

pll−smc−core: 1e3 pll−smc−core: 1e4

(26)

Chapter 5 Discussion

As the results showed, both pll-smc-core and pll-smc-main inferred tree distributions with similar RF distances suggesting that they pro-duce tree distributions of similar quality. The main difference between the implementations was the time and memory required to complete the tree inference computations; most likely explained by pll-smc-core containing optimizations that are not part of pll-smc-main. We begin discussing these implementation differences.

5.1 Implementation comparison

What sets the two implementations apart is how the PLL API is used, see section (2.5). One of the main differences is the amount of data that can be shared between particles. As explained in Bouchard-Côté [1], phylogenetic SMC algorithms can be optimized to reduce their mem-ory usage. Each time a particle generation is resampled particles are copied to the next particle generation. If one particle is copied multi-ple times, multimulti-ple particles in the new generation will share the same underlying data. Specifically, when a particle p1 is copied to make a

particle p2, p1 and p2 share the same trees in their forests. Since each

particle in pll-smc-core has a set of pointers to trees making up their forest, as recommended by Bouchard-Côté [1], it is trivial to copy a particle by simply copying the set of tree pointers. This is possible since it is guaranteed that no particle will modify the existing trees in the forest; a proposal (2.4.1) can only extend existing trees by adding a parent node and will not modify the inferred subtrees.

In contrast, pll-smc-main keeps track of one PLL partition per

(27)

20 CHAPTER 5. DISCUSSION

cle, essentially allocating a forest for every particle. To our knowledge it is not possible to share data between partitions using the main API. This design choice caused two problems, both of which most likely where contributing factors to the performance difference. Firstly, stor-ing multiple instances of the same data will use more memory. This is a clear disadvantage, especially when SMC is primarily memory intensive.[1] Secondly, resampling was not equally trivial since entire PLL partitions had to be copied. In our implementation this was done using a series of memcpys.

The memory allocation graphs (4.5) show the detrimental effect of not optimizing for memory usage and limits the usability of pll-smc-main. On many modern machines it will require swap-memory and for longer sequences or larger particle counts this problem will only get worse. These graphs also provide some evidence that no major memory leaks are present.

5.2 Potential problems and limitations

It is important to acknowledge potential problems and limitations with our approach. One potential problem is the possibility of an incorrect implementation. Another is the possibility that the model parameters were incorrectly setup, either inside the implementation or when gen-erating phylogenetic trees. A hint of a potential problem is the com-parison of pll-smc-core with different number of particles (4.6) where no particular improvement can observed when inferring with 104

in-stead of 103 _{particles. One would expect that if the implementation}

and model parameters were correct that the inferred trees would con-verge towards the correct tree and RF distance concon-verge towards 0 as particle counts increased.

Although this is a problem it does not change the fact that the relative difference in time and memory usage between pll-smc-core and pll-smc-main increased with larger leaf counts. Neither would a change in model parameters change the number of likelihood com-putations performed in the implementations. It should therefore still be possible to make conclusions regarding the effectiveness of each respective implementation and thus the usability of PLL in SMC algo-rithms.

(28)

CHAPTER 5. DISCUSSION 21

5.3 Future work

Moving forward there are a couple topics which we think would be interesting to investigate. Firstly it would be interesting to investigate what, if any, API changes are required to improve the usability of PLL with SMC algorithms. Based on our experience with pll-smc the crit-ical feature is being able to store data in a tree-like structure instead of in partitions. This solves two key criteria. Primarily, it enables in-cremental likelihood computation by only computing the likelihood of the nodes once when they are first proposed using the proposal dis-tribution (2.4.1). Additionally, it allows for the memory saving tech-niques used in pll-smc-core and described by Bouchard-Côté [1].

Secondly it would also be interesting to use PLL to implement more advanced SMC algorithms such as Combinatorial Sequential Monte Carlo (CSMC) which can infer a more general class of phylogenetic trees.[15]

(29)

Chapter 6 Conclusion

The use of different parts of the PLL API in core and pll-smc-main shows how design and API choices can affect the performance of an implementation. Between the two implementations pll-smc-core performed significantly better as the size of the trees increased, even though it did not infer worse trees with regards to RF distance. As discussed in (5.1) this can most likely be attributed to the restrictions of the PLL partition data type which resulted in high memory usage.

The pll-smc-main implementation is not suitable for phylogenetic tree inference. However, since pll-smc-main is not representative of all possible SMC implementations using the main PLL API, it cannot be concluded that the main PLL API isn’t suitable for SMC methods. Nevertheless, it can be concluded that the PLL’s core functions can be used to implement a SMC algorithm for inferring coalescent phyloge-netic trees.

(30)

Bibliography

[1] Bouchard-Côté. “SMC (sequential Monte Carlo) for Bayesian phy-logenetics”. In: Bayesian Phylogenetics: methods, algorithms, and ap-plications. Ed. by Ming-Hui Chen, Lynn Kuo, and Paul O Lewis. CRC Press, 2014, pp. 163–184.

[2] Alexandre Bouchard-Côté, Sriram Sankararaman, and Michael I Jordan. “Phylogenetic inference via sequential Monte Carlo”. In: Systematic biology 61.4 (2012), pp. 579–593.

[3] Arnaud Doucet and Adam M Johansen. “A tutorial on particle filtering and smoothing: Fifteen years later”. In: Handbook of non-linear filtering 12.656-704 (2009), p. 3.

[4] Alexei J Drummond et al. “Bayesian phylogenetics with BEAUti and the BEAST 1.7”. In: Molecular biology and evolution 29.8 (2012), pp. 1969–1973.

[5] T Flouri et al. “The phylogenetic likelihood library”. In: System-atic biology 64.2 (2014), pp. 356–362.

[6] Willi Hennig. “Phylogenetic systematics”. In: Annual review of entomology 10.1 (1965), pp. 97–116.

[7] Andrew Rambaut and Nicholas C Grass. “Seq-Gen: an applica-tion for the Monte Carlo simulaapplica-tion of DNA sequence evoluapplica-tion along phylogenetic trees”. In: Bioinformatics 13.3 (1997), pp. 235– 238.

[8] David F Robinson and Leslie R Foulds. “Comparison of phylo-genetic trees”. In: Mathematical biosciences 53.1-2 (1981), pp. 131– 147.

[9] Naruya Saitou. Introduction to evolutionary genomics. Springer, 2016.

(31)

24 BIBLIOGRAPHY

[10] J. Seward, N. Nethercote, and J. Weidendorfer. Valgrind 3.3 - Ad-vanced Debugging and Profiling for GNU/Linux Applications. Net-work Theory Ltd., 2008.ISBN: 0954612051, 9780954612054. [11] Joel Sjöstrand et al. “GenPhyloData: realistic simulation of gene

family evolution”. In: BMC bioinformatics 14.1 (2013), p. 209. [12] Jeet Sukumaran and Mark T Holder. “DendroPy: a Python

li-brary for phylogenetic computing”. In: Bioinformatics 26.12 (2010), pp. 1569–1571.

[13] Yee W Teh, Hal Daume III, and Daniel M Roy. “Bayesian ag-glomerative clustering with coalescents”. In: Advances in Neural Information Processing Systems. 2008, pp. 1473–1480.

[14] Tomáš Flouri, Diego Darriba, Kassian Kobert, Mark T. Holder, Alexey Kozlov, Alexandros Stamatakis. API Reference. https: //github.com/xflouris/libpll/wiki/API-Reference. 2016.

[15] Liangliang Wang, Alexandre Bouchard-Côté, and Arnaud Doucet. “Bayesian phylogenetic inference using a combinatorial sequen-tial Monte Carlo method”. In: Journal of the American Statistical Association 110.512 (2015), pp. 1362–1374.

(32)

Appendix A

Implementations

Algorithm 1 General program structure of pll-smc-main, including all

PLL function calls.

1: functionSMC(sequences, particle_count) 2: _{particles 2 ⇤ number of particles to use} 3: for all p 2 particles do

4: p.partition pll_partition_create(size of final tree)

5: pll_set_frequencies(p, . . . )

6: pll_set_category_rates(p, . . . )

7: pll_set_subst_params(p, . . . )

8: for all s 2 sequences do

9: pll_set_tip_states(p, s, . . . )

10: pll_compute_root_loglikelihood(p, . . . )

11: end for 12: end for

(33)

26 APPENDIX A. IMPLEMENTATIONS

13: for all i 2 {1, . . . , sequence count} do 14: if i is even then

15: RESAMPLE(upper half of particles, lower half of

parti-cles)

16: PROPOSE(lower half of particles)

17: NORMALIZE(lower half of particles)

18: else

19: RESAMPLE(lower half of particles, upper half of

parti-cles)

20: PROPOSE(upper half of particles)

21: NORMALIZE(upper half of particles)

22: end if 23: end for

24: Return particles along with their associated normalized weight 25: end function

26: functionRESAMPLE(from, to) 27: for all p 2 to do

28: p0 _{⇠ from, weighted by normalized weights}

29: p.weight p0_.weight

30: p.normalized_weight _{size of from}1

31: p.partition p0_{.partition (using memcpy)}

32: end for 33: end function

34: functionPROPOSE(particles) 35: for all p 2 particles do

36: pll_update_prob_matrices(p, . . . )

37: pll_update_partials(p, . . . )

38: pll_compute_root_loglikelihood(p, . . . )

39: end for 40: end function

41: functionNORMALIZE(particles)

42: Normalize weights

(34)

APPENDIX A. IMPLEMENTATIONS 27

Algorithm 2 General program structure of pll-smc-core, including all

PLL function calls.

1: functionSMC(sequences, particle_count)

2: reference_partition pll_partition_create(size of final tree) 3: pll_set_frequencies(reference_partition, . . . )

4: pll_set_category_rates(reference_partition, . . . ) 5: pll_set_subst_params(reference_partition, . . . ) 6: _{for all s 2 sequences do}

7: pll_set_tip_states(reference_partition, s, . . . ) 8: end for

9: pll_update_eigen(reference_partition, . . . )

10: particle Particle(reference_partition, sequences, . . .) 11: for all s 2 particle.sequences do

12: pll_core_root_loglikelihood(. . . )

13: end for

14: particles 2 ⇤ number of particles to use 15: for all p 2 particles do

16: p.weight copy(particle.weight)

17: p.normalized_weight copy(particle.normalized_weight)

18: p.particle_pointers copy(particle.particle_pointers)

19: end for

20: for all i 2 {1, . . . , sequence count} do 21: if i is even then

22: RESAMPLE(upper half of particles, lower half of

parti-cles)

23: PROPOSE(lower half of particles)

24: NORMALIZE(lower half of particles)

25: else

26: RESAMPLE(lower half of particles, upper half of

parti-cles)

27: PROPOSE(upper half of particles)

28: NORMALIZE(upper half of particles)

29: end if 30: end for

31: Return particles along with their associated normalized weight 32: end function

(35)

28 APPENDIX A. IMPLEMENTATIONS

33: functionRESAMPLE(from, to) 34: for all p 2 to do

35: p0 _{⇠ from, weighted by normalized weights}

36: p.weight p0.weight 37: p.normalized_weight 1 size of from 38: p.particle_pointers p0_{.particle_pointers} 39: end for 40: end function

41: functionPROPOSE(particles) 42: for all p 2 particles do

43: pll_core_update_pmatrix(left_edge, . . . ) 44: pll_core_update_pmatrix(right_edge, . . . ) 45: pll_core_update_partial_ii(. . . ) 46: pll_core_root_loglikelihood(. . . ) 47: end for 48: end function

49: functionNORMALIZE(particles)

50: Normalize weights

(36)

Implementing Bayesian phylogenetic tree inference with Sequential Monte Carlo and the Phylogenetic Likelihood Library

Implementing Bayesian

phylogenetic tree inference with

Sequential Monte Carlo and the

Phylogenetic Likelihood Library

ISAAC ARVESTAD

HENRIK LAGEBRAND

Implementing Bayesian

phylogenetic tree inference

with Sequential Monte Carlo

and the Phylogenetic

Likelihood Library

ISAAC ARVESTAD & HENRIK LAGEBRAND

Abstract

Sammanfattning

Contents

Chapter 1

Introduction

1.1 Problem statement

1.2 Scope

1.3 Outline

Chapter 2

Background

2.1 Phylogenetic trees

2.2 Hidden Markov models

2.2.1 Definition

2.2.2 Filtering

2.3 Sequential Monte Carlo

2.3.1 Importance Sampling

2.3.2 Sequential Importance Sampling

2.3.3 Resampling

2.3.4 Algorithm

2.3.5 Applications

2.4 PosetSMC

2.4.1 Proposals and weights

2.5 Phylogenetic Likelihood Library

2.5.1 The main API

2.5.2 Core functions

Chapter 3

Methods

3.1 Implementation

3.2 Experiments

3.2.1 Comparing pll-smc implementations

Chapter 4

Results

Chapter 5

Discussion

5.1 Implementation comparison

5.2 Potential problems and limitations

5.3 Future work

Chapter 6

Conclusion

Bibliography

Appendix A

Implementations