• No results found

Asymptotics, weak convergence and duality in population genetics

N/A
N/A
Protected

Academic year: 2022

Share "Asymptotics, weak convergence and duality in population genetics"

Copied!
47
0
0

Loading.... (view fulltext now)

Full text

(1)
(2)
(3)
(4)

A Gianmarco

(5)
(6)

v

Abstract

This thesis consists of four papers on asymptotic results and stochas- tic duality for some processes in mathematical population genetics.

The focus is on Wright-Fisher diffusions and coalescent processes, which model, respectively, the evolution of frequencies of genetic types and genealogies in a population, and play a key role in inference on genetic data sets.

Paper A concerns the derivation of a stochastic dual process for the coupled Wright-Fisher diffusion in a multi-locus population evolving under pairwise selection, which acts on pairs of loci, parent dependent mutations and free recombination. The dual process consists of block counting processes of coupled ancestral selection graphs, one for each locus. In the dual process, coalescence, mutation and single-branching events occur at one locus at a time, whereas double-branching events consist of two branching events occurring at two loci simultaneously.

The remaining three papers provide asymptotic results concerning sequences related to the Kingman coalescent, with parent dependent mutations, as the size of the initial configuration grows to infinity.

In Paper B, it is shown that the sampling probabilities of the King- man coalescent decay polynomially. The degree of the polynomial de- pends of the number of types in the model, and the multiplicative con- stant on the stationary density of the Wright-Fisher diffusion. More- over, the asymptotic behaviour of the backward transition probabilities is analysed.

In Paper C, it is proved that the normalised and time scaled jump chains of the block counting process of the Kingman coalescent weakly converge to a deterministic process. Furthermore, the time scaled chains counting the number of mutations between types weakly con- verge to independent Poisson processes with varying intensities.

Paper D focuses on establishing a theoretical framework for the asymptotic analysis of importance sampling algorithms for the coales- cent. To this aim, the weak convergence result in Paper C is extended to include a cost component. It is illustrated how the weak convergence of the cost sequence can be used to study the asymptotic behaviour of importance sampling weights, and consequently asymptotic properties of the corresponding algorithms.

(7)

vi

Sammanfattning

Avhandlingen består av fyra artiklar som behandlar asymptotiska resultat och dualitet för stokastiska processer inom matematisk popu- lationsgenetik. Huvudsakligen studeras Wright-Fisher-diffusioner och sammansmältningsprocesser, vilka beskriver evolutionen av frekvenser av olika gener respektive genealogin i en polulation och spelar en viktig roll för inferens baserad på genetisk data.

I artikel A härleds en stokastisk dual för den kopplade Wright- Fisher-diffusionen i en population med flera lokus som utvecklas under en parvis selektionsmekansim. Modellen innehåller även mutationer och fri rekombination. Den duala processen består av blockräknande processer av kopplade selektionsgrafer, en för varje lokus. I den dua- la processen inträffar sammansmältningar, mutationer och enkla för- greningar individuellt vid varje lokus. Dessutom kan det förekomma dubbel-förgreningar av två förgreningshändelser vid två lokus samti- digt.

De övriga tre artiklarna utvecklar asymptotiska resultat för en följd relaterad till Kingmans sammansmältningsprocess, med generella mu- tationer, då storleken på initiala konfigurationen går mot oändligheten.

I artikel B visas att konfigurationssannolikheten i Kingmans sam- mansmältningsprocess avtar polynomiellt, där graden beror på antalet typer i modellen. Den multiplikativa konstanten beror på stationära tätheten för Wright-Fisher-diffusionen. Vidare analyseras asymptotis- ka beteendet för övergångssannolikheterna bakåt i tiden.

I artikel C bevisas att den normaliserade och tidsskalade hoppked- jan för Kingmans sammansmältningsprocess konvergerar svagt mot en deterministisk process. Vidare konvergerar den omskalade räknepro- cessen, som räknar antalet mutationer mellan typer, mot en Poisson- process med tillståndsberoende intensitet.

Artikel D utvecklar ett teoretiskt ramverk för att studera asymp- totisk effektivitet av viktade simuleringsagoritmer för sammansmält- ningsprocesser. Resultaten för svag konvergens i artikel C utvidgas till att innehålla en generell kostnadskomponent och kan användas för att beskriva det asymptotiska beteendet av likelihoodkvoten mellan ur- sprungsfördelningen och simuleringsfördelningen.

(8)

Contents

Contents vii

Acknowledgements ix

Part I: Introduction and Summary

1 Introduction and Background 3

1.1 Mathematical population genetics and an overview . . . 3

1.2 Wright-Fisher models . . . 5

1.3 Wright-Fisher diffusions . . . 7

1.4 Coalescent processes . . . 12

1.5 Duality of Markov processes . . . 17

1.6 Inference under the coalescent . . . 18

2 Summary of Results 21

References 29

Part II: Scientific Papers Paper A

A dual process for the coupled Wright-Fisher diffusion (joint with T. Koski and H. Hult)

To appear in the Journal of Mathematical Biology Preliminary version at http://arxiv.org/abs/1906.02668

vii

(9)

viii CONTENTS

Paper B

Asymptotic behaviour of sampling and backward transition

probabilities of the coalescent with parent dependent mutations (joint with H. Hult)

Preprint: http://arxiv.org/abs/2011.04385 Paper C

Weak convergence of the scaled jump chain and number of mutations of the Kingman coalescent

(joint with H. Hult)

Preprint: http://arxiv.org/abs/2011.06908 Paper D

Asymptotic analysis of backward sampling algorithms in Kingman’s coalescent

(joint with H. Hult) In progress

(10)

Acknowledgements

I would like to begin by thanking my supervisor Henrik Hult for his guidance and for teaching me to see maths problems in a more intuitive and insightful way, to be independent, to believe in my ideas and to never give up.

I am grateful to everyone in the Mathematical Statistics division, partic- ularly to Boualem Djehiche, Timo Koski and Jimmy Olsson, for discussions about teaching and research, and to all my fellow PhD students at the Mathematics department for sharing with me these years of growth, and, sometimes, struggle.

To Paul Jenkins at Warwick University, thank you for welcoming me and sharing your knowledge and experience, and to Dario Spanò and Jere Koskela, thanks for the stimulating discussions in a friendly environment.

I would like to express my gratitude to Henrik, Michael, Federico, Giulia and Ilaria for helping with proofreading.

My warm thanks to all those people in Stockholm and Rome, even those temporarily somewhere else, who make me feel like I have two places to call home. Special thanks to the following friends: Amebs, who has been there since forever; Anna, Anna Paola, Giulia and Maurizia, who have been far away, yet so close; Jacopo and Giulio, always ready to lighten things up with a laugh; Federico, Ilaria, Francesca, Simone, Lorenzo and Ann, a

"sunny" piece of Stockholm; Gerard, Isabel, Julian, Lena, Michelle, Nasrin, Parikshit, Philippe, Samuel, and all the colleagues who became friends.

I would like to thank my family. In particular, Flavia and Jacopo, thanks for making life more fun; mamma, papà, nonna and Eleonora, thanks for always cheering me on (and an additional thanks to mum so she is happier).

Finally, to Gianmarco, simply thank you for being you.

Martina Favero Stockholm, November 2020 ix

(11)
(12)

Part I:

Introduction and Summary

1

(13)
(14)

1 Introduction and Background

1.1 Mathematical population genetics and an overview

Population genetics is a branch of biology that studies differences within and between populations, which originated by the combination of Darwin’s and Mendel’s theories of evolution and inheritance. In the mid-19th cen- tury, Darwin developed his well-known evolution theory based on natural selection, explaining how selection slowly changes the features of a popula- tion over time, without knowing how the diversity arises and is transmitted through generations. At the same time, Mendel defined his law of inheri- tance to describe how traits are inherited from parents to children, laying the foundations of genetics. However, at first these two theories seemed to be in contrast with each other. It was not until the early 20th century that the two theories were reconciled, forming the modern evolutionary synthe- sis, largely thanks to mathematical models developed by scientists such as Haldane, Fisher, and Wright. Since then, mathematical population genet- ics has been used to explain how the genetic composition of a population evolves over time under the action of various mechanisms.

The first of these mechanisms is mutation, which acts on regions of DNA, genes, producing different variants, alleles. Mutations occur randomly and are the primary source of diversity in a population. Concurrently, natural selection acts on a diverse population by increasing the frequency of bene- ficial traits and decreasing the frequency of less beneficial, or unfavourable, traits. Besides selection, there is a random force, the so-called genetic drift, which causes variations in frequencies of traits independently of their viabil- ity. This is due to the randomness in reproduction and can lead for example

3

(15)

4 CHAPTER 1. INTRODUCTION AND BACKGROUND

to the extinction of an allele or to the wide spread of an initially rare allele, which could even become the only variant of a gene, i.e. fixated. Other mech- anisms include, but are not limited to, migration, another source of diversity in a population, and recombination, the reshuffling of genetic material be- tween the parents producing an offspring with new genetic traits. The big changes, observed over long periods of time, which constitute the evolution of a population, are due to several small variations in the individuals adding up over time and are influenced by the above mentioned forces.

Mathematical population genetics provides tools for the study of the evolution of genetic traits in a population. Many successful models in math- ematical population genetics, including the ones at the core of this thesis, consist of a pair of stochastic processes: a diffusion and a random tree, or graph. The diffusion process, usually some type of Wright-Fisher diffusion, models the evolution of a vector of allele frequencies in a population forwards in time. While the natural time flow of evolution is forward, that is from the past to the present, it comes natural to reverse time when looking at genealogies, this flow of time is often referred to as backward or reverse-time.

The random tree, or graph, known as coalescent process, or ancestral graph, models the ancestry of a sample from a population backwards in time, that is from the individuals in the sample to their most recent common ancestor.

A sample from a population refers to a sample of genetic material, e.g. a gene or part(s) of it, of some individuals in the population.

The most renowned pair of processes consists of the classical Wright- Fisher diffusion and the Kingman coalescent. These two processes arise when looking at a sequence of certain finite population size models, such as Wright-Fisher models or Moran models, as the population size goes to infin- ity, provided time is properly scaled; the former as a limit of the frequencies in the population, the latter as a limit of the genealogy of a sample. Fur- thermore, they are related through a stochastic duality relationship, which can be used to transfer information from one process to the other. Deriving a duality relationship of this type is the focus of Paper A.

While the original pair only takes into account the genetic drift and, in its more general version, mutations, its numerous generalisations led to the establishment of most modern models in mathematical population ge- netics, which cover more general assumptions involving various genetic and demographic forces.

In Section 1.2 a brief introduction to Wright-Fisher models is given.

Wright-Fisher diffusions, on which Paper A is focused, and which are crucial

(16)

1.2. WRIGHT-FISHER MODELS 5

in Paper B, are presented in Section 1.3. The coalescent is introduced in Section 1.4. This is the most relevant model for this thesis. In fact, it is the underlying model in Papers B-D, and the derivation of the dual process in Paper A leads to a generalisation of it. Section 1.5 is dedicated to duality for Markov processes and its role in population genetics. This is the main topic of Paper A, but also relevant for the other papers.

The aim of this chapter is to introduce the framework that is most rele- vant for Papers A-D, thus the focus is mostly on models with mutation, with some mention of models with selection. For a more complete introduction to models in mathematical population genetics, see for example [8, 7, 38], to which this section and Sections 1.2-1.4 are partly inspired. In these sections, reference to [8, 7, 38] is implied, whenever explicit reference is omitted.

To conclude the overview, inference methods must be mentioned. Coa- lescent models are especially valuable in statistical inference on genetic data sets. For example, when a sample of genetic material is taken from some individuals in a population, the ancestral history that led to that particular sample is often unobserved, not surprisingly, the hidden history is crucial for the study of the population and to perform inference. To estimate the relevant parameters, simulation based methods can be implemented by sim- ulating the hidden ancestral history according to coalescent models. Due to the substantial advances in DNA sequencing technology in recent decades, genetic data sets have become increasingly large posing new challenges in inference. Papers B, C and D contribute to the analysis of asymptotic prop- erties of coalescent processes and importance sampling algorithms, for large sample sizes. Importance sampling algorithms for the coalescent based on backward simulation are presented in Section 1.6, providing the relevant background for Paper D.

1.2 Wright-Fisher models

Independently developed by S. Wright and R. A. Fisher [48, 40], the Wright- Fisher model describes the effect of evolutionary forces on allele frequencies over time in a population with finite fixed size N . The characteristic assump- tions in this type of models are the following: generations are discrete, i.e.

non overlapping, the population is mixing completely and is homogeneous.

Here only genetic drift, which is the fundamental force intrinsic to the model, mutation, and selection, are included; and individuals are assumed

(17)

6 CHAPTER 1. INTRODUCTION AND BACKGROUND

to be haploid, i.e. having a single copy of each chromosome, in order to illustrate the mathematical framework readily. The model can be comple- mented with additional forces such as recombination, migration, etc. leading to various generalizations of the model and the corresponding diffusion and coalescent approximations.

The alleles present in the population are labelled by 1, . . . , d, using the finite-alleles model for mutations. The Wright-Fisher model with mutations is constructed as follows. Each gene in generation k + 1 chooses its parent independently at random, with replacement, from the genes in generation k. The offspring inherits the type of the parent with probability 1 − γ, or a mutation occurs, with probability γ, and the type of the parent, say type i, is changed to type j with probability Pij. The matrix P = (Pij)di,j=1 is referred to as the mutation probability matrix.

Figure 1.1: A representation of the neutral Wright-Fisher model with mu- tations in a population of N=10 individuals with d=2 possible alleles. The initial population is constituted by half blue alleles and half yellow alleles, mutations are marked with a red dot. The blue allele is still present in the last generation thanks to mutation. Without mutation, the yellow allele would fixate in the population. The R package learnPopGen was used to create the structure of the plot.

(18)

1.3. WRIGHT-FISHER DIFFUSIONS 7

Let XN(k) ∈ {x ∈ N1Nd :Pdi=1xi = 1} be the vector of frequencies of allele types in the kth generation, i.e. N XiN(k) is the number of individuals with allele type i in generation k. Then

N XN(k + 1) ∼ Multinomial(r(XN(k))), rj(x) =

d

X

i=1

xiγij, j = 1, . . . , d, where γij = γPij, i 6= j and γii = γPii+ 1 − γ. Selection can be included in the model by considering some selection parameters si > 0, i = 1, . . . , d, and defining rj(x) =Pdi=1Pdsixiγij

k=1skxkγkj

. The larger si, the more viable type i; the higher the viability of type i, compared to that of the other types, the higher the probability that an individual chooses its parent of type i from the previous generation.

Despite their apparent simplicity, when the population size is large Wright- Fisher models become intractable, given that it becomes complicated to keep track of the genealogy of each individual in a large population. Therefore, it is natural to introduce limiting objects, as the population size goes to infinity, to which the next two sections are dedicated.

1.3 Wright-Fisher diffusions

Wright-Fisher diffusions are widely used in population genetics to model the evolution of frequencies of allele types in a population. They are character- ized by a particular structure of the diffusion matrix, shown below, which is equal to zero on the boundaries of the domain. This type of diffusion arises as the weak limit of the frequencies of a sequence of Wright-Fisher models when the population size goes to infinity and time is properly scaled.

Diffusion approximation of the frequencies in the Wright-Fisher model

Looking at Figure 1.2, it is intuitive to guess that a diffusion process might be used to approximate the vector of frequencies in a Wright-Fisher model with a large population size.

Let XN be the vector of frequencies in a neutral Wright-Fisher model with mutations for a population of size N . In order to obtain a non triv- ial limit for the sequence of frequencies, time needs to be scaled by the population size, and N γ = θ2 is fixed. Assuming that XN(0) converges in

(19)

8 CHAPTER 1. INTRODUCTION AND BACKGROUND

Figure 1.2: Genealogies in a population of size N = 50 through 150 gen- erations, using the neutral Wright-Fisher model without mutation. The R package learnPopGen was used for the plot.

distribution to X(0), as N → ∞, the sequence of frequencies {XN(bN tc)}t≥0 converges weakly in the Skorokhod space DS[0, ∞), as N → ∞, to a diffu- sion process {X(t)}t≥0 ⊂ S = {x ∈ [0, 1]d : Pdi=1xi = 1}, see e.g. [7] for more details. X solves the SDE

dX(t) = µ(X(t))dt + σ(X(t))1/2dW(t), (1.1) where W is a d dimensional Wiener process, the diffusion matrix is

σij(x) = xiij− xj), i, j = 1, . . . , d; (1.2) and the drift vector is

µi(x) =

d

X

j=1

(ujixj− uijxi), i = 1, . . . , d. (1.3)

The mutation parameters can be written as uij = θ2Pij, where θ ≥ 0 is the mutation rate and P is the mutation probability matrix. This notation is particularly useful when the Wright-Fisher diffusion is studied in relation to its dual process, the coalescent, for which this notation is more natural.

(20)

1.3. WRIGHT-FISHER DIFFUSIONS 9

Depending on which forces are included in the assumptions, the drift function µ can take different forms, while the structure of the diffusion ma- trix, which is characteristic of Wright-Fisher diffusions, remains essentially unchanged. For example, if selection is considered, in addition to mutation, the drift vector becomes µ + g, where

gi(x) =

d

X

j=1

hjσij(x), (1.4)

with hi ≥ 0, i = 1, . . . , d, indicating the viability of each type.

Infinitesimal generator and Kolmogorov equations

The infinitesimal generator plays an important role in the study of diffusion processes and it is a core tool in this thesis. The infinitesimal generator of the Wright-Fisher diffusion (1.1) is the operator

LX =

d−1

X

i=1

µi(x)

∂xi +1 2

d−1

X

i,j=1

σij(x) 2

∂xi∂xj.

In the expression above, as in the rest of this section, the state space of the diffusion has been restricted to ∆ = {x ∈ [0, 1]d−1 : Pd−1i=1 xi ≤ 1}. Since the last component of the Wright-Fisher diffusion can be written in terms of the other components, when dealing with density functions the restriction is necessary. The notation for the process X on the restricted state space remains unchanged. From now on, if x ∈ ∆, xdstands for 1 −Pd−1i=1 xi.

Linked to the infinitesimal generator are the backward and forward Kol- mogorov equations which characterise the dynamics of a diffusion process.

For example, in terms of the transition density function p(t, x, y) of X, the backward Kolmogorov equation is

∂tp(t, x, y) = LXp(t, ·, y)(x).

More relevant, for the content of this thesis, is the forward Kolmogorov equation, also known as the Fokker-Planck equation, since it can be used to derive the stationary density, which is a key tool for several results in this

(21)

10 CHAPTER 1. INTRODUCTION AND BACKGROUND

thesis,

∂tp(t, x0, x) = LXp(t, x0, ·)(x) : = −

d−1

X

i=1

∂xi

i(x)p(t, x0, x)] +

d−1

X

i,j=1

2

∂xi∂xj

ij(x)p(t, x0, x)].

(1.5) The diffusion matrix of the Wright-Fisher diffusion is degenerate on the boundary of the domain, that is when one of the frequencies becomes zero, and thus the infinitesimal generator of the Wright-Fisher diffusion is not uniformly elliptic. This prevents classical results from the general theory of PDEs to be directly applicable to this case and requires specialised investi- gation.

Although Fokker-Planck equations for transition density functions of dif- fusions in population genetics do not admit an analytic solution in general, a spectral representation of the transition density function can be obtained if the eigenvalues and eigenfunctions of the generator of the diffusion are found. This is the case of equation (1.5) under the assumption of parent in- dependent mutations (PIM), that is when the type of the mutated offspring does not depend on the type of the parent, i.e. uij = uj, i, j = 1, . . . , d.

Indeed, in [13], an explicit expansion for the solution of (1.5) using orthog- onal polynomials is derived. The spectral expansion turns out to be closely related to the coalescent dual process of the diffusion.

Stationarity

Equation (1.5) can also be used to determine an explicit expression for the stationary density of the Wright-Fisher diffusion. If a stationary density p(x) = lim˜ t→∞p(t, x0, x) exists, then it satisfies

LXp(x) = 0.˜ (1.6)

Wright himself proved in [48, 49] that, in the PIM case, the stationary density is

p(x) =˜ 1 B(2u)

d

Y

i=1

x2ui i−1, (1.7)

where B(2u) = B(2u1, . . . , 2ud) is the multivariate Beta function. This means the stationary distribution of the PIM Wright-Fisher diffusion is

(22)

1.3. WRIGHT-FISHER DIFFUSIONS 11

Dirichlet distributed with parameters 2u1, . . . , 2ud. It can be easily veri- fied that ˜p solves equation (1.6).

The existence of an analytic solution in the PIM case is related to the fact that in this case the Fokker-Planck equation (1.5) can be written in divergence form, see e.g. [1, 8, 21],

∂tp(t, x0, x) = −∇ · J (t, x0, x),

where ∇·J (t, x0, x) represents the rate at which the probability flows through the point x, the “probability flux". At stationarity the flux is zero, that is, the stationary density solves ∇ · J = 0.

In the PIM case, when selection is included in the assumptions, and thus the drift function µ is replaced by µ + g, the stationary density becomes

p(x) ∝˜

d

Y

i=1

x2ui i−1e−2hixi.

Unfortunately, in the non-PIM case, an explicit expression for the sta- tionary density is not available. However, it is relevant to know when such a density exists. The results in [43] ensure, under some conditions, the ex- istence of a unique stationary distribution. For the Wright-Fisher diffusion with mutation and selection presented in this introduction, these conditions are satisfied if the mutation probability matrix P is irreducible. Further- more, in this case, by [43], the stationary distribution is mutually absolutely continuous with respect to the Lebesgue measure and its density function is smooth on ˚∆ = {x ∈ ∆ : xi > 0, i = 1, . . . , d − 1}. The existence of a unique stationary density, despite the absence of its analytic expression, is crucial to obtain the results presented in this thesis, particularly in Paper B.

Finally, it is worth mentioning that in absence of mutation, a unique sta- tionary density does not exist, in fact, any distribution that puts all its mass on one type is an invariant distribution. The Wright-Fisher distribution in this case eventually hits one of the axes, which are absorbing, and one allele fixates, as the frequencies of the Wright-Fisher model in Figure 1.2. Indeed, it is the mutation mechanism that maintains the genetic diversity in the population.

(23)

12 CHAPTER 1. INTRODUCTION AND BACKGROUND

1.4 Coalescent processes

J. F. C. Kingman developed the basic coalescent model [26, 28, 29], which is indeed known as the Kingman coalescent, to study the limit of several finite population size models, one of which being the Wright-Fisher model. As explained in [6], while other authors developed similar ideas, e.g. [14, 47, 22], it was Kingman’s rigorous formulation that had the greatest impact on the field of mathematical population genetics. In [27], Kingman describes the ideas that lead to the creation of the coalescent model:

“Three insights, in combination, comprise the essential basis of the coalescent. The first is the idea of tracing the ancestry of a gene backward in time and building up the family tree of the genes (at a particular locus) in a population sample back to the point at which they have a single common ancestor. [...] It be- comes powerful because of the second insight, that for a large class of demographic models, characterized by selective neutral- ity and constrained population size, the stochastic structure of the genealogy does not depend on the detail of the reproductive mechanism. Finally, for such models the effect of mutation is statistically independent of the genealogy.”

In fact, the Kingman coalescent arises not only as an approximation of the Wright-Fisher model (see next subsection), but also of other finite popula- tion size models, e.g. the Moran model [35]. This makes it a suitable general model for the genealogy of a sample taken from a population of large size.

The coalescent process, {G(t)}t≥0, describing the genealogy of a sample of n individuals, is a continuous time Markov chain on the state space Pn of partitions of the set {1, . . . , n}. It starts with G(0) = {{1}, . . . , {n}}, it evolves backwards in time and eventually ends with the trivial partition {{1, . . . , n}}. The partition G(t) defines an equivalence relation among the individuals in the sample: if two individuals are in the same set of the partition G(t), this means their ancestral lineages have merged by time t, i.e. they share a common ancestor in their genealogical tree within time t.

At rate 1, the coalescent jumps from a partition ξ to a partition η 6= ξ, if the latter is obtained from the former by merging exactly two sets. The time between two jumps, while G(t) consists of a partition of k sets, i.e.

while there are k lineages in the genealogy, is exponentially distributed with parameter k(k−1)2 .

(24)

1.4. COALESCENT PROCESSES 13

To include mutation in the above described coalescent, it is possible to simply superimpose a Poisson process, with constant intensity θ2, on the lineages, the points of which correspond to mutation events. By assigning a type to the most recent common ancestor, types can be assigned to the descending lineages, according to the tree structure and the mutation points.

Past

Present

Figure 1.3: On the left hand side, the same realisation of the Wright-Fisher model as in Figure 1.1, the ancestry of a sample of n = 3 individuals is highlighted. On the right hand side, a sketch of the corresponding coalescent process.

A limit of genealogies

Obtaining the coalescent as a limit of genealogies of Wright-Fisher models is natural. Consider a population of size N evolving according to the Wright- Fisher model, choose a sample of n < N individuals from the population at the present time and trace their genealogy, as illustrated in Figure 1.3.

Let {GN(k)}n∈N be the discrete time Markov chain, evolving backwards in time, on the state space of partitions Pn describing the genealogy of the sampled individuals according to the Wright-Fisher model. Two individuals of the sample have the same ancestor k generations back if they belong to the same set according to the partition GN(k). As the population size

(25)

14 CHAPTER 1. INTRODUCTION AND BACKGROUND

N → ∞, keeping the sample size n fixed, and scaling the time as to obtain the Wright-Fisher diffusion, the sequence {GN(bN tc)}t≥0 converges weakly to {G(t)}t≥0. The reason for this particular time scaling is that, fixing two individuals in the population, the number of generations back to their most recent common ancestor is geometrically distributed with success probability

1

N. Thus the expected number is N and it makes sense to consider units of time of size N . Then, in the limit, the geometric distribution scaled by N becomes an exponential distribution of parameter 1.

Transition and sampling probabilities

Often, as in this thesis, instead of working directly with the coalescent pro- cess, the jump chain, {H(k)}k∈N⊂ Nd\{0}, of the block counting process of the Kingman coalescent with mutations is considered. H counts the number of individuals, or more precisely lineages, in each block, i.e. of each type, at each jump of the coalescent. That is, Hi(k) is the number of lineages of type i at step k.

The Markov chain H, which can be seen as a state dependent random walk, can evolve either forwards or backwards in time, as the coalescent. Backward simulation of H is more natural and more useful in applications to simulate the unobserved history of a given sample. However, while forward transition probabilities are explicitly known, the backward transition probabilities are not.

The probability that H, evolving forwards in time, jumps from a state n ∈ Nd\ {0} to a state n + v, in a single transition, is

p (n + v|n) =

nj

knk1

knk1−1

knk1−1+θ if v = ej, j = 1 . . . d;

ni

knk1 θ

knk1−1+θPij if v = ej− ei, i, j = 1 . . . d;

0 otherwise

(1.8)

where ej is the unit vector in Nd. It is customary to assume that the proba- bility that H(0) = ej is ˜πj, where ˜π = (˜πj)dj=1is the invariant distribution of the mutation probability matrix P , which is assumed to be irreducible. By using the transition probabilities above, it is straightforward to compute the probability of a given realisation of H, corresponding to an ancestral tree.

Nevertheless, there is no explicit expression for the probability p(n), known as sampling probability, that H hits state n ∈ Nd\ {0} evolving forwards in time. However, having chosen ˜π as initial distribution implies that p(n)

(26)

1.4. COALESCENT PROCESSES 15

can be interpreted as a sample of knk1 individuals from a population with allele frequencies ˜X distributed according to the stationary distribution of the Wright-Fisher diffusion (1.1), see e.g. [17]. That is

p(n) = knk1 n

! E

" d Y

i=1

X˜ini

#

. (1.9)

In most cases, the stationary distribution of the Wright-Fisher diffusion is unknown and thus the expression above, while still having great theoretical value, is not directly applicable to compute p(n). For example, in Paper B this expression is crucial for the analysis of the asymptotic behaviour of the sampling probabilities for large sample sizes.

In real life applications, often only a sample from the population at the present time is observed, corresponding to a configuration n ∈ Nd\ {0}, while the hidden genealogy, corresponding to the path of H that led to n, is unobserved. In this case, the likelihood function is the unknown p(n), rather than the known probability of the path of H. Of course, p(n) can be written as a sum over all the possible realisations of H that hit state n. However, the number of these possible realisations is large, even for a moderate sample size knk1, yielding an intractable sum in practice.

A recursion formula for p(n) is known [17]

p(n) =

d

X

j=1

nj − 1

knk1− 1 + θ p(n − ej) +

d

X

i,j=1

ni+ 1 − δij knk1

θPij

knk1− 1 + θ p(n − ej+ ei),

with initial conditions p(ej) = ˜πj, j = 1, . . . , d. This formula can either be obtained by conditioning on the different possible steps back in time starting from state n, or by using the generator of the corresponding Wright-Fisher diffusion, see e.g. [17]. While the recursion formula above uniquely defines the sampling probabilities and can in theory be solved, in practice, finding a numerical solution is computationally too expensive, unless the sample size knk1 is rather moderate. The unavailability of a usable formula for the sampling probability leads to the design of numerous simulation methods to estimate p(n), as further discussed in Section 1.6.

Like the sampling probabilities, the probability that H, evolving back- wards in time, jumps from state n to state n − v, in a single transition, is

(27)

16 CHAPTER 1. INTRODUCTION AND BACKGROUND

also unknown explicitly, but can be written (see [45, 5]) as

←−p (n − v|n) =

nj(nj−1) knk1(knk1−1+θ)

1

π[j|n−ej] if v = ej, j = 1 . . . d,

θPijnj

knk1(knk1−1+θ)

π[i|n−ej]

π[j|n−ej] if v = ej− ei, i, j = 1 . . . d,

0 otherwise.

(1.10) Here π[j|n] is the probability, also unknown explicitly, that an individual of type j is sampled from the population after knk1 individuals, with types given by n, are sampled. This expression, although implicit, enables the study of the large-sample-size asymptotic behaviour of the backward tran- sition probabilities in Paper B, passing through the study of the behaviour of the probabilities π[·|n].

There is one case in which all the quantities mentioned above are ex- plicitly known, that is the PIM case. This is due to the availability of an explicit expression for the stationary density of the Wright-Fisher diffusion.

In this case, Pij = Qj, i, j = 1, . . . , d, π[j|n] = nknkj+θQj

1, and the sampling probabilities can be computed by combining (1.7) and (1.9).

Generalisations

The coalescent model, as well as the underlying Wright-Fisher model and the corresponding Wright-Fisher diffusion, can be generalised to include other genetic forces. The focus here is on generalisations of the coalescent.

Adding selection, which corresponds to modifying the drift term in the SDE of the diffusion, deeply transforms the structure of the coalescent. A selection event can be interpreted as two potential parents being preliminar- ily chosen with only the most viable of them becoming the actual parent.

This reflects in the ancestry by the addition of the virtual lineage of the po- tential parent, which, going backwards in time, corresponds to a branching of lineages. Thus the genealogical tree becomes a graph with virtual lineages in which the true genealogy is embedded, known as the ancestral selection graph and introduced in [32, 36].

Another important mechanism that can be included is recombination.

Instead of considering one location, locus, in the DNA of each individual, several locations can be considered in a multi-locus model. An individual inherits its genetic material from two parents, the material at some loci, from one parent, and the material at the other loci from the other one.

(28)

1.5. DUALITY OF MARKOV PROCESSES 17

When looking at ancestries backwards in time, branching of lineages occur, thus also in this case the tree is replaced by a graph, known as the ancestral recombination graph [15, 16].

While the two generalisations of the coalescent mentioned above are di- rectly related to some of the results in this thesis, particularly to Paper A, numerous other generalisations exist, resulting in other complex and well-known models. For example, allowing multiple mergers of lineages, instead of coalescences of only two lineages, leads to the definition of the Λ−coalescent [41, 39]; allowing simultaneous multiple mergers, i.e. mergers of several lineages occurring simultaneously, leads to the establishment of the Ξ−coalescent [34, 42]; and allowing offsprings to be created by a seed coming from few generations back, leads to the seed bank coalescent [25].

Providing a complete list goes beyond the aim of this introduction. It is worth noting that generalisations of the coalescent provide inspiration for future work consisting in the extension of the results in this thesis to more general models.

1.5 Duality of Markov processes

Duality of Markov processes with respect to a duality function has proven to be an effective tool in population genetics and various other fields, such as interacting particle systems and queuing theory. For a survey on duality of Markov processes with respect to a function, see [24]. In mathematical population genetics, by relating the backward-in-time ancestral process and the corresponding forward-in-time diffusion, it is possible to combine infor- mation from the two processes, enabling further analysis of the properties of the population.

Duality has been established for several pairs of ancestral processes and Wright-Fisher diffusions, e.g. [9, 19] and Paper A, including the ones pre- sented in the previous sections. See [19] and the reference therein for a more complete list of examples and an overview of the role of duality in popula- tion genetics. Problems that can be addressed with the support of duality in population genetics include for example the study of fixation probabilities, e.g. [12, 33, 19], exact simulation of Wright-Fisher diffusions, e.g. [23], opti- mal filtering, e.g. [37], and writing an expansion for the transition functions of Wright-Fisher diffusions, e.g. [9, 19, 2].

The coalescent with mutations and the corresponding Wright-Fisher dif-

(29)

18 CHAPTER 1. INTRODUCTION AND BACKGROUND

fusion, introduced in the previous sections, not only arise as the limit of frequencies and genealogies of the same sequence of Wright-Fisher models, but are also further related by the following duality relationship, which is used here to illustrate duality with respect to a function. Let {H(t)}t≥0 be the block counting process of the Kingman coalescent, with mutations, evolving backwards in time, and let X be the diffusion (1.1), then

E [F (X(t), n)|X(0) = x] = E [F (x, H(t))|H(0) = n] , ∀x ∈ S, ∀n ∈ Nd\{0}, where F is the duality function

F (x, n) = 1 k(n)

d

Y

i=1

xnii, with k(n) = E

" d Y

i=1

X˜ini

#

, (1.11)

and ˜X is distributed according to the unique stationary distribution of the Wright-Fisher diffusion, assuming the mutation probability matrix P is ir- reducible. In the PIM case, the function k is explicit, being the ratio of multivariate Beta functions, and thus the duality relationship is explicit as well. Whereas, in the general case, k remains implicit, being linked to the explicitly unknown sampling probability p by (1.9).

Duality of Markov process can be proved by deriving a duality relation- ship, expressed below, for the corresponding infinitesimal generators, under some conditions (see [10, 24]). Let LX and LH be the infinitesimal genera- tors of X and H respectively, then the equality

LXF (·, n)(x) = LHF (x, ·)(n), ∀x ∈ S, ∀n ∈ Nd\ {0}, (1.12) implies that X and H are dual with respect to the duality function F . Prov- ing duality of infinitesimal generators mostly consists in the lengthy task of rewriting the generator of one process, applied to the duality function with fixed second component, in terms of the generator of the second process, applied to the duality function with fixed first component. This generator approach has been used, not only to prove a duality relationship between known ancestral processes and diffusions, but also to identify the unknown ancestral process corresponding to a given diffusion, or vice versa, by iden- tifying its infinitesimal generator, e.g. [9, 19] and Paper A.

1.6 Inference under the coalescent

The main difficulty in inference under the coalescent regards the unavail- ability of an explicit expression for the likelihood function corresponding

(30)

1.6. INFERENCE UNDER THE COALESCENT 19

to a sample n ∈ N \ {0}, i.e. the sampling probability p(n), and for the backward sampling probabilities (1.10), as described in Section 1.4. The sampling probability p(n) can be written as a sum of the probabilities, which are known explicitly, of the ancestral histories that are compatible with the sample n. However, this sum is often intractable, and the high dimension and complex structure of spaces of trees also render standard numerical integration routines unusable. To address this problem, Monte Carlo methods can be used to approximate the likelihood, see [44] for an overview. These methods can be roughly grouped in two classes, that is, MCMC algorithms, moving from genealogy to genealogy, importance sam- pling and SMC algorithms, based on backward simulation of genealogies.

Methods based on forward simulations produce poor approximations, since only few simulations are compatible with the observed sample and actually contribute to the approximation. Importance sampling algorithms based on backward simulation for inference under the coalescent are introduced in [45, 5], based on the idea in [17]. Since then, backward simulation has been used for importance sampling on numerous, more general, coalescent processes, e.g. [46, 3, 18, 20, 4, 30] and for sequential Monte Carlo [31]. This thesis, in particular Paper D, contributes to the development of a theoretical framework for the asymptotic analysis, for large sample sizes, of backward importance sampling algorithms. In the following, the importance sampling estimator for the sampling probability of the Kingman coalescent based on backward simulation is briefly introduced.

Since, in the general non-PIM case, it is not possible to simulate the jump chain H of the block counting process of the Kingman coalescent according to the unknown ←−p in (1.10), substitute backward transition probabilities

←−q are chosen. Let Hq be distributed according to ←−q , starting from con- figuration n and evolving backwards in time until the most recent common ancestor is reached at time τ , then [5]

p(n) = E

"τ −1 Y

k=0

p (Hq(k)|Hq(k + 1)

←−q (Hq(k + 1)|Hq(k)|Hq(0) = n

# ,

where −→p defined in (1.8). Therefore, simulating Hlq, l = 1, . . . , L, indepen- dently, according to ←−q , given Hlq(0) = n, yields the importance sampling estimator

1 L

L

X

l=1 τl−1

Y

k=0

p (Hlq(k)|Hlq(k + 1)

←−q (Hlq(k + 1)|Hlq(k).

(31)

20 CHAPTER 1. INTRODUCTION AND BACKGROUND

The proposal ←−q should be chosen, as in [45, 5], to approximate ←p , which corresponds to the zero variance change of measure, and thus to the optimal proposal. Paper D provides a framework for the asymptotic analysis of importance sampling weights in this type of algorithms for large sample size knk1.

(32)

2 Summary of Results

Paper A: A dual process for the coupled Wright-Fisher diffusion

In this paper an ancestral process, which is dual to the coupled Wright- Fisher diffusion [1] is derived. The coupled Wright-Fisher diffusion arises in [1] as the limit of frequencies of a sequence of multi-locus Wright-Fisher models with parent dependent mutations, pairwise selection, i.e. acting on one or two loci at the time, and free recombination. This model is suitable for populations in which recombination is very high, when loci are linked mostly through selection, rather than recombination, e.g. some populations of bacteria. This results in the equations of the corresponding system of SDE being coupled in the selection drift term.

In order to illustrate the processes of Paper A, a remark on notation is necessary. Let L be the number of loci, Ml the number of possible alleles at locus l, and M =PLl=1Ml. If x is a M -dimensional vector, denote by x(l) the lth vector of length Ml in x, and by x(l)i its ith component.

The coupled Wright-Fisher diffusion, X ⊂ S = {x ∈ [0, 1]M :PMi=1l x(l)i = 1, l = 1, . . . , L}, models the evolution of the vectors of frequencies of alleles at each locus, and solves the SDE

dX(t) = µ(X(t))dt + G(X(t))dt + D1/2(X(t))dW(t),

where the mutation function µ(x) =µ(1)(x(1)), . . . , µ(L)(x(L))T , is com- posed of drift functions, µ(l), l = 1, . . . , L, of the classical form (1.3); and the diffusion matrix D(x) = diagD(1)(x(1)), . . . , D(L)(x(L)), is a block diago- nal matrix with each block, D(l), l = 1, . . . , L, corresponding to the classical

21

(33)

22 CHAPTER 2. SUMMARY OF RESULTS

diffusion matrix (1.2). The coupling is due to the function G, which describes how selection acts on single loci and pairs of loci. Let V (x) = xTh +12xTJ x, where h is a vector of selection parameters indicating the viability of each allele at each single locus, corresponding to the parameters in (1.4), and J is a matrix of pairwise selection parameters indicating the viability of each couple of alleles at two loci, then G(x) = D(x)∇V (x). If J = 0, the diffusion X would consist of L independent Wright-Fisher diffusions, X(1), . . . , X(L), with mutation and selection, as those in Section 1.3, evolving independently of each other.

The goal of Paper A is to derive a dual process for X with respect to the following duality function, which is the multi-locus generalisation of the duality function (1.11),

F (x, n) = 1 k(n)

L

Y

l=1 Ml

Y

i=1

(x(l)i )n(l)i , x ∈ S, n ∈ NM \ {0}

with

k(n) = E

L

Y

l=1 Ml

Y

i=1

( ˜Xi(l))n(l)i

,

where ˜X is distributed according to the stationary distribution of X. In order for the duality function to be well defined, it is assumed that a sta- tionary distribution exists and does not have positive mass on the axes, so that k(n) 6= 0 for all n. The case of no mutations, which does not fulfil this assumption, is treated separately.

For the derivation of the dual process, a generator approach is used, that is, duality is proven by first proving the duality relationship, analogous to (1.12), among the infinitesimal generators. The dual process derived in this paper can be interpreted as the block counting process of L ancestral selection graphs, one for each locus, evolving simultaneously, not indepen- dently (unless the pairwise selection parameters matrix J = 0), but coupled through their jump rates, which depend on the state of the dual process at all loci. More precisely, the dual process is a pure jump Markov process on the state space NM \ {0}, with the following possible jumps, from state n.

The first two types of jumps are classical, as in the Kingman coalescent, and consist of jumps to n − e(l)i , i = 1, . . . , Ml, l = 1, . . . , L, corresponding to a coalescence of type i at locus l, and jumps to n − e(l)i + e(l)j , i, j = 1, . . . Ml, l = 1, . . . , L, corresponding to a mutation from type i to type j at

(34)

23

locus l. The third type of jump is to states n+e(l)j , j = 1, . . . Ml, l = 1, . . . , L, this corresponds to a branching of type j at locus l, which is typical of an- cestral selection graphs. The last, and novel, type of jump is to states n + e(l)j + e(r)h , j = 1 . . . Ml, h = 1 . . . , Mr, l, r = 1, . . . , L, r 6= l, and corre- sponds to simultaneous double branching, of type j at locus l, and of type h at locus r. The rates of the jumps are derived in terms of the function k, which in the PIM case can be computed numerically, as shown in an example.

The dual process derived in Paper A, in addition to providing a model for the ancestry of individuals, could be used for several purposes. For example, to construct inference methods to estimate the matrix J , which corresponds to the important task of detecting couples of loci evolving under strong selective pressure [1, 11]; to derive an expansion for the transition density function of the diffusion; and to study fixation and extinction probabilities in absence of mutation.

Paper B: Asymptotic behaviour of sampling and backward transition probabilities of the coalescent with parent dependent mutations

As mentioned in Section 1.4, the sampling probabilities p(n) and the back- ward transition probabilities ←−p (n − v|n) of the Kingman coalescent are not known explicitly when mutations are parent dependent. Paper B is ded- icated to the study of their asymptotic behaviour for large sample sizes.

Questions about large-sample-size asymptotic behaviour arise in relation to the large size of modern genetic data sets. In Paper B, as well as in the rest of the thesis, a large sample is considered to be of the form ny(n), with large n ∈ N \ {0}, and y(n)n1Nd\ {0}, such that y(n)→ y ∈ Rd>0, as n → ∞.

In order to make an educated guess on what the asymptotic behaviour of p(ny(n)) could be, the PIM case, which admits explicit formulas, is studied.

In this case, the decay of p(ny(n)) turns out to be polynomial of degree d − 1, where d is the number of alleles, with the mutation parameters only affecting the multiplicative constant. This leads to the conjecture that the same polynomial decay should be observed in the non-PIM case.

The first key step to prove the conjecture is to rewrite the sampling

(35)

24 CHAPTER 2. SUMMARY OF RESULTS

probabilities in a convenient form, using (1.9), that is

p(ny(n)) = ny(n)

1

ny(n)

! 1

B(ny(n))E

hp(D˜ (n))i,

where D(n)is Dirichlet distributed with concentration parameters ny(n)+ 1, 1 being the vector with all components equal to 1; B is the Beta function and ˜p is the stationary density of the Wright-Fisher diffusion (1.1). It is assumed that the mutation matrix P is irreducible, so that the unique sta- tionary density exists. Subsequently, central and local limit theorems for the sequence of Dirichlet random vectors are first derived and then applied to prove the following asymptotic result for the sampling probabilities, as- suming y(n)→ y ∈ Rd>0, as n → ∞,

n→∞lim nd−1 p(ny(n)) = kyk1−d1 p˜ y kyk1

! ,

which confirms that the decay is indeed polynomial, as conjectured. This is the main contribution of Paper B.

Having derived the limit above enables the study of the behaviour of the backward transition probabilities ←−p in (1.10). Let H(n), n ∈ N, be indepen- dent copies of the jump chain of the block counting process of the Kingman coalescent, then the probability ←−p (ny(n)− v|ny(n)) can be interpreted as the probability that the scaled jump chain n1H(n), evolving backwards in time, jumps from state y(n)to state y(n)1nv. First, the following limit for the explicitly unknown probabilities π, appearing in the expression for the backward transition probabilities (1.10), is proved

n→∞lim π[i|ny(n)] = yi

kyk1.

Finally, the asymptotic behaviour of the transition probabilities is obtained

n→∞lim

←−p (ny(n)− ej|ny(n)) = yj

kyk1, i = 1, . . . , d;

n→∞lim n←p (ny(n)− ej− ei|ny(n)) = θPijyi

kyk21 , i, j = 1, . . . , d.

(2.1)

References

Related documents

46 Konkreta exempel skulle kunna vara främjandeinsatser för affärsänglar/affärsängelnätverk, skapa arenor där aktörer från utbuds- och efterfrågesidan kan mötas eller

The increasing availability of data and attention to services has increased the understanding of the contribution of services to innovation and productivity in

Av tabellen framgår att det behövs utförlig information om de projekt som genomförs vid instituten. Då Tillväxtanalys ska föreslå en metod som kan visa hur institutens verksamhet

Generella styrmedel kan ha varit mindre verksamma än man har trott De generella styrmedlen, till skillnad från de specifika styrmedlen, har kommit att användas i större

a) Inom den regionala utvecklingen betonas allt oftare betydelsen av de kvalitativa faktorerna och kunnandet. En kvalitativ faktor är samarbetet mellan de olika

Parallellmarknader innebär dock inte en drivkraft för en grön omställning Ökad andel direktförsäljning räddar många lokala producenter och kan tyckas utgöra en drivkraft

Närmare 90 procent av de statliga medlen (intäkter och utgifter) för näringslivets klimatomställning går till generella styrmedel, det vill säga styrmedel som påverkar

The results of the present study also showed that, con- sistent with earlier studies [9, 11], girls scored signifi- cantly higher on the total SCOFF and that they were