• No results found

Applying Hidden Markov Models to RNA-seq data

N/A
N/A
Protected

Academic year: 2022

Share "Applying Hidden Markov Models to RNA-seq data"

Copied!
20
0
0

Loading.... (view fulltext now)

Full text

(1)

Individual Project in Computational Biology

Applying Hidden Markov Models to RNA-seq data

Monica Golumbeanu Student number: 880703 - T300

monicago@kth.se

January 2, 2013

Abstract

Enterococcus faecalis is one of the most controversial commensal bacteria of the human intestinal flora that is also responsible for lethal nosocomial infections. Determining the fac- tors that influence its pathogenicity is at present a great challenge. Cutting-edge approaches analyze the E. faecalis bacterium trough the next generation RNA-sequencing technology.

Since next generation sequencing is recent and yields a large amount of data, there is a continuous need for appropriate statistical methods to interpret its output. We propose an approach based on hidden Markov models to explore RNA-seq data and show an example of how we can apply this statistical tool to detect transcription start sites. We compare this application with a previously developed method based on signal processing.

1 Introduction

Hidden Markov models (HMMs) have been widely used for analyzing sequencing data. Their applications are diverse and cover research topics such as gene identification, alignment, non coding RNA detection and protein secondary structure prediction [13]. In the present study, we explore the use of two different HMMs for determining the transcripts and the transcription levels respectively starting from RNA sequencing (RNA-seq) data of the Enterococcus faecalis bacterium.

Enterococcus faecalis is a commensal bacterium which commonly populates the human gas- trointestinal tract and can produce lethal infections. Its toughness has assured survival in the prophylactic hospital environment, defense from the hosts immune systems and a strong resis- tance to antimicrobial agents [7]. Understanding how the E. faecalis bacterium operates and what are the factors that influence its pathogenicity is thus a great challenge.

Applying HMMs to detect and analyze the different transcriptionally active elements withing the transcriptome of E. faecalis would provide important information about the genes modeling the bacterium’s conduct and contribute to establishing new treatments. It would be important, for example, to obtain a complete image of the most transcriptionally active elements and what is their relation to the phenotypic traits of the bacterium (e.g. resistance to antibiotics). Know- ing these elements would contribute to better defining the targets of novel drugs and therapies.

(2)

E. Faecalis

Ribosomal RNA removed (Treated)

All RNA (Crude)

'Home made' kit (FRAG)

Barcoding (BC)

'Home made' kit (FRAG)

Barcoding (BC) Cultivation

RNA extraction

Sequenced libraries

Figure 1: Description of the work flow used to obtain different RNA-seq data sets from a culture of E. faecalis v583.

2 Materials and Methods

2.1 Data preparation and extraction

One colony of E. faecalis v583 has been grown in a static, anaerobic Brain Heart Infusion en- vironment at 37C (figure 1). Afterwards, the RNA has been extracted from the cells. Half of the sample has been cleared of ribosomal RNA (Treated sample) while the other half has been left intact (Crude sample). The DNA Sequencing and Genomics Laboratory from the Institute of Biotechnology of the Helsinki University in Finland performed the sequencing of the two samples on the SOLiD sequencing platform [11]. Two different fragment libraries have been used during the sequencing procedure: one library prepared by Applied Biosystems kit (BC) and the other one by the laboratory own kit (FRAG).

Even though there are different sequencing platforms (e.g. SOLiD [11], Illumina [2]), the RNA-seq procedure relies on the same principles (figure 2). The method consists of extracting the different transcripts and then converting them into double stranded DNA (dsDNA). The dsDNA fragments are then sheared into small fragments and adapters are ligated to them such that a sequencing library is formed. The first 50 nucleotides of each fragment are sequenced and in this way, a large set of reads is obtained.

In our study, we have used the Crude BC dataset, therefore the ribosomal RNA was not removed. The reads obtained following the RNA-seq experiment were aligned to the genome using the Bowtie software [5]. For every position on the genome, the number of reads that were mapped there were counted and the resulting signal was coined the ”coverage depth”.

2.2 The general HMM framework 2.2.1 General definition

An HMM is a directed acyclic graph consisting of a set of a discrete-time discrete-state Markov chain of hidden states, Y = {yt}Tt=1, and a set of observed states, X = {xt}Tt=1, constituting an observation model p(xt|yt) [8]. Unlike the hidden states that are discrete, the observations in an HMM can be either continuous or discrete. In the case of discrete variables, the observation model is represented by an observation matrix also named emission matrix.

The following parameters are needed to define an HMM with discrete observed states:

• The number of hidden states: M

• The number of observed states: N

(3)

AAAGGCTACTTCATTTACA

CTATTAATCCCTATCATAT CCCTATGGATCAGAATCG

AAAGGCTACTTCATTTACA CTATTAATCCCTATCATAT

TAGTTAACAGGTCCCATGA

...

extraction of RNAs

conversion into dsDNA and shearing

adapter ligation and amplification

immobilization and sequencing

Figure 2: Workflow of a standard RNA-seq experiment.

• The initial distribution of the hidden states:

Π = {πi}1≤i≤M = {p(y1 = i)}1≤i≤M (1)

• The transition probabilities between two hidden states, 1 ≤ i, j ≤ M :

aij = P (yt= j|yt−1= i). (2)

These probabilities are contained within the transition matrix: a = (aij)1≤i,j≤M where aij represents the probability of transitioning from hidden state i to hidden state j (figure 3).

• The emission probabilities

eij = P (xt= j|yt= i) (3)

where 1 ≤ i ≤ M and 1 ≤ j ≤ N . These probabilities are contained within the emission matrix: e = (eij)1≤i≤M,1≤j≤N where eij represents the probability of emitting the observed state j from the hidden state i (figure 3).

Figure 3 displays an example of a HMM and is useful to understand the structure of an HMM as well as the inner dependencies. Since the HMM is a directed acyclic graph, it is straightforward to write the corresponding joint probability distribution:

p(y1, y2, ..., yT, x1, x2, ..., xT) = p(y1, ..., yT)p(x1, ..., xt|y1, ..., yT)

=

"

p(y1)

T

Y

t=2

p(yt|yt−1)

# " T Y

t=1

p(xt|yt)

#

= πy1

T

Y

t=2

ayt−1yt

T

Y

t=1

eytxt

(4)

(4)

y1 y2

x1 x2

yT-1 yT

xT-1 xT

ey1x1 ey2x2 eyT-1xT-1 eyTxT

ay1y2 ay2y3 ayT-2yT-1 ayT-1yT

Figure 3: Example of a realization of an HMM with hidden (yt, t ∈ {1, ..., T }) and observed (xt, t ∈ {1, ..., T }) states. The arrows represent the transition (ayt−1yt) and emission (eytxt) probabilities.

2.2.2 Inference in HMMs - the Baum-Welch algorithm

HMMs have the advantage of considering long range dependencies between the observed vari- ables due to the latent variables. The general HMM inference problem searches to estimate the hidden states starting from the observed variables, in other words, to calculate p(yt|x1, x2, ..., xt) in an online learning process or p(yt|x1, x2, ..., xT) in an offline setting.

An online setting is a process where we observe the states sequentially and we want to compute the hidden variables on the spot. On the contrary, an offline setting assumes that we observe all the variables and we are looking for the most likely hidden states behind the observations i.e. p(yt|x1, x2, ..., xT). We will focus on the offline learning setting since we will model our problem accordingly. Therefore, the goal will be to compute the probabilities of the hidden states given all the observations. Then, for each observation, we will choose the hidden state with the highest probability. The key to solving this inference problem is the fact that we can separate the Markov chain into two parts - the past and the future:

αi(t) = p(x1:t, yt= i)

βi(t) = p(xt+1, ..., xT|yt= i) (5) We have used the notation x1:t = {x1, x2, ..., xt}. The αi(t) are also called forward coeffi- cients while the βi(t) are also called backward coefficients. We can write these coefficients as a function of the HMM parameters by using d-separation [10] and conditional independence properties of the graphical model (figure 3) as follows:

αi(t) = p(x1:t, yt= i)

=

M

X

j=1

p(x1:t, yt= i, yt−1 = j) =

M

X

j=1

p(x1:t, yt= i|yt−1= j)p(yt−1= j) =

=

M

X

j=1

p(xt|yt= i, yt−1= j, x1:t−1)p(x1:t−1, yt−1= j, yt= i) =

=

M

X

j=1

p(xt|yt= i, yt−1= j, x1:t−1)p(yt= i|x1:t−1, yt−1= j)p(yt−1= j, x1:t−1) =

=

M

X

j=1

p(xt|yt= i)p(yt= i|yt−1= j)αj(t − 1) =

=

M

X

j=1

eixtajiαj(t − 1)

(6)

(5)

βi(t) = p(xt+1, xt+2, ..., xT|yt= i) =

M

X

j=1

p(xt+1:T, yt+1 = j|yt= i) =

=

M

X

j=1

p(xt+2:T|yt+1= j, yt= i, xt+1)p(xt+1|yt+1= j, yt= i)p(yt+1= j|yt= i)

=

M

X

j=1

βj(t + 1)ejxt+1aij

(7)

Above we have found the main recurrence relations for the forward and backward coefficients.

The following start conditions allow the dynamic computation of all the rest of the coefficients known as the forward-backward algorithm [14]:

αi(1) = p(x1, y1= i) = πieix1

βi(1) = p(xT +1 : T |yT = i) = p(0|yT = i) = 1 (8) In order to calculate p(yt|x1, x2, ..., xT), the maximum likelihood parameters of the model need to be calculated: θ = {π, a, e}. The most common method to find these parameters is to use the EM (Expectation-Maximization) algorithm for HMM also known as the Baum-Welch al- gorithm [12]. The Baum-Welch algorithm iteratively alternates between 2 main steps. It starts with random values for the parameters and in the first step (E-step) computes the complete data log likelihood as function of new parameters with regard to the old parameters. In the second step (M-step), it updates the parameters such that the updated values maximize the expected likelihood calculated in the E-step.

In the E-step the expected complete data log likelihood of the new parameters θ is calcu- lated with respect to the old parameters θold:

Q(θ, θold) = Ep(Y |X,θold)[log p(X, Y |θ)] = Ep(Y |X,θold)

"

log πy1

T

Y

t=2

ayt−1yt

T

Y

t=1

eytxt

#

=

= Ep(Y |X,θold)

"

log

M

Y

k=1

πkI(y1=k)

#

+ Ep(Y |X,θold)

log

T

Y

t=2 M

Y

i=1 M

Y

j=1

aI(yji t=i,yt−1=j)

+

+ Ep(Y |X,θold)

log

T

Y

t=1 M

Y

i=1 N

Y

j=1

eI(xij t=j,yt=i)

=

=

M

X

k=1

Ep(Y |X,θold)[I(y1 = k)logπk] +

T

X

t=2 M

X

i=1 M

X

j=1

Ep(Y |X,θold)[I(yt= i, yt−1 = j)log aji] +

+

T

X

t=1 M

X

i=1 N

X

j=1

Ep(Y |X,θold)[I(xt= j, yt= i)log eij] =

=

M

X

k=1

p(y1= k|x1:T, θold)logπk+

T

X

t=2 M

X

i=1 M

X

j=1

p(yt= i, yt−1= j|x1:T, θold)log aji+

+

T

X

t=1 M

X

i=1 N

X

j=1

p(yt= i|x1:T, θold)I(xt= j)log eij

(9)

(6)

Using the notations:

γi(t) = p(yt= i|x1:T, θold)

ψji(t) = p(yt= i, yt−1= j|x1:T, θold) (10) the expression of the expected log likelihood becomes:

Q(θ, θold) =

M

X

k=1

γk(1)log πk+

T

X

t=2 M

X

i=1 M

X

j=1

ψji(t)log aji+

T

X

t=1 M

X

i=1 N

X

j=1

γi(t)I(xt= j)log eij (11) Note that γi(t) and ψji(t) are both dependent only on the old parameters:

γi(t) = p(yt= i|x1:T, θold) = p(yt= i|x1:t, xt+1:T, θold) =

= p(xt+1:T|yt= i, x1:t, θold)p(yt= i|x1:t, θold) PM

j=1p(xt+1:T|yt= j, x1:t, θold)p(yt= j|x1:t, θold) =

= p(xt+1:T|yt= i, θold)p(yt= i, x1:told) PM

j=1p(xt+1:T|yt= j, θold)p(yt= j, x1:told) =

= βiold(t)αoldi (t) PM

j=1βjold(t)αoldj (t)

(12)

ψji(t) = p(yt= i, yt−1= j|x1:T, θold) =

= p(yt= i, yt−1= j, x1:Told) p(x1:Told) =

= p(yt= i, yt−1= j, x1:Told) PM

i0=1

PM

j0=1p(yt= i0, yt−1 = j0, x1:Told) =

= p(yt= i, yt−1= j, x1:t−1, xt, xt+1:Told) PM

i0=1

PM

j0=1p(yt= i0, yt−1 = j0, x1:Told) =

= p(xt|yt, yt−1, x1:t−1, xt+1:T, θold)p(xt+1:T|yt= i, yt−1= j, x1:t−1, θold) PM

i0=1

PM

j0=1p(yt= i0, yt−1= j0, x1:Told) ×

× p(yt= i|yt−1= j, x1:t−1, θold)p(yt−1= j, x1:t−1old) PM

i0=1

PM

j0=1p(yt= i0, yt−1= j0, x1:Told) =

= eoldix

tβiold(t)aoldji αoldj (t − 1) PM

i0=1

PM

j0=1eoldi0xtβiold0 (t)aoldj0i0αoldj0 (t − 1)

(13)

In the M-step, the new parameters θ are calculated by maximizing the expected log likeli- hood from the E-step (equation 11) and thus solving:

∂Q(θ, θold)

∂πk

= 0

∂Q(θ, θold)

∂aji = 0

∂Q(θ, θold)

∂eij = 0

(14)

By solving these equations, we will obtain the updated values of the HMM parameters as functions of the old parameters.

(7)

To determine πk, we use the constraint PM

i=1πi = 1, therefore we can write Q(θ, θold) as a function of πk using a Lagrange multiplier:

Q(θ, θold) = f (πk) =

M

X

k=1

γk(1)log πk+ λ(

M

X

i=1

πi− 1) + const. (15)

where const. stands for constant. We then use the differential equations:

∂f (πk)

∂πk

= 0 ⇒ γk(1) 1 πk

+ λ = 0 ⇒ πk = −γk(1) λ

∂f (πk)

∂λ = 0 ⇒

M

X

i=1

πi= 1 ⇒

M

X

i=1

−γi(1)

λ = 1 ⇒ λ = −

M

X

i=1

γi(1)









⇒ πk= γk(1) PM

i=1γi(1) (16)

The same for aji, the following constraint applies: PM

i0=1aji0 = 1 ∀j, we use Lagrange multipli- ers:

Q(θ, θold) = f (aji) =

T

X

t=2 M

X

i=1 M

X

j=1

ψji(t)log aji+

M

X

j0=1

λj0

M

X

i0=1

aj0i0 − 1

!

+ const. (17)

and therefore we have that:

∂f (aji)

∂aji

= 0 ⇒

T

X

t=2

ψji(t) 1 aji

+ λj = 0 ⇒ aji= − PT

t=2ψji(t) λj

∂f (aji)

∂λj = 0 ⇒

M

X

i0=1

aji0− 1 = 0 ⇒ λj = −

T

X

t=2 M

X

i0=1

ψji0(t)













⇒ aji=

PT

t=2ψji(t) PT

t=2

PM

i0=1ψji0(t) (18) We can simplify this form by using the definition of ψji(t) and γi(t) from equation 10 and observing the marginalization at the denominator of equation 18:

aji =

PT

t=2p(yt= i, yt−1= j|x1:T, θold) PT

t=2

PM

i0=1p(yt= i0, yt−1= j|x1:T, θold) = PT

t=2p(yt= i, yt−1= j|x1:T, θold) PT

t=2p(yt−1= j|x1:T, θold) ⇒

⇒ aji = PT

t=2ψji(t) PT

t=2γj(t)

(19)

Finally, eij are subject to a constraint as well: PN

j0=1eij0 = 1 ∀i and we can write:

Q(θ, θold) = f (eij) =

M

X

i=1 N

X

j=1

X

t:xt=j

γi(t)log eij+

M

X

i0=1

λi0

N

X

j0=1

ei0j0 − 1

+ const. (20) We have as before:

∂f (eij)

∂eij = 0 ⇒ X

t:xt=j

γi(t) 1

eij + λi = 0 ⇒ eij = − P

t:xt=jγi(t) λi

∂f (eij)

∂λi = 0 ⇒

N

X

j0=1

eij0− 1 = 0 ⇒ λi = −

N

X

j0=1

X

t:xt=j0

γi(t)













⇒ eij = P

t:xt=jγi(t) PN

j0=1

P

t:xt=j0γi(t) (21)

(8)

Note thatPN j0=1

P

t:xt=j0γi(t) =PT

t=1γi(t), therefore:

eij = P

t:xt=jγi(t) PT

t=1γi(t) (22)

The computation of the emission probabilities can be improved by looking for emission proba- bilities of a certain form (e.g. following the Poisson distribution). This technique can be used on the emission probabilities because we know the observed states. Since we don’t have any information about the hidden states and we want them to cover as much as dependencies as possible, we do not apply this method to compute transition probabilities. The distribution for the emission probabilities is chosen according to the quantity to model. We will show in the following sections how we chose an appropriate distribution to deal with RNA-seq data.

2.3 Application of HMMs on RNA-seq data

The HMM framework was used to analyze the RNA-seq data obtained from E. faecalis. Two models were designed. In both cases, the observed states are represented by the values of the coverage signal and the hidden states model the level of amplification of the signal (duplicate, triplicate, etc.). Therefore, the implemented algorithm takes as an input the coverage signal for the entire genome and returns the corresponding values of the hidden states which constitute the predicted signal. The predicted signal tells which regions are amplified or deleted (figure 4).

0 2000 4000 6000 8000 10000

01020304050

Position

Coverage

Figure 4: Example of generated coverage depth signal (black) and signal predicted with the HMM (red). A background coverage signal is estimated around the value of 10 and there are regions with an average amplification level of 40, 30 and respectively 20. A deletion can be observed as well starting at position 6000. The red signal corresponds to the output of the HMM algorithm which detects all these regions.

2.3.1 The standard HMM model

The straightforward way to apply an HMM to RNA-seq data is to use the standard framework described in the previous section and then to pick a distribution to model the emission proba- bilities of the HMM. The chain has the length of the genome and looks like in figure 3. Based on literature [1, 6, 3], we have chosen a Poisson distribution to model the probability that a

(9)

certain number of reads map at a certain position in the genome in a region which is amplified a certain amount of times:

eij = p(j reads map in a region amplified i times)

= p(xt= j|yt= i) = P oisson(j, ci)

= (ci)j j! e−ci

(23)

where ci represents how many times the background coverage depth has been amplified at po- sition t. In other words, c1 is the level of the background signal, c2 stands for a duplicate, c3 for a triplicate, etc.

The transition probabilities aji and the initial distribution of the hidden states πk are cal- culated by using the previously found relations (given by the equations 19 and 16), while the emission probabilities are found using the maximum likelihood estimate for the parameters ci of the Poisson distribution.

To find the maximum likelihood estimate for ci, the following equation is solved:

∂Q(θ, θold)

∂ci = 0 (24)

where the expression of the expected log likelihood is given by:

Q(θ, θold) =

M

X

k=1

γk(1)log πk+

T

X

t=2 M

X

i=1 M

X

j=1

ψji(t)log aji

| {z }

const.

+

T

X

t=1 M

X

i=1 N

X

j=1

γi(t)I(xt= j)log eij

= f (ci) = const. +

T

X

t=1 M

X

i=1 N

X

j=1

γi(t)I(xt= j)log (ci)j j! e−ci



=

T

X

t=1 M

X

i=1 N

X

j=1

γi(t)I(xt= j)log (ci)j j! e−ci



+ const.

=

M

X

i=1 N

X

j=1

X

t:xt=j

γi(t)jlog ci− γi(t)ci

+ const.

(25)

We have that:

∂f (ci)

∂ci

= 0 ⇒

N

X

j=1

X

t:xt=j

γi(t)j1 ci

− γi(t) = 0

⇒ ci = PT

t=1γi(t)j PT

t=1γi(t)

(26)

Once the ci parameters are computed, the emission probabilities eij are obtained with the Poisson formula in equation 23. We have now deduced the update formulas for all the parame- ters of the HMM.

The complete Baum-Welch algorithm is as follows: an initial value for the parameters is set, then the expected log likelihood of the data is calculated; afterwards, the parameters are updated with regard to the old parameters by using the update rules 16, 19 and 26 and the

(10)

y1 y2

x1 x2

yT-1 yT

xT-1 xT

ey1x1 ey2x2x1 eyT-1xT-1xT-2 eyTxTxT-1

ay1y2 ay2y3 ayT-2yT-1 ayT-1yT

Figure 5: Example of a realization of an autoregressive HMM where the observed states are organized in a first order Markov chain. Note the emission probabilities that are a function of the current observed state, the corresponding hidden state as well as the previously observed state.

log likelihood is calculated with the new values. If the likelihood converges, the iteration is stopped, otherwise, the old parameters receive the values of the updated parameters and the update rules are used to compute new parameters (algorithm 1).

Algorithm Baum-Welch;

1. find an initial setting for the parameters: θ0 = {πk0, a0ji, e0ij};

2. set θold= θ0 ;

3. update θ by using the update rules:

πk= PMγk(1)

i=1γi(1), aji =

PT t=2ψji(t) PT

t=2γj(t) , ci =

PT t=1γi(t)j PT

t=1γi(t) ; with respect to θold.

4. set θold= θ. ;

5. go to 3 unless the expected log likelihood converges

Algorithm 1: Training of the HMM with the EM framework within the Baum-Welch algo- rithm.

2.3.2 The auto-regressive HMM model

The standard model assumes that all the inherent dependencies of the observed coverage signal are captured within the hidden variables. However, it has been shown that this fact may reduce the accuracy of the model [6]. We know that the coverage depth signal has a built-in dependency between the consecutive values since a read spans several positions in the genome. As a result, the current state in the coverage depth signal can be split in the following way:

xt= number of reads that start at t + number of reads that continue from (t − 1) (27) In this way, we can model a dependency between the current and the previous observed state and we can model the observed states as a first degree Markov Chain (see figure 5). We call this particular model an autoregressive HMM [4]. This change in the model will affect the method in which the parameters of the model are computed since we have to take into account the dependency between the observed states. Already, the emission probabilities will have an additional parameter, the previous observed state:

eijk= p(xt= j|yt= i, xt−1= k) (28)

(11)

The complete data likelihood will sensibly change as well:

p(y1, y2, ..., yT, x1, x2, ..., xT) =

"

p(y1)p(x1|y1)

T

Y

t=2

p(yt|yt−1)

# " T Y

t=2

p(xt|yt, xt−1)

#

= πy1σy1x1

T

Y

t=2

ayt−1yt

! T Y

t=2

eytxtxt−1

! (29)

where σy1x1 = p(x1|y1) is an additional parameter.

We can rapidly prove, using the properties of the graphical model in figure 5, that the forward and backward parameters keep largely the same expression as before:

αi(t) = p(x1:t, yt= i)

=

M

X

j=1

p(x1:t, yt= i, yt−1 = j) =

M

X

j=1

p(x1:t, yt= i|yt−1= j)p(yt−1= j) =

=

M

X

j=1

p(xt|yt= i, yt−1= j, x1:t−1)p(x1:t−1, yt−1= j, yt= i) =

=

M

X

j=1

p(xt|yt= i, yt−1= j, x1:t−1)p(yt= i|x1:t−1, yt−1 = j)p(yt−1 = j, x1:t−1) =

=

M

X

j=1

p(xt|yt= i, xt−1)p(yt= i|yt−1= j)αj(t − 1) =

=

M

X

j−1

eixtxt−1ajiαj(t − 1)

(30)

For the backward coefficients we define:

βik(t) = p(xt+1, xt+2, ..., xT|yt= i, xt= k) =

=

M

X

j=1

p(xt+2:T|yt+1 = j, yt= i, xt+1, xt= k)p(xt+1|yt+1= j, yt= i, xt= k)×

× p(yt+1= j|yt= i, xt= k) =

=

M

X

j=1

βjxt+1(t + 1)ejxt+1kaij

(31)

The following start conditions apply:

αi(1) = p(x1, y1 = i) = πiσix1

βik(1) = p(xT +1: T |yT = i, xT = k) = p(0|yT = i, xT = k) = 1 (32) We recall that the objective is to determine the hidden states behind each observed state.

We apply as before the Baum-Welch algorithm with its E and M steps modified according to the new model. In the E-step, we compute the expected log likelihood with respect to the old

(12)

parameters θold:

Q(θ, θold) = Ep(Y |X,θold)[log p(X, Y |θ)] = Ep(Y |X,θold)

"

log πy1σy1x1

T

Y

t=2

ayt−1yt

T

Y

t=2

eytxtxt−1

#

=

= Ep(Y |X,θold)

"

log

M

Y

k=1

πkI(y1=k)

#

+ Ep(Y |X,θold)

log

M

Y

i=1 N

Y

j=1

log σijI(y1=i,x1=j)

+

+ Ep(Y |X,θold)

log

T

Y

t=2 M

Y

i=1 M

Y

j=1

aI(yji t=i,yt−1=j)

+

+ Ep(Y |X,θold)

log

T

Y

t=2 M

Y

i=1 N

Y

j=1 N

Y

k=1

eI(xijkt=j,yt=i,xt−1=k)

=

=

M

X

k=1

Ep(Y |X,θold)[I(y1= k)logπk] +

M

X

i=1 N

X

j=1

Ep(Y |X,θold)[I(y1= i, x1 = j)σij] +

+

T

X

t=2 M

X

i=1 M

X

j=1

Ep(Y |X,θold)[I(yt= i, yt−1= j)log aji] +

+

T

X

t=1 M

X

i=1 N

X

j=1 N

X

k=1

Ep(Y |X,θold)[I(xt= j, yt= i, xt−1= k)log eijk] =

=

M

X

k=1

p(y1 = k|x1:T, θold)logπk+

M

X

i=1 N

X

j=1

p(y1= i|x1:T, θold)I(x1= j)log σij+

+

T

X

t=2 M

X

i=1 M

X

j=1

p(yt= i, yt−1= j|x1:T, θold)log aji+

+

T

X

t=1 M

X

i=1 N

X

j=1 N

X

k=1

p(yt= i|x1:T, θold)I(xt= j, xt−1 = k)log eijk

(33) We proceed in the same way as for the standard model and we use the following notations:

γi(t) = p(yt= i|x1:T, θold)

ψji(t) = p(yt= i, yt−1= j|x1:T, θold) (34) The expression of the expected log likelihood becomes:

Q(θ, θold) =

M

X

k=1

γk(1)log πk+

M

X

i=1 N

X

j=1

γi(1)log σij+

+

T

X

t=2 M

X

i=1 M

X

j=1

ψji(t)log aji+

T

X

t=2 M

X

i=1

γi(t)log eixtxt−1

(35)

As before, we use the conditional independence properties of the graphical model to express

(13)

γi(t) and ψji(t) as functions of the old parameters:

γi(t) = p(yt= i|x1:T, θold) = p(yt= i|x1:t, xt+1:T, θold) =

= p(xt+1:T|yt= i, x1:t, θold)p(yt= i, x1:told) PM

j=1p(xt+1:T|yt= j, x1:t, θold)p(yt= j, x1:told) =

= p(xt+1:T|yt= i, xt, θold)p(yt= i, x1:told) PM

j=1p(xt+1:T|yt= j, xt, θold)p(yt= j, x1:told) =

= βoldixt(t)αoldi (t) PM

j=1βjxold

t(t)αoldj (t)

(36)

ψji(t) = p(yt= i, yt−1= j|x1:T, θold) =

= p(yt= i, yt−1 = j, x1:Told) p(x1:Told) =

= p(yt= i, yt−1= j, x1:Told) PM

i0=1

PM

j0=1p(yt= i0, yt−1= j0, x1:Told) =

= p(yt= i, yt−1= j, x1:t−1, xt, xt+1:Told) PM

i0=1

PM

j0=1p(yt= i0, yt−1= j0, x1:Told) =

= p(xt|yt, yt−1, x1:t−1, xt+1:Told)p(xt+1:T|yt= i, yt−1= j, x1:t−1, θold) PM

i0=1

PM

j0=1p(yt= i0, yt−1= j0, x1:Told) ×

×p(yt= i|yt−1= j, x1:t−1, θold)p(yt−1 = j, x1:t−1old) PM

i0=1

PM

j0=1p(yt= i0, yt−1 = j0, x1:Told) =

= eoldixtx

t−1βoldixt(t)aoldji αoldj (t − 1) PM

i0=1

PM

j0=1eoldi0xtxt−1βiold0xt(t)aoldj0i0αoldj0 (t − 1)

(37)

In the M-step, the new parameters θ are calculated by maximizing the expected log likeli- hood found in the E-step (equation 35). Note that πk and aji have the same coefficients as in the expression of the expected log likelihood in equation 11, therefore, their update rules will remain unchanged:

πk= γk(1) PM

i=1γi(1) (38)

aji= PT

t=2ψji(t) PT

t=2γj(t) (39)

The parameters that remain to be calculated are eijk and σij. We use the same method as for the other parameters, i.e. solving the differential equations:

∂Q(θ, θold)

∂σij

= 0

∂Q(θ, θold)

∂eijk = 0

(40)

The parameters σij have the constraintPN

j0=1σij0 = 1, therefore we use Lagrange multipliers:

Q(θ, θold) = f (σij) =

M

X

i=1 N

X

j=1

γi(1)log σij +

M

X

i0=1

λi0

N

X

j0=1

σi0j0 − 1

+ const. (41)

(14)

By taking the derivative with respect to σij and λi0, we get:

∂f (σij)

∂σij = 0 ⇒ γi(1) 1

σij + λi = 0 ⇒ σij = −γi(1) λi

∂f (σij)

∂λi

= 0 ⇒

N

X

j0=1

σij0− 1 = 0 ⇒ λi= −

N

X

j0=1

γi(1) = −N γi(1)









⇒ σij = 1

N (42)

Finally, we compute eijk which are subject to the constraint: PN

j0=1eijk= 1 so we can write:

Q(θ, θold) = f (eijk) =

T

X

t=1 M

X

i=1

γi(t)log eixtxt−1 +

M

X

i0=1 N

X

k0=1

λi0k0

N

X

j0=1

ei0j0k0− 1

+ const. (43)

As before, we take the derivative with respect to eijk and λik:

∂f (eijk)

∂eijk = 0 ⇒ X

t:xt=j,xt−1=k

γi(t) 1

eijk + λik = 0 ⇒ eijk= − P

t:xt=j,xt−1=kγi(t) λik

∂f (eijk)

∂λik = 0 ⇒

M

X

j0=1

eij0k= 1 ⇒ λik = −

N

X

j0=1

X

t:xt=j0,xt−1=k

γi(t)













⇒ eijk= P

t:xt=j0,xt−1=kγi(t) PM

i0=1

P

xt=j,xt−1=kγi0(t) (44)

We have showed with the relation 27 that the number of reads that match at a certain position can be split into two independent quantities: the reads that start at the respective position and the reads that continue from the previous position. We can model these two quantities by two known distributions. Therefore, the Poisson distribution would model the reads that start at the current position and the Binomial distribution would model the reads that continue from the previous position. This would help to write the emission probabilities in the following manner:

eijk= p(xt= j|yt= i, xt−1= k)

=

j

X

r=1

p(r reads start at t)

| {z }

P oisson(r,ci)

× p(r reads carry on from t − 1)

| {z }

Binomial(r,k,pc)

=

j

X

r=1

cri r!e−ci

 k i − r



pi−rc (1 − pc)k−(i−r)

(45)

where pcis the probability that a read spans two adjacent positions:

pc= len − 1

len (46)

where len is the length of a read. In order to find the maximum likelihood estimate of the parameter ci, a gradient descent library has been used in the implemented algorithm. We have obtained thus all the update rules from the parameters of the HMM, therefore the Baum-Welch algorithm is completely defined.

(15)

3 Results and discussion

We have trained and tested the standard HMM as well as the auto-regressive HMM on the RNA-seq data of E. faecalis. The performances of the two algorithms have proved completely different for this kind of data.

3.1 Performance of the two HMMs

The idea of an autoregressive HMM has come from a previous application on DNA-sequencing (DNA-seq) data for detecting copy number variants [9]. When applied to DNA-seq data, the auto-regressive model preforms the same as the standard HMM model. The two models detect the same amplified regions as well as the background level of coverage depth (figure 6).

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

0 10 20 30 40 50 60

position

CD arHMM

(a) Autoregressive HMM

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

0 10 20 30 40 50 60

position

CD stHMM

(b) Standard HMM

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

0 10 20 30 40 50 60

position

CD stHMM arHMM

(c) Superposition of the autoregressive and standard HMM

Figure 6: Application of the two models on the same stretch of DNA-seq coverage depth signal. The two models perform largely the same and detect the amplified regions of the signal as well as the background level.

(16)

Figure 6 displays the result of applying the two HMM models on the same fragment of cov- erage depth from DNA-seq data. The predictions of the two models are the same, with a few small differences. Note the existence of a few artifacts (very short deletions or amplifications), especially with the standard model. These artifacts may require an additional smoothing oper- ation or a slight modification of the transition probabilities aji.

Unlike for DNA-seq data, when applied to RNA-seq data of the E. faecalis bacterium, the two models behave completely different (figure 7). The auto regressive HMM fails to predict the correct amplification regions and returns a signal impossible to interpret. On the contrary, the standard model performs better and manages to detect continuous regions corresponding to the different transcripts, however accompanied by several artifacts.

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

0 1 2 3 4 5 6 7 8

position on the genome

log coverage

(a) Real coverage depth signal

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

0 0.5 1 1.5 2

position on the genome

CD HMM prediction

(b) Autoregressive HMM

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

0 0.5 1 1.5 2

position on the genome

RD HMM prediciton

(c) Standard HMM

Figure 7: Application of the two models on the first 10000 elements of the coverage depth signal obtained from RNA-seq data of the E. faecalis bacterium. The two models have a completely different performance. The standard model clearly overpasses the auto regressive model. The auto-regressive model fails to predict any amplification as well as the background signal.

There may be different reasons responsible for the unsatisfactory result with the autoregres- sive HMM. First of all, RNA-seq coverage depth signal is different from the DNA-seq coverage signal. By looking at the real coverage signals (in blue) in figures 6 and 7, we can clearly see that the RNA-seq data is far more irregular than DNA-seq data. Secondly, the distributions used to calculate the emission probabilities for the autoregressive HMM may not be the most appropriate ones. While the combination between Poisson and Binomial may be suitable for DNA-seq data, this solution may be not appropriate for RNA-seq data which is subject to differ- ent biological phenomena such as transcription degradation. Other distributions such Negative Binomial may replace the Poisson component and may give better results, or maybe a mixture of several distributions may be more suitable. Finally, the sequencing depth of the RNA-seq data is very low compared to DNA-seq data which may be a caveat for the autoregressive HMM.

References

Related documents

It is not new that anti-abortion legislation is basing their arguments mainly on the rights of the fetus. One of the bills specifically refers to the 14 th amendment of the

This study is set out to examine the ideological conflicts that are present among the interest groups submitting amicus curiae briefs to cases brought before the Supreme Court of

The most recent incarnation of the ‘WPR’ approach (Bacchi 2009) includes two questions (Questions 3 and 6) that did not appear in its initial formulation. The

46 Konkreta exempel skulle kunna vara främjandeinsatser för affärsänglar/affärsängelnätverk, skapa arenor där aktörer från utbuds- och efterfrågesidan kan mötas eller

I två av projektets delstudier har Tillväxtanalys studerat närmare hur väl det svenska regel- verket står sig i en internationell jämförelse, dels när det gäller att

The increasing availability of data and attention to services has increased the understanding of the contribution of services to innovation and productivity in

Av tabellen framgår att det behövs utförlig information om de projekt som genomförs vid instituten. Då Tillväxtanalys ska föreslå en metod som kan visa hur institutens verksamhet

We aim to answer the following questions: How to create reliable synthetic data given a data collection and does this data reproduces the special features from the original data..