Applying Hidden Markov Models to RNA-seq data

(1)

Individual Project in Computational Biology

Applying Hidden Markov Models to RNA-seq data

Monica Golumbeanu Student number: 880703 - T300

monicago@kth.se

January 2, 2013

Abstract

Enterococcus faecalis is one of the most controversial commensal bacteria of the human intestinal flora that is also responsible for lethal nosocomial infections. Determining the factors that influence its pathogenicity is at present a great challenge. Cutting-edge approaches analyze the E. faecalis bacterium trough the next generation RNA-sequencing technology.

Since next generation sequencing is recent and yields a large amount of data, there is a continuous need for appropriate statistical methods to interpret its output. We propose an approach based on hidden Markov models to explore RNA-seq data and show an example of how we can apply this statistical tool to detect transcription start sites. We compare this application with a previously developed method based on signal processing.

1 Introduction

Hidden Markov models (HMMs) have been widely used for analyzing sequencing data. Their applications are diverse and cover research topics such as gene identification, alignment, non coding RNA detection and protein secondary structure prediction [13]. In the present study, we explore the use of two different HMMs for determining the transcripts and the transcription levels respectively starting from RNA sequencing (RNA-seq) data of the Enterococcus faecalis bacterium.

Enterococcus faecalis is a commensal bacterium which commonly populates the human gas- trointestinal tract and can produce lethal infections. Its toughness has assured survival in the prophylactic hospital environment, defense from the hosts immune systems and a strong resistance to antimicrobial agents [7]. Understanding how the E. faecalis bacterium operates and what are the factors that influence its pathogenicity is thus a great challenge.

Applying HMMs to detect and analyze the different transcriptionally active elements withing the transcriptome of E. faecalis would provide important information about the genes modeling the bacterium’s conduct and contribute to establishing new treatments. It would be important, for example, to obtain a complete image of the most transcriptionally active elements and what is their relation to the phenotypic traits of the bacterium (e.g. resistance to antibiotics). Know- ing these elements would contribute to better defining the targets of novel drugs and therapies.

(2)

E. Faecalis

Ribosomal RNA removed (Treated)

All RNA (Crude)

'Home made' kit (FRAG)

Barcoding (BC)

'Home made' kit (FRAG)

Barcoding (BC) Cultivation

RNA extraction

Sequenced libraries

Figure 1: Description of the work flow used to obtain different RNA-seq data sets from a culture of E. faecalis v583.

2 Materials and Methods

2.1 Data preparation and extraction

One colony of E. faecalis v583 has been grown in a static, anaerobic Brain Heart Infusion environment at 37^◦C (figure 1). Afterwards, the RNA has been extracted from the cells. Half of the sample has been cleared of ribosomal RNA (Treated sample) while the other half has been left intact (Crude sample). The DNA Sequencing and Genomics Laboratory from the Institute of Biotechnology of the Helsinki University in Finland performed the sequencing of the two samples on the SOLiD sequencing platform [11]. Two different fragment libraries have been used during the sequencing procedure: one library prepared by Applied Biosystems kit (BC) and the other one by the laboratory own kit (FRAG).

Even though there are different sequencing platforms (e.g. SOLiD [11], Illumina [2]), the RNA-seq procedure relies on the same principles (figure 2). The method consists of extracting the different transcripts and then converting them into double stranded DNA (dsDNA). The dsDNA fragments are then sheared into small fragments and adapters are ligated to them such that a sequencing library is formed. The first 50 nucleotides of each fragment are sequenced and in this way, a large set of reads is obtained.

In our study, we have used the Crude BC dataset, therefore the ribosomal RNA was not removed. The reads obtained following the RNA-seq experiment were aligned to the genome using the Bowtie software [5]. For every position on the genome, the number of reads that were mapped there were counted and the resulting signal was coined the ”coverage depth”.

2.2 The general HMM framework 2.2.1 General definition

An HMM is a directed acyclic graph consisting of a set of a discrete-time discrete-state Markov chain of hidden states, Y = {yt}^T_t=1, and a set of observed states, X = {xt}^T_t=1, constituting an observation model p(xt|y_t) [8]. Unlike the hidden states that are discrete, the observations in an HMM can be either continuous or discrete. In the case of discrete variables, the observation model is represented by an observation matrix also named emission matrix.

The following parameters are needed to define an HMM with discrete observed states:

• The number of hidden states: M

• The number of observed states: N

(3)

AAAGGCTACTTCATTTACA

CTATTAATCCCTATCATAT CCCTATGGATCAGAATCG

AAAGGCTACTTCATTTACA CTATTAATCCCTATCATAT

TAGTTAACAGGTCCCATGA

...

extraction of RNAs

conversion into dsDNA and shearing

adapter ligation and amplification

immobilization and sequencing

Figure 2: Workflow of a standard RNA-seq experiment.

• The initial distribution of the hidden states:

Π = {πi}_1≤i≤M = {p(y1 = i)}1≤i≤M (1)

• The transition probabilities between two hidden states, 1 ≤ i, j ≤ M :

aij = P (yt= j|yt−1= i). (2)

These probabilities are contained within the transition matrix: a = (a_ij)_1≤i,j≤M where a_ij represents the probability of transitioning from hidden state i to hidden state j (figure 3).

• The emission probabilities

eij = P (xt= j|yt= i) (3)

where 1 ≤ i ≤ M and 1 ≤ j ≤ N . These probabilities are contained within the emission matrix: e = (eij)1≤i≤M,1≤j≤N where eij represents the probability of emitting the observed state j from the hidden state i (figure 3).

Figure 3 displays an example of a HMM and is useful to understand the structure of an HMM as well as the inner dependencies. Since the HMM is a directed acyclic graph, it is straightforward to write the corresponding joint probability distribution:

p(y1, y2, ..., yT, x1, x2, ..., xT) = p(y1, ..., yT)p(x1, ..., xt|y₁, ..., yT)

=

"

p(y1)

T

Y

t=2

p(yt|y_t−1)

# " _T Y

t=1

p(xt|y_t)

#

= π_y₁

T

Y

t=2

a_y_t−1_y_t

T

Y

t=1

e_y_t_x_t

(4)

y₁ y₂

x₁ x₂

y_T-1 y_T

x_T-1 x_T

e_y1x1 e_y2x2 e_yT-1xT-1 e_yTxT

a_y1y2 a_y2y3 a_yT-2yT-1 a_yT-1yT

Figure 3: Example of a realization of an HMM with hidden (yt, t ∈ {1, ..., T }) and observed (xt, t ∈ {1, ..., T }) states. The arrows represent the transition (ayt−1yt) and emission (eytxt) probabilities.

2.2.2 Inference in HMMs - the Baum-Welch algorithm

HMMs have the advantage of considering long range dependencies between the observed variables due to the latent variables. The general HMM inference problem searches to estimate the hidden states starting from the observed variables, in other words, to calculate p(y_t|x₁, x₂, ..., x_t) in an online learning process or p(y_t|x₁, x₂, ..., x_T) in an offline setting.

An online setting is a process where we observe the states sequentially and we want to compute the hidden variables on the spot. On the contrary, an offline setting assumes that we observe all the variables and we are looking for the most likely hidden states behind the observations i.e. p(y_t|x₁, x₂, ..., x_T). We will focus on the offline learning setting since we will model our problem accordingly. Therefore, the goal will be to compute the probabilities of the hidden states given all the observations. Then, for each observation, we will choose the hidden state with the highest probability. The key to solving this inference problem is the fact that we can separate the Markov chain into two parts - the past and the future:

αi(t) = p(x1:t, yt= i)

βi(t) = p(xt+1, ..., xT|y_t= i) (5) We have used the notation x1:t = {x1, x2, ..., xt}. The α_i(t) are also called forward coefficients while the β_i(t) are also called backward coefficients. We can write these coefficients as a function of the HMM parameters by using d-separation [10] and conditional independence properties of the graphical model (figure 3) as follows:

α_i(t) = p(x_1:t, y_t= i)

=

M

X

j=1

p(x_1:t, y_t= i, y_t−1 = j) =

M

X

j=1

p(x_1:t, y_t= i|y_t−1= j)p(y_t−1= j) =

=

M

X

j=1

p(x_t|y_t= i, y_t−1= j, x_1:t−1)p(x_1:t−1, y_t−1= j, y_t= i) =

=

M

X

j=1

p(x_t|y_t= i, y_t−1= j, x_1:t−1)p(y_t= i|x_1:t−1, y_t−1= j)p(y_t−1= j, x_1:t−1) =

=

M

X

j=1

p(xt|y_t= i)p(yt= i|yt−1= j)αj(t − 1) =

=

M

X

j=1

eixtajiαj(t − 1)

(6)

(5)

β_i(t) = p(x_t+1, x_t+2, ..., x_T|y_t= i) =

M

X

j=1

p(x_t+1:T, y_t+1 = j|y_t= i) =

=

M

X

j=1

p(x_t+2:T|y_t+1= j, y_t= i, x_t+1)p(x_t+1|y_t+1= j, y_t= i)p(y_t+1= j|y_t= i)

=

M

X

j=1

βj(t + 1)ejxt+1aij

(7)

Above we have found the main recurrence relations for the forward and backward coefficients.

The following start conditions allow the dynamic computation of all the rest of the coefficients known as the forward-backward algorithm [14]:

αi(1) = p(x1, y1= i) = πieix1

βi(1) = p(xT +1 : T |yT = i) = p(0|yT = i) = 1 (8) In order to calculate p(yt|x₁, x2, ..., xT), the maximum likelihood parameters of the model need to be calculated: θ = {π, a, e}. The most common method to find these parameters is to use the EM (Expectation-Maximization) algorithm for HMM also known as the Baum-Welch algorithm [12]. The Baum-Welch algorithm iteratively alternates between 2 main steps. It starts with random values for the parameters and in the first step (E-step) computes the complete data log likelihood as function of new parameters with regard to the old parameters. In the second step (M-step), it updates the parameters such that the updated values maximize the expected likelihood calculated in the E-step.

In the E-step the expected complete data log likelihood of the new parameters θ is calculated with respect to the old parameters θ^old:

Q(θ, θôld) = E_{p(Y |X,θ}ôld₎[log p(X, Y |θ)] = E_{p(Y |X,θ}ôld₎

"

log π_y₁

T

Y

t=2

a_y_t−1_y_t

T

Y

t=1

e_y_t_x_t

#

=

= E_{p(Y |X,θ}^old₎

"

log

M

Y

k=1

π_k^I(y¹^=k)

#

+ E_{p(Y |X,θ}^old₎



log

T

Y

t=2 M

Y

i=1 M

Y

j=1

a^I(y_ji ^t^=i,y^t−1^=j)



+



log

T

Y

t=1 M

Y

i=1 N

Y

j=1

e^I(x_ij ^t^=j,y^t⁼ⁱ⁾



=

=

M

X

k=1

E_{p(Y |X,θ}^old₎[I(y1 = k)logπk] +

T

X

t=2 M

X

i=1 M

X

j=1

E_{p(Y |X,θ}^old₎[I(yt= i, yt−1 = j)log aji] +

+

T

X

t=1 M

X

i=1 N

X

j=1

E_{p(Y |X,θ}^old₎[I(xt= j, y_t= i)log e_ij] =

=

M

X

k=1

p(y₁= k|x_1:T, θ^old)logπ_k+

T

X

t=2 M

X

i=1 M

X

j=1

p(y_t= i, y_t−1= j|x_1:T, θ^old)log a_ji+

+

T

X

t=1 M

X

i=1 N

X

j=1

p(y_t= i|x_1:T, θ^old)I(xt= j)log e_ij

(9)

(6)

Using the notations:

γi(t) = p(yt= i|x_1:T, θ^old)

ψ_ji(t) = p(y_t= i, y_t−1= j|x_1:T, θ^old) (10) the expression of the expected log likelihood becomes:

Q(θ, θ^old) =

M

X

k=1

γ_k(1)log π_k+

T

X

t=2 M

X

i=1 M

X

j=1

ψ_ji(t)log a_ji+

T

X

t=1 M

X

i=1 N

X

j=1

γ_i(t)I(xt= j)log e_ij (11) Note that γ_i(t) and ψ_ji(t) are both dependent only on the old parameters:

γi(t) = p(yt= i|x1:T, θ^old) = p(yt= i|x1:t, xt+1:T, θ^old) =

= p(xt+1:T|y_t= i, x1:t, θ^old)p(yt= i|x1:t, θ^old) PM

j=1p(xt+1:T|y_t= j, x1:t, θ^old)p(yt= j|x1:t, θ^old) =

= p(xt+1:T|y_t= i, θ^old)p(yt= i, x1:t|θ^old) PM

j=1p(xt+1:T|y_t= j, θ^old)p(yt= j, x1:t|θ^old) =

= β_i^old(t)α^old_i (t) P_M

j=1β_j^old(t)α^old_j (t)

(12)

ψji(t) = p(yt= i, yt−1= j|x1:T, θ^old) =

= p(yt= i, yt−1= j, x1:T|θ^old) p(x1:T|θ^old) =

= p(yt= i, yt−1= j, x1:T|θ^old) PM

i⁰=1

PM

j⁰=1p(yt= i⁰, yt−1 = j⁰, x1:T|θ^old) =

= p(yt= i, yt−1= j, x1:t−1, xt, xt+1:T|θ^old) PM

i⁰=1

PM

j⁰=1p(yt= i⁰, yt−1 = j⁰, x1:T|θ^old) =

= p(xt|y_t, yt−1, x1:t−1, xt+1:T, θ^old)p(x_t+1:T|y_t= i, yt−1= j, x1:t−1, θ^old) PM

i⁰=1

PM

j⁰=1p(yt= i⁰, yt−1= j⁰, x1:T|θ^old) ×

× p(yt= i|yt−1= j, x1:t−1, θ^old)p(yt−1= j, x1:t−1|θ^old) PM

i⁰=1

PM

j⁰=1p(yt= i⁰, yt−1= j⁰, x_1:T|θ^old) =

= e^old_ix

tβ_iôld(t)aôld_ji αôld_j (t − 1) PM

i⁰=1

PM

j⁰=1eôld_i0xtβ_iôld0 (t)aôld_j0i⁰αôld_j0 (t − 1)

(13)

In the M-step, the new parameters θ are calculated by maximizing the expected log likelihood from the E-step (equation 11) and thus solving:

∂Q(θ, θ^old)

∂πk

= 0

∂Q(θ, θ^old)

∂a_ji = 0

∂Q(θ, θ^old)

∂e_ij = 0

(14)

By solving these equations, we will obtain the updated values of the HMM parameters as functions of the old parameters.

(7)

To determine π_k, we use the constraint PM

i=1π_i = 1, therefore we can write Q(θ, θ^old) as a function of π_k using a Lagrange multiplier:

Q(θ, θ^old) = f (π_k) =

M

X

k=1

γ_k(1)log π_k+ λ(

M

X

i=1

πi− 1) + const. (15)

where const. stands for constant. We then use the differential equations:

∂f (π_k)

∂πk

= 0 ⇒ γ_k(1) 1 πk

+ λ = 0 ⇒ π_k = −γ_k(1) λ

∂f (π_k)

∂λ = 0 ⇒

M

X

i=1

π_i= 1 ⇒

M

X

i=1

−γ_i(1)

λ = 1 ⇒ λ = −

M

X

i=1

γ_i(1)











⇒ π_k= γ_k(1) PM

i=1γ_i(1) (16)

The same for a_ji, the following constraint applies: PM

i⁰=1a_ji⁰ = 1 ∀j, we use Lagrange multipliers:

Q(θ, θ^old) = f (aji) =

T

X

t=2 M

X

i=1 M

X

j=1

ψji(t)log aji+

M

X

j⁰=1

λ_j⁰

M

X

i⁰=1

a_j⁰_i⁰ − 1

!

+ const. (17)

and therefore we have that:

∂f (a_ji)

∂aji

= 0 ⇒

T

X

t=2

ψ_ji(t) 1 aji

+ λ_j = 0 ⇒ a_ji= − PT

t=2ψ_ji(t) λj

∂f (aji)

∂λ_j = 0 ⇒

M

X

i⁰=1

a_ji⁰− 1 = 0 ⇒ λ_j = −

T

X

t=2 M

X

i⁰=1

ψ_ji⁰(t)











⇒ a_ji=

PT

t=2ψji(t) PT

t=2

PM

i⁰=1ψ_ji⁰(t) (18) We can simplify this form by using the definition of ψji(t) and γi(t) from equation 10 and observing the marginalization at the denominator of equation 18:

aji =

PT

t=2p(yt= i, yt−1= j|x1:T, θ^old) PT

t=2

PM

i⁰=1p(yt= i⁰, yt−1= j|x1:T, θ^old) = PT

t=2p(yt= i, yt−1= j|x1:T, θ^old) PT

t=2p(yt−1= j|x1:T, θ^old) ⇒

⇒ a_ji = PT

t=2ψ_ji(t) PT

t=2γj(t)

(19)

Finally, e_ij are subject to a constraint as well: PN

j⁰=1e_ij⁰ = 1 ∀i and we can write:

Q(θ, θ^old) = f (eij) =

M

X

i=1 N

X

j=1

X

t:xt=j

γi(t)log eij+

M

X

i⁰=1

λ_i⁰





N

X

j⁰=1

e_i⁰_j⁰ − 1



+ const. (20) We have as before:

∂f (e_ij)

∂e_ij = 0 ⇒ X

t:xt=j

γ_i(t) 1

e_ij + λ_i = 0 ⇒ e_ij = − P

t:xt=jγi(t) λ_i

∂f (eij)

∂λ_i = 0 ⇒

N

X

j⁰=1

e_ij⁰− 1 = 0 ⇒ λ_i = −

N

X

j⁰=1

X

t:xt=j⁰

γ_i(t)











⇒ e_ij = P

t:xt=jγi(t) PN

j⁰=1

P

t:xt=j⁰γ_i(t) (21)

(8)

Note thatPN j⁰=1

P

t:xt=j⁰γ_i(t) =PT

t=1γ_i(t), therefore:

e_ij = P

t:xt=jγi(t) P_T

t=1γ_i(t) (22)

The computation of the emission probabilities can be improved by looking for emission probabilities of a certain form (e.g. following the Poisson distribution). This technique can be used on the emission probabilities because we know the observed states. Since we don’t have any information about the hidden states and we want them to cover as much as dependencies as possible, we do not apply this method to compute transition probabilities. The distribution for the emission probabilities is chosen according to the quantity to model. We will show in the following sections how we chose an appropriate distribution to deal with RNA-seq data.

2.3 Application of HMMs on RNA-seq data

The HMM framework was used to analyze the RNA-seq data obtained from E. faecalis. Two models were designed. In both cases, the observed states are represented by the values of the coverage signal and the hidden states model the level of amplification of the signal (duplicate, triplicate, etc.). Therefore, the implemented algorithm takes as an input the coverage signal for the entire genome and returns the corresponding values of the hidden states which constitute the predicted signal. The predicted signal tells which regions are amplified or deleted (figure 4).

0 2000 4000 6000 8000 10000

01020304050

Position

Coverage

Figure 4: Example of generated coverage depth signal (black) and signal predicted with the HMM (red). A background coverage signal is estimated around the value of 10 and there are regions with an average amplification level of 40, 30 and respectively 20. A deletion can be observed as well starting at position 6000. The red signal corresponds to the output of the HMM algorithm which detects all these regions.

2.3.1 The standard HMM model

The straightforward way to apply an HMM to RNA-seq data is to use the standard framework described in the previous section and then to pick a distribution to model the emission probabilities of the HMM. The chain has the length of the genome and looks like in figure 3. Based on literature [1, 6, 3], we have chosen a Poisson distribution to model the probability that a

(9)

certain number of reads map at a certain position in the genome in a region which is amplified a certain amount of times:

eij = p(j reads map in a region amplified i times)

= p(x_t= j|y_t= i) = P oisson(j, c_i)

= (c_i)^j j! e^−cⁱ

(23)

where c_i represents how many times the background coverage depth has been amplified at position t. In other words, c₁ is the level of the background signal, c₂ stands for a duplicate, c₃ for a triplicate, etc.

The transition probabilities a_ji and the initial distribution of the hidden states π_k are calculated by using the previously found relations (given by the equations 19 and 16), while the emission probabilities are found using the maximum likelihood estimate for the parameters c_i of the Poisson distribution.

To find the maximum likelihood estimate for ci, the following equation is solved:

∂Q(θ, θ^old)

∂c_i = 0 (24)

where the expression of the expected log likelihood is given by:

Q(θ, θ^old) =

M

X

k=1

γ_k(1)log π_k+

T

X

t=2 M

X

i=1 M

X

j=1

ψ_ji(t)log a_ji

| {z }

const.

+

T

X

t=1 M

X

i=1 N

X

j=1

γ_i(t)I(xt= j)log e_ij

= f (ci) = const. +

T

X

t=1 M

X

i=1 N

X

j=1

γi(t)I(xt= j)log (ci)^j j! e^−cⁱ

=

T

X

t=1 M

X

i=1 N

X

j=1

γi(t)I(xt= j)log (c_i)^j j! e^−cⁱ

+ const.

=





M

X

i=1 N

X

j=1

X

t:xt=j

γi(t)jlog ci− γ_i(t)ci



+ const.

(25)

We have that:

∂f (ci)

∂ci

= 0 ⇒

N

X

j=1

X

t:xt=j

γi(t)j1 ci

− γ_i(t) = 0

⇒ c_i = PT

t=1γ_i(t)j PT

t=1γ_i(t)

(26)

Once the c_i parameters are computed, the emission probabilities e_ij are obtained with the Poisson formula in equation 23. We have now deduced the update formulas for all the parameters of the HMM.

The complete Baum-Welch algorithm is as follows: an initial value for the parameters is set, then the expected log likelihood of the data is calculated; afterwards, the parameters are updated with regard to the old parameters by using the update rules 16, 19 and 26 and the

(10)

y₁ y₂

x₁ x₂

y_T-1 y_T

x_T-1 x_T

e_y1x1 e_y2x2x1 eyT-1xT-1xT-2 e_yTxTxT-1

a_y1y2 a_y2y3 a_yT-2yT-1 a_yT-1yT

Figure 5: Example of a realization of an autoregressive HMM where the observed states are organized in a first order Markov chain. Note the emission probabilities that are a function of the current observed state, the corresponding hidden state as well as the previously observed state.

log likelihood is calculated with the new values. If the likelihood converges, the iteration is stopped, otherwise, the old parameters receive the values of the updated parameters and the update rules are used to compute new parameters (algorithm 1).

Algorithm Baum-Welch;

1. find an initial setting for the parameters: θ⁰ = {π_k⁰, a⁰_ji, e⁰_ij};

2. set θ^old= θ⁰ ;

3. update θ by using the update rules:

πk= PM^γ^k⁽¹⁾

i=1γi(1), aji =

PT t=2ψji(t) PT

t=2γj(t) , ci =

PT t=1γi(t)j PT

t=1γi(t) ; with respect to θ^old.

4. set θ^old= θ. ;

5. go to 3 unless the expected log likelihood converges

Algorithm 1: Training of the HMM with the EM framework within the Baum-Welch algorithm.

2.3.2 The auto-regressive HMM model

The standard model assumes that all the inherent dependencies of the observed coverage signal are captured within the hidden variables. However, it has been shown that this fact may reduce the accuracy of the model [6]. We know that the coverage depth signal has a built-in dependency between the consecutive values since a read spans several positions in the genome. As a result, the current state in the coverage depth signal can be split in the following way:

x_t= number of reads that start at t + number of reads that continue from (t − 1) (27) In this way, we can model a dependency between the current and the previous observed state and we can model the observed states as a first degree Markov Chain (see figure 5). We call this particular model an autoregressive HMM [4]. This change in the model will affect the method in which the parameters of the model are computed since we have to take into account the dependency between the observed states. Already, the emission probabilities will have an additional parameter, the previous observed state:

e_ijk= p(xt= j|yt= i, xt−1= k) (28)

(11)

The complete data likelihood will sensibly change as well:

p(y₁, y₂, ..., y_T, x₁, x₂, ..., x_T) =

"

p(y₁)p(x₁|y₁)

T

Y

t=2

p(y_t|y_t−1)

# " _T Y

t=2

p(x_t|y_t, x_t−1)

#

= πy1σy1x1

T

Y

t=2

ayt−1yt

! _T Y

t=2

eytxtxt−1

! (29)

where σ_y₁_x₁ = p(x₁|y₁) is an additional parameter.

We can rapidly prove, using the properties of the graphical model in figure 5, that the forward and backward parameters keep largely the same expression as before:

α_i(t) = p(x_1:t, y_t= i)

=

M

X

j=1

p(x1:t, yt= i, yt−1 = j) =

M

X

j=1

p(x1:t, yt= i|yt−1= j)p(yt−1= j) =

=

M

X

j=1

p(xt|y_t= i, yt−1= j, x1:t−1)p(x1:t−1, yt−1= j, yt= i) =

=

M

X

j=1

p(x_t|y_t= i, y_t−1= j, x_1:t−1)p(y_t= i|x_1:t−1, y_t−1 = j)p(y_t−1 = j, x_1:t−1) =

=

M

X

j=1

p(x_t|y_t= i, x_t−1)p(y_t= i|y_t−1= j)α_j(t − 1) =

=

M

X

j−1

e_ix_t_x_t−1a_jiα_j(t − 1)

(30)

For the backward coefficients we define:

β_ik(t) = p(x_t+1, x_t+2, ..., x_T|y_t= i, x_t= k) =

=

M

X

j=1

p(x_t+2:T|y_t+1 = j, y_t= i, x_t+1, x_t= k)p(x_t+1|y_t+1= j, y_t= i, x_t= k)×

× p(y_t+1= j|y_t= i, x_t= k) =

=

M

X

j=1

βjxt+1(t + 1)ejxt+1kaij

(31)

The following start conditions apply:

αi(1) = p(x1, y1 = i) = πiσix1

βik(1) = p(xT +1: T |yT = i, xT = k) = p(0|yT = i, xT = k) = 1 (32) We recall that the objective is to determine the hidden states behind each observed state.

We apply as before the Baum-Welch algorithm with its E and M steps modified according to the new model. In the E-step, we compute the expected log likelihood with respect to the old

(12)

parameters θ^old:

Q(θ, θôld) = E_{p(Y |X,θ}ôld₎[log p(X, Y |θ)] = E_{p(Y |X,θ}ôld₎

"

log π_y₁σ_y₁_x₁

T

Y

t=2

a_y_t−1_y_t

T

Y

t=2

e_y_t_x_t_x_t−1

#

=

= E_{p(Y |X,θ}^old₎

"

log

M

Y

k=1

π_k^I(y¹^=k)

#



log

M

Y

i=1 N

Y

j=1

log σ_ij^I(y¹^=i,x¹^=j)



+



log

T

Y

t=2 M

Y

i=1 M

Y

j=1

a^I(y_ji ^t^=i,y^t−1^=j)



+



log

T

Y

t=2 M

Y

i=1 N

Y

j=1 N

Y

k=1

e^I(x_ijk^t^=j,y^t^=i,x^t−1^=k)



=

=

M

X

k=1

E_{p(Y |X,θ}^old₎[I(y1= k)logπ_k] +

M

X

i=1 N

X

j=1

E_{p(Y |X,θ}^old₎[I(y1= i, x₁ = j)σ_ij] +

+

T

X

t=2 M

X

i=1 M

X

j=1

E_{p(Y |X,θ}^old₎[I(yt= i, y_t−1= j)log a_ji] +

+

T

X

t=1 M

X

i=1 N

X

j=1 N

X

k=1

E_{p(Y |X,θ}^old₎[I(xt= j, y_t= i, x_t−1= k)log e_ijk] =

=

M

X

k=1

p(y1 = k|x_1:T, θ^old)logπ_k+

M

X

i=1 N

X

j=1

p(y1= i|x_1:T, θ^old)I(x1= j)log σij+

+

T

X

t=2 M

X

i=1 M

X

j=1

p(yt= i, yt−1= j|x1:T, θ^old)log aji+

+

T

X

t=1 M

X

i=1 N

X

j=1 N

X

k=1

p(yt= i|x1:T, θ^old)I(xt= j, xt−1 = k)log eijk

(33) We proceed in the same way as for the standard model and we use the following notations:

γ_i(t) = p(y_t= i|x_1:T, θ^old)

ψji(t) = p(yt= i, yt−1= j|x1:T, θ^old) (34) The expression of the expected log likelihood becomes:

Q(θ, θ^old) =

M

X

k=1

γ_k(1)log π_k+

M

X

i=1 N

X

j=1

γ_i(1)log σ_ij+

+

T

X

t=2 M

X

i=1 M

X

j=1

ψ_ji(t)log a_ji+

T

X

t=2 M

X

i=1

γ_i(t)log e_ix_t_x_t−1

(35)

As before, we use the conditional independence properties of the graphical model to express

(13)

γ_i(t) and ψ_ji(t) as functions of the old parameters:

γ_i(t) = p(y_t= i|x_1:T, θ^old) = p(y_t= i|x_1:t, x_t+1:T, θ^old) =

= p(x_t+1:T|y_t= i, x_1:t, θ^old)p(y_t= i, x_1:t|θ^old) PM

j=1p(x_t+1:T|y_t= j, x_1:t, θ^old)p(y_t= j, x_1:t|θ^old) =

= p(x_t+1:T|y_t= i, x_t, θ^old)p(y_t= i, x_1:t|θ^old) PM

j=1p(x_t+1:T|y_t= j, x_t, θ^old)p(y_t= j, x_1:t|θ^old) =

= β^old_ix_t(t)α^old_i (t) PM

j=1β_jx^old

t(t)α^old_j (t)

(36)

ψ_ji(t) = p(y_t= i, y_t−1= j|x_1:T, θ^old) =

= p(y_t= i, y_t−1 = j, x_1:T|θ^old) p(x1:T|θ^old) =

= p(yt= i, yt−1= j, x1:T|θ^old) PM

i⁰=1

PM

j⁰=1p(yt= i⁰, yt−1= j⁰, x1:T|θ^old) =

= p(yt= i, yt−1= j, x1:t−1, xt, xt+1:T|θ^old) PM

i⁰=1

PM

j⁰=1p(yt= i⁰, yt−1= j⁰, x1:T|θ^old) =

= p(xt|y_t, yt−1, x1:t−1, xt+1:T|θ^old)p(xt+1:T|y_t= i, yt−1= j, x1:t−1, θ^old) PM

i⁰=1

PM

j⁰=1p(yt= i⁰, yt−1= j⁰, x1:T|θ^old) ×

×p(yt= i|yt−1= j, x1:t−1, θ^old)p(yt−1 = j, x1:t−1|θ^old) PM

i⁰=1

PM

j⁰=1p(yt= i⁰, yt−1 = j⁰, x1:T|θ^old) =

= e^old_ix_t_x

t−1βôld_ix_t(t)aôld_ji αôld_j (t − 1) P_M

i⁰=1

P_M

j⁰=1eôld_i0xtxt−1β_iôld0xt(t)aôld_j0i⁰αôld_j0 (t − 1)

(37)

In the M-step, the new parameters θ are calculated by maximizing the expected log likelihood found in the E-step (equation 35). Note that π_k and a_ji have the same coefficients as in the expression of the expected log likelihood in equation 11, therefore, their update rules will remain unchanged:

π_k= γ_k(1) PM

i=1γ_i(1) (38)

a_ji= PT

t=2ψji(t) PT

t=2γj(t) (39)

The parameters that remain to be calculated are eijk and σij. We use the same method as for the other parameters, i.e. solving the differential equations:

∂Q(θ, θ^old)

∂σij

= 0

∂Q(θ, θ^old)

∂e_ijk = 0

(40)

The parameters σ_ij have the constraintPN

j⁰=1σ_ij⁰ = 1, therefore we use Lagrange multipliers:

Q(θ, θ^old) = f (σ_ij) =

M

X

i=1 N

X

j=1

γ_i(1)log σ_ij +

M

X

i⁰=1

λ_i⁰





N

X

j⁰=1

σ_i⁰_j⁰ − 1



+ const. (41)

(14)

By taking the derivative with respect to σ_ij and λ_i⁰, we get:

∂f (σ_ij)

∂σ_ij = 0 ⇒ γ_i(1) 1

σ_ij + λ_i = 0 ⇒ σ_ij = −γ_i(1) λ_i

∂f (σ_ij)

∂λi

= 0 ⇒

N

X

j⁰=1

σ_ij⁰− 1 = 0 ⇒ λ_i= −

N

X

j⁰=1

γ_i(1) = −N γ_i(1)











⇒ σ_ij = 1

N (42)

Finally, we compute e_ijk which are subject to the constraint: PN

j⁰=1e_ijk= 1 so we can write:

Q(θ, θ^old) = f (e_ijk) =

T

X

t=1 M

X

i=1

γ_i(t)log e_ix_t_x_t−1 +

M

X

i⁰=1 N

X

k⁰=1

λ_i⁰_k⁰





N

X

j⁰=1

e_i⁰_j⁰_k⁰− 1



+ const. (43)

As before, we take the derivative with respect to eijk and λik:

∂f (eijk)

∂e_ijk = 0 ⇒ X

t:xt=j,xt−1=k

γi(t) 1

e_ijk + λ_ik = 0 ⇒ e_ijk= − P

t:xt=j,xt−1=kγi(t) λ_ik

∂f (e_ijk)

∂λ_ik = 0 ⇒

M

X

j⁰=1

e_ij⁰_k= 1 ⇒ λ_ik = −

N

X

j⁰=1

X

t:xt=j⁰,xt−1=k

γi(t)











⇒

⇒ e_ijk= P

t:xt=j⁰,xt−1=kγ_i(t) PM

i⁰=1

P

xt=j,xt−1=kγ_i⁰(t) (44)

We have showed with the relation 27 that the number of reads that match at a certain position can be split into two independent quantities: the reads that start at the respective position and the reads that continue from the previous position. We can model these two quantities by two known distributions. Therefore, the Poisson distribution would model the reads that start at the current position and the Binomial distribution would model the reads that continue from the previous position. This would help to write the emission probabilities in the following manner:

e_ijk= p(xt= j|yt= i, xt−1= k)

=

j

X

r=1

p(r reads start at t)

| {z }

P oisson(r,ci)

× p(r reads carry on from t − 1)

| {z }

Binomial(r,k,pc)

=

j

X

r=1

c^r_i r!e^−cⁱ

k i − r

p^i−r_c (1 − p_c)^k−(i−r)

(45)

where p_cis the probability that a read spans two adjacent positions:

pc= len − 1

len (46)

where len is the length of a read. In order to find the maximum likelihood estimate of the parameter ci, a gradient descent library has been used in the implemented algorithm. We have obtained thus all the update rules from the parameters of the HMM, therefore the Baum-Welch algorithm is completely defined.

(15)

3 Results and discussion

We have trained and tested the standard HMM as well as the auto-regressive HMM on the RNA-seq data of E. faecalis. The performances of the two algorithms have proved completely different for this kind of data.

3.1 Performance of the two HMMs

The idea of an autoregressive HMM has come from a previous application on DNA-sequencing (DNA-seq) data for detecting copy number variants [9]. When applied to DNA-seq data, the auto-regressive model preforms the same as the standard HMM model. The two models detect the same amplified regions as well as the background level of coverage depth (figure 6).

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

0 10 20 30 40 50 60

position

CD arHMM

(a) Autoregressive HMM

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

0 10 20 30 40 50 60

position

CD stHMM

(b) Standard HMM

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

0 10 20 30 40 50 60

position

CD stHMM arHMM

(c) Superposition of the autoregressive and standard HMM

Figure 6: Application of the two models on the same stretch of DNA-seq coverage depth signal. The two models perform largely the same and detect the amplified regions of the signal as well as the background level.

(16)

Figure 6 displays the result of applying the two HMM models on the same fragment of coverage depth from DNA-seq data. The predictions of the two models are the same, with a few small differences. Note the existence of a few artifacts (very short deletions or amplifications), especially with the standard model. These artifacts may require an additional smoothing oper- ation or a slight modification of the transition probabilities a_ji.

Unlike for DNA-seq data, when applied to RNA-seq data of the E. faecalis bacterium, the two models behave completely different (figure 7). The auto regressive HMM fails to predict the correct amplification regions and returns a signal impossible to interpret. On the contrary, the standard model performs better and manages to detect continuous regions corresponding to the different transcripts, however accompanied by several artifacts.

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

0 1 2 3 4 5 6 7 8

position on the genome

log coverage

(a) Real coverage depth signal

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

0 0.5 1 1.5 2

CD HMM prediction

(b) Autoregressive HMM

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

0 0.5 1 1.5 2

RD HMM prediciton

(c) Standard HMM

Figure 7: Application of the two models on the first 10000 elements of the coverage depth signal obtained from RNA-seq data of the E. faecalis bacterium. The two models have a completely different performance. The standard model clearly overpasses the auto regressive model. The auto-regressive model fails to predict any amplification as well as the background signal.

There may be different reasons responsible for the unsatisfactory result with the autoregressive HMM. First of all, RNA-seq coverage depth signal is different from the DNA-seq coverage signal. By looking at the real coverage signals (in blue) in figures 6 and 7, we can clearly see that the RNA-seq data is far more irregular than DNA-seq data. Secondly, the distributions used to calculate the emission probabilities for the autoregressive HMM may not be the most appropriate ones. While the combination between Poisson and Binomial may be suitable for DNA-seq data, this solution may be not appropriate for RNA-seq data which is subject to different biological phenomena such as transcription degradation. Other distributions such Negative Binomial may replace the Poisson component and may give better results, or maybe a mixture of several distributions may be more suitable. Finally, the sequencing depth of the RNA-seq data is very low compared to DNA-seq data which may be a caveat for the autoregressive HMM.