Unsupervised hidden Markov model for automatic analysis of expressed sequence tags

(1)

The Department of Physics, Chemistry and Biology

Master of Science Thesis in Bioinformatics

Unsupervised hidden Markov model for automatic

analysis of expressed sequence tags

Andrei T. Alexsson

LiTH-IFM-A-EX--11/2553--SE

The Department of Physics, Chemistry and Biology Linköpings universitet

(2)

(3)

Master of Science Thesis in Bioinformatics LiTH-IFM-A-EX--11/2553--SE

Unsupervised hidden Markov model for automatic

analysis of expressed sequence tags

Andrei T. Alexsson

Supervisor: Lars Arvestad, KTH Royal Institute of Technology

Examiner: Bengt Persson, Linköping University

(4)

(5)

Avdelning, Institution

Division, Department

The Department of Physics, Chemistry and Biology The Department of Physics, Chemistry and Biology Linköpings universitet

SE-581 83 Linköping, Sweden

Datum Date 2011-06-17 Språk Language Svenska/Swedish Engelska/English ⊠ Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats D-uppsats Övrig rapport ⊠

URL för elektronisk version

http://www.ifm.liu.se/

ISBN

—

ISRN

LiTH-IFM-A-EX--11/2553--SE

Serietitel och serienummer

Title of series, numbering

ISSN

—

Titel

Title Oövervakad dold Markov modell för automatisk analys av EST-sekvenser_{Unsupervised hidden Markov model for automatic analysis of expressed sequence} tags Författare Author Andrei T. Alexsson Sammanfattning Abstract

This thesis provides an in-depth analyze of expressed sequence tags (EST) that represent pieces of eukaryotic mRNA by using unsupervised hidden Markov model (HMM). ESTs are short nucleotide sequences that are used primarily for rapid identification of new genes with potential coding regions (CDS). ESTs are made by sequencing on double-stranded cDNA and the synthesized ESTs are stored in digital form, usually in FASTA format. Since sequencing is often randomized and that parts of mRNA contain non-coding regions, some ESTs will not represent CDS. It is desired to remove these unwanted ESTs if the purpose is to identify genes associated with CDS. Application of stochastic HMM allow identification of region contents in a EST. Softwares like ESTScan use HMM in which a train-ing of the HMM is done by supervised learntrain-ing with annotated data. However, because there are not always annotated data at hand this thesis focus on the abil-ity to train an HMM with unsupervised learning on data containing ESTs, both with and without CDS. But the data used for training is not annotated, i.e. the regions that an EST consists of are unknown. In this thesis a new HMM is intro-duced where the parameters of the HMM are in focus so that they are reasonably consistent with biologically important regions of an mRNA such as the Kozak sequence, poly(A)-signals and poly(A)-tails to guide the training and decoding correctly with ESTs to proper states in the HMM. Transition probabilities in the HMM has been adapted so that it represents the mean length and distribution of the different regions in mRNA. Testing of the HMM’s specificity and sensitivity have been performed via BLAST by blasting each EST and compare the BLAST results with the HMM prediction results. A regression analysis test shows that the length of ESTs used when training the HMM is significantly important, the longer the better. The final results shows that it is possible to train an HMM with unsu-pervised machine learning but to be comparable to suunsu-pervised machine learning as ESTScan, further expansion of the HMM is necessary such as frame-shift correc-tion of ESTs by improving the HMM’s ability to choose correctly posicorrec-tioned start codons or nucleotides. Usually the false positive results are because of incorrectly positioned start codons leading to too short CDS lengths. Since no frame-shift correction is implemented, short predicted CDS lengths are not acceptable and is hence not counted as coding regions during prediction. However, when there is a lack of supervised models then unsupervised HMM is a potential replacement with stable performance and able to be adapted for any eukaryotic organism.

Nyckelord

Keywords Machine learning, Markov Model, Hidden Markov Model, Expressed sequence tag,

(6)

(7)

Abstract

This thesis provides an in-depth analyze of expressed sequence tags (EST) that represent pieces of eukaryotic mRNA by using unsupervised hidden Markov model (HMM). ESTs are short nucleotide sequences that are used primarily for rapid identification of new genes with potential coding regions (CDS). ESTs are made by sequencing on double-stranded cDNA and the synthesized ESTs are stored in digital form, usually in FASTA format. Since sequencing is often randomized and that parts of mRNA contain non-coding regions, some ESTs will not represent CDS. It is desired to remove these unwanted ESTs if the purpose is to identify genes associated with CDS. Application of stochastic HMM allow identification of region contents in a EST. Softwares like ESTScan use HMM in which a train-ing of the HMM is done by supervised learntrain-ing with annotated data. However, because there are not always annotated data at hand this thesis focus on the abil-ity to train an HMM with unsupervised learning on data containing ESTs, both with and without CDS. But the data used for training is not annotated, i.e. the regions that an EST consists of are unknown. In this thesis a new HMM is intro-duced where the parameters of the HMM are in focus so that they are reasonably consistent with biologically important regions of an mRNA such as the Kozak sequence, poly(A)-signals and poly(A)-tails to guide the training and decoding correctly with ESTs to proper states in the HMM. Transition probabilities in the HMM has been adapted so that it represents the mean length and distribution of the different regions in mRNA. Testing of the HMM’s specificity and sensitivity have been performed via BLAST by blasting each EST and compare the BLAST results with the HMM prediction results. A regression analysis test shows that the length of ESTs used when training the HMM is significantly important, the longer the better. The final results shows that it is possible to train an HMM with unsu-pervised machine learning but to be comparable to suunsu-pervised machine learning as ESTScan, further expansion of the HMM is necessary such as frame-shift correc-tion of ESTs by improving the HMM’s ability to choose correctly posicorrec-tioned start codons or nucleotides. Usually the false positive results are because of incorrectly positioned start codons leading to too short CDS lengths. Since no frame-shift correction is implemented, short predicted CDS lengths are not acceptable and is hence not counted as coding regions during prediction. However, when there is a lack of supervised models then unsupervised HMM is a potential replacement with stable performance and able to be adapted for any eukaryotic organism.

(8)

Sammanfattning

I detta examensarbete har man analyserat EST-sekvenser som representerar bitar av ett eukaryotiskt mRNA genom att använda oövervakad dold Markov modell (HMM). EST-sekvenser är korta nukleotidsekvenser som används först och främst för snabb identifiering av nya gener med potentiellt kodande regioner (CDS). Se-kvenseringen görs på hela dubbelsträngade cDNA som utgångsmaterial där de syntetiserade EST-sekvenserna lagras i digitalt form, oftast i FASTA format. Ef-tersom sekvenseringen oftast sker randomiserat och att delar av mRNA innehåller icke kodande regioner, blir konsekvensen att vissa EST-sekvenser ej representerar de regioner som är kodande d.v.s. CDS-regioner. Dessa oönskade EST-sekvenser vill man sålla bort om syftet är att kartlägga gener med tillhörande CDS. Tillämp-ning av stokastisk HMM möjliggör identifikation av just dessa EST-sekvenser. Program som ESTScan utnyttjar HMM där träningen av HMM:et sker genom övervakad maskininlärning med annoterad data. Dock finns inte alltid annoterade data till hands och i det här exjobbet har man istället provat att träna ett HMM med oövervakad maskinlärning med data innehållande EST-sekvenser, EST bå-de med och utan CDS. Men eftersom datat inte är annoterad så har man heller ingen kunskap om EST-sekvensernas positioner i ett mRNA, d.v.s. man vet inte vilka regioner som en EST-sekvens består av. Man har därför introducerat en ny HMM och lagt ned fokus på själva HMM’s parametrar så att de någorlunda över-enstämmer biologiskt med viktiga regioner av ett mRNA såsom Kozaksekvenser, poly(A)-signaler samt poly(A)-kedjor för att på så sätt underlätta träningen och avkodningen med EST-sekvenserna över korrekta tillstånd i HMM:et. Övergångs-sannolikheterna i HMM:et har anpassats så att det representerar medelvärdet av längden och längdfördelningen för de olika regionerna i mRNA:t. Testningen av HMM’s specificitet och sensitivitet har utförts via BLAST genom att blasta var-je EST-sekvens och på så sätt jämföra BLAST-resultaten med den oövervakade HMM’s predikteringsresultat. En regressionsanalys visar att längderna på EST-sekvenser är signifikant viktiga under träningen av HMM:et, ju längre desto bättre. Resultaten visar att det är möjligt att träna ett HMM med oövervakad maskinin-lärning men det behövs ytterligare utökning av HMM:et såsom felhantering av EST-sekvenser, exempelvis förbättra HMM’s kapacitet att välja korrekt positio-nerade startkodoner eller nukleotider. Oftast beror de positivt falska resultaten på felaktigt positionerade startkodoner som leder till alldeles för korta CDS längder. Eftersom felhantering inte är implementerad är korta predikterade CDS-längder inte acceptabla och räknas ej som kodande regioner under en prediktering. Men när det är brist på förberäknade modeller är en oövervakad HMM en potentiell ersät-tare med stabil prestanda och är enkel att anpassas för alla typer av eukaryotiska organismer.

(9)

Acknowledgments

I would like to thank Dr. Lars Arvestad for providing me the opportunity to do this thesis and for his support, and thanks to Prof. Bengt Persson for being my examiner. Thanks to my family for their endless private support.

(10)

(11)

Introduction

1.1 Background

In order to understand how different organisms function from a biological perspec-tive it requires an insight into their genetically processes. The possibility to identify genes with different high-throughput (microarrays, mass spectrometry) methods enables the characterization of gene functions and how they are expressed in var-ious processes [3]. High-throughput technologies have been rapidly evolved since the human genome project, and the focus today is mainly on using low cost high-throughputs [2]. It is rather expensive and complex to sequence whole genomes and because of that other techniques have been considered, such as high-throughput EST. EST or expressed sequence tag are short biological sequences derived from partial or full-length cDNA. cDNA are mainly synthesized from mRNA and be-cause mRNA is built up by exons only, cDNA is therefore important for gene discovery. By using EST which is a part of a cDNA, makes it possible to rapidly explore transcriptomes and genes in a more practical way rather than using whole full-length cDNA [3].

Consequently, high-throughput have generated a huge amount of EST data which continuously increases. During the master’s thesis project more than 67 millions EST in public entries at dbEST were available from different organisms. The amount of public entries contributes to implement powerful data mining utilities and machine learning methods that can quickly process and assemble the vol-ume of EST that continuously expanding. Data mining tools that can annotate unknown ESTs which involves identification of protein-coding regions are highly appreciated. Such tools have already been designed like ESTScan, which use a supervised HMM to decode ESTs and finds their region contents [4]. A supervised HMM, is a trained HMM with annotated data where one have knowledge about the sequences used in the training data, for example the regions that can be found in each such sequence. However, in this thesis, the focus is on using unsupervised HMM with data that is not annotated. The data contain EST sequences derived from different mRNAs, and by training an HMM with the algorithm Baum-Welch

(16)

with those ESTs, a performance test will be analyzed on how well the trained HMM can distinct between ESTs with and without coding regions by predictions using 1-best decoding algorithm.

1.2 Thesis objectives

The thesis objective is to design an HMM in XML and a software in Python that can train HMM unsupervised from unannotated data, i.e. data containing ESTs derived from different mRNAs without any knowledge about the EST se-quence contents. The HMM shall after training be able to predict ESTs containing protein-coding regions from the unannotated data.

Once having developed a proper HMM and software, evaluation will be performed by evaluating the specificity and sensitivity of the model by comparing the model’s prediction with the predictions from BLAST and ESTScan.

1.3 Methods

Studying literature for better understanding of the various regions of an EST-sequence. Particularly, examination of nucleotide distributions at important areas of EST was performed to find part in regions with common nucleotide distri-butions, for example the common start codon ATG at the beginning of CDS. Literature studies on Markov models and HMM and the various algorithms used in parameter estimation and decoding.

The development of the software were made in Python. HMM design was es-tablished in XML. GHMM-library from Algorithmics group at the Max Planck Institute for Molecular Genetics which is a C-library and comes with Python wrap-pers, was used to incorporate in the Python software various kind of algorithms such as 1-best and Baum-Welch and for its support to HMM design.

Evaluation of the predicted sequences by the software was compared with BLAST results which is used as reference. Comparison with ESTScan is also performed.

1.4 Delimitations

• The following two algorithms will be used which is implemented in GHMM library:

– Baum-Welch

– 1-best

• No modifications will be made on the Baum-Welch and 1-best algorithms. • HMM topology will be static, i.e. the topology is not learned during training.

(17)

1.5 Abbreviations 5

• The model will not provide support for frame-shift error in EST sequences. • The HMM shall consider the following features:

– 5’UTR sites

– Start codon

– CDS region, start codon and stop codon not included here.

– Stop codon

– 3’UTR sites

– Poly(A)-signal.

– Poly(A)-tail.

– HMM applied only on eukaryotic organisms.

• cDNA libraries was downloaded from UniGene database and used as EST-data.

• The developed software features and requirements:

– Linux is required as operating system.

– EST-data need to be in FASTA-format.

– HMM must be in XML-format and fully supported by GHMM-library functions.

– Shall train HMM unsupervised.

– After training, a prediction can occur with any test data (FASTA for-mat).

– The predicted EST-sequences containing CDS can be saved in a new file.

– Able to sample sequences.

– Estimate parameter differences between two models.

1.5 Abbreviations

3’UTR

Three prime untranslated region.

5’UTR

Five prime untranslated region.

A

Adenin.

BLAST

(18)

C Cytosine. cDNA Complementary DNA. CDS Coding sequence. db cDNA

double-stranded complementary deoxyribonucleic acid.

dbEST

Expressed Sequence Tags database.

DNA

Deoxyribonucleic acid.

EST

Expressed sequence tag.

G

Guanine.

GHMM-library

The General Hidden Markov Model library.

HMM

Hidden Markov model.

poly(A)

Polyadenosine.

RNA

Ribonucleic acid.

ss cDNA

single-stranded complementary deoxyribonucleic acid.

T

Thymine.

U

Uracil.

XML

(19)

Chapter 2

Principles of EST

2.1 Eukaryotic mRNA structure

mRNA is the biological template used for protein synthesis. Therefore mRNA is a useful resource for exploration of new genes with potential coding regions. When mRNA is mature it consists only by exons with fundamental regions that have biological functions. The structure of eukaryotic mRNA is basically composed by untranslated regions from both ends, i.e. 5’UTR and 3’UTR. In between is the ac-tually region that translates to a protein, i.e. CDS region (see figure 2.1). Within those regions there are elements or typical common nucleotide distributions as for example the start codon AUG that can be found in the beginning of a CDS region contributing to translation start of a protein.

CDS 5'UTR 3'UTR poly(A) poly(A) signal Kozak Start Stop

Figure 2.1. Eukaryotic mRNA structure.

The average length of CDS is typically 1500 nucleotides or 500 codons. 5’UTR average length range between 100 and 200 nucleotides and is usually shorter than 3’UTR average length which range between 200 and 800 nucleotides (800 in hu-mans) [12]. 5’UTR is involved in regulation of the mRNA translation initiation [13].

From the other end, 3’end, it basically consist of two important elements, a poly(A)-tail and a poly(A)-signal. A poly(A)-signal on the pre-mRNA (primary transcription of DNA consisting of exons and introns) interact with cleaving fac-tors together with the synthesis of poly(A)-tail forming the mature mRNA 3’end

(20)

with repeated adenosines [6]. One of the poly(A)-tail functions is to stabilize the mRNA especially against degradation and for subcellular localization. Indeed, 3’UTR regulate mainly the stability and transportation of the mRNA [12]. The description about mRNA elements in each region so far is focused on generally conserved elements that occurs in almost all eukaryotic organisms and therefore relevant for this thesis. Other elements exist but varies too much in nucleotide distribution and location to be considered here.

2.2 Nucleotide distributions in mRNA

Each region in mRNA must have some partly overall nucleotide patterns to interact with different molecules as for example ribosome units to initial the translation. Nevertheless, to effectively use unsupervised machine learning methods to identify the different mRNA regions a knowledge of the commonly base compositions in each region is necessary. The nucleotide distributions described below can almost always be found in any eukaryotic organisms and hence important later on for unsupervised HMM by initiating the emission parameters with the corresponding base compositions (see chapter 4).

2.2.1 Start and stop codon

Each organism requires a start and a stop codon to initiate respective terminate the translation. Start codon is the first codon in CDS and stop codon the last codon.

The start codon in almost all eukaryotic mRNAs is composed by AUG. However, the RNA nucleotide U is never written in EST data, instead an mRNA sequence is written as a regular DNA sequence, so U will be typed as T. Thus the first codon in CDS is determined as ATG [7].

Stop codons are composed in three different ways. These are determined as UAA->TAA, UAG->TAG and UGA->TGA. Only one of these is recognized as stop codon in the last codon of the CDS [?].

2.2.2 Poly(A)-signal and poly(A)-tail

The nucleotide distribution in poly(A)-signal is one of the most preserved in eu-karyotic organisms. The base sequence is often AAUAAA->AATAAA.

Poly(A)-tail is a distribution of only adenosines. The number of adenosine residues can be several hundreds, around 250 nucleotides [18].

The region between polyadenylation signal and poly(A) tail contain variation of nucleotides, mostly adenosines, and range between 10 to 30 nucleotides [5].

(21)

2.3 EST production - basic overview 9

2.2.3 Kozak sequence

It has been showed that a sequence flanking the start codon is mainly conserved in eukaryotic mRNAs, i.e. a Kozak sequence. Kozaq sequence strongly enhance the initiation translation by signaling to the ribosome units that the flanked AUG is a true start codon. How the signaling works is unknown but it is believed that Kozak sequence slows the scanning of the ribosome unit contributing to easier recognition of a start codon [11].

The Kozak sequence is mainly GCCRCCAUGG where R is a purine A or G, mostly A [9][10].

2.2.4 GC content

GC contents exhibit significant correlations within regions of mRNA. 5’UTR are particularly high in GC-content, the average GC content for human are more than 60%. Although high in 5’UTR, 3’UTR obtains AT-richer distributions, i.e. de-pleted GC contents as so in less than 45% in human (see figure 2.2). Moreover, the third position in codons are as well high in GC-content often associated with the GC content in 5’UTR, if lower GC content in 5’UTR then lower GC content at the third position in codons.

Invertebrate, plant and fungi show a frequently low GC content in both UTR but still high in CDS, all over 50% [14]. Therefore, it is importance to include emission parameters in the model that represents the GC statistics leading to a compromise of nucleotide G or C against A or T in the affected regions of HMM.

2.3 EST production - basic overview

EST are short biological sequences derived from partial or full-length cDNA. cDNA are synthesized from mRNA and because mRNA is composed by exons only, cDNA is therefore important for gene discovery. By using EST which is a part of a cDNA, contributes to more rapidly exploration of transcriptomes and genes in a more practical way rather than using whole full-length cDNA.[3].

Here follows a very basic description of cDNA synthesis procedure to EST pro-duction. For more details see [16] and [3].

2.3.1 cDNA synthesis

Synthesis of regular cDNA occurs in several steps. First step involves the synthesis of ss cDNA by using mRNA as template and the enzyme reverse transcriptase. Re-verse transcriptase is a kind of RNA-dependent DNA polymerase acting on RNA strand and transcribes RNA to DNA. Reverse transcriptase can be found in retro-virus and hence retroretro-virus are used as source when isolating the reverse transcrip-tase. Transcription start directionally from 3’end usually containing

(22)

polyadenyla-Figure 2.2. Average GC-content of different organisms for 5’UTR, 3’UTR and third position in codons. Hum (Human). OM (Other mammal). Rod (Rodent). OV (Other vertebrate). Inv (Invertebrate). Pl (Plant). Fun (Fungi). Figure from [14].

tion sites (if eukaryotic) of the mRNA template and progresses against 5’end. Once the antisense ss cDNA is synthesized, it needs to be converted to db cDNA because ss cDNA cannot be cloned directly and db cDNA is better stabilized against degradation [3][19]. Nevertheless ss cDNA is removed from the mRNA template and act itself as template for another DNA-polymerase producing a double-stranded cDNA (see figure 2.3). The cDNA can now be taken to the next step in cloning procedures and for EST sequencing [16].

2.3.2 cDNA library

Considering that cDNA comprises only by exons cDNA library are an important tool used for identification of genetic information. A set of mRNAs from a specific type (organism, tissue) are collected for preparing a set of cDNA. Afterwards, these are incorporated in a vector molecule within a host such as a bacteria cell. The host cell replicates itself generating larger scale of cDNAs producing a cDNA library[17].

2.3.3 EST synthesis

EST are sequenced from cDNA. Sequencing of a cDNA can occur randomly from any position and from both directions, however, a full-length sequencing is not ob-tained. Both strands in cDNA can be used as probes producing manifolds of 5’read

(23)

2.3 EST production - basic overview 11 mRNA template 3' 5' reverse transcriptase + primer 5' 5' 3' 3' mRNA template singe-stranded cDNA 5' 5' 3' 3' single-stranded cDNA mRNA template

mRNA template remove

5' 3' single-stranded cDNA DNA polymerase I 5' 5' 3' 3' double-stranded cDNA

Figure 2.3. Basic schematic of cDNA synthesis.

EST sequences and 3’read EST sequences (see figure 2.4). 5’read EST sequence represent the sense strand or the mRNA strand and 3’read sequence represent the antisense strand [3].

EST-data from databases such as UniGene contain collections of multiple EST from different mRNAs. Each sequence is written 5’ to 3’ which needs to be con-sidered when having an EST-data with both 5’read EST and 3’read EST. An HMM representing mRNA strand in 5’ to 3’ direction, 3’read EST requires to be inverted, i.e. 3’read EST must be converted to 5’read EST before applying the data to the HMM.

(24)

CDS

5'UTR 3'UTR

EST sequencing cDNA

(25)

Chapter 3

Principles of Hidden Markov

Model

A description of the mathematical background of HMM is given here. Most of the formula described is taken from [15] and [1].

3.1 Discrete HMM

When searching for biological patterns in the form of nucleotide sequences of or-ganisms the patterns will almost never be exact. Also from the same family of organisms there are variations in the nucleotide sequences of example genes encod-ing the same protein. It is necessary to consider these fluctuations when tryencod-ing to align sequences with the same provenance or be able to lookup what region in the mRNA a sequence belongs. A way to do this is to use stochastic models. Stochastic models take into account the random variations in a sequence for a common region or element, for example the poly(A)-signal in 3’UTR part of the mRNA. There-fore, models with stochastic capacity is very important for prediction of sequences. HMM is a kind of stochastic model. An illustration of HMM topology can be seen in figure 3.1. It is composed of hidden states that are not observable. These hidden states can be interpreted as the regions in the mRNA. A state may un-dergo transition to another state where each state is able to emit one or more observations called emissions (as in nucleotides). For example when a stop codon is reached, the next nucleotide is on the 3’UTR region, hence a state transition from CDS to 3’UTR has occurred. Furthermore, each transition and emission occurs within a probability distribution. When working with discrete emissions, as in nucleotides, the HMM is said to be discrete.

Let assume we have S distinct states denoted ξ1, ξ2, ..., ξS, where each state emits

one of O discrete emissions denoted β1, β2, ..., βO. A state transition ξkto ξloccur

with a state transition probability aξkξldefined as a conditional probability where

(26)

State 1 Transition State 2 Transition State 3

Observation Observation Observation

Figure 3.1. Hidden Markov model topology.

the current state at time t (in nucleotide sequence time can be interpreted as the nucleotide position) is denoted xt:

aξkξl= P [xt+1= ξl|xt= ξk], 1 ≤ k, l ≤ S

1 ≤ t (3.1)

When in a current state ξk at t, the probability eξk(βi) that emission βi emits is

defined as:

eξk(βi) = P [βiat t|xt= ξk], 1 ≤ k ≤ S

1 ≤ i ≤ O 1 ≤ t

(3.2)

An HMM needs to have initial probabilities before proceeding with training or sampling of emissions, i.e. initial state distribution π, emission distribution B and transition distribution A [15].

3.2 HMM order and periodicity

An important property of HMM is the dependences of the previous states when calculating the outcome of next state, i.e. HMM’s order. Equation 3.1 is a typical 1-order HMM, the next state is dependent only by the current state. A 2-order HMM is dependent of the previous two states, 3-order HMM of the previous three states and so on. There is also states with higher order emission, a state that contain information about previous emissions. It follows the same principle as higher order state HMM, 2-order emission states is dependent of the previous two emissions, 3-order emission of the three previous emissions and so on. Normally, a state is of 0-order emission meaning that the state has no memory of the previous emissions. But in this thesis the focus will be in using increased emission order for states modeling the coding region of mRNA (more of this in chapter 4). The outcome probability of next emission eξk(βi) at t + 1 where a state with higher

(27)

3.3 HMM training 15 as: eξk(βi|β1...βn) = P [βi at t + 1|xt+1= ξk, xt−1= β1..., xt−n= βn], 1 ≤ k ≤ S 1 ≤ i ≤ O 1 ≤ t (3.3) If a state ξk can only be reached after d number of steps where d is an integer and

d > 1, then the state ξk is said to be periodic with period d. Figure 3.2 shows a

typical 3-periodic hidden Markov model. Same state can be reached after three steps[15].

State 1 Transition State 2 Transition State 3

Observation Observation Observation

Transition

Figure 3.2. 3-periodic Hidden Markov model topology.

3.3 HMM training

An HMM’s ability to recognize biological sequences derived from any organism, is dependent on the parameter used in HMM. The parameters must be estimated so that they can represent sequences one is interested in finding. An HMM can be trained with specific sequences, say a specific mRNA from human brain. By training an HMM with that mRNA , the parameters of HMM will be estimated in such a way that it will represent the nucleotide distributions for that specific mRNA (if the HMM is builded that way). When having an unknown sequence, one can apply this on the trained HMM and the HMM will predict whether or not the unknown sequence represent the mRNA from human brain. The sequences that are used for training the HMM are called a training set. New unknown se-quences that are used for testing the HMM’s ability to predict new sese-quences are called for a test set. There are different ways to train an HMM with sequences, but most of the time an HMM is trained automatically with a training algorithm. The parameters (transition probabilities, emission probabilities, state initial prob-abilities) of HMM must be initiated with values before applying training. Then

(28)

during the training, the parameters are estimated until some kind of threshold is reached, often the difference in model parameters between an iteration i and

i + 1. If the difference is sufficiently small, the iteration stops. In this thesis the

focus is on using a training algorithm called Baum-Welch. Baum-Welch is a kind of iterative training algorithm where emission sequences are used as training data (as EST sequences). Baum-Welch algorithm is complex and contain itself two kind of fundamental algorithms, namely forward and backward algorithm [1]. Before proceeding further with the explanation of how Baum-Welch works we need to understand the forward and backward algorithm together with some concepts and mathematical backgrounds which are described below.

3.3.1 Forward algorithm

The full forward algorithm calculates the probability that a sequence X is gener-ated from a model λ over all possible paths in S, i.e. several paths kan emit the same sequence X. The model λ consist of parameters A, B and π. Let assume sequence X = x1x2...xL is of length L and contains any of the emissions in O

where an emission in O can occur more than once in X. Each position in X with emission xi emitted from a specific state ξlof S is then calculated recursively until

reaching length L or any desired length of X. The probability that the partial sequence x1...xi in X end in state ξlwith emission xi is acquired recursively with

equation: fξl(i) = eξl(xi) S X k=1 fξk(i − 1)aξkξl, 1 ≤ l ≤ S (3.4)

Indeed, the recurse equation 3.4 is used if interested in the probability ending in a specific state with a specific emission. The recursion always start from position one in X and continues forward to a specified length, hence the name forward algorithm. The procedure is initialized with a state called begin state, i.e. when

k = 0. The begin state does not emit any emissions (end state ξE as well).

The recursive steps for full algorithm calculates all possible paths emitting whole sequence X and sum them to obtain probability that X is generated by model λ:

Initialization:(i=0) fξ0(0) = 1 fξk(0) = 0 f or k > 0 Recursion start:(i=1...L) fξl(i) = eξl(xi) S X k=1 fξk(i − 1)aξkξl

(29)

3.3 HMM training 17 Termination: P (X) = S X k=1 fξk(L)aξkξE ξE= End state

Figure 3.3 illustrates the paths and directions needed to obtain the

probabil-Figure 3.3. Forward algorithm directions and paths when obtaining probabilities. Red circle is the current state, green the past. Blue circles define start and end state. Black circle is not yet reached.

ity that emission β2 emits from state ξ1. The emission sequence is β1β2 and

there are two hidden states ξ1 and ξ2. As can be seen, the sum of the paths

(fξ1(1)aξ1ξ1 + fξ2(1)aξ2ξ1) multiplied with the emission probability eξ1(β2) gives

the referred probability (red circle).

Figure 3.4. Forward algorithm directions and paths when obtaining the full probability that sequence β1β2 is generated from the model.

(30)

When it is desired to have the probability that sequence β1β2 belongs to that

model, all state are reached and paths are summed (figure 3.4), i.e. P (β1β2) = fξ1(2)aξ1ξE+ fξ2(2)aξ2ξE [1][15].

3.3.2 Backward algorithm

While forward algorithm start the recursion from the beginning of a sequence, backward algorithm does the opposite, i.e. recursion start from the end of a se-quence. However, here we are interested in the probability that the next states emit the remaining sequence xi+1...xL when in current state ξk with emission xi,

i.e. we wants to find bk(i). Total probability P (X) is the same as the one from

forward algorithm. The recursive equation is the following:

Initialization (i = L): bξk(L) = aξkξEnd f or all k Recursion (i = L-1,...,1): bξk(i) = S X l=1 aξkξleξl(xi+1)bξl(i + 1) Termination P (X) = S X l=1 aξ0ξleξl(x1)bξl(1)

Figure 3.5(a) shows the backward recursion to obtain the probability that the next states emit emission β2and also the full probability P (β1β2) in figure 3.5(b) [1].

3.3.3 Baum-Welch algorithm

The Baum-Welch algorithm estimates the parameters of emission and transition with the help of a training set containing a number of emission sequences. Transi-tion parameter aξkξlis estimated by first sum the probabilities that state transition ξk to ξl occurs in training data from each position i in sequence X for each

se-quence X′

in the training data. Then divide by the sum of the probabilities that

ξk transits to each possible ξl′ in S at each position i in sequence X for each

sequence X′

in the training data, i.e. formally as:

aξkξl = Aξkξl S P l′₌₁ Aξkξ_l′ (3.5)

To obtain Aξkξl we need to combine forward with backward algorithm and then

calculating the posterior probabilities. First we are interested in the probability that an observed sequence X is generated from a model where xi, xi+1 is emitted

from state ξk and ξl respectively by:

(31)

3.3 HMM training 19

(a) Red circle correspond to the probability b1(1).

(b) Full probability P (β1β2).

Figure 3.5. Backward algorithm directions.

The probability that state transition aξkξloccur from position i in sequence X is

then obtained from the posterior probability where P (X) can be obtained through forward or backward algorithm:

P (xi= ξk, xi+1= ξl|X) =

fξk(i)aξkξleξl(xi+1)bξl(i + 1)

P (X) (3.7)

Figure 3.6 illustrates the principles of obtaining Aξkξl. Here, I have assumed only

one sequence β1β2β3 in the training set, and to obtain the probability that this

sequence is generated by the model from state ξ1to ξ1representing sequence β1β2

is easily done by summing all the related paths multiplied by emission probability of β2at ξ1.

Furthermore, to find the directly probability that β1β2 is emitted from state ξ1

and ξ1 respectively, i.e. the probability of aξ1ξ1 from position one in the given

observed sequence β1β2β3, we divide P (β1β2β3, β1= ξ1, β2= ξ1) with P (β1β2β3),

(32)

We repeat those steps and yet again calculating the posterior probability that the same state transition ξ1 to ξ1 occur but instead from position two (β2) (see

figure 3.6(b) and 3.6(c)). The steps are repeated until reaching the end of the emission sequence. Each posterior probability obtained is then summed giving the searched Aξ1ξ1 = (P1+ P2+ P3)/P (β1β2β3). Of course, if we had more than one

sequence, we would as well sum the probabilities over all sequences, i.e. to obtain

Aξkξl we sum over all positions and over all sequences in the training set. Aξkξl

can therefore be defined as:

Aξkξl = X j 1 P (Xj₎ X i f_ξj_k(i)aξkξleξl(x j i+1)b j ξl(i + 1), (3.8)

where j is the sequence and i the position in j.

Emission parameter eξk(x) is estimated in a similar way as in 3.5. The

proba-bility that emission x emits from state ξk in training data divided by the sum of

probabilities of each possible emission βi′ that takes place in state ξk:

eξk(x) = Eξk(x) O P i′₌₁ Eξk(βi′) (3.9)

Eξk(βi) is obtained analogous to 3.8. The definition is: Eξk(x) = X j 1 P (Xj₎ X i|βj_i=x fξjk(i)b j ξk(i), (3.10)

where we are only interested in position i that emits x as can be seen in the inner sum.

We are now ready to summarize the Baum-Welch algorithm by following itera-tion steps:

1. Initialization:

Initialize transition parameters and emission parameters with values, i.e. the implementation of matrices A, B and π.

2. New parameter estimation:

Estimate all aξkξl and eξk(βi) through 3.5 and 3.9.

3. Log-likelihood probability of the new model:

Calculates the probability that training data belongs to the new model λ with new parameters on A, B and π. Logarithmic probabilities are used to avoid numerical problems. Log-likelihood probability is achieved with:

X

j

logP (Xj_|λ), _(3.11)

(33)

3.4 HMM decoding - 1-best algorithm 21

4. Converge criterion:

A convergence criterion is needed to end the iteration. A termination will occur when the difference in the log-likelihood of the model is sufficiently small between two iterations.

Conclusively, the convergence of Baum-Welch is strongly dependent on initial pa-rameters, i.e. Baum-Welch reaches different local maxima. Therefore it is of interest to keep in mind when using unsupervised HMM that the initial parame-ters should represent a reasonably biological significance. Furthermore, to avoid underflows the forward and backward variables will be scaled keeping the values in a reasonable interval[1].

3.4 HMM decoding - 1-best algorithm

When having data containing EST, we are interested in what regions of the mRNA each EST sequence represent. The hidden states in HMM represent the regions in mRNA and therefore cannot be directly observed, i.e. we cannot tell the region contents of an EST sequence by viewing it directly. We must instead decode each EST on the HMM and calculate the most likely regions that generates the specific EST sequence. For this purpose we can once again use an iterative algorithm, namely 1-best algorithm. 1-best algorithm calculates all hidden state sequences that may give the observed EST sequence, and picks up the one that gives the highest probability. It calculates probabilities that an emission is emitted by a hidden state.

1-best algorithm is described as followed:

1. Start recursion at i = 0 and continue to i = L:

γl(labelj) = (

X

k

aξkξlγk(labeli))el(xi+1)

γl(labelj) is the probability that next emission xi+1 is emitted from the

hidden state l with label j. γk(labeli) is the highest probability picked for

state k of current emission i. The label represent the region of mRNA in this case. γl(labelj) is calculated for each state in HMM and with all labels.

Unfortunately, the states in the unsupervised HMM have only one label each, so there will be only one estimated γl(labelj) for each state l.

2. Summing all probabilities

Sum the probabilities of each finally estimated γl(labelj) over the states and

save the one with the highest value.

EST can be several hundreds of nucleotides, therefore using the original 1-best algorithm can produce numerical underflows. To avoid this the algorithm is loga-ritmized [8].

(34)

(a)

(b)

(c)

Figure 3.6. Illustration on how to achieve Aξ1ξ1 from a given sequence β1β2β3, i.e.

(35)

Chapter 4

Experimental design

4.1 HMM of mRNA

A description in detail of the design of HMM representing mRNA will be mentioned here. Because mRNA contains several different regions the HMM is constructed by several sub-models representing each region of the mRNA (see figure 2.1). Explanation of the various transitions between states and all parameters distinct values implemented will of course be mentioned. A simple HMM will be initiated with as few states as possible, later on a further expansion of the HMM will be performed.

4.1.1 5’UTR HMM

Figure 4.1 illustrates the proposed eight-state HMM for the entire 5’UTR-section. State one and two models 5’UTR over a length distribution. State three to eight represents the Kozak sequence before the start codon (see chapter two).

The first two states have a transition to themselves so that any length can be modeled. Note that two states are chosen after each other to avoid the geomet-rical distribution. A geometric distribution is dependent on the length, i.e. the longer the length, the less probability. Thus, if having only one state and are trying to model a long 5’UTR sequence the risk is that the transition occurs too soon from that state to the Kozaq sequence states and subsequently to the CDS region. The consequence is that some part of 5’UTR sequence is modeled as a CDS region instead of 5’UTR-region. Instead, using two states will induce the negative binomial distribution which capture length distributions better.

The transition probabilities in state one and two are tied (equally probabilities in both states) and will be chosen to fit an approximately bell-shaped negative bi-nomial distribution with a maximum top representing the mean length of 5’UTR. The sum of transition probabilities in each state will always be one, as for the emission probabilities. The probability density for a geometrical distribution

(36)

when having only one state is P (i) = ai−1

kk (1 − akk) where i is the length of a

sequence. However, using two or more states gives the probability density func-tion P (i) = i−1

n−1a

i−n

kk (1 − akk)n where n is the number of states and akk is same

for all states.

After testing (by plotting different values of akk in MATLAB) a11= a22= 0.995

has been shown to produce a reasonably binomial distribution considering that 5’UTR average length is set to 200 nucleotides (see figure 4.2(a)). However, using

1 2 3 4 5 6 7 8 Kozak sequence 0.995 0.995 0.005 0.005 1.0 1.0 1.0 1.0 1.0 Figure 4.1. HMM of 5’UTR. 0 100 200 300 400 500 600 700 800 900 1000 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2x 10 −3

(a) Negative binomial distribution.

0 100 200 300 400 500 600 700 800 900 1000 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5x 10 −3 (b) Geometrical distribution. Figure 4.2. x-axis = length, y-axis = P (i).

only one state with same transition probabilities gives the geometrical distribu-tion as can be seen in figure 4.2(b), where the probability then shrink exponential with length. If reducing the value of a11 and a22 it present a more geometrical

distribution. Increasing the values contributes to capture longer lengths.

I have chosen to have emission order 0 because 5’UTR vary greatly in the nu-cleotide positions and rarely follows a specific pattern, i.e. a nunu-cleotides position is not dependent on the backlying nucleotides. Kozak sequence is also of emission order 0 where emission probabilities are set to be high for those specific nucleotides

(37)

4.1 HMM of mRNA 25

that occur in the Kozak sequence. By using emission order 0 in the 5’UTR HMM unnecessary parameters can be avoided.

The initial transition probabilities for all states can be seen in figure 4.1.

The emission parameters of the first two states reflect the GC statistics that was mentioned in chapter two. This means that the emission probability of nucleotide G or C will be slightly higher than A and T. States three to eight reflects of course the Kozak sequence. The overall initial emission probabilities are showed in ta-ble 4.1. A matrix Bξk will hold the initial probabilities for each state which is

implemented in the XML file. The emission probabilities is randomly chosen but

Table 4.1. Emission probabilities for each state in 5’UTR HMM.

A G C T N State 1 0.2 0.2995 0.2995 0.2 0.001 State 2 0.2 0.2995 0.2995 0.2 0.001 State 3 0.0003 0.999 0.0003 0.0003 0.0001 State 4 0.0003 0.0003 0.999 0.0003 0.0001 State 5 0.0003 0.0003 0.999 0.0003 0.0001 State 6 0.8 0.1 0.045 0.045 0.001 State 7 0.0003 0.0003 0.999 0.0003 0.0001 State 8 0.0003 0.0003 0.999 0.0003 0.0001

follows the GC content and Kozak sequence statistics.

Emission N stands for "any nucleotide" or "unknown nucleotide" and is one of the many denotation in FASTA format for nucleotide representation. Some po-sitions in a sequence is represented for various reasons with an N, for example because of sequencing errors. Additional ambiguity symbols are "R", "Y", "C", "M", "S", "W", "B", "D", "H", "V", "X". When these are found in a sequence, they will be converted to an "N" simply interpreted as an unknown nucleotide in that specific position. If extending the emission parameters to represent all symbols the HMM becomes too complex contributing to heavy calculations when trying to estimate the parameters through Baum-Welch. The probability for N is set to be low in all states because of rare occurrences on the same nucleotide positions.

4.1.2 Start codon HMM

Start codon model will be designed as a three-state HMM. The model shall rep-resent the nucleotide sequence ATG. Emission parameters of this model are fixed, i.e. the parameters will not be modified during training with Baum-Welch, be-cause I simply assume that ATG always start the CDS. State 9 has fixed high emission probability of A, state 10 on T and state 11 on G. Because the emission

(38)

9 1 10 1 11

Figure 4.3. HMM of start codon ATG.

probabilities are fixed the start codon HMM do not need higher emission order, thus order equals zero. It is possible to see in table 4.2 that the probabilities of

Table 4.2. Emission probabilities for each state in start codon HMM.

A G C T N

State 9 0.999 0.0003 0.0003 0.0003 0.0001

State 10 0.0003 0.0003 0.0003 0.999 0.0001

State 11 0.0003 0.999 0.0003 0.0003 0.0001

remaining emissions are not exactly zero in order to avoid problems with overfit-ting. For example, if a decoding propose that a nucleotide T is emitted from state 9 the probability that the whole sequence belongs to the model would be zero. Decoding with 1-best algorithm would give a zero probability because of product multiplications with zero in the recursion equation (see chapter 3). Hence, to avoid overfitting problems I add extra initial values to the emission probabilities.

4.1.3 CDS HMM

The yellow states in figure 4.4 hold high emission probabilities of nucleotides G and C relative A and T, because of the GC statistics that was mentioned in chapter 2. The blue states hold equal distributed emission probabilities of the nucleotides A,C,G and T. State 12 represents the nucleotide that immediately follow the start codon and is the last nucleotide of the Kozak sequence, namely G, and which G has therefore a high emission probability relative the other emissions.

States 15-17 and 18-20 represent the codons. A codon is made up of three nu-cleotides and encodes a specific amino acid during translation. Thus, states 15-17 reflects the position of one to three in a codon, the same applies for states 18-20. It is possible to see that the two codon models are periodic with period three through the transition from state 17 and 20. State 17 has a transition to state 15, i.e. the transition back to the codon model, or a transition to the next codon model. Similarly, there is a transition from state 20 back to state 18 or a transition to a stop codon (next section). This three-periodic system captures therefore the codon structure in a coding region in a proper way.

(39)

in the CDS HMM by changing the values of P (G|T A), P (T |AA) and P (A|T G) to be exactly zero.

On the same principle as in the 5’UTR HMM, I used two codon models to avoid the geometrical distribution. Average length of CDS is approximately 1500 nucleotides or 500 codons. A length distribution that covers averagely 500 codons must there-fore be implemented. By using the formula P (c) = c−1

m−1a

c−m

kl (1 − akl)

m_{, where c}

is the number of codons, m is the number of codon models, I have through testing found that a17,15 = a20,18 = 0.9981 gives a fairly negative binomial distribution

where 500 codons gives highest probability top (see figure 4.5). The states in

13 15 16 17 12 1 1 1 1 1 0.0019 18 1 19 1 20 0.0019 0.9981 0.9981 Figure 4.4. HMM of CDS. 0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 1 2 3 4 5 6 7 8x 10 −4

Figure 4.5. Negative binomial distribution. y-axis = P (c), x-axis = c.

the codon models are of higher order emission because the purpose is to capture the CDS structure, emission order is therefore set to 2. A second order emission means that each state is dependent of a history of the previous two emissions. The emission set {A, G, C, T, N }, may occur in 52 _{= 25 different ways over the}

previous two emissions. This means that each emission of the states 15 to 20 can be estimated in 25 different ways, which gives a total of 25 ∗ 5 = 125 different

(40)

emission parameters (see table 4.3). For example, the probability of P (A|GG), the probability that A occurs given sequence GG, another probability P (A|GC), contributing to 25 different emission probabilities for emission A. This also applies to the remaining emissions.

Table 4.3. Table of how the emission probabilities are achieved for a two order emission state. xy is any combination from the emission set.

A G C T N

... ... ... ... ...

Total emission parameters: 125

However, state 12-14 is of zero order emissions. State 12 does not need to be dependent of the previous emissions because there is already historical knowledge about the previous emissions correspond to the start codon. State 13 and 14 is also of zero order emission because of simplicity and for reducing the number of parameters, but also to have at least one place in the CDS model that can be a start point directly through decoding or training to reduce frame-shift errors. It is not possible to directly start in state 15 to 20 without emission history of at least two nucleotides. The order of a state is seen in table 4.4 and the total emission probabilities for each state is written in table 4.5.

The transition probabilities are shown in figure 4.4.

Table 4.4. Order emission in each state.

State Order 12 0 13 0 14 0 15 2 16 2 17 2 18 2 19 2 20 2

(41)

4.1.4 Stop codon HMM

The emission probabilities are fixed in the same way as in the start codon simply because I assume that these nucleotide sequences are always present as stop codons. The three sub-models in figure 4.6 represent the three stop codons TAA, TAG, TGA. The emission in the states are of zero order. The emission probabilities for each state is defined in table 4.6 below.

21 1 22 1

24 1 25 1

27 1 28 1

CDS 3'UTR

Figure 4.6. Stop codon HMM.

4.1.5 3’UTR HMM

The first three states 30-32 in figure 4.7, models the 3’UTR part right after the stop codon. The purpose of the selected topology is to illustrate that the 3’UTR is longer in average length than the 5’UTR, I might as well have used only two states, but to avoid confusion between the lengths of the 5’UTR and 3’UTR the given topology has been chosen. By testing the transition probabilities in MAT-LAB, the value a30,30 = a31,31 = a32,32 = 0.9964 gives a binomial distribution

with 550 nucleotides as mean length (see figure 4.8(a)). 3’UTR show tendencies of AT-rich elements, therefore high emission probabilities of A and T are initialized. States 33-38 covers the polyadenylation signal AATAAA and hence contributes to high emission probabilities for nucleotides A and T in their respectively states. However, the emissions is not fixed simply because the nucleotides may vary. There is a space between polyadenylation signal and poly(A) tail covering 10 to 30 nucleotides. This space is modeled by state 39. A short length is better mod-eled by a geometrical distribution than a negative binomial distribution. I let the transition probability (a39,39) be equalized with 0.5 allowing the Baum-Welch to

have more control when estimating the value through training. The geometrical distribution can be seen in figure 4.8(b). As in state 30-32, the emission probabil-ities in state 39 represent AT rich elements.

States 40-41 models poly(A)-tail consisting of repeated adenosine nucleotides that extends around 250 adenosines and so it is important once again to have a neg-ative binomial distribution with transition probability a40,40 = a41,41 = 0.99595

(42)

(see figure 4.8(c)). The emissions is not fixed because of variations, some T may appear instead of A for example.

State 42 contain fixed emissions with A as highest emission probability because of the desires to end with an Adenosine. A transition of 1.0 is only there to model the last adenosines of an emission sequence. As in 5’UTR, all states are of zero-order emissions. Table 4.7 shows all emission probabilities.

30 31 32 33 34 35 36 37 Polyadenylation signal 0.9964 0.9964 0.0036 0.0036 0.0036 1.0 1.0 1.0 1.0 0.9964 1.0 38 39 0.5 0.5 41 0.00405 0.99595 0.99595 0.00405 Poly(A) tail 1.0 1.0 Figure 4.7. 3’UTR HMM.

4.1.6 The complete HMM

The complete HMM is built up of all the submodels for each region in the mRNA described above. Figure 4.9 shows two silent states, a begin state and an end state. A transition can occur to all states in the submodels from begin state simply be-cause that an EST sequence is randomly sequenced, thus EST-sequences can start anywhere on the HMM. These transitions from begin state is defined as the initial state probabilities, and are equally distributed over the 42 states that can be used as start point in the HMM, i.e. πξk = 1/42 for all states except state 42 and N2

state. There is no need to start in state 42, because it is very unlikely to have a very short EST sequence with only adenosines. Similarly, the end state illus-trates that EST sequences can end anywhere in the HMM. However, a transition to an end state does not exist because end state is only abstract as the begin state. Starting in a state with higher order emission is not allowed because the state has emission probabilities that requires history of previous emissions, GHMM li-brary does not support the ability to start in such a state without emission history. Therefore the HMM needs two additional states before the CDS (the states that are of higher order emissions), which are named N1 and N2. These states have

(43)

4.1 HMM of mRNA 31 0 100 200 300 400 500 600 700 800 900 1000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1x 10 −3

(a) Negative binomial distribution over states 30,31 and 32. 0 5 10 15 20 25 30 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

(b) Geometrical distribution in state 39.

0 100 200 300 400 500 600 700 800 900 1000 0 0.5 1 1.5x 10 −3

(c) Negative binomial distribution over states 40 and 41.

Figure 4.8. x-axis = length, y-axis = P (i).

fixed uniformly emission probabilities. It is possible to start in N1 but not in N2, since the affected CDS states (15-20) require two nucleotides in history.

4.1.7 Implement HMM in XML

The designed HMM is implemented in XML, which layout is defined by GHMM library. The XML file will be loaded in the software using GHMM library and saved as a datatype interpreted as an HMM. Once the datatype is created from the XML-file it is possible to apply the Baum-Welch and 1-best algorithm and thus it is possible to estimate new parameters directly in the datatype or decoding the datatype.

The code below shows a typical HMM implemented in XML. The XML code defines a two-state HMM where state 2 is of emission order 1. In XML the HMM

(44)

Begin 5'UTR HMM Start codon HMM CDS HMM Stop codon HMM 3'UTR HMM End state

Figure 4.9. The complete HMM with all submodels and the two extra states N1 and N2.

is initialized by defining the type of the HMM to be used (line 1). In this case, the HMM is initialized by using a discretized higher order HMM. "Labeled" means that the emissions are named and not numbered. Line 2-8 defines the number of emissions and their respective denominations and line 9-15 the number of regions. Line 16 and 20 define the two states. States are identified with a number that always starts with zero and in chronological order. The states must be initiated with initial state probabilities and it is because of this a begin state is unneces-sary. Line 17 and 21 initialize the emissions of each state which specifies emission probabilities, and if they are to be fixed or not, and what order it should be in the state. Line 21 specifies that state 2 is of emission order 1 which provides 25 emission parameters. The last lines specifies the transitions between the states and transition probabilities.

XML code:

1 <HMM type="higher-order labeled discrete"> 2 <alphabet id="0"> 3 <symbol code="0">A</symbol> 4 <symbol code="1">C</symbol> 5 <symbol code="2">G</symbol> 6 <symbol code="3">T</symbol> 7 <symbol code="4">N</symbol> 8 </alphabet> 9 <classAlphabet> 10 <symbol code="0">5U</symbol>

(45)

4.2 Choosing appropriate EST-data 33 11 <symbol code="1">Start</symbol> 12 <symbol code="2">CDS</symbol> 13 <symbol code="3">Stop</symbol> 14 <symbol code="4">3U</symbol> 15 </classAlphabet>

16 <state id="0" initial="0.03333333" desc="State 1">

17 <discrete id="0" fixed = "1" >0.1998, 0.1998, 0.1998, 0.1998, 0.01</discrete> 18 <class>0</class>

19 </state>

20 <state id="1" initial="0.03333333" desc="State 2">

21 <discrete id="0" order="1">0.1998, 0.1998, 0.1998, 0.1998, 0.01, 22 0.1998, 0.1998, 0.1998, 0.1998, 0.01, 23 0.1998, 0.1998, 0.1998, 0.1998, 0.01, 24 0.1998, 0.1998, 0.1998, 0.1998, 0.01, 25 0.1998, 0.1998, 0.1998, 0.1998, 0.01</discrete> 26 <class>0</class> 27 </state>

28 <transition source="0" target="0"> 29 <probability>0.5</probability> 30 </transition>

37 </HMM>

4.2 Choosing appropriate EST-data

EST data was downloaded from the UniGene database. The data consists of EST sequences from a cDNA library. cDNA library is derived from specific tissues and usually consist of different mRNA sequences. The downloaded EST sequences will thus represent parts of different mRNA sequences, so the HMM should be able to model different mRNA sequences at once.

I have chosen 20 different EST-data from various organs belonging to Homo sapi-ens, the HMM is therefore adapted to human mRNA because of the chosen tran-sition probabilities that represent the average length for the specific regions of the human mRNA. These 20 are categorized into different numbers of EST sequences, i.e. 5 with approximately 500 sequences, 5 with approximately 2000 sequences, 5 with approximately 5000 sequences, and 5 with approximately 10 000 sequences each. With the selected data, a measure of how well an unsupervised model is relative supervised model. ESTScan will hence use a supervised HMM for Homo sapiens. More details about the EST data can be seen in table 4.8 and appendix A. Another task will be to compare the results from the unsupervised HMM with ESTScan when there is no supervised model for some organism. Seven distinct

Unsupervised hidden Markov model for automatic analysis of expressed sequence tags

The Department of Physics, Chemistry and Biology

Master of Science Thesis in Bioinformatics

Unsupervised hidden Markov model for automatic

analysis of expressed sequence tags

Andrei T. Alexsson

Unsupervised hidden Markov model for automatic

analysis of expressed sequence tags

Andrei T. Alexsson

Abstract

Sammanfattning

Acknowledgments

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1

Background

1.2

Thesis objectives

1.3

Methods

1.4

Delimitations

1.5

Abbreviations

Chapter 2

Principles of EST

2.1

Eukaryotic mRNA structure

2.2

Nucleotide distributions in mRNA

2.2.1

Start and stop codon

2.2.2

Poly(A)-signal and poly(A)-tail

2.2.3

Kozak sequence

2.2.4

GC content

2.3

EST production - basic overview

2.3.1

cDNA synthesis

2.3.2

cDNA library

2.3.3

EST synthesis

Chapter 3

Principles of Hidden Markov

Model

3.1

Discrete HMM

3.2

HMM order and periodicity

3.3

HMM training

3.3.1

Forward algorithm

3.3.2

Backward algorithm

3.3.3

Baum-Welch algorithm

3.4

HMM decoding - 1-best algorithm

Chapter 4

Experimental design

4.1

HMM of mRNA

4.1.1

5’UTR HMM

4.1.2

Start codon HMM

4.1.3

CDS HMM

4.1.4

Stop codon HMM

4.1.5

3’UTR HMM