Generation and Bioinformatic Analysis of Synthetic Ago HITS-CLIP Data

(1)

IT 13 039

Examensarbete 45 hp

Juni 2013

Generation and Bioinformatic

Analysis of Synthetic Ago HITS-CLIP

Data

Mehmet Ali Arslan

(2)

(3)

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

Generation and Bioinformatic Analysis of Synthetic

Ago HITS-CLIP Data

Mehmet Ali Arslan

Micro-RNAs (miRNAs) have been discovered to regulate messenger RNA (mRNA) translation and degradation. Various recent studies have been focused on miRNA target prediction, in order to get a better understanding of the rules and nature of miRNA regulation over mRNAs. In this project we aim to create a software module to identify miRNA target sites on mRNAs. As basis to this project, we refer to a study that identified a platform for miRNA-mRNA interaction in protein-RNA complexes in mouse brain (AGO HITS-CLIP study). We propose a probabilistic model of the data from this study, and generate synthetic sample data according to this model, in order to create a test bed for a discovery module. Our discovery module analyzes the sample data to identify peak regions where the interaction density is high. We present results both on synthetic sample data and data from the AGO HITS-CLIP study to evaluate our module.

Ämnesgranskare: Lars Arvestad Handledare: Jens Lagergren

(4)

(5)

Introduction

miRNAs are discovered to regulate gene expression by binding to mRNAs and causing them to degrade or inhibit their translation. This in turn a↵ects how proteins coded by mRNAs that are targeted by miRNAs are generated. Hence, every life related function they carry out is regulated by miRNAs (see “Background” for details). This is why we are interested in finding miRNA target sites on mRNAs.

As Chi et al. [1] provide a platform for investigating miRNA-mRNA in-teraction, we use the output of their study as input to ours. Aligning the isolated mRNA tags from the Argonaute-miRNA-mRNA ternary complex to the genome, we aim to identify regions that are dense in interaction with miRNAs in this ternary complex. We propose a probabilistic model of the aligned mRNA tags from [1]. This model is used both for generat-ing synthetic sample data, and in the target site detection algorithm. The capability of generating synthetic sample data is important to create a test bed for the detection algorithm.

The rest of this report is organized as follows: Chapter 2 is meant as a brief background to the biology and probability theory behind the project, while chapter 3 focuses on a summary of the related studies. Methodology is presented in chapter 4 and chapter 5 presents the results of our experiments including their discussions. Finally, chapter 6 concludes the report.

(7)

Chapter 2

Background

2.1 mRNA

While the essence or meaning of life is an open debate, functions necessary for life, including catalyzing metabolic reactions and DNA replication, are carried out by protein molecules. These molecules are encoded by genes in DNA. Messenger RNAs (mRNA) carry this encoded genetic information for the amino acid sequence of a protein. Gene expression, namely manufac-turing a protein, happens in two main phases: transcription which is the generation of the mRNA; and the translation of the genetic code residing in the mRNA to a protein [2]. mRNA in prokaryotic and eukaryotic cells have di↵erent properties and they act slightly di↵erently. We focus on eukary-otic mRNAs in this study and further discussion considers only eukaryeukary-otic mRNAs.

2.1.1 Transcription

Transcription is the process where the mRNA is generated by complementing part of a DNA strand named the template strand including the genetic information for the protein to be coded. As it can be seen in Fig. 2.1, the entire process starts with the RNA polymerase enzyme (pol II for mRNA transcription) binding to the promoter region for the gene in the template strand of the DNA, which is necessary for the enzyme to be bound to the DNA. In return, RNA polymerase starts unwinding the DNA and adds the complementing nucleotide at the 3’ end of the newly generated RNA until it reaches the termination site. The result is called a pre–mRNA which is processed further to produce the mature mRNA [2]. This post–processing is called splicing.

(8)

2.1.2 Translation

Translation is the synthesis process for a protein from the information resid-ing in the mRNA that is codresid-ing this protein (see Fig. 2.2). With the help of transfer RNAs (tRNA) and the ribosome, the mRNA is read codon by codon (three base pairs specifying an amino acid) to synthesize the amino acid chain that constitutes the initial form of a protein. For each codon that is read, a tRNA with the corresponding anticodon carries the amino acid coded by the codon to the ribosome and transfers it to the growing amino

(9)

acid chain [2, 4].

Figure 2.2: Summary of the translation process. Figure from Mariana Ruiz Villarreal/Wikimedia Commons. Used with permission.

2.2 miRNA

Before we begin introducing miRNA, note that only metazoan miRNAs and their functions in metazoans are considered in this project. Thus, the reader should consider our elaboration within metazoans only.

The history of micro RNAs (miRNA) goes back to 1993 [5], where it was discovered that the LIN-14 protein’s abundance was regulated by a short RNA product through inhibiting its translation. The discovery was con-sidered peculiar until in the turn of the millennium, studies reported the evolutionarily conserved [6] let-7 miRNA to regulate expression of several genes [7]. Today, a search in PubMed with the keyword ”microRNA” gives more than 15000 citations, which gives an idea about the amount of interest in miRNAs in the new century.

(10)

miRNAs are ⇠22-nucleotide residue RNAs. The precursor of the mature miRNA is the⇠70-nucleotide imperfectly base-paired hairpin segment from the RNA that the miRNA is derived from [9]. Further biogenesis steps occur for the pre-miRNA to be transformed into the mature miRNA (see Fig. 2.3). The mature miRNA, together with Argonaute and several other proteins, is assembled into a complex named RNA-induced silencing complex (RISC), which is also referred to as miRNA-protein complex (miRNP). miRNA di-rects this complex to the binding site on the mRNA in order for the miRNP to perform its functions on the mRNA.

Rules and regulations for miRNA functions are not crystal clear. This fact is one of the reasons why there is an increasing amount of diverse research focused on miRNAs. However, there is some common ground, such as the fact that the human genome encodes several hundred unique miRNAs and that these miRNAs interact with thousands of mRNAs [10], which in turn means that one unique miRNA interacts with more than one mRNA. It

(11)

is also known that miRNA expression levels are often perturbed in disease states [11, 12, 13] which is another important cause for research interest.

2.3 miRNA-mRNA interaction

miRNAs interact with mRNAs through complementing mRNA target / in-teraction sites. Thus, the e↵ect of the inin-teraction depends on the level of complementarity. Perfect complementarity causes endonucleolytic cleavage (i.e. a split of the strand into two by phosphodiester bond cleavage between the nucleotides) of the RISC bound mRNA. This occurs only if the RISC includes an Argonaute protein that is capable of endonucleolytic cleavage. Cleaved mRNAs are degraded as a result. [14, 15]. For mammals Ago2 is the only enzyme capable of endonucleolytic cleavage [16]. This kind of cleav-age can happen with perfect or near-perfect matches, while the mismatch positions are significant for higher success rates in a near-perfect match. [17, 18]. But the more interesting and intriguing interaction between miR-NAs and mRmiR-NAs is when there is partial complementarity involved. Stud-ies show that miRNAs that partially complement their mRNA target sites can cause translation inhibition [19, 17]. This in turn causes decrease in the expression of the protein that the mRNA concerned codes for, while cleavage through (near-)perfect complementarity causes decrease in mRNA abundance through degradation. Further research about the e↵ects of tar-get site sequence and complementarity on the expression/repression of the protein coded by the target mRNA identifies the residues 2 to 8 in the 5’ of the miRNA as a so-called seed section, which should be almost perfectly complementary to the mRNA to cause translation inhibition. [20, 21]

2.4 Poisson Distribution

For data modeling purposes later on (see Section 4.2.2.1), we use the Poisson distribution, which we introduce in the following.

Named after Sim´eon Poisson, the Poisson process is a counting process that can be used to describe many daily-life situations such as customer arrivals to a queue, phone calls made to a call-center, etc. where a counting process {C(t), t 0} is a stochastic process that counts events that have occurred up to time t, where C(t) is non-negative and integer-valued for all t 0 and C(t) is non-decreasing in t.

A Poisson distribution on the other hand, is a discrete probability distribu-tion which describes the probability of a given number of events that occur in a fixed time/space interval. Namely, Poisson distribution describes how

(12)

events occur in a Poisson process. The average rate of occurrences is known and occurrences in di↵erent intervals are independent of each other [22] Characteristics of a Poisson process include [23]:

• The average rate of success/occurrence (expectation) is known. • Probability of a single success in an interval is proportional to the size

of the interval

• Probabilities of successes that are in di↵erent intervals are independent of each other

• Probability that a success will occur in an extremely small region is virtually zero

The key property of Poisson distribution is the average success rate (expec-tation) , which used to describe the variance, mode, etc. of the distribution. Random splitting of a Poisson process results in two independent Poisson processes, while several independent Poisson processes can be compounded into a new Poisson process as well, the random variable of the compound be-comes the sum of the independent processes that constitute the compound. [24].

The probability mass function of the discrete stochastic variable k for a Poisson distribution is given as [25]:

f (k; ) = P r(X = k) =

k_e

k! (2.1)

where is the average success rate, and e is the base of the natural logarithm. See subsection ”4.2.2.1 Poisson Expectations” for the use of the Poisson process and the distribution in this project, together with the motivation for doing so.

(13)

Chapter 3

Literature Review

mRNA expression is regulated by miRNAs via miRNA-containing ribonu-cleoprotein particles (miRNPs). Argonaute protein is part of these particles, where they bind miRNAs and mediate target mRNA recognition [26]. Im-munopurification is a technique involving purification of the proteins and the RNAs in RNP complexes. This is achieved using antibodies of the con-stituent proteins [27, 28].

Microarray analysis is a technique used for identifying the genes that are active in a target sample. The isolated mRNA from the target sample is converted into complementary DNA (cDNA) that is dyed in fluorescent afterwards. The dyed cDNA is injected into a silicon chip or a glass slide full of single DNA strands representing each gene. The DNA segments that are complemented by the cDNA identify the genes that are highly expressed [29].

In [30], the authors use miRNP immunopurification in order to identify functional mRNA targets. Following [1] they use the fact that mRNAs and miRNAs bind to the protein in the miRNP complex (in their case Ago1) at the same time. Their microarray analysis shows a high degree of enrichment for miRNA complementary sites on the 30UTRs of the mRNAs that were bound to the miRNP in regard. Further on they validate the regulation of mRNAs that are associated with miRNP by one particular miRNA, miR-1. Chromatin immunoprecipitation followed by sequencing (ChIP-seq) is a tech-nique for genome-wide profiling of DNA-binding proteins [31]. ChIP-seq is out of the scope of this work, therefore we are not considering the details of this technique. We mention it here since in “Model-based Analysis of ChIP-seq data (MACS)” [32], the authors focus on peak identification tech-niques on short read sequencer data resulted from ChIP-seq. They try to compensate for various technical characteristics of such sequencers, such as

(14)

the fact that ChIP-Seq tags represent only the ends of the ChIP fragments, instead of precise protein-DNA binding sites; or biases it exhibits along the genome due to sequencing and mapping biases, chromatin structure and genome copy number variations. They model the tag distribution along the genome with Poisson distribution where they traverse the genome in win-dows to find candidate peaks with a significant tag enrichment. In order to evaluate significance, there is a need for a reference to compare to, namely a background expectation ( BG) for having reads in genomic positions. This

describes the noise, or in other words, expected number of reads by chance. Significance is measured by the Poisson distribution p-value based on this

BG. Any window that has a p-value that is less than a pre-defined threshold

is identified as significant. Their default threshold value is 10 5.

They expand this with a dynamic background expectation concept. Instead of using a uniform BGestimated from the whole genome, they use the

max-imum of the expectations estimated from several windows centered from the position in regard ( local), with di↵erent lengths. This captures the influence

of local biases, and is robust against occasional low tag counts at small local regions. Through the use of local, false positives that would otherwise be

called by BGare eliminated. This invention of dynamic background

expec-tation is important for our study since we used a slightly modified version of this technique in peak calling.

Argonaute (Ago) high-throughput sequencing of RNAs isolated by crosslink-ing immunoprecipitation (HITS-CLIP) is introduced in [1]. HITS-CLIP is developed to “directly identify protein–RNA interactions in living tissues in a genome-wide manner”. The idea stems from the X-ray crystal struc-tures of an Ago-miRNA-mRNA ternary complex, which suggest that Ago maintains close enough contacts with a miRNA that is bound to it and the nearby mRNAs, and this in turn lets Ago HITS-CLIP identify the interac-tion sites in vivo. We use parts of their outputs as input to our study. See “Methodology” for further information.

(15)

Chapter 4

Methodology

4.1 Inputs and their handling

It is important to describe the inputs in detail in order to provide the reader the assumptions made in our study. In this section, we will describe the genome sequencing used, the list of genes selected and the details about the input from the Ago HITS-CLIP study [1].

4.1.1 Selected Genome and Genes

The Ago HITS-CLIP study is done on mouse brain, therefore we limit our-selves to mouse genome. We use the mm9 mouse genome taken from the UCSC database for any alignments done in this study [33, 34].

Since the Ago HITS-CLIP study is on the mouse brain, we limit ourselves to mouse genes. We use the genes listed in the refGene table from the RefSeq Gene track in UCSC database [34] which shows known mouse protein-coding and non-protein-coding genes taken from the NCBI RNA reference sequences collection (RefSeq) [35].

4.1.2 Ago HITS-CLIP Data

As we need mRNA tag samples to build our generation model on (to generate similar data), we use the samples from the Ago HITS-CLIP study [1]. Here we describe the data we picked to use, in more detail.

To identify the particles from the HITS-CLIP results, they use a technique called radioisotopic labeling, which is a molecular tracing technique by use of radioisotopes injected into the molecules that are to be traced. The

(16)

radiation emitted when these isotopes decay is used to trace the labeled compound(s) [36]. Radiolabeling the results of HITS-CLIP on mouse brain (P13 neocortex), they identify two complexes with di↵erent modal molecular sizes, 110kD and 130kD respectively. As reported in their study, the 110kD product corresponds to the miRNAs while the 130kD product corresponds to mRNAs that simultaneously bind to the Ago compound [1]. In this study we are interested in miRNA target site prediction which are located on mRNAs, thus we focus on the mRNA product of the HITS-CLIP results, namely the 130kD modal sized product. As they conducted the same experiment with di↵erent mouse brains we needed to pick one of the samples. We use the sample labeled as “Brain A” [1], containing the tags acquired from the 130kD product of the HITS-CLIP in FASTQ file format, which is accessible from the website of the respective study, as part of the supplementary material [37]. For more information on the tags generated by AGO HITS-CLIP please refer to Supplementary Table 1 in the supplement to [1].

More than the tags themselves, we are interested in where they align on the genome. Hence, we align the tags ourselves using a third party short read mapper software, since the alignments are not provided as supplementary material. We use the short read mapper SHRiMP [38] to align the tags to the reference mouse genome mentioned above. To keep things relatively simpler, we consider only the perfect alignments of these to the genome.

4.2 Synthetic Data Generation

In this section we will go into the details of why and how we generate synthetic data, including motivation, parameters and the outlines of the generation procedure.

While generating synthetic data, we generate reads from genes that are in the list mentioned in “Selecting Genes” section above. This means that we skip reads that do not perfectly align to any identified gene from our list.

4.2.1 Motivation

The Ago HITS-CLIP sequencing data is available at [37]. In order to test a miRNA target-site prediction system as ours, one has to validate the func-tionality. This can be done via testing with various datasets with di↵erent parameters which can imply a trustworthy system, or at the least a method worth investigating deeper with costlier methods. To acquire sequencing data that has similar properties as the Ago HITS-CLIP but varying at the same time, we either have to do our own genuine sequencing experiments or generate synthetic data. While technological improvements have lead not

(17)

just to better tools and methods but also to cheaper ones, generating se-quencing data for such datasets is still quite expensive. Thus, while we want to create a system that will be applied to real sequencing data, it is a good start to test the methodology with synthetic data initially.

4.2.2 Parameters

In [1], the authors identify the peaks of clusters of tags by cubic-spline interpolation technique described in the supplementary material for their study. According to their results, Ago binds within 45-62 nucleotide of these cluster peaks 95% of the time. This region is defined as the Ago-mRNA footprint, and their further results indicate that this is a good prediction site for miRNA targets. In our study we rename this footprint as peak window length, and use this as one of the parameters.

It is important to evaluate the behavior of our system given datasets with various numbers of peaks. Thus, number of peaks is another parameter that is used and tweaked in generation.

4.2.2.1 Poisson Expectations

In read generation and peak detection, we assume that the the probability to have a read at a particular position is independent from the same probabil-ities of other positions. We also want to use di↵erent probability values for background reads and peak regions respectively. This means that if we know that we are in a non-peak region (i.e. background region) the probability to have a read there is dependent only on the background expectation; while on the other hand, if we know that we are in a peak region, the probability to have a read there is dependent both on the peak region expectation and background expectation.

As described in the introduction, a Poisson process counts independent stochastic events that occur in unit intervals. Poisson distribution is good to describe systems where there are very small number of events in a very large domain. In our case, an event is having a read start at a genomic position, the unit interval is 1 base pair, and the very large domain is the transcriptome. The ratio of the total number of reads (taken from [1] sup-plementary materials) to the length of the transcriptome (total number of positions to check for a read) is very small. Thus Poisson distribution is a good model for our problem. On the side, having a Poisson expectation for each position means that we assume independence of probabilities for them. The other important property of Poisson processes is that joining two inde-pendent Poisson processes results again in a Poisson process, while joining

(18)

is a very simple calculation, namely addition of the two expectation values for respective processes. This property is important for us to be able to consider the background expectation even in a peak region. We simply add the background expectation to the peak expectation to get the expectation for the position considered.

Background Expectations

In order to decide whether a region is a peak region or not, we need a reference to reason about our observations regarding this region. In other words we need some kind of control data or information that we can compare to our observations about the region at hand (e.g. number of reads that fall into this region). We call this the background expectation, namely the expectation to have a read start at a particular position assuming it is not in a peak region. Given a region, when we count the number of reads that fall into this region as x, we can calculate the likelihood of having at least x reads in a non-peak region according to this background expectation. Then, if this likelihood is below a certain threshold, we can name this region a peak since it is very unlikely to have occurred by chance/noise according to the aforementioned background expectation.

As reported by Chi et al. [1] only 1% of the reproducible Ago-mRNA tags are found in the 5’ untranslated region (50 UTR). This low percentage of reproducible tags led us to believe that number of reads in the 50 UTRs is a good proxy to derive background expectations from.

Inspired by [32], we define three di↵erent backgrounds and respective ex-pectations, namely one background per gene (and expectation: Ebgene), one

per chromosome (and expectation: Ebchrom), and one global background

(and expectation: Ebglobal). The motivation behind this decision is to have

a varying background, and to build a detection algorithm that is resilient to such background. A dynamic background entails di↵erent thresholds for peak calling depending on the current chromosome, and gene the region resides. How to calculate these expectations will be explained in a minute.

Peak Expectation

While generating reads starting from a particular peak position, we need to know how many reads to generate. Peak expectation (Ep) is to be used

at this point. Note that Ep is expected to be remarkably bigger than the

background expectations, especially Ebglobal, so that the read generation

al-gorithm generates enough number of reads in a peak region, which is to have a very low likelihood to be generated by a background expectation.

Calculation of Expectations

First and foremost, it is important to note that all expectations are regarding only one position. Namely, if for a position pos the expectation is Epos, it

(19)

means that the expected number of reads starting from position pos is Epos.

If we want to calculate the expected number of reads in a region with n positions that are in a similar region as x (i.e. have the same expectation), we simply multiply the expectation for one position with the length of the region Epos⇥ n.

As explained above, we use number of tags that are found in 50UTRs of genes as our input for calculating the background expectations. Hence we have to calculate the number of alignments that fall in the 50 UTRs of the genes first. Using the RefSeq genes table, we calculate the 50 UTR for each gene and count the number of alignments that fall into them from the Ago alignment file (see section “Ago HITS-CLIP Data”) and save this information into a table (50utr table)

For a gene g: if: ng = # reads in g’s 50 UTR (4.1) lg = g’s 50 UTR length (4.2) then: Ebgenes[g] = ng lg (4.3) For a chromosome chr: Ebchroms[chr] = P g in chrng P g in chrlg (4.4) Finally, Ebglobal = P g in transcriptomeng P g in transcriptomelg (4.5)

We mentioned above that multiplying the length of the region we are consid-ering with the expectation, will give us the expected number of reads. From the supplementary documents (and from the actual data) of the Ago study, we get the actual total number of reads for the entire transcriptome, which

(20)

we further on use to calculate the peak expectation. We get the background expectation(s) that are needed for peak expectation calculation, using the formulas above. Window length and number of peaks (see “Parameters”) are the other parameters in the following formulas:

if:

N = total number of reads (4.6) l = window length (4.7) p = # of peaks to be generated (4.8) b =_{|transcriptome|} p_{⇥ l} (4.9) then: (p⇥ l ⇥ Ep) + (b⇥ Ebglobal) = N (4.10) and: Ep = N (b⇥ Ebglobal) p⇥ l (4.11)

Noting that b represents the background positions (namely, the positions that will not reside in peaks); Eq. 4.10 summarizes that the total number of reads is the number of reads in peaks plus the number of reads in the background positions. The number of peak positions are calculated as p⇥ l. Thus, peak expectation Ep is calculated as in Eq. 4.11. A careful reader

should notice that we use Ebglobal in the equation. This obviously generates a

slight conflict with the idea of having di↵erent background expectations for di↵erent chromosomes and genes. One can suggest having di↵erent expec-tations for peak positions as well. However, in order to be able to calculate di↵erent peak expectations, we have to know how many peaks are to be generated in a particular gene and chromosome respectively. This can be included in future work, but for now we believe that using Ebglobal here is

(21)

4.2.3 Generation

After calculating the expectations as described above, we are ready to gen-erate synthetic sequencing data.

As the real sequencing data that we imitate comprises of short reads (namely 32 base-pair tags), we copy segments from genes that are 32 bp long. The important question is: Where will we copy from? We need to decide on where exactly to start a read. The expectations we defined in the previous section let us decide whether or not to generate a read starting from a particular position. But as we saw, there are more than one expectations to use. When do we use the peak expectation, when do we use a background expectation?

We distribute the peaks randomly across the transcriptome. For each po-sition of each gene, we “flip a coin“ to decide whether or not a peak starts from this position. “Flipping a coin“ is implemented in such a way so that the number of peaks generated will be around the desired number, input to the system as p. If the position is to be the start of a peak, we generate a peak with window length l by generating reads in this window according to the peak expectation. We want to be able to cross-check our called peaks (see “Peak Calling“) with these synthetically generated peaks, so we record them when we generate them. Since we do not consider overlapping peaks, we continue traversing the gene after l positions. From all the non-peak positions we generate background reads according to the Ebgene of the gene

that the regarding position resides in.

The algorithm for synthetic sequencing data generation is given below: for all g in genes do

for p in [g.start, g.end] do if f lip a coin() then

for i in [p, p + l] do generate read(i, Ep)

end for

record the peak in synth peaks p = p + l

else

generate read(i, Ebgenes[g])

end if end for end for

The function generate read(pos, ) generates reads that start from the po-sition pos according to the Poisson distribution with the given expectation ( ). g.start and g.end denotes the transcript start and end of the gene g,

(22)

respectively.

4.3 Peak (Target Site) Detection

Our goal is to build a miRNA target site prediction system based on the results of the Ago study. The synthetic sequencing data generation part of our project we described in the previous section, only is a stepping stone to reach this goal. With that part explained, we can now assume that we have various datasets which are similar to the sequencing data from the Ago HITS-CLIP study.

After read generation, we align the reads to the genome with the short reader mapper SHRiMP as we did with the Ago reads.

In order to call a peak, we have to know about the background. In the previous section, we derived the background and peak expectations from the number of Ago HITS-CLIP alignments that are in the 50 UTRs of the genes in the mouse transcriptome. This was necessary to know where and how much reads to generate.

For peak calling, we derive the expectations in the same way but this time from the data that was generated, the data that we will try to detect the peaks (namely the target sites) of. Thus, in the next section, when we mention peak expectation or background expectation, we refer to the ex-pectations for the synthetic data; not the Ago HITS-CLIP data.

4.3.1 Peak Calling

The main idea in our peak calling is to go through each possible window (with length l) in the transcriptome, count the number of alignments that start in this window and check whether or not at least this number of align-ments could be background alignalign-ments. If the probability of this happening is lower than a threshold, then we call the window a peak.

It is important to note that one unique read that is generated results in many alignments. Since we control data generation by generation of reads, not alignments, this might cause misleading results. To alleviate this e↵ect, while calculating the number of alignments in a window, we calculate the contribution of each alignment as _n1 where n is the number of alignments that are generated through same read.

A crude outline of our peak calling process would be: for all g in genes do

(23)

k = countAlignmentsInW indow(g.chrom, g.strand, p, l) Eb = max(Ebgenes[g], Ebchroms[g.chrom], Ebglobal)

if cumulative poisson prob((Eb⇥ l), k) < threshold then

call this window a peak p = p + l

end if end for end for

The function countAlignmentsInW indow(g.chrom, g.strand, p, l) counts the number of alignments that start in the window with length l, which in turn starts at the position p. Chromosome and strand information is also impor-tant since p refers to a position on a chromosome, and strands are definitive for mRNAs and alignments. We will explain how we implement this func-tion in more detail soon. But for now, let us continue with the rest of the algorithm outline above

The line Eb = max(Ebgenes[g], Ebchroms[g.chrom], Ebglobal) is where we get

the maximum of the background expectations regarding gene g. We cross-check the observed number of alignments in the current window against this maximum as the background expectation. This is inspired by [32], and aims to protect the system against noise, providing a conservative background estimation. The function cumulative poisson prob( , k) uses the Poisson probability mass function k_k!e (2.1) to calculate the probability of having at least k events with the event expectation . As the reader might recall, Eb, as all the other expectations, is per position. But we need the probability

for having k alignments in a window with length l. That is the reason why we multiply Eb with l while calculating the cumulative poisson prob( , k): to

get the background expectation value for the window. We set the threshold top_/_{|transcriptome|}_{to limit the false positives to the number of peaks p, that}

is input to the synthetic sequencing data generation. This threshold can be tweaked higher or lower to increase true positives or decrease false positives, respectively.

The bottleneck of this algorithm resides in the function

countAlingmentsInW indow(g.chrom, g.strand, p, l), since it is executed for almost each position in the entire transcriptome. Hence, the way we imple-ment this function is critical for the performance of peak calling (in terms of execution time). Briefly, the function is supposed to check all the align-ments for each window; count the alignalign-ments that start in the given window, and return this count. Using brute force approach and going through all the alignments one by one for each window, to check if it is in the window, results in a runtime of several days, just for a small part of the transcriptome. Instead of the brute force approach, we introduced indexing techniques to

(24)

reduce the runtime drastically. We created a massive dictionary (hash table) that keeps the number of alignments at each position. To lower the load factor of the dictionary, we index on three tiers: first on chromosome; second on strand ; and third on the position relative to the chromosome. At the beginning of peak calling, we go through the alignment file once to initialize this dictionary. With the use of this dictionary, checking the number of alignments that start from a given position is reduced to a look up on a dictionary, while in the naive approach each alignment had to be visited for each position. A miss on the look up means that there are no alignments on the position that is looked up. By this indexing, the complexity was reduced from O(|transcriptome| ⇥ N) to O(N), where N is the number of alignments.

Since genes are independent of each other regarding peak calling, it is possi-ble to divide the transcriptome into collections of genes and do peak calling on these collections in parallel. This way it is possible to further reduce the total runtime of peak calling with the expense of runtime memory, assuming a multi-processor system to run the peak calling on.

(25)

Chapter 5

Results

In the following, p refers to the number of peaks parameter for synthetic sequencing data generation while l is the window length.

5.1 Peak calling on synthetic data

In order to analyze the performance of our model with various data sets, we generated sequencing data with p = 1500, p = 2500, p = 3500, p = 4500, p = 11500. We also tweaked the window length from 62 to 150. l = 62 is the “Ago-footprint” taken from [1].

5.1.1 P-values

As mentioned in the peak calling part in methodology, we traverse the tran-scriptome window by window to identify the peak windows. To call a window which has k alignments a peak, the probability of having k or more align-ments in this window given the background expectation, has to be lower than the threshold. We call this probability the p-value for this window. Note that we are considering each alignment’s contribution to the afore-mentioned k as _n1 where n is the number of alignments that are generated through same read.

We plotted these p-values to have a general idea about their distribution. In the plots in Fig. 5.1, x-axis is the normalized ranks of the p-values. This means that the x-axis of a point gives the portion of plotted p-values that are less than the p-value of the point in consideration (which actually lay to the left of this point in the plot). The y-axis on the other hand is simply the p-value which is described above.

(26)

The total number of windows we traverse in the transcriptome (which is almost the same as the length of the transcriptome) is around 1 billion. However, since the probability of having an alignment in a random position is very low, most of the windows end up having 0 alignments, which in turn gives a p-value very close to 1 (given the background expectation). These p-values are not crucially interesting for plotting purposes since they almost correspond to a long horizontal line at the end of the plot, hence we omitted them in order to ease the plotting process as much as analyzing them. The number of windows (thus the number of points in the plots) after omitting the ones which had no alignments is around 14 million which is almost a ₇₀1 fraction of the original number. Observe also that we plotted the dataset with p = 0 (Fig. 5.1a) to see the di↵erence with the other datasets. The number of windows that are plotted is of a di↵erent order of magnitude compared to the number of called peaks for each set. Keeping this in mind, one can see that each figure except Fig. 5.1a, which is for p = 0, starts with a part where p-values are very close to zero, and that moving slightly towards the right on the x-axis we come to a point where there is a relatively sharp increase in the p-values. This is the point where the threshold lies. The points below this threshold belong to the called peaks.

5.1.2 Peaks called

After doing peak calling on the synthetic data, we analyze the peak calling performance. This is feasible since we record the starting positions of the generated peaks during synthetic sequencing data generation. Comparing the positions of the called peaks with the generated peaks gives us an idea about how good the peak calling performs.

We label a called peak as a true positive, if it overlaps with a generated peak at least with one base-pair. As described in methodology, we jump l positions in traversal when we call a window a peak, to avoid overlapping called peaks. As an artifact of this we consider even the least overlap as an indication of being a true positive. Otherwise we would classify called peaks that are actually related to the generated peaks as false positives, which would be misleading.

Another artifact of jumping after calling a peak is that, we call approxi-mately two peaks per synthetically generated peak. The explanation for this fact is that, in peak calling, at the end of the window we are currently checking, we see some alignments from the beginning of a generated peak that are sufficiently many to call this window a peak, and jump to the next non-overlapping window (since we called the previous window a peak), which also has sufficiently many alignments from the end of the same generated peak, so we call this new window a peak too. Thus, we end up having two

(27)

(a) p=0 (b) p=1500

(c) p=2500 (d) p=3500

(e) p=4500 (f) p=11500

Figure 5.1: Cumulative p-value distributions for peak calling on synthetic datasets with various number of peaks

(28)

called peaks per generated peak. In tables 5.1 and 5.2, we call the number of generated peaks that are overlapped by at least one called peak the unique true positives. As a corollary to that, we call all the called peaks that overlap with a generated peak all true positives.

In order to eliminate these artifacts of jumping whenever a peak is called, the jumping should be replaced with a delay in peak calling. That is: whenever the number of alignments detected results in a p-value that is less than the threshold, the system should calculate the p-values for the next l windows and call the one with the lowest p-value as the peak. This way, there will be at most one called peak per generated peak, while the position accuracy of the called peak will also increase.

Tables 5.1 and 5.2 are the summary tables for peak calling done on synthetic sequencing data with various parameters. The window length is di↵erent for each table.

p unique true all true false called positives positives positives peaks

1500 812 1643 2244 3887

2500 1256 2537 3783 6320 3500 1793 3629 5149 8778 4500 2249 4568 6615 11183 11500 5674 11232 15592 26824

Table 5.1: Peak calling analysis for l = 62

p unique true all true false called positives positives positives peaks

1500 777 1588 2299 3887

2500 1297 2653 3792 6445 3500 1811 3691 5190 8881 4500 2307 4680 6550 11230 11500 5787 11377 15841 27218

Table 5.2: Peak calling analysis for l = 150

As seen in both table 5.1 and table 5.2, the number of false positives steadily increases with p. This phenomenon can be explained: Increasing the p increases the total number of reads since for each peak ⇠20-40 reads are generated. This in turn results in a much higher number of alignments which are caused by the same read. For example, the total number of unique alignments are 993544 for p = 1500 while the same metric is 1244820 for p = 4500 (both values are for l = 150). This in turn causes more peaks to be called, and eventually more false positives.

(29)

5.2 Peak calling on Ago data

We also tried peak calling on the Ago data provided in [37]. This data is the result of HITS-CLIP with Ago antibody for three di↵erent mouse brains, labeled as A, B and C. As mentioned before, we used the 130kD model sized samples from their sequencing results. In table 5.3, we summarize the called peaks according to the region they belong in their respective transcripts.

Dataset Peaks in Peaks in Peaks in Total 30 UTR 50 UTR CDS

Brain A 315 28 238 581

Brain B 927 60 1070 2057 Brain C 744 39 630 1413

Table 5.3: Regions for called peaks for Ago data

In the supplementary material to their study, Chi et al. plot the regional distribution of mRNA tag clusters within the transcribed genes (determined by RefSeq annotation) in supplementary figure 8 [1]. The results in table 5.3 are significantly close to this distribution of clusters (called peaks in this study) in terms of percentages of clusters discovered in various regions. The percentages they present are 1.9%, 44.9% and 53.2% respectively for 50 UTR, CDS and 30 UTR. In table 5.3, the distribution is 3% 48% and 49% for the same respective regions. On the other hand, the total number of clusters is 11118 in their study while we discovered only 4051 clusters. This, combined with the numbers in table 5.2, points to the fact that our method needs improvement in terms of clusters left undiscovered.

(30)

Chapter 6

Conclusions

mRNA-miRNA interaction is a key topic in understanding the rules and regulations for gene expression. In this study, we focused on the results presented in [?], which introduce the Ago-miRNA-mRNA ternary complex as a promising platform for investigating miRNA target sites. We proposed a probabilistic model of a part of their result data, and generated several synthetic data sets using this model to test our software for target site detection.

Our results on the synthetic data showed that our relatively simple prob-abilistic model works fine regarding true positives, as the detection soft-ware finds more than half of the generated peaks for all the tests we con-ducted. On the other hand, the results from tests on genuine data from [37] showed that the regional distribution of the interaction sites (peaks) our software identified matches the distribution given in [1]. Despite these positive results, there is vast room for improvements and further study. As discussed in Chapter 5, there are side e↵ects from our method of synthetic data generation, these should be alleviated in the generation process or a pre-processing phase should be introduced before detection. Furthermore, partially caused by the previously mentioned side e↵ects, number of false positives and unidentified peaks are higher than expected. This lowers the quality of the results; as in a real scenario, called peaks are to be cross-checked to find out whether they actually correspond to interesting sites in the respective genome or not.

Even with the problems mentioned, we believe that this study provides a promising introduction for important research in mRNA-miRNA interac-tion.

(31)

Chapter 7

Bibliography

[1] S.W. Chi, J.B. Zang, A. Mele, and R.B. Darnell. Argonaute HITS-CLIP decodes microRNA-mRNA interaction maps. Nature, 2009.

[2] S. Clancy and W. Brown. Translation: DNA to mRNA to protein. 2008.

[3] D. Sadava, H.C. Heller, and G.H. Orians. Life: The Science of Biology. W. H. Freeman, 2008.

[4] B.A. Pierce. Genetics: A Conceptual Approach. W. H. Freeman, 2010. [5] R.C. Lee, R.L. Feinbaum, and V. Ambros. The C. elegans heterochronic gene 4 encodes small RNAs with antisense complementarity to lin-14. Cell, 75(5):843–54, 1993.

[6] A.E. Pasquinelli, B.J. Reinhart, F. Slack, M.Q. Martindale, M.I. Kuroda, B. Maller, D.C. Hayward, E.E. Ball, B. Degnan, P. Mller, J. Spring, A. Srinivasan, M. Fishman, J. Finnerty, J. Corbo, M. Levine, P. Leahy, E. Davidson, and G. Ruvkun. Conservation of the sequence and temporal expression of let-7 heterochronic regulatory RNA. Nature, 408(6808):86–9, 2000.

[7] B.J. Reinhart, F.J. Slack, M. Basson, A.E. Pasquinelli, J.C. Bettinger, A.E. Rougvie, H.R. Horvitz, and G. Ruvkun. The 21-nucleotide let-7 RNA regulates developmental timing in Caenorhabditis elegans. Na-ture, 403(6772):901–6, 2000.

[8] L. He and G.J. Hannon. MicroRNAs: small RNAs with a big role in gene regulation. Nat Rev Genet, 5(7):522–31, 2004.

[9] B.P. Lewis, C.B. Burge, and D.P. Bartel. Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets. Cell, 120(1):15–20, 2005.

(32)

[10] V. Ambros, B. Bartel, D.P. Bartel, C.B. Burge, J.C. Carrington, X. Chen, G. Dreyfuss, S.R. Eddy, S. Griffiths-Jones, M. Marshall, M. Matzke, G. Ruvkun, and T. Tuschl. A uniform system for mi-croRNA annotation. RNA, 9(3):277–9, 2003.

[11] Jun Lu, Gad Getz, Eric A Miska, Ezequiel Alvarez-Saavedra, Justin Lamb, David Peck, Alejandro Sweet-Cordero, Benjamin L Ebert, Ray-mond H Mak, Adolfo A Ferrando, and et al. MicroRNA expression profiles classify human cancers. Nature, 435(7043):834–838, 2005. [12] J. Lu, G. Getz, E.A. Miska, E. Alvarez-Saavedra, J. Lamb, D. Peck,

A. Sweet-Cordero, B.L. Ebert, R.H. Mak, A.A. Ferrando, J.R. Down-ing, T. Jacks, H.R. Horvitz, and T.R. Golub. A microRNA polycistron as a potential human oncogene. Nature, 435(7043):834–8, 2005.

[13] I. Alvarez-Garcia and E.A. Miska. MicroRNA functions in animal de-velopment and human disease. Dede-velopment, 132(21):4653–62, 2005. [14] Y. Tomari and P.D. Zamore. Perspective: machines for RNAi. Genes

Dev, 19(5):517–29, 2005.

[15] J. Martinez and T. Tuschl. RISC is a 5’ phosphomonoester-producing RNA endonuclease. Genes Dev, 18(9):975–80, 2004.

[16] J. Liu, M.A. Carmell, F.V. Rivas, C.G. Marsden, J.M. Thomson, J. Song, S.M. Hammond, L. Joshua-Tor, and G.J. Hannon. Argonaute2 is the catalytic engine of mammalian RNAi. Science, 305(5689):1437– 41, 2004.

[17] R.J. Jackson and N. Standart. How do microRNAs regulate gene ex-pression? Sci STKE, 2007(367):re1, 2007.

[18] S. Yekta, I. Shih, and D.P. Bartel. MicroRNA-directed cleavage of HOXB8 mRNA. Science, 304(5670):594–6, 2004.

[19] M.R. Fabian, N. Sonenberg, and W. Filipowicz. Regulation of mRNA translation and stability by microRNAs. Annu Rev Biochem, 79, 2010. [20] J. Brennecke, A. Stark, R.B. Russell, and S.M. Cohen. Principles of

microRNA-target recognition. PLoS Biol, 3(3):e85, 2005.

[21] J.G. Doench and P.A. Sharp. Specificity of microRNA target selection in translational repression. Genes Dev, 18(5):504–11, 2004.

[22] Frank A. Haight. Handbook of the Poisson distribution. Wiley, 1967. [23] Stat Trek. Poisson distribution, http://stattrek.com/

probability-distributions/poisson.aspx. Date accessed: 2012-02-01.

(33)

[25] Wikipedia. Poisson distribution, http://en.wikipedia.org/wiki/ Poisson_distribution. Date accessed: 2012-02-01.

[26] M. Landthaler, D. Gaidatzis, A. Rothballer, P.Y. Chen, S.J. Soll, L. Dinic, T. Ojo, M. Hafner, M. Zavolan, and T. Tuschl. Molecu-lar characterization of human Argonaute-containing ribonucleoprotein complexes and their bound target mRNAs. RNA, 2008.

[27] F. Lejeune and L.E. Maquat. Immunopurification and analysis of pro-tein and RNA components of mRNP in mammalian cells. Methods Mol Biol, 257, 2004.

[28] A. Galgano and A.P. Gerber. RNA-binding protein immunopurification-microarray (RIP-Chip) analysis to profile lo-calized RNAs. Methods Mol Biol, 714, 2011.

[29] Scitable by Nature Education. Scientists Can Study an Organism’s Entire Genome with Microarray Anal-ysis, http://www.nature.com/scitable/topicpage/ scientists-can-study-an-organism-s-entire-6526266. Date accessed: 2012-12-01.

[30] G. Easow, A.A. Teleman, and S.M. Cohen. Isolation of microRNA targets by miRNP immunopurification. RNA, 13(8):1198–204, 2007. [31] P.J. Park. ChIP-seq: advantages and challenges of a maturing

technol-ogy. Nat Rev Genet, 2009.

[32] Y. Zhang, T. Liu, C. Meyer, J. Eeckhoute, D. Johnson, B. Bernstein, C. Nussbaum, R. Myers, M. Brown, W. Li, and X. Liu. Model-based Analysis of ChIP-Seq (MACS). Genome Biol, 9(9):R137, 2008.

[33] R.H. Waterston, K. Lindblad-Toh, E. Birney, J. Rogers, and et al. Abril. Initial sequencing and comparative analysis of the mouse genome. Nature, 420(6915):520–62, 2002.

[34] University of California, Santa Cruz. UCSC Genome Browser. http://genome.ucsc.edu/. Date accessed: 2012-01-01.

[35] K.D. Pruitt, T. Tatusova, W. Klimke, and D.R. Maglott. NCBI Ref-erence Sequences: current status, policy and new initiatives. Nucleic Acids Res, 2008.

[36] A.N.H. Creager. Phosphorus-32 in the Phage Group: radioisotopes as historical tracers of molecular biology. Stud Hist Philos Biol Biomed Sci, 40(1):29–42, 2009.

[37] The Rockefeller University. Ago-miRNA-mRNA ternary map by HITS-CLIP, http://ago.rockefeller.edu. Date accessed: 2011-11-01.

(34)

[38] University of Toronto Computational Biology Lab. SHort Read Map-ping Package, http://compbio.cs.toronto.edu/shrimp/. Date ac-cessed: 2012-03-01.

Generation and Bioinformatic Analysis of Synthetic Ago HITS-CLIP Data

Examensarbete 45 hp

Juni 2013

Generation and Bioinformatic

Analysis of Synthetic Ago HITS-CLIP

Data

Mehmet Ali Arslan

Abstract

Generation and Bioinformatic Analysis of Synthetic

Ago HITS-CLIP Data

Mehmet Ali Arslan

Contents

Chapter 1

Introduction

Chapter 2

Background

2.1

mRNA

2.2

miRNA

2.3

miRNA-mRNA interaction

2.4

Poisson Distribution

Chapter 3

Literature Review

Chapter 4

Methodology

4.1

Inputs and their handling

4.2

Synthetic Data Generation

4.3

Peak (Target Site) Detection

Chapter 5

Results

5.1

Peak calling on synthetic data

5.2

Peak calling on Ago data

Chapter 6

Conclusions

Chapter 7

Bibliography