basicASE: an R-statistics package to detect significant allelic imbalance in RNA-seq data

(1)

UPTEC X 12 013

Examensarbete 30 hp Juni 2012

basicASE: an R-statistics package to detect significant allelic

imbalance in RNA-seq data

Jesper Gådin

(2)

(3)

Bioinformatics Engineering Program

Uppsala University School of Engineering UPTEC X 12 013 Date of issue 2012-06

Author

Jesper Gådin

Title (English)

basicASE: an R-statistics package to detect significant allelic imbalance in RNA-seq data

Title (Swedish)

Abstract

Investigation of allelic imbalance expression QTLs (aeQTLs) in RNA-seq data can increase the accuracy in the hunt for candidate genes susceptible as drug targets. The Bioconductor project has recently developed research tools to handle the massive amounts of data that come with next generation sequencing. The basicASE package has been developed to carry out aeQTL analyses in that environment. Many technical issues concerning a user friendly import, statistical testing and visualization of data are facilitated through the package, which also has a central design to accommodate integration of new functions and development.

Keywords

BasicASE, allelic imbalance, aeQTL, R, Bioconductor, NGS, RNA-seq

Supervisor

Lasse Folkersen

The Center for Molecular Medicine at Karolinska Institute Scientific reviewer

Mikael Thollesson

Department of Organismal Biology at Uppsala University

Project name Sponsors

Language

English ^Security Secret until 2013-10

ISSN 1401-2138 Classification

Supplementary bibliographical information

Pages

49 Biology Education Centre Biomedical Center Husargatan 3 Uppsala

Box 592 S-75124 Uppsala Tel +46 (0)18 4710000 Fax +46 (0)18 471 4687

(4)

(5)

basicASE: an R-statistics package to detect significant allelic imbalance in RNA-seq data

Jesper Gådin 2012-06

Populärvetenskaplig sammanfattning

Med dagens teknik kan man avkoda allt DNA eller RNA i ett prov från en människa på mindre tid än ett dygn. Detta har medfört att väldigt mycket data har blivit tillgängligt för analys. För att kunna hantera detta har det utvecklats många bioinformatiska program som effektivt kan hitta intressanta spår från eller orsaker till sjukdom.

BasicASE är ett sådant program och det används för att hitta tänkbara sjukdomsgener. Det fungerar så att man mäter om det sker lika mycket uttryck från mammans som pappans gener.

Om så inte är fallet kan man misstänka att den gen som har en obalans av uttryck är orsaken till sjukdom. Vanligen har man sedan tidigare kända DNA-variationer kopplade till specifika

sjukdomar som undersöks, men det går också att titta igenom hela genomet för att hitta obalanser.

BasicASE innehåller verktyg för att statistiskt avgöra om funna obalanser är orsakade av slumpen eller av regulatoriska element som påverkar genen av intresse. Den genen blir i sådana fall föremål för en större undersökning för att hitta vad exakt som är orsaken, och hur detta påverkar resten av kroppen.

Examensarbete 30 hp

Civilingenjörsprogrammet Bioinformatik

Uppsala Universitet, Juni 2012

(6)

(7)

1 Introduction 1

2 Background 1

2.1 High throughput data . . . . 1

2.1.1 General . . . . 1

2.1.2 Expression microarrays . . . . 1

2.1.3 RNA-seq . . . . 2

2.2 GWAS . . . . 2

2.3 eQTLs . . . . 2

2.4 Allelic imbalance . . . . 2

3 Material and Methods 3 3.1 Technical formats . . . . 3

3.1.1 FASTQ . . . . 3

3.1.2 SAM and BAM . . . . 3

3.1.3 VCF . . . . 3

3.1.4 BCF and BGZF . . . . 4

3.2 S oftware . . . . 4

3.2.1 Samtools . . . . 4

3.2.2 BWA . . . . 4

3.2.3 R-statistics environment . . . . 4

3.3 R packages . . . . 5

3.3.1 biomaRt . . . . 5

3.3.2 IRanges . . . . 5

3.3.3 Biostrings . . . . 5

3.3.4 GenomicRanges . . . . 5

3.3.5 Rsamtools . . . . 6

3.4 RNA-seq and SNP Repositories . . . . 6

3.4.1 dbSNP . . . . 6

3.4.2 HapMap . . . . 6

3.4.3 1000 genomes project . . . . 6

3.4.4 Sequence read archive(SRA) . . . . 6

4 Results 6 4.1 The basicASE package . . . . 7

4.2 Documentation and Manual . . . . 7

4.3 Vignette . . . . 11

4.4 basicASE plots . . . . 11

4.5 Verication . . . . 11

4.6 Algorithms . . . . 13

5 Discussion 15 5.1 The RNA-seq Microarray competition . . . . 15

5.2 Approach to subdue a complex disease . . . . 15

5.3 Heterozygosity . . . . 15

5.4 Verication inference . . . . 15

5.5 Algorithm analysis . . . . 16

5.6 basicASE integrative design . . . . 16

5.7 Computer storage . . . . 16

5.8 Why use R as the statistical analysis tool? . . . . 16

6 Conclusion 16

(8)

7 Future 17

8 Acknowledgements 17

9 References 17

Appendix A Vignette

Appendix B Manual 29

Appendix C Perl scripts 41

Appendix D R scripts for Algorithm testing 45

Abbreviations

aeQTL Allelic Imbalance for Expression Quantitative Trait Locus BAM Binary Alignment and Map format

BCF Binary variant Call Format CMM Center for Molecular Medicine

CNV Copy Number VariationSNP Single Nucleotide Polymoprhism eQTL Expression Quantitative Trait Locus

FASTQ a fasta le extended with quality information from sequencerR The R-statistics project FTP File Transfer Protocol

GWAS Genome Wide Association Studies Latex A typesetting system

LOH Loss Of Heterozygosity

MPMB Maternal and Paternal Mapping Bias NCBI National Center for Biotechnology Information NGS Next Generation Sequencing

RLE Run Length Encoding

RNA-seq the RNA sequencing method SAM Sequence Alignment and Map Format SNP Single Nucleotide Polymorphism SRA Sequence Read Archive

Sweave the S language weaved into Latex VCF Variant Call Format

21

(9)

1 Introduction

In the quest to nd drug targets, one approach has been through the genome wide association studies (GWAS).

In this genome-wide study that became very popular around the year of 2007, considerable economic resources have been spent on nding risk SNPs associated with disease. Since 2007 a long list of risk SNPs has been generated, but has also led to an uncertainty how to take the next step in how to nd drug targets. In an eQTL study it has been possible to associate mRNA expression to risk SNPs, and then nd a protein that could be the target of drugs [1]. To improve an eQTL study one could include the use of the expression imbalance between two alleles, which performs better than an ordinary eQTL [2]. To nd that allelic imbalance for the allele-expression one would need access to RNA sequence data, which would give information on the exact number of gene products each allele has produced. A good statistics tool is necessary to nd associations between imbalances and SNPs. R is a programming language that help scientists developing their own statistical software and easily share it to others [3]. To Include R-statistics, RNA-seq and allelic imbalance is a great advice to any GWAS frontline project.

This project took place at the Center for Molecular Medicine(CMM), a part of Karolinska Institutet in Stockholm. Dr. Lasse Folkersen who works as a post-doc at CMM has been supervising the project.

2 Background

This project aims at realizing the possibilities for drug discovery that the combination of aeQTL and RNA- seq gives, and facilitate an analysis through an R package. An explanation of important concepts, such as aeQTL and RNA-seq will be covered in this section.

2.1 High throughput data

Even if this has been a project using RNAseq data, this part gives an explanation of some general features for high-throughput sequencing techniques. An introductory guide to high-throughput bioinformatics has been written by Baggerly et al. [4]. The relevant parts of his guide will be thoroughly used in subsection 2.1.2.

The subsection 2.1.3 will cover RNAseq and is based on Wang et al. [5].

2.1.1 General

High throughput data is a collection of a large amount of related data, coming from a small number of samples. The most common technique for high throughput assays has been microarrays. Other methods used includes SAGE, CAGE, MPSS [5].

2.1.2 Expression microarrays

Expression microarrays measures the relative mRNA levels within a sample. Expression levels for thousands of genes from only one single run of the experiment can be obtained. A common expression microarray platform is the Aymetrix GeneChip. This small chip has a probe set representing the complementary strands from the target genome. Complementary strands are about 25 base pairs long and every probe has its own unique variant, to account for dierent anities on dierent parts of the gene strands. And so one gene may be represented severaltimes on an Aymetrix GeneChip. There are severaldiculties with microarray assessments, which should be remembered. Here is a list of them.

• Overlap between short gene sequences

• What is believed to be a gene today, could be something else in the future.

• Cross-hybridization, binding to the wrong target.

• Need of normalization between experiments [5].

These are the fundamental diculties that can turn up as problems in an analysis of microarray data.

1

(10)

2.1.3 RNA-seq

The techniques for the sequencing of DNA have become fast, reliable and cheap. This opens up the door to do the same thing on RNA-data, and by that out-compete microarrays for gene proling. RNA sequencing (RNA-seq)has compared to microarrays the only major disadvantage that it is more expensive. But looking at the progress of the next generation sequencing (NGS)techniques, it is just a matter of time before it is cheap enough. Some specic benets with RNA-seqs are listed below [6].

• No need to know the genomic sequence of the organism investigated.

• No background noise.

• Knowledge of the exact amount of expression for each RNA sequenced.

• Bar-code labeling, giving all cDNAs a unique identier can lower the sequencing error rate.

Another complex biological question is how to discover and measure the iso-form expressions, which are the products of alternative splicing. RNA-seq is already competing with microarrays, and its potential increases as the read length produced by sequencing machines becomes longer. RNA-seq oers interesting bioinformatic challenges how to handle the enormous amounts of data [5].

There are however also some pitfalls in RNA-seq. Between samples there is still a need for normalization, and duplicated genes can make mapping dicult. Structural variants such as SNPs creates a reference genome bias, where alleles with the same polymorphism as the reference genome acquire a mapping advantage [7].

2.2 GWAS

Since the Wellcome Trust Case Control Consortium (WTCCC)paper 2007 ve years ago [8], there has been an argument if the GWAS are worth the eort or not. The critics has said that the approximately 2000 loci associated to disease in human has been a too expensive exercise, accounting for an international investment of $250 million. Other arguments claiming that the low level of ndings is because there probably never been any association to be found from the start. GWAS assume that genetics are involved as risk in the development of common disease. The optimists instead claim that the price for each detected locus of about $125 000 actually is a good investment. And to use multiple loci in more than one ethnic population could further help discovering disease architecture. The future of GWAS depends much on the technological advancement of aordable sequencing that could help detecting more genes, pathways and other biological insight. Sequencing would also allow for the detection of rare variants that has not been possible to catch using microarrays [9].

2.3 eQTLs

In traditional QTL analyses, linkage mapping leads to the detection of genomic regions which are associated with phenotypic variations within a population [10]. Using gene expression data as quantitative traits is then simply called eQTL [11]. An eQTL is when a genetic variant aects the transcript levels of a gene.

The location of the variant could be both close or far away from the gene it regulates. Even if most gene regulations are aected by variations far away, the closer variant has more impact and is therefore easier to

nd. The problem related to RNA-seq data is that the variants outside the protein-coding regions would not be detected. Linking gene variants to gene expression in this way has the advantage that no prior knowledge of the causal allele is needed. But the association analysis should be done in the cell lineage tissue believed to be associated to the disease [12].

2.4 Allelic imbalance

To take the eQTL analysis one step further, RNA-seq data can reveal any allele specic expression dierence.

If the transcript levels from a heterozygote gene shows a deviation from a 1 to 1 ratio it makes the gene a strong candidate [12].

Allelic imbalance has dierent meanings in dierent contexts, one of them is allelic imbalance for expression

quantitative trait locus (aeQTL). In an aeQTL the mRNA level of two dierent alleles gives information if

(11)

there is an expression-imbalance between the alleles in a heterozygous gene. If an imbalance can be shown, its regulation is probably at least partly due to a cis-acting inuence. Traditionally the eQTL analysis compares expression between samples, which makes it vulnerable to inuences on both alleles such as age, gender or population stratication. An aeQTL analysis eludes that problem [2]. If iso-form information from splice variants are available they could be tested for allelic imbalance in the same way as a gene. A problem when testing for allelic imbalance is that it can only be detected in genes with heterozygote SNPs, as it will make the transcripts dierent. And if one of the alleles is completely silenced it will show up as an homozygous site, and a possible aeQTL would be missed. The reference genome bias or also called the maternal paternal mapping bias (MPMB) can be a problem when working with RNA-seq data. it is more probable that the read most similar to the reference genome will align most, and therefore creating a false allelic imbalance.

This can partly be solved by creating a paternal and maternal reference genome using phase-information from HapMap(3.4.2) together with the structural variants showing up when performing a variant call [7].

Allelic imbalance can also mean the complete loss of an allele or an increase in copy number of one allele relative to the other. These two types of DNA level allelic imbalances are most often referred to as loss of heterozygosity (LOH) and copy number variation (CNV). LOH can arise from physical deletion, chromosome non-disjunction, mitotic non-disjunction, mitotic recombination and gene conversion [13]. CNV happens when the number of copies of an allele is amplied or deleted, and also creates an imbalance in expression depending on number on genes more than regulation. This imbalance can easily be misunderstood as an aeQTL if CNV sites are not ltered out. Both LOH and CNV are highly suspected to play a big role in cancer disease [14]. Which proves that a high understanding of allelic imbalance could be a good tool to get a deeper understanding of pathological pathways. This project only deals with the aeQTL.

3 Material and Methods

This section contains of formats, software and dependencies in association to the basicASE package.

3.1 Technical formats

This section describes important formats that are used by the basicASE package.

3.1.1 FASTQ

The FASTQ format has become a standard for representing sequence data, but unfortunately the FASTQ format has no clear denition. Dierent versions of FASTQ has been developed, and diers is in how the PHRED score is calculated. PHRED is a quality measure for each base, and has an character that represent each base. As an example The Sanger standard has a PHRED score between 0 and 90, and Solexa/early Illumina has a PHRED score between -5 and 60. Conversion between variants of the FASTQ format is not a major problem. Implementations to handle this issue exist in all of the common bioinformatics script languages [15].

3.1.2 SAM and BAM

The Sequence Alignment/Map format (SAM) is a generic alignment format, with support for both single and paired end reads. There are 11 mandatory elds for alignment and character sequence information. And also an option to include a variable number of elds for storage of extra information. The binary alignment/map format (BAM) is the SAM formats binary equivalent and increase performance while containing exact the same information. Having a BAM format le indexed on position makes it possible to streamline processes accessing only specic regions of the genome, without loading the whole le into memory. A BAM le would not only be accessed easy, but also very compact. An example alignment of 112Gbp of Illumina GA data would require only 116GB of disc space [16].

3.1.3 VCF

The variant call format (VCF) is a generic format for storing polymorphism data, and post-processing. It can handle millions of sites variants and thousands of samples in the same le. There are eight mandatory

3

(12)

elds with alignment and alternative variants, and an extra eld for extensible annotation (INFO). The allele count (AC) annotation in the INFO eld will be of extra importance in the context of this project, because it carries the allelic imbalance information [17].

3.1.4 BCF and BGZF

One format for fast streamline access of specic regions in an VCF le is the binary call format (BCF), which is the VCFs binary equivalent. If the VCF or BCF formats are not used by the variant calling software, the BGZF exists as a more general format, which can handle more types of TAB-delimited les. Regions can then be accessed easily with the tabix software [18].

3.2 Software

This section describes important software tools that are used by or together with the basicASE package.

3.2.1 Samtools

Samtools is a universal tool to handle generation sequence data. It can convert between various alignment formats, sort and merge alignments. Samtools also has a pileup function named mpileup, which can perform an SNP call to nd variants in the data. And there is a simple implemented alignment viewer for fast look- ups [16]. The output from Samtools mpileup function is a VCF le, but can be additionaly processed and converted to its binary eqvivalent BCF using Bcftools. Bcftools is a software distributed together with samtools, as a part of the samtools suite [19].

3.2.2 BWA

BWA is a software tool for aligning reads data with ecient string matching using the Burrows-Wheeler transform(BWT). Usually alignment software hashes all reads and slide through the reference genome, to map all reads to a genomic location. This has been a very memory demanding and slow approach. By instead using the BWT and a backward search the performance has dramatically increased. BWA can perform gapped alignment allowing mismatches and gaps for single or paired end mapping on large genomes such as the human genome. It has the SAM format as default output [20].

3.2.3 R-statistics environment

The R environment is open-source and fully dependent of volunteers. Traditionally software was developed by hierarchical steering, beneting by the obligations labours felt while getting paid. Open source projects relies therefore even more to a good organization, to maintain the volunteers interest at the same time as driving the project forward. Many open source projects such as Perl and Linux are developed by hierarchical top steering.

And R has a core team that consists of about 20 individuals maintaining the core functionalities, to keep the

foundation of R running. This makes R a comparatively independent organization, which is facilitated much

by the package system allowing individuals to develop their own package without any acceptance from the

R core team. But a rapid drop out of the key individuals in the R core team would seriously damage the

project as a whole, and could be a risk worth knowing by the user [21]. Sometimes it is said that R has a

steep learning curve compared to statistical packages like Stata, SAS and SPSS. But forgetting that R has

the advantage of being a programmatic language and gives the user the freedom to do any type of analysis of

interest. For example the one-rectangular-dataset restriction in Stata makes it easier to learn, but also limits

the possibilities. SAS could be a good choice for handling data, but is clearly secondary to R when it comes to

graphics. In comparison to R the SPSS package can be experienced as inexible, having rectangular datasets

and providing insucient programming tools. R was developed by statisticians but has capabilities as any

other programmatic language. And can produce publication quality graphics, which makes it a good tool

for researchers wanting to form new ideas and then present them to the scientic community. The number

of academic publications on google scholar, gives a hint of the future for dierent statistical software, and

shows that both Stata and R has a predicted good future, see g 1. R works on Windows, Macintosh, Linux

and Unix, and data can freely be moved between dierent platforms [22].

(13)

Figure 1 Use of data analysis software in academic publications, measured by google scholar. The right graph diers from the left by excluding SPSS and SAS, http://r4stats.com/articles/popularity/.

3.3 R packages

The Bioconductor project is an open source suite of R-packages with cutting edge research tools for genomic data. In basicASE a number of these packages are dependencies and need a short description [23].

3.3.1 biomaRt

biomaRt is an interface to BioMart for fast retrieval of gene annotations, homolog mapping and SNP infor- mation. The search speed comes with the BioMart database star and reverse-star schema to avoid complex joins [24, 25]. Depending on situation the biomaRt approach would better suit these questions than otherwise bulk downloading the information of interest. A bulk download can be tedious to parse every time a new version of the le is released, but can be advantageous when the internet connection is limited or the BioMart information has to be accessed repeatedly [26].

3.3.2 IRanges

A package with a class to represent sequences in a range as numbers with start and stop, and functions to act on these intervals . The IRanges class can use the run length encoding format (RLE) to save memory, which is good when information is redundant, [27].

3.3.3 Biostrings

A package with a class to store string representations of DNA and RNA sequences, and functions to eectively access and manipulate these representations [28].

3.3.4 GenomicRanges

The GenomicRanges class has become the foundation for NGS data within the bioconductor project, and is a container for other classes important for NGS, like IRanges and Biostrings. There are a minimal requirement of chromosome and position information, but unlimited possibilities to extend the GenomicRanges object

5

(14)

and keep manipulation of the object eective. Genomic Ranges is built for interval operations, and has a lot of mapping and sub-setting functions [29].

3.3.5 Rsamtools

Rsamtools main purpose is to import Bam les to R, but could also import other le formats such as the BCF. Rsamtools also provides an interface between R and samtools. Rsamtools is the usual starting point in any NGS analysis handled in R [30]

3.4 RNA-seq and SNP Repositories

Ocial data repositories can be a great resource, and below are short descriptions of the ones that relates to the basicASE package.

3.4.1 dbSNP

dbSNP is a single nucleotide polymorphism database hosted by the National Center for Biotechnology Infor- mation (NCBI), and is a central repository for genetic variation. Because of the integration to NCBI, SNPs in the dbSNP are crosslinked to other information resources in the NCBI framework, such as GenBank, [31].

There has been some critiscism about dbSNP, as of 2010s dbSNP build 129, half of the reported SNPS are candidates and not validated in a population [32].

3.4.2 HapMap

The HapMap project aims to identify genetic variants across the human genome, such as SNPs. All data is unrestricted public and located at the projects web site, http://www.hapmap.org. There are possibilites for bulk downloads, download a specic set of regions or using the genome browser [33]. 2010 there were around 10 million genetic variants identied by the HapMap project togheter with The Human Genome Project and the SNP consortium [34].

3.4.3 1000 genomes project

The main goal of the 1000 genomes project is to explore the genotype information of DNA polymorphism in human populations of dierent origin by using high-throughput sequencing technologies. All data that comes out from the project is publicly available from the project website (http://www. 1000genomes.org).

To minimize the errors that could come from cell lines blood derived DNA was used. It has been shown that the vast majority of the SNPs were already present in dbSNP. And the populations with most newly discovered SNPs had african ancestry [35].

3.4.4 Sequence read archive(SRA)

Initially SRA was known as the short read archive but because of the long reads now produced by high- throughput sequencing techniques, it has changed the name to the sequence read archive. Auxiliary tools to eective search in the SRA archive and handle the data is available as the SRA Toolkit on their webpage [36, 37].

4 Results

The product of this project was a software named basicASE with comprehensive documentation to handle basic allelic imbalance questions from RNA-seq data. R Software repositories such as Bioconductor also recommends developers to contribute a vignette. The vignette is distributed together with the package to introduce new users how to use all functions and in what situations.

The package has been developed for a central role, and so needs great integrability to accommodate for

dierent sources of input. An example of how an allelic imbalance testing workow is shown in g 2. The

same workow can be used in the basicASE package. Basically, all SNPs of interest are collected from

(15)

(16)

1

2 > data( snpAFList )

3 > head ( snpAFList )

4 $` chr1 :109852648 `

5 A T GC

6 ALL_ERR031838 . F q f i l t . bam 0 0 0 2

8

9 $` chr1 :109856306 `

10 A T GC

13

14 $` chr1 :109858119 `

15 A T GC

18

19 $` chr1 :109859968 `

20 A T GC

23

24 $` chr1 :109859985 `

25 A T G C

28

29 $` chr1 :109860144 `

30 A T G C

Figure 3 An example of the intermediate data structure, of the SNPs in the list, each carrying their own

data frame. In this example ALL_ERR031838.Fqlt.bam is the sample name and the gures on the right the

distribution between the alleles A,T,G,C for one specic SNP.

(17)

1 > chiAFTotDfFilt

2 A T G C c h i s q r a t i o count1

3 chr1 .109865308 0 8 25 2 0.00308318609090795 3.125 25

4 chr1 .109867892 1 5 18 6 0.0143058784354296 3 18

5 chr12 .57550285 0 20 2 40 0.00982327450751926 2 40

6 chr12 .57555078 10 0 25 0 0.0112298866529167 2 . 5 25

7 chr12 .57560188 9 26 0 0 0.00405919636555506 2.88888888888889 26

8 chr12 .57560445 6 0 28 0 0.00016131642030862 4.66666666666667 28

9 chr12 .57579346 2 27 51 1 0.00657841361282637 1.88888888888889 51

10 chr12 .57585144 0 134 3 190 0.00186384793502995 1.41791044776119 190

11 chr12 .57587839 1 90 0 127 0.0120143125760759 1.41111111111111 127

12 chr12 .57592557 0 21 0 46 0.0022563442406387 2.19047619047619 46

13 chr12 .57593101 164 0 208 1 0.0225310715948584 1.26829268292683 208

14 chr12 .57602632 131 1 172 3 0.0185033512463685 1.31297709923664 172

15 chr12 .57602815 0 1 168 245 0.000151302222455018 1.45833333333333 245

16 chr2 .216249430 11 30 0 0 0.00300426223939828 2.72727272727273 30

17 chr2 .216299629 19 46 84 1 0.000859703965599114 1.82608695652174 84

18 count2 ch1 ch2

19 chr1 .109865308 8 G T

20 chr1 .109867892 6 G C

21 chr12 .57550285 20 C T

22 chr12 .57555078 10 G A

23 chr12 .57560188 9 T A

24 chr12 .57560445 6 G A

25 chr12 .57579346 27 G T

26 chr12 .57585144 134 C T

27 chr12 .57587839 90 C T

28 chr12 .57592557 21 C T

29 chr12 .57593101 164 G A

30 chr12 .57602632 131 G A

31 chr12 .57602815 168 C G

32 chr2 .216249430 11 T A

33 chr2 .216299629 46 G T

Figure 4 An example of a chi-square test and chi-square ltered extended intermediate data structure. Each row contains the merged information from all individuals for one SNP.

9

(18)

1 \name{BamImpGRList}

2 \a l i a s{BamImpGRList}

3 \t i t l e{ Import Bam S e l e c t i o n }

4 \d e s c r i p t i o n{

5 Imports a s e l e c t i o n o f a bam f i l e s p e c i f i e d by a GenomicRanges o b j e c t as searchArea .

6 }

7 \ usage {BamImpGRList(UserDir , searchArea , ver bose = TRUE) }

8

9 \ arguments {

10 \ item { UserDir }{The r e l a t i v e or f u l l path o f f o l d e r c o n t a i n i n g bam f i l e s . }

11 \ item { searchArea }{A \ code {GenomicRanges o b j e c t } that c o n t a i n s the r e g i o n s o f i n t e r e s t }

12 \ item { verbose }{ S e t t i n g \ code { verbose=TRUE} g i v e s d e t a i l s o f procedure during f u n c t i o n run . }

13 }

14 \ d e t a i l s { Important that bam f i l e s are q u a l i t y f i l t e r e d and has no gaps . In near future , i n c l u d i n g reads with gaps w i l l be p o s s i b l e }

15 \ value {

16 \ code {BamImpGRList} r e t u r n s a GenomicRangesList o b j e c t .

17 }

18 \ r e f e r e n c e s {}

19

20 \ s e c t i o n {Warning}{Make s u r e your data i s q u a l i t y f i l t e r e d so no NA v a l u e s i n ranges , o t h e r w i s e the import w i l l f a i l}

21 \ note {

22 To look at o b j e c t s t r u c t u r e g i v e the command \ code { s t r (BamGRL) } . The next s t e p a f t e r import o f a \ code {BamGRL} o b j e c t would be to import a BcfGRL o b j e c t with \ code { BcfImpGRL} and the use \ code { snpAFListBcf } to s u b s e t i n f o r m a t i o n o f i n t e r e s t . Also make s u r e t h e r e i s a complementary index f i l e \ code {∗. bam . bai } f o r each bam f i l e in

\ code { UserDir } .

23 }

24

25 \ s e e a l s o {\ code {\l i n k{BcfImpGRList }}}

26 \ author { J e s p e r Gadin}

27 \ examples {

28 #Declare searchArea

29 searchArea<−GRanges (

30 seqnames = c(" chr1 "," chr12 "," chr2 ") ,

31 ranges = IRanges (

32 c(109852192 ,57522276 ,216225163) ,

33 c(109940573 ,57607134 ,216300895) ,

34 )

35 )

36

37 #R e l a t i v e or f u l l path

38 pathToFiles <− system.f i l e(" extdata ", package="basicASE")

39

40 #import

41 BamGRL<−BamImpGRList(pathToFiles , searchArea )

42 }

43 \keyword{ Bam }

44 \keyword{ Import }

Figure 5 An example of the Rd le format

(19)

1

2 \ s e c t i o n {An i n t r o d u c t o r y example}

3 Use the \Rpackage{basicASE} bam and b c f system f i l e s .

4

5 <<systemFiles , e v a l=TRUE>>=

6 #load package

7 l i b r a r y("basicASE")

8

9 #Construct SearchArea

10 searchArea<−GRanges (

11 seqnames = c(" chr1 "," chr12 "," chr2 ") ,

12 ranges = IRanges (

13 c(109852192 ,57522276 ,216225163) ,

14 c(109940573 ,57607134 ,216300895) ,

15 )

16 )

17 18

19 #Set Path to bam and b c f system f i l e s

20 #pathToFiles <− system . f i l e (" extdata " , package="basicASE ")

21

22 #import

23 #BamGRL<−BamImpGRList( pathToFiles , searchArea )

24 #BcfGRL <−BcfImpGRList ( pathToFiles , searchArea )

25

26 #p r e p a r a t i o n s t e p s

27 #snpAFList <− snpAFListBcf (BamGRL, BcfGRL)

28 data( snpAFList )

29 snpAFTotDf <− mergeAFList ( snpAFList )

30 chiAFTotDf <− chisqAFTotDf ( snpAFTotDf )

31 chiAFTotDfFilt <−filtChiAFTotDf ( chiAFTotDf , 0 . 0 5 )

32 chiAFTotDfFilt

33 @

Figure 6 Section from the BasicASE.Rnw le demonstrating the wrapping of R code in Latex code

4.3 Vignette

R-packages often comes with a vignette, which is an introduction to the package. The vignette introduces what questions are relevant for the package, and sometimes gives a small tutorial of how to begin the use of the package. The vignette is written in the Rnw language which can then be executed rst by R Sweave followed by Latex to automatically make the vignette PDF. The advantage of using the Rnw language is that the R code can be processed within latex code and produce plots or tables on the y to lower maintenance when updating R functions in the package. This functionality assures that the code presented in the vignette always is executable. An example of how to wrap the R code in latex code from the vignette Rnw can be found in g 6.

4.4 basicASE plots

The package can auto produce two essential allelic imbalance plots from the intermediate data strucure, which can be seen in g 7 and g 8. The g 7 plot is showing the allele frequency for a set of SNPs. Normally there should only be two alleles present, but due to sequencing errors or mapping bias more than two allele can show up. This can be good for visual exploration, to get a feeling for the distribution of alleles in the data.

The g 8 plot shows the ratio and chi-square signicance values for a set of SNPs and gives an immediate result to which SNPs are interesting and should be analyzed further.

4.5 Verication

A fastq read le was built from a gene and SNPs placed in user specied locations. To nd if data got lost during the analysis a test was performed, where the input of the pipeline was compared with what was

11

(20)

● ● ● ● ● ● ● ● ● ●

●

● ● ●

● ●

●

● ● ●

●

● ●

●

● ● ●

●

● ●

●

● ●

●

● ● ● ●

●

● ●

●

● ●

chr1.109865308 chr1.109867892 chr12.57550285 chr12.57555078 chr12.57560188 chr12.57560445 chr12.57579346 chr12.57585144 chr12.57587839 chr12.57592557 chr12.57593101 chr12.57602632 chr12.57602815 chr2.216249430 chr2.216299629

0244974122172220

Allele Count Distribution

Read Count

●

A T G C

Figure 7 Plot of the allele frequency distribution for a set of signicant, chi-square p-value<0.05, ltered SNPs.

chr1.109865308 chr1.109867892 chr12.57550285 chr12.57555078 chr12.57560188 chr12.57560445 chr12.57579346 chr12.57585144 chr12.57587839 chr12.57592557 chr12.57593101 chr12.57602632 chr12.57602815 chr2.216249430 chr2.216299629 Allele Frequency Ratio

Ratio

0 1 2 3 4

0.003 0.01 0.01 0.01 0.004 2e−04 0.007 0.002 0.01 0.002 0.02 0.02 2e−04 0.003 9e−04

With the most frequent allele as numerator

Figure 8 Abar plot showing on the y-axis the ratio imbalance between the two most expressed alleles for

every SNP on the x-axis. On the bars the chi-square p-values are indicated for additional proof of signicance.

(21)

Figure 9 Plot of verication results. The ratio describes the amount of reads that where not found by basicASE.

Table 1 A table pointing out the dierences between the algorithms tested for their functions and corresponding package.

Algorithm Algorithm dierence Package Purpose

Alg1 %in% GenomicRanges function,

R core function Built in functions may be optimized Alg2 subsetByOverlaps() GenomicRanges function Built in functions may be

optimized

Alg3 Vector arithmetics R core function Remove for loop

The Purpose column describes what could lay behind an increase in speed performance by using that algo- rithm

detected by basicASE. Figure 9 shows for dierent amount of randomly created reads, what ratio of the variant reads that got lost during the analysis. The Pie chart shows the impact of the parameter n (n=read depth). As shown, the largest depth gives the largest ratio of lost variant reads.

4.6 Algorithms

Most functions in basicASE are fast and do not need a special implementation to increase speed performance.

But one function is very critical and has got much attention. The function called snpAFListBcf() can take a long time and the time complexity will increase linearly with the amount of SNPs that are being investigated. The most important steps in the algorithm are described as pseudo-code in g, 10. The solving strategy for the performance problem has been circulating around vectorization and through this decreasing the amount of looping. The GenomicRanges package has also functions for sub-setting, which has been investigated. The subsetByOverlaps() function was investigated and showed no time improvement over the other algorithms tried, g 11. Another subsetting function was the %in%, which performed better than the subsetByOverlaps(). See table 1 for an overview over the algorithms dierence and purpose in optimization of the snpAFListBcf() function. In Alg3 some vector arithmetics was derived to remove the last for loop in the algorithm, and was used together with the %in% function. The vectorization eectively improved time compared to using a for loop.

13

(22)

1 #General Purpose

2 Fish out and count the c h a r a c t e r s f o r each snp p o s i t i o n i n the b c f o b j e c t . Test f o r s i g n i f i c a n t a l l e l i c imbalance .

3

4 #Overview o f algorithm

5 C o l l e c t SNPs found i n a l l samples .

6 I n i t i a l i z e an i n t e r m e d i a t e object , where the r e s u l t can be saved .

7 Look up each SNP and c o l l e c t the number o f c h a r a c t e r s .

8 Return The i n t e r m e d i a t e o b j e c t f o r f u r t h e r a n a l y s i s .

9

10 #The s t e p s more d e s c r i p t i v e

11 Algorithm s t a r t;

12 Declare an i n t e r m e d i a t e data s t r u c t u r e o b j e c t ;

13 Declare BCF object , c o n t a i n i n g a l l SNP p o s i t i o n s i n the r e f e r e n c e genome ;

14 Declare BAM object , c o n t a i n i n g a l l reads and t h e i r p o s i t i o n in the r e f e r e n c e genome ;

15

16 Merge a l l i n d i v i d u a l snps i n BCF o b j e c t ;

17

18 For each p o s t i o n in the merged BCF o b j e c t

19 For each i n d i v i d u a l i n the BAM o b j e c t

20 Get a s u b s e t o f reads from the i n d i v i d a l BAM o b j e c t c o n t a i n i n g that p o s i t i o n ;

21

22 I f s u b s e t i s 0 then

23 next i t e r a t i o n ;

24 Else

25 e x t r a c t a l l c h a r a c t e r s on that p o s i t i o n ;

26 count the number o f each c h a r a c t e r;

27 i n s e r t the i n d i v i d u a l sample c h a r a c t e r count i n f o r m a t i o n in the i n t e r m e d i a t e data o b j e c t ;

28

29 Return i n t e r m e d i a t e data o b j e c t ;

30

31 Algorithm end;

Figure 10 Algorithm pseudo-code for the function snpAFListBcf() .

Figure 11 Time performance for algorithm 1, 2 and 3. The x-axis shows the increase in size of the reads

sample. Every step can be seen as one amplication.

(23)

5 Discussion

An allelic imbalance analysis require a wide insight into areas spanning from choosing appropriate subjects to the analysis of data. The massive amount of data produced by NGS require great skills in both biology, statistics and programming to grasp the enormous amount of information provided. Rtogether with Bio- conductor is a great tool with a framework for cutting edge biology research. The basicASE package will use this framework to address the questions in the eld of allelic imbalance using RNA-seq data.

5.1 The RNA-seq Microarray competition

There is still a competition between RNA-seq and microarrays what method scientists choose to use, when conducting analysis of human data. Using microarrays on other organisms than humans is not trivial, because the probes are constructed from known genes in the organism under study. So working with for example sponge animals, the RNA-seq approach is the only available option if the researcher wants to perform an expression analysis. If cheaper experiments and longer reads becomes available for RNA-seq it may be preferred for expression analysis. A drawback to circumvent in using RNA-seq data would be the massive amounts of data generated. In clinical experiments you typically just use a small section of interesting genes.

Having RNA-seq data for a sample, a lot of it may never be used, and therefore taking up unnecessary resources of the sequencers and hard-drive space. But the potential of having access to a rich map of genomic information, concepts never thought of when the sample was taken could later be investigated. RNA-seq has a great potential, but suers as microarray experiments from cross-sample normalization exercises. But the allelic imbalance events gives more power also to the cross-sample normalization, escaping population stratication biases.

5.2 Approach to subdue a complex disease

Complex diseases such as the vascular diseases or cancer needs broader approaches to be explained. That is one of the purposes of GWAS, to search for gene variants in the whole genome that associate to disease.

The more gene variants available the closer we may come to a full explanation. The measurement of allele specic expression can provide a ne grained source of information to explain disease. Even lacking the full pathophysiological picture, researchers have begun to look for drug targets using eQTL.

To get more power in your analysis it is possible to use the allelic imbalance information in an aeQTL analysis, as it would suggest that one gene is regulated at least partly due that SNP or structural variant in the alleles. There is also a reverse more dicult way of using the allelic imbalance information. One could look for causes of the allele specic event, searching for regulatory factors aected by the strucutral variant.

RNA-seq delivers also the gene iso-forms, which also can be a target of allelic imbalance tests to nd more signicant drug targets.

5.3 Heterozygosity

Heterozygous reads are to date a requirement to be able to make aeQTLs, and to nd patients with the right genotype can be a problem. Geno-typing a large group of subjects can be expensive, but is more of an issue when it comes to rare diseases where patients are few. Making aeQTLs of homozygous genes, would need for example a uorescent probe to be attached on one of them. But this would be practically dicult, if not impossible. That manipulation of sequence would create a bias to what allele will be expressed most. In tissue samples the marking of alleles becomes even harder, when subjects would not had time to mark the alleles before biopsy.

5.4 Verication inference

As shown, no reads got lost during analysis when just using a data set of 200 reads picked out at random from a gene. This would indicate that no reads are lost because of script code error in the basicASE package.

Instead using larger datasets increases the number of lost reads, and would probably be due to that no alignment algorithms are perfect. Interestingly the ratio of lost reads increased when using 2 million reads compared to 10 thousand reads. Probably only due to that the data was picked at random and therefore

15

(24)

having less of the reads that are harder to align in the 10 thousand reads dataset. So the 2 million reads dataset is probably closer to the true value and a better measurement of mapping performance of BWA.

Using this verication result as a response to the MPMB, it could be concluded that it had a very low impact in this specic example. Even though the MPMB should not be forgotten, it may not be a hinder when performing an aeQTL.

5.5 Algorithm analysis

The result clearly indicate the need to replace for loops with vectorized functions. But caution, when vectorization can make the delivery to the intermediate data structure more dicult. This may indicate a future need to also improve the intermediate data structure further. The subsetByOverlaps() function was expected to be more eective than the %in%, which is an R core functionality inherited from the GenomicRanges package. But the explanation can be because of the subsetByOverlaps() function initially creates and optimize an interval search tree. With the purpose that it can be stored separately for repetitive queries instead of create an interval search tree for every query. A combination of an external interval tree and vectorized loops, would surely produce the most optimized function for extracting the allele distribution.

An implementation of an outer interval search tree is not trivial while the creation of the tree lies deep within the subsetByOverlaps() function. This is clearly something to be investigated more closely in the future.

5.6 basicASE integrative design

The basicASE package is still in its bud, lacking more sophisticated statistics. But has to its future advantage a well thought-out infrastructure, making it easy to integrate any other statistical packages which could be of interest for the user. Not only other functions are thought of, but also the possibility to import BAM and BCF les from public or private FTP-repositories.

5.7 Computer storage

The SRA is a deposit of sequence data, and from there it is possible to stream data directly from the FTP, using the SRA Toolkit. The 1000 genome projects also provides streaming from their FTPsupport, by letting Rsamtools access the BAM and BCF les located on their website. This service can eliminates the need of data storage on a computer when performing analysis on smaller regions.

5.8 Why use R as the statistical analysis tool?

There are softwares for statistical analysis like SPSS, SAS and Stata, but as data sets becomes larger and more complex with NGS a more exible software is needed. R could be that exible software, beneting from being a programmatic language, it promotes a deeper understanding of the data. And a benchmark on popularity is the fact that even though the code is open source, no one has yet tried to fork the code into a new distribution. This could be a lot explained by the ease of contribution of user packages, which gives the user the perfect amount of freedom to continue development instead of forking the R base code.

Another popular software that can be used for statistics not mentioned in the background is Matlab [39].

Unfortunately there is no ocial report available that has compared R and Matlab. But the general attitude on the Internet seems to be that Matlab was invented by engineers and R by statisticians and therefore only diers in the ease of performing dierent types of calculations. Matlab is not open source, which favors R if the price of the software is an issue. If there are plans of collaboration with countries or institutions with less economic resources, then R would be a good choice.

6 Conclusion

RNA-seq is here to stay, and a brand-new part of the Bioconductor environment has been developed to

facilitate the handling of high-throughput data. BasicASE would be the R-statistics integrated part to

accommodate the allelic imbalance questions from RNA-seq data.

(25)

7 Future

For the next version of basicASE some functions are already on their way such as easy import from dbSNP and HapMap. It would also be good to include a stream-function to direct stream BAM and BCF from FTP-repositories. There are also ideas on how to convert the so called intermediate data structure to an R class, which would give a lot of benets in managing functions. And make it easier for the user, when similar functions have the same name but dierent arguments. There is also still too much looping in the algorithms, which decreased speed performance. Instead in an environment such as R, the vector arithmetics is more highly favoured. Additional verication procedures would not necessarily only be an extension of the package, but could also be of interest to conrm or disprove the hypothesis of the importance of MPMB.

8 Acknowledgements

First of all I want to thank Lasse Folkersen for an interestingproject, and beingan excellent supervisor.

Thanks to Mikael Thollesson as subject reviewer and all the great feedback. Warm thanks to all new colleagues at Karolinska Institutet and CMM for all new input of cardiovascular diseases. And at last I would like to thank Per Eriksson for oeringa Ph.D-student position in his research group.

9 References

[1] C.J. Willer and K.L. Mohlke. Finding genes and variants for lipid levels after genome-wide association analysis, Curr Opin Lipidol. 2012 Apr;23(2):98-103, 2012.

[2] M.S. Cunnington, M Santibanez Koref, B.M. Mayosi, J. Burn, and B. Keavney. Chromosome 9p21 SNPs associated with multiple disease phenotypes correlate with ANRIL expression. PLoS Genetics, 6(4):e1000899, April 2010.

[3] R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2011. ISBN 3-900051-07-0.

[4] K.A. Baggerly, K.R. Coombes, and J.S. Morris. An introduction to high-throughput bioinformatics data. Ed. KA Do, P Mueller, M Vannucci. New York: Cambridge University Press, 2006. 1-33., 2006.

[5] Z. Wang, M. Gerstein, and M. Snyder. RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics, 10(1):5763, 2009.

[6] K. Shiroguchi, T.Z. Jia, P.A. Sims, and X.S. Xie. Digital RNA sequencing minimizes sequence-dependent bias and amplication noise with optimized single-molecule barcodes. Proceedings of the National Academy of Sciences, January 2012.

[7] J. Rozowsky, A. Abyzov, J. Wang, P. Alves, D. Raha, A. Harmanci, J. Leng, R. Bjornson, Y. Kong, N. Kitabayashi, N. Bhardwaj, M. Rubin, M. Snyder, and M. Gerstein. AlleleSeq: analysis of allele- specic expression and bindingin a network framework. Molecular Systems Biology, 7, August 2011.

[8] Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls.

Nature, 447(7145):661678, June 2007.

[9] P.M. Visscher, M.A. Brown, M.I. McCarthy, and J. Yang. Five Years of GWAS Discovery. Am J Hum Genet, 90(1):724, January 2012.

[10] B. Holloway, S. Luck, M. Beatty, J.A. Rafalski, and B. Li. Genome-wide expression quantitative trait loci (eQTL) analysis in maize. BMC genomics, 12(1):336, 2011.

[11] L. Folkersen, F. v. Hooft, E. Chernogubova, H.E. Agardh, G.K. Hansson, U. Hedin, J. Liska, A.-C.

Syvanen, G. Paulsson-Berne, A. Franco-Cereceda, A. Hamsten, A. Gabrielsen, P. Eriksson, and on behalf of the BiKE and ASAP study groups. Association of genetic risk variants with expression of proximal genes identies novel susceptibility genes for cardiovascular disease. Circulation: Cardiovascular Genetics, 3(4):365373, June 2010.

17

(26)

[12] A. Monteiro, G. Coetzee, M. Freedman, M. De Biasi, G. Casey, D. Duggan, A. Risch, C. Plass, P. Liu, M. James, H. Vikis, J. Tichelaar, M. You, S. Gayther, and I. Mills. Principles for the post-GWAS functionalcharacterisation of risk loci. Nature Precedings, November 2010.

[13] R. Mei. Genome-wide detection of allelic imbalance using human SNPs and high-density DNA arrays.

Genome Research, 10(8):11261137, August 2000.

[14] Z. Liu, A. Li, V. Schulz, M. Chen, and D. Tuck. MixHMM: inferring copy number variation and allelic imbalance using SNP arrays and tumor samples mixed with stromal cells. PLoS ONE, 5(6):e10909, June 2010.

[15] P.J.A. Cock, C.J. Fields, N. Goto, M.L. Heuer, and P.M. Rice. The sanger FASTQ le format for sequences with quality scores, and the Solexa/Illumina FASTQ variants, 2010.

[16] H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, R. Durbin, et al. The sequence alignment/map format and SAMtools. Bioinformatics, 25(16):20782079, 2009.

[17] P. Danecek, A. Auton, G. Abecasis, C.A. Albers, E. Banks, M.A. DePristo, R.E. Handsaker, G. Lunter, G.T. Marth, S.T. Sherry, et al. The variant call format and VCFtools. Bioinformatics, 27(15):21562158, 2011.

[18] H. Li. Tabix: fast retrieval of sequence features from generic TAB-delimited les. Bioinformatics, 27(5):718719, 2011.

[19] samtools.1. URL: http://samtools.sourceforge.net/samtools.shtml. Page Retrieved: 10-Jun-2012.

[20] H. Li and R. Durbin. Bioinformatics-2009-Li-1754-60.pdf, May 2009.

[21] J. Fox. Aspects of the socialorganization and trajectory of the r project. The R Journal, 1(2):513, 2009.

[22] P. Burns. R relative to statisticalPackages:Comment 1 on technicalreport number 1 (Version 1.0 statistical consulting group: UCLA academic technology services. Technical Report 1, 2006.

[23] Bioconductor. URL: http://www.bioconductor.org/. Page Retrieved: 10-Jun-2012.

[24] BioMart. URL: http://www.biomart.org/. Page Retrieved: 10-Jun-2012.

[25] S. Durinck, Y. Moreau, A. Kasprzyk, S. Davis, B. De Moor, A. Brazma, and W. Huber. BioMart and bioconductor: a powerful link between biological databases and microarray data analysis. Bioinformatics, 21(16):34393440, 2005.

[26] S. Durinck, P.T. Spellman, E. Birney, and W. Huber. Mapping identiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Nature Protocols, 4(8):11841191, July 2009.

[27] H. Pages, P. Aboyoun, and M. Lawrence. IRanges: Infrastructure for manipulating intervals on se- quences. R package version 1.15.4.

[28] H. Pages, P. Aboyoun, R. Gentleman, and S. DebRoy. Biostrings: String objects representing biological sequences, and matching algorithms. R package version 2.25.2.

[29] P. Aboyoun, H. Pages, and M. Lawrence. GenomicRanges: Representation and manipulation of genomic intervals. R package version 1.9.7.

[30] M. Morgan and P. Pagès. Rsamtools: Binary alignment (BAM), variant call (BCF), or tabix le import.

R package version 1.7.40.

[31] S.T. Sherry, M.H. Ward, M. Kholodov, J. Baker, L. Phan, E.M. Smigielski, and K. Sirotkin. dbSNP:

the NCBI database of genetic variation. Nucleic Acids Research, 29(1):308311, January 2001.

(27)

[32] L. Musumeci, J.W. Arthur, F.S.G. Cheung, A. Hoque, S. Lippman, andJ.K.V. Reichardt. Single nucleotide dierences (SNDs) in the dbSNP database may lead to errors in genotyping and haplotyping studies. Human mutation, 31(1):6773, 2010.

[33] G.A. Thorisson, A.V. Smith, L. Krishnan, andL.D. Stein. A user's guide to the international HapMap project web site. URL: ftp://jimwatsonsequence.cshl.edu/presentations/users_guide_to_hapmap.pdf, Page Retrieved: 10-Jun-2012, 2005.

[34] D.M. Altshuler, R.A. Gibbs, L. Peltonen, E. Dermitzakis, S.F. Schaner, F. Yu, P.E. Bonnen, P.I.W.

de Bakker, P. Deloukas, S.B. Gabriel, R. Gwilliam, S. Hunt, M. Inouye, X. Jia, A. Palotie, M. Parkin, P. Whittaker, K. Chang, A. Hawes, L.R. Lewis, Y. Ren, D. Wheeler, D. Marie Muzny, C. Barnes, K. Darvishi, M. Hurles, J.M. Korn, K. Kristiansson, C. Lee, S.A. McCarroll, J. Nemesh, A. Keinan, S.B. Montgomery, S. Pollack, A.L. Price, N. Soranzo, C. Gonzaga-Jauregui, Alkes L. Price, V. Anttila, W. Brodeur, M.J. Daly, S. Leslie, G. McVean, L. Moutsianas, H. Nguyen, Q. Zhang, M.J.R. Ghori, R. McGinnis, W. McLaren, F. Takeuchi, S.R. Grossman, I. Shlyakhter, E.B. Hostetter, P.C. Sabeti, C.A. Adebamowo, M.W. Foster, D.R. Gordon, J. Licinio, M. Cristina Manca, P.A. Marshall, I. Matsuda, D. Ngare, V. Ota Wang, D. Reddy, C.N. Rotimi, C.D. Royal, R.R. Sharp, C. Zeng, L.D. Brooks, and J.E. McEwen. Integrating common andrare genetic variation in diverse human populations. Nature, 467(7311):5258, September 2010.

[35] D.M. Altshuler, E.S. Lander, L. Ambrogio, T. Bloom, K. Cibulskis, T.J. Fennell, S.B. Gabriel, D.B. Jae, E. Sheer, C.L. Sougnez, et al. A map of human genome variation from population scale sequencing.

Nature, 467(7319):10611073, 2010.

[36] D.L. Wheeler, T. Barrett, D.A. Benson, S.H. Bryant, K. Canese, V. Chetvernin, D.M. Church, M. DiCuc- cio, R. Edgar, S. Federhen, et al. Database resources of the national center for biotechnology information.

Nucleic acids research, 36(suppl 1):D13D21, 2008.

[37] Sequence readarchive:NCBI/NLM/NIH. URL: http://www.ncbi.nlm.nih.gov/Traces/sra/. Page Re- trieved: 10-Jun-2012.

[38] F. Leisch. Sweave. dynamic generation of statistical reports using literate data analysis. Institut fur Statistik und Wahrscheinlichkeitstheorie, Technische Universitat Wien, Wiedner Hauptstrasse 8- 10/1071, A-1040 Wien, Austria , Report No. 69, 2002.

[39] MATLAB. version 7.10.0 (R2010a). The MathWorks Inc., Natick, Massachusetts, 2010.

19

(28)

(29)

basicASE

Jesper Gadin jesper.gadin@gmail.com

Lasse Folkersen lasse.folkersen@ki.se June 9, 2012

1 Abbrevations

ASE Allele spciﬁc expression.

SNP An allele variant from a single nucleotide polymorphism.

BAM A binary ﬁle format for storing read alignment data.

BCF A binary ﬁle format for storing allele variants as SNPs.

2 basicASE

The basicASE package contains functions for investigating allelic imbalance from RNA-sequencing data in a subset of the genome. The package handles one or more samples(individuals) and provides neccessary plot functions for a basic analysis. All infrastructure used in this package are built with the aim to have as much data as easily accessible as possible. Accessibility is managed from an intermediate data.frame containing the most important information for a speciﬁc analysis, while the large amount of information is stored in the Genom- icRanges objects. So it’s easy for users to add a speciﬁc function on their own, or as a contribution to the package if they believes it also could be used by others. The structure of an intermediate data frame can be found under the

”An introductory example” section.

3 SNP sources

There are several sources where to find interesting SNPs, which is demonstrated by the SNP Source flowchart below, fig 1.There are possibilities for a high- throughput approach or a study of a selection of interesting SNPs or both.

4 Allelic imbalance testing on RNA-seq data

Compared to microarray data, RNA-seq data provides much more information like read counts and the exact sequence for every read. And these features are exactly where the basicASE package gets the advantage from. The example R code below merges all individual data to one pool, and calculates their united imbalance significance by a chi-square test. See also fig 2 for a schematic picture over the workflow.

1 (21)

Appendix A Vignette

(30)

Figure 1: A ﬂowchart over where to obtain interesting SNPs. data

Figure 2: Extension of ﬁg 1, showing the workﬂow of an allelic imbalance testing

2 (22)

(31)

5 Collecting data

There are two essential input data for this package, RNA-sequencing data and snp data. RNA-sequencing data could be from an user provided experiment or from a public project as the 1000 Genomes project http://www.1000genomes.

org/. Snp data, can be imported from several diﬀerent sources. The most prefer- able way would be to use a snp-calling method on the same RNA-sequencing data that will be used. And import both the reads and the snps found in these reads.

6 An introductory example

Use the basicASE bam and bcf system ﬁles.

> #load package

> library("basicASE")

> #Construct SearchArea

> searchArea<-GRanges(

+ seqnames = c("chr1","chr12","chr2"), + ranges = IRanges(

+ c(109852192,57522276,216225163), + c(109940573,57607134,216300895),

+ )

> #Set Path to bam and bcf system files

> #pathToFiles <- system.file("extdata", package="basicASE")

>

> #import

> #BamGRL <-BamImpGRList(pathToFiles,searchArea)

> #BcfGRL <-BcfImpGRList(pathToFiles,searchArea)

>

> #preparation steps

> #snpAFList <- snpAFListBcf(BamGRL,BcfGRL)

> data(snpAFList)

> snpAFTotDf <- mergeAFList(snpAFList)

> chiAFTotDf <- chisqAFTotDf(snpAFTotDf)

> chiAFTotDfFilt <-filtChiAFTotDf(chiAFTotDf,0.05)

> head(chiAFTotDfFilt)

A T G C chisq ratio count1 count2

chr1.109865308 0 8 25 2 0.00308318609090795 3.125 25 8

chr1.109867892 1 5 18 6 0.0143058784354296 3 18 6

chr12.57550285 0 20 2 40 0.00982327450751926 2 40 20

chr12.57555078 10 0 25 0 0.0112298866529167 2.5 25 10

chr12.57560188 9 26 0 0 0.00405919636555506 2.88888888888889 26 9 chr12.57560445 6 0 28 0 0.00016131642030862 4.66666666666667 28 6

ch1 ch2 chr1.109865308 G T chr1.109867892 G C chr12.57550285 C T

3 (23)

(32)

chr12.57555078 G A chr12.57560188 T A chr12.57560445 G A

A simple plot of the allele frequency can then be built.

> plotAFTotDf(chiAFTotDfFilt)

● ● ● ● ● ● ● ● ● ●

●

● ●

●

● ● ●

●

● ●

●

● ● ●

●

● ●

●

● ●

●

● ● ● ●

●

● ●

●

● ●

chr1.109865308 chr1.109867892 chr12.57550285 chr12.57555078 chr12.57560188 chr12.57560445 chr12.57579346 chr12.57585144 chr12.57587839 chr12.57592557 chr12.57593101 chr12.57602632 chr12.57602815 chr2.216249430 chr2.216299629

0244974122172220

Allele Count Distribution

Read Count

●

A T G C

> ratioplotAfDf(chiAFTotDfFilt)

4 (24)

(33)

chr1.109865308 chr1.109867892 chr12.57550285 chr12.57555078 chr12.57560188 chr12.57560445 chr12.57579346 chr12.57585144 chr12.57587839 chr12.57592557 chr12.57593101 chr12.57602632 chr12.57602815 chr2.216249430 chr2.216299629 Allele Frequency Ratio

Ratio

0 1 2 3 4

0.003 0.01 0.01 0.01 0.004 2e−04 0.007 0.002 0.01 0.002 0.02 0.02 2e−04 0.003 9e−04

With the most frequent allele as numerator

7 Get going - a small but extensive tutorial

To get a feeling for RNA-sequencing data it’s important to know all steps in- volved in the process. So here will be provided a small tutorial on how to take your analysis from raw sequencing data fastq.gz files into R, further process the data and produce some plots. A small warning has to be anounced for the first part of this tutorial. Make sure that a computer with enough memory is used, RNA-seq data files can be very big. And preferably also to gain some speed it could be wise too execute the code on a multicore computer.

7.1 PART ONE

Here the BWA and samtools software will be used. BWA is an alignment algo- ritm and samtools an alignment ﬁle convert and processing tool. For more exten- sive information the reader is directed to http://samtools.sourceforge.net/

basicASE: an R-statistics package to detect significant allelic imbalance in RNA-seq data

Examensarbete 30 hp Juni 2012

basicASE: an R-statistics package to detect significant allelic

imbalance in RNA-seq data

Jesper Gådin

Bioinformatics Engineering Program

Uppsala University School of Engineering UPTEC X 12 013 Date of issue 2012-06

Author

Jesper Gådin

Title (English)

basicASE: an R-statistics package to detect significant allelic imbalance in RNA-seq data

Title (Swedish)

Abstract

Keywords

BasicASE, allelic imbalance, aeQTL, R, Bioconductor, NGS, RNA-seq

Supervisor

Lasse Folkersen

The Center for Molecular Medicine at Karolinska Institute Scientific reviewer

Mikael Thollesson

Department of Organismal Biology at Uppsala University

Project name Sponsors

Language

English Security Secret until 2013-10

ISSN 1401-2138 Classification

Supplementary bibliographical information

Pages

49

Biology Education Centre Biomedical Center Husargatan 3 Uppsala

Box 592 S-75124 Uppsala Tel +46 (0)18 4710000 Fax +46 (0)18 471 4687

basicASE: an R-statistics package to detect significant allelic imbalance in RNA-seq data

Jesper Gådin 2012-06

Populärvetenskaplig sammanfattning

BasicASE är ett sådant program och det används för att hitta tänkbara sjukdomsgener. Det fungerar så att man mäter om det sker lika mycket uttryck från mammans som pappans gener.

Om så inte är fallet kan man misstänka att den gen som har en obalans av uttryck är orsaken till sjukdom. Vanligen har man sedan tidigare kända DNA-variationer kopplade till specifika

sjukdomar som undersöks, men det går också att titta igenom hela genomet för att hitta obalanser.

Examensarbete 30 hp

Civilingenjörsprogrammet Bioinformatik

Uppsala Universitet, Juni 2012

Contents

1 Introduction 1

2 Background 1

2.1 High throughput data . . . . 1

2.1.1 General . . . . 1

2.1.2 Expression microarrays . . . . 1

2.1.3 RNA-seq . . . . 2

2.2 GWAS . . . . 2

2.3 eQTLs . . . . 2

2.4 Allelic imbalance . . . . 2

3 Material and Methods 3 3.1 Technical formats . . . . 3

3.1.1 FASTQ . . . . 3

3.1.2 SAM and BAM . . . . 3

3.1.3 VCF . . . . 3

3.1.4 BCF and BGZF . . . . 4

3.2 S oftware . . . . 4

3.2.1 Samtools . . . . 4

3.2.2 BWA . . . . 4

3.2.3 R-statistics environment . . . . 4

3.3 R packages . . . . 5

3.3.1 biomaRt . . . . 5

3.3.2 IRanges . . . . 5

3.3.3 Biostrings . . . . 5

3.3.4 GenomicRanges . . . . 5

3.3.5 Rsamtools . . . . 6

3.4 RNA-seq and SNP Repositories . . . . 6

3.4.1 dbSNP . . . . 6

3.4.2 HapMap . . . . 6

3.4.3 1000 genomes project . . . . 6

3.4.4 Sequence read archive(SRA) . . . . 6

4 Results 6 4.1 The basicASE package . . . . 7

4.2 Documentation and Manual . . . . 7

4.3 Vignette . . . . 11

4.4 basicASE plots . . . . 11

4.5 Verication . . . . 11

4.6 Algorithms . . . . 13

5 Discussion 15 5.1 The RNA-seq Microarray competition . . . . 15

5.2 Approach to subdue a complex disease . . . . 15

5.3 Heterozygosity . . . . 15

5.4 Verication inference . . . . 15

5.5 Algorithm analysis . . . . 16

5.6 basicASE integrative design . . . . 16

English ^Security Secret until 2013-10

4.5 Verication . . . . 11

5.4 Verication inference . . . . 15

FASTQ a fasta le extended with quality information from sequencerR The R-statistics project FTP File Transfer Protocol

In the quest to nd drug targets, one approach has been through the genome wide association studies (GWAS).

These are the fundamental diculties that can turn up as problems in an analysis of microarray data.

• Bar-code labeling, giving all cDNAs a unique identier can lower the sequencing error rate.

The location of the variant could be both close or far away from the gene it regulates. Even if most gene regulations are aected by variations far away, the closer variant has more impact and is therefore easier to

To take the eQTL analysis one step further, RNA-seq data can reveal any allele specic expression dierence.

Allelic imbalance has dierent meanings in dierent contexts, one of them is allelic imbalance for expression

quantitative trait locus (aeQTL). In an aeQTL the mRNA level of two dierent alleles gives information if