• No results found

Transposable elements in the chimpanzee genome

N/A
N/A
Protected

Academic year: 2022

Share "Transposable elements in the chimpanzee genome "

Copied!
52
0
0

Loading.... (view fulltext now)

Full text

(1)

UPTEC 11 009

Examensarbete 30 hp Februari 2011

Transposable elements in the chimpanzee genome

Johan Stenninger

(2)

 

(3)

Bioinformatics Engineering Program

Uppsala University School of Engineering

UPTEC X 11 009 Date of issue 2011-02

Author

Johan Stenninger

Title (English)

Transposable elements in the chimpanzee genome

Title (Swedish) Abstract

The genomes of humans and chimpanzees are highly identical. Despite these similarities, there is a limited understanding of the structure and the genetic variation present in chimpanzees. In this project, we focus on characterization of Long Interspersed Nuclear Elements (LINEs), a form of transposable elements that cover a large fraction of the genome in humans and other primates. The aim of the current project was to identify LINEs in the chimpanzee genome that are not annotated in the reference sequence. This was done by using high throughput sequencing data for the chimpanzee genome, in combination with well- annotated, human-specific LINEs. Our results show that there are at least 20 full-length LINEs in the chimpanzee genome that have not yet been annotated in the reference sequence.

Our data also adds approximately 45,000 base pairs of sequence, previously annotated as gap regions in the chimpanzee assembly.

Keywords

Transposable elements, LINE, L1, transposons, SOLiD, Bowtie, orphans, mate-pair, NGS Supervisors

Lars Feuk

Department of immunology, genetics and pathology, Uppsala University Scientific reviewer

Ulf Gyllensten

Department of immunology, genetics and pathology, Uppsala University

Project name Sponsors

Language

English Security

ISSN 1401-2138 Classification

Supplementary bibliographical information Pages

46

Biology Education Centre Biomedical Center Husargatan 3 Uppsala Box 592 S-75124 Uppsala Tel +46 (0)18 4710000 Fax +46 (0)18 471 4687

(4)

 

(5)

Transposabla element i schimpansgenomet

Johan Stenninger

Populärvetenskaplig sammanfattning

Nästan hälften av vår arvsmassa (vårt genom) består av repetitiva DNA sekvenser med förmåga att förflytta sig i genomet. Den största gruppen av dessa kallas Long interspersed nuclear elements (LINEs) och utgör ca 20 % av genomet. LINEs har under evolutionen bidragit till att utöka genomet genom sin förmåga att kopiera sig till nya platser. Eftersom deras funktion inte ännu är kartlagd och studier har visat på att de är inblandade i vissa sjukdomar som t.ex. cancer är det väldigt intressant att undersöka dessa element. Dessutom är elementen en bidragande faktor till variation människor emellan då uppsättningen av LINEs ser olika ut i olika individer.

Målet med det här projektet var dels att utnyttja den stora likheten mellan schimpans- och människogenomet för att undersöka hur uppsättningen av LINEs skiljer sig mellan de båda arterna samt att försöka hitta element som ännu inte finns beskrivna i schimpansgenomet. Detta har gjorts möjligt genom att utnyttja befintlig data från helgenomsekvensering av två schimpanser framtagna på Rudbecklaboratoriet i samarbete mellan Lars Feuks forskargrupp och Uppsala genomcenter. Ett annat mål var att försöka förbättra den genomsekvens som idag finns för schimpansen för att på så sätt underlätta framtida jämförelser mellan de båda arterna.

Det här projektet bidrar till en bättre förståelse för hur vi skiljer oss från schimpansen när det gäller LINEs, något som kan hjälpa oss att förstå vilken funktion dessa element har. På lång sikt kan detta projekt också bidra till att klargöra ytterligare vad det är som skiljer oss människor ifrån schimpanserna, och om LINEs bidrar till människospecifika egenskaper, t ex mottaglighet för vissa infektioner, utveckling av vissa neurodegenerativa sjukdomar eller vår förmåga att kommunicera med tal.

Examensarbete 30hp Civilingenjörsprogrammet Bioinformatik Uppsala Universitet februari 2011

 

(6)

 

(7)

Abstract

The genomes of humans and chimpanzees are highly similar. De- spite these similarities, there is a limited understanding of the struc- ture and the genetic variation present in chimpanzees. In this project, we focus on characterization of Long Interspersed Nuclear Elements (LINEs), a form of transposable elements that cover a large fraction of the genome in humans and other primates. The aim of the cur- rent project was to identify LINEs in the chimpanzee genome that are not annotated in the reference sequence. This was done by us- ing high throughput sequencing data for the chimpanzee genome, in combination with well annotated, human-specific LINEs. Unmapped mate-pair reads were extracted from previous mappings of chimpanzee sequence data to the reference genome. These unmapped reads, called orphans, were mapped to a library of human LINEs to investigate the presence of these in the chimpanzee genome.

In a second step the mate-pairs were anchored to the reference assem- bly and orphan reads were used to fill gaps in the reference sequence.

Our results show that there are at least 20 full-length LINEs in the chimpanzee genome that have not yet been annotated in the refer- ence sequence. Our data also adds approximately 45,000 base pairs of sequence, previously annotated as gap regions in the chimpanzee assembly. We conclude that the non-unique portions of the genome are very poorly annotated in the chimpanzee, and our results will help towards creating a better chimpanzee reference sequence.

(8)
(9)

CONTENTS CONTENTS

Contents

List of Tables 5

List of Figures 5

List of Abbrevations 6

1 Introduction 7

1.1 Transposable elements . . . . 7

1.2 Goals . . . . 7

2 Background 9 2.1 Transposons . . . . 9

2.1.1 Introduction to transposons . . . . 9

2.1.2 LINEs . . . . 10

2.1.3 Transposition . . . . 10

2.1.4 Survival of elements . . . . 11

2.2 The chimpanzee genome . . . . 13

2.2.1 About the chimpanzee genome . . . . 13

2.2.2 The chimpanzee reference sequence . . . . 13

2.3 Sequencing . . . . 14

2.3.1 NGS Techniques . . . . 14

2.3.2 Mate-pairs . . . . 14

2.3.3 Orphan pairs . . . . 15

3 Materials & Methods 17 3.1 Project data . . . . 17

3.1.1 LINE-1 data . . . . 17

3.2 LINE Reference & Strategy . . . . 18

3.2.1 Reference . . . . 18

3.2.2 Project strategy . . . . 19

3.3 Mate-pair mapping . . . . 19

3.3.1 Orphans . . . . 20

3.3.2 Read mapping . . . . 21

3.4 Identification . . . . 22

3.4.1 LINE calculation . . . . 22

3.4.2 Clustering . . . . 24

3.4.3 Annotation of elements . . . . 25

(10)

CONTENTS CONTENTS

3.4.4 Filling gaps in the reference . . . . 26

4 Results 28 4.1 Mapped orphans . . . . 28

4.2 Clustering . . . . 28

4.3 Filling gaps . . . . 31

5 Future work 33 5.1 Verification of elements . . . . 33

5.2 Workflow . . . . 34

6 Discussion 35 6.1 Annotation of LINEs . . . . 35

6.2 Mapping orphans . . . . 36

6.3 Interpreting the results . . . . 36

6.4 Filling gaps . . . . 38

6.5 Verification and future work . . . . 38

7 Acknowledgements 40

References 41

Appendices 44

(11)

LIST OF TABLES LIST OF FIGURES

List of Tables

1 The different classes of retrotransposons . . . . 9

2 LINE Elements From UCSC . . . . 18

3 Human specific LINEs . . . . 28

4 Orphans mapped to the reference library using Bowtie . . . . 28

5 Single clusters . . . . 29

6 Double clusters . . . . 29

7 Orphans used . . . . 30

8 Gap statistics . . . . 31

List of Figures

1 Illustration of L1 integration by RT . . . . 12

2 Mate pair mapping . . . . 15

3 Orphan pairs . . . . 16

4 Library distributions . . . . 20

5 Mapping explained . . . . 21

6 Calculation of LINE positions . . . . 23

7 Clustering of mate-pairs . . . . 25

8 Creation of contigs . . . . 26

9 Distribution of reads/predicted LINE . . . . 30

10 Coverage on predicted LINEs . . . . 32

11 Verification procedure . . . . 33

(12)

List of abbreviations

BED - Browser Extensible Data (format) bp - Base Pairs

chr - Chromosome

DNA - Deoxyribonucleic acid DSB - Double strand break kb - kilo bases (1000 bases)

LINE - Long Interspersed Nuclear Element NGS - Next Generation Sequencing

ORF - Open reading frame PC - Prediction cluster

PCR - Polymerase chain reaction RNA - Ribonucleic acid

RT - Retrotransposon

SOLiD - Sequencing by Oligonucleotide Ligation and Detection TE - Transposable element

TPRT - Target primed reverse transcription UCSC - University of California, Santa Cruz

(13)

1 INTRODUCTION

1 Introduction

1.1 Transposable elements

Transposable elements are DNA-sequences that are capable of retransposition by copying their sequences and inserting them in new places in the genome.

The retranspositioning of the elements (or sequences) occurs in several dif- ferent ways depending on the type of element. These sequences represent a major contributing factor to the genetic variation in humans and primates.

There has recently been a renewed interest in these elements in the field of genetic research since it has been discovered that they make up almost 50%

of the genomic content in humans and chimpanzees [22]. This is significant, in particular when taking into consideration that protein coding regions only make up approximately 1.5% of the human genome. These elements exist in almost all living organisms, although the number and types of transposable elements varies greatly between species.

Transposable elements have existed in the genome of many species for mil- lions of years and have, during that time, contributed to the evolution in a number of different ways by for instance increasing the genome size and by altering gene functions and expression [20]. It has also been suggested that these elements can be associated with different diseases, including cancer.

Despite the fact that the transposable elements make up such a large part of the hominid genomes, the elements that are presently able to actively trans- pose are quite few. Most of the elements are considered to be inactive, as they have lost their ability to retranspose because of truncations and mutations.

1.2 Goals

This thesis project focuses on a special group of transposable elements called Long interspersed nuclear elements (LINEs) and the identification and an- notation of these elements in the chimpanzee genome. One of the main goals is to identify elements or regions that have not yet been located or annotated in the reference sequence, with the help of DNA sequence data from whole- genome sequencing of two chimpanzee individuals (see Section 3.1). The approach was to use well annotated elements in humans and investigate if these can be found in the newly sequenced chimpanzee genome, even though they are not yet annotated in the chimpanzee reference sequence.

(14)

1.2 Goals 1 INTRODUCTION

The method described can also be used to fill the gaps in the chimpanzee reference sequence and supply information on poorly annotated elements i.e.

those that are only partially annotated or incorrectly annotated.

This project will help us to achieve a better understanding of how humans and chimpanzees differ from each other when it comes to LINEs, which ultimately could lead to a deeper understanding of the function of these elements. By improving the chimpanzee reference genome, the project can also contribute to the overall understanding of what it is that makes us human despite the fact that our DNA is very similar to the chimpanzee.

(15)

2 BACKGROUND

2 Background

2.1 Transposons

2.1.1 Introduction to transposons

Transposable elements can be divided into two main groups; DNA trans- posons and retrotransposons (RTs). Both of these can in turn be further classified into several sub-groups based on similarity and evolutionary age [18]. The main difference between these different types of transposons is the way in which they retranspose in the genome. DNA transposons make use of a technique that is referred to as ”cut-and-paste” [22]. This is a complex pro- cess in which the element excises itself from its original position and moves to another location in the genome. Retrotransposons on the other hand use a ”copy-and-paste” mechanism. They make use of existing cellular RNA- polymerase to transcribe their DNA, with the starting copy still remaining intact at the original position.

Table 1: The different classes of retrotransposons Class Fraction of genome Type

LINEs 17 % Autonomous

SINEs 11 % Non-Autonomous

LTR 8 % Non-Autonomous

RTs can also be divided into two different classes: autonomous and non- autonomous retrotransposons (see Table1). The difference between the two is that autonomous RTs are themselves coding for the proteins required for the transposition while the non-autonomous ones are not. This leads to a scenario in which the autonomous RTs can carry out their own functions inde- pendent of the host cell while the non-autonomous elements need to rely upon the polymerases of the host cell and the enzymes from other retrotransposons.

The main groups among the retrotransposons are called LINEs, SINEs (Short interspersed nuclear elements) and LTR-retrotransposons (Long terminal repeats). Of these, the LINEs represent autonomous RT, while SINEs and LTR-retrotransposons are non-autonomous. The LTR retrotransposons are made up of highly repeated DNA sequences (LTRs) and can be up to 5000

(16)

2.1 Transposons 2 BACKGROUND

bases long. The LTRs can also be found flanking other types of elements such as retroviruses. The group of retroviruses that can be found in humans represent almost exclusively an inactive form and make up about 10% of the genome, while active forms are much more common in other mammals.

2.1.2 LINEs

LINEs are the largest group of retrotransposons found in humans and chim- panzee. These elements make up approximately 21% of the whole genome and the main family of these, called LINE-1 (or L1), can be found at over half a million locations throughout the genomic sequence [24]. Currently there are only about 100 of these that are still active or capable of retrans- position [2]. A full-length L1 element is usually about 6kb long and contains two open reading frames (ORFs) in the sequence; ORF1 and ORF2. ORF2 codes for most of the proteins necessary for retranspositioning of the ele- ment, including target-site alteration (via endonuclease activity) and reverse transcriptase [1]. ORF1 also contributes to the retransposition by coding for proteins involved in binding to nucleic acids [15].

The number of L1 elements in humans and chimpanzees is thought to be quite similar and both human-specific and chimpanzee-specific variants can be found. It has however been suggested that the amplification-rate varies quite a lot between the species [16]. There are also a number of shared vari- ants that can be found in both species. The L1 elements vary greatly in length and that length is linked to the age of the element itself. Because of truncations and mutations of the element throughout time, elements that have existed in the genome for a longer time are the shorter than the younger elements. The annotation of these L1 elements is quite extensive for the hu- man genome while it is incomplete in chimpanzee genome because of the poor quality of the chimpanzee reference assembly (see Section 2.2).

2.1.3 Transposition

LINE elements copy themselves to new locations in the genome and are con- sidered the only type of transposable elements that are autonomous. The integration of the element is a complicated procedure called target-primed reverse transcription [14]. The exact mechanism for this procedure is not fully understood but according to the current model, the DNA is cut at the

(17)

2.1 Transposons 2 BACKGROUND

site of insertion, with the help of endonuclease-proteins encoded by ORF2, creating so called ”sticky ends”. Then transcriptase from ORF2 (together with the cells own polymerase), uses these ends to initiate the synthesis of the new DNA and the element is copied to a new location (see Fig.1).

The locus to which a new LINE is relocated does not have to be close to the original position of the elements or even within the same chromosome.

Even though there are targets sites (or motifs) that have a high propensity for insertions, they can be found all over the genome. The current estimate is that the rate of retrotransposition for L1 elements in germ-line cells is somewhere around 1 in every 20-200 births [10].

2.1.4 Survival of elements

It has been shown that the frequency of small repetitive elements is higher in humans compared to that of other non-human hominids. This suggests that there has been a great increase in the number of these elements during the last 5-6 million years since the divergence of humans and other primates [3].

Since the actual function of the transposable elements still remains unclear, it can be very difficult to understand how and why these elements have sur- vived in the genome over time. If the effect that they have had on evolution had been negative, would they not have been selected against and eventually disappeared from our genomes?

Since this is clearly not the case, we must consider that they may have had a somewhat positive impact on evolution in many different species. A possible explanation for how these elements have survived over time has been proposed in an article by Han et al. (2005). They suggest that the elements have been expressed at low levels and have stayed hidden in the genome for a long period of time [7]. If they had been active or highly expressed, there would be the possibility of negative selection acting upon the LINEs instead. These elements seem to have continued duplicating throughout his- tory and somehow, a small fraction of the copies have been rendered active quite recently in humans.

(18)

2.1 Transposons 2 BACKGROUND

LINE Target

b) a)

c)

d)

e)

End repair Second strand

cDNA synthesis

cDNA synthesis at DSB

Intra cellular strand completion TPRT

RNA LINE

Proteins Cellular proteins

Figure 1: Illustration of L1 integration by RT.

a) The line element transcribes its RNA and initiates the target primed reverse transcription (TPRT). b,c) This is carried out at the double strand break (DSB)

with help of the LINE proteins. d) When the DNA is transcribed, there are a few ways in which cDNA for the second strand is formed. This is mostly done by

the LINE machinery itself but in some cases this is taken care of by the cellular repair-mechanisms [5]. e) The final step in the transpositioning procedure is that

the ends are repaired.

(19)

2.2 The chimpanzee genome 2 BACKGROUND

2.2 The chimpanzee genome

2.2.1 About the chimpanzee genome

It is clear that humans and chimpanzees are quite different when it comes to their appearance. This is also the case at the chromosome-level; there are visible differences between the karyotypes. An example of this is that chim- panzees have 23 pairs of autosomal chromosomes while humans only have 22 pairs. This is because the human chromosome 2 is a product of two chim- panzee chromosomes (2a and 2b) that have been merged [17]. This theory is supported by the evidence of several centromere sequences present on chro- mosome 2 in humans, as well as telomere sequences found in chromosomal locations where they normally are not expected to be found.

In addition, if we look closer at the genomes at sequence-level, humans and chimpanzees are approximately 98,8% identical if only single nucleotide dif- ferences are counted. If deletions and duplications are included, the identity drops to around 95% [23]. This makes chimpanzees the closest living species related to humans and thus means that the chimpanzee genome is extremely suitable for all kinds of studies involving genomic comparison with humans.

2.2.2 The chimpanzee reference sequence

It is possible to retrieve the latest version of the chimpanzee reference- assembly from the UCSC database, currently called PanTro2.1 [9]. Ap- proximately 97% of the genome sequence is present in the assembly made from sequence data with approximately 6X coverage. The sequence data it- self is derived from a chimpanzee named Clint and was assembled in 2006 at Washington University School of Medicine Genome Sequencing Center (GSC) [23]. This data has since been updated with more information and chromosome-specific updates [8].

Problems that are encountered when working with the chimpanzee genome are often related to the quality of the data. The status of the reference as- sembly is currently incomplete and it is poorly annotated compared to the human genome. There are parts of the genome that are not finished with hundreds of thousands of gaps that needs to be filled. This is a problem when it comes to comparisons between the human and chimpanzee genome.

Not only is the reference genome less well annotated when it comes to the

(20)

2.3 Sequencing 2 BACKGROUND

LINE-elements but also when it comes to coding sequences and other ele- ments which can be quite important when looking for gene-altering effects from insertions caused by elements such as LINEs.

2.3 Sequencing

There has always been a high demand for fast and powerful methods that can produce more data at a faster rate. In human genomics, this has been realized by new methods that have been developed called Next-Generation Sequencing (NGS) techniques. They are not only able to produce thousands of sequences simultaneously, but are also very cost effective [21, 6].

2.3.1 NGS Techniques

The most common NGS techniques used today are:

• 454 Pyro sequencing (www.454.com)

• Illumina/Solexa sequencing (www.illumina.com)

• SOLiD sequencing (www.appliedbiosystems.com)

These are all high-throughput techniques and they differ from each other both in the length of sequences produced and in the chemistry of the se- quencing reactions. In this project, the sequence data came from the SOLiD (Sequencing by Oligonucleotide Ligation and Detection) sequencing tech- nique. SOLiD, which was introduced by Applied Biosystems in 2007, pro- duces short sequence reads of 50bp (base pairs). The sequencing reaction can be performed using two different strategies called fragment sequencing and mate-pair sequencing. The approach used to create the data for this project was mate-pair sequencing 1.

2.3.2 Mate-pairs

A mate-pair library is a library of paired reads from the randomly fragmented sample-DNA. The fragments are selected based on size and both ends of the fragment are ligated to an internal adapter with a known, predetermined

1http://www.invitrogen.com/site/us/en/home/Products-and-

Services/Applications/Nucleic-Acid-Amplification-and-Expression-Profiling/DNA- Sequencing/next-generation-sequencing/solid-mate-paired-library-construction-kits.html

(21)

2.3 Sequencing 2 BACKGROUND

distance from each other. This distance is known as the insert size and it provides crucial information for mapping (or aligning) this data to a refer- ence genome (see Section 3.3 for further information).

Mated reads are expected to map to the reference genome at a distance matching the known insert size for the sequencing library. Sometimes how- ever they map with an unexpected distance from each other and there are many events that can cause this. These events can either be deletions, inser- tions or inversions (See Fig.2a). Mate pairs also have a sequencing direction, which can be used to determine the relative positioning of the mates in the pair (See Fig.2b).

Reference Same as reference Insertion Deletion Inversion

Insert Size

(a) Mapping the data to a reference genome can show evidence of insertions, deletions and inversions if the mates map

to the reference with an unexpected distance from each other.

3 5

5’

3’

+R3 +F3

-R3 -F3

(b) The different mates get labelled differently depending on the relative order in the pair (R/F) and also by the direction of the strand (-/+) that they map to in

the mapping reference

Figure 2: Mate-pair mapping basics and mapping orientation

2.3.3 Orphan pairs

Pairs of reads where both mates are uniquely mapped to the reference genome are almost exclusively used in most analysis of sequence data. In some cases only one of the mates maps to the reference genome and there are cases where none of the mates map at all. Pairs where only one mate maps to the reference genome while the other one maps to several (non-unique) locations or none at all, are called orphans (Fig.3b).

(22)

2.3 Sequencing 2 BACKGROUND

||||||||||||||||||||||||||| |||||||||||||||||||||||||||

Reference Genome

+R3 +F3

(a) Mate pair where both mates map to the reference genome with

the expected distance between them.

|||||||||||||||||||||||||||

Reference Genome

+R3

(b) Mate pair where one of the mates map as expected while the other one do not uniquely map to a

position in the reference genome - an orphan.

Figure 3: The definition of an orphan pair compared to a normal mate-pair

(23)

3 MATERIALS & METHODS

3 Materials & Methods

Below is a flowchart showing an outline of the main parts of this project.

The individual parts are explained in detail in this chapter together with the data used for the different analyses.

LINE IDENTIFICATION

LIBRARY CREATION

MAPPING ORPHANS

CALCULATION OF LINES

CLUSTERING OF READS

ANALYSING RESULTS

ANALYSING RESULTS EXPERIMENTAL VERIFICATION

GROUPING CLUSTERS CALCULATING CLUSTER PREDICTIONS OF LINE ELEMENTS

GROUPING MAPPED READS CREATING CONTIGS FOR

FILLING GAPS

FLOWCHART OF THE PROJECT

3.1 Project data

Sequenced DNA from two chimpanzee individuals was used as input data for the project. The DNA was sequenced using Next-Generation Sequencing techniques (NSG) on the SOLiD platform. All sequenced reads were mapped to the current build of the chimpanzee reference sequence produced by the Chimpanzee Genome Sequencing Consortium [23]. The data itself consists of 3 different mate-pair mapping libraries with different insert sizes (see Section 3.3) from two different chimpanzee individuals.

3.1.1 LINE-1 data

The UCSC Table Browser (http://genome.ucsc.edu/) was used to retrieve annotations of LINE elements in both humans and chimpanzees [9]. This online database contains vast amounts of information regarding all the dif- ferent types of already annotated LINEs. This information was filtered in

(24)

3.2 LINE Reference & Strategy 3 MATERIALS & METHODS

a pre-processing pipeline to extract the relevant elements for the project.

These include only the full-length L1 elements (See Table 2). The reason for this is that full-length elements are considered to have been inserted more re- cently in the genome and are younger than the shorter ones. This is of great importance for the project since we aim to find recent differences between humans and chimpanzees.

Table 2: LINE Elements From UCSC Type # of entries # of entries of element for chimpanzee for human

LINEs 1435834 1406996

LINE-1 (L1) 951014 927393

Full-Length L1s 5483 8233

3.2 LINE Reference & Strategy

The first step in the pipeline was the creation of a new reference library - a library containing only the human specific LINEs. Since the chimpanzee reference genome is expected to contain more elements than those that are currently accounted for, we need a strategy for finding additional regions.

As the sequence data cannot be mapped to something that does not exist in the reference genome, the idea of creating a new library with human LINE elements was proposed.

3.2.1 Reference

Trying to identify all of the LINE elements, in particular the elements that were human/chimpanzee specific was the first step in this project. This was quite a complicated procedure, which required data manipulation since we wanted to compare chimpanzee LINE with human data.The human data came from the UCSC database and the version of the reference genome is called HG18. Full-length LINEs were extracted from the raw data and their species-specific coordinates were converted from one reference genome to the other with the use of the liftOver tool (http://genome.ucsc.edu/cgi- bin/hgLiftOver). This made it possible for the different LINEs in humans and chimpanzees to be compared to one another and to later define a set

(25)

3.3 Mate-pair mapping 3 MATERIALS & METHODS

of human-specific LINEs. The presence of these LINEs in the chimpanzee genome was then assessed, despite their absence from the reference genome.

The number of L1-elements in the human reference assembly that have no full-length equivalent in the chimpanzee genome is around 2000. Yet there is a need for some caution here since the L1-elements in the chimpanzee genome are poorly annotated, and fragmented in their sequence content. When all the elements that were considered to be human specific were found, their sequence was retrieved (using BEDTools [19] built-in function for sequence retrieval) and finally a library-file in FASTA format2 was created. This file was now ready to be used as a target when trying to map reads from our chimpanzee data.

3.2.2 Project strategy

The idea behind this reference file was to see if sequenced reads that pre- viously did not map to the chimpanzee reference genome could possibly be mapped to the file created from the human-specific LINEs. If a high num- ber of reads mapped to an element from the human genome that was not annotated in the chimpanzee genome then this would support the idea that this element (or something similar to this element) was in fact present in the chimpanzee genome. This could either be an element that is unique to our data (eg. an insertion of a new LINE) or an element that also exists in the reference individual but has not yet been annotated. With the help of mate- pair sequence data it was also possible to obtain an approximate position for the LINE element.

3.3 Mate-pair mapping

Sequence data used came from the SOLiD mate-pair technique where each run generates a library of reads. These libraries contained all of the se- quenced reads, where each read was 50 bases long and with three different distributions of insert sizes shown in Figure 4. The distribution for the first library had a median of 3kb with a lower bound of 2330 and an upper bound of 3770. The distribution for the second library had a median of 4kb with a lower bound of 3720 and an upper bound of 4940. The third library had

2http://www.ncbi.nlm.nih.gov/blast/fasta.shtml

(26)

3.3 Mate-pair mapping 3 MATERIALS & METHODS

a median of 3kb with a lower bound of 1370 and an upper bound of 4870.

All the bounds for the different libraries represented 2 standard deviations.

Since the size selection is gel-based, this creates a distribution of different insert sizes between pairs of mates thus making it impossible to determine the expected distance between a pair of reads.

0   20000   40000   60000   80000   100000   120000   140000   160000   180000  

2130   2227   2324   2421   2518   2615   2712   2809   2906   3003   3100   3197   3294   3391   3488   3585   3682  

Number  of  reads  

Insert  size  

Library  #1  (3kb)  

(a) Library 1 - median3kb

0   20000   40000   60000   80000   100000   120000  

2720   2817   2914   3011   3108   3205   3302   3399   3496   3593   3690   3787   3884   3981   4078   4175   4272   4369   4466   4563   4660   4757   4854  

Number  of  reads  

Insert  size  

Library  #2  (4kb)  

(b) Library 2 - median 4kb

0   20000   40000   60000   80000   100000   120000  

1370   1523   1676   1829   1982   2135   2288   2441   2594   2747   2900   3053   3206   3359   3512   3665   3818   3971   4124   4277   4430   4583   4736  

Number  of  reads  

Inser  size  

Library  #3  (3kb)  

(c) Library 3 - median 3kb

Figure 4: Insert-size distribution-plots for the 3 different libraries

3.3.1 Orphans

One of the aims of this project was to use orphan pairs to find new LINEs.

For these orphans we know where one mate has mapped to the reference

(27)

3.3 Mate-pair mapping 3 MATERIALS & METHODS

genome. There is also information about the relative positioning of that mate in the pair as well as sequence-information for the other, unmapped mate. These are of great importance when it comes to the later stages of the pipeline.

3.3.2 Read mapping

In order to be able to map the orphans to the new reference library we used the aligner Bowtie [13]. Despite the fact that this is a fast and efficient aligner, the large amount of sequence data makes this procedure computa- tionally demanding. To make this part of the pipeline possible, the computa- tions were performed on resources provided by Swedish National Infrastruc- ture for Computing (SNIC) through Uppsala Multidisciplinary Center for Advanced Computational Science (UPPMAX). The orphans were mapped, one chromosome at a time, to the new library and throughout the whole procedure the information about the pair itself were conserved so that the position i.e. where the mate that already mapped to the reference genome, was still known. This made it easy to always keep an overview of what had mapped and especially where it had mapped.

At this stage filtering of the results was conducted since we were only in- terested in sequences that mapped with as few mismatches as possible. A cut-off was chosen where 2 mismatches, out of the 50 bases that comprised the mates, were allowed.

|||||||||||||||||||||||||||||||||||||||||||||

>>> LINE element >>>

OFFSET

>>> +F3 >>>

Figure 5: An orphan mate mapping to a position, a number of bases from the starting position of a LINE (offset distance). Note the direction and relative

positioning of the mate itself.

(28)

3.4 Identification 3 MATERIALS & METHODS

3.4 Identification

After filtering of the results to obtain only the relevant data, the information for a single pair of reads was:

• Position in the reference genome where the first mate had mapped

• Information of which LINE the second mate had mapped to

• Offset or number of bases on the LINE before the second mate had mapped (See Fig.5)

• Sequence data for both the first and the second mate

By using this information, a diagram with the distribution of mapped mate- pairs on each chromosome could be produced. Although many of the orphans were still unmapped, approximately 14,000 orphan pairs where one mate mapped to the reference genome and the other mate mapped to the LINE reference-library were found. To investigate the presence of a novel LINE element, these pairs had to be clustered together.

3.4.1 LINE calculation

A number of Perl scripts were produced to build up a pipeline for the cal- culation of positions and other useful information for putative LINEs. The basic approach for finding putative novel elements was to use the previously known expected mapping distance between the two mates in a pair and their relative positioning. If one mate mapped to the reference genome on a cer- tain position and the other mate mapped to a certain LINE element, the approximate position for a LINE insertion in the genome could be calculated by using the offset and the distance between the two mates (See Fig.6). This would then be considered as the insertion or starting-point for the LINE.

The length of the element itself still remains unclear as the only evidence of the length is the 48-50 bases from the mapped orphan mate. One assumption is that the length would be somewhat similar to the length of the element in the human genome.

When calculating the position of the LINE based on the mate-pair data, one must remember that the distance between the two mates in a library is

(29)

3.4 Identification 3 MATERIALS & METHODS

Reference Genome LINE element

Offset Insert size 3-4 kb

(a) The two main criteria for calculating the position is the insert size from the library and the offset on the LINE which

the mate maps to.

Reference Genome LINE element

Offset Insert size 3-4 kb Insertion position

(b) The insert position for the LINE equals the insert size from the library minus the offset.

Reference Genome LINE element

LINE element

(c) The calculated position for a LINE element together with its original length.

Figure 6: The procedure of calculating the approximate position where a LINE could be expected in the genome

(30)

3.4 Identification 3 MATERIALS & METHODS

quite uncertain, even though there is a known median distance. It is still possible to determine an estimate of where the detected LINE is positioned.

To be able to determine the precise position, one would need to experimen- tally validate this with help of PCR and Sanger sequencing of the desired region.

There are a few things that can be done to further improve the predictions.

Since each mate-pair will contribute with a certain position for a LINE, the average position for a number of pairs is calculated. More mates support- ing an element could, if they mapped at a lot of different positions on the LINE element, supply more evidence for the actual length and position of the LINE. If they map with a distribution of approximately 6 kb then a full-length LINE could be expected.

3.4.2 Clustering

To further support the validity of elements found, several pairs supporting the existence of the same element at approximately the same position in the genome were required. Searching for clusters of pairs mapped both to the reference genome along the chromosome and to a LINE element could solve this. This was a procedure where rules for defining a cluster had to be de- fined (See Fig.7a). Pairs where the mates aligning to the reference sequence mapped less than 15 kb apart and with the same intra-pair direction were considered to be part of the same single cluster. To remove any unwanted artefacts and to be more accurate about the predictions, only clusters with 3 or more orphans were kept for the second clustering step.

In the second part of the clustering procedure the main idea was to find small adjacent clusters with opposite mapping direction, mapping to the same LINE (Fig.7b). If we imagine the presence of a LINE, these so called single clusters with different mapping direction would then map from differ- ent sides to the LINE (upstream or downstream) on the chromosome. This would give extra support for the LINE itself compared to only having a clus- ter from only one side.

The single clusters were grouped into a number of prediction clusters (PCs).

In each PC the position for the LINE was predicted from both sides. To obtain an approximate LINE position from each PC, an average of the two predictions was calculated. All of these averages positions from the different

(31)

3.4 Identification 3 MATERIALS & METHODS

Chromosome on the reference genome

LINE

LINE

+R3 +F3 +R3

(a) Finding sub-clusters that mapped close to each other on the reference

chromosomes and have the same intra-pair direction.

Chromosome on the reference genome

LINE

LINE

+R3 +F3 +R3

(b) Defining a prediction cluster (PC) as a set of two single clusters mapping

to a certain LINE element and belonging to different direction classes

(R3/F3).

Figure 7: Clustering procedure where mate-pairs are clustered by position and directions in a two step process

PCs resulted in a data-set with 151 LINEs distributed over different chro- mosomes.

3.4.3 Annotation of elements

An important part in the creation of this data-set is the annotation of the elements in it. After the clustering steps, the regions where the predicted LINEs mapped, were investigated. To be able to obtain information about the regions and annotate the LINEs, a list of these elements was compiled in a BED format3. This format is based on genomic coordinates and makes it possible for anyone to upload their own regions to the UCSC Genome Browser. In this browser it is possible to view them alongside other informa- tion about the genome available in the UCSC database. This makes it easy to see if the regions are associated with coding sequences, gaps and other information of interest.

All elements in the data-set were annotated and the region surrounding the position for each LINE was thoroughly investigated. Since we were aiming to find new LINEs, regions where no other elements were already annotated had the highest priority. LINEs predicted in regions where the reference contained small fragments of LINEs was considered interesting as well since

3http://genome.ucsc.edu/FAQ/FAQformat.html#format1

(32)

3.4 Identification 3 MATERIALS & METHODS

there were no full-length elements present. By only selecting elements that were supported by a high number of reads and those who were in close prox- imity to exons , the data set could be narrowed down to focus only on the most interesting regions.

3.4.4 Filling gaps in the reference

The current version of the chimpanzee assembly is known to contain hundreds of thousands of gaps and the coordinates for these can be obtained from the UCSC database. The total length of all the gaps in the reference genome is about 400 million bps. A second, complementary part of this project was to investigate if some of these gaps could be filled with the sequence from the data generated.

160 210

1 51 120 170

30 80

150 200 90 140

70 120 60 110 10 60

160 210

1 51 120 170

30 80 150 200

90 140 70 120

60 110 10 60

ASSEMBLED CONTIG

a)

b)

c)

1 ATCTAGTCAGCTACGATCGATCGCTGACTGATCGACGACTGAC 210

Figure 8: Sequences from mates (a) assembled (b) with help of the information about the predicted position for each mate. A longer coherent contig could then

be deduced from the assembled reads (c) .

From each mate that mapped to the human LINEs, the 50 bps of sequence was known but the position for the mate in the reference genome still re- mained to be determined. Since the position where the mate had mapped to on a certain LINE was known and the expected insert position of that spe- cific element in the chimpanzee genome had been calculated, it was possible to determine the approximate position for each mate. The sequences from the mates could then be assembled into large coherent contigs by using their individual positions as shown in Figure 8.

(33)

3.4 Identification 3 MATERIALS & METHODS

Coordinates for these contigs were then compared to the coordinates of the annotated gaps in the chimpanzee reference to investigate which of these that were located in gap regions. In order to perfom this, BEDTools function for intersections between genomic locations intersectBed [19] was used together with Perl scripts created.

To be able to determine both the exact location and sequence for the contigs located in gap regions, the exact position of the LINEs that these mates had mapped to, need to be experimentally verified since the positions for the mates are based on the insert position of the LINE element itself.

(34)

4 RESULTS

4 Results

By comparing annotated full-length LINEs in human and chimpanzee re- spectively, 1,916 human-specific elements were identified.

Table 3: Species specific full-length L1 elements.

Humans Chimpanzees

Full-length L1 8,233 5,483

Specific full-length L1 1,916 490

4.1 Mapped orphans

Less than 1% of all the raw orphans were mapped to the reference library of LINE sequences that was created. That corresponded to approximately 14,000 orphan-pairs in total (See Table 4). The distribution of these orphans can be seen in Appendix B.

Table 4: Orphans mapped to the reference library using Bowtie Library # of raw orphans # of orphans mapped type (size) from start to reference library

3 kb 210,679,452 4,241

4 kb 180,049,048 4,103

3 kb (new) 221,676,694 5,723

Total 612,405,194 14,067

4.2 Clustering

After the first step in the clustering procedure, orphans were grouped to- gether to create small individual clusters. The resulting number of clusters for each chromosome are illustrated in Table 5. Note that there are some

(35)

4.2 Clustering 4 RESULTS

chromosomes where no clusters can be found.

Table 5: Number of clusters on each chromosome after the first step

Chromosome 1 2a 2b 3 4 5 6 7 8 9 10 11 12

# of clusters 13 9 13 22 15 18 20 22 10 9 7 16 13

Chromosome 13 14 15 16 17 18 19 20 21 22 X Y Total

# of clusters 13 7 5 2 1 6 5 9 0 1 261 0 497

The results from the second step in the clustering procedure is shown in Ta- ble 6. The criteria for this clustering step were that the single clusters in a PC could have a maximum distance of 15 kb between them and different mapping directions (either R3 or F3) to each other. Since each PC repre- sented a predicted LINE, a data-set containing 151 LINEs distributed over the different chromosomes was created. There was a vast difference in the number of reads that supported a specific element and the distribution of these can be seen in Figure 9.

Table 6: Number of clusters on each chromosome after the second clustering step. These clusters of single clusters are called prediction-clusters (PCs).

Chromosome 1 2a 2b 3 4 5 6 7 8 9 10 11 12

# of clusters 4 2 2 8 5 3 4 8 4 0 1 3 3

Chromosome 13 14 15 16 17 18 19 20 21 22 X Y Total

# of clusters 2 2 0 0 0 2 2 2 0 0 94 0 151

For each clustering step, a different amount of orphans was used. To support the prediction of the 497 LINEs from the first step in the clustering, over 14,000 orphans were used. This number dropped to approximately 10,000 after the second step of the clustering procedure. In Table 7 the exact num- bers and the average number of orphans per number of predicted LINEs are shown.

To be able to visualize the results in a more comprehensive way and also to

(36)

4.2 Clustering 4 RESULTS

Table 7: Orphans used in each step of the clustering to predict the different LINEs

# Lines predicted Orphans used Average # of orphans / LINE

Step 1 497 14,067 30

Step 2 151 10,168 70

perform the task of annotating the LINEs, the data-set was converted into a BED-file that could be used together with the UCSC Browser. An example of this is shown in appendix A. Examining and annotating all of the elements resulted in a smaller data-set with approximately 20 LINEs that were not yet annotated, or not annotated as full-length elements, in the chimpanzee reference sequence. These were the elements that were chosen for further ver- ification. In Appendix B (and Table 6) the number of LINEs calculated on each chromosome can be found together with the number of orphans found on the same chromosome.

Distribution of reads/novel element

Number of supporting reads

Frequency

0 100 200 300 400 500 600

01234567

Figure 9: A histogram over the distribution of the total number of reads supporting a predicted LINE element.

(37)

4.3 Filling gaps 4 RESULTS

4.3 Filling gaps

All the reads that mapped to LINEs from the reference library were put to- gether into small contigs after their sequence was obtained. The total length of all these contigs was 116,448 bp (which was 16% of the original length from all the orphan reads) and approximately 45,000 bp were located in gap regions (see Table 8). That corresponds to about 0,01 % of the total number of bases considered as gaps in the chimpanzee reference sequence.

Table 8: The different sets with their total number of base-pairs.

Total length of gaps in the chimpanzee reference 440,962,440 Length of read-contigs located in gaps 45,863 0,1%

Total length of LINE mapped orphan reads 703,350

Length of read-contigs 116,448

16,5%

These contigs assembled from the reads mapping to the LINEs in our library have the advantage of containing sequence data that is not yet annotated in the reference sequence. In some cases they are quite short which makes it difficult to fill an entire gap sequence even though large parts can be as- sembled. This is illustrated in Appendix C where the total bases covered by the contigs for a predicted LINE is shown. In Figure 10 the total number of bases supported by the different reads are shown as well as the supported minimum length. This length is based on the fact that different reads map to different locations on the element. If the distance between two reads map- ping to the same element is 6kb, a full-length element could then be expected.

References

Related documents

I started off with an idea that instead of cnc-mill plywood and get a contoured model I wanted to com- pose the stock myself.. Idid some quick Rhino tests and I liked patterns

You suspect that the icosaeder is not fair - not uniform probability for the different outcomes in a roll - and therefore want to investigate the probability p of having 9 come up in

contented group. Among other things, they are increasingly angry at the president’s failure to prosecute anyone for the Maspero massacre in October 2011. The draft consti-

This study investigates how a group of children of incarcerated parents in India make meaning in their lives and how India Vision Foundation affects their meaning-making process..

It has also shown that by using an autoregressive distributed lagged model one can model the fundamental values for real estate prices with both stationary

Sweden is known to be a highly developed and transparent country (Carlberg, 2008). In addition, it is one of the countries that has the lowest limits of the criteria regarding the

The children in both activity parameter groups experienced the interaction with Romo in many different ways but four additional categories were only detected in the co-creation

Denna åtskillnad som Burroughs gör i sitt brev till Ginsberg finns även i hans romaner Junky och Queer.. I Queer är uppdelningen mellan queers och fags