Transposable elements in the chimpanzee genome

(1)

UPTEC 11 009

Examensarbete 30 hp Februari 2011

Transposable elements in the chimpanzee genome

Johan Stenninger

(2)

(3)

Bioinformatics Engineering Program

Uppsala University School of Engineering

UPTEC X 11 009 Date of issue 2011-02

Author

Johan Stenninger

Title (English)

Transposable elements in the chimpanzee genome

Title (Swedish) Abstract

The genomes of humans and chimpanzees are highly identical. Despite these similarities, there is a limited understanding of the structure and the genetic variation present in chimpanzees. In this project, we focus on characterization of Long Interspersed Nuclear Elements (LINEs), a form of transposable elements that cover a large fraction of the genome in humans and other primates. The aim of the current project was to identify LINEs in the chimpanzee genome that are not annotated in the reference sequence. This was done by using high throughput sequencing data for the chimpanzee genome, in combination with well- annotated, human-specific LINEs. Our results show that there are at least 20 full-length LINEs in the chimpanzee genome that have not yet been annotated in the reference sequence.

Our data also adds approximately 45,000 base pairs of sequence, previously annotated as gap regions in the chimpanzee assembly.

Keywords

Transposable elements, LINE, L1, transposons, SOLiD, Bowtie, orphans, mate-pair, NGS Supervisors

Lars Feuk

Department of immunology, genetics and pathology, Uppsala University Scientific reviewer

Ulf Gyllensten

Department of immunology, genetics and pathology, Uppsala University

Project name Sponsors

Language

English ^Security

ISSN 1401-2138 Classification

Supplementary bibliographical information Pages

46

Biology Education Centre Biomedical Center Husargatan 3 Uppsala Box 592 S-75124 Uppsala Tel +46 (0)18 4710000 Fax +46 (0)18 471 4687

(4)

(5)

Transposabla element i schimpansgenomet

Johan Stenninger

Populärvetenskaplig sammanfattning

Nästan hälften av vår arvsmassa (vårt genom) består av repetitiva DNA sekvenser med förmåga att förflytta sig i genomet. Den största gruppen av dessa kallas Long interspersed nuclear elements (LINEs) och utgör ca 20 % av genomet. LINEs har under evolutionen bidragit till att utöka genomet genom sin förmåga att kopiera sig till nya platser. Eftersom deras funktion inte ännu är kartlagd och studier har visat på att de är inblandade i vissa sjukdomar som t.ex. cancer är det väldigt intressant att undersöka dessa element. Dessutom är elementen en bidragande faktor till variation människor emellan då uppsättningen av LINEs ser olika ut i olika individer.

Målet med det här projektet var dels att utnyttja den stora likheten mellan schimpans- och människogenomet för att undersöka hur uppsättningen av LINEs skiljer sig mellan de båda arterna samt att försöka hitta element som ännu inte finns beskrivna i schimpansgenomet. Detta har gjorts möjligt genom att utnyttja befintlig data från helgenomsekvensering av två schimpanser framtagna på Rudbecklaboratoriet i samarbete mellan Lars Feuks forskargrupp och Uppsala genomcenter. Ett annat mål var att försöka förbättra den genomsekvens som idag finns för schimpansen för att på så sätt underlätta framtida jämförelser mellan de båda arterna.

Det här projektet bidrar till en bättre förståelse för hur vi skiljer oss från schimpansen när det gäller LINEs, något som kan hjälpa oss att förstå vilken funktion dessa element har. På lång sikt kan detta projekt också bidra till att klargöra ytterligare vad det är som skiljer oss människor ifrån schimpanserna, och om LINEs bidrar till människospecifika egenskaper, t ex mottaglighet för vissa infektioner, utveckling av vissa neurodegenerativa sjukdomar eller vår förmåga att kommunicera med tal.

Examensarbete 30hp Civilingenjörsprogrammet Bioinformatik Uppsala Universitet februari 2011

(6)

(7)

Abstract

The genomes of humans and chimpanzees are highly similar. De- spite these similarities, there is a limited understanding of the structure and the genetic variation present in chimpanzees. In this project, we focus on characterization of Long Interspersed Nuclear Elements (LINEs), a form of transposable elements that cover a large fraction of the genome in humans and other primates. The aim of the current project was to identify LINEs in the chimpanzee genome that are not annotated in the reference sequence. This was done by using high throughput sequencing data for the chimpanzee genome, in combination with well annotated, human-specific LINEs. Unmapped mate-pair reads were extracted from previous mappings of chimpanzee sequence data to the reference genome. These unmapped reads, called orphans, were mapped to a library of human LINEs to investigate the presence of these in the chimpanzee genome.

In a second step the mate-pairs were anchored to the reference assembly and orphan reads were used to fill gaps in the reference sequence.

Our results show that there are at least 20 full-length LINEs in the chimpanzee genome that have not yet been annotated in the reference sequence. Our data also adds approximately 45,000 base pairs of sequence, previously annotated as gap regions in the chimpanzee assembly. We conclude that the non-unique portions of the genome are very poorly annotated in the chimpanzee, and our results will help towards creating a better chimpanzee reference sequence.

(8)

(9)

CONTENTS CONTENTS

List of Tables

1 The different classes of retrotransposons . . . . 9

2 LINE Elements From UCSC . . . . 18

3 Human specific LINEs . . . . 28

4 Orphans mapped to the reference library using Bowtie . . . . 28

5 Single clusters . . . . 29

6 Double clusters . . . . 29

7 Orphans used . . . . 30

8 Gap statistics . . . . 31

List of Figures

1 Illustration of L1 integration by RT . . . . 12

2 Mate pair mapping . . . . 15

3 Orphan pairs . . . . 16

4 Library distributions . . . . 20

5 Mapping explained . . . . 21

6 Calculation of LINE positions . . . . 23

7 Clustering of mate-pairs . . . . 25

8 Creation of contigs . . . . 26

9 Distribution of reads/predicted LINE . . . . 30

10 Coverage on predicted LINEs . . . . 32

11 Verification procedure . . . . 33

(12)

List of abbreviations

BED - Browser Extensible Data (format) bp - Base Pairs

chr - Chromosome

DNA - Deoxyribonucleic acid DSB - Double strand break kb - kilo bases (1000 bases)

LINE - Long Interspersed Nuclear Element NGS - Next Generation Sequencing

ORF - Open reading frame PC - Prediction cluster

PCR - Polymerase chain reaction RNA - Ribonucleic acid

RT - Retrotransposon

SOLiD - Sequencing by Oligonucleotide Ligation and Detection TE - Transposable element

TPRT - Target primed reverse transcription UCSC - University of California, Santa Cruz

(13)

1 INTRODUCTION

1 Introduction

1.1 Transposable elements

Transposable elements are DNA-sequences that are capable of retransposition by copying their sequences and inserting them in new places in the genome.

The retranspositioning of the elements (or sequences) occurs in several different ways depending on the type of element. These sequences represent a major contributing factor to the genetic variation in humans and primates.

There has recently been a renewed interest in these elements in the field of genetic research since it has been discovered that they make up almost 50%

of the genomic content in humans and chimpanzees [22]. This is significant, in particular when taking into consideration that protein coding regions only make up approximately 1.5% of the human genome. These elements exist in almost all living organisms, although the number and types of transposable elements varies greatly between species.

Transposable elements have existed in the genome of many species for mil- lions of years and have, during that time, contributed to the evolution in a number of different ways by for instance increasing the genome size and by altering gene functions and expression [20]. It has also been suggested that these elements can be associated with different diseases, including cancer.

Despite the fact that the transposable elements make up such a large part of the hominid genomes, the elements that are presently able to actively trans- pose are quite few. Most of the elements are considered to be inactive, as they have lost their ability to retranspose because of truncations and mutations.

1.2 Goals

This thesis project focuses on a special group of transposable elements called Long interspersed nuclear elements (LINEs) and the identification and annotation of these elements in the chimpanzee genome. One of the main goals is to identify elements or regions that have not yet been located or annotated in the reference sequence, with the help of DNA sequence data from whole- genome sequencing of two chimpanzee individuals (see Section 3.1). The approach was to use well annotated elements in humans and investigate if these can be found in the newly sequenced chimpanzee genome, even though they are not yet annotated in the chimpanzee reference sequence.

(14)

1.2 Goals 1 INTRODUCTION

The method described can also be used to fill the gaps in the chimpanzee reference sequence and supply information on poorly annotated elements i.e.

those that are only partially annotated or incorrectly annotated.

This project will help us to achieve a better understanding of how humans and chimpanzees differ from each other when it comes to LINEs, which ultimately could lead to a deeper understanding of the function of these elements. By improving the chimpanzee reference genome, the project can also contribute to the overall understanding of what it is that makes us human despite the fact that our DNA is very similar to the chimpanzee.

(15)

2 BACKGROUND

2 Background

2.1 Transposons

2.1.1 Introduction to transposons

Transposable elements can be divided into two main groups; DNA transposons and retrotransposons (RTs). Both of these can in turn be further classified into several sub-groups based on similarity and evolutionary age [18]. The main difference between these different types of transposons is the way in which they retranspose in the genome. DNA transposons make use of a technique that is referred to as ”cut-and-paste” [22]. This is a complex process in which the element excises itself from its original position and moves to another location in the genome. Retrotransposons on the other hand use a ”copy-and-paste” mechanism. They make use of existing cellular RNA- polymerase to transcribe their DNA, with the starting copy still remaining intact at the original position.

Table 1: The different classes of retrotransposons Class Fraction of genome Type

LINEs 17 % Autonomous

SINEs 11 % Non-Autonomous

LTR 8 % Non-Autonomous

RTs can also be divided into two different classes: autonomous and non- autonomous retrotransposons (see Table1). The difference between the two is that autonomous RTs are themselves coding for the proteins required for the transposition while the non-autonomous ones are not. This leads to a scenario in which the autonomous RTs can carry out their own functions inde- pendent of the host cell while the non-autonomous elements need to rely upon the polymerases of the host cell and the enzymes from other retrotransposons.

The main groups among the retrotransposons are called LINEs, SINEs (Short interspersed nuclear elements) and LTR-retrotransposons (Long terminal repeats). Of these, the LINEs represent autonomous RT, while SINEs and LTR-retrotransposons are non-autonomous. The LTR retrotransposons are made up of highly repeated DNA sequences (LTRs) and can be up to 5000

(16)

2.1 Transposons 2 BACKGROUND

bases long. The LTRs can also be found flanking other types of elements such as retroviruses. The group of retroviruses that can be found in humans represent almost exclusively an inactive form and make up about 10% of the genome, while active forms are much more common in other mammals.

2.1.2 LINEs

LINEs are the largest group of retrotransposons found in humans and chimpanzee. These elements make up approximately 21% of the whole genome and the main family of these, called LINE-1 (or L1), can be found at over half a million locations throughout the genomic sequence [24]. Currently there are only about 100 of these that are still active or capable of retransposition [2]. A full-length L1 element is usually about 6kb long and contains two open reading frames (ORFs) in the sequence; ORF1 and ORF2. ORF2 codes for most of the proteins necessary for retranspositioning of the element, including target-site alteration (via endonuclease activity) and reverse transcriptase [1]. ORF1 also contributes to the retransposition by coding for proteins involved in binding to nucleic acids [15].

The number of L1 elements in humans and chimpanzees is thought to be quite similar and both human-specific and chimpanzee-specific variants can be found. It has however been suggested that the amplification-rate varies quite a lot between the species [16]. There are also a number of shared variants that can be found in both species. The L1 elements vary greatly in length and that length is linked to the age of the element itself. Because of truncations and mutations of the element throughout time, elements that have existed in the genome for a longer time are the shorter than the younger elements. The annotation of these L1 elements is quite extensive for the human genome while it is incomplete in chimpanzee genome because of the poor quality of the chimpanzee reference assembly (see Section 2.2).

2.1.3 Transposition

LINE elements copy themselves to new locations in the genome and are considered the only type of transposable elements that are autonomous. The integration of the element is a complicated procedure called target-primed reverse transcription [14]. The exact mechanism for this procedure is not fully understood but according to the current model, the DNA is cut at the

(17)

site of insertion, with the help of endonuclease-proteins encoded by ORF2, creating so called ”sticky ends”. Then transcriptase from ORF2 (together with the cells own polymerase), uses these ends to initiate the synthesis of the new DNA and the element is copied to a new location (see Fig.1).

The locus to which a new LINE is relocated does not have to be close to the original position of the elements or even within the same chromosome.

Even though there are targets sites (or motifs) that have a high propensity for insertions, they can be found all over the genome. The current estimate is that the rate of retrotransposition for L1 elements in germ-line cells is somewhere around 1 in every 20-200 births [10].

2.1.4 Survival of elements

It has been shown that the frequency of small repetitive elements is higher in humans compared to that of other non-human hominids. This suggests that there has been a great increase in the number of these elements during the last 5-6 million years since the divergence of humans and other primates [3].

Since the actual function of the transposable elements still remains unclear, it can be very difficult to understand how and why these elements have survived in the genome over time. If the effect that they have had on evolution had been negative, would they not have been selected against and eventually disappeared from our genomes?

Since this is clearly not the case, we must consider that they may have had a somewhat positive impact on evolution in many different species. A possible explanation for how these elements have survived over time has been proposed in an article by Han et al. (2005). They suggest that the elements have been expressed at low levels and have stayed hidden in the genome for a long period of time [7]. If they had been active or highly expressed, there would be the possibility of negative selection acting upon the LINEs instead. These elements seem to have continued duplicating throughout his- tory and somehow, a small fraction of the copies have been rendered active quite recently in humans.

(18)

LINE Target

b) a)

c)

d)

e)

End repair Second strand

cDNA synthesis

cDNA synthesis at DSB

Intra cellular strand completion TPRT

RNA LINE

Proteins Cellular proteins

Figure 1: Illustration of L1 integration by RT.

a) The line element transcribes its RNA and initiates the target primed reverse transcription (TPRT). b,c) This is carried out at the double strand break (DSB)

with help of the LINE proteins. d) When the DNA is transcribed, there are a few ways in which cDNA for the second strand is formed. This is mostly done by

the LINE machinery itself but in some cases this is taken care of by the cellular repair-mechanisms [5]. e) The final step in the transpositioning procedure is that

the ends are repaired.

(19)

2.2 The chimpanzee genome 2 BACKGROUND

2.2 The chimpanzee genome

2.2.1 About the chimpanzee genome

It is clear that humans and chimpanzees are quite different when it comes to their appearance. This is also the case at the chromosome-level; there are visible differences between the karyotypes. An example of this is that chimpanzees have 23 pairs of autosomal chromosomes while humans only have 22 pairs. This is because the human chromosome 2 is a product of two chimpanzee chromosomes (2a and 2b) that have been merged [17]. This theory is supported by the evidence of several centromere sequences present on chromosome 2 in humans, as well as telomere sequences found in chromosomal locations where they normally are not expected to be found.

In addition, if we look closer at the genomes at sequence-level, humans and chimpanzees are approximately 98,8% identical if only single nucleotide differences are counted. If deletions and duplications are included, the identity drops to around 95% [23]. This makes chimpanzees the closest living species related to humans and thus means that the chimpanzee genome is extremely suitable for all kinds of studies involving genomic comparison with humans.

2.2.2 The chimpanzee reference sequence

It is possible to retrieve the latest version of the chimpanzee reference- assembly from the UCSC database, currently called PanTro2.1 [9]. Ap- proximately 97% of the genome sequence is present in the assembly made from sequence data with approximately 6X coverage. The sequence data itself is derived from a chimpanzee named Clint and was assembled in 2006 at Washington University School of Medicine Genome Sequencing Center (GSC) [23]. This data has since been updated with more information and chromosome-specific updates [8].

Problems that are encountered when working with the chimpanzee genome are often related to the quality of the data. The status of the reference assembly is currently incomplete and it is poorly annotated compared to the human genome. There are parts of the genome that are not finished with hundreds of thousands of gaps that needs to be filled. This is a problem when it comes to comparisons between the human and chimpanzee genome.

Not only is the reference genome less well annotated when it comes to the

(20)

2.3 Sequencing 2 BACKGROUND

LINE-elements but also when it comes to coding sequences and other elements which can be quite important when looking for gene-altering effects from insertions caused by elements such as LINEs.

2.3 Sequencing

There has always been a high demand for fast and powerful methods that can produce more data at a faster rate. In human genomics, this has been realized by new methods that have been developed called Next-Generation Sequencing (NGS) techniques. They are not only able to produce thousands of sequences simultaneously, but are also very cost effective [21, 6].

2.3.1 NGS Techniques

The most common NGS techniques used today are:

• 454 Pyro sequencing (www.454.com)

• Illumina/Solexa sequencing (www.illumina.com)

• SOLiD sequencing (www.appliedbiosystems.com)

These are all high-throughput techniques and they differ from each other both in the length of sequences produced and in the chemistry of the sequencing reactions. In this project, the sequence data came from the SOLiD (Sequencing by Oligonucleotide Ligation and Detection) sequencing technique. SOLiD, which was introduced by Applied Biosystems in 2007, pro- duces short sequence reads of 50bp (base pairs). The sequencing reaction can be performed using two different strategies called fragment sequencing and mate-pair sequencing. The approach used to create the data for this project was mate-pair sequencing ¹.

2.3.2 Mate-pairs

A mate-pair library is a library of paired reads from the randomly fragmented sample-DNA. The fragments are selected based on size and both ends of the fragment are ligated to an internal adapter with a known, predetermined

1http://www.invitrogen.com/site/us/en/home/Products-and-

Services/Applications/Nucleic-Acid-Amplification-and-Expression-Profiling/DNA- Sequencing/next-generation-sequencing/solid-mate-paired-library-construction-kits.html

(21)

distance from each other. This distance is known as the insert size and it provides crucial information for mapping (or aligning) this data to a reference genome (see Section 3.3 for further information).

Mated reads are expected to map to the reference genome at a distance matching the known insert size for the sequencing library. Sometimes however they map with an unexpected distance from each other and there are many events that can cause this. These events can either be deletions, insertions or inversions (See Fig.2a). Mate pairs also have a sequencing direction, which can be used to determine the relative positioning of the mates in the pair (See Fig.2b).

Reference Same as reference Insertion Deletion Inversion

Insert Size

(a) Mapping the data to a reference genome can show evidence of insertions, deletions and inversions if the mates map

to the reference with an unexpected distance from each other.

3 5

5’

3’

+R3 +F3

-R3 -F3

(b) The different mates get labelled differently depending on the relative order in the pair (R/F) and also by the direction of the strand (-/+) that they map to in

the mapping reference

Figure 2: Mate-pair mapping basics and mapping orientation

2.3.3 Orphan pairs

Pairs of reads where both mates are uniquely mapped to the reference genome are almost exclusively used in most analysis of sequence data. In some cases only one of the mates maps to the reference genome and there are cases where none of the mates map at all. Pairs where only one mate maps to the reference genome while the other one maps to several (non-unique) locations or none at all, are called orphans (Fig.3b).

(22)

||||||||||||||||||||||||||| |||||||||||||||||||||||||||

Reference Genome

+R3 +F3

(a) Mate pair where both mates map to the reference genome with

the expected distance between them.

|||||||||||||||||||||||||||

Reference Genome

+R3

(b) Mate pair where one of the mates map as expected while the other one do not uniquely map to a

position in the reference genome - an orphan.

Figure 3: The definition of an orphan pair compared to a normal mate-pair

(23)

3 MATERIALS & METHODS

3 Materials & Methods

Below is a flowchart showing an outline of the main parts of this project.

The individual parts are explained in detail in this chapter together with the data used for the different analyses.

LINE IDENTIFICATION

LIBRARY CREATION

MAPPING ORPHANS

CALCULATION OF LINES

CLUSTERING OF READS

ANALYSING RESULTS

ANALYSING RESULTS EXPERIMENTAL VERIFICATION

GROUPING CLUSTERS CALCULATING CLUSTER PREDICTIONS OF LINE ELEMENTS

GROUPING MAPPED READS CREATING CONTIGS FOR

FILLING GAPS

FLOWCHART OF THE PROJECT

3.1 Project data

Sequenced DNA from two chimpanzee individuals was used as input data for the project. The DNA was sequenced using Next-Generation Sequencing techniques (NSG) on the SOLiD platform. All sequenced reads were mapped to the current build of the chimpanzee reference sequence produced by the Chimpanzee Genome Sequencing Consortium [23]. The data itself consists of 3 different mate-pair mapping libraries with different insert sizes (see Section 3.3) from two different chimpanzee individuals.

3.1.1 LINE-1 data

The UCSC Table Browser (http://genome.ucsc.edu/) was used to retrieve annotations of LINE elements in both humans and chimpanzees [9]. This online database contains vast amounts of information regarding all the different types of already annotated LINEs. This information was filtered in

(24)

3.2 LINE Reference & Strategy 3 MATERIALS & METHODS

a pre-processing pipeline to extract the relevant elements for the project.

These include only the full-length L1 elements (See Table 2). The reason for this is that full-length elements are considered to have been inserted more recently in the genome and are younger than the shorter ones. This is of great importance for the project since we aim to find recent differences between humans and chimpanzees.

Table 2: LINE Elements From UCSC Type # of entries # of entries of element for chimpanzee for human

LINEs 1435834 1406996

LINE-1 (L1) 951014 927393

Full-Length L1s 5483 8233

3.2 LINE Reference & Strategy

The first step in the pipeline was the creation of a new reference library - a library containing only the human specific LINEs. Since the chimpanzee reference genome is expected to contain more elements than those that are currently accounted for, we need a strategy for finding additional regions.

As the sequence data cannot be mapped to something that does not exist in the reference genome, the idea of creating a new library with human LINE elements was proposed.

3.2.1 Reference

Trying to identify all of the LINE elements, in particular the elements that were human/chimpanzee specific was the first step in this project. This was quite a complicated procedure, which required data manipulation since we wanted to compare chimpanzee LINE with human data.The human data came from the UCSC database and the version of the reference genome is called HG18. Full-length LINEs were extracted from the raw data and their species-specific coordinates were converted from one reference genome to the other with the use of the liftOver tool (http://genome.ucsc.edu/cgi- bin/hgLiftOver). This made it possible for the different LINEs in humans and chimpanzees to be compared to one another and to later define a set

(25)

3.3 Mate-pair mapping 3 MATERIALS & METHODS

of human-specific LINEs. The presence of these LINEs in the chimpanzee genome was then assessed, despite their absence from the reference genome.

The number of L1-elements in the human reference assembly that have no full-length equivalent in the chimpanzee genome is around 2000. Yet there is a need for some caution here since the L1-elements in the chimpanzee genome are poorly annotated, and fragmented in their sequence content. When all the elements that were considered to be human specific were found, their sequence was retrieved (using BEDTools [19] built-in function for sequence retrieval) and finally a library-file in FASTA format² was created. This file was now ready to be used as a target when trying to map reads from our chimpanzee data.

3.2.2 Project strategy

The idea behind this reference file was to see if sequenced reads that previously did not map to the chimpanzee reference genome could possibly be mapped to the file created from the human-specific LINEs. If a high number of reads mapped to an element from the human genome that was not annotated in the chimpanzee genome then this would support the idea that this element (or something similar to this element) was in fact present in the chimpanzee genome. This could either be an element that is unique to our data (eg. an insertion of a new LINE) or an element that also exists in the reference individual but has not yet been annotated. With the help of mate- pair sequence data it was also possible to obtain an approximate position for the LINE element.

3.3 Mate-pair mapping

Sequence data used came from the SOLiD mate-pair technique where each run generates a library of reads. These libraries contained all of the sequenced reads, where each read was 50 bases long and with three different distributions of insert sizes shown in Figure 4. The distribution for the first library had a median of 3kb with a lower bound of 2330 and an upper bound of 3770. The distribution for the second library had a median of 4kb with a lower bound of 3720 and an upper bound of 4940. The third library had

2http://www.ncbi.nlm.nih.gov/blast/fasta.shtml

(26)

a median of 3kb with a lower bound of 1370 and an upper bound of 4870.

All the bounds for the different libraries represented 2 standard deviations.

Since the size selection is gel-based, this creates a distribution of different insert sizes between pairs of mates thus making it impossible to determine the expected distance between a pair of reads.

0 20000 40000 60000 80000 100000 120000 140000 160000 180000

2130 2227 2324 2421 2518 2615 2712 2809 2906 3003 3100 3197 3294 3391 3488 3585 3682

Number of reads

Insert size

Library #1 (3kb)

(a) Library 1 - median∼3kb

0 20000 40000 60000 80000 100000 120000

2720 2817 2914 3011 3108 3205 3302 3399 3496 3593 3690 3787 3884 3981 4078 4175 4272 4369 4466 4563 4660 4757 4854

Number of reads

Insert size

(b) Library 2 - median^∼ 4kb

0 20000 40000 60000 80000 100000 120000

1370 1523 1676 1829 1982 2135 2288 2441 2594 2747 2900 3053 3206 3359 3512 3665 3818 3971 4124 4277 4430 4583 4736

Number of reads

Inser size

(c) Library 3 - median^∼ 3kb

Figure 4: Insert-size distribution-plots for the 3 different libraries

3.3.1 Orphans

One of the aims of this project was to use orphan pairs to find new LINEs.

For these orphans we know where one mate has mapped to the reference

(27)

genome. There is also information about the relative positioning of that mate in the pair as well as sequence-information for the other, unmapped mate. These are of great importance when it comes to the later stages of the pipeline.

3.3.2 Read mapping

In order to be able to map the orphans to the new reference library we used the aligner Bowtie [13]. Despite the fact that this is a fast and efficient aligner, the large amount of sequence data makes this procedure computa- tionally demanding. To make this part of the pipeline possible, the computa- tions were performed on resources provided by Swedish National Infrastruc- ture for Computing (SNIC) through Uppsala Multidisciplinary Center for Advanced Computational Science (UPPMAX). The orphans were mapped, one chromosome at a time, to the new library and throughout the whole procedure the information about the pair itself were conserved so that the position i.e. where the mate that already mapped to the reference genome, was still known. This made it easy to always keep an overview of what had mapped and especially where it had mapped.

At this stage filtering of the results was conducted since we were only in- terested in sequences that mapped with as few mismatches as possible. A cut-off was chosen where 2 mismatches, out of the 50 bases that comprised the mates, were allowed.

|||||||||||||||||||||||||||||||||||||||||||||

>>> LINE element >>>

OFFSET

>>> +F3 >>>

Figure 5: An orphan mate mapping to a position, a number of bases from the starting position of a LINE (offset distance). Note the direction and relative

positioning of the mate itself.

(28)

3.4 Identification 3 MATERIALS & METHODS

3.4 Identification

After filtering of the results to obtain only the relevant data, the information for a single pair of reads was:

• Position in the reference genome where the first mate had mapped

• Information of which LINE the second mate had mapped to

• Offset or number of bases on the LINE before the second mate had mapped (See Fig.5)

• Sequence data for both the first and the second mate

By using this information, a diagram with the distribution of mapped mate- pairs on each chromosome could be produced. Although many of the orphans were still unmapped, approximately 14,000 orphan pairs where one mate mapped to the reference genome and the other mate mapped to the LINE reference-library were found. To investigate the presence of a novel LINE element, these pairs had to be clustered together.

3.4.1 LINE calculation

A number of Perl scripts were produced to build up a pipeline for the calculation of positions and other useful information for putative LINEs. The basic approach for finding putative novel elements was to use the previously known expected mapping distance between the two mates in a pair and their relative positioning. If one mate mapped to the reference genome on a certain position and the other mate mapped to a certain LINE element, the approximate position for a LINE insertion in the genome could be calculated by using the offset and the distance between the two mates (See Fig.6). This would then be considered as the insertion or starting-point for the LINE.

The length of the element itself still remains unclear as the only evidence of the length is the 48-50 bases from the mapped orphan mate. One assumption is that the length would be somewhat similar to the length of the element in the human genome.

When calculating the position of the LINE based on the mate-pair data, one must remember that the distance between the two mates in a library is

(29)

Reference Genome LINE element

Offset Insert size 3-4 kb

(a) The two main criteria for calculating the position is the insert size from the library and the offset on the LINE which

the mate maps to.

Offset Insert size 3-4 kb Insertion position

(b) The insert position for the LINE equals the insert size from the library minus the offset.

LINE element

(c) The calculated position for a LINE element together with its original length.

Figure 6: The procedure of calculating the approximate position where a LINE could be expected in the genome

(30)

quite uncertain, even though there is a known median distance. It is still possible to determine an estimate of where the detected LINE is positioned.

To be able to determine the precise position, one would need to experimentally validate this with help of PCR and Sanger sequencing of the desired region.

There are a few things that can be done to further improve the predictions.

Since each mate-pair will contribute with a certain position for a LINE, the average position for a number of pairs is calculated. More mates supporting an element could, if they mapped at a lot of different positions on the LINE element, supply more evidence for the actual length and position of the LINE. If they map with a distribution of approximately 6 kb then a full-length LINE could be expected.

3.4.2 Clustering

To further support the validity of elements found, several pairs supporting the existence of the same element at approximately the same position in the genome were required. Searching for clusters of pairs mapped both to the reference genome along the chromosome and to a LINE element could solve this. This was a procedure where rules for defining a cluster had to be de- fined (See Fig.7a). Pairs where the mates aligning to the reference sequence mapped less than 15 kb apart and with the same intra-pair direction were considered to be part of the same single cluster. To remove any unwanted artefacts and to be more accurate about the predictions, only clusters with 3 or more orphans were kept for the second clustering step.

In the second part of the clustering procedure the main idea was to find small adjacent clusters with opposite mapping direction, mapping to the same LINE (Fig.7b). If we imagine the presence of a LINE, these so called single clusters with different mapping direction would then map from different sides to the LINE (upstream or downstream) on the chromosome. This would give extra support for the LINE itself compared to only having a cluster from only one side.

The single clusters were grouped into a number of prediction clusters (PCs).

In each PC the position for the LINE was predicted from both sides. To obtain an approximate LINE position from each PC, an average of the two predictions was calculated. All of these averages positions from the different

(31)

Chromosome on the reference genome

LINE

+R3 +F3 +R3

(a) Finding sub-clusters that mapped close to each other on the reference

chromosomes and have the same intra-pair direction.

Chromosome on the reference genome

LINE

+R3 +F3 +R3

(b) Defining a prediction cluster (PC) as a set of two single clusters mapping

to a certain LINE element and belonging to different direction classes

(R3/F3).

Figure 7: Clustering procedure where mate-pairs are clustered by position and directions in a two step process

PCs resulted in a data-set with 151 LINEs distributed over different chromosomes.

3.4.3 Annotation of elements

An important part in the creation of this data-set is the annotation of the elements in it. After the clustering steps, the regions where the predicted LINEs mapped, were investigated. To be able to obtain information about the regions and annotate the LINEs, a list of these elements was compiled in a BED format³. This format is based on genomic coordinates and makes it possible for anyone to upload their own regions to the UCSC Genome Browser. In this browser it is possible to view them alongside other information about the genome available in the UCSC database. This makes it easy to see if the regions are associated with coding sequences, gaps and other information of interest.

All elements in the data-set were annotated and the region surrounding the position for each LINE was thoroughly investigated. Since we were aiming to find new LINEs, regions where no other elements were already annotated had the highest priority. LINEs predicted in regions where the reference contained small fragments of LINEs was considered interesting as well since

3http://genome.ucsc.edu/FAQ/FAQformat.html#format1

(32)

there were no full-length elements present. By only selecting elements that were supported by a high number of reads and those who were in close prox- imity to exons , the data set could be narrowed down to focus only on the most interesting regions.

3.4.4 Filling gaps in the reference

The current version of the chimpanzee assembly is known to contain hundreds of thousands of gaps and the coordinates for these can be obtained from the UCSC database. The total length of all the gaps in the reference genome is about 400 million bps. A second, complementary part of this project was to investigate if some of these gaps could be filled with the sequence from the data generated.

160 210

1 51 120 170

30 80

150 200 90 140

70 120 60 110 10 60

160 210

1 51 120 170

30 80 150 200

90 140 70 120

60 110 10 60

ASSEMBLED CONTIG

a)

b)

c)

1 ATCTAGTCAGCTACGATCGATCGCTGACTGATCGACGACTGAC 210

Figure 8: Sequences from mates (a) assembled (b) with help of the information about the predicted position for each mate. A longer coherent contig could then

be deduced from the assembled reads (c) .

From each mate that mapped to the human LINEs, the 50 bps of sequence was known but the position for the mate in the reference genome still re- mained to be determined. Since the position where the mate had mapped to on a certain LINE was known and the expected insert position of that specific element in the chimpanzee genome had been calculated, it was possible to determine the approximate position for each mate. The sequences from the mates could then be assembled into large coherent contigs by using their individual positions as shown in Figure 8.

(33)

Coordinates for these contigs were then compared to the coordinates of the annotated gaps in the chimpanzee reference to investigate which of these that were located in gap regions. In order to perfom this, BEDTools function for intersections between genomic locations intersectBed [19] was used together with Perl scripts created.

To be able to determine both the exact location and sequence for the contigs located in gap regions, the exact position of the LINEs that these mates had mapped to, need to be experimentally verified since the positions for the mates are based on the insert position of the LINE element itself.

(34)

4 RESULTS

4 Results

By comparing annotated full-length LINEs in human and chimpanzee re- spectively, 1,916 human-specific elements were identified.

Table 3: Species specific full-length L1 elements.

Humans Chimpanzees

Full-length L1 8,233 5,483

Specific full-length L1 1,916 490

4.1 Mapped orphans

Less than 1% of all the raw orphans were mapped to the reference library of LINE sequences that was created. That corresponded to approximately 14,000 orphan-pairs in total (See Table 4). The distribution of these orphans can be seen in Appendix B.

Table 4: Orphans mapped to the reference library using Bowtie Library # of raw orphans # of orphans mapped type (size) from start to reference library

3 kb 210,679,452 4,241

4 kb 180,049,048 4,103

3 kb (new) 221,676,694 5,723

Total 612,405,194 14,067

4.2 Clustering

After the first step in the clustering procedure, orphans were grouped together to create small individual clusters. The resulting number of clusters for each chromosome are illustrated in Table 5. Note that there are some

(35)

4.2 Clustering 4 RESULTS

chromosomes where no clusters can be found.

Table 5: Number of clusters on each chromosome after the first step

Chromosome 1 2a 2b 3 4 5 6 7 8 9 10 11 12

# of clusters 13 9 13 22 15 18 20 22 10 9 7 16 13

Chromosome 13 14 15 16 17 18 19 20 21 22 X Y Total

# of clusters 13 7 5 2 1 6 5 9 0 1 261 0 497

The results from the second step in the clustering procedure is shown in Ta- ble 6. The criteria for this clustering step were that the single clusters in a PC could have a maximum distance of 15 kb between them and different mapping directions (either R3 or F3) to each other. Since each PC represented a predicted LINE, a data-set containing 151 LINEs distributed over the different chromosomes was created. There was a vast difference in the number of reads that supported a specific element and the distribution of these can be seen in Figure 9.

Table 6: Number of clusters on each chromosome after the second clustering step. These clusters of single clusters are called prediction-clusters (PCs).

Chromosome 1 2a 2b 3 4 5 6 7 8 9 10 11 12

# of clusters 4 2 2 8 5 3 4 8 4 0 1 3 3

Chromosome 13 14 15 16 17 18 19 20 21 22 X Y Total

# of clusters 2 2 0 0 0 2 2 2 0 0 94 0 151

For each clustering step, a different amount of orphans was used. To support the prediction of the 497 LINEs from the first step in the clustering, over 14,000 orphans were used. This number dropped to approximately 10,000 after the second step of the clustering procedure. In Table 7 the exact num- bers and the average number of orphans per number of predicted LINEs are shown.

To be able to visualize the results in a more comprehensive way and also to

(36)

4.2 Clustering 4 RESULTS

Table 7: Orphans used in each step of the clustering to predict the different LINEs

# Lines predicted Orphans used Average # of orphans / LINE

Step 1 497 14,067 ∼30

Step 2 151 10,168 ∼70

perform the task of annotating the LINEs, the data-set was converted into a BED-file that could be used together with the UCSC Browser. An example of this is shown in appendix A. Examining and annotating all of the elements resulted in a smaller data-set with approximately 20 LINEs that were not yet annotated, or not annotated as full-length elements, in the chimpanzee reference sequence. These were the elements that were chosen for further verification. In Appendix B (and Table 6) the number of LINEs calculated on each chromosome can be found together with the number of orphans found on the same chromosome.

Distribution of reads/novel element

Number of supporting reads

Frequency

0 100 200 300 400 500 600

01234567

Figure 9: A histogram over the distribution of the total number of reads supporting a predicted LINE element.

(37)

4.3 Filling gaps 4 RESULTS

4.3 Filling gaps

All the reads that mapped to LINEs from the reference library were put together into small contigs after their sequence was obtained. The total length of all these contigs was 116,448 bp (which was ^∼16% of the original length from all the orphan reads) and approximately 45,000 bp were located in gap regions (see Table 8). That corresponds to about 0,01 % of the total number of bases considered as gaps in the chimpanzee reference sequence.

Table 8: The different sets with their total number of base-pairs.

Total length of gaps in the chimpanzee reference 440,962,440 Length of read-contigs located in gaps 45,863 0,1%

Total length of LINE mapped orphan reads 703,350

Length of read-contigs 116,448

16,5%

These contigs assembled from the reads mapping to the LINEs in our library have the advantage of containing sequence data that is not yet annotated in the reference sequence. In some cases they are quite short which makes it difficult to fill an entire gap sequence even though large parts can be assembled. This is illustrated in Appendix C where the total bases covered by the contigs for a predicted LINE is shown. In Figure 10 the total number of bases supported by the different reads are shown as well as the supported minimum length. This length is based on the fact that different reads map to different locations on the element. If the distance between two reads mapping to the same element is 6kb, a full-length element could then be expected.

Transposable elements in the chimpanzee genome