Characterising copy number polymorphisms using next generation sequencing data Zhiwei Li

(1)

Characterising copy number

polymorphisms using next generation sequencing data

Zhiwei Li

Degree project in bioinformatics, 2019

Examensarbete i bioinformatik 30 hp till masterexamen, 2019

(2)

(3)

Abstract

We developed a pipeline to identify the copy number polymorphisms (CNPs) in the Northern Swedish population using whole genome sequencing (WGS) data. Two different methodologies were applied to discover CNPs in more than 1,000 individuals. We also studied the association between the identified CNPs with the expression level of 438 plasma proteins collected in the same population.

The identified CNPs were summarized and filtered as a population copy number matrix for

1,021 individuals in 243,987 non-overlapping CNP loci. For the 872 individuals with both

WGS and plasma protein biomarkers data, we conducted linear regression analyses with age

and sex as covariance. From the analyses, we detected 382 CNP loci, clustered in 30 collapsed

copy number variable regions (CNVRs) that were significantly associated with the levels of 17

plasma protein biomarkers ( ! < #. %& × ()

^*()

).

(4)

(5)

The not so famous genetic variations

Popular Science Summary Zhiwei Li

The human genome is encoded by 4 different nucleotides. They are present in different order as a genetic programme to code for our life and phenotype, such as height, eye colour, sex and race. Each nucleotide in the genome corresponds to one base-pair, which is the smallest unit of modification in the genetic programme that could change our phenotype. There are two copies of the genetic programme in our genome, one inherited from the father and one from the mother.

Each copy contains over 3 billion base-pairs of genetic information. The two copies of genetic information are very identical to sustain our critical life functions such as respiration, digestion and growth. But mutations can change the code of our genetic programme at any positions and lead to genetic variations, from single nucleotide up to millions of nucleotides.

The most studied variations are alterations to a single base-pair and known as single nucleotide polymorphisms (SNPs). Some studies estimated single nucleotide variations to occupy up to 99% of the genetic variations. There are many databases for documenting the effect of single nucleotide variations with application in estimation of disease risk and ancestry tracing, which have been available commercially by biotechnology companies like 23 and me. However, the remaining variations can possibly cover a bigger block of the genetic code and cause more severe phenotypic changes. One of the not so famous genetic variations is copy number polymorphism (CNP), which we focus on in this project. CNPs are deletions or duplications of blocks of the genetic code covering at least 50 base-pairs. Since there are two copies of the genetic programme in our genome, the genetic regions not involved in the copy number variations have neutral copy number 2, while deletions will result in copy number 1 or 0, and duplications will lead to more than two copies.

In this project we focus on detecting, filtering and summarizing the copy number variations across the entire genome in over one thousand people from Northern Sweden and investigate if any of the variations have phenotypic effects on the expression level of plasma protein. In the 243,987 CNPs we documented, there are 382 independent significant associations with 17 plasma proteins.

Degree project in bioinformatics, 2019

Examensarbete i bioinformatik 30 hp till masterexamen, 2019

Biology Education Centre and Department of Immunology, Genetics and Pathology, Uppsala University Supervisor: Åsa Johansson, Nima Rafati

(6)

(7)

Abbreviations

aCGH array Comparative Genomic Hybridization CNP Copy Number Polymorphism

CNV Copy Number Variation

CNVR Copy Number Variable Region DGV Database of Genomic Variants NGS Next Generation Sequencing

NSPHS Northern Sweden population health study PEA Proximity Extension Assay

RD Read Depth

RP Read Pair

SNP Single Nucleotide Polymorphism SNV Single Nucleotide Variation SV Structural Variation

WGS Whole Genome Sequencing

(10)

(11)

1 Introduction

Recent developments in sequencing technologies have enabled us to gain unprecedented insights into underlying genetic contribution to human complex traits such as body mass, height, intelligence, and diseases by conducting molecular epidemiology studies. With the improvement in data quality and cost reduction, whole-genome sequencing (WGS) has become more popular in large-scale genomic studies. In addition, the human genome reference from the human genome project (International Human Genome Sequencing Consortium 2004) helps to identify genetic variations and increases the robustness of human genomic studies.

Array-based methods of detecting genetic variations such as SNP array, array comparative genomic hybridization (aCGH) come with high accuracy and reasonable price to be the golden standard for genotype-phenotype association studies, but their observations limit to the already documented variations, excluding any novel signals. WGS can provide whole-genome observation of genetic variations by comparing the sequences to a human reference genome for reporting the variations for the downstream genotype-phenotype association studies.

In order to develop a standardized human reference genome, the human genome project initiated in 1990s. As the largest international biological endeavour in history, this three billion US dollars project finished in 2003, resolving 99% of the human euchromatic genome with an accuracy of 99.99% (Schmutz et al. 2004, International Human Genome Sequencing Consortium 2004). There has been a continuous development of the human reference genome by the Genome Reference Consortium (GRC) with the latest release GRCh38.p13 from 2019 (Schneider et al. 2016). In parallel with the advance of data reliability and cost reduction in next generation sequencing (NGS), many NGS-based variations discovery algorithms and statistical models were developed to decipher genotype-phenotype association studies (Cantor et al. 2010, McKenna et al. 2010, Korte & Farlow 2013, Guan & Sung 2016). In addition, high- performance computing provides sufficient computational power to analyse high throughput data. Nowadays, large-scale genomic projects could include terabytes of multi-omics data from population-based studies.

To identify the phenotypic effects of genetic variations, a genome-wide association study (GWAS) includes a genome-wide set of genetic variations from a population, which are investigated if they are associated with different traits. The main purpose of GWAS is to decrypt the genotype-phenotype associations and the results will not only be beneficial to biological and evolutionary studies, but also provide a molecular insight into clinical research such as drug development and precision medicine. In GWA studies, the phenotypes can be binary traits such as case-control groups for diseases, or more quantitative traits such as body mass index (BMI) and expression levels of exons or proteins. Correlations between the phenotype and the genotype are calculated by statistical models such as linear and logistic regression.

Genetic variations can be classified by their sizes, which are the number of base-pairs (bp) on

the genome they cover. Alternations of a single base on a specific position in the genome are

(12)

single nucleotide variations (SNVs). Insertions or deletions within 50 bp are classified as micro- INDELs (Gonzalez et al. 2007). Variations larger than 50 bp are documented as structural variations (SVs), which include deletions, duplications, insertions, inversions, translocations and complex variations (Alkan et al. 2011). Deletions and duplications are referred to as copy number loss and copy number gain in Copy Number Variations (CNVs), they have higher frequency compared to other SVs (Sudmant et al. 2015). SNVs and CNVs can be further characterized by their frequencies in the population. If more than 1% of the population carry these variations, they are defined as single nucleotide polymorphisms (SNPs) or copy number polymorphisms (CNPs) (Sebat et al. 2004, Keats & Sherman 2013).

Over 84 million SNPs have been documented in a previous study including 26 populations (The 1000 Genomes Project Consortium 2015). With high coverage, high accuracy and low economical cost, SNP array approaches have been generally applied in many genetic studies for variant detection. There are extra challenges in detecting SVs compared to SNPs. The array- based methods provide copy number information, array comparative genomic hybridization (aCGH) has the highest signal-to-noise ratios for detecting CNVs but provides low effective resolution at around 15 to 35 kb, while SNP-based arrays provide a very noisy copy number signal converted from SNPs intensity (Ionita-Laza et al. 2009). For NGS-based SVs discovery, there is no widely accepted pipeline to detect the SVs with low false positive rate (Guan & Sung 2016). Although the study published by the 1000 genomes project consortium in 2015 quantified over 99.9% of the genomic variations as SNPs or short indels, they also emphasized the importance of SVs as they could involve larger regions. In this project we will focus on characterising polymorphisms of copy number variations, which have higher frequency compared to the other SVs, using WGS data collected in a population-based study.

There have been studies done on the phenotypic impacts of CNPs in e.g. evolutionary fitness and embryonic lethality (Beckmann et al. 2007). In clinical genomics, CNPs have been found associated with psychiatric disorders (Malhotra & Sebat 2012), Crohn’s diseases, type 1 diabetes, and multiple developmental diseases (The Wellcome Trust Case Control Consortium et al. 2010).

A previous study using the Database of Genomic Variants (DGV) (MacDonald et al. 2014)

estimated CNVs to cover 4.8~9.5% of the genome and observed complete deletions of about

100 genes without apparent phenotypic effects from the CNPs detected by array-based and

NGS-based method in 55 population studies (Zarrei et al. 2015). Another study in 2015

compared their SVs discovery in 2,504 human genomes to DGV and discovered 43% of the

CNPs are novel discoveries (Sudmant et al. 2015), they also observed multiple breakpoints in

(13)

high-quality CNPs in the population with two algorithms then filter and summarize the polymorphisms as a population CNPs matrix. The next step is a genome-wide CNP-phenotype association analysis to identify significant phenotypic effects of CNPs.

Figure 1. Project workflow. We aim to use two different methods for CNPs calling and report high-quality variations detected by both methods as one population CNPs matrix. The initial data is stored in Binary Alignment Map (BAM)

format and the results are stored as Browser Extensive Data (BED) format.

2 Materials and methods

2.1 Genotype and phenotype data

The NSPHS provides 1,021 samples with Illumina short-read paired-end WGS data, of these 872 samples have 438 plasma protein biomarkers data. The WGS data had been mapped to GRCh37, deduplicated and saved as binary alignment map (BAM) files to be the initial genomic data for this project. The 438 quantitative plasma protein biomarkers data were collected by Proximity Extension Assay (PEA) using five protein panels (Cardiovascular II, Cardiovascular III, Inflammation, Neurology and Oncology II) provided by Olink Proteomics (https://www.olink.com/products/). Other information includes sex, age, BMI, height and lifestyle for each sample.

2.2 CNP calling methods

For WGS CNPs discovery, there are over 60 different algorithms with various statistical features for detecting SVs (Guan & Sung 2016, Kosugi et al. 2019) while for identifying SNPs there is already a standardized workflow by the Genome Analysis Toolkit (GATK) pipeline (McKenna et al. 2010). There are mainly three approaches for detecting CNPs using NGS data:

read depth (RD), read pair (RP), and local re-assembly. The first two methods have been

commonly used for short-read sequencing data, while improvement in long-read sequencing

(14)

techniques allows local re-assembly method to be more applicable by providing reads that can cover larger structural variants for better alignment to the reference sequence. We decided to detect the CNPs by applying two algorithms, one RD based, and one RP based using the Illumina short-read WGS data collected in the NSPHS.

RD and RP methods of detecting the CNPs are based on detecting abnormal alignments of paired-end reads to a genome reference (Figure 2). RD method estimates the copy number of the variation by comparing the local RD to the global RD. Algorithms implementing RD method for CNPs discovery are for example, CNVnator (Abyzov et al. 2011), ERDS (Zhu et al. 2012) and RDXplorer (Yoon et al. 2009). RP method tracks the insert size and the orientation of paired-end reads to locate the breakpoints of CNPs at base-pair resolution. Manta (Chen et al. 2016), Lumpy (Layer et al. 2014) and Delly (Rausch et al. 2012b) are some of the software that apply RP methods for CNPs discovery. There are also algorithms that combine both RD and RP methods for CNPs discovery, such as Genome STRiP (Handsaker et al. 2011).

Figure 2. Two methods of detection CNPs using NGS data. CNPs can be further classified as deletions or duplications.

Although both methods can detect deletions and duplications, only RD method can estimate the exact copy number of the genomic regions by comparing the local RD to global RD. RP method detects the break ends of variations by abnormal read orientation and insert size coloured in purple. This figure was adapted from Rafati’s thesis (Rafati

2017).

2.3 Selection of CNP calling algorithms

Different SV detection algorithms have different sensitivity and false discovery rate. We aimed

to include two algorithms using RD or RP methods for CNPs discovery to increase the accuracy,

(15)

with GATK indel results which had been calculated previously, we decided to focus on bigger variations and selected CNVnator 0.3.3 pre-installed in UPPMAX as the RD based algorithm.

Regarding the RP based algorithms. A study has observed Lumpy to be particularly sensitive when analysing low allele frequency variants or low coverage data (Tattini et al. 2015), which might not be beneficial with regards to this project’s aims to detect genetic variation polymorphisms in the population. Manta’s developer claimed the algorithm to be superior to Delly in terms of recall rate, accuracy, memory usage and run time (Chen et al. 2016). Therefore, Manta 1.5.0 manually downloaded and installed from GitHub was selected to represent RP algorithms for CNP calling.

2.4 Implementation of CNVnator and Manta

The 1,021 samples were split into 11 groups to run the variant calling in parallel on the high- performance computing platform UPPMAX-Bianca.

2.4.1 CNVnator

CNVnator requires 5 steps for each sample:

1. The reads in the BAM file are extracted and saved as a ROOT (Antcheva et al. 2009) file. ROOT is a C++ framework originally designed for physics data analysis by using tree object for data storage.

2. Generate a converted RD histogram at given bin-size resolution.

3. Calculate the ratio of the converted bin-RD signal’s mean value to its standard deviation.

4. Partitioning the converted RD signal by the mean-shift approach came from image processing (Comaniciu & Meer 2002). For each bin, mean-shift algorithm groups the neighbouring bins with most similar RD signal and detect the segment breakpoints.

5. CNV calling by one-sample t-test between regional and global RD signal, merge fragmented CNVs by two-sample t-test, scoring and report the results.

CNVnator requires bin size as an input parameter for detecting the CNPs. Each base-pair will be grouped into an equal size non-overlapping bin for converting the base-pair RD signal to bin-RD signal. In step 4, the bin-RD signal is processed by mean-shift approach, which was invented for image analyses in edges detection. In CNVnator, the mean-shift approach will partition the neighbouring bins with the most similar RD signal as one group.

The developer of CNVnator recommends using an optimized bin size for each sample. We

integrated step 2 and 3 into an optimized bin size testing function. For each sample, the optimal

bin size is the lowest value of (70, 85, 100, 150, 200, 250 bp) at which the converted RD signal’s

mean value to its standard deviation ratio is between 4~5. The reason for having the mean value

of the converted RD signal 4~5 times greater than its standard deviation is to preserve enough

statistical power for detecting deletions by t-test between the regional and global RD signal

while detecting the variations with smallest bin size possible to enable higher breakpoints

resolution.

(16)

The allocated computing resources are shown in table 1 below. By default, CNVnator reports variations with estimated copy number genotype less than 1.5 as deletions and over 2.5 as duplications. The results were converted into browser extensible data (BED) format.

Table 1. Allocated computing resources for CNVnator on UPPMAX-Bianca for one sample

Process (CNVnator) Core Estimated Time

Step 1: Extracting reads 2 30 mins

Step 2&3: Bin size testing 1 15 mins

Step 4: Partitioning 1 25 mins

Step 5: CNVcalling 1 < 1 min

2.4.2 Manta

Compared to CNVnator, Manta is more automatic and does not provide any variant calling parameters for SVs detection. It automatically calculates the insert size distribution for abnormal read-pair discovery and flags the high read depth regions caused by misalignment.

There are mainly three steps for variant calling:

1. Rapid estimation of insert size distribution and read depth information from the BAM file, which includes stratified sampling at different region for each chromosome.

2. Break end graph construction within high quality reads by abnormal insert size, read orientation and read depth. Merging single graphs as segment’s break end graph and de-noising.

3. SVs hypothesis generation and scoring.

The allocated computing resources for analysing one sample is around 95 mins with 8 cores.

The SVs are saved in variant call format (VCF) for documenting different SVs (deletions, insertion, inversions, duplications and translocations) and their genotype estimations (GT:0/1 for heterozygous and GT:1/1 for homozygous variations), we selected the deletions and duplications and converted the results as BED files to investigate the CNPs reported by Manta.

2.5 Filter the CNPs

Four cases of CNPs detected by both CNVnator and Manta in the same sample are shown in

Table 2. As a RD based algorithm, CNVnator can report the estimated copy number for the

variations based on regional and global RD signal. Manta detects the breakpoints of variations

(17)

Table 2. Example of CNPs in the same individual reported by both CNVnator and Manta. The five columns from the left are results from CNVnator and the five columns from the right are results from Manta.

CNVnator Manta

Start End Copy number P-value

(t-test) q0 Start End Event FT GT

110,187,200 110,191,195 1.08 3.99e-11 0.011 110,187,162 110,191,421 DEL PASS 0/1

72,766,375 72,811,850 0.01 3.50e-12 0.275 72,766,322 72,811,840 DEL PASS 1/1

32,576,080 32,581,860 3.01 0 0.002 32,575,369 32,581,899 DUP PASS 0/1

62,281,370 62,309,505 4.54 0 0.038 62,281,352 62,309,499 DUP PASS 0/1

CNVnator provides the P-values from one sample t-test between local and global RD signal. In this project, the variations are considered as high-quality detections if passing these criteria with Bonferroni multiple testing correction:

• q0 < 0.5

• P-value (t-test) < 0.05/number of samples

Manta evaluates the variations by q0, mapping quality score (QUAL), genotype score (GQ), abnormal breakpoint read depth and inconsistent variant discovery to the same region. It automatically flags the high-quality variations as ‘PASS’ in the filter field of the VCF output if variations pass the criteria of:

• q0 < 0.4

• QUAL > 20

• GQ > 15

• RD (both breakends) < 3*median chromosome RD

• No inconsistent SVs discoveries on the same region

For comparing the CNP discovery between CNVnator and Manta, we applied standard filter recommended by the authors (Abyzov et al. 2011, Chen et al. 2016) described above. But when we were formatting the population CNPs matrix, a less stringent threshold, only filters variations with q0 over 0.5 was applied, in order to include weak evidence of CNPs.

2.6 Compare and summarize CNPs

In the population, each variation around the same genomic region can have different

breakpoints due to resolution limit and possible individual mutation events. All CNPs in the

population (*.bed) were first extracted and saved in one BED file ($All_CNP_BED), then

overlapping (at least 1 bp) and adjacent variations were collapsed to report as a copy number

variable region (CNVR) with the number of CNPs that were merged in each CNVR by

BEDtools/2.27.1 merge function (Quinlan & Hall 2010). The result of the scripts below is a

(18)

populational CNVR coordinate list ($All_mergedBED) with number of variants merged in each CNVR:

cat ∗. bed > $All_CNP_BED

bedtools merge − c 1 − o count − i $All_CNP_BED > $All_mergedBED

To compare the populational observations of the two algorithms for copy number gain and loss in terms of their sizes and positions, CNPs were sorted into duplication or deletion group before merging the variations as CNVRs. Deletion CNVRs and duplication CNVRs from the two algorithms were used to study size distribution as well as novel detections. Additional, copy number gain and loss data was downloaded from DGV for reporting novelty. To measure the number of novel detections between the results of the two algorithms and DGV, we used BEDtools intersect report absence function to return the variations in the first file that are not at least 50% reciprocally overlapping in CNVnator vs Manta, CNVnator vs DGV and Manta vs DGV ($BED_1 vs $BED_2) cases:

bedtools intersect − v − r − f 0.5 − a $BED_2 − b $BED_2

For formatting the population CNPs matrix, we applied a less stringent filter compared to the previous step by only applying q0 filter to the CNPs. Genotype of the variations that failed the other filters were marked as NA values to label low-quality detections. We used BEDtools merge function to merge all regions with variations as CNVRs. Depending on the resolution of the CNPs detection, the CNVRs were fragmented by equal size windows ($WindowSize) to generate a CNP loci coordinates list ($CNP_lociBED) by BEDtools makewindows function:

bedtools makewindows − w $WindowSize − b $All_mergedBED > $CNP_lociBED CNPs genotype of each individual ($individualCNP_BED) was mapped to the CNP loci list ($CNP_lociBED) by BEDtools intersect function, the genotype of each loci was given by the genotype of the CNPs that map over 50% of the loci’s region:

bedtools intersect − wao − f 0.5 − a $CNP_lociBED − b $individualCNP_BED

> $CNP_matrixBED

For different CNPs mapped to the same locus, we only kept the variation with lowest q0 value.

Each locus was filtered if over 10% of the individuals have low-quality genotype and the

filtered variations matrix was saved as polymorphisms (CNPs) matrix. The workflow for

(19)

Figure 3. CNP population matrix generation workflow. There are mainly three steps for formatting the population CNP matrix including generation of CNP loci list and genotyping each individual in each locus. Functions and variables are described above with command lines. The population CNP matrix was selected by at least 90% of the

individuals having high-quality variants in each locus for the downstream CNP-phenotype association analysis.

2.7 Association analysis and visualization

We studied the association between CNPs and the levels of protein biomarkers by linear regression model using GLM function in R/3.4.3 with age and sex as covariances, the results were filter by adjusted significance threshold by Bonferroni correction for multiple tests. To visualize the results, a R script from GitHub (Rafati 2018) based on ggplot2::geom_bar function (Wickham 2016) as well as qqman::Manhattan function (Turner 2014) were used to plot the P- values from the association analyses at logarithmic scale.

3 Results

For CNVnator, in the bin size testing section we selected different optimal bin sizes within 70, 85, 100, 150, 200, 2500 bp for each sample and the mean value of the calculated bin size is 92 bp. We observed that the variations size reported by CNVnator ranges from 140 bp to 20M bp in this population.

Manta reported variations of size from 36 to 200M bp. Although Manta calculated insert size estimation by sampling each chromosome, we also estimate the insert size distribution of each sample by taking 100 thousand reads starting from the 100 thousandth reads of each BAM file.

The estimated insert sizes for all samples are around 300 to 400 bp.

(20)

Figure 4 shows an example of the CNPs on Chromosome 19, the variations have different start and end positions due to independent mutation events. The overlapping and adjacent variations in the figure were merged as one collapsed CNVR for size distribution and variations’ genomic regions investigation between the population-wide results of the two algorithms.

Figure 4. Example of CNPs reported by CNVnator. The variations cover around 8Kb on Chr19. Each line represents the copy number for each individual. Multiple breakpoints indicate individual mutation events. To study the size distribution, variations were first classified as duplications or deletions, then merged the overlapping and adjacent

variations as collapsed copy number variable regions (CNVRs).

The numbers of deletions and duplications detected in the population by the two algorithms are shown in Table 3 below. Manta reported more deletion events than CNVnator at individual level but after merging the variations CNVnator report more deletion regions than Manta.

Table 3. Numbers of deletions and duplications detected by CNVnator and Manta as mean values per sample and as collapsed copy number variable regions (CNVRs) at population level.

Deletion Duplication CNVnator

(per sample)

774 641

Manta (per sample)

4,600 481

CNVnator (CNVRs)

8,579 2,805

Manta (CNVRs)

7,564 1,310

(21)

Figure 5. Collapsed CNVRs size distribution. CNVRs with size over 15Kb were moved to the same bar in the histogram. The histogram is coloured by 4 different types of CNVRs.

We compared the genomic regions of collapsed CNVRs between CNVnator and Manta results as well as downloaded data from DGV. The percentage of overlapping variations on the chromosomes are shown in Table 4. There are very high levels of novel detection between the two algorithms and the documented variations from DGV.

Table 4. Percentages of overlapping variations between DGV, CNVnator and Manta.

Collapsed CNVRs CNVnator Manta

DGV (Gain) 5% 2%

DGV (Loss) 23% 12%

CNVnator (DUP) NA 5%

CNVnator (DEL) NA 6%

Manta (DUP) 2% NA

Manta (DEL) 5% NA

Only CNPs reported by CNVnator were used to generate the CNPs population matrix (see

discussion), the breakpoint resolutions of CNVnator is expected to be greater than the bin sizes

for each sample, since the mean value of the bin size is 92 bp and smallest variations reported

by CNVnator is 140 bp, we created non-overlapping 200 bp windows across the CNVnator

CNVRs to generate the population CNPs matrix. We extended the adjacent windows with

identical genotypes. The final CNPs matrix contains genotype of 1,021 individuals in 243,987

CNP loci.

(22)

In the CNP-phenotype association analysis with 438 plasma proteins we detected 17 biomarkers with at least one locus passing the adjusted significance threshold ( T <

_WXY×ZWX[Y\^U.UV

≈ 4.68 × 10

^*aU

). Figure 6 shows the results of the association analysis. For each protein, only the result with the highest -log10(P) value is shown.

Figure 6. Results of the association analysis with 438 plasma proteins and 243,987 CNPs. For each protein, only the result with the highest -log10(p) value is shown in the plot. Hits with ! < #. %& × ()^*() are labelled with protein

names. Colours represent the five different biomarker panels.

For the 17 protein biomarkers with at least one hit passing the adjusted significance threshold, we extracted their association results for all CNP loci to generate a Manhattan plot (figure 6).

Since some of the 17 biomarkers are clustered in the nearby regions in chromosome 3, 6, 17 and 19 as shown in figure 6, there are only 9 peaks of significant associations between CNPs and 17 protein biomarkers in figure 7. The 9 peaks consist of 382 independent significant associations. By merging the adjacent CNPs there are 30 CNVRs (Appendix Table A1) that are significantly associated with the 17 protein markers.

We also intersected Manta’s CNVRs with the 30 significant CNVRs (Appendix Table A1). For

the 30 significant CNVRs, there are 4 regions (on chromosome 1, 5, 5, 17 respectively) that

Manta did not report any variations for the whole population. Significant CNVRs on

chromosome 2 (12 kb) and 6 (< 5 kb) are intersected with much larger Manta’s CNVRs of size

15 Mb and 116 kb respectively after removing the variants reported by Manta over 20 Mb.

(23)

Figure 7. Manhattan plot for the CNPs-Phenotype association results of the 17 protein biomarkers with at least one hit passing the adjusted significant threshold ! < #. %& × ()^*(). There are 382 independent significant associations, by collapsing the book-ended significant CNPs there are 30 significant CNVRs. The CNPs with at least one hit are

highlighted in green.

4 Discussion

Our method of reporting the CNPs in the population summarized the genotype of 1,021 individual at 243,987 CNP loci as a population CNPs matrix. For the 872 samples with both genotype and phenotype data we identified 9 peaks consisting of 382 CNPs, clustered in 30 CNVRs that are significantly associated with 17 protein biomarkers ( ! < #. %& × ()

^*()

).

There have not been many studies focusing on CNPs genotyping and CNP-phenotype association analysis. A recent study developed a novel method for CNVs discovery at population level with 1,364 individuals and reported 4 significant large CNVs ( ! <

(. bc × ()

^*%

, resolution: 15kb) using 275 plasma protein biomarkers data (Png et al. 2018).

In our study we included 438 protein biomarkers from 872 individuals and detected CNPs with much smaller resolution at around 92 bp and reported 382 significant CNPs. It would be interesting to compare the results from their study when the data is available.

The population cohort we used in this project had been studied in a SNP-phenotype association

study previously (Enroth et al. 2018) and reported 17,253 SNPs to be significantly associated

with 109 protein biomarkers ( ! < #. bc × ()

^*c

) from 10,442,416 SNPs and INDELs. For

(24)

this population, both SNP and CNP association studies reported low frequencies of around 1.5%

of the variations significantly associated with different biomarkers, but the CNP study reported less affect protein biomarkers than the SNP study. Since SNPs can spread more broadly across the genome compared to CNPs, we expect to find more SNPs in coding or regulatory regions of different proteins than CNPs.

CNVnator and Manta have different sensitivity for detecting CNPs of different sizes. We expected the resolution of CNVnator to be around the bin size for each sample, which is around 92 bp. Our observation of the smallest variations reported by CNVnator in the population is 140 bp, which agrees with our expectation. Manta has base-pair resolution using pair end reads and split-reads, the developer also claimed Manta has higher recall rate for deletions and insertions between 50 to 500 bp than Pindel (Ye et al. 2009). In figure 5 we observe most of the variations reported by Manta are within 1kb, showing a much higher sensitivity, or false discovery rate in detecting variations within 1kb than CNVnator.

To reduce the false positive discoveries, we planned to report the variations detected by both CNVnator and Manta. However, we found that the sizes and the positions of the CNVs detected by the two algorithms differed dramatically. In table 4, less than 10% of the CNVs are reported by both algorithms. Since most of the variations detected by Manta are smaller than 1 kb, which possibly overlapped with GATK indel results, we decided to only include CNVnator results for generating the population CNPs matrix and the downstream CNP-phenotype association analysis.

We compared the 30 significant CNVRs from the CNP-phenotype association analyses to Manta’s CNVRs and observed 4 regions with no detection by Manta in the population. By manually checking the copy number of the loci in these 4 regions in the population CNP matrix (see Appendix Table A1), CNVnator reported many duplications in these regions. The discordance maybe due to a more stringent filter in Manta that removes the variants with breakpoints’ read depth 3 times greater than the global read depth. For the significant CNVRs on chromosome 2 (12 Kb) and 6 (< 5 Kb) that intersected with much larger Manta’s CNVRs (15 Mb and 116 Kb respectively) after filtering the Manta variations over 20 Mb, we need to have a more elaborate design of size filter while ensuring the true signals are not filtered.

4.1 Limitation

We did not include other SVs such as inversions and translocations in this project since the

short-read pair ended WGS data cannot provide sufficient power for detecting the variations

(25)

Without long-read data, we are not able to detect the SVs for each haplotype with confidence, especially for duplication events. The normal copy number is two copies for diploid organisms, homozygous and heterozygous deletions will be reported as 0 copy by CNVnator and GT:1/1 by Manta as well as one copy by CNVnator and GT:0/1 by Manta respectively. Deducing copy numbers from genotypes reported for duplications will be difficult without having haplotype information. For the two duplications in the same individual reported by both CNVnator and Manta in Table 2, although CNVnator report 3.0 and 4.5 copies for the two events respectively, Manta estimated both duplications to be heterozygous (GT:0/1). With no haplotype information it is difficult to interpret the results and conduct heritability testing for variations reported in parents-child trios.

In this project, we focus on tandem CNVs discovery based on the alignments of short-read WGS reads to the human genome reference. Although the current human genome references (GRCh38 and GRCh37) claim to resolve 99% of the human euchromatic genome, a study constructed a de novo assembly of two Swedish genomes by long-read sequences and reported around 10 Mb novel sequences missing from the GRCh38 mainly located in the centromeric or telomeric regions (Ameur et al. 2018). The misalignments of the reads from the unresolved regions on the current human genome reference can limit the discovery of true signals and lead to false positive discoveries.

4.2 Future work

Due to the time limitation, we only included CNVs reported by CNVnator for generating the CNPs matrix and the CNP-phenotype association analysis. There are many different algorithms developed to identify SVs with different power and false discovery rate. To extend this project, we can include more algorithms and investigate the discordance between these algorithms and CNVnator. For the available Manta results, we can compare them with the INDELs which had been detected by GATK pipeline in this population.

To maximize the sensitivity and accuracy in structural variation discovery, we can include long-

read sequencing methods, which will cover longer genomic region in each single read and

mitigate the issue of amplification biases as well as unresolved haplotype in current short-read

WGS data. For the 30 significant CNRVs reported in this project, re-sequencing these regions

by long-read sequencing methods will help to further examine the discoveries and provide

haplotype information heritability testing. In this project the WGS data had been aligned to

GRCh37. We can align the reads to a Swedish reference genome (Ameur et al. 2018) to reduce

the false positives and detect novel genetic variations.

(26)

5 Acknowledgement

I would like to thank my subject reader Adam Ameur for introducing Åsa Johansson’s group to me for doing the degree project. His expertise in human genome and support to me as a master student have been much appreciated.

During the intensive 6 months of project work, my supervisors Åsa Johansson and Nima Rafati have been very generous with the resources and guidance I needed for completing my project smoothly. The other researchers in the group, Weronica Ek, Torgny Karlsson, Mathias Rask- Andersen and Julia Höglund have been equally inspiring to me for pursuing a career in academia. I was very lucky to get to know the nice and talented people at IGP and around SciLifeLab. These memories will last for life.

Thanks for all the classmates and friends I met during my master’s studies, especially to Lauri Mesilaakso as my student opponent and Amanj Bajalan as the only other second year student in our programme. Thank you, Lisa Klasson, for always being supportive as our programme leader and Lena Henriksson, for coordinating this degree project.

I am also grateful to my family for taking care of me, despite being 8,200 km and 6 hours’ time

difference away from each other.

(27)

6 References

Abyzov A, Urban AE, Snyder M, Gerstein M. 2011. CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Research 21: 974–984.

Alkan C, Coe BP, Eichler EE. 2011. Genome structural variation discovery and genotyping.

Nature reviews Genetics 12: 363–376.

Ameur A, Che H, Martin M, Bunikis I, Dahlberg J, Höijer I, Häggqvist S, Vezzi F, Nordlund J, Olason P, Feuk L, Gyllensten U. 2018. De Novo Assembly of Two Swedish Genomes Reveals Missing Segments from the Human GRCh38 Reference and Improves Variant Calling of Population-Scale Sequencing Data. Genes 9: 486.

Antcheva I, Ballintijn M, Bellenot B, Biskup M, Brun R, Buncic N, Canal Ph, Casadei D, Couet O, Fine V, Franco L, Ganis G, Gheata A, Maline DG, Goto M, Iwaszkiewicz J, Kreshuk A, Segura DM, Maunder R, Moneta L, Naumann A, Offermann E, Onuchin V, Panacek S, Rademakers F, Russo P, Tadel M. 2009. ROOT — A C++ framework for petabyte data storage, statistical analysis and visualization. Computer Physics Communications 180:

2499–2512.

Beckmann JS, Estivill X, Antonarakis SE. 2007. Copy number variants and genetic traits:

closer to the resolution of phenotypic to genotypic variability. Nature Reviews Genetics 8:

639–646.

Cantor RM, Lange K, Sinsheimer JS. 2010. Prioritizing GWAS Results: A Review of Statistical Methods and Recommendations for Their Application. The American Journal of Human Genetics 86: 6–22.

Chaisson MJP, Sanders AD, Zhao X, Malhotra A, Porubsky D, Rausch T, Gardner EJ, Rodriguez OL, Guo L, Collins RL, Fan X, Wen J, Handsaker RE, Fairley S, Kronenberg ZN, Kong X, Hormozdiari F, Lee D, Wenger AM, Hastie AR, Antaki D, Anantharaman T,

Audano PA, Brand H, Cantsilieris S, Cao H, Cerveira E, Chen C, Chen X, Chin C-S, Chong Z, Chuang NT, Lambert CC, Church DM, Clarke L, Farrell A, Flores J, Galeev T, Gorkin DU, Gujral M, Guryev V, Heaton WH, Korlach J, Kumar S, Kwon JY, Lam ET, Lee JE, Lee J, Lee W-P, Lee SP, Li S, Marks P, Viaud-Martinez K, Meiers S, Munson KM, Navarro FCP, Nelson BJ, Nodzak C, Noor A, Kyriazopoulou-Panagiotopoulou S, Pang AWC, Qiu Y, Rosanio G, Ryan M, Stütz A, Spierings DCJ, Ward A, Welch AE, Xiao M, Xu W, Zhang C, Zhu Q, Zheng-Bradley X, Lowy E, Yakneen S, McCarroll S, Jun G, Ding L, Koh CL, Ren B, Flicek P, Chen K, Gerstein MB, Kwok P-Y, Lansdorp PM, Marth GT, Sebat J, Shi X, Bashir A, Ye K, Devine SE, Talkowski ME, Mills RE, Marschall T, Korbel JO, Eichler EE, Lee C.

2019. Multi-platform discovery of haplotype-resolved structural variation in human genomes.

Nature Communications 10: 1784.

(28)

Chen X, Schulz-Trieglaff O, Shaw R, Barnes B, Schlesinger F, Källberg M, Cox AJ, Kruglyak S, Saunders CT. 2016. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics 32: 1220–1222.

Comaniciu D, Meer P. 2002. Mean shift: a robust approach toward feature space analysis.

IEEE Transactions on Pattern Analysis and Machine Intelligence 24: 603–619.

Enroth S, Maturi V, Berggrund M, Enroth SB, Moustakas A, Johansson Å, Gyllensten U.

2018. Systemic and specific effects of antihypertensive and lipid-lowering medication on plasma protein biomarkers for cardiovascular diseases. Scientific Reports 8: 5531.

Gonzalez KD, Hill KA, Li K, Li W, Scaringe WA, Wang J-C, Gu D, Sommer SS. 2007.

Somatic microindels: analysis in mouse soma and comparison with the human germline.

Human Mutation 28: 69–80.

Guan P, Sung W-K. 2016. Structural variation detection using next-generation sequencing data: A comparative technical review. Methods 102: 36–49.

Handsaker RE, Korn JM, Nemesh J, McCarroll SA. 2011. Discovery and genotyping of genome structural polymorphism by sequencing on a population scale. Nature Genetics 43:

269–276.

Igl W, Johansson A, Gyllensten U. 2010. The Northern Swedish Population Health Study (NSPHS)--a paradigmatic study in a rural population combining community health and basic research. Rural and Remote Health 10: 1363.

International Human Genome Sequencing Consortium. 2004. Finishing the euchromatic sequence of the human genome. Nature 431: 931–945.

Ionita-Laza I, Rogers AJ, Lange C, Raby BA, Lee C. 2009. Genetic association analysis of copy-number variation (CNV) in human disease pathogenesis. Genomics 93: 22–26.

Keats BJB, Sherman SL. 2013. Chapter 13 - Population Genetics. In: Rimoin D, Pyeritz R, Korf B (ed.). Emery and Rimoin’s Principles and Practice of Medical Genetics, pp. 1–12.

Academic Press, Oxford.

Korte A, Farlow A. 2013. The advantages and limitations of trait analysis with GWAS: a review. Plant Methods 9: 29.

Kosugi S, Momozawa Y, Liu X, Terao C, Kubo M, Kamatani Y. 2019. Comprehensive

(29)

MacDonald JR, Ziman R, Yuen RKC, Feuk L, Scherer SW. 2014. The Database of Genomic Variants: a curated collection of structural variation in the human genome. Nucleic Acids Research 42: D986–D992.

Malhotra D, Sebat J. 2012. CNVs: harbingers of a rare variant revolution in psychiatric genetics. Cell 148: 1223–1241.

McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA. 2010. The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research 20: 1297–1303.

Png G, Suveges D, Park Y-C, Walter K, Kundu K, Ntalla I, Tsafantakis E, Karaleftheri M, Dedoussis G, Zeggini E, Gilly A. 2018. Population-wide copy number variation calling using variant call format files from 6,898 individuals. bioRxiv 504209.

Quinlan AR, Hall IM. 2010. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26: 841–842.

Rafati N. 2017. Exploring genetic diversity in natural and domestic populations through next generation sequencing.

Rafati N. 2018. In this repository you can find scripts useful for plotting and other R related functions: nimarafati/R_scripts.

Rausch T, Jones DTW, Zapatka M, Stütz AM, Zichner T, Weischenfeldt J, Jäger N, Remke M, Shih D, Northcott PA, Pfaff E, Tica J, Wang Q, Massimi L, Witt H, Bender S, Pleier S, Cin H, Hawkins C, Beck C, von Deimling A, Hans V, Brors B, Eils R, Scheurlen W, Blake J, Benes V, Kulozik AE, Witt O, Martin D, Zhang C, Porat R, Merino DM, Wasserman J, Jabado N, Fontebasso A, Bullinger L, Rücker FG, Döhner K, Döhner H, Koster J, Molenaar JJ, Versteeg R, Kool M, Tabori U, Malkin D, Korshunov A, Taylor MD, Lichter P, Pfister SM, Korbel JO. 2012a. Genome Sequencing of Pediatric Medulloblastoma Links

Catastrophic DNA Rearrangements with TP53 Mutations. Cell 148: 59–71.

Rausch T, Zichner T, Schlattl A, Stütz AM, Benes V, Korbel JO. 2012b. DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics 28: i333–

i339.

Schmutz J, Wheeler J, Grimwood J, Dickson M, Yang J, Caoile C, Bajorek E, Black S, Chan YM, Denys M, Escobar J, Flowers D, Fotopulos D, Garcia C, Gomez M, Gonzales E, Haydu L, Lopez F, Ramirez L, Retterer J, Rodriguez A, Rogers S, Salazar A, Tsai M, Myers RM.

2004. Quality assessment of the human genome sequence. Nature 429: 365.

Schneider VA, Graves-Lindsay T, Howe K, Bouk N, Chen H-C, Kitts PA, Murphy TD, Pruitt KD, Thibaud-Nissen F, Albracht D, Fulton RS, Kremitzki M, Magrini V, Markovic C,

McGrath S, Steinberg KM, Auger K, Chow W, Collins J, Harden G, Hubbard T, Pelan S,

(30)

Simpson JT, Threadgold G, Torrance J, Wood J, Clarke L, Koren S, Boitano M, Li H, Chin C-S, Phillippy AM, Durbin R, Wilson RK, Flicek P, Church DM. 2016. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. bioRxiv 072116.

Sebat J, Lakshmi B, Troge J, Alexander J, Young J, Lundin P, Månér S, Massa H, Walker M, Chi M, Navin N, Lucito R, Healy J, Hicks J, Ye K, Reiner A, Gilliam TC, Trask B, Patterson N, Zetterberg A, Wigler M. 2004. Large-Scale Copy Number Polymorphism in the Human Genome. Science 305: 525–528.

Sudmant PH, Rausch T, Gardner EJ, Handsaker RE, Abyzov A, Huddleston J, Zhang Y, Ye K, Jun G, Hsi-Yang Fritz M, Konkel MK, Malhotra A, Stütz AM, Shi X, Paolo Casale F, Chen J, Hormozdiari F, Dayama G, Chen K, Malig M, Chaisson MJP, Walter K, Meiers S, Kashin S, Garrison E, Auton A, Lam HYK, Jasmine Mu X, Alkan C, Antaki D, Bae T,

Cerveira E, Chines P, Chong Z, Clarke L, Dal E, Ding L, Emery S, Fan X, Gujral M, Kahveci F, Kidd JM, Kong Y, Lameijer E-W, McCarthy S, Flicek P, Gibbs RA, Marth G, Mason CE, Menelaou A, Muzny DM, Nelson BJ, Noor A, Parrish NF, Pendleton M, Quitadamo A, Raeder B, Schadt EE, Romanovitch M, Schlattl A, Sebra R, Shabalin AA, Untergasser A, Walker JA, Wang M, Yu F, Zhang C, Zhang J, Zheng-Bradley X, Zhou W, Zichner T, Sebat J, Batzer MA, McCarroll SA, The 1000 Genomes Project Consortium, Mills RE, Gerstein MB, Bashir A, Stegle O, Devine SE, Lee C, Eichler EE, Korbel JO. 2015. An integrated map of structural variation in 2,504 human genomes. Nature 526: 75–81.

Tattini L, D’Aurizio R, Magi A. 2015. Detection of Genomic Structural Variants from Next- Generation Sequencing Data. Frontiers in Bioengineering and Biotechnology, doi

10.3389/fbioe.2015.00092.

The 1000 Genomes Project Consortium. 2015. A global reference for human genetic variation. Nature 526: 68–74.

The Wellcome Trust Case Control Consortium, Craddock N, Hurles ME, Cardin N, Pearson

RD, Plagnol V, Robson S, Vukcevic D, Barnes C, Conrad DF, Giannoulatou E, Holmes C,

Marchini JL, Stirrups K, Tobin MD, Wain LV, Yau C, Aerts J, Ahmad T, Daniel Andrews T,

Arbury H, Attwood A, Auton A, Ball SG, Balmforth AJ, Barrett JC, Barroso I, Barton A,

Bennett AJ, Bhaskar S, Blaszczyk K, Bowes J, Brand OJ, Braund PS, Bredin F, Breen G,

Brown MJ, Bruce IN, Bull J, Burren OS, Burton J, Byrnes J, Caesar S, Clee CM, Coffey AJ,

Connell JMC, Cooper JD, Dominiczak AF, Downes K, Drummond HE, Dudakia D, Dunham

A, Ebbs B, Eccles D, Edkins S, Edwards C, Elliot A, Emery P, Evans DM, Evans G, Eyre S,

(31)

CM, Maisuria-Armer M, Maller J, Mansfield J, Martin P, Massey DCO, McArdle WL, McGuffin P, McLay KE, Mentzer A, Mimmack ML, Morgan AE, Morris AP, Mowat C, Myers S, Newman W, Nimmo ER, O’Donovan MC, Onipinla A, Onyiah I, Ovington NR, Owen MJ, Palin K, Parnell K, Pernet D, Perry JRB, Phillips A, Pinto D, Prescott NJ,

Prokopenko I, Quail MA, Rafelt S, Rayner NW, Redon R, Reid DM, Renwick A, Ring SM, Robertson N, Russell E, St Clair D, Sambrook JG, Sanderson JD, Schuilenburg H, Scott CE, Scott R, Seal S, Shaw-Hawkins S, Shields BM, Simmonds MJ, Smyth DJ, Somaskantharajah E, Spanova K, Steer S, Stephens J, Stevens HE, Stone MA, Su Z, Symmons DPM, Thompson JR, Thomson W, Travers ME, Turnbull C, Valsesia A, Walker M, Walker NM, Wallace C, Warren-Perry M, Watkins NA, Webster J, Weedon MN, Wilson AG, Woodburn M,

Wordsworth BP, Young AH, Zeggini E, Carter NP, Frayling TM, Lee C, McVean G, Munroe PB, Palotie A, Sawcer SJ, Scherer SW, Strachan DP, Tyler-Smith C, Brown MA, Burton PR, Caulfield MJ, Compston A, Farrall M, Gough SCL, Hall AS, Hattersley AT, Hill AVS, Mathew CG, Pembrey M, Satsangi J, Stratton MR, Worthington J, Deloukas P, Duncanson A, Kwiatkowski DP, McCarthy MI, Ouwehand WH, Parkes M, Rahman N, Todd JA, Samani NJ, Donnelly P. 2010. Genome-wide association study of CNVs in 16,000 cases of eight common diseases and 3,000 shared controls. Nature 464: 713–720.

Trost B, Walker S, Wang Z, Thiruvahindrapuram B, MacDonald JR, Sung WWL, Pereira SL, Whitney J, Chan AJS, Pellecchia G, Reuter MS, Lok S, Yuen RKC, Marshall CR, Merico D, Scherer SW. 2018. A Comprehensive Workflow for Read Depth-Based Identification of Copy-Number Variation from Whole-Genome Sequence Data. American Journal of Human Genetics 102: 142–155.

Turner SD. 2014. qqman: an R package for visualizing GWAS results using Q-Q and manhattan plots. bioRxiv 005165.

Wickham H. 2016. ggplot2: Elegant Graphics for Data Analysis, 2nd ed. Springer International Publishing

Ye K, Schulz MH, Long Q, Apweiler R, Ning Z. 2009. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics (Oxford, England) 25: 2865–2871.

Yoon S, Xuan Z, Makarov V, Ye K, Sebat J. 2009. Sensitive and accurate detection of copy number variants using read depth of coverage. Genome Research 19: 1586–1592.

Zarrei M, MacDonald JR, Merico D, Scherer SW. 2015. A copy number variation map of the human genome. Nature Reviews Genetics 16: 172–183.

Zhu M, Need AC, Han Y, Ge D, Maia JM, Zhu Q, Heinzen EL, Cirulli ET, Pelak K, He M,

Ruzzo EK, Gumbs C, Singh A, Feng S, Shianna KV, Goldstein DB. 2012. Using ERDS to

Infer Copy-Number Variants in High-Coverage Genomes. American Journal of Human

Genetics 91: 408–421.

(32)

24 7 Appendix A

Table A1. List of the significant CNVnator copy number variable regions (CNVRs) with the associated biomarkers. CNVRs reported by Manta. -1 in the coordinates of Manta CNVRs means there is not overlap.

Chr Start End Size Copy Number

>2:2:1:0

Biomarker Manta:

Srart

Manta:

End

Manta:

Size

Manta:

Overlap

CNVnator:

Start

CNVnator:

End

CNVnator:

Size

CNVnator:

Overlap

1 161640580 161642980 2400 101:742:30:0 ONC2_194_FCRLB -1 -1 -1 0 161481980 161646625 164645 2400

2 89613000 89613200 200 36:805:31:1 ONC2_149_GPNMB 88703946 103748350 15044404 200 89524600 90280600 756000 200

3 98410600 98414800 4200 2:284:378:209 CVD3_188_ICAM-2:

CVD2_184_PD-L2:

NEU_109_Siglec-9:

NEU_194_CD200R1

98409900 98415046 5146 4200 98403200 98417505 14305 4200

3 98899100 98902100 3000 12:56:336:469 ONC2_190_VEGFR- 3

98899059 98902392 3333 3000 98896300 98905950 9650 3000

5 70303300 70312500 9200 144:669:60:0 CVD2_133_IL-18 -1 -1 -1 0 70290700 70412725 122025 9200

5 70391300 70395300 4000 137:732:4:0 CVD2_133_IL-18 -1 -1 -1 0 70290700 70412725 122025 4000

6 30994100 30995100 1000 76:744:0:53 ONC2_173_MIC-AB 30993762 30995409 1647 1000 30992700 30999800 7100 1000 6 31337890 31341890 4000 1:796:72:4 ONC2_173_MIC-AB 31337856 31341970 4114 4000 31337290 31343325 6035 4000

(33)

25

6 32513000 32514800 1800 1:640:191:41 INF_145_CCL19 32442590 32558622 116032 1800 32436000 32564630 128630 1800 6 32516400 32518800 2400 41:605:187:40 INF_145_CCL19 32442590 32558622 116032 2400 32436000 32564630 128630 2400 6 32520200 32522600 2400 44:601:188:40 INF_145_CCL19 32442590 32558622 116032 2400 32436000 32564630 128630 2400 6 32523200 32523400 200 87:497:219:70 INF_145_CCL19 32442590 32558622 116032 200 32436000 32564630 128630 200 6 32524800 32525000 200 82:460:248:83 INF_145_CCL19 32442590 32558622 116032 200 32436000 32564630 128630 200 6 32526400 32527400 1000 13:605:225:30 INF_145_CCL19 32442590 32558622 116032 1000 32436000 32564630 128630 1000 6 32540600 32541000 400 81:583:188:21 INF_145_CCL19 32442590 32558622 116032 400 32436000 32564630 128630 400 6 32643000 32647800 4800 1:599:259:14 INF_145_CCL19 32609321 32729412 120091 4800 32637200 32648840 11640 4800

11 63442300 63446100 3800 1:781:86:5 ONC2_188_FR- gamma

63442151 63445937 3786 3637 63439700 63447700 8000 3800

11 67330155 67332555 2400 8:786:76:3 ONC2_188_FR- gamma

67330061 67332456 2395 2301 67328755 67333600 4845 2400

16 28609845 28625245 15400 203:664:6:0 INF_191_ST1A1 21861483 29509970 7648487 15400 28609045 28627800 18755 15400

17 36392670 36394670 2000 800:73:0:0 INF_130_CCL4 -1 -1 -1 0 36249270 36406860 157590 2000

17 39203400 39207000 3600 1:814:57:1 CVD3_171_CCL15 39202803 39221766 18963 3600 39203200 39222500 19300 3600 17 39207800 39211200 3400 1:814:57:1 CVD3_171_CCL15 39202803 39221766 18963 3400 39203200 39222500 19300 3400 19 41381725 41387525 5800 6:837:28:2 ONC2_151_MIA 41344152 41387879 43727 5800 41339325 41393725 54400 5800 19 51508940 51510740 1800 3:852:17:1 ONC2_121_hK11 51508588 51511299 2711 1800 51506940 51511445 4505 1800 19 54555500 54560700 5200 1:609:243:20 CVD2_186_hOSCAR 54555339 54560900 5561 5200 54554100 54562350 8250 5200

Characterising copy number polymorphisms using next generation sequencing data Zhiwei Li