Processing and analyzing of chromatin interaction data and numerical fitting of the statistical helix

(1)

Processing and analyzing of chromatin interaction data and numerical fitting of the statistical helix model

Tobias Everhorn

Master’s Thesis in Engineering Physics

Master Programme in Engineering Physics

Royal Institute of Technology year 2016

Supervisor at Sci-Life was Pelin Sahlén

Examiner at KTH was Jerker Widengren

(2)

2 Abstract

Studying the DNA structure is important in many different sectors of biological science, such as disease treatment. Currently the DNA structure at a large scale is considered to be mostly explored, while structures that would only show if we look at smaller pieces of the genome (higher resolutions) are unexplored. In this thesis, data is processed and analyzed originating from a chromosome

conformation capture technique called HiCap. Which is a biological method, similar to the older

chromosome conformation capture technique HiC, that through several steps provides interaction

data on the whole genome but with a focus on the promoter-enhancer interactions. The HiCap

method can look at smaller parts of the genome than previous methods, which is required to study

small structures such as promoter-enhancer sites. In this thesis the chromatin structure at 40

kilobase resolution, meaning 40 kilobase large parts, at promoter-enhancer interaction sites in

human monocytic cells from an acute monocytic leukemia patient, THP1 cells, is investigated. This was

done by processing and analyzing the data, followed by trying to fit the statistical helix model to

interaction data from HiCap experiments, to see whether it can detect structures in our data and

how well it can predict interactions in this kind of data set. The statistical helix model has shown

promise in previous articles and it is here tested how well it can be applied to interaction data of

promoter-enhancer interactions. The results of the thesis suggests that the HiCap method has a

better chance to detect the structures of small promoter-enhancer sites than the HiC method. The

statistical helix model did not seem to be able to detect structures in our data, nor is it viable as a

predictor of interactions for this kind of data.

(3)

3 Abstract ... 2

Contents ... 3

1. Introduction ... 5

1.1 Problem formulation ... 5

1.2 Objective... 5

2. Background ... 6

2.1 The cell nucleus and the human genome ... 6

2.2 Basics of the genomic structure ... 6

2.2.1 Promoters and Enhancers ... 6

2.3 Methods to study chromatin conformation ... 7

2.3.1 3C - Chromosome conformation capture methods ... 7

2.3.2 Sequencing ... 7

2.4 Cell lines ... 7

2.5 Statistical Helix model ... 7

2.6 lsqCurvefit in Matlab ... 8

2.7 Jaccard Index ... 8

3. Method ... 9

3.2 Creating the tag directory ... 9

3.3 Generating the interaction matrices ... 9

3.4 Heat maps ... 10

3.5 Finding topologically associating domains ... 10

3.6 Jaccard Index ... 10

3.7 Investigating interactions ... 11

3.7.1 NLPS and WLPS ... 11

3.7.2 Domain interactions ... 11

3.7.3 Expression values ... 11

3.7.4 Random Interactions ... 11

3.7.5 Statistical helix modelling ... 12

3.7.6 Parameter fitting in matlab ... 12

4. Results ... 13

4.1 Heat Maps ... 13

4.2 Topologically associating domains – TADs ... 16

4.3 Jaccard Index ... 16

(4)

4 4.5 Interactions... 17

4.5.1 Domain Interactions ... 18

4.5.2 Expression values ... 19

4.6 Statistical helix model ... 20

4.6.1 Separation distance ... 22

4.6.2 Domain Interactions ... 23

4.6.3 Expression value ... 26

5. Discussion ... 28

5.1 Topologically associating domains ... 28

5.2 Statistical helix model ... 28

6. Conclusions ... 29

7. References ... 30

(5)

5 1. Introduction

1.1 Problem formulation

The studying of our genetic structure with chromosome conformation capture techniques has for the last 10 years[1] been an area of huge progress. Step by step we can look more closely at the chromatin structures. This can give important information for all fields connected to human genetics. By

understanding how the genes organize themselves, both at non-active sites with no genes and at active sites such as the promoter-enhancer interaction site, vital progress can be made in e.g. the treatment of certain diseases. Data from experiments based on a new conformation capture technique called HiCap[2] is used in this thesis. This method focuses on the interactions at the promoter-enhancer interaction sites. Experimental methods prior to HiCap have had significantly worse resolution, meaning that HiCap can look at smaller pieces of the genome. Now, with the advantage of HiCap’s resolution, smaller structures such as the promoter-enhancer site can be explored. By using a certain model, the statistical helix model, and fitting it to our interaction data it can hopefully give new insights into the structure of the DNA. In this work I also studied if the statistical helix model can be used as a better predictor of interactions than the method mostly used today. This is important when determining how reliable an interaction is. To determine if it is a real interaction or just a matter of proximity creating noise in the data.

1.2 Objective

The overall objective of this thesis is to investigate the structure of the DNA at the enhancer-promoter interaction sites. Partly by analyzing the interaction data and partly by trying to fit a certain model, the statistical helix model[3], to the data. Interaction data retrieved from an HiCap[2] experiment will be analyzed, where the most significant properties of the interactions are the separation distance between two sites, the number of supporting pairs which reflects how often the interactions occur and the expression value which shows how much the genetic features at the interactions are transcribed. There will be two major data sets retrieved from HiCap experiments, one where the sample had no

stimulation and one where it is stimulated with lipopolysaccharide, LPS, to make it more prone to interact. These will be compared to reveal differences between non-stimulated and stimulated samples.

A lot of work will be to prepare the raw experimental data to the analyzing and modeling that will be done. To investigate how well the HiCap data compares to data from older experimental methods Heat maps will be studied, giving a visual representation of how defined the topologically associating domains, TADs, are in HiCap data compared to HiC[4] data. The TADs represents active areas where a lot of interactions occur.

The active regions in the genome, called TADs, will then be found using a method in the Homer software[5]. The validity of these TADs is tested by checking the Jaccard index, which measures the overlap between two data sets, the TAD coordinates from Homer and the CCCTC-binding factor coordinates. The CCCTC-binding factor, more commonly called CTCF, is a protein that is present at the interactions between promoters and enhancers. This Jaccard index test will show if the TADs are present at the CTCF sites. These TADs will then be used when investing e.g. how many interactions are inside these active regions, TADs, and outside them.

The major point of interest will be the statistical helix model[3]. It will be applied to different sets of the data to see whether it is applicable to all or any of it. Note that the input value, supporting pairs, which is used here is a different input than what it is originally used for the statistical helix model. The input used here is a direct measure of interaction frequency while the input in the original article is an indirect measure of the same thing. The fitting will be made to different data sets, such as all

interactions, all interactions with a high expression value and interactions within a certain separation

distance. The results will show if the statistical helix model is applicable to this kind of data.

(6)

6 2. Background

It is important when reading this thesis that you have a basic understanding of how the human genome is structured, and the differences between different levels in the structure. This section will give you a brief summary of this. In addition some information will be given about the different techniques used to study the genome structure.

2.1 The cell nucleus and the human genome

Every cell in our body has a cell core containing the human genome. This is coded by approximately 3.2 billion base pairs organized into a double stranded string of these pairs. Each strand contains a copy of our whole DNA. If you stretch the DNA in one cell the length would be almost 2 m. This entire string has to fit inside a nucleus with a diameter of about 6 microns. The packing of the DNA is known to not be random. Instead the DNA and the different chromosomes are folded and organized inside the nucleus in a highly ordered and regulated manner.

The conformation of the chromosomes are changing constantly depending on where in the cell cycle it is. This is a problem when studying it, since all cells don’t have to be in the same phase in the cell cycle at the same time. The chromosome conformation at the cell division and reproduction has been well studied, so now scientists are trying to look at the whole cell cycle. The Interphase for example. It is connected to gene transcription and regulation, making it very interesting to study.

2.2 Basics of the genomic structure

Understanding the basics of our genomic structure is vital, thus a brief explanation will follow. The DNA consists of a double stranded chain of connected nucleotide bases. The order in which these bases are arranged determines e.g. protein sequences and transcription factors. This is studied with DNA sequencing, and the development of high-throughput DNA sequencing have had a big impact on how much research and discoveries are made in medicine and biology today.

The DNA chains are then wrapped around certain proteins called histones, these units are called Chromatin and this we call the Primary structure of the genome. The chromatin is arranged in multiple different ways, depending mostly on where in the cell cycle it currently is. Generally, the cycle can be divided into two phases, the first is the reproductive phase called Mitosis. This is the phase which most know about. It is when chromatin gather and create the high density chromosome structure that can be observed with normal light microscopy. The second phase is called the

Interphase, it is here that gene transcription and regulation happens. This level of genome organization is called the secondary structure, when the chromatin chain creates a structure.

The Tertiary structure is the next level, and is the part we are concerned with most in this thesis. It is when the secondary structure forms complex structure e.g. loops and domains.

2.2.1 Promoters and Enhancers

Promoters and enhancers are of special interest in this case, since the main focus will be on data from HiCap experiments focused on promoter-enhancer interactions. The promoters regulate gene activity at a certain time and place. For every gene there is at least one promoter, which acts as an on-switch for that gene. When a specific protein binds to the promoter, the gene is activated or deactivated.

Enhancers are regions in the DNA which are connected to the probability of the gene transcription.

Proteins can bind to enhancers to increase the transcription of a certain gene by also connecting to a

promotor. A protein called CTCF can bind to the promoter and to an enhancer far away in the

sequence, creating a loop in the 3D structure, which enhances the transcription of the corresponding

gene. These loops occurs in very large numbers in the genes tertiary structure.

(7)

7 The secondary and tertiary structures of the genome are of high interest because it is directly related to what happens inside the cell, e.g. it is connected not only to the normal activity but probably also to diseases and disorders.

2.3 Methods to study chromatin conformation

Since 2002 when the chromosome conformation capture technique, 3C, was published [6], immense progress have been made in studies of the structure and spatial organization of the chromosomes in the cell during its different cycles. It is a technique that investigates the chromatin structure through interaction data. From 3C, many techniques were developed to further this goal by improving throughput and specifying what to study. 4C, 5C , HiC and HiCap are all techniques that originates from the 3C method. In this thesis the data used was retrieved by using a newly invented method called HiCap[2].

2.3.1 3C - Chromosome conformation capture methods

Chromatin conformation capture[6] is a molecular biology technique that investigates the chromatin structure. The basis for all the different techniques are the same: First the genome is cross-linked, this creates bonds that stops all interactions. Then the genome is cut into pieces, and the different pieces are allowed to interact with each other. Smaller pieces means better resolution. These interactions are quantified in different ways depending on which method is used. The data retrieved is used to determine how prone a loci, a specific genome segment, is to interact with another loci, giving information about the structure.

From 3C several other techniques have evolved: 4C, 5C, HiC and HiCap. Most interesting here is HiC and HiCap. They use a similar approach in generating the interaction data and the main difference between the created data sets is which part of the DNA you are looking at. 3C only looks at

interactions between two specific segments in the DNA, while HiC looks at interactions between all segments at the same time. HiCap, a method that was published in 2015, works as HiC in many ways, but focuses on the promoter-enhancer sites. HiCap can also achieve a resolution of up to ~700 bp[2], significantly higher than previous methods making it possible to investigate smaller structures around e.g. the promoter-enhancer interaction sites.

2.3.2 Sequencing

DNA-sequencing refers to all techniques that determine the order of nucleotides (Adenine, Guanine, Thymine and cytosine) in the DNA. Better and faster methods of sequencing has accelerated the progress in all genetic related research, e.g. medical research. It is a vital part of all such research and is a required step in these methods.

2.4 Cell lines

Two different kind of cell lines are used in generating the data that is used in this thesis. Mouse embryonic stem cells, MESC, and THP1- cells. MESC are mouse stem cells and has the ability to transform into any cell type and propagate. They are in the field of genetics commonly used to study genetic diseases. THP1-cells are human monocytic cells from an acute monocytic leukemia patient.

The MESC are in this thesis only used for comparing the resolution of heat maps using different methods of retrieving the interaction data, HiC or HiCap.

2.5 Statistical Helix model

The major part in the thesis is the investigation of the structure at the enhancer-promoter regions in the DNA using the statistical helix model [3]. It was observed in their data that the chromatin often adopted a preferred shape, in a larger scale than specific loci-loci interactions. Previously chromatin in yeast had been modeled using a method [8] that connected the crosslinking frequency and site

separation, see equation 1.

(8)

8 𝑋(𝑠) = [𝑘 ∗ 0.53 ∗ 𝛽

⁻³²

∗ exp (−

²

𝛽²

) ∗ (𝐿 ∗ 𝑆)

⁻³

eq. 1

The crosslinking frequency, X(s), has the unit

^mol

𝑙𝑖𝑡𝑒𝑟

𝑛𝑚

³

. Separation distance or site separation, s, is the distance between two interacting loci. The constant k is the crosslinking efficiency. S is the Kuhn statistical segment length (in kb) which describes the stiffness. It was developed for the freely joint chain used to model polymers in the 1930s. L is a measure of the density of genomic DNA, namely the length of chromatin (in nm) containing 1000 bases of DNA. β represents the number of Kuhn

statistical segments. It depends on the shape of the object of study. This beta term was initially constructed for a linear and unconstrained polymer, equation 2. This is what I refer to when I talk about the unconstrained model when doing the fitting of the interaction data to the statistical helix model.

𝛽 =

^𝑠

𝑆

eq. 2

In the article from 2011 [11] the beta term was remade to fit for a polymer in a circular helix, equation 3. P stands for pitch and represents the distance between the loops(in nm). D stands for diameter and represents the width of the loops. The other parameters are the same as in equation 1 and 2.

𝛽 =

√𝐷²∗sin²( ^{𝜋∗𝐿∗𝑠}

√𝜋2∗𝐷2+𝑃2)+)(^{𝑃2∗𝐿2∗𝑠2} 𝜋2∗𝐷2+𝑃2)

𝐿∗𝑆

eq. 3

Note that study presented in [11] was based on data retrieved with the 3C method on mESC (mouse embryonic stem cells). While this thesis will investigate how the statistical helix model works on THP1-cells using data retrieved with the HiCap method.

2.6 lsqCurvefit in Matlab

Lsqcurvefit[9] is a method that solves data fitting problems in a non-linear least-square way in matlab.

It can implement a trust-region-reflective algorithm or a levenburg-marquardt algorithm. The default today is the trust-region algorithm and is used in this thesis. The basic equation is eq 4 below.

min

𝑥

||𝐹(𝑥, 𝑥𝑑𝑎𝑡𝑎) − 𝑦𝑑𝑎𝑡𝑎||

₂²

= min

𝑥

∑ (𝐹(𝑥, 𝑥𝑑𝑎𝑡𝑎

_𝑖 _𝑖

) − 𝑦𝑑𝑎𝑡𝑎

_𝑖

)

²

eq. 4

xdata and ydata are the two data inputs. In this case xdata is separation distance and ydata is the crosslinking frequency.

2.7 Jaccard Index

The Jaccard index[10] is used for comparing the equality of different data sets. It simply checks the ratio between the intersection and the union of two data sets.

𝐽(𝐴, 𝐵) = |𝐴 ∩ 𝐵|

|𝐴 ∪ 𝐵|

(9)

9 3. Method

The basis for this thesis is the interaction matrix which is needed for e.g. creating heat maps and localization of TADs in our data and all general interaction analysis. This was done using a program called Homer [5]. The first step in generating the interaction matrix to create tag directories from the experimental data using options suited for our HiC[4] and HiCap[2] data. The Tag directory is just a way for Homer to organize the data before creating the interaction matrix. The second step is to create the interaction matrix, which can be either a matrix containing the raw frequency data just showing the number of possible ligation events between different sites in the genome or a normalized matrix or a Pearson correlated matrix. I will explain what these are in section 3.3.

Matrices were made for one HiC experiment and two HiCap experiments which studied MESC. They were also made for two recent HiCap experiments studying human THP1 cells. One of these had no stimulation and the other was stimulated by LPS, Lipopolysaccharide. This stimulation is used to increase the actual interaction frequency, which is the frequency of interactions that are not random interactions due to e.g. close proximity.

3.2 Creating the tag directory

The tag directory is a directory containing several files describing the experiment, and these files are very similar to sorted BAM files. BAM files are compressed binary versions of sequence alignment maps, SAM, files. When making the tag directories for the recent THP1 experiments the experimental files you input are two sorted BAM files where one contains the first read and the other the second read. The BAM files contains the data when it has been mapped, meaning that sequence information from the experiment has been mapped to a reference genome giving the sequence of e.g. a

chromosome.

Below are the command given to Homer[5] to create a tag directory, with the options used for my project. These were suggested by my supervisor. Where the input file represents the sorted BAM files.

The different options are chosen to suit HiC and HiCap experiments. The tag directories for HiC, HiCap1 and HiCap2 were created from a text Summary file with the format explained in the Homer documentation[5].

makeTagDirectory NameofTagDirectory.tagdir inputFile1,inputFile2 -illuminaPE -tbp 1 -removePEbg -restrictionSite GATC -genome hg19.fa

3.3 Generating the interaction matrices

Interaction matrices were created using Homer. Matrices were made at several different resolutions using the different datasets. For the main part of the thesis, essentially all parts except the heat maps, the resolution of the matrix is 40 000 base pairs. Three different interaction matrices were generated:

raw, normalized and Pearson correlated[7]. The raw interaction matrices contains the number of supporting pairs, basically number of interactions, registered between site x and site y in a certain chromosome. Each interaction is a possible ligation event.

The normalized matrix contains the ratio of observed/expected interactions. It is assumed for this normalization that each site has the same probability to interact with every other site in the genome and that the distance between the sites affects the number of interactions. This is important to consider because many proximity ligation events occurs. It is natural that there will be some interaction

between sites that are close to each other, no matter what the overall 3D structure is.

(10)

10 The Pearson correlated matrices are used to find active areas, in respect to gene activity, in 2D heatmaps, called TADs. They are calculated from a normalized matrix as described above. These matrices contain the pearson correlated coefficient, see equation 5, in each cell.

eq. 5

In eq. 5 r is the Pearson correlation coefficient. The coefficient is calculated between the row and the column on which the cell lies. If we are looking at the matrix cell at position (x,y), x

i

is the matrix value at the i:th position on row x and the matrix value at the i:th position in column y. 𝑥̅ is the mean of all values on row x, 𝑦̅ is the mean of all values in column y.

The value of r will be between -1 and 1, where 1 means there is a perfect linear relationship between x and y and y increases when x increases. A value of -1 means that Y decreases as x increases. The value 0 naturally means there is no correlation between the variables.

Pearson correlated matrices are often used to make TADs more visible despite sometimes having very low correlation coefficients.

3.4 Heat maps

To examine the overall quality of the data from the HiCap[2] experiment you can look at heat maps.

Viewing heat maps from the HiCap experiment as well as heat maps from an HiC[4] experiment shows the differences in e.g. resolution and differences between short range interactions and long range interactions. Pearson correlated interaction matrices with different resolutions will be

investigated to determine which resolution to use when finding the TADs within our HiCap data. In this thesis, resolutions of 1 MegaBase, 500 KiloBase and 100 KiloBase are investigated. The

resolution corresponds to the smallest distinguishable parts in the chromatin chain, which depends on how small parts you can cut the genome into.

3.5 Finding topologically associating domains

The Homer software[5] is also used to find domains of topologically associating domains , TADs.

This is done by calculating the directionality index which describes the tendency of the genomic position to interact either upstream or downstream. It is calculated by looking at the observed interactions/expected interactions ratio upstream and downstream of the position. Homers default settings was used for this. The resolution is 5kb and 0-1000000 is the min-max distance that is considered for the directionality index.

3.6 Jaccard Index

To determine how good the TADs that Homer calculated are the Jaccard index was investigated at the boundaries of each domain. A file containing all boundaries for the TADs and a file containing the CTCF binding sites were used when calculating the Jaccard index. CTCF is a protein that is present at real interactions and not “interactions” due to proximity. The CTCF binding sites have first been filtered to only include the high quality sites.

Since there should be an overlap between areas with CTCF bindings and TAD boundaries we should

see that when looking at the Jaccard index[10] for areas around the boundaries and the CTCF binding

sites. First the Jaccard index was calculated at the boundaries, then the boundaries were shifted in 1kB

steps either upstream or downstream. 200 files were created with the differently shifted boundary files,

spanning an area 200 kB wide(-100 kb to +100kb). The Jaccard index was calculated for all the

(11)

11 different boundary files and the CTCF binding site file. A graph showing the results can be found in the results section.

3.7 Investigating interactions

Before analyzing the interaction data the interactions are filtered to remove unreliable interactions that provide noise to the data. All interactions with a supporting pair, number of interactions, value below 4 are removed and so are interactions with a p-value higher than 0.001. After filtering there are 111534 interactions in the LPS stimulated data set and 31846 interactions in the non-stimulated data set.

3.7.1 NLPS and WLPS

The data from the HiCap[2] experiments provided both a set of interactions with no

lipopolysaccharide, LPS, stimulation and set of interactions that had been LPS stimulated. The differences between these data sets are explored in the fitting of the statistical helix model and when different domain interactions are investigated. The LPS stimulation increases the activity in the genome and the chance for an interaction.

3.7.2 Domain interactions

Four different kinds of domain interactions are possible; domain interactions where both interacting loci are inside a domain, a TAD, no domain interaction where both interacting locis are outside all TADs, domain interactions where one of the interacting loci is inside a TAD and the other outside all of the TADs, here called Semi Domain interaction, and finally a Double Domain interaction where one of the interacting loci is inside one TAD and the other loci is inside another TAD. The number of each kind of domain interaction is calculated and cumulative distribution function, CDF, plots show the whole distribution.

3.7.3 Expression values

The rate at which transcription occurred for the genes identified at the interactions, here called expression values, are investigated briefly in this thesis. First, the 10

^th

and 70

^th

percentile for the expression value of all interactions is calculated. Then, the distribution of interactions with expression values in three different intervals is examined. These intervals are; <10

^th

percentile, between 10

^th

and 70

^th

percentile, above the 70

^th

percentile. A CDF plot is also made to show the distribution of the expression values in the four kinds of domain interactions.

WLPS #Interactions Avg Expression Value

<10% 6711(10%) 0.074 10%-70% 40061(60%) 2.82

>70% 20025(30%) 84.1

Table 1 – Number of interactions in three different intervals of expression value; interactions <10^th percentile, between 10^th and 70^th percentile and above the 70^th percentile.

3.7.4 Random Interactions

To check the validity of my findings I made a set of random interactions. They were generated by taking a random interaction from the real interactions and replacing the gene feature position with the position of a random gene feature existing in the real interactions. Then the interactor position was replaced with the new random gene start plus the separation distance. The expression value was replaced by the expression value of the new random gene feature chosen.

These interactions are compared to the real interactions when looking at different domain interactions.

The difference in separation distance distribution and expression value distribution is investigated and

results are presented in distribution tables and CDF plots in the result section. A t-test is made to see if

the data sets are significantly different with a significance level of 1%. The t-test was made on the data

after it was converted to the natural logarithm to get a normal distribution.

(12)

12 3.7.5 Statistical helix modelling

It is important to note that the input data that is used is not the same as the input data that the statistical helix model is designed to use. Originally the input is the relative cross linking frequency, while I use supporting pairs. The relative cross linking frequency is an indirect measure of the number of

supporting pairs. Supporting pairs is determined directly in our HiCap[2] experiment but their method, 3C[6], does not provide that option. So in reality these numbers ought to be very similar. Therefore it is possible to disregard this difference and assume that the model should work for our data and still look at whether or not the model can find any structures in our data. I also look at if the statistical helix model describes the general interaction distribution better than the unconstrained model that is used customarily today to predict interactions when calculating e.g. the quality of an interaction and how likely it is to not be a random interaction.

The model will be implemented on several data sets where the interactions have been filtered by separation distance or expression value. By applying the model on interactions from different

separation distance, structural differences may be found and it might do a better job at different ranges.

The different distances are the following intervals: 0-70kb, 0-90kb, 0-110kb, 0-130kb, 0-150kb and all interactions. Different expression values are tested, the low(expression values between 0 and the 30

^th

percentile), the medium(expression value between the 30

^th

and 70

^th

percentile) and the high expression values(expression value above the 70

^th

percentile).

3.7.6 Parameter fitting in matlab

The data from the experiment is used as input in the statistical helix function and the parameters: S, k,

D and P is calculated using Matlab’s lsqcurvefit. L is chosen to be 10 (nm) for all runs. This value is

approximately known because of how the nucleosomes are packed. The function used to fit the

statistical helix function to the experimental data when lsqcurvefit ran is a trust-region-reflective

algorithm implemented in Matlab. The function lsqcurvefit is used with the inputs; function, xdata,

ydata, starting guesses. The function was the statistical helix model. The xdata was the measured

separation distance and ydata the corresponding number of supporting pairs for that distance. As

starting guesses approximate values from the original article [3] was used. S=2, k=8.5, D=200, P=250.

(13)

13 4. Results

4.1 Heat Maps

Below are the heat maps at 1MB, 500 kB and 100 kB. These resolutions mean that each pixel in the heat map is of that particular size. The pixel in the top left corner represent the interactions in the area between the first base and the 1 000 000

^th

base. Simply put, the close interactions are near the diagonal and the distal interactions are far away from the diagonal. In this thesis the most interesting areas are the distal regions with high positive correlation, red areas. Because those represents interactions between two parts of the genome that are far away and appear to interact a lot. These heat maps, fig. 1- 6, show that the HiCap experiment provides better Heat maps with more TADs found or at least more pronounced TADs than the HiC experiment, compare e.g. the images at 100 kB resolution and significant differences are found in the distal regions.

Figure 1 - A heat map showing the pearson correlated interaction matrix for chr 1. Data is from the HiC experiment analyzed at 1000 kB resolution.

Figure 2 - A heat map showing the pearson correlated interaction matrix for chr 1. Data is from the HiCap experiment analyzed at 1000 kB resolution.

(14)

14

(15)

15

(16)

16 4.2 Topologically associating domains – TADs

Here, information about the TADs generated in Homer will be presented. The number of TADs, the length of the TADs both with LPS stimulation and without. The TADs here are retrieved through the method described in the method section. The number of TADs we calculated was 4330 for the cells stimulated with LPS and 4225 for the non-stimulated. For the stimulated, WLPS, and non-stimulated, NLPS, data the length maximum, minimum and average are displayed below in table 1. The Homer software[5] found slightly more TADs in the WLPS data. The average sizes and minimum and maximum imply that the TADs found are rather equal in the two sets.

WLPS NLPS

Number of TADs: 4330 4225

Avg Size (bases): 507408 503159

Minimum size 80000 80000

Maximum size 4668750 4668750

Table 2 – A table showing the number of tads detected and their sizes in the stimulated,wlps, and non stimulated sample, nlps..

4.3 Jaccard Index

The graphs, figure 7 and figure 8, shows the calculated Jaccard index for each shifted boundary file and the original boundary file compared with the CTCF binding sites. For NLPS and WLPS

respectively. It shows that there is a large overlap between CTCF binding sites and boundaries at the calculated boundaries, and as you go further away from the boundaries the overlap decreases rapidly at first and then reaches a steady level.

Figure 7 – A graph showing the different values for the Jaccard index, blue dots, of a boundary file and the CTCF binding sites for the WLPS data.

0,0115 0,0117 0,0119 0,0121 0,0123 0,0125 0,0127 0,0129 0,0131

- 1 0 0 - 8 0 - 6 0 - 4 0 - 2 0 0 2 0 4 0 6 0 8 0 1 0 0

JA C C ARD IND EX

BORDER SHIFT IN KB

JACCARD INDEX - WLPS

(17)

17

Figure 8 - A graph showing the different values for the jaccard index, blue dots, of a boundary file and the CTCF binding sites for the NLPS data.

4.5 Interactions

In this thesis the interactions investigated are generated using the HiCap method described in the method section. After filtering out unreliable interactions the number of WLPS and NLPS interactions are 111534 and 31846. In this section it is investigated how the interactions interact with our generated TADs and what the expression values are. CDF plots show the differences in distributions between the real interactions and random interactions, see section 3.7.4, in the WLPS data. In figure 9 you can see the overall distribution of the separation distance and expression value.

Figure 9 - CDF plot showing the distribution of the distance and expression value of all interactions. The distances and expression value are shown in log10 format and the interactions are the WLPS interactions.

0,0117 0,0119 0,0121 0,0123 0,0125 0,0127 0,0129 0,0131 0,0133

- 1 0 0 - 8 0 - 6 0 - 4 0 - 2 0 0 2 0 4 0 6 0 8 0 1 0 0

JA C C ARD IND EX

BORDER SHIFT IN KB

JACCARD INDEX - NLPS

(18)

18 4.5.1 Domain Interactions

This part deal with the four different kinds of interactions. The interactions where both interacting loci are in the same domain, the interactions where neither loci is inside a domain, the interactions where one loci is inside a domain and the other outside (here called Semi Domain), and the interactions where one loci is inside one domain and the other is inside another domain(here called Double Domain). Table 2 below shows the overall distribution of interactions in each category. The interactions are plotted in figure 16-19 in section 4.6.2. In figure 18 and 19 you can see the double domain and semi domain interactions, they have some peaking that other domain interactions sets don’t.

Interaction distribution NLPS WLPS Random

Same Domain 44,0% 46,1% 35,0%

Outside Domains 14,2% 10,0% 14,7%

Semi Domain 18,4% 19,0% 17,7%

Double Domain 23,4% 24,9% 32,6%

Table 3- The table shows the distribution ,in percentage of all interactions, of the different domain interactions. Together with the random data set.

The distribution of the WLPS data is shown in CDF plots in fig. 10, comparing the real interactions with the random interactions for each of the four kinds of domain interactions. They show that there are clear differences for all except the semi domain interactions between the real and random interactions. A t-test was made to establish that there is a significant difference at the 1% level between the real and random interactions. Results of this is shown in table 3.

Distance Distribution H P-value

Domain Interactions 1 2,13E-09

no domain Interactions 1 3,26E-144

Doubledomain interactions 1 1,60E-07

Semidomain interactions 1 1,57E-95

All interactions 1 1,24E-120

Table 4 – The table shows the result of the t-test, the hypothesis result at 1 % significance level and the corresponding p- value.

(19)

19

Figure 10 – CDF plot showing the distribution of the distance in each of the four domain interaction categories. Note that the distances are shown in log10 format and the interactions are only the WLPS interactions.

4.5.2 Expression values

The overall distribution of the expression values from the HiCap experiment interactions is presented below in table xx. It shows how the expression values are distributed between the three intervals:

below the 10

^th

percentile, between 10

^th

and 70

^th

percentile and above the 70

^th

percentile. The average values are also presented, showing that the interactions from the LPS stimulated experiment has a higher expression value. The distributions between the 10

^th

and 70

^th

percentiles stay approximately the same.

NLPS #Interactions Avg Expression Value

<10% 1977(10%) 0.0038

10%-70% 11862(60%) 3.29

>70% 5930(30%) 67.6

All interactions 19769 22.2

WLPS #Interactions Avg Expression Value

<10% 6617(10%) 0.074

10%-70% 40029(60%) 2.82

>70% 20008(30%) 84.1

All interactions 66654 26.9

Table 5 – The table shows number of interactions with an expression value in the three different intervals; below the 10^th percentile, between 10^th and 70^th percentile and above the 70^th percentile. Together with the average expression value in that interval.

Furthermore a CDF plot, figure 11, was made of the WLPS data to compare the expression values

from the real interactions with the random interactions. These show the expression values for the

(20)

20 interactions with the different type of domain interactions; domain interactions, outside domain interaction, double domain interactions and semi domain interaction. A t-test was made to confirm whether there is a significant difference at a 1% level between the real and random data set and the results showed that there is, table 5.

Figure 11 – A CDF plot showing the distributions of the WLPS data, blue curve, for the different kinds of domain interactions together with the random interaction distribution with the same filtering.

ExpValue Distribution H P-value

Domain Interactions 1 3,37E-99

no domain Interactions 1 5,26E-37

Doubledomain interactions 1 0

Semidomain interactions 1 9,64E-09

All interactions 1 9,49E-20

Table 6 - The table shows the result of the t-test, the hypothesis result at 1 % significance level and the corresponding p- value.

4.6 Statistical helix model

In this section the results from the fitting of our data to the statistical helix model will be presented.

The Graphs show the interactions together with the plots of the statistical helix model with the

parameters retrieved from the fitting. In figure 12 and 13 you can see all interactions with a separation distance shorter than 500 kb together with the unconstrained model and the statistical helix model. It shows that the unconstrained model is really bad at estimating interactions with a longer separation distance than about 30 kb. The statistical helix model does that a lot better, however the peaks

appearing does not seem to correlate to a pattern in the interactions. This goes for both the WLPS and

NLPS data.

(21)

21

Figure 12 – The figure shows all WLPS interactions with a separation distance of 0-500kb and the unconstrained model fit, black curve, and the statistical helix model fit, red curve. Each blue dot is an interaction with an y-value, supporting pairs, that is comparable to number of interactions, and an x-value representing the distance between the two interacting segments.

Figure 13 - The figure shows all NLPS interactions with a separation distance of 0-500kb and the unconstrained model fit, black curve, and the statistical helix model fit, red curve.

(22)

22 4.6.1 Separation distance

The interactions were filtered based on separation distance, creating 6 different sub sets for NLPS and WLPS. These are sets containing “all interactions” and interactions in the intervals: 0-70 kb, 0-90kb, 0-110kb, 0-130kb, 0-150kb. The plots of these are shown in figure 14 for NLPS and in figure 15 for WLPS data. Note that the curves are plotted beyond the interval where the parameters were calculated.

Each of the six curves has peaks that does not seem to correlate to any peak in the interaction data, and the peak positions are changed depending on which interval you look at.

Figure 14 – The figure is zoomed in to view the shorter interactions, less than 200 kb, of the NLPS data together with 6 different graphs that shows the statistical helix model fitting when looking at interactions from different intervals: 0-70kb, 0- 90 kb, 0-110kb, 0- 130kb, 0-150 kb and all interaction. Note that the each curve is calculated based on the interactions within the interval. In this figure the plots are extended beyond that interval and is drawn over all interactions.

(23)

23

Figure 15 – Same as in figure 14, but with data points from WLPS.

4.6.2 Domain Interactions

Now if we look at the fitting of the statistical helix model to the different domain interactions in figure

16-19 instead, it reveals that for the semi domain interactions and double domain interactions the

model starts failing severely. The unconstrained model is still bad at longer separation distances, more

than ~30kb, and the peaks in the statistical helix model seem unsupported in the interaction data.

(24)

24

Figure 16 – A graph showing all normal domain interactions, that are all interactions where both interacting locis are in the same domain. It is zoomed in to show only interactions with a separation distance shorter than 300 kb, together with the fitted unconstrained model (black curve) and statistical helix model (red curve).

Figure 17 - A graph showing all interactions outside domains. It is zoomed in to show only interactions with a separation distance shorter than 500 kb, together with the fitted unconstrained model (black curve) and statistical helix model (red curve).

(25)

25

Figure 18 - A graph showing all Double domain interactions, which are all interactions where the interacting locis are in different domains. It is zoomed in to show only interactions with a separation distance shorter than 4000 kb, together with the fitted unconstrained model (black curve) and statistical helix model (red curve).

Figure 19 - A graph showing all Semi domain interactions, which are all interactions where one of the interacting locis are inside a domain and the other is not. It is zoomed in to show only interactions with a separation distance shorter than 4000 kb, together with the fitted unconstrained model (black curve) and statistical helix model (red curve).

(26)

26 4.6.3 Expression value

The fitting of the statistical helix model to low, moderately, and highly expressed interactions are shown in figure 20-22. As for the previous fittings the statistical helix model still shows unexplained peaks, however it is a better for the moderately and highly expressed interactions than the low expressed.

Figure 20 – This shows all interactions with an expression value below the 10th percentile, for the WLPS data. Together with the fitted unconstrained model (black curve) and statistical helix model (red curve).

(27)

27

Figure 21 - This shows all interactions with an expression value between the 10th percentile and the 70^th percentile, for the WLPS data. Together with the fitted unconstrained model (black curve) and statistical helix model (red curve).

Figure 22 - This shows all interactions with an expression value above the 70th percentile, for the WLPS data. Together with the fitted unconstrained model (black curve) and statistical helix model (red curve).

(28)

28 5. Discussion

There are a lot of unknown factors when working with biological data such as this. For example there will always be an issue that each cell in the experiment is not in the same stage in the cell cycle at every given time. This means that only a part of the cells is in the wanted stage at any given time. This is a source of noise in the data together with another big source being the random interactions that are due to e.g. proximity. To remove part of this noise you normally use a function to predict how reliable each interaction is giving it a certain p-value depending on that reliability. One thing it looks at is the separation distance and today it just assumes a lower chance for a real interaction at longer distances.

But since there are specific interactions between specific locis at longer separation distances too, they are rated lower than they should be. When working with the large amounts of data that HiC and HiCap experiments provide it becomes important to find a good function to calculate the reliability of an interaction, otherwise a lot of noise is present.

Since the main focus of the HiCap experiment is to find information about the structure and not comparing WLPS and NLPS, there are a lot more WLPS interactions making those numbers more reliable than the NLPS data. 111534(wlps) vs 31846(nlps) interactions, more than 3 times as many.

5.1 Topologically associating domains

Note that all results in the domain interactions part is based on the TADs generated in the Homer software and no visual confirmation has been made i.e. no cross referencing between TAD positions and heat maps to confirm the validity of the TADs directly. This is because I lacked the resources to do it. Instead I looked at the Jaccard index of the CTCF binding sites and TAD borders which gives an overall picture of how correct the TAD positions are.

5.2 Statistical helix model

It is important to note that the input used in the statistical helix model is supporting pairs instead of relative cross linking frequency[3]. However, since the magnitude of the supporting pair value and relative cross linking frequency is very similar there shouldn’t be an order of magnitude issue in that input. Also, the goal is to find how well the statistical helix model fits to the data and not calculate specific parameters. The rest of the inputs are in the same units as used in [3]. The most interesting part is to see if the statistical helix model fits well to our data so it could then be used to predict interactions and help determine the reliability of an interaction. It is also of big interest to see whether it can detect any structural differences or not.

By looking at the interaction data and corresponding fit of the statistical helix model it appears the data used is not working for the model, and in some cases fails completely which can be seen in figure 18. This is probably due to the distribution of the interactions. We have a lot more interactions than what the model was originally used for[3] and also the distribution is more fluctuating, like in figure 18 where there are big peaks. Figure 14 and 15 shows that the statistical helix model is far from perfect for several different intervals of separation distance. The periodicity is in this part completely unsupported when looking at the interaction data, and the position of the peaks are changing

depending on which interval you look at. Even if the peak positions are included in both intervals. An example of this is in figure 15. For the intervals up to 130 kb there is a peak around 25 kb but when the interval is increased to 150 kb that peak disappears and another pops up at around 55 kb. This would imply that there are no peaks due to a difference in structure that the model detects, but only

“random” peaks generated by the model.

For all the different plots and fits the results are almost the same. The unconstrained model fails to

predict interactions at longer separation distances and the statistical helix model has a higher average

that is more in agreement with the data. However the peaks in the fitting never seem to actually

correlate to anything in the interaction data. So if it would be used as a predictor of interactions it

would provide a lot of false values at those peaks.

(29)

29 6. Conclusions

Looking at the images showing interactions and the fitted statistical helix model it is clear that both the unconstrained model and the statistical helix model are far from perfect predictors of the interactions for our data. The unconstrained model clearly underestimates the number of interactions longer than 20 kb. While the statistical helix model performs better in this respect it has issues with the periodicity of the model. It seems to be no clear pattern originating from structural orientation in the interactions that can be explained by the periodic statistical helix function. Instead it generates peaks that appear unsupported by the actual data. However, it should be possible to create a model that is better at predicting and analyzing the interactions than the unconstrained model, which is used primarily today.

The much improved resolution in the HiCap experiments have potential to find structures that was

previously impossible to find due to the resolution being capped. That is if there are structures to find

at that level. It would require more time analyzing the data than I had to find something and also figure

something out to reduce the noise from random interactions, e.g. making an equation that is better at

identifying how reliable an interaction is.

(30)

30 7. References

1. Elzo de Wit and Wouter de Laat .“A decade of 3C technologies: insights into nuclear organization”.

Genes & Development. 2012 26:11–24.

2. Pelin Sahlén, Ilgar Abdullayev†, Daniel Ramsköld, Liudmila Matskova, Nemanja Rilakovic,

Britta Lötstedt et al. “Genome-wide mapping of promoter-anchored interactions with close to single- enhancer resolution”. Genome Biology 2015, 16:156. Available at:

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0727-9

3. Vuthy Ea, Tom Sexton, Thierry Gostan, Laurie Herviou, Marie-Odile Baudement, Yunzhe Zhang et al. “Distinct polymer physics principles govern chromatin dynamics in mouse and Drosophila topological domains”. BMC Genomics. 2015; 16(1): 607.

4. Erez Lieberman-Aiden, Nynke L. van Berkum, Louise Williams, Maxim Imakaev, Tobias

Ragoczy, Agnes Telling et al. ” Comprehensive mapping of long range interactions reveals folding principles of the human genome“. Science. 2009 Oct 9; 326(5950): 289–293

5. HOMER, updated 2016-1-13, cited 2016-10-25 .http://homer.salk.edu/homer/

6. Dekker J1, Rippe K, Dekker M, Kleckner N. “Capturing chromosome conformation”. Science. 2002 Feb 15;295(5558):1306-11

7. Malawi Med J. “A guide to appropriate use of Correlation coefficient in medical research”. 2012 Sep; 24(3): 69–71.

8. Rippe K “Making contacts on a nucleic acid polymer”. Trends Biochem Sci. 2001 Dec; 26(12):733- 40.

9. The MathWorks, Inc. Lsqcurvefit. Cited 2016-10-17. Available from:

https://se.mathworks.com/help/optim/ug/lsqcurvefit.html

10. Jaccard, Paul. "Étude comparative de la distribution florale dans une portion des Alpes et des Jura". 1901, Bulletin de la Société Vaudoise des Sciences Naturelles, 37: 547–579.