Processing and analyzing of chromatin interaction data and numerical fitting of the statistical helix model
Tobias Everhorn
Master’s Thesis in Engineering Physics
Master Programme in Engineering Physics
Royal Institute of Technology year 2016
Supervisor at Sci-Life was Pelin Sahlén
Examiner at KTH was Jerker Widengren
2
Abstract
Studying the DNA structure is important in many different sectors of biological science, such as disease treatment. Currently the DNA structure at a large scale is considered to be mostly explored, while structures that would only show if we look at smaller pieces of the genome (higher resolutions) are unexplored. In this thesis, data is processed and analyzed originating from a chromosome
conformation capture technique called HiCap. Which is a biological method, similar to the older
chromosome conformation capture technique HiC, that through several steps provides interaction
data on the whole genome but with a focus on the promoter-enhancer interactions. The HiCap
method can look at smaller parts of the genome than previous methods, which is required to study
small structures such as promoter-enhancer sites. In this thesis the chromatin structure at 40
kilobase resolution, meaning 40 kilobase large parts, at promoter-enhancer interaction sites in
human monocytic cells from an acute monocytic leukemia patient, THP1 cells, is investigated. This was
done by processing and analyzing the data, followed by trying to fit the statistical helix model to
interaction data from HiCap experiments, to see whether it can detect structures in our data and
how well it can predict interactions in this kind of data set. The statistical helix model has shown
promise in previous articles and it is here tested how well it can be applied to interaction data of
promoter-enhancer interactions. The results of the thesis suggests that the HiCap method has a
better chance to detect the structures of small promoter-enhancer sites than the HiC method. The
statistical helix model did not seem to be able to detect structures in our data, nor is it viable as a
predictor of interactions for this kind of data.
3
Contents
Abstract ... 2
Contents ... 3
1. Introduction ... 5
1.1 Problem formulation ... 5
1.2 Objective... 5
2. Background ... 6
2.1 The cell nucleus and the human genome ... 6
2.2 Basics of the genomic structure ... 6
2.2.1 Promoters and Enhancers ... 6
2.3 Methods to study chromatin conformation ... 7
2.3.1 3C - Chromosome conformation capture methods ... 7
2.3.2 Sequencing ... 7
2.4 Cell lines ... 7
2.5 Statistical Helix model ... 7
2.6 lsqCurvefit in Matlab ... 8
2.7 Jaccard Index ... 8
3. Method ... 9
3.2 Creating the tag directory ... 9
3.3 Generating the interaction matrices ... 9
3.4 Heat maps ... 10
3.5 Finding topologically associating domains ... 10
3.6 Jaccard Index ... 10
3.7 Investigating interactions ... 11
3.7.1 NLPS and WLPS ... 11
3.7.2 Domain interactions ... 11
3.7.3 Expression values ... 11
3.7.4 Random Interactions ... 11
3.7.5 Statistical helix modelling ... 12
3.7.6 Parameter fitting in matlab ... 12
4. Results ... 13
4.1 Heat Maps ... 13
4.2 Topologically associating domains – TADs ... 16
4.3 Jaccard Index ... 16
4
4.5 Interactions... 17
4.5.1 Domain Interactions ... 18
4.5.2 Expression values ... 19
4.6 Statistical helix model ... 20
4.6.1 Separation distance ... 22
4.6.2 Domain Interactions ... 23
4.6.3 Expression value ... 26
5. Discussion ... 28
5.1 Topologically associating domains ... 28
5.2 Statistical helix model ... 28
6. Conclusions ... 29
7. References ... 30
5
1. Introduction
1.1 Problem formulation
The studying of our genetic structure with chromosome conformation capture techniques has for the last 10 years[1] been an area of huge progress. Step by step we can look more closely at the chromatin structures. This can give important information for all fields connected to human genetics. By
understanding how the genes organize themselves, both at non-active sites with no genes and at active sites such as the promoter-enhancer interaction site, vital progress can be made in e.g. the treatment of certain diseases. Data from experiments based on a new conformation capture technique called HiCap[2] is used in this thesis. This method focuses on the interactions at the promoter-enhancer interaction sites. Experimental methods prior to HiCap have had significantly worse resolution, meaning that HiCap can look at smaller pieces of the genome. Now, with the advantage of HiCap’s resolution, smaller structures such as the promoter-enhancer site can be explored. By using a certain model, the statistical helix model, and fitting it to our interaction data it can hopefully give new insights into the structure of the DNA. In this work I also studied if the statistical helix model can be used as a better predictor of interactions than the method mostly used today. This is important when determining how reliable an interaction is. To determine if it is a real interaction or just a matter of proximity creating noise in the data.
1.2 Objective
The overall objective of this thesis is to investigate the structure of the DNA at the enhancer-promoter interaction sites. Partly by analyzing the interaction data and partly by trying to fit a certain model, the statistical helix model[3], to the data. Interaction data retrieved from an HiCap[2] experiment will be analyzed, where the most significant properties of the interactions are the separation distance between two sites, the number of supporting pairs which reflects how often the interactions occur and the expression value which shows how much the genetic features at the interactions are transcribed. There will be two major data sets retrieved from HiCap experiments, one where the sample had no
stimulation and one where it is stimulated with lipopolysaccharide, LPS, to make it more prone to interact. These will be compared to reveal differences between non-stimulated and stimulated samples.
A lot of work will be to prepare the raw experimental data to the analyzing and modeling that will be done. To investigate how well the HiCap data compares to data from older experimental methods Heat maps will be studied, giving a visual representation of how defined the topologically associating domains, TADs, are in HiCap data compared to HiC[4] data. The TADs represents active areas where a lot of interactions occur.
The active regions in the genome, called TADs, will then be found using a method in the Homer software[5]. The validity of these TADs is tested by checking the Jaccard index, which measures the overlap between two data sets, the TAD coordinates from Homer and the CCCTC-binding factor coordinates. The CCCTC-binding factor, more commonly called CTCF, is a protein that is present at the interactions between promoters and enhancers. This Jaccard index test will show if the TADs are present at the CTCF sites. These TADs will then be used when investing e.g. how many interactions are inside these active regions, TADs, and outside them.
The major point of interest will be the statistical helix model[3]. It will be applied to different sets of the data to see whether it is applicable to all or any of it. Note that the input value, supporting pairs, which is used here is a different input than what it is originally used for the statistical helix model. The input used here is a direct measure of interaction frequency while the input in the original article is an indirect measure of the same thing. The fitting will be made to different data sets, such as all
interactions, all interactions with a high expression value and interactions within a certain separation
distance. The results will show if the statistical helix model is applicable to this kind of data.
6
2. Background
It is important when reading this thesis that you have a basic understanding of how the human genome is structured, and the differences between different levels in the structure. This section will give you a brief summary of this. In addition some information will be given about the different techniques used to study the genome structure.
2.1 The cell nucleus and the human genome
Every cell in our body has a cell core containing the human genome. This is coded by approximately 3.2 billion base pairs organized into a double stranded string of these pairs. Each strand contains a copy of our whole DNA. If you stretch the DNA in one cell the length would be almost 2 m. This entire string has to fit inside a nucleus with a diameter of about 6 microns. The packing of the DNA is known to not be random. Instead the DNA and the different chromosomes are folded and organized inside the nucleus in a highly ordered and regulated manner.
The conformation of the chromosomes are changing constantly depending on where in the cell cycle it is. This is a problem when studying it, since all cells don’t have to be in the same phase in the cell cycle at the same time. The chromosome conformation at the cell division and reproduction has been well studied, so now scientists are trying to look at the whole cell cycle. The Interphase for example. It is connected to gene transcription and regulation, making it very interesting to study.
2.2 Basics of the genomic structure
Understanding the basics of our genomic structure is vital, thus a brief explanation will follow. The DNA consists of a double stranded chain of connected nucleotide bases. The order in which these bases are arranged determines e.g. protein sequences and transcription factors. This is studied with DNA sequencing, and the development of high-throughput DNA sequencing have had a big impact on how much research and discoveries are made in medicine and biology today.
The DNA chains are then wrapped around certain proteins called histones, these units are called Chromatin and this we call the Primary structure of the genome. The chromatin is arranged in multiple different ways, depending mostly on where in the cell cycle it currently is. Generally, the cycle can be divided into two phases, the first is the reproductive phase called Mitosis. This is the phase which most know about. It is when chromatin gather and create the high density chromosome structure that can be observed with normal light microscopy. The second phase is called the
Interphase, it is here that gene transcription and regulation happens. This level of genome organization is called the secondary structure, when the chromatin chain creates a structure.
The Tertiary structure is the next level, and is the part we are concerned with most in this thesis. It is when the secondary structure forms complex structure e.g. loops and domains.
2.2.1 Promoters and Enhancers
Promoters and enhancers are of special interest in this case, since the main focus will be on data from HiCap experiments focused on promoter-enhancer interactions. The promoters regulate gene activity at a certain time and place. For every gene there is at least one promoter, which acts as an on-switch for that gene. When a specific protein binds to the promoter, the gene is activated or deactivated.
Enhancers are regions in the DNA which are connected to the probability of the gene transcription.
Proteins can bind to enhancers to increase the transcription of a certain gene by also connecting to a
promotor. A protein called CTCF can bind to the promoter and to an enhancer far away in the
sequence, creating a loop in the 3D structure, which enhances the transcription of the corresponding
gene. These loops occurs in very large numbers in the genes tertiary structure.
7
The secondary and tertiary structures of the genome are of high interest because it is directly related to what happens inside the cell, e.g. it is connected not only to the normal activity but probably also to diseases and disorders.
2.3 Methods to study chromatin conformation
Since 2002 when the chromosome conformation capture technique, 3C, was published [6], immense progress have been made in studies of the structure and spatial organization of the chromosomes in the cell during its different cycles. It is a technique that investigates the chromatin structure through interaction data. From 3C, many techniques were developed to further this goal by improving throughput and specifying what to study. 4C, 5C , HiC and HiCap are all techniques that originates from the 3C method. In this thesis the data used was retrieved by using a newly invented method called HiCap[2].
2.3.1 3C - Chromosome conformation capture methods
Chromatin conformation capture[6] is a molecular biology technique that investigates the chromatin structure. The basis for all the different techniques are the same: First the genome is cross-linked, this creates bonds that stops all interactions. Then the genome is cut into pieces, and the different pieces are allowed to interact with each other. Smaller pieces means better resolution. These interactions are quantified in different ways depending on which method is used. The data retrieved is used to determine how prone a loci, a specific genome segment, is to interact with another loci, giving information about the structure.
From 3C several other techniques have evolved: 4C, 5C, HiC and HiCap. Most interesting here is HiC and HiCap. They use a similar approach in generating the interaction data and the main difference between the created data sets is which part of the DNA you are looking at. 3C only looks at
interactions between two specific segments in the DNA, while HiC looks at interactions between all segments at the same time. HiCap, a method that was published in 2015, works as HiC in many ways, but focuses on the promoter-enhancer sites. HiCap can also achieve a resolution of up to ~700 bp[2], significantly higher than previous methods making it possible to investigate smaller structures around e.g. the promoter-enhancer interaction sites.
2.3.2 Sequencing
DNA-sequencing refers to all techniques that determine the order of nucleotides (Adenine, Guanine, Thymine and cytosine) in the DNA. Better and faster methods of sequencing has accelerated the progress in all genetic related research, e.g. medical research. It is a vital part of all such research and is a required step in these methods.
2.4 Cell lines
Two different kind of cell lines are used in generating the data that is used in this thesis. Mouse embryonic stem cells, MESC, and THP1- cells. MESC are mouse stem cells and has the ability to transform into any cell type and propagate. They are in the field of genetics commonly used to study genetic diseases. THP1-cells are human monocytic cells from an acute monocytic leukemia patient.
The MESC are in this thesis only used for comparing the resolution of heat maps using different methods of retrieving the interaction data, HiC or HiCap.
2.5 Statistical Helix model
The major part in the thesis is the investigation of the structure at the enhancer-promoter regions in the DNA using the statistical helix model [3]. It was observed in their data that the chromatin often adopted a preferred shape, in a larger scale than specific loci-loci interactions. Previously chromatin in yeast had been modeled using a method [8] that connected the crosslinking frequency and site
separation, see equation 1.
8 𝑋(𝑠) = [𝑘 ∗ 0.53 ∗ 𝛽
−32∗ exp (−
2𝛽2
) ∗ (𝐿 ∗ 𝑆)
−3eq. 1
The crosslinking frequency, X(s), has the unit
mol𝑙𝑖𝑡𝑒𝑟
𝑛𝑚
3. Separation distance or site separation, s, is the distance between two interacting loci. The constant k is the crosslinking efficiency. S is the Kuhn statistical segment length (in kb) which describes the stiffness. It was developed for the freely joint chain used to model polymers in the 1930s. L is a measure of the density of genomic DNA, namely the length of chromatin (in nm) containing 1000 bases of DNA. β represents the number of Kuhn
statistical segments. It depends on the shape of the object of study. This beta term was initially constructed for a linear and unconstrained polymer, equation 2. This is what I refer to when I talk about the unconstrained model when doing the fitting of the interaction data to the statistical helix model.
𝛽 =
𝑠𝑆
eq. 2
In the article from 2011 [11] the beta term was remade to fit for a polymer in a circular helix, equation 3. P stands for pitch and represents the distance between the loops(in nm). D stands for diameter and represents the width of the loops. The other parameters are the same as in equation 1 and 2.
𝛽 =
√𝐷2∗sin2( 𝜋∗𝐿∗𝑠
√𝜋2∗𝐷2+𝑃2)+)(𝑃2∗𝐿2∗𝑠2 𝜋2∗𝐷2+𝑃2)
𝐿∗𝑆
eq. 3
Note that study presented in [11] was based on data retrieved with the 3C method on mESC (mouse embryonic stem cells). While this thesis will investigate how the statistical helix model works on THP1-cells using data retrieved with the HiCap method.
2.6 lsqCurvefit in Matlab
Lsqcurvefit[9] is a method that solves data fitting problems in a non-linear least-square way in matlab.
It can implement a trust-region-reflective algorithm or a levenburg-marquardt algorithm. The default today is the trust-region algorithm and is used in this thesis. The basic equation is eq 4 below.
min
𝑥||𝐹(𝑥, 𝑥𝑑𝑎𝑡𝑎) − 𝑦𝑑𝑎𝑡𝑎||
22= min
𝑥
∑ (𝐹(𝑥, 𝑥𝑑𝑎𝑡𝑎
𝑖 𝑖) − 𝑦𝑑𝑎𝑡𝑎
𝑖)
2eq. 4
xdata and ydata are the two data inputs. In this case xdata is separation distance and ydata is the crosslinking frequency.
2.7 Jaccard Index
The Jaccard index[10] is used for comparing the equality of different data sets. It simply checks the ratio between the intersection and the union of two data sets.
𝐽(𝐴, 𝐵) = |𝐴 ∩ 𝐵|
|𝐴 ∪ 𝐵|
9
3. Method
The basis for this thesis is the interaction matrix which is needed for e.g. creating heat maps and localization of TADs in our data and all general interaction analysis. This was done using a program called Homer [5]. The first step in generating the interaction matrix to create tag directories from the experimental data using options suited for our HiC[4] and HiCap[2] data. The Tag directory is just a way for Homer to organize the data before creating the interaction matrix. The second step is to create the interaction matrix, which can be either a matrix containing the raw frequency data just showing the number of possible ligation events between different sites in the genome or a normalized matrix or a Pearson correlated matrix. I will explain what these are in section 3.3.
Matrices were made for one HiC experiment and two HiCap experiments which studied MESC. They were also made for two recent HiCap experiments studying human THP1 cells. One of these had no stimulation and the other was stimulated by LPS, Lipopolysaccharide. This stimulation is used to increase the actual interaction frequency, which is the frequency of interactions that are not random interactions due to e.g. close proximity.
3.2 Creating the tag directory
The tag directory is a directory containing several files describing the experiment, and these files are very similar to sorted BAM files. BAM files are compressed binary versions of sequence alignment maps, SAM, files. When making the tag directories for the recent THP1 experiments the experimental files you input are two sorted BAM files where one contains the first read and the other the second read. The BAM files contains the data when it has been mapped, meaning that sequence information from the experiment has been mapped to a reference genome giving the sequence of e.g. a
chromosome.
Below are the command given to Homer[5] to create a tag directory, with the options used for my project. These were suggested by my supervisor. Where the input file represents the sorted BAM files.
The different options are chosen to suit HiC and HiCap experiments. The tag directories for HiC, HiCap1 and HiCap2 were created from a text Summary file with the format explained in the Homer documentation[5].
makeTagDirectory NameofTagDirectory.tagdir inputFile1,inputFile2 -illuminaPE -tbp 1 -removePEbg -restrictionSite GATC -genome hg19.fa
3.3 Generating the interaction matrices
Interaction matrices were created using Homer. Matrices were made at several different resolutions using the different datasets. For the main part of the thesis, essentially all parts except the heat maps, the resolution of the matrix is 40 000 base pairs. Three different interaction matrices were generated:
raw, normalized and Pearson correlated[7]. The raw interaction matrices contains the number of supporting pairs, basically number of interactions, registered between site x and site y in a certain chromosome. Each interaction is a possible ligation event.
The normalized matrix contains the ratio of observed/expected interactions. It is assumed for this normalization that each site has the same probability to interact with every other site in the genome and that the distance between the sites affects the number of interactions. This is important to consider because many proximity ligation events occurs. It is natural that there will be some interaction
between sites that are close to each other, no matter what the overall 3D structure is.
10
The Pearson correlated matrices are used to find active areas, in respect to gene activity, in 2D heatmaps, called TADs. They are calculated from a normalized matrix as described above. These matrices contain the pearson correlated coefficient, see equation 5, in each cell.
eq. 5
In eq. 5 r is the Pearson correlation coefficient. The coefficient is calculated between the row and the column on which the cell lies. If we are looking at the matrix cell at position (x,y), x
iis the matrix value at the i:th position on row x and the matrix value at the i:th position in column y. 𝑥̅ is the mean of all values on row x, 𝑦̅ is the mean of all values in column y.
The value of r will be between -1 and 1, where 1 means there is a perfect linear relationship between x and y and y increases when x increases. A value of -1 means that Y decreases as x increases. The value 0 naturally means there is no correlation between the variables.
Pearson correlated matrices are often used to make TADs more visible despite sometimes having very low correlation coefficients.
3.4 Heat maps
To examine the overall quality of the data from the HiCap[2] experiment you can look at heat maps.
Viewing heat maps from the HiCap experiment as well as heat maps from an HiC[4] experiment shows the differences in e.g. resolution and differences between short range interactions and long range interactions. Pearson correlated interaction matrices with different resolutions will be
investigated to determine which resolution to use when finding the TADs within our HiCap data. In this thesis, resolutions of 1 MegaBase, 500 KiloBase and 100 KiloBase are investigated. The
resolution corresponds to the smallest distinguishable parts in the chromatin chain, which depends on how small parts you can cut the genome into.
3.5 Finding topologically associating domains
The Homer software[5] is also used to find domains of topologically associating domains , TADs.
This is done by calculating the directionality index which describes the tendency of the genomic position to interact either upstream or downstream. It is calculated by looking at the observed interactions/expected interactions ratio upstream and downstream of the position. Homers default settings was used for this. The resolution is 5kb and 0-1000000 is the min-max distance that is considered for the directionality index.
3.6 Jaccard Index
To determine how good the TADs that Homer calculated are the Jaccard index was investigated at the boundaries of each domain. A file containing all boundaries for the TADs and a file containing the CTCF binding sites were used when calculating the Jaccard index. CTCF is a protein that is present at real interactions and not “interactions” due to proximity. The CTCF binding sites have first been filtered to only include the high quality sites.
Since there should be an overlap between areas with CTCF bindings and TAD boundaries we should
see that when looking at the Jaccard index[10] for areas around the boundaries and the CTCF binding
sites. First the Jaccard index was calculated at the boundaries, then the boundaries were shifted in 1kB
steps either upstream or downstream. 200 files were created with the differently shifted boundary files,
spanning an area 200 kB wide(-100 kb to +100kb). The Jaccard index was calculated for all the
11
different boundary files and the CTCF binding site file. A graph showing the results can be found in the results section.
3.7 Investigating interactions
Before analyzing the interaction data the interactions are filtered to remove unreliable interactions that provide noise to the data. All interactions with a supporting pair, number of interactions, value below 4 are removed and so are interactions with a p-value higher than 0.001. After filtering there are 111534 interactions in the LPS stimulated data set and 31846 interactions in the non-stimulated data set.
3.7.1 NLPS and WLPS
The data from the HiCap[2] experiments provided both a set of interactions with no
lipopolysaccharide, LPS, stimulation and set of interactions that had been LPS stimulated. The differences between these data sets are explored in the fitting of the statistical helix model and when different domain interactions are investigated. The LPS stimulation increases the activity in the genome and the chance for an interaction.
3.7.2 Domain interactions
Four different kinds of domain interactions are possible; domain interactions where both interacting loci are inside a domain, a TAD, no domain interaction where both interacting locis are outside all TADs, domain interactions where one of the interacting loci is inside a TAD and the other outside all of the TADs, here called Semi Domain interaction, and finally a Double Domain interaction where one of the interacting loci is inside one TAD and the other loci is inside another TAD. The number of each kind of domain interaction is calculated and cumulative distribution function, CDF, plots show the whole distribution.
3.7.3 Expression values
The rate at which transcription occurred for the genes identified at the interactions, here called expression values, are investigated briefly in this thesis. First, the 10
thand 70
thpercentile for the expression value of all interactions is calculated. Then, the distribution of interactions with expression values in three different intervals is examined. These intervals are; <10
thpercentile, between 10
thand 70
thpercentile, above the 70
thpercentile. A CDF plot is also made to show the distribution of the expression values in the four kinds of domain interactions.
WLPS #Interactions Avg Expression Value
<10% 6711(10%) 0.074 10%-70% 40061(60%) 2.82
>70% 20025(30%) 84.1
Table 1 – Number of interactions in three different intervals of expression value; interactions <10th percentile, between 10th and 70th percentile and above the 70th percentile.
3.7.4 Random Interactions
To check the validity of my findings I made a set of random interactions. They were generated by taking a random interaction from the real interactions and replacing the gene feature position with the position of a random gene feature existing in the real interactions. Then the interactor position was replaced with the new random gene start plus the separation distance. The expression value was replaced by the expression value of the new random gene feature chosen.
These interactions are compared to the real interactions when looking at different domain interactions.
The difference in separation distance distribution and expression value distribution is investigated and
results are presented in distribution tables and CDF plots in the result section. A t-test is made to see if
the data sets are significantly different with a significance level of 1%. The t-test was made on the data
after it was converted to the natural logarithm to get a normal distribution.
12 3.7.5 Statistical helix modelling
It is important to note that the input data that is used is not the same as the input data that the statistical helix model is designed to use. Originally the input is the relative cross linking frequency, while I use supporting pairs. The relative cross linking frequency is an indirect measure of the number of
supporting pairs. Supporting pairs is determined directly in our HiCap[2] experiment but their method, 3C[6], does not provide that option. So in reality these numbers ought to be very similar. Therefore it is possible to disregard this difference and assume that the model should work for our data and still look at whether or not the model can find any structures in our data. I also look at if the statistical helix model describes the general interaction distribution better than the unconstrained model that is used customarily today to predict interactions when calculating e.g. the quality of an interaction and how likely it is to not be a random interaction.
The model will be implemented on several data sets where the interactions have been filtered by separation distance or expression value. By applying the model on interactions from different
separation distance, structural differences may be found and it might do a better job at different ranges.
The different distances are the following intervals: 0-70kb, 0-90kb, 0-110kb, 0-130kb, 0-150kb and all interactions. Different expression values are tested, the low(expression values between 0 and the 30
thpercentile), the medium(expression value between the 30
thand 70
thpercentile) and the high expression values(expression value above the 70
thpercentile).
3.7.6 Parameter fitting in matlab
The data from the experiment is used as input in the statistical helix function and the parameters: S, k,
D and P is calculated using Matlab’s lsqcurvefit. L is chosen to be 10 (nm) for all runs. This value is
approximately known because of how the nucleosomes are packed. The function used to fit the
statistical helix function to the experimental data when lsqcurvefit ran is a trust-region-reflective
algorithm implemented in Matlab. The function lsqcurvefit is used with the inputs; function, xdata,
ydata, starting guesses. The function was the statistical helix model. The xdata was the measured
separation distance and ydata the corresponding number of supporting pairs for that distance. As
starting guesses approximate values from the original article [3] was used. S=2, k=8.5, D=200, P=250.
13
4. Results
4.1 Heat Maps
Below are the heat maps at 1MB, 500 kB and 100 kB. These resolutions mean that each pixel in the heat map is of that particular size. The pixel in the top left corner represent the interactions in the area between the first base and the 1 000 000
thbase. Simply put, the close interactions are near the diagonal and the distal interactions are far away from the diagonal. In this thesis the most interesting areas are the distal regions with high positive correlation, red areas. Because those represents interactions between two parts of the genome that are far away and appear to interact a lot. These heat maps, fig. 1- 6, show that the HiCap experiment provides better Heat maps with more TADs found or at least more pronounced TADs than the HiC experiment, compare e.g. the images at 100 kB resolution and significant differences are found in the distal regions.
Figure 1 - A heat map showing the pearson correlated interaction matrix for chr 1. Data is from the HiC experiment analyzed at 1000 kB resolution.
Figure 2 - A heat map showing the pearson correlated interaction matrix for chr 1. Data is from the HiCap experiment analyzed at 1000 kB resolution.
14
Figure 3 - A heat map showing the pearson correlated interaction matrix for chr 1. Data is from the HiC experiment analyzed at 500 kB resolution.
Figure 4 - A heat map showing the pearson correlated interaction matrix for chr 1. Data is from the HiCap experiment analyzed at 500 kB resolution.
15
Figure 5 - A heat map showing the pearson correlated interaction matrix for chr 1. Data is from the HiC experiment analyzed at 100 kB resolution.
Figure 6 - A heat map showing the pearson correlated interaction matrix for chr 1. Data is from the HiCap experiment analyzed at 100 kB resolution.
16
4.2 Topologically associating domains – TADs
Here, information about the TADs generated in Homer will be presented. The number of TADs, the length of the TADs both with LPS stimulation and without. The TADs here are retrieved through the method described in the method section. The number of TADs we calculated was 4330 for the cells stimulated with LPS and 4225 for the non-stimulated. For the stimulated, WLPS, and non-stimulated, NLPS, data the length maximum, minimum and average are displayed below in table 1. The Homer software[5] found slightly more TADs in the WLPS data. The average sizes and minimum and maximum imply that the TADs found are rather equal in the two sets.
WLPS NLPS
Number of TADs: 4330 4225
Avg Size (bases): 507408 503159
Minimum size 80000 80000
Maximum size 4668750 4668750
Table 2 – A table showing the number of tads detected and their sizes in the stimulated,wlps, and non stimulated sample, nlps..
4.3 Jaccard Index
The graphs, figure 7 and figure 8, shows the calculated Jaccard index for each shifted boundary file and the original boundary file compared with the CTCF binding sites. For NLPS and WLPS
respectively. It shows that there is a large overlap between CTCF binding sites and boundaries at the calculated boundaries, and as you go further away from the boundaries the overlap decreases rapidly at first and then reaches a steady level.
Figure 7 – A graph showing the different values for the Jaccard index, blue dots, of a boundary file and the CTCF binding sites for the WLPS data.
0,0115 0,0117 0,0119 0,0121 0,0123 0,0125 0,0127 0,0129 0,0131
- 1 0 0 - 8 0 - 6 0 - 4 0 - 2 0 0 2 0 4 0 6 0 8 0 1 0 0
JA C C ARD IND EX
BORDER SHIFT IN KB
JACCARD INDEX - WLPS
17
Figure 8 - A graph showing the different values for the jaccard index, blue dots, of a boundary file and the CTCF binding sites for the NLPS data.
4.5 Interactions
In this thesis the interactions investigated are generated using the HiCap method described in the method section. After filtering out unreliable interactions the number of WLPS and NLPS interactions are 111534 and 31846. In this section it is investigated how the interactions interact with our generated TADs and what the expression values are. CDF plots show the differences in distributions between the real interactions and random interactions, see section 3.7.4, in the WLPS data. In figure 9 you can see the overall distribution of the separation distance and expression value.
Figure 9 - CDF plot showing the distribution of the distance and expression value of all interactions. The distances and expression value are shown in log10 format and the interactions are the WLPS interactions.
0,0117 0,0119 0,0121 0,0123 0,0125 0,0127 0,0129 0,0131 0,0133
- 1 0 0 - 8 0 - 6 0 - 4 0 - 2 0 0 2 0 4 0 6 0 8 0 1 0 0
JA C C ARD IND EX
BORDER SHIFT IN KB
JACCARD INDEX - NLPS
18 4.5.1 Domain Interactions
This part deal with the four different kinds of interactions. The interactions where both interacting loci are in the same domain, the interactions where neither loci is inside a domain, the interactions where one loci is inside a domain and the other outside (here called Semi Domain), and the interactions where one loci is inside one domain and the other is inside another domain(here called Double Domain). Table 2 below shows the overall distribution of interactions in each category. The interactions are plotted in figure 16-19 in section 4.6.2. In figure 18 and 19 you can see the double domain and semi domain interactions, they have some peaking that other domain interactions sets don’t.
Interaction distribution NLPS WLPS Random
Same Domain 44,0% 46,1% 35,0%
Outside Domains 14,2% 10,0% 14,7%
Semi Domain 18,4% 19,0% 17,7%
Double Domain 23,4% 24,9% 32,6%
Table 3- The table shows the distribution ,in percentage of all interactions, of the different domain interactions. Together with the random data set.
The distribution of the WLPS data is shown in CDF plots in fig. 10, comparing the real interactions with the random interactions for each of the four kinds of domain interactions. They show that there are clear differences for all except the semi domain interactions between the real and random interactions. A t-test was made to establish that there is a significant difference at the 1% level between the real and random interactions. Results of this is shown in table 3.
Distance Distribution H P-value
Domain Interactions 1 2,13E-09
no domain Interactions 1 3,26E-144
Doubledomain interactions 1 1,60E-07
Semidomain interactions 1 1,57E-95
All interactions 1 1,24E-120
Table 4 – The table shows the result of the t-test, the hypothesis result at 1 % significance level and the corresponding p- value.
19
Figure 10 – CDF plot showing the distribution of the distance in each of the four domain interaction categories. Note that the distances are shown in log10 format and the interactions are only the WLPS interactions.
4.5.2 Expression values
The overall distribution of the expression values from the HiCap experiment interactions is presented below in table xx. It shows how the expression values are distributed between the three intervals:
below the 10
thpercentile, between 10
thand 70
thpercentile and above the 70
thpercentile. The average values are also presented, showing that the interactions from the LPS stimulated experiment has a higher expression value. The distributions between the 10
thand 70
thpercentiles stay approximately the same.
NLPS #Interactions Avg Expression Value
<10% 1977(10%) 0.0038
10%-70% 11862(60%) 3.29
>70% 5930(30%) 67.6
All interactions 19769 22.2
WLPS #Interactions Avg Expression Value
<10% 6617(10%) 0.074
10%-70% 40029(60%) 2.82
>70% 20008(30%) 84.1
All interactions 66654 26.9
Table 5 – The table shows number of interactions with an expression value in the three different intervals; below the 10th percentile, between 10th and 70th percentile and above the 70th percentile. Together with the average expression value in that interval.
Furthermore a CDF plot, figure 11, was made of the WLPS data to compare the expression values
from the real interactions with the random interactions. These show the expression values for the
20
interactions with the different type of domain interactions; domain interactions, outside domain interaction, double domain interactions and semi domain interaction. A t-test was made to confirm whether there is a significant difference at a 1% level between the real and random data set and the results showed that there is, table 5.
Figure 11 – A CDF plot showing the distributions of the WLPS data, blue curve, for the different kinds of domain interactions together with the random interaction distribution with the same filtering.
ExpValue Distribution H P-value
Domain Interactions 1 3,37E-99
no domain Interactions 1 5,26E-37
Doubledomain interactions 1 0
Semidomain interactions 1 9,64E-09
All interactions 1 9,49E-20
Table 6 - The table shows the result of the t-test, the hypothesis result at 1 % significance level and the corresponding p- value.
4.6 Statistical helix model
In this section the results from the fitting of our data to the statistical helix model will be presented.
The Graphs show the interactions together with the plots of the statistical helix model with the
parameters retrieved from the fitting. In figure 12 and 13 you can see all interactions with a separation distance shorter than 500 kb together with the unconstrained model and the statistical helix model. It shows that the unconstrained model is really bad at estimating interactions with a longer separation distance than about 30 kb. The statistical helix model does that a lot better, however the peaks
appearing does not seem to correlate to a pattern in the interactions. This goes for both the WLPS and
NLPS data.
21
Figure 12 – The figure shows all WLPS interactions with a separation distance of 0-500kb and the unconstrained model fit, black curve, and the statistical helix model fit, red curve. Each blue dot is an interaction with an y-value, supporting pairs, that is comparable to number of interactions, and an x-value representing the distance between the two interacting segments.
Figure 13 - The figure shows all NLPS interactions with a separation distance of 0-500kb and the unconstrained model fit, black curve, and the statistical helix model fit, red curve.
22 4.6.1 Separation distance
The interactions were filtered based on separation distance, creating 6 different sub sets for NLPS and WLPS. These are sets containing “all interactions” and interactions in the intervals: 0-70 kb, 0-90kb, 0-110kb, 0-130kb, 0-150kb. The plots of these are shown in figure 14 for NLPS and in figure 15 for WLPS data. Note that the curves are plotted beyond the interval where the parameters were calculated.
Each of the six curves has peaks that does not seem to correlate to any peak in the interaction data, and the peak positions are changed depending on which interval you look at.
Figure 14 – The figure is zoomed in to view the shorter interactions, less than 200 kb, of the NLPS data together with 6 different graphs that shows the statistical helix model fitting when looking at interactions from different intervals: 0-70kb, 0- 90 kb, 0-110kb, 0- 130kb, 0-150 kb and all interaction. Note that the each curve is calculated based on the interactions within the interval. In this figure the plots are extended beyond that interval and is drawn over all interactions.
23
Figure 15 – Same as in figure 14, but with data points from WLPS.
4.6.2 Domain Interactions
Now if we look at the fitting of the statistical helix model to the different domain interactions in figure
16-19 instead, it reveals that for the semi domain interactions and double domain interactions the
model starts failing severely. The unconstrained model is still bad at longer separation distances, more
than ~30kb, and the peaks in the statistical helix model seem unsupported in the interaction data.
24
Figure 16 – A graph showing all normal domain interactions, that are all interactions where both interacting locis are in the same domain. It is zoomed in to show only interactions with a separation distance shorter than 300 kb, together with the fitted unconstrained model (black curve) and statistical helix model (red curve).
Figure 17 - A graph showing all interactions outside domains. It is zoomed in to show only interactions with a separation distance shorter than 500 kb, together with the fitted unconstrained model (black curve) and statistical helix model (red curve).
25
Figure 18 - A graph showing all Double domain interactions, which are all interactions where the interacting locis are in different domains. It is zoomed in to show only interactions with a separation distance shorter than 4000 kb, together with the fitted unconstrained model (black curve) and statistical helix model (red curve).
Figure 19 - A graph showing all Semi domain interactions, which are all interactions where one of the interacting locis are inside a domain and the other is not. It is zoomed in to show only interactions with a separation distance shorter than 4000 kb, together with the fitted unconstrained model (black curve) and statistical helix model (red curve).
26 4.6.3 Expression value
The fitting of the statistical helix model to low, moderately, and highly expressed interactions are shown in figure 20-22. As for the previous fittings the statistical helix model still shows unexplained peaks, however it is a better for the moderately and highly expressed interactions than the low expressed.
Figure 20 – This shows all interactions with an expression value below the 10th percentile, for the WLPS data. Together with the fitted unconstrained model (black curve) and statistical helix model (red curve).
27
Figure 21 - This shows all interactions with an expression value between the 10th percentile and the 70th percentile, for the WLPS data. Together with the fitted unconstrained model (black curve) and statistical helix model (red curve).
Figure 22 - This shows all interactions with an expression value above the 70th percentile, for the WLPS data. Together with the fitted unconstrained model (black curve) and statistical helix model (red curve).