• No results found

Systematic comparison of gene regulatory datasets using experimentally validated enhancers

N/A
N/A
Protected

Academic year: 2022

Share "Systematic comparison of gene regulatory datasets using experimentally validated enhancers"

Copied!
41
0
0

Loading.... (view fulltext now)

Full text

(1)

IN

DEGREE PROJECT BIOTECHNOLOGY, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2020,

Systematic comparison of gene regulatory datasets using

experimentally validated enhancers

XUE DONG

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ENGINEERING SCIENCES IN CHEMISTRY, BIOTECHNOLOGY AND HEALTH

(2)

Systematic comparison of gene regulatory datasets using experimentally

validated enhancers

XUE DONG

Master’s Programme, Medical Biotechnology, 120 credits Date: August 26, 2020

Supervisor: Pelin Sahlén Examiner: Patrik Ståhl

School of Engineering Sciences in Chemistry, Biotechnology and Health

(3)

Systematic comparison of gene regulatory datasets using experimentally validated enhancers

c

2020 Xue Dong

(4)

| i

Abstract

Promoter-enhancer interactions are essential for gene regulating, Cap- ture Hi-C is a chromosome conformation capture method to map promoter- enhancer interactions at high resolution. We have Capture Hi-C data for GM12878 cells, immortalized primary B lymphocytes, in three replicates.

Although Capture Hi-C maps enhancer elements together with the promoters they regulate, the overlap between enhancer datasets produced by other meth- ods such as ChIP-seq and Capture Hi-C is lower than expected.

In order to understand the reasons for lower overlap, we investigated the enhancer potential of replicated and non-replicated Capture Hi-C interactors, as well as enhancer overlapping and non-overlapping Capture Hi-C interac- tors. We performed a systematic comparison between our interactor and experimental regulatory and transcriptomic datasets to determine the extent of enhancer mapping.

The results show replicated interactors have higher enhancer potential than non-replicated ones. However, there is evidence that interactors not overlap- ping with experimental validated regulatory datasets can also potentially be true enhancers.

Keywords

Capture Hi-C; ChIP-seq; enhancer; promoter-enhancer interactions; eQTLs

(5)

ii |

Sammanfattning

Interaktioner med promotor-förstärkare är viktiga för genreglering, Cap- ture Hi-C är en metod för att fånga kromosomkonformation för att kartlägga promotor-förstärkare-interaktioner med hög upplösning. Vi har fångat Hi- C-data för GM12878-celler, odödliga primära B-lymfocyter, i tre repliker.

Även om Capture Hi-C kartlägger förbättringselement tillsammans med pro- motorerna som de reglerar, är överlappningen mellan förstärkningsdatasätt producerade med andra metoder som ChIP-seq och Capture Hi-C lägre än väntat.

För att förstå orsakerna till lägre överlappning undersökte vi förbättrings- potentialen för replikerade och icke-replikerade Capture Hi-C-interaktorer, så- väl som enhancer-överlappande och icke-överlappande Capture Hi-C-interak- torer. Vi utförde en systematisk jämförelse mellan vår interaktör och ex- perimentella regulatoriska och transkriptomiska datasätt för att bestämma omfattningen av förstärkarkartläggning.

Resultaten visar att replikerade interaktorer har högre förbättringspotential än icke-replikerade. Det finns emellertid bevis på att interaktorer som inte överlappar experimentella validerade regleringsdatasätt också kan vara riktiga förbättrare.

(6)

| iii

Acknowledgments

I would like to thank my supervisor Pelin Sahlén, and co-supervisor Sailen- dra Pradhananga for providing guidance and feedback throughout this project.

Stockholm, September 2020 Xue Dong

(7)

iv | CONTENTS

Contents

1 Introduction 1

2 Materials and Methods 5

2.1 HiCap data . . . . 5 2.2 ChIP-seq data . . . . 6 2.3 Overlapping of HiCap data with ChIP-seq enhancers . . . . . 6 2.4 Investigations of overlapped ChIP-seq

enhancer marks . . . . 6

2.5 Comparisons between replicated and

non-replicated HiCap enhancers . . . . 7

2.6 Comparisons between overlapped and

non-overlapped HiCap enhancers . . . . 8 2.7 Transcription factor binding motif analysis and gene ontology

analysis of HiCap non-overlapped functional regions . . . . . 8

3 Results 9

3.1 Overlapping between HiCap datasets and experimentally val- idated enhancers. . . . 9 3.2 Investigations of overlapped ChIP-seq

enhancer marks . . . 11

3.3 Comparisons between replicated and

non-replicated HiCap enhancers . . . 15 3.4 Comparisons between overlapping and non-overlapping Hi-

Cap enhancers . . . 18 3.5 Investigations of HiCap non-overlapped functional regions . . 18

4 Discussion 21

5 Future Perspectives 24

(8)

Contents | v

References 25

A Supplementary Figures 27

B Supplementary Tables 31

(9)

vi | List of acronyms and abbreviations

List of acronyms and abbreviations

ChIP-seq Chromatin immunoprecipitation with sequencing

DHS DNase I hypersensitive site

eQTL expression quantitative trait loci

GO Gene ontology

HiCap Capture Hi-C

IRF Interferon regulatory factor

TF Transcription factor

TFBM Transcription factor binding motif

(10)

Introduction | 1

Chapter 1 Introduction

The coding genes undergo processes of replication, transcription and trans- lation to perform their functions, these gene expression processes are regulated by a wide range of mechanisms to switch on/off the gene expression state or turn up/down the gene expression products. Enhancers are important elements in gene regulatory process, playing roles in cis-regulating gene transcription.

Enhancers can be located up to hundreds of kilobases from their target gene, and they are acting by physically contacted with their controlled promoters mediated via DNA looping. Enhancer regulation is essential for all organisms because it can activate specific genes at specific times and in specific cells to achieve a predetermined, orderly and irreversible differentiation and develop- ment process, also enable tissues and organs to maintain normal physiological functions under certain environmental conditions.

Enhancers are often isolated and identified by DNase I hypersensitivity methods in the past, DNase I hypersensitive sites (DHSs)) are chromatin regions that are exposed to DNase I enzyme cleavage, different from DNA packed by histone proteins, these open DNA regions usually contain binding sites for proteins such as RNA polymerase and transcription factors which are essential to gene regulation, so they often used as promoter and en- hancer marks [1]. With the development of high-throughput sequencing, DNase-seq becomes a straightforward way to identify enhancers [2]. Beside DHSs, some histone modifications and Transcription factors (TFs) are now often used as enhancer marks. For example, monomethylation of histone H3 lysine 4 (H3K4me1) was found to be linked to distal regions with cis- regulatory activity in HeLa cells [3], tri-methylation of histone H3 lysine 4 (H3K4me3), acetylation of histone H3 lysine 27 (H3K27ac), and histone H2A variant H2A.Z have been shown to be associated with putative enhancer

(11)

2 | Introduction

enrichment [4]. RNA polymerase II, as a general transcription factor, directly lead to gene transcription. Histone acetyltransferase p300 (P300), a general transcriptional co-activator, can regulate gene transcription by remodeling chromatin [5]. The identification and characterization of histone modifications and transcription factors are often performed byChromatin immunoprecipita- tion with sequencing (ChIP-seq)[6]. By ChIP-seq technique, DNA fragments associated with a specific protein are enriched and sequenced, allowing precise genome-wide mapping of histone modifications and DNA-binding proteins (e.g. transcription factors).

Except the identification of enhancers, the promoter-enhancer interactions are also worth investigating. Research shows although in many cases en- hancers are associated with their closet gene, 65% of them function from up to hundreds of kilobases away (i.e. do not regulate the nearest gene) [7], these long range promoter-enhancer interactions are often difficult to interpret simply based on their nearest gene. Furthermore, there are 20,000–25,000 genes in human genome in contrast to around 100,000 enhancers [8], one gene is in average regulated by several enhancers together, so investigating the function of one single enhancer might not be sufficient for investigating its associated gene. Thus, promoter-enhancer interactome researching is crucial for solving these problems.

Chromosome conformation capture (3C) method, which could capture and quantify the contacting sequences in three-dimensional space, is a good way to analyse a selected promoter-enhancer interaction since activating enhancers will spatially contact promoters from linearly long distance away by genome looping. Hi-C is able to be used for studying promoter-enhancer interactions in a genome-wide scale, since the innovative Hi-C aims to capture all pos- sible pairs of fragments simultaneously instead of selecting one or one kind of interactions, also due to the development of next-generation sequencing methods [9]. Capture Hi-C (HiCap) is a promoter targeted chromosome conformation capture method evolved from Hi-C technique. Capture Hi-C is aiming at mapping promoter-enhancer interactions and is able to map them accurately, mean distal region size of only 699 bp—close to single-enhancer resolution [7], compared to 5-–20 kb resolution by Hi-C, which makes Capture Hi-C very useful to study gene regulation.

An interaction still can be important if it is not detected in large number of cells. As discussed before, one single gene could be regulated by several enhancers, some enhancers are redundant, which means different enhancers can play the same role on one gene in case of the loss or mutation of individual enhancers [10]. In many times several weak enhancers work together instead

(12)

Introduction | 3

of one stronger enhancer, enhancers can even be dispensable because of the replacement of other enhancers with the same function [11]. These redundant enhancers are usually weak and not easily detected, but they are still function- ing. The redundancy in the gene regulatory system provides the much-needed flexibility for the fitness of the organism in unstable environments. In addition, some enhancer-promoter interactions are transient, only essential in certain period or certain tissues, they are rarely be detected in other time or place.

While these transient interactions are often directly linked to essential transient biological processes for the development of cells and tissue, also controlling transcription of genes essential to developmental and disease phenotypes [12].

GM12878 cells are selected for processing Capture Hi-C data. GM12878 cell line is a B-lymphocyte cell line from a female human donor transformed by Epstein-Barr virus (EBV), it was also selected as a tier 1 common cell type in the Encyclopedia of DNA Elements (ENCODE) Project [13], therefore, there are a large amount of sequencing data (more than 300 experiments) available on the GM12878 cell line through the ENCODE Project, including ChIP-seq for TFs and histone marks, Hi-C, DNase-seq, RNA-seq and so on.

We have Capture Hi-C data of GM12878 cells in three replicates. There are 16,148 unique promoter-enhancer interactions detected in all three replicates, and 514,997 interactions seen only in one replicate. All of these interactions are supported by enough supporting pairs (greater than five) and low p value (p < 0.05), so we then ask the question if those interactions detected in only one replicate could be the dispensable enhancers. Our aim is to determine the enhancer potential of both replicated and non-replicated interactions in the dataset, We will perform a systematic comparison of Capture Hi-C interactors using experimentally validated enhancer datasets. Selected enhancer mark datasets are ChIP-seq data of transcription factors, histone modifications, and RNA polymerase II, and DNase-seq data. Selected comparison factors involve supporting pairs, gene expression level, signal of enhancer marks, promoter- enhancer network degree, and interaction distance.

Moreover, although Capture Hi-C maps enhancer elements, the overlap between enhancer datasets produced by ChIP-seq and Capture Hi-C is lower than expected. We also investigated enhancer potential of the well-supported interactions that do not overlap with known enhancer marks. Only a propor- tion of enhancers are known experimentally, so there could be new potential functional enhancers in Capture Hi-C datasets. We aim to investigate enhancer potential of non-overlapping Capture Hi-C distal interactions. We will select potentially functional enhancers in non-overlapping regions of Capture Hi-C dataset, then perform transcription factor binding motif (TFBM) analysis and

(13)

4 | Introduction

gene ontology (GO) term enrichment analysis on these regions.

One way to find putative functional enhancers is to detect expression quantitative trait loci (eQTL)in distal regions. An eQTL is a genetic variant locus that correlate genetic variation with gene expression phenotype [14], their regulating phenotype genes are eQTL genes. TFs bind different DNA regions with different affinity, an alternative regulatory functional variant will change TF binding affinity, leading to the attracting or repelling of TFs, resulting in altered gene regulation. So the candidate TFs which are affected are TFs potential functioning, then the corresponding eQTLs are potentially functional enhancers. Transcription factor binding motif (TFBM)analysis can help to find the affected TFs of each eQTL.

GO enrichment analysis will be performed in order to get the functional profile of these eQTL genes. Gene ontology (GO) [15] is a system for gene annotation and classification, the ontology consists of three terms: cel- lular component, molecular function and biological process, representing the function of genes and gene products. GO enrichment analysis is used to perform enrichment analysis on gene sets, given a set of genes, gene set enrichment analysis (GSEA) with GO annotation dataset is able to find the cellular component, molecular function and biological process that are over/under-represented.

In this study we aim to investigate the enhancer potential of replicated and non-replicated, as well as non-overlapping interactors in Capture Hi-C datasets. We will perform a systematic comparison between Capture Hi-C and ChIP-seq datasets, also TFBM and GO enrichment analysis using eQTL data.

(14)

Materials and Methods | 5

Chapter 2

Materials and Methods

2.1 HiCap data

In previous studies, HiCap were performed on GM12878 cells in three replicates, invalid and duplicated pairs were removed by HICUP, valid pairs were tested for significance by HiCapTools. Cis-interactions (interactions within the same chromosome) are selected from valid pairs as proximities, after removing extra long fragments (>10kbp), supported interactions (inter- actions with at least 5 supporting pairs and p-value < 0.05) are selected from proximities as interactors. There are three replicates in HiCap data, named rep1, rep2, rep3, information of each replicate shown in Table B.1. Our analyses are based on the filtered interactor datasets.

In order to make analysis systematically and orderly, we separated our data as 7 groups based on the number of interactor replicates: interactors only present in rep1, interactors only present in rep2, interactor only present in rep3, interactors only present in rep1 and rep2, interactors only present in rep1 and rep3, interactors only present in rep2 and rep3, interactors present in rep1, rep2 and rep3, named rep1only, rep2only, rep3only, rep1.2, rep1.3, rep2.3, rep1.2.3, respectively. In these 7 groups, rep1only, rep2only and rep3only consisted of interactors present in only one replicate, rep1.2, rep1.3 and rep2.3 consisted of interactors present in only two replicates, rep1.2.3 contained interactors present in all of three replicates, named 1rep, 2reps and 3reps, respectively.

The number of fragment in each group shown in TableB.2

(15)

6 | Materials and Methods

2.2 ChIP-seq data

We downloaded publicly available ChIP-seq datasets of selected enhancer marks generated from GM12878 cell line from ChIP-Atlas, the downloaded datasets were in peak-call (BED) format. The selected enhancer marks con- tained histone modifications including H3K27ac, H3K4me1, H3K4me3 and H3K9me3, all of 96 available transcription factors, RNA polymerase II and DNase-seq. There were 209 transcription factor datasets from 209 different experiments in total, as well as 134 selected histone modification datasets, 15 RNA polymerase II datasets and 7 DNase-seq datasets. For each enhancer mark, all of its ChIP-seq datasets were concated and merged into one BED file. We also merged a BED file of all selected antigens.

We also downloaded ChIP-seq data of selected antigens in BigWig format from ChIP-Atlas, including nine datasets of transcription factors with top overlapping fraction (SRX190236, SRX100583, SRX038504, SRX100431, SRX100432, SRX190337, SRX190349, SRX100408, SRX100576), four data- sets of selected histone marks (SRX038506, SRX038512, SRX038516, SRX- 067415), one RNA polymerase (SRX017983), and one DNase-seq dataset (SRX069089).

2.3 Overlapping of HiCap data with ChIP-seq enhancers

We overlapped HiCap enhancers regions to the merged BED file of each ChIP-seq enhancer mark, and all selected antigens. We did this by BED- tools, and select parameters of finding unique entries to get the fragments overlapping with at least one enhancer mark. The overlapped percentages were calculated simply by dividing the number of overlapped fragments by the number of HiCap enhancers.

2.4 Investigations of overlapped ChIP-seq enhancer marks

We applied deepTools to generate the correlation heatmap and PCA plots of top overlapped transcription factors. First we computed the average score for each enhancer in every genomic region of 10kb, then computed correlation

(16)

Materials and Methods | 7

coefficients in pairwise using Spearman method and visualized by heatmap, finally performed the principal component analysis and made PCA plots.

To generate PCA plots of HiCap enhancers based on overlapping results, we generated a matrix in which each eigen is the overlapping states (over- lapping or not) of ChIP-seq enhancer marks in one HiCap enhancer, then performed principal component analyses on these eigen.

To compare HiCap enhancers only overlapped with histone marks and enhancers only overlapped with TFs, we first got HiCap enhancer regions only overlapped with histone marks and regions only overlapped with TFs by BEDTools, then got the supporting pairs and interaction distance associated with each enhancer region, after that we tested the diversity between two groups by ANOVA method.

2.5 Comparisons between replicated and non-replicated HiCap enhancers

We applied deepTools to generate signal heatmaps of selected enhancer marks in the regions of HiCap detected enhancers. The scores of each en- hancer marks per HiCap enhancer regions were calculated and plotted, we extended 2kb up- and downstream for plotting heatmap and used nearest interpolation method. Three HiCap data groups, rep1_2_3, rep1_2, and rep1only, are used for three heatmaps.

For comparing gene expression level between replicated and non-replicated interactors, we downloaded gene expression datasets of GM12878 cell line, and matched them to genes of HiCap target promoters, then simply visualized their distribution and comparison by box plot. Three gene sets of three HiCap groups, 1rep, 2reps and 3reps are compared.

To comparing the number of promoters the enhancers connecting to, we constructed a gene-promoter-enhancer network using our HiCap data of in- teracting pairs with annotations of enhancer groups (1rep, 2reps and 3reps), then we simply calculated the average degree of each enhancer group followed by comparing them with two-tail t-test. Besides, we grouped enhancers by number of degree (1, 2, 3-5, and >5), and calculated the percentage of each group.

To compare interacting distance distribution between replicated and non- replicated interactors, we got the interacting distances of interactors in 1rep, 2reps and 3reps, then performed a pairwise ANOVA test and made a density plot.

(17)

8 | Materials and Methods

2.6 Comparisons between overlapped and non-overlapped HiCap enhancers

First we overlapped HiCap enhancers to all ChIP-seq datasets, and got two groups of HiCap enhancers, overlapped and non-overlapped.

To compare interacting distance distributions between overlapped and non- overlapped enhancers, we used the same steps as above-mentioned.

To compare supporting pair distributions, we first got the supporting pairs of overlapped HiCap enhancers and non-overlapped HiCap enhancers, then made Q-Q plots using supporting pairs of non-overlapped enhancers as the- oretical datasets and supporting pairs of overlapped enhancers as sample datasets. Finally we performed Kruskal–Wallis test of two groups of support- ing pairs. Q-Q plot and Kruskal–Wallis test were performed for each replicate.

2.7 Transcription factor binding motif analy- sis and gene ontology analysis of HiCap non-overlapped functional regions

We got HiCap non-overlapped functional regions by finding the intersec- tions between HiCap non-overlapped enhancers and GM12878 eQTL datasets.

We downloaded publicly available eQTL data of GM12878 cell line from Genotype-Tissue Expression (GTEx) Portal, using datasets of GTEx Analysis V8 (dbGaP Accession phs000424.v8.p2), with threshold of q-value 0.05.

Then we overlapped HiCap non-overlapped enhancers to GM12878 eQTLs, the overlapped eQTLs are in potentially functional enhancers.

We performed transcription factor binding motif analysis on these eQTL regions, 25bp up- and downstream extended, both on reference eQTL allele and alternative eQTL allele. We applied bioinformatic tool of FIMO (Find individual motif occurrences) in The MEME Suite to find known motifs with occurrences in our extended eQTL regions.

We selected top 50 occurred TFs, and further selected TFs that highly and uniquely expressed in GM12878 cells by GTEx Multi Gene Query tool. Then we got HiCap targeted genes potentially regulated by each TF, performed gene ontology term analysis on these genes of using clusterProfiler.

(18)

Results | 9

Chapter 3 Results

3.1 Overlapping between HiCap datasets and experimentally validated enhancers

The percentages of enhancer overlapping Capture Hi-C interactors become lower as the increasing of ChIP-seq data cutoff according to Fig. 3.1, which means ChIP-seq data of Q < 1e−05 contains the most information. However, the threshold of Q < 1e−20 is selected even 756 informative enhancers are lost compared to the threshold of Q < 1e−05, because the latter cutoff contains the most noise. Another analysis proved the noise (data not shown), histone modifications of H3K4me1 and H3K27ac are often regarded as hallmarks of promoters, only 59% of HiCap regions containing H3K4me1 and H3K27ac are GENCODE promoters using threshold of Q < 1e−05, while nearly 90%

of HiCap regions containing these two marks are true GENCODE promoters by increasing the threshold to Q < 1e−20. Thus, ChIP-seq threshold of Q <

1e−20 is mainly the balance of getting more information and eliminating more noises.

There is no significant bias of overlapping percentages between three replicates according to the overlapping results of all ChIP-seq data and each ChIP-seq antigen, based on ANOVA test of overlapping percentages between groups rep1, rep2 and rep3 (p-value = 0.499).

As the increasing of HiCap data resolution, the overlapping percentage is increasing remarkably (Fig. 3.2). Only 6.31% of enhancers in proxim- ity dataset intersect with ChIP-seq enhancer marks, 12.63% overlapping en- hancers are found in interactor set we are working on, while in interactors present in all three replicates, this percentage rises to 31.7%. This increas- ing trend is also seen in overlapping results of every ChIP-seq enhancer

(19)

10 | Results

mark (Fig.A.1–A.4).

Interactors present in more replicates tend to have larger ChIP-seq overlap- ping proportion. HiCap enhancers in all three replicates overlap to ChIP-seq data in 31.7%, while enhancers in one replicate overlap to ChIP-seq validated enhancers in only 10.82% (Fig.3.2). This result is also suited to single antigen overlapping cases, for example, H3K4me3, a hallmark of enhancers, counts around 2% in HiCap enhancers of only one replicates, while counts 12% in HiCap enhancers of all three replicates, six times the former (Fig.A.1).

Figure 3.1: Percentages of enhancers present in all three replicates overlapped with ChIP-seq data in different thresholds.

Figure 3.2: Overlapped percentages of HiCap enhancers with all ChIP-seq data.

(20)

Results | 11

3.2 Investigations of overlapped ChIP-seq enhancer marks

The ChIP-seq enhancer marks show a strong correlation with each other except CTCF. Fig. 3.3(a) and Fig. 3.3(b) show correlations and clusters of the nine selected transcription factors in all genome regions. ATF2, NFIC and RUNX3 are strongly correlated with each other, SP1, EBF1, BATF and SPI1 also show high correlations, while CTCF has low correlation score in the heatmap, and is located separately from the other TFs in PCA plot.

The separating of CTCF and clustering of the other enhancer marks are also revealed by Fig. 3.4(a)–3.4(h), eight PCA plots showing all eigenvalues and eigenvectors of the top two principal components. All of ChIP-seq enhancer marks, including TFs, histone marks, DHSs and RNApolII, are clustered together except CTCF and negative control H3K9me3 (Fig.3.4(e)).

Enhancer marks except CTCF maintain their clustering manner when increasing data resolution from either replicate to all replicates (Fig. 3.4(b), 3.4(d),3.4(f),3.4(h)). CTCF gets closer to other enhancer marks in rep1_2_3 compared to in either_rep (Fig. 3.4(e),3.4(g)), while CTCF does not display the same manner in Fig.3.4(a) compared with Fig.3.4(c), demonstrating the closer relationship between CTCF with histone modifications rather than other TFs.

In comparisons of supporting pairs and promoter-enhancer distance (Fig.

3.5(a), 3.5(b), 3.5(c)), there are no significant differences between HiCap enhancers overlapping with only histone marks and enhancers overlapping with only transcription factors, further showing the similar correlations among enhancer marks.

(21)

12 | Results

(a) (b)

Figure 3.3: (a) Correlation heatmap of transcription factors with top overlapping fractions. (b) Principle component analysis plot of transcription factors with top overlapping fractions, including the plot of top two principal components eigenvalues and the plot of the amount of variability and cumulative variability for all nine principle component.

(22)

Results | 13

(a) (b)

(c) (d)

(e) (f)

(g) (h)

Figure 3.4: (a–h) Principle component analysis plots of HiCap enhancers, based on overlapping or not. a, b, e, f are PCA plots of all HiCap interactors, c, d, g, h are PCA plots of HiCap interactors present in all three replicates. a, b, c, d contain the information of TF overlapping results (b, d exclude CTCF), e, f, g, h contain the information of overlapping results of all selected ChIP-seq enhancer marks (f, h exclude CTCF).

(23)

14 | Results

(a) (b)

(c)

Figure 3.5: Comparisons of HiCap enhancers overlapping with only histone marks and HiCap enhancers overlapping with only transcription factors. (a, b) Supporting pair comparisons between HiCap enhancers overlapping with only histone marks and HiCap enhancers overlapping with only transcription factors, (a) use enhancer set present in only rep1, (b) use enhancer set present in all replicates. (c) Comparison of promoter – enhancer distance between HiCap enhancers overlapping with only histone marks and HiCap enhancers overlapping with only transcription factors, using enhancer set of either replicate.

(24)

Results | 15

3.3 Comparisons between replicated and non-replicated HiCap enhancers

All enhancer marks show a higher enriched signal in the region of en- hancers present in more replicates except negative control H3K9me3. In Fig. 3.6(a) of enhancers in all three replicates, H3K27ac shows a significant signal enrichment in HiCap enhancer regions compared with up- and down- stream regions, H3K4me1, H3K4me3, EP300 and CTCF are also enriched in enhancer regions. These enhancer marks are less enriched in enhancers presented in only rep1 and rep2 (Fig. 3.6(b)), and much less enriched in enhancers only in rep1 (Fig.3.6(c)).

The genes potentially regulated by HiCap enhancers present in all repli- cates have more transcripts than genes potentially regulated by non-replicated or less-replicated enhancers (Fig. 3.7(a)). Replicated enhancers significantly connect with more promoter nodes than non-replicated enhancers (Fig.3.7(b)).

Although more than 80% of HiCap enhancers are connected with only one promoter, the fraction of enhancers with high degree is significantly higher in replicated interactors (Fig.3.7(c)).

The replicated interacting pairs have significant shorter promoter-enhancer interacting distance (Fig.3.7(d)), the average interacting distances of 1rep, 2reps and 3reps groups are about 200kb, 140kb and 73kb, replicated inter- actions only have half of the interacting distance of non-replicated pairs. Most of HiCap enhancers interact with promoters in relative short range, only 5%

of promoter-enhancer interaction distances are longer than 1mbp, while 96%

of these long-range interactions are non-replicating. However, these long- range interactions have the similar ChIP-seq overlapping fraction as short- range ones, removing those long-range interactions does not effect the overlap ratios.

(25)

16 | Results

(a) (b)

(c)

Figure 3.6: (a, b, c) Heatmaps of signals in the region of HiCap enhancers plus 2kb up- and downstream of enhancer marks including H3K27ac, H3K4me1, H3K4me3, H3K9me3, EP300, CTCF. (a) plots the enhancer mark signals in the region of enhancers present in all three replicates, (b) applies the region of enhancers present only in rep1 and rep2, (c) applies the region of enhancers present only in rep1.

(26)

Results | 17

(a) (b)

(c) (d)

Figure 3.7: (a) Gene expression level distribution in interactors present in one replicate, in two replicates, and in three replicates, including all genes regulated by associated promoters. (b, c) Degree of enhancers in promoter-enhancer network. (b) Average degree of non-replicated and replicated enhancers. (c) Enhancer degree distribution of non-replicated and replicated enhancers. (d) Promoter-enhancer interacting distance distribution of interactors in one replicate, two replicates and three replicates.

(27)

18 | Results

3.4 Comparisons between overlapping and non-overlapping HiCap enhancers

Using supporting pairs of non-overlapping enhancer set as theoretically ex- pected distribution, supporting pairs of overlapping enhancer set do not follow the same distribution as expected one, moreover, the upward trend of the Q–Q plot is steeper than the line y = x, indicating HiCap enhancers overlapped with ChIP-seq data have more supporting pairs than non-overlapping enhancers.

This profile seen in all of three replicates (Fig.3.8(a),3.8(b),3.8(c)). However, the distributions of promoter-enhancer interacting distance are similar between overlapping interactions and non-overlapping interactions (Fig.3.8(d)).

3.5 Investigations of HiCap non-overlapped functional regions

In total of 42,604 eQTLs are found in HiCap interactors which are not overlapped with ChIP-seq data, 7.04% of non-overlapped HiCap interactors contain eQTLs. eQTL overlapping percentages also differ between replicated and non-replicated HiCap interactors (Fig.A.5), around 7% of non-replicated interactors contain eQTLs while among interactors present in all three repli- cates, 12.7% have eQTLs.

265 transcription factor binding motifs are found In these eQTL extended regions (25bp up/downstream eQTLs) by transcription factor binding motif analysis. 85% of eQTL extended regions are bound by one or several of these transcription factors. For 50 TFs with most occurrences, we compare the eQTL genes of TF binding eQTLs, and the promoter associated genes of these eQTL located enhancers, and found only around 14% of the time eQTL gene is its promoter gene.

Among the top 50 TFs by occurrences, IFR4 and IFR8 are highly and very specifically expressed in GM12878 cells. IRF4, especially, has median TPM of 385.0 in GM12878 cells, while the highest median TPM in other tissues is only 26.5 (Fig.3.9(a)). Gene ontology analyses of IRF4 show two significant biological processes (p.adjust < 0.05): cytokinesis and positive regulation of cytokinesis (Fig.3.9(b)), GO analyses of IRF8 give the same results.

(28)

Results | 19

(a) (b)

(c) (d)

Figure 3.8: Comparisons of HiCap enhancers overlapped and not overlapped with ChIP-seq datasets. (a, b, c) Q-Q plots of supporting pair distribution comparison between overlapping and non-overlapping enhancers, supporting pairs of non-overlapping interactors as theoretical datasets, supporting pairs of overlapping interactors as sample datasets. (a) use enhancers in rep1, (b) use enhancers in rep2, (c) in rep3. (d) Promoter-enhancer interacting distance distribution of overlapping interactions and non-overlapping interactions.

(29)

20 | Results

(a)

(b)

Figure 3.9: Investigations of predicted TFs (a)Top 50 TFs by occurrences and their encoded gene expression heatmap in different tissues, generated from GTExPortal. (b) Gene ontology analysis of IRF4, term of biology process.

(30)

Discussion | 21

Chapter 4 Discussion

Replicated HiCap interactors seems to have higher enhancer potential than non-replicated ones. The most powerful reason is the fractions of HiCap enhancers overlapping with ChIP-seq datasets and eQTL datasets are much higher in replicated group than non-replicated groups. The ChIP-seq data of enhancer marks are more likely to be true enhancers by experimental validation, eQTLs are also true functional regions in the same way. So HiCap detected enhancers which are overlapping with these ChIP-seq datasets and eQTLs are more likely to be true enhancers. These true enhancers occupy more fraction in replicated HiCap interactors, actually the fraction of these validated enhancers in replicated interactors is three times than the fraction in non-replicated ones, thus replicated interactors have higher enhancer potential.

Moreover, for every selected enhancer mark, replicated interactors have more overlapping fraction (more than five times with no significant bias except CTCF) than non-replicated ones, which means we can get the same conclusion even some ChIP-seq enhancer marks have higher enhancer potential than the others. This conclusion also derived from signal heatmaps of selected enhancer marks, in which five typical enhancer marks, H3K27ac, H3K4me1, H3K4me3, EP300 and CTCF, are much more enriched in replicated interactors than non-replicated interactors.

The selected ChIP-seq enhancer marks are investigated, and these enhancer marks show a strong correlation with each other except CTCF, suggesting transcription factors and histone marks are not working separately, different transcription factors might bind to the same region even they might work together for regulating a single gene. CTCF is different because CTCF is a structural transcription factor for regulating three-dimensional space of chro- matin rather than regulating gene expression directly. Besides, enhancer

(31)

22 | Discussion

marks except CTCF maintain similar clustering manner between PCA plot of replicated HiCap interactors and PCA plot of non-replicated interactors, while CTCF doesn’t display the same manner but get closer to histone marks, which means CTCF is more likely to be detected in non-replicated interactors than other enhancer marks, suggesting there are many structural elements like CTCF staying in non-replicated interactors, while these structural elements are more likely to overlap with histone marks if they are replicated. Enhancer potential of TF overlapped interactors and histone overlapped interactors are compared, and no significant differences are found in supporting pairs and promoter-enhancer interacting distance, suggesting there might be no differ- ence in accuracy between transcription factors and histone modification for discovery of enhancers.

HiCap gene-promoter-enhancer network analyses support the conclusion of replicated interactors having higher enhancer potential than non-replicated interactors. Replicated HiCap enhancers connect more promoters and genes than non-replicated enhancers, which means replicated enhancers are on aver- age stronger than non-replicated enhancers. The genes connected with repli- cated enhancers are significantly more expressed than genes connected with non-replicated groups, so replicated groups have stronger enhancers which are able to up-regulate gene for higher expression. Besides, the promoter- enhancer interacting distance are compared, replicated interactions only have half of the interacting distance of non-replicated pairs, and 96% of long- range interactions with distances more than 1mp are non-replicating. How- ever, interactors with long interacting distances even more than 1mp have similar overlapping percentages as short range interactors, no differences of enhancer potential are found between long-range interactions and short-range interactions. Thus we cannot conclude the shorter interacting distances in replicated interactors are correlated with higher enhancer potential, interactors with shorter interacting distances are more likely to be replicated might simply because short-range interactions are easier to be detected than long-range ones, since most of HiCap enhancers interact with promoters in relative short range, only 5% of promoter-enhancer interaction distances are longer than 1mbp.

According to ChIP-seq overlapping results, only 13.6% of HiCap inter- actors are overlapping, and only 31.7% of interactors in all three replicate, all interactors are really interacting because they are supported with enough supporting pairs and significant p value, so there must be something bringing promoters and interactors together, although some regions are supported be- cause of they are located close to promoters, there might be potential enhancers beyond ChIP-seq overlapped interactors.

(32)

Discussion | 23

Supporting pairs and distance distributions are compared between over- lapping and non-overlapping interactors. Overlapping interactors have more supporting pairs than non-overlapping interactors, even these two have similar distance distributions. These might demonstrate overlapping interactors have higher enhancer potential than non-overlapping interactors. Supporting pairs are effected by two types of factors as stated above, one is true promoter- enhancer interactions, the other is other factors like interacting distance. De- spite of effect of distance since overlapping and non-overlapping interactors have similar distance distributions, the reason why overlapping interactors have more supporting pairs the overlapping interactors are more likely to be true enhancers with higher enhancer potential.

In order to answer the question of are there potential enhancers beyond ChIP-seq overlapped interactors, overlapping HiCap interactors to eQTLs are performed and 7.04% of non-overlapped HiCap interactors are found to con- tain in total of 42604 eQTLs, demonstrating there are a lot of potential func- tional regions detected by HiCap which have not been experimentally found and validated by ChIP-seq before. On these eQTLs regions, 265 transcription factor binding motifs are found by transcription factor binding motif analyses, 85% of eQTL regions are bound by one or several of these transcription factors, showing most of these HiCap interactors containing eQTLs are potential to be true enhancers to bind with TFs. At most of the time, genes of eQTLs are not same as the genes regulated by eQTL located enhancers, suggesting eQTL genes might lead to wrong enhancer regulated genes.

Among the top 50 TFs by occurrences, IFR4 and IFR8 have the highest and most specific expression in GM12878 cells. Interferon regulatory factor 4 (IFR4) and interferon regulatory factor 8 (IFR8) belong to Interferon reg- ulatory factor (IRF) family, these two homologous transcription factors are found to act as critical regulators for B cell function and differentiation [16].

Those HiCap non-overlapping enhancer regions which are bound with IFR4 and IFR8 are found to be significantly associated with cytokinesis and positive regulation of cytokinesis by gene ontology analysis. This gives a slight that IRF4 and IRF8 might help guide cell division in B cell development.

(33)

24 | Future Perspectives

Chapter 5

Future Perspectives

In this study we only investigated the potential of HiCap enhancer roughly, we can not achieve to determine the enhancer potential by ChIP-seq over- lapping, there are up to 1600 proteins in the human genome presumed to act as transcription factors, only 200 of them from ChIP-Atlas database are investigated in this study, therefore we do not know the binding sites of transcription factors that are yet to be investigated. Moreover, the redundant and transient enhancers are still hard to investigate, for example, the higher degree and higher gene expression of replicated enhancers can be explained as, the stronger enhancers are more likely to be replicated, there still might be weak enhancers with essential functions. So it would be better to find a way to determine an enhancer potential in future study.

In this study we concluded that there is evidence that interactors not overlapping with experimental validated regulatory datasets are also potential to be true enhancers. So it is worth to investigate and explore these genomic regions to get hints of novel regulatory elements.

(34)

REFERENCES | 25

References

[1] A. Barski, S. Cuddapah, K. Cui, T.-Y. Roh, D. E. Schones, Z. Wang, G. Wei, I. Chepelev, and K. Zhao, “High-resolution profiling of histone methylations in the human genome,” Cell, vol. 129, no. 4, pp. 823–837, 2007.

[2] L. Song and G. E. Crawford, “Dnase-seq: a high-resolution technique for mapping active gene regulatory elements across the genome from mammalian cells,” Cold Spring Harbor Protocols, vol. 2010, no. 2, pp.

pdb–prot5384, 2010.

[3] N. D. Heintzman, R. K. Stuart, G. Hon, Y. Fu, C. W. Ching, R. D. Hawkins, L. O. Barrera, S. Van Calcar, C. Qu, K. A. Ching et al., “Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome,” Nature genetics, vol. 39, no. 3, pp. 311–318, 2007.

[4] Z. Wang, C. Zang, J. A. Rosenfeld, D. E. Schones, A. Barski, S. Cuddapah, K. Cui, T.-Y. Roh, W. Peng, M. Q. Zhang et al.,

“Combinatorial patterns of histone acetylations and methylations in the human genome,” Nature genetics, vol. 40, no. 7, pp. 897–903, 2008.

[5] R. Eckner, M. E. Ewen, D. Newsome, M. Gerdes, J. A. DeCaprio, J. B.

Lawrence, and D. M. Livingston, “Molecular cloning and functional analysis of the adenovirus e1a-associated 300-kd protein (p300) reveals a protein with properties of a transcriptional adaptor.” Genes &

development, vol. 8, no. 8, pp. 869–884, 1994.

[6] A. P. Boyle, S. Davis, H. P. Shulha, P. Meltzer, E. H. Margulies, Z. Weng, T. S. Furey, and G. E. Crawford, “High-resolution mapping and characterization of open chromatin across the genome,” Cell, vol.

132, no. 2, pp. 311–322, 2008.

(35)

26 | REFERENCES

[7] P. Sahlén, I. Abdullayev, D. Ramsköld, L. Matskova, N. Rilakovic, B. Lötstedt, T. J. Albert, J. Lundeberg, and R. Sandberg, “Genome- wide mapping of promoter-anchored interactions with close to single- enhancer resolution,” Genome biology, vol. 16, no. 1, pp. 1–13, 2015.

[8] L. A. Pennacchio, W. Bickmore, A. Dean, M. A. Nobrega, and G. Bejerano, “Enhancers: five essential questions,” Nature Reviews Genetics, vol. 14, no. 4, pp. 288–295, 2013.

[9] A. Denker and W. De Laat, “The second decade of 3c technologies:

detailed insights into nuclear organization,” Genes & development, vol. 30, no. 12, pp. 1357–1382, 2016.

[10] M. Osterwalder, I. Barozzi, V. Tissières, Y. Fukuda-Yuzawa, B. J.

Mannion, S. Y. Afzal, E. A. Lee, Y. Zhu, I. Plajzer-Frick, C. S.

Pickle et al., “Enhancer redundancy provides phenotypic robustness in mammalian development,” Nature, vol. 554, no. 7691, pp. 239–243, 2018.

[11] S. Barolo, “Shadow enhancers: frequently asked questions about distributed cis-regulatory information and enhancer redundancy,” Bioes- says, vol. 34, no. 2, pp. 135–141, 2012.

[12] A. S. Nord, M. J. Blow, C. Attanasio, J. A. Akiyama, A. Holt, R. Hosseini, S. Phouanenavong, I. Plajzer-Frick, M. Shoukry, V. Afzal et al., “Rapid and pervasive changes in genome-wide enhancer usage during mammalian development,” Cell, vol. 155, no. 7, pp. 1521–1531, 2013.

[13] Genome.gov, “Encode project common cell types,”

2020. [Online]. Available: https://www.genome.gov/

encode-project-common-cell-types

[14] R. C. Jansen and J.-P. Nap, “Genetical genomics: the added value from segregation,” TRENDS in Genetics, vol. 17, no. 7, pp. 388–391, 2001.

[15] G. O. Consortium, “The gene ontology project in 2008,” Nucleic acids research, vol. 36, no. suppl_1, pp. D440–D444, 2008.

[16] R. Lu, “Interferon regulatory factor 4 and 8 in b-cell development,”

Trends in immunology, vol. 29, no. 10, pp. 487–492, 2008.

(36)

Appendix A: Supplementary Figures | 27

Appendix A

Supplementary Figures

Figure A.1: Overlapped percentages of HiCap enhancer groups with ChIP-seq data of selected histone modifications.

(37)

28 | Appendix A: Supplementary Figures

Figure A.2: Overlapped percentages of HiCap enhancer groups with ChIP-seq data of transcription factors, showing top 9 TFs.

(38)

Appendix A: Supplementary Figures | 29

Figure A.3: Overlapped percentages of HiCap enhancer groups with ChIP-seq data of RNApol II.

Figure A.4: Overlapped percentages of HiCap enhancer groups with DNase- seq data.

(39)

30 | Appendix A: Supplementary Figures

Figure A.5: Percentage of HiCap enhancers overlapped with eQTLs.

(40)

Appendix B: Supplementary Tables | 31

Appendix B

Supplementary Tables

Number of Pairs Number of Proximities Number of Interactions

rep1 112622565 14543392 435131

rep2 75272694 10855325 265614

rep3 40887305 5492541 112242

Table B.1: Infos of three HiCap replicates, including numbers of pairs, proximities and filtered interactions.

Groups Number of frags

1rep 514977

2reps 124783

3reps 16148

rep1only 298755 rep2only 136141 rep3only 80081 rep1.2 108770

rep1.3 11458

rep2.3 4555

rep1.2.3 16148 interactors 655908 proximities 21465453

Table B.2: Number of interacting pairs in different HiCap groups.

(41)

www.kth.se

References

Related documents

More targets of these TFs could be found by comparative analysis of the promoter regions and the functional annotations of the remaining 19,000 genes i.e., an attempt can be made

These classes extend the Limited classes, so a pro- grammer developing a Scania strong named application do not need to create two objects to get access to all methods in a

In order to better understand tumour cell characteris- tics in primary colon cancers associated with tumour cell dissemination, and disease recurrence, the aim of this study was

Therefore, another type of protein column, SUPELCO SigmaCrom GFC-100 protein Column was used, which made it possible to perform a separation with a low ion-strength

To examine whether the predicted effects of the TPH2 SNP rs4570625 (-703 G/T) genotype and infants’ attentional bias to fearful facial expressions are dependent on early

At the time of autopsy, however, the pathologist would (as mentioned in “Methods”) request for additional information, at least a minimum of pertinent data as to enable a proper

Validation of AP001056.1 and ICOSLG Levels by Reverse Transcription Quantitative PCR (RT-qPCR) RT-qPCR was performed to investigate gene expression and the correlation

Industrial Emissions Directive, supplemented by horizontal legislation (e.g., Framework Directives on Waste and Water, Emissions Trading System, etc) and guidance on operating