UPTEC X 11 026
Examensarbete 30 hp Maj 2011
Computational analysis of transcriptional regulation in proximal versus
distal normal colon
Sarah Engman
Bioinformatics Engineering Program
Uppsala University School of Engineering
UPTEC X 11 026 Date of issue 2011-06
Author
Sarah Engman
Title (English)
Computational analysis of transcriptional regulation in proximal versus distal normal colon
Title (Swedish)
Abstract
Factors involved in transcriptional regulation in the normal colon were investigated computationally. Genes differentially expressed between distal and proximal colon were identified and their promoter structures were analysed for presence of TATA-boxes, CpG islands and transcription factor binding sites. The CpG methylation status of the genes were experimentally verified in colon cancer cell lines and compared between the two groups of differentially expressed genes. A difference in methylation frequency in colon cancer cell lines was detected between genes up-regulated and down-regulated in normal proximal colon.
Keywords
Colon, transcriptional regulation, cancer, CpG methylation, transcription factor binding sites Supervisors
Neil Saunders
CSIRO Mathematics, Informatics and Statistics
Scientific reviewer
Claes Andersson
Department of Medical Sciences, Cancer Pharmacology and Informatics, Uppsala University
Project name Sponsors
Language
English
Security
ISSN 1401-2138 Classification
Supplementary bibliographical information
Pages
48
Biology Education Centre Biomedical Center Husargatan 3 Uppsala
Box 592 S-75124 Uppsala Tel +46 (0)18 4710000 Fax +46 (0)18 471 4687
Computational analysis of transcriptional regulation in proximal versus distal normal colon
Sarah Engman
Populärvetenskaplig sammanfattning
Kolorektalcancer är en av de vanligaste cancertyperna i länder med hög
levnadsstandard. För att förstå orsakerna bakom sjukdomen och öka chanserna att detektera den i tid bedrivs det idag omfattande forskning inom området. Eftersom tjocktarmen är ett långt organ varierar också egenskaperna mellan tarmens olika delar. En förståelse för vilka naturliga skillnader som finns kan i sin tur öka förståelsen för cancer i de olika segmenten.
I detta arbete har ett antal olika beräkningsmetoder använts för att hitta mönster som kan förklara skillnaden mellan två av tarmens delar. Beräkningarna har utgått från genuttrycksdata från frisk vävnad för två tarmsegment där vardera genens
promotersekvens har extraherats från databaser tillgängliga online. Med hjälp av algoritmer, genuttrycksdatabaser och manuella jämförelser har tänkbara faktorer såsom transkiptionsfaktorsbindingsplatser och metylering av CpG-öar på dessa sekvenser analyserats.
Examensarbete 20p
Civilingenjörsprogrammet Bioinformatik
Contents
1 Introduction 2
1.1 Project goal . . . . 2
2 Background, from biology to computational approaches 2 2.1 The colon . . . . 2
2.2 Gene regulation . . . . 3
2.3 Computational motif detection methods . . . . 4
2.4 Microarray analysis . . . . 5
2.5 Gene Ontology . . . . 5
3 Materials and methods 6 3.1 The dataset . . . . 6
3.2 Differential expression analysis . . . . 6
3.3 Analysis based on Gene Ontology . . . . 6
3.4 Network analysis . . . . 6
3.5 Extracting the upstream sequences . . . . 6
3.6 Novel motif discovery . . . . 7
3.7 Motif scanning . . . . 7
3.8 Classification of genes based on occurences of motifs . . . . 11
3.9 CpG Island detection . . . . 11
3.10 CpG methylation analysis . . . . 12
4 Results and discussion 12 4.1 Differentially expressed genes between distal and proximal colon . . . . 12
4.1.1 Function and chromosomal position . . . . 17
4.1.2 Differences in function based on Gene Ontology . . . . 19
4.2 Identifying interactions between differentially expressed genes . . . . 19
4.3 Promoter sequence analysis . . . . 24
4.3.1 Promoter structure . . . . 24
4.3.2 Ab initio motif discovery . . . . 28
4.3.3 Transcription factor binding site analysis . . . . 29
4.3.4 Classification based on presence of sequence specific motifs . . . . 33
4.3.5 Discussion . . . . 34
4.4 CpG methylation analysis . . . . 36
5 Conclusion 37
6 References 41
1 Introduction
Gene expression patterns have been observed to change along the colon from the proximal (right) towards the distal (left) colon [38]. Understanding this variation can be of great interest in further comprehending colon cancer development and prevention. Since gene expression in normal colon exhibits this change in gene expression pattern depending on location, it is very important that the location of the cancer is taken into account when comparing its gene expression with normal samples.
Gene expression can be regulated on different levels with transcriptional regulation as one of the initial steps. Binding of transcription factors to the genes’ promoters and change of CpG methylation are important factors involved in regulating the amount of transcripts in a cell.
The change in gene expression along the colon could possibly be explained by transcription factor binding site patterns present in genes highly expressed in one segment but missing in genes highly expressed in the other. Changes in CpG methylation can also result in differing gene expression patterns since this methylation works to supress transcription. Therefore low expression in one segment could possibly be explained by methylation of its promoter region.
Due to cost and time consumption of biological experiments scientists today have to rely on computational methods in parts of their research. These methods can be used as an indication of valuable experiments to perform in the future, or as a way of finally analyzing the data. Here computational methods are used to find potential transcription factor binding sites which can be evaluated experimentally. Also output from microarray experiments are validated in a statistical manner to find differentially expressed genes between proximal and distal colon which will provide useful information for future work.
1.1 Project goal
The main goal of this project is the identification of factors generating gene expression pat- terns in the colon. Emphasis will be put on the understanding of what agents are needed to control the differential expression of genes between proximal and distal colon. This in- volves promoter analysis, CpG island investigation and chromatin structure appreciation.
The upstream regions for genes of interest will be gathered and analysed for the presens of sequence specific motifs and CpG island methylation status. The results can then be used as guidelines when in the future designing experiments to unravel the complex machinery of transcriptional regulation in not only normal tissues but also cells undergoing changes related to cancer development.
2 Background, from biology to computational approaches
2.1 The colon
The colon in mammals can be divided into 5 parts, namely the caecum, ascending colon,
transverse colon, descending colon and sigmoid colon. Further, these parts can be grouped
together resulting in two major segments, the proximal colon consisting of the cecum, ascend-
ing and transverse colon, and distal colon consisting of the remaining two. The distal and
proximal colon are from different embryonic origins with most of the distal being derived from
the hindgut while a majority of the proximal instead is derived from the midgut. Differences
between the proximal colon involves size and blood supply. The colon is a long organ with
an average length of 1.5 m and the diameter will decrease when moving proximal to distal [24]. The proximal colon has a multilayered capillary network whereas the distal only has a single-layered network. This could explain the fact that the proximal colon has a greater ability to absorb water and transport electrolytes [3].
2.2 Gene regulation
Even though different cell types can differ in both appearance and function they all contain the same DNA. Therefore the cells rely on various regulatory factors to be able to differentiate into a specific type of cell or tissue. Gene regulation in the cell can be performed on different levels as transcriptional, translational and post-translational. Factors involved in RNA processing i.e. splicing, RNA transport and mRNA degradation are also known to play a role in this complex network [1].
DNA in the human genome forms very precise structures called chromosomes. The DNA helix wrapes around histones which are then packed into chromatin. For transcription to occur this condensed structure needs to be resolved and the DNA double helix opened. Therefore there exists several ways for the cell to control its start, speed and amount of transcription.
Transcriptional initiation involves the binding of PolII to the promoter together with the TATA-binding protein (TBP) and general transcription factors [40]. In all, this complex machinery in eukaryotes involves the binding of up to 100 subunits [1]. TFIID is the general transcription factor responsible for localization of the TATA-box. Its subunit TBP binds the TATA-box resulting in enabling of binding of another general transcription factor, TFIIB.
Not until these factors and a few others have bound the DNA can RNA Polymerase take its place. Now the energy dependent step begins with the opening of the double helix followed by elongation [1].
A promoter is located upstream from the gene transciption start and is usually of length around 100 bp. Many promotors contain a TATA-box which serves as a transcription initiation helper, whereas transcription factor binding sites are important for the speed of transcription.
It has been shown that mutations in a promoters regulatory elements can seriously decrease the speed of transcription, whereas mutations elsewhere in the promoter have no effect at all [49].
Gene regulatory proteins have the ability to recognise and bind specific patterns in the major groove of the DNA helix. There are thousand of these patterns, also called motifs, identified in different organisms. The transcription factor binds the motif by forming hydro- gen bonds, ionic bonds and hydrophobic interactions with the DNA bases. These proteins can be divided into different groups depending on their structure and composition. Such groups include: helix-turn-helix, zinc fingers and zipper-type. These proteins can either act as activators or repressors and thereby control the rate of gene expression and the amount of transcripts synthesised in the cell at a specific time point.
In the light of epigenetics chromatin structure plays a crucial role. Factors such as post-
translational modifications of histones, histone variants and nucleosome occupancy serve to
modify the access to specific DNA fragments. The chromatin structure is composed of DNA
wrapped around histones where the histones consist of three types of proteins, H2A, H2B, H3
and H4. Methylation at H3K4, lysine 4, is believed to be correlated with active promoters
[62] whereas di and tri methylation at H3K9, lysine 9, works as a transcriptional repressor
[69]. It has been shown that H3K4me3 is found in 75% of promoters in human embryonic
stem cells [31].
Many vertebrate promoters are associated with so called CpG islands. Normally, the CpG dinucleotide occurs at a very low frequency but in promoter regions the frequency is a lot higher. An important aspect is the methylation of the CpG dinucleotide which is thought to be used as a way of turning genes on or off. Methylation of DNA residues is reliant on different types of DNA methyltransferases(DNMT). DNMT3A and DNMT3B bind un-methylated DNA [16] whereas DNMT1 prefers partially methylated sequences. An un- methylated dinucleotide is often found in promotors of genes frequently expressed, whereas CpG islands are more uncommon in tissue specific genes [11]. However, recent research has identified several examples of CpG methylation being a factor in transcriptional regulation, see [33] for review. A change in CpG methylation patterns can be the cause of diseases as cancer [74]. Interestingly, CpG methylation is correlated with histone modifications. CpG islands which have not been subjected to methylation correspond to regions with histone modifications as H3K4 [71]. The CXXC finger protein 1, Cfp1, has been shown to bind unmethylated CpGs in the mouse brain and its presence seems to be correlated by H3K4me3 [71].
Regulatory information as transcription factor binding sites and presens of CpG-islands can be found at the UCSC Genome Bioinformatics Site (http://genome.ucsc.edu). It is a human genome browser containing information on assemblys, genes, SNPs, regulatory features and repeats [36]. Half of the 31 annotation tracks available are provided by UCSC while the rest are from various contributors. The site comes with a DAS server, enabling easy access to annotation data. Data repositories such as this are important sources of genome data and annotations, which are essential for studies of this type.
2.3 Computational motif detection methods
There are several different ab initio motif discovery methods available with MEME [6], GLAM2 [25] and AlignACE [59] being a few examples. They differ in their use of align- ment algorithms using either Gibbs sampling or Expectation Maximization (EM). Glam2 uses extenden Gibbs sampling and has previously been shown to identify a motif bound by Lmo2 when given a series of DNA sequences known to be bound by this complex [25].
Two important parameters when running this type of analysis is the number of runs, i.e.
number of times the method is repeated, and iterations in each run. Ideally you would want the motifs for all runs, or a majority, to be identical or at least similar. If that is not the case, pre processing of the data might be crucial. One such technique is to remove too similar sequences before the motif scan, and can be done using the software tool called purge [51].
It looks for sequences in the fasta file that share a sequence similarity greater than a user defined cutoff. It calculates the maximal segment pair score for every pair of sequences and iteratively removes sequences with a score above the cutoff, resulting in a set of sequences with MSP score lower than the user defined cutoff. The MSP score is defined as the highest scoring pair between two sequences, where the segments are of equal lengths and uses cost matrixes as BLOSUM62 for proteins, +5/-1 (5 for match, -1 for mismatch) for DNA or other methods prefered by the user [2]. There are no real guidelines in chosing the value for the cutoff so repeated tries will instead have to be used.
Another approach to finding transcription factor binding sites would be to search the
sequences for occurences of previously known motifs. Databases such as JASPAR [12] and
TRANSFAC [47] contains matrix models for transcription factor binding sites. These models,
often represented as Position Weight Matrices, can then be used to find segments in DNA
sequences with a high affinity to specific transcription factors. The JASPAR database contains motifs for experimentally validated published transcription factor binding sites [12]. Tools like FIMO (http://meme.sdsc.edu/meme/fimo-intro.html), Consite [61] and UCSC [56] have well-developed algorithms to match motif models to sequences of DNA. FIMO calculates log-odds scores for each motif in all positions in the input DNA sequences and then uses a dynamic programming algorithm to convert the scores to p-values. The Benjamini and Hochberg method is then used to convert p-values into q-values and only matches with a q-value smaller than a user defined threshold is returned. Consite uses PWM to score the squences and these scores are then converted to a normalized score. For more info see [61].
For UCSC a sequence is considered having a transcription factor binding site if the specific site is conserved among human, rat and mouse. A score is calculated based on the TRANSFAC Matrix database [47].
2.4 Microarray analysis
Microarrays are a very powerful tool when investigating expression levels of thousands of genes at once. There are different techniques involved depending on the manufacturer, probe synthesis method and sample preparation. Two-colour microarrays contains probes corre- sponding to oligonucleotides, cDNA or mRNA of genes present under a specific condition.
Fluorescently labeled cDNA from samples under comparison are hybridised to the probes on the chip resulting in detectable expression. The two samples are labeled with different colours and therefore each probe on the chip will have either of the four colours red, green, yellow or black depending on the expression of that gene in the two samples. For example, if geneA is expressed in sample number 1 but not in number 2, that probe will be coloured red or green depending on the dye for that sample. If instead geneA is expressed in both samples that probe will be yellow. The image is then analysed to from colour intensities expression levels will arise.
One-colour microarrays are in contrast only subjected to one sample each. The hybrydis- ation of the samples to the probes are measured to give a gene expression level per probeset.
These values are then normalized using an appropriate technique as RMA or MAS5. This technique has the great advantage of being comparable over different experiments as long as the samples have been processed in a similar manner. One such example is the Affymetrix gene chip.
2.5 Gene Ontology
One way of classifying a gene by function is to use the Gene Ontology [4]. It is visualized as a
tree structure where every gene will have a specific GO identifier on a detailed level and will
then belong to the group of GO identifiers further up in the tree. In other words, one gene will
have ancestor GO identifiers grouping them together with gene products performing similar
tasks. At the topmost level there are three different domains, namely cellular component,
biological process and molecular function. Depending on area of interest GO identifiers for a
specific gene can be chosen from these groupings.
3 Materials and methods
3.1 The dataset
The dataset used to find differential expression between proximal and distal colon was a gene expression data set obtained from Genelogic(Gaithersburg, USA), containing 475 sam- ples from varying parts of the intestine and disease stages. The microarray experiment was conducted using two Affymetrix platforms, the HGU133A and the HGU133B. In total the dataset contained 71 samples from normal proximal and 97 from normal distal colon. More information can be found in [38].
3.2 Differential expression analysis
After normalization was performed using RMA [38], the limma package [67] in Bioconductor was used to make a pairwise comparison between the two colon segments. A design matrix was created and using the lmFit method a linear model was fitted for each gene over all the arrays. The limma method eBayes ranks the genes according to probability of being differentially expressed and the topTable method with BH as method adjust the p-values for multiple testing. BH (Benjamini-Hochberg) is the default adjustment method which takes into consideration the false discovery rate making it a powerful method. The probeids were then converted into gene names and entrez id using the Bioconductor packages hgu133a.db and hgu133b.db respectively [15].
3.3 Analysis based on Gene Ontology
Using the Bioconductor GO package [42], hgu133a.db and hgu133b.db packages the most detailed GO id was collected for each gene of interest, in this case all differentially expressed genes between distal and proximal colon.In this case the ontolgy of interest is from the bio- logical process group. The genes were then grouped together into groups of biological process using the Bioconductor goProfiles package [60].
3.4 Network analysis
The network analysis was undertaken using the online tool STRING-Known and Predicted Protein-Protein Interactions [34] with default parameters. STRING contains network in- formation for proteins based on experimental data and textmining. The analysis was first performed using only genes differentially expressed higher in distal colon than in proximal with a log fold change larger than 1, then only genes differentially expressed higher in proximal colon than in distal with a log fold change larger than 1, and finally all differentially expressed genes under the criteria p<0.05. Following this every genes was analyzed individually and manually curated to produce a network possible to explain the difference in gene expression levels between distal and proximal colon.
3.5 Extracting the upstream sequences
Upstream sequences for all differentially expressed genes were extracted using the UCSC table
browser (http://genome.ucsc.edu/). Both a 1000 bp upstream sequence to 100 bp downstream
from transcription start site and 2000bp upstream for every gene of interest was saved in fasta
format for further analysis. The gene table used was refGene for retrieval of refseq genes.
3.6 Novel motif discovery
To discover possible motifs the GLAM2 software tool [25] was used. The 2000 bp up- stream sequences were first masked from repeats using the online tool RepeatMasker (http://www.repeatmasker.org) with cross match as search engine and the other alternatives as default. The database scanned is the Repbase Update [35], containing over 3000 types of repeates for several eukaryotic species. Number of runs were set to 10000 and the rest of the settings to its default values. The data was preprocessed using purge and the cutoff was 80.
The motifs were then used as input using the TOMTOM package [32] to compare the motif to a database of already known motifs.
3.7 Motif scanning
The upstream sequences were scanned for the presence of known motifs using the FIMO pack- age, Consite [61] and the UCSC DAS server. All motifs from JASPAR core was downloaded in flatfile format from the JASPAR download page
1and the ones from human were seper- ated and then converted into meme format using the convert tool jaspar2meme. FIMO was downloaded in a standalone version from http://meme.nbcr.net/downloads/ and run with the command > fimo meme format genes.fasta. Only matches for motifs on the same strand, found in human and with a q-value smaller than 0.5 were included when using FIMO. The representation of strand possession is whether start is greater than or smaller than stop.
Consite was run from the web-interface http://asp.ii.uib.no:8090/cgi- bin/CONSITE/consite?rm=t input single including only transcription factor profiles with a minimum specificity of 10 bits and the output was manually saved in a text file and parsed to only include human motifs and matches on the same strand.
To be able to query the UCSC DAS server each gene’s position had to be presented in bed format(Table 1,Table 2), here for the hg18 assembly. It was then run through the DAS server and the XML output files were parsed using the Perl module XML Simple. The table queried was tfbsConsSites and the table tfbsConsFactors was used to convert name to transfac accession. Since the UCSC table containing conserved transcription factor binding sites is derived from position matrixes from Transfac, the identical motifs with different names between Transfac and JASPAR were translated manually. In Table 3 a list of motifs only present in Transfac can be found and in Table 4 the same motifs with different names are presented.
After scanning the up and down regulated sequences through the JASPAR database, the results were vizalised as both motifs per gene or genes per motif using the Graphvis module in Perl.
1At http://jaspar.genereg.net/html/DOWNLOAD/jaspar CORE/non redundant/by tax group/vertebrates/