Computational analysis of transcriptional

(1)

UPTEC X 11 026

Examensarbete 30 hp Maj 2011

Computational analysis of transcriptional regulation in proximal versus

distal normal colon

Sarah Engman

(2)

(3)

Bioinformatics Engineering Program

Uppsala University School of Engineering

UPTEC X 11 026 Date of issue 2011-06

Author

Sarah Engman

Title (English)

Computational analysis of transcriptional regulation in proximal versus distal normal colon

Title (Swedish)

Abstract

Factors involved in transcriptional regulation in the normal colon were investigated computationally. Genes differentially expressed between distal and proximal colon were identified and their promoter structures were analysed for presence of TATA-boxes, CpG islands and transcription factor binding sites. The CpG methylation status of the genes were experimentally verified in colon cancer cell lines and compared between the two groups of differentially expressed genes. A difference in methylation frequency in colon cancer cell lines was detected between genes up-regulated and down-regulated in normal proximal colon.

Keywords

Colon, transcriptional regulation, cancer, CpG methylation, transcription factor binding sites Supervisors

Neil Saunders

CSIRO Mathematics, Informatics and Statistics

Scientific reviewer

Claes Andersson

Department of Medical Sciences, Cancer Pharmacology and Informatics, Uppsala University

Project name Sponsors

Language

English

Security

ISSN 1401-2138 Classification

Supplementary bibliographical information

Pages

48 Biology Education Centre Biomedical Center Husargatan 3 Uppsala

Box 592 S-75124 Uppsala Tel +46 (0)18 4710000 Fax +46 (0)18 471 4687

(4)

(5)

Computational analysis of transcriptional regulation in proximal versus distal normal colon

Sarah Engman

Populärvetenskaplig sammanfattning

Kolorektalcancer är en av de vanligaste cancertyperna i länder med hög

levnadsstandard. För att förstå orsakerna bakom sjukdomen och öka chanserna att detektera den i tid bedrivs det idag omfattande forskning inom området. Eftersom tjocktarmen är ett långt organ varierar också egenskaperna mellan tarmens olika delar. En förståelse för vilka naturliga skillnader som finns kan i sin tur öka förståelsen för cancer i de olika segmenten.

I detta arbete har ett antal olika beräkningsmetoder använts för att hitta mönster som kan förklara skillnaden mellan två av tarmens delar. Beräkningarna har utgått från genuttrycksdata från frisk vävnad för två tarmsegment där vardera genens

promotersekvens har extraherats från databaser tillgängliga online. Med hjälp av algoritmer, genuttrycksdatabaser och manuella jämförelser har tänkbara faktorer såsom transkiptionsfaktorsbindingsplatser och metylering av CpG-öar på dessa sekvenser analyserats.

Examensarbete 20p

Civilingenjörsprogrammet Bioinformatik

(6)

(7)

1 Introduction 2

1.1 Project goal . . . . 2

2 Background, from biology to computational approaches 2 2.1 The colon . . . . 2

2.2 Gene regulation . . . . 3

2.3 Computational motif detection methods . . . . 4

2.4 Microarray analysis . . . . 5

2.5 Gene Ontology . . . . 5

3 Materials and methods 6 3.1 The dataset . . . . 6

3.2 Differential expression analysis . . . . 6

3.3 Analysis based on Gene Ontology . . . . 6

3.4 Network analysis . . . . 6

3.5 Extracting the upstream sequences . . . . 6

3.6 Novel motif discovery . . . . 7

3.7 Motif scanning . . . . 7

3.8 Classification of genes based on occurences of motifs . . . . 11

3.9 CpG Island detection . . . . 11

3.10 CpG methylation analysis . . . . 12

4 Results and discussion 12 4.1 Differentially expressed genes between distal and proximal colon . . . . 12

4.1.1 Function and chromosomal position . . . . 17

4.1.2 Differences in function based on Gene Ontology . . . . 19

4.2 Identifying interactions between differentially expressed genes . . . . 19

4.3 Promoter sequence analysis . . . . 24

4.3.1 Promoter structure . . . . 24

4.3.2 Ab initio motif discovery . . . . 28

4.3.3 Transcription factor binding site analysis . . . . 29

4.3.4 Classification based on presence of sequence specific motifs . . . . 33

4.3.5 Discussion . . . . 34

4.4 CpG methylation analysis . . . . 36

5 Conclusion 37

6 References 41

(8)

1 Introduction

Gene expression patterns have been observed to change along the colon from the proximal (right) towards the distal (left) colon [38]. Understanding this variation can be of great interest in further comprehending colon cancer development and prevention. Since gene expression in normal colon exhibits this change in gene expression pattern depending on location, it is very important that the location of the cancer is taken into account when comparing its gene expression with normal samples.

Gene expression can be regulated on different levels with transcriptional regulation as one of the initial steps. Binding of transcription factors to the genes’ promoters and change of CpG methylation are important factors involved in regulating the amount of transcripts in a cell.

The change in gene expression along the colon could possibly be explained by transcription factor binding site patterns present in genes highly expressed in one segment but missing in genes highly expressed in the other. Changes in CpG methylation can also result in differing gene expression patterns since this methylation works to supress transcription. Therefore low expression in one segment could possibly be explained by methylation of its promoter region.

Due to cost and time consumption of biological experiments scientists today have to rely on computational methods in parts of their research. These methods can be used as an indication of valuable experiments to perform in the future, or as a way of finally analyzing the data. Here computational methods are used to find potential transcription factor binding sites which can be evaluated experimentally. Also output from microarray experiments are validated in a statistical manner to find differentially expressed genes between proximal and distal colon which will provide useful information for future work.

1.1 Project goal

The main goal of this project is the identification of factors generating gene expression pat- terns in the colon. Emphasis will be put on the understanding of what agents are needed to control the differential expression of genes between proximal and distal colon. This in- volves promoter analysis, CpG island investigation and chromatin structure appreciation.

The upstream regions for genes of interest will be gathered and analysed for the presens of sequence specific motifs and CpG island methylation status. The results can then be used as guidelines when in the future designing experiments to unravel the complex machinery of transcriptional regulation in not only normal tissues but also cells undergoing changes related to cancer development.

2 Background, from biology to computational approaches

2.1 The colon

The colon in mammals can be divided into 5 parts, namely the caecum, ascending colon,

transverse colon, descending colon and sigmoid colon. Further, these parts can be grouped

together resulting in two major segments, the proximal colon consisting of the cecum, ascend-

ing and transverse colon, and distal colon consisting of the remaining two. The distal and

proximal colon are from different embryonic origins with most of the distal being derived from

the hindgut while a majority of the proximal instead is derived from the midgut. Differences

between the proximal colon involves size and blood supply. The colon is a long organ with

(9)

an average length of 1.5 m and the diameter will decrease when moving proximal to distal [24]. The proximal colon has a multilayered capillary network whereas the distal only has a single-layered network. This could explain the fact that the proximal colon has a greater ability to absorb water and transport electrolytes [3].

2.2 Gene regulation

Even though different cell types can differ in both appearance and function they all contain the same DNA. Therefore the cells rely on various regulatory factors to be able to differentiate into a specific type of cell or tissue. Gene regulation in the cell can be performed on different levels as transcriptional, translational and post-translational. Factors involved in RNA processing i.e. splicing, RNA transport and mRNA degradation are also known to play a role in this complex network [1].

DNA in the human genome forms very precise structures called chromosomes. The DNA helix wrapes around histones which are then packed into chromatin. For transcription to occur this condensed structure needs to be resolved and the DNA double helix opened. Therefore there exists several ways for the cell to control its start, speed and amount of transcription.

Transcriptional initiation involves the binding of PolII to the promoter together with the TATA-binding protein (TBP) and general transcription factors [40]. In all, this complex machinery in eukaryotes involves the binding of up to 100 subunits [1]. TFIID is the general transcription factor responsible for localization of the TATA-box. Its subunit TBP binds the TATA-box resulting in enabling of binding of another general transcription factor, TFIIB.

Not until these factors and a few others have bound the DNA can RNA Polymerase take its place. Now the energy dependent step begins with the opening of the double helix followed by elongation [1].

A promoter is located upstream from the gene transciption start and is usually of length around 100 bp. Many promotors contain a TATA-box which serves as a transcription initiation helper, whereas transcription factor binding sites are important for the speed of transcription.

It has been shown that mutations in a promoters regulatory elements can seriously decrease the speed of transcription, whereas mutations elsewhere in the promoter have no effect at all [49].

Gene regulatory proteins have the ability to recognise and bind specific patterns in the major groove of the DNA helix. There are thousand of these patterns, also called motifs, identified in different organisms. The transcription factor binds the motif by forming hydro- gen bonds, ionic bonds and hydrophobic interactions with the DNA bases. These proteins can be divided into different groups depending on their structure and composition. Such groups include: helix-turn-helix, zinc fingers and zipper-type. These proteins can either act as activators or repressors and thereby control the rate of gene expression and the amount of transcripts synthesised in the cell at a specific time point.

In the light of epigenetics chromatin structure plays a crucial role. Factors such as post-

translational modifications of histones, histone variants and nucleosome occupancy serve to

modify the access to specific DNA fragments. The chromatin structure is composed of DNA

wrapped around histones where the histones consist of three types of proteins, H2A, H2B, H3

and H4. Methylation at H3K4, lysine 4, is believed to be correlated with active promoters

[62] whereas di and tri methylation at H3K9, lysine 9, works as a transcriptional repressor

[69]. It has been shown that H3K4me3 is found in 75% of promoters in human embryonic

stem cells [31].

(10)

Many vertebrate promoters are associated with so called CpG islands. Normally, the CpG dinucleotide occurs at a very low frequency but in promoter regions the frequency is a lot higher. An important aspect is the methylation of the CpG dinucleotide which is thought to be used as a way of turning genes on or off. Methylation of DNA residues is reliant on different types of DNA methyltransferases(DNMT). DNMT3A and DNMT3B bind un-methylated DNA [16] whereas DNMT1 prefers partially methylated sequences. An un- methylated dinucleotide is often found in promotors of genes frequently expressed, whereas CpG islands are more uncommon in tissue specific genes [11]. However, recent research has identified several examples of CpG methylation being a factor in transcriptional regulation, see [33] for review. A change in CpG methylation patterns can be the cause of diseases as cancer [74]. Interestingly, CpG methylation is correlated with histone modifications. CpG islands which have not been subjected to methylation correspond to regions with histone modifications as H3K4 [71]. The CXXC finger protein 1, Cfp1, has been shown to bind unmethylated CpGs in the mouse brain and its presence seems to be correlated by H3K4me3 [71].

Regulatory information as transcription factor binding sites and presens of CpG-islands can be found at the UCSC Genome Bioinformatics Site (http://genome.ucsc.edu). It is a human genome browser containing information on assemblys, genes, SNPs, regulatory features and repeats [36]. Half of the 31 annotation tracks available are provided by UCSC while the rest are from various contributors. The site comes with a DAS server, enabling easy access to annotation data. Data repositories such as this are important sources of genome data and annotations, which are essential for studies of this type.

2.3 Computational motif detection methods

There are several different ab initio motif discovery methods available with MEME [6], GLAM2 [25] and AlignACE [59] being a few examples. They differ in their use of align- ment algorithms using either Gibbs sampling or Expectation Maximization (EM). Glam2 uses extenden Gibbs sampling and has previously been shown to identify a motif bound by Lmo2 when given a series of DNA sequences known to be bound by this complex [25].

Two important parameters when running this type of analysis is the number of runs, i.e.

number of times the method is repeated, and iterations in each run. Ideally you would want the motifs for all runs, or a majority, to be identical or at least similar. If that is not the case, pre processing of the data might be crucial. One such technique is to remove too similar sequences before the motif scan, and can be done using the software tool called purge [51].

It looks for sequences in the fasta file that share a sequence similarity greater than a user defined cutoff. It calculates the maximal segment pair score for every pair of sequences and iteratively removes sequences with a score above the cutoff, resulting in a set of sequences with MSP score lower than the user defined cutoff. The MSP score is defined as the highest scoring pair between two sequences, where the segments are of equal lengths and uses cost matrixes as BLOSUM62 for proteins, +5/-1 (5 for match, -1 for mismatch) for DNA or other methods prefered by the user [2]. There are no real guidelines in chosing the value for the cutoff so repeated tries will instead have to be used.

Another approach to finding transcription factor binding sites would be to search the

sequences for occurences of previously known motifs. Databases such as JASPAR [12] and

TRANSFAC [47] contains matrix models for transcription factor binding sites. These models,

often represented as Position Weight Matrices, can then be used to find segments in DNA

(11)

sequences with a high affinity to specific transcription factors. The JASPAR database contains motifs for experimentally validated published transcription factor binding sites [12]. Tools like FIMO (http://meme.sdsc.edu/meme/fimo-intro.html), Consite [61] and UCSC [56] have well-developed algorithms to match motif models to sequences of DNA. FIMO calculates log-odds scores for each motif in all positions in the input DNA sequences and then uses a dynamic programming algorithm to convert the scores to p-values. The Benjamini and Hochberg method is then used to convert p-values into q-values and only matches with a q-value smaller than a user defined threshold is returned. Consite uses PWM to score the squences and these scores are then converted to a normalized score. For more info see [61].

For UCSC a sequence is considered having a transcription factor binding site if the specific site is conserved among human, rat and mouse. A score is calculated based on the TRANSFAC Matrix database [47].

2.4 Microarray analysis

Microarrays are a very powerful tool when investigating expression levels of thousands of genes at once. There are different techniques involved depending on the manufacturer, probe synthesis method and sample preparation. Two-colour microarrays contains probes corre- sponding to oligonucleotides, cDNA or mRNA of genes present under a specific condition.

Fluorescently labeled cDNA from samples under comparison are hybridised to the probes on the chip resulting in detectable expression. The two samples are labeled with different colours and therefore each probe on the chip will have either of the four colours red, green, yellow or black depending on the expression of that gene in the two samples. For example, if geneA is expressed in sample number 1 but not in number 2, that probe will be coloured red or green depending on the dye for that sample. If instead geneA is expressed in both samples that probe will be yellow. The image is then analysed to from colour intensities expression levels will arise.

One-colour microarrays are in contrast only subjected to one sample each. The hybrydis- ation of the samples to the probes are measured to give a gene expression level per probeset.

These values are then normalized using an appropriate technique as RMA or MAS5. This technique has the great advantage of being comparable over different experiments as long as the samples have been processed in a similar manner. One such example is the Affymetrix gene chip.

2.5 Gene Ontology

One way of classifying a gene by function is to use the Gene Ontology [4]. It is visualized as a

tree structure where every gene will have a specific GO identifier on a detailed level and will

then belong to the group of GO identifiers further up in the tree. In other words, one gene will

have ancestor GO identifiers grouping them together with gene products performing similar

tasks. At the topmost level there are three different domains, namely cellular component,

biological process and molecular function. Depending on area of interest GO identifiers for a

specific gene can be chosen from these groupings.

(12)

3 Materials and methods

3.1 The dataset

The dataset used to find differential expression between proximal and distal colon was a gene expression data set obtained from Genelogic(Gaithersburg, USA), containing 475 sam- ples from varying parts of the intestine and disease stages. The microarray experiment was conducted using two Affymetrix platforms, the HGU133A and the HGU133B. In total the dataset contained 71 samples from normal proximal and 97 from normal distal colon. More information can be found in [38].

3.2 Differential expression analysis

After normalization was performed using RMA [38], the limma package [67] in Bioconductor was used to make a pairwise comparison between the two colon segments. A design matrix was created and using the lmFit method a linear model was fitted for each gene over all the arrays. The limma method eBayes ranks the genes according to probability of being differentially expressed and the topTable method with BH as method adjust the p-values for multiple testing. BH (Benjamini-Hochberg) is the default adjustment method which takes into consideration the false discovery rate making it a powerful method. The probeids were then converted into gene names and entrez id using the Bioconductor packages hgu133a.db and hgu133b.db respectively [15].

3.3 Analysis based on Gene Ontology

Using the Bioconductor GO package [42], hgu133a.db and hgu133b.db packages the most detailed GO id was collected for each gene of interest, in this case all differentially expressed genes between distal and proximal colon.In this case the ontolgy of interest is from the bio- logical process group. The genes were then grouped together into groups of biological process using the Bioconductor goProfiles package [60].

3.4 Network analysis

The network analysis was undertaken using the online tool STRING-Known and Predicted Protein-Protein Interactions [34] with default parameters. STRING contains network in- formation for proteins based on experimental data and textmining. The analysis was first performed using only genes differentially expressed higher in distal colon than in proximal with a log fold change larger than 1, then only genes differentially expressed higher in proximal colon than in distal with a log fold change larger than 1, and finally all differentially expressed genes under the criteria p<0.05. Following this every genes was analyzed individually and manually curated to produce a network possible to explain the difference in gene expression levels between distal and proximal colon.

3.5 Extracting the upstream sequences

Upstream sequences for all differentially expressed genes were extracted using the UCSC table

browser (http://genome.ucsc.edu/). Both a 1000 bp upstream sequence to 100 bp downstream

from transcription start site and 2000bp upstream for every gene of interest was saved in fasta

format for further analysis. The gene table used was refGene for retrieval of refseq genes.

(13)

3.6 Novel motif discovery

To discover possible motifs the GLAM2 software tool [25] was used. The 2000 bp up- stream sequences were first masked from repeats using the online tool RepeatMasker (http://www.repeatmasker.org) with cross match as search engine and the other alternatives as default. The database scanned is the Repbase Update [35], containing over 3000 types of repeates for several eukaryotic species. Number of runs were set to 10000 and the rest of the settings to its default values. The data was preprocessed using purge and the cutoff was 80.

The motifs were then used as input using the TOMTOM package [32] to compare the motif to a database of already known motifs.

3.7 Motif scanning

The upstream sequences were scanned for the presence of known motifs using the FIMO pack- age, Consite [61] and the UCSC DAS server. All motifs from JASPAR core was downloaded in flatfile format from the JASPAR download page

¹

and the ones from human were seper- ated and then converted into meme format using the convert tool jaspar2meme. FIMO was downloaded in a standalone version from http://meme.nbcr.net/downloads/ and run with the command > fimo meme format genes.fasta. Only matches for motifs on the same strand, found in human and with a q-value smaller than 0.5 were included when using FIMO. The representation of strand possession is whether start is greater than or smaller than stop.

Consite was run from the web-interface http://asp.ii.uib.no:8090/cgi- bin/CONSITE/consite?rm=t input single including only transcription factor profiles with a minimum specificity of 10 bits and the output was manually saved in a text file and parsed to only include human motifs and matches on the same strand.

To be able to query the UCSC DAS server each gene’s position had to be presented in bed format(Table 1,Table 2), here for the hg18 assembly. It was then run through the DAS server and the XML output files were parsed using the Perl module XML Simple. The table queried was tfbsConsSites and the table tfbsConsFactors was used to convert name to transfac accession. Since the UCSC table containing conserved transcription factor binding sites is derived from position matrixes from Transfac, the identical motifs with different names between Transfac and JASPAR were translated manually. In Table 3 a list of motifs only present in Transfac can be found and in Table 4 the same motifs with different names are presented.

After scanning the up and down regulated sequences through the JASPAR database, the results were vizalised as both motifs per gene or genes per motif using the Graphvis module in Perl.

1At http://jaspar.genereg.net/html/DOWNLOAD/jaspar CORE/non redundant/by tax group/vertebrates/

(14)

Table 1: BED format containing the three required fields chromosome, start position and stop postion.

chr2 217205372 217206472 NM 000597 upstream + chr2 234265251 234266351 NM 001072 upstream + chr2 79240287 79241387 NM 002580 upstream - chr2 79200092 79201192 NM 002909 upstream + chr2 113118998 113120098 NM 005415 upstream + chr2 234301512 234302612 NM 019093 upstream + chr3 197426780 197427880 NM 152672 upstream + chr5 525980 527080 NR 024158 upstream - chr8 6901569 6902669 NM 021010 upstream - chr12 99390810 99391910 NM 005123 upstream + chr12 22668343 22669443 NM 018638 upstream + chr13 52499973 52501073 NM 006418 upstream + chr15 88158976 88160076 NM 001150 upstream - chr16 25134786 25135886 NM 001169 upstream + chr21 41609531 41610631 NM 058186 upstream +

In this case the regions represented are the 1000 bp upstream from transcription start to

100 downstream for genes IGFBP2, UGT1A6, REG3A, REG1A, SLC20A1, UGT1A3,

OSTalpha, LOC25845, DEFA5, NR1H4, ETNK1, OLFM4, ANPEP, AQP8 and FAM3B

(15)

Table 2: BED format containing the three required fields chromosome, start position and stop postion.

chr1 20179419 20180519 NM 000300 upstream - chr1 67039427 67040527 NM 005478 upstream - chr1 203803735 203804835 NM 181644 upstream + chr2 219990343 219991443 NM 001927 upstream + chr2 216008936 216010036 NM 002026 upstream - chr2 162717060 162718160 NM 002054 upstream -

chr4 6745467 6746567 NM 005980 upstream +

chr5 147422728 147423828 NM 001127698 upstream + chr6 130798955 130800055 NM 052913 upstream + chr7 100398624 100399724 NM 001164462 upstream + chr15 43330720 43331820 NM 004212 upstream + chr15 49420005 49421105 NM 181789 upstream + chr17 39437263 39438363 NM 004160 upstream - chr17 44161010 44162110 NM 006361 upstream - chr17 44154781 44155881 NM 032391 upstream - chr20 43530808 43531908 NM 006103 upstream + chr21 42659613 42660713 NM 003225 upstream - chr21 40160217 40161317 NM 006198 upstream + chr21 30510240 30511340 NM 199328 upstream - chr22 36305870 36306970 NM 006498 upstream -

In this case the regions represented are the 1000 bp upstream from transcription start to 100 downstream for genes PLA2G2A, INSL5, MFSD4, DES, FN1, GCG, S100P, SPINK5, TMEM200A, MUC12, SLC28A2, GLDN, PYY, HOXB13, PRAC, WFDC2, TFF1, PCP4,

CLDN8 and LGALS2

(16)

T ab le 3: Motif s p re sen t in T ran sf ac with out e q uiv alence in JAS P AR Ahr Ap-2gamm a AREB6 A TF A TF6 A TF-2 Bac h2 C/EBP alp ha C/EBPb eta Chx10 CP1A CP1C CUTL1 E2F E2F -1 E2F -2 E2F -3a E2F 4 E2F -5 E4BP 4 Egr -1 Egr -2 Egr -3 F A C1 F O XJ2 F O X O4 F ra-1 GCNF-1 GCNF-2 GR-b eta ISG F- 3 Ju nB Ju nD LUN-1 Max 1 Mazr ME F-2A NF-Y Nkx2-2 Nkx6-1 P ax-2 P ax-5 Pb x1a PO U2F1 PO U3F2 PO U3F2(N-o c t- 5a) PO U3F2(N-o c t- 5b ) PP AR-gam ma1 PP AR-gam ma2 RFX1 PO P alp ha1 RP58 RSRF C4 SRE B P -1a SRE B P -1b SRE B P -1c ST A T1 ST A T5A ST A T5B TFI ID

(17)

Table 4: Motifs with different names in Transfac and JASPAR Transfac name JASPAR name

C-Myc MYC::MAX

CREB CREB1

deltaCREB CREB1

C-rel REL

MZF-1 MZF1 1-4

AP-1 AP1

C-Fos AP1

C-Jun AP1

COUP-TF1 NR2F1

NF-1 NF1C

NF-kappaB1 NFKB1

P53 TP53

GR-alpha NR3C1

HNF-4alpha2 HNF4A

IRF-1 IRF1

LCR-F1 NFE2L1:MafG

NF-YA NFYA

RREB-1 RREB1

3.8 Classification of genes based on occurences of motifs

Three different classification schemas were created. One takes into consideration if a gene is present or not in the promoter region, Table 5. The second includes the number of matches per motif in each promoter, Table 6. The last includes the highest score per motif in each promoter, Table 7. The classification was then performed based on scheme 1, 2 and 3 using the R package rpart [70].

3.9 CpG Island detection

To examine the occurence of CpG islands in the promoters of genes of interest ‘the CpGProD [55] tool was used. This was performed on the sequences after masking of repetitive elements since many of them, for example Alus, contain methylated CpG islands not responsible for transcription initiation [54]. A similar analysis was also conducted using UCSC where a CpG island has the constraint of having a GC content of at least 50%, a length of 200 bp or more

Table 5: Classification scheme 1: for genes using the presence of all known motifs in their upstream regions.

Motif 1 Motif 2 Motif 3 ... Motif n

Gene 1 0 1 0 0

Gene 2 1 1 0 1

...

Gene m 1 0 1 0

(18)

Table 6: Classification scheme 2: for genes using the number of occurences of all known motifs in their upstream regions.

Motif 1 Motif 2 Motif 3 ... Motif n

Gene 1 3 1 0 4

Gene 2 1 3 0 2

...

Gene m 1 0 5 0

Table 7: Classification scheme 3: for genes using the highest score for each known motif in their upstream regions.

Motif 1 Motif 2 Motif 3 ... Motif n

Gene 1 5.39 2.67 - -

Gene 2 - 11.2 - 2.39

...

Gene m 4.67 - - 7.98

and a CpG ratio of 0.6 or greater.

3.10 CpG methylation analysis

To determine the colon segment identity for colon cancer cell lines HCT-116, HT29 and SW- 680, the GEO dataset GSE5720 was used. The arrays were normalized and expression levels for PRAC (the gene that best classifies proximal and distal colon) were plotted for all cell lines together with a prostate cancer cell line as control.

Genome-wide methylation status was collected on the colon cancer cell lines HCT-116, HT- 29 and SW-480 using a novel CpG-methylation assay developed at CSIRO called SuBLIME (manuscript under review). It involves cutting genomic DNA at CpG sites using a restric- tion enzyme, and sequencing the resulting DNA fragments using high-throughput sequencing methods. When the fragments are mapped back to the reference Human genome sequence (GRCh37/hg19), precise locations of methylated CpG dinucleotides are obtained. For this study, methylation counts for the putative promoter regions of genes of interest (5kb upstream and 2kb downstream of the transcription start site) were obtained. The raw mapped sequence data were converted into pseudo counts, using the R library edgeR [58]. These were then av- eraged for each gene, and used in subsequent boxplots and analysis. Genes with expression values higher than 10 in both distal and proximal was used as a high control and genes with expression values lower than 4 in both distal and proximal was used as a low control.

4 Results and discussion

4.1 Differentially expressed genes between distal and proximal colon In total there were 1905 differentially expressed probesets between distal and proximal colon.

Out of these probesets, 1070 were higher expressed in distal than proximal colon and 835

were higher in proximal. From now on the expression in the distal colon will be used as a

reference and genes up-regulated in that segment will be called up-regulated genes and genes

(19)

Figure 1: Number of genes differentially expressed between proximal and distal colon.

up-regulated in proximal colon will be called down-regulated genes. For further analyses the genes with a log2 fold change of bigger than one or smaller than negative one was also saved.

A complete list of differentially expressed genes under this criterion can be found in Table 8

and Table 9. In total there were 24 probesets differentially expressed higly in proximal colon

and 23 in distal under this criterion. See Figure 4.1 for summary. Average expression for these

genes in the two colon segments can be found in Figure 2. The genes in the up-regulated

group has an average expression of around 8 in distal colon and 6.5 in proximal. Those

numbers for down-regulated genes are 6 in distal colon and 7.5 in proximal. A more distinct

difference in gene expression between proximal and distal colon is found when looking at the

genes with the highest fold change, in this case PRAC for up-regulated genes and LOC25845

for down-regulated genes, see Figure 3.

(20)

A B

Figure 2: Average gene expression for (A) up-regulated genes and (B) down-regulated genes in proximal and distal colon.

A B

Figure 3: Boxplot for gene expression for (A) PRAC and (B) LOC25845 over all samples.

(21)

T ab le 8: Complete li st of p rob e sets up -r e gu late d in d istal colon with a log fol d chan ge of b igger than 1 (compared to p ro ximal) and a adj usted p-v alue smaller than 0. 05. pr ob eID geneAc c geneSym b ol en trez ID logF C t P .V al ue adj .P .V al 230784 at BG498699 PRA C 84366 3.464 16.551 2.771e- 37 6.274-33 209844 at U57052 HO XB13 10481 1.382 13.552 8.483e- 29 6.301e- 25 230105 at BF062550 HO XB13 10481 1.259 12.939 4.203e- 27 1.904e- 23 203892 at NM 006103 WFD C 2 10406 1.116 12.467 1.020e- 25 4.547e- 22 230360 at A W006648 GLDN 342035 1.167 11.434 7.942e- 23 1.998e- 19 214598 at AL049977 CLDN8 9073 2.189 11.284 2.279e- 22 6.348e- 19 221091 at NM 005478 INSL5 10022 1.868 10.830 4.289e- 21 8.355e- 18 229254 at BE550027 MF SD4 148808 1.241 10.381 7.258e- 20 1.494e- 16 226654 at AF147790 MUC12 10071 1.594 8.969 5.293e- 16 6.308e- 13 207249 s at NM 004212 SLC28A2 9153 1.091 8.960 5.795e- 16 6.796e- 13 205185 at NM 006846 SP INK5 11005 1.310 8.925 7.173e- 16 7.992e- 13 206422 at NM 002054 GCG 2641 1.758 7.984 2.072e- 13 1.282e- 10 234994 at AA088177 KIAA1913 114801 1.069 7.648 1.442e- 12 1.306e- 09 211719 x at BC 005858 FN1 2335 1.076 7.134 2.732e- 11 9.821e- 09 204351 at NM 005980 S100P 6286 1.358 7.066 3.993e- 11 1.390e- 08 212464 s at X02761 FN1 2335 1.040 6.841 1.381e- 10 4.273e- 08 208450 at NM 006498 LGALS 2 3957 1.003 6.688 3.164e- 10 8.822e- 08 207080 s at NM 004160 PYY 5697 1.465 6.499 8.704e- 10 2.155e- 07 205549 at NM 006198 PCP 4 5121 1.115 5.621 7.674e- 08 1.179e- 05 205009 at NM 003225 TFF1 7031 1.090 5.572 9.772e- 08 1.452e- 05 203649 s at NM 000300 PLA2G 2A 5320 1.035 5.292 3.712e- 07 4.545e- 05 202222 s at NM 001927 DE S 1674 1.055 3.583 4.432e- 4 0.013

(22)

T ab le 9: C omp le te list of genes up -r e gu lated in pr o x imal colon with a log fold change smaller than min u s 1 (c ompar e d to di stal) an d a adj uste d p -v alu e sm al le r th an 0.05. pr ob eID geneAc c geneSym b ol en trez ID logF C t P .V al ue adj .P .V al 222262 s at AL137750 ETNK1 55500 -1.969 -15.404 5.105e- 34 6.166e- 30 219017 at NM 018638 ETNK1 55500 -2.282 -15.392 5.534e- 34 6.166e- 30 225458 at BF528646 LOC25845 25845 -2.745 -14.885 1.253e- 32 1.419e- 28 225457 s at BF528646 LOC25845 25845 -2.288 -14.015 3.635e- 30 2.743e- 26 224453 s at BC 006111 ETNK1 55500 -1.197 -13.277 4.564e- 28 2.584e- 24 229230 at AA702685 OS T alph a 200931 -1.492 -12.549 5.436e- 26 1.759e- 22 226432 at AI276880 ETNK1 55500 -1.421 -12.200 5.348e- 25 1.514e- 21 206340 at NM 005123 NR1H4 9971 -1.176 -11.734 1.225e- 23 4.550e- 20 201920 at NM 005415 SLC20A1 6574 -1.219 -10.823 4.499e- 21 8.355e- 18 225290 at A V692425 ETNK1 55500 -1.350 -10.705 8.977e- 21 2.032-17 227194 at BF106962 F AM 3B 54097 -1.595 -9.617 9.324e- 18 1.408e- 14 222943 at A W235567 GBA3 57733 -1.279 -9.600 1.038e- 17 1.469e- 14 231576 at AA829940 ETNK1 55500 -1.335 -9.442 2.801e- 17 3.731e- 14 202888 s at NM 001150 ANPE P 290 -1.682 -9.286 7.726e- 17 1.148e- 13 202718 at NM 000597 IGFBP2 3485 -1.089 -9.095 2.529e- 16 3.523e- 13 208596 s at NM 019093 UGT1A3 54659 -1.008 -8.332 2.624e- 14 2.008e- 11 207529 at NM 021010 DE F A5 1670 -1.575 -7.307 1.032e- 11 4.107e- 09 215125 s at A V691323 UGT1A6 54578 -1.036 -6.977 6.515e- 11 2.135e- 08 205815 at NM 002580 REG 3A 5068 -1.070 -5.997 1.183e- 08 2.273e- 06 209752 at AF172331 REG 1A 5967 -1.471 -5.499 1.389e- 07 1.997e- 05 230494 at AI671885 SLC20A1 6574 -1.027 -5.262 4.240e- 07 1.143e- 4 212768 s at AL390736 OLFM 4 10562 -1.332 -4.345 2.396e- 05 0.001 206784 at NM 001169 A QP8 343 -1.089 -4.154 5.169e- 05 0.003

(23)

4.1.1 Function and chromosomal position

The chromosomal position of these genes can be interesting when looking at possible coex- pression and regulation. In Table 10 and Table 11 a chromosomal positional overview can be found. It can be seen that closely positioned genes are from the same family and probably are under the same transcriptional control, as UGT1A3 and UGT1A6. REG1A and REG3A are also positioned closely together but on different strands.

The gene with the largest fold change is PRAC which encodes a nuclear protein that is found in prostate, rectum, and distal colon. The reason for its expression pattern is unknown, but these tissues are derived from the same embryonic cell line. PRAC contains five possible phosphorylation sites, making it a possible factor in regulation [43]. Interestingly, PRAC and HOXB13 share a similar chromosomal location as well as expression pattern making it possible for them to be controlled by the same transcriptional factors. The promoter region of HOXB13 is known to have a CpG island methylated in some cancer types but interstingly un-methylated in normal colon and colorectal cancer [27]. WFDC2 is a secreted protein containing two WAP domains and eight disulfides [21]. It has been shown to be overly expressed in different tumour cell lines, for example colon [9]. CLDN8 protein Claudin-8 is an integral membrane protein that is down regulated in colorectal tumour tissue [30].

INSL5 is already known to be expressed in colon [18] and is believed to activate G-protein coupled receptors (GPCR)135 and GPCR142 [41]. PLA2G2A has been shown to be down- regulated in human colon cancer [20] and functions to supress tumorgenesis [45]. GCG codes for the glucagon protein that can be cleaved into 8 different chains, glucagon, glicentin, glicentin-related polypeptide, oxyntomodulin, glucagon-like peptide 1, glucagon-like peptide 1(7-37), glucagon-like peptide 1(7-36) and glucagon-like peptide 2. GLPs can be found in L-cells in the distal intestine and stimulate insulin release during feeding as well as inhibits glucagon secretion [7]. PYY is as GLP expressed in L-cells in the distal colon and they both inhibit apetite [7]. PYY is believed to be up-regulated by gastric acid secretion and inhibited by gastrin [29]. Gastrin is then inhibited by somatostatin via the sst2 receptor [75].

Somatostatin is up-regulated by PDX1, a gene that here is up-regulated in the proximal colon.

FN1 encode fibronectin, a protein that is a major component of the extracellular matrix. It activates THBS1 and is known to bind several proteins as integrins and collagen. THBS1 is differentially expressed under the criteria p-value smaller than 0.05 in the up-regulated group.

Desmin is the gene product of DES and is found in muscle cells connecting myofibrils to each

other. The calcium binding protein S100P is on the other hand upregulated in cancer tissue

and is believed to play an important role in cancer development [26]. SPINK5 encodes the

proteinase inhibitor LEKTI [50]. TMEM200A, the transcript from KIA1913 does not have

a defined function. MUC12 is a membrane bound mucin protein that is downregulated in

Crohns disease [48]. SLC28A2 encodes the nucleoside transporter CNT2 and is regulated by

several factors [53]. PCP4, purkinje cell protein 4 is a calmodulin binding protein expressed in

neurons [66]. TFF1, trefoil factor 1 is normally expressed in the stomach but has expression

has also been identified in inflammatory bowel disease goblet cells [64]. It is regulated by

oestrogen binding to an enhancer in the gene’s upstream region from -428 to -324. This

region contains an imperfect palindromic estrogen responsive element ERE [8]. LGALS2 is a

S-lac lectin that is present in the cytosol in cells in the intestine [17]. The upstream region

contains two TATA boxes, a binding site for transcription factor Sp1 and one for Ap1 [28].The

REG1A protein is a lectin that is up-regulated in colorectal cancer adenoma and tumor tissue

[5]. As with REG1A, REG3A is also a secreted lectin with an adjacent chromosomal position.

(24)

Table 10: Chromosomal position for up regulated genes.

Chromosome start stop strand gene name

chr1 20174515 20179496 - PLA2G2A

chr1 203804636 203838669 + MFSD4

chr1 67036012 67039527 - INSL5

chr2 162707638 162717159 - GCG

chr2 215933408 216009140 - FN1

chr2 219991343 219999705 + DES

chr4 6745697 6749797 + S100P

chr5 147423728 147497118 + SPINK5

chr6 130728572 130805901 + KIAA1913

chr7 100399624 100448940 + MUC12

chr15 43331725 43355425 + SLC28A2

chr15 49421169 49487502 + GLDN

chr17 39385637 39437363 - PYY

chr17 44154083 44154883 - PRAC

chr17 44157124 44161119 - HOXB13

chr20 43531760 43543586 + WFDC2

chr21 30508195 30510262 - CLDN8

chr21 40161113 40223192 + PCP4

chr21 42655460 42659772 - TFF1

chr22 36296201 36308569 - LGALS2

Table 11: Chromosomal position for down regulated genes.

Chromosome start stop strand gene name

chr2 79200996 79204053 + REG1A

chr2 79237640 79240387 - REG3A

chr2 113119905 113137875 + SLC20A1

chr2 217205796 217237404 + IGFBP2

chr2 234302493 234346684 + UGT1A3

chr2 234264992 234346688 + UGT1A6

chr3 197422755 197454446 + OSTalpha

chr4 22303645 22430290 + GBA3

chr5 523626 526177 - LOC25845

chr8 6900241 6901666 - DEFA5

chr12 22669276 22734875 + ETNK1

chr12 99391810 99481772 + NR1H4

chr13 52500831 52524197 + OLFM4

chr15 88129124 88159098 - ANPEP

chr16 25135743 25147762 + AQP8

chr21 41598068 41651524 + FAM3B

(25)

SLC20A1 is as the up-regulated gene SLC28A2 a sodium dependent transporter. It’s promoter region consists of a GC rich region and no TATA box but instead a CCAAT box [52]. IGFBP2, insulin-like growth factor-binding protein 2 is a secreted growth factor binding protein. It’s promoter region has been identified in Jurkat K16 cells and is believed to lack TATA and CCAAT boxes but instead being GC rich [10]. UGT1A3, UDP-glucuronosyltransferase 1- 3 is part of the UDP-glucuronosyltransferase (UGT) family that catalyzes the process of glucuronidation. The gene is believed to be regulated by LXRα together with SRC-1α and NCoR co-factors andit seems to be the -183bp position that is occupied with a LXR/RXR complex [73]. OSTA is part of a Ost-α/Ost-β complex responsible for bile acid transport and activated by the nuclear receptor FXR [39]. Ost-β is also a differentially expressed gene in this context, with a logarithmic fold change of -0.537. Found no information at all for LOC25845. DEFA5 encodes Defensin-5, an alpha-defensin family protein involved in antibacterial activity [23].ETNK1, ethanolamine kinase 1 is an ethanolamine kinas involved in the CDP-ethanolamine pathway. Overexpression of ETNK1 has been shown to accelerate ethanolamine kinase activity resulting in elevated phosphatifylethanolamine synthesis [44]. A synonym for NR1H4 is FXR, a protein as previously described responsible for activation of OSTA. SInce both ar eup-regulated in distal colon, it is likely that NR1H4 is expressed and then causing over-expression of OSTA. OLFM4 is a secreted protein that is highly expressed in colorectal carcinomas [72]. ANPEP is a membrane bound protein that has been shown to be regulated by different promoter regions in different cell types [65]. AQP8 encodes a protein from the aquaporin family responsible for water transportation through membranes.

FAM3B is a secreted cytokine present in the pancreas believed to have part in the apaptosis of in β-cells [14].

4.1.2 Differences in function based on Gene Ontology

Important processes for both up and down regulated genes involve cellular process, biological regulation and regulation of biological process, see Figure 4. In particular there doesn’t seem to be a huge difference in overall function between the two groups of differentially expressed genes.

4.2 Identifying interactions between differentially expressed genes

Network analysis is a useful tool to find interactions between genes or groups of genes. Ideally the gene product for genes in one group will work as activators for genes in the same group and inhibit genes in the other group. In Figure 5 interactions between genes in the same regulatory group can be found. There are three clusters among the up-regulated genes WFDC2, S100P, HOXB13 and PRAC, FN1 and DES and PYY, GCG and NSL5. HOXB13 and PRAC are here connected based on their proximity in chromosomal position [43]. S100P and HOXB13 have both been shown to be differentially expressed in the prostate gland during puberty compared to adult [19] and S100P is further connected to WFDC2 due to their up-regulation in prostate after surgical influence in normal and cancer [63]. PYY and GCG are both involved in regulation of apetite by delaying gastric emptying [13].

Among the down-regulated genes there are two clusters, one with REG3A,

REGL(REG1A), OLFM4 and DEFA5 and the other with UGT1A1, NR1H4 ans OSTA. The

most important interaction is here OSTA that is activated by the bile acid receptor NR1H4

[37].

(26)

A B

C D

Figure 4: Functional profiles based on Gene Ontology for (A) up-regulated genes on HGU133A

(B) up-regulated genes on HGU133B (C) down-regulated genes on HGU133A (D) down-

regulated genes on HGU133B.

(27)

A B

Figure 5: Network analysis for (A) up-regulated genes, (B) down-regulated genes.Yellow line indicates textmining, purple line experimental evidence and blue cooccurence.

So far the most valuable information is the activation of OSTA by NR1H4. This means

that for OSTA to be up-regulated something needs to activate NR1H4. To get a bigger

picture all genes significant under the criterion p<0.05 were analysed and only genes with

interactions were displayed(Figure 6). From this network a smaller network was built by

only including factors which activate or inhibit other genes. In Figure 7 one can see that

PDX1 activates REG3A, INS and SST. This means that high levels of PDX1 will increase

levels of both up and down-regulated genes. However, SST will then act as an inhibitor for

another up-regulated gene namely GCG which is also directly inhibited by PDX1. INS and

GCG exhibits a sort of self-switch behaviour where INS induces GCG and GCG activated

INS. Interestingly, GCG acts as a activator of the down-regulated gene AQP8. For this

to work there must be either something inducing AQP8 more than GCG is activating it in

distal colon. The part of the network where GCG activates INS which then induces TCF1, an

activator for down-regulated genes as UGT1A3, UGT1A6 and NR1H4, is likely to explain the

downregulation of these genes. If for example GAST expression results in increased values

of GCG, this will result in evelated values of INS which will then decrease the amount of

TCF1 resulting in down-regulation of UGT1A3, UGT1A6 and NR1H4. Since this analysis

only takes a small subpart into account, there are some parts of the network not managing

to explain the proper behaviour. GCG activating IGF1R makes sence but ISL1 activating

REG3A without any other factors isn’t supported by the litterature.

(28)

Figure 6: Interactions between all genes significant under the criterion p<0.05

(29)

Figure 7: Factors involved in activation and inhibition for connected genes. Genes

with enhanced levels in distal colon include PYY, SST, GCG, ISL1, FOXA2 and IGF1R

while genes with enhanced levels in proximal colon: PDX1, REG3A, IGF1BP2, AQP8,

UGT1A1(UGT1A3, UGT1A6), NR1H4

(30)

4.3 Promoter sequence analysis 4.3.1 Promoter structure

One possible explanation of the difference in up and down regulated genes is the presence of a TATA-box motif in one of the groups distinguishing its transcription pattern from the other group. However according to these results there are promoters with TATA boxes among both groups, see Figure 8, Figure 9, Figure 10. According to FIMO, 5 out of 20 promoters among up-regulated genes are TATA-box promoters and 4 out of 15 among down-regulated genes. Consite discovered the most TATA-boxes out of the three methods with matches in 17 out of 20 promoters for up-regulated genes 12 out of 15 for down-regulated genes. UCSC only found TATA-boxes in the promoters for 5 out of 20 up-regulated genes and one out of 15 down-regulated genes. However both groups exhibit a similar TATA-box pattern with an enhancement of TATA-binding protein sites near transcription start.

TATA-less promoters are often correlated with CpG islands and therefore the presens of CpG Islands in the promoters of up and down-regulated genes are of interest. This presence of CpG islands in the promoter regions can be evaluated computationally. The result can be found in Table 12 and Table 13, where CpGo/e ratio is defined as observed CpG frequency divided by the expected. The expected CpG frequency is just the number of C times the number of G divided by the number of A, C, G, T. Start p is the probability of that CpG island being responsible for transcription starting, as opposed to CpG islands not being associated by a promoter sequence. The percentage of genes with CpG islands in the promoter sequences are similar, Table 12 and Table 13. Among the up-regulated genes 9 out of 20 sequences had a GC-frequency larger than 0.5 and among the down-regulated genes that number was 6 out of 15. So roughly half of the genes in either group have GC rich promoters.

The promoter for LOC25845 is according to previous analysis here TATA-less and is

instead inhabiting three CpG islands.

(31)

A B Figu re 8: Matc hes for T A T A-bin din g prot e in sites among 1000b p up str e am to 100 d o w n str e am fr om tran scrip tion start site sequence s for (A) u p-regulat e d genes (B) do wn-regulated genes when using FIMO A B Figu re 9: Matc hes for T A T A-bin din g p rote in sites am on g up str e am 1000b p u pstream to 100 do wnstream from tr ansc ripti on start site se qu e n c es for (A) u p-regul ate d genes (B) do wn-regulated genes when using consite A B Figu re 10: M atc h e s for T A T A-bin din g p rotein site s amon g 1000bp u pstream to 100 d o wn str e am from tr ansc ri ption start site up str e am se qu e n c es for (A) u p-regul ate d genes (B) do wn-regulated genes fr om UCSC

(32)

T ab le 12: CpG Is lands foun d b y CpG Pr oD for th e region 5kb up str e am from tran scrip tion start site to 2kb do wnstream among up st re am sequ e n c es for u p regulat e d genes . se q name n u m b e r b e gin end length (bp) G+C fr e qu e n c y CpGo/e ratio A T sk ew GC sk ew start-p strand (s tr and -p ) PLA2G 2A 0/0 INSL5 0/0 MF SD4 1/1 4377 5368 842 0.652019 0.729305 -0.085324 0.063752 0.351292 pl us (0. 861228) DE S 1/1 4708 6030 1270 0.674803 0.706017 -0.007264 0.026838 0.406114 pl us (0. 615231) FN1 1/1 4351 5830 1454 0.580468 0.653355 0.042623 -0.016588 0.287656 min u s (0.660159) GCG 0/0 S100P 0/0 SP INK5 0/0 TME M200A 0/0 MUC12 1/1 560 2269 1598 0.654568 0.592138 -0.061594 -0.059273 0.306289 min u s (0.522440) SLC28A2 0/0 GLDN 1/1 4811 6085 1166 0.659520 0.694386 -0.128463 0.022107 0.363623 pl us (0. 850797) PYY 1/1 1644 5104 3415 0.667057 0.621533 -0.062445 -0.021949 0.611091 pl us (0. 603803) HO XB13 1/3 1 797 655 0.596947 0.673158 0.098485 0.084399 0.228274 pl us (0. 519560) HO XB13 2/3 4489 5562 928 0.615302 0.653005 -0.042017 -0.078809 0.253618 min u s (0.638328) HO XB13 3/3 6379 7000 622 0.586817 0.599118 0.003891 0.183562 0.162978 pl us (0. 920813) PRA C 1/2 150 801 652 0.590491 0.562476 0.018727 0.174026 0.144230 pl us (0. 896768) PRA C 2/2 1556 4384 2829 0.552139 0.673094 0.035517 -0.029449 0.472273 min u s (0.681057) WFD C 2 1/1 4596 6353 1648 0.667476 0.566677 0.000000 0.012727 0.297613 pl us (0. 549743) TFF1 0/0 PCP 4 0/0 CLDN8 0/0 LGALS 2 0/0

(33)

T ab le 13: CpG Islan ds foun d b y CpG ProD for the region 5kb u pstream from transcrip tion start site to 2kb do wnstream for do wn regul ate d genes . se q name n u m b e r b e gin end length (bp) G+C fr e qu e n c y CpGo/e ratio A T sk ew GC sk ew start-p strand (s tr and -p ) IGFBP2 1/1 4247 6064 1707 0.645577 0.764910 0.080992 0.172414 0.511312 pl us (0. 810650) UGT1A6 0/0 REG 3A 0/0 REG 1A 0/0 SLC20A1 1/1 4073 6026 1727 0.654893 0.853827 -0.124161 -0.025641 0.628146 pl us (0. 741075) UGT1A3 0/0 OS T alph a 0/0 LOC25845 1/3 1073 1973 901 0.664817 0.591864 0.231788 0.125209 0.233164 min u s (0.698788) LOC25845 2/3 2562 3277 566 0.678445 0.635085 -0.065934 -0.093750 0.245045 min u s (0.623753) LOC25845 3/3 3596 5656 1791 0.718593 0.773571 -0.019841 -0.069153 0.606376 min u s (0.664315) DE F A5 0/0 NR1H4 0/0 ETNK1 1/1 4447 6114 1613 0.570986 0.678098 -0.049133 -0.155266 0.324892 min u s (0.820185) OLFM 4 0/0 ANPE P 1/1 4343 5889 1397 0.686471 0.734354 0.022831 0.124088 0.469854 pl us (0. 809265) A QP8 0/0 F AM 3B 1/1 4521 5824 1304 0.634202 0.602750 -0.044025 -0.020556 0.264468 pl us (0. 559012)

(34)

A B

Figure 11: Motif logo for top motifs from each of ten runs of glam2 with a purge cut-off at 80 for (A) up regulated genes and (B) down regulated genes.

Figure 12: Motif logo for top scoring motif for up-regulated genes.

4.3.2 Ab initio motif discovery

The ideal motif discovery situation would be to find an overabundance of a particular motif in one of the groups possibly explaining the differences in gene expression patterns. Ideally when running GLAM2 you would want the motifs for all runs, or a majority, to be identical or at least similar. If that is not the case, pre processing of the data might be crucial. One such technique is to remove highly similar sequences before the motif scan. For the up regulated genes a purge cutoff at 80 resulted in 7 sequences out of 20 and a shared motif between the three first runs. To get an idea of the similarity between the other six motifs there are different methods available, STAMP [46] being one example. This application will allow for the making of one motif from the ten motifs from each run. The result can be found in Figure 11.

It can be seen that the motifs from the down-regulated genes aren’t as conserved as the up-regulated. This probably has to do with the fact that for the down-regulated genes no motifs were identical and desired would been a majority. Therefore the down regulated genes were found not to have a shared motif that could be found using glam2 and left until the start of a new approach. The top motif (Table 14, Figure 12) for the up-regulated genes was more probable considering both the familiar motif and the fact that the top three motifs from the ten runs were identical. To verify the significance of this motif, the sequences were randomly shuffled using shuffleseq from the EMBOSS package [57] and then used as input for glam2.

This procedure was repeated 100 times and the top score for each run was saved. In this case the real sequences performed better, i.e. had a higher score, than the shuffled ones in every case and is therefore thought of as being significant.

For this motif only 7 of the original 20 sequences were used to build the motif. An

interesting aspect would be if the motif is still present in the remaining 13 sequences. This

was verified using the GLAM2SCAN package [25] where 17 out of the in total 20 sequences

were found to match the motif when only including the top 50 matches. The three top scoring

matches can be found in Table 15. The score is a log likelihood ratio between the sequence

given the motif model and the sequence given the background model where the background

(35)

Sequences: 7

Greatest sequence length: 2000

Residue counts: a=2308 c=2382 g=2287 t=2533 N=4490 Score: 70.7631 Columns: 30 Sequences: 7

PLA2G2A 1794 gtcagc.agcagggccctaggtctccatcc 1822 + 11.8 SLC28A2 1950 gcccccatgccccacccaagctgcccctcc 1979 + 18.8 GLDN 1954 cgcaggctgc.ctccactggct.cctctcc 1981 + 13.8 WFDC2 1629 gcctccctgc.gggccctgccttatcctcc 1657 + 20.1 TFF1 633 gcccccattc.tggcccaggct.cccctcc 660 + 19.4 PCP4 214 gccagccagcagagcacaggctgattctcc 243 + 27.4 CLDN8 1890 ctcagccagc.cagcactggat.actatct 1917 + 15.3

Table 14: Part of output for top scoring motif for up-regulated genes. The first line shows the score, size of the motif and the number of sequences the motif is based on. Then the actual alignment is visualized with position, strand and a marginal score. The stars indicate key positions in the alignment. Last the actual motif is shown.

PCP4 214 gccagccagcagagcacaggctgattctcc 243 + 33.7 WFDC2 1629 gcctccctgc.gggccctgccttatcctcc 1657 + 29.3 SLC28A2 1950 gcccccatgccccacccaagctgcccctcc 1979 + 29.2

Table 15: Top scoring matches for GLAM2 motif when queried on a fasta file of up regulated genes using GLAM2SCAN.

model is based on random and independent occurence of nucleotides [25]. The three sequences without a match with a score greater than 6.6 was SPINK5, INSL5, KIAA1913. When this motif was used to search through all sequences, both for up and down, it resulted in matches among the 50 first for down-genes including LOC25845, OSTalpha, UGT1A6, IGFBP2, AQP8 and ANPEP. As already noted the motif for the down regulated genes wasn’t very strong and therefor put aside. With the up regulated genes the motif was a bit stronger and could be used for motif comparison. The question raised, is this a real motif? If yes, is it already known? The motif was therefore used as input using the TOMTOM package [32] to compare the motif to a database of already known motifs. The database used was JASPAR CORE [12]

containing curated non-redundant motifs from several species. The q value is here the false discovery rate [68]. Unfortunately no match was found, meaning that if the motif is really a true motif, it is in some sence a newly found motif that needs experimental verification.

4.3.3 Transcription factor binding site analysis

Specific transcription factor binding sites in the promoter regions were analysed as another

possible factor involved in differential expression between distal and proximal colon. In Ta-

ble 16, Table 17 and Table 18 the number of genes in each group each motif is present in, is

displayed for the three different tools FIMO, Consite and UCSC respectively.

(36)

Table 16: Motifs with matches for up-regulated and down- regulated genes when scanning promoter regions for tran- scription factor binding sites using FIMO

Motif accession Motif name Up Down

MA0003.1 TFAP2A 8 7

MA0017.1 NR2F1 5 2

MA0025.1 NFIL3 1 0

MA0028.1 ELK1 1 0

MA0030.1 FOXF2 1 2

MA0031.1 FOXD1 1 0

MA0032.1 FOXC1 1 2

MA0033.1 FOXL1 2 0

MA0042.1 FOXI1 2 2

MA0043.1 HLF 1 0

MA0048.1 NHLH1 7 4

MA0050.1 IRF1 2 0

MA0051.1 IRF2 3 1

MA0052.1 MEF2A 0 1

MA0055.1 Myf 4 6

MA0057.1 MZF1 5-13 10 6

MA0058.1 MAX 2 3

MA0059.1 MYC::MAX 1 1

MA0060.1 NFYA 6 3

MA0061.1 NF-kappaB 3 5

MA0066.1 PPARG 5 1

MA0069.1 Pax6 1 1

MA0070.1 PBX1 0 1

MA0071.1 RORA 1 1 2

MA0072.1 RORA 2 1 1

MA0073.1 RREB1 11 5

MA0074.1 RXRA::VDR 0 1

MA0076.1 ELK4 1 0

MA0077.1 SOX9 1 1

MA0081.1 SPIB 2 3

MA0083.1 SRF 1 0

MA0084.1 SRY 2 0

MA0090.1 TEAD1 4 3

MA0091.1 TAL1::TCF3 1 1

MA0093.1 USF1 0 1

MA0101.1 REL 2 3

MA0105.1 NFKB1 2 4

MA0106.1 TP53 2 1

MA0107.1 RELA 2 0

MA0113.1 NR3C1 2 3

MA0115.1 NR1H2::RXRA 1 1

MA0119.1 TLX1::NFIC 3 4

(37)

MA0124.1 NKX3-1 0 2

MA0139.1 CTCF 4 4

MA0148.1 FOXA1 2 2

MA0149.1 EWSR1-FLI1 5 2

MA0138.2 REST 4 3

MA0137.2 STAT1 6 5

MA0112.2 ESR1 7 5

MA0150.1 NFE2L2 1 3

MA0152.1 NFATC2 3 3

MA0153.1 HNF1B 3 1

MA0155.1 INSM1 9 5

MA0156.1 FEV 1 2

MA0157.1 FOXO3 2 1

MA0159.1 RXR::RAR DR5 3 4

MA0160.1 NR4A2 2 2

MA0163.1 PLAG1 9 9

MA0080.2 SPI1 0 3

MA0018.2 CREB1 1 1

MA0099.2 AP1 2 2

MA0079.2 SP1 12 10

MA0258.1 ESR2 4 2

MA0259.1 HIF1A::ARNT 2 1

MA0102.2 CEBPA 1 2

MA0108.2 TBP 5 4

MA0114.1 HNF4A 2 6

MA0046.1 HNF1A 1 4

(38)

Table 17: Motifs with matches for up-regulated and down-regulated genes when scanning promoter regions for transcription factor binding sites using Consite.

Motif name Up Down

ARNT 18 15

CREB 8 6

E2F 4 3

HLF 5 5

MEF2 11 4

Myf 12 12

NF-kappaB 14 11

Pax6 1 2

SRF 2 0

TBP 17 12

USF 15 15

Table 18: Motifs with matches for up-regulated and down- regulated genes when scanning promoter regions for tran- scription factor binding sites using UCSC

Motif name Present in Up Down

M00237 AhR 1 1

M00517 AP1 3 1

M00413 AREB6 3 1

M00237 Arnt 1 1

M00017 ATF 3 2

M00179 ATF-2 2 1

M00490 Bach2 1 0

M00201 C/EBPalpha 3 0

M00109 C/EBPbeta 3 1

M00615 c-Myc 0 1

M00209 CP1A 4 1

M00185 CP1C 3 1

M00113 CREB1 3 1

M00222 E47 1 0

M00045 E4BP4 1 0

M00243 Egr-1 2 0

M00246 Egr-2 1 0

M00245 Egr-3 2 0

M00456 FAC1 1 0

M00291 FOXC1 1 2

M00290 FOXF2 1 1

M00422 FOXJ2 1 0

M00260 Hlf 1 1

M00132 NF1C 1 3

M00134 HNF4A 1 1

(39)

M00062 IRF1 1 0

M00480 LUN-1 2 1

M00615 Max 0 1

M00491 MAZR 2 2

M00006 MEF-2A 3 1

M00084 MZF1 1-4 2 2

M00209 NF-Y 4 1

M00209 NFYA 4 1

M00272 p53 2 2

M00248 POU2F1 3 1

M00528 PPAR-gamma1 2 1

M00528 PPAR-gamma2 2 1

M00156 RORalpha1 1 0

M00257 RREB1 1 1

M00410 Sox9 1 2

M00008 Sp1 2 4

M00220 SREBP-1a 4 1

M00220 SREBP-1b 4 1

M00220 SREBP-1c 4 1

M00186 SRF 2 0

M00160 SRY 1 0

M00252 TBP 6 1

M00216 TFIID 2 0

SLC20A1 is known to have a CCAAT-box in its promoter but no TATA-box [52]. The CCAAT-box is bound by the transcription factor NFY. FIMO and UCSC both succeed in predicting the CCAAT-box but only Consite has a match for the TATA-box.

IGFBP2 is believed to lack both TATA and CCAAT boxes a result well correlated with both the output from FIMO and UCSC. Consite manages in not detecting a CCAAT-box but instead identifies one TATA-box.

4.3.4 Classification based on presence of sequence specific motifs

One hypothesis is that the presence of specific motifs in the upstream regions for different genes could be used for classifying the genes into the two groups up and down regulated in the two different colon segments.

The first scheme is assuming that the score of the motif, that is the binding affinity for the transcription factor, isn’t an important factor but the mere presence of the motif in the upstream region is enough to classify genes into the two groups. However, it is very likely that this simple model wouldn’t be able to explain such a complex process as gene regulation.

Instead the number of occurences for each motif in a gene’s upstream region would be more

appropriate, since binding sites often can occur more than once. These two models have

both discarded the score for each motif, making them simple to use but also loosing a lot of

information. Since the score for the motif is a calculation of how well it compares to known

regulatory motifs, a motif with a high score would mean a high sequence similarity and therfor

a high binding affinity by its corresponding transcription factor. Taking this into account,

classification can be performed using each motifs score in every upstream region that is the

(40)

Table 19: Sequences sorted out at each node in Figure 13A.

Node Up-regulated genes Down-regulated genes

2: GLDN, LGALS2 OLFM4, Osta,REG1A, REG3A

6: MFSD4, PCP4, PYY DEFA5, IGFBP2, LOC25845, SLC20A1 7: The rest

highest score per motif in each sequence.

Decision tree methods can be used to classify the results. These represent data classifica- tion as a tree structure with every node representing a partition based on a specific decision.

In this case, the model is based on the presence of motifs in the upstream sequences. In total the dataset has 35 observations from 2 classes, i.e. 35 upstream sequences from down or up-regulated genes. The results for FIMO and Consite for the classification scheme zero or one can be found in Figure 13. The labeling of the nodes follow the convention of the root being 1 and then left child is 2n and right child is 2n+1 where n is the parent node. In Figure 13A it can be seen that at the root there are in total 35 sequences, 15 from down and 20 from up. Since there are more sequences from group 1(up), this node will predict class 1. The root is then split based on the motif HNF4A, and sequences with occurences of this motif go right, to node 2, and the other left, to node 3. Let’s start exploring the left side.

This node has in total 8 sequences, 6 from class 0 and 2 from class 1. The majority is from down and this node will therefore predict class 0. No improvement can then be made, and that node will therefore end up with the sequences in Table 19. Then continue to the right side where node 3 contains 27 sequences. A split using the motif CTCF results in one node with 7 sequences and one with 10. The left node in this case, node 6, contains all sequences that doesn’t contain the motif HNF4A but contains CTCF. The result from Consite varies a lot from the tree built from FIMO. For Consite only one split occurs, based on the motif MEF2A. So the question raised is what motifs are really needed for classifying the upstream regions into the two classes of up or down regulated genes? According to the output from FIMO, the two motifs HNF4A and CTCF are important and for Consite the motif MEF2A is of greatest importance.

The results for FIMO and consite for the classification scheme 2 can be found in Figure 14.

It can be seen that the change of classification scheme for FIMO doesn’t change the output.

For Consite on the other hand, the classification is now based on the motif USF1. Classifi- cation scheme 3 changes the output for both methods. RREB1 and SP1 are the important motifs for FIMO and for Consite a combination of the previously chosen motifs MEF2A and USF1 are needed for the classification. Motif matches from UCSC don’t manage to build a classifier for either of the classification schemes.

No scheme or method produces a classifier without wrong predictions but motif occurences based on FIMO and Consite manages to classify a majority of the genes into their correct class belonging. However since no external test was performed there is no assurance that the predictions aren’t based on noise only.

4.3.5 Discussion

We here analyzed the promoters of differentially expressed genes between proximal and

distal colon with emphasis on the structure. Both up and down-regulated genes were found

to have promoters with TATA-boxes especially around the transcription initiation sites. No

(41)

A B

Figure 13: Decision trees for motif scanning using the presence of motifs in the pro- moter(classification scheme 1) for (A) FIMO (B) consite. At each internal node the number of genes from each group is presented, down-regulated genes to the left and up-regulated genes to the right. Each leaf will predict one of the two groups, 0 for down-regulted genes and 1 for up-regulated genes.

A B

Figure 14: Decision trees for motif scanning using the count of each motif in the pro-

moter(classification scheme 2) for (A) FIMO (B) consite

Computational analysis of transcriptional

Examensarbete 30 hp Maj 2011

Computational analysis of transcriptional regulation in proximal versus

distal normal colon

Sarah Engman

Bioinformatics Engineering Program

Uppsala University School of Engineering

UPTEC X 11 026 Date of issue 2011-06

Author

Sarah Engman

Title (English)

Computational analysis of transcriptional regulation in proximal versus distal normal colon

Title (Swedish)

Abstract

Keywords

Colon, transcriptional regulation, cancer, CpG methylation, transcription factor binding sites Supervisors

Neil Saunders

CSIRO Mathematics, Informatics and Statistics

Scientific reviewer

Claes Andersson

Department of Medical Sciences, Cancer Pharmacology and Informatics, Uppsala University

Project name Sponsors

Language

English

Security

ISSN 1401-2138 Classification

Supplementary bibliographical information

Pages

48

Biology Education Centre Biomedical Center Husargatan 3 Uppsala

Box 592 S-75124 Uppsala Tel +46 (0)18 4710000 Fax +46 (0)18 471 4687

Computational analysis of transcriptional regulation in proximal versus distal normal colon

Sarah Engman

Populärvetenskaplig sammanfattning

Kolorektalcancer är en av de vanligaste cancertyperna i länder med hög

I detta arbete har ett antal olika beräkningsmetoder använts för att hitta mönster som kan förklara skillnaden mellan två av tarmens delar. Beräkningarna har utgått från genuttrycksdata från frisk vävnad för två tarmsegment där vardera genens

promotersekvens har extraherats från databaser tillgängliga online. Med hjälp av algoritmer, genuttrycksdatabaser och manuella jämförelser har tänkbara faktorer såsom transkiptionsfaktorsbindingsplatser och metylering av CpG-öar på dessa sekvenser analyserats.

Examensarbete 20p

Civilingenjörsprogrammet Bioinformatik

Contents

1 Introduction 2

1.1 Project goal . . . . 2

2 Background, from biology to computational approaches 2 2.1 The colon . . . . 2

2.2 Gene regulation . . . . 3

2.3 Computational motif detection methods . . . . 4

2.4 Microarray analysis . . . . 5

2.5 Gene Ontology . . . . 5

3 Materials and methods 6 3.1 The dataset . . . . 6

3.2 Differential expression analysis . . . . 6

3.3 Analysis based on Gene Ontology . . . . 6

3.4 Network analysis . . . . 6

3.5 Extracting the upstream sequences . . . . 6

3.6 Novel motif discovery . . . . 7

3.7 Motif scanning . . . . 7

3.8 Classification of genes based on occurences of motifs . . . . 11

3.9 CpG Island detection . . . . 11

3.10 CpG methylation analysis . . . . 12

4 Results and discussion 12 4.1 Differentially expressed genes between distal and proximal colon . . . . 12

4.1.1 Function and chromosomal position . . . . 17

4.1.2 Differences in function based on Gene Ontology . . . . 19

4.2 Identifying interactions between differentially expressed genes . . . . 19

4.3 Promoter sequence analysis . . . . 24

4.3.1 Promoter structure . . . . 24

4.3.2 Ab initio motif discovery . . . . 28

4.3.3 Transcription factor binding site analysis . . . . 29

4.3.4 Classification based on presence of sequence specific motifs . . . . 33

4.3.5 Discussion . . . . 34

4.4 CpG methylation analysis . . . . 36

5 Conclusion 37

6 References 41

1 Introduction

Gene expression can be regulated on different levels with transcriptional regulation as one of the initial steps. Binding of transcription factors to the genes’ promoters and change of CpG methylation are important factors involved in regulating the amount of transcripts in a cell.

1.1 Project goal

2 Background, from biology to computational approaches

2.1 The colon

The colon in mammals can be divided into 5 parts, namely the caecum, ascending colon,

transverse colon, descending colon and sigmoid colon. Further, these parts can be grouped

together resulting in two major segments, the proximal colon consisting of the cecum, ascend-

ing and transverse colon, and distal colon consisting of the remaining two. The distal and

proximal colon are from different embryonic origins with most of the distal being derived from