Transcription response in the TGF-beta pathway
Francisco Manuel Sánchez de Oria
Degree project in biology, Master of science (2 years), 2008 Examensarbete i biologi 30 hp till masterexamen, 2008
Biology Education Centre and Department of Genetics and Pathology, Rudbeck Laboratory, Uppsala
University
Table of Contents
List of abbreviations...2
Abstract...3
Introduction...3
The TGF superfamily β ...3
Role of TGF in tumor pathogenesis β ...4
TGF signal transduction and the Smad proteins β ...5
Studying transcription factors binding: Chromatin ImmunoPrecipitation (ChIP) assays...7
ChIPseq: next generation ChIP assays...10
Results...13
In vivo mapping of binding sites for Smad4 transcription factor...13
Verification of known binding sites...13
Analysis of Smad4 target genes...17
Smad4 enriched regions contained an over representation of Smad binding sites...19
Discussion...20
Global mapping of Transcription Factor Binding Sites in the postgenomic era...20
Unraveling the secrets of the complex network of TFs in the TGF pathway β ...21
Materials and methods...23
Experimental procedures...23
Cell cultures...23
Antibodies...23
ChIP and DNA template preparation for sequence analysis...23
PCR confirmation...25
Data analysis...25
Acknowledgments...25
References...26
List of abbreviations
BMPs Bone MorphogeneticProteins CDK CyclinDependent Kinases ChIP Chromatin Immunoprecipitation CLB Cell Lysis Buffer
FBS Fetal Bovine Serum
FoxG1 Forkhead Box G1B
ISmads Inhibitory Smads
MAPK Mitogen Activated Protein Kinases PAI1 Plasminogen Activator Inhibitor1 RSmads ReceptorRegulated Smads
RIPA RadioImmunoPrecipitation Assay SARA Smad Anchor for Receptor Activation
SBE Smad Binding Element
Smurf Smad ubiquitination regulatory factor TF Transcription Factor
TFBS Transcription Factor Binding Site
TGFβ Transforming Growth Factor beta
TSS Transcription Start Site
Summary
Transforming growth factor beta (TGF ) is a multifunctional cytokine involved in the regulation of β numerous cellular responses including cell proliferation, differentiation and apoptosis. Escaping from TGF induced apoptosis is one of the hallmarks that characterizes cancer cells. β The aim of this project was to identify genes bound and regulated by Smad2 and Smad4 transcription factors, which directly mediate TGF β signaling. In this project I used ChIPseq, a stateoftheart method used to analyze protein interactions with DNA. ChIPseq combines Chromatin Immunoprecipitation, a powerful method employed to selectively enrich for DNA sequences bound by a particular protein in vivo, with massively parallel DNA sequencing of the ChIPenriched DNA. Therefore the DNA bound to it is identified and mapped to the human reference genome to locate visually the position of every DNA fragment. Using HepG2 cells (human hepatocellular liver carcinoma cell line) as a model system, I identified well known target genes of the Smad4 protein, such as PAI1 and JUNB, as well as some candidate genes that could potentially be targets for therapeutic intervention, like FoxG1 and HNF4
genes.
Introduction
Transforming growth factor beta (TGF ) is a multifunctional cytokine involved in the regulation of β numerous cellular responses, such as cell growth and proliferation, differentiation, cellular matrix production, migration and apoptosis (Jennings & Pietenpol, 1998; Verrecchia & Mauviel, 2002). TGFβ is a secreted homodimeric protein member of the TGF superfamily. β Deregulation of TGFβ
expression or signaling is involved in a variety of diseases, including cancer and fibrosis (Blobe et al., 2000).
The TGF superfamily β
The TGF β superfamily includes more than 30 pleiotropic cytokines with similar structure, involved in key roles in development and tissue homeostasis. The first member of the TGF superfamily, β TGFβ1, was discovered in the late 1970s. Its name stands for the ability to induce growth and morfological transformation of rat kidney fibroblasts (DeLarco and Todaro, 1976). However, shortly after its
discovery it was shown that TGFβ1 also acts as an inhibitor of cell proliferation (Tucker et al., 1984).
This duality in cell growth regulation is celltype dependent and imprinted during embryonic
development (Sporn and Roberts, 1990). Other members of the TGF s β uperfamily are TGFβ2 and TGFβ3, bone morphogenetic proteins (BMPs), antimüllerian hormone (AMH), activins and nodal (Piek et al., 1999).
Role of TGF β in tumor pathogenesis
TGFβ plays important roles in tumor pathogenesis, contributing to cell growth, invasion and metastasis, angiogenesis and also decreasing host tumorspecific immune responses (Jennings &
Pietenpol, 1998). Although originally TGF acts as a tumor suppressor inhibiting cell growth in most β
cell types via the Smads pathway, once the tumor has been established most cells become resistant to
TGF and β TGF turns prooncogenic (Elliott and Blobe, 2005) (Figure 1). Escaping from TGF β β
growth inhibition is the identifying characteristic of many cancer cells (Massagué et al., 2000).
Figure 1. The dual role of TGF in tumor pathogenesis. β
TGF arrests the cell cycle progression at early G1 through controlling a number of important cell β cycle regulators (Hanahan and Weinberg, 2000). Cyclindependent kinases (CDK) regulation is essential for cell growth inhibition mediated by TGF . This regulation can be either direct β
downregulation of CDK levels (Zhang et al., 2001) or by upregulation of CDK inhibitors (Alexandrow and Moses, 1995)
Alterations of the TGF pathway can increase cancer risk. A common example is TGFBR1*6A , a β variant of the TGFBR1 gene with a 9bp inframe deletion. This modification is present in
approximately 14% of the general population and results in decreased TGF mediated growth β inhibition. Population studies have shown that this allele is related to increased breast cancer risk by 31% for heterozygotes and 169% for homozygotes, respectively (Zhang et al., 2005). Nonetheless, the most common cause of TGF signalling alteration is the mutational inactivation of the TGFBR2, β present in about 2030% of all colon cancers (Biswas et al., 2004).
TGF β signal transduction and the Smad proteins
The Smads proteins directly mediate the biological effects of TGF . The Smads proteins are homolog β of both the Drosophila mothers against decapentaplegic (MAD) protein and the Caenorhabditis elegans SMA protein, their name is a combination of the two.
TGFβ binds to and activates type I and type II serine/threonine kinase receptors present in the surface
of the cell. The receptorregulated Smads (RSmads) directly mediate TGFβ signalling upon receptor
activation, and those are Smad1, Smad2, Smad3, Smad5 and Smad8. The Smad anchor for receptor
activation (SARA) or endofin mediates Smad activation delivering the RSmads to the receptor, which result in phosphorylation of the RSmads (Tsukazaki et al., 1998; Shi et al., 2007). Once RSmads are activated they form heterodimeric complexes together with Smad4. These complexes are translocated to the nucleus, where they recruit other transcription factors to regulate the expression of target genes through the interaction with other transcription factors, coactivators and corepressors (Massagué et al., 2005). Such genes mediate the biological effects of TGF . Some of the activated target genes stimulate β tumorigenesis, while others suppress it. Although Smad4 is not required for translocation into the nucleus, it seems to be needed for the Smad complex to act as a transcription factor (Liu et al.,1997).
Besides the TGF superfamily receptors there are other kinases such as cyclin dependent kinases β (CDK) and mitogen activated protein kinases (MAPK) that can phosphorylate Smad proteins thus regulating their capacity of controlling transcription of their target genes (Matsuura et al., 2004 ; Kamaraju and Roberts, 2005) (Figure 2).
Figure 2. The TGF Smad pathway. TGF binds to and activates type I and type II serine/threonine kinase receptors β β present in the surface of the cell . The receptorregulated Smads (RSmads) directly mediates TGF signalling upon β receptor activation. SARA or endofin mediates Smad activation delivering the RSmads to the receptor, which result in phosphorilation. Once RSmads are activated they form heterodimeric complexes together with Smad4 and are translocated to the nucleus, where they control target genes. Smad7/Smurf12 represents the negative loop of the cycle, ending signaling.
Modified with permission from ten Dijke and Hill, 2004.
Smad6 and Smad7, the ISmads, constitute a subclass of inhibitory Smads that acts in direct opposition to RSmads signalling, forming a negative feedback loop. Originally this subclass was shown to
compete with RSmads for activated type I receptor binding (Moustakas et al., 2001). Later on they were found to produce ubiquitination and degradation of the activated type I receptor by recruiting of E3ubiquitin ligases, also known as Smad ubiquitination regulatory factor 1 (Smurf1) and Smurf2, thus ending signalling (Shi and Massagué, 2003). Shortly after this discovery it was demonstrated that I
Smads associate with phosphatases, dephosphorylating and therefore inactivating type I receptors (Shi et al., 2004). A possible role for ISmads in transcriptional regulation has also been postulated as Smad6 has been shown to repress BMPinduced transcription by recruiting corepressor CtBP (C
terminal binding protein) and Smad7 disrupts Smad2/Smad3 complexes in the nucleus (Lin et al., 2003).
Although there are numerous members in the TGFβ superfamily that produce a vast diversity of cellular responses there are only two different Smad pathways known, raising many questions about how signaling specificity and diversity are produced (Attisano and Wrana, 2002; Miyazawa et al., 2002).
Studying transcription factors binding: Chromatin ImmunoPrecipitation (ChIP) assays Transcription is controlled by the association of transcription factors (TFs) with their target DNA sequences in gene regulatory regions and additional recruitment of activators of the transcription machinery, hence it is of great importance to be able to study in vivo proteinDNA associations. Those associations are fine tuned by epigenetic modifications including methylation of CpG dinucleotides (Antequera, 2003), posttranslational modifications of histones (Strahl and Allis, 2000 ; Jenuwein and Allis, 2001) and incorporation of histone variants (Mito et al., 2007). Such modifications are used by the transcription factors to modulate transcription and constitute the epigenetic code (Cosgrove and Wolberger, 2005).
Chromatin Immunoprecipitation (ChIP) assays are the cuttingedge techniques to study large scale
proteinDNA interactions in vivo. The ChIP technique involves reversible crosslinking of proteins with
DNA, a procedure by which the proteinDNA interaction is covalently linked using formaldehyde. The
purpose of the crosslinking is to ensure that the DNAprotein link is maintained during the ChIP
procedure. The chromatin is fragmented into smaller pieces, usually in the range of 200 base pairs
length, using either enzymatic digestion or sonication of the nuclei. The sheared chromatin is then
immunoprecipitated with an antibody recognizing the protein of interest. In the last steps the crosslink
is reversed, proteins are digested and the enriched ChIPDNA is purified (Figure 3). For a recent review
of the ChIP current state and applications see Collas and Dahl, 2008.
Figure 3. ChIP assay experimental outline. Modified with permission from Collas and Dahl, 2008
For several years a strong limitation of the ChIP technology was the restriction of analysis of the ChIP
selected DNA material to a set of predetermined target sequences using PCR with chosen primers. This method introduces a strong bias towards the sequence of interest. Array technology extended the power of ChIP, enabling the discovery of novel target sites for TFs and build the map of posttranslationally modified histones across the genome. This approach is known as ChIPonchip or ChIPchip, and was first successfully applied on yeast in three papers published in 2000 and 2001 (Ren et al.,2000; Iyer et al.,2001; Lieb et al.,2001). Recent advances in microarray technology have made it possible to study TFs genomewide in human cells (RadaIglesias et al., 2008). Microarray hybridization overcomes the limitations of regular ChIPPCR analysis and have permitted genomewide scope analysis. Nonetheless, the advent of nextgeneration sequencing technologies have lead ChIP assays to the next frontier.
ChIPseq: next generation ChIP assays
The so called nextgeneration sequencing machines are machines capable of producing tens to hundreds of millions of short sequence reads during a single instrument run (Shendure and Ji, 2008). This
unprecedented sequencing capacity is being applied in many fields of biology enabling striking scientific advances at dizzying speed.
ChIPsequencing, also know as ChIPseq, uses this novel technology to sequence ChIPDNA fragments in massively parallel manner. Some of the advantages of ChIPseq over ChIPchip assays are lower cost, less input DNA or less amplification requirements, not limited by microarray content and more
accurate mapping (Barski et al., 2007; Johnson et al., 2007; Mikkelsen et al., 2007; Robertson et al.,
2007). ChIPseq has been recently used to study epigenetic changes in the DNA and target sites for TFs
and other related chromosomeassociated proteins across the entire genome, enabling the possibility to build a high resolution genomewide map for gene expression and genome function (Barski et al., 2007;
Johnson et al., 2007; Robertson et al., 2007; Wederell et al., 2008).
The Illumina sequencing technology (Figure 5), which relies on proprietary reversible terminatorbased sequencing chemistry. The first step prior to sequencing is the library preparation. Adaptor sequences are ligated to the DNA fragments. The ligated fragments are then amplified and immobilized in a flow cell surface, where they are directly amplified (solid phase amplification) to create up to 1000 clones of each single molecule in very close proximity. Then the clusters of clones are sequenced using
fluorescentlylabeled modified nucleotides (sequencingbysynthesis). One important property of those nucleotides is reversible termination, allowing the presence of the 4 nucleotides (A, C, T, G)
simultaneously during sequencing, which results in higher accuracy than methods where only one
nucleotide is present at the time. For a cycle of sequencing, a laser excites the fluorescentlylabeled
nucleotides and the image is captured determining the identity of the base for each cluster. Each cycle is
repeated to obtain the sequence of bases in a given fragment. In the last steps the Illumina Pipeline
software maps the sequence reads to a reference genome in order to obtain the genomic coordinates of
every ChIPDNA fragment (aligned reads). The resulting file contains the sequence of every DNA
fragment and its location in the genome, and it can be formatted and uploaded to the University of
California Santa Cruz (UCSC) genome browser (http://genome.ucsc.edu/) genome browser to locate
visually the position of every DNA fragment in the genome and compare the different samples. Those
regions of the genome where several aligned ChIPDNA overlaps form peaks. Each step in the peak
represents the position of an aligned ChIPDNA read in the human reference genome.
Figure 5. Scheme of the Illumina sequencing technology. Modified with permission from Illumina Inc., www.illumina.com (2008)
Aim
The aim of this project was to identify genes bound and regulated by Smad2 and Smad4 transcription
factors, which directly mediate TGF signaling, in HepG2 cells. For that purpose I used chromatin β
immunoprecipitation and high throughput parallel sequencing (ChIPseq), a method employed to
determine the in vivo genomic localization of transcription factors and other chromatin related proteins.
Results
In vivo mapping of binding sites for Smad4 transcription factor
Chromatin immunoprecipitation coupled to highthoughput sequencing technology (ChIPseq) can be used to profile wholegenome binding sites for a chosen transcription factor (Barski et al., 2007;
Johnson et al., 2007; Robertson et al., 2007; Wederell et al., 2008). In this study I used chromatin immunoprecipitation to isolate DNA bound by Smad4 in TGF treated and control HepG2 cell β s. All ChIP samples were confirmed for the presence of known binding sites using semiquantitative or quantitative PCR. This step is required to evaluate the efficiency of the ChIP procedure before further analysis using the Illumina genome analyzer. Samples in which the result of the PCR showed low or no enrichment in known binding sites were discarded. Then the inmunoprecipitated Smadbound DNA samples were sequenced using Illumina 1G genome analyzer and mapped with respect to the human genome using the Illumina Analysis Pipeline, thus identifying target genes of the TGF pathway. The β output text files were converted to browser extensible data (BED) format in order to visualize the data in the University of California Santa Cruz (UCSC) genome browser (http://genome.ucsc.edu/). Table 1 summarizes the statistical information in the output files obtained:
Table 1. Sequencing statistics obtained for each sequenced ChIPDNA sample.
Name Antibody TGFbeta treated #Aligned reads Peaks
aSmad2 antiSmad2 Yes 3176304 586
Smad4 antiSmad4 Yes 2851319 667
Smad4C_Last antiSmad4 No 4846500 17330
Smad4T_Last antiSmad4 Yes 4707083 3117
a
Since Smad2 and Smad4 samples had less sequences, peaks above 6 overlapping reads are counted, while in Smad4C_Last and Smad4T_Last only peaks above 8 overlapping reads are counted.
Verification of known binding sites
In order to validate my data in silico, I looked for peaks located in well known and characterized promoters of the Smads target genes. I extracted 3 known target genes from the literature and analyzed them in the UCSC Genome Browser (Mar. 2006 Assembly): plasminogen activator inhibitor1 or PAI1 (SERPINE1), JUNB protooncogene (JUNB) and SMAD family member 7 (Smad7).
For the PAI1 gene, a Smad binding region has been located −586 to −551 upstream of the gene. This
region contains 3 Smad Binding Elements (SBE) and an Ebox, and the 3 bp spacer between the Ebox and an SBE has been shown to be essential to mediate TGF induced transcription (Hua β et al.,1999).
An Ebox is a small DNA sequence located typically upstream of a gene promoter and contains a palindromic canonical sequence CACGTG. Transcription factors containing the basichelixloophelix protein structural motif typically bind to Eboxes or related variant sequences and enhance transcription of the downstream gene. TGF activates JUNB by binding of a nuclear factor to a promoter distal β element, a 22 bp sequence located between nucleotides 2813 and 2792 relative to the JunB gene, where a SBE for this gene has been characterized (Jonk et al., 1998). The Smad7 gene has been shown to be regulated by the Smad3Smad4 complex in the presence of TGF treatment. The Smad7 β
promoter is located 471 to 275, and it contains a perfect 8 bp SBE (GTCTAGAC) (Nagarajan et al., 1999 , Stopa et al., 2000).
The first ChIP carried out included Smad2 and Smad4 transcription factors in TGF treated HepG2 β cells. The samples were sequenced and the sequencing data was analyzed in the UCSC genome
browser. The number of peaks in those samples were similar and in many cases overlapping in the same position. However, I did not find peaks at known binding sites.
The second set of samples were Smad4 control and TGF treated (Smad4C_Last and Smad4T_Last). β Those samples confirmed known binding sites for PAI1 (Figure 6A and 6B) and JUNB (Figure 7).
Nevertheless, although my data did not support Smad4 binding at 471 to 275 for the Smad7 gene,
there was a peak at around +750 bp (Figure 8). Further studies are necessary to determine whether the
471 to 275 region of the Smad7 gene is in fact negative for Smad4 binding in vivo.
A
B
Figure 6. (A) The PAI1 promoter in the UCSC genome browser showing the genome localization of the sequences
precipitated with antiSmad4 antibody. The upper panel (Smad4T_Last) represents the sequence tags (black) from the
TGF treated sample, while the lower panel (Smad4C_Last) represents the sequence tags (black) for the control sample. β
The Y axis represents the peak height, whereas the X axis represent the localization in the genome. Peaks are scaled
according to the tallest peak of each panel, so that different scaling is used in each panel. (B) A closer look at the PAI1
binding site. SBEs are shown in red, the Ebox sequence in black and the the 3 bp spacer between the Ebox and a SBE
shadowed in green.
Figure 7. The JUNB distal element in the UCSC genome browser showing the genome localization of the sequences precipitated with antiSmad4 antibody. The upper panel (Smad4T_Last) represents the sequence tags (black) from the TGF
treated sample, while the lower panel (Smad4C_Last) represents the sequence tags (black) for the control sample. The Y
β
axis represents the peak height, whereas the X axis represent the localization in the genome. Peaks are scaled according to the tallest peak of each panel, so that different scaling is used in each panel.
Figure 8. The Smad7 promoter in the UCSC genome browser showing the genome localization of the sequences precipitated with antiSmad4 antibody. The upper panel (Smad4T_Last) represents the sequence tags (black) from the TGF treated sample, while the lower panel (Smad4C_Last) represents the sequence tags (black) for the control sample. β The Y axis represents the peak height, whereas the X axis represent the localization in the genome. Peaks are scaled according to the tallest peak of each panel, so that different scaling is used in each panel
The overall success at identifying known targets genes suggests that my data have a good coverage of
known Smads binding sites across the genome.
Analysis of Smad4 target genes
In an attempt to extract relevant biological information from the enormous amount of data, the treated sample was filtered according the following criteria (see methods section): peaks below 8 hits and peaks located at the same position as peaks above 5 hits in the control sample were removed. Regions were extended +/ 250 bp from center and only those that were within 10 kb of a transcription start site (TSS) were saved. In this way, both weak peaks and peaks located in the same position in the control and treated sample were filtered away, leaving only strong peaks in the proximity of a TSS, which could potentially be gene promoters. Out of the 590 regions analyzed, table 2 shows the top 20 most enriched regions, description of the closest gene and distance from the peak to the TSS.
Table 2. Top 20 enriched regions within 10 kb of a TSS.
It is important to mention that amongst the most enriched regions appears the Forkhead Box G1B (FoxG1) gene with peak height 9. FoxG1 has been shown to regulate p21 expression (Seoane et al., 2004), a gene whose regulation determines TGF mediated growth inhibition. β
A histogram of distance from all 590 peaks to the TTS reveal that most were immediately downstream
Gene id Description
17 NM_002985 CHEMOKINE (C-C MOTIF) LIGAND 5 9322
15 NM_032514 MICROTUBULE-ASSOCIATED PROTEIN 1 LIGHT CHAIN 3 ALPHA 7289
14 BX538238 HYPOTHETICAL PROTEIN DKFZP686B0790 30
14 AF346307 CHROMOSOME 19 F379 RETINA SPECIFIC PROTEIN -4555
14 NM_005484 POLY (ADP-RIBOSE) POLYMERASE FAMILY, MEMBER 2 -356
14 NM_080833 CHROMOSOME 20 OPEN READING FRAME 151 -6130
14 BC050331 HYPOTHETICAL PROTEIN DKFZP434K191 -50
14 AK123337 HYPOTHETICAL PROTEIN MGC12760 -95
14 AK125239 SIMILAR TO RIKEN CDNA 4632412N22 GENE -8043
13 X87871 HEPATOCYTE NUCLEAR FACTOR 4, ALPHA 3317
13 AK094414 ACYL-COA SYNTHETASE SHORT-CHAIN FAMILY MEMBER 1 8863
13 NM_148961 OTOSPIRALIN 7343
13 NM_152837 SORTING NEXIN 16 434
13 BX161415 TETRATRICOPEPTIDE REPEAT DOMAIN 6 5781
13 NM_198441 FLJ40296 PROTEIN -419
13 NM_198951 TRANSGLUTAMINASE 2 (C POLYPEPTIDE, PROTEIN-GLUTAMINE-GAMMA-GLUTAMYLTRA... 759
12 NM_203448 HYPOTHETICAL PROTEIN LOC286286 -653
12 NM_000088 COLLAGEN, TYPE I, ALPHA 1 -1895
12 AK131425 CDNA FLJ16545 FIS, CLONE OCBBF3004972 4979
12 NM_004455 EXOSTOSES (MULTIPLE)-LIKE 1 532
12 NM_001054 SULFOTRANSFERASE FAMILY, CYTOSOLIC, 1A, PHENOL-PREFERRING, MEMBER 2 2228 Peak
Height Distance
from TSS
of the TTS. More than 50 peaks were located in the 0 +500 bp region of a TSS, suggesting that the peaks were preferably situated close to TSSs and not randomly distributed (Figure 9).
Figure 9. Histogram of distance of peaks to TSSs.
Smad4 enriched regions contained an overrepresentation of Smad binding sites
To determine whether the regions occupied by the filtered peaks contained Smad binding motif
sequences the data was analyzed using RegionMiner software (Genomatix, www.genomatix.de). The
software engine searched for all known TF binding motifs potentially contained in the data submitted
and the result was sorted by overrepresentation of those motifs compared to a set of background
promoters (Table 3). The overrepresentation reflects the fold factor of match numbers in regions
compared to an equally sized sample of the background (i. e. found versus expected). The Smad family
of transcription factors were reported among the top, with an overrepresentation of 1.95, suggesting that
the Smads binding motifs in my data occur almost twice as often as expected by chance.
Table 3. Overrepresentation of TF motifs contained in the sequenced samples
a