• No results found

Thesis Methods: Assessing the Biological Plausibility of Regulatory Hypotheses

N/A
N/A
Protected

Academic year: 2022

Share "Thesis Methods: Assessing the Biological Plausibility of Regulatory Hypotheses"

Copied!
5
0
0

Loading.... (view fulltext now)

Full text

(1)

Thesis Methods: Assessing the Biological Plausibility of Regulatory Hypotheses

Jonas Gamalielsson December 9, 2004

Abstract

Many algorithms that derive gene regulatory net- works from microarray gene expression data have been proposed in the literature. The performance of such an algorithm is often measured by how well a genetic network can recreate the gene expression data that the network was derived from. However, this kind of performance does not necessarily mean that the regulatory hypotheses in the network are biologically plausible. We have therefore proposed a Gene Ontology based method for assessing the biological plausibility of regulatory hypotheses at the gene product level using prior biological knowl- edge in the form of Gene Ontology (GO) anno- tation of gene products and regulatory pathway databases (Gamalielsson et al. 2005). Templates were designed to encode general knowledge, derived by generalizing from known interactions to typical properties of interacting gene product pairs. By matching regulatory hypotheses to templates, the plausible hypotheses can be separated from inplau- sible ones. This document elaborates on how the present method can be improved and extended.

1 Introduction

We have proposed a systematic method based on general knowledge of regulatory pathways for assessing the biological plausibility of hy- potheses derived during regulatory network re- construction (Gamalielsson et al. 2005). Our results demonstrated that the method is able to filter out a large proportion of potentially inplausible hypotheses, thus greatly improving the specificity of the regulatory network recon- struction process. However, there is a need for improvements and extensions to make the method more useful. This report elaborates on this matter.

2 Method improvements

2.1 Alternate hypothesis scoring Instead of creating templates containing GO terms, the semantic similarity between hypo- thetical relations and known model relations could be calculated. This would mean that a set of binary regulatory relations is derived from the set of known regulatory pathways, and compared to a set of hypothetical regu- latory relations. Each hypothesis ”GP hi → GP hj” could be compared to each model rela- tion ”GPk[reltype]GPl” by computing a simi- larity score as:

score= s(GPk, GP hi) + s(GPl, GP hj)

2 (1)

where s(GPk, GP hi) is the semantic similar- ity (Lin 1998) ∈ [0, 1] between gene products GOkand GP hi. However, this approach would not give any information on generalisation, it would only yield a figure of similarity. The ab- straction level variation of the GO templates (Gamalielsson et al. 2005) offer generalisation by the GO-score, which measures template specificity.

It is also possible to explore slightly mod- ified variants of the current GO-score mea- sure, for example by somehow improving the statistical foundation of templates and their GO-scores. There are already p-values avail- able for GO terms which are based on annota- tion occurences in an organism specific annota- tion database. Hence, the expected probabil- ity that a specific GO term occurs by chance is available, and the GO-score of a template is the average value of (1−p) for participating terms.

In a sense the GO-score is the complement of the p-value of terms. However, it may also be

(2)

benificial to introduce p-values for templates rather than terms. A possible approach could be to derive a set of templates T using a set of pathways P containing N gene products. Cre- ate all possible |N | · (|N | − 1) 2-permutations of binary relations and add the top GO-score of the template matching each relation to a set EXT. Figure 1 shows the probability pt for GO-scores exceeding a certain threshold for the EXT set derived using the S. Cerevisiae cell cycle pathway. However, it may be a better idea to generate all possible 2-permutations for all gene products in the genome. A motivation for this would be that all gene products with known function for the specific organism are used when the term probability is derived. As there are 3997 genes with known function ac- cording to the GO-annotation database for S.

Cerevisiae, this would result in approximately 16·106 relations. The pt values for high scoring relations would most likely be reduced consid- erably because many gene products are used that are not related to the pathway(s) used for template derivation. The probability pt is a function of GO-score, and it might actually replace the GO-score. An empirical study is required to investigate this matter more thor- oughly.

0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

EXT (n=4888)

GO−score

pt(GO−score)

Figure 1: p-value at different GO-score thresholds for all 2-permutations of relations between the gene prod- ucts of the S. Cerevisiae cell cycle pathway.

2.2 Path templates

Only binary relations have been studied so far.

It would also be possible to study paths as hyptheses, containing two or more relations and three or more gene products. Either GO based templates or semantic similarity (see previous subsection) could be used for this.

It would be possible to use templates based on both binary relations and paths, where the path templates are more specific with respect to the pathways used for template derivation.

A path based approach could be achieved by creating a graph from the set of binary rela- tion hypotheses and then extract paths of vary- ing lengths from the graph. Paths of varying lenghts could also be extracted from the reg- ulatory pathways of the pathways databases used in order to derive path templates. Path template specificity increases with path length, and for this reason some kind of normalisation should be performed in order to obtain a tem- plate score that is independent of path length.

2.3 Handling complex relations Our current approach decomposes relations be- tween gene product complexes into atomic bi- nary relations between individual gene prod- ucts. This leads to certain problems as can be observed e.g. for the CLN3-CDC28 com- plex in figure 2; CLN3 regulates the CDC28 kinase but the complex as a unit phosphory- lates SWI6. Hence, it is not entirely true that CLN3 and CDC28 individually phosphorylate SWI6 as our method suggests. On the other hand it is partially true because CLN3 helps to regulate SWI6 but not on its own. It is there- fore desirable to extend our method so that complex relations can be handled without be- ing decomposed into binary relations.

It is not a problem to derive the complex templates, the relation ”{CLN3, CDC28} [phosphorylation] {SWI6}” would for example result in a basic template

”{GO:0016538, GO:0004693} [phosphoryla- tion] {GO:0003713}”. However, sensible use of the complex template requires that the data mining algorithm is capable of deriving com- plex hypotheses. If binary relation hypotheses are generated, only partial matches to a com- plex template will occur. It would be possible to create a graph from the individual binary re- lation hypotheses derived by the data mining

(3)

CLN3

CDC28

SWI4

SWI6

MBP1

SWI6

CLN1 CLN2

CDC28

CLB5 CLB6

CDC28 SIC1

FUS3 FAR1

CDC20 CDC6 +p

+p

+p +p

+p

+p +e

+e

+u

Figure 2: Part of the S. Cerevisiae cell cycle path- way. +p = phosphorylation, +e = expression, +u = ubiquination, and T-shaped arrows represent inhibi- tion.

algorithm. The hypothesis graph could subse- quently be searched for matches to the complex templates. However, this would mean that the search algorithm has to make an interpreta- tion of the hypothesis graph in order to find complexes. This interpretation would be inde- pendent of the data mining algorithm used.

Another matter is if separate complex form- ing relations should be created from the gene products in the left hand side (LHS) and right hand side (RHS) of a complex relation. These relations are however not regulatory (!?). In the previously mentioned example this would result in the relations ”CLN3 → CDC28” and

”CDC28 → CLN3”. But it is not necessarily a good idea to derive templates from complex forming relations and mix those templates with the regulatory templates. It may be the case that complex forming templates ”dominate”

the regulatory templates. A better alternative could be to apply hypotheses to the complex forming templates separately. There are also specific protein complex databases available for various organisms, which could be used to cre- ate complex forming relations.

2.4 Evidence quality and pathway knowledge base

Some additional experiments would be in- teresting to perform both using the original method (Gamalielsson et al. 2005) and im- proved versions of the method.

• The effect of evidence quality. All ev- idence types were used in Gamalielsson et al. (2005). It would therefore be of inter- est to examine the impact on performance when different evidence types are used.

• Performance using a more diverse range of pathways. The results show that our method in its current state performs best when similar relations and gene products are used for template and hypothesis derivation. However, we expect that the generality of the method will improve if a more diverse set of pathways is used for template derivation. Furthermore, the number of regulatory pathways in KEGG is small, but additional pathway sources such as BioCarta (http://www.biocarta.com) for H. Sapiens and M. Musculus, MIPS (http://mips.gsf.de) for fungi and plants, and the SBML model repository (http://sbml.org) can extend the knowl- edge base. One problem with pathways originating from other databases than KEGG or the SBML model repository is that they often are available only in graphic format rather than a structured graph format. This increases the risk of misinterpretation when converting the graphic pathway diagrams into relations between gene products.

3 Method extensions

3.1 Expert user intervention

Our method aims to automatically identify the most biologically plausible hypotheses by a measure based on GO term specificity, but it would also be beneficial to have the top scor- ing hypotheses assessed by a domain expert in order to reduce the number of false posi- tives. However, the intention is not to per- form a rigorous empirical investigation to ex- plore the utility of an expert user. A future

(4)

tool featuring the method could provide func- tionality for expert user intervention.

3.2 Other sources of prior knowledge As was shown in Gamalielsson et al. (2005), it is often the case that basic templates de- rived from GO terms at the annotation level have lower scores than variant templates at higher abstraction levels. A way to get bet- ter accuracy for our method could also be to incorporate other sources of biological knowl- edge such as databases containing transcrip- tion factor binding site information and protein interactions. Transcription factor binding site information is specific knowledge rather than general, because the information is valid only for particular regulatory proteins. This kind of prior knowledge has been used before for gene regulatory network derivation (Hartemink et al. 2002).

3.3 Data mining algorithm integra- tion

Instead of applying the hypothesis assessment method as a post-processing stage, it is of in- terest to investigate how the method can be integrated into the data mining algorithm that derives the hypotheses. In this way the hy- pothesis assessment method could drive the data mining process. This could result in im- proved data mining quality and efficiency. The details regarding how the integration is done depend on the choice of data mining algorithm.

One possible class of algorithms to target is dy- namic Bayesian network (DBN) techniques.

So far, this class of algorithms has been ap- plied to microarray gene expression data. A crucial question is what kind of regulatory re- lations that are likely to be observable from gene expression data. It has been claimed that it is only regulation at the transcrip- tional level we are likely to observe, i.e. reg- ulatory proteins that control the transcrip- tion rate of mRNA for genes of an organism (Husmeier 2003, Knudsen 2002). This would correspond only to the expression relation type in our model. But protein to protein rela- tions could be derived by also using proteomics related methods and knowledge such as gene interaction maps, rather than only gene chip technology (Knudsen 2002).

3.4 Application to alternative hy- pothesis types

In the future we may be interested in explor- ing how the method can be modified to be useful in the analysis of other types of data mining hypotheses; Classificatory rules derived using rule induction algorithms (e.g. asso- ciation rules and decision trees) applied to non-temporal microarray gene expression data, clusters of co-expressed genes derived using clustering algorithms applied to temporal ex- pression data, and gene collections showing sig- nificant fold change between different experi- mental conditions.

3.5 Adapting the method to bio- chemical pathways

Only regulatory pathways have been studied so far. However, the majority of the models used in pathway simulation software (e.g. Pathway- Lab) are biochemical reaction networks. An important question is therefore if it is possible to adapt the template method (Gamalielsson et al. 2005) to biochemical pathways as well.

Regulatory pathways are at a higher abstrac- tion level consisting of a set of regulatory re- lations between gene products, whereas bio- chemical pathways are more fine grained and consist of a set of interconnected reactions be- tween chemical compounds (see figure 3). Re- actions can be catalyzed by enzymes.

GP GP

A)

B) S E P

Figure 3: Comparison of relations in regulatory and metabolic pathways. A) A relation between two gene products in a regulatory pathway. The relation is usu- ally unidirectional and is of a specific type (e.g. ex- pression or phosphorylation). B) An enzymatic reac- tion where a substrate S generates a product P and is catalyzed by an enzyme E. A reaction can be both uni- directional (irreversible) and bidirectional (reversible).

Substrates and products are different types of chemical compounds.

The transition from regulatory pathways to biochemical pathways is complicated by the

(5)

fact that the template method relies heavily on the Gene Ontology (GO) annotation of gene products rather than compounds.

Another question concerns applicability.

What are templates of biochemical reactions good for? What kind of hypotheses are sup- posed to be assessed and how are these hy- potheses derived? The purpose of the origi- nal template method is to assess the biological plausibility of regulatory relations derived by applying different kinds of data mining algo- rithms (e.g. DBN approaches) to microarray gene expression data. Even at this level it is generally not possible do derive all kinds of reg- ulatory relations from microarray data alone (see section 3.3). The fine grained biochemical networks are simply not observable from the available biological data.

However, one possibility is to use regula- tory hypotheses like those derived by DBN techniques, and compare them to templates created from enzyme to enzyme interactions found in the LIGAND database in KEGG (Goto et al. 1998). If the idea of generalization is abandoned, it is possible to create a graph of feasible chemical reactions using the LIG- AND database. This graph can for example be used to derive alternative reaction paths given a start substrate and end product or to com- pute possible reaction paths given a set of en- zymes (Goto et al. 1997). The reaction graph can also be used to identify missing enzymes in pathways.

Pathway Miner (Panday et al. 2004) is an example of a tool which uses a database of known pathways (both regulatory, cellu- lar and metabolic) in the assessment of re- sults from microarray gene expression analysis.

Genes identified in the expression data can be mapped to known pathways, and association networks of gene products can be derived for genes co-occuring in the pathways. However, the tool does not attempt to generalize from the known pathways and does not use directed regulatory hypotheses like the method we pro- pose.

4 Conclusion

A number of different ways to improve and ex- tend the current method for assessing the bi- ological plausibility of regulatory hypotheses, have been discussed. It is believed that the

method and its extensions will be useful for bi- ologists when analysing results from data min- ing of gene expression data and other data.

References

Gamalielsson, J., Olsson, B. and Nilsson, P. (2005).

A gene ontology based method for assessing the biological plausibility of regulatory hypotheses, Manuscript in preparation.

Goto, S., Bono, H., Ogata, H., Fujibuchi, W., Nishioka, T., Sato, K. and Kanehisa, M. (1997). Organizing and computing metabolic pathway data in terms of binary relations, Pacific symposium on biocom- puting, pp. 175–186.

Goto, S., Nishioka, T. and Kanehisa, M. (1998). LIG- AND: chemical database for enzyme reactions, Bioinformatics 14: 591–599.

Hartemink, A., Gifford, D., Jaakkola, T. and Young, R. (2002). Combining location and expression data for principled discovery of genetic regula- tory networks, Pacific symposium on biocomput- ing, pp. 437–449.

Husmeier, D. (2003). Sensitivity and specificity of infer- ring genetic regulatory interactions from microar- ray experiments with dynamic bayesian networks, Bioinformatics 19: 2271–2282.

Knudsen, S. (2002). A biologist’s guide to analysis of DNA microarray data, Wiley, New York.

Lin, D. (1998). An information-theoretic definition of similarity, Proceedings of the 15th international conference on machine learning, Morgan Kauf- mann, San Francisco, CA, pp. 296–304.

Panday, R., Guru, R. K. and Mount, D. W. (2004).

Pathway Miner: extracting gene association net- works from molecular pathways for predicting the biological significance of gene expression mi- croarray data, Bioinformatics (Applications Note) 20: 2156–2158.

References

Related documents

To compare the eight methods equally, we take the first 19,000 edges from the five individual predictions, the generic average ranking prediction, the prediction of the

The main observations and conclusions of the present investigations can be summarized as follows. • The established LNCaP-19 model resembles human sclerotic CRPC,

uppgner saknas fё r flcra bladlusartcr En stor brist ar ocksi att inga parasitcr cncr predatorer namns och avcn att litteraturrefercnscr saknas helt,vilka hadc varit av stort vardc i

In our study we have demonstrated the antibacterial activity of human prostasomes and since the activity was also present at the same high degree in both native and prostate

Following from the introductory discussion about biological GTP hydrolysis by Ras GTPase, the recent elucidation of an increasing number of crystal structures of EF-Tu and EF-G

In partial fulfillment of the requirements for the Degree of Master of Science. Colorado State College Fort Collins,

For each parameter point, we simulate each model level (well-mixed, compartment-based, and fully spa- tial) and compare the results using the Kolmogorov distance metrics described

Enhancer potential of TF overlapped interactors and histone overlapped interactors are compared, and no significant differences are found in supporting pairs and