Linköping Studies in Science and Technology
Dissertation No. 1035
Integration of Biological Data
by
Vaida Jakonienė
Department of Computer and Information Science
Linköpings universitet
SE-581 83 Linköping, Sweden
Data integrationis animportant pro edureunderlying many resear htasks in the life s ien es, as often multiple data sour es have to be a essed to olle ttherelevantdata. Thedatasour esvaryin ontent,dataformat,and a ess methods, whi h often vastly ompli ates the data retrieval pro ess. As a result, the task of retrieving data requires a great deal of eort and expertise on the part of the user. To alleviate these di ulties, various information integrationsystemshavebeen proposedinthearea. However,a numberof issuesremainunsolvedand newintegrationsolutions areneeded. Theworkpresentedinthis thesis onsiders dataintegration atthree dif-ferentlevels. 1) Integration ofbiologi aldata sour esdeals withintegrating multiple datasour esfromaninformation integration systempointof view. We study properties ofbiologi al datasour es and existingintegration sys-tems. Based on the study, we formulate requirements for systems integrat-ingbiologi aldatasour es. Then, wedene aquerylanguage thatsupports queries ommonly used by biologists. Also, we propose a high-level ar hi-te ture for an information integration system that meets a sele ted set of requirementsandthatsupportsthespe ied querylanguage. 2) Integration of ontologies deals withndingoverlappinginformation between ontologies. Wedevelop andevaluate algorithmsthatuselife s ien eliteratureand take the stru ture of the ontologies into a ount. 3) Groupingof biologi al data entriesdealswithorganizingdataentriesintogroupsbasedonthe omputa-tionofsimilarityvaluesbetweenthedataentries. Weproposeamethodthat oversthemainsteps and omponentsinvolvedinsimilarity-basedgrouping pro edures. The appli ability of the method is illustrated by a number of test ases. Further, we develop an environment that supports omparison and evaluationof dierent grouping strategies.
The work is supported by the implementation of: 1) a prototype for a system integrating biologi al data sour es, alled BioTRIFU, 2) algorithms for ontology alignment, and 3) an environment for evaluating strategies for similarity-basedgrouping ofbiologi al data, alledKitEGA.
Many people have supported my graduate work and made this PhD thesis possible.
Iamgratefultomysupervisor,Asso iateProfessorPatri kLambrix,for his supportand guidan e duringthis work. His onstru tive ommentsand our many onversations brought insightandhelpedtoshapethis thesis. His en ouragement,patien e, and devotion asatea herhelped meto growasa resear her. I amgladthatIhad the opportunityto work withhim.
Iwouldliketoexpressmyappre iationtoProfessorNahidShahmehrifor providingvaluable omments,pointingoutimportantaspe tsoftheresear h worldand givingsupportduring this work.
The members of IISLAB (Laboratory for Intelligent Information Sys-tems) reated a stimulating and supportive working environment. I am thankfulfortheirfriendshipoveralltheseyears. Spe ially,Iwanttomention my student olleagues: Shanai Ardi, Ioan Chisalita,Claudiu Duma, Almut Herzog, DennisMa iuszek,He Tan,Eduard Tur anand Cé ile Åberg.
This work would have been mu h harder without the support of my family,relativesand friends. Iwouldespe iallylike to expressmygratitude to my Mum and Dad for aring so mu h about me, and for wel oming me with su h warmth when Ireturned home to Lithuania. I would also like to thankmysisterforbeingsu hagenerousandjoyfulperson. Iamverylu ky tohaveher. ThefriendsImadeinLinköpingmademystayinSwedenmu h more enjoyable. In parti ular, I want to thank my dearest friends Akvile, Aleksandraand Joe. Igreatly valuetheir ompany andour onversations.
Thisresear h work was funded byCUGS (the national graduate s hool in omputer s ien e). I also a knowledge the nan ial support of the EU Network of Ex ellen e REWERSE (Sixth Framework Programme proje t 506779).
Vaida Jakoniene Linköping, September 2006
Thisthesis ontains revisedversionsof the following papers.
1. Lambrix P, Jakoniene V.Towards transparent a ess to multiple bio-logi aldatabanks. Pro eedingsoftheFirstAsia-Pa i Bioinformati s Conferen e,pp 53-60, Adelaide,Australia, 2003.
2. JakonieneV,LambrixP.Informationintegrationsystemsforbiologi al datasour es: requirements andopportunities. Submitted.
3. Jakoniene V, Lambrix P.Ontology-based integration for bioinformat-i s. Pro eedingsoftheVLDBWorkshoponOntologies-basedte hniques for DataBases and Information Systems - ODBIS 2005, pp 55-58, Trondheim, Norway,2005.
4. Tan H, Jakoniene V, Lambrix P, Aberg J, Shahmehri N. Alignment of Biomedi alOntologiesusing LifeS ien e Literature. Pro eedings of the International Workshop on Knowledge Dis overy in Life S ien e Literature,pp 1-17,Singapore,2006. LNBI3886.
5. JakonieneV,RundqvistD, Lambrix P. Amethodfor similarity-based groupingofbiologi aldata. Pro eedingsofthe3rdInternational Work-shop on Data Integration in the Life S ien es - DILS06, pp 136-151, Hinxton, UK, 2006. LNBI4047.
6. JakonieneV,LambrixP.AToolforEvaluatingStrategiesforGrouping of Biologi al Data. Submitted.
Related Papers
The followingare relatedresear h arti lesnot in luded inthethesis.
1. DomsA,JakonieneV,LambrixP,S hroederM,Wä hterT.Ontologies and Text Mining asa Basis for aSemanti Web for theLife S ien es. ReasoningWeb,Se ondInternationalSummerS hool,Springer-Verlag, pp 164-183,2006. LNCS 4126.
2. JakonieneV. AStudy in Integrating MultipleBiologi al Data Sour es. Li entiate thesisNo 1149, Linköpings universitet,Sweden, 2005.
3. Lambrix P,Tan H,Jakoniene V, Strömbä k L. Biologi al Ontologies. Chapter inBakerCJO, Cheung KH(eds) Semanti Web: Revolution-izing Knowledge Dis overy in the Life S ien es, Springer, 2006. To appear.
4. Strömbä k L, Jakoniene V, Tan H, Lambrix P. Representing, storing anda essingmole ularintera tiondata: areviewofmodelsandtools. Briengs in Bioinformati s,2006. Invited ontribution. To appear.
Other
1. Ba kofen R,Badea M, Barahona P,Berndtsson M, Burger A, Dawel-bait G,DomsA,FagesF, HotaranA,JakonieneV, KrippahlL, Lam-brix P, M Leod K, Nutt W, Olsson B, S hroederM, S hroi A, Soli-man S, Tan H, Tilivea D, Will S. Requirements and spe i ation of use ases. REWERSE Deliverable A2-D3,2005.
2. Ba kofen R,Badea M, Barahona P,Burger A,DawelbaitG, DomsA, FagesF,Hotaran A,JakonieneV,KrippahlL,Lambrix P,M LeodK, MöllerS,NuttW,OlssonB,S hroederM,SolimanS,TanH,TiliveaD, Will S.Usageof bioinformati stoolsand identi ation of information sour es. REWERSEDeliverable A2-D2,2005.
3. JakonieneV,NilssonR.Abstra tBookof theFourth Swedish Bioinfor-mati s Workshop for PhD studentsandPostDo s,Linköping, Sweden, 2003.
Introdu tion ...1 Motivation ...3 ProblemStatement ...5 Contributions ...7 Paper Summaries ... 9 RelatedWork ...11 FutureWork ...14 Referen es ...16
1 Motivation
Resear hers inareas, su h as, medi ine, agri ultureand environmental s i-en es, intensively use the available biologi al data to answer dierent re-sear h questions or to solve various tasks [CGG03 ℄. One of the main goals is to understand how various organisms fun tion asbiologi al systems. To a hieve this goal, it is important to explore fun tions and intera tions of genome-en oded omponents. This type of knowledge may be used for dif-ferent purposes. For instan e, it is used to identify genes responsible for a disease, to develop drugs enabling treatment of diseases and to predi t organisms' responsesto adrug.
Thesigni an e of these areas, theworldwide interest and theavailable toolsandte hniques ausedthegenerationofanenormousamountof biolog-i al data, su h asDNAand protein sequen es, generegulatory and protein intera tion networks, and se ondary and tertiary stru tures of mole ules. Thisdataisspreadoveralargenumberofautonomousdatasour esthatare oftenpubli lyavailableontheWeb. Forinstan e,858datasour esarelisted inthe2006 DatabaseIssueoftheNu lei A idsResear h[NAR℄journal. As the data sour es are developed and supported independently by dierent groups and organizations, they arehighlyheterogeneous invarious aspe ts. For example,the data sour esvaryin thetype ofthe stored data,the data format, and a essmethods. Further, thereis aterminology dis repan y at the s hema and data levels. In addition to data sour es, a large number of bio-ontologiesdes ribingdomainknowledgearepubli lyavailableinthearea [LTJ06℄. For instan e, OBO[OBO℄, an umbrellaweb addressfor ontologies overing the genomi s and proteomi s domains, lists 29 orthogonal ontolo-gies. Some of the ontologies have rea hed the status of de fa to standard and areusedextensivelyto annotate thedatasour es.
Data integration is an important pro edure underlying many resear h tasks inthe life s ien es,asoftenmultiple data sour es have to bea essed to olle t the relevant data. For instan e, to nd publi ations des ribing a given disease that relates to a ertain type of sequen es may require anal-ysis of data sour es for publi ations, diseases and sequen es together with some otherdatasour es ombining thesetypesofinformation [LMN04℄. To supporthealth areappli ationsbyusingresultsinfun tionalgenomi s,the integration of lini aldataandgenomi data isimportant [MIN04℄.
steps are performed to a quirethe data: datasour esthat ontain relevant dataaresele ted,queriesoverea hdatasour eareformulatedandde isions are madeon howto ombine theresults. To nd relevant data sour es, the user has to be a quainted with the ontent of dierent data sour es. To formulate a query and de ide on how to exe ute the query, theuser has to be familiar with the ways thedata sour es support data retrieval and how dataatdierentsour esrelatetoea hother. Toexe utethequery,theuser has to know thelo ation of the datasour es thatare spread overtheWeb, thedierent querylanguagesand dataformats. During queryexe ution the user mayneed to translate thedata between dierent formatsand ombine the results. A mistake inany of these steps may either result in ine ient query exe ution or not nding results. The pro ess is also time onsuming sin e a large amount of datais usually pro essed. Data retrieval may take a long time, e.g. when tools are used to a quire the results. As biologi al data sour es hange oftenand datasour es appearand disappear, theuser hasto beaware ofthese hanges.
To alleviate these di ulties various information integration solutions havebeenproposed. Spe ializedintegrationsolutions fo usonsolvinga sin-gletaskbasedonasetofrelevantdatasour es. In ontrast,general purpose information integration systems aim to support a broad rangeof tasks and integration of various data sour es. Su h systems may provide a ommon interfa e through whi h a user a esses multiple data sour es. In this ase the lo ation and dierent query languages of the data sour es are hidden fromtheuser. Othertypesofinformation integrationsystemseven hidethe integrated data sour es from the user. During query pro essing, these sys-tems handlealsothesele tion ofdata sour esthatarerelevant tothequery. However,new integration solutions areneededto better supportlifes ien e resear hers intheir tasks. A numberof open issues remain inthe available integration solutions. For instan e, itmaybe di ultto integratenew data sour esinto theexistingsystemsorto reusethesystemsfor newtasks. Fur-thermore, solutions arela kingfor managingin ompleteand in orre tdata, and for handling semanti heterogeneity. For solving some of theproblems spe ialized solutionshavetobedeveloped whileinother asesdevelopments inother areas ouldbe adapted.
This thesis fo uses on data integration at three dierent levels. This in ludes integration ofbiologi al data sour es, integration ofontologies and integrationorgroupingofbiologi aldataentries. Integrationofbiologi al
when they want to use biologi al data sour es to nd relevant information for their resear h andanalyzes waysof dealing withtheseproblems in om-binationfromaninformationintegrationsystempointofview. Further,two spe i tasks in integrating biologi al data are dealt with. Integration of ontologies deals with nding overlapping information between ontologies. This in ludes nding relationships, alled alignments, between the related terms intheontologies. Grouping of biologi al data entries deals with organizing data entries into groups based on the omputation of similarity values between the dataentries. Groupingof data entries is an abstra tion of the problemof nding entriesthat represent thesame entityin dierent datasour es thatisa basi operation for integrating thedataentries.
2 Problem Statement
Theworkpresentedinthisthesisaimstodevelopapproa hesandte hniques thatalleviatethe hallengesmetwhenusingandintegrating biologi aldata, and in parti ular, the heterogeneity present at dierent levels in the data and data sour es. The thesis fo uses on the identi ation and analysis of the available knowledge about data and data sour es, and thedevelopment of me hanisms thatusetheavailable knowledgefor integration ofbiologi al data. To a hieve these goals,we fo usonthefollowing tasksinthethesis.
2.1 Integration of biologi al data sour es
In this thesis we deal witha few aspe ts in the ontext of integrating bio-logi al data sour es: requirements and query languages for information in-tegrationsystems,andtheuseofontologies forintegratingthedatasour es. Despite the fa tthat anumberofinformation integration solutionsare pro-posed inthe life s ien es, not somu h resear h hasbeen performed on the requirements for su h systems. Su h a study of requirements is needed as the omplexityofthelifes ien es,thetaskstobesolved,thestyleofthe s i-enti resear h andthe properties oftheavailable datasour esposespe ial requirements for information integration systems in the area. Further, the dieren e in fo us of the existing information integration systems together withdierent design anddevelopment hoi es ledto thefa tthatoften sys-tems support a unique query language. The variety of the available query languages makes it di ult to sele t between the query languages and to
portanttoknowasubsetofquerylanguageoperatorsthatshouldbepresent inanyquerylanguageforintegratingbiologi aldatasour es,forinstan e,to support the development of new integration solutions. In addition, during the re ent years some solutions were proposedfor using ontologies in infor-mationintegration systems. However,this isstill doneinalimitedwayand onlyasmallpartofthepossibleontology-basedknowledgeis urrentlyused.
Inthis thesiswe fo uson:
•
Study of requirements for systems providing integrated a ess to bio-logi aldatasour eswithfo usonsystemsprovidingvirtualintegration of datasour es, i.e. preservingautonomy ofdata sour es.•
Spe i ation of a query language that allows formulation of dierent typesof queries ommonlyusedby biologists.•
Spe i ationofahigh-levelar hite tureforaninformationintegration systemthatmeetsasele tedsetofrequirementsandthatsupportsthe spe ied querylanguage.•
Designanddevelopmentofaprototypefor theinformationintegration system. The systemshould onformto thehigh-levelar hite tureand enable deeper exploration of issues related to query pro essing over multiple biologi aldatasour es.•
Identify types of ontologi al knowledge publi ly available in the area oflifes ien esandstudyhowthisknowledge ouldbeusedtoenhan e urrent integrationapproa hes.2.2 Integration of ontologies
The task of aligningontologies isnot well explored and is onsidered to be one ofthemajorissuesinthe lifes ien es[CGG03℄. Anumber ofalignment strategiesareproposed, butfurtherresear handdevelopment ofnew strate-gies areneeded[LT06a,LT06b℄. Forinstan e,not mu hworkhasbeendone on ontology alignment using life s ien e literature asa resour e for nding alignments. Alsonot manystrategiesuseinformationaboutthestru tureof theontologies.
Inthis thesiswe fo uson:
•
Studyhowthestru tureofontologies ould beusedinontology align-ment.2.3 Grouping of biologi al data entries
Manytoolsfor analyzing biologi aldatausesome formofgrouping and are used in, for instan e, dataintegration, data leaning, predi tion of protein fun tionality,and orrelation ofgenes basedonmi roarray data. A number ofaspe tsinuen ethequalityofthegroupingresults: thedatasour es,the grouping attributes and the algorithms implementing the grouping pro e-dure. Manymethods exist,but itis oftennot lear whi h methods perform best for whi h grouping tasks. The study of the properties, and the evalu-ation and the omparison ofthe dierent aspe ts that inuen e thequality of the grouping results, would give us valuable insight inhow thegrouping pro edures ouldbeusedinthebestway. Itwouldalsoleadto re ommenda-tionsonhowtoimprovethe urrentpro eduresanddevelopnewpro edures. To be able to perform su h studies and evaluations we need environments that allow usto ompare and evaluate dierent grouping strategies.
Inthis thesiswe fo uson:
•
Spe i ation ofa methodthat overs themainsteps and omponents that shouldbein luded inenvironments.•
Designanddevelopmentofaprototypeforanenvironment supporting the evaluation of similarity-based grouping pro edures. The environ-ment should be basedon thedened method.3 Contributions
The main ontributions ofthe thesis arethefollowing: Integration of biologi al data sour es
•
Study ofbiologi al data sour es. Theresults arepresentedinpaper1 and 2. Paper2 extendsthework done inpaper1.•
Identi ation of requirements for information integration systems for biologi al data sour es. Paper 2 presents and dis usses therequire-•
Study of urrent information integration systems for biologi al data sour es with respe t to the identied requirements. The work is in- luded inpaper2.•
Proposal for a query language and ar hite ture for the BioTRIFU 1system. The ontributions appearinpaper1.
Asafeasibilitystudy andto getan overviewofissuesrelatedto query pro essingovermultiplebiologi aldatasour es,asubsetofthedened query language and the ideas in luded in the ar hite ture denition were implemented in a prototype. The prototype supports the main steps and omponents needed to integrate two data sour es that an bea essed at dierent lo ations. For detailswe refer to [Jak05 ℄.
•
Identi ation of ontologi al knowledge and its use in information in-tegration systems for biologi al data sour es. Paper 3 dis usses the results.•
Proposal of an ontology-based approa h for information integration systemsforbiologi aldatasour es. Theapproa hispresentedinpaper 3.Integration of ontologies
•
Development and evaluations of algorithms for ontology alignment. Thealgorithms uselifes ien eliteratureandtakethestru ture ofthe ontologies into a ount. The ontributions aredes ribed inpaper4. The ontology alignment algorithms were implemented and in orpo-rated into theSAMBOsystem[LT06a℄.Groupingof biologi al data entries
•
Proposal of amethod for similarity-basedgrouping of biologi aldata. The method isintrodu ed inpaper5.As afeasibilitystudy,two grouping tasks wereimplemented and ana-lyzedthrough anumberof test ases.
•
Development and implementation of KitEGA 2, an environment for evaluating strategies for similarity-based grouping of biologi al data. The environment is based on the proposed method. The tool and its 1
TheRightInformationForyoUinBioinformati s 2
usearepresentedinpaper6.
The urrent implementation of KitEGA supports the spe i ation of test ases throughtheuseofplug-insanduserinterfa es,and provides anumberofuserinterfa essupportinganalysisofthegroupingresults.
4 Paper Summaries
In this se tion we give short summaries of the six papers in luded in this thesis. Papers 1, 2 and 3 deal with integration of biologi al data sour es, with paper 3 fo using on ontology-based integration. Paper 4 deals with integration of ontologies. Papers 5 and 6 deal with grouping of biologi al data entries.
Paper 1: Towards transparent a ess to multiple biologi al data-banks
In paper1 we dis uss ommon problems met by the users of biologi al data sour es. The dis ussion in ludes a study of urrent biologi al data sour es. Basedontheobservations,thepaperproposesabasequerylanguage that ontains operators that should be present in any query language for biologi al data sour es. Further, the paper presents an ar hite ture for a systemsupportingsu halanguageand enablingtransparentandintegrated a essto biologi al datasour es.
Paper2: Informationintegrationsystemsforbiologi aldatasour es: requirements and opportunities
Inpaper2requirementsforinformationintegrationsystemsintheareaof bioinformati sareidentied. Thispaperextendsthestudyof problemsand requirements identied in paper 1. First, we study biologi al data sour es and identify their properties that make querying multiple biologi al data sour es a di ult task. Then, we formulate requirements for information integration systems for biologi al data sour es. We also dis uss how well urrentinformationintegrationsystemssatisfytheserequirementsand iden-tify opportunitiesfor futureresear h.
Paper 3: Ontology-based integration for bioinformati s
Inpaper3wearguethatthe urrentapproa hesforintegratingbiologi al data sour es should be enhan ed by ontologi al knowledge. We identify
(ontologies,ontologyalignments,annotations,mappingsbetweendatavalues and ontologi al terms) and propose an approa h to use this knowledge to support integrateda essto multiple biologi al data sour es. We alsoshow that urrent ontology-based integration approa hes only over parts of our approa h.
Paper 4: Alignment of biomedi al ontologies usinglife s ien e lit-erature
In paper 4 we propose strategies for aligning ontologies based on life s ien e literature. We propose a basi algorithm aswell asextensions that takethestru tureofthe ontologiesinto a ount. Weevaluate thestrategies and ompare them with strategies implemented in the alignment system SAMBO. We also evaluate the ombination of the proposed strategies and theSAMBO strategies.
Paper5: Amethodforsimilarity-basedgroupingofbiologi aldata
In paper5 a method for similarity-based grouping is proposed. As the main steps the method ontains spe i ation of grouping rules, pairwise grouping between entries, a tual grouping ofsimilar entries, andevaluation and analysisofthe results. Often,dierent strategies anbeusedinthe dif-ferentsteps. Themethodenables explorationof theinuen eof the hoi es and supports evaluation of the results withrespe t to given lassi ations. Thegroupingmethodisillustratedbytest asesbasedondierentstrategies and lassi ations. The results showthe omplexity of thesimilarity-based grouping tasks and give deeper insights in the sele ted grouping tasks, the analyzeddata sour e,andthe inuen eofdierent strategiesontheresults.
Paper6: A Toolforevaluatingstrategies forgroupingof biologi al data
In paper 6 we present KitEGA, an environment supporting the evalua-tion of grouping strategies. Based on the method presentedin paper5, we propose a framework for omparative evaluation of strategies for grouping data based on the method, and present its urrent implementation. Fur-ther, we illustrate the useof KitEGA by omparing grouping strategies for
5 Related Work
5.1 Integration of biologi al data sour es
Requirements for general purpose information integration systems for bio-logi aldatasour esontheWebweredis ussedin[DOB95 ℄,[Kar96 ℄,[Won02 ℄ and[HK04℄. Thersttwopaperswerewrittenade adeago. Sin ethen,the area oflife s ien es hasevolved fast: manymore datasour esand tools are publi lyavailableandnewtaskshavetobesolved. Whilesomeoftheearlier dened requirements for information integration systems are still valid in the hanged environment, other requirements need to be re onsidered and new requirements need to be spe ied. The more re ent paper [Won02 ℄ ar-guesforageneralpurposeinformation integrationsystemthatsupports ore fun tionality needed for information integration in life s ien es. Therefore, the denedrequirementsdonot oversomeoftheissuesspe i tothearea. The authors of [HK04℄point out a few highlevelrequirementsfor informa-tionsystemsemphasizingtheneedtoautomateamaximumnumber oftasks while minimizing the amount of timeand intera tions for theuser. The re-quirementsprovided in[HK04 ℄areinlinewiththerequirements spe iedin paper2. Inpaper2therequirementsarespe iedatamoredetailedlevelby lookingat dierentinformation integration aspe tsandfo usingonsystems providing virtualintegration of data sour es.
Withintheareaoflifes ien es severalintegration approa heshavebeen proposedand systemshavebeen implemented. Thisin ludessystemsbased ondatabasete hnology,i.e. virtualandmaterialized(datawarehouses) inte-grationapproa hes. Also,systemsbasedontheSemanti Web,webservi es, grid and agents te hnologies aredeveloped. In this thesis we fo usedon is-sues related to virtual integration. For an overview of su h systems see paper 2. For solving spe ialized tasks, the use of warehouses is a widely adopted integration solution (e.g. [TRM05℄). During the re ent years Se-manti Web te hnologies are being used for resolving s alability, hetero-geneity and reusability problems in the life s ien es. In these approa hes biologi aldataandknowledgeisrepresentedusingSemanti Weblanguages, e.g. XML, RDF and OWL [Muk05℄. A number of studies are ondu ted to explore integrateduseof datarepresentedintheseformats, e.g. [CYS05℄ and[SLD06℄. Also,the useofontologies isproposedtoresolvesemanti het-erogeneityproblems andtosupportknowledgedis overy basedonbiologi al data [Gar05℄. Further, work is ongoing in applying web servi es and grid
are example proje ts based on these te hnologies. Also, agent te hnology is shown to be useful for meeting integration hallenges inthelife s ien es. The authors in [KBB04 ℄argue thatadvan ed ommuni ation supported by agent te hnology an omplement theSemanti Weband gridte hnologies.
Someoftheavailableinformationintegrationsystemsuseontology-based te hnologies to support querying (e.g. BACIIS[MWL03 ℄, KIND [LGM03℄, SEMEDA [KPL03 ℄ and TAMBIS[GSN01 ℄). A ommon feature is that the integrateds hemasusedinthesesystemsareseenasontologies. In ontrast, in the approa h des ribed in paper 3, we expe t ontologies to be agreed uponandsharedbymanyusers[Lam04 ℄. Asinourapproa h,theintegrated s hemasin ludedomainknowledgeandinformationondatastru turesatthe data sour es. All the systems use the maintained ontology to des ribe the ontent of datasour es. Though it is not expli itly stated, ross-referen es between data sour es are probably used to join the retrieved data items. KIND uses two ontologies des ribing stati and pro ess knowledge, respe -tively. The ontologies ombine domain knowledge from neuroanatomy and neurophysiology. In SEMEDA ontrolled vo abularies an be usedto spe -ify semanti s of data type values. Also, data sour e ontent des riptions an be rened with integrated s hema terms. Ontologi al annotations and mappings between ontology terms arenot taken into a ount inany of the systems.
5.2 Integration of ontologies
Dierent strategies anbeusedtoperformalignmentof ontologies. [LT06b℄ des ribes a general strategy for aligning two ontologies. One of the main omponent types is a mat her responsible for omputing similarities be-tween the termsfromthedierentsour e ontologies. The mat hers an im-plement strategies based on linguisti mat hing, stru ture-based strategies, onstraint-based approa hes, instan e-based strategies, strategies that use auxiliaryinformation ora ombination ofthese. Byusingdierent mat hers and ombining and ltering theresultsindierentways we obtain dierent alignment strategies. Tools forontologyalignmentaredis ussed in[LT06a℄. Someontologyalignmentandmergingsystemsprovidealignment strate-gies using literature, su has ArtGen[MW02℄, FCA-Merge [SM01℄ and On-toMapper[PPF02 ℄. Also,therearesystemsthatimplement alignment
algo-existen e of previously aligned on epts. For instan e, An hor-PROMPT [NM01℄ determines the similarity of on epts by thefrequen y of their ap-pearan e along the paths between previously aligned on epts. The paths may be omposed of any kind of relations. Also SAMBO as des ribed in [LT05 ℄providessu h a omponent where thesimilaritybetween on epts is augmented based on their lo ation in the is-a hierar hy relative to already aligned on epts. In ontrast, the methods proposed in this thesis do not require previously aligned on epts.
OntoMapperimplements themost similarapproa hto thestrategies de-s ribed in paper 4. OntoMapper provides an ontology alignment algorithm using Bayesian learning. A set of do uments (abstra ts of te hni al papers taken from ACM'sdigital library and Citeseer) is assigned to ea h on ept in the ontologies. Two raw similarity s ores matri esfor theontologies are omputed by the Rainbow text lassier. The similarity between the on- epts is al ulatedbased onthese two matri esusing theBayesian method. When analyzing stru ture of the ontologies, OntoMapper does not require previously aligned on epts andtakesthedo uments fromthesub- on epts intoa ountwhen omputingthesimilaritybetweentwo on epts. However, asthis is hard- oded inthe method,it isnot lear howthestru ture of the ontologies inuen esthe resultof the omputation.
In ontrast tomost otherapproa hes,[CTL06℄usesthestru tural infor-mationnotto omputesimilaritybetweenontologi alterms,butasamethod for ltering wrong results generated bymat hers. The approa h givesgood results whenmanyinitial suggestionsareavailableandthetimeforltering is oftenonly asmall fra tionofthetimefor thesimilarity omputation.
5.3 Grouping of biologi al data entries
There aretwokindsofrelatedwork: evaluationsofgroupingalgorithmsand tools for supporting evaluation ofgrouping algorithms.
A number of evaluations of dierent kinds of grouping algorithms have been performed. For instan e, regarding lustering of gene expression data [YHR01℄proposesameasuretoestimatethepredi tivepowerofa lustering algorithm and ompares twopartitionalandthreehierar hi al lustering al-gorithms basedonthismeasure. [DD03 ℄proposesthreevalidationstrategies and ompares sixalgorithms. Also [GSS03 ℄proposes a newvalidation mea-sure and ompares four lustering methods. Five bi lustering methods for
ations isthefa tthattheyfo uson lustervalidationfortheevaluationand omparisonofalgorithms. Theyusesyntheti andrealdatasour es. Someof thepapersalsoaimtoproposenewvalidationmeasures. Further,inallthese evaluations, most of theevaluated algorithms needed to bere-implemented for thepurposeof theevaluations.
[CRF03℄presentsthe Se ondString Toolkit for name-mat hing methods whi h ould be used, for instan e, in dupli ate dete tion. Several distan e fun tions for strings are implemented. The algorithms are ompared on a dataset regardingnon-interpolated averagepre ision.
Asystemthatgoessome wayintoprovidingan environment for luster-ingandvalidationistheMa haonClusterValidationEnvironment[BAC05 ℄. Thissystemisintendedfor lusteringof mi roarraydataandevaluatingthe qualityoftheobtained lusters. Thesystemfo useson lustervalidationfor new data sets and therefore uses internal measures based on ompa tness and isolation. The system implements several lustering algorithms, met-ri s (distan e), and internal measures [BA03℄. The user an hoose among these to run a luster taskon a data set. The results are shown asa tree. The highestlevelnodesrepresent the hosen lusteralgorithmswith parti -ular parameter sele tion. The next level represents the results of applying dierentvaliditymeasures to the lusters generated bythealgorithm.
The framework and system (KitEGA) that we propose in papers 5 and 6 aims to go one step further. KitEGA is a platform for evaluating and omparingsimilarity-basedgroupingstrategies. Evaluators anplugintheir own algorithms related to the grouping strategies and the evaluation mea-sures, aswellastheir owndatasets. KitEGA providesthenthesupportfor running the algorithms, and summarizing andanalyzing theresults.
6 Future Work
6.1 Integration of biologi al data sour es
Asweobservedinse tion5.1thefo usoftheresear honintegratingdatain the lifes ien es is reorienting from the useof lassi al database approa hes to the useof web and Semanti Web te hnologies. [Muk05℄ mentions hal-lenges to make the best use of the new te hnologies. First, most of the biologi al dataand knowledgeshould be available intheSemanti Web. To a hieve this, tools supporting automati extra tion of biologi al data from
Semanti Web areresear h prototypes. Further studies are needed on how to extend these prototypes into systems supporting real-world appli ations for ee tive retrieval of information and dis overy of hidden knowledge on the Semanti Web. For instan e, to guarantee s alability, inferen e engines availableforquerying theSemanti Webandgraph theorybasedalgorithms usedtoexploreasso iationsbetweenobje tsontheSemanti Webmayhave to bere onsidered.
Paper2enumerates other hallenges forinformation integrationsystems for the life s ien es. To allow users to view and spe ify dierent types of information, more powerful modules for supporting intera tion between the usersandinformationintegrationsystemsareneeded. Also,theneedfor fur-ther resear h on how to resolve semanti heterogeneity is emphasized. For instan e, theavailable approa hes, like theontology-based dataintegration approa h proposed in paper 4, ould be tested in the ontext of the real Semanti Web. Also,paper2 statestheneed fortools supportingthe devel-opmentandmaintenan e ofinformation integrationsystems. Su h tools are essential to ope withthes ale anddynami s of thelifes ien es.
6.2 Integration of ontologies
Alignmentandmergingofontologiesisanimportantresear htopi andnew systems and strategies for ontology alignment should be developed. More studiesareneededthatexplorewhi hstrategiesworkwellforwhi htypesof ontologiesandasystemasKitAMO[LT06 ℄ anprovideagoodenvironment to perform these studies. In the future we will see an in rease of available alignments between ontologies. This will provide a type of ontologi al in-formation that an be used in, for instan e, data integration as dis ussed in paper 3. Further, there areeorts to promote interoperability of ontolo-gies, su h as theOBOFoundry where it is required thatthe ontologies use relations whi h are unambiguously dened following the pattern of deni-tions dened in the OBO Relation Ontology [SCK05 ℄. The results of su h eortswillprovideinformationthatshouldbetaken into a ountduringthe alignment pro ess.
There are a number of issues related to thealgorithms in paper 4 that wouldbeinteresting tofurtherinvestigate. Alimitation ofouralgorithmsis thatabstra tsofresear harti lesareonly lassiedtoone on ept. Wewant to extendourstrategiesbyallowingabstra tstobe lassiedto0,1ormore
Regardingthestru turetheontologiesinthe urrentexperimentsare reason-ablysimpletaxonomies. Wewanttoinvestigatewhetherthestru ture-based strategies lead to similar results for other types of ontologies. Further, our mat hers ould beenhan edto use synonymsand domain knowledge.
6.3 Grouping of biologi al data entries
Similarity-based grouping of data entries is not a trivial task. In order to nd themost suitable grouping strategies for given tasks, tools areneeded to supportthe evaluation and omparison of dierent grouping pro edures. An example of su h system isKitEGA (paper6). We intendto extend the urrentKitEGAimplementation inseveralways. Wewillextendthesystem to fully omply with our framework. Further, we will provide a number of libraries for omponentsthatare ommon. This ould in lude,for instan e, dierentevaluationmeasuresorgroupingmethods. WewillalsouseKitEGA for studies indataintegration.
Referen es
[BA03℄ Bolshakova N, Azuaje F. Cluster validation te hniques for genome expressiondata. Signal Pro essing, 83:825-833, 2003.
[BAC05℄ BolshakovaN,AzuajeF,CunninghamP.Anintegratedtoolfor mi- roarraydata lusteringand lustervalidityassessment.Bioinformati s, 21(4):451-455,2005.
[CGG03℄ Collins F, Green E, Guttma her A, Guyer M. A Vision for the Futureof Genomi sResear h.Nature,422:835-847, 2003.
[CRF03℄ CohenW,RavikumarP,FienbergS.A omparisonofstringmetri s for mat hing names and re ords. Pro eedings of the KDD Workshop on Data Cleaning and Obje t Consolidation,2003.
[CTL06℄ ChenB, Tan H, Lambrix P. Stru ture-based ltering for ontology alignment. Pro eedings of the IEEE WETICE Workshop on Semanti Te hnologies in Collaborative Appli ations, 2006.
[CYS05℄ Cheung KH, Yip KY, Smith A, Deknikker R,Masiar A,Gerstein M. YeastHub: a semanti web use ase for integrating data in the life
[DD03℄ Datta S, Datta S. Comparisons and validation of statisti al lus-tering te hniques for mi roarray gene expression data. Bioinformati s, 19(4):459-466,2003.
[DOB95℄ Davidson S, Overton C, Buneman P. Challenges in Integrating Biologi alDataSour es.JournalofComputationalBiology,2(4):557-572, 1995.
[Gar05℄ Gardner SP. Ontologies and semanti data integration. Drug Dis- overy Today,10(14):1001-1007, 2005.
[GSN01℄ GobleCA, StevensR,Ng G,Be hhoferS,PatonN,BakerP,Peim M, Brass A. Transparent a ess to multiple bioinformati s information sour es.IBM SystemsJournal, 40(2), 2001.
[GSS03℄ Gat-Viks I, Sharan R, Shamir R. S oring lustering solutions by their biologi alrelevan e. Bioinformati s,19(18):2381-2389, 2003.
[HK04℄ Hernandez T, Kambhampati S. Integration of biologi al sour es: Current systems and hallenges. ACM SIGMOD Re ord, 33(3):51-60, 2004.
[Jak05℄ Jakoniene V. A Study in Integrating Multiple Biologi al Data Sour es.Li entiate thesisNo1149,Linköpingsuniversitet,Sweden,2005.
[Kar96℄ KarpP.Astrategy fordatabase interoperation. Journal of Compu-tational Biology,2(4):573-586, 1996.
[KBB04℄ KarasavvasKA, Baldo kR,Burger A.Bioinformati s integration andagent te hnology.Journal of Biomedi al Informati s,37(3):205-219, 2004.
[KPL03℄ Köhler J, Philippi S, Lange M. SEMEDA: ontology based seman-ti integrationofbiologi aldatabases.Bioinformati s,19(18):2420-2427, 2003.
[Lam04℄ Lambrix P. Ontologies in Bioinformati s and Systems Biology. Chapter8inDubitzkyW,AzuajeF(eds)Arti ialIntelligen e Methods andTools for SystemsBiology, Springer,pp 129-146, 2004.
Media-Crit hlow T (eds) Bioinformati s: Managing S ienti Data, Morgan Kaufmann Publishers,pp 335-370, 2003.
[LMN04℄ La roix Z, Murthy H, Naumann F, Ras hid L. Links and Paths through Life S ien e Data Sour es. Pro eedings of the International Workshop on Data Integration in the Life S ien es, pp 203-211, 2004. LNCS2994.
[LT05℄ LambrixP,TanH.AFrameworkforAligningOntologies.Pro eedings ofthe Workshopon Prin iplesandPra ti e of Semanti WebReasoning, pp17-31, 2005. LNCS3703.
[LT06a℄ Lambrix P, Tan H. SAMBO - A System for Aligning and Merg-ing Biomedi al Ontologies. Journal of Web Semanti s, Spe ial issue on Semanti Webfor the Life S ien es,2006.
[LT06b℄ Lambrix P, Tan H. Ontology alignment and merging. Chapter in Burger A,Davidson D, Baldo kR (eds) Anatomy Ontologies for Bioin-formati s: Prin iples and Pra ti e, Springer,2006. To appear.
[LT06 ℄ Lambrix P, Tan H. A Tool for Evaluating Ontology Alignment Strategies.Journal on Data Semanti s, VIII, 2006.Toappear.
[LTJ06℄ Lambrix P, Tan H, Jakoniene V, Strömbä k L. Biologi al Ontolo-gies. Chapter inBaker CJO, Cheung KH (eds) Semanti Web: Revolu-tionizing Knowledge Dis overy in the Life S ien es, Springer, 2006. To appear.
[MIN04℄ Martin-San hez F, Iakovidis I, Norager S, Maojo V, de Groen P, VanderLeiJ,JonesT,Abraham-Fu hsK,ApweilerR,Babi A,BaudR, BretonV,Cinquin P,Doupi P,DugasM, Eils R,Engelbre ht R,Ghazal P, Jehenson P, Kulikowski C, Lampe K, DeMoor G, Orphanoudakis S, RossingN,Sara hanB,SousaA,SpekowiusG, ThireosG,ZahlmannG, ZvarovaJ,HermosillaI,Vi enteF.Synergybetweenmedi alinformati s andbioinformati s: fa ilitatinggenomi medi inefor futurehealth are. Journal of Biomedi al Informati s, 37:30-42, 2004.
[Muk05℄ Mukherjea S.Information retrievalandknowledgedis overy utilis-ingabiomedi alSemanti Web.BriengsinBioinformati s,6(3):252-62,
[MW02℄ Mitra P, Wiederhold G. Resolving terminologi al heterogeneity in ontologies.Pro eedingsofthe ECAIWorkshoponOntologies and Seman-ti Interoperability,2002.
[MWL03℄ Miled ZB, Webster YW, Liu Y, Li N. An Ontology for Seman-ti Integration of Life S ien e Web Databases. International Journal of Cooperative Information Systems, 12(2):275-294,2003.
[NAR℄ NAR.Nu lei A ids Resear h.http://nar.oupjournals.org
[NM01℄ NoyN,MusenM. An hor-PROMPT: UsingNon-Lo al Context for Semanti Mat hing. Pro eedings of the IJCAI Workshop on Ontologies andInformation Sharing,pp63-70, 2001.
[OBO℄ OBO. Open Biomedi al Ontologies.http://obo.sour eforge.net/
[PBZ06℄ Preli¢A,BleulerS,ZimmermannPh,WilleA,BühlmannP, Gruis-sem W, Hennig L, Thiele L, Zitzler E. A systemati omparison and evaluation of bi lustering methods for gene expression. Bioinformati s, 22(9):1122-1129, 2006.
[PPF02℄ Prasad S, Peng Y, Finin T. Using Expli it Information To Map Between Two Ontologies. Pro eedings of the AAMAS Workshop on On-tologies in Agent Systems,2002.
[SCK05℄ Smith B, Ceusters W, Klagges B, Köhler J, Kumar A, Lomax J, Mungall CJ, Neuhaus F, Re tor A, Rosse C. Relations in Biomedi al Ontologies.Genome Biology, 6(5):R46, 2005.
[SLD06℄ Stephens S, LaVigna D, DiLas io M, Lu iano J. Aggregations of Bioinformati s Data Using Semanti Web Te hnology. Journal Web Se-manti s,4(3), 2006.
[SM01℄ StummeG, Mäd he A. FCA-Merge: Bottom-up mergingof ontolo-gies.Pro eedings ofthe International Joint Conferen es on Arti ial In-telligen e, pp225-230,2001.
[SRG03℄ StevensRD,RobinsonAJ,GobleCA.MyGrid: personalised bioin-formati sonthe information. Bioinformati s,19(1):i302-i304, 2003.
[TRM05℄ TriÿlS,RotherK,MüllerH,SteinkeT,Ko hI,PreissnerR,F röm-mel C, Leser U. Columba: An Integrated Database of Proteins,
Stru -[VS05℄ Vyas H, Summers R. Interoperability of bioinformati s resour es. VINE: The journal of informationand knowledge management systems, 35(3):132-139,2005.
[WL02℄ WilkinsonMD,LinksM. BioMOBY:anopensour e biologi alweb servi esproposal.Briengs in Bioinformati s,3(4):331-41, 2002.
[Won02℄ Wong L.Te hnologies for integratingBiologi al Data. Briengs in Bioinformati s,3(4):389-404,2002.