An XML-based Database of Molecular Pathways

(1)

An XML-based Database of Molecular

Pathways

by David Hall LITH-IDA-EX--05/051--SE 2005-06-02

(2)

(3)

An XML-based Database of Molecular Pathways

by David Hall LITH-IDA-EX--05/051--SE

Supervisor : Dr. Lena Str¨omb¨ack

Dept. of Computer and Information Science at Link¨opings universitet

Examiner : Dr. Lena Str¨omb¨ack

Dept. of Computer and Information Science at Link¨opings universitet

(4)

(5)

Avdelning, Institution Division, Department Datum Date Spr˚ak Language 2 Svenska/Swedish 4 Engelska/English 2 Rapporttyp Report category 2 Licentiatavhandling 4 Examensarbete 2 C-uppsats 2 D-uppsats 2 ¨Ovrig rapport 2

URL f¨or elektronisk version

ISBN

ISRN

Serietitel och serienummer Title of series, numbering

ISSN Titel Title F¨orfattare Author Sammanfattning Abstract Nyckelord Keywords

Research of protein-protein interactions produce vast quantities of data and there exists a large number of databases with data from this re-search. Many of these databases offers the data for download on the web in a number of different formats, many of them xml-based.

With the arrival of these xml-based formats, and especially the stan-dardized formats such as psi-mi, sbml and Biopax, there is a need for searching in data represented in xml. We wanted to investigate the capabilities of xml query tools when it comes to searching in this data. Due to the large datasets we concentrated on native xml database sys-tems that in addition to search in xml data also offers storage and indexing specially suited for xml documents.

A number of queries were tested on data exported from the databases IntAct and Reactome using the XQuery language. There were both simple and advanced queries performed. The simpler queries consisted of queries such as listing information on a specified protein or counting the number of reactions.

One central issue with protein-protein interactions is to find path-ways, i.e. series of interconnected chemical reactions between proteins. This problem involve graph searches and since we suspected that the complex queries it required would be slow we also developed a C++ program using a graph toolkit.

The simpler queries were performed relatively fast. Pathway searches in the native xml databases took long time even for short searches while the C++ program achieved much faster pathway searches.

ADIT,

Dept. of Computer and Information Science 581 83 LINK ¨OPING 2005-06-02 — LITH-IDA-EX--05/051--SE — http://www.ep.liu.se/exjobb/ida/2005/dd-d/ 051/

An XML-based Database of Molecular Pathways En XML-baserad databas f¨or molekyl¨ara reaktioner

David Hall

XML, native XML databases, XQuery, protein-protein interactions, pathway search

(6)

(7)

there exists a large number of databases with data from this research. Many of these databases offers the data for download on the web in a number of different formats, many of them xml-based.

With the arrival of these xml-based formats, and especially the stan-dardized formats such as psi-mi, sbml and Biopax, there is a need for searching in data represented in xml. We wanted to investigate the capa-bilities of xml query tools when it comes to searching in this data. Due to the large datasets we concentrated on native xml database systems that in addition to search in xml data also offers storage and indexing specially suited for xml documents.

A number of queries were tested on data exported from the databases IntAct and Reactome using the XQuery language. There were both simple and advanced queries performed. The simpler queries consisted of queries such as listing information on a specified protein or counting the number of reactions.

One central issue with protein-protein interactions is to find pathways, i.e. series of interconnected chemical reactions between proteins. This problem involve graph searches and since we suspected that the complex queries it required would be slow we also developed a C++ program using a graph toolkit.

The simpler queries were performed relatively fast. Pathway searches in the native xml databases took long time even for short searches while the C++ program achieved much faster pathway searches.

Keywords : XML, native XML databases, XQuery, proteprotein in-teractions, pathway search

(8)

(9)

this thesis work, for welcoming me and giving an insight into the academic world. I especially would like to thank my supervisor and examiner Lena Str¨omb¨ack for her guidance and constructive comments throughout the work.

I would also like to thank my co-worker Anders Bovin (who did a related thesis work at the same time as me). Discussing problems and alternative solutions as well as things not related at all with the subject of bioinfor-matics and databases have been much appreciated.

Finally I would like to thank my opponent Mikael Albertsson for his valuable feedback.

(10)

(11)

1 Introduction 1 1.1 Background . . . 1 1.2 Problem overview . . . 1 1.3 Purpose . . . 2 1.4 Thesis outline . . . 2 1.5 Document conventions . . . 3

2 XML - Extensible Markup Language 5 2.1 Background . . . 5 2.1.1 Advantages of XML . . . 7 2.1.2 Drawbacks with XML . . . 8 2.1.3 Data vs. document . . . 9 2.2 Validation . . . 10 2.2.1 DTD . . . 10 2.2.2 XML Schema . . . 11 2.2.3 Relax NG . . . 12 2.3 Namespaces . . . 14 2.4 Meta-data . . . 15 2.4.1 RDF . . . 16 2.4.2 OWL . . . 16 2.5 Links . . . 16 2.6 XML APIs . . . 17 2.7 Transforms . . . 19 2.7.1 XSLT . . . 19 ix

(12)

2.8 Query . . . 19 2.8.1 XPath . . . 19 2.8.2 History . . . 20 2.8.3 Update capabilities . . . 21 2.8.4 XQuery . . . 21 2.9 Databases in XML . . . 26

2.9.1 Different types of XML databases . . . 27

2.9.2 Native XML databases . . . 28 2.9.3 Indices . . . 30 2.9.4 Normalization . . . 31 2.9.5 Referential integrity . . . 32 2.9.6 Performance . . . 32 2.9.7 Output/API . . . 32 2.9.8 NXD Models . . . 33

2.9.9 Implementations of native XML databases . . . 33

2.10 Summary . . . 35 3 Bioinformatics 37 3.1 Genes . . . 37 3.2 Proteins . . . 38 3.3 Pathways . . . 38 3.4 Experimental methods . . . 39 3.4.1 Two-hybrid systems . . . 39 3.4.2 Phage-display systems . . . 39 3.4.3 Curated data . . . 40 3.5 Databases . . . 40 3.5.1 KEGG . . . 41 3.5.2 DIP . . . 41 3.5.3 MINT . . . 42 3.5.4 BIND . . . 42 3.5.5 Reactome . . . 42 3.5.6 IntAct . . . 42

3.6 Proposed standard formats . . . 43

3.6.1 SBML . . . 43

3.6.2 PSI MI . . . 45

(13)

3.7 Proprietary exchange formats . . . 47 3.7.1 KGML . . . 47 3.7.2 XIN . . . 47 3.7.3 BIND . . . 47 3.8 Summary . . . 48 4 Problem analysis 51 4.1 Questions to be answered . . . 51 4.1.1 Query capability . . . 51 4.1.2 Efficiency . . . 52 4.2 Chosen datasets . . . 52 4.2.1 Databases . . . 53 4.2.2 Queries . . . 53 4.3 Chosen technologies . . . 55

4.3.1 Native XML databases and XQuery . . . 55

4.3.2 The Graph Template Library . . . 56

5 Native XML database setup 59 5.1 Native XML databases . . . 60 5.1.1 Exist . . . 60 5.1.2 Sedna . . . 60 5.1.3 X-Hive . . . 61 5.1.4 Qizx/open . . . 61 5.1.5 Java . . . 61 5.1.6 Machine setup . . . 61 5.2 Queries . . . 62

5.2.1 Type of queries and efficiency . . . 62

5.2.2 Description of queries . . . 62 5.2.3 XML serialization . . . 65 5.3 Test framework . . . 65 5.4 Benchmarking . . . 66 6 GTL test setup 69 6.1 The GTL package . . . 69 6.2 Transformation . . . 69 6.3 The program . . . 70

(14)

6.3.1 Removal of extraneous edges . . . 71

6.3.2 Control of reachability and leaf deletion . . . 71

6.3.3 Path search . . . 72

6.4 Benchmark methods . . . 73

7 Results 75 7.1 Queries on IntAct data . . . 75

7.2 Queries on Reactome data . . . 77

7.3 Premature technique . . . 79 8 Discussion 81 8.1 Conclusions . . . 81 8.2 Future work . . . 82 8.2.1 More formats . . . 82 8.2.2 Data integration . . . 83

8.2.3 Data integration with OWL . . . 83

8.2.4 Using live remote data . . . 83

8.2.5 XQuery graph support . . . 84

8.2.6 User interface . . . 85 Bibliography 87

Appendix

103

A System specifications 103 A.1 Software . . . 103 A.2 Hardware . . . 103 B Figures 105 C File listings 109 C.1 Transformation . . . 109 C.2 IntAct XQueries . . . 110 C.3 Reactome XQueries . . . 113

(15)

Introduction

This chapter presents a motivation for this thesis work, followed by a de-scription of the problem to be solved and the actual objectives. Finally a section is dedicated to the structure of this report.

1.1 Background

The research of protein-protein interactions is an area that, like other within molecular biology, produce vast quantities of data. There exist a number of different databases with information on protein interactions but they often have incompatible data formats. This makes it harder to make use of all publicly available data.

The need for exchange of this data has resulted in at least three different proposals for standards: sbml, psi mi and Biopax, adding to that several exchange formats for specific databases has emerged. Many of these are based on xml (Extensible Markup Language).

1.2 Problem overview

Despite the fact that these formats solve the problem with extracting data from different databases, the problem with different data being available

(16)

in different formats remains. To make use of the information contained in these files, integration of data from different sources and search function-ality are needed.

A traditional method would be to import the data into a database for further processing. These operations however can be done directly on the xml files, thus eliminating the need for database import.

1.3 Purpose

The purpose of this thesis work is to investigate how search and discovery can be done directly on xml files containing information on protein-protein interactions. As a part of this work we want to:

Test different proposed standard formats for protein-protein interac-tions.

Evaluate different existing xml tools and their applicability on pro-tein interactions.

Build a basis for a working demonstration system.

1.4 Thesis outline

Chapter 2 and 3 contains a literature study of the areas of xml and bioin-formatics respectively; both these chapters have a summary at the end. In Chapter 4 the problem is analyzed and Chapter 5 describes the testing system using native xml databases. In Chapter 6 the implementation of a specialized graph search program is discussed, followed by a presenta-tion and evaluapresenta-tion of the results in Chapter 7. Chapter 8 summarizes the report and give examples of future work.

The reader of this thesis is expected to have good knowledge about databases and the World Wide Web and some knowledge in biology.

(17)

1.5 Document conventions

Text in proportional typeface (example) denotes data contents, function names or operators.

(18)

(19)

XML - Extensible

Markup Language

This chapter introduces some background on the xml format as well as information on xml databases and related technologies such as xml query languages.

2.1 Background

xml, Extensible Markup Language [W3C98], is a document format devised by the World Wide Web Consortium in the second half of the 1990’s. The format is more general than html [W3C95] (Hypertext Markup Language) and allows storage of all types of documents, not just hypertext documents for the web, as well as other types of data storage. In xml, unlike html, content is separated from presentation. Both xml and html are built on sgml [ISO86] (Standard Generalized Markup Language), an international standard for a metalanguage describing languages for the markup of elec-tronic texts in a device-independent and system-independent way. Both xml and html documents are conforming sgml documents. The reason for designing xml as a subset of sgml may have been to gain acceptability within the community but some [Kay03] think the main benefit of xml

(20)

being just a small subset of sgml is the simplicity of the specification. xml is used as a metalanguage, xml-based formats are specializations of xml, where just a few things are defined. Unlike with html the user are not given a fixed set of tags to use, instead they can define their own tags to describe the data. There exist a number of xml-based standards where an authority within the field has specified what tags to use and in what order.

By design, only a few things are specified for xml itself: enclosing of elements (all tags must be closed), how attributes are defined, how to refer to namespaces, naming (what characters are allowed in a tag name). These rules are called a grammar and documents following this grammar are said to be well-formed . An xml document following a specification (such as dtd or xml Schema) of allowed tag names and values are said to be validating. An xml document contains a number of elements. These elements can be nested and since a single root element is required, an xml document is essentially a tree. Every element must be closed, which means it is easy to see not just where an element starts but also where it ends. This means that the document can be parsed without special knowledge of the tags.

Listing 2.1: xml example <?xml version="1.0"?> <document> <toc /> <chapter numbering="no"> <section>Introduction to...</section> </chapter> <chapter> <title>My 2nd chapter</title> <section>

Continuing the text... </section>

</chapter> </document>

The first line of an xml document (as in Listing 2.1) is an xml dec-laration specifying the version of xml being used. Document is the root

(21)

node and the level under it consists of one toc element and two chapter

elements. Here the element tochas no content and is therefore opened and closed in the same tag by finishing off with a slash before the end bracket, >. Since xml can be used as a document format the order is important.

2.1.1 Advantages of XML

There are a number of advantages of xml. Most of them have to do with the structure, both logical and visual, which makes reading of the files easier for humans and machines and allows extensions without breaking compatibility. These are a number of advantages as described in [Hol01, MBK+00, CRZ03]:

Structure The requirement that every element must be closed means that the document can be parsed without special knowledge of the tags. The structure of data stored in xml format can be anywhere from highly structured data to semi-structured data as well as unstructured documents.

Machine readable xml was designed to be easy for computers to process. The strict limitations of the syntax make the file easy to parse without ambiguity. Using xml Schema information about data types and rdf meta-data (see Section 2.4.1) further processing of the information can be done automatically without human intervention.

Human readable Using sensible tag-names for elements give an self-de-scribing markup, called auto-descriptive, making it possible to read and even write directly in xml format. This make it possible to create documents for where there does not exist authoring or viewing applications yet. Introduction of too many attributes or namespaces may however make the xml file hard to read.

Extensible If an xml-based standard is missing tags for expressing some data the document can be extended by introducing a new namespace. The new tags then use a name space different from the default name space used by the rest of the document. Old tools will still be able to read the files and just skip the information it does not understand.

(22)

Extern links Instead of having to include all information in the same document, other xml documents can be referenced by linking. This document can be a neighboring file or a document from a totally different organization.

Transforms Tools, especially xslt, for transforming xml data allow easy transformation of data from one format to another, thus the world does not need to agree on one common schema but organizations can develop their own and use transforms to the schema the other party is using when communicating. Transforms are also an important reason to why xml is being adopted in the publishing industry for information that is distributed on a number of mediums, e.g. news being published both on paper and on the web.

Standard W3C is a organization recognized for developing recommenda-tions for the World Wide Web. Many of the information technology industry’s larger companies are members of W3C. There is a wide-spread support for xml, mainly because of xml’s simplicity. Many standards using xml have been created as well as tools for creating, viewing and processing xml files, some of them described later in this chapter.

2.1.2 Drawbacks with XML

The most obvious drawbacks with xml arise from its non-binary, verbose contents. While it makes the data easier to read it leads to problems. Verbose One main disadvantage with xml is its verbosity [Hol01]. The

requirement of start tags and end tags for delimiting data gives a large overhead. This results in high memory demand and slow transmission times. _{Therefore xml is not suitable for large data sets, such as} seismic data [Cok05] or data from computer tomography. Binary extensions for xml are under development and will probably fix some of these issues.

Messy With complex documents it can be hard to manually read and edit in xml-based formats. One indication of this is the alternative

(23)

compact syntaxes that has emerged, such as Notation 3 [BL01] for rdf data and Relax ng Compact syntax [rel02].

Slow Mainly due to the verbosity but also because of the lack of physical pointers [Bou04] within the file it is slow to parse xml files. By using external indices xml parsers can speed up continuous parsing, as described in Section 2.9.3.

2.1.3 Data vs. document

The world of xml and databases has sometimes been divided into data-centric and document-data-centric parts by the xml community [Bou04]. Since xml documents are not strictly data- or document-centric, but somewhere in between, this is a skewed view.

Data-centric documents use xml purely as a data transport; they are designed to be read by a computer. Data-centric documents have a fairly regular structure, fine-grained data and little or no mixed content. The order of sibling elements is almost never significant.

Most often data from xml files is stored in a traditional database and imported and exported by third-party middle-ware or by functions built into the database (xml-enabled).

Document-centric documents are often designed to be read by humans as well as computers. They have a less regular structure, the data is coarser grained and usually have mixed content. The order of sibling elements is often important. Document-centric documents are often handwritten in some format and converted to xml or written in xml directly. The data do not originate from a database.

The distinction between data- and document-centric documents is not always clear. Many files, e.g. genetic and other biological data, are data-centric but semi-structured. Semi-structured data are irregular and can have a rapidly changing structure, a case where relational databases are not suitable. xml’s capability of storing semi-structured data (as well as highly structured and unstructured data) is one of its strengths.

(24)

2.2 Validation

If an xml document conforms to a schema it is said to be valid. A schema specifies what an xml document of the current sort should look like. There are at least four different levels [vdV01] of validation that are implemented to various degrees in schema languages:

Validation of markup - the structure of the document. Validation of content in individual nodes (datatyping).

Validation of integrity of links between nodes and between docu-ments.

Other tests (“business rules”).

Of the four support for link validation, especially between documents, is poor in most schema languages.

2.2.1 DTD

The xml version of dtd (Document Type Definition) is a simplified version of dtd found in the sgml standard and thus has an xml-like but non-xml syntax. It does not support namespaces and it has a weak datatype system that only works for attributes. dtd can check internal links within a document.

Listing 2.2: dtd example

<!ELEMENT document (toc,chapter+)> <!ELEMENT toc EMPTY>

<!ELEMENT chapter _{(section|title){*}>} <!ELEMENT section (#PCDATA)>

<!ELEMENT title (#PCDATA)>

<!ATTLIST chapter numbering (yes|no) "yes">

This dtd describes the structure of the xml document in Listing 2.1. A valid document according to the dtd can contain the root element

(25)

chapter elements. The toc element is empty, while the chapter element consists of several section or title elements. The chapter elements also have an attribute, numbering with the default value of “yes”.

2.2.2 XML Schema

xml Schema [W3C01b] is a successor to dtd, developed by W3C, with support for namespaces, a richer datatyping system and extensible vocab-ularies. It is, however, hard to learn and hard to use. xml Schema is complex, has a large number of features and an xml syntax with a large number of attributes and heavy nesting. The complexity makes it hard to use. xml Schema supports integrity check of internal links in documents.

Listing 2.3: xml Schema example

<?xml version="1.0" encoding="UTF-8" ?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="document"> <xs:complexType> <xs:sequence> <xs:element ref="toc" />

<xs:element ref="chapter" maxOccurs="unbounded" /> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="chapter"> <xs:complexType> <xs:choice> <xs:element ref="section" /> <xs:element ref="title" /> </xs:choice>

<xs:attribute name="numbering" type="xs:NMTOKEN" use=" optional" default="yes"/>

</xs:complexType> </xs:element>

(26)

<xs:element name="section"> <xs:complexType mixed="true" /> </xs:element> <xs:element name="title"> <xs:complexType mixed="true" /> </xs:element>

<xs:element name="toc" type="xs:string" /> </xs:schema>

The xml Schema in Listing 2.3 describes the same document structure as the dtd in Listing 2.2. The difference, except for the more verbose syntax is that the xml Schema lacks a definition of allowed values for the

numbering attribute.

2.2.3 Relax NG

Relax ng [Rel01] is a schema language based on Relax, a product of a Japanese iso standard technical report written by Murata Makoto, and trex by James Clark, who was technical lead for the xml 1.0 recommenda-tion, co-author of the xsl and XPath recommendations and has developed a number of sgml and xml parsers. Relax ng is currently being developed into an iso standard.

The language was developed to be “simple, easy to learn, use xml syn-tax, . . . support namespaces. . . and can partner with a separate datatyping language” [Cla03]. It does not support integrity check of links except by using features of an external datatype system, such as W3C’s xml Schema. Relax ng is easier to grasp than xml Schema, much because of the separation of structure and datatypes.

There is also an alternative Relax ng syntax, the Compact syntax [rel02, Fit02], that is a non-xml syntax in some ways similar to the dtd syntax. A schema written in compact syntax can be translated to Relax ng xml syntax.

(27)

Listing 2.4: Relax ng example <?xml version="1.0"?> <grammar xmlns="http://relaxng.org/ns/structure/1.0"> <start> <element name="document"> <optional> <element name="toc"> </element> </optional> <oneOrMore> <element name="chapter"> <attribute name="numbering"> <text /> </attribute> <element name="title"> <text /> </element> <zeroOrMore> <element name="section"> <text /> </element> </zeroOrMore> </element> </oneOrMore> </element> </start> </grammar>

Listing 2.5: Relax ng Compact example

element document {

element toc { empty }?,

element chapter {attribute numbering { text }?, element title { text }?,

element section { text } }+ }

(28)

Listing 2.4 shows that the Relax ng syntax resembles the original xml document. Another obvious difference to xml Schema is how the number of allowed elements is defined, it is possible to define this in Relax ng using the occurs _{attribute (like the xml Schema} maxOccurs) but using the shorthand zeroOrMore is often easier. The compact syntax in Listing 2.5 somewhat resembles the dtd in Listing 2.2, but here nesting is used to determine which elements and attributes that are allowed in other elements. Neither of the two Relax NG examples have default attribute values, this because attribute defaults cannot be defined in Relax ng (except with a special extension using namespaces) [Smi01, Cla01].

2.3 Namespaces

A namespace [Sri][W3C99a] is used to avoid naming conflicts when sev-eral vocabularies are used simultaneously. Namespaces are used in xml technologies like xslt (to discriminate xslt’s own tags from tags to be outputted in the resulting document). The namespace is a vocabulary of elements and attributes where an iri (Internationalized Resource Identifier, an internationalized version of uri) is used to identify the namespace.

A namespace can be declared anywhere in the document. If declared in the root node it will apply to the entire document, if declared elsewhere it will apply to the element where it is declared and element nested within that element. The declaration binds a prefix to the namespace, except for the default namespace declaration which defines what namespace to use when a prefix is missing. The prefix is then used to connect namespaces with elements and attributes.

To define what elements and attributes there are in a namespace an xml Schema Definition (xsd) can be written.

Listing 2.6: Namespace example

<?xml version="1.0"?>

(29)

<section>Introduction to... <pic:figure>

<pic:type format="postscript" version="3.0" /> <pic:input file="logo.ps" alt="IDA logo" /> </pic:figure> </section> </chapter> <chapter xmlns:pic="http://www.dpg.se/pictures"> <title>My 2nd chapter</title> <section>

Continuing the text...

<pic:figure file="diagram.svg">An interesting figure</ pic:figure>

</section> </chapter> </document>

In Listing 2.6 the xml file is associated with the default namespace uri ”http://www.ida.liu.se/doc”. Note that nothing has to exist at that uri as long as it is unique so that implementations can determine what type of document the xml is. A prefix, pic, is associated with a second namespace ”http://www.ida.liu.se/pic”. This namespace is then used to insert new elements in the first chapter. In the second chapter thepicprefix is overridden by another namespace ”http://www.dpg.se/pictures”. This namespace will be used when using the picprefix in elements beneath this

chapter element since it is inherited.

2.4 Meta-data

Meta-data is data about data, in this context machine understandable in-formation. The data is used to identify relationships between different resources, e.g. who has written a certain article. This data can in turn be related to other data, e.g. the address of the author or the affiliation of the author.

(30)

2.4.1 RDF

rdf [W3C04e, W3C04d] (Resource Description Framework) is a way to express meta-data of web resources that can be used between different applications and are one of the cornerstones of the semantic web. rdf is developed by W3C and is often serialized as xml. rdf consists of a syntax specification and a schema specification (rdfs). The description itself has its own uri, making it possible to describe rdf descriptions in rdf. The knowledge expressed is written as triples. A triple consists of a resource (subject), a property (predicate) and an object. The resource (e.g. the article) is a uri (Uniform Resource Identifier) reference, thus rdf can be used to express the semantics of anything that has a uri. A property (e.g. author) is also a resource making it possible for the property to have properties itself. The object can be a value (e.g. the name of the author) or a resource (e.g. the home-page of the author).

rdf descriptions can be embedded directly in an xml document or provided separately.

2.4.2 OWL

owl [W3C04c] (Web Ontology Language) builds upon rdf and adds re-lations between classes, cardinality, equality, symmetry and so on. owl is constructed to build ontologies of different subjects with the web in mind. Like rdf owl was developed by W3C. There are three type of owl docu-ments: Lite, dl (Description Logic) and Full. All rdf documents are also valid owl Full documents. owl dl is a more restricted subset of Full, and Lite is in turn a more restricted subset of Lite. Making it easier to grasp and easier to provide tools for. For example Lite only allows cardinalities of 0 or 1.

2.5 Links

Links are the main functionality of the web; they make it possible to connect one resource to another. Links are useful not just in hypertext but also for connecting other types of information. This can be used to reduce the

(31)

amount of redundant data by linking to common information resources, to link to resources that have access to up-to-date data.

XLink [W3C01a] provides a way to link between all types of xml doc-uments. Unlike links in a html document, that only offers links from the current document, XLink allows links both to and from a document.

2.6 XML APIs

Since parsing of xml is an essential part in all applications that handle xml data the need for xml apis is inevitable. An api (Application Programming Interface) is used to access data in xml files from within a programming language. There are a number of types of xml apis as identified by Elliotte Rusty Harold [Ven03]. He identifies five different types, or styles, of xml apis :

Push API, or event-based api, was the first type of xml apis invented because it was easy to implement. They are streaming apis where the document is read in document order and an event is triggered when an element starts or ends and when data starts. The programmer has to implement code for these events that make use of the data. It has low memory constraints and is very fast but is on the other hand complex to use, especially if parent-children element relations is to be accessed. This is because you cannot access a specific node -the document is read in document order only. If you want to access a previous sibling or parent the document must be reread from start. The most famous push api is sax [SAX] (Simple api for xml). It began as an xml api for Java and is now a de facto standard for push-based xml parsing. Another push api is Xerces Native Interface. Pull API is the newest type of xml api and is, like push api, a streaming

api. The difference to push api is that the application asks, pulls, the parser for new information instead of the parser calling functions when new information is found. The pull api is also very fast and very memory efficient but is simpler to use than the push api. Being the newest type of api, the implementations have not matured much yet.

(32)

Tree-based API builds an object model, often a tree with nodes for ele-ments, attributes, text, and so on of the xml documents. This tree can the be queried, navigated and modified via the api. It is an ef-fective way to process the contents but since the tree usually must be kept in internal memory it is impossible to use for larger documents. Some applications, however, such as some xml databases, have tree-based apis that do not require all of the tree to be in internal mem-ory. The most known tree-based api is W3C’s recommendation dom [W3C04a](Document Object Model) and jdom [HM04], a Java-based document object model.

Data-binding API, like tree-based api, parses the entire document and builds an object model. But here the model represents actual data instead of elements, attributes and so on. in the xml document. An object in a data-binding api can be a chapter class (compare with Listing 2.1) corresponding in some way to some element with sub-elements and attributes in the xml document. The mapping between the xml document and the resulting objects is defined in some sort of schema: W3C’s xml Schema language, dtd or a special binding schema language. The problem with data-binding apis is that most xml document do not have a corresponding xml Schema, have a schema in a format other than W3C’s Schema language (that most apis assume) or that the document does not follow the specified schema. Another problem is that data-binding apis assume flat data structures and that order of elements does not matter, rendering it unsuitable for all document-centric xml and a large number of data-centric xml.

Query API is an api where queries, like XPath, xslt or XQuery is made directly over the api. Most work to get the data is put into writing the queries in the query language itself, not in the native programming language.

The two most common styles of xml apis are event-based (push api) and tree-based.

In general, the existing apis are [Ven03] too complicated or too simple. They do not model xml completely or correctly. The largest problem is

(33)

with namespaces (see Section 2.3), an area within xml that is difficult to understand, and difficult to handle in the apis.

2.7 Transforms

Transformation is the process of transforming an xml document into an-other. These transforms are often from a device-independent data format to a specific presentation format, such as xhtml, html or wml. Data can be filtered and grouped in a transformation, to just give the desired subset of information.

2.7.1 XSLT

xslt [W3C99c] (xsl transformations) is the transformation language in the Extensible Style-sheet Language family [W3C04b] (xsl). It is, as most W3C recommendations, xml-based. XPath (as described in Section 2.8.1) is an essential part of xslt, it is used to specify which nodes to transform. xslt supports templates, sorting, grouping and user-defined functions (the latter two were introduced in xslt 2.0). xslt is designed for transformation to presentation formats though other types of transformation is possible the functionality is aimed at presentation transformation and may miss some features needed.

2.8 Query

Many xml tools such as xslt and query apis use some type of query language. XQuery is becoming more common, all xml query tools will probably support XQuery eventually.

2.8.1 XPath

XPath [W3C99b] is like regular expressions (wildcard search patterns used for searching in text) for xml nodes. It is standardized by W3C and is used to find nodes in xsl and XQuery among others.

(34)

An XPath expression defines the path top-to-bottom through the tree to the desired nodes. Each node level is separated by a slash, much like paths used in file systems. To find all the section elements in Listing 2.1 the following XPath expression can be used:

/document/chapter/section

If you want to find all section tags irrespective of where they are in the tree the expression is:

//section

The double slashes (//) means arbitrary depth. In this case the result would be the same.

XPath also support function calls and conditionals.

count(//chapter[@numbering="no"])

would return the number of elements where the attribute numbering is set to “no”. The @ sign indicates that the following name is an attribute. There is no not-equal operator; this is instead achieved by enclosing the expression in a function call to not():

count(//chapter[not(@numbering="no")])

XPath can work as a basis for a query language but quite a few features are missing, like grouping, sorting, cross document joins and data types. If used within an xsl transform (xslt) the first three of these functions are provided by xslt. xslt with its xml-based syntax is somewhat hard to use and probably nothing you would want to use from a programming environment.

2.8.2 History

A number of querying languages for querying xml files have been devel-oped, some of them originated in the research on semi-structured data such as oql, Quilt, YaTL, xml-ql, xql and Lorel. A W3C working group is de-veloping an xml query language called XQuery based on Quilt and inspired by some of the mentioned languages.

(35)

2.8.3 Update capabilities

With large amounts of data in xml documents it becomes interesting to be able to update the data. There are some ways of updating xml documents. From simply replacing the existing document, to updates on a live dom tree, to a special update language. Most methods are proprietary but two common languages have emerged:

XUpdate [LM00] from xml:db initiative is based on XPath which it uses to specify which node to delete/update or which nodes to insert new nodes before or after.

XQuery extensions have been proposed by W3C’s XQuery Working Group and Patrick Lehti [Leh01]. Some variations of these are im-plemented in a number of xml tools.

When update/delete support is added to XQuery it is likely to be supported by a large number of tools handling xml data.

2.8.4 XQuery

XQuery [W3C04f] is designed to query collections of xml data and has been called the “last great project of xml standardization” [Dum04]. The collections can be data from xml files, xml databases or relational data-bases, on a varying level of structure. Queries can combine data from several sources and since element order is important in some xml docu-ments, results can be given in element order. XQuery builds upon XPath which is heavily used in XQuery. The specifications for XQuery 1.0 and XPath 2.0 are developed by the same working group under W3C and the final XQuery 1.0 recommendation is believed to be released in the end of 2005 [Dum04]. XPath is nowadays defined as a subset of XQuery. The non-XPath additions are sql-like, in function and syntax.

XQuery was first called Quilt and was a successor to the xml query language xql, other influences were xml-ql and sql.

XQuery consists of three different languages. One human-friendly non-xml syntax, one machine-friendly non-xml-syntax (XQueryX, see Section 2.8.4) and a formal algebraic language used in the XQuery processor.

(36)

XQuery currently does not support group-by operations and thus nested queries (queries with sub-queries) are important. There are no restrictions on query nesting in XQuery.

The main concepts in XQuery are the flwor (pronounced “flower”) expressions. flwor is an acronym for the different expressions for, let,

where, order by and return. For is used to iterate over the nodes, let

is used to assign a variable. Both these expressions specify a sequence of tuples that can be filtered or ordered using where and order by clauses respectively. Once the tuples have been filtered and ordered they are re-turned by the return clause.

flwor can be used to build simple or complex xml fragments and several documents can be queried simultaneously. The flwor expressions are separated from the surrounding xml content by curly braces.

flwor has a type of conditional expression that unlike most languages require an else clause because every expression in XQuery must return a value.

Listing 2.7: XQuery example

<automated-toc title-chs="{count(//chapter/title)}" tot-chs="{count(//chapter)}">

<chapter> {//chapter/title} </chapter> </automated-toc>

Listing 2.8: Results of XQuery in Listing 2.7

<automated-toc title-chs="1" tot-chs="2"> <chapter>

<title>My 2nd chapter</title> </chapter>

</automated-toc>

Listing 2.7 shows a simple XPath-type query (with results in Listing 2.8) of the number of chapters in Listing 2.1 with title and the number of chapters in total. The title of the chapter is also printed.

(37)

Listing 2.9: Second xml file <?xml version="1.0" encoding="UTF-8"?> <annotations> <annotation> <fortitle>My 2nd chapter</for-title> <reviewer> <name>David</name> <email>david@example.net</email> </reviewer> <date>2004-10-16</date>

<text>There needs to be a lot more text.</text> </annotation> <annotation> <fortitle>My 2nd chapter</for-title> <reviewer> <name>Cecilia</name> <email>cecilia@example.com</email> </reviewer> <date>2004-09-23</date>

<text>Change title to "My second chapter"!</text> </annotation> <annotation> <fortitle>Another chapter</for-title> <reviewer> <name>Cecilia</name> <email>cecilia@example.com</email> </reviewer> <date>2004-09-23</date>

<text>This chapter should perhaps be removed.</text> </annotation>

(38)

Listing 2.10: XQuery join <doc-annotations> { for $b in doc("exempel.xml")//chapter return <part> <title>{$b/title}</title> { for $c in doc("annotation.xml")//annotation

where $c/fortitle = $b/title

order by $c/date return <comment from="{$c/reviewer/name}">{$c/text}</comment> } </part>} </doc-annotations>

Listing 2.11: Result of XQuery join

<doc-annotations> <part> <title/> </part> <part> <title> <title>My 2nd chapter</title> </title> <comment from="Cecilia">

<text>Change title to "My second chapter"!</text> </comment>

<text>There needs to be a lot more text.</text> </comment>

</part>

(39)

Listing 2.10 shows a more advanced join query using flwor on the documents in Listings 2.1 and 2.9. The chapters in the first xml document are listed together with the corresponding annotations in the second file. The result is listed in Listing 2.11.

XQuery has a large number of functions, including functions for math, string manipulation, time comparison, node manipulation, sequence ma-nipulation, type conversion and Boolean logic. All functions are written in a namespace to avoid name collisions. You can also define your own functions. There is no support for function overloading. Functions can be recursive.

Since there is no way to import code it is not possible to call a common function library for frequently recurring functions.

There is a number of arithmetic and comparison operators. Special operators are < < and > > that check if an node appears before or after another node.

XQueryX [W3C03], formerly abql (Angle Bracket Query Language), is an xml representation of XQuery and is meant to be used by existing xml tools for parsing, creation or modification. It does not seem to be used to a great extent in any of the existing implementations and there is speculation whether it will be included in the XQuery recommendation or not.

Some of the most well-known stand-alone XQuery implementations are: XQEngine [Kat04] is an open-source Java component for XQueries. QuiP [AG] from Software ag is a prototype XQuery engine that allows

queries on xml files in a file system as well as in xml stored in a Tamino xml Server (see Section 2.9.9). QuiP has not been updated for the latest recommendation drafts.

XQuark [Gro04]is an open-source project that has released two different products: Bridge and Fusion. Bridge is a system for expanding rela-tional database systems with import and export capabilities. Fusion is a system to get an xml view of data sources (including those from Fusion and Bridge) and perform XQueries on it.

IPSI-XQ [Dar04]is a demonstration implementation of XQuery written at Fraunhofer ipsi. It has graphical, command-line and web interfaces

(40)

and a Java api. It has been updated up to the November 2003 draft of the XQuery recommendation.

Qexo [Bot03]is an open-source, partial implementation of XQuery using the Kawa framework to get queries compiled to Java bytecode for bet-ter performance. Many of the standard functions defined in XQuery are not supported.

Galax [gal04]is an open-source implementation of XQuery developed by AT&T, Lucent and Avaya in the O’Caml programming language. Command-line tools, a web interface and apis for O’Caml, C and Java are available.

Qizx/open is an open-source XQuery implementation. It implements all features except schema import and validation. A commercial variant called XQuest with support for indexing is under development. XQJ is an effort to specify an XQuery Java api based on jdbc. See Section

2.9.7.

Saxon is a Java-based xslt processor that now also supports XQuery.

2.9 Databases in XML

xml has some advantages as a database format, it is self-describing (struc-ture and type names are given in the markup, the semantics is missing however), it is portable and data can be described in a tree or graph struc-ture.

The main disadvantages are its verbosity and need for parsing and text conversion leading to slow reading of data.

xml and related tools provide many of the features found in databases: storage, schemas, query languages and programming interfaces. However, it lacks efficient storage, indices, security, transactions, data integrity, multi-user access, triggers, queries across multiple documents etc.

These shortcomings can be defeated by storing the xml in an xml database.

(41)

In case the data has an irregular structure and/or uses entities a native xml database (nxd) can be suitable. Physical document structure is pre-served in an nxd, document-level transactions are supported and queries can be executed on the xml.

2.9.1 Different types of XML databases

Storing database-like xml documents in ordinary xml files may work for small amounts of data but will fail in most production environments re-quiring strict data integrity and good performance.

One main issue is indexing - finding an actual element or value inside an xml file requires reading the whole file if an index is missing.

A better solution is to store the xml data in a database. There are two different types of xml databases: enabled and native. A third hybrid type also exists.

XML-enabled databases (xedb) are built on top of relational data-bases. xml data is mapped to a relational database giving access to all the features and the performance found in relational database management systems. Due to xml’s structure this mapping often gives a large number of tables or an unnormalized representation in cases where the data is well-structured, not nested in too many levels and the schema is strict. The mapping also often leads to physi-cal and logiphysi-cal structure (processing instructions, comments, element order and so on) of the xml data being lost to some degree. This means that an xml document exported from the database probably will be different from what was imported. Most of the well-known database management systems offer xml-enabled databases, such as ibm’s db2, Oracle’s Oracle 10g and Microsoft’s sql Server.

Native XML databases (nxd) are designed especially to store xml doc-uments. They support transactions, security, multi-user access, pro-grammatic apis, query languages and so on. The only difference from other databases is the internal model. nxds are most useful for stor-ing document-centric documents since document order, processstor-ing instructions, comments, cdata sections and entities are preserved.

(42)

Not just documents but also data, especially data semi-structured in nature, can be stored in native xml databases.

Hybrid XML databases (hxd) are databases that, depending on the features wanted, can be seen as either native xml databases or as xml-enabled databases. An example is Ozone which allows data ac-cess through xml, whose internal data representation maintains full xml structure and whose fundamental unit of storage is an xml doc-ument.

The boundaries between traditional xml-enabled databases and nxds has blurred. Traditional databases support native xml operations and nxds can use external relational databases to store document fragments. Stan-dard xml tools can be used for working on the stored documents, these include dom, sax, XPath, xslt and XQuery. nxds are useful for storing documents where xml is the native format. An nxd requires no initial configuration to store an xml document.

xml databases is a younger research field than relational databases and much work on finding effective algorithms to query tree-structured data is still to be done.

2.9.2 Native XML databases

The term native xml database was first mentioned in the marketing cam-paign for Software ag’s product Tamino. The term is in common use but lacks a formal technical definition.

Software ag’s own definition [CRZ03] requires that a native xml data-base system is built and designed specifically for the handling of xml.

Another and more open definition developed by members of the xml:db mailing list [Bou04] is:

Defines a (logical) model for an xml document – as opposed to the data in that document – and stores and retrieves doc-uments according to that model. At a minimum, the model must include elements, attributes, pcdata, and document or-der. Examples of such models are the XPath data model, the

(43)

xml Infoset, and the models implied by the dom and the events in sax 1.0.

Has an xml document as its fundamental unit of (logical) stor-age, just as a relational database has a row in a table as its fundamental unit of (logical) storage.

Is not required to have any particular underlying physical stor-age model. For example, it can be built on a relational, hierar-chical, or object-oriented database, or use a proprietary storage format such as indexed, compressed files.

An nxd can store more information than contained in the model. A docu-ment is the fundadocu-mental unit of storage in all nxds, although it would be possible to use document fragments instead.

nxd provides robust storage and a way to manipulate xml documents. The model of nxds allows arbitrary levels of nesting and complexity. The meaning of stored data is given by the received document, not by what is stored in the underlying layer. There is currently no well-defined way to perform updates in the stored documents. A few proprietary update languages exist as well as xml:db XUpdate but there is no widespread language, and there probably will not exist any until a query language is added to XQuery, see Section 2.8.3 for more discussion on this. nxds excel at storing document-oriented data, data with very complex structure with deep nesting and data that is semi-structured in nature.

The reasons for storing xml in nxds are:

When you have semi-structured data (a regular structure but with large variations) a large number of null-valued columns or a large number of tables would be required to store the data in a relational database. This type of data could also be stored in an object-oriented or hierarchical database.

Faster retrieval speed. Some nxds store entire documents together physically on disk or use physical pointers, allowing documents to be retrieved faster than with a relational database with data spread logically as well as physically. If data is retrieved in non-sequential order there probably will not be any performance boost.

(44)

If you want to use xml-specific capabilities, such as executing xml queries. The support of xml query is not well spread yet and xml query languages are being implemented in relational databases mean-ing this reason is not very strong.

Most native xml databases can only return data as xml, if the application needs data in another format it must parse the xml. In a distributed application (e.g. web services) this is not a problem as the data has to be distributed serialized in some form, xml being the most common.

xml does not support data types directly, this information can how-ever be specified via a schema, such as xml Schema. Some nxds require collections to be associated with a schema while others do not.

Some nxds can include remote data in documents. The data is retrieved from relational database using a mapping.

nxd operates on a set of documents, called a collection. An entire collection can be accessed as one tree. The collection could be associated with a schema, specifying the format of the data stored but not all xml databases requires the collection to be associated with a schema. This gives higher flexibility and makes development easier but also come with the risk of low data integrity. Some products support validation with dtd and a few also with xml Schema.

Transactions are supported in many nxds. Locking is often performed on document root level because a deletion of a node higher up in the hi-erarchy would remove the node being updated at the same time. There is ways to achieve node-level locking but this is somewhat complex and not yet implemented in any of the nxds investigated.

One of the advantages with nxds are their round-tripping support, i.e. they can emit xml documents as they once were entered into the database.

2.9.3 Indices

Indices are found in all nxds. There are three types:

Value indices. These include text and attribute values.

Structural indices. Indices containing information on location of ele-ments and attributes.

(45)

Full-text indices. The full-text indices are based on text and attribute values.

Most nxds support the first two types, a few support full-text indexing. These types can be combined, e.g. a structural value index can be used to find specific elements containing a specific value. Some of the nxds make use of xml Schema information about data types when indexing data.

2.9.4 Normalization

Normalization is, just as with relational databases, a concern with xml databases. The risk of duplicated, redundant, data is present also for xml [Bou04, Pro02a], resulting in increased file size and risk of inconsistence. Just as with relational databases you are not required to normalize your data.

Since the structure of an xml document differs a lot from the flat tables in relational databases normalization is quite different. xml’s support for multi-valued fields (one of a number of the xml structure’s deviations from first normal form) makes it possible to normalize data in a way not possible with a relational database, since multiple children in a one-to-many rela-tionship can be placed directly under a parent using composition [Pro02a] without need for multiple tables or foreign keys. If the data is spread over a number of documents however keys, implemented in e.g. XLink, is needed [Pro02a, Bou04].

Second and third normal forms can be applied nearly unchanged to the xml world using xml Schema’s keyref capabilities to act much like a foreign key. This should only be used for many-to-one and many-to-many relationships, for one-to-many relationships composition is a better use of xml’s structure [Pro02b]. In a relational database a primary key must be unique within its database instance. In xml Schema a key is unique within the scope of some element instance, thus the scope depends on in which element the keyref is defined.

Since fourth normal form only is interesting if first normal form is fol-lowed it has no meaning in an xml tree. Fifth normal form applies to xml however [Pro02b].

(46)

2.9.5 Referential integrity

Referential integrity means ensuring that pointers in xml documents point to valid documents or document fragments. These pointers can be of the types id/idref attributes, key/keyref fields (as in xml Schema), XLink or some proprietary mechanism. Referential integrity in nxds can be either on internal pointers or on external pointers. Most nxds only validate in-ternal pointers at insertion, a few or none guarantee referential integrity after updates. External pointers are not enforced for integrity since it is meaningless to enforce integrity to pointers within the same database and since the database has no way to control external documents they neither can be enforced. Referential integrity will probably be supported for inter-nal pointers and perhaps also for ”‘exterinter-nal pointers of some sort”’ [Bou04]. Until then it is up to applications to enforce the integrity of pointers.

2.9.6 Performance

If data is retrieved in stored order native xml databases should scale well, probably even better, than relational databases. This is confirmed by tests [CRZ03]. If it is retrieved in any other order it will most certainly be slow and scale poorly. This despite the heavily use of indexing in nxd which helps retrieval time at the cost of slower updates. For un-indexed data nxds are clearly outperformed by relational databases.

Further work on specialized algorithms for querying and processing xml databases such as developed in the Timber project [JAKC+02] will enable the use of large databases.

2.9.7 Output/API

apis are offered by most nxds. These are often odbc-like and returns xml strings, dom trees, or a parser over the returned document. Most apis are proprietary but there exist two well-known vendor-neutral apis:

xml:db api [Sta01] uses XPath as a query language.

xqj (XQuery api for Java) [EMA+04] is based on jdbc and is under development.

(47)

2.9.8 NXD Models

Native xml databases can be divided [Bou04] into two large groups de-pending on how they store data: text-based and model-based.

Text-based NXDs can be ordinary files, blobs in relational databases or use a proprietary text format as storage. Indices are essential since they give a tremendous speed advantage when retrieving documents, requiring just a single index lookup to get the position of the required document fragment and reading of the bytes in the order it is found. If you want the data in another structure it is likely to be outperformed by a relational database.

Model-based NXDs build an internal object model from the document and store this model. It can be stored in a relational or object-oriented database as well as in a proprietary storage format opti-mized to the model. Model-based nxds that use a proprietary stor-age format will have performance similar to that of text-based nxds when retrieving data in storage order since physical pointers are used. Model-based nxds are faster at retrieving documents as Document Object Model (dom, see Section 2.6) trees than text-based. As with text-based nxds performance problems can occur when data is re-trieved in another order.

Persistent DOM is a specialized type of model-based nxd which uses a dom internally. This makes live dom access possible. Due to the large memory requirements of dom, this is not practical for large documents.

2.9.9 Implementations of native XML databases

There exist a number of nxd implementations. This list contains the most popular databases as well as a number of databases with interesting fea-tures.

dbXML [dbx03] is an open-source native xml database. It does not sup-port XQuery.

(48)

Exist (eXist) [exi05] is an open-source Java-based native xml database that supports XQuery.

Infonyte DB [inf05] is a native xml database based on persistent dom. For Java. Does not support XQuery yet.

Ipedo XML database [ipe] supports XQuery and transactions. For Web services, Java and .NET.

MarkLogic Content Interaction Server [mar05] supports XQuery, lock-free queries and configurable levels of document fidelity.

Neocore XMS [neo05] is a transactional native xml database from Xpri-ori using a “patented pattern-recognition technology” [LLC05]. It supports XQuery and has apis for Java, C#, C++, Web services etc. For Linux and Microsoft Windows.

Sedna [sed05] is an nxd developed at the Institute for System Program-ming of the Russian Academy of Sciences. It supports XQuery and has an update language. A Scheme api, a C api and a Java api is available. Sedna is implemented in C/C++ and Scheme. Binaries are available for Microsoft Windows, it is however said to be possible to compile on other operating systems such as Linux.

Sonic XML Server [son05] from Sonic Software supports XQuery, doc-ument linking, update-grams (a way to update xml docdoc-uments) with triggers. For Java.

Tamino [tam04] is a commercial nxd from Software ag and the first prod-uct announced as a native xml database. Whether it really has na-tive xml storage is unknown since details on the architecture are unknown. Available on Linux and Microsoft Windows platforms. Timber [tim04] developed at University of Michigan are using a special

type of algebra that takes trees as input and give trees as output, to achieve fast querying. Only Microsoft Windows is supported. Timber does not support user defined XQuery functions.

(49)

Virtuoso [vir04] from OpenLink. Gives access to a virtual database via xml, odbc, jdbc, .net or ole db. XQuery is supported. Virtuoso can be used with Web services, .net, Mono and j2ee.

Xindice [xin] is a fork of dbXML developed under the Apache umbrella. It does not have XQuery support yet.

X-Hive DB [xhi05] is a commercial nxd that supports XQuery and ver-sioning. It can be used with j2ee.

2.10 Summary

xml is a document format that due to its flexibility is used in more and more areas. To ensure that documents conforms to a standardized schema a number of validation techniques are available. Documents can be extended to contain information not defined in the schema.

As with data stored in traditional database systems issues with nor-malization and performance occur. Due to the different structure in xml documents compared to relational databases these areas differ remarkably and are the focus of much research.

XQuery is a query language for querying xml data as well as data from relational database systems. xml data is also returned as a result of the queries. XQuery is currently under development by a W3C working group and is already supported, to different extents, by a number of tools.

Effective and fast apis are necessary to handle large xml documents. When managing many and/or large xml files it becomes cumbersome to keep track of all files and queries on the contents are slow. With the help of xml databases xml documents can be assembled into collections and indexed. Native xml databases (nxds) are a type of database management systems with a special architecture to store and retrieve xml data. nxds are better than other xml databases, such as xml extensions on relational database management systems (xedb), when data is semi-structured or you want to retrieve documents as they were stored.

(50)

(51)

Bioinformatics

The advances in molecular biology have resulted in sequencing of the genomes of several species and thus an amount of data impossible to process manu-ally. By introducing information science to biology, a field called bioinfor-matics, researchers are able to store and access data easily. This chapter describes the fundamentals of the central dogma in biology, protein inter-actions, existing databases and exchange formats.

3.1 Genes

A gene is a description of what amino acids to assemble to build a protein. Genes are part of dna molecules, large chains of nucleotides. There are four different types of nucleotides: adenine (a), guanine (g), cytosine (c) and thymine (t). A combination of three nucleotides, a codon, codes for one amino acid. Some codons do not code for amino acids but indicate the beginning or ending of a protein being coded.

To create proteins the part of interest of the dna molecule is copied to a mrna molecule in a process called transcription, the m stands for messenger. To this mrna other molecules, called trna, bonds. There are 47 different trna molecules that bond to a specific triplet on one side and a specific amino acid on the other side.

(52)

There are about 3.08 billion base pairs [Con04] in the human genome. 10 percent of the base pairs code for genes, what function the remaining 90 percent has is unclear. The human genome contains 20,000-25,000 genes [Con04].

3.2 Proteins

Proteins are long polymers of amino acids. Most proteins consists of some-where between one hundred and one thousand amino acids. Since there exist twenty different amino acids the number of possible different protein structures are enormous, but in reality 15,000 different protein structures are currently [Les02] known.

Proteins are important for the life of organisms. They have a number of different roles: structural (such as skin on animals), catalysis (enzymes), negative catalysis (inhibitors), transportation, regulation, control of genetic transcription and as participants in immune systems. The main function of a protein is to bond to different substances. What substance a protein will bond to depends on the shape of the protein and the distribution of electrical charges on its surface. What shape a protein will get depends solely on what amino acids it consists of.

3.3 Pathways

A biochemical pathway (or biochemical network) is a network of intercon-nected chemical processes within the cell. These are processes between different molecules, including proteins (protein interactions). There are different types of pathways. A metabolic pathway is a series of reactions catalyzed by enzymes. A regulatory pathway is a series of reactions and interactions regulating the expression and activity of enzymes and trans-porters. A signal transduction pathway is a series of reactions and interac-tions realizing transfer of information between different cellular locainterac-tions, such as between the extra-cellular medium and the cell nucleus. The differ-ence between regulatory and signaling pathways is not always noticeable.

(53)

proteins coded by genes actually affect the organism. The analysis is tricky: the amount of information is large, data in databases are heterogeneous, data can often be incomplete and the potential size of pathways can be very large.

3.4 Experimental methods

Protein interaction data are determined mainly by two different methods [Cun01, Twy03], two-hybrid systems and phage-display systems.

3.4.1 Two-hybrid systems

There are a number of two-hybrid systems with yeast two-hybrid systems being the most common. The yeast two-hybrid system is a method that uses a yeast protein called gal4 which functions as a transcription factor. This factor is split up in two parts, one with the dna-binding domain (dbd) and the other with the activation domain (ad).

The protein sequence whose function is unknown is joined with the transcription factor’s dna-binding domain and the result is a hybrid protein called a bait .

The other genes are joined with the transcription factor’s activation domain to form hybrid proteins known as prey.

The bait is then tested against each prey and possible interactions are detected since an interaction between the bait and prey results in a func-tional transcription factor that in turn activates a test gene that can be detected.

3.4.2 Phage-display systems

Another method is phage-display. The steps, as described by [Twy03], are: 1. The function of protein X is unknown. The protein is used to coat

(54)

2. All the other genes in the genome are expressed as fusions with the coat protein of a bacteriophage (a virus that infects bacteria), so that they are displayed on the surface of the viral particle.

3. This phage-display library is added to the dish. After a while, the dish is washed.

4. Phage-displaying proteins that interact with protein X remain at-tached to the dish, while all others are washed away. DNA extracted from interacting phage contains the sequences of interacting proteins.

3.4.3 Curated data

These methods are susceptible to false positives, e.g. gal4 can interact with other yeast cell elements and trigger a transcription within the bait without any prey involved. The data collected in biological databases are therefore not necessarily correct. Some databases require manual verifi-cation of the data, in some cases with references to at least two different published sources, before submittal while other databases do not require any verification.

IntAct (see Section 3.5.6) for example allows the authors to express their confidence in an interaction [EMB05], that can be expressed as attributes for an interaction in the psi mi format (see Section 3.6.2).

3.5 Databases

Researchers use databases to store experimental data using different models [DGvHW03]. Database models are used simply for storage but are not suit-able for analyzing the structure of the stored networks to give a qualitative analysis; therefore special models adapted to the needs are designed, such as graph-based models. These models are often extracted from data stored in a database according to the traditional database model. Computational models are used to explain biological systems by simulating the system and give a quantitative analysis.

Pathway databases contain data on pathways, the components (en-zymes, substrates and products) and the interactions. There are a number