Corpus construction based on Ontological domain knowledge

(1)

Department of Computer and Information Science

Final thesis

Corpus construction based on

Ontological domain knowledge

by

Nirupama Benis & Rajaram Kaliyaperumal

LITH-IDA-EX-2011/044-SE

2011-10-26

(2)

Final thesis

Corpus construction based on

ontological domain knowledge

by

Nirupama Benis & Rajaram Kaliyaperumal

LITH-IDA-EX-2011/044-SE

26-10-2011

Supervisor: Dr.He Tan Examiner: Dr.He Tan

(3)

(4)

Abstract

§ The purpose of this thesis is to contribute a corpus for sentence level inter-pretation of biomedical language. The available corpora for the biomedical domain are small in terms of amount of text and predicates. Besides that these corpora are developed rather intuitively. In this effort which we call BioOntoFN, we created a corpus from the domain knowledge provided by an ontology. By doing this we believe that we can provide a rough set of rules to create corpora from ontologies. Besides that we also designed an annotation tool specifically for building our corpus. We built a corpus for biological transport events. The ontology we used is the piece of Gene On-tology pertaining to transport, the term transport GO: 0006810 and all of its child concepts, which could be called a sub-ontology. The annotation of the corpus follows the rules of FrameNet and the output is annotated text that is in an XML format similar to that of FrameNet. The text for the corpus is taken from abstracts of MEDLINE articles. The annotation tool is a GUI created using Java.

(5)

(6)

Acknowledgement

We would like to first thank our supervisor Dr. He Tan for all the help she has patiently given us while doing the project. We especially thank her for giving us the honour of co-authoring two scientific papers with her. She gave us a lot of tips on scientific writing and reviewed our report to made sure we gave a very good thesis.

We thank Prof. Nahid Shahmehri for giving us permission to do the project at IDA and for helping us in several ways. She made our stay in IDA very pleasant and comfortable.

We thank Prof. Göran Salerud and Assoc.Prof. H˚akan Örman for re-viewing the thesis proposal and allowing us to do the thesis outside of IMT. Both of them have guided us in selecting courses oriented towards bioinfor-matics. We would like to give Prof.Göran Salerud special mention because he has helped us out from day one of the master’s programme.

We are also extremely grateful to our families for giving us moral and financial support all through the programme.

(7)

Abbreviations

• Arg - Argument

• BIOSMILE - BIOmedical SeMantIc roLe labEler • CNI - Constructional Null Instantiation

• DNI - Definite Null Instantiation • ES - Example Sentence • ET - Example Term • FE - Frame Element • FN - FrameNet • GF - Grammatical Function • GO - Gene Ontology

• GRIF - Gene References Into Function • GUI - Graphical User Interface • HLKB - Hunter Lab Knowledge Base • IE - Information Extraction

• INI - Indefinite Null Instantiation • LU - Lexical Unit

• MEDLINE - Medical Literature Analysis and Retrieval System Online • MESH - MEdical Subject Heading

• NER - Named Entity Recognition • NI - Null Instantiation

(8)

• OBO - Open Biological and Biomedical Ontologies • PAS - Predicate Argument Structure

• PMID - PUBMED ID • POS - Part Of Speech

• PropBank - Proposition Bank • PT - Phrase Type

• SMILE - SeMantIc roLe labEler • SR - Semantic Role

• SRL - Semantic Role Labeling • ST - Semantic Type • TA - Transport Attribute • TC - Transport Condition • TDR - Transport Direction • TDS - Transport Destination • TE - Transport Entity • TL - Transport Location • TM - Text Mining • TO - Transport Origin • TP - Transport Path • TPL - Transport Place • TT - Transport Transporter

(9)

(10)

Chapter 1 Introduction

There is a very large amount of literature in the biomedical field. This is increasing even more everyday as more and more researches are conducted. For example, the MEDLINE database is the most comprehensive list of all the published articles in the biomedical field. Currently it has references to more than 18 million articles and about 2,000 to 3,000 new articles are added everyday. While it is good that so much of work is going on in the field, there are problems that are associated with this rapid increase in text. There could be information that is neglected because of the sheer amount of information available. Some very interesting information and patterns can be taken from the text which may not be apparent to the hu-man reader. In order to achieve this with less workload on the researcher, there is a lot of research going on in text mining (TM) and natural language processing (NLP). These techniques together aim at automatically, or semi-automatically, obtaining information from the text that can be useful for research and education.

Semantic role labeling (SRL) is a method that has been used in text min-ing tasks such as question answermin-ing. For example, in question answermin-ing it has a lot of potential because SRL can give the relations between different parts of the sentence. For general English there are two extensive corpora of annotated text. They provide necessary test data in creating automatic SRL systems. One such corpus is FrameNet (FN) which follows frame semantics for annotation of sentences. In order to achieve such corpora, a lot of time and effort has to be put in to find predicates for labeling and to annotate the text. Since biomedical text is quite different from general English, the same corpora can not be used for performing the same text mining operations for the biomedical domain.

Corpora specific to each domain is essential, but there are very few cor-pora of annotated biomedical text to guide SRL systems for this domain.

(13)

The available corpora are quite small and are created intuitively. There are no guidelines for creating corpora in any specific domain. This is because domain knowledge is essential for selecting the predicates of interest and deciding what semantic roles (SR) are of importance. This kind of domain knowledge is already available in the form of ontologies. For example, the OBO (Open Biological and Biomedical Ontologies) Foundry has several on-tologies that could aid the creation of corpora.

This master thesis project aims to use the knowledge in ontologies to construct a corpus that is annotated using FN rules. This will reduce the need for domain knowledge and this method can be used for any domain. First an ontology in the domain of interest is chosen and within that a par-ticular term and its children are chosen to form a frame according to frame semantics and FN rules. Predicates are chosen directly from the ontology. Then frame elements are decided based on the terms in the ontology, and sentences annotated to populate the corpus. The sentences for this thesis were taken by using a search engine called GoPubMed which has access to the abstracts of MEDLINE articles. The method has been used for the cre-ation of the PROTEIN TRANSPORT frame in BioOntoFN based on the Gene Ontology (GO) term protein transport (GO:0015031) and was pub-lished by He Tan et al [1]. We extended the existing corpus to cover different types of biological transport events. In order to achieve this, we consider the term transport (GO:0006810) which is a parent of the term protein trans-port. Along the way we have tried to give generalized rules for constructing a corpus using this method.

Besides this we created an annotation tool for manually annotating the sentences that will populate the corpus. The annotation tool is specifically designed to support building this kind of corpus using this method. It is a graphical user interface (GUI) that will help the user annotate the required sentences. The sentences will be loaded and four layers of annotation (target, frame element, phrase type and grammatical function) are identified and the output given in XML.

(14)

Chapter 2 Background

2.1 Text mining

There is plenty of information available as free text. This is not machine-readable because it is in the form of natural language which is mostly un-structured and is difficult for computers to process. In order to obtain information from this kind of unstructured text, text mining was developed. TM is a method of extracting useful and non-trivial information from a col-lection of textual documents [2]. It is a field that involves several disciplines like information retrieval, information extraction and NLP.

TM is a field that is also called text data mining, for example in [3] be-cause of the similarities between data mining and text mining. There are of course differences between the two because of the difference in the data on which the different methods work. The data mining algorithms work on data that is very different in nature from text. The data in databases is usually structured, and the preprocessing required is generally minimal. The text that TM works on needs much more preprocessing for algorithms to work efficiently. Another important difference is that using domain knowledge in TM gives much more improved results than in data mining.

The general methodology used for the text mining process is explained in [2]. The first step is to obtain all the documents that need to be worked on, which may belong to a particular domain. Then these documents are preprocessed to convert the unstructured text to a structured form that the core mining algorithms can work on. Preprocessing will give the most im-portant parts of the document and the other parts are discarded. Many of the NLP techniques are used in TM for this step and that is the goal of build-ing our corpus. Once the preprocessbuild-ing is done, the core minbuild-ing operations are applied to extract non-trivial patterns and perform knowledge discovery.

(15)

2.1. TEXT MINING

We will look into the preprocessing step in more detail because it de-termines the quality of the entire TM operation. There are several ways to give text some structure, and can be done in different levels. Preprocessing is either based on the structure of the document, the characters or words occurring in it, or on a NLP level, either semantic or syntactic. There are some preprocessing methods that take a typographical view of the docu-ment. In this case spacing and special characters in the text are a kind of structure. Most other preprocessing methods use features of the document to give it structure. The features of the document are the characters and words that it contains. One method is a bag of words approach which only takes into account the words that occur and their frequency of appearance, not the sequence in which the words occur. This is useful for applications like document clustering in which documents with similar content are gathered together. It is also very useful for the task of information retrieval which uses keywords to search for the most relevant document. For some tasks like summarizing, this approach is not enough since the meaning of the text is not taken into account. This is why several preprocessing methods use NLP techniques like POS tagging and SRL along with domain knowledge to give the text structure [2].

The NLP techniques currently developed usually give text structure in the syntactic or semantic level. Most used NLP techniques are POS (part of speech) tagging and parsing. POS tagging takes the grammar of the text into account and assigns words tags sometimes based on their context in the sentence. The tags that can be assigned vary from the basic POS like noun, verb, adjective etc. to tags that are more specific like noun plural, proper noun plural. Parsing is a method which uses a particular grammar theory to assign syntactic roles to the different parts of the sentence in the text. Syntactic roles are parts of the sentence that are designated based on grammatical rules. The parsing can be done completely, where each word is parsed separately or it could be shallow where in only some chunks of the sentence are parsed.

Labeling with semantic roles is an NLP technique that gives useful in-formation concerning the semantics of the sentence in question. SRL plays an important part in preprocessing for TM on a sentence level. The corpus we have created is to support SRL system development for the biomedical domain. Our corpus has sentences that are annotated with semantic roles according to the frame semantics theory. This corpus has the advantage of having both a NLP technique and domain knowledge. Besides semantic roles the annotation also includes syntactical and POS information which will also be useful. The theory behind semantic roles and frame semantics will be discussed in detail in upcoming sections of this chapter.

(16)

2.1. TEXT MINING

2.1.1 Biomedical text mining

The amount of biomedical text available keeps increasing everyday. This could cause a lot of confusion because data could get lost in the shuffle. Re-searchers may not be able to access papers on their research areas because there is too much to go through to find the required paper or required in-formation. There might be important information patterns within or across documents that a human reader might not be able to find. Text mining could avoid that by automating the process so that the burden on a re-searcher is reduced [4]. Text mining on general English is difficult enough but the same techniques can not be applied blindly to biomedical text and be expected to give the same results. This is because biomedical text is very different from other English text. One of the differences that is of most importance to us is with the meaning of the words, for example ’translation’ is used with a different meaning in biomedical text than in ordinary English. In some cases the sense of the word can be only determined by the context in which it is used. This is why the semantics of the sentence is useful.

The current work on biomedical text mining is concentrating on NER (Named Entity Recognition), relationship extraction and hypothesis gener-ation [4]. There are also lots of improvements in informgener-ation retrieval (IR) and information extraction (IE) in biomedical text.

• Named Entity Recognition: NER is a technique that retrieves the names of a specific entity or concept. It has received a lot of attention especially for names of genes and proteins. The most important work in NER is to find out all the possible synonyms for a biomedical entity in different contexts. The NER system must be able to handle am-biguity because the same name can mean different things in different contexts and the same entity can be referred by different names. • Information retrieval: This is a method in which documents are

selected from a document collection based on a query. Most search engines use IR techniques to give results. The documents are mostly chosen based on the occurrence of the query terms in the documents. The most popular IR system in biological applications is PubMed [5]. Ambiguity affects the results of biological IR systems which is why most of them use some kind of controlled vocabulary to further re-fine the results. PubMed uses the MeSH (Medical Subject Headings) terminology [6].

• Relationship extraction: This task uses the techniques of tion extraction (IE) which is a method to search for relevant informa-tion from one particular document. The results of NER could help in relationship extraction which tries to obtain the relationships between entities in the text. Until now, a number of research efforts have been

(17)

2.2. CORPUS

made to extract protein interactions and regulatory relationships. The entities can be specified to get all possible relations between them or the relationship type can be specified to get entities. Either way the quality of the NER system will affect the results of the relationship extraction. It can be done like NER itself by giving possible words or word patterns for relationships. Relationships are also got by statis-tical means or by linguistic means in which the text is parsed using NLP methods.

• Hypothesis generation: This is a methodology in which relation-ships across documents that are not apparent are discovered. Most of the current work in this area is based on the idea of ’complemen-tary structures in disjoint literatures’ developed by Swanson [4] into a model called Swanson’s ABC model. He used this method to manually find several connections across scientific articles and formulated hy-potheses. Some of these hypotheses were subsequently proved. There are attempts to make this process automatic and to choose from the automatically generated hypotheses.

Besides the aforementioned techniques there are more techniques that are part of text mining like synonym and abbreviation extraction and text classification. All or many of these methods can be integrated to get a better overall result for text mining. For example a system called MedScan uses syntax and semantics along with lexicons to extract relationships between particular biological entities in text [4].

2.2 Corpus

A corpus is a collection of text representative of some particular topic or domain [7]. Corpora were created to represent a language and to be a stan-dard reference for linguistic studies. They used to be created manually and printed on paper, but now the creation of corpora is done automatically and is usually in a machine-readable format. Almost all corpora have some kind of annotation like POS tagging, which is very common, other features of in-terest can also be included in the annotation. For general English, the source for the corpora is usually newswire text. This is because that is diversified enough to represent several facets of the language. In the biomedical field most of the corpora are formed from MEDLINE abstracts eg, the GENIA corpus [8]. The GENIA corpus selected abstracts from MEDLINE by using the MeSH terms human, blood cell and transcription factor. The annotation included the end terms of the GENIA ontology besides linguistic features like POS, sentence boundaries and term boundaries.

There are several annotated corpora available with different annotation schemes. The annotation scheme is based on the linguistic requirements of the corpus. The different levels in which annotation can be done can

(18)

2.2. CORPUS

be broadly classified into the phonetic level, morphological level, syntactic level, and the pragmatic or discourse level [7] [9]. Phonetic annotation gives information about how a word in the text is pronounced. One annotation mode involves lemmatization which is on the morphological level. It involves reducing the words in the text to its roots which are also called morphemes. The most usual form of annotation is on the syntactic level. One way of syntactic annotation is to use POS tagging and another is parsing. For most corpora, both POS tagging and parsing are done. Then there is the semantic level in which semantic roles are assigned to different parts of the sentence. We will look more deeply into this in the next section. The pragmatic or discourse level give information about the text depending on the context in which the sentence occurs. There are in fact innumerable ways of annotat-ing text in a corpus [9]. There can be application based annotation which will annotate only the text relevant to the task.

There are several uses of annotated corpora. Some of the uses are for evaluation of a parser [10] and training of systems for NLP tasks [11].

• It is sometimes necessary to evaluate the efficiency of parsers to gauge the quality of one parser in comparison with others. The annotated corpora can act as a gold standard against which the parser output is measured.

• It is also possible to use the annotated corpora as training data for automatic labeling or parsing systems. An automatic system is usually created based on the information in one corpus. This system could be used on a corpus with a different annotation scheme eg., POS tags, semantic roles, or a corpus with same annotation scheme but with text that belongs to a different domain. In these situations, there is a drop in the performance of the automatic system. In order to improve the accuracy of the system, it can be trained on part of the new corpora before being used on the whole text.

There is an increasing need for domain-specific corpora because systems trained on one corpus can not adapt to another domain easily. The size of a corpus is also important because the text that the NLP systems have to work with is extremely large. Thus by creating an extensive corpus we hope to aid NLP tasks in this domain.

(19)

2.3. SEMANTIC ROLE LABELING

2.3 Semantic Role Labeling

2.3.1 Introduction

The way humans understand language is to analyse it in five steps which are usually simultaneous. The first is phonetic, then morphological, syntac-tic, semantic and pragmatical analysis [12]. In order for the computer to also understand natural language, these steps have to be clearly defined and this knowledge must be given to the computer. Morphological knowledge pertains to how a word is derived from its morpheme. Syntactic knowl-edge is the arrangement of words which should follow the rules of grammar. When the syntax of a sentence is changed, its meaning might be lost or changed. Semantics is concerned with the meaning of words and how the meaning of a sentence depends on the meaning of the words that constitute that sentence. The semantics of a sentence will clearly demarcate between the different senses of a polysemous word. Thus semantics deals with sen-tence level interpretation. Pragmatics deals with the context in which the sentence is used and how useful the user of the language finds it. We will only be dealing with sentence level interpretation so syntactic and semantic information will be of most importance.

In order to better understand the meaning of a sentence and to make it more machine-readable, SRL is used. This is achieved by assigning semantic roles to parts of sentences that answer questions like ’who’, ’where’ etc. for a particular predicate [13]. The predicate is a word that is able to describe some action or event in the given sentence. Examples are HURT, EATING, WALKING, GIVING, etc. Most of the predicates chosen are verbs but they could be any other POS too. The most general and commonly used seman-tic roles are Agent, Theme and Recipient [13]. These SRs are described in [14] as follows. An Agent is usually a noun phrase that is the instigator of the action referred to in the sentence. This is generally an animate subject but not necessarily the subject all the time. A theme is a noun phrase that undergoes some change due to the action in the sentence. The recipient or beneficiary is the animate subject for whose benefit the action was per-formed.

Eg. Amy[Agent]gave the ring [T heme] to John[Recipient].

When a sentence is marked like this, it is easier for some tasks like ques-tion answering, informaques-tion retrieval etc. [13]. With the descripques-tion of each SR it is possible to generate the appropriate questions. There are several SRs that all researchers agree upon, but there is no definitive list of all the possible SRs possible in the English language. There are several efforts to define predicates and SRs in terms of PAS (Predicate Argument Structures). In a PAS all the SRs (here called arguments) possible for a predicate are listed. Efforts like FN refute the claim that all SRs can be listed for all

(20)

predicates.

The syntax of a sentence imposes a structure on the sentence and this leads to some syntactic roles based on grammar. Most of these roles can be identified by knowing things like the POS of the tokens. The most basic POS are noun, verb, article, adjective, preposition, number and proper noun. They give syntactic roles like subject, object etc. Based on the semantics of a sentence, SR can be assigned and also mapped to syntactic roles. The mapping of the SR to syntactic roles does not have hard and fast rules but can be done intuitively [13].

2.3.2 Resources

Here we discuss the resources relevant to SRL in general English and in domain specific text. FN and PropBank are frequently used corpora based on newswire text besides other corpora. And in related work we look at BioFN, PASBio and BioProp which are corpora for the biomedical domain. 2.3.2.1 FrameNet

Framenet is a corpus in which text is annotated based on frame semantics [15]. This method has its core idea that for a word to be understood in the right sense, the context in which it appears has to be known. Our corpus also uses this kind of semantics for annotation. This context is called a frame and each frame has a set of lexical units (LU) that can evoke it. Thus the frame takes into account the semantics of the sentence in which the predi-cate occurs. The LUs could be nouns or verbs or any other POS. The frame is defined and within each frame there are a set of SRs (in this case called frame elements) which is pertinent to that particular frame. This method argues against the theory that all semantic roles possible for all predicates should be listed. These FEs (frame elements) are further classified in order to better define the frame. The most essential and unique FEs are called the core elements because without them the sentence would not fit into that particular frame. Then there are some FEs that give some extra information on the main frame and these are called non-core or peripheral FEs. Besides this some FEs do not contribute to the event that the predicate puts focus on. These FEs can give useful information and are called extra-thematic FEs. All FEs are usually given names that describe their roles with regard to the predicate and frame. This is illustrated in the following example.

For example in the frame Arranging, the core FEs are Agent, Config-uration and Theme. Where the agent is the person doing the arranging, the theme is the thing undergoing the arrangement and the configuration is the final arrangement. These are core FEs because without these three participants, there is no logical way to explain the frame arrangement. The

(21)

non-core or peripheral FEs for this frame are general ones like Circum-stances, Degree, Location, Instrument, Manner etc. They are not integral parts of the frame itself but could give valuable information.

The FEs and the frames can have relations between each other. Some of the relations between FEs are Requires and Excludes [15]. The Requires relationship between two FEs indicates that the occurrence of one of the FEs means the other one has to occur in the same sentence. And Ex-cludes has the opposite meaning, if one FE occurs the other can not occur. Between the frames are relationships like Inheritance, Subframe, Perspec-tive on etc. For example the frame Annoyance inherits from the frame Emotions by Stimulus.

When any text is annotated according to FN rules, it must have a mini-mum of four layers. They are the target layer, the semantic role (FE) layer, the phrase type (PT) layer and grammatical function (GF) layer [15]. So each sentence is annotated according to the predicate chosen in it which is called the target. Based on the frame that the LU belongs to, the FEs are assigned to the dependents of the target LU. Once the FE has been deter-mined the PT and GF of these fragments of the sentences are assigned. The PT is to determine the POS of that token. Then a GF is specified for the FE fragment. This GF is in relation to the target so there is a difference between assigning GFs to targets of different POS. Both PT and GF are assigned only to FEs, not to the target [15]. Sometimes the core FEs are not realized in the sentence, then they have to be given as null instantiations (NI). This is a feature that is useful in cases when the FE can be understood only from the context, or if the grammar of the sentence will allow the FE to be left out of the sentence.

There are more annotation layers which can be marked, one of them is called Other. This layer is useful for marking extra linguistic features of the sentence. One layer holds the NIs that occur in a sentence. There is also a layer that will indicate the POS of the predicate and one more called the sentence layer.

2.3.2.2 Propbank

Proposition Bank (PropBank) added a layer of PAS to the TreeBank database of parsed sentences [16]. A predicate is a word that plays the role of a verb in a sentence. But a predicate need not be only a verb, any POS could be a predicate. The PropBank has only verbs now but will expand later to other POS too [16].

The method that PropBank has followed is to first decide the lexemes or the predicates and then different sentences with that word are analyzed. If the chosen word has more than one meaning or sense, all of them are

(22)

documented. Then the number of arguments that can possibly occur for that predicate is recorded. The arguments are listed as semantic roles in what is called a roleset [16]. It is not necessary that all the arguments in the roleset appear in all the sentences with the predicate. The arguments are numbered as arg0 to arg5 where ever applicable [16]. Modifiers are marked separately and prepositions are also noted. The modifier arguments (argM) are given functional tags to denote the information that they provide to the sentence. The role agr0 is usually used to denote an agent so if a verb can not have an agent or subject, arg0 is omitted. Each predicate has a set of syntactic frames which depict all the ways in which the predicate-argument structure can occur in English. The roleset and syntactic frames of each predicate is called a frameset [16].

2.3.2.3 Related work

2.3.2.3.1 BioFrameNet FN was extended to the biomedical domain by Dolbey A. [17], specifically for molecular biology. This effort is called BioFrameNet (BioFN) and deals with GeneRIFs (Gene References into Func-tion) that were annotated by the Lawrence Hunter’s Bioinformatics Lab in the University of Colorado Health Sciences center. GeneRIFs (GRIF) are short descriptions of the functions of genes. Even though they have a limit on length these descriptions can provide a lot of useful information. The Hunter Lab knowledge base (HLKB) consists of particular GRIFs chosen because of the idea they demonstrate. As indicated by the name, BioFN forms frames and FEs that will be used to annotate the sentences as done in FN.

BioFN focuses on the phenomenon of intracellular protein transport. The annotated GRIFs belong to the class of protein transport in the HLKB. The annotations followed the FN rules that the dependants of the predicate must be marked with semantic roles. There are two frames that BioFN deals with, the Protein transport and Cause protein tranport frames. These frames are related but different in only one aspect. The Cause protein tranport frame has an extra FE to indicate the substance that is transporting the transported entity. This FE is called the Transporting entity. The other common FEs between the frames are Transported entity, Transport origin, Transport destination and Transport locations. All the FEs in BioFN are core and three of them constitute a core set. The Transport origin, Trans-port destination and TransTrans-port locations together satisfy the conditions for forming a core set.

This corpus is comparable to the one we built in our thesis because the theme in both is transport and both are based on the rules of FN annotation.

(23)

2.4. ONTOLOGY

2.3.2.3.2 PASBio This is an attempt to apply the PAS to verbs from the molecular biology domain. Their approach was to select verbs based on their frequency of appearance in MEDLINE articles and their importance in the events of molecular biology [11]. These words were compared with the existing PAS frames in PropBank. If the predicates appear in the same sense they are mapped to the PropBank frames. Sometimes the predicates have different arguments in the domain usage and this will cause the frames to change a little bit. For example PropBank and FN consider some infor-mation extra-thematic if it is not within the scope of the predicate. But such information is sometimes vital in the biomedical literature. So far there are 30 predicates in the PASBio database with frames. PASBio is widely used for annotating biomedical text.

2.3.2.3.3 Bioprop This is a corpus that was developed in order to train SRL systems to annotate biomedical text. An SRL system SMILE (SeMan-tIc roLe labEler) was trained on PropBank I and then allowed to automat-ically annotate biomedical text from the GENIA corpus [11]. The verbs were selected based on their importance to the biological processes and the frequency in which they appeared in the GENIA corpus. If these verbs did not appear in PropBank I then the frameset of a similar verb in PropBank I is taken. But most of the 30 verbs used in BioProp were found in PropBank I too. Some alterations had to be made to accommodate for different us-age of a verb in general English and in biomedical text. The automatically annotated sentences were reviewed by experts and corrected if necessary. Once the corrections were made, the SMILE SRL system was trained in the corpus BioProp and this SRL system is called BIOSMILE (BIOmedical SeMantIc roLe labEler).

2.4 Ontology

The concept of an ontology is derived from philosophy where it is a means of expressing the existence of things [18]. It is a representation of a do-main by defining the important terms and the relations between them in the given domain. An ontology is made up of the important terms, which are called concepts, and their definitions. These concepts are usually ar-ranged in a hierarchy or graph which highlights the relationships between them [19]. The most common relationships are is a and part of. There are many more relations that represent the domain knowledge in depth. In-stances are the entities that the concepts represent, but not all ontologies specify them. Axioms are also present in an ontology sometimes, to provide more information about the domain. It is also possible to set restrictions for the concept relations to make sure no false data or incomplete data is added.

(24)

2.4. ONTOLOGY

There are different types of ontologies, varying in their complexity and structure as given in [19]. The most basic ontology is a controlled vocabu-lary which is just a list of terms in the domain and their definitions. When these concepts are arranged in an is a hierarchy, then it is called a taxon-omy. If the ontology is in the form of a graph with concepts as nodes and relations as arcs, it is called a thesaurus. The highest level of an ontology is a knowledge base which is usually based on a formal language. This is called a formal ontology which is made up of several rules and logic that guide its concept definition and the relations between the concepts.

Numerous ontologies are being created for the biomedical domain espe-cially by efforts of the OBO foundry [19]. The OBO Foundry allows users to create ontologies in the biomedical field, but during the creation of these ontologies, there are rules that must be followed. This will reduce ambiguity and will make the ontologies compatible for information exchange.

2.4.1 Ontology Vs Frame Semantics

Frame semantics has several similarities to an ontology in terms of structure and perhaps applications too. The concepts in an ontology are comparable to the frames defined by frame semantics. Both are agreed upon descrip-tions of a particular event, situation or theme. Both concepts and frames can have relations. In some cases the relations are directly transferrable, eg, is a relation between concepts and inheritance relation between frames has the same ideology. The definition of a concept can thus help define a frame. The LUs for the frame can be obtained from the terms in the concept. The concept being well defined, the semantic sense of polysemous words will be very clear. This is why we have chosen to create FN-like frames from on-tologies.

We believe that forming frames from the concepts will allow for bet-ter annotation of biomedical text because of the domain knowledge being transferred to frames. Ontologies are being used for annotation but mostly in NER applications which is not rich linguistically. FN allows several layers of annotation which add a lot of semantic and syntactic information to the annotation. The resulting corpus will be useful for several linguistic tools like parsers and semantic role labeling systems.

(25)

2.5. ANNOTATION TOOLS

2.5 Annotation tools

There are several annotation tools available for unstructured text. Some are general purpose and can be tailored, to an extent, to suit one’s needs. Some of the annotation tools available were developed specifically for some method of building a corpus and can not be modified for general usage or for any other corpus. This is sometimes advantageous if the available tools do not give satisfactory results or is done simply for convenience. Here we discuss two annotation tools, one is a general purpose tool and the other is specific to FN.

2.5.1 Knowtator

This is a general purpose tool that is a plug-in for the Protégé knowledge base system [20]. It can process large amounts of unstructured text to deliver an annotated corpus. Since this tool works on the Protégé environment, it structures the annotation task like a scenario that is to be represented by a knowledge base. Thus the annotation schema is usually given in an ontology format with classes, instance and relationships between classes. Besides that the annotation schema can also have information like the color each class is to be highlighted in. It also can handle complex relationships between classes.

2.5.2 FN tool

In the FN project the annotation was first done by Alembic Workbench annotation tools [21]. This is a general purpose annotation tool that was created by MITRE which is a non-profit organization that provides solu-tions for systems engineering and information technology among others. The Alembic Workbench is a system that can work on XML and has some tem-plates for annotation like most general purpose annotation tools. The user also has the option of using NLP techniques that will help machine learning. But now the FN annotators use a tool that was created by FN staff them-selves [21]. It was custom made for FN annotation and stores the results of the annotation in an SQL database. The tool has been made in such a way that it can handle multiple or overlapping layers of annotation. It can also allow for more than one predicate in a sentence, setting different annotation layers for each predicate.

(26)

Chapter 3 Corpus Development

The development of a corpus annotated by FN rules has to start with defin-ing the frame and its frame elements, choosdefin-ing predicates and then finddefin-ing suitable example sentences to be annotated.

3.1 BioOntoFN

3.1.1 Domain Knowledge

Before going into how the corpus was constructed, we describe two sources for domain knowledge that we used in the project, GO and two of the UMLS components. Here is a brief introduction on them.

3.1.1.1 GO biological transport

The GO version 1.2155 on 2011-08-05 has 34761 terms totally. The three ma-jor classes are biological process, cellular component and molecular function. Transport is a child of establishment of localization which is a subclass of biological process. The term protein transport is a subclass of transport. A frame based on the term protein transport was created and we have extended it to its parent term transport.

The term transport and all its subclasses form a sub-ontology on which we worked to give the frame definition and the frame elements. We have started with the transport event because it is the most important subcellular event. It is the basis or sometimes the explanation behind other activities in the cellular or sub-cellular level. The transport sub-ontology was chosen also because it is not a very general term but is general enough to encompass several important events. Thus when a term is chosen from an ontology it should cover only one kind of process or event. This is useful to keep the frame definition simple and effective.

(27)

3.1. BIOONTOFN

3.1.1.2 UMLS

Unified Medical Language System (UMLS) was created by the U.S. National Library of Medicine in order to solve ambiguities and inter-operability of medical terminology between computer systems. There are three different tools, also called knowledge sources in the UMLS, the metathesaurus, the semantic network and SPECIALIST lexicon and lexical tools [22].

• The metathesaurus is a source of biomedical concepts from several terminologies or ontologies. Some of the sources are, MeSH, GO and NCBI (National Center for Biotechnology Information) taxonomy [22]. Each concept in the metathesaurus has several terms that come from the constituent vocabularies. In this way each term in the source vocabulary is mapped to a concept in the metathesaurus. This helps solve ambiguities when several vocabularies are to be processed. • The semantic network was developed to give semantic types and

mantic relations to the concepts in the metathesaurus [23]. The se-mantic network can also be considered a top-level ontology because it has very broad concepts, eg., Organism, Event, Activity. Assigning such labels to the concepts in the metathesaurus makes it easier to map terms from the source vocabularies to concepts. It also has rela-tions that can be applied to the concepts to explain them better. A few of the relations are, consists of, alters, secreted by and articulates with.

• SPECIALIST is a set of NLP tools that can be used for further refining or expanding the metathesaurus.

Among the available UMLS knowledge sources we have worked with the metathesaurus and the semantic network. The metathesaurus was used to check if there were any terms from vocabularies other than GO that map to GO concepts. The semantic network was used to assign semantic types to the FEs in our frame. This will be discussed more in the later sections of this chapter.

3.1.2 Frame definition

The frame transport is defined based on the GO definition of the term transport. The definition was not taken as such but was modified to fit in the different FEs. The GO definition of the term transport is,

”The directed movement of substances (such as macromolecules, small molecules, ions) into, out of or within a cell, or between cells, or within a multicellular organism by means of some agent such as a transporter or pore.”

(28)

3.1. BIOONTOFN

We defined the frame after deciding on the FEs. The definition we have given the frame is,

”This frame deals with the cellular process in which a substance, the Transport Entity, moves from the Transport Origin to a different location, the Transport Destination. Sometimes the Transport Origin and port Destination are not specified or are the same location. The Trans-port Entity could be a macromolecule, small molecule or ions which under-goes directed movement into, out of or within a cell or between cells or within a multicellular organism. This activity could be aided or impeded by other substances, organelles or processes and could influence other cellular processes.”

The frame protein transport was already created by He Tan et al [1]. The frame protein transport inherits from the frame transport. This inheritance relationship is consistent with the is a relationship between the two corresponding terms in GO.

3.1.3 Frame Elements

The FEs are the semantic roles in FN and we used all semantic roles that are pertinent to our particular frame. This also we did solely with GO which provides several synonyms for each of the term. The synonyms are useful because they present the same concept in a different perspective and hence give more information on how the transport event occurs. There were 1040 GO concepts in the transport sub-ontology. Counting all the synonyms and the main terms, there are 2112 terms. The terms in the transport sub-ontology provide a lot of information about the transport event. Like where something is being transported from or where it is being transported to. Sometimes the condition under which the transport event occurs is also given in the term. Thus by manually going through the whole sub-ontology, we have defined FEs for all the information that could be useful when describing transport.

3.1.3.1 Identifying FEs in the GO terms

Here are some examples of how we decided on the FEs and their definitions. ET1. 2-aminoethylphosphonate transport:GO:0033223

ET2. ER to Golgi ceramide transport:GO:0035621—endoplasmic retic-ulum to Golgi ceramide transport—ER to Golgi ceramide transloca-tion—non-vesicular ceramide trafficking

The easiest FE to recognize was the entity that is being moved. In ET1 (example term) ’2-aminoethylphosphonate’ is the entity being transported.

(29)

3.1. BIOONTOFN

ET2 shows the origin, ’ER’ and the destination, ’Golgi’ of the entity, ’ce-ramide’. Hence we got the main FEs important for a transport event, the entity being moved, its origin and its destination. But in some terms like, in ET3, the transport event does not have a distinct origin and destination. For such cases we included a FE called location which can be used if the origin and destination is the same site or is just not specified.

ET3. B cell receptor transport within lipid bilayer:GO:0032595—B cell receptor translocation within membrane—BCR translocation within membrane—BCR transport within lipid bilayer

ET4. Golgi to endosome transport:GO:0006895—Golgi to endosome vesicle-mediated transport—TGN to endosome transport—trans-Golgi to endosome transport

ET5. DNA secretion by the type IV secretion system:GO:0044098—DNA secretion via the type IV secretion system

ET6. G-protein coupled receptor internalization:GO:0002031

Besides these FEs there are several other participants in a transport event. The first synonym in ET4 indicates a transporter for the entity, the ’vesicle’. It carries the entity from the origin to the destination and we called it the transporter FE. In ET5, besides the entity and the destina-tion, there is mention of a cellular structure ’the type IV secretion system’ through which the transport takes place. Since it is helps the transport but is stationary we have named it the FE path.In ET6, ’G-protein’ helps the transport to happen, but it is not clear if it is the path or the transporter but is essential for the transport. There are several cases like this where some molecule, cellular structure or chemical environment helps the trans-port event. There are also cases when the purpose of the transtrans-port is given in the term (ET7). Any event, environment, molecule or cellular structure that triggers or affects the transport event is called a condition. It is the broadest of our FEs and can take a wide range of realizations from a cellular structure to a chemical environment.

ET7. ATP synthesis coupled proton transport:GO:0015986—chemiosmosis ET8. antigen transcytosis by M cells in mucosal-associated lymphoid tissue:GO:0002412—antigen transcytosis by M cells in MALT—antigen transport by M cells in MALT—antigen transport by M cells in mucosal-associated lymphoid tissue

ET9. anterograde axon cargo transport:GO:0008089—anterograde ax-onal transport

(30)

3.1. BIOONTOFN

ET10. endocytosis:GO:0006897—nonselective vesicle endocytosis—plasma membrane invagination—vesicle endocytosis

ET11. acrosome reaction:GO:0007340

In ET8 ’M cells’ is the FE location and ’MALT’ is the region in which the ’M cells’ are present. So we introduced an FE called place. ET9 is one of the few terms that show the FE direction. ’Anterograde’ denotes which part of the cell the ’cargo’ (entity) is moving towards. ET10 has a synonym ”nonselective vesicle endocytosis”. In this term ’nonselective’ does not in-fluence the transport but is a characteristic of the transport event this we call an attribute of the transport event. In terms like ET11, even though ’acrosome’ is a location it is not counted because reaction is not taken to be a predicate which is discussed more in the next section.

The list of all the FEs and their definition is given below, 1. Transport Entity(TE):

The substance (such as a macromolecule, small molecule, ion) which is undergoing the motion event into, out of or within a cell, or between cells, or within a multicellular organism

2. Transport Origin(TO):

The organelle, cell, tissue, gland or organism from which the Trans-port Entity is moved to a different location

3. Transport Destination(TDS):

The organelle, cell, tissue, gland or organism to which the Trans-port Entity is moved from a different location

4. Transport Location(TL):

The organelle, cell, tissue, gland or organism where the motion event takes place when the origin and the destination are the same or when origin or destination is not specified

5. Transport Condition(TC):

The event, substance, organelle or chemical environment which pos-itively or negatively influences or is influenced by, the motion event. The substance organelle does not necessarily move with the Trans-port Entity

6. Transport Transporter(TT):

The substance, organelle or cell crucial to the motion event, that moves along with the Transport Entity, taking it from the Transport Origin to the Transport Destination

(31)

3.1. BIOONTOFN

7. Transport Path(TP):

The substance or organelle which helps the entity to move from the Transport Origin to the Transport Destination, sometimes by con-necting the two locations, without itself undergoing translocation 8. Transport Place(TPL):

The organelle, cell, tissue, gland or organism where the Transport Origin, Transport Destination or Transport Location is situated

9. Transport Direction(TDR):

The direction in which the motion event is taking place with respect to the Transport Place, Transport Origin, Transport Destination or Transport Location.

10. Transport Attribute(TA):

This describes the motion event in more detail by giving information on how (particular movement, speed etc.) the motion event occurs. It could also give information on any characteristic or atypical features of the motion event.

We calculated the frequency of each of the FEs’ appearance in the sub-ontology and the UMLS terms.

FEs Number Percentage TE 2070 92.6 TO 272 12.2 TDS 432 19.3 TC 221 9.9 TL 127 5.7 TP 164 7.3 TT 43 1.9 TDR 34 1.5 TA 40 1.8 TPL 8 0.36

Table 3.1: The frequency of FE appearance in GO and UMLS terms [1]

Before the FEs were defined, we looked at every realization of the FE in the GO sub-ontology and then defined it. For example the origin, destination and location realizations in the GO terms were only organelles, cells tissues or glands. All the definitions of the FEs are consistent with the information in the ontology.

(32)

3.1. BIOONTOFN

3.1.3.2 Classification of FEs

Once the FEs are decided upon, the coreness has to be decided. The core FEs are those FEs that make the frame unique. They are the most essential FEs which give the frame meaning. Among our FEs we have decided the core elements based on what is important for the frame and the frequency with which those FEs appear in GO. Thus we decided that the FEs TE, TO, TDS and TL will be to core elements of this frame. This is because for a transport event to occur, the entity being transported, the location it comes from or the location it reaches are very important. All core FEs have to be marked on a sentence that is annotated for a frame. All the other FEs give extra information that will be useful but is not essential for the frame. 3.1.3.3 Relationships between FEs

In FN it is possible to assign relationships between semantic roles. Among the FEs that we have, there are three relationships that are assigned. They are, coreness set, excludes and proto-frame.

1.Coreness set: This relationship is assigned to a group of FEs that have the same valence pattern. That is they all are of the same phrase type and grammatical function. So just one of the set appearing in a sentence will satisfy the valence pattern. It is very rare for all the members of a set to be in the same sentence. This relationship is only for core FEs as the name suggests. The FEs TO, TDS and TL belong to the same core set. This is because not all sentences give the origin, destination and location of the entity, and these FEs have the same PT and GF. Most of the time if the origin and destination are not given the location is sufficient. So even though TO, TDS and TL are all core FEs, it is sufficient to mark only one in a sentence.

2.Excludes: Sometimes the appearance of one FE negates the appearance of another. This is true for the core set, TO and TDS excludes TL. TL occurs only if the sentence does not specify the origin or destination or if the transport event occurs within one location. Thus if the origin or destination is specified there is no need of marking the location and if the location is marked in a sentence the origin or destination can not be marked.

3.Proto-frame: This relationship is still not in use according to the FN book [15] but it explains the relationship between a few of our FEs. If one or more FEs are specific cases of another FE, the more general FE is called a proto-frame element. This is true with the FE TC. The condition FE could be a cellular component, molecule or just a chemical environment that encourages or discourages the transport event. The FEs TP and TT are specializations of TC. If a molecule or cellular component is stationary and the transport event occurs

(33)

3.1. BIOONTOFN

through it, then that is called a path. But if the molecule or cellular component moves along with entity from the origin to the destination it is called the transporter. If it is not clear what role the molecule or cellular component is playing in the transport, it can simply be labeled TC.

3.1.4 Identifying predicates

An ontology is a very concise representation of domain knowledge. So each of the terms in the sub-ontology that we have chosen, describe some trans-port event. Since this description is as short as possible, it is very easy to choose the predicates or lexical units (LU) that can best represent transport. Usually the head of the GO term, a noun, is taken as an LU and if its verb form seems suitable for describing the transport event that is also taken as an LU. For example let us look at the terms that are subclasses of transport.

ET12. D-aspartate import:GO:0070779—D-aspartate uptake

ET13. CVT pathway:GO:0032258—cytoplasm to vacuole targeting—cytoplasm-to-vacuole targeting

ET14. glycolipid translocation:GO:0034203—flippase—scramblase ET1 (given in section 3.1.3.1.) is a very simple term which gives the en-tity being transported and an LU ’transport’. All the LUs are in bold font. The ET12 has a main term and a synonym. There is an LU in the main term but the synonym has another possible LU. Both the words have the same meaning but are discovered easily because the ontology presents it to us. In ET13, the main term does not have a useful LU but its synonyms do. They also provide extra information about the transport event. As in ET14 not all the synonyms provide useful information. Some synonyms are clas-sified as broad or related and do not contain any potential LUs like in ET15.

ET15. mucus secretion:GO:0070254—mucus production

ET16. potassium ion transport:GO:0006813—low voltage-dependent potassium channel auxiliary protein activity—low voltage-gated potassium channel auxiliary protein activity—potassium conductance—potassium ion conductance—K+ conductance—potassium transport—sodium/potassium transport

Some terms like ET11 (in section 3.1.3.1.) might not give any LUs even though they do denote some kind of transport. The word reaction is too general to be only talking about transport which is why we did not choose it as an LU. In ET16 the noun channel is not a predicate but its verb form

(34)

3.1. BIOONTOFN

is quite commonly used to describe transport events in the cell. Thus we have taken only the verb form as an LU which is a little different from the usual method. There are some special terms like ET10 and ET17. There is only one single word describing the transport and sometimes the entity is part of that word.

ET17. defecation:GO:0030421

ET18. establishment of protease localization in mast cell secretory granule:GO:0033372—establishment of protease localisation in mast cell secretory granule

ET19. protein import into peroxisome matrix, translocation:GO:0016561 ET20. synaptic vesicle to endosome fusion:GO:0016189—synaptic vesi-cle fusion

Not all the LUs are made up of only one word, ET18 shows the LU ’establishment of...localization’. The whole LU is needed for describing the transport event. Some terms like ET19 have more than one LU because they are about more than one event happening simultaneously. Though some words like fusion from ET20, do describe transport, it is only in a very narrow sense. There are very few usages of the word fusion in the transport sense which caused us to reject it as an LU.

Once we were done with the GO terms, we also referred to the UMLS metathesaurus. For each GO concept we looked at the synonyms that UMLS gave back from GO and other ontologies. This was done to see if any other ontology has new predicates that we could use. In this manner we discov-ered 123 new terms and from them, 2 new predicates. We also included the terms that were not found in GO. This increased the total count of the terms to 2235.

Finally there were a total of 68 predicates in the 2235 terms, the fre-quency of each predicate is given below. All these 68 are nouns, by deriving verbs from them the total number of LUs that we worked with is 127.

(35)

3.2. EXAMPLE SENTENCES

Absorption(9), Budding(3), Channel(3), Chemiosmosis(1), Clearance(4), Con-ductance(4), Congression(1), Defecation(1), Degranulation(15), Delivery(2), Dis-charge(1), Distribution(4), Diuresis(1), Efflux(6), Egress(2), Endocytosis(16), En-try(3), Establishment, of...localization(14), Establishment, of...localisation(5), Ex-change(5), Excretion(3), Exit(2), Exocytosis(25), Export(122), Hydration(1), Im-port(168), Influx(1), Internalization(9), Invagination(1), Lactation(1), Loading(1), Localization(6), Micturition(1), Migration(7), Mobilization(2), Movement(4), Na-triuresis(1), Phagocytosis(2), Pinocytosis(2), Positioning(4), Progression(1), Re-absorption(3), Recycling(14), Reexport(2), Release(15), Removal(1), Retrieval(1), Reuptake(14), Secretion(338), Shuttle(6), Sorting(3), Targeting(71), Trafficking(3), Transcytosis(10), Transfer(1), Translocation(87), Transpiration(1), Transport(1011), Uptake(53), Urination(1), Voiding(1), Salivation(1), Streaming(1), Sequestration(1), Elimination (3), Flow (1)

3.2 Example sentences

The main aim of the project was to have a corpus of biomedical text an-notated according to FN guidelines. In order to achieve this we selected sentences from abstracts of biomedical articles from MEDLINE and anno-tated them for the frame transport. For each of the 127 LUs we tried to get at least 10 sentences. But that was not possible for all the LUs because not all of them are used extensively in the transport sense.

3.2.1 Finding example sentences

3.2.1.1 Search engine

To find example sentences from abstracts we used a search engine called GoPubMed [24]. This search engine is different from conventional ones because it uses GO and MeSH (MEdical Subject Headings) terms to refine the results. When any keyword is used for searching, the search engine compares it to synonyms in the GO and MeSH terms. This will reduce the number of results and increase the accuracy. It searches for the keywords among abstracts available in PUBMED. This search engine was very useful for extracting example sentences because it returns just one sentence from the abstract that contains the keyword or the synonym of the keyword. It is also possible to restrict the search to any particular area by choosing an appropriate GO or MeSH term in a menu. This feature was very useful for some general terms that we wanted to specify to the transport sense. 3.2.1.2 Method of searching

In order to search for relevant documents we mostly used the LU itself as a keyword. For example for LU delivery.n we used ’delivery’ as a keyword. The results of GoPubMed seem to be sorted by both relevance and date the article was published. Thus the results could vary day to day. But we took

(36)

the first 10 relevant sentences. Even though a single document had more then one sentence for a particular LU we tried to take only one sentence from that document. This was because we wanted to collect sentences used in different contexts.

With some LUs like exit.n, migration.n, which are very general, we used the LU name followed by ’transport’ as a keyword to get a more relevant search result. This was done just to restrict the results to documents that used the LUs in terms of transport. ’Migration’ could be used in a wide variety of senses as could several words like ’progression’ and ’sorting’. This could be generalized to say that the superclass can be used to restrict the search results if the number of results for the predicate alone is too big and varied.

If the LU is a verb like exit.v, then the method varies a little bit. Since exit is the stem of the verb, GoPubMed does not give back results for differ-ent forms of the verb. So in order to show the differdiffer-ent forms, we changed the search terms accordingly. For exit.v we first used the keywords ’exit transport’, then ’exited transport’, ’exiting transport’ and ’exits transport’. This way all the forms of the verb are shown in the 10 sentences if applicable. 3.2.1.3 Factors influencing the choice

An important factor while looking for example sentences is to keep the definitions of the frame and the FEs in mind. Our frame only deals with transport events involving substances being transported from one part of the cell to another or one cell to another. This is important because the sentences must be appropriate for the frame. There are transport events in which the origin or the destination is a macromolecule. But we avoid such scenarios because the sub-ontology of transport does not have terms describing such transport events. Another thing is to be careful about the entity being transported. Entities like viruses and bacteria and whole cells must be avoided because they do not appear as entities in the GO terms in the transport sub-ontology. There is a whole other sub-ontology for the transport of cells. The other FEs are not so restrictive, so there is no need to worry about them while looking for sentences.

3.2.2 Annotation

The example sentences that we found were annotated by FN rules as speci-fied in [15]. The FEs for this frame were explained in the previous section. All the FEs identified in the sentence, with respect to the LU are marked. But the core FEs must be identified or given an NI, except in the case of core sets. Verb targets can have three kinds of NIs, definite NI (DNI), con-structional NI (CNI), or indefinite NI (INI). Nouns can have only one NI which is DNI, according to the book [15]. There are several LUs that can

(37)

only have INI for the core FEs. They are usually LUs that come from GO terms with only one word. NIs do not have PTs and GFs assigned to them.

ET17. defecation:GO:0030421

In ET17, the entire transport event is limited to the word itself. The entity is inherent and the location or origin and destination are rarely mentioned in sentences with the word defecation. So they have to be noted as NIs.

Once the FEs have been assigned, the PT has to be determined. The PT mostly depends on the header of the FE. Some FEs happen to be of a particular PT, for example, most transport entities are noun phrases (NP) when targets are verbs but are mostly prepositional phrases (PP) when the target is a noun. The FEs location, origin, destination and attribute can also be broken down to similar general patterns based on the target. But all the other FEs have a wide range of PTs and can not be boiled down easily. The LUs in our frame are only verbs and nouns for now. The GF as-signment differs depending on the POS of the target and the placement of the FE in relation to the target. It is useful to see the syntactic roles that each semantic role plays in the sentence.

3.2.2.1 Some Examples

ES1. The potential for [Pacific hagfish, Eptatretus stoutiiT ransport Destination,N P,Ext],

to absorb [amino acidsT ransport Entity,N P,Obj] [from the environment

T ransport Origin,P P [f rom],Dep] [across the skin and gillT ransportP ath,P P [across],Dep]

was thus investigated. PMID: 21367787

ES2. We conclude that, if the E1B 55kDa protein binds to RNA in in-fected cells[T ransport Location,DN I] in the same manner as in in vitro assays,

this activity is not required for such well established functions as induction of [selectiveT ransport Attribute,AJ P,Dep] export [of viral late mRNAs T ransport Entity,P P [of ],Dep]. PMID: 21605885

ES3. Control micturition studies were first performed using 28 awake Sprague-Dawley rats[T ransport Origin,DN I] that were placed in metabolic

cages for characterization of the frequency and mean and total volume voided over a 4-hr period[T ransport Entity,CN I]. PMID: 9590473

The example sentences (ES) have the predicate in bold font and the annotation is done in the subscript with the pattern [FE, PT, GF]. ES1 is a typical sentence which has most of the FEs and is not so complicated to annotate. ES2 shows a case in which the core set is not explicitly expressed so it is given an NI. ES3 has a predicate that does not occur with many FEs and hence all the core FEs have to be marked as NIs. Of course a NI does not have to have the PT and GF layers.

(38)

3.3. SEMANTIC TYPE (ST)

3.2.3 Valence pattern

The valency of a predicate is the number of slots that it has for its com-plements [25]. The idea of valence started only for verbs but now has been applied to all POS. This word was borrowed from chemistry. The comple-ments or argucomple-ments are usually considered in a semantic sense rather than syntactic. These complements could be optional or mandatory. This idea is similar to the idea behind the frames and FEs in FN. The FEs are the valency which could be optional or mandatory. When all the possible slots of a particular LU is given it is called a valence pattern in FN. This includes the FEs and the PTs and GFs of the predicate. In this way FN considers the semantic roles by taking into account the FEs and also the syntactic roles by involving the PTs and GFs too. The PT of an FE gives its POS and the GF is a variation of syntactic roles which is limited to certain names. For all the LUs that we annotated we have given their valence pattern in the sentences that we choose. We discovered some recurring valence pat-terns for some of the LUs. But the valence patpat-terns of the biomedical LUs were markedly different from the same words in general English. This gives insight on the difference in usage of words in general English vs. biomedical text.

3.3 Semantic type (ST)

FN sometimes assigns what is called a semantic type to FEs. This is to indi-cate what kinds of realizations could possibly occur for that FE. This can be useful to relate FEs from different frames with different names. Sometimes the same concept is with different FE names in different frames. We have been able to identify STs for most of the FEs in the transport frame. But some of them were too general to boil down to one ST. We used the upper ontology UMLS semantic network to denote our STs. An ST should be as specific as possible but also has to be general enough to be true for all the fillers of that FE.

We found two classes on the ULMS semantic network that could be STs. But these STs could represent more than one FE. TE, TP and TT have the same ST which is Physical Object [T072] from the UMLS semantic network. The child of Physical Object, Anatomical Structure [T017] is the ST for the FEs, TO, TDS, TL and TPL. This is consistent with the definition of these FEs. The definition of the origin, destination, location and place is more restrictive than that of the entity, path and transporter. Thus the ST of the former batch is a child of that of the latter batch of FEs. The FE TC is very general and so there is no use of assigning an ST. STs for the FEs direction and attribute were not found in the UMLS semantic network.

(39)

Chapter 4 GUI annotation tool

4.1 General description

Figure 4.1: The annotation tool with the drop down menu with annotation options

We have developed a simple annotation tool suited for annotating biomedical text according to the BioOntoFN frames. It accepts .txt files as input and

Corpus construction based on Ontological domain knowledge

Department of Computer and Information Science

Final thesis

Corpus construction based on

Ontological domain knowledge

by

Nirupama Benis & Rajaram Kaliyaperumal

LITH-IDA-EX-2011/044-SE

2011-10-26

Final thesis

Corpus construction based on

ontological domain knowledge

Nirupama Benis & Rajaram Kaliyaperumal

LITH-IDA-EX-2011/044-SE

26-10-2011

Abstract

Acknowledgement

Abbreviations

Contents

Chapter 1

Introduction

Chapter 2

Background

2.1

Text mining

2.1.1

Biomedical text mining

2.2

Corpus

2.3

Semantic Role Labeling

2.3.1

Introduction

2.3.2

Resources

2.4

Ontology

2.4.1

Ontology Vs Frame Semantics

2.5

Annotation tools

2.5.1

Knowtator

2.5.2

FN tool

Chapter 3

Corpus Development

3.1

BioOntoFN

3.1.1

Domain Knowledge

3.1.2

Frame definition

3.1.3

Frame Elements

3.1.4

Identifying predicates

3.2

Example sentences

3.2.1

Finding example sentences

3.2.2

Annotation

3.2.3

Valence pattern

3.3

Semantic type (ST)

Chapter 4

GUI annotation tool

4.1

General description