Ontology-based information extraction from legacy surveillance reports of infectious diseases in animals and humans

(1)

Linköpings universitet

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Datateknik

2020 | LIU-IDA/LITH-EX-A--20/073--SE

Ontology-based information

extraction from legacy

surveillance reports of infectious

diseases in animals and humans

–

Biniam Palaiologos

Examiner : Patrick Lambrix

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten ﬁnns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

More and more institutes and health agencies choose knowledge graphs over tradi-tional relatradi-tional databases to store semantic data. The knowledge graphs, using some form of ontology as a framework, can store domain-specific information and derive new knowl-edge using a reasoner. However, much of the data that must be moved to the graphs is either inside a relational database, or inside a semi-structured report. While there has been much progress in developing tools that export data from relational databases to graphs, there is a lack of progress in semantic extraction from domain-specific unstruc-tured texts. In this thesis, a system architecture is proposed for semantic extraction from semi-structured legacy surveillance reports of infectious diseases in animals and humans in Sweden. The results were mostly positive since the system could identify 17 out of the 20 different types of relations.

(4)

Acknowledgments

First of all I would like to thank the Swedish National Veterinary Institute for making this thesis work possible. Also, I would like to thank Patrick Lambrix, Fernanda Dórea, Simon Tim Jackman and Vincent Déhaye for their help and patience during my thesis work.

(5)

3.3 Hierarchical, Perceptron-like Learning for Ontology-Based Information Ex-traction . . . 11 3.4 OntoX . . . 11 3.5 SPRAT . . . 11 3.6 SPEED . . . 12 4 Method 13 4.1 Architecture . . . 13 4.2 Dataset . . . 14 4.3 Preprocessing . . . 15 4.4 Domain Ontology . . . 16 4.5 GATE . . . 16 4.6 Document Reset . . . 16 4.7 English Tokenizer . . . 17

(6)

4.8 Sentence Splitter . . . 17

4.9 Part-Of-Speech Tagger . . . 17

4.10 Morphological Analyser . . . 17

4.11 OntoRoot Gazetteer . . . 17

4.12 JAPE Transducers and Patterns . . . 18

4.13 RDF Converter . . . 21

5 Results & Discussion 24 5.1 Training Documents . . . 24 5.2 Test Documents . . . 25 5.3 Full Dataset . . . 25 5.4 Results . . . 26 5.5 Method . . . 26 5.6 Future work . . . 27

5.7 The work in a wider context . . . 28

6 Conclusion 30

(7)

List of Figures

3.1 General architecture of an OBIE system . . . 10

3.2 KIM Architecture . . . 10

3.3 SPEED architecture . . . 12

4.1 Proposed System Architecture . . . 14

4.2 A page from the Campylobacteriosis chapter from the 2017 surveillance report . . 15

4.3 The target relations/object properties of the Surveillance Activity class. . . 16

4.4 The target data properties of the Surveillance Activity class. . . 16

4.5 English Tokenizer annotations. . . 17

4.6 An annotated instance of the concept ’meat production’. . . 18

4.7 JAPE pattern for number combination . . . 19

4.8 JAPE pattern for annotating the number of units tested. . . 20

4.9 JAPE pattern used for generating the SurveillanceActivity annotations. . . 20

4.10 JAPE pattern used for generating the SamplingStrategy annotations. . . 21

4.11 The JAPE pattern used for generating the hasTargetHost relation annotations. . . . 22

4.12 JAPE pattern used for generating the SAcontext annotations. . . 23

(8)

List of Tables

5.1 Test metrics Precision, Recall and F1-Score on the training documents. . . 25 5.2 Test metrics Precision, Recall and F1-Score on the testing documents. . . 25 5.3 Test metrics Precision, Recall and F1-Score on the full dataset. . . 25

(9)

1 Introduction

1.1 Motivation

One could argue that with the everlasting evolution of information systems, different kinds of industries and organizations have potential access to an abundance of data, and ways in which to process it. The need for data is even more critical in public health organizations, in order to support areas such as epidemiologic surveillance, health outcome assessment, program evaluation and performance measurement, public health planning, and policy anal-ysis [50]. That is why in recent years, many of such organizations have created datasets and published them as Linked Open Data[19], in order to facilitate knowledge sharing, decision making and research. However, the amount of data that is available as Linked Open Data and following well-defined standards such as the RDF[28] is not at the desired levels as of yet. As a result, there is much work left to be done and many problems to overcome.

One of the most significant problems is that the information in the healthcare domain is enormously complex, because it covers different types of data such as patient administration, organizational information, clinical data and laboratory/pathology data[46]. In addition, in-stitutes, agencies and care givers in general, may be unwilling to share health-related infor-mation, but even when they are in agreement to share inforinfor-mation, individual entities may have their customized or vendor-driven software that is incompatible and thus not interop-erable with other systems[24]. Researchers must also overcome the problem of the usage of incompatible ontologies. Ontologies, which are further presented in section2.2, are used in order to encompass a formal representation of a concept and define its properties and rela-tions between its properties and among other concepts. However, depending on the country and on the organization, the terminology used to describe various concepts and relations may differ, and this leads to developing ontologies that consist of multiple representations for the same clinical concept[24]. This problem hinders the semantic interoperability among institu-tions and researchers which means that less data with unambiguous and shared meaning is available to be exchanged. Finally, legacy data that has been gathered and stored in legacy systems prior to any national or multinational standarization agreement, offers limited in-teroperability and needs to be properly extracted, transformed and migrated to knowledge graphs in order to be available for the crucial tasks of decision making and research. Knowl-edge graphs are intelligent systems organised as graphs that integrate information into on-tologies and usually apply a reasoner, which is a piece of software able to infer logical

(10)

con-1.2. Swedish National Veterinary Institute

sequences from a set of asserted facts or axioms, allowing implicit information to be derived from explicitly asserted data which means deriving new knowledge. These problems are multiplied by the fact that many countries and institutions have demonstrated a resistance to change. The healthcare industry still relies in many cases on piles of paper reports and records which need to be digitalized, a task that can be time consuming and costly. The reasons for this resistance to change as presented in[25] are the following:

1. Large number of physicians in individual or small group practices with very limited administrative support for IT and related practice changes.

2. The lack of uniformity and interoperability of IT systems from different vendors. 3. Regulatory limitations on hospital funding of IT for physicians.

4. Lack of trust and other legal concerns with respect to joint IT solutions. 5. Privacy and security concerns.

All the aforementioned challenges that various institutions and researchers need to over-come in order to use all the available information efficiently and promote data interoperabil-ity can serve as a valid motivation for every researcher in different fields of the academia or the industry. However, it is not only us the humans that can benefit from the exchange of knowledge and the data availability among researchers and institutions, but the animals as well. Animal health surveillance is a prime example of a field where data interoperability can facilitate veterinarians and epidemiologists with gathering and exchanging surveillance data from wild animals and livestock in order to prevent infections or infestations, and to detect as early as possible exotic or emerging diseases among animal populations. An example of how can this be achieved is the Animal Health Surveillance Ontology (AHSO) [15] which seeks to facilitate the development of smart systems for data-driven disease surveillance and early disease detection by introducing a framework which describes how knowledge from different sources can be incorporated and how already existing ontologies can be integrated to the proposed ontology. Another initiative in which the Swedish National Veterinary In-stitute1took part in is the The One health suRveillance Initiative on harmOnization of data collection and interpretatioN (ORION)2_{project which aims to contribute to the critical need} for data interoperability in animal and human healthcare. The project is divided into three distinct working packages and this thesis aims to cover parts of the third package, which fo-cuses on forming the basis for successful harmonisation and integration of surveillance data and methods. The findings of this thesis will be used to showcase whether an ontology based information extraction system could address efficiently the problem of extracting legacy data from surveillance reports of infectious.

1.2 Swedish National Veterinary Institute

The Swedish National Veterinary Institute is an authority under the Swedish Government Offices with the task of providing expert advice and service to public agencies and individ-uals in the area of veterinary medicine. The Institute promotes animal and human health, Swedish stock farming and the Swedish environment, through diagnostic services, research, preparedness and advisory services. The institute collaborates with both national and in-ternational universities, research institutes, organizations and companies. The Swedish Na-tional Veterinary Institute is also participating in numerous European Union projects and is member of various european networks and associations.

1_{https://www.sva.se/en/}

(11)

1.3. ORION

1.3 ORION

The ORION project, launched in 2018, aims to establish and strengthen inter-institutional collaboration and transdisciplinary knowledge transfer in the area of surveillance data inte-gration and interpretation, in accordance with the One Health3_{objective of improving health} and well-being. The project is funded by the European Union’s Horizon 2020 research and innovation programme4and includes 13 veterinary and public health institutes from 7 Euro-pean countries. In order to achieve its goals, the project is divided into three parts. The first one is responsible to develop a high level framework for harmonised, cross-sectional descrip-tion and categorisadescrip-tion of surveillance data covering all surveillance phases and all knowl-edge types. The second one is focused on creating a cross-domain inventory of currently available data sources, methods, algorithms and tools, that support One Health surveillance data generation, data analysis, modelling and decision support. Finally, the last part of the project forms the basis for successful harmonization and integration of surveillance data and methods.

1.4 Aim

The aim of the thesis work is to collect legacy surveillance reports publicly available from the Swedish National Veterinary Institute, extract the required semantic data and convert them into RDF format by developing an Ontology Based Information Extraction pipeline (OBIE)[53].

1.5 Research questions

The research questions to be answered are the following:

1. How to create an ontology-based information extraction pipeline that can extract all the information required, given the ontology provided by the SVA? This research ques-tion assumes that the ontology provided by the SVA is stable and no modificaques-tions are required.

2. How well does the pipeline perform with regards to detecting correctly semantic in-stances and relationships? To answer this research question, the performance measure-ment metrics of recall, precision and F-measure shall be used.

1.6 Delimitations

This thesis will focus on extracting semantic data from the campylobacteriosis chapter of the annual surveillance reports. The reason is that the report contains many different chapters and it is thus prohibitively time consuming to create a system at such a large scale. An ad-ditional delimitation to the time constraint is that the ontology covers the necessary concepts to cover two chapters at the time being. Nevertheless, the work done to cover the needs for the first chapter, will lay the foundations for extracting concepts related to other chapters as well.

Campylobacteriosis is a zoonosis, a disease which can transmitted to humans from ani-mals or animal products. While it usually causes mild symptoms, it can be fatal among young children, elderly or immunosuppressed individuals. Basic food hygiene practices should be enough to prevent infections.

3_{https://www.who.int/news-room/q-a-detail/one-health} 4_{https://ec.europa.eu/programmes/horizon2020/en}

(12)

1.7. Research methodology

1.7 Research methodology

The research methods used in this thesis were: a literature study, implementation, evaluation and analysis. The literature study was used to get an understanding of previous research done in the field as well as to examine different proposed OBIE architectures. It also helped with finding relevant theory which was used to motivate the chosen system architecture and the general methodology that the thesis follows. The keywords used to perform the literature study pertained mainly to: semantic based information extraction, relation extraction, named entity recognition, ontologies and pattern-based information extraction. The aforementioned keywords were used to find relevant articles and research papers on online libraries such as Google Scholar, the online library of Linköping University and the online library of the Institute of Electrical and Electronics Engineers. The implementation of the proposed OBIE architecture was done by using the open source framework called General Architecture for Text Engineering (GATE), which has been used by numerous relevant research papers as the backbone of their implementation. The quality of the system was ensured by analyzing the performance of the patterns used to extract textual information. No automated tests were used, since it is not clear how to test patterns based on the Java Annotation Patterns En-gine(JAPE) grammar using conventional testing principles. The work in this thesis was done in an iterative fashion. Initially , the data was provided by SVA in a PDF format. The relevant chapters from the reports were extracted and were splitted into training and testing datasets. For every new pattern, the training set was used to observe its performance and make nec-essary changes. This was repeated until the results were satisfying for every pattern. Finally, the testing set was used to test the performance of the system on unseen data.

1.8 Outline

The outline in this thesis is as follows. In chapter 2, theory about Information Extraction, On-tologies, Ontology-Based Information Extraction and metrics for OBIE systems is presented. This is to introduce the reader to theory needed for the methods used in this thesis. Chapter 3 contains a summary of related work done in the field of OBIE. Chapter 4 presents the meth-ods used in this thesis as well as the system’s architecture and its consisting parts. Chapter 5 contains the results when using the presented system in the previous chapter and a discus-sion about the results and methods selected. Finally, chapter 6 concludes the work done in this thesis and answers the research questions.

(13)

2 Theory

The theory provided in this chapter aims to familiarize the reader with the Information Ex-traction (IE) field and the recently emerged subfield of Ontology-based information extrac-tion(OBIE), as well as to present the evaluation metrics that will be used.

2.1 Information Extraction

Plainly put, information extraction is a task that extracts structured data from unstructured or semi-unstructured text. More specifically, it can be described as a task that aims to process natural language text and to retrieve occurrences of a particular class of objects or events and incidence of relationships among them[45]. Riloff also presented a similar viewpoint in [44], stating that information extraction is a form of natural language processing in which certain types of information must be recognized and extracted from text.

There are many examples of IE in our everyday lives, with some being simple and others more complex. One example of such a system is Google Calendar, as it extracts information from mail and automatically creates entries in the calendar. Another more complex example could be a system that reads news articles and extracts information about specific persons, companies, and the relationships between them.

Since information extraction can be applied to almost any textual source, it can success-fully solve challenges regarding knowledge discovery, content management, and decision making. Some areas where such systems are used extensively are the following.

• Pharmaceutical research for discovering new drugs and adverse effects[1][48]. • Business intelligence for gathering information from different sources[5][47]. • Financial investigation for discovering hidden relationships between companies[9]. • Media monitoring for mentions of brands, companies and people[2][23].

It is evident that IE has become over the years a necessary tool for numerous industries since it offers a diverse set of applications. IE has evolved significantly over the last two decades and its results can only be enhanced when combined with Natural Language Pro-cessing. Nowadays, every IE system employs some kind of NLP functionality and it seems

(14)

2.2. Ontologies

that an affinity exists between the two fields. This is better explained in [10], where the author states that IE is attractive for NLP because of the following reasons:

1. The extraction tasks are well defined. 2. IE uses real-world texts.

3. IE offers difficult and interesting NLP problems.

4. IE performance can be compared to human performance on the same task.

However, it appears that IE systems act as the middleman between Information Retrieval (IR) systems and text understanding systems. As stated by Russell and Norvig, information extraction lies mid-way between Information Retrieval systems, which merely find docu-ments that are related to the user’s requiredocu-ments, and text understanding systems (sometimes referred to as text parsers) that attempt to analyze text and extract their semantic contents[45]. Having said that, we can conclude that the difficulty associated with information extrac-tion systems lies in between these two categories, their success has also been somewhere in between the levels achieved by information retrieval systems and text understanding sys-tems[53].

2.2 Ontologies

Ontologies are becoming more and more common in organizations, public agencies and in industry. They are primarily used in Semantic Web, Artificial Intelligence, Bioinformatics, Software Engineering and Information Architecture as a form of knowledge representation about the world or some part of it[31]. The term ontology originally comes from a branch of philosophy called metaphysics, which deals with the nature of being. Metaphysics are responsible to answer questions such as "what exists?" and "what is the nature of reality?". In order to answer these questions the existence is explored and modeled through properties, entities and relations.

In the field of computer science, an ontology is a piece of software which is responsible for the conceptualization of knowledge. There are various research papers that defined the term in different ways. Gruber in [17] defines ontology as an explicit specification of a con-ceptualization while in [51], an ontology is defined as hierarchically structured set of terms for describing a domain that can be used as a skeletal foundation for a knowledge base. On-tologies consist of four main components: concepts, instances, relations and axioms.

The concepts represent sets of objects in a domain and they are organized in a taxonomy. Concepts can be organized in hierarchies by using specialization relations (is-a) or partitive relations(is-part-of). For example immune response is-a defense response and brain is-part-of the central nervous system. Relations describe the possible interactions between concepts or a concept’s properties[49]. Relations can also be organized in a taxonomy where lower levels of hierarchy represent a more informative relation. For example a relation hasAddress can be subdivided into hasCityName, hasStreetName and hasPostalCode. Instances or individuals are instantiated concepts. For example, Sweden is an instance of the concept Country. How-ever, the norm is to keep the ontology free of instances for its purpose is to be a schema or a conceptualization of a domain. When an ontology is combined with its associated instances then we get a knowledge base. Axioms are the assertions in a logical form that together com-prise the overall theory that the ontology describes in its domain of application[14]. A more informal description of axioms is that they are the facts that are always true. These facts are describing the theory of the domain that the ontology tries to conceptualize.

The main purpose of an ontology is to provide a common vocabulary among researchers and software agents which results in an increased interoperability among systems. In addi-tion to that, ontologies can be reused and shared over different platforms which facilitates

(15)

2.3. Ontology-Based Information Extraction

the exchange and reuse of knowledge. Ontologies can also function as a basis for integra-tion of informaintegra-tion sources or as a query model for informaintegra-tion sources[27]. In conclusion, ontologies lead to a better understanding of a field and to more effective and efficient han-dling of information in that field[27]. An example of an ontology which is publicly available since 2005 is the Friend Of A Friend (FOAF)1_{ontology which describes people and social} relationships on the web. The ontology itself is relatively small with 19 classes, 44 object properties and 27 datatype properties, however there are 11 active social networking web-sites that have used FOAF so far[16]. Another notable example of an ontology is the Genome Ontology (GO)2. As stated by the Gene Ontology Consortium, the goal of the GO project is "(i) to develop a set of controlled, structured vocabularies—known as ontologies—to de-scribe key domains of molecular biology, including gene product attributes and biological sequences; (ii) to apply GO terms in the annotation of sequences, genes or gene products in biological databases; and (iii) to provide a centralized public resource allowing universal access to the ontologies, annotation data sets and software tools developed for use with GO data."[8]. Nowadays, the ontology has over 44 thousand terms, over 8 million annotations and is still evolving. An example of an ontology that deals with information pertaining to music is the Music Ontology[42]3_{which provides a model for dealing with music-related} in-formation on the semantic web. While the ontology can not describe music attributes such as tunes, notes and rhythms, it can describe business-related information such as artists, live events and albums. Nowadays there are also ontology lookup services, such as the EMBL-EBI4or the bio.tools5ontology lookup service, which function as repositories for ontologies and provide an access point for anyone who is interested in browsing or searching for an on-tology. Those lookup services provide occasionally web services which offer cross-ontology mappings between terms from different ontologies, assistance in mapping data to ontologies and even tools for building ontologies from spreadsheets or tabular forms of data.

2.3 Ontology-Based Information Extraction

Ontology-Based Information Extraction is a task performed by a system which processes un-structured or semi-un-structured natural language text through a mechanism guided by ontolo-gies, to extract certain types of information, and the output is generally presented through an ontology[53]. Another straightforward description of an OBIE system is that it guides IE algorithms and methods by using an ontology to extract the desired information. OBIE has emerged as a subfield of Information Extraction little more than a decade ago, but there has been research on the topic since 1999, with the work of Hwang on the dynamic construction of ontologies from textual data[22].

OBIE systems can be used for Ontology Learning or for general Information Extraction. Ontology Learning’s goal is to generate an ontology from a textual resource. The term was originally used in [30] and then defined by Cimiano[6] as an acquisition of a domain model from data. There has been extensive research on this topic and different algorithms and ar-chitectures have been proposed. However, Ontology Learning is not within the scope of this thesis and therefore will not be further discussed.

An OBIE system for general Information Extraction receives as an input a document(PDF, text, HTML, XML) or a corpus of documents, which then are processed by using NLP algo-rithms and methods (sentence splitting, POS tagging). Afterwards, the Information Extrac-tion process begins, which can be rule-based, Machine Learning, or a hybrid of those two. The use of ontologies for the Information Extraction process is for detecting synonyms,

co-1_{http://xmlns.com/foaf/spec/} 2_{http://www. geneontology.org/} 3_{http://musicontology.com/} 4_{https://www.ebi.ac.uk/ols/index} 5_{https://bio.tools/ols}

(16)

2.4. Metrics for Ontology-Based Information Extraction

references, concept relations, concept properties and relations between concepts. The output of such a system can be RDF data, semantic annotations or filled templates.

Different approaches that make use of OBIE systems have been published over the years and in chapter3 some of those will be presented to the reader. The NLP methods used in this thesis will be further explained and presented in chapter4.

2.4 Metrics for Ontology-Based Information Extraction

IE and OBIE systems are traditionally evaluated by using Precision, Recall and F-measure metrics.

Precision

Precision, also called positive predictive values, measures the number of the correctly positive observations as a percentage of the total number of positive observations. In other words it gives us the percentage of the correct positive predictions out of all the positive predictions made by the system.

Precision= TruePositive

TruePositive+FalsePositive (2.1)

Recall

Recall or Sensitivity measures the number of correctly positive observations as a percentage of the total number of correct positive observations and incorrect false observations. In other words, it measures how many items that should be actual positives are correctly identified by the system.

Recall= TruePositive

TruePositive+FalseNegative (2.2)

F-measure

The F-measure[43] is used in conjunction with Precision and Recall, as a weighted average of the two[34].

F ´ measure= (β

2₊₁₎_{Precision ˚ Recall}

(β2Precision) +Recall (2.3) Where β reflects the weighting of Precision versus Recall. If β is set to 1, the two are weighted equally and the metric can be referred to as F1-Score. With β set to 0.5, Precision weights twice as much as recall and with β set to 2, Recall weights twice as much as precision. The purpose of this metric is to seek a balance between Recall and Precision. This thesis assumes that β=1 and therefore in chapter5 this metric will be referred as F1-Score.

(17)

3 Related Work

OBIE systems differ in their implementation details from one to another. However, those systems follow a common architecture when observed from a higher level. In figure3.1, is shown how domain experts and users can interact with an OBIE system. The architecture of the system can be seen within the dotted lines and shows how the different components of an OBIE system are collaborating in order to extract information from the textual input. Initially, the system would accept a text input which will be processed inside the preprocessor module. This module converts the input format to text or to any other format that the information extraction module requires. An example of such preprocessing would be the removal of HTML tags from the document or text extraction from a PDF document.

The actual information extraction takes place inside the information extraction module which is guided by an ontology and sometimes a semantic lexicon component. Semantic lex-icons are digital dictionaries of words labeled with semantic classes which are used to draw associations between words that have not previously been encountered. A notable seman-tic lexicon is WordNet[35]1which links words into semantic relations including synonyms, hyponyms, and meronyms in more than 200 languages. Some systems may incorporate an ontology editor or even an ontology generator component which will initially generate the ontology used for the information extraction process. However, these two aforementioned components are not necessary since there are tools such as the Protégé[37]2 development platform, that facilitate the creation and maintenance of an ontology.

The output of an OBIE system is information which was extracted from the initial input. The information can be represented by a number of formats, such as XML, text, or RDF for-mat and can potentially be stored in a database or a knowledge graph where the user can access it through a query answering system. The architecture described here doesn’t have to be applied in every system, since some components can be removed depending on the desired outcome and the desired system design. The designer of such system can also add components that are not described by this architecture. Having said that, some OBIE systems will be presented and analyzed in this chapter.

1_{https://wordnet.princeton.edu/} 2_{https://protege.stanford.edu/}

(18)

3.1. KIM

Figure 3.1: General architecture of an OBIE system as presented in [53]

3.1 KIM

The KIM[41] platform provides a Knowledge and Information Management framework, ser-vices for semantic annotation, indexing and retrieval of documents, and finally, an infrastruc-ture for information extraction. KIM can accept and analyze a great variety of documents types as an input. The ontology used for the OBIE process needs also to be provided by the user. The IE methods that the system deploys are based on Linguistic rules and a set of lists containing names of entities known as Gazetteer lists. The ontological components that can be extracted by the KIM platform are instances and property values.

Figure 3.2: The KIM Architecture as presented in[40].

3.2 SOBA

The SOBA[4] system is specialized in ontology-based information extraction from soccer web pages for automatic population of a knowledge base, which is in turn used for

(19)

domain-3.3. Hierarchical, Perceptron-like Learning for Ontology-Based Information Extraction

specific question answering. SOBA, as compared to other OBIE systems, can only receive HTML documents as an input. The system contains an in-house web crawler, which facil-itates with the automatic creation of a soccer corpus. The interesting part of this process is that the corpus can contain not only text, but also images found in the original document. For its information extraction component, it uses Linguistic rules, Gazetteer lists and also an-alyzes tags found in the HTML documents. The extracted information is then inserted into the built-in Knowledge Base which updates the previously stored facts.

3.3 Hierarchical, Perceptron-like Learning for Ontology-Based

Information Extraction

Yaoyong Li and Kalina Bontcheva proposed an OBIE system in [29], which uses a modified version of the original Hieron batch learning algorithm[13]. This approach trains two hier-archical classifiers. One is used for recognising the start token of the class instances and one for the end. To deal with the irrelevant tokens in the text, they extend the ontology with a new child node which represents the concept of non-relevant token. The corpus used in the study consisted of news articles that were annotated manually according to the Proton ontol-ogy[52]. The preprocessing of the corpus was done by using the open source system called Nearly-New Information Extraction System (ANNIE) which is part of the GATE[11] frame-work. The proposed system can only detect concepts, whereas the aforementioned systems can also detect properties.

3.4 OntoX

OntoX[55] is an OBIE system that extracts information from text sources by implementing a Rule Generation technique. OntoX has an additional functionality which allows the user to detect outdated structures in the ontology. The system’s architecture is divided into three modules. The first module, called the Ontology Management Module, receives an ontology and uses it to determine which parts of the input text are relevant to the IE process. The out-put of the first module is then passed to the second module called the Rule Generation Mod-ule, wherein rules are automatically generated in the form of regular expressions. Finally, the regular expressions obtained are given to the Extraction Module which applies them on the input text and detects instances and datatype property values.

3.5 SPRAT

SPRAT[33] which stands for Semantic Pattern Recognition and Annotation Tool is yet an-other example of an OBIE system which is capable of ontology population. The system’s architecture is divided into two parts. The first part is responsible for text preprocessing and is based on the processing resources provided by the GATE framework. The preprocessing is performed using shallow NLP techniques such as tokenization, part of speech tagging and morphological analysis. Tokenization is usually the first preprocessing task performed in ev-ery system which separates a piece of text into smaller units called tokens. The part of speech tagging usually occurs after the tokenization task and marks a token/word in a text with a particular part of speech, based on its definition and context. Finally, morphological analysis is the task responsible for analyzing the structure of words and parts of words, such as stems, root words, prefixes, and suffixes. The second part is where the information is extracted. For this reason Hearst patterns[18] as well as Lexico-Syntactic and Contextual patterns are imple-mented as JAPE[12] rules. The information extracted from those patterns combined is used for populating a given ontology or creating one from scratch.

(20)

3.6. SPEED

3.6 SPEED

SPEED[21] which stands for Semantics-Based Pipeline for Economic Event Detection is an OBIE system proposed in 2013 which extracts economic events from news articles and up-dates a knowledge base in real-time. Its architecture is based on the GATE framework. The system’s preprocessing module is similar to the one used in SPRAT with the only addition be-ing the usage of a Word Sense Disambiguator which identifies which sense of a word is used in a sentence and which is based on an adapted Structural Semantic Interconnections algo-rithm [36]. After the sense of words has been disambiguated, the system detects economic events by using a Gazetteer and validates them by using lexico-semantic patterns imple-mented as JAPE rules. Finally, the results can be used to update the Knowledge Base through the Ontology Instantiator module.

(21)

4 Method

This section introduces the implemented system architecture that tackles the research ques-tions. This chapter begins with an overview of the architecture and the ontology. Important information is also provided about the dataset and the processing resources used. The the-ory and the related work presented in chapter 2 and 3 have introduced the reader to the general field of Information Extraction, Ontology Based Information Extraction and appli-cations related to the aforementioned fields. Those appliappli-cations have successfully extracted semantic information from human language documents, however there is currently no sys-tem that extracts information related to diseases based on the Health Surveillance Ontology3. In order to be able to extract specific concepts from the surveillance reports, a new system is proposed that is based on the GATE framework for its shallow NLP preprocessing, however customized lexico-syntactic rules were implemented in order to extract concepts and rela-tions and a custom module that transforms the exported XML documents into OWL/RDF triplets. The system’s architecture has similarities with two of the systems presented in chap-ter 3, namely, with the SPRAT system in section3.5 and the SPEED system in section3.6. It is also evident that both the aforementioned systems and the proposed one, follow the general architecture shown in figure 3.1.

4.1 Architecture

Different methodologies exist for extracting relations from textual sources and each method-ology influences the architecture of the system by requiring a different approach for pre-processing the input text and sometimes different types of components to be implemented. There are 4 general approaches to the topic of relation extraction. Hand-written patterns, su-pervised, semi-supervised and unsupervised machine learning. This thesis is done by using hand-written patterns which are lexico-syntactic rules that extract relations that often hold between named entities. They usually achieve high precision and they can be tailored to spe-cific domains. The reason that this methodology was chosen instead of any other was the need for high precision and accuracy. In addition to that, the OBIE system had to adhere to specific requirements when it came to named entities, since big chunks of text should be considered as one entity which is better handled with the use of patterns. Finally, one other

(22)

4.2. Dataset

issue that influenced the final decision of the methodology, was the limited amount of train-ing data which meant that approaches such as the unsupervised/(semi)supervised machine learning would not be as effective. The components that were chosen to build the system, such as gazetteers and tokenizers, were already available with the GATE framework. While each component could be potentially replaced by developing another one from scratch, the time scope of this thesis was not enough to develop each component. The suggested system architecture that produces the RDF/OWL data is the one shown in figure4.1. Initially the surveillance reports need to be edited by a human supervisor in order to provide some basic restructuring and spelling check. Afterwards, the edited text and the ontology is given as in-put into the pipeline that is responsible for annotating the ontological instances. Additional information is then added from the ontology to the annotations’ features. The annotations are processed again by implementing JAPE patterns which results in generating the relation-ships and the data properties as text annotations. Finally, these annotations are exported in an XML file and are processed by the RDF converter module which produces the OWL/RDF data. The shallow NLP is done with the assistance of the GATE implemented processing resources which are stacked sequentially to form a pipeline. The list of these resources con-sists of a Document Resetter, English Tokeniser, Sentence Splitter, POS Tagger, Morphological Analyser, OntoRoot Gazetteer and a Language Identificator. Each of these resources adds its own annotations over the text which are finally used by the JAPE rules.

Figure 4.1: Proposed System Architecture

4.2 Dataset

The dataset consists of annual surveillance reports of infectious diseases in animals and hu-mans in Sweden and is provided by the Swedish Veterinary Agency. These reports are annual and they started being published in their current format since 2006. The reports contain in-formation about infectious diseases in animals and humans in Sweden. However, the thesis focuses only on Campylobacteriosis, a disease that can infect both animals and humans. All the surveillance reports are public and can be accessed from the SVA webpage1.

(23)

4.3. Preprocessing

4.3 Preprocessing

As stated above, each report is composed of chapters and each of them contains surveillance information about a specific disease. An example of a page that contains the target infor-mation is shown in figure 4.2. The inforinfor-mation that needs to be extracted is contained in the surveillance and results section of the chapter. A human editor is needed to extract the desired chapter and save it in a text file. Furthermore, these sections contain subsections ded-icated to animals, humans and food surveillance activities and results. Because of this specific layout, the complexity and difficulty of the information extraction process is increased and therefore the human editor needs to combine the surveillance and results section for each subsection. In this way the desired surveillance activity instances can be annotated in a more precise and straightforward way. Finally, the text files are split into training and test files following the 70/30 principle. The layout of the surveillance reports will change in the future to simplify the described process and eliminate the need for a human editor.

Figure 4.2: A page from the Campylobacteriosis chapter from the 2017 surveillance report that contains surveillance activity information.

(24)

4.4. Domain Ontology

Figure 4.3: The target relations/object properties of the Surveillance Activity class.

Figure 4.4: The target data properties of the Surveillance Activity class.

4.4 Domain Ontology

The domain ontology used in the proposed system is the Health Surveillance Ontology(HSO). HSO is still in the early stages of development and is a part of an ongoing project managed by the Swedish National Veterinary Institute. The ontology’s goal is to model the animal and health information obtained by a single observation made at a specific point of time. The ontology was primarily designed to facilitate the epidemiologists with decision making, with inference generation on the recorded data, with discovering knowledge and ultimately support data interoperability. At present, HSO counts 411 classes, 464 individuals and 108 properties. However, this paper will focus on extracting the concept of surveillance activity, its relations with other concepts, shown in figure 4.3, and its associated properties shown in figure 4.4. The HSO ontology is available on GitHub2and on NCBO BioPortal3.

4.5 GATE

General Architecture for Text Engineering (GATE)[11] is a JAVA based, open source frame-work4for Natural Language Processing, Computational Linguistics and Language Engineer-ing, and it was originally developed at the University of Sheffield5_{. The GATE architecture is} based on components which offer well-defined interfaces that may be deployed in a variety of contexts. GATE components are one of three types. Language Resources(LR) which rep-resent entities such as lexicons, corpora or ontologies. Visual Resources(VR) which reprep-resent visualisation and editing components. Processing Resources(PR) which represent algorith-mic entities such as parsers and generators. A processing resource performs a single task over a corpus that pertain manipulating and creating annotations on documents. Processing resources can be combined into applications or as commonly known pipelines.

4.6 Document Reset

The Document Reset resource gives the user the ability to decide whether the annotation sets should be removed alongside their contents. The proposed system sets a boolean parame-ter called keepOriginalMarkupsAS as True which allows the original markups annotations, such as the paragraphs and the headings, to be preserved during runtime. This resource is normally added to the beginning of an application, so that all previous annotations are reset before the Information Extraction process is initiated.

2_{https://github.com/SVA-SE/HSO}

3_{http://bioportal.bioontology.org/ontologies/HSO} 4_{https://gate.ac.uk/}

(25)

4.7. English Tokenizer

4.7 English Tokenizer

In order for the system to detect concepts and relations, an English Tokeniser is required to split the text into different types of tokens. The types of tokens that this processing resource is able to produce are words, numbers, symbols, punctuation and space tokens. The tokenizer puts all the tokens in an annotation set called Tokens except from the space tokens which are put into their own annotation set called SpaceToken. An example of the produced tokens can be seen in figure 4.5

Figure 4.5: English Tokenizer annotations.

4.8 Sentence Splitter

The sentence splitter is a processing resource which splits paragraphs into sentences. This module is composed from a cascade of finite-state transducers and is required by the POS tagger. The splitter distinguishes sentence marking full stops by using a gazetteer list of abbreviations. The sentence splitter produces an annotation set called Sentences.

4.9 Part-Of-Speech Tagger

The Part-Of-Speech Tagger (POS Tagger) is the processing resource which receives the tokens generated from the tokenizer and with the assistance of the sentence splitter assigns parts of speech tags to them (noun, verb, adjective, etc.). The POS tagger provided by GATE is based on an implementation of Brill’s transformation[3] which was previously known as Hepple Tagger[20]. It is trained on the Wall Street Journal corpus[38] and it uses the Penn Treebank tagset[32]. The outcome of the POS Tagger is not a new annotation set but instead an added annotation feature to the token annotation set.

4.10 Morphological Analyser

A Morphological Analyser is the module that reduces the morphs to their cannonical form. Each word can have a variety of forms and each form may describe the same concept but from a different perspective. For example, the noun ’disease’ is a morph of the noun ’diseases’ which describes the concept of a disease in the plural form. The Morphological Analyser in this case will receive the token which has been enriched with part of language information and will create two new annotation features "root=disease" and "suffix=s". This is an impor-tant component of the pipeline since the OntoRoot Gazetteer is heavily based on the root features of the tokens to search through the ontology.

4.11 OntoRoot Gazetteer

OntoRoot Gazetteer is a processing resource which creates dynamically an ontology based gazetteer. This gazetteer is then used by the same resource to generate ontology-based an-notations over the given document. To produce anan-notations that link to the specific concepts

(26)

4.12. JAPE Transducers and Patterns

or relations from the ontology, it requires an ontology and a small pipeline of processing re-sources which consists of an English Tokeniser, a POS tagger and a Morphological Analyser. The output of the OntoRoot Gazetteer is a new annotation set called Lookup wherein each token is enhanced with ontological information. An example of such token can be seen in figure 4.6.

Figure 4.6: An annotated instance of the concept ’meat production’.

4.12 JAPE Transducers and Patterns

JAPE, which stands for Java Annotation Patterns Engine, is a finite state transducer over an-notations based on regular expressions provided by the GATE framework. This means that JAPE allows recognition of regular expressions in annotations over documents. JAPE uses a grammar which contains a set of phrases which are executed sequentially during runtime. Each of these phrases consists of a set of pattern/action rules. The left-hand-side (LHS) of the rules describes the pattern to be detected in the text while the right-hand-side (RHS) consists of statements which describe what kind of actions should be performed on the patterns de-scribed on the LHS. There are three ways in which patterns can be specified. By specifying a string of the text, by specifying the attributes and values of annotations and finally, by spec-ifying annotation types from gazetteers. JAPE patterns offer the ability of labeling parts of the patterns for easier manipulation on the LHS, macros for simplifying repetitive tasks and the use of priorities which enables the user to control which pattern should operate first on a given text segment.

(27)

The JAPE patterns in this thesis were manually created and were organized into three sets. The first one is responsible for preprocessing annotations, namely for adding features in the Lookup tokens, defining newline tokens, formatting and creating a number token as seen in figure 4.7. The second set of patterns targets the concepts and instances of the ontology inside the text. These patterns are searching for concepts and instances inside the Lookup annotation set and they are adding information that could not be added with the OntoRoot Gazetteer as token attributes. They are also creating annotation sets named after the concept or instance they represent. Some examples of such patters can be seen in figures 4.8, 4.9 and 4.10.The last set of patterns is aiming on detecting the relations between the concepts and creating annotations over the document where the first token of the annotated text is the subject and the last one is the object. Few examples of these patterns can be seen in figures 4.11 and 4.12.

Figure 4.7: A JAPE pattern used in the preprocessing part which combines numbers split either with fullstops or commas into one token.

The process of creating these patterns was a tedious one which required a lot of patience and testing. The first big obstacle was to determine what kind of preprocessing should be conducted on the documents and how to approach the problem of annotating instances that could not be annotated automatically from the Onto Root Gazetteer processing resource. A major difficulty in this step was the fact that every instance of SurveillanceActivity was not a single word or a even a sentence but two paragraphs that each one of them belongs to a dif-ferent section of the document. Therefore a decision was made to manually concatenate the paragraphs that belong to every instance of SurveillanceActivity and are located in different sections, and generate a pattern that can detect and annotate paragraphs so that the Surveil-lanceActivity instances can be annotated easier by using the paragraph annotation. Another issue that occured was the need to annotate the language of the text. For this specific prob-lem a processing resource called TexCat found on the official GATE web page proved to be

(28)

Figure 4.8: A JAPE pattern used in the system which tries to detect the number of units tested in the text using part of speech information.

Figure 4.9: The JAPE pattern used for generating the SurveillanceActivity annotations.

helpful. This resource decides which language is dominant in a paragraph and creates a fea-ture in every paragraph annotation with the result. Since a feafea-ture such as the language can’t be a specific annotation that can be generated with a pattern, the instance and the relation reportLang are generated during the RDF transformation.

Another big issue with generating patterns showed up during the relationship annota-tion stage because of the lack of verbs or other indicative words that could show explicitly a relationship between two instances. For example every instance of SurveillanceActivity has a SurveillanceContext which is depicted by the relation SAcontext, however this is not stated explicitly anywhere in the text. The author was told to assume from the text that if the sentence "a surveillance program for..." exists within the SurveillanceActivity instance then a SurveillanceContext instance also exists and therefore the relation needs to be created. The solution to the problem of annotating a relationship between a paragraph-instance and a sentence-instance that exists within the paragraph-instance was to annotate a section of the paragraph that includes both of them. The relation begins from the start point of the left-hand side instance (SurveillanceActivity) and ends at the end of the right-hand side annotation (SurveillanceContext). The annotation of such relation contains the IDs of both instances as a feature. All of the relations in this thesis had to be created in such a way.

(29)

4.13. RDF Converter

Figure 4.10: The JAPE pattern used which detects SamplingStrategy instances and generates the necessary annotations.

Finally, one needs to take into account the difficulty of the development process as a whole. While the GATE framework provides a graphical user interface which helps tremen-dously the development process, it does not offer the tools that can speed up the process of testing and debugging efficiently the patterns. For every change that needs to be made on a pattern, the developer needs to reload the pattern on the graphical user interface, select the correct pipeline tab, choose the desired corpus and run the application again. This inef-ficient way of developing patterns is extremely time-consuming on the long run and it gets only worse when the complexity of the patterns increases, when new instances need to be annotated or when both cases happen at the same time.

4.13 RDF Converter

The RDF Converter module of the system exports the annotated text as a GATE XML file. This file is then processed by an XML parser which removes all the unnecessary tags from the document. Finally, the XML transformation to RDF is done by using XSLT transformations[7]. A part of an RDF file can be seen in figure4.13. It should be noted that the number of triplets generated from each document is directly dependent on the amount of relations that were annotated by the application in the previous module. Having said that an average of 25 triplets per document is expected currently from the system. From the training document, it was possible to extract a total of 126 triplets, while from the testing document 74 triplets were obtained. A grand total of 200 triplets was extracted.

(30)

4.13. RDF Converter

Figure 4.11: JAPE pattern used annotate the relation hasTargetHost which consists of the SurveillanceActivity instance on the left-hand side and the Eukaryota instance on the right-hand side.

(31)

4.13. RDF Converter

Figure 4.12: The JAPE pattern used annotate the relation SAcontext which consists of the SurveillanceActivity instance on the left-hand side and the SurveillanceContext instance on the right-hand side.

(32)

5 Results & Discussion

This chapter presents the results obtained from the system up until the XML/RDF Converter. Note that the presented results are focused on the instance and relation annotation part of the system, since the goal is to produce RDF triplets using the obtained relations and instances. Relations include both object and data properties and instances include both the individu-als and the data attributes detected. Once the documents are annotated with the detected relations, they are then exported as XML annotated texts and are further transformed into proper RDF files using the standarized procedure described in chapter 4. Both relations and instances are evaluated using Recall, Precision and F1 measures as presented in section. Both the training and testing datasets were inserted into the system and were evaluated based on the relations and the entities the system was able to extract by using Recall, Precision and F1 measures as presented in section2.4. The system’s design allows it to detect 12 different types of instances, 12 different types of object properties and 5 different types of data proper-ties which can be seen in the appendix6. This chapter also discusses the results with respect to the research questions formulated in section1.5. Additionally, the chosen methodology to tackle is also discussed with respect to its advantages and disadvantages. Finally, possible improvements and future expansions to the system are discussed in section5.6.

5.1 Training Documents

The documents used for training purposes contained five reports. Specifically, the reports included were the surveillance reports of infectious diseases in animals and humans in Swe-den from 2010, 2012, 2013, 2016 and 2017. The results regarding the instance and relation extraction is shown in table5.1.

(33)

5.2. Test Documents

Training Documents

Instances

Total Extracted

160 Precision

100%

Recall

94.1%

F1-Score

96.9%

Relations

Total Extracted

126 Precision

100%

Recall

92.8%

F1-Score

96.2%

Table 5.1: Test metrics Precision, Recall and F1-Score on the training documents.

5.2 Test Documents

The documents used for testing purposes contained three reports. Specifically, the reports in-cluded were the surveillance reports of infectious diseases in animals and humans in Sweden from 2011, 2014 and 2015. The results regarding the instance and relation extraction is shown in table5.2.

Test Documents

Instances

Total Extracted

95 Precision

100%

Recall

92.2%

F1-Score

95.9%

Relations

Total Extracted

74 Precision

100%

Recall

93.2%

F1-Score

96.4%

Table 5.2: Test metrics Precision, Recall and F1-Score on the testing documents.

5.3 Full Dataset

The documents used for the full dataset evaluation contained all the surveillance reports of infectious diseases in animals and humans in Sweden from 2010 until 2017. The results regarding the instance and relation extraction is shown in table5.3.

Full Dataset

Instances

Total Extracted

255 Precision

100%

Recall

93.4%

F1-Score

96.5%

Relations

Total Extracted

200 Precision

100%

Recall

93%

F1-Score

96.3%

(34)

5.4. Results

5.4 Results

Before discussing the obtained results, it is necessary to mention that extracting information from text while maintaining human-like precision and recall is a challenging task that even the most sophisticated systems cannot achieve fully. The first challenge for such systems is the implementation of an entity recognizer which will correctly identify concepts in the text. The second challenge is of course the implementation of a module that can correctly identify and extract relationships. However, in order to obtain all those relationships it is imperative that the entity recognizer detects entities correctly.

This challenge is probably even harder when the Information Extraction system is based on an ontology for detecting and annotating semantic instances and relations. The system can be benefited extensively from an ontology that is designed with the purpose to be used by an IE system. However designing an ontology can be a tedious and time consuming task depending on the complexity of the domain to be conceptualized. Domain expertise and precision during the ontology engineering are of the essence for generating an ontology that is capable of storing knowledge efficiently.

Having said that, the proposed system in this thesis achieved high precision, recall and F1-Scores as seen in chapter5 which answers the second research question. However, three of the object properties were not included in the extraction process at all. The object properties left out were SAsampledArea, SAsampler and SAtestType. The reason for this was the inability of the system to extract the entities needed for forming the relations due to missing lexical representations of the knowledge base individuals, the lack of supporting annotations and also the lack of contextual information from the text. Thereby, to answer the first research question, the system was unable to detect and extract all of the semantic relations having extracted the 85% of them.

5.5 Method

Data

The initial request from the Swedish Veterinary Agency was the information extraction from the surveillance reports of infectious diseases in animals and humans in Sweden from 2013 up until 2017. However, four surveillance reports were deemed not enough for both training and testing and therefore an initial attempt was made to include reports from 1999. More specifically, reports of Zoonoses in Sweden from 1999 until 2007 and surveillance reports of zoonotic and other animal disease agents in Sweden from 2006 until 2010 were initially included and divided into training and testing datasets. This attempt failed nonetheless be-cause of various reasons. The most important reason was that the documents from 1999 until 2010 had no active surveillance monitoring which is the main point of interest. A second rea-son was that even if there was no active surveillance monitoring and potentially could other concepts be detected, the documents did not contain as much information as the ones from 2010 and on and therefore the potential extracted information couldn’t be as useful. Another reason was that some of the documents also contained information in a structured way i.e. using tables which would require a different approach altogether in order to extract them. Another attempt to find additional data for training and testing was made by including an-nual reports on Zoonoses from the Danish National Food Institute1_{, which is a member of} the Orion Project, but unfortunately the documents also contained structured text with many tables which again requires a different method as mentioned earlier. This had as a result the total inclusion of 8 documents which all follow the same pattern of data presentation.

(35)

5.6. Future work

Ontology

The ontology provided for the implementation of the proposed OBIE system is still under continuous development. The HSO ontology is continuously evolving to incorporate more concepts and improve current conceptualizations. As stated in section5.4, there were missing individuals as well as a lack of supporting annotations which impacted the information ex-traction process. More specifically, the OntoRoot Gazetteer was impacted the most from the bespoken deficiencies and therefore it was unable to locate specific concepts in the text. To remedy this problem, additional lexico-syntactic patterns needed to be generated in order to annotate those concepts.

OBIE System

The OBIE system proposed in this thesis have achieved great accuracy and recall for extract-ing all possible concepts and the relations among them. Additionally, the initial task was to remove the trouble from a human editor to search through the text to locate all instances and relations and extract them himself, which the specialized hand-written patterns have achieved. However, there are disadvantages with the chosen method. The first one is that hand-written patterns are hard to write and they are time-consuming since writing a pattern requires the writer to study the document’s structure in order to implement lexico-syntactic or contextual patterns. The second problem is that they are difficult to maintain. For every new concept and relation added in the ontology, a new pattern must be generated in order for it to be extracted. Some patterns may also be needed to add additional ontological information to the annotations before the extraction process is done. Finally, the hand-written patterns implemented for this thesis are domain depended and for some relations even document-dependent. As mentioned in chapter1, health organizations and agencies may use different terminologies, reporting formats, and in many cases they are interested in different kinds of information. Incorporating semi-structured surveillance reports from other countries than Sweden could potentially reduce the performance of the proposed OBIE system significantly.

5.6 Future work

Future improvements and additions to the proposed system can be made. An important ex-tension for the system would be the implementation of patterns that can detect concepts and relations from other chapters of the surveillance reports. The ontology so far can support the relation extraction from the chapters related to the bacteria E.Coli and Salmonella, although more concepts need to be added to enable the detection of all the diseases in the surveillance reports.

A major improvement would be a reduction in the amount of hand-written patterns. In order to achieve this and at the same time improve the efficiency of concept annotation, the GATE OntoGazetteer processing resource can be added in the system’s architecture as a sup-plement to the OntoRoot Gazetteer. The difference between the two resources is that the OntoGazetteer generates annotations by mapping a given list to ontology classes in the form of class instances while the OntoRoot searches through the ontology in order to create a dy-namic gazetteer as described in section4.11. Maintaining a gazetteer list is easier compared to maintaining several hand-written patterns. This addition could potentially increase the performance of the system when surveillance reports from countries other than Sweden are used as an input allowing the system to become document-independent. Improvements in the HSO ontology can also mean that some patterns will become obsolete or that they will have to be updated.

In addition to that, a second pipeline can be added to the system in order to extract data from PDF tables and increase the system’s availability for surveillance reports that contain a big amount of information in tabular form. Different methods for information extraction from

(36)

5.7. The work in a wider context

PDF documents have been proposed. A method called pdf2table described in [54] achieved 83% accuracy and 81% recall decomposing PDF tables while another method that uses Con-ditional Random Fields[26], described in[39], achieved 92% F1-Score.

5.7 The work in a wider context

As discussed in chapter1, data interoperability is a great necessity for knowledge sharing and cooperation among different health institutes and agencies across the world. As new projects and agreements are initiated and established, the need for a common "languages" is becom-ing even more necessary to describe and represent knowledge obtained so far. However, in many cases, this knowledge remains dormant and unused in the form of unprocessed infor-mation in various reports and needs to be extracted. The proposed system attempts to extract information related to surveillance activity reports of diseases in animals and humans in Swe-den which can be stored in a Knowledge Base and used for decision making and knowledge discovery.

(37)

(38)

6 Conclusion

As described in chapter4, the system is able to annotate ontological instances and relations in the text associated with the campylobacteriosis disease, export them as annotated text files and finally transform them into the desired RDF format.

This thesis has shown that it is possible to create an OBIE system which has high perfor-mance scores. However, in order to achieve the desired scores without having access to big amounts of data, hand-written patterns were used extensively which makes the maintenance of the system harder for future use and expansion. Another issue shown in this thesis is that the ontology used to assist the Information Extraction process plays an important role and any deficiencies it may contain, can hinder the process. The ontology used in this work made the detection of three types of instances impossible which made the system unable to extract three types of relations as well.

Ontology-based information extraction from legacy surveillance reports of infectious diseases in animals and humans

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Datateknik

2020 | LIU-IDA/LITH-EX-A--20/073--SE

Ontology-based information

extraction from legacy

surveillance reports of infectious

diseases in animals and humans

Biniam Palaiologos

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

1

Introduction

1.1

Motivation

1.2

Swedish National Veterinary Institute

1.3

ORION

1.4

Aim

1.5

Research questions

1.6

Delimitations

1.7

Research methodology

1.8

Outline

2

Theory

2.1

Information Extraction

2.2

Ontologies

2.3

Ontology-Based Information Extraction

2.4

Metrics for Ontology-Based Information Extraction

Precision

Recall

F-measure

3

Related Work

3.1

KIM

3.2

SOBA

3.3

Hierarchical, Perceptron-like Learning for Ontology-Based

Information Extraction

3.4

OntoX

3.5

SPRAT

3.6

SPEED

4

Method

4.1

Architecture

4.2

Dataset

4.3

Preprocessing

4.4

Domain Ontology

4.5

GATE

4.6

Document Reset

4.7

English Tokenizer

4.8

Sentence Splitter

4.9