Integration of Recommendation and Partial Reference Alignment Algorithms in a Session based Ontology Alignment System

(1)

Institutionen för datavetenskap

Department of Computer and Information Science

Final Thesis

Integration of Recommendation and Partial

Reference Alignment Algorithms in a Session

based Ontology Alignment System

By

Shahab Qadeer

LIU-IDA/LITH-EX-A--11/038—SE

2011-10-03

Linköpings universitet SE-581 83 Linköping, Sweden

Linköpings universitet 581 83 Linköping

(2)

Final Thesis

Integration of Recommendation and Partial

Reference Alignment Algorithms in a Session

based Ontology Alignment System

by

Shahab Qadeer

LIU-IDA/LITH-EX-A--11/038—SE

2011-10-03

Supervisor: Qiang Liu Examiner: Patrick lambrix

(3)

Linköping University Electronic Press

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare –från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida

http://www.ep.liu.se/

Copyright

The publishers will keep this document online on the Internet – or its possible replacement – from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for noncommercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and itsprocedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(4)

Abstract

SAMBO is a system to assist users for alignment and merging of two ontologies (i.e. to find inter-ontology relationship). The user performs an alignment process with the help of mapping suggestions. The objective of the thesis work is to extend the existing system with new components; multiple sessions, integration of an ontology alignment strategy recommendation system, integration of a system that can use results from previous sessions, and integration of partial reference alignment (PRA) that can be used to filter mapping suggestions. Most of the theoretical work existed, but it was important to study and implement, how these components can be integrated in the system, and how they can work together.

(5)

Acknowledgement

I am extremely thankful to my examiner Patrick Lambrix PhD Professor, who gave me an opportunity to work in this research area. His support and acknowledgement really helped me during the project phase. Furthermore, I am grateful to my supervisor Qiang Liu, who guided and assisted me whenever I needed his support.

I am obliged to my parents, who supported and encouraged me throughout my studies. In the end, I am grateful to mighty God, who gave me strength and patience to accomplish my thesis work.

Shahab Qadeer

Linkoping University, Sweden September, 2011

(6)

List of Figures

Figure 2-1: Alignment framework Figure 2-2: Combination and filtering Figure 2-3: Mapping Suggestion

Figure 2-4: Information about the remaining suggestions

Figure 2-5: Session based framework for alignment of ontologies Figure 2-6: Computation Session

Figure 2-7: Validation Session

Figure 2-8: Recommendation Method

Figure 4-1: Session based framework for alignment of ontologies Figure 4-2: Session Management

Figure 4-3: Recommendation Process Figure 4-4: PRA Process

Figure 5-1: JSP Model Architecture Figure 5-2: SAMBO database Schema Figure 5-3: User Information

Figure 5-4: Session Information

Figure 5-5: Session Information – Previous Release Figure 5-6: Session Information – New Release Figure 5-7: Suggestion List

(7)

Figure 5-8: History List

Figure 5-9: List of Predefined Strategies

Figure 5-10: Results from Predefined Strategies Figure 5-11: Locked Sessions

Figure 5-12: PRA Alignment Algorithm Figure 6-1: Multiple sessions for user Figure 6-2: Precision and Recall

Figure 6-3: Remaining Suggestions (no PRA) Figure 6-4: Remaining Suggestions (with PRA) Figure 6-5: Test case WL

Figure 6-6: Test case WL with PRA

Figure 6-7: Test case C1 (NGram, Edit Distance, and WL)

Figure 6-8: Test case C1 (NGram, Edit Distance, and WL) with PRA Figure 6-9: Test case C2 (NGram, Edit Distance, and WN)

(8)

Abbreviations

GO Gene Ontology

HTTP Hyper Text Markup Language

IDE Integrated Development Environment

OWL Web Ontology Language

OAEI Ontology Alignment Evaluation Initiative

OBO Open Biomedical Ontologies

ORM Object Relational Mapping

PRA Partial Reference Alignment

RA Reference Alignment

RDF Resource Description Framework

WWW World Wide Web

(9)

(10)

List of Contents

1. Introduction . . . . 1 1.1. Problem Statement . . . 2 1.2. Project Goal . . . . 2 1.3. Methodology . . . . . 2 1.4. Thesis Outline . . . . . . 2 2. Background . . . . . . . 4 2.1. Semantic Web . . . . . . 4 2.2. Ontology . . . . . . 4

2.3. Semantic Web and Ontology . . . . . . 5

2.4. Ontology Alignment . . . .. . . 5

2.5. SAMBO . . . .. . . 6

2.6. Session-based SAMBO . . . .. . . 7

2.7. Method of Recommending Ontology Alignment Strategies . . . . .. . . 10

2.8. Align Ontologies Using Partial Reference Alignment . . . 11

3. Requirements . . . . . . 13

4. Analysis and Design . . . 14

4.1. Framework Architecture . . . 14 4.2. Session Management. . . 15 4.3. Recommendation Process . . . 15 4.4. PRA Process . . . . . . . . . 16 5. Implementation . . . . .. . . . . . 18 5.1. Model Architecture . . . . . . 18 5.2. Database Schema . . . . . . 18

5.3. Implementation of Multiple Sessions. . . . . . 19

5.3.1. User Information . . . . . . 19

5.3.2. User Information Session . . . 20

5.4. Integration of Recommendation Method . . . 21

5.4.1. Predefined Strategies Information. . . 21

5.5. Integration of PRA Methods . . . . . . .. . . 23

5.6. Classes and Functions . . . . . . . . . 25

6. Evaluation and Testing. . . . . . . . . 26

6.1. Database driven muti-session handling . . . . . . . . 26

6.2. Evaluation. . . . . . .. . . 27

6.2.1. Precision and Recall . . . .. . . 28

6.3. PRA . . . .. . . 36

6.4. Recommendation. . . .. . . 36

6.5. Testing . . . .. . . 36

7. Conclusion and Future work . . . .. . . 36

8. Bibliography . . . .. . . 37

(11)

Chapter 1 1. Introduction

The semantic web is a major research initiative in the world wide web consortium (w3c) to create a metadata-rich web of resources that can describe themselves not only by how they should be displayed (HTML) or syntactically (XML), but also by the meaning of the metadata1.

The goal of the semantic web is to facilitate to share information with well-defined meaning among each other. Later on that information can be searched, retrieved and shared among different parties.

Ontology is a specification of a conceptualization that is designed for reuse across multiple applications and implementations. A specification of conceptualization is a written, formal description of a set of concepts and relationships in the domain of interest. [1]

Ontology defines a common vocabulary for researchers who need to share information in a domain. It includes machine interpretable definition of basic concepts in the domain and relations among them. Ontologies are used: 2

(i) To share common understanding of the structure of information among people or software agents and reuse of domain knowledge.

(ii) To make domain assumptions explicit analyze domain knowledge. (iii) To separate domain knowledge from the operational knowledge.

Nowadays the information available on the web is shown for humans to consume, which in turn is not suitable for automatic processing of information using software as mediators. Semantic web can play an important role to handle this situation by the use of ontologies, which can be used to markup the content available on the web. As a result, information processing became easier for software mediators.

Several ontologies contain overlapping information, and it is important to know the relationship between terms in different ontologies. We align ontologies when we define the relationships between terms in different ontologies. The similarity between terms in different ontologies defines the relationship. Ontology alignment helps us find inter-ontology relationships. Knowledge of

inter-ontology relationship can lead to improvements in search, integration and analysis of data.

1. From W3C semantic web activity, http://www.w3.org/2001/sw/ Accessed 2011-08-24

(12)

1.1 Problem Statement

System for Aligning and Merging of Biomedical Ontologies a.k.a SAMBO is an ontology alignment and merging system. Multiple sessions are to be saved, integration of a mechanism to evaluate and suggest a best strategy among all the strategies. Furthermore, the integration algorithms that uses PRA to generate fewer wrong suggestions, and to make the system database driven.

1.2 Project Goal

The goal of the project was to develop three new components in the SAMBO system. A multi-session component needed to be developed to facilitate users to load a multi-session from list of saved sessions. The user should be able to finish the selected session or update a session after processing a few or all suggestions. The component will load predefined strategies, and apply the recommendation algorithm. The evaluation results are shown to the user. A partial reference alignment (PRA) framework is added to enable the system to have fewer wrong suggestions.

1.3 Methodology

Firstly, we studied the research articles “A Session based system for alignment of large ontologies”

[2], “Method for Recommending Ontology Alignment Strategies” [3] and “Using Partial Reference

Alignments to Align Ontologies” [4]. Secondly, we converted the session component to

multi-session and integrated the methods discussed in the articles to SAMBO. Finally, we evaluated our results.

1.4 Thesis Outline

The thesis work is organized as follows. Chapter 1 Introduction

This chapter gives an introduction about the problem statement, project goal, methodology and thesis outline.

Chapter 2 Background

This chapter gives an overall background about semantic web, ontologies, ontology alignment, and method for recommending ontology alignment strategies, aligning ontologies using partial reference alignment (PRA), and gives an introduction about SAMBO.

(13)

Chapter 3 Requirements

This chapter describes the main requirements and supporting features. Chapter 4 Analysis and Design

This chapter describes the proposed framework for aligning ontologies using a session- based approach.

Chapter 5 Implementation

This chapter describes the architecture model, database schema, implementation of multiple sessions, integration of recommendation method, and integration of PRA in SAMBO.

Chapter 6 Evaluation and Testing

This chapter discusses the test case we used and the evaluation result. Chapter 7 Conclusion and Future Work

This chapter describes the summary of current work we performed, and the conclusions we can draw from the work we have done. It describes the future work, which can be done in extension.

(14)

Chapter 2 2. Background

2.1 Semantic Web

The semantic web is an emerging concept in world wide web (www). It is evolving the web today by allowing users to find, share, and combine information more easily3. The co-founder of the world wide consortium (w3c) Tim Berners Lee together with James Hendler and Ora Lassilsa defined semantic web as [5],

“The semantic web is an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation”.

The vision of the semantic web is extremely ambitious and would require solving many long standing research problems in knowledge representation, reasoning, databases, computational linguistics, computer vision, and agent systems. The semantic web has an important influence on the development of the web [6] like the infrastructure needed to support the development of languages, and tools for content annotation, design and deployment of ontologies.

2.2 Ontology

“An ontology can be seen as defining as basic terms and relations of a domain of interest, as well as the rules for combining these terms and relations” [7]. Ontologies define a way to communicate between people and organizations by providing common terminologies over domains4. Although ontologies differ in representation, most ontologies represent the following components [8] [9].

 Concepts – represent sets or classes of entities within a domain.

 Instances – are entities represented by a concept.

 Relations – define a way in which concepts and instances are related to one another.

 Axioms – describe facts that are always true in the topic area of an ontology, which can be used to constrain values for concepts or instances.

3. Semantic Web - http://en.wikipedia.org/wiki/Semantic_Web Accessed: 2011-02-06 18:10

(15)

2.3 Semantic Web and Ontologies

The information gathered by software as mediators that is available on the web for human is not suitable for automatic processing. To handle this situation, we use ontologies which can be used to markup the content available on the web. The information processing becomes easier for software mediators and leads to new research in the area of semantic web.

2.4 Ontology Alignment

Ontology alignment is defined as determining relations between concepts in different ontologies. The concept of ontology is compared with the concept of the other ontology.

An alignment framework is shown in figure 2-1. Many of the ontology alignment systems follow the framework which has two parts. The first part deals with computation of mapping suggestions, and the second part interacts with the user to decide upon the final alignment.

Figure 2-1: Alignment framework [4]

The two source ontologies are given as input to an ontology alignment algorithm. These ontologies may be partitioned into mappable groups during preprocessing. The groups are given as input to different matchers, which are used to calculate similarity values for terms in different ontologies. The techniques can be linguistic, structured, constraint, and instance based. By making different combinations of matchers, and filtering the results, mapping suggestions are generated that are accepted or rejected by the user and may result in further suggestions. The conflict checker is used to avoid conflicts between mapping suggestions. A PRA can be used to filter out term pairs in

(16)

suggestions to get fewer wrong suggestions. The information about PRA can be found in section 2.7.

2.5 SAMBO

SAMBO is developed according to the framework described in section 2.4. The current implementation supports ontologies in OWL format. The system functionality is setup in two steps; aligning relations and aligning concepts [10]. The user can perform the alignment manually or based on the suggestions. The suggestions are proposed by different alignment strategies as well as their combinations. After the user accomplishes the alignment process, the system can merge the ontologies. SAMBO also provides a number of reasoning services, such as checking whether the new ontology is consistent, whether there are cycles in the new ontology and whether there are unsatisfiable concepts. Figure 2-2 shows how different matchers can be chosen and weights can be assigned to these matchers. Filtering is performed using a threshold value. The pairs of terms with a similarity value above this value are shown to the user as mapping suggestions [10].

Figure 2-2: Combination and Filtering.

In figure 2-3 the system displays information (definition/identifier, synonyms, and relations) about the source ontology terms in the suggestions. For each of the mapping suggestions, the user decides, whether the terms are equivalent, or there is a is-a relation between the terms, or whether the suggestion should be rejected. If the user decides that the terms are equivalent, a new name for term can be given as well. Upon the action of the user, the suggestion list is updated. If the user rejects a suggestion where two different terms have the same name, he/she is required to rename at least one of the terms. At each point in time during the alignment process the user can view the ontology represented in trees with the information on which actions have been performed and he/she can check how many suggestions still need to be processed [10].

(17)

Figure 2-3: Mapping Suggestion.

Figure 2-4 shows the remaining suggestions for a particular alignment process. A similar list can be obtained to view the previously accepted alignment suggestions.

Figure 2-4: Information about the remaining suggestions.

2.6 Session-based SAMBO

We start with the background information about the framework architecture of the previous work which describes the alignment algorithm integrated with computation [see section 2.6.1] and validation [see section 2.6.2] sessions along with the recommendation process. Figure 2-5 shows the architecture for session based framework.

(18)

Figure 2-5: Session based framework for alignment of ontologies [2].

The framework is divided into three processes, where two sessions are integrated into each other. The computation session is used to compute the suggestions after applying alignment algorithms including several matchers, combinations and filters. The results of the computation session are used by the validation session where the user has an option to accept or reject the suggestions. The conflict checker will be applied on accepted suggestions for removing the conflicts. At the end a Partial Reference Alignment (PRA), i.e., list of current mappings will be generated for further use in the recommendation process. The recommendation process will run independently on the results using the results of the computation session and will recommend the better matching strategies. At preprocessing PRA will be used to divide the ontologies into mappable parts. It will also be used to filter suggestions [2].

(19)

2.6.1.

Computation Session

Figure 2-6 shows the computation session process in the system. It takes two source files as input and has filters and matchers for alignment. The preprocessor will determine if it is the first computation for suggestions or any PRA is available for it. The matchers are implemented using linguistic matching, structure based matching, constrained based approaches, instance based strategies, and strategies that use the auxiliary information or a combination of these. The matcher calculates the similarity values between the terms from different source ontologies. The suggestions are generated by combining matchers and filtering the results. The user can select different matchers and set a value for threshold to get the results. By using different matchers, combinations and filters different results and different suggestions will be generated. The suggestion list generated by a computation session will be used as an input to the validation session [2].

2.6.2. Validation Session

Figure 2-7 shows the validation session process in the system. It takes the suggestion list generated by a computation session as an input. The suggestions will be shown to the user and the user will decide on accepting or rejecting the suggestions. The acceptance and rejection of suggestions may affect other suggestions. A conflict checker will detect the unsatisfiable concepts and can be used to remove redundancy on user request [2].

(20)

Figure 2-6: Computation Session [2]. Figure 2-7: Validation Session [2].

2.7 Evaluation of Ontology Alignment Strategies by different Systems

There is a lack of knowledge currently on how different alignment strategies should work for different kinds of ontologies. Different groups have performed evaluations by comparison of different systems like PROMPT (based on Protégé), FCA-Merge and ODEMerge [11]. The evaluation is done for functionality, interoperability and visualization, but for the quality of alignment no tests are performed. Other systems like PROMPT, Chimaera, FOAM, and SAMBO are evaluated in terms of the quality of alignment as well as the time it takes to align ontologies with these tools [12][13].

Tools have been developed recently to evaluate and compare part of alignment algorithms which are non-interactive. The systems compute the precision, recall and f-measure of an alignment result between two ontologies [3]. KitAMO [14] is an integrated system for analysis and evaluation

(21)

of alignment strategies. It generates similarity values and gives results to evaluate precision and recall for combinations of different algorithms, weights, threshold, and computes performance of strategies [3].

2.8. Methods for Recommending Ontology Alignment Strategies

Figure 2-8 shows a recommendation method. In the initial step ontology pairs are input to segment pair algorithm, which generates small ontology pairs, called segment pairs. For these segment pairs alignments are generated. In the alignment toolbox the available alignment strategies align the segment pairs, and report on the alignment results generated. Based on these reports, the recommendation algorithm gives recommendations on strategies for aligning the two given ontologies [3].

Figure 2-8: Recommendation Method [3].

2.9. Align Ontologies using Partial Reference Alignment

2.9.1 Partial Reference Alignment (PRA)

Using portals for mappings or by using an interactive method for alignment, correct mappings between terms in different ontologies are given or can be obtained. A Partial Reference Alignment is a sub-set of all correct mappings. A PRA can be used in the preprocessing, matcher and filter steps. In this thesis we are going to cover only three of these methods. We refer to [4] for information about the methods used by PRA.

(22)

2.9.2 PRA Alignment Algorithms

2.9.2.1 fPRA

fPRA is a filtering algorithm, which adds PRA mappings in the final result. Remaining suggestions contradicting with PRA their mapping will be removed.

2.9.2.2 dtfPRA

dtfPRA is a filtering algorithm, it’s a double threshold filtering technique and uses the structure of ontologies. The double threshold filtering uses three steps [4].

(i) Find a consistent suggestion group for the pairs with similarity value higher or equal to the upper threshold.

“A set of suggestions is a consistent suggestion group if each concept occurs at most once as first argument in a pair, at most once as second argument in a pair and for each pair of suggestions <A,A’> and <B,B’> where A and B are concepts in the first ontology and A’ and B’

are concepts in the second ontology: A is a subset of B, if and only if A’ is a subset of B’”[11].

(ii) Use a consistent suggestion group to partition original ontologies.

(iii) Filter the pairs with similarity values between the lower and upper thresholds using the partitions. Elements of pairs which belong to corresponding pieces in the partitions are retained as suggestions.

2.9.2.3 mgPRA

mgPRA is a preprocessing algorithm that partitions the ontologies into mappable parts. It adds missing structural relationships, making the whole PRA a consistent group. Then, partition these ontologies into mappable groups. Alignment strategies using structural information, the quality of the examined ontologies, the completeness of the structure, and the correct use of the structural relations, has a significant influence on the quality of the results [4].

(23)

Chapter 3 3. Requirements

Design a database for SAMBO and make the system database driven.

 Sessions are to be saved in the database; previously, SAMBO was using xml.

 Predefined strategies, processed strategies, and PRA filtered suggestions are to be saved in the database.

To extend the session based ontology alignment system by introducing multiple sessions feature.

 The user should be able to save and load sessions to and from the database.

 Sessions are saved per selected ontology pair and can be reloaded at its last saved state. Integration of the recommendation system presented in [3]

 The system should show the predefined alignment strategies to the user that are defined in the database

 The user should have an option to align ontology pairs using predefined strategies.

 The recommendation component has to generate segment pairs.

 The recommendation component will use UMLS as an oracle to generate suggestions.

 The recommendation component will use threshold and weights with segment pairs to generate suggestions.

 The recommendation component shows results to the user for pair of ontologies with parameters (f-measure, precision, recall and recommendation score). Please refer to glossary for more information about these parameters.

 The user can choose the best ranked strategy from the list to start alignment process. Integration of the PRA-based methods presented in [4]

 The system should integrate fPRA and dtfPRA using filtering and mgPRA using preprocessing.

(24)

Chapter 4 4. Analysis and Design

4.1

Framework Architecture

Figure 4-1 shows an updated architecture for session based framework. The inputs to the framework are two source ontology files and, the output is the alignment.

Figure 4-1: Session based framework for alignment of ontologies.

The framework is divided into three processes, where two sessions are integrated into each other. The computation session is used to compute the suggestions after applying alignment algorithms including several matchers, combinations and filters. The results of the computation session are used by validation session where the user has an option to accept or reject the suggestions. The conflict checker will be applied on accepted suggestions for removing the conflicts. The recommendation process will run independently on the results using the results of the computation session and will recommend the better matching strategies. PRA helps in filtering

S O U R C E O N T O L O G I E S A L I G N M E N T DB Computation Session Alignment Algorithm Preprocessor Matchers(s) Combination(s) Filter(s) Validation Session Accepted Suggestions (Conflict Checker) Locked Session

Partial Reference Alignment (PRA) Algorithms

1. Calculate Processes Suggestions 2. Filter PRA suggestion using 2.1 fPRA

2.2 dtfPRA 2.3 mgPRA

Recommendation Component

1. Segment Pair Generation 2. Use UMLS as Oracle

3. Generate Recommendation (KitAMO)

U S E R DOMAIN THESAURI COMBINATIONS GENERAL DICTIONARIES FILTERS MATCHERS FILTER

(25)

remaining suggestions to get fewer wrong suggestions and after applying validation session on the PRA based suggestion, we get more improved list of suggestions [4].

4.2 Session Management

Figure 4-2 shows, the information for management of sessions. We have introduced a multi session feature to SAMBO. A user can save more than one session and at the time of log in, the system will show the list of saved sessions. The sessions are going to be maintained in the database.

Figure 4-2: Session Management.

4.3 Recommendation Process

Figure 4-3 shows, the information about the recommendation process. The process starts with loading ontology pairs. These loaded ontology pairs are divided into segment pairs using SubG. SubG collects the pair of terms in the two ontologies, which have the same name. The pairs of sub-graphs with respect to is-a, are candidate segment pairs. Segment pairs are chosen randomly from

Start

Select Ontology Pairs

Computation Session

End Start Session

Validation Session

Locked Session

List of Saved Sessions No Session

(26)

the candidate segment pairs, such that, the segments are pair-wise disjoint. These segment pairs are small sub-ontology pairs of the main ontology pair.

We generate correct alignments for every segment pair using UMLS. For every predefined strategy suggested mappings are generated. Then, we calculate precision, recall and f-measure with respect to correct alignments. At the end we calculate the final measures for every predefined strategy.

Figure 4-3: Recommendation Process

4.4 PRA Process

Figure 4-4 shows the information regarding application of the PRA framework. The PRA framework is applied after the session is saved. There are three PRA algorithms used for alignment of saved suggestions. The saved suggestions are processed suggestions with equivalence relation and remaining suggestions. These processed suggestions are taken and used with PRA algorithms to filter suggestions that match in the remaining suggestions.

Start

Load ontology Pairs

Generate Segment Pairs (SP)

Generate correct alignments using UMLS for every segment pair

Calculate the final measures for every predefined strategy

Calculate measures for precision, recall and f-measure w.r.t correct alignment For every predefined alignment strategy generate suggested alignment

(27)

Figure 4-4: PRA Process PRA Aligned Suggestions

Select Algorithm

End PRA Start PRA

fPRA dtfPRA mgPRA

(28)

Chapter 5 5. Implementation

5.1

Model Architecture

The system has been implemented in Java technology using JSP and Servlets. Figure 5-1 shows the JSP Model Architecture. JSP and Servlets can be used together for rapid development of applications, which have enhanced performance, support layers for business logic and data, and have the ability to extend into an enterprise application 5

.

Figure 5-1: JSP Model Architecture

5.2

Database

We have designed a database for storing all the information related to sessions in the database and the same applies to the recommendation component. We have used MySQL as a database server because, it is helpful in optimized access to large amount of data on the server and make data searches efficient. Figure 5-2 gives information about the database schema that has been designed for the SAMBO application according to the requirements discussed in Chapter 3. More detailed information about the schema can be read in the following sections.

5. Java Server Pages Overview - http://java.sun.com/products/jsp/overview.html Accessed: 2011-02-16 13:10

DB Servlets Views (JSP) Java Beans Client/ Browser Request Response Enterprise Server Application Server

(29)

Figure 5-2: SAMBO database Schema

5.3 Implementation of Multiple Sessions

5.3.1

User Information

User information is retrieved from the ‘users’ table shown in figure 5-2 above. Previously this information was retrieved from an xml file shown in figure 5-3.

(30)

5.3.2 User Session Information

User session information is retrieved from the ‘usersessions’ table shown in the figure 5-2 above. The design is flexible enough to accommodate multiple user functionality to the current design in future.

We are reusing the information for sessions stored by the previously developed system in xml format. Here we will save the information in the xml file to a field in case of each saved session and later use the session at the time it is loaded again.

The previous release of the system had session information stored in an xml file, Figure 5-4 shows what type of information was saved in the xml file. In this release we have replaced this feature by having a ‘usersessions’ table in our database schema and all the information in this file now be saved in the ‘usersessions’ table.

Figure 5-4: Session Information (Pervious release) [4].

When the user logs in, the system will look for saved sessions. If the system had previously saved sessions they are presented below for the old and new release. The previous release could only save a single session; Figure 5-5 shows the display of the saved session information.

Figure 5-5: Session Information (Previous release) [4].

In the new release multiple sessions saved in the system will be presented to the user as shown in Figure 5-6.

(31)

Figure 5-6: Session Information (New release).

There are two other files that contain the information about processed suggestions in Figure 5-7, and history list in Figure 5-8. The suggestion list will contain all the suggestions that are not

processed by the user yet, and history will contain all the suggestions that the user has processed.

Figure 5-7: Suggestion List.

Figure 5-8: History List.

The new release saves information in ‘Suggestion List’ and ‘History List’ to the ‘usersessions’ table.

5.4 Integration of Recommendation Method

For the integration of recommendation method into SAMBO, we have introduced a new class called “Recommendation” in SAMBO. This class takes two ontology source files; number of segments pairs and segments saved location as input. The generate segment pair procedure will generate segment pairs from the source ontology files according to the number of segment pairs given. We have used a SubG algorithm for generating segment pairs. In SubG a candidate segment pair is equal to subgraph according to is-a with roots having the same name. The segment pairs are randomly chosen from candidate segment pairs such that segment pairs are disjoint. After getting the segment pairs the next step is the alignment of generated segment pairs. Our algorithm has a segment pair alignment pair procedure which computes segment pairs from the ontologies and

(32)

applies matching algorithm to calculate class based similarity values. It also uses UMLS as an oracle to generate correct alignments for every segment pair.

We have used KitAMO in the recommendation class for making alignments. Our KitAMO alignment generator procedure takes threshold, matchers and weights as input. It loads the segment pair ontologies and calculates number of concepts in each of the loaded segment pairs and calculates term pairs by multiplying number of concepts in the segment pair. The execution time is calculated by merging algorithm with given weights and threshold to the function.

The precision, recall, f-measure and recommendation scores are calculated by the KitAMO alignment generator procedure. The precision measures how many of the mapping suggestions are correct in the ontology alignment result, which is defined as the number of correct suggestions divided by the number of suggestions. The recall measures how many correct mappings are found in the given ontology alignment results, which in defined as the number of correct suggestions divided by the number of correct mappings. F-measure is calculated by multiplying precision and recall twice and divide it by the sum of precision and recall. The f-measure will be the recommendation score in the current case.

5.4.1 Predefined Strategies Information

The ‘predefinedstrategies’ table is used in recommendation process. It holds information for different combinations of matchers, weights and threshold. This table is used to generate expected results of the selected ontology pairs.

Figure 5-9 shows the list of predefined strategies that are loaded from the ‘predefinedstrategies’ table.

(33)

The system saves the computation of the recommendation process to the ‘savedpredefinedstrategies’ table, which is at least run once. The next time the system will load information from ‘savedpredefinedstrategies’ for fast retrieval of the previously run computations. For each predefined strategy, there are some suggestions that are saved in the ‘savedpredefinedstrategiessuggestions’ table.

Figure 5-10 shows the results that are generated using the predefined strategies. The important information about the result can be seen in Recall, Precision, f-measure and recommendation score. The recommendation score calculated using the recommendation algorithm integrated from “Method for Recommending Ontology Alignment Strategies” decides, which strategy stands best for the alignment process.

Figure 5-10: Results from Predefined Strategies.

5.5 Integration of PRA methods

For the integration of PRA methods into SAMBO we have developed a PRA class. It takes the loaded ontologies and takes the processed suggestions with equivalence relations. We have used three algorithms in the PRA class to get PRA suggestions. The fPRA method filters the remaining suggestions with the processed suggestions with equivalence relations. The filter is applied based on term matching in the remaining suggestions with the term in the processed suggestions and the matched suggestions will be removed from the remaining suggestions to show more correct suggestions from the remaining suggestions. We also used dtfPRA and mgPRA. The dtfPRA uses a consistent group in the PRA to filter suggestions between upper threshold and lower threshold, and mgPRA finds consistent groups in the PRA and partitions ontologies into mappable groups before aligning.

When the user locks the session the information for the session is saved to the ‘usersessions’ table. A notification box is shown to the user saying that the session has been successfully saved. Alongside the History link there is another link on the right hand side called Align using PRAs. Figure 5-11 shows a screen shot containing Align using PRAs link.

(34)

Figure 5-11: Locked Session.

A user can apply different alignment algorithms that have been integrated to the current system shown in Figure 5-12. The user can select any particular algorithm and perform Alignment using PRA.

Figure 5-12: PRA Alignment Algorithm.

5.6 Classes and Functions

We have implemented new classes and added new functions to already defined classes.

The descriptions of all the functions are given in section 9.1, appendix A and the descriptions of all the newly defined classes are given in section 9.2, appendix B.

(35)

Chapter 6 6. Evaluation and Testing

6.1 Database driven multi-session handling

In the evaluation for multiple session handling, we tested the system by saving different sessions to the database we have designed. Currently we only have one user who can manage multiple sessions.

Information about the user, ontologies, matchers, weights, thresholds, session type, session id, suggestion list, history list, creation time, and last accessed time are saved as session in the database, to make sure that we get the correct session information to be saved and loaded.

At the time of loading a session if a user has more than one session, the system can load the session that the user has chosen from the list of available sessions in the database. Please refer to Figure 5-2: SAMBO database Schema above to know about the database design for multi-session functionality.

Figure 6-1 shows multi-session information that is retrieved from the database and being displayed on the page.

(36)

6.2. Evaluation

For our evaluation we have used two well-known biomedical ontologies [3].

Medical Subject Headings (MeSH) provides a comprehensive vocabulary for indexing journals, articles and books. It also serves as thesaurus that facilitates searching. MeSH is created and maintained by United States National Library of Medicine (NLM), and is used by MEDLINE/PubMed. Adult Mouse Anatomy (MA) is maintained by Mouse Genome Informatics (MGI). MGI is the international database resource for the laboratory mouse, providing integrated genetic, genomic, and biological data to facilitate the study of human health and disease. 6,7

6.2.1.

Precision and Recall

Suppose we have a Strategy A composed of matchers WL, EditDistance, and NGram with weights 1.0 for all matchers respectively and the threshold value of 0.6. Using this combination we figured out that the generated values for precision, recall, f-measure and recommendation score are 0.8, 0.9, 0.83 and 0.83 respectively. We have taken another Strategy A’ composed of same set of matchers with same threshold and here we set weights to 2.0 for all of the matchers. We observed that the generated values in this case are same as we had for Strategy A. By this experiment we were able to understand that the effect of weights is normalized when applying different level of weights.

Lets us consider another case where we have Strategy A composed of matchers WL, EditDistance, and NGram with weights 2.0 for each of the matcher and threshold 1.0. Now we noticed the results generated were different as compared to the case discussed above. Here we get values for precision, recall, f-measure, and recommendation score as 1.0, 0.85, 0.88, and 0.88 respectively. Here we noticed that if we raise the value of threshold to 1.0 our precision will also be equal to 1.0. The value of precision helps us measure, how many mapping suggestions are correct in the ontology alignment, in this case our precision is high we can conclude that we have fewer wrong mapping suggestions.

We studied another case where we have Strategy A composed of matchers WL, EditDistance, and NGram with weights 2.0 for each of the matchers and threshold 0.0. In this case the generated values for precision, recall, f-measure, and recommendation score

6. http://en.wikipedia.org/wiki/Medical_Subject_Headings#Structure_of_MeSH Accessed 2011-08-24 12.00 7. http://www.informatics.jax.org/searches/AMA_form.shtml Accessed 2011-09-02 00.15

(37)

will be as follows 0.1, 1.0, 0.18 and 0.18. By this experiment we found that by setting threshold to 0.0 the value of precision is dropped to 0.1 and recall rise to 1.0. Hence in this

case we can say that, if threshold will decrease the value of recall will increase, and we achieve fewer wrong mapping suggestions.

The results of the different test cases for threshold, precision and recall are shown in the figure 6-8 below.

Figure 6-2: Precision & Recall

6.3. PRA

In our evaluation of PRA framework, we used different PRA algorithms to filter out suggestions based on the equivalence relation. The algorithms used are fPRA, dtfPRA and mgPRA. Please refer to section 2.7.2 for more information on PRA algorithm. With PRA we get more correct suggestion hence increase the performance of the current system. Please refer to table 6-2 above to see the results after applying PRA algorithm.

From the evaluation results we have it can be seen that with using certain set of matchers, filters and algorithm, we can get the desired level of correct suggestions, which in return help us to optimize our results more. In addition to that PRA is an excellent framework that can filter out

(38)

suggestions to make the results more desirable. In the work performed in this thesis, we took some screenshots for evaluating the results and see how PRA helps in getting correct suggestions. We evaluated the system with validation sessions used in two steps with PRA.

First Validation Session:

In figure 6-1 after using strategy with matchers (EditDistance, NGram, and WL), weights (1.0, 1.0, and 1.0) and threshold 0.6 we aligned 14 suggestions and have 30 remaining suggestions.

Figure 6-3: Remaining Suggestions.

In figure 6-2 after using strategy with matchers (EditDistance, NGram, WL), weights (1.0, 1.0, 1.0) and threshold 0.6 we have 7 remaining suggestions. Therefore 32 suggestions were removed after applying fPRA.

(39)

Figure 6-4: Remaining Suggestions with PRA

Second Validation Session:

In the second validation session we aligned 2 more suggestions to start our evaluation test with 5 remaining suggestions.

In figure 6-3 after using strategy with matchers (EditDistance, NGram, WL), weights (1.0, 1.0, 1.0) and threshold 0.6 we have 5 remaining suggestions.

(40)

Figure 6-5: Remaining Suggestions.

In figure 6-4 after using strategy with matchers (EditDistance, NGram, WL), weights (1.0, 1.0, 1.0) and threshold 0.6 we have 5 remaining suggestions.

(41)

Hence we can evaluate that in the first validation session the number of suggestions reduced from 30 remaining suggestions to 7 remaining suggestions, but in the second validation session the number of suggestions before applying PRA were 5 remaining suggestions and after apply PRA remains the same. This means that we have got all correct mappable suggestions now and there are no common terms in term pairs left.

The approach of using validation sessions multiple times instead of using it only once is very effective. Since the remaining suggestions are saved when we lock the session and next time if we want to validate the same session we reload it and start validating sessions from where we left validating on the last occasion. The results of alignment are the same, the only difference is that previously in order to get the correct suggestion we need to go through all the suggestions but now we can lock at any suggestion and reload from its saved state and start making further validations.

PRA integration is introduced to SAMBO and we can use multiple validation sessions with it which was not done in the previous release.

6.4.

Recommendation

In our evaluation of the recommendation component, we compared different strategies using several combinations of thresholds, and weights. Using this method we generated different values of Precision, Recall, F-measure and Recommendation score. These parameters helped us understand the behavior of different combinations and filter out the best ranked strategy for alignment and merging of ontology pairs.

Here we will discuss different cases and understand how the values differ with various combination of matchers, threshold and weights.

In the table below we have defined different test cases used to align ontology pairs.

Test Case Strategy Threshold Weights

1 WL 0.8 1.0

2 C1 (NGram,EditDistance,WL) 0.7 1.0,1.0,1.0

3 C2 (NGram,EditDistance,WN) 0.7 1.0,1.0,1.0

Table 6-1: Test cases for Evaluation.

The table below gives the results using recommendation score. We recommend wordlist matcher with threshold 0.8 as the best strategy and C1 with threshold 0.7 as the second best, and C2 as the third best.

(42)

We applied the test on dataset eye, and found our best strategy has up to 23 term pairs as suggestions, where as for the second and third best it’s up to 25 term pairs.

After applying PRA on the best strategy with 3 processed suggestions reduced the number of expected mappings to 16, in case of the second best case after 3 processed suggestions the number of suggestions reduced to 18, and for the third best case after 2 processed suggestions the number of expected mappings reduced to 19.

Strategy Processed Sugg. Remaining Sugg. Remaining Sugg. (PRA) Rank WL 3 23 16 1 C1 (NGram,EditDistance,WL) 3 25 18 2 C2 (NGram,EditDistance,WN) 2 25 19 3

Table 6-2: Results of Evaluation.

(43)

Figure 6-8: Test case WL using PRA

(44)

Figure 6-10: Test case C1 (NGram, Edit Distance, and WL) with PRA

(45)

Figure 6-12: Test case C2 (NGram, Edit Distance, and WN) with PRA

6.5.

Testing

The approach we used to test our work is black box and white box testing. It helped use in identifying critical bugs in the system. At the end, we performed acceptance testing; it was done to make sure system acts according to the requirements we have in this thesis works.

(46)

Chapter 7 7. Conclusion and Future work

7.1. Conclusion

In this thesis, we integrated the recommendation component to SAMBO system that will tackle the problem of deciding, which strategy to use for a particular alignment problem. We also used PRA to align ontologies. We noticed that by using PRA in preprocessing and filtering reduces the number of suggestion, and also helps to improve the precision and in some cases recall also. In addition to that we made SAMBO database driven to manage session information, expected results from the recommendation components and PRA filtered suggestions.

7.2. Future Work

SAMBO is designed to add multiple users’ functionality to the system. In the integration of recommendation process, obtained results are quite good. It is important to perform more tests with different kinds of ontologies and alignment strategies. We have used one algorithm to segment ontology pairs called SubG, there are some other ones that can also be used in future. Integration part for PRA’s algorithms should be tested on other ontologies and with different base algorithms. Combination of methods can also be investigated. In the current system we extended the work by using xml files to populate the databases in usersessions table, but it can be changed, by eliminating the use of xml files and solely save and retrieve information from databases. Paging can be applied to the list of saved sessions displayed to the user on log in to the system.

(47)

Bibliography

[1] [2] [3] [4] [5] [6] [7]

Peter D. Karp: An ontology for biological function based on molecular interactions, Bioinformatics 2000, 16-269

Z. Khan. Muzammil: MSc Thesis. A Session based system for alignment of large ontologies. Department of Computer and Information Science (IDA), Linkoping University. (September 2010)

Tan H, Lambrix P.: A Method for Recommending Ontology Alignment Strategies: ISWC/ASWC 2007, LNCS 4825, pp. 494-507, 2007. Springer-Verlag Berlin Heidelberg (2007).

Lambrix P, Liu Q,.: Using Partial Reference Alignments to Align Ontologies, ESWC 2009, LNCS 5554, pp. 188-202,2009. Springer-Verlag Berlin Heigelberg (2009).

Tim Berners Lee, James Hendler and Ora Lassila. The Semantic Web, Scientific American Magazine, May 17, 2001.

Ian Horrocks. Ontologies and the Semantic Web, Communications of the ACM, December 2008, vol. 51 no. 12, 2010.

Neches R., Fikes R., Finin T., Gruber T., Senator T., and Swartout, W. Enabling technology for knowledge engineering, Al Magazine 12(3):26-56, 1991.

[8] [9] [10]

A Gomez-Perez. Ontological Engineering: A State Of The Art. In Expert Update, pages 33-45, 1999.

R Stevens, CA Goble, and S Bechhofer. Ontology-based Knowledge Representation for Bioinformatics. Briefings in Bioinformatics, 1(4):398–414, 2000.

Lambrix, P., Tan, H.: Ontology Alignment and Merging. Chapter 6 in Burger, Davidson, Baldock, (eds), Anatomy Ontologies for Bioinformatics: Principles and Practice, 133-150, Springer, 2008. ISBN: 978-1-84628-884-5.

[11] [12]

OntoWeb Consortium, A survey on ontology tools, Deliverables 1.3 (2002).

Lambrix, P., Edberg, A.: Evaluation of ontology merging tools in bioinformatics. Proc. of the Pacific Symposium on Biocomputing 8, 589-600 (2003)

[13] [14]

Lambrix P, Tan H: SAMBO – A System for aligning and merging biomedical ontologies. Journal of Web Semantics 4 (3), 196-206 (2006).

Lambrix, P., Tan, H.: A Tool for Evaluating Ontology Alignment Strategies. Journal on Data Semantics VIII, 182-202 (2007).

(48)

Appendices

Appendix A: Implemented Functions Description

Class or Page Name LoadSessionServlet.jsp

Private Functions

Public Functions GenerateUserSessionFromDb();

Private Variables Public Variables Package Name

Description This Servlet is used to load the session

information from the database.

Class or Page Name LockSessionServlet.jsp

Private Functions

Public Functions SaveUserSessionToDb();

Description This Servlet will lock, the session to the

database.

Class or Page Name MainServlet.jsp

Private Functions

Public Functions RemoveUserSessionFromDb();

Description This Servlet will remove, the user session

from the database.

Class or Page Name recommendations.jsp

Private Functions

Public Functions getQualityAndExecutionMeasures ();

Description This page provides an interface for calling a

(49)

Class or Page Name strategies.jsp

Private Functions

Public Functions getPredefinedStrategies ();

Description This page will load predefined strategies

from the database.

Class or Page Name pras alignment.jsp

Private Functions

Public Functions getPRAsAlignment();

Description This page will load PRA alignment

(50)

Appendix B: Implemented Classes Description

Class Name Recommendation

Package Name se.ida.liu.sambo.Recommendation

Functions GenerateSegmentPairs

SegmentPairsAlignmentGenerator KitAMOAlignmentGenerator

Description Recommendation perform the following:

GenerateSegmentPairs

It will generate segment pairs from ontologies. SegmentPairsAlignmentGenerator

It uses UMLS as an oracle to align the segment pairs. KitAMOAlignmentGenerator

It uses segments results from the alignment generator to generate results for f-measure, quality, precision, recall and recommendation score.

Class Name PRA

Package Name se.ida.liu.sambo.PRA

Description This class will be used for Partial Reference Alignment of Ontologies. It

has three types of algorithms: Filter with PRA – fPRA

Double Threshold Filter with PRA - dtfPRA Mappable Groups and Fixing with PRA – mgPRA

Class Name SubG

Package Name se.liu.ida.sambo.segPairSelAlgs

Description This class is used for generating segment pairs Sub-Graph (Sub-G).

Class Name(s) UsersDao

StrategiesDao UserSessionsDao UsersUserSessionDao PredefinedStrategiesDao SavedPredefinedStrategiesDao SavedPredefinedStrategiesSuggestionsDao

Package Name se.liu.ida.sambo.dao

Description These classes will do CRUD operations as they are mapped to tables in

database

Class Name(s) Users

UsersPk Strategies StrategiesPk

(51)

UserSessions UserSessionsPk UsersUserSession UsersUserSessionPk PredefinedStrategies PredefinedStrategiesPk SavedPredefinedStrategies SavedPredefinedStrategiesPk SavedPredefinedStrategiesSuggestions SavedPredefinedStrategiesSuggestionsPk

Package Name se.liu.ida.sambo.dto

Description These classes have the getters and setters, that are mapped to the table

fields in the database.

Class Name(s) DaoException

UsersDaoException StrategiesDaoException UserSessionsDaoException UsersUserSessionDaoException PredefinedStrategiesDaoException SavedPredefinedStrategiesDaoException SavedPredefinedStrategiesSuggestionsDaoException

Package Name se.liu.ida.sambo.exceptions

Description Show exception messages for each of the respective classes.

Class Name(s) UsersDaoFactory

StrategiesDaoFactory UserSessionsDaoFactory UsersUserSessionDaoFactory PredefinedStrategiesDaoFactory SavedPredefinedStrategiesDaoFactory SavedPredefinedStrategiesSuggestionsDaoFactory

Package Name se.liu.ida.sambo.factory

Description These classes are used to create connection to JDBC adopter to make

queries to the database.

Class Name(s) UsersDaoFactory

StrategiesDaoFactory UserSessionsDaoFactory UsersUserSessionDaoFactory PredefinedStrategiesDaoFactory SavedPredefinedStrategiesDaoFactory SavedPredefinedStrategiesSuggestionsDaoFactory

(52)

Description This class is used to create a connection.

Class Name(s) AbstractDAO

UsersDaoImpl ResourceManager StrategiesDaoImpl UserSessionsDaoImpl PredefinedStrategiesDaoImpl SavedPredefinedStrategiesDaoImpl SavedPredefinedStrategiesSuggestionsDaoImpl

Package Name se.liu.ida.sambo.jdbc

(53)

Glossary

F-measure

- is a weighted harmonic mean of precision and recall.

Ontology Alignment

- is a process for finding mappings between entities from different source

ontologies. Depending on the specific ontology alignment task, the entities can be concepts, relations or instances, and the relationship of the mappings can be equivalence as well as is-a, part-of or any other kind of relation.

Precision

- measures how many of the mapping suggestions are correct in the ontology alignment

result, which is defined as the number of correct suggestions divided by the number of suggestions.

Partial Reference Alignment (PRA)

- is a set of correct mappings between entities from two

ontologies.

Recall

- measures how many correct mappings are found in the given ontology alignment results,

which is defined as the number of correct suggestions divided by the number of correct mappings.

Reference Alignment (RA)

- is a complete set of mappings between entities from two ontologies.

Resource Description Framework (RDF)

-

is a language that provides a flexible mechanism for describing web resources and the relationships among them.

Web Ontology Language (OWL)

-

is the standard Web Ontology Language proposed by W3C in