Adaptation of a distributed inference algorithm and migrating to P2P in the Semantic Distributed Repository

(1)

algorithm and migrating to P2P in the

Semantic Distributed Repository

Edward Tj¨

ornhammar

(2)

Abstract

SDR is short for Semantic Distributed Repository. It is a distributed infras-tructure intended to alleviate problems in information exchange between differ-ent organizations. Resources are considered heterogeneous and can range from project specific files, simulation components, sensor data to service descriptions. The infrastructure is a hybrid between an index service, a filesystem and a mid-dleware for distributed semantic resource lookup. SDR uses techniques common to the domains of P2P and Semantic Web.

SDR has been running as part of different projects at the Swedish Defence

Research Agency (FOI) ranging from 2005 to 2008. It has been conducted

within the domain of Simulation & Distributed Systems. The work presented in this thesis was conducted between June and December 2006.

(3)

List of Figures

1 Example skip list node overlay where arrows denote routing

knowl-edge . . . 8

2 BNF grammar for a typical “/etc/hosts” file. Lexical tokens have been replaced by their symbolic representation since their regular expressions are quite long. We can also see that this grammar ignores white spaces except newline NL. . . 10

3 A conceptual deconstruction of a description logics tableau into triples . . . 12

4 A partition of the W3C Language Stack . . . 14

5 A sample OWL file . . . 15

6 A Simple WSDL Defined Service . . . 17

7 Sample SOAP invocation chain . . . 18

8 Component diagram of the SDR grid service as envisioned and designed in the Cl2prototype. . . 20

9 Overview of the system’s components interrelationships. . . 25

(4)

List of Abbreviations

API Application Program Interface

ACL Access Control List

BNF Backus-Naur Form

BOM Base Object Model

BT BitTorrent

CORBA Common Object Request Broker Architecture

CVS Concurrent Versions System

DHT Distributed Hash Table

DL Description Logic

DNS Domain Name System

DKS Distributed Kernary System

DRONT Distributed Repository Ontology

FOI Swedish Defence Research Agency

FTP File Transfer Protocol

GSI Grid Security Infrastructure

GT Globus Toolkit

HTTP Hypertext Transfer Protocol

HTTPS Hypertext Transfer Protocol Secure

IDL Interface Description Language

IR Information Retrieval

LSA Latent Semantic Analysis

LRC Local Replica Catalog

REST REpresentational State Transfer

RLI Replica Location Index

KB Knowledge Base

NetSim Network based Simulation and modelling

MP3 MPEG Audio Layer III

OWL Web Ontology Language

(5)

OGSA Open Grid Service Architecture

OGSI Open Grid Service Infrastructure

P2P Peer to Peer

RAM Random Access Memory

RDF Resource Description Framework

RDFS RDF Schema

RDQL RDF Query Language

RLS Replica Location Service

SOAP Simple Object Access Protocol

SPARQL SPARQL Protocol and RDF Query Language

SHA Secure Hash Algorithm

SDR Semantic Distributed Repository

SOA Service Oriented Architecture

SaaS Software as a Service

UDDI Universal Description Discovery and Integration

UDP User Datagram Protocol

UDT UDP Data Transfer protocol

URI Unified Resource Identifier

WSDL Web Services Description Language WSRF Web Service Resource Framework

W3C World Wide Web Consortium

XML eXstensible Markup Language

(6)

1 Introduction

We want to produce a system with a precise and hopefully efficient semantic resource discovery. By “resource” we refer to any piece of data or metadata which we want to be able to rediscover via the system. Mainly the system, and subsequently the goals for this thesis are to design and implement an architecture which uses:

• Semantic resource descriptions for indexing and discovery. • A P2P overlay for resource propagation.

• A distributed inference algorithm.

This work is a continuation on the work of [25, 31, 13] which are earlier iterations at making a grid based semantic resource storage system.

1.1 Semantics

In linguistics we refer to semantics as the ascribed meaning of words. Many linguists and mathematicians have looked at the construction of a well defined language, the language where each infinitesimal concept can be described with a logical truth formula. Many words convey different meaning depending on the context in which they are used. This context, in which a word’s meaning is captured, is called semantics.

Due to the inherent fuzziness in all real world descriptions, and even the modelling of such descriptions, the nature of the construct language to capture semantics is not known. There does however still exist a need, especially for computer scientists, to reason around concepts in a black box fashion.

The idea that naturally emerges is to disregard any fuzziness and just be content that, while the conceptual description can be broken down further, its present form captures enough possible descriptive contingencies.

One needs to create a semantic system, a set of constructs, to describe knowl-edge for our domain in a way which is by means of logic: provable, decidable and complete. Such a construct system enables humans and machines to reason about real world descriptions in a well defined manner.

1.2 Information retrieval

A publishing system needs ways to efficiently announce and/or store information

so that requesting agents1_{receive good accuracy in searches but also don’t have}

to wait too long in order to receive an answer. The main variable to help improve

accuracy2_{and speed is structure, or rather classification. Some classification can}

be done as front work, preprocessing of data, and as such performed without any knowledge of the incoming query. One example can be when a person is searching for MPEG Audio Layer III (MP3)s on a service like a BitTorrent (BT) tracker. The tracker’s registry system does some of the work at publishing time 1_{a piece of software which interacts behalf of a computer process or user with the system} at hand

(9)

since the user resources files, i.e. torrents are usually inserted into categories like movies, books, audio and so forth. Such categorisation is commonly referred

to taxonomic classification. Since the user who is performing the search is

clearly only interested in music he or she should choose to search inside the audio category since this will efficiently cut the search space in a considerable manner.

This means however that part of the search algorithm now has been per-formed by the user. This can be considered a crude technique to resource pub-lishing and discovery but it is still of importance to refresh these old concepts. One must remember that the written registry system is to perform equivalent actions since it needs ways to harvest and categorise resources in an transparent, automated, manner.

Search engines today mainly use keyword based search strategies. This

means that the registry uses different text metrics to find correlation between the search data and the query. The data is at publish/harvest time fed to an analyser which uses an heuristic to look at how often keywords appear and in what context. With enough data the metric function will yield good approxi-mate distances between resources. This makes it possible for the engine to place both data and query inside a semantic space. These kinds of techniques are col-lected within their own discipline simply named Information Retrieval (IR).

In contrast to these metric based IR methods we have semantic IR methods. These always yield perfect precision and recall but require a perfectly formulated query and a perfectly formulated meta data. This means that both the query, i.e. the search problem constraints, and the meta data need to share the same structuring. In the non-inference case, i.e. when you have a traditional search problem, the tuples need to be exactly present in the meta dataset. Different semantic inference algorithms are differently well adapted at expanding and translating the user search constraints onto the meta dataset.

1.3 Distribution

Distribution in this case refers to data location. Where should data be collected? If it is done through a single server we get a single point of failure and we also get scalability issues since the server can only store a limited amount of resources and only handle so many search requests. One could always use a tree structure to store the information, like Domain Name System (DNS), or a hash table structure, like BT trackers, to alleviate the storage demand. To alleviate load balancing the system can do its best, via its own structuring but it will also need the help of the messaging system since this has to be done before any message has reached the distributed topology.

1.4 Network based Simulation and Modelling

The Swedish Defence Research Agency (FOI) was focusing on a system which would aid users to collaboratively model, compose and execute simulation com-ponents. These simulation components are defined as Base Object Model (BOM)[1] files and this programme was named Network based Simulation and modelling (NetSim).

(10)

If there was a way to enable such an registry it would directly translate into an increased reuse of simulation components.

1.5 Semantic Distributed Repository

Semantic Distributed Repository (SDR) was a part of the NetSim initiative and its role was to index all information related to the BOM files, as well as to store the BOM files themselves. Further the intention of SDR was to provide with a way to discover resources using semantic querying. SDR was initially implemented by Baymani & Stridfeldt[13] and then extended by Torres & Pan[31].

Despite these earlier inceptions there was still no way to semantically search for all stored resources in the system. Resource descriptions were only discov-erable through the same node where they were first published.

1.6 Problem Definition

One of the problems with semantic based search is that it doesn’t scale. Having lots of nodes, and lots of description data, but no way to distribute the process-ing load seems wasteful. As such we would like a way to utilize as many of the available nodes as possible. It doesn’t matter if it is done by imposing a more rigorous structuring onto the nodes or by using new, or clever, algorithms.

SDR is to be extended in such a way that users can search for available resources using a semantic query. The metadata files are currently stored and accessed through a NetSim File Transfer Protocol (FTP) process and replicated. Today nodes in SDR can only perform semantic lookup on local resources, which is not enough and it needs to be extended to work on a global scope. There is also an access right issue, during semantic lookup no node may divulge information that is not first authorized.

The search function also needs to be scalable. The number of messages and execution time, with a fixed quality query, should not be affected by the number of facts or number of nodes in the system. SDR is finally also meant to migrate to an ontology, called Distributed Repository Ontology (DRONT)[15].

1.7 Structure of Thesis

The report outline and the intention of the different chapters are listed as follows. Ch. 1 This introduction is meant to serve as a shallow description of some of

the key concepts.

Ch. 2 Background knowledge essential to understand the design and imple-mentation decisions.

Ch. 3 Stakeholder requirements and entailments of requirements.

Ch. 4 Explains the design decisions made for the system and introduces the different components in the system, relies heavily on background.

(11)

(12)

2 Background

This section will focus on describing the theoretical background needed to un-derstand all of the aspects of the repository. I have no intention of providing an exhaustive description of all of the different subjects covered here but will try to highlight the most critical aspects which I rely on when making my design and implementation.

2.1 Hash Functions

A hash function h(x) → y {x, y : x ∈ X , y ∈ Y} maps members from a finite or infinite member set X onto a finite and reduced member set Y. It is central to the hash function to distribute members in X uniformly onto Y but still avoid collisions, which will occur since size(Y) < size(X ). The hash function is not bijective since this implies, by the pigeonhole principle, that the member size of X would be equal to that of Y.

It is desirable for the hash function to spread all close neighbours {x1, x2∈

X } in the input domain onto distant neighbours {y1, y2 ∈ Y} in the output

domain. This helps to distribute the input domain uniformly onto the output domain. Another way of putting it would be; a small value change in the input results in large value change in the mapped output. This is called the “avalanche effect”[17] in the cryptographic domain and applies to all block ciphers. Secure Hash Algorithm (SHA) is a block based cipher hash, or digest, function which exhibits this effect. Under the assumption of a hash function h which exhibits the avalanche effect it should be obvious that in order to create a ’new’ hash function one needs to just add a namespace separator k to the input x, i.e.

h0 = h(x + k) or h0 = h(x) + k (mod size(Y)). h0 will then be exactly as

insecure as h and still exhibit the avalanche effect.

Restated one can say that even though there will be collisions the probability of such collisions is so infinitesimal small that we can disregard this aspect of the hash function. The practical implication, in computer science, of having a hash function which has a high collision rate is often that of false identity. If establishing identity is not a concern then the hash is most likely used to speedup search, in which case collisions will results in higher complexities. The simplest case is a hash function which only has one output bucket to put all data. This will yield a complexity of O(n) instead of the desirable O(1).

2.2 Hash Tables

We can now describe hash tables by using hash functions. A hash table is a dictionary. By that we mean a container consisting of two groups, a key group and a value group, and a bijective mapping function between entities in the two sets. The dictionary becomes a hash table when the mapping function is replaced by a surjective hash function.

In programming languages hash tables are not as popular as dictionaries since a dictionary has no conflicts and can be trivially implemented by using the data values’ memory addresses as the keys and the dereference operator as the bijective function.

(13)

its mapping, which means that the mapping is bijective since the hashCode function returns an obfuscated memory address of the Object.

2.3 Hash Tree

First some trivia, Hash Trees are also referred to as Merkle Trees and Tiger Trees because they were invented by Ralph Merkle and the most wide spread Tree Hash is the Tiger Tree Hash.

Suppose now that we have a large data object which we need to verify in chunks, e.g. because we need to send the data over a network and it is subject to fragmentation, and we wish to verify both the data chunk and the complete data object. This is when we use a Tree Hash. Lets assume that the data object consists of four blocks, x, y, z, u in total then we get(+ denotes concatenation here):

h0₀= h(x) + h(y) (mod size(Y))

And symmetrically for the two last blocks:

h0₁= h(z) + h(u) (mod size(Y))

Since both h0

0 and h01 are hash values they can in turn be used concatenated

into a hash:

h00= h(h0₀) + h(h0₁) (mod size(Y))

And we have the complete binary tree hash for the data object. This way we

can verify the correctness of z, u if we have h0₁. This is also a scheme that many

Peer to Peer (P2P) systems employ in order to maintain data consistency.

2.4 Distributed Hash Tables

The Distributed Hash Table (DHT)[12] is as the name implies a hash table shared between connected nodes. It is a common P2P technique used to enable distributed data lookups and storage. Each node gets assigned a keyspace inter-val and is responsible for the (key → inter-value) mappings belonging to its interinter-val in the keyspace. The interval can be determined during runtime by participating nodes since they all agree on the same hash function and partitioning scheme.

2.5 Consistent Hashing

One problem with partitioning and delegating responsibilities is that when a participant exits the whole keyspace needs to be remapped. One usual technique to alleviate this is called consistent hashing[26]. This is a system behaviour and can be summarized as:

• When nodes leaves or enters the system remaps the keyspace to fit the new number of nodes

(14)

More specifically, in consistent hashing each node generates an identifier us-ing the same hash function as the (key → value) pairs on the rus-ing. By clockwise examination of the stored keys one can establish ownership, i.e. objects belong to the next found node identifier on the key space ring. If a node parts then its delegated objects will be remapped to the next node on the ring and likewise if a node joins it will be responsible for objects from its identifier to, not including, the next.

Consistent hashing also tries to make the distance between node identifiers as equal as possible, which implicates that each node will be responsible for as small amount of objects as possible. This is done by letting each node generate more than one identifier.

2.6 Distributed Kernary System

The DKS[22] is a skip list based distributed routing network where each node maintains a keyspace interval layered routing table. The system handles parts, joins, crashes, replication, synchronization and recovery and is as such an overlay system. It is used in conjunction with a DHT but could, for instance, be used for implementing multicasting. Simplified the DKS is not just consisting of one distribution ring but many, referred to as groups.

The routing mechanism divides the keyspace into k intervals, each part is in

turned divided into k intervals. This creates a routing table of log_k(n) levels.

Each node which enters will be assigned at a random point on the ring and will, at the top layer, see the top node responsible for a distinct interval. In the last layer the node knows of the next k nodes, this create a keyspace searchable in

O(log_k(n)) hops.

This idea of interval routing originates from Chord, in which case k = 2, but was later generalized in the DKS system. We can summarize these constants as:

N The maximum number of possible nodes in the system. This is the maximum size of the key used in the system. It is imperative, for the function of the ring maintenance, that all participating nodes use the same hash algorithm throughout the system.

k The identifier space will be divided into k parts and each node will, at each level of magnitude, know of one node at distance 1/k

L Is the level of magnitude and as such the number of levels in the routing

table. It follows that this should be chosen as L = logk(N ). This will

yield optimal lookup, or “hop”, complexity.

r Signifies the replication degree, with r = 1 each node will know of only its own interval responsibility. If r = 2 each node will add a shifted interval responsibility and store objects which are inserted into another nodes pri-mary interval. If the responsible node does not answer the replica will be returned instead.

(15)

n0 n1 n2 n3 n4 n5 n6 n7 n8

Figure 1: Example skip list node overlay where arrows denote routing knowledge

This specific DKSDHT isn’t very useful since, according to Section 2.5, this yields a map where we can store one object per node, which will also be repli-cated to the other nodes in its routing table. Typically N will be in the order of thousands but that is a formidable image to fit into this report.

2.7 Bloom filters

A bloom filter is a highly memory efficient representation of a set where the individual members are not stored but only a bit vector hash print. This enables an inspector to run membership queries against the filter. Individual hashprints are or:ed together which makes the membership queries somewhat unreliant.

So a filter consists of a bit vector v of m bits, k different hash functions. The different hash functions are used to hash each input value into k different

indexes on the bit vector. That is, say input x generates h1(x) = i1, h2(x) =

i2, . . . , hk(x) = ik indexes then v(i1), v(i2), . . . , v(ik) are set to one. When you later inspect the vector, you once more hash your input with the known hash functions and if all the indexes are marked then the set contains the element.

False positives are an imminent problem with the bloom filter, i.e. the more objects you add to the filter the more bits will have been flipped on the bitvector. If the hash functions uniformly distributes values then the probability for each position to be flipped are equal. The probability for finding a false positive after insertion of n elements can then be expressed as:

(16)

For sufficiently large values on m/n, k the false positive hit rate will be tolerable, i.e. P (1024, 64768, 8) ≈ 3.9 ∗ 10−8.

2.8 Ontologies

Ontologies are a conceptual representation of a knowledge domain. This is a pure abstraction and holds no restriction upon which construct system to use in such a representation. It is the case that we have such a language of constructs, and not just one, to declare semantics for a knowledge domain in a well defined manner.

But first we remind us that the basic layers of a natural language are: Glyphs The alphabet characters are defined as the smallest expressible entity

in the language.

Syntax What constitutes valid expression and what are the keywords, i.e. grammar and words are defined.

Semantics What is the ascribed meaning to expressions, i.e. the phrase ”Wilsons horse died from Nipah” means that it is likely that Wilsons horse died from a lethal virus and as such also that the horse was infected upon australian soil. This is known since we have a Nipah Virus ontology that tells us so. An example of such a declared conceptualisation is the well known Wine Ontology[5] which describes both how wine can be ordered in relation to

mak-ers and flavors as well as to describe different kinds of specific wines. The

Wine Ontology has been redeclared a couple of times using different construct languages but in [5] Web Ontology Language (OWL) is used.

There are two regularly used construct systems for knowledge representation, the first being the triple and the second being the horn clause. Triples are often

expressed via Resource Description Framework (RDF) or OWL. And horn

clauses are almost always restricted to Prolog implementations.

The important remark to make is that the same domain can be structured, given the same construct language, in different ways, such that there does not exist a one to one mapping between the concepts in the ontologies. The easiest example of such a nonexistent mapping can be constructed by using resolution.

i.e. take two ontologies describing the same knowledge domain, O1and O2and

let there be a concept α0 ∈ O1 which is replaced by a graph, G, with concept

vertices’s {α1, α2, α3, . . . , αk} ∈ O2. This means that all the concepts αi=1..k

in O2 are mapped to the same concept α0. Since we have a language G, which

only one agent can understand we loose the ability of meaningful information exchange. The point to make is that it is vital that a human or machine agent interacting with another agent shares the same ontology, or ontologies, in order for them to communicate without misinterpretation.

2.9 Backus-Naur Form

(17)

grammars that BNF notation formalizes are, according to this theory, type 2 grammars.

A typical BNF grammar for a /etc/hosts file is shown in Figure 2. Produc-tion rules, or expressions, have a left hand side and a right hand side. Terms occurring on the left hand side are called “non-terminals” and terms only oc-curring on the right hand side “terminals”. Terminals are always evaluated by an alphabetic character or lexical token. Productions may include any number of non-terminals and or lexical tokens. A lexical token is a symbolic represen-tation of a matching string and is used in place of its matching formulae since most toolkits requires you to build both the scanner and parser in this two way fashion. < h o s t s > = < h o s t m a p s > < h o s t m a p s > = < h o s t m a p > < h o s t m a p s > | < h o s t m a p > < h o s t m a p > = < a d d r > < h o s t n a m e s > < h o s t n a m e s > = < h o s t n a m e > < h o s t n a m e s > | < h o s t n a m e > NL < a d d r > = | = I P V 4 A D D R = I P V 6 A D D R < h o s t n a m e > = H N A M E

Figure 2: BNF grammar for a typical “/etc/hosts” file. Lexical tokens have been replaced by their symbolic representation since their regular expressions are quite long. We can also see that this grammar ignores white spaces except newline NL.

2.10 Inference Engines

Can be seen as a database evaluator with a semantic understanding of the schema. These engines reason about asserted facts in the system and “infer-ences” new facts within the bounds of the construct system. The model of computation which the inference engine follows must be proven to yield coher-ent information and terminate within some finite time. All inference algorithms can be reduced to a search problem, i.e. a satisfiability problem.

The inference engines I will focus on usually work by dividing the world into two sets of data, relational data and assertive data. The assertive data might both be data that has been fed to the inference engine or data that has been concluded from earlier fed data. This same class of inference engines tries, within

their respective reasoning bounds3_{, to employ an expanding tableau matching}

strategy on the asserted facts.

(18)

Lastly the inference engine is also used as an evaluator for any queries to-wards the systems since such queries needs to be translated into the same data-model.

2.11 Knowledge Base

All this information needs to be stored somewhere, which is handled by our Knowledge Base (KB). I will restrict myself to only talk about KB’s as related specific to cater to tableau based inference engines.

Abox The Abox is used to store facts, or assertions, about individuals, hence the name assertion box. An Abox tableau may look something like:

.. . Robot(r2d2) ALU (alu21) hasALU (r2d2, alu21) = 10hasOperation(alu21, x).Operator .. .

T box The T box collects terminological axioms, also called concept definitions which can be seen as a concept schemata container. A T box looks some-thing like:

.. .

Android= Robot u ∃isHumanoid.

P robe= Robot u ∃hasT hrusters.

SyntheticW orker= Android t P robe.

.. .

Or, put in another way, we can say that: • The T box is internalized by its ontology. • The Abox is internalized by the user data.

So within this context a KB is the combination of a T box and the Abox,

KB = hA, T i. Figure 3 is a non-tableau serialization of this Section’s example.

KB. We still need to remember that we can reverse this process since all the constucts used in the first case are explicitly stated in a known form.

2.12 Description Logics

(19)

.. .

r2d2 isa Robot

alu21 isa ALU

r2d2 hasALU alu21

alu21 has Operator

alu21 hasN rOperators 10

.. .

Humanoid isa Concept

Robot isa Concept

Android isa Humanoid

Android isa Robot

P robe isa Robot

P robe has T hruster

P robe hasAtLeastOne T hruster

SyntheticW orker isa Android

SyntheticW orker isa P robe

.. .

Figure 3: A conceptual deconstruction of a description logics tableau into triples

2.12.1 ALC

The origin of the DL languages and tableau algorithms. ALC stands for (Attributive Language), the C is short for U E where U declares that the DL supports a declar-ative way to define the union of sets. E stands for full existential restrictions

ALC only constructs concepts using the operations of t, u, ¬, ∀, ∃ and roles are restricted to being atomic. Rules are applied to the tableau until a blocking condition occurs.

There are two types of rules which can be applied to the tableau, identifying and generating. Each time the tableau is evaluated with a rule it is said that it has been interpreted. An interpretation of a primitive p is often formalized as

pI, where a primitive is a role or a concept.

All concept descriptions are inserted into A and T sets are translated to NNF[10], nauer normal form which states that negation only occurs in front of concept names.

(20)

2.12.2 Description Logic Classes

The letters stand for different kinds of extensions from the first ALC. Each type of permutation of letters requires its own algorithm to solve blocking conditions and as such satisfiability. Each expressiveness class enables different deduction constructs. Some classes refer to conceptual constructs, others relational, which means that the different knowledge base boxes gets internalized differently de-pending on the kind of extension.

S Support for transitive roles. The intuitive meaning is correct, i.e. hasAncestor(x, z) is be defined as a transitive closure for hasP arent(xi, xi+1).

H Which means that the description logic supports role hierarchies, isF ighterJ et is defined to subsume isJ et.

R Complex role inclusions

O Means support for singleton classes or nominals. This is much like enumer-ation, a concept which regulates other concepts. Like

P et= {Cat, Dog, P arrot, Gorilla}.

I Means that the DL supports inverse roles, if hasF ederate is defined then

automatically so is hasF ederate−1.

F The DL supports functional roles. This means that if there is an occur-rence like {. . . , r(x, y), . . . , r(x, z), . . .} in the A-box then y is replaced by z everywhere in it if there is no statement like y 6= z.

N Support for unqualified number restrictions. In the case of P rocessort = 8hasP art the declaration = 8hasP art is unqualified since it does not specify which concepts are allowed for the relation.

Q Support for qualified number restrictions. If P rocessort = 8hasP art is extended to P rocessort = 8hasP art.SIM D then we have a qualified number restriction since it specifies that we are looking for a processor with exactly eight SIMD units.

So for example the algorithm for SHOIQ[24] is different from SHOIN and they need different blocking condition and they have different complexities.

2.13 W3C Language Stack

World Wide Web Consortium (W3C) has constructed a set of recommendations to structure and describe knowledge on the web. All of these standards uses extended BNF to describe their syntax. These recommendations are dependant upon one another so we’ll start from the bottom:

eXstensible Markup Language (XML) Which is used to create a struc-tured syntax and defines namespaces.

(21)

RDF Defines the triple representation.

RDF Schema (RDFS) Defines means to define local constructs and type class hierarchies. RDFS can define properties both for classes and datatypes. OWL Further expands RDFS with cardinality, role hierarchies, declaration of

disjoint classes, distinct individuals, number restrictions, class properties, and more. OWL also sets out to conform to specific reasoning algorithms. Because of the different complexities and termination guarantees of

differ-ent description logic families there are three main flavours of OWL4

OWL-L Corresponds to SHIF , which means it should integrate with such reasoners.

OWL-DL Corresponds to SHOIN , under some restrictions, e.g. re-sources cannot both be classes and individuals.

OWL(-Full) Corresponds to SROIQ. Can end up in a indefinite cycle of deductions. OWL RDFS RDF XMLS XML

Figure 4: A partition of the W3C Language Stack

So within OWL we can make declarations such as “all individuals in class person, whom are not male, with one or more children is a mother. Individuals in class person cannot be in class refrigerator”.

These can be serialized as in Figure 5.

2.14 Jena

Jena[14] is a Java Application Program Interface (API) to manipulate RDF and OWL data, much like the SAX and EXPAT XML libraries. Jena also provides with basic reasoners and is able to infer statements from injected data.

(22)

< ?xml v e r s i o n=" 1.0 "? > < r d f : R D F x m l n s : r d f = " h t t p : // www . w3 . org / 1 9 9 9 / 0 2 / 2 2 - rdf - syntax - ns # " x m l n s : r d f s =" h t t p : // www . w3 . org / 2 0 0 0 / 0 1 / rdf - s c h e m a # " x m l n s : o w l =" h t t p : // www . w3 . org / 2 0 0 2 / 0 7 / owl # "

x m l n s =" h t t p : // www . owl - o n t o l o g i e s . com / u n n a m e d . owl # " x m l : b a s e =" h t t p : // www . owl - o n t o l o g i e s . com / u n n a m e d . owl "> < o w l : O n t o l o g y r d f : a b o u t =" "/ > < o w l : C l a s s r d f : I D =" M o d e l "> < o w l : d i s j o i n t W i t h > < o w l : C l a s s r d f : I D =" F i l e "/ > < / o w l : d i s j o i n t W i t h > < r d f s : s u b C l a s s O f > < o w l : C l a s s r d f : I D =" R e s o u r c e "/ > < / r d f s : s u b C l a s s O f > < o w l : R e s t r i c t i o n > < o w l : o n P r o p e r t y r d f : r e s o u r c e =" h a s F i l e s " / > < o w l : m a x C a r d i n a l i t y r d f s : d a t a t y p e =" & xsd ; I n t e g e r "> 50 < / o w l : m a x C a r d i n a l i t y > < / o w l : R e s t r i c t i o n > < / o w l : C l a s s > < / r d f : R D F >

Figure 5: A sample OWL file

2.15 Grid Computing

Grid computing is a paradigm which has matured along side Service Oriented Architecture (SOA) and as such shares some of its components. There isn’t a clear definition to what grid computing is but it is based on the principle that heterogeneous hardware and software should be able to interoperate in virtual organizations forming, for each computational task, a virtual grid. It is a way for secure communication while at the same time maximizing resources in an ad-ministratively and geographically dispersed environment. The grid should adopt to avoid hotspots by letting unused resources be allocated by authorized enti-ties, which could, for instance, be an overloaded service. Why Grid Computing differs from SOA or Common Object Request Broker Architecture (CORBA) is that it focuses on connecting installations of computational clusters.

One such standard is the Open Grid Service Architecture (OGSA) which defines the virtual grid as a set of accompanied services with specific capabilities. OGSA does not specify implementation specific information but only general guide lines, i.e. serves as a standard for the grid community to follow.

(23)

merge Grid Services with Web Services. OGSA is as such complemented by the use of XML based technologies (Web Services Description Language (WSDL) and Simple Object Access Protocol (SOAP)) for message passing and service declaration. WSDL and SOAP are specified to be used over the Hypertext Transfer Protocol (HTTP) or Hypertext Transfer Protocol Secure (HTTPS) as transport protocols.

2.15.1 Web Service Description Language

WSDL is an XML based language used to define service endpoints. WSDL defines interfaces for public service methods, message types for requests and responses, transport binding and service location.

This means that code to interface the service can be generated from the specification without much effort. This idea is much alike CORBA Interface Description Language (IDL) files only that WSDL also bundles the actual service instance together with the interface definition. Also CORBA was “object” and not “service” oriented.

In WSDL each service endpoint is defined as a set of endpoint bindings and each binding specifies one or more operations.

Going from bottom to top; WSDL is used to define types, messages, ports, bindings and services. To understand how these relate see the simple WSDL in Code 6.

2.15.2 Simple Object Access Protocol

SOAP is a schema based on XML meant to be used between connector and stub to send messages in a encoded safe way. This is similar to ASN.1 only that the SOAP abstract syntax is declared using XML. The message is at service arrival used by the service broker, in the case of Globus Toolkit, an Axis engine, to determine which object and method the argument should be delivered.

One of the problems with using ordinary SOAP messages for service invoca-tion is that the standard stack throws away the sender addressing informainvoca-tion.

2.15.3 Web Service Resource Framework

WSRF is a standard brought forward by OASIS to meet the demands of stateful web services. There are many capabilities of WSRF extended web services over ordinary WSDL/SOAP ones. These components are typically:

WS-Resource The specification of stateful resources, which is the main reason WSRF came into existence, there is a distinction between the resource and the service. The service only provides access to its resources.

WS-ResourceProperties Get and set interfaces for the resources.

WS-Adressing Transport independent endpoint addressing, remember that SOAP forgets to tell the upper layers who was the sender.

(24)

(25)

Figure 7: Sample SOAP invocation chain < !- - S O A P R e q u e s t - -> < ?xml v e r s i o n=" 1.0 " e n c o d i n g =" UTF -8 " s t a n d a l o n e=" no " ? > < s o a p : e n v e l o p e s o a p : e n c o d i n g S t y l e = " h t t p : // s c h e m a s . x m l s o a p . org / s o a p / e n c o d i n g / " x m l n s : s o a p =" h t t p : // s c h e m a s . x m l s o a p . org / s o a p / e n v e l o p e / " x m l n s : x s i =" h t t p : // www . w3 . org / 1 9 9 9 / X M L S c h e m a - i n s t a n c e " x m l n s : x s d =" h t t p : // www . w3 . org / 1 9 9 9 / X M L S c h e m a "> < s o a p : b o d y > c r a c k e r ? < / s o a p : b o d y > < / s o a p : e n v e l o p e > < !- - S O A P R e s p o n s e - -> < ?xml v e r s i o n=" 1.0 " e n c o d i n g =" UTF -8 " ? > < s o a p : e n v e l o p e x m l n s : s o a p =" h t t p : // s c h e m a s . x m l s o a p . org / s o a p / e n v e l o p e / " s o a p : e n c o d i n g S t y l e = " h t t p : // s c h e m a s . x m l s o a p . org / s o a p / e n c o d i n g / " x m l n s : x s i =" h t t p : // www . w3 . org / 1 9 9 9 / X M L S c h e m a - i n s t a n c e " x m l n s : x s d =" h t t p : // www . w3 . org / 1 9 9 9 / X M L S c h e m a "> < s o a p : b o d y > < r e t u r n x s i : t y p e =" x s d : s t r i n g "> b w a a k ! < / r e t u r n > < / s o a p : b o d y > < / s o a p : e n v e l o p e > 2.15.4 Globus Toolkit

Globus Toolkit (GT) is a middleware which conforms to OGSA[21] and imple-ments WSRF and is the reference implementation for these “standards”. Earlier versions of GT only allowed stateless services to be deployed but this was fixed when GT adopted WSRF in favor for their “Grid Services”. One nice thing is that GT also keeps the service description and the service instance separated and puts them back together again to form a WSDL file.

GT also includes software components for resource monitoring, discovery,

security and management. SDR-Cl2 has been integrated with the following

components:

(26)

the responsibility to manage the mapping between logical data names and their physical endpoint. RLI maintains state information on data items in LRC to disambiguate referenced replicated items.

GridFTP Handles file transfers of data found via RLS. This actually uses an fast transport protocol called UDP Data Transfer protocol (UDT). Grid Security Infrastructure (GSI) Manages authentication and

authoriza-tion of users to provide single sign on.

2.16 REST

In REST[20] usage of the HTTP protocol is central but is, like WSRF, resource centric. This means that Unified Resource Identifier (URI)s represent some resource and the different HTTP verb, like GET, PUT, POST and DELETE modifies them. There is no interface language or description, like IDL or WSDL. In other words: all we need to know is in the URI and the user must manually build their own programmatic interfaces.

As an example, doing a POST on the URI

/rodent/f uzzywuzzy/runinwheel

Would tell the rodent resource with identifier f uzzywuzzy to execute the runinwheel method.

2.17 SDR-Cl

2

grid service

As mentioned before SDR has been iteratively improved upon since its first

inception in Carbonara, and later with Cl2. It has at its core been built on, and

around GT using OWL as its base description language. It utilizes Jena[14] for semantic querying but can only find local node resources and nothing on the other grid nodes. SDR was never designed to allow global grid semantic resource discovery only that they would be discoverable using semantic techniques.

It is important to note that in the Cl2prototype there is a Chord[30] DHT

present. One might be tricked that this would be used to index other nodes’ meta data descriptions, or that it was able to store more than just node IDs to

IP address mappings. Sadly this is not the case5 _{In Figure 8 the components}

have the following responsibilities.

Cl2 WS Is the main grid service as defined by a WSDL schema and can be

accessed by using SOAP. This in turn just delegates all calls to the Cl2

Core. It is just a wrapper.

Cl2 Core Manages data and resource description insertion and removal.

LRC & Cl2 Chord Manages data replication and enables discovery of data

and replica locations. The Chord implementation just maps resource

names to a LRC server host address.

(27)

Jena Performs query evaluation and inference.

These components can be compared to the descriptions given in Section 2.15.4. It should be noted that the Jena module has no knowledge of other data repli-cas than those inserted through its own SOAP interface. As such all semantic knowledge is local. Cl2WS Cl2 Core Cl2 Chord LRC GridFTP GSI Jena

Figure 8: Component diagram of the SDR grid service as envisioned and

de-signed in the Cl2 prototype.

2.18 SDR - Fortress

Fortress was work presented by Zeeshan[29] to solve the authorization problems in SDR. Fortress relies on XACML control lists to define resource access rights and uses X.509 certificates to identify users and enable a chain of trust. It is beyond the scope of this thesis and I will omit the details of the system other than to note its planned usage throughout SDR.

2.19 Distributed Reasoning

Work on distributed semantic querying using RDF has been performed by Felix Heine[23, 18]. I will reiterate his algorithm in the design section.

(28)

probability, already have been introduced into systems like Pellet[6] and Racer[7] but is worth investigating none the less.

(29)

3 Requirements

This work is intended to provide with a distributed inference engine within the SDR system. Migrating to using a pure P2P architecture will theoretically make the prototype more scalable, robust and performing. Performed interviews with project stakeholders revealed that the following requirements be set:

• The system shall be able to resolve all resources in the system, from any given node, through querying.

• Resources shall be secured using both some authentication and authoriza-tion scheme, but it is not a requirement for this work.

• Resources and their description are encrypted on disk. • Resources shall be described using OWL-DL.

• The semantic query shall scale with growing nodes and resources in the system.

• Removal of Globus Toolkit. Mainly because deployment of the system to date have been to cumbersome and static.

• Resources shall be described by the use of DRONT.

3.1 Decisions & Motivations

Following these requirements I decided upon the following:

• Implement the distributed inference algorithm proposed by Felix Heine

using Jena and a DHT6

• Move all data stored in the system onto the DHT in order to reduce complexity.

• Publish the web interface using REST, in order to speed up development of the web service.

Also finding a good DHT subsystem means not needing to bother with imple-menting replication and dealing with node membership. Here DKS was chosen, mainly because:

• It is written in Java, which is also a implicit requirement for the prototype, this means that it will be easier to integrate.

• All the code is publicly available.

• It supports symmetric replication, i.e. employs consistent hashing. • Configuration of the different runtime properties is very easy. More

spe-cific, it is very easy to change the replication degree.

(30)

3.2 Entailments

Given these requirements and major design decisions I became convinced to pull all GT specific service reliance, like file management and authorization, and put everything on a DKSDHT ring. This leads to the following situation:

• Resources are automatically replicated, by the replication guarantees given by DKS, this should be considered a big win.

• Resources have finite size, since they are passed in memory, i.e. stream transfers on hold, minor loss since this is a proof of concept implementa-tion.

• Resources are accessible and discoverable from all participating system nodes, also a big win.

• Scalable querying can be achieved with regards to participating nodes, big win.

• Cl2’s DHT must be tossed since it doesn’t allow general mapping of

re-sources, i.e. only allows (U RI → InetAddress) lookups, and provides no

replication. This is because of the Cl2 DHT system requirement to

basi-cally only function within SDR as a drop in DNS replacement. This isn’t really a big loss since DKS will be integrated anyway.

As such this prototype will inevitably remove, and not necessarily replace, the functionalities of:

• GSI • GridFTP • LRC

• Cl2 Chord

Within the SDR-Cl2system. It will also expand the semantic component of

(31)

4 Design

This chapter further elaborates on the design decisions, made in Section 3.1, for the system. This also limits what work which should be done and how it ideally should be done. Figure 9 shows the different components of the system and their responsibility in the system. They will all be covered or mentioned further within this chapter, except Fortress, the planned security subsystem, since it was not a part of this thesis. Information flow, within a SDR process is as follows:

Web Service The main entry point of the service, this will be exported using REST. It will just delegate calls to the core implementation and is a wrapper.

Core Translates information lookup and/or publishing into DHT calls either going through the semantic component or directly onto the DHT. It also follows some consistency rules for resource manipulation.

Semantic Validates the OWL-DL descriptions, as well as infers new facts, given their ontology.

Harvester Wraps and exports the distributed algorithm, which in turn uses the DHT.

DKS DHT Handles peer to peer communication, storage and provides with a call out to a security subsystem, since resources needs to be protected. We notice that this design relies heavily on the DKSDHT as a backend service. This means that there will be a lot of hash table calls and that it will be responsible for the actual resource storage. By design this will mean that we will likely need to inspect the underlying storage and messaging mechanism of the DHT.

But before we dive into the description of the components, lets first begin with something more esoteric.

4.1 Data Model

Since the whole system revolves around resources we will start by looking at these. We deal with four basic entities in the system:

Data Which can be anything and is in itself untyped. The corresponding de-scription should be able to identify what kind of data is stored and how it is interpreted.

Meta Data Which is the OWL description as related to a specific piece of data, or an entity in its own right.

Access Control List (ACL) Entries Describes both authoritative rights for a user, role as well as describing the set of people corresponding to a role. Public Keys Your typical RSA public, private keypair needs to identify users

(32)

Web Service Core Semantic DKS DHT Fortress Jena Harvester

Figure 9: Overview of the system’s components interrelationships.

Ontology The ontology used to describe, validate and infer new facts from Meta Data entries.

This means that if I know the id of a resource then it should be easy for me to fetch any of the parts used to make up a resource.

4.1.1 Usage Consistency

It should now become obvious that we cannot just let users pull and push de-scriptions, meta data, etc. around. We need to establish some simple soundness rules for consistency.

• A meta data entry is a complete entity by its own right.

• A meta data entry shall be serialized to OWL-DL and have a valid ontol-ogy.

• A user public key is an entity by its own right.

(33)

• An ACL entry with the same name as a user defines the users role. • ACLs shall be validated and decrypted against the security module. • The owner of the ACL entry is the public key used to push the entry. • Resource data cannot exist without a corresponding meta data entry. • Resource data cannot exist without a corresponding ACL handle, i.e. the

ACL to define access rights.

• A user cannot be removed as long as she still owns resources. • Ownership can only be passed by the user or root.

• ACLs can only by added if they are relayed through the security module and accepted.

Failure to comply with any of the rules warrants removal for that piece of information. As such it would be possible for the system to initialize a garbage collection sweep if it should happen to encounter inconsistencies.

4.1.2 Naming Strategy

Because of this division of resource parts we would like to provide a DHT names-pace for Data, ACL entries, Meta data and Public Keys. This can be done in two fashions:

• Either we rely on DKS ability to manage groups, see Section 2.6 if in doubt.

• Or we take the lazy path and just prepend a string like ”metadata” to any meta data key we enter into the system, see Section 2.1 if in doubt, and just continue to add prefix strings to the id of any new piece of information we would like to keep available.

Both of these approaches are similar in terms of security but the first is more resilient to collisions, i.e. they will use the same hash function but in the first fewer items will occupy the same keyspace.

We also have opaque entities which are only used as a mean to enforce a nam-ing strategy, i.e. “resources”. Resource handles cover both resource data and resource meta data. In other words, going with the lazy namespace approach above, we the following lookup keys and corresponding entities:

• Resource data using the lookup ”data:hresourceIdi”. • A resource description using ”metadata:hresourceIdi” • The resource ACL handle using ”aclhandle:hresourceIdi”. • An ontology using ”ontology:hontologyIdi”.

• The ACL using ”acldata:haclHandlei”.

(34)

4.2 Web Service

SDR will use REST[20], i.e. actually rely on HTTP, for interfacing and different resource representations for serialization of entities. This completely does away with the WS-* standards. So no WSDL and no more SOAP.

Stream transfers are only an implementation problem since HTTP/1.1 speci-fies the availability of chunked transfer encoding, see Section 3.6 in RFC2616[19].

4.2.1 Routes

This follows a simple scheme, the routes below are the only ones which are de-fined, all other returns HTTP Status code 404[19]. For all intents and purposes a name always corresponds to a identifier. Because REST specifies that doing POST on a named URI shall treat the named entity as a collection, and create the sent entry into said collection, makes it vacuous for such routes. For the collection routes query and resource POST is used instead.

URI GET PUT DELETE

/resource/{id}/data 1.1 1.2 1.3 /resource/{id}/description 2.1 2.2 2.3 /resource/{id}/owner 3.1 3.2 3.3 /resource/{id}/acl 4.1 4.2 4.3 /acl/{id} 5.1 5.2 5.3 /pubkey/{id} 6.1 6.2 6.3 /ontology/{id} 7.1 7.2 7.3

URI GET POST DELETE

/query 9.1 9.2 9.3

/resource 10.1 10.2 10.3

The expected meaning of the URIs and their respective VERB functions are correct, i.e. doing GET on a resource will return its representation, PUT will update or create it and DELETE will remove it.

4.2.2 Representations

Given below is the list of the different request response representations needed to make a HTTP call. Further, all connections towards the web application is done over HTTPS. The system does not rely on HTTPS client authentication certificates since these checks needs to be done at the application level anyway. All verbs need to be accompanied by the user name and encrypted user secret. So that the system can verify against its public key. There is, of course, one exception, the query route.

1.1 Response: Base64.

1.2 Request: Base64. Response:

HTTP code.

1.3 Response: HTTP code. 2.1 Response: OWL-DL.

2.2 Request: OWL-DL. Response: HTTP code.

2.3 Response: HTTP code 403. 3.1 Response: User name, String.

(35)

Response: HTTP code. 3.3 Response: HTTP code 403. 4.1 Response: ACL handle, String.

4.2 Request: ACL handle, String.

Response: HTTP code. 4.3 Response: HTTP code 403. 5.1 Response: XACML.

5.2 Request: XACML. Response: HTTP code.

5.3 Response: HTTP code. 6.1 Reponse: X.509 certificate 6.2 Request: X.509 certificate.

Re-sponse: HTTP code.

6.3 Response: HTTP code. 7.1 Response: OWL-DL.

7.2 Request: OWL-DL. Response: HTTP code.

7.3 Response: HTTP code. 8.1 Response: HTTP code 404. 8.2 Request: SPARQL, Response:

Resource names, String List. 8.3 Response: HTTP code 404. 9.1 Response: HTTP code 404. 9.2 Response: HTTP code 404. 9.3 Response: HTTP code.

4.3 Core

This will glue all the other components together and perform any necessary sanity or consistency checks for the system.

4.3.1 Basic Operations

It will implement all of the invocations given by the REST specification in

Section 4.2.1. There is no reason to reiterate them here since they will be

directly equivalent to the REST calls. It will:

• Use the DHT ring and the Semantic component as a backend.

• Follow the consistency rules given in Section 4.1.1, when assembling enti-ties.

• Use the naming strategy in Section 4.1.2 when fetching and storing entities on the DHT

4.3.2 Transactions

As an extra effort to avoid inconsistencies each complete resource will be written with a transaction entry with its state. This state can be one either of:

CREATION The state of a resource that is being created by some node on the ring.

DECAYING The state of a resource that is being deleted by some node on the ring.

(36)

INCONSISTENT The state of a resource which has been found breaking the usage patterns in Section 4.1.1.

We will also need a way to deal with sending and storing all of the different type of data, as described in Section 4.1.2 and Section 4.2.1. All of the data gets wrapped into a consistency entry, which contains the following information:

• A reference to the resource state, i.e. following the naming strategy in 4.1.2 yields a string like ”consistence:hresourceIdi”. Which can be reused in another lookup in order to figure out the resource state.

• A reference to the ACL file governing the entity usage.

• The tree hash for the object. This is made available for chunking see check Section 2.3 and 4.3.3 for clarification.

• A base64 blob of the embedded Java object.

4.3.3 Chunking

So in order to stream larger files the system will need • A fragmentation limit on data, sent out on the DHT

• An assembly mechanism, after the different data chunks has been retrieved from the DHT

• A disassembly mechanism, during the transfer of the data chunks to the DHT

• A mechanism to send the complete data back and forth to the client. The first point is just a definition so we pick whatever size we like, which doesn’t necessarily align with the transport layer fragment size. Another thing which we must keep in mind is that we will not be able to hold all of the data chunks in the Core runtime because of memory, this is after all why we are doing this.

The assembly mechanism will then fetch the root data entry and then pursue to hunt down the other chunks, provided there are any. When a part of the data stream has been collected, and is next in line on the output data stream, it will then be sent out using base64 over HTTP chunking.

4.4 Distributed Hash Table

(37)

4.4.1 Storage

DKS doesn’t provide with a method to encrypt its data and it shouldn’t, en-cryption of stored data should be done at higher layers. But we want to provide with an alternative storage backend for untrusted environments.

This backend will instruct the node to generate it’s own encryption key at startup in order to invalidate data. This is equivalent with not using the filesystem since nothing can be saved between runtimes. What we gain is the ability to use the filesystem as a swap area.

4.4.2 Security

We can see that the “Core” is able to distinguish and enforce security policies easy since it just asks the fortress module for help. The problem is that we cannot accept “Core” to filter out privileged information since it would then be easy to connect with a “rouge” DKS instance and just read anything we want to. This means that when a DHT node retrieves a lookup request it must check whether or not the requester is authenticated and authorized to view the data. The implication of this is that we will need supplement the DHT message receiver/sender to have:

• Messages which are signed, with a MAC.

• That the MAC corresponds with a user in the system. • A hook to “fortress” in order to ask for authorization.

4.5 Semantic

This component is responsible for the semantic query evaluation. It is also responsible for converting its data structures to and from the global DHT rep-resentation. Each time a query is encountered the local node starts harvesting all RDF triples needed to satisfy the query. Once the subgraph has been col-lected the node then tries to construct an OWL-DL file from the data. If the construction is complete it is fed into a inference engine Jena which will try to find matches. If new information is inferred it is stored in the local data structure as well as pushed back into the DHT.

4.5.1 OWL-DL to OWL-RDF and back

OWL-DL is often serialized by XML/RDF but there are different approaches to describe a OWL-DL structure. What one means with OWL-RDF is the translation from a OWL serialization as in Figure 5 to a representation in RDF triples where the OWL constructs like, owl:Class and owl:maxCardinality is represented as nodes.

Semantic decomposition will be handled via Jena’s Model and Graph API to translate OWL statements into bags of RDF triples and vice versa. Note that this approach will only work if the injected triples are acquired from an OWL-DL source. If the system gets injected by pure RDF statements then these are quite likely to break the stored descriptions. As such the Semantic component will only accept resource descriptions which are conforming to an

(38)

published the component will first try to find the ontology referenced in the description through the DHT and validate the description. It will also only accept ontologies that are conform to OWL-DL.

Just to make this even clearer Jena models statements relates to graphs and triples the following way:

Model Statement Graph Triple contains contains contains contains

Figure 10: Jena API backing model for “models” and “graphs”

This separation is done in order to provide different levels of abstraction. We can as such grab the graph from a model or initialize a model from a graph. The system will also implement a SPARQL to RDF triple graph parser since Jena ARQ is too tightly coupled to the model API.

And finally to the point, this allows us to inject a OWL-DL description and its ontology into a Jena model, get all statements and inferred statements, which will yield a consistent data structure. We can then use this to pull the graph from underneath the model and inject these triples onto the DHT ring.

4.5.2 Distributed Inference Algorithm

This algorithm is unnamed but bears a striking resemblance to a distributed sub graph isomorphism algorithm[32]. This is a reiteration of Felix Heine’s algorithm as described in[23] with alterations made in order to fit this thesis.

A formal representation of a graph can be seen as a set of triples:

TM ⊆ (L ∪ B) × L × (L ∪ B)

Where L denotes an RDF literal node and B a blank node. As a side note, an RDF triple is not valid within a graph without a literal as predicate. We get the corresponding query graph as:

TQ⊆ (L ∪ V) × (L ∪ V) × (L ∪ V)

Where V denotes a variable node. The difference between a blank node B and a variable node V is that the variable node denotes a named wildcard for which we are interested in fulfilling assignment. Conversely B is an anonymous wildcard

node. Here TM denotes triples in the global model and TQ denotes the query

(39)

This gives us the problem definition, to find a mapping from the variable

nodes in TQ to the literal or blank nodes in TM:

R : VQ → LM ∪ BM

Which gives us, that for any triple (s, p, o) ∈ TQ we can deduce a triple, via R,

(R(s), R(p), R(o)) = (s0, p0, o0) ∈ TM.

Now if R would be an injective mapping this would result in the sub graph isomorphism problem, here R is non-injective since different query nodes may map onto the same literal and blank node. We can draw this conclusion since

LQ ⊆ LM and R is only defined for variables.

Further TQis assumed to have a complete path between any pairs of vertices,

i.e. that it is a connected graph. This is also a characteristic which is easy to enforce during the start of the query evaluation.

Clarification

We seed each injected triple onto the DHT ring using subject, predicate and ob-ject as key. Effectively yielding three copies of the triple. This lets us define the DHT lookup functions getBySubject, getByP redicate and getByObject. Just to be perfectly clear this means, following the naming strategy in Section 4.1.2, to prepend any of these keys with “graph:”, this yields the following lookup functions:

• getBySubject(s) : ”graph:subj:s” → (s, p, o)∗ • getByP red(p) : ”graph:pred:p” → (s, p, o)∗ • getByObject(o) : ”graph:obj:o” → (s, p, o)∗

This means that you can collect a maximum graph component for any given triple by using the three lookup functions. Typically the query will contain more than one triple which means that the collected graph will contain triples that are irrelevant in respect to satisfying the complete query.

Candidate Set

The algorithm relies on the use of a look ahead membership data type, the candidate set. A candidate set is a named set which allows us to ask:

C.exists(c) =

true if ∃c ∈ C

f alse otherwise

Without sending the complete set C over the network. This is achieved by serializing the candidate set as a bloom filter, see Section 2.7. This enables the network to do set existence queries and still only send over a fraction of the data needed for the complete set.

Similarly CT(t) and CV(v) both denotes possible candidate triples for a given

t ∈ TQ or v ∈ VQ, where:

∀x ∈ CT → x ∈ TM

(40)

Just to clarify, this enables us to ask questions like, “Give me all candidates for the third query triple” or “Give me all triple candidates for the second query triple’s object variable”. The algorithm will maintain such sets for the duration of the run.

It is perfectly valid for a variable candidate set to be undefined. We define the size of this undefined candidate as infinity. If we are looking at a fixed value, like a literal, we define the set to be a one element set containing only that element. |CV(v)| =    ∞ if CV(v) = 4 size({v}) = 1 if v ∈ LM size(CV(v)) otherwise

Finally we define a candidate suitability metric, the specification grade:

sg(x) =

|CV(x)| if x is a triple element

min(sg(s), sg(p), sg(o)) if x ≡ (s, p, o)

Algorithm overview

As such this metric specifies the number of lookups needed to be performed in order to evaluate a specific variable, in other words, it specifies a grade for the element. This gives enough of a framework to summarize the algorithm. It works in two phases.

• Collect all of the candidates for each of the query triples. Yielding a maximum graph component for within which we are certain that the target graph is located.

• Reduce the retrieved maximum graph component into a minimal form. • Assign a triple element to each of the query variables.

Retrieving Candidates

We can summarize the algorithm informally as: 1. Set all candidate sets to undefined

2. For each undefined triple candidate set, pick one and find a member of the candidate set with smallest number of required lookups, (bloom filters are transferred)

3. Fetch all candidates for the node of that triple with the least amount of lookups (real sets are transferred)

4. Refine the sets of variables for the current known triples, if there is an error return an error otherwise (jump to 2.)

(41)

function candidates (TQ, TM) → Result begin foreach t ∈ TM do CT(t) ← ∆ end foreach v ∈ VQ do CV(v) ← ∆ end foreach t ≡ (s, p, o) ∈ TQ: CT(t) ≡ ∆ sg(t) ≤ sg(t0) : ∀t0∈ CT(t0) ≡ ∆ do if sg(t) = sg(s) then CT(t) ←Sx∈CV(s)getBySubject(x) CT(t) ← {(s, p, o) ∈ CT(t) : p ∈ CV(p) ∧ o ∈ CV(o)} else if sg(t) = sg(p) then ..

. repeated for predicate else if sg(t) = sg(o) then

..

. repeated for object end

if refine (CT, CV, {t}, ∅) ≡ Error then

return Error end

end return Ok end

Algorithm 1: Fetching all applicable candidates over DHT

Refining the Candidate Sets

We can summarize the algorithm informally as:

1. Until we either have no changed triples or changed variables(these are also two sets) we

2. Check each changed triple

(a) for each variable in the changed triple we compare that variable’s candidate set with the candidate sets for each triple the variable is present in

(b) if a candidate is not present in the triple candidate set, we remove it from the variable candidate set and add the variable to the set of changed variables

(42)

(a) we find all the query triples it is in

(b) and update that triple’s candidate set to only contain triples who have a node that is also a candidate for the variable

(c) if the triple’s candidate set changed we add that triple to the set of changed variables

(d) finally we remove the variable from the set of changed variables 4. if some candidate set is empty we are out of luck

A more formal representation is presented in Algorithm 2 Variable assignment

We can summarize the algorithm informally as:

1. If there is a query triple left in the set of query triples we remove it 2. Check each candidate triple for that query triple

(a) if the candidate triple doesn’t contradict previous variable assignment (b) store the new assignment

(c) and recurse 3. otherwise we’re done

(43)

function refine (CT, CV, TQ, T, V ) → Result begin while T 6= ∅ ∨ V 6= ∅ do foreach t ≡ (s, p, o) ∈ T do if s ∈ V then CV(s) ← CV(s)∩subject (CT(t))

if CV(s) was changed then

V ← V ∪ {s} end

end

if p ∈ V then ..

. repeated for predicate end

if o ∈ V then ..

. repeated for object end T ← T \ {t} end foreach v ∈ V do foreach t ∈ TQ do if subject (t) ≡ v then CT(t) ← {t0≡ (s0, p0, o0) ∈ CT(t) : s0∈ CV(v)}

if CT(t) was changed then

T ← T ∪ t end

end

if predicate (t) ≡ v then ..

. repeated for predicate end

if object (t) ≡ v then ..

. repeated for object end V ← V \ {v} end if ∃t : CT(t) = ∅ ∨ ∃v : CV(v) = ∅ then return err end end end return ok end

(44)

function evaluate (V, TQ, CT) → () begin

if ∃t ∈ TQ then

TQ\ {t}

foreach u ∈ CT(t) do

if u doesn’t contradict V then

V0 ← V store assignments of u in V0 evaluate (V0, TQ, CT) end end end

All members V have been evaluated end

Adaptation of a distributed inference algorithm and migrating to P2P in the Semantic Distributed Repository

algorithm and migrating to P2P in the

Semantic Distributed Repository

Edward Tj¨

ornhammar

Abstract

List of Figures

List of Abbreviations

Contents

1

Introduction

1.1

Semantics

1.2

Information retrieval

1.3

Distribution

1.4

Network based Simulation and Modelling

1.5

Semantic Distributed Repository

1.6

Problem Definition

1.7

Structure of Thesis

2

Background

2.1

Hash Functions

2.2

Hash Tables

2.3

Hash Tree

2.4

Distributed Hash Tables

2.5

Consistent Hashing

2.6

Distributed Kernary System

2.7

Bloom filters

2.8

Ontologies

2.9

Backus-Naur Form

2.10

Inference Engines

2.11

Knowledge Base

2.12

Description Logics

2.13

W3C Language Stack

2.14

Jena

2.15

Grid Computing

2.16

REST

2.17

SDR-Cl

grid service

2.18

SDR - Fortress

2.19

Distributed Reasoning

3

Requirements

3.1

Decisions & Motivations

3.2

Entailments

4

Design

4.1

Data Model

4.2

Web Service

4.3

Core