Querying Federations of Eiffel Event Data Repositories

(1)

Linköpings universitet i

Linköpings universitet | Institutionen för datavetenskap

Examensarbete på avancerad nivå, 30hp | Datateknik

2020 | LIU-IDA/LITH-EX-A–20/056–SE

Querying Federations of Eiffel

Event Data Repositories

En undersökning av distribuerade system som innehåller

conti-nous integration data i ett länkat format

Jonatan Pålsson

Handledare : Patrick Lambrix Examinator : Olaf Hartig

(2)

ii

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och admin-istrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sam-manhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet – or its possible replacement – for a period of 25 years starting from the date of publication barring exceptional circum-stances. The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the con-sent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping Uni-versity Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

c

(3)

iii

Abstract

The goal of this thesis was to find out if Eiffel event data could be represented in a RDF format and to be able to query this data effectively in a federation of SPARQL endpoints. The Eiffel data was created by a provided generator from the Eiffel project which the author extended in order to create links between datasets. These new datasets were then converted into RDF triples and uploaded to two separate SPARQL endpoints in Azure. Two types of SPARQL federation engine types were used to query this data, index-based and index-free. The index-based systems were not able to produce any measurable results unfortunately, however the index-free system FedX was able to. This gives proof that Eiffel data can be represented in RDF format in order for a index-free engine to query on a SPARQL federation containing datasets of this data. It is however difficult to provide proof of effectiveness without having another system to compare to.

(4)

Författarens tack

I would like to thank all my friend and family that has been patient and given me support over the time writing this thesis. I would also like to give a huge thanks to Olaf Hartig who has given me lots of assistance and helped me with everything that I have needed.

(5)

3.4 Queries . . . 14 4 Results 17 4.1 Generated data . . . 17 4.2 Index-based systems . . . 18 4.3 Index-free systems . . . 20 5 Discussion 21 5.1 Generated data . . . 21 5.2 Index-based systems . . . 22 5.3 Index-free systems . . . 22 6 Conclusion 24 6.1 Future work . . . 25 Bibliography 26 A Appendix A 29

(6)

List of Figures

2.1 An simplified event graph used in the paper "Continuous Integration and

Deliv-ery Traceability in Industry: Needs and Practices" [stahl2016]. . . . 5

2.2 A triple where Bill is the subject, knows is the predicate and Jess is the object. . . . 5

2.3 The predicate in this triple is an URI referring the relationship for “knows”. Both the subject and object are also URIs. . . 6

2.4 A graph where the age of Jess is the literal value 25. . . 6

2.5 A graph where the node Jess has been replaced with a blank node and now may no longer be uniquely identified . . . 6

2.6 A sample of a VOID description . . . 9

2.7 A sample of a summary used in HiBISCus . . . 9

2.8 A sample of a capability in summary used in CostFed . . . 10

3.1 A simplified event flow of the Eiffel event generator . . . 12

3.2 A sample of a dataset in RDF format . . . 13

4.1 A Figure showing the possible dependency setups when choosing four datasets in the generator. . . 18

(7)

List of Tables

4.1 A Table showing average execution times, standard deviation and Entries found for various SPARQL queries with FedX over a SPARQL federation with 2 endpoints 20

(8)

1 Introduction

1.1 Motivation

Eiffel is a framework to represent and maintain historic and live information about continu-ous integration processes (for instance, in big, decentralized software engineering projects). This framework has been developed and used at Ericsson for several years now. A number of software development tools and continuous integration tools have been extended to gen-erate Eiffel data. Furthermore, recently a visualization tool has been developed at Linköping University to visualize the relationships between events described by this data. The database backend currently used by this visualization tool and by the applications developed by Eric-sson is MongoDB (a popular NoSQL system).

1.2 Aim

The current database backend is a centralised system. For this thesis I will be look-ing into a possibility of decentralislook-ing the database and evaluate the options for do-ing so. Potential reasons due to which decentralization can be desirable maybe that the size of the data is too big or different departments in a company want to main-tain their Eiffel data in their own database. Furthermore, given the graph-like na-ture of the visualized data, it might be more suitable and more efficient to use a graph database system as backend. The particular graph database technologies to be used for this project are based on the Resource Description Framework (RDF), thus, also involves the development of an approach to represent Eiffel data in RDF. The basis for this thesis is not only to investigate the performance with Eiffel data in RDF format but also how the performance will change with a federation of endpoints that can be accessed with the query language SPARQL. These endpoints are referred to as SPARQL endpoints and the federation as a SPARQL federation. By first having a query engine con-nect to each endpoint, then queries written in SPARQL can be executed over the federation. Query engines have different source selection strategies which determines how they will try to optimize their effectiveness and accuracy in regards of identifying the correct endpoints for the various parts of a given query. This thesis will focus on how different state-of-the-art source selection strategies can be used to execute SPARQL queries.

(9)

1.3. Research questions

1.3 Research questions

The main research question is:

Is a SPARQL federation of datasets containing Eiffel data in RDF format a viable way of creating a decentralized database system for continuous integration data?

To be able to answer this question I focus on the following two more concrete questions: 1. How can Eiffel data be represented in RDF format in order for state-of-the-art SPARQL

federation engines to execute queries?

2. How effective are the state-of-the-art source selection strategies over a SPARQL federa-tion of datasets containing Eiffel data in RDF format?

1.4 Delimitations

This thesis will not do any comparisons between using a SPARQL federation to create a de-centralized database system with an existing centralised system. There will also not by any discussion about the ethical or societal aspects as it has not been seen as required.

(10)

2 Theory

Before diving into the technological details of how the study was conducted, a brief introduc-tion to the field of study is necessary in order to grasp the bigger picture. That is the purpose of this chapter in which the author explains the terminology and the different technologies used throughout the report. First, a presentation of the Eiffel framework will be given, fol-lowed by what RDF is and how it is connected to SPARQL and SPARQL federations. Finally, the basics of SPARQL federation engines will be described and a more in-depth look on how index can be used for source selection and query optimization.

2.1 Eiffel

Eiffel is a framework used for continuous integration data in a large scale. The framework was created by, and still maintained by, Ericsson, but since 2016 it is now under Apache License 2.0. The idea behind Eiffel was to develop a tool for continuous integration and delivery traceability in real time [15]. To gener-ate traceability in real time is uncommon for automgener-ated tools similar to Eiffel [2]. Whenever some activity is produced in the framework this is represented by transmit-ting an event. Each event contains information about a specific delivery of the system in use and also references to other events. By having all events linking to another event, a traversable directed acyclic graph of events is formed [15]. An simplified example of this can be found in Figure 2.1. Even if the graph is directed, Ericcson wanted to be able to traverse through the graph both ways, upstream and downstream as they call it [16]. Upstream is used when you want to trace what events have lead to the current state of the system. Downstream is the opposite, here you find out what has happened to a specific event after it is produced. For example if we would like to figure out what the cause of a potential bug in Figure 2.1 we go upstream (from left to right in the figure). And if we would like to trace what the outcome of a new build, then we go downstream (right to left in the figure). Each Eiffel event is a JSON object that in turn contain two objects and an array.

1. Meta

(11)

2.1. Eiffel

find information on what type of event it is, what version it is, when it was created and the id for the event in the form of a universally unique identifier (UUID).

2. Data

The data object is the actual payload for the event and is different for each event type. This can for example be the outcome of a test or testsuite, the reason for why a artifact was created or information about a Jenkins job that relates to the event.

3. Links

The final part of the Eiffel event is an array called links and shows how this event is connected to other events. Each entry in the array has a type and a target. The type defined how the two events are connected and the target is the UUID of the event. In Listing 2.1 we can see an example of an Eiffel Event with the type EiffelConfi-denceLevelModifiedEvent.

Listing 2.1: Example of an Eiffel Event

{ " meta " : { " type " : " E i f f e l C o n f i d e n c e L e v e l M o d i f i e d E v e n t " , " v e r s i o n " : " 3 . 0 . 0 " , " time " : 1 2 3 4 5 6 7 8 9 0 , " id " : " aaaaaaaa´bbbb´5ccc ´8ddd´e e e e e e e e e e e 0 " } , " data " : { " name " : " s t a b l e " , " value " : "SUCCESS " , " i s s u e r " : {

" name " : " Gary Johnston " ,

" email " : " gary . johnston@teamamerica . com " , " id " : " g a r y j " ,

" group " : " Team America " } } , " l i n k s " : [ { " type " : "CAUSE" , " t a r g e t " : " aaaaaaaa´bbbb´5ccc ´8dddé e e e e e e e e e e 1 " } , { " type " : "CAUSE" , " t a r g e t " : " aaaaaaaa´bbbb´5ccc ´8dddé e e e e e e e e e e 2 " } , { " type " : " SUBJECT " , " t a r g e t " : " aaaaaaaa´bbbb´5ccc ´8dddé e e e e e e e e e e 3 " } , { " type " : " SUBJECT " , " t a r g e t " : " aaaaaaaa´bbbb´5ccc ´8dddé e e e e e e e e e e 4 " } , { " type " : "SUB_CONFIDENCE_LEVEL " , " t a r g e t " : " aaaaaaaa´bbbb´5ccc ´8dddé e e e e e e e e e e 5 " } , { " type " : "SUB_CONFIDENCE_LEVEL " , " t a r g e t " : " aaaaaaaa´bbbb´5ccc ´8dddé e e e e e e e e e e 6 " } ] }

The Eiffel Vocabulary contains 23 unique event types1and it is also possible to create your own type of event. For this thesis 16 unique event types are used and this can be seen in Appendix A.

(12)

2.2. Resource Description Framework (RDF)

Figure 2.1: An simplified event graph used in the paper "Continuous Integration and Deliv-ery Traceability in Industry: Needs and Practices" [15].

2.2 Resource Description Framework (RDF)

The Resource Description Framework, or RDF for short, is a W3C standard for modeling data that is used for the semantic web [3]. By using IRIs to name relationships between two entities it becomes a statement on how the entities are linked to each other. These RDF statements are called triples.

2.2.1 Triples

A triple is constructed by a subject, a predicate and an object. The subject and the object are two entities which are connected with the predicate. For example Figure 2.2 shows that “Bill” “knows” “Jess”, which describes that Bill knows who Jess is. All entities in a triple can be viewed as nodes and predicates as edges and thus forming a RDF-graph of data. This graph will then grow as more triples with the same subject or object are added to the dataset.

Figure 2.2: A triple where Bill is the subject, knows is the predicate and Jess is the object.

2.2.2 International Resource Identifier (IRI)

All entities in a triple can be represented by an IRI. An IRI is a generalization of a URI with the only difference that IRIs support international characters. Here, IRIs are used as an global identifier to a resource [5]. In Figure 2.3 the predicate “knows” has been replaced with an IRI, aswell as both the subject and the object. In this thesis as long as no international characters are used in an IRI, they will be referred to as an URI.

(13)

2.2. Resource Description Framework (RDF)

Figure 2.3: The predicate in this triple is an URI referring the relationship for “knows”. Both the subject and object are also URIs.

2.2.3 Literals

The object in a triple can also take the form of a literal. These values are usually integers or strings. In Figure 2.4 Jess has the age 25, where 25 is a literal value.

Figure 2.4: A graph where the age of Jess is the literal value 25.

2.2.4 Blank nodes

When the size of a RDF-graph grows it can be practical to use a blank node. A blank node can be described as a placeholder node for a subject or object in a triple. In the Figure 2.5 we show that Bill knows someone whose name is Jess and has the age 25 where the right node is a blank node. If another person also knows someone whose name is Jess and has the age 25 then they might not be the same person since the blank node has no unique identification.

Figure 2.5: A graph where the node Jess has been replaced with a blank node and now may no longer be uniquely identified

(14)

2.3. SPARQL

2.3 SPARQL

SPARQL is a graph-matching query language that is W3C recommended and has been developed in order to query over datasets containing data in RDF format [8]. The following code is a SPARQL SELECT query that if executed on the dataset that Fig-ure 2.5 represents, will result in a table with the URI or blank node of everyone Bill knows. PREFIX foaf: <http://xmlns.com/foaf/0.1/>

PREFIX person: <http://example.org/persons/> SELECT ?name

WHERE {

person:Bill foaf:knows ?name . }

The PREFIX is a abbreviation for the URI specifying the vocabulary for foaf and person. The SELECT statement specifies what the output of the query will be. In this case the variable ?name of each result will be given. If the query above has any matches then the variable ?name will be a bound variable because it is bound to a specific bound value, if there would be no matches then the ?name would not point towards a value and would therefore be a unbound variable. The last part of the query is called the pattern matching part according to J.Pérez, M.Arenas and C Gutierrez [8] and describes the criteria for the pattern to match. This part is always contained within a WHERE clause. Here we have introduced a triple pattern to construct the pattern matching part. A triple pattern is a triple where either a entity or a predicate has been replaced with a variable. A variable is not connected to a specific datatype and can within one query have different datatypes. For example the following SPARQL query will give a table of all triples in the dataset it is queried upon.

SELECT ?s ?p ?o WHERE {

?s ?p ?o . }

Here the query is containing a blank node and is returning the name and age for everyone that Bill knows.

PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX person: <http://example.org/persons/> SELECT ?name ?age

WHERE {

person:Bill foaf:knows _:blank . _:blank foaf:name ?name .

_:blank foaf:age ?age . }

Another type of SPARQL query is the ASK query. An ASK query only returns true or false depending on if the query has any matching part. The query below returns true if Bill knows anyone named “Charles” and false otherwise.

PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX person: <http://example.org/persons/> ASK ?name

(15)

2.4. SPARQL federation engines

person:Bill foaf:knows ?name . ?name foaf:name "Charles" . }

2.4 SPARQL federation engines

By having a dataset with data in RDF format, you are now able to preform SPARQL queries upon this SPARQL endpoint. A SPARQL endpoint is a Point of Presence where RDF data is exposed. The endpoint is capable of receiving and process requests for the exposed data. When multiple SPARQL endpoints are used together they form a SPARQL federation. This can for example be very useful when connecting information across a distributed system. In order to execute federated queries [10] over a SPARQL federation, a SPARQL federation engines needs to be used.

2.4.1 Source selection

In order to reduce the query execution time the SPARQL federation engine preform a source selection prior to executing any queries. The source selection is used to probe the SPARQL federation and potentially reduce the number of endpoints that are actually used in the query. Different engines use different methods for this but the goal is to only use the endpoints that have data relevant to the query. If only the relevant sources are selected and there is no over or underestimation, then the source selection accuracy is 100%.

2.4.2 Types of engines

The federation engines can be placed into two separate categories, where the first category is index-based systems. These systems use some type of metadata of the endpoints in order to increase the source selection accuracy of the system. The metadata gives the engines some indication of where the desired data might be located and there are multiple types of meta-data that can be useful for this. The second category is Index-free systems which have, as the name suggests, no metadata of the endpoints.

2.4.2.1 Index-based systems

For this report three different index-based systems was analysed at the beginning. SPLENDID, which uses Vocabulary of Interlinked Datasets (VOID) as its index. Hi-BISCus and CostFed both use an index called summaries. These types of meta-data is created by having a generator run on the SPARQL federation and categoriz-ing the data on all of the endpoints and create one summary or VOID per endpoint. Each endpoint in the federation has a VOID file that specifies three attributes for each unique triple predicate in the dataset. First, how many triples in the existing dataset that has this predicate. Secondly, the number of unique triple subjects that uses the predi-cate. Finally, VOID specifies how many unique triple objects that also uses the predipredi-cate. In addition to this each void file also specifies the total number of triples in the dataset, unique subjects, unique objects and unique properties. A sample from a VOID file can be seen in Figure 2.6. SPLENDID uses the VOID descriptions during its query optimiza-tion by identifying its relevant sources. If the predicate of each triple pattern in the query has a bound value then we can reduce the set of relevant sources for the query by us-ing the VOID description to see if an endpoint has data with the property [4]. For the triple patterns with unbound predicates, all endpoints in the federation are included in the source selection [4]. SPLENDID uses another step in order to further reduce the set

(16)

of relevant endpoints by sending SPARQL ASK queries to all endpoints for triple pat-terns that contain values and not variables as the subject or object in the triple pattern [4].

Figure 2.6: A sample of a VOID description

The summaries that HiBISCus uses consists of a set of capabilities for each endpoint in the federation. A capability is defined as a 3-tuple with one of the endpoints unique predicates as the first element. Figure 2.7 illustrates an RDF description of three capabilities used for an endpoint. The triples predicate is the set of all distinct subject authorities. Finally the set of all distinct object authorities is the triples object [11]. Authorities is defined here as the combination of the path and the authority part of an URI [11]. For example, the prefix:

person: <http://example.org/persons/>

would have the authority example.org and the path persons. HiBISCus uses all unique subject and object authorities which means all unique authorities for the subjects or objects in the dataset. One exception to this however is the predicate rdf:type. For this predicate the set of all distinct URI classes are listed instead of the set of distinct object authorities [11]. The first capability in Figure 2.7 is about the predicate rdf:type. We can also see that the last two capabilities does not have any object authorities which is because these predicates only points towards objects that are literals any therefore does not have any authorities.

(17)

The summaries for CostFed are similar to HiBISCus with some improvements. The difference in their summaries can be seen in Figure 2.8 where each capability has information about the frequency of the subjects and objects for each predicate. This is done in order to calculate the skew distribution. The distribution is used to categorise the resources into three groups called buckets depending on their frequency [12].

Figure 2.8: A sample of a capability in summary used in CostFed

Both CostFed and HiBISCus uses their summaries in order to reduce the set of relevant sources for a query similar to SPLENDID. The difference here is that since information re-garding the subject and objects are included in the summaries, the query engine can check for bound values for the subject and objects as well in order to reduce the set of relevant sources for a SPARQL query [11] [12].

2.4.2.2 Index-free systems

While the index-based system are using metadata to identify which of the sources in the federation that are relevant, the index-free take another approach. Executing a query with FedX, Lusail or Fed-DSATUR all start of with a similar approach for the source selection. By dividing all of the triple patterns in the query as subqueries and executing these as SPARQL ASK queries for all of the endpoints in the federation, they can derive if the endpoint is relevant for each triple pattern in the executed query [1] [14] [18]. If any of these SPARQL ASK queries return true, then the engines stores information that this endpoint should be used for the query to be executed in a cache. However, this which might lead to an overestimation [14]. For example the triple pattern ?s rdf:type ?o is likely to exist in every dataset in a federation and therefore the set of relevant endpoints will not be reduced.

(18)

3 Method

To get an understanding of the work method and to increase the replicability of the thesis this method chapter is created. The main focus here is to describe how the work is planned in order to reach the results in the next chapter. The work method is an essential part in order to achieve the same results since similar work is limited. This chapter is divided into three major parts. The first part will give a brief explana-tion of similar studies in order to further describe what quesexplana-tions this thesis aims to answer. The second part will focus on what methods are used in order to set up a SPARQL federation using Eiffel data. This is done in order to give grounds to answer the first research question. The last part of the chapter will give an explanation of different SPARQL federation engines and how the choice of engines for thesis was made. This part describes the method for answering the second research question.

3.1 Similar work

There has previously been several research papers and extensive research with SPARQL federation [6] [18] [17] [7]. Similar work can also be placed in two different categories. The first category is introducing new techniques for SPARQL federation engines and the other one is focusing on comparing differ-ent systems in experimdiffer-ental setups. This thesis falls under the second category. The unique approach for this thesis is to examine continuous integration data. This might at first seem minor but the federation is heavily affected by this. For this federation all end-points contains the same RDF schemas. This gives a new take on SPARQL federations since other federations such as FedBench have different schemas for its datasets [13]. One of things this thesis wishes to look further into is how more homogeneous datasets in a federation will affect the source selection techniques.

3.2 Generated data

To answer the research question it is first necessary to create a SPARQL federation of datasets containing Eiffel data in RDF format. This process is done in four steps. First, generating

(19)

3.2. Generated data

Eiffel data. Secondly, dividing the data into different datasets for federation setups, then con-verting this data into RDF. Lastly, creating endpoints for the different datasets thus creating a SPARQL federation.

3.2.1 Eiffel generator

The Eiffel project contains a generator for creating synthetic data that is similar to how data would look like in production. This was insured in an interview with Daniel Ståhl, author of “Achieving traceability in large scale continuous integration and delivery deploy-ment, usage and validation of the Eiffel framework” The generator creates Eiffel events based on the event flow that is displayed in Appendix A. The generator goes through sev-eral iterations of the flow and creating a dataset where events from each iteration point to an event of the same type in a previous iteration. This is not necessarily the most recent iteration since it is not certain that every event type is generated in every iter-ation. By having these links it will be possible to increase the traceability of the delivery.

Figure 3.1: A simplified event flow of the Eiffel event generator

The first iteration creates three unique Eiffel events that specifies the environment that the rest of the Eiffel events are using. After this setup the following iterations are producing around 30-50 events depending on some randomized results in the generator. For this thesis 4000 iterations are done for the setup. This results in 120.000 - 200.000 events. A simplified version of the event flow displayed in Appendix A can be seen in Figure 3.1. Here it is shown that once the enviroments are defined a new build will be triggered. This build creates up to three artifacts “ArtCC1”, “ArtCC2” and “ArtCC3”. Once these artifacts are created a subsystem will start to build. This subsystem is built on the latest versions of artifacts created in the build step. Here there is an 80% chance that an artifact called “ArtC2” is generated. If this artifact is not generated then the current iteration will end and a new iteration will begin. However if the artifact is created then the subsystem will run three test events, which tests that the system integration is successful, must pass. These are coded to have a 98% chance of being successful in the generator. This means that a system integration will be done in 75,3% of the iterations. Finally, four last test cases will be generated to simulate intergration tests. These tests pass in 80,1% resulting that a new version of the entire system will be done in 60.6% of the iterations.

3.2.2 Creating linked datasets

For this paper it is of interest to execute queries over a federation of datasets. In order to have a federation of datasets it is necessary to generate multiple datasets which have links between them. To have an accurate result the data must be linked in such a way that would be a realistic scenario. A common scenario is that a software is dependent on

(20)

a specific version of another software. This scenario is represented in this thesis by having one dataset linking to the latest build of another dataset. This allows for easy traceability. The last group of events generated in an interation is the integration group which can be seen in Figure 2.1. The first event to be generated is the “CDEF1” event which is called EiffelCompositionDefinedEvent. The purpose of this event is to define all artifact and sources used in the integration group. In order to connect the two datasets a new link is introduced in the “CDEF1” event which point to the latest successful build in the dataset it’s dependent upon. In other words the “CDEF1” events in dataset 2 is connected to the latest successful “CLM1” event in dataset 1. By doing this the generated data will simulate that the latest version of dataset 1 is used when trying to integrate this dataset into dataset two, similar to having a integration test between two software systems.

3.2.3 Convert into RDF

After the previous step, we now have one JSON file per dataset containing 120.000 - 200.000 events. The next step is to convert the data into RDF. Here a conversion tool is made which allows for the generated dataset in JSON format to be converted into RDF. Before the conversion tool can be used, two RDF vocabularies are defined1. The Eiffel vocabulary defines what properties an Eiffel event has, such as version or what iteration it was cre-ated. The other vocabulary, Eiffellink, is used to define how Eiffel events relate to each other. The conversion tool is a JAVA-application that uses the Apache Jena framework to cre-ate triples. The identifier for a Eiffel event is a UUID. The conversion tool takes each of the generated Eiffel events, connects all of the data for the event to its corresponding UUID by using the properties defined in the vocabularies. In some cases a blank node could be created to connect data to a event, for example in Figure 3.2 the entity <urn:uuid:3e6f3d74-db81-45fc-8a8a-c6d1674e410d> could have been a blank node. However, the SPARQL federation engines does not support blank nodes at this time. Instead a new UUID is generated here to represent the blank node.

Figure 3.2: A sample of a dataset in RDF format

3.2.4 Creating a SPARQL federation

The final step for creating the SPARQL federation is to connect each of the datasets with an endpoint that will allow for SPARQL queries to be executed over the dataset. The datasets will be uploaded to separate virtual machines in the cloud solution Azure. Blazegraph 2.1.4 will be used on the VMs to deploy the SPARQL endpoints. In total four different A0 (1 vCPU(s), 0.75 GB RAM) Ubuntu VMs will be used.

3.3 SPARQL federation engines

The following section will focus on the background and reasoning for choosing the SPARQL federation engines, in order to answer how effective state-of-the-art source selection strate-gies are for the SPARQL federation used in this thesis.

(21)

3.4. Queries

3.3.1 Index-based systems

For this thesis I decided to use SPLENDID, FedX-HiBISCus and a combination of these two SPLENDID-HiBISCus. FedX-HiBISCus uses HiBIScus to create summaries and then FedX executes ASK queries towards the summaries rather than directly towards the SPARQL end-point. The reason for not choosing CostFed is that three different index-based systems is suf-ficient and SPLENDID and FedX-HiBISCus offers the opportunity to experiment with these in the combination SPLENDID-HiBISCus.

3.3.2 Index-free systems

For the index-free systems there was also three systems analysed at the beginning of writing this thesis. The three systems were, Lusail, FedX and Fed-DSATUR. Unfortunatly the Lusail system was not made public at the time of this thesis. When contacting the developers of Fed-DSATUR they regretted to inform that there had been an accident with the server were the code was located and that the code was lost. In regards of this I decided to continue the thesis using FedX 3.1 as the only index-free system.

3.4 Queries

The following six queries will be executed over a SPARQL federation with two end-points. The queries are divided into three categories. The first category is Sim-ple queries. There is two queries in this category and the purpose is to get a ref-erence point in regards to the minimum execution time. SQ1 searches for any en-try of type EnvironmentDefinedEvent. There are two of these in each dataset. SQ2 uses a more uncommon predicate but should still be very easy to execute.

PREFIX e i f f e l : < h t t p s :// w3id . org/ e i f f e l /RDFvocab/main#> PREFIX r d f : < h t t p ://www. w3 . org /1999/02/22´ rdf´syntax´ns#> SELECT ? id

WHERE {

? i d r d f : type e i f f e l : EnvironmentDefinedEvent . }

Listing 3.1: Simple query 1 (SQ1)

PREFIX e i f f e l l i n k : < h t t p s :// w3id . org/ e i f f e l /RDFvocab/ l i n k s #> SELECT ? s ? o

WHERE {

? s e i f f e l l i n k : p r e v i o u s _ v e r s i o n ? o . }

Listing 3.2: Simple query 2 (SQ2)

The second category is Linked queries and these are used to find connections between the datasets. If you were to execute these queries over only one of the datasets, then you would not find all results. LQ1 starts of with finding the UUID of the two Environ-mentDefinedEvent and then finding all other nodes that links to this one. After these nodes are found, the type and iteration is extracted as well. The other Linked Query LQ2, will create a subgraph of 3 events that are connected to each other like a triangle. These subgraphs will be in both different datasets but also overlapping between datasets.

(22)

3.4. Queries

PREFIX e i f f e l : < h t t p s :// w3id . org/ e i f f e l /RDFvocab/main#> PREFIX e i f f e l l i n k : < h t t p s :// w3id . org/ e i f f e l /RDFvocab/ l i n k s #> PREFIX r d f : < h t t p ://www. w3 . org /1999/02/22´ rdf´syntax´ns#> SELECT ? id ? type ? i t e r a t i o n WHERE { ? s o u r c e r d f : type e i f f e l : EnvironmentDefinedEvent . ? i d ? l i n k ? s o u r c e . ? i d r d f : type ? type . ? i d e i f f e l : i t e r a t i o n ? i t e r a t i o n . }

Listing 3.3: Linked query 1 (LQ1)

PREFIX e i f f e l l i n k : < h t t p s :// w3id . org/ e i f f e l /RDFvocab/ l i n k s #> SELECT ? id ? i d2 ? i d3 WHERE { ? i d e i f f e l l i n k : p r e v i o u s _ v e r s i o n ? id 3 . ? i d e i f f e l l i n k : change ? id 2 . ? i d2 e i f f e l l i n k : base ? id 3 . }

Listing 3.4: Linked query 2 (LQ2)

The last category is Realistic queries. These are meant to resemble a realistic scenario. The first query RQ1 will find all testcases that is has not passed and display the item in use and which testcase that failed. The last query will look at all changes to a certain Git repo and view the original submitter and then who made each change and what was changed.

PREFIX e i f f e l : < h t t p s :// w3id . org/ e i f f e l /RDFvocab/main#> PREFIX e i f f e l l i n k : < h t t p s :// w3id . org/ e i f f e l /RDFvocab/ l i n k s #> PREFIX r d f : < h t t p ://www. w3 . org /1999/02/22´ rdf´syntax´ns#> SELECT ? groupId ? a r t i f a c t I d ? gavVersion ? t e s t T h a t F a i l e d WHERE { ? t e s t C a s e r d f : type e i f f e l : T e s t C a s e F i n i s h e d E v e n t . ? t e s t C a s e e i f f e l : outcomeVerdict ? v e r d i c t . FILTER ( ? v e r d i c t ! = ’ PASSED ’ ) . ? t e s t C a s e ? t e s t C a s e E x e c u t i o n ? temp . ? temp e i f f e l : t e s t C a s e ? blanktemp . ? blanktemp e i f f e l : t e s t C a s e I d ? t e s t T h a t F a i l e d . ? temp e i f f e l l i n k : i u t ? itemUnderTest .

? itemUnderTest e i f f e l : gav ? temp2 . ? temp2 e i f f e l : groupId ? groupId .

? temp2 e i f f e l : a r t i f a c t I d ? a r t i f a c t I d . ? temp2 e i f f e l : gavVersion ? gavVersion . }

(23)

3.4. Queries

PREFIX e i f f e l : < h t t p s :// w3id . org/ e i f f e l /RDFvocab/main#> PREFIX e i f f e l l i n k : < h t t p s :// w3id . org/ e i f f e l /RDFvocab/ l i n k s #> SELECT ? id ? i d V e r s i o n ? t a r g e t ? t a r g e t V e r s i o n WHERE { ? i d e i f f e l : name " Composition 1 " . ? i d e i f f e l : e v e n t V e r s i o n ? i d V e r s i o n . ? i d e i f f e l l i n k : element ? t a r g e t . ? t a r g e t e i f f e l : name " Composition 1 " . ? t a r g e t e i f f e l : e v e n t V e r s i o n ? t a r g e t V e r s i o n . }

(24)

4 Results

The following chapter will present the results from the steps presented in the previ-ous chapter. These result will lay the foundation of the answers on the presented research questions in this thesis. The chapter is divided into two major parts. The first part will present the results of the generated Eiffel data, the linked datasets and the results of the converted data into RDF. These results will be used in or-der to answer the first research question "How can Eiffel data be represented in RDF format in order for state-of-the-art SPARQL federation engines to execute queries?" The second part of this chapter will go into more depth on how well the index-based and the index-free systems were able to query over the federation of datasets. This sub-chapter is more focused on answering the second research question "How effective are the state-of-the-art source selection strategies over a SPARQL federation of datasets containing Eiffel data in RDF format?"

4.1 Generated data

This section provides the results of generating data that will be used and converting it to RDF.

4.1.1 Creating the datasets

In order to create generate several datasets and to create links between these, the Eiffel generator was extended. In order to create links between the datasets the generator now needed to be able to generate data for several datasets and needed to generate the data based on the dependencies between datasets. The user of the conversion tool is given the option of first deciding to have 1-4 datasets. If the user decide on having only 1 dataset then the generator will generate the same thing that the original generator would do since there are no links to be created between the datasets. For two datasets then the second dataset would automatically be dependent on the first one. Figure 4.1 shows the choices given when choosing to have four datasets. Here each circle represents a dataset and an arrow shows what other dataset it is dependent upon. For example in the third option, here dataset 2 is dependent on dataset 1. As well as dataset 4 being dependant on dataset 3.

(25)

4.2. Index-based systems

Figure 4.1: A Figure showing the possible dependency setups when choosing four datasets in the generator.

For convenience sake the generator now also write to file and uses separate files for all of the datasets. This means when generating data for four datasets, the generator gives each as a separate file.

The generated data used for this thesis is very similar to the data generated by the example data provided by the Eiffel project. When generating data with the Eiffel example data the datasets have the average size of 34MB when using 1000 iterations. These size are the same the generated data given by the extended generator. The only difference here is that an extra link is used to combine two datasets that is described previously in the thesis. The conversion tools takes the different datasets described in the previous section and maps each of the JSON-objects to a set of triples. The generated data looks promising and similar to sets of triples from other datasets from BigRDFBench [9]. The only thing that is differentiating from the generated dataset is that each subject is a UUID where most other datasets have URIs.

4.2 Index-based systems

In order to see the results for the index-based systems we need to look at all the parts that makes the system work. Here there will be subsections for the generation of VOID and summaries which analyses these files and compares them to files used for BigRDF-Bench in order to give an understanding of the result. The final subsection will be on the result of the queries run on SPLENDID, FedX-HiBISCus and SPLENDID-HiBIScus.

(26)

This subsection will give grounds for the second research question in regards to the index-based systems.

4.2.1 VOID

The VOID was generated from an existing tool in SPLENDID and were based on the datasets uploaded to the Azure machines. In Figure 4.2 there is some examples of the generated void and compared to Figure 2.6 the data looks similar.

Figure 4.2: A Figure showing examples of the generated VOID.

The sizes of the VOID files are about 36 KB for each dataset and that is neither big or small compared to the provided VOID files in the BigRDFBench.

4.2.2 Summaries

The summaries for the generated data has an UUID as either the subject or the object, and sometimes both, in each triple. There is no way of using the authority part of an UUID as you can with an URI and therefor the whole UUID is stored in the summary similar to how the rdf:type works in a summary. This causes the summary to gain a lot in size. The summaries used for two endpoints has as an size of about 150 MB which in contrast to the summary for BigRDFBench which has the size of 561 KB.

4.2.3 Experiments

This increase in data size for the summaries leads to two problems. First, it takes longer to generate the summaries for the federation. To generate the summaries for two endpoints takes over 30 minutes. To put this time in to perspective BigRDFBench takes 92 minutes for HiBISCuS and 192 minutes for SPLENDID [9]. Even tho these are long times BigRDFBench consists of over 1 billion triples over multiple datasets and the two datasets that lead to these times have approximately 1.7 million triples each. The second problem is that the evaluation step for the Index-based systems will take too long and timeout after 30 minutes for all queries except SQ1. By having huge summaries the source selection would spend a long time sort-ing through the summaries just to end up with not narrowsort-ing the scope of relevant sources.

(27)

4.3. Index-free systems

For SPLENDID that uses VOID as its metadata, an unexpected error where it was not able to execute at all. This could have been a problem with how the VOID-files were loaded or some other configuration.

4.3 Index-free systems

This section will provide additional answers to the second research question, Where the pre-vious subsection focused on index-based systems, this will instead look at the result for the index-free systems. As stated in the method chapter most index-free systems was unavailable to use, which leads to this chapter only has results on the FedX system.

4.3.1 FedX

To start of the experimentation two datasets were used in separate endpoints where one dataset was dependent on the other. Only two endpoints were used in order to create a proof of concept. Table 4.1 shows the execution times for all queries defined in Chapter 3.4. The queries were executed in a the series of SQ1, SQ2, LQ1, LQ2, RQ1 and RQ2. Each series executed FedX main function which initializes all configuration. This was done in order to have each query use FedX in the same way and not have any results cached. This series was executed 100 times and the results for each query was saved after the query was executed. The execution time shown in Table 4.1 is the average time for the 100 executions. Together with the execution time, the standard deviation time and the amount of entries found is also shown in Table 4.1.

Query Average Execution Time (ms) Standard Deviation (ms) Entries found

SQ1 254,55 148,64 4 SQ2 36064,32 5117,55 58494 LQ1 22041,97 2111,47 12429 LQ2 77961,11 6024,54 8000 RQ1 62019,42 3253,95 1609-1610 RQ2 24463,30 1228,32 3032

Table 4.1: A Table showing average execution times, standard deviation and Entries found for various SPARQL queries with FedX over a SPARQL federation with 2 endpoints

When looking at the accuracy of the queries we find some interesting results. All of the queries returns a consistent value in regards of the SELECT clause in the queries except for RQ1. Here, 51 out of 100 tests results in 1610 matches and in the remaining 49 tests there is only 1609 matches. The reason for this inconsistency is not known but since it only happens for one specific query then that gives an indication that the issue might lie in the query itself.

(28)

5 Discussion

In order to get a better understanding of the results of the thesis, this chapter will further analyse the given results. Here the author gives perspective on the results and discuss them. The goal for this chapter is to get a better insight in the re-sults and also to get a understanding of what could have been done differently. The chapter is divided into three sections that cover the same sections as the results chapter. The first section takes a look at the generated data and the validity of it in order to answer the research questions. Secondly, there is a discussion on why the index-based systems did not perform very well. For the last part there is an analysis on the index-free system FedX.

5.1 Generated data

In order to answer the first research question in this thesis, “How can Eiffel data be represented in RDF format in order for state-of-the-art SPARQL federation engines to execute queries?”, we need to look at both the generated data as well as the con-version tool. The obvious fact that FedX managed to execute queries successfully over a SPARQL federation of this generated data shows that this is indeed possi-ble. The results show that FedX was able to find links between the two endpoints. These results will be further discussed in the section regarding index-free systems. The main uncertainty regarding the generated data is the actual fact that it is gen-erated. The validity of the results are affected since the datasets of these events are not taken from a actual system. If the data was actually taken from a sys-tem that had produced the Eiffel events then that would have been a stronger in-dication that it is possible to represent this data in RDF format. With that being said, the original generator is produced by the same creators as the Eiffel project. There are potential improvements that could be done for the RDF-converter that could lead to better results for the index-based system. It is however not the generated data itself that is a problem for the index-based systems, the problem lies in the representation of the data in RDF format. This will be discussed in the upcoming section about index-based systems.

(29)

With all this being said, I feel confident in saying that I have proven one way of repre-senting Eiffel data in RDF format in order for a state-of-the-art SPARQL federation engine to execute queries.

5.2 Index-based systems

The index-based systems used in this thesis were unfortunately not very compati-ble with the generated data. This were shown for both FedX-HiBISCus as well as SPLENDID. The result for FedX-HiBISCus were heavily affected by the generated summaries and unfortunately the exact problem with SPLENDID could not be pin-pointed at this time. This section will further try to analyse and discuss these results. The summaries used for FedX-HiBISCus grew a lot in size with the frequent use of UUID. The size is almost as big as the actual data and this is not efficient by any means. If the summaries did not store all of the UUID values and instead stored only that it has the type UUID the size of the summaries would shrink drastically. This would how-ever lead to that the benefits of using an index-based system would be lost since the set of relevant sources in the SPARQL federation would be likely be all of the endpoints. FedX-HiBISCus also does not perform optimally with sources that share the same URI authority [12]. These issues might be solvable by representing each event with another identifier besides UUID, since an UUID consists of only an unique value and no indica-tion on what type of data the identifier points to. If each event was an URI that ended with an unique ID and where the authority part of the URI represented the type of the event then the index-based system would be able to prune the federation more efficiently. Unfortunately it is not clear why the usage of SPLENDID did not work. When compar-ing the generated VOID files with existcompar-ing VOID files from other datasets there is nothcompar-ing that stands out in the same sense as the summaries. The VOID files has a similar size and similar properties which leads me to think that there might be some configuration that is not set up properly. An issue that was encountered when working with the index-based systems was that the triple subjects used in this thesis were not the regular a URI. There were parts of the systems were it tried to split on the character /. This lead to er-rors that needed to be worked around, since a UUID does not contain such characters. This might have been a potential issue, but it is not possible to say without more research. Since index-based system is using metadata to reduce the set of relevant sources for each query, it is reliant on the metadata being updated. This leads to updating it be-fore each query is executed. This is suboptimal for an continuous integration software such as Eiffel. In order for this to be efficient it would have to result in that the time of generating the metadata would outweighed by the time that the query evaluation saves during a query. This is not the case for the summaries generated in this thesis, however it gives gives further indications that future work should focus more on index-free systems. The results from the index-based systems were unable to provide a direct answer on how effective these state-of-the-art source selection strategies are. It can however give a strong indication that if further work should be done to finding an answer to this effectiveness then index-based systems might not be the way forward.

5.3 Index-free systems

The results from the index-free systems provided the best results. However, the fact that it was only possible to get a hold of one of the systems were quite discouraging. This

(30)

5.3. Index-free systems

immediately makes it very hard to measure effectiveness of the index-free systems. On the other hand was FedX the only system which was able to provide any measurable re-sult at all. This section will therefor look a bit further into the rere-sults given by FedX. Table 4.1 shows the execution time of the given queries presented in the same chapter. The execution times by them self are hard to analyse without being able to compare them to anything else. What we can see is the time for a very simple query such as SQ1 is much faster than the rest of the queries were FedX needed to find links between the two datasets. What gives a good indication about the effectiveness of FedX is the accuracy. Five out of the six queries had a 100% accuracy which really proves that the engine is accurate. The only query that did not have a 100% accuracy was the RQ1 query. Here the accuracy is still great if you measure it as:

51 ˚ 1610+49 ˚ 1609

100 ˚ 1610 =0.9997

You can however also see it as that the results were incomplete 49% of the time. FedX is unfortunately the only index-free system used in this thesis and also the only system that could provide a result. By being the only system that provided a result, that gives an indication that the answer to the first research question that the method described in the thesis is a possibility to represent Eiffel data in RDF format. However in order to measure how effective this method is, a comparison with another system needs to be made. Since only FedX was able to produce any measurable results, it was decided to only use a federation of 2 endpoints. This decision was based on several reasons. First, since it had successfully been able to query over two endpoints it could be established that the generated data was correctly generated for a index-free system. After this it was considered a higher priority to produce measurable results for any of the index-based systems in order to be able to compare results. Testing with a larger federation than two endpoints would also cost more since more VMs in Azure would be needed. To connect this section about index-free systems to the second research question that is focused on the effectiveness of the engines, there are two main takeaways. First in regards of execution time, FedX is the only system that has produced a measur-able result which is positive but at the same time it is hard to get a good under-standing if these results are good or not. The other takeaway is the accuracy of the engine which provides a great result. It is therefore my conclusion that further research is necessary to provide a more reliable answer in regards of effectiveness. To finally give an answer to the main research question, if it is a viable way of creating a decentralised database system for continuous integration data, is something that is very hard to do at this point. Given my findings regarding the index-based systems I do not believe that this is a viable way. However, I do think that there is potential to use a index-free system. To get a definitive answer to this there would need to be more research. There needs to be some sort of comparison between index-free systems and more endpoints for the experiments in order to test scalability. If these two things shows promise then it could be a viable way of querying Eiffel event data over a SPARQL federation.

(31)

6 Conclusion

This goal of this thesis was to provide answers if data from Eiffel events could be repre-sented in RDF format in order for state-of-the-art SPARQL federation engines to execute queries. And if so, how effective would they be. There would not be any compari-son with any existing system only to measure the federation engines against each other. The data used to answer this was created by a provided generator in the Eiffel project which was in turn modified to provide a realistic scenario where two or more datasets would be linked together with some sort of dependency. This generated data would then be converted into RDF triples by a conversion tool created by the author. This data would then be hosted in different SPARQL endpoints provided by Blazegraph inside Azure. The SPARQL federation engines can be divided into two separate groups. Index-based and index-free. The index-based system need to generate some sort of meta-data in order to more effectively prune the meta-datasources. The systems chosen for this thesis were SPLENDID and HiBISCus. For the index-free systems it was un-fortunatly only possible to use the system called FedX. As the name suggest this engine does not need to generate any index or metadata before executing queries. FedX was able to execute queries over a federation of two endpoints which contained datasets with linked data. This gives confirmation that it is possible to represent Eiffel data in RDF format and to have a state-of-the-art engine to execute queries on a feder-ation of SPARQL endpoints. However, it was not possible to achieve the same results with the index-based systems. For HiBISCus the summaries of the data which is used as metadata grew too large and the source selection timed out after 30 minutes. The reason for the summaries to grow so large is most likely because of the usage of UUID as sub-ject in the triples of the RDF data. The reason why SPLENDID was not able to produce any result is unclear at this time, but it is likely to be because of the usage of UUID as well. It is my opinion that this thesis has proven that Eiffel data can be represented in RDF format and that a index-free system is able to execute SPARQL queries over a federation of datasets with great accuracy. At this point it is not possible to give a well grounded proof on it’s effectiveness in regards to execution time. To do this there needs to be a comparison to

(32)

6.1. Future work

another system. If such a comparison was to be made then it is the authors recommendation to do it with another index-free system.

6.1 Future work

In order to give a better result on effectiveness of these systems then there would be a need to compare systems against each other. It would be inter-esting to do further research on how well index-free systems would compare against each other. If it were possible to do a comparison between Lusail and FedX then that could give a better indication on how effective this could be. Another approach could be to compare a index-free system with an existing centralised system. This would give a good idea of what already exists and if a index-free system can be an improvement.

(33)

Bibliography

[1] Ibrahim Abdelaziz, Essam Mansour, Mourad Ouzzani, Ashraf Aboulnaga, and Panos Kalnis. “Lusail: a system for querying linked data at scale”. In: Proceedings of the VLDB Endowment 11.4 (2017), pp. 485–498.

[2] H. U. Asuncion, A. U. Asuncion, and R. N. Taylor. “Software traceability with topic modeling”. In: 2010 ACM/IEEE 32nd International Conference on Software Engineering. Vol. 1. May 2010, pp. 95–104.DOI: 10.1145/1806799.1806817.

[3] F. Manola and E. Miller. RDF Primer, W3C Recommendation. 2004.URL: http:%7B%5C% %7D0A//www.w3.org/TR/rdf-primer/.

[4] Olaf Görlitz and Steffen Staab. “SPLENDID: SPARQL endpoint federation exploiting voiD descriptions”. In: CEUR Workshop Proceedings 782 (2011).ISSN: 16130073.

[5] Guus Schreiber and Yves Raimond. RDF Primer, W3C Recommendation. 2014. URL: https://www.w3.org/TR/2014/NOTE-rdf11-primer-20140624/.

[6] Olaf Hartig and Carlos Buil-Aranda. “Bindings-restricted triple pattern fragments”. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 10033 LNCS (2016), pp. 762–779.ISSN: 16113349.DOI: 10.1007/978-3-319-48472-3_48. arXiv: 1608.08148.

[7] Damla Oguz, Belgin Ergenc, Shaoyi Yin, Oguz Dikenelli, and Abdelkader Hameurlain. “Federated query processing on linked data: a qualitative survey and open challenges”. 2015. URL: http : / / www . journals . cambridge . org / abstract % 7B % 5C _ %7DS0269888915000107.

[8] Jorge Pérez, Marcelo Arenas, and Claudio Gutierrez. “Semantics and complexity of SPARQL”. In: ACM Transactions on Database Systems (TODS) 34.3 (2009), p. 16.

[9] Muhammad Saleem, Ali Hasnain, and Axel-Cyrille Ngonga Ngomo. “BigRDFBench: A Billion Triples Benchmark for SPARQL Endpoint Federation”. In: ().

[10] Muhammad Saleem, Yasar Khan, Ali Hasnain, Ivan Ermilov, and Axel-Cyrille Ngonga Ngomo. “A fine-grained evaluation of SPARQL endpoint federation systems”. In: Se-mantic Web 7.5 (2016), pp. 493–518.

(34)

Bibliography

[11] Muhammad Saleem and Axel-Cyrille Ngonga Ngomo. “HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint Federation”. In: The Semantic Web: Trends and Challenges. Ed. by Valentina Presutti, Claudia d’Amato, Fabien Gandon, Mathieu d’Aquin, Steffen Staab, and Anna Tordai. Cham: Springer International Publishing, 2014, pp. 176–191.ISBN: 978-3-319-07443-6.

[12] Muhammad Saleem, Alexander Potocki, Tommaso Soru, Olaf Hartig, and Axel-Cyrille Ngonga Ngomo. “Costfed: Cost-based query optimization for sparql endpoint federa-tion”. In: Procedia Computer Science 137 (2018), pp. 163–174.

[13] Michael Schmidt, Olaf Görlitz, Peter Haase, Günter Ladwig, Andreas Schwarte, and Thanh Tran. “FedBench: A Benchmark Suite for Federated Semantic Data Query Processing”. In: The Semantic Web – ISWC 2011. Ed. by Lora Aroyo, Chris Welty, Harith Alani, Jamie Taylor, Abraham Bernstein, Lalana Kagal, Natasha Noy, and Eva Blomqvist. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp. 585–600.

[14] Andreas Schwarte, Peter Haase, Katja Hose, Ralf Schenkel, and Michael Schmidt. “Fedx: Optimization techniques for federated query processing on linked data”. In: International semantic web conference. Springer. 2011, pp. 601–616.

[15] D. Ståhl, K. Hallén, and J. Bosch. “Continuous Integration and Delivery Traceability in Industry: Needs and Practices”. In: 2016 42th Euromicro Conference on Software Engineer-ing and Advanced Applications (SEAA). Aug. 2016, pp. 68–72. DOI: 10 . 1109 / SEAA . 2016.12.

[16] Daniel Ståhl, Kristofer Hallén, and Jan Bosch. “Achieving traceability in large scale con-tinuous integration and delivery deployment, usage and validation of the eiffel frame-work”. In: Empirical Software Engineering 22.3 (2017), pp. 967–995.ISSN: 15737616.DOI: 10.1007/s10664-016-9457-1.

[17] Ruben Verborgh, Miel Vander Sande, Olaf Hartig, Joachim Van Herwegen, Laurens De Vocht, Ben De Meester, Gerald Haesendonck, and Pieter Colpaert. “Triple Pattern Frag-ments: A low-cost knowledge graph interface for the Web”. In: Journal of Web Semantics 37-38 (2016), pp. 184–206.ISSN: 15708268.DOI: 10.1016/j.websem.2016.03.003. [18] Maria Esther Vidal, Simón Castillo, Maribel Acosta, Gabriela Montoya, and Guillermo

Palma. “On the selection of SPARQL endpoints to efficiently execute federated SPARQL queries”. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Ar-tificial Intelligence and Lecture Notes in Bioinformatics) 9620 (2016), pp. 109–149. ISSN: 16113349.DOI: 10.1007/978-3-662-49534-6_4.

(35)

(36)

Querying Federations of Eiffel Event Data Repositories

Linköpings universitet | Institutionen för datavetenskap

Examensarbete på avancerad nivå, 30hp | Datateknik

2020 | LIU-IDA/LITH-EX-A–20/056–SE

Querying Federations of Eiffel

Event Data Repositories

En undersökning av distribuerade system som innehåller

conti-nous integration data i ett länkat format

Jonatan Pålsson

Upphovsrätt

Copyright

Författarens tack

Contents

List of Figures

List of Tables

1

Introduction

1.1

Motivation

1.2

Aim

1.3

Research questions

1.4

Delimitations

2

Theory

2.1

Eiffel

2.2

Resource Description Framework (RDF)

2.2.1

Triples

2.2.2

International Resource Identifier (IRI)

2.2.3

Literals

2.2.4

Blank nodes

2.3

SPARQL

2.4

SPARQL federation engines

2.4.1

Source selection

2.4.2

Types of engines

3

Method

3.1

Similar work

3.2

Generated data

3.2.1

Eiffel generator

3.2.2

Creating linked datasets

3.2.3

Convert into RDF

3.2.4

Creating a SPARQL federation

3.3

SPARQL federation engines

3.3.1

Index-based systems

3.3.2

Index-free systems

3.4

Queries

4

Results

4.1

Generated data

4.1.1

Creating the datasets

4.2

Index-based systems

4.2.1

VOID

4.2.2