Making a common graphical language for the validation of linked data.

(1)

INOM

EXAMENSARBETE ARKITEKTUR,

AVANCERAD NIVÅ, 30 HP STOCKHOLM SVERIGE 2017,

Making a common graphical language for the validation of linked data.

DANIEL ECHEGARAY

KTH

SKOLAN FÖR DATAVETENSKAP OCH KOMMUNIKATION

(2)

Making a common graphical language for the validation of linked data.

DANIEL ECHEGARAY

Master in Computer Science Date: July 7, 2017

Supervisor: Cyrille Artho Examiner: Tino Weinkauf

Swedish title: Skapandet av ett generiskt grafiskt språk för validering av länkad data.

School of Computer Science and Communication

(3)

i

Abstract

A variety of embedded systems is used within the design and the construction of trucks within Scania. Because of their heterogeneity and complexity, such systems require the use of many software tools to support embedded systems development. These tools need to form a well-integrated and effective development environment, in order to ensure that product data is consistent and correct across the developing organisation. A prototype is under development which adapts a linked data approach for data integration, more specifically this prototype adapt the Open Services for Lifecycle Collaboration(OSLC) specification for data-integration. The prototype allows users, to design OSLC-interfaces between product management tools and OSLC-links between their data. The user is further allowed to apply constraints on the data conforming to the OSLC validation language Resource Shapes(ReSh).

The problem lies in the prototype conforming only to the language of Resource Shapes whose constraints are often too coarse-grained for Scania’s needs, and that there exists no standardised language for the validation of linked data. Thus, for framing this study two research questions was formulated (1) How can a common graphical language be created for supporting all validation technologies of RDF-data? and (2) How can this graphical language support the automatic generation of RDF-graphs?

A case study is conducted where the specific case consists of a software tool named SESAMM-tool at Scania. The case study included a constraint language comparison and a prototype extension. Furthermore, a design science research strategy is followed, where an effective artefact was searched for answering the stated research questions. Design science promotes an iterative process including implementation and evaluation. Data has been empirically collected in an iterative development process and evaluated using the methods of informed argument and controlled experiment, respectively, for the constraint language comparison and the extension of the prototype.

Two constraint languages were investigated Shapes Constraint Language (SHACL) and Shapes Expression (ShEx). The result of the constraint language comparison con- cluded SHACL as the constraint language with a larger domain of constraints having finer-grained constraints also with the possibility of defining new constraints. This was based on that SHACL constraints was measured to cover 89.5% of ShEx constraints and 67.8% for the converse. The SHACL and ShEx coverage on ReSh property constraints was measured to 75% and 50%. SHACL was recommended and chosen for extending the prototype. On extending the prototype abstract super classes was introduced into the underlying data model. Constraint language classes was stated as subclasses. SHACL was additionally stated as such a subclass. This design offered an increased code reuse within the prototype but gave rise to issues relating to the plug-in technologies that the prototype is based upon. The current solution still has the issue that properties of one constraint language may be added to classes of another constraint language.

(4)

ii

Sammanfattning

En mängd olika inbyggda system används inom design och konstruktion av lastbilar inom Scania. På grund av deras heterogenitet och komplexitet kräver sådana system an- vändningen av många mjukvaruverktyg för att stödja inbyggd systemutveckling. Dessa verktyg måste bilda en välintegrerad och effektiv utvecklingsmiljö för att säkerställa att produktdata är konsekventa och korrekta över utvecklingsorganisationen. En prototyp håller på att utvecklas som anpassar en länkad datainriktning för dataintegration, mer specifikt anpassar denna prototyp en dataintegration specifikation utvecklad av Open Services for Lifecycle Collaboration(OSLC). Prototypen tillåter användare att utforma OSLC-gränssnitt mellan produkthanteringsverktyg och OSLC-länkar mellan deras data. Användaren får vidare tillämpa begränsningar på de data som överensstämmer med OSLC-valideringsspråket Resource Shapes.

Problemet ligger i prototypen som endast överensstämmer med Resource Shapes, vars begränsningar ofta är för grova för Scanias behov och att det inte finns något stan- dardiserat språk för validering av länkad data. Således, för att utforma denna studie for- mulerades två forskningsfrågor textit (1) Hur kan ett gemensamt grafiskt språk skapas för att stödja alla valideringsteknologier av RDF-data? och textit (2) Hur kan detta gra- fiska språk stödja Automatisk generering av RDF-grafer?

En fallstudie genomförs där det specifika fallet består av ett mjukvaruverktyg som heter SESAMM-tool hos Scania. Fallstudien innehöll en jämförelse av valideringsspråk och vidareutveckling av prototypen. Vidare följs Design Science som forskningsstrategi där en effektiv artefakt sökts för att svara på de angivna forskningsfrågorna. Design Sci- ence främjar en iterativ process inklusive genomförande och utvärdering. Data har em- piriskt samlats på ett iterativt sätt och utvärderats med hjälp av utvärderingsmetoderna informerat argument och kontrollerat experiment, för valideringsspråkjämförelsen och vidareutvecklingen av prototypen.

Två valideringsspråk undersöktes Shapes Constraint Language (SHACL) och Shapes Expression (ShEx). Resultatet av valideringsspråksjämförelsen konkluderade SHACL som valideringsspråket med en större domän av begränsningar, mer finkorniga begränsningar och med möjligheten att definiera nya begränsningar. Detta var baserat på att SHACL- begränsningarna uppmättes täcka 89,5 % av ShEx-begränsningarna och 67,8 % för det omvända. SHACL- och ShEx-täckningen för Resource Shapes-egenskapsbegränsningar mättes till 75 % respektive 50 %. SHACL rekommenderades och valdes för att vidareut- veckla prototypen. Vid vidareutveckling av prototypen infördes abstrakta superklasser i den underliggande datamodellen. Superklasserna tog i huvudsak rollen som tidiga- re klasser för valideringsspråk, som istället utgjordes som underklasser. SHACL anges som en sådan underklass. Denna design erbjöd hög kodåteranvändning inom prototypen men gav också upphov till problem som relaterade till plugin-teknologier som prototypen bygger på. Den nuvarande lösningen har fortfarande problemet att egenskaper hos ett valideringsspråk kan läggas till klasser av ett annat valideringsspråk.

(5)

Contents

Contents iii

List of Figures vi

List of Tables viii

1 Introduction 1

1.1 Problem and Research Question . . . . 2

1.2 Purpose . . . . 2

1.3 Ethics and Sustainability . . . . 3

1.4 Scope . . . . 3

1.5 Limitations . . . . 3

1.6 Disposition . . . . 3

2 Background 4 2.1 Linked data . . . . 4

2.2 Open Services for Lifecycle Collaboration . . . . 4

2.3 Resource Description Framework . . . . 5

2.4 OSLC Tool-chain . . . . 5

2.5 RDF Constraint languages . . . . 6

2.6 Summary . . . . 7

3 Related Work 8 3.1 Shapes Constraint Language . . . . 8

3.2 Shapes Expression . . . . 9

3.3 OSLC Resource Shape . . . . 10

3.4 SPARQL Inferencing Notation . . . . 10

3.5 Web Ontology Language . . . . 11

3.6 Description Set Profiles . . . . 12

3.7 Summary . . . . 13

4 Lyo toolchain modeling and code generation prototype 14 4.1 Functionality . . . . 14

4.2 Extensions . . . . 15

4.3 Technologies . . . . 16

4.3.1 Eclipse Modeling Framework Core . . . . 16

4.3.2 Sirius . . . . 16

4.3.3 Acceleo . . . . 17

iii

(6)

iv CONTENTS

4.4 Summary . . . . 17

5 Research Method 18 5.1 Research Phases . . . . 18

5.1.1 Case study . . . . 18

5.2 Design Science . . . . 20

5.2.1 Design as an Artifact . . . . 21

5.2.2 Problem Relevance . . . . 21

5.2.3 Design Evaluation . . . . 21

5.2.4 Research Contribution . . . . 22

5.2.5 Research Rigor . . . . 22

5.2.6 Design as a Search Process . . . . 22

5.2.7 Communication of Research . . . . 23

5.3 Research Strategy Motivation . . . . 23

5.4 Summary . . . . 23

6 Constraint Language Comparison 25 6.1 Features . . . . 26

6.2 Constraint coverage . . . . 27

6.3 Summary . . . . 28

7 Implementation 29 7.1 Evaluation . . . . 29

7.1.1 Task . . . . 29

7.1.2 Evaluation Criteria . . . . 31

7.2 Iterative Process . . . . 32

7.2.1 First iteration: Learn by doing . . . . 32

7.2.2 Second iteration: Inheritance for code reuse . . . . 33

7.2.3 Third iteration: Abstract super class for cohesion . . . . 35

7.2.4 Fourth iteration: reference attributes and backwards compatibility . . 36

7.2.5 Fifth iteration: Breaking name conventions and code clean up . . . . 37

7.3 Summary . . . . 38

8 Discussion and Conclusion 39 8.1 Comparison between constraint languages . . . . 39

8.2 Implementation . . . . 40

8.3 Research findings . . . . 41

9 Future Work 42

Bibliography 45

A Lyo prototype meta-model 48

B SHACL on ShEx coverage 50

C ShEx on SHACL coverage 52

D SHACL on ReSh coverage 53

(7)

CONTENTS v

E ShEx on ReSh coverage 55

(8)

List of Figures

2.1 An illustration of lifecycle management tools integrated with a linked data approach and forming an OSLC toolchain. . . . 6 4.1 A simple high-level model of how three tools are connected through their data.

The letter ’P’ stands for producing data and ’C’ for consuming data. . . . . 15 4.2 A simple conceptual model of how the prototype currently work and how it

should be extended. . . . 16 5.1 An overview of the research phases. . . . 18 5.2 An overview of how design science research was applied for the implemen-

tation in this thesis. . . . 20 6.1 Top left of SHACL and ReSh. Top right ShEx and ReSH. Bottom left SHACL

and ShEx. Bottom right SHACL, ShEx and ReSh. . . . . 25 6.2 To the left, amount of ShEx constraints covered by SHACL. To The right, amount of SHACL constraints covered by ShEx. . . . 27 6.3 To the left, amount of ReSh constraints covered by SHACL. To the right, amount

of ReSh constraints covered by ShEx. . . . 28 7.1 A modelled figure replicating a subset of SESAMM-tool database with classes

and properties obfuscated. . . . 30 7.2 The meta-model extension in the first iteration. . . . 32 7.3 A model designed in the first iteration. All elements on the left side, conform

to pre-existing ReSh constraints. The elements on the right conform to the extended constraint language, SHACL. The language elements are spread over two domains with two associated namespaces depicted as ’nsp’, allowing any element to be reached by following their unique URL. . . . 33 7.4 The meta-model extension in the second iteration. Inheriting from the origi-

nal classes . . . . 34 7.5 The meta-model extension in the third iteration. An abstract resource and prop-

erty . . . . 35 7.6 The meta-model extension in the fourth iteration. An abstract Shape and Prop-

erty are applied as adaptors in an adaptor pattern with ShaclShape and Shacl- Property as adaptee classes, the abstract elements are referenced similar to Re- source and ResourceProperty excluding the keyword ’resource’. . . . 36 7.7 The meta-model extension in the fifth iteration. An abstract Shape and Prop-

erty as adaptors, with the additional adaptee classes Resource and ResourceProp- erty . . . . 37

vi

(9)

LIST OF FIGURES vii

A.1 Meta-model of the prototype . . . . 49

(10)

List of Tables

6.1 Table comparing features in the languages of SHACL and ShEx . . . . 26 B.1 Table of how SHACL Core constraint components are covered by constraints

of ShEx language. . . . 51 C.1 Table of how ShEx Node Constraint Semantics are covered by constraints of

SHACL language . . . . 52 D.1 Table of how ReSh Property Constraints are covered by constraints of SHACL

language . . . . 54 E.1 Table of how ReSh Property Constraints are covered by constraints of ShEx lan-

guage. . . . 56

viii

(11)

Chapter 1

Introduction

“Begin at the beginning,” the King said, gravely, “and go on till you come to an end; then stop”.

Lewis Carroll, Alice in Wonderland

This conducted thesis is a part of the ESPRESSO project. The ESPRESSO project is a collaboration between Scania and Royal Institute of Technology (KTH). The overall objective of the project is to develop and adapt model-based techniques that improve quality and costs for embedded systems in trucks focusing on safety critical systems[1].

A variety of embedded systems are used within Scania. Because of their heterogeneity and complexity, such systems require the use of many software tools to support embedded systems development. These tools need to form a well-integrated and effective Development Environment (DE), in order to ensure that product data is consistent and correct across the developing organisation.

Currently, there is a modelling tool under development which will be referred as the prototype. The prototype follows a linked data approach for data integration. This ar- chitecture is the opposite of having a centralised integration approach where all data is stored in one point. The prototype allows users to create models of how tools share their data by following a set of constraint rules. The tools are integrated in data level forming a tool-chain. While a linked data distributed approach seems promising, it creates a chal- lenge to understand and manage the overall information structure that is now handled across the many tools. In particular, it is necessary to investigate how such a distributed approach to data management can be reconciled with the need to have control over the overall information model in the organisation.

Model-based development reduces errors and misunderstanding between different sections within companies [2]. Thus, a tool like this can be valuable. The question lies in how to extend this graphical language to support several constraint languages and how the code that the prototype generates could be supported by external validation software modules.

A case study is conducted where the specified case or the point of view is for a software tool called SESAMM-tool within the organisation Scania. To find what constraint language should be used and how the prototype can be extended to use it. SESAMM-

1

(12)

2 CHAPTER 1. INTRODUCTION

tool has been developed in Scania in the department of RESA. It is a tool for modelling functionalities within Scania’s vehicle system.

In the time of writing this thesis, the prototype supports one constraint language and conform to the constraint language OSLC Resource Shapes. The problem is two-fold, property constraints defined by the OSLC Resource Shapes vocabulary are defined to be broad and general. Being very coarse-grained, they do not allow to construct validation rules of a more specific and complex nature. This makes the tool impractical at a large company such as Scania. The second problem is within the prototype and how it can be made compatible with an external automatic validation module. There are future plans in Scania to append modules to the prototype for performing automatic validation.

These, external validation modules make use of the generated code from the prototype.

In this thesis, consideration has been put on how the generated code can support such a module for automatic validation. The validation module itself is not in the scope of this thesis.

1.1 Problem and Research Question

The purpose of this thesis is to construct a general graphical language for a tool-chain modelling system that conforms to a given validation language. In the time of writing this thesis, there exists a prototype that conforms to one validation language. This thesis is in the context of that prototype.

The question arises how can further validation languages be integrated to the prototype. Making the prototype a common graphical language supporting more than one validation language of linked data. Suggestions could be by either a conjunction or a disjunction of the validation languages into the graphical language. Further on, while constructing the graphical language consideration has been taken on how it can support RDF-graph generation. Hence the two following research questions were constructed.

1. How can a common graphical language be created for supporting all validation technologies of RDF-data?

2. How can this graphical language support the automatic generation of RDF-graphs?

1.2 Purpose

In the time of writing this thesis, the modelling prototype conforms to a constraint language of linked data named OSLC Resource Shapes (ReSh). The prototype automatically generates Java classes that represent validating ReSh resources. As mentioned, the problem with having a modelling prototype restricted to ReSh in an industry company such as Scania are that the sets of property constraints defined in ReSh are defined to be coarse-grained. There is a need for more fine-grained constraints. Further reasons for extending the prototype are that there exist software modules for automatic validation that support other large constraint languages such as Shapes Constraint Language (SHACL) and Shapes Expression (ShEx). The extended purpose is to analyse different constraint languages in the perspective of a software tool named SESAMM-tool and investigate how the existing prototype can be extended to suit the need of applying constraints on SESAMM-tool data and making it support automatic validation.

(13)

CHAPTER 1. INTRODUCTION 3

1.3 Ethics and Sustainability

Ethics has been considered towards the thesis employer, Scania. This has been done by not exposing data that may harm the company and furthermore have the thesis approved for publishing by Scania ambassadors. The prototype that has been extended within this thesis, simplifies the process for achieving tool interoperability. Allowing software tools to exchange information and in this project contributing to the quality and cost for the development of embedded systems, primary in Scania but may be generalised for other cases.

1.4 Scope

The conducted study for this thesis has primarily been done in the location Scania Tekniskt Centrum from January 2017 until June 2017. The main focus has been to study and investigate constraint languages and implementation consisting of a further development for an existing modelling tool prototype of software tool-chains.

1.5 Limitations

The thesis has two obvious limitations. First being that the work in this thesis is a case study research, meaning that the conducted research has been done for the specific case of SESAMM-tool in a truck manufacturing company named Scania. Due to this, research should be conducted for other scenarios to further generalise the results. The second lim- itation has to do with the research question "How can a common graphical language be created for supporting all validation technologies of RDF-data?". Due to the limited period of time, it has not been possible to investigate all linked data validation technologies. Therefore, a subset of constraint languages has been chosen and analysed.

1.6 Disposition

The following chapters in this thesis are structured as follows. Chapter 2, covers the theoretical background, followed by related work presented in Chapter 3. Chapter 4 describes the prototype, its purpose and the intended extension. In chapter 5, the research method used in this study is described. Chapter 6, presents the results from the made constraint language comparison followed by the prototype implementation results presented in chapter 7. Chapter 8, holds the discussion and conclusions of the results followed by future work that is presented in chapter 9.

(14)

Chapter 2

Background

2.1 Linked data

The structure of linked data is three-fold and is composed of three components that are linked together. The linked components are intuitively called a ’triple’ and consists of subject, predicate and object. Each component should be able to be dereferenced to get the value and possible other relations. Linked data is built upon web technologies such as Hypertext Transfer Protocol(HTTP), Resource Description Framework(RDF) and Uni- versal Resource Identifiers(URI). Linked data is data linked through web technologies.

Instead of serving web pages they point to data readable by a computer. Following is an example to give the reader an intuitive feel for linked data:

Alice -> From->Sweden.

Given this triple of data a user is able to dereference any part of it. Dereferencing the subject ‘Alice’ should give further information about Alice and further show other triples for which Alice is a subject. Dereferencing the predicate ‘From’ could give back the definition of the predicate ’From’. For instance, ’From’ = ’Has as subject a Person.

The object should be the country of origin of the subject Person’. Dereferencing the link

‘Sweden’ should return a descriptive text about Sweden and possible other linked triples to Sweden. This procedure continues for every component and can accumulate into large graphs. The semantic web[3] also build on these technologies and is described as a shared

’web of linked data’, where data is shared across application and enterprise boundaries [4]. In the case of Scania, the desire is to use linked data technology within the company boundaries.

2.2 Open Services for Lifecycle Collaboration

Open Services for Lifecycle Collaboration (OSLC) is an open community creating specifications for integration of software tools. The specifications are based upon linked data and internet standards such as Representational State Transfer (REST) and Uniform Re- source Locator(URL). The integration is done by integrating data of tools and workflows in support of end-to-end processes. The community is separated into work groups that work in different ‘domains’. For each domain, there is one topic. To list a few of them:

Change Management (CM), Requirement Management (RM), Quality Management (QM) etc. Workgroups investigate several integration scenarios within the domains and specify common vocabularies needed to support the scenarios [5]. OSLC is mainly composed by

4

(15)

CHAPTER 2. BACKGROUND 5

the OSLC Core specification and OSLC domain specification. OSLC Core specifies a general interface between different tools. The idea is to have a minimalistic approach such that the core vocabulary should contain a bare minimum to act as a general specification for integrating tools [6].

The OSLC domain specifications are specifications that specialise in different life cy- cle tools. Meaning that the domain for change management has a vocabulary that supports data integration for change management tools. These domain specifications are to be further specified by users of the OSLC specification to allow rules to be more specific for their own tools. The OSLC integration protocol consists of a conjunction of the OSLC core specification and one or more OSLC domain specifications [7].

2.3 Resource Description Framework

Resource Description Framework (RDF) is a technique for implementing linked data, defined by World Wide Web Consortium (W3C). An RDF document consists of three types i.e. resources, properties and statements. A resource is addressed by a URI and may be parts of web pages, entire web pages or real life objects. A property is an attribute describing a resource and defines its permitted values and relations[8]. A resource together with a property and the value of that property forms a statement. Allowed property values may be a literal, resource or other statements [9, 8]. The reader may consider a statement as a triple in this context and further linked statements may be referred to as an RDF-graph.

2.4 OSLC Tool-chain

Perhaps the simplest example of what a tool-chain is, is the compiler & linker, libraries and a debugger, where one tool gives input to the consecutive tool. A variety of methods can be used for system integration on constructing a tool-chain. For the project in this study, it was achieved by a linked data integration. Meaning that each tool had its own data and exchange information, by having their data linked mainly by the RDF technology. An OSLC tool-chain follows the linked data approach and publishes its data by adopting an OSLC specification. On implementation, an adapter is created for each tool in the tool-chain that will handle the communication in the tool-chain[10]. In figure 2.1 a simple OSLC tool-chain is illustrated. The dotted lines represent a data level integration by linking data from different tools using RDF technology. An example of this con- stitutes of the predicate called testedBy in the figure. Let the subject data have the name changeRequest residing in the change management tool and the object have the name test- Case, residing in the test management tool. An arbitrary adapter may then request the changeRequest data from the change management tool and simultaneously get the URIs to its predicate and its subject. Subsequently, a request can be done to retrieve the testCase data from the test management tool. Adaptors are the tool-chains mean of communication and essentially acts as a REST interface for each tool and allows the tool-chain to scale with the possibility of adding additional tools by letting them conform to the REST framework.

(16)

6 CHAPTER 2. BACKGROUND

Figure 2.1: An illustration of lifecycle management tools integrated with a linked data approach and forming an OSLC toolchain.

2.5 RDF Constraint languages

This section defines what is meant by a constraint language in the context of this thesis and briefly describe the existing varieties in constraint languages of RDF data. The focus of some constraint language technologies are not for validating data but in practice are still used for it. A few varieties of constraint languages are that they differ in expressiveness, as for example in how constraints may be expressed or whether or not the language supports inference possibilities. An example of inference in a constraint language can be stated with two given RDF data triples ‘Eva-likes-Hubert’ and ‘Hubert-ofType- Dog’. By inference, the triple ‘Eva-likes-Dog’ may be derived. Depending on the context, inference possibilities can be an undesirable feature, as in the case that Eva likes Hubert but hates other dogs. RDF data does not have any standard constraint language such as XML constraint language is XML-schema and SQL standard constraint language is DDL[11], there only exist proposals. Some of the languages, in the time of writing this thesis, are still worked upon. To list a few RDF constraint languages: ShEx, SHACL and ReSh. These languages and a few additional will be explained in the following chapter.

(17)

CHAPTER 2. BACKGROUND 7

2.6 Summary

Linked data consists of the linked components subject, predicate and object which ideally should be able to be dereferenced, for getting information and additional links for each component. RDF is a common technique used for implementing linked data and consists of a resource, a property defining constraints and the value of that property. These three components form a statement that may be considered a triple and interlinked triples are called RDF-graphs. In the time of writing this thesis there exist no standard validation language for RDF data, only proposals. OSLC is a community creating specifications for tool interoperability and provides a standard that facilitates tool-chain integration. An OSLC tool-chain is a collection of tools that are integrated adopting OSLC specification standards and consequently integrated by a linked data approach.

(18)

Chapter 3

Related Work

This chapter describes constraint languages used for validating RDF data. The described languages, are SHACL, ShEx, OSLC Resource Shapes, SPIN, OWL2 and DSP. An example is provided for each language of a validating schema for an RDF data node.

3.1 Shapes Constraint Language

Shapes Constraint Language (SHACL) is a constraint language used for validating and describing the shape of RDF data[12]. The language is currently under development by the W3C Data Shapes Working Group[5]. SHACL makes use of a schema construct called shapes. In general, a shape can be described as a collection of predicates with associated constraints that is used to describe the shape of RDF data[13, 5]. A SHACL shape can be considered a collection of scopes and constraints for which scopes specify which data nodes should be validated and constraints determine how the node should be validated.

SHACL supports validation of graph-based and object oriented data unlike XML-schema that is constrained to tree-structures[14]. SHACL is based on RDF and provides a vocabulary for classes, properties and integrity constraints for instances.

: User a sh : Shape ; sh : property [

sh : p r e d i c a t e f o a f : name ; sh : minCount 1 ; sh : maxCount 1 ; sh : datatype xsd : s t r i n g ; ] ;

sh : property [

sh : p r e d i c a t e f o a f : familyName ; sh : minCount 1 ; sh : maxCount 1 ; sh : datatype xsd : s t r i n g ; ] ;

sh : property [

sh : p r e d i c a t e f o a f : mbox ; sh : minCount 1 ; sh : maxCount 1 ; sh : nodeKind sh : IRI ;

] .

: User sh : scopeNode : Daniel .

Code snippet 3.1: Example of a SHACL shape describing RDF-data of a User.

8

(19)

CHAPTER 3. RELATED WORK 9

: Daniel f o a f : name " Daniel " ; a f o a f : User ;

f o a f : familyName " Echegaray "

f o a f : mbox darr@kth . se .

Code snippet 3.2: RDF data, representing an instance of the class user.

In code snippet 3.1 a simple example of a User shape is demonstrated. The User consists of three properties i.e. the predicates name, familyName and mbox. The cardinality for each property is exactly one and is defined by the constraints "sh:minCount" and

"sh:maxCount". The bottom line declares a scope for the User shape allowing users to declare nodes that the shape should target, an example of a node that is validated cor- rectly is demonstrated in 3.2.

A common syntax in RDF data is to use prefix bindings for namespaces. The namespaces define a vocabulary defining properties and classes. A user may then follow the URL and get information of the defined node or property. For instance, the URL http:

//xmlns.com/foaf/spec/#term_givenName would be equivalent to stating foaf:givenName, the URL may be followed to get information of the property, similar to using a dictio-

nary. Public vocabularies such as "foaf", allow users to control their data in a non-proprietary way, and for example, describes characteristics of people or other specialisations stated in the vocabulary.

3.2 Shapes Expression

Shapes Expression(ShEx), is a language with notions of regular expressions. A ShEx schema is a collection of labelled shapes and node constraints[15, 5]. The reader may think of a shape as a lesser schema used for validating an RDF-triple, and a node constraint as a description of an RDF-node. Meticulously described, a ShEx schema consists of ’shapes expressions’, described as:

"A collection of shapes and of node constraints, possibly combined with AND, OR, and NOT expressions."[16]

A shape expression is composed of four objects. The first object, a node constraint (1), defines allowed values for a set of nodes. The second object, a shape constraint (2), is used for applying constraints for the allowed neighbourhood of a node. A neighbourhood is defined as the triples that contain a node as a subject or an object. External shape (3), is used as an extension mechanism for ShEx. The fourth object, shape reference (4), is used for identifying other shapes in a schema. The four objects, may be combined with the operators AND, OR and NOT [15].

: User {

f o a f : name xsd : s t r i n g { 1 } , f o a f : familyName xsd : s t r i n g , f o a f : mbox shex : IRI

}

Code snippet 3.3: The User example, expressed as a ShEx shape

(20)

10 CHAPTER 3. RELATED WORK

The defined code snippet 3.3 is a ShEx shape corresponding to the SHACL shape, defined in snippet 3.1. The first property in the shape ’name’ has the cardinality of exactly one expressed by "{1}", this definition of cardinality is superfluous as the default cardinality in ShEx is 1, but is stated in the example, for illustrative purposes. The language makes use of the regular expression terms (?,+,*) when expressing cardinalities e.g. (Zero-or-one, One-or-more, Zero-to-many), a precise lower and higher bound is expressed by defining {lowerBound, upperBound}.

3.3 OSLC Resource Shape

OSLC Resource Shape (ReSh) is a high-level RDF vocabulary for validating and describing the shape of RDF-data [17]. A resource shape may list properties, which in turn has

to specify their occurrence, property definition and name, and may specify further attributes[18, 19].

<http ://myWebPage . com/User> a o s l c : ResourceShape ; o s l c : d e s c r i b e s <http ://myWebPage . com/User/d e s c r i p t i o n > ; o s l c : property [ a o s l c : Property ;

o s l c : name "name " ;

o s l c : occurs o s l c : Exactly one ;

o s l c : p r o p e r t y D e f i n i t i o n f o a f : name ; o s l c : valueType xsd : S t r i n g ; ] ;

o s l c : property [ a o s l c : Property ; o s l c : name " familyName " ; o s l c : occurs o s l c : Exactly one ;

o s l c : p r o p e r t y D e f i n i t i o n f o a f : familyName ; o s l c : valueType xsd : S t r i n g ;

] ;

o s l c : property [ a o s l c : Property ; o s l c : name "mbox " ;

o s l c : occurs o s l c : Exactly one ;

o s l c : p r o p e r t y D e f i n i t i o n f o a f : mbox ; o s l c : valueType o s l c : Resource ; ] ;

Code snippet 3.4: User example as an OSLC resource shape

An OSLC resource shape, corresponding to the shape in snippet 3.1, is defined in snippet 3.4. The shape in snippet 3.4, validates the data in code snippet 3.2 as correct.

Similar to a SHACL shape, an OSLC shape also lists its containing properties. The property constraints "oslc:name", "oslc:occurs", "oslc:propertyDefinition" are required constraints and respectively define the name of the property, the cardinality of the property and specify the URI of the property.

3.4 SPARQL Inferencing Notation

SPARQL Inferencing Notation(SPIN)[20]. SPARQL is an RDF Query Language. SPIN is based on SPARQL and essentially provides an abstract vocabulary that represents SPARQL queries in RDF notation[5]. Thus, a user does not have to know SPARQL, although SPARQL extension possibilities exists for constructing own rules.

(21)

ss : User

spin : c o n s t r a i n t [

rdf : type spl : A t t r i b u t e ; spl : maxCount 1 ;

spl : minCount 1 ;

spl : p r e d i c a t e f o a f : name ; spl : valueType xsd : s t r i n g ; ] ;

spl : minCount 1 ;

spl : p r e d i c a t e f o a f : givenName ; spl : valueType xsd : s t r i n g ; ] ;

spl : minCount 1 ;

spl : p r e d i c a t e f o a f : mbox ; spl : valueType spl : o p t i o n a l ; ] .

Code snippet 3.5: An SPIN example, providing validation rules for a user.

The SPIN language provides functionality for inference, allowing node values to be calculated from other nodes. A use case example of this is the property area, that may get calculated from a width and height. The provided example 3.5 does not contain any specific SPIN-logic, staying consistent with the other examples in this chapter and providing the reader with a simple comparable overview of the described languages.

3.5 Web Ontology Language

By definition, the word ontology is a branch of philosophy in metaphysics and concerns the nature of entities and their relations¹.

The Web Ontology Language(OWL), more specifically OWL 2 consists of three notions: Axioms, Entities and Expressions. An axiom is a statement expressed in an OWL ontology, e.g. "It is raining", "All ravens are black". Entities are essentially atoms, meaning any real-world object including relations. Expressions are entities combined with con- structors. For example, the entities "Male" and "User" with a conjunctive constructor is regarded as an expression[21].

The language is declarative, meaning that state of affairs are described in a logical way for which any correct answers (validation) can be expressed by formal semantics (mathematical models of relations of expressions)[22, 21].

OWL uses a reasoner instead of a validator. Inferencing is used when validating on- tologies, and it does not make a ’Unique Name Assumption’ meaning that there are no requirements for resources to have a unique URI. An informal example of this is that two nodes, Daniel and Yash of the same class User will be considered to be the same resource [18].

1https://www.merriam-webster.com/dictionary/ontology

(22)

12 CHAPTER 3. RELATED WORK

OWL assumes an ’Open World Assumption’, compared to SQL that assumes a ’Closed World Assumption’. This means that if an SQL query asks ’is Daniel a user?’ and the SQL- database do not include a user named Daniel the reply would be false, however, a reasoner in OWL 2 would reply ’possibly true’[21].

ex : a owl : Ontology f o a f : User a owl : c l a s s

f o a f : name a owl : ObjectProperty , owl : minCardinality 1 ; owl : maxCardinality 1 ; owl : datatype xsd : s t r i n g .

f o a f : familyName a owl : ObjectProperty , owl : minCardinality 1 ;

owl : maxCardinality 1 ; owl : datatype xsd : s t r i n g . f o a f : mbox a owl : ObjectProperty ,

owl : minCardinality 1 ; owl : maxCardinality 1 ; owl : datatype xsd : anyURI .

Code snippet 3.6: An OWL ontology of the resource User

The listed ontology 3.6 is in turtle (ttl) format. The ontology describes a user and validates the RDF node 3.2 as correct.

3.6 Description Set Profiles

Description Set Profile(DSP) is as the name implies, a description of constraints for a description set. A description set is a set of one or more descriptions each describing a resource. A description consists of one or more statements regarding only one resource[23].

The intended usage for a DSP is to evaluate whether a meta-data record conforms to the DSP. DSP uses the notions of templates and constraints. There are two levels of templates: description templates, and statement templates. The first one applies to a single description containing the constraints of the resource. The second applies to one statement and contains constraints for properties[24]. The templates may be seen as contain- ers, for constraints and identifiers, applied to resources or properties.

<?xml version = " 1 . 0 " ?>

<StatementTemplate minOccurs ="1" maxOccurs ="1" type ="

l i t e r a l ">

</StatementTemplate >

l i t e r a l ">

<Property > f o a f : givenName</Property >

(23)

n o n l i t e r a l ">>

</DescriptionTemplate >

</DescriptionSetTemplate >

Code snippet 3.7: An example of how DSP may be used, to apply constraints on a User Code snippet 3.7 has an example of how constraints are applied on the user data in 3.2. The DSP has a treelike structure with three ’StatementTemplates’, each containing a property for the user.

3.7 Summary

In general, a shape can be described as a collection of predicates, with associated constraints describing the shape of an RDF-graph. The shape schema constructs is used in ReSh, SHACL and ShEx. SPIN is based on SPARQL and consist of an abstract vocabulary representing SPARQL queries. OWL 2 is a declarative language that uses a reasoner for validating data and applies an ’open world assumption’ on data. DSP is a set of descriptions for statements, the language main focus is to examine if a meta-data record conforms to it.

(24)

Chapter 4

Lyo toolchain modeling and code gen- eration prototype

The purpose of this chapter is to give the reader an overall understanding about the prototype on that the implementation work has been conducted. The prototype builds on the Lyo project. The Lyo project is an educational project aiding the Eclipse community by providing a software development kit to help it adapt the OSLC specification and following build OSLC compliant tools. In the duration of this thesis, other extensions were also applied to the prototype in the form of an automatic validation module. The other extensions are not in the context of this thesis, but consideration has been taken towards them. The differences will be explained and clarified in this chapter.

Section 4.1, Functionality, explains the purpose and functionality of the prototype and its intended function within the industry. Following section 4.2, Extensions, covers implemented extensions to the prototype during the time period of this thesis, and also distin- guish extensions covered by others. The last section 4.3, Technologies, lists and explains the technologies that the prototype is based on and further describes how these technologies have been applied within the prototype.

4.1 Functionality

The prototype consists of three views: domain specification, adapter Interface and tool- chain view. The three views allow the modelling user, which will be referred as the ’architect’, to model an OSLC tool-chain[2, 25]. The purpose of the three views is to reduce complexity for the architect and is achieved by having the views simulate different levels of abstraction of an OSLC tool-chain.

By model based development, the prototype support a linked data approach for software tool interoperability conforming to OSLC standards[2]. The prototype will allow an architect to design high-level models for how different tools are related and for what data they produce and consume, this is illustrated in figure 4.1. The figure shows the relation of how management tools both produce and consume data of one another, depicted by the letters ’P’ and ’C’. The prototype further allows an architect to design a lower level model for how data is related and how constraints conforming to ReSh are applied to the data.

After designing both a high and a low-level model, the prototype has the feature to generate runnable code. The generated code, is code of an OSLC tool-chain, having

14

(25)

CHAPTER 4. LYO TOOLCHAIN MODELING AND CODE GENERATION PROTOTYPE 15

adaptors for specified management tools and ReSh resources for validating data. The Adaptors should be integrated with a management tool and its database and further communicates using HTTP protocol and REST framework.

Figure 4.1: A simple high-level model of how three tools are connected through their data.

The letter ’P’ stands for producing data and ’C’ for consuming data.

To give the reader an intuitive understanding of how the prototype with the generated code works, a simple example scenario will be described. A scenario with two management tools, one called Bugzilla and another tool Requirement Management ab- breviated as RM. Bugzilla produces data called ChangeRequest. RM produce data called Requirement. A company that use these management tools, wish to link their data by a linked data approach e.g. ’Requirement–>originateFrom->ChangeRequest’. A solution is to use the prototype which allows an architect to model how these management tools should be integrated. The architect may further apply constraints such as ChangeRequest has to have an author and further author is of type Employee etcetera. When the architect is done modelling, the prototype may be used to generate runnable code. The code would allow these two tools to communicate with the help of the OSLC specification protocol and to use the generated ReSh shapes, for validating their data.

4.2 Extensions

The modelling prototype conforms to the OSLC specification and generates Java code representing ReSh resources [26]. An objective of this thesis is how the prototype can be extended to conform to other constraint languages and generate resources in Java code representing the extended language. A graphic example of what extensions are included, and what are not included in the scope of this thesis are illustrated in figure 4.2. The dashed arrows represent how the prototype is to be extended, where it can be

(26)

16 CHAPTER 4. LYO TOOLCHAIN MODELING AND CODE GENERATION PROTOTYPE

seen that the elements Modelling and Java class representing shape already conforms to the element OSLC and should be extended to the element Constraint language. An element with the name Automatic validation is also presented in the figure, this extension is not in the scope of this thesis but is applied to the prototype in parallel. Upon extending Java class representing shape, consideration has been taken to support the automatic validation and is therefore positioned halfway outside of Prototype. Besides automatic validation, a purpose of the module is to use the generated Java resources and convert them to RDF- graphs that either can be populated with data, or used as a validating schema.

Figure 4.2: A simple conceptual model of how the prototype currently work and how it should be extended.

4.3 Technologies

4.3.1 Eclipse Modeling Framework Core

EMF is a set of plug-ins that can be used to model data and generate code. EMF distin- guish between its meta-model and an actual model. The meta-model describes the overall structure of the model, whereas the model itself is a concrete instance of the meta- model. The user defines a domain model, that can be used to generate Java code. A domain model consists of the data. The EMF-tools allows the data model to be modelled by UML diagrams[27]. EMF is used for constructing a meta-model in the prototype. Illus- trated in Appendix A.1, is the meta-model and underlying data model that the prototype is based on.

4.3.2 Sirius

Sirius is a framework for creating graphical modelling workbenches. Sirius is based on EMF structured data, in other words, Sirius visualises and allows functions to be expressed in an EMF data model.

Sirius has a model and editors, the model defines the complete structure of the modelling workbench. The editors consist of diagrams, tables and trees.

(27)

CHAPTER 4. LYO TOOLCHAIN MODELING AND CODE GENERATION PROTOTYPE 17

Sirius is made out of two parts, Specification Environment (SE) and Runtime Environ- ment (RE). SE is for the specifier/developer to create the functionalities for the modelling tool. RE is for the architect, that uses the modelling tool. The specification is executed in the RE, and is viewpoint based. Meaning that different levels of representations are or- ganised in different viewpoints with the purpose of reducing complexity for the intended user.

An example of this, which corresponds to the prototype is that there exists three different views i.e. domain specification, adapter interface and tool-chain view. These three views exhibit different levels of detail within the tool-chain, also making the views more specialised for its intended user[28].

Besides the purpose of creating a modelling tool, Sirius supports plug-in extensions for code generation, document generation and validation[29].

4.3.3 Acceleo

Acceleo[30] is a code generation module that allows its users to generate code and provides tools for doing so. Acceleo implements the language ’Model To Text Transforma- tion’¹ (MTL) which in term uses EMF data models. The code generator module in the prototype is based on Acceleo. The work in this module has in large extent, attempted to synchronise with another thesis student working on an automatic validation module.

Meaning that code generation, from the code generator module, has been modified to conform with the validation module.

4.4 Summary

The purpose of the prototype is allowing users to create models. The models may be used on a day to day basis, for discussion, and to allow an overview of data and tool relations.

The prototype allows a user to model relations between data and tools, and to generate code corresponding to the model. The generated code consists of interfaces in the form of OSLC adaptors, that may be applied to management tools for constructing an OSLC tool-chain. The prototype further allows users to apply ReSh constraints to data and generate corresponding Java resources.

An additional extension that is under development is the ability to use the generated Java resources and automatically create RDF-triples, and store them into a triple store. A triple store can be seen as the RDF equivalence of an SQL database.

The prototype is based on three different technologies. EMF is the technology used to define the underlying data model of the prototype, Sirius is the technology used for constructing the user interface of the prototype and is based upon the EMF data model.

Acceleo is the technology that the code generator module is based on. During code generation, it takes the Sirius user designed model as input.

1http://www.omg.org/spec/MOFM2T/1.0/

(28)

Chapter 5

Research Method

This chapter discuss the methods used in the conducted study. Section 5.1, Research Phases, gives an overview of the research phases. The following section 5.2, Design Sci- ence, describes a research approach named Design Science, and how it was applied in this study. The section 5.3, Research Strategy Motivation , argues and motivates the used methods.

5.1 Research Phases

The following section has been added to this thesis to allow the reader an overview of the transpired research phases in the study. Figure 5.1 illustrates this graphically and shows that there have been three phases. The first phase was a prestudy and was further partitioned into two sub-phases where ’Prototype study’ refers to the practical knowledge acquired to be able to work with and modify the prototype.

Figure 5.1: An overview of the research phases.

5.1.1 Case study

The research has taken the form of a case study. A case study means that conducted research is seen through the lens of an issued case. In this study, the case consisted of a software tool called SESAMM-tool, in a truck manufacturing company named Scania.

The stated research questions were considered the studied objects.

The case study consisted of two phases, first one having a theoretical nature where different validation technologies for RDF-data was analysed in the context of supporting the case, SESAMM-tool. Further on, a proposition was made to stakeholders, who

18

(29)

CHAPTER 5. RESEARCH METHOD 19

decided what technology to advance with. The second phase had a practical nature, extending the existing prototype to support the chosen technology.

Constraint Language Comparison Phase

This phase largely consisted of an in-depth study of the two constraint languages SHACL and ShEx. The comparison was initialised in collaboration with another thesis student.

The language comparison was structured to be done in a high-level feature comparison followed by a low-level constraint by constraint comparison.

The method of document review was used for gathering data about the languages.

The languages were further evaluated by the method, informed argument and practically analysed by point-by-point comparison, comparing language features and the languages main constraint components. The expressiveness of the languages has been measured.

Measuring, how well the languages could cover each other’s constraints. A constraint of one language was considered covered if a second language could replicate it using one, or more constraints. Found key strengths and weaknesses were then brought forward, and a narrative analysis and discussion were held for the final choice of constraint language to be used for extending the prototype.

Implementation Phase

This phase was the most time-consuming phase and comprised of the extension work of the graphical language. The implementation was done in iterations and included a development of design suggestion for the EMF meta-model. The suggestion was discussed with stakeholders of the prototype, and thereafter implemented and tested. Stakeholders, in the implementation phase, were from Scania with SESAMM-tool in interest and from the OSLC tool-chain interoperability community. The main modules of the prototype is the EMF meta-model, Sirius graphical interface and the code generator module.

Two important concepts within design science, are Relevance and Rigour. For this study, rigour was achieved by having an iterative development process whereas empirical knowledge, gained from precessing iterations were used as a base for further development into consecutive iterations. Relevance was achieved by proposing design suggestions to stakeholders and discussing them, conjoining it with experience gained from previous iterations. Figure 5.2 graphically describes how the two concepts of Design Sci- ence i.e. Relevance and Rigour, was applied into the iterative process. The iterative process, contained proposed design suggestions and development of the prototype, followed by an evaluation. Controlled Experiment were used as an evaluation method. A controlled experiment is an experiment in a controlled environment whereas a selected variable is modified, the experiment yields result affected by that variable.

The evaluation initially transpired in the modelling module of the tool-chain prototype, where a subset of data from the database of SESAMM-tool was modelled. The code generator module then used the model for generating code. The generated code was subsequently assessed and evaluated for conformance with the automatic validation module, a module implemented in parallel by another thesis student.

(30)

20 CHAPTER 5. RESEARCH METHOD

Figure 5.2: An overview of how design science research was applied for the implementation in this thesis.

5.2 Design Science

Design Science as a research approach has been used. Design science closely relates to behavioural science, framing business needs with research activities. The philosophy is that theory and practice go hand in hand. Knowledge and understanding of a problem domain and its solution is acquired in the building and application of a designed artefact[31, 32]. Design science is based upon seven guidelines which are meant to aid a better understanding, executing and evaluating the research and results. It is not de- manded that all guidelines are rigorously followed, although it is recommended that all guidelines be addressed in some manner.

In Design Science two important concepts are Relevance and Rigour[33]. These are respectively ensured by implementing suitable methodologies, and by feedback of application, in the appropriate environment[33]. Relevance in information science research should take business needs into account when an artefact is evaluated and built. Rigour in Design Science is achieved by reusing knowledge gained from prior research. This may be achieved by using models and instantiations(prototypes) from previous iterations and by redefining or expand used evaluation methods based upon empirical data[34].

The following sections describe each guideline and how they were followed in this study.