Implementing the GraphQL Interface on top of a Graph Database

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Datavetenskap

2020 | LIU-IDA/LITH-EX-A--20/010--SE

Implemen ng the GraphQL

Inter-face on top of a Graph Database

Linn Ma sson

Supervisor : Patrick Lambrix Examiner : Olaf Har g

(2)

Upphovsrätt

De a dokument hålls llgängligt på Internet - eller dess fram da ersä are - under 25 år från publicer-ingsdatum under förutsä ning a inga extraordinära omständigheter uppstår.

Tillgång ll dokumentet innebär llstånd för var och en a läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och a använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrä en vid en senare dpunkt kan inte upphäva de a llstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För a garantera äktheten, säker-heten och llgängligsäker-heten ﬁnns lösningar av teknisk och administra v art.

Upphovsmannens ideella rä innefa ar rä a bli nämnd som upphovsman i den omfa ning som god sed kräver vid användning av dokumentet på ovan beskrivna sä samt skydd mot a dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens li erära eller konstnärliga anseende eller egenart.

För y erligare informa on om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years star ng from the date of publica on barring excep onal circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educa onal purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are condi onal upon the consent of the copyright owner. The publisher has taken technical and administra ve measures to assure authen city, security and accessibility.

According to intellectual property law the author has the right to be men oned when his/her work is accessed as described above and to be protected against infringement.

For addi onal informa on about the Linköping University Electronic Press and its procedures for publica on and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

Since becoming an open source project in 2015, GraphQL has gained popularity as it is used as a query language from front-end to back-end, ensuring that no over-fetching or under-fetching is performed. While the query language has been openly available for a few years, there has been little academic research in this area. The aim of this thesis is to create an approach for using GraphQL on top of a graph database, as well as evaluate the optimisation techniques available for this approach.

This was done by developing logical plans and query executions plans, and the suit-able optimisation technique was found to be parallel execution and batching of database calls. The implementation was done in Java by using graph computing framework Apache TinkerPop, which is compatible with a number of graph databases. However, this im-plementation focuses on graph database management system Neo4j. To evaluate the implementation, query templates and data from Linköping GraphQL Benchmark was used.

The logical plans were created by converting a GraphQL query into a tree of logical operators. The query execution plans were based on four different primitives from the Apache TinkerPop framework, and the physical operators were each influenced by one or more logical operators. The performance tests of the implementation showed that the query execution times were largely dependant on the query template as well as the number of database nodes visited. The pattern between execution times and the number of threads used in the parallel execution was concluded as lower execution times (< 100 ms) were improved when 4-6 threads are used, while higher execution times were improved for 12-24 threads used. For the very fast query executions (< 5 ms), using threading caused more overhead than the time saved by parallel execution, and for these cases it was better to not use any threading.

(4)

Acknowledgments

A big thank you to Olaf Hartig for his guidance and patience regarding the completion of this thesis.

Another thank you to the badminton group of IDA for their friendliness and for inviting me to join their games.

(5)

List of Figures

2.1 GraphQL schema over the Lord of the Rings domain. . . 4

2.2 Query and response over the Lord of the Rings domain. . . 5

2.3 Property graph over the Lord of the Rings domain. . . 7

2.4 GraphQL graph over the Lord of the Rings domain. . . 8

3.1 A simplified model of the online shopping portal dataset. All edge labels have been added for a pure descriptive purpose and does not represent the labels present in the actual dataset. . . 13

3.2 The interface and concrete classes representing the ordered JSON objects. . . 16

3.3 Example query with multiple array fields. . . 17

4.1 The logical operators. . . 24

4.2 Example query. . . 25

4.3 Database graph. . . 25

4.4 Logical plans for the example query and the structure of its query response object. 26 4.5 Boxplots for execution times of query template A. . . 35

4.6 Boxplots for execution times of query template B. . . 36

4.7 Boxplots for execution times of query template C. . . 37

4.8 Boxplots for execution times of query template D. . . 38

4.9 Boxplots for execution times of query template E. . . 39

5.1 Boxplots with outliers marked as diamonds for the first run for each number of threads and type of threading. . . 44

(7)

List of Tables

3.1 The difference in notation between the original LinGBM query templates (QTs) and the notation used in this study. . . 13

(8)

1 Introduction

Since the introduction of the REST architecture in 2000, it has grown to today be the most common model for creating Application Programming Interfaces (APIs) for the web. This architecture has many benefits, but has limitations in the domain of fetching specific data for the client. Generally speaking, the endpoints of a REST API should be generic enough to be useful for several different views of a website. On the other hand, the API should not send too much unnecessary data (i.e. forcing the client to over-fetch data), nor should the client have to make an excessive number of requests to get all data (i.e. under-fetching data). Although it is possible to make custom endpoints for e.g. different views, any changes in the user interface means that the custom endpoint also needs to reflect this. Working like this is not ideal, since it is not only more time-consuming but also more prone to errors.

One company affected by these limitations was Facebook as they were rebuilding their already established mobile applications. Since mobile networks generally are much more lim-ited than the average home network, it is more important to avoid over-fetching and under-fetching for these applications. Facebook were lacking an API for under-fetching data that was both straightforward to learn and expressive enough to fill their needs. Additionally, their available resources did not output data in the format they wanted – their preference being a graph with objects as nodes. To resolve this, Facebook started to develop the query language GraphQL in 2012 and made it an open source project three years later. [2]

The syntax of GraphQL queries is similar to the structure of JavaScript Object Notation (JSON), and as the name GraphQL suggests these queries can be understood to focus on a graph representation of an underlying dataset. The query response will always have the same structure as the query, which is a great benefit compared to the REST APIs where the data structures can be arbitrary. However, since GraphQL was developed mainly to facilitate for front-end developing, the back-end developing becomes a bit more complex.

1.1 Motivation

A big difference in GraphQL APIs compared to REST APIs is that the back-end only consist of one custom endpoint, which takes the GraphQL query as input. The developer will then have to map the different fields of the query (similar to attributes in relational databases) to functions that retrieves that specific data. Having to do the mapping manually, suggests that the GraphQL back-end can be customised to work with any type of Database Management

(9)

1.2. Aim

System (DBMS), or even with other APIs or additional data sources. However, it also implies that the developer has to be skilled enough to do this mapping correctly, or has to rely on frameworks that handle the mapping for them.

Since GraphQL queries can be seen as representations of a graph, one might think that it would be common to use GraphQL with a graph DBMS, which is a DBMS where the data is stored as a graph rather than in different tables. However, currently Neo4j is the only graph DBMS that offers an extension to make it compatible with GraphQL, by translating the query into its own query language called Cypher [4]. Neo4j is the most popular graph DBMS1_{, but} there is a lack of out-of-the-box solutions for using GraphQL together with any other graph DBMS.

1.2 Aim

The aim of this study is to develop an approach to create GraphQL interfaces on top of graph databases. The querying will be done with the help of a programming interface adapted to such databases.

Once the approach is developed it will also be implemented, where the programming inter-face in question will be the framework Apache TinkerPop (henceforth shortened to TinkerPop). Lastly, an evaluation will be done based on the query optimisation of the implementation, to measure how efficient such an approach is.

1.3 Research Questions

There are a three questions to consider when conducting this study:

1. How can GraphQL queries be mapped to low-level primitives for accessing a graph database?

2. How can a query execution plan for executing GraphQL queries based on these mappings be represented conceptually?

3. Which kind of query optimisation techniques are possible for such an approach and how effective are these techniques?

1.4 Delimitations

To limit the scope of the study, the focus will be on graph databases which internal structure is represented as property graphs. Additionally, the only programming interface used will be TinkerPop, which means that the first research question will focus solely on the low-level primitives of this framework.

(10)

2 Background

In this chapter, all the necessary concepts of the thesis are presented and explained. The examples will be based on a domain from Lord of the Rings that focuses on the characters Gimli and Legolas, and models their relationship, weapon of choice and source of transportation.

2.1 GraphQL

This section will describe different key parts of the GraphQL framework. In order, we will go through GraphQL schemas, GraphQL queries and resolvers.

For the interested reader, a detailed introduction to GraphQL can be found on GraphQL’s official website1_.

GraphQL Schemas

The GraphQL framework is set to operate on a certain type of domain, and the purpose of the schema is to correctly capture this domain. One of the main things that is specified by a schema is the different types, also called object types, that defines the different objects present in the domain. An object type has different fields, which symbolise the different features or attributes of the object. Additionally, abstract types, or interfaces, can be modelled by the schema. The interface serves as a template for the object types that implements it, implying that those object types need to have all the fields specified by the interface, but also can have additional type-specific fields.

Another important aspect of GraphQL schemas is that fields can be assigned arguments, which makes them work similar to a function. For example, the user could have the option of getting a length in either meter or feet. Finally, there are also scalar types, that declare which field values that are valid.

In order to formally present the schema, we must first formalise the domain. The specific domain can be modelled as consisting of three disjoint, finite sets - F⊂ F ields, A ⊂ Arguments and T ⊂ T ypes. The sets F ields, Arguments and T ypes are infinite and represent field names, argument names and type names, respectively. Furthermore, there is a finite set

Scalars⊂ T ypes that represent scalar type names, and more specifically, the set T consists

(11)

2.1. GraphQL

Figure 2.1: GraphQL schema over the Lord of the Rings domain.

of the union of object types (OT), interface types (IT), union types (UT) and Scalars. There

also exists a set of scalar values, V als, as well as a function values∶ Scalars → 2V als_{to assign}

a valid set of values to every scalar type. Lastly, LT = {[t]∣t ∈ T }, where [t] are list types. The

domain itself is denoted(F, A, T ). [3]

A formalisation of the schema is presented below, cited from Definition 2.1 by Hartig & Pérez [3]:

A GraphQL schema S over(F, A, T ) is composed of the following five assignments: • f ieldsS∶ (OT ∪ IT) → 2F that assigns a set of fields to every object type and

every interface type,

• argsS∶ F → 2A that assigns a set of arguments to every field,

• typeS ∶ F ∪ A → T ∪ LT that assigns a type or a list type to every field and

argument, where arguments are assigned scalar types; i.e., typeS(a) ∈ Scalars

for all a∈ A,

• unionS∶ UT → 2OT that assigns a nonempty set of object types to every union

type,

• implementationS∶ IT → 2OT that assigns a set of object types to every

inter-face.

The first bullet point confirms that interfaces and object types should always have some fields assigned to them, and the second one that these fields can have arguments. The third one acknowledge that the type of a field value is either a type or a list type, and that the type of an argument value always is a scalar type. The last two bullet points notes that union types can only consist of sets of at least one element, and that every interface is implemented by some object type. In addition to this, a GraphQL schema should also always have a separate

root type, where rootS∈ OT [3].

With this information in mind, the schema for the Lord of the Rings domain is presented in Figure 2.1. We see that the interface Character is implemented by types Elf and Dwarf, and can be used to intuitively capture the fact that friends can be both elves and dwarves. The schema also shows that elves have some sort of transportation (most likely a horse), and that dwarves can get a lift from a Rider. The Rider is not a type on its own, but a union of the types Elf and Nazgul, meaning that a Rider can be either of these types. The union can be used when two or more types can be considered equal in a certain context, but without the strictness of the interfaces that forces fields on the implementing types.

The types Nazgul and Weapon are straightforward, and the enum has the same functionality as in e.g. C++ and Java. However, there is one type that have not been mentioned before -Query. This type is the root type and defines what type of queries that are allowed, and what their return type should be. In this domain, there are two types of queries allowed. In the query char(id: ID), the user can input a character id and get the corresponding Character object. For weapon(type: WeaponType), the user chooses which type of weapon they want information on, and the response will be a list of Weapon objects of that type.

(12)

2.1. GraphQL

(a) GraphQL query. (b) Response object.

Figure 2.2: Query and response over the Lord of the Rings domain.

Lastly, it should be noted that a GraphQL schema is consistent if all the fields present in an interface also are present in the object types [3].

GraphQL Queries

Once a schema is defined, it is possible to start writing queries. An example query based on the schema over the Lord of the Rings domain can be seen in Figure 2.2a. The corresponding example data is later presented in Section 2.2.

The structure of the GraphQL query is very similar to that of a JSON object, and every line of the query is a valid field in the schema. The expression on Elf is an inline fragment. This is used to access fields specific for a certain object type, when the return type of the parent field is an interface or union type [8]. Since char(id: ID) have return type Character, accessing the field transport without the inline fragment would yield in an invalid query - it is not certain that all Character objects will have this field. The response object (Figure 2.2b) is a JSON object, and it follows both the structure of the query, as well as the structure of the different return objects set up by the GraphQL schema.

A formal definition of GraphQL queries have been set up by Hartig & Pérez [3], and their Definition 3.1 is shown below:

A GraphQL query, or a simple query, over(F, A, T ) is an expression ϕ constructed from the following grammar where [,], {,}, :, and on are terminal symbols [and] [...] l∈ F ields [...].

ϕ∶∶= f[α] | l ∶ f[α] | on t{ϕ} | f[α]{ϕ} | l ∶ f[α]{ϕ} | ϕ⋯ϕ

The GraphQL query is seen as an expression that can be represented in six different ways. One of the most simple expressions is f[α], which denotes the scalar field f with optional argument α. Likewise, l ∶ f[α] is also a scalar field, but renamed to label l. The similar expressions f[α]{ϕ} and l ∶ f[α]{ϕ} represent requests for related objects rather than scalar values, where{ϕ} is the remaining sub-query ϕ for the related objects. The expression on t{ϕ} represents the inline fragments mentioned earlier, and lastly ϕ⋯ϕ allows for building complex queries by indicating that a query can be made up of two or more sub-queries.

The query response object can also be built by using a similar grammar, cited from Defi-nition 3.2 in [3]:

A GraphQL response object is an expression ρ constructed from the following gram-mar where [,], {,}, :, and null are terminal symbols, ϵ denotes the empty word,

l∈ F ields and v, v1, ..., vn∈ V als:

ρ∶∶= l ∶ v | l ∶ [v1⋯vn] | l ∶null | l ∶ {ρ} | l ∶ [{ρ}⋯{ρ}] | ρ⋯ρ | ϵ

The first two expressions concerns scalars, and returns a single scalar value and a list of scalar values, respectively. The expression l∶null is returned when the field in question has

(13)

2.2. Graphs

no value. l∶ {ρ} and l ∶ [{ρ}⋯{ρ}] denote that it is also possible to return objects, either as a single object or listed in an array. Moreover, ρ⋯ρ indicate that more complex return objects can be constructed by combining several of these expressions into one. Finally, if the empty word ϵ is encountered, that means that the value is missing.

Resolvers

When implementing the GraphQL framework, every field of every type will be associated with a resolver. This is a function that specifies what action should be performed to produce the response object for the field in the context of a specific data object. For a field that corresponds to a scalar, the function would simply retrieve the corresponding value from the database. Resolvers for fields that are not scalars will call the resolver for the sub-field of the query, and this calling of sub-fields continues recursively until a scalar field is found. [8]

Before a query is executed, however, it needs to be validated. This is done by the GraphQL framework, that analyses the query in two steps. First, it parses the query into an Abstract Syntax Tree (AST), and secondly, it matches the AST against the schema. If the AST can be created and the types in the query have the correct fields, the query is executable. If either of the step fails, an error will be thrown. [8]

At query exceution, the framework starts at the root node of the AST, and calls resolvers as it walks through the different parts of the tree. Once all the results have been retrieved, they are assembled into a JSON object, and returned as the query result.

2.2 Graphs

In mathematics, a graph consists of two sets of elements - nodes (a.k.a. vertices) and edges. The nodes model the different objects of the domain, and can represent e.g. persons or cities. The edges represent the relationship between the nodes. Examples of these relationships are that person A knows person B, or that two cities are connected by a highway.

Formally, a graph G is defined as G= (V, E), where V is a set of nodes and E is a set of edges [7]. The set of nodes cannot be empty, and the edge between node u and v is denoted (u, v), where (u, v) = (v, u) for undirected graphs. Different constraints and extensions can be added to create graphs specialised for different applications.

In a directed graph, the edge (u, v) represents an ordered pair of nodes, where the edge starts at u and ends at v. This means that for directed graphs,(u, v) ≠ (v, u). Furthermore, graphs can have multiple edges between the same pair of nodes, entitled multigraphs. There are also labelled graphs that can add information about a node or edge by designating a label to it. As an example, a label Movie or Actor can be added to a corresponding node to make the context of the graph more clear.

The graph types of interest to this study are property graphs and GraphQL graphs, both

directed, labelled multigraphs. They are presented in the following sections.

Property Graphs

The definition of a property graph cited below is taken from Definition 2.3 in [1]: A property graph G is a tuple(V, E, ρ, λ, σ) where:

1. V is a finite set of vertices (or nodes).

2. E is a finite set of edges such that V and E have no elements in common. 3. ρ∶ E → (V × V ) is a total function. Intuitively, ρ(e) = (v1, v2) indicates that

(14)

2.2. Graphs

Figure 2.3: Property graph over the Lord of the Rings domain.

4. λ∶ (V ∪ E) → Lab is a total function with Lab a set of labels. Intuitively, if

v∈ V (respectively, e ∈ E) and ρ(v) = l (respectively, ρ(e) = l), then l is the

label of node v (respectively, edge e) in G.

5. σ ∶ (V ∪ E) × P rop → V al is a partial function with P rop a finite set of properties and V al a set of values. Intuitively, if v∈ V (respectively, e ∈ E),

p ∈ P rop and σ(v, p) = s (respectively, σ(e, p) = s), then s is the value of

property p for node v (respectively, edge e), in the property graph G.

The first and second point are self-explanatory - there must be a finite amount of both nodes and edges, and each element must belong to exactly one of these sets. The third point indicates that the graph is directed. Point number four express that the graph is labelled, where each label names their corresponding object. The last point specify that nodes and edges can have several properties bound to them. In short, this definition once again tells us that a property graph is a directed, labelled (multi)graph.

With the formalities covered, let us look at an example of a property graph. We will base this on the previously mentioned characters from Lord of the Rings, Gimli and Legolas. The property graph for this domain is shown in Figure 2.3.

Looking at this figure, both the friendship of the characters and their preferred weapons have been captured successfully. The node labels are Elf, Dwarf and Weapon, and the edge labels are friends, getLiftFrom and prefers. The node labels also have unique identifiers

-n1,⋯, n4 - that are used to separate the different nodes from each other. Here, only the node labels have IDs, but edge labels can also have unique identifiers assigned to them. Lastly, the white squares show the properties, and in this example there only exist node properties.

A substantial advantage of property graphs is that it is possible to intuitively model dif-ferent domains, ranging from large ones to this very small, presented example. It is also much easier to get an overview of the data compared to e.g. looking at the tables in a relational database. Property graphs are also one of the most expressive graphs and is currently the graph type that is most used in graph database systems.

GraphQL Graphs

GraphQL graphs are very similar to property graphs, and they were mainly introduced in order to have a solid foundation for the formalisation of the query language GraphQL [3]. The main difference of the GraphQL graph is that it has a root node, that acts as a starting point when the graph is queried. If the query has an answer, it will be represented as an edge going out from the start node, going into an answer node. If a query does not have any answers, that simply means that the start node have no outgoing edges with the label of the query.

(15)

2.2. Graphs

Figure 2.4: GraphQL graph over the Lord of the Rings domain.

Another difference is that GraphQL graphs model a specific domain, denoted (F, A, T ) (same domain as defined for GraphQL schemas in Section 2.1). A formal definition of GraphQL graphs is presented below, cited from Definition 2.3 in [3]:

A GraphQL graph [...] over(F, A, T ) is a tuple G = (N, E, τ, λ, r) with the following elements:

• N is a set of nodes,

• E is a set of edges of the form(u, f[α], v) where u, v ∈ N, f ∈ F and α is a partial mapping from A to V als,

• τ∶ N → OT is a function that assigns a type to every node,

• λ is a partial function that assigns a scalar value v ∈ V als or a sequence [v1...vn] of scalar values (vi∈ V als) to some pairs of the form (u, f[α]) where u∈ N, f ∈ F and α is a partial mapping from A to V als,

• r∈ N is a distinguished node called the root node.

The first two bullet points declare that the graph consists of nodes and edges, where the edges have a start node and an end node, as well as an edge label (f[α]). In the third bullet point, assigning a type to a node is essentially the same as a property graph assigning a node label. The forth bullet point is equivalent to how property graph handles properties. Finally, the last bullet point assigns the previously mentioned root node.

Let us once again look at the example from Lord of the Rings, but this time modelled as a GraphQL graph. The corresponding graph is shown in Figure 2.4. Comparing it to the property graph for the same domain (Figure 2.3), it is definitely very similar but with the addition of one node and four edges. The added node is the root node and is of object type Query, and the rest of the nodes are of object type Elf, Dwarf and Weapon, respectively. The four added edges, all going out from the root node, represent the queries that will give the corresponding end node as an answer. All nodes, except for the root node, have also gotten an additional property added - id, which is a unique identifier that can be used to differentiate the distinct objects.

Conformance to a GraphQL Schema

A certain domain can be modelled by both a GraphQL graph and a GraphQL schema, and to be sure that a graph corresponds to a given schema we can check its conformance, as

(16)

2.3. Graph Database Management Systems

described in [3]. That a graph conforms to a schema means that the schema accurately can be represented by the GraphQL graph.

2.3 Graph Database Management Systems

A graph DBMS differs from a relational DBMS in the sense that data is not stored in tables, but as a graph. This means that it is possible to intuitively model the relationships between the objects. While a relational database also can model this using a many-to-many relationship, this solution is often not as efficient as the graph database counterpart.

The difference in execution time between these two DBMS types was investigated in a study by Vukotic et al. [9]. They measured the time taken to execute a query for finding friends of friends for both MySQL (relational DBMS) and Neo4j (graph DBMS). The authors compared the execution time for different depths of the query, where depth two equals retrieving friends of friends, depth three equals retrieving friends of friends of friends, and so on. Both DBMSs performed similar for queries of depth two, but the graph DBMS was considerably faster from depth three and onwards. It should also be noted that at depth five, MySQL could not complete the query within 60 minutes. Based on this, the results suggests that graph databases are well-suited for path traversal queries over heavily interconnected data, which can be found in e.g. social media.

Another benefit of a graph DBMS is that, in many cases, it is more straightforward to model the data as a graph rather than to split it into a number of tables. Moreover, since the result of a GraphQL query basically is a graph, having the data modelled as a graph means that no complex transformation between the query result and the data is needed. Additionally, by transforming the data into a GraphQL graph, it can be verified to be a valid representation of a GraphQL schema by checking the graph’s conformance to the schema. All of these facts indicate that a graph DBMS should work relatively smoothly with GraphQL, in comparison to e.g. a relational DBMS.

2.4 TinkerPop

TinkerPop is an open source framework for graph computing hosted by the Apache Software Foundation. The main purpose of TinkerPop is to serve as a stable foundation for people interested in graph systems - whether they are building a brand new system, or expanding their current one with new functionality. Out-of-the-box, TinkerPop comes with among other things a server infrastructure and its own query language, Gremlin.

Additionally, TinkerPop provides a Java API for interacting with graphs, whether they are saved in-memory or externally. It is possible to both read and write data through this API, but most importantly it allows for graph traversals.

One big advantage of using TinkerPop as the back-end is that the implementation can be used together with all other programs that are TinkerPop-enabled, i.e that are compatible with TinkerPop. As the popularity of graph DBMSs rises, more and more of these programs emerge. For example, DataStax, Neo4j and Amazon Neptune are all TinkerPop-enabled graph systems. Not being bound to a specific graph DBMS allows the user more freedom, as they can choose the system that fits them the best.

2.5 Query Representations

During the processing of a query in a DBMS, the query usually goes through several different representations. This is done to help identify optimisation opportunities, since the repre-sentations have different benefits and drawbacks. In this thesis, the GraphQL queries are represented as both logical plans and query execution plans.

(17)

2.6. Optimisation Techniques

Logical Plans

The logical plan for a query is used to conceptually describe and order each of the steps that needs to be completed to get the result of the query. Each step can be represented as

logical operators, that serves as an indication of which physical operation that later needs to

be performed. It is common to present the query processing visually, and the logical operators can then be represented as nodes in a tree.

The logical operators have an input and an output, and optionally they can also have one or more arguments. As an example from the relational algebra, if we take the selection operator

σ, both its input and output is a set of tuples. The argument for this operator is a condition,

and only tuples that fulfil this condition will be present in the output. Two logical operators can be connected only if the output type of the first operator is the same as the input type of the connecting operator.

Query Execution Plans

Query execution plans, also known as physical plans, represent the different steps the system have to go through during execution of the query. Creating these plans is based on the logical plans, but the two representations do not necessarily have a 1:1 mapping between their operators. A single logical operator can correspond to multiple different physical operators depending on the circumstances, and it is also possible for several logical operators to be combined into one physical operator.

The physical operators themselves come with a specific method or algorithm. For example, the logical operator join can be implemented as either a hash join or as a nested loop join. Both methods have the same input and output, but on a low-level they operate in different ways. Because of this, it is possible that the efficiency of the different operators depends on the actual input(s) and on the hardware itself, e.g. if the data is stored on a mechanical hard drive or on an SSD.

In addition to this, the physical operators are also associated with a cost function. By serving as an estimate for how costly one operation is in regards to performance, it is possible to calculate how costly a query execution plan is. By combining the available physical operators in different ways, we can then calculate the cost of the plan and choose the one with the lowest value for the (estimated) best performance.

2.6 Optimisation Techniques

When optimising based on performance, a basic approach is to to make sure an application does not make redundant calls or requests. A technique to avoid this redundancy is called

batching, in which several similar requests are batched together and sent as one large request

instead.

A paper written by Lin, Kwee & Tsai [6] makes use of this technique when they examine its impact on the INSERT statement for a MySQL DBMS. The authors compared the insertion time for datasets with between 8000 to 10 millions records, where records either are inserted one per DBMS request or through batching with 1000 records per requests. In the end, they reach the conclusion that the batched insertion performs faster for all sizes of datasets, and reduces the execution time for regular insertion by almost 95%.

Batching improves performance since fewer incoming requests means less overhead, and the execution is able to run more efficiently. For the batching to work, however, it can only perform one type of action, like insertion or fetching node properties. Naturally, this is only a viable option for programs that performs the same type of requests over and over.

Another optimisation technique is threading, which allows parts of the program to be run in parallel on several different threads. This technique is very well-known and has a lot of potential for improving performance, although it has to be used with caution because of the

(18)

2.6. Optimisation Techniques

added complexity. One of the simplest approaches to threading is to use it on a piece of code that has to be run many times in succession. Moreover, all input needs to be independent of any result from the parallelised code, since the threads otherwise would have to wait for each other’s subresult, practically running synchronised.

Choosing how many threads a program should use is no easy task. One the one hand, the workload of each thread will be overly high if too few threads are used, but using too many threads will result in extra overhead when creating them. Additionally, using more threads than available ones on the processor could cause excessive context switching, affecting performance negatively. Consequently, this choice depends both on the code being run as well as the available hardware.

The thought behind these two techniques is also common for optimising GraphQL queries, as shown in an example by Ingram [5] for an event application. The author starts by showing a very simple approach to retrieving the needed data, resulting in 47 requests being made. They gradually improve the structure by batching requests, and the final version of the data fetching uses only eight requests. The example also talks about the advantages of asynchronous re-quests, since they often can be executed in parallel to shorten the overall response time. While asynchronous requests are not the same thing as threading, they both offer the opportunity to run things in parallel, and for this case they should have a comparable impact on performance.

(19)

3 Method

This chapter covers the different tools and programs that have been used, as well as how the implementation and evaluation was done.

3.1 Linköping GraphQL Benchmark (LinGBM)

LinGBM is a performance benchmark for server implementations of GraphQL, and is developed by the Database and Web Information Systems Group of Linköping University. Currently, there are no standardised tests for evaluating a GraphQL server, which means that comparing two implementations could be troublesome. The aim is that this benchmark would serve as an accessible test suite, that would also simplify comparison between different projects.

LinGBM is based on the project Berlin SPARQL Benchmark (BSBM). As the name sug-gests, BSBM focuses on the query language SPARQL and consists of a complete test suite as well as a data generator. This generator has been adapted to GraphQL by the LinGBM team, and the queries has been reworked to identify potential choke points (or bottlenecks) related to the structure of GraphQL.

The remaining subsections about LinGBM will cover both the domain of the data, as well as the query templates that was used in this study.

The Dataset

The dataset produced by the generator models an online shopping portal, including products being sold by which vendor and customer reviews. For an overview of the structure of the data, see Figure 3.1. However, note that for clarity, the actual edge labels have been replaced with more descriptive names in the figure.

The products have distinct offers that are given by different vendors, and each product is also created by a producer. Additionally, the products have a number of reviews written about them, and each review is written by a person. Lastly, each producer and person are located in a country. Not included in the figure are the properties of the different object types. The ones of interest will be mentioned in the benchmark queries in the next section, and the interested reader can find all of the properties at LinGBM’s github1_.

(20)

3.1. Linköping GraphQL Benchmark (LinGBM)

Figure 3.1: A simplified model of the online shopping portal dataset. All edge labels have been added for a pure descriptive purpose and does not represent the labels present in the actual dataset.

Table 3.1: The difference in notation between the original LinGBM query templates (QTs) and the notation used in this study.

LinGBM Notation Current Notation QT 1 QT A QT 2 QT B QT 4 QT C QT 5 QT D QT 6 QT E

When generating a dataset, the size of the data is determined by a so called scaling factor. This factor corresponds to the number of unique products in the dataset, and thus the more products, the larger the dataset. For this study, a scaling factor of 1000 was deemed large enough to sufficiently test the performance of the program.

The Query Templates

Out of the 15 query templates that are available in the LinGBM test suite, five were compatible with the program. The rest of the templates either used fields with arguments or fields that in the database were represented as node labels instead of edge labels, neither of which is supported by the program. The query templates used along with their notation in this study is shown in Table 3.1.

The templates have been designed to capture possible choke points, i.e. structures that potentially can serve as bottlenecks on performance. Exactly what choke points each query correspond to will be covered in the following sections, where each query template is described in detail.

Query Template A

The first query template focuses on retrieving information for reviews about a certain product belonging to one unique offer. To select which offer to start from, the user has to input an ID for a node of type Offer - the variable denoted by $offerID. Further, the template has a depth of three - the first level is product, the second one is reviews and the last level consists of the multiple scalars.

(21)

3.1. Linköping GraphQL Benchmark (LinGBM)

offer(nr:$offerID) { product {

reviews {

title text reviewDate publishDate rating1 rating2 } } }

The choke point handled by this template is related to retrieval of multiple scalar values from the same object. Issues caused by this can be countered by allowing multiple scalars to be retrieved with one database call, so-called batching.

Query Template B

The next template is based on the input of a producerID, and the title for all reviews of all of its products will be retrieved. The data retrieval will handle 1:N relationship types (one producer to many products, and in turn one product to many reviews) without the retrieval of intermediate scalars, and this is the possible choke point the template models. One way to counter this is by retrieving all nodes connected by the same edge in one call, or by allowing retrieval of multiple objects at the same time, e.g. via threading.

producer(nr:$producerID) { products { reviews { title } } } Query Template C

By giving an offerID as input, this template will retrieve a fair amount of data about the corresponding product. The depth of the template is five, and here we see a mix of scalars (e.g. label) and object types (e.g. reviews) on the same level. In order to save space, the scalar fields of each level have been written on the same line. The choke points in focus here are 1:1 and 1:N relationship types with the retrieval of intermediate scalars, which may work differently than template B depending on the implementation. Additionally, the 1:1 relationships may support more efficient techniques than 1:N relationships, which could have an impact on performance.

offer(nr:$offerID) { product {

label comment reviews {

title text rating1 rating2 rating3 rating4 reviewer {

country { code } } } } } Query Template D

This query template has a depth of five, and handles retrieval of data about reviews for a given product. First, it will visit all reviews, and for each review, all reviews will be revisited again. This indicates that the response object will consist of a tree with at least n2_{leaf nodes,} where n is the number of reviews for the given product. Additionally, the choke point here concerns cyclic relationships, which means that unless some sort of caching is taking place the same data will be retrieved multiple times.

(22)

3.2. Neo4j product(nr:$productID) { reviews { title reviewFor { reviews { title reviewFor { label } } } } } Query Template E

Finally, the last template concerns offers sold by a given vendor and retrieves the code of the country in which the offered product is produced. The depth of the template is five, and the relevant choke points are 1:1 and 1:N relationship types without any intermediate scalars.

vendor(nr:$vendorID) { offers { product { producer { country { code } } } } }

3.2 Neo4j

In order to examine how the implementation worked with data bigger than a toy-example, it needed to be run together with a DBMS. Neo4j has good integrations with TinkerPop, and is also the currently most used graph DBMS. Furthermore, there are numerous helpful plug-ins made by the community that handles otherwise tedious tasks. Based on this, Neo4j was deemed a suitable choice for this study. The version of Neo4j as well as the plug-ins used are listed below:

• Neo4j Community Edition 3.5.8 • neosemantics2 _{by Jesús Barrasa}

• neo4j-gremlin-bolt3 _{by Steel Bridge Labs}

The plug-in neosemantics was used to import the generated data to the database, while neo4j-gremlin-bolt made the data accessible in Java.

In the documentation for TinkerPop, they mention another plug-in for accessing Neo4j via Java called neo4j-gremlin (note that this is different from neo4j-gremlin-bolt). This plug-in uses a different part of Tplug-inkerPop called Gremlplug-in Server, along with the query language Gremlin to traverse the graph data. The Gremlin Server works by having the data graph locally on the server, and then allowing TinkerPop to call this graph. The most common way for this approach is to create a graph traversal object out of the graph, and then query that object using Gremlin language.

However, since Neo4j supports the bolt protocol it is possible to access the database itself directly, instead of having to use Gremlin Server as an intermediator. neo4j-gremlin-bolt was therefore the preferred choice because of its simplicity, as well as being equal in function-ality compared to neo4j-gremlin.

2_{https://github.com/neo4j-labs/neosemantics/tree/master} 3_{https://github.com/SteelBridgeLabs/neo4j-gremlin-bolt}

(23)

3.3. Implementation

Figure 3.2: The interface and concrete classes representing the ordered JSON objects.

3.3 Implementation

The implementation can be split up in two groups of helper classes - custom JSON objects and a GraphQL parser - as well as the query engine itself. First the helper classes will be presented, and the section then covers the query engine and its threading.

The coding has been done entirely in Java using TinkerPop 3.4.3 API.

Custom JSON Objects

One issue that was identified early was that a GraphQL query always returns an ordered JSON object, but the definition of JSON objects states that the objects of the same level have no set internal order4_{. Because of this, neither of the available JSON classes could guarantee that} the response object would always have its fields in the correct order, and custom classes were instead built to handle this. These classes represent all possible types that may appear in a JSON object, and an overview is shown in Figure 3.2.

There are two inherited methods, getMaximumDepth and writeToJson, that all JSON types need to have. getMaximumDepth is used to get the maximum depth of a subquery, in order to calculate the correct weight for that field (used for threading, see section about Query Engine). The method writeToJson is very straightforward, used only to print out the object as a proper JSON string.

The class OrderedJsonObject is what effectively keeps the fields ordered, by saving the field name and its corresponding field value in two different lists. The order of these lists cannot be changed, which assures that only the correct order of the fields is kept. Additionally, it has functions to retrieve the contents of the respective list.

Similarly, JsonScalarValue allows retrieval of the content of the leaf nodes of JSON ob-jects, but since each leaf node only have one value, there is no internal order to keep track of. Furthermore, the two array classes JsonScalarArray and JsonObjectArray basically works like normal lists, but are filled with either scalars (Objects) or OrderedJsonObjects.

GraphQL Parser

The GraphQL parser was created in order to ease the conversion of a GraphQL query in pure text to an OrderedJsonObject. The parser trims away any newlines and tabs present, and then recursively converts the string to a JSON object.

(24)

3.3. Implementation

Figure 3.3: Example query with multiple array fields.

Query Engine

The query engine is the part that handles the execution of the query as well as the retrieval of data from the database. In order for the execution to take place, the initialising method requires two arguments given by the user. One argument is the GraphQL query in the form of an OrderedJsonObject, which can be created with the previously mentioned GraphQL Parser. The second argument is a boolean determining if threading should be performed on the upper half or the lower half of a query. If the value is true, threading will take place on the upper half, and if the value is false it will be done on the lower half. The halves are split up based on the depth of the query, where the lower half always will be the bigger one if the depth is an uneven number. For example, a query of depth five will have its upper half as depth 1-2, and its lower half as depth 3-5. However, the threading will only be performed for fields that evaluates to arrays, meaning that the execution will run fully synchronised if there are no such fields in the allowed depths.

One thing specific for this query engine is that it does not assume a predefined GraphQL schema, but rather constructs a schema as it works its way through the query. This means that it cannot rely on outer information for how to handle the fields, but rather has to check if the fields are scalars, objects or arrays while executing them. Consequently, it is possible for the implementation to assign an unexpected type to a field - namely to treat a value as an object instead of an array of one element. The impact this could have on the execution time is negligible, but should be worth keeping in mind if the query response itself is used.

Concerning the possible choke points, the design of the query templates implies that there could be a difference between retrieving a 1:1 relationship and a 1:N relationship. This query engine does not make a distinction between the two types of relationships, but retrieves the nodes in the exact same way. Therefore, it is unlikely that results from these two types of choke points will be distinct from each other. Furthermore, the query engine does not cache its retrieved data, meaning that the choke point concerning the cyclic relationships is not being addressed in this implementation.

Threading

As mentioned in the previous section, the threading will only be performed on fields that evaluates to arrays. Once an array field is found, all array fields on the same level will be run in parallel. As previously mentioned, whether this level is on the upper or lower half of the query is decided by the user. An example query using the shopping portal domain is shown in Figure 3.3. Since the maximum depth of the query is two and the array fields are on level one, this query will only run threaded when optimisation is set to the upper half. Further, note that this query would not work on the actual dataset since producers neither have reviews or vendors, but is used solely to demonstrate a query with multiple array fields.

To decide how the threads should be split up for the different fields, each field is assigned a weight. This weight is dependant on both the maximum depth of the subquery for the field, as well as the number of elements the array will have. The weight is calculated at execution

(25)

3.3. Implementation

time once the parent field of the array field is reached (i.e. the field one level up from the array field), since it is only then the number of database nodes belonging to the array field is known. The formula for the weight is weight= maximumDepth ∗ arrayLength. A field having a large weight indicates that it likely will be more computational heavy than a field with a smaller weight. This weight is then converted to a weight ratio, which indicates how many percent of the total load the field represents.

Using the example query in Figure 3.3, let us assume we have reached the producer field (i.e. the parent field of the array fields) and want to calculate the weight for the three array fields. We start with retrieving the number of database nodes belonging to each field, and for this example products has 6 database nodes, reviews has 45 database nodes and vendors has 9 database nodes. Next, the maximum depth for each array field’s subquery has to be found. Starting with field products, we have to move one level down in the query to reach label, which is at the maximum depth. Therefore, the maximum depth for the subquery of products field is one. Similarly, the maximum depths of subqueries for fields reviews and vendors are also one. When applying the weight formula, the weights evaluate to 6, 45 and 9, respectively.

Next, the weight ratios for the array fields have to be calculated. The total weight is the sum of all weights, here evaluating to 60. The weight ratio for products is 6/60 = 0.1, i.e. 10%. Correspondingly, the weight ratio for reviews is 75% and it is 15% for vendors.

Before starting to examine the threading algorithm, there are a few last notions that have to be defined. The first one is a constant called MIN_NODES_ON_THREAD, which defines the min-imum number of database nodes that an array field must have in order to get a separate thread allocated to it. If the number of database nodes for a field is fewer than MIN_NODES_ON_THREAD, that field will be run on an assembled thread together with other threads that did not meet the threshold. Secondly, there is a variable called availableThreads, which is set by the user and dictates the maximum number of threads available for the parallel execution. Further, each thread will be assigned an initial workload, spread out evenly on the threads and measured in percentage. However, note that the actual workload for a thread will depend on the inputs, and that the initial workload rather defines an allowed interval of workload that a thread will handle. Lastly, a field entity represents an array field, and has information about its field label (i.e. the name of the field), number of database nodes as well as its weight ratio.

Now, we have all the background information needed to study the threading algorithm presented in Listing 3.1.

Analysing line 2-24, this is where the algorithm decides if a field has a workload of at least initThreadWorkload and should be run on its own thread(s), or if it should share an

assembled thread with other fields. Starting from the top, initThreadWorkload represents the inital workload previously mentioned, and is dependent on variable availableThreads. For

our example, let us set MIN_NODES_ON_THREAD= 1 and availableThreads = 5, which means that initThreadWorkload = 0.2, or 20%. The field entities are still based on the example query in Figure 3.3, with the same number of database nodes and weight ratios as previously stated.

Starting from line 4 with the field entity for products, we see that maxThreads =

f loor(6/1) = 6. Its weight ratio of 10% is less than 20%, which satisfies the if-condition

on line 7 and it gets assigned as a cheap field entity, meaning that this field entity will be run on an assembled thread.

Next is the field entity for reviews, with maxThreads= 45. Since its weight ratio of 75% is greater than 20%, the if-condition on line 7 is not satisfied and we instead enter the else-case for expensive field entities on line 11. The variable estimatedThreads = floor(0.75/0.2) = 3. Line 15-18 handles the fact that a field entity might not reach the threshold for using estimatedThreads+ 1 threads, but is still sufficiently close for another thread to get as-signed to it. In order to deduct if it is sufficiently close, we need to look at the remainder after assigning the initial value of estimatedThreads. If the remainder is less than half of initThreadWorkload, the field entity is not considered large enough to be assigned another

(26)

3.3. Implementation

thread. However, if the remainder is greater than or equal to this threshold (line 15), the field entity will be assigned another thread. For the reviews field entity, the if-condition on line 15 evaluates to (0.75mod0.2) ≥ (0.2/2) = 0.15 ≥ 0.1 ⇒ true, which means that instead of the initial estimate of three threads, the final estimation is four threads. The last check in the else-case is that the estimated number of threads do not exceed the maximum number threads set on line 5. Whichever is the lowest of maxThreads and estimatedThreads is the number of threads assigned to that field entity, and the total number of threads used is incremented correspondingly on line 22.

1 MIN_NODES_ON_THREAD = minimum number of database nodes a thread should handle

2 initThreadWorkload = 1 / availableThreads

3

4 for each field entity {

5 maxThreads = floor(number of database nodes belonging to field entity / MIN_NODES_PER_THREAD)

6

7 if weightRatio < initThreadWorkload or maxThreads == 0 {

8 // Field entity is considered a 'cheap field entity '

9 No threads assigned yet , variable threadsUsed is not incremented

10 }

11 else {

12 // Field entity is considered an 'expensive field entity ' 13 estimatedThreads = floor(weightRatio / initThreadWorkload)

14

15 if (weightRatio mod initThreadWorkload) >= (initThreadWorkload / 2) {

16 // If weightRatio is closer to using (n+1) threads than (n), round it up 17 estimatedThreads =+ 1

18 }

19

20 If maxThreads < estimatedThreads , set numberOfThreads = maxThreads

21 Else , set numberOfThreads = estimatedThreads

22 threadsUsed =+ numberOfThreads

23 }

24 }

25

26 for each cheap field entity {

27 sortedWeights = sort from smallest to largest weightRatio

28 sortedFEs = sort so field entities are on the same index as its weightRatio

29

30 while exists cheap field entities not assigned to any thread {

31 // FEList contains all field entities that should be run on the same thread 32 FEList = assignEnitiesToAssembledThread (sortedWeights , sortedFEs)

33 Add FEList as an element in assembledList

34 } 35 threadsUsed =+ assembledList.length 36 } 37 38 if threadsUsed > availableThreads { 39 Throw error 40 } 41

42 for each element in assembledList {

43 create new thread and execute it

44 }

45

46 for each expensive field entity {

47 Split up the number of outgoing database nodes with current fieldLabel evenly into as many sublists as field entity has assigned threads

48

49 for each sublist {

50 create new thread and execute it

51 }

52 }

(27)

3.3. Implementation

Finally, the field entity for vendors can have a maximum of nine threads (line 5). Its weight ratio is 15%, which means that the if-condition 15% < 20% is fulfilled and that vendors is a

cheap field entity.

To summarise the example so far, field entity reviews has been assigned as an expensive field entity with four threads for usage, and field entities products and vendors are assigned as cheap field entities, with the number of threads for usage yet to be decided. The total number of available threads is five.

Moving on to line 26-36, this code section assigns the cheap field entities to a number of assembled threads. First, the weight ratios are sorted ascending, and a list of indices for cheap field entities (sortedIndices) are sorted so the index of the smallest weight ratio is the first element, the index of the second smallest weight ratio is the second element, and so on. On line 32, the algorithm calls method assignEnitiesToAssembledThread, which adds cheap field entities to the same assembled thread as long as their total workload is less than a set threshold. The algorithm for allocation of the cheap field entities is later presented in Listing 3.2, and the threshold in this study was set to 2∗ initThreadWorkload. For our example, this corresponds to a threshold of 2∗ 20% = 40%. The two cheap field entities for products and vendors have weight ratios of 10% and 15%, respectively, summarising to a total weight ratio of 25%. Since 25% is less than the threshold of 40%, field entities products and vendors are both allocated to the same thread, and this is noted as an entry in assembledList. Now that both cheap field entities have been assigned a thread, the while-condition on line 30 no longer holds and the loop is exited. The variable threadsUsed is incremented by the length of assembledList, which in our example is one. Seeing that threadsUsed already has a value of four (from assigning four threads on line 22 to handle the field entity for reviews), its new value is now five.

Line 38-40 confirms that the threading only should take place if the number of threads to be used is at maximum as many as the available threads. Otherwise, an error is thrown and execution is stopped. This is one part of the algorithm that easily can be improved to e.g. take the two thread assignments with the lowest workload and merge them to only be assigned to one thread, and continue this merging until the number of threads to be used is equal to the number of available threads. However, to limit the scope of this study, none of the possible improvements were part of the implemented query engine.

As mentioned for our example, threadsUsed= 5 and availableThreads = 5, meaning that a valid amount of threads have been assigned for use and the execution can continue.

The last part of the algorithm, line 42-52, is what actually creates the threads and starts their execution. For the assembled threads, the threads have already been assigned different field entities on line 26-36, and no further information is needed to create and start execution on the thread. The expensive field entities, however, has to divide their database nodes into as many groups as there are threads assigned to the field entity. Furthermore, the groups should ideally be of the same size, or at least of similar size. This can be done in different ways, but this implementation assigns #DatabaseN odes

#AssignedT hreads database nodes to the first through the second to last thread. The last thread is assigned at least as many database nodes as the other threads, but also includes the remainder of the quotient, meaning that it at maximum has numberOf AssignedT heads− 1 more elements than the other threads. Once the list of database nodes have been divided into sublists, a new thread is created and executed for each sublist. For the field entity of reviews with 45 database nodes and four assigned threads, this means that three of its threads will handle eleven database nodes, and that one of threads will handle twelve database nodes.

When the threading algorithm has been executed, the query engine waits for the result from all threads and gather the results in an object to return it.

The last part of this section describes the algorithm for assigning field entities to an as-sembled thread presented in Listing 3.2. This algorithm calculates which field entities that can be run on the same thread based on a set threshold, and is the algorithm that method assignEntitiesToAssembledThread on line 32 in Listing 3.1 uses.

(28)

3.4. Evaluation

1 sortedWeightList = list of weight ratios sorted from smallest to largest

2 Assign a threshold to maximum thread workload 3

4 // Add largest weights first 5 addLargest = true

6

7 while exists field entities not yet assigned and totalWorkloadForThread < threshold {

8 newTotal = totalWorkloadForThread

9

10 If addLargest = true, set newTotal =+ last element of sortedWeightList

11 Else , set newTotal =+ first element of sortedWeightList

12

13 if newTotal > threshold {

14 If addLargest = true, set addLargest to false

15 Else , break

16 }

17 else {

18 Add corresponding field entity to returnList ,

19 and remove its weight ratio from SortedWeightList

20 totalWorkloadForThread = newTotal

21 }

22 }

23 return returnList

Listing 3.2: Algorithm for assigning field entities to an assembled thread.

This algorithm requires a list of sorted weight ratios as well as a threshold for the workload an assembled thread should not exceed. It also sets a boolean on line 5, addLargest, which indicates that the largest weight ratios should tried to be assigned to the thread before the smaller weight ratios are.

The while-loop on line 7-22 will continue as long as there are field entities to be as-signed and the workload of the thread does not exceed the threshold. On line 8, the value of totalWorkloadForThread is copied to variable newTotal, which is then incremented on line 10-11 by either the largest weight ratio (if addLargest= true), or by the smallest weight ratio. Next, the if-case on line 13 checks if the updated workload of the thread (newTotal) exceeds the allowed threshold. If it does exceed the threshold, an additional if-else-case is executed. If addLargest= true, it means that the largest weight ratio was added to newTotal, and if we instead add a smaller weight ratio, the threshold might not be exceeded. The value of addLargest is therefore set to false, and the loop will start over again since both variables in the while-condition remains unchanged. If instead addLargest= false, the smallest weight ratio was added to newTotal, but the workload of the thread still exceeded the threshold. At this point, there is no smaller value we can try adding instead, and the loop is stopped (line 15).

However, if newTotal does not exceed the threshold, the else-case on line 17-21 will be executed. The field entity that the weight ratio belongs to will then be added to returnList, which in the end will contain all field entities that should be run on the same assembled thread. Additionally, the added weight ratio will be removed from sortedWeightList, so that it cannot be added twice, and totalWorkloadOfThread is updated with its new value. Finally, once the condition for the while-loop no longer holds, returnList is filled with all the field entities that should be run on the same thread.

3.4 Evaluation

In order to evaluate the implementation and, in particular, to study the impact of the thread-ing, we have varied the number of threads available as well as alternating between using threading on the upper or the lower half of the query. Additionally, the execution time for running everything synchronised (i.e. no threading) have also been recorded. Below are the

(29)

3.4. Evaluation

properties for the machine the queries were executed on. • CPU: Intel i7-3770

• RAM: 16,0 GB

• OS: Windows 10 Education • Type of hard drive: HDD

• JVM: OpenJDK version 1.8.0 • Version of TinkerPop: 3.4.3 • Version of Neo4j: 3.5.8

• neo4j-gremlin-bolt, version 0.3.1

Since both Neo4j and the implementation were run on the same machine, the CPU’s eight available threads were split up between them. Out of the eight threads, two were allocated for Neo4j, and the other six threads were used by the implementation.

For each of the query templates, the IDs of five possible start nodes have been selected randomly. Template A and C, that both has input offerID, has had their IDs randomised independently of each other. The queries have then ran with each ID 10 times, for a total of 50 runs per template. In order to minimise the chance of using the CPU’s internal cache, the different templates and IDs were run as scattered as possible. To achieve this, the first ID of each of the templates was executed, then the second ID for each of the templates, and so on until the fifth ID has been run for all the templates. An alternative way of writing this sequence is A.1, B.1, ..., E.1; A.2, ..., D.5, E.5, where the letter represents the query template, and the number represents which of its corresponding IDs that are being used. Because of the structure of the queries, not all of them can be run in parallel for both the upper and lower half of the query. Queries using template A can never use threading on the upper half, and correspondingly template C and E can never cause threading on the lower half.

The number of threads for the different runs were chosen based on the available number of threads on the CPU. Since the implementation used six CPU threads, the number of threads were based on multiples of this number. To see how fewer threads than available CPU threads affect the execution time, it was decided that it should be run with 2 and 4 threads, respectively. We also wanted to see how the implementation performed on 6 available threads. Lastly, 9, 12, 18 and 24 threads were chosen to see how the implementation was affected by an increasingly larger number of threads than available on the CPU.

The time was recorded by using Java’s built in method System.nanoTime, and was mea-sured in whole milliseconds.

(30)

4 Results

In this chapter, the logical and query execution plans are presented, as well as the query execution algorithm and the test result from the performance runs.

4.1 Logical Plans

This section defines a notion for representation of logical plans, where the logical operators mainly are based on the structure of the GraphQL queries themselves. There is also one operator that handles the assigning of the database nodes that should serve as entry points in the database graph, i.e. the start nodes for the query. For a visual representation of the operators, see Figure 4.1. The ellipse shows the name of the logical operator, and the two squares represent the input and output to it - the lower square is the input, and the upper one is the output. Two of the logical operators do not have any input, which is represented by the absence of the input square. The rounded square on the ellipse symbolises an argument for the specific operator, and a dashed line indicates that the argument is optional. Lastly, the dashed lines on an input indicates that it is optional, and also that it can represent multiple of the same input.

The first operator, GetStartNodes in Figure 4.1a, handles the gathering of all the database nodes that the query should have as starting points. The argument of the operator corresponds to the first line of the query, which contains all the information needed to find the start node(s). The first line will be on the form queryLabel(args), where queryLabel should correspond to the node label of some database node(s), and args defines the needed value of one or more properties in order for the node to be selected as a start node. For example, in query template A, the first line of the query is offer(nr:$offerID), and all database nodes with node label offer are chosen as a contender for being a start node. In order to select which of the offer nodes that should be start nodes, we study its arguments. For this dataset, nr is the node property that contains the ID of the offer nodes, and $offerID is the ID given by the user. All offer nodes where the value of nr is $offerID will be represented as an element in the output array. Since there can only be one node with a specific ID, for this example there is one start node. However, if the argument does not concern an ID, there can potentially be several nodes with a matching property value, leading to multiple start nodes.

The following operators describe how the response of the query is built. For each of them, the context which they should use (i.e. the corresponding database nodes) is passed along from

Implementing the GraphQL Interface on top of a Graph Database

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Datavetenskap

2020 | LIU-IDA/LITH-EX-A--20/010--SE

Implemen ng the GraphQL

Inter-face on top of a Graph Database

Linn Ma sson

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

1

Introduction

1.1 Motivation

1.2 Aim

1.3 Research Questions

1.4 Delimitations

2

Background

2.1 GraphQL

GraphQL Schemas

GraphQL Queries

Resolvers

2.2 Graphs

Property Graphs

GraphQL Graphs

2.3 Graph Database Management Systems

2.4 TinkerPop

2.5 Query Representations

Logical Plans

Query Execution Plans

2.6 Optimisation Techniques

3

Method

3.1 Linköping GraphQL Benchmark (LinGBM)

The Dataset

The Query Templates

3.2 Neo4j

3.3 Implementation

Custom JSON Objects

GraphQL Parser

Query Engine

3.4 Evaluation

4

Results

4.1 Logical Plans