• No results found

Large Scale Privacy-Centric Data Collection, Processing, and Presentation

N/A
N/A
Protected

Academic year: 2021

Share "Large Scale Privacy-Centric Data Collection, Processing, and Presentation"

Copied!
84
0
0

Loading.... (view fulltext now)

Full text

(1)

Large Scale Privacy-Centric Data

Collection, Processing and Presentation

Josefin Andersson-Sunna

Computer Science and Engineering, master's level 2021

Luleå University of Technology

(2)
(3)

ABSTRACT

It has become an important part of business development to collect statistical data from online sources. Information about users and how they interact with an online source can help improving the user experience and increasing sales of products. Collecting data about users has many benefits for the business owner, but it also raises privacy issues since more and more information about users are spread over the internet. Tools that collect statistical data from online sources exists, but using such tools gives away the control over the data collected. If a business implements its own analytics system, it is easier to make it more privacy centric and the control over the data collected is kept.

This thesis examines what techniques that are most suitable for a system whose purpose is to collect, store, process, and present large-scale privacy centric data. Research about what technique to use for collecting data and how to keep track of unique users in a privacy centric way has been made as well as research about what database to use that can handle many write requests and store large scale data. A prototype was implemented based on the research, where JavaScript tagging is used to collect data from several online sources and cookies is used to keep track of unique users. Cassandra was chosen as database for the prototype because of its high scalability and speed at write requests. Two versions of the processing of raw data into statistical reports was implemented to be able to evaluate if the data should be preprocessed or if the reports could be created when the user asks for it.

To evaluate the techniques used in the prototype, load tests of the prototype was made where the results showed that a bottleneck was reached after 45 seconds on a workload of 600 write requests per second. The tests also showed that the prototype managed to keep its performance at a workload of 500 write requests per second for one hour, where it completed 1 799 953 requests. Latency tests when processing raw data into statistical reports was also made to evaluate if the data should be preprocessed or processed when the user asks for the report. The result showed that it took around 30 seconds to process 1 200 000 rows of data from the database which is too long for a user to wait for the report. When investigating what part of the processing that increased the latency the most it showed that it was the retrieval of data from the database that increased the latency. It took around 25 seconds to retrieve the data and only around 5 seconds to process it into statistical reports. The tests showed that Cassandra is slow when retrieving many rows of data, but fast when writing data which is more important in this prototype.

(4)

SAMMANFATTNING

Det har blivit en viktig del av affärsutvecklingen hos företag att samla in statistiska data från deras online-källor. Information om användare och hur de interagerar med en online-källa kan hjälpa till att förbättra användarupplevelsen och öka försäljningen av produkter. Att samla in data om användare har många fördelar för företagsägaren, men det väcker också integritetsfrågor eftersom mer och mer information om användare sprids över internet. Det finns redan verktyg som kan samla in statistiska data från online-källor, men när sådana verktyg används förloras kontrollen över den insamlade informationen. Om ett företag implementerar sitt eget analyssystem är det lättare att göra det mer integritetscentrerat och kontrollen över den insamlade informationen behålls.

Detta arbete undersöker vilka tekniker som är mest lämpliga för ett system vars syfte är att samla in, lagra, bearbeta och presentera storskalig integritetscentrerad information. Teorier har undersökts om vilken teknik som ska användas för att samla in data och hur man kan hålla koll på unika användare på ett integritetscentrerat sätt, samt om vilken databas som ska användas som kan hantera många skrivförfrågningar och lagra storskaligdata. En prototyp implementerades baserat på teorierna, där JavaScript-taggning används som metod för att samla in data från flera online källor och cookies används för att hålla reda på unika användare. Cassandra valdes som databas för prototypen på grund av dess höga skalbarhet och snabbhet vid skrivförfrågningar. Två versioner av bearbetning av rådata till statistiska rapporter implementerades för att kunna utvärdera om data skulle bearbetas i förhand eller om rapporterna kunde skapas när användaren ber om den.

För att utvärdera teknikerna som användes i prototypen gjordes belastningstester av prototypen där resultaten visade att en flaskhals nåddes efter 45 sekunder på en arbetsbelastning på 600 skrivförfrågningar per sekund. Testerna visade också att prototypen lyckades hålla prestandan med en arbetsbelastning på 500 skrivförfrågningar per sekund i en timme, där den slutförde 1 799 953 förfrågningar. Latenstest vid bearbetning av rådata till statistiska rapporter gjordes också för att utvärdera om data ska förbehandlas eller bearbetas när användaren ber om rapporten. Resultatet visade att det tog cirka 30 sekunder att bearbeta 1 200 000 rader med data från databasen vilket är för lång tid för en användare att vänta på rapporten. Vid undersökningar om vilken del av bearbetningen som ökade latensen mest visade det att det var hämtningen av data från databasen som ökade latensen. Det tog cirka 25 sekunder att hämta data och endast cirka 5 sekunder att bearbeta dem till statistiska rapporter. Testerna visade att Cassandra är långsam när man hämtar ut många rader med data, men är snabb på att skriva data vilket är viktigare i denna prototyp.

(5)

ACRONYMS AND ABBREVIATIONS

Abbreviation

Description

DOM Document Object Model

GDPR General Data Protection Regulation

EDP ePrivacy Directive

NoSQL “Not only” SQL

SQL Structured Query Language

ACID Atomicity, Consistency, Isolation, Durability

XML eXtenxible Markup Language

JSON JavaScript Object Notation

BSON Binary JSON

BASE Basically Available, Soft state, Eventual

Consistency

DBMS Database Management System

REST REpresentational State Transfer

UUID Universally Unique Identifier

ORM Object Relational Mapper

ODM Object Data Manager

OGM Object Grid Mapper

(6)

TABLE OF CONTENTS

1

Introduction ... 6

1.1 Background ... 6 1.2 Motivation ... 7 1.3 Requirements ... 8 1.4 Problem Definition ... 10

1.5 Equality and Ethics ... 10

1.6 Delimitations ... 11

1.7 Thesis Structure ... 11

2

Related work ... 12

2.1 Matomo ... 12

2.2 Open Web Analytics ... 12

2.3 Plausible ... 13 2.4 Google Analytics ... 13

3

Theory ... 14

3.1 Data Collection ... 14 3.1.1 JavaScript Tagging ... 14 3.1.2 Privacy ... 15 3.2 Data Storage ... 16 3.2.1 Structure ... 17 3.2.2 Size ... 17 3.2.3 Speed ... 18 3.2.4 Scalability ... 18 3.2.5 Database Models ... 18

4

Implementation... 21

4.1 Back End ... 22 4.1.1 JavaScript Tracker ... 23 4.1.2 REST API ... 26 4.1.3 Database ... 26 4.1.4 Data Processing ... 33 4.2 Front End ... 36 4.2.1 Publisher Dashboard ... 36 4.2.2 Brand Dashboard ... 40

(7)

4.2.3 Spotin Dashboard ... 41

5

Evaluation ... 43

5.1 Back End ... 43

5.1.1 Load Testing of the REST API ... 43

5.1.2 Latency Testing of the On-demand Model ... 49

5.2 Front End ... 51

6

Discussion ... 52

6.1 Performance ... 52 6.2 Database ... 53 6.3 Processing Model ... 55 6.4 Ethics ... 56 6.5 General Remarks ... 57

7

Conclusions ... 58

8

Future Work ... 60

9

References ... 61

(8)

6

1 Introduction

Nowadays almost every business has online sources like websites and apps and collecting user statistics from these has become an important part of business development. User statistics can provide many valuable insights to what is needed to improve the business, such as creating better user experience, increase sales of products and services and much more. Analytics tools that collect, stores, and presents data from online sources exists, but using such tool, gives away the control of the data being collected. The third party can use the data for its own purpose which raises privacy concerns for the end user. If a business develops their own analytics system, they keep the control of the data being collected.

1.1 Background

Spotin [1] is a distributed sales and digital marketing platform that gives product suppliers (brands) the possibility to sell their products on third-party online media content (publishers). The publishers who use Spotin can offer their visitors richer content, with products that are exposed in digital media such as images linked directly to the brand's own e-commerce system. This gives visitors the ability to purchase products from different brands with a single checkout and directly from the brand. Figure 1 illustrates how the three actors: users, publishers and brands are connected.

Figure 1: Spotin connects publishers, brands, and users. Screenshot is taken from Spotin’s website [1].

(9)

7

The users of Spotin: publishers and brands, requests statistics reports on sales, visitors, conversion rates and more. Spotin is also interested in statistics to be able to improve the user experience to increase the use of the service. For this, Spotin wants to develop an analytics system that collects, store, processes and presents anonymized statistical data from publishers using Spotin.

This work aims to examine what techniques are suitable for an analytics system and to implement a prototype based on the research that can collect anonymous statistical data. The prototype can then be evaluated for further development and possible use in Spotin’s platform.

1.2 Motivation

It already exists analytic services which can be integrated to online sources, where the most used [2] is Google Analytics [3]. When using such services however, the owner of the online source does not have complete control over the data being collected since this is done by a third party. The third party can then use the data for its own intentions. If data about users is collected from several sources, as in the case of Google Analytics, and if that data is later combined the result can become quite personal. This is an example of something that cannot be controlled by the owner of the online source if the data collection is done by a third-party.

Google Analytics collects users IP-addresses and to anonymize it, Google Analytics use masking [4], where the last bits are set to zero. This method is classified as pseudonymization of the IP address, rather than an anonymization. Pseudonymized data is described as personal data that is changed in a way where it no longer can be connected to a specific individual without using additional data [5]. Pseudonymized data is not the same as anonymized data since it must be irreversible and impossible to connect it to an individual even when using additional data. According to GDPR (General Data Protection Regulation) [5] pseudonymized data is classified as personal, and there are regulations that must be followed when processing personal data.

The responsibility when collecting personal data lays on both Google Analytics and the owner of the online source being tracked and that must be considered when using the tool. Google Analytics is responsible because they are not only using the information to present statistics, but they also use the information for their own purposes. Google make money by sharing the information with third parties to target marketing to specific people, making the marketing more effective [6]. It is however up to the owner of the online source to inform and receive a valid consent from its visitors before collecting any data with Google Analytics. Even if a valid consent is given that complies with GDPR, there are still risks of violating it when using Google

(10)

8

Analytics since it can involve data transfer to the US and there are regulations in GDPR regarding transferring of personal data to third countries [6].

As described in this section, there are a lot of things that must be taken into consideration before using a service like Google Analytics. Spotin wants to develop its own tool for collecting statistics where the data is not traceable to specific users and where data is not shared with third parties. With its own analytics system, Spotin has control over the data, and can make sure that all the regulations regarding GDPR are followed. Spotin can possibly also further develop the system to encrypt the information collected. That would ensure only the brands and publishers themselves gets access to their own data, so that not even Spotin get direct access to it. That would add even more security to sensitive data.

1.3 Requirements

The user types that request statistics reports are: Spotin, brands and publishers, where statistics over sales and conversion rates are interesting for all three. Publishers and Spotin are also interested in some data regarding number of visitors and how they interact with Spotin to help increase user experience and sales. There are different ways a customer can buy products through Spotin and to understand the requirements on what statistics the prototype should collect and process, they are described in this section.

At present there are three different ways a customer can buy products through a publisher with Spotin:

1. Spots

2. Image captions 3. Shop

Figure 2 shows how customers can buy products through spots in images and image captions. There will be a requirement that the prototype can present sales statistics over the spots and the image captions. These images can also be part of a publisher´s written articles, statistics over sales of an article will therefore also be a requirement for the prototype. These statistics can help both publishers and brands understand what images sell the most.

(11)

9

Figure 2: Customers can buy products through spots (left) and image captions (right). Screenshots taken from Spotin’s website [1] and Café’s [7] website.

Figure 3 shows the shop from which a customer also can buy products. Sales statistics of products sold from the shop will also be a requirement for the prototype.

Figure 3: Customers can buy products in the shop. Screenshot is taken from Spotin’s website [1].

(12)

10

There are also requirements of what statistics each user should be able to see:

• Publishers wants to see statistics of sales, visitors, and conversion rates from their own online source.

• Brands want to see statistics of sales and conversion rates from their brand, from all publishers they are exposed at.

• Spotin wants to see the statistics of total sales, visitors, and conversion rates from all brands and publishers.

1.4 Problem Definition

The main problem to be studied in this thesis is what techniques are most suitable for a system whose purpose is to collect, store, process, and present large-scale anonymized statistical data from several online sources not owned by the data collector. The main problem can be expressed as a set of research questions:

PD1 With user interactions on several online sources generating write requests to the analytics system, where is the bottleneck for how many requests it can handle at a time?

PD2 What database is the most suitable for an analytics system that must be able to scale with the continuously increasing amount of data?

PD3 The data collected needs to be processed into reports before presentation. From a performance perspective, is it desirable to pre-process the data and store the results in separate tables rather than processing the data when it is requested?

1.5 Equality and Ethics

Integrity issues on the internet have become a big problem, especially when it comes to collecting user data. Such data are often used to targeted direct marketing and if the data is combined with data collected about the user from other sources it can become even more personal.

When browsing the internet, users give away a lot of personal data for free which has become a huge income for large corporations. Users are often not aware about what information and how much information about them that is circulated around the internet. A lot of information

(13)

11

can easily be extracted directly from a user’s browser without the user noticing it. Such information can be the user location, what type of device that is used, battery level on the device and mouse movements. Social media platforms are another example where the users give away a lot of personal data for free by adding information in their profile and interacting with the application. These platforms also receive information about the user from other sources and use it to target the ads to specific users and thereby make money out of user data.

When developing a system whose purpose is to collect and analyze user data, it is important to reflect about what data can be collected without violating user integrity. The prototype in this thesis will not collect any personal data that can be traced back to specific users. The prototype will not use data for targeted direct marketing and data will not be shared with third parties. The prototype will however use cookies to be able to track the number of unique visitors. The use of cookies in the prototype will be further discussed in 3.1.2.

1.6 Delimitations

The focus of this thesis is on finding the right techniques for a system that can collect, store, process, and present statistical data and since every part of the system takes a long time to study and implement some delimitations must be made.

Security is not going to be taken into consideration when implementing the REST API and the front end since it is not in the scope of this thesis. The REST API and the front end dashboards of the prototype will be accessed by anyone, and security should be added in the future.

Since back end is the main focus of finding the right techniques for, the front end part of the prototype will only be implemented as a demo of how the statistical reports can be presented.

1.7 Thesis Structure

In section 2, some already existing analytics systems and their solutions are discussed to get inspiration for the implementation of the prototype. In section 3, the theory behind techniques for collecting data as well as the theories behind choosing the right database for the prototype are discussed. In section 4, the implementation of the prototype is described based on the theories from section 3 and in the following section, the result is evaluated. Section 6 consists of a discussion about the solutions and the results. Section 7 consists of conclusions and section 8 is about what can be done in the future.

(14)

12

2 Related work

There exists analytics services and tools that business owners can use to track user behavior on their online sources, there is however not much information about what techniques to use when designing such a system. Studying existing solutions can be helpful when making the design choices for the prototype. These solutions are not customized for a specific online source and has different focus, some prioritize user privacy while others want to gather as much information as possible of their users. Some of the tools also comes with a cost which must be taken into consideration before using it. Some of these related works with different characteristics will be described in this section.

2.1 Matomo

Matomo [8] is an open-source analytics application running on PHP/MySQL web server and is described as a more privacy centric alternative to Google Analytics. In terms of appearance and functionality Matomo is kind of like Google Analytics, but with Matomo the user has better control over the data collected. Matomo can run on the user’s own servers, in the cloud or by Matomo as a cloud service. Even though Matomo is open source, it does not mean that it is free to use. It is free when using the basic functionality and if it is hosted by the user, but the use of more advanced functionality and depending on how many visitors the online source has, it costs accordingly.

2.2 Open Web Analytics

Open Web Analytics (OWA) [9] is an open-source analytics software framework written in PHP/MySQL. OWA was developed because of the increasing need for an open-source framework that easily tracks and analyzes how people use online sources. Even if the user of OWA have control over the data collected since it is hosted by the user, it has some features that raises privacy concerns. It assigns each user a unique user ID and creates reports about each individual user with detailed information about them, such as their IP address, the pages they viewed and how long their visit were. It also has additional plugins that can track users’ mouse and other DOM movements to be able to analyze what elements that attract the user the most.

(15)

13

2.3 Plausible

Plausible [10] is another open-source analytics tool which focus on being lean and fast and are therefore only collecting a small amount of information. The biggest focus of this tool is privacy, where the project creators state that no information about visitors is stored. The interesting thing with this tool is that it does not use cookies to track unique users which is common when using JavaScript tagging to collect the data. By not using cookies, Plausible is compliant with the cookie laws and privacy regulations and Plausible does not need to obtain consent from the visitors to store and retrieve data from their devices. But this also comes with some limitations since some statistics cannot be calculated without the use of persistent identifiers, like unique visitors. Instead of using cookies Plausible generates an identifier that is changing each day by running the users IP address and user agent through a hash function with a rotating salt [11]. Old salts are deleted to reduce the chances of revealing the IP address and to not be able to link information about visitors from different days.

Plausible is free if it is self-hosted and comes with a monthly fee based on number of visitors if hosted by Plausible.

2.4 Google Analytics

Google Analytics is the most used analytics tool, and it comes in a free version and a paid premium version for more advanced features. Google Analytics has a lot of features that can provide the user with detailed information about visitors and how they interact with the content, sales of products, and more [12].

Google Analytics uses JavaScript tracking to collect data from several online sources, and as described in 1.2, Google Analytics collection and use of personal data raises privacy concerns. According to the paper Bigtable: A Distributed Storage System for Structured Data [13], Google Analytics use Google’s own Bigtable to store the huge amount of data. BigTable is designed to be highly scalable, and to have a high performance and availability.

(16)

14

3 Theory

This section covers the theories of techniques that will be used in the implementation of the prototype. First it covers the technique used for collecting data and what privacy concerns to consider using the technique. It also covers the theories of how to choose the right database for the prototype.

3.1 Data Collection

There are some ways to collect statistical data from online sources where log file data capture and JavaScript tagging are the two main approaches [14]. JavaScript tagging is going to be further studied in this section since it is easier to use than log file data capture when tracking third-party online sources, which the prototype in this thesis is going to do. It is also the most used technique by existing analytics tools with the same purpose, making it easier to find information on how it can be implemented.

Privacy is also something that must be taken into consideration when collecting user data and when using cookies to identify unique users and will therefore also be discussed in this section.

3.1.1 JavaScript Tagging

JavaScript tagging or “page tagging” is a method where the online source being tracked insert a snippet of JavaScript code that is provided by the analytics system into the beginning of every page that is going to be tracked and will be activated every time a visitor opens a page. The code snippet downloads a script from the analytics system containing functions that can be used in the application to collect data. When using the functions, data is sent to the analytics system for storage, processing, and presentation. The process of collecting data from a website with this technique is illustrated in Figure 4 and it is described by A. Kaushik in the book Web

analytics: An hour a Day as follows [15]:

1. The user types the website’s URL in the browser. 2. The request comes to the webserver of the website.

3. The webserver sends back the page with the appended JavaScript code from the analytics system.

4. As the page loads, it executes the JavaScript code, which captures the page view details and sends it back to the analytics system for storage and processing.

(17)

15

Figure 4: Data collection with JavaScript tagging.

There are some things to keep in mind when using this approach [14]:

• Users that have JavaScript turned off will not be recognized by the analytics system. • Cookies are used to identify unique users, if a user has cookies disabled or blocked, the

analytics system will not be able to recognize the user as unique. It can therefore be good to measure how many page views are done with cookies disabled.

• If a user deletes the cookies, it will be treated as a new visitor even if it is a returning visitor.

• User cookies are not shared across devices and browsers which means that one user will be treated as two unique users if she visits the website on two different browsers or on two different devices.

3.1.2 Privacy

Privacy must be considered when collecting user data. GDPR [16] was set in May of 2018 with the purpose of protecting the personal data of people living in EU. If a cookie is set by another domain than the one a user is visiting, the cookie is classified as a party cookie. If third-party cookies are used, it must comply with the regulations regarding cookies according to GDPR and the ePrivacy Directive (EDP) [17]. To be GDPR compliant when storing such cookie, the following points taken from gdpr.eu [18] must be followed:

• Receive the users consent before using any cookies.

• Provide accurate and specific information about the data each cookie tracks and its purpose in plain language before consent is received.

(18)

16

• Document and store consent received from users.

• Allow users to access the service even if they refuse to allow the use of certain cookies. • Make it easy for users to withdraw their consent as it was for them to give their consent

in the first place.

Since cookies will be set by the analytics system to keep track of unique visitors, it will be classified as a third-party cookie. If a user does not consent with the use of cookies, the analytics system should not store a cookie, and thereby not track the user as a unique visitor. It is up to the online source being tracked to obtain consent from the user.

Due to the privacy concerns regarding the use of third-party cookies, some browsers, including Mozilla Firefox and Safari, block them by default. Google also announced in early 2020 that they will stop the use of third-party cookies in Chrome by 2022, and instead create a privacy sandbox where people with similar interests are grouped to hide individual persons but still be able to give the group relevant ads and content [19]. This would not stop the tracking of users and some also claims that Google does this to gain a further grip on the advertising market [20]. Regardless of the purpose with blocking third-party cookies, this must also be taken into consideration when using cookies as a unique identifier in the prototype.

3.2 Data Storage

When data has been collected from an online source it needs to be stored in the system. There are many different solutions for storing data, with each having their advantages and disadvantages. The decision of which solution to use in an application does not have an absolute answer and it depends entirely on what it is going to be used for and in which type of application, however the choice can be crucial for the end-product. Many articles have been written in the topic of making the right choice [21] [22] [23] [24] and there are some key factors that will be described in subsequent sections:

• Structure • Size • Speed • Scalability

(19)

17

3.2.1 Structure

Structure refers to how to store and retrieve data. Data can be of different forms and sizes and there are three categories in which it can be stored: structured, unstructured, or semi-structured [25]. Before making the choice of which database to use, it must be taken into consideration in which structure the data will be stored.

1. Structured data is highly organized and the type of data that we are most commonly working with, for example in spreadsheets where data are organized in rows and columns. The format of the data can be both text and numbers, for example addresses, names, geolocations and so on. Structured data is formatted to fit predefined fields before it is placed in storage [26].

2. The definition of unstructured data is the opposite to structured data. Data is not formatted and organized before it is placed in storage, instead it is stored in its original format and not processed before it is used. Unstructured data can be in a wide variety of formats, for example text, video, images, e-mail, social media activity and much more [26].

3. Semi-structured data can be defined as a mix between the other two categories. It is not structured but it has some organizational properties such as metadata or semantic tags that can separate the data into various hierarchies. Examples of semi-structured data are JSON and XML [27].

When it comes to databases, structured data is usually stored in a relational database while unstructured and semi-structured data is usually stored in a non-relational database [28].

3.2.2 Size

Size refers to the quantity of data being stored and the ability to store and retrieve data without negatively impacting the database. When reflecting over the size needed for a certain application, the future needs are as important since the quantity will grow with time. When the quantity grows, queries will run slower which will affect performance and speed [24].

(20)

18

3.2.3 Speed

Speed refers to the time it takes to handle incoming requests. The optimal database would be fast at all kinds of requests, but that is not the case. Some are designed for heavy-writes while others are designed to handle heavy-reads, and that is why this need to be considered when choosing a database for an application. Several research papers evaluate how different database models perform in different queries with varying workloads [29] [30] [31], and the results shows that the relational model performs slower than the non-relational model in almost every query when the workload increases. In tests between non-relational models, it was the in-memory model that had the highest performance.

3.2.4 Scalability

Scalability is the ability to add or remove capacity when the workload changes. The amount of data and the number and size of the requests are workloads that can affect the database. Non-relational databases are typically easier to scale than Non-relational ones and there are two main approaches when it comes to database scalability: vertical scaling and horizontal scaling.

Scaling a database vertically means adding more capacity (CPU, memory, storage) to the single machine. There is a limit in how much capacity that can be added to one instance, and if the workload reaches beyond the limit, a database that can be scaled horizontally could be a better choice. Scaling a database horizontally means adding more servers to spread out the workload between the servers [32].

3.2.5 Database Models

There are different types of database models where the relational model is the most used. Database models that are non-relational, have been categorized as NoSQL (“not only” SQL) [33]. Database models are designed for different purposes where every model has its strengths and weaknesses. Some of the most popular models will be furthered studied in this section to be able to evaluate which one could fit the requirements of the analytics system based on the factors listed in 3.2.

3.2.5.1 Relational Model

Relational databases were created to store highly structured data. In a relational database, data is organized in tables consisting of rows and columns [34]. The tables can have relations between each other, and it requires fixed schemas to keep the structure. The well-organized

(21)

19

structure makes it easy to manage the data. Another advantage with a relational database is that it must be ACID [35] compliant which stands for:

• Atomicity: every transaction is treated as a single unit and if one part of the system fails, the entire system will fail.

• Consistency: a transaction can only bring the database from one valid state to another. Every transaction must follow a set of rules.

• Isolation: every transaction will occur in isolation, meaning that no transaction will interfere another transaction.

• Durability: once a transaction is committed it will stay committed even if the system fails.

Because of the support of ACID properties, the database guarantees crash recovery and highly reliable transactions. The advantages of structured data, well defined tables and fulfilling ACID properties comes with trade-offs. Structured data with relationships between tables does not work well across a distributed architecture, which makes scalability a trade-off in relational databases [36]. Scaling is done vertically which may not be the best solution when handling large-scale data. Another trade-off that comes with the highly structured data is if a new set of data need to be inserted but does not fit the table parameters, it will be hard to include it. Performance on queries can be decreased with many relations between tables and complex queries typically takes longer time to execute.

3.2.5.2 Non-relational Model

Non-relational databases are categorized as NoSQL databases and do usually not use organized tables as its storing structure [35]. This makes it easier to scale the database horizontally, which will increase the capacity and the availability since more servers can handle the workload. They can store and process structured, unstructured, and semi-structured data which gives more flexibility since data can be changed whenever without affecting the existing data. NoSQL databases are usually not built on ACID properties, but from BASE properties:

• Basically Available: When failure occurs, the system is guaranteed to be available. • Soft state: The state of the data could change without application interactions due to

eventual consistency.

(22)

20

The term NoSQL includes several data models where four common data models are: key-value, document, wide column, and graph. The most popular DBMS (database management systems) [37] for each database model are described in Table 1 [38].

Table 1: Description of four common NoSQL models.

DBMS

Model

Description

Redis Key-value Is an in-memory data structure store. It can be used as a database, message broker and cache. It supports various data structures like strings, lists, sets, hashes, and sorted sets. It can be replicated using relax master slave architecture [39].

MongoDB Document Data is stored as documents in JSON-like format with dynamic schemas which make data integration easier and faster. Ad hoc queries, indexing, and real time aggregation can be used to access and analyze data. High availability, horizontal scaling, and geographic distribution are built in [40].

Cassandra Wide column Data is stored in rows that are organized into column families (tables). It is linearly scalable and provides

availability with high performance at write operations. It can manage large-scale data across number of servers and still maintain high availability without any single point of failure.

Neo4j Graph Stores data in a graph structure with nodes and edges which can have properties associated with them. It is highly scalable and handles relationships between nodes. It is a full ACID

(23)

21

4 Implementation

This section describes the implementation of the prototype that will be a part of this thesis work. The implementation of the prototype will be based on the theories in section 3 and the related works in section 2. The purpose with the prototype is to test and evaluate the techniques used to be able to answer the questions in 1.4. The prototype will collect statistical data from any online source using JavaScript. The data that will be collected is going to be stored in a Cassandra database and later processed into statistical reports before it is presented on dashboards for three different user types: Spotin, brands and publishers. Docker containers [41] will be used to build all parts of the system. An overview of the system can be seen in Figure 5.

(24)

22

Figure 5 shows the flow of requests and can be described as follows:

1. A user visits one of the online sources being tracked by the analytics system. 2. The user performs an action that the website wants to track.

3. A request containing data about the action is sent to the analytics server.

4. The analytics server receives the request and stores the data as raw data in the database. 5. When one of the user types wants to see statistic reports in their dashboard, a request

about the report is sent to the analytics server.

6. The analytics server receives the request and performs one of the following:

o gets the raw data from the database and processes it into the requested report, o gets an already processed report that is stored in the database.

7. The report is then sent to the analytics front end and presented in the dashboard.

The implementation of each part of the system will be further described in the next sections.

4.1 Back End

The back end will consist of the JavaScript tracker, a REST API [42], data processing, and the Cassandra database cluster. The choice of Cassandra as database are further described in this section. Node.js [43] is chosen as the back end server environment since it is described as faster than many other server-side technologies due to its single threaded event loop model. While the multi-threaded model instantiates a new thread or process for every request from the client, Node.js handle all requests in a single thread with shared resources, which removes the risks of race conditions and deadlocks. If a task in Node.js uses callbacks, promises, or async/await, it will run asynchronously through the event loop. When one task waits for a response, the next task in the event loop can be fired which makes the I/O tasks non-blocking and fast.

Data that is going to be collected by the JavaScript tracker will be stored as raw data and must be processed into different metrics and compiled into statistical reports before presentation. The definitions of the metrics used in this thesis are described in Table 2.

(25)

23

Table 2: Definitions of general metrics used in this thesis.

Metric

Definition

Page view The total number of pages viewed on the website.

Unique visitors The total number of unique visitors on the website.

New visitors The total number of users that visit the website for the first time.

Returning visitors The total number of users that have visited the website before.

Domain conversion rate The total number of orders divided by the total number of unique visitors.

Cookies disabled The total number of visitors with cookies disabled.

Revenue The total revenue of sold products.

Amount The total number of sold products.

Conversion rate on spot The total number of purchases from the spot divided by the total number of unique visitors that clicked on the spot.

Domain conversion rate for shop The total number of purchases from the shop divided by the total number of unique visitors on the website.

Shopping cart conversion rate The total number of purchases from the shop divided by the total number of add to shopping cart events from the shop.

Article conversion rate The total number of purchases from an article divided by the total number of unique visitors on the article.

Article conversion rate for spot The total numbers of purchases from a spot in an article divided by the total number of unique visitors on the article.

Article conversion rate for image caption The total number of purchases from an image caption in an article divided by the total number of unique visitors on the article.

4.1.1 JavaScript Tracker

The tracker will be implemented based on the theory described in 3.1.1 and the related works in 2. The tracker will be a JavaScript file located in the root folder of the back end and will contain all the functions that is used for collecting data when users interact with the online source being tracked. All the tracking functions and how to use them are listed in Appendix C.

(26)

24

To track an online source the analytics script tag seen in Figure 6 must be inserted in the beginning of every page that is going to be tracked. This code snippet creates a script element that asynchronously downloads the analytics.js script from the analytics server. Since the script is downloaded asynchronously, the loading time of the page will not be affected by the added code snippet.

Figure 6: Analytics script tag.

JavaScript has a global object called window, that can be reached by the JavaScript tracker and the application being tracked. The window object will be used to create a global function “sa(arguments)” and when called, it pushes the arguments into a global array “sa.q” on the window object. The first argument is the function to be executed by the tracker and the rest is arguments to be sent as parameters to the function. In the end of the analytics.js script, the global array will be redeclared as a function queue seen in Figure 7, and the functions that was pushed into the global array in the code snippet will be executed in the order they were inserted when the script is fully loaded. The global function “sa(arguments)” can then be used in selected parts of the application to execute more functions on the tracker immediately, since it then will call the push function in Figure 7 on the global “sa.q”.

(27)

25

The first function that must be executed is the create function. It will take a tracking id, used to identify a publisher, as argument and will set it to the tracker object. To be able to keep track of unique visitors, the create function will also set a client id to the tracker object. The client id will be a UUID (Universally Unique Identifier) stored in a cookie named “_sa” on the user’s browser if the user is a returning visitor and if the user has cookies enabled. If the user is a new visitor a new client id will be generated and stored in the “_sa” cookie. If the user has cookies disabled or blocked, the analytics system will not be able to count the user’s actions as unique. The UUID will be in the form “xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx” where each x will be replaced by a letter or number. The function that will be used for making the calculations is from an example on tutorialspoint.com [44], where each x will be replaced by a letter or number calculated by combining the number of milliseconds since 1 January 1970 00:00:00 with random numbers, making the chances of creating the same UUID for several users minimal.

To log a user action in the database, the send function must be called when the action takes place. It will take one argument specifying what type of action to be logged, for example a page view or an event. The send function will use XMLHttpRequest to send data to an endpoint of the REST API through a POST request.

Figure 8 illustrates when a user pushes a button, and the event is sent to the analytics system.

(28)

26

4.1.2 REST API

The REST API will be built in a layered architecture to separate the responsibilities between the layers. The layers are the controller layer, the service layer, and the data access layer. Figure 9 illustrates how the different layers interact with each other.

Figure 9: REST API layered architecture.

Routes defines the endpoints of the API and will receive the client request and call the correct controller to handle the request. The routes are divided into two parts, routes for collecting data through the JavaScript tracker and routes for retrieving statistical reports for the three user types. The routes are listed in Appendix A and a description of the API parameters can be found in Appendix B. The controllers will receive a request from the routes, create a request object, call a service, and return a response. The services will be responsible for all business logic, inserting and receiving data from the database through the data access layer and returning it to the controllers.

4.1.3 Database

The choice of database is based on the factors listed in 3.2 and the requirements for the analytics system which are summarized in Table 3.

Table 3: Requirements for the analytics system.

FACTOR

REQUIREMENT

STRUCTURE Data comes at high speed and will be stored

as raw data. The information must be easy to retrieve when it is to be processed before presentation, this means that some type of structure is required without it becoming too complex to query since it can reduce performance. Relationships between data is not required.

SIZE Collecting data from several online sources

(29)

27

to be stored where the amount of data will increase continuously with time.

SPEED It must be able to handle multiple write

requests fast. Every write request is important since the statistics will be incorrect if not every request is stored. The speed when retrieving the statistical reports are not as important since it will not happen as often.

SCALABILITY Horizontal scaling is important because of

the large-scale data that will be stored and from a performance perspective. If the workload becomes too heavy for one server to handle, more servers should be easily integrated.

In a research paper [38], the authors compared different features of the database models described in 3.2.5 where a summary can be seen in Table 4.

Table 4: Summary of different database models and their features.

Feature Cassandra MongoDB Redis MySQL Neo4j Database

Model

Wide Column Document Key-value Relational Graph

DB Key space Database Database Database Graphs

Table Column family Collection Hash set, list, set, sorted set and string

Relation Label

Value Rows Documents Key-value pair Rows Node and edges

Read operations

Slow Fast Fast Slow (join dependent)

Data dependent

Write operations

Fast Fast Fast Slow Data dependent

License Open source Open source Open source Open source Open source

Scaling Horizontal Horizontal Horizontal Vertical Horizontal

Replication Selectable replication factor

Master slave Relaxed master slave - Causal clustering using Raft protocol (master slave) Data scheme Schema-free No schema is followed but usually contents of same documents have similar structures but is not mandatory.

Schema-free Yes Schema-free

Transaction Concepts Atomicity and Isolation are Atomic operations can be Optimistic locking, atomic ACID ACID

(30)

28 supported for single operations performed within single document execution of command blocks and scripts. Predefined types

Yes: ASCII, int, blob, counter, decimal, double, list, map, set, text, timestamp, varchar Yes: Boolean, date, object_id, string, integer, double Partial: data types supported for value are strings, Bit arrays, hyper logs, hashes, lists, sets, sorted sets, and geospatial indexes

Yes: int, float, double, date, time, bit, char, enum, binary, blob, Boolean

Yes: Boolean, byte, short, int, long, float, double, char, string

Based on Table 1, Table 3 and Table 4, MongoDB and Cassandra seem to have similar characteristics and fits the requirements better than the others. One key difference that makes Cassandra the better choice of the two is that it works with a peer-to-peer architecture with each node connected to all others. Every node can perform all database operations and serve client requests, making it available even if nodes disconnect from the cluster. MongoDB works with master slave architecture with the master node handling all requests. When the master node disconnects from the cluster it will take some time to elect a new leader, making it unavailable for writes and reads until the new leader is elected [45]. Since the analytics system must handle a huge amount of write operations, it is better to be able to spread out the requests on several nodes and to have a database that prioritizes high availability over consistency. Cassandra will therefore be chosen as database in this prototype.

To make the code cleaner and to make it easier to handle the database requests, Express Cassandra [46] will be used which is a Cassandra ORM/ODM/OGM for NodeJS. With Express Cassandra, database models are written as JavaScript modules and are automatically creating the database tables and user defined types when the application is started. The models also contain methods for making queries to the database such as save and delete without having to write raw CQL (Cassandra Query Language) queries.

Before implementing the database, some things must be taken into consideration and those will be described in the next section.

4.1.3.1 Database Implementation Considerations

Since Cassandra is a distributed database where data from one table can be distributed over several nodes in the cluster, the models must be designed based on the queries to retrieve data to make them as efficient as possible. How data from one table is going to be stored is specified by different keys, the keys and their description are listed in Table 5.

(31)

29

Table 5: Description of keys in Cassandra

Key

Description

Primary key One or more columns used to retrieve data from a table.

Composite/Compound key Any multiple columns key.

Partition key First part of a composite primary key. Responsible for data distribution across nodes.

Clustering key Second part of a composite primary key. Responsible for data sorting within the partition.

To query on a column from a table (SELECT * FROM TABLE WHERE column=1), the column must be part of the primary key. The partition key must be part of the query since its more efficient to query on only one partition than on many. The clustering keys can also be part of it in the order they are set. As an example, the table in Figure 10 has these valid queries:

1. column1,

2. column1 and column2,

3. column1 and column 2 and column3.

Invalid queries for the same table would be: 1. column2,

2. column3,

3. column1 and column3.

Figure 10: Table model with a composite primary key of three columns.

It is not only the queries that has to be taken into consideration when designing the models. The partition key has an important role since it is responsible for the data distribution across nodes in the cluster, which means that a good model has a partition key that spreads the data evenly across the nodes. If a table has a partition key that will always have the same value for all the rows, all data from that table will be stored on a single node which is not an even spread. A better model is one where the value of the partition key varies a lot between the rows.

(32)

30

4.1.3.2 Database Implementation

To answer the problem definition PD3, there will be two different data processing implementations. This will also require two different database models because of the querying restrictions described in 4.1.3.1: the demand model and the preprocessing model. The on-demand model will create the statistical reports when the user request it, while the preprocessing model will be scheduled to do the work once a day and save the reports in a separate table in the database. The design of the two database models is further described in this section.

4.1.3.2.1 On-demand Database Model

This model has the following query criteria:

1. it must be able to query tables on a date range when creating reports for a longer date period than one day,

2. it must be able to query tables on tracking id for publishers, on brand id for brands and to get data from all publishers for Spotin.

When using the range operators (>, >=, < or <=) in a query, only the lowest level column that is not part of the partition key can be restricted. This means that the date column must be the lowest level column in the queries and cannot be part of the partition key. To retrieve data for a publisher, the tracking id column which identifies a publisher must be on a higher level than the date column. This would give a primary key as in Figure 11.

Figure 11: Table model with primary key for retrieving data for publishers.

The primary key has the tracking id as partition key and date as clustering key. This model will work for retrieving data for publishers and for Spotin. Retrieving data from the same table for brands will not be possible since the brand id, which identifies a brand is not part of the primary key. If brand id is added to the primary key as in Figure 12, it will be possible to retrieve data for a brand, but it will no longer be possible to retrieve data for publishers and Spotin since the queries requires the brand id column. Publishers and Spotin want data from all brands not just from one.

(33)

31

Figure 12: Table model with primary key for retrieving data for brands.

When creating a Cassandra database model, it is recommended to create one table for each query since it is better to duplicate data into two tables than querying data over several partitions. In the on-demand model, the tables will therefore be duplicated into two tables with different primary keys, as seen in Figure 13, to be able to make the queries for the different user types.

Figure 13: Duplicated table model to make queries for all user types.

The tables in Figure 13 meets the query criteria but may not be a good model when it comes to distributing data over the nodes in the cluster. Since the tracking id is the partition key all data from one publisher will be stored on the same node which can create an uneven spread of data if the number of user interactions varies a lot between the publishers. The amount of data on one node will also grow continuously with time which can be a problem later since adding another node to the cluster will not help. Figure 14 illustrates the data distribution with the on-demand model.

(34)

32

4.1.3.2.2 Preprocessing Database Model

The preprocessing model will once a day after midnight be scheduled to retrieve yesterday’s data from the database for all user types and create statistical reports to be saved in separate tables.

This model has the following query criteria: 1. it must be able to query tables on one date,

2. it must be able to query tables on tracking id for publishers, on brand id for brands and to get data from all publishers for Spotin.

Since this model is not restricted by the date range operators there is no need to duplicate the tables as in 4.1.3.2.1. The tables can instead be modeled as in Figure 15.

Figure 15: Table model for retrieving data for all user types in the preprocessing model.

The table model in Figure 15, has a composite key of tracking id and date as partition key. This will create a better distribution of data across the cluster than the table models in Figure 13. Since this model also has date in the partition key, data from one publisher can be spread on several nodes on different dates. It will not be a problem to add another node to the cluster when the system requires it since data from one publisher is not restricted to only one node. This model also meets the query criteria since only one date at a time will be queried. Figure 16 illustrates the data distribution with the preprocessing model.

(35)

33

Figure 16: Data distribution with the preprocessing model.

4.1.4 Data Processing

When the JavaScript tracker collects data from a publisher, the data is stored in the database as raw data. When a user wants to see the statistics for a given period of time, data must be retrieved from several tables and then compiled into a statistical report containing the metrics in Table 2. As mentioned in 4.1.3 two different versions of processing the data will be implemented to be able to evaluate the two: preprocessing and on-demand processing. The implementation of the two versions is further described in this section.

4.1.4.1 On-demand Processing

In the on-demand processing version, the reports will be compiled of data from one day, a week, a month, or a year based on what the user wants to see at any given time. Figure 17 shows an overview of the on-demand processing with the following steps:

1. the user wants to see its statistics report from a time range r. The analytics front end sends a GET request to the REST API with a date d within the time range r and a type

t (day, week, month year) as parameters.

2. The service in the REST API calculates the time range r based on d and t, and requests the data needed from the database.

3. The service receives the data from the database.

4. The service starts a worker that makes all the calculations needed on the data and compiles it into a statistic report.

(36)

34

5. When the report is compiled, the worker returns it to the service.

6. The REST API returns a response with the requested report and it is displayed on the front end dashboard.

Figure 17: Overview of the on-demand processing.

Since Node JS is single threaded it is not ideal to make CPU intensive tasks on the main event loop since it will block all other events until the task is finished, and that would be bad since a high number of requests for collecting data will be blocked. That is why Worker Threads [47] will be used to do the heavy calculations. The Worker Threads module enables the use of threads that can execute JavaScript in parallel and will not block the main event loop since it is running in isolation from other workers.

4.1.4.2 Preprocessing

In the preprocessing version, the reports will also be compiled of data for a day, a week, a month, or a year. The calculations will be scheduled with the npm package node-cron [48] to run every day past midnight to create reports from yesterday’s data for every user to be saved in separate database tables. When the user requests a report on the front end, the report will already have been compiled, minimizing the steps taken to obtain the report as seen in Figure 18.

(37)

35

Figure 18: Overview of a user requesting a preprocessed report.

The cron job that is scheduled will start by retrieving all the log data from yesterday from several tables in the database. Then it will create one day report for each user by compiling the log data and then save it in a table for day reports. Then it will request the week report for the user from the database and either create a new weekly report with the data from the day report if it does not exist or summarize the existing week report data with the data from the day report. The same procedure will be done with month and year reports for every user. A flowchart of the procedure can be seen in Figure 19.

(38)

36

4.2 Front End

A front end for the prototype will be implemented to demonstrate how the REST API can be used to fetch the reports and how the statistical reports can be displayed on dashboards for the different user types. Front end will be built in a separate docker container and will be communicating with back end through the REST API as seen in Figure 20. Front end and back end will be loosely coupled which makes it easier to further develop a separate front end that can be part of Spotins platform in the future.

Figure 20: Front end and back end communicating through the REST API.

ReactJS [49] will be used as front end framework, which is a JavaScript library for building user interfaces. It is chosen because of previous experiences with it and because there are many ready-made components and libraries that can be used to create a dynamic single-page application, which makes development faster. Front end will consist of three dashboards, one for publishers, one for brands and one for Spotin. Each dashboard will be further described in this section and all figures displaying dashboards contains mock data for demonstration.

4.2.1 Publisher Dashboard

When entering publisher dashboard, yesterday’s statistical report is presented. As seen in Figure 21, the user can then choose another date in the calendar together with one of the following types: day, week, month, year. This will fetch another report from back end based on the type and date to be presented in the dashboard. Figure 21 also show the totals from the publisher such as page views, unique visitors, and sales.

(39)

37

Figure 21: Analytics dashboard for publisher showing the totals.

Further down on the dashboard all sales from the publisher that was made from the Spotin shop are displayed together with a table where the publisher can see what specific products sold the most from the shop. This can be seen in Figure 22.

Figure 22: Table on publisher dashboard displaying sales from the shop.

Another way of buying products with Spotin is from spots in images. Figure 23 shows the next table on the publisher dashboard where data about sales from each spot are displayed.

(40)

38

Publishers can have articles containing images with spots or image captions where customers can buy products from. Figure 24 shows the last table on publisher dashboard displaying sales from articles, and if the publisher clicks on the button “spots” or “image captions” a table with more details about sales from the two options will be displayed.

Figure 24: Table on publisher dashboard displaying sales from articles.

If the publisher chooses the type “week” in the calendar, page view and sales data from each day the chosen week will be displayed in charts as in Figure 25.

Figure 25: Charts on publisher dashboard displaying page views and sales for each day in the selected week.

If the publisher chooses the type “month” in the calendar, page view and sales data for each day that month will be displayed in charts as in Figure 26.

(41)

39

Figure 26: Charts on publisher dashboard displaying page views and sales for each day in the selected month.

If the publisher chooses the type “year” in the calendar, page view and sales data for each month that year will be displayed in charts as in Figure 27.

Figure 27: Charts on publisher dashboard displaying page views and sales for each month in the selected year.

(42)

40

4.2.2 Brand Dashboard

The calendar on brand dashboard has the same functionality as the publishers described in 4.2.1. The brand dashboard displays the total sales and the total amount of products sold from the given brand combined from all publishers. As seen in Figure 28, it also displays a chart with sales of the brand from each publisher as well as the distribution of the sales in a pie chart.

Figure 28: Brand dashboard displaying totals.

Further down on the dashboard is a table with information about sales of each of the brand’s products. This is seen in Figure 29.

Figure 29: Brand dashboard displaying product sales from the brand.

To give the brand more information about their sales on each publisher, the table in Figure 30 is included in the dashboard. If the brand clicks on the buttons “spot”, “shop”, “articles”, more sales information about the three options are provided.

(43)

41

Figure 30: Brand dashboard displaying brand sales at each publisher.

4.2.3 Spotin Dashboard

The calendar on Spotin dashboard has the same functionality as the publishers described in 4.2.1. Figure 31 shows the top of the Spotin dashboard, it contains the total sales and the total amount of products from every publisher. It also has a pie chart to show the distribution of unique visitors from publishers as well as a chart with more details about unique visitors from each publisher.

Figure 31: Spotin dashboard displaying totals.

To get more detailed information about page views from each publisher, the table in Figure 32 are placed underneath the page view charts.

(44)

42

Figure 32: Spotin dashboard displaying page views from each publisher.

Figure 33 shows the sales of each brand, but this table is restricted to only showing the total amount of products and the total sales of each brand, since Spotin is only interested in the totals and not what a specific product sold.

Figure 33: Spotin dashboard displaying sales of each brand.

The last table on Spotins dashboard can be seen in Figure 34. It shows the total sales from each publisher. This table is also restricted to only showing the totals and not specific details about products.

(45)

43

5 Evaluation

The prototype implemented in section 4 has been tested and evaluated to be able to answer the problem definitions in section 1.4.

5.1 Back End

To evaluate the prototype back end, Load testing of the API endpoints that collects data from several online sources was performed to find the bottleneck for how many requests the prototype can handle at a time. Latency tests of the on-demand model was also performed to evaluate if the statistical reports can be created on-demand. Every test as well as the prototype was running locally on a computer with specifications described in Table 6.

Table 6: Specification of the computer running the tests.

Model UX430UAR

Type x64-based PC

CPU Intel® Core™ i5-8250U 1.60GHz

Cores 4

Logical processors 8

5.1.1 Load Testing of the REST API

The prototype has five API endpoints that can be used to collect data from several online sources, these are described in Table 7. The endpoints were load tested with npm loadtest [50] since it is a lightweight tool that will not require a lot of resources during tests.

Table 7: Description of the API endpoints that were load tested.

Endpoint Description

/api/collect/order Collects data from a customer’s order

containing all the bought products in the order.

/api/collect/addItem Collects data about a product that has been added to the shopping cart.

/api/collect/spotClick Collects data about a spot that has been

(46)

44

/api/collect/pageview Collects data about the page a user has

visited.

/api/collect/event Collects data about an event a user has

performed, for example a button click.

To evaluate the performance of each endpoint under different workloads, the same load test was carried out on each endpoint. The parameters that were set on each test are described in Table 8. Every test had the same concurrency level as requests per second which means that the test simulated the requests as if they were sent by different clients. A specification of the parameters that was used in the tests are listed in Table 9.

Table 8: Description of the load test parameters.

Parameter Description

Concurrency level The number clients the test will simulate.

Requests per second (rps) The number of requests per second that will be sent to the server by the clients.

Time (s) The number of seconds the test will run.

Table 9: Parameters used in performance tests of API endpoints.

Parameter Value

Concurrency level 400, 500, 600, 700, 800, 900

Requests per second (rps) 400, 500, 600, 700, 800, 900

References

Related documents

Det valda ämnet och det resultat som framkom blir av relevans för sjuksköterskan då hon någon gång i sin yrkesutövning kommer att vårda personer med blodsmitta, exempelvis

What’s more, to gain less confusion and better view of data presentation, on the chart, each day will only have one data point value for each vital sign.. The value shows on the

The teachers at School 1 as well as School 2 all share the opinion that the advantages with the teacher choosing the literature is that they can see to that the students get books

A curated personal data platform in combination with inter- active webstories make data collection, data usage, and the risks of data aggregation visible.. Business practices

For security reasons, authentication is the most powerful method to ensure the safety of the privacy of diabetics and their personal data. Only registered user with correct

I denna studie kommer gestaltningsanalysen att appliceras för att urskilja inramningar av moskéattacken i Christchurch genom att studera tre nyhetsmedier, CNN, RT

Part of R&amp;D project “Infrastructure in 3D” in cooperation between Innovation Norway, Trafikverket and

People who make their own clothes make a statement – “I go my own way.“ This can be grounded in political views, a lack of economical funds or simply for loving the craft.Because