Providing a scalable architecture to support low-latency ad-hoc funnel analysis on custom defined events for an A/B testing use case

(1)

LiU-ITN-TEK-A--17/013--SE

Providing a scalable

architecture to support

low-latency ad-hoc funnel

analysis on custom defined

events for an A/B testing use

case

Pär Eriksson

(2)

LiU-ITN-TEK-A--17/013--SE

Providing a scalable

architecture to support

low-latency ad-hoc funnel

analysis on custom defined

events for an A/B testing use

case

Examensarbete utfört i Datateknik

vid Tekniska högskolan vid

Linköpings universitet

Pär Eriksson

Handledare Matthew Cooper

Examinator Aida Nordman

(3)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare –

under en längre tid från publiceringsdatum under förutsättning att inga

extra-ordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner,

skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för

ickekommersiell forskning och för undervisning. Överföring av upphovsrätten

vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av

dokumentet kräver upphovsmannens medgivande. För att garantera äktheten,

säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ

art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i

den omfattning som god sed kräver vid användning av dokumentet på ovan

beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan

form eller i sådant sammanhang som är kränkande för upphovsmannens litterära

eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se

förlagets hemsida

http://www.ep.liu.se/

Copyright

The publishers will keep this document online on the Internet - or its possible

replacement - for a considerable time from the date of publication barring

exceptional circumstances.

The online availability of the document implies a permanent permission for

anyone to read, to download, to print out single copies for your own use and to

use it unchanged for any non-commercial research and educational purpose.

Subsequent transfers of copyright cannot revoke this permission. All other uses

of the document are conditional on the consent of the copyright owner. The

publisher has taken technical and administrative measures to assure authenticity,

security and accessibility.

According to intellectual property law the author has the right to be

mentioned when his/her work is accessed as described above and to be protected

against infringement.

For additional information about the Linköping University Electronic Press

and its procedures for publication and for assurance of document integrity,

please refer to its WWW home page:

http://www.ep.liu.se/

(4)

Abstract

A/B testing combined with funnel analysis is a highly interesting technique to support data driven decision making. This thesis outlines a scalable architecture that gathers custom defined events and applies funnel analysis to gain valuable insights about user behaviour. The insights of the users are discussed from an A/B testing point of view, however, these insights are just as valuable for scenarios outside A/B testing as well.

Custom defined events together with A/B testing is an interesting com-bination, since it provides opportunities to test different versions of an appli-cation against each other, based on relevant metrics. Where the vision is to determine which of the tested application versions that is best. The power to make smart data driven decisions lies in the hand of good data analysis of the end-users. Having pre-defined metrics such as counts, of some sort, is one way to do it. However, it reduces the flexibility to let e.g. application managers to ”dig deeper” into what is actually happening. Funnel anal-ysis is a technique to analyze sequences, and can be used to analyze user behaviour in a sequential matter.

There are different techniques to provide such tools, e.g. with Google Analytics, users can pre-define funnel steps that they want Google Analytics to register when events are logged. This thesis will instead strive to not require anything being pre-defined and also make it possible, at high scale, provide dynamic low latency queries.

A proof of concept architecture has been presented in this thesis, to support two problematic ends of a spectrum, that is; at scale, both to store custom events, and at the same time be able to interactively run dynamic low latency funnel analysis on the events.

(5)

List of Figures

2.1 An overview of the general use-case of AppGrid. . . 9 2.2 An overview of the existing system (in blue components and

solid arrows), and the implementation made by this thesis (in green components and dashed arrows). . . 9 2.3 An A/B test is run on the total traffic, which is split into two

different variants A and B. These variants are assigned with two different views. . . 11 2.4 Token ownership for each node in the distributed Cassandra

ring. . . 18 2.5 The main components of Cassandra. . . 19 2.6 An architecture overview of a running Spark application. . . . 22 2.7 Placing Spark workers on the same nodes as the Cassandra

nodes opens up possibilities for distributed computation. . . . 23 2.8 Data processing in Hadoop Map Reduce. . . 24 2.9 Data processing in Apache Spark. . . 24 2.10 High level architecture of Spark job-server. Within a query

context, different Spark jobs can operate on shared data. . . . 25 3.1 High level architecture of the responsibility of Spark

job-server. The result of the job closest to the database (Cassan-dra) is cached as a shared RDD (marked as a star), accessible for other jobs (e.g. funnel analysis). . . 31 4.1 A visualization of the partitioning and acknowledgment ”flow”

for the datamodel defined in Listing 1. . . 36 5.1 A generic interface of the simulator to create dynamic groups

of events (e.g. user form or movie events). Within each group it is possible to dynamically define custom events to be used in the simulation. . . 42 5.2 All custom events that have been logged combined with their

total count. This is shown when the user has pressed the button below section ”Get logged events as keys”. . . 43

(8)

5.3 When funnel keys are selected, a drag-and-drop funnel se-quence is presented to the right that builds up the funnel query. . . 44 5.4 Interactive sunburst visualization shows a less common

se-quence. All sequences presented support the dynamic funnel query but have been completed in different ways. . . 44 5.5 Performance for different spark jobs, once the initial

trans-formed RDD is cached (first row), it is then reused for the funnel analysis job (third row). . . 45 6.1 Placing Spark workers on the same nodes as the Cassandra

nodes, opens up possibilities for distributed computation. . . 49 6.2 Trade-offs between truly static and dynamic queries [1]. . . . 50 6.3 The events that have been inserted to the database before

the initial transformation is run (yellow area), will be used to serve low latency queries. . . 51

(9)

Chapter 1 Introduction

1.1 Motivation

Design and functionality decisions within software development can be made in different ways, some times it may be discussed between the developers and potentially the product owner1

, and they then make decisions on how to proceed. Other times an empirical study might be made to get more insight and to see how a narrow test group performs when they’re using the application. This study is then analyzed and decisions are made based on the outcome. However, the problem with this is that the people that make the decision, or the limited group that is tested, probably doesn’t represent the ”big mass” that is your users. The decision that was made may even worsen the user experience for the application. Just imagine this extreme case where the mean height for men and women is to be decided, and only one man and one woman are measured, and then let their height represent the mean length for women and men in a specific region. There is a significant high risk that this differs from the reality. A similar problem can occur when designing an application. It is not guaranteed that the “big mass” interprets the application in the same way as the test group did. Thus, it is problematic to make wise and profitable decision-making. Luckily there are other more powerful methods to use to take more certain decisions. A/B testing is a method where at least two different variants of an application are tested against the end users2

, in the most simple case one Control Group (A) which is usually the current version of the application, is tested together with one Treatment Group (B). These two groups are randomly distributed to all users depending on a defined probability, then data is gathered and the result is observed and compared (see more in section 2.2). With A/B testing data-driven decisions can be made, and these will be much more connected

1_{The product owner is responsible to specify what should be done for a product, this}

role is often used for agile methodologies like Scrum.

2

(10)

1.1. Motivation Chapter 1

to the “big mass” since all users have been contributing to the observed behaviour, e.g. increased sales ratio or more frequent watched movies. A/B testing is a truly effective method to make smart data-driven decisions. In [2], a successful A/B testing story is described, the test was run for Barack Obama in the 2008 campaign. These A/B tests included small visual changes of the campaign page, which improved the sign-up rate with 40%. That later translated to an impressive additional donation of $60 millions. Even though an A/B test report has been finalized, it may not be sufficient to answer the question “why does the result differ?”. One case might be if a landing page is tested, and the result clearly shows different conversion rates3

of signing up. You might find it valuable to know why that come to be, and be able to answer the question above: ”why does the result differ”. It can be especially hard to point out which component that influenced the result the most, if the Treatment Group (the new design) includes multiple changes. If only one change has been made, like a button’s color, you know that it must be the only thing that affected the end-result.

With sequential analysis like funnel analysis, users’ behaviour can be ana-lyzed. These techniques can e.g. measure certain drop offs within a sequence of events. One example could be if an application’s registration page includes multiple changes, and the new design has 20% less registrations. By using funnel analysis, one would be able to see where users start to drop off. Let’s say that in this example the sign-up form was changed in the new design, and the password restriction is not shown in a proper way, which confuses the users. Funnel analysis can help to point out valuable information where users start to leave the registration form, e.g. when trying to find password that is valid. This will clearly tell you that it is something misleading with the new way of presenting the password restriction, and that it is not the potential new color of the sign-up-button to blame.

This thesis will investigate how gathering custom defined events can help to make wiser decisions. Much effort will be on finding a scalable system architecture to make this possible, since usage can rapidly increase. The im-plementation use case will be based on custom events to show how providing custom events into Accedo’s AppGrid4

would help customers to make wiser, data driven, decisions. This implementation, although, will not be tightly coupled to AppGrid’s solution. It will work as a proof of concept (POC) solution to present the value of analyzing custom events.

3

A conversion rate is a measured representation of how many users take a specific action.

4

(11)

Chapter 1 1.2. Problem description

1.2 Problem description

The central question addressed in this thesis is how to gather and apply funnel analysis to custom events, in order to improve online services. A system that takes this problem, and works as a proof of concept (POC), is described. The system should use existing open source tools and be able to handle high load usage. A/B testing has been a central mindset in this thesis, but not fully required to be implemented. Since the funnel analysis of custom events must work outside A/B tests as well.

A Graphical User Interface (GUI) should be developed to start a sequence simulation of custom defined events. It should also be possible to specify different percentages that affect the probability of fulfilling the defined se-quence, for two variants (in an A/B test). The system should provide low latency ad-hoc funnel queries, and a basic visualization to show the funnel analysis applied to the simulated data. The GUI is not really a specified problem for the thesis per se, but it helps to visualize and to show the result, which is: how to gather and applying funnel analysis to custom events, in order to improve online services.

To accomplish this, the following problems will first need to be solved: • Problem 1: how to gather, store, and apply funnel analysis to the user

data (custom events).

• Problem 2: to define a test case.

• Problem 3: to define a system that supports low latency ad-hoc queries. Problem 1 - This includes the challenges of what techniques and tools to choose to gather, store, aggregate and analyze the data at scale. This is a more theoretical problem. Multiple tools will most likely need to be considered/compared to find a suitable solution to the idea of Custom Events and system scalability.

Problem 2 - Specifying an implemented test case. Important components are: the data model and how to analyze the funnels.

Problem 3 - The full architecture to support low latency dynamic ad-hoc querying. The system should support dynamic queries of funnel sequences (funnel queries), and it should answer how many users who completed the specified sequence.

1.3 Method

This section describes which tools/techniques that have been used to each of the problems described in section 1.2.

(12)

1.4. Assumptions and limitations Chapter 1

• Problem 1: To meet the high required write/s throughput of the sys-tem, Cassandra has been used as data storage. The applied analysis has been done through Apache Spark.

• Problem 2: Funnel analysis has been applied with help of Regular Expressions mapped onto each user’s sequence of events. This logic has been implemented with Apache Spark. The possibilities of using sequence mining as the funnel analysis process has been investigated. • Problem 3: A Web Interface has been designed to easily define funnel queries to run low latency funnel analysis with dynamic input. Spark job-server has been used to handle long-living spark contexts, required for low latency jobs.

1.4 Assumptions and limitations

Since there does not exist any ”real-world” data useful for my research within the company, I will need to simulate data for my experiments. The simula-tion should simulate different behavior between A/B test groups (variants), to more easily show the different outcomes, comparing to what just random data would do.

Regarding A/B testing, this thesis will not investigate two of the main components discussed under section 2.2.2, Randomization and Assignment. Both components are necessary to provide A/B testing as a service, where different profiles (variants) need to be delivered to end-users in a random-ized matter. This thesis will instead aim its focus on the third point, Data path. This solution provides dynamic (non-static) ad-hoc querying on raw-data with low latency instead of static pre-aggregated raw-data. The association between end-users and A/B tests might be important, notice might, since the solution in this thesis should also work outside A/B testing scenarios as well. Funnel analysis will be used as a use case that hopefully can be useful as a metric to determine the winning profile in an A/B test, or, to present a global user behaviour in an application that is not running A/B tests. This thesis will not include any performance tests between all compared tools, it will instead rely on previous up to date comparisons and relate to such relevant articles. Clearly it is not reliable to do measurements/perfor-mance tests for distributed systems locally on a single machine. Obviously, the over all performance would require to be thoroughly tested before run-ning it in production. This aspect will be considered as future work.

(13)

Chapter 1 1.5. Thesis Structure

1.5 Thesis Structure

Chapter 2 - Background : This chapter will explain each concept and tool included in this thesis.

AppGrid, section 2.1, will explain what AppGrid is, how it has influenced this thesis and how the solution provided by this thesis fits in.

A/B testing, section 2.2, will describe the terminology behind A/B testing, and what is architecturally required to provide A/B testing as a service. Funnel analysis, section 2.3, will explain the concept of funnel analysis and provide some examples of what can be expected out of this kind of analysis. Sequence mining, section 2.4, will provide related sequence mining tech-niques.

Database management system (DBMS), section 2.5, will explain the respon-sibilities for a database management system (DBMS), and mention different types of DBMS and why different DBMS options exist.

Cassandra, section 2.6, will briefly describe Cassandra and its core concepts. Extract, transform and load (ETL), section 2.7, will explain the concept of extract transform and load (ETL) principles. It will provide challenges by previous traditional ETL processes and how it has been improved by more recent tools/techniques.

Apache Spark, section 2.8, will briefly describe Apache Spark and one of its most important features: Resilient Distributed Datasets (RDD). It will also explain how Spark is applied to Cassandra, and how Spark relates to Hadoop MapReduce.

Spark job-server, section 2.9, will describe what spark job-server is and why it is useful, especially for low latency queries.

Chapter 3 - Implementation: This chapter will go deeper into imple-mentation challenges and how they were solved, and also, give a motivation for the tools and techniques used in this work.

Choosing a suitable database system, section 3.1, will describe why Cassan-dra was chosen based on presented requirements.

(14)

1.5. Thesis Structure Chapter 1

POC system in this thesis. It will present how the topics data storage and data analysis have been developed.

Dynamic funnel analysis, section 3.3, will describe how dynamic funnel anal-ysis has been accomplished. It will present limitations by using pre-defined mining techniques for dynamic use cases.

Chapter 4 - A Test Case: This chapter will describe the proof of concept implementation (the test case) implemented in this thesis, it will clarify top-ics such as the chosen data model and the regular expression based funnel analysis.

Event consumer, section 4.1, will describe the event consumer’s responsibil-ity in this system.

Data storage, section 4.2, will describe the data model used for custom events, together with replication and tunable consistency principles.

Data analysis in Apache Spark, section 4.3, will present the analysis process and the required transformations needed to run the regular expression based funnel analysis.

Chapter 5 - Result: This chapter will provide figures of the POC sys-tem’s web interface.

Data simulation, section 5.1, will describe and show how the simulation of custom events have been implemented.

Funnel analysis, section 5.2, will show how the interactive funnel queries are created and run. The performance of the funnel analysis will be presented, to show the benefit of using Spark job-server.

Chapter 6 - Discussion: This chapter will discuss the thesis’s topics and the decisions taken. A thorough discussion is done on Cassandra, Spark, Spark job-server and the dynamic funnel analysis.

Architecture, section 6.1, will describe the architectural decisions and im-pacts of Cassandra, Spark and Spark job-server.

Dynamic funnel analysis, section 6.2, will express the value of the imple-mented regular expressions based funnel analysis, together with its limita-tions.

(15)

Chapter 1 1.5. Thesis Structure

Conclusion, section 6.3, will conclude the overall picture of this thesis. It will point out the problem solved by this implementation and drawback introduced by the proposed solutions.

Chapter 7 - Future work: This chapter will list out important tasks that have been left out of the scope of this thesis, and are therefore subject for future work.

(16)

Chapter 2 Background

This chapter will describe concept such as AppGrid, A/B testing, funnel analysis, sequence mining, database management system, Cassandra, extract - transform - load (ETL), Spark and Spark job-server. The section about Spark will provide a comparison between Spark and Hadoop MapReduce, that describes how Spark differs and why it was needed.

2.1 AppGrid

AppGrid1

is a product that can be seen as a multi-platform Content Manage-ment System (CMS). AppGrid is used to control content for multi-platform applications, e.g. web, mobile or SmartTV applications. Their solution can target users and serve their applications with content, defined and config-ured within their web tool. Figure 2.1 shows the general concept in AppGrid. The left-hand-side shows the use case for application developers or con-tent providers, they can define metadata and assets that should be delivered to specific platforms and applications. On the right-hand-side, end-users create sessions to fetch metadata that they have been specifically assigned to get.

This solution makes it possible to dynamically change settings that affects e.g. the UI of an application without the need to republish a newer version to each platform’s specific application store (e.g Goole Play, or Apple’s App Store). Thus, updates made through AppGrid can avoid the re-validation process, if such process is required by each application store. For instance, in Apple’s App Store2

there is a review phase for each update made by each application. By using AppGrid, these kind of bottlenecks can be avoided for fast reactive changes that needs to be propagated to all end-users.

1_{AppGrid - https://www.accedo.tv/appgrid/}

(17)

Chapter 2 2.1. AppGrid

Figure 2.1: An overview of the general use-case of AppGrid.

2.1.1 Current architecture

Figure 2.2: An overview of the existing system (in blue components and solid arrows), and the implementation made by this thesis (in green com-ponents and dashed arrows).

The current architecture in AppGrid of the data analysis pipeline is using RabbitMQ3

to handle all messages between the REST API servers and the consumers of the queues, see Figure 2.2. The benefit by having a message orientated middleware such as RabbitMQ, is that it can be extended with more use cases (queues), without the need of integrating all logic into the same service. For example, the services sending/producing the events don’t need to know which database is used.

If a new feature, such as a custom event is added, a new queue and a

3

RabbitMQ is a messaging service that decouples applications and lets them send messages asynchronously, https://www.rabbitmq.com/

(18)

2.2. A/B testing Chapter 2

consumer can be implemented next to the current services. This abstraction of responsibilities gives a much more scalable architecture. The API servers can simply put a message into the queue which will be processed at a later time (e.g. stored into a database). Therefore the API requests can be much faster acknowledged to the client and the API servers can maintain a higher throughput. By having a dedicated consumer on the consuming part of the queue, it is easier to add new features into the current system, such as custom events. The alternative would require to add all logic into one big monolithic software, which would limit the scalability and maintainability.

2.1.2 Use case and load prediction

A concrete number of incoming user events does not exist at this moment, there can be arbitrary number of incoming custom events per second. A prediction of the total number of user events that is logged cannot be measured at this stage. The total number of logs can be declared as equa-tion 2.1, where all variables are undefined and also dynamic (not fixed). Therefore the incoming messages may rapidly increase. This makes it hard to predict the minimum requirements for a system, and the simplicity to scale accordingly becomes really important.

M

X

m

N_m·U_m (2.1)

where m is the consumer of the service (not to be misinterpreted with a consumer in RabbitMQ). Nm is the mean value of logged events/second for

a service consumer m, and Um - is the number of treated end-users in the

consumer’s application.

2.2 A/B testing

A/B testing is a data-driven experiment to analyze the result of different functionalities, design and behaviors, in an application. Most often this is used to analyze different graphical changes, but it may also include differ-ent back end systems. For example, if differdiffer-ent video streams for a video platform are tested, an A/B test should be able to analyze if they have any impact of the customer’s satisfaction, e.g. in the meaning of more shows watched.

(19)

Chapter 2 2.2. A/B testing

Figure 2.3: An A/B test is run on the total traffic, which is split into two different variants A and B. These variants are assigned with two different views.

The difference with A/B testing and sequential testing is that with A/B testing multiple tests (treatment groups) can be run simultaneously, see figure 2.3. Whereas with sequential testing, only one test can run at a time. With sequential testing, the same scenario described in figure 2.3 (but run with sequential testing) would need to be run in two iterations, compared to one iteration with A/B testing.

2.2.1 Terminology

In the simplest case, one A/B test randomly distributes end-users into two variants 4

(additional variants can be used): Control group5

(A) and Treat-ment group6

(B). One important aspect is that the distribution must be truly random to provide statistically valid results of the Overall Evalu-ation Criterion (OEC). The OEC can within statistics be seen as the dependent variable or response. [3] provides a thorough explanation about the OEC. The result of the A/B test must not have any influence on how users tend to be distributed into a specific variant’s group, e.g. users within a specific age range, or hashed account information like their email [4].

2.2.2 Implementation

In [4], the suggested implementation architecture is defined as three block components:

4

A variant is a common name to describe a variation (version) in an A/B test, e.g. variant A_{or variant B.}

5_{A control group is the current used version of e.g. an app.}

6_{A treatment group is tested against the control group. The treatment group usually}

(20)

2.2. A/B testing Chapter 2

1. Randomization, a function that maps all users to a certain variant group.

2. Assignment, uses the result from the randomization to decide which variation to send to the user (assigns the variant).

3. Data path, collects raw observation data, analyzes and provides the reports.

Randomization

As mention above, the randomization of the variants (e.g. variant A and B in figure 2.3) is an important aspect and cannot be based on any information regarding the user. The randomization algorithm must strictly follow the following properties:

• Users in the A/B test must have the same probability, e.g. in a 50-50 distribution, to be assigned any of the possible variants.

• The repeated assignment when a user connects, must be consistent. Users cannot get different variants on upcoming visits.

• If multiple control tests are run, there cannot exist any correlation between a user’s assigned profile in one experiment and its assignment in other experiments.

Assignment

Variation groups’ (variants) behavior and visual appearance needs to be assigned to each user that connects to e.g. the website. This requires some server backend to provide these truly different components. [4] brings up different methods, such as:

• Traffic splitting • Page rewriting

• Client-side assignment • Server-side assignment

Each of the methods listed above has limitations and benefits. The work reported in [4] presents a comparison of these methods.

Data path

To be able to measure the difference conversions between all variants, the data first needs to be collected. Depending on the use case of the reports, data can be stored on different formats. But one important aspect is to store necessary data to be able to separate end users participating in different variants from each other. [4] talks about different techniques to gather the data.

(21)

Chapter 2 2.3. Funnel analysis

2.3 Funnel analysis

Funnel analysis can be expressed as a sequential mining process of time-based events, that often leads towards a certain goal. The outcome from this process can in an application show different drop-off percentages/conversions of users’ presence within a sequence of events/interactions.

Funnel analysis can help to address the answers to the following questions: • How many of the users that landed on the homepage registered and

purchased something?

• How many of the users that got the gift certificate bought something, and then continued purchasing more items?

• How many of the users that watched an episode of a TV-series, ended up watching the full season?

Combining A/B-testing, in section 2.2, with funnel analysis can show very interesting result of the usage in different varying profiles. Often funnel se-quences are used to express valuable behaviour, like the examples expressed above. Using funnel analysis within A/B testing will help to point out which variation, of e.g. the design, is generating the best behaviour of the end-users.

2.4 Sequence mining

This section describes relevant sequence mining techniques for funnel anal-ysis.

One very common frequent itemsets mining method is the Apriori algo-rithm [5]. Frequent itemsets are found for a defined minimum support, which is analyzed in the pruning process using the Apriori principle: if an itemset is not frequent, then none of the itemset’s superset is frequent. The Apriori principle is used to optimize the process to find sequences.

FP-growth has been shown to be superior for bigger datasets since Apriori requires multiple scans against the database [6]. In [7] the process of the FP-growth algorithm and how it performs relative to the Apriori algorithm, is presented.

In [8] the FP-growth algorithm is further improved so that it is able to scale out the process of FP-growth onto distributed machines (in parallel). This solves problems when analysing datasets that are too big to fit in memory on a single machine. In addition, it inherits the methodology that MapReduce

(22)

2.5. Database management system (DBMS) Chapter 2

offers and improves the runtime significantly.

The community behind Apache Spark has implemented FP-growth (since version 1.3.0) into their Machine learning library MLlib7

.

All FP-growth based mining techniques is limited in the way that it looses the sequence order of the registered items. Each transaction (container of the items) sorts the elements/items in a descending order by the overall support of the available items in the dataset [7]. This makes it unusable for representing access patterns by clients, which can be seen as the funnel of events. The result of FP-growth can still be usable to represent a group of e.g. common visited links by clients.

A technique that keeps the order of occurrences was developed by [9]. Web Access Pattern tree (WAP-tree) contains information of common web access patterns, which are sequential patterns. The algorithm divides associated sequences to an unique access point (often a unique user identifier like User ID or IP address), much similar to the process in FP-growth. But one major difference between them is that with WAP-tree the sequence order is important to be maintained [10].

2.5 Database management system (DBMS)

A Database Management System (DBMS) is an interface to handle requests against a database. According to [11] the DBMS provides three important blocks, which are: the data, the data engine which accesses the data, and the database schema which defines the structure of the database. The DBMS handles the logical functionality of the stored data and the communication of all clients/applications that make requests to the database. Usual re-quests can be assumed to be: Create, Read, Update and Delete (CRUD) commands. One benefit with a DBMS is that incoming requests do not need to know about the actual location of the data.

There are different types of DBMS interfaces such as Relational DBMS (RDBMS) and NoSQL DBMS [11].

2.5.1 RDBMS

Relational DBMS has been used for many years [12]. This approach keeps data consistent and is very powerful for online transaction processing (OLTP) [13]. A typical RDBMS guarantees four properties for each treated transaction: atomicity, consistency, isolation, durability. These together build up the ex-pression called ACID. The ACID properties [14] can be explained as follows.

(23)

Chapter 2 2.5. Database management system (DBMS)

• Atomicity - If a set of operations should be performed, either all of them are completed, or none. Can be seen as an “all or nothing” rule. • Consistency - Guarantees that only data that is validated will be writ-ten to the database. If a transaction fails the validation on some field, the whole request will be rolled back.

• Isolation - If multiple transactions occur simultaneously they do not affect each other.

• Durability - Guarantees that transactions which are successfully com-mitted to the database cannot be lost.

2.5.2 NoSQL DBMS

NoSQL, or Not only SQL as it stands for, is a newcomer relative to most of the existing relational databases. They were design to manage the scalabil-ity and performance issues that occurred when trying to horizontally scale traditionally RDBMS [15].

[16] brings up the problems by using traditional RDBMS for upcoming com-plex data. With the full ACID support (section 2.5.1) the problems to support high availability increased, especially when outgrowing single data servers and requiring the data to be distributed to multiple servers. The need for multiple data center was a given fact for web services with many users. To keep a system highly available requires that the system is fault tolerant, and can handle requests even if one data storage (node) goes down. The system must balance read requests onto multiple servers to manage ac-cess peaks which would be too large for a single server. The replication of RDBMS considers consistency above availability, and to make this possible in a RDBMS, additional tuning and expertise is really essential. This pushed companies and organizations to develop their own database systems to ful-fill their specific needs [16]. Logically, different needs resulted in different solutions, therefore it is problematic to select one NoSQL database that is best for a specific use case. One major difference between the RDBMS and NoSQL is the data model, NoSQL can be categorized into 4 groups related to the data model [16]:

• Key Value Stores • Document Stores • Column Family Stores • Graph Databases

Even though the data models differ in some aspect, their DBMS’s essence is that they are easily distributed to multiple nodes to manage high scalability,

(24)

2.6. Cassandra Chapter 2

however with different approaches. This concept is also known as horizontal scaling. With multiple nodes (distributed), the system can manage to be available even though one node looses connection or dies (assuming that replication is used). Therefore there is no way to make a single node system highly available. However there will be problems with supplying both high availability and guaranteed consistency on a distributed system [12], this will be further discussed in section 2.5.3.

2.5.3 CAP theorem

Distributed systems aim to follow three desirable properties; consistency (C), availability (A) and partition tolerance (P). The CAP theorem was formally introduced by Eric Brewer in 2000 [17]. The theorem states that only two out of the three defined properties, mentioned above, can be used for any distributed system [17]. Seth Gilbert and Nancy Lynch later on proved this conjecture in their article [18].

The properties that builds up the CAP theorem can be explained as the following: [13, 18]

• Consistency - the requests must be consistent for the entire distributed system, therefore respond with same output for the request across all replicas. This can be expressed as a request to the system should act like it was executed on a single machine system, there should not be possible to get inconsistent/stale data.

• Availability - every request within the distributed system must return a response.

• Partition tolerance - the system can still manage to operate, even though arbitrary messages are lost between the nodes in the network. In distributed systems which are consistent, the atomicity should be preserved even though messages are lost between the nodes. For dis-tributed systems that are highly available, the node which is requested by the client must return a response, even though messages are lost between the nodes.

These properties must be considered when designing a system and choosing a suitable DBMS. Usually partition tolerance is always chosen because it’s a fundamental principle for distributed systems [12]. It then becomes a question of which property to “sacrifice”, consistency or availability.

2.6 Cassandra

Cassandra [19] among other modern databases in the NoSQL family, are de-signed to provide high speed [20][21]. Cassandra initial focus was to provide

(25)

Chapter 2 2.6. Cassandra

blazing fast writes and linear scalability (i.e. double the node count implies to double the execution speed). [13] categorizes Cassandra as a ”distributed, decentralized, fault tolerant, eventually consistent, linearly scalable, and a column-oriented data store”. These properties explains the core features that Cassandra brings to the table. No master exist within the clusters, all nodes instead communicate equally with each other, and there are practi-cally no single point of failure (if designed properly). Cassandra inherits the robust distributed technologies from Amazon Dynamo [22] and its data model from Google BigTable [23].

2.6.1 Architecture

The distributed nodes in Cassandra works together in a master-less cluster. The cluster is represented as a ring of nodes containing subsets of all data within the cluster. The idea is that each node is responsible for a certain token range8

, see Figure 2.4. This token range may later be changed when new nodes are added/removed or when trying to even out an unbalanced cluster. Data in Cassandra is stored on a columnar format (inherited from Google BigTable), associated with a partition key (PK). The idea to spread out data to balance the load of the cluster (ring), is based on the PK. Cassandra has a partitioner that calculates a token number through a hash function to decide which node that should be provided with the data, this is further explained in [13].

8

The token range represents a value range of hashed partition-keys that a node should be responsible for.

(26)

2.6. Cassandra Chapter 2

Figure 2.4: Token ownership for each node in the distributed Cassandra ring.

Cassandra’s functional components serve with a simple abstraction to the user. These components treat functionality such as internal communication, requests and replication. The main components can be seen in Figure 2.5. The storage layers main responsibility is to handle all requests from the clients, e.g to direct the request to the responsible node that holds the relevant data. In other words, Cassandra DBMS works like a proxy for underlying components. [13] presents a complete explanation of Cassandra’s underlying components, seen in Figure 2.5.

(27)

Chapter 2 2.7. Extract, transform and load (ETL)

Figure 2.5: The main components of Cassandra.

2.6.2 Data model

Cassandra inherits its Data Model from Google’s Big Table. The wide row structure suits the use case of time series data9

well. Cassandra has a limita-tion of about two billion columns for each row (limited by Java’s maximum Integer value) [13], this requires the data model to be well defined. For system where this limit could be reached, one solution may be to partition the values based on some date association, e.g. dates or week numbers to limit the amount of columns on each row.

2.7 Extract, transform and load (ETL)

The traditional ETL methodology has been used to extract business value from raw data sources. This methodology’s ”pipeline” contains three differ-ent steps:

• Extract - extracting data from multiple sources to a single processor. • Transform - transforming the extracted data in to an aggregated/spec-ified format, the transformation is pre-designed in the ETL process. • Load - Load the transformed dataset into a different database. This

is usually called an Enterprise Data Warehouse (EDW). This most likely includes a high load over the network.

The traditional ETL process does not perform well with large data sets, it needs to extract data from one system transform it on hardware with limited resources, and then load it to a data warehouse over the network. At the

9

(28)

2.7. Extract, transform and load (ETL) Chapter 2

EDW, the Business Intelligence queries are run on the pre-aggregated data. Downsides with these traditional ETL processes are:

• The ETL process is a bottleneck that takes time, what is ”new” data in the EDW may be too old to consider up-to-date. Since the ETL process can take days, some extreme cases weeks, when transforming really big datasets.

• If the analysts using the EDW want other features out of the data (the analytical needs are changed), an ETL designer will need to write that functionality to the ETL process. The queries to the EDW can feel limited.

• It’s complex and expensive to maintain the performance if the data that is transformed grows.

This has pushed the communities to other solutions. With the distributed file systems such as Hadoop Distributed File System (HDFS) [24] and Cas-sandra, massive parallel processing techniques to support distributed ETL processing are possible [25]. With distributed ETL, the ETL process is brought to the data location, and can process the data on the same network. This has significantly reduced the latency and increased the performance of the overall process.

In [26] Phil Shelley and James Markarian both mention that traditional ETL is changing since it doesn’t meet the requirements that are needed today. Phil Shelley says:

“The growth of ETL has been alarming, as data volumes esca-late year after year. Companies have significant investment in people, skills, software and hardware to do nothing but ETL. Some consider ETL to be a bottleneck in IT operations: ETL takes time as, by definition, data has to be moved. Reading from one system, copying over a network and writing all take time – ever growing blocks of time, causing latency in the data before it can be used.” [26].

[27] brings up the challanges for ETL processing with Big Data. They say: “Because the majority of available tools were born in a world of “single server” processing, they cannot scale to the enormous and unpredictable volumes of incoming data we are experiencing today. We need to adopt frameworks that can natively scale across a large number of machines and elastically scale up and down based on processing requirements.” [27].

(29)

Chapter 2 2.8. Apache spark

2.8 Apache spark

This section will describe the usage of Apache Spark as a computing engine, and its connection to general big data extract, transform and load (ETL). Apache spark[28] as with other MapReduce-based technologies, e.g. Hadoop MapReduce[29], provide massive parallel data processing possibilities onto distributed file systems, like hadoop distributed file system (HDFS)[24] or Cassandra[19]. With distributed workers placed near the data’s location, the big overhead to transfer data over networks can be avoided. Spark uses an abstraction called Resilient Distributed Dataset (RDD) to represent an immutable (read-only) collections that are distributed over a set of machines (where the Spark workers are located).

2.8.1 Resilient distributed dataset (RDD)

Spark is a general computing engine for large-scale data processing. It ab-stracts the physical data into Resilient Distributed Datasets (RDDs) which are read-only data collections. With these RDDs, applications can suc-cesfully distribute the data processing in an effective, fault tolerant way. Spark’s methodology is much like recent MapReduce tools [30] (similar pro-gramming model). However other implementations of MapReduce [31] lack effective data sharing which has been addressed by [28]. Such sharing ex-ist within the RDD abstraction, which makes it both efficient and reliable. The RDD is a core feature of Spark, this abstraction supports in-memory computation on large clusters. With the ability of distributed in-memory computation Spark solves the problematic areas that previous competitors (e.g. Hadoop MapReduce) have: iterative and interactive processes. If a node in the distributed cluster fails (one of the workers in Figure 2.6), or something unexpected happens to it, Spark is smart enough to recreate the failing node’s RDD instead of rerunning the whole algorithm. Caching an RDD is where Spark’s power takes place. Instead of reading and transform-ing raw data from it’s location, Spark manages to cache an RDD for future use, which is ideal for iterative and interactive algorithms.

(30)

2.8. Apache spark Chapter 2

Figure 2.6: An architecture overview of a running Spark application.

An overview of a running Spark application can be seen in Figure 2.6. The Driver, also called the Master, runs the application and the Workers oper-ates on data from a distributed persistent storage. RDD partitions can be kept in memory (if told so) for fast future access.

2.8.2 Spark on cassandra

Spark collaborates well with Cassandra, since Spark can optimize compu-tation because of how data is partitioned in Cassandra [32]. Data locality is important when running distributed computation. Since Cassandra al-ready stores the data by it’s partition key, it is very effective for scenarios when grouping or joining by the same partition key. In addition to this, the Spark Cassandra Connector10

can translate Cassandra’s token aware-ness to let Spark do better query optimization. Thus, Spark can leverage by knowing on which nodes partitions are located. Spark will then try to run aggregations locally where data is stored, to reduce network overhead [32]. The concept by keeping Spark workers close to where data is located can be seen in Figure 2.7, in this figure the Spark workers are placed on the same machines as the Cassandra nodes.

10_{A connector to access data and expose Cassandra’s data tables to Spark’s RDDs}

(31)

Chapter 2 2.8. Apache spark

Figure 2.7: Placing Spark workers on the same nodes as the Cassandra nodes opens up possibilities for distributed computation.

2.8.3 Related work

The new way of mapping distributed ETL processes onto the distributed data storage (explained in section 2.7) has been successfully implemented by Hadoop MapReduce, and more recently Apache Spark. Hadoop has been superior for big data processing for many years, but the limitation to write to disk for each applied transformation led to new implementations, such as Spark. Spark, compared to Hadoop, makes it possible to keep transforma-tion in memory instead of writing result back to storage/disk. Caching can be used to improve the performance of iterative and interactive processes [28], since there will be less I/O if it is possible to request memory instead of disk. The difference between how interactive and iterative processes op-erates between Hadoop MapReduce and Spark can be seen in Figure 2.8 and Figure 2.9. As seen in Figure 2.8, Hadoop MapReduce requires to read and write from disk upon each transformation or request. This means, if an ad-hoc (interactive) query is requested, Hadoop requires to read from disk. Spark, on the contrary, is able to use memory between iterations and request interactive queries from cache, see Figure 2.9. Spark is designed to be a better fit for [28]:

• Iterative - Machine learning is often done in iterative steps.

• Interactive - Ad-hoc queries can be provided with very low latency, when and if the data is stored in memory.

(32)

2.9. Spark job-server Chapter 2

Figure 2.8: Data processing in Hadoop Map Reduce.

Figure 2.9: Data processing in Apache Spark.

2.9 Spark job-server

Spark works well for scheduled batch jobs, it was designed operate in that way [28]. This choice of design also means that Spark was not designed to be run as a long running process to provide low latency ad-hoc queries, as a service. The use case of Spark is closer to run heavy aggregations

(33)

Chapter 2 2.9. Spark job-server

on raw data, then store the aggregated result back into the data storage. Often an EDW contains batch computation output for limited fast pre-computed queries. The limitation with predefined output tables has driven the community to develop a platform to support long-running spark context, to provide low-latency spark jobs via an HTTP server. This has been proven to be very efficient for interactive and iterative querying, since Spark job-server11

manages to keep Spark RDDs cached for repeated usage. The work flow described in Figure 2.10 can be explained such as:

1. Create query context.

2. Load data (e.g. run transformations on the data stored in Cassandra) 3. Use query context to run query jobs, ideally this would use the cached

transformed data.

It is possible that the query jobs can operate on already cached transformed data. This reduces the transformations required from the raw data stored in the database. In Figure 2.10, this means that the load data task can be cached so the query jobs don’t need to read and transform data stored on disk.

Figure 2.10: High level architecture of Spark job-server. Within a query context, different Spark jobs can operate on shared data.

11

(34)

Chapter 3 Implementation

Due to the vague provided requirements for the incoming logs/second, men-tioned in section 2.1.2, the requirement to scale easily is highly relevant. For example, a customer might want to extend a test case with more events to log, and the incoming log ratio for this customer exceeds the current man-ageable ratio. The system must treat this so it doesn’t becomes unavailable for other usage. In practice this would be noticed by the message queue in RabbitMQ (described in section 2.1.1). The received message ratio would be higher than the consumed message ratio, and therefore filling up the message queue.

With this information, some requirements of the system’s behavior are given below:

• The system should easily be able to scale, to manage rapidly increased amount of incoming user events.

• The system should be highly available and no single point of failure that could stop the entire system.

As stated in section 2.5.3, partition tolerance is fundamentally required when working with distributed database systems. The question therefore becomes more like: which of the other two to sacrifice, availability or consistency. From the system requirements above, availability is highly important, there-fore availability was prioritized. The write operation ratio (writes/second) is assumed to be significantly larger than the read operation ratio (read-s/second). This pushes the constraint to consider availability even further, because reading is where dropping consistency might be a problem. This consideration was braced by the assumption that there are only writes of user events, hence there are no updates. Logically there won’t be any in-correct data, the consistency issues that could occur is that there might be missing data when a read operation is processed (e.g. the replication is not

(35)

Chapter 3 3.1. Choosing a suitable database system

completed at one node when data is read). How much this affects the result can be discussed, but it is well connected to the downtime of the node (how much data that has failed to be delivered). This limitation is well known for the DBMSs that aim for availability and partition tolerance properties. The problem has been solved by supporting a consistency model called eventual consistency [13], this means that a node that has been down will eventually get the correct replicated data. The architectural concept is widely known as BASE [33] which stands for: Basically Available, Soft state, Eventual consistency.

3.1 Choosing a suitable database system

Before a suitable database system was chosen, some concrete ideas over the structure of the data logs were made. From a funnel analysis perspective, the data structure would require to describe a session, association to an A/B test, an event and a time-stamp when the event occurred. The session is required to separate users from each other, the events should be able to be analyzed connected to the same session they occurred in. This is a very simple data schema structure, though this may need to be changed for future use cases. For example, meta-data may give analytic value for segmentation, e.g. location or user information (gender, age, platform, etc.). The dynamic schema/unstructured data is a key concept of the NoSQL family, and is supported in different ways [34]. Some consideration about unstructured data was made, but the main focus of the data model was the key part of its format: the time-stamp. As mentioned in section 2.3, a funnel can be described as a series of events and this concept has impact on the consideration of database systems. Due to the fact that some data models are more suitable for time series data. According to [16], column family storage suits problems where a wide range (a series) of values are stored together on a row connected to a row-key. All column family stores are based on Google’s Big Table’s data model [23].

3.1.1 Data model

The columnar data model defined in Big Table [23], can be seen as a multi-dimensional map, indexed by three fields, RowKey, ColumnKey and Times-tamp. Where RowKey points out the location of the associated columns and with the ColumnKey the value can be found. The Timestamp is unique for cases were the same ColumnKey values exist for the same RowKey. The combination between ColumnKey and the Timestamp must be unique to avoid data collision. To get specific value, the required fields must be pro-vided: (RowKey, ColumnKey, Timestamp) → Value.

To connect the data model to the use case of funnel analysis, the focus on the timestamp based wide-row storage, maps well to the idea of time-series data

(36)

3.2. Architecture Chapter 3

storage. The RowKey could separate different test rounds (in an A/B test use case) to keep relevant data together on each row, it could also separate different test groups (section 2.2). The ColumnKey could contain necessary information regarding the session of an event. The Timestamp would logi-cally define when the event was registered. The stored Value would in this scenario be the event.

Both [16, 23] argue that similar use cases to this (time series models) are a good fit for column oriented data storage. The provided open source database systems they mention are: HBase, Cassandra and HyperTable. Even though their data models are based on the same idea, they differ ac-cording to the CAP-properties described in section 2.5.3. With the required availability discussed above based on section 2.1.2, the logical choice will be Cassandra, see Table 3.1.

Table 3.1: NoSQL DBMS’s properties regarding the CAP Theorem in Section 2.5.3

Cassandra HBase HyperTable

CAP Theorem High Availability, Partition tolerance

Consistency,

Partition tolerance

Consistency,

Partition tolerance

There are several performance tests [20, 21, 35, 36] that favors Cassandra above other DBMSs when it comes to scalability and operation throughput. With the data model, CAP properties (see Table 3.1) and the performance literature base (cited above), Cassandra was chosen as the data storage.

3.2 Architecture

This section will describe the architectural problems and the tools that have been chosen to solve them. In chapter 4, a proof of concept (POC) system will be presented, in the form of a test case using the architecture discussed in this chapter.

3.2.1 Data storage

As described in section 3.1 Cassandra was chosen as the database system. An explanation about Cassandra can be found in section 2.6. Cassandra has in this thesis been running locally on a single machine with Cassandra cluster manager (CCM)1

. The benefit of using CCM is that a multi-node Cassandra cluster can be set up on a local machine, to simulate a more production-like behaviour (e.g. data that is partitioned to different nodes). This implementation will expect data to come from a message source. The

(37)

Chapter 3 3.2. Architecture

input stream of data has been selected to be from RabbitMQ2

. Even though the API sources (which post event logs) grow in number, the posted messages will end up at the same location. This lets the consumer (described below) of the messaging queue to only hold information about one data source, which is information about the message exchange. Figure 2.2 shows how this implementation would be integrated with AppGrid.

On the other end of the message queue, the consumer is located (Figure 2.2). The consumer’s assignment is to handle the incoming messages and store them into Cassandra.

3.2.2 Data analysis

Business Intellegence (BI) applications usually make it possible to store data to provide reports and dashboards for companies and organizations. This allows businesses enhance their performance by providing relevant data to make more informed decisions [37]. The queryable data, however, usually follows a traditionally ETL process that limits the possible queries into a static pool of pre-aggregated tables. The concept of traditional ETL is de-scribed in section 2.7.

Data analysis with Cassandra, as with most NoSQL databases, is very lim-ited. Cassandra query language (CQL) doesn’t support queries such as joins or group by, which are fundamental techniques for relational databases. This is mostly due to prevent high latency for requested queries. As described in section 2.6, Cassandra partitions datasets distributed over a cluster, and join queries would most likely (if the data is not partitioned to the same node) include queries against multiple nodes, before any joining could take place. This would increase the query latency since the nodes may not be on the same network. As described in Cassandra Wiki3

, this limitation of aggregation queries was made by intention to force design decision of data-models to be denormalized, which has the benefit that queries will only request onto single replicas which reduces the overall read-latency.

To provide BI with Cassandra, the data needs to be queried/aggregated with other tools than the native CQL. Another problem will be to support ad-hoc queries against the raw data. BI applications traditionally only supplies queries to be run on pre-aggregated data. This limits the possibilities to analyze the raw data stored in Cassandra. The traditional ETL processes and BI applications are not sufficient to support dynamic queries of data that is not already transformed into the structured queryable Data Warehouse (DW).

2

RabbitMQ is a messaging service that decouples applications and lets them send messages asynchronously https://www.rabbitmq.com/.

(38)

3.2. Architecture Chapter 3

The decision to use Cassandra as data storage (in section above) made it more complex to consider Hadoop MapReduce, since Hadoop MapReduce is tightly integrated to the HDFS storage [24]. Although, Cassandra can run with Cassandra File System (CFS)4

which is compatible with HDFS. The compatibility makes it possible to integrate Hadoop MapReduce [38] with data stored in Cassandra. However, CFS is a commercial feature (not open source) in DataStax Enterprice (DSE) 5

. [38] provides different solutions to combine Hadoop MapReduce with Cassandra, but the solution is either tedious or requires two complete systems to be set up. Where data is pulled from Cassandra into HDFS, then once analyzed, the result is stored back in Cassandra. This works well for batch/offline6

orientated analysis, but not interactive analysis.

Spark was more interesting to use, since it integrates more smoothly with Cassandra. The combination of Cassandra and Spark is described in sec-tion 2.8.2.

Spark job-server, described in section 2.9, was used to keep transformed data available within a long lived Spark context. Spark job-server was required to keep a Spark context alive and run the funnel analysis on already trans-formed data. The alternative would be, for each dynamic funnel analysis query, to rerun the the required time consuming transformation of the raw custom events. This transformation is called SetupSharedData in figure 3.1. The blue boxes in the figure are spark jobs run by Spark executors, the logic wrapped around Spark is functionality provided by Spark job-server. The functionality to retrieve shared RDDs between different Spark jobs is one example of what is possible with Spark job-server.

4

https://docs.datastax.com/en/datastax_enterprise/4.0/datastax_enterprise/ ana/anaCFS.html

5

http://www.datastax.com/products/datastax-enterprise

6_{batch or offline processing happens in the background, i.e. not connected to ad-hoc}

(39)

Chapter 3 3.3. Dynamic funnel analysis

Figure 3.1: High level architecture of the responsibility of Spark job-server. The result of the job closest to the database (Cassandra) is cached as a shared RDD (marked as a star), accessible for other jobs (e.g. funnel analysis).

3.3 Dynamic funnel analysis

In section 2.4 different sequence mining techniques were introduced. Spark’s machine learning library (MLlib)7

has support for FP-growth. The problems to run FP-growth as dynamic funnel analysis is that FP-growth both lack:

• Dynamic possibilities - since FP-growth is a mining process that search for sequences supported by minimum threshold. It does not search for dynamically defined sequences on the raw dataset.

• Time based order - since the events won’t be order by time.

It would be more interesting to base the funnel analysis on WAP-tree [9] mining, described in section 2.4. WAP-tree is a mining technique that is more interesting for funnel analysis (see section 2.4), however, WAP-tree or any other time based sequential mining process were not included at the time of version 1.3 in Spark’s MLlib. Even though if such would exist, it

(40)

3.3. Dynamic funnel analysis Chapter 3

would still have the same problem as with FP-growth: to run dynamic funnel analysis.

The reasons above led to the decision to base the funnel analysis on regular expression (RE). Dynamic REs can be created using user defined funnel queries8

. By matching REs to end-users event sequences, dynamic funnel analysis can be accomplished. The implemented RE based funnel analysis is further described in section 4.3.1.

8

(41)

Chapter 4 A test case

This chapter will describe an implemented test case using the architecture described in 3.2.

4.1 Event consumer

The consumer of the event logs has been implemented in Python and uses the datastax’s python driver against Cassnadra1

. The main task for the consumer is:

• to subscribe to the message queue and listen to incoming messages. • to deserialize all incoming messages/events.

• to build up objects matching the format described in Listing 1. • to write the objects as a batch operation to Cassandra.

4.2 Data storage

Cassandra was selected as the DBMS. The data model used has been mod-eled based on the assumption of what the data model of custom events must include, and also how it should be partitioned and queried.

Data model

The data model contains all information needed for the logged custom events. Extra metadata is applied to make the storage and read requests more efficient, e.g. the events will be ordered by time on insertion (described further down). The data model can bee seen in Listing 1, it is constructed with:

(42)

4.2. Data storage Chapter 4

• application - a string that determines the application that uses this implementation.

• start time - a string that determines when the A/B-test started (to separate potential A/B tests).

• timestamp - the timestamp of the event. • event - a string representing a custom event.

• user id - a string that is unique for the native users.

• group - a string that determines which A/B test group the event is associated with.

The PRIMARY KEY notation in Listing 1 determines the uniqueness of each insertion, the behaviour of this key is discussed more thoroughly below. The last statement with clustering order by tells Cassandra to order each column in descending order by its timestamp, this is also a powerful technique to effectively query e.g. the N number of latest events

(e.g SELECT * FROM ’some-table’ LIMIT 100).

1 CREATE TABLE IF NOT EXISTS custom_events (

2 application text, 3 start_time timestamp, 4 timestamp timestamp, 5 event text, 6 user_id text, 7 group text, 8 PRIMARY KEY (

9 (application, start_time, group),

10 timestamp,

11 user_id

12 )

13 ) WITH CLUSTERING ORDER BY (timestamp DESC);

Listing 1: The defined data model in Cassandra Query Language (CQL)

How data is partitioned

The tuple in the PRIMARY KEY, row 9 in Listing 1 (application, start time, group) is called the PARTITION KEY and it defines on which partition the data should be placed. The partitioner (mentioned in section 2.6.1) uses this information to partition the data within the Cassandra cluster (if us-ing multiple nodes). The remainus-ing keys (timestamp, user id) define the

(43)

Chapter 4 4.2. Data storage

location of the columnar values (clustering key), this combination of the PRIMARY KEY needs to be unique for all insertions, and the assumption for the model presented in Listing 1 was that:

• Multiple events can be logged at the same time for any user, since they are on separate devices.

• Multiple events cannot be logged at the same time for one user, as-suming that timestamps are measured with microsecond resolution. Logically the probability for timestamp collision varies depending on the number of users that are posting logs. And the probability can be decreased with microsecond resolution of the timestamp. However, the model cannot ensure uniqueness with only timestamp. Therefore user id was combined together with the timestamp as clustering key. This was chosen to prevent writing to Cassandra with same primary key, since Cassandra will choose the ”freshest” duplicate of the insertion with collided primary keys.

The data model described in Listing 1 will partition the incoming data to different nodes. An overview how it behaves can be seen in figure 4.1. De-pending on how Cassandra has been set up/tuned, it will result in different behavior when inserting. The figure visualizes a Replication Factor (RF) of ALL and the same for Write Consistency Level (WC). The WC de-termines the number of nodes that must send an acknowledgment for each insertion to consider the write operation successful, this makes up for Cas-sandra’s well known Tunable Consistency.

(44)

4.3. Data analysis in Apache Spark Chapter 4

Figure 4.1: A visualization of the partitioning and acknowledgment ”flow” for the datamodel defined in Listing 1.

4.3 Data analysis in Apache Spark

This section will describe the implementation of Apache Spark and the pro-cedure for regular expression based funnel analysis.

Apache Spark was not manageable to run against a local cluster of Cassandra nodes with a node count larger than one, this is a bug in the Cassandra Cluster Manager (CCM) tool2

that was used to simulate a local cluster on a single machine. The queries from Spark against Cassandra resulted in timeouts, this bug has also been experienced by 3

.

With the limitation to spread out the data onto multiple nodes locally, the decision to keep all data onto a single node was made. This would however

2_{Cassandra Cluster Manager (CCM) - https://github.com/pcmanus/ccm}

3

http://www.codedisqus.com/CNVkeWUUkj/problems-with-connecting-spark-to-local-multinode-cluster-with-ccm.html, https://github.com/pcmanus/ccm/issues/275

Providing a scalable architecture to support low-latency ad-hoc funnel analysis on custom defined events for an A/B testing use case

LiU-ITN-TEK-A--17/013--SE

Providing a scalable

architecture to support

low-latency ad-hoc funnel

analysis on custom defined

events for an A/B testing use

case

Pär Eriksson

LiU-ITN-TEK-A--17/013--SE

Providing a scalable

architecture to support

low-latency ad-hoc funnel

analysis on custom defined

events for an A/B testing use

case

Examensarbete utfört i Datateknik

vid Tekniska högskolan vid

Linköpings universitet

Pär Eriksson

Handledare Matthew Cooper

Examinator Aida Nordman

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare –

under en längre tid från publiceringsdatum under förutsättning att inga

extra-ordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner,

skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för

ickekommersiell forskning och för undervisning. Överföring av upphovsrätten

vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av

dokumentet kräver upphovsmannens medgivande. För att garantera äktheten,

säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ

art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i

den omfattning som god sed kräver vid användning av dokumentet på ovan

beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan

form eller i sådant sammanhang som är kränkande för upphovsmannens litterära

eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se

förlagets hemsida

Copyright

The publishers will keep this document online on the Internet - or its possible

replacement - for a considerable time from the date of publication barring

exceptional circumstances.

The online availability of the document implies a permanent permission for

anyone to read, to download, to print out single copies for your own use and to

use it unchanged for any non-commercial research and educational purpose.

Subsequent transfers of copyright cannot revoke this permission. All other uses

of the document are conditional on the consent of the copyright owner. The

publisher has taken technical and administrative measures to assure authenticity,

security and accessibility.

According to intellectual property law the author has the right to be

mentioned when his/her work is accessed as described above and to be protected

against infringement.

For additional information about the Linköping University Electronic Press

and its procedures for publication and for assurance of document integrity,

please refer to its WWW home page:

Abstract

Contents

List of Figures

Chapter 1

Introduction

1.1

Motivation

1.2

Problem description

1.3

Method

1.4

Assumptions and limitations

1.5

Thesis Structure

Chapter 2

Background

2.1

AppGrid

2.1.1

Current architecture

2.1.2

Use case and load prediction