DEGREE PROJECT, IN INFORMATION AND SOFTWARE SYSTEMS / , SECOND LEVEL COMPUTER SCIENCE AND ENGINEERING
STOCKHOLM, SWEDEN 2014
Creation of an Analytics Platform for an on-site eCommerce Search Engine
THOMAS FATTAL
KTH ROYAL INSTITUTE OF TECHNOLOGY
Creation of an Analytics platform for an on-site eCommerce search engine
THOMAS FATTAL
Master’s Thesis at ICT Supervisor: Meni Morim Examiner: Vladimir Vlassov
TRITA xxx yyyy-nn
Abstract
The author of this master thesis is a co-founder of the startup Findify which provides a search engine service for eCommerce websites. This master thesis focuses on the cre- ation of an analytics platform that collects users’ actions and behaviors in order to provide sharp and actionable an- alytics for the merchant. From the tracking of customers’
actions set at multiple levels, a fault-tolerant and scalable architecture has been created, capable of saving all the logs.
These logs are then transferred to a centralized system for being processed by a big data system. To validate the ana- lytics platform flow, a report is sent weekly to the merchant, containing details on Findify’s service usage and giving in- sightful analytics. The analytics platform is currently in production, able to receive thousands of logs per second and scales automatically.
Acknowledgements
First, I would like to express my gratitude to my KTH examiner, Vladimir Vlassov, for accepting to examine my master’s thesis and for his helpful advices on the ac- ceptance process of my master’s thesis.
I’m also grateful to Lionel Brunie for his supervision during this internship.
Second, I would like to acknowledge and extend my heartfelt gratitude to the other co-founders of Findify:
• Meni Morim, for supervising and assisting me in my master’s thesis. His honest judgement on my work and his advices were very valuable.
• Jaclyne Clarke for her contribution in my work, particularly on the type of analytics I could generate and with the analytics report sent weekly to the merchant.
• Thibaut Patel who collaborated with me on the trackers part, helped me in the architecture choices I had with this master’s thesis and confirmed the di- rections I took.
I hope this adventure continue with such open-minded and smart people and I wish all the best in the future of Findify.
I would also like to thank my parents, brothers, sisters and friends for their constant encouragement during this master thesis. Many thanks to Ann-Katrin Batzer who was a great source of support and encouragement for me.
Contents
List of Figures
1 Introduction 1
1.1 Background . . . 1
1.2 Motivation and Scope . . . 3
1.3 Literature Survey . . . 3
1.4 Contribution of the thesis . . . 5
1.5 Thesis outline . . . 6
2 Trackers 7 2.1 Introduction . . . 7
2.2 Customers’ events . . . 8
2.3 Customers tracking . . . 10
2.3.1 Cookies . . . 10
2.3.2 The proxy, end-point server . . . 10
2.4 Events format . . . 13
2.5 Overview . . . 17
3 Collection of the data 19 3.1 Introduction . . . 19
3.2 Requirements . . . 19
3.3 Syslog server . . . 21
3.4 Direct upload to S3 . . . 22
3.5 SQS and an EC2 instance . . . 22
3.6 First solution with DynamoDB . . . 24
3.7 Actual solution using Kinesis and S3 . . . 25
3.7.1 Details about the solution . . . 26
3.7.2 From the proxy to Kinesis . . . 26
3.7.3 The Kinesis consumer application . . . 26
3.7.4 Results . . . 29
3.8 Overview . . . 31
4 Processing and Analytics 33 4.1 Requirements . . . 33
4.2 Spark-based processing system . . . 36
4.2.1 Input / Output . . . 36
4.2.2 Merchant database . . . 37
4.2.3 Jobs in Spark . . . 37
4.3 External Analytics . . . 39
4.3.1 Introduction . . . 40
4.3.2 General counts . . . 40
4.3.3 Searches . . . 43
4.3.4 Top Suggestions . . . 43
4.3.5 Autocomplete . . . 45
4.4 Internal Analytics . . . 47
4.4.1 BI Software . . . 47
4.4.2 Visualizations in Tableau . . . 49
4.5 Overview . . . 52
5 Visualization and Querying 53 5.1 Introduction . . . 53
5.2 The querying program . . . 54
5.3 Analytics report . . . 56
5.4 Overview . . . 60
6 Further work 61 6.1 Trackers . . . 61
6.2 Collecting system . . . 61
6.3 Processing system . . . 62
6.4 Querying system . . . 62
7 Conclusion 63
Bibliography 65
List of Figures
1.1 Autocomplete overlay . . . 2
1.2 Search results . . . 2
1.3 Limited architecture of Findify . . . 5
1.4 Architecture Overview . . . 5
2.1 The trackers, first piece of the analytics platform . . . 7
2.2 Findify’s search bar . . . 8
2.3 Click on the search box . . . 8
2.4 When a letter is typed - in French . . . 9
2.5 Overview of the trackers client-side/proxy . . . 12
3.1 The collection of the data, second piece of the analytics platform . . . . 19
3.2 Input, the collecting system and output . . . 20
3.3 Using a Syslog server . . . 21
3.4 Direct upload to S3 from the proxy . . . 22
3.5 Using a queue SQS and a consumer EC2 . . . 23
3.6 Using DynamoDB . . . 24
3.7 Actual solution using Kinesis . . . 25
3.8 The log files are stored in daily directories on S3 . . . 30
3.9 For each day, several files are present . . . 31
4.1 The processing system, third piece of the analytics platform . . . 33
4.2 A view of the computing system . . . 34
4.3 Enrichment job - Input, actions, output . . . 38
4.4 Extraction job to a CSV file - Input, actions, output . . . 38
4.5 Features Job - Input, actions, output . . . 39
4.6 Summary over the computing system chosen . . . 39
4.7 Flow to import data to Tableau . . . 48
4.8 The window of Tableau . . . 49
4.9 Amount of logs and IPs for the first week of May . . . 50
4.10 Distribution of the event category among the logs for the first week of May . . . 51 4.11 Distribution of the event category per unique IP for the first week of May 51
5.1 The querying system, fourth and last piece of the analytics platform . . 53
5.2 Search box usage of the website touchedeclavier.com . . . 56
5.3 Search activity over time . . . 57
5.4 Unique search terms . . . 57
5.5 Top 10 Search Suggestions Conversions . . . 58
5.6 Effectiveness of Search Suggestions . . . 58
5.7 Top 20 Search Queries . . . 59
5.8 Autocomplete . . . 60
7.1 Architecture Overview . . . 63
List of listings
1 Reading of the cookie at startup time by the Javascript client library 11
2 Sending of an event by the Javascript client library . . . 11
3 Reception of the event by the proxy server . . . 12
4 Sending of an event from the proxy to Kinesis . . . 27
5 Processing the records with the Kinesis consumer application . . . . 28
6 When does the buffer need to be flushed to S3? . . . 28
7 Emit the buffer of events to S3 (1/2) . . . 29
8 Emit the buffer of events to S3 (2/2) . . . 30
9 Spark program - Aggregation of the events by day . . . 41
10 Spark program - Filtering to obtain the search suggestions . . . 44
11 Spark program - Chronological history of the events . . . 46
12 Extraction of the autocomplete aggregations . . . 55
Chapter 1
Introduction
Findify[1] is a startup providing a search engine for eCommerce stores. This chapter presents the Findify product offering and the need for analytics. It then states the motivation, scope and outline of this master’s thesis.
1.1 Background
One of the main actor in online eCommerce retail, Amazon, is currently selling around 230 million of products[2] with a yearly linear increase. Its success comes from an exceptional UX experience, a deep understanding of customers’ behaviours, a massive usage of recommendation engine[3, Linden, Greg, Brent, 2003], A/B and usability testing, and a fantastic search experience.
Even though eCommerce websites like Amazon or Ebay are dominating the market, online shopping is not only restricted to big retailers. Indeed, Shopify[4], Magento[5], Prestashop[6] or the swedish Tictail[7] are all providing an eCommerce platform from where individuals can easily add products and offer a viable shopping store to online customers.
One of the advantage of the platform Shopify is to offer a simple administration interface to manage products. However, Shopify lacks a good search quality. For Shopify and the other eCommerce platforms mentioned above, no specific attention is given to the quality of the search. To answer the need, eCommerce platforms are often proposing standalone modules created by third-tier developers that improve parts of a merchant’s store. Some are dedicated to the search.
It’s in order to serve this purpose that Findify has been created. The Findify search-as-a-service platform is a globally distributed, scalable system which helps online merchants to increase their conversion rates while maximizing customer re- tention by providing an optimal shopping experience. Findify makes available to eCommerce merchants two main products:
CHAPTER 1. INTRODUCTION
• An autocomplete overlay, currently in production and deployed in two stores. Integrated within the search bar, the overlay is displaying search and product suggestions that customers can click on. The figure 1.1 is presenting the overlay on a French eCommerce store after a customer has typed "table".
Figure 1.1. Autocomplete overlay
• The search results page, in beta testing, gives a list of products with filters and facets that customers can select to refine their search. The figure 1.2 shows the result of the search "table".
Figure 1.2. Search results
1.2. MOTIVATION AND SCOPE
1.2 Motivation and Scope
Because Findify is controlling the search bar, capturing the interactions that cus- tomers have with the autocomplete overlay is possible. Understanding customers’
actions allows to:
• Define customers’ navigation patterns useful to improve the UX experience.
• Improve the relevancy of Findify search engine by analysing the most clicked suggestions, products and components in the interface seen by the customers.
• Provide insightful analytics to the merchant concerning their customers’ search usage.
The main goal of this master thesis is to create an analytics platform that col- lects data on customers’ actions and is able to give therefore insightful analytics to the merchant.
Creating an analytics platform requires to design the architecture services that will support the saving and the computation of the customer’s data. It also requires software development at different levels, as explained in section 1.4. Besides, this section gives more details about the different steps required to create the analytics platform.
We will limit the creation of the analytics platform to the collection and the extraction of analytics. In this context, no improvement will be made to the search relevancy.
1.3 Literature Survey
When speaking about analytics on the Internet, Google Analytics[8] is often re- ferred. By integrating a code snippet in each page of a website, Google Analytics provides a free platform that collects information about customers’ navigation and information like their geographical origin and general visit statistics. Building an analytic platform for Findify does not aim to overlap nor replace Google Analytics.
The collection of the data is following the same process but the goal of the analyt- ics platform is to provide relevant metrics, specialized in eCommerce search that Google Analytics does not provide.
Other platforms such as MixPanel[9] and KissMetric[10] are providing analytics using a similar architecture from what we built. Findify could have used one of these platforms to provide merchants with analytics. However, the analytics plat- form built in this master thesis is also the base to run one of the core features of Findify: machine learning algorithms applied to the collected data for improving the search relevancy of the search engine. These algorithms cannot be implemented
CHAPTER 1. INTRODUCTION
using external services like MixPanel or KissMetric.
Findify relies completely on Amazon Web Services[11] (AWS). AWS is a IaaS (Infrastructure as a Service) cloud platform that makes available several services intended for building an architecture. For instance, the service EC2[12] for "Elastic Compute Cloud" is providing dedicated servers. AWS provides a way to manage the compute capacity of these servers with different prices and gives, in addition of a real-time monitoring system, a console management for configuring every detail of these servers: for instance, security ports or hard drives attached to a server.
Several other services can be combined to EC2 to offer a full cloud platform. An- other service that will be introduced in the chapter 4 is called EMR[13] for "Elastic MapReduce". EMR is proposing an abstraction for big-data systems in order to run batch-processing jobs.
In order to ramp up skills on AWS, Findify paid a support account for ask- ing questions to AWS technical experts. Most of the architecture decisions taken throughout this master thesis have been discussed and confirmed by these experts.
Besides, the Findify’s team participated to the AWS Summit in Stockholm in May to confirm our architecture. I also made advantage of the official resources and pre- sentations provided by the technologies used in order to choose always the optimal setup for Findify.
Findify is using massively a combination of the AWS services to provide a scal- able, fault-tolerant and distributed architecture. A basic schema of Findify’s core architecture is displayed in figure 1.3. Building an analytics platform also requires to setup an architecture: in the following chapters, some more details are given about the AWS services used.
Basic flow: When searching on an eCommerce store, customers type in the search bar. A Javascript library on the front-end side detects these interactions and send requests to the application servers called "Proxy". These EC2 instances (servers) retrieve search information from a database and send requests to the search backend.
When the search backend answers with the results of a search for instance, the proxy takes back the answer, modifies it before returning it to the customer’s browser.
1.4. CONTRIBUTION OF THE THESIS
Figure 1.3. Limited architecture of Findify
The programming languages Javascript, Scala and Java have been used during this master thesis.
1.4 Contribution of the thesis
Four main parts are required to create an analytics platform:
Figure 1.4. Architecture Overview
1. Trackers: The trackers are used to collect customer’s actions and informa- tion. A tracking code has been developed inside the Javascript client library to
CHAPTER 1. INTRODUCTION
detect actions and create events in consequence. The event’s format has been created in order to include several available attributes. Finally, a method for sending the events from the Javascript client library to the servers has been developed.
2. Collecting System: The collecting system takes all the data collected by the trackers and saves them in a permanent storage. In this part, several solutions has been studied in order to choose one that respects all the requirements.
Software development has been done to develop the transfer of the events.
3. Processing System: The processing system retrieves all the data from the permanent storage and runs algorithms to produce desired analytics. After having chosen the big data system, a full application has been implemented to run aggregations on top of the customers’ data.
4. Querying System: The querying platform aims to extract the results and create visualizations. The querying system has two main parts: a program has been implemented in order to extract information from the processing system’s results ; the analytics are then rendered manually.
1.5 Thesis outline
This thesis contains several parts.
Chapter 2 - Trackers
In this chapter, we see the different trackers that have been setup in order to collect customers actions at different levels in the architecture.
Chapter 3 - Collection of the data
In this chapter, we analyse the collection of the data sent by the trackers.
Several approaches are studied in order to solve the problem.
Chapter 4 - Processing and Analytics
In this chapter, we analyse in detail the processing part. From the data we have collected, which tools are used to produce analytics for the merchant and for Findify?
Chapter 5 - Visualization and Querying
In this chapter, we take a deep look to the end-point of the analytics platform which is the visualization of the analytics.
Chapter 6 - Further work
In this chapter, further work is presented related to the analytics platform in the future.
Chapter 7 - Conclusion
In this chapter, conclusions are given.
Chapter 2
Trackers
2.1 Introduction
The trackers are responsible for collecting customer’s information and actions. They represent the base of the analytics platform.
Figure 2.1. The trackers, first piece of the analytics platform
According to the figure 1.3, when customers are searching on a merchant’s store, requests are sent to the "proxy" application. The proxy can then record information linked to the request sent or information related to the response of the search back- end. However, certain actions made by the customer do not reach the proxy. For instance, clicking on the search box does not lead to a request. We then determine two locations where trackers have to be implemented:
1. On the client-side: Customer’s actions that are not related to the search service will be recorded at the front-end level (a click on the search box for instance). Merchants are using the Findify service via a Javascript client library. A tracking code has been implemented in this library to enable the customers’ tracking.
2. On the server-side: Information about API requests will be recorded at the proxy level.
Before explaining the type of logs recorded, actions made by the customers are studied.
CHAPTER 2. TRACKERS
2.2 Customers’ events
Different events are detected by the trackers when customers perform different ac- tions in an eCommerce store:
Load a page
Findify is integrated in a merchant’s store via the Javascript library added to the header of each page. This Javascript library is managing the content of the search bar showed in figure 2.2: displaying the autocomplete overlay that appears when a letter is typed in the search bar ; managing the navigation with the keyboard of the autocomplete overlay, etc.
Figure 2.2. Findify’s search bar
Each time a customer loads a page, the search bar is displayed and the Javascript library loaded. When it happens, the tracking code creates an event called "page_view".
This event is a client-side event.
Click on the search box
When a customer clicks on the search box (figure 2.3), an event called "box_click"
is generated. This event is a client-side event.
Figure 2.3. Click on the search box
Type a letter
For each letter typed, a request is sent to the proxy server in order to provide results to fill the autocomplete overlay. The proxy then returns search suggestions and product suggestions. From the search response, a server-side event called "server- autocomplete" is created. In the figure 2.4, the beginning of a French word "chais"
for "chaise" ("chair" in English) has been typed.
2.2. CUSTOMERS’ EVENTS
Figure 2.4. When a letter is typed - in French
Click on a search/product suggestion
If a suggestion reveals itself to be interested, the customer might click on either search suggestions (top of the overlay) or product suggestions (bottom).
When a customer clicks on a suggestion, the Javascript library does not directly create an event. To prevent blocking the load of the new page while handling the event, the tracking code in the Javascript library temporarily stores the type of the clicked suggestion (search or product suggestion) and the value clicked, in a cookie (see part 2.3.1 - clickthrough). Then, when the new page loads, the Javascript library automatically reads the content of the cookie. If the library detects that a suggestion has been clicked, it will combine the "page_view" event with the details of the suggestion clicked previously.
Click on the search button
Findify is currently not controlling the search. When it will be the case, it will be possible to know that a search action has been made thanks to the request made to the proxy. As a result, clicking on the search button will not generate an event in the current implementation of the Javascript library. However, a method has been implemented in the processing system to detect when a search has been made. This method will be explained in the chapter 4.
CHAPTER 2. TRACKERS
2.3 Customers tracking
Through the Javascript library, customers are indirectly making requests to Findify backend servers when interacting with the autocomplete overlay. In this library, a module has been implemented in order to track the customers’ actions. This module generates the client-side events seen above. In this section, the different techniques used to do so are explained.
2.3.1 Cookies
Cookies are set to track customers on the client-side. A cookie is a small file stored in the customer’s browser that contains information. Findify can create three cookies:
1. uniq: This cookie is stored on the permanent storage of the web browser and is not removed when customers close their browser. The expiration date of this cookie has been set to 30 years. This cookie is used to track a customer on a long-term period. Combined with the visit cookie, it gives the possibility to see browsing history among different sessions.
2. visit: This cookie is stored in the session storage of the web browser1. This cookie contains an identifier describing the visit of a customer on the store.
The cookie has an expiration date of 30 minutes: if the customer is inactive during 30 minutes, the session cookie will expire and be removed by the web browser. Otherwise, at each action made by the customer, the expiration date will be set again to 30 minutes later. If the customer is coming back in a period larger than 30 minutes, the Javascript library will not detect any visit cookie and will create a new one with a new identifier.
3. clickthrough: This cookie is not set when the library is loaded. It is only used when a customer clicks on a suggestion in the autocomplete overlay. The listing 1 shows the reading of the cookie when the Javascript library is loaded.
The expiration date of the clickthrough cookie is set to 30 seconds. We con- sider that it’s not possible that a page is taking more than 30 seconds to load.
Furthermore, some extreme cases are avoided: this expiration date prevents for instance that the customer is clicking on a page, leaving, then coming back 2 days after and having the suggestion click event sent because the expiration date is too long.
2.3.2 The proxy, end-point server
All the tracking events for the analytics are sent to the proxy servers. Dedicated instances for handling only the analytics events could have been used. However, the startup couldn’t afford to deploy such instances.
Each event is a list of parameters attached to a single-pixel image request. Google
1Web browser normally removes the session cookies when the browser is closed.
2.3. CUSTOMERS TRACKING
// The cookie clickthrough is read and stored in a variable var ct = storage.readCookie(COOKIE_CT);
// If the cookie exist: someone did a click on a suggestion if(ct) {
// First, the cookie is made invalid storage.writeCookie(0, COOKIE_CT);
// Second, an event is sent with type of suggestion and value var event = ct.split(’#’);
sendEvent(’page_view’, event[0], event[1]);
} else {
// Otherwise, if nothing has been found in the cookie, // a page view event is sent.
sendEvent(’page_view’);
}
Listing 1: Reading of the cookie at startup time by the Javascript client library
Analytics has the same process[14, Tracking Code Overview, Google] to send data.
The code implemented in the client-side library is shown in listing 2.
function sendEvent(eventCategory, eventType, eventValue) { var p = [
’t=’ + (+new Date()),
’key=’ + config.merchantKey,
’url=’ + encodeURIComponent(window.location.href),
’host=’ + encodeURIComponent(window.location.host),
’visit=’ + visit,
’visit_id=’ + visit_id,
’uniq=’ + uniq,
’uniq_id=’ + uniq_id,
’ev_cat=’ + eventCategory ];
if(eventType) {
p.push(’ev_type=’+eventType);
}
if(eventValue) {
p.push(’ev_val=’+eventValue);
}
new Image().src = config.endpoint + ’/’+ config.version +
’/a.gif?’ + p.join(’&’);
}
Listing 2: Sending of an event by the Javascript client library
config.endpoint corresponds to the address of the proxy servers. config.version is the API version used by the Javascript library to communicate with the proxy.
As the code is showing it, new Image().src is creating an image GIF and assign to it all the tracking parameters. The image a.gif is the tinyest image that can
CHAPTER 2. TRACKERS
exist: 1*1 pixel, GIF, transparent, 37 bytes[15]. The created image is not attached to the DOM of the page, so it’s not visible by the front-end users. The main ad- vantage of this technique over traditional AJAX requests is the possibility to make cross-domain requests from a web browser.
When the GIF image is requested by the Javascript library, the proxy is receiving a request. The following code, in Node.JS1, is then executed (listing 3).
var IMAGE_URL = path.resolve(__dirname + ’/../public/a.gif’);
exports.analytics = function (req, res) {
res.header("Cache-Control", "no-cache, no-store, must-revalidate");
res.header("Pragma", "no-cache");
res.header("Expires", 0);
// The image is sent to the customer res.sendfile(IMAGE_URL);
// The event parameters are logged in the proxy module.
logging.log(req.query, req);
};
Listing 3: Reception of the event by the proxy server
The figure 2.5 summarizes the tracking system active on the client side and the proxy. From the front-end where the customer loads a page, clicks on the search box, types letters in the search bar or clicks on a suggestion, the Javascript library is detecting the action performed. Creating an event leads to request a GIF image with tracking parameters to the proxy server. Tracking parameters from the front- end are then sent to our collection system by the proxy (see chapter 3).
Figure 2.5. Overview of the trackers client-side/proxy
1Node.JS is the programming language used to develop the proxy server.
2.4. EVENTS FORMAT
2.4 Events format
After having seen which different events could be saved and how the events were actually sent to the proxy, this part aims to present the different format of events sent to the proxy. The parameters of type Client are the parameters added by the Javascript library to the event. On the other hand, the parameters of type Server are the parameters added by the proxy to the event. Each event is going to be analysed separately.
Page view Event
The page-view event contains the following information:
Parameter Side Type Description Example
ev_cat Client String ev_cat means "Event Category".
This parameter describes the main category of the event.
"page_view"
t_client Client Long Timestamp set on the client side when the event occurs (in ms).
1399380663161 t_server Server Long Timestamp set on the server side
when the event is handled by the proxy server.
1399380678682
key Client String Public key of the merchant. Unique key linked to the merchant’s store from where the analytics event comes from.
"E5AAB356- 0EC2-4916-7FF9- AD73A347955E"
instanceID Server String AWS identifier of the EC2 proxy in- stance that has handled the request.
"i-21e6c972"
ref Server String Referer. URL address of the web- page the customer is coming from.
"http://www.
site.com/ 64- acer-aspire-5742- series.html"
cookie Server String List of cookies (each cookie is a key/value) sent by the HTTP re- quest when requesting the GIF im- age.
"_findify_visit=
jIVh0oaL7XHU&
_findify_uniq=
m5Y1lsizUvZ"
lang Server String List of languages set in the user’s browser.
"en-US"
CHAPTER 2. TRACKERS
ua Server String User Agent of the browser used by the customer. The user agent contains information about the browser, the operating system and the device used by the cus- tomer.
"Mozilla/5.0
(Windows NT
6.1) AppleWe- bKit/537.36 (KHTML, like Gecko) Chrome/
34.0.1847.131 Safari/537.36"
ip Server String IP of the customer that made the request
"203.0.113.1"
url Client String URL address of the current page on the customer’s browser.
"http://www..."
host Client String Domain name of the web host.
When the request comes from the merchant’s store, the host is equal to the domain of the store.
"www.site.com"
visit Client Boolean Indicate if a cookie "visit" was al- ready existing before for the session.
A value "true" means that the cus- tomer came on the website in the last 30 minutes. A value "false"
means that the current session is new.
false
visit_id Client String Unique key (UUID of 16 characters) attached to a customer’s visit (see part 2.3.1)
"jIVh0oaL7XHU"
uniq Client Boolean Indicate if a cookie "uniq" was al- ready present. The value "true"
means that the customer already went in a page whose search is owned by Findify. The value "false"
means that the customer is new.
true
uniq_id Client String Unique key (UUID of 16 characters) attached to a customer when they first come into a store whose search is owned by Findify
"m5Y1lsizUvZ"
It is important to notice that the meaning of the cookies "visit" and "uniq" can be altered if the customer is clearing their browser’s cookies. Indeed, if it happens, the value of the cookie "uniq" will be different and the analytics system will lose trace of the customer: it will not be possible to recreate the whole browsing history of the customer.
2.4. EVENTS FORMAT
Box click Event
The structure of the box click event is equivalent to the page view event’s one. The only difference is coming from the field ev_cat that takes the value "box_click".
Suggestion click Event
The structure of the suggestion click is similar to the page view event. The field ev_cat is taking the value "page_view" (see the part 2.3.1 - clickthrough explaining that the suggestion click event is sent as a page view). Two other fields have been added to the event:
Parameter Side Type Description Example
ev_type Client String Type of the suggestion event clicked by the customer (search suggestion or product suggestion).
"autocomplete- query" or
"autocomplete- product"
ev_val Client String Object on which the customer clicks on (search term or product id).
"acer 67" or
"561893"
Autocomplete Query Event
The autocomplete query event is logged when a letter is typed by a customer in the search box, on the server side, at the level of the proxy application. In this case, the Javascript client library is not sending any event because the proxy has all the information required to log the event. The structure of the event is a bit different from the client-side events (the parameters in italic were already present in the client-side events):
Parameter Type Description Example
ev_cat String This parameter describes the main category of the event.
"server- autocomplete"
q Query This parameter is composed of other parameters and contain information about the autocomplete query
-
q.q String This parameter correspond to the letters typed in the search bar.
"chai" (for
"chaise") t_client Long Timestamp of the HTTP request re-
ceived by the proxy server.
1399380663161 t_server Long Timestamp set on the server side
when the event is handled by the proxy server.
1399380678682
CHAPTER 2. TRACKERS
key String Public key of the merchant. Unique key linked to the merchant’s store from where the analytics event comes from.
"E5AAB356- 0EC2-4916-7FF9- AD73A347955E"
d Long Duration taken by the search re- quest to be processed by the search backend.
26
results Object Object that contains two arrays:
"product_ids" and "suggest_ids"
- product_ids Array Array of String that contains prod-
uct ids suggested in the autocom- plete overlay after a customer typed a letter
[’1590’, ’56’, ’23’,
’12’]
suggest_ids Array Array of String that contains term suggested in the autocomplete over- lay after a customer typed a letter
["asus", "asus eee", "asus eee pc", "asus rogs"]
instanceID String Identifier of the EC2 proxy instance that handled the request.
"i-21e6c972"
ref String Referer. Identifies the address of the webpage the customer comes from.
"http://www.
touchede- clavier.com/
64-acer-aspire- 5742-series.html"
cookie String List of cookies (each cookie is a key/value) sent by the HTTP re- quest when requesting the GIF im- age.
"_findify_visit=
jIVh0oa5lrhL&
_findify_uniq=
m5Y1lsizQov"
lang String List of languages set in the user’s browser.
"en-US"
ua String User Agent of the browser used by the customer. The user agent contains information about the browser, the operating system and the device used by the cus- tomer.
"Mozilla/5.0
(Windows NT
6.1) AppleWe- bKit/537.36 (KHTML, like Gecko) Chrome/
34.0.1847.131 Safari/537.36"
ip String IP of the customer that made the request
"203.0.113.1"
2.5. OVERVIEW
2.5 Overview
In this part has been introduced the first piece of the Findify’s analytics platform:
the trackers. Trackers have been implemented on the client-side and on the server side in order to save customers’ actions (page views, box clicks, letters typed, sug- gestions clicked). The following points have been studied and implemented in this chapter:
• The customers’ actions Findify is tracking that led to the development of a tracking code in the Javascript client library detecting these actions and creating events.
• The setup of cookies to track users among their browsing session.
• The definition of the events’ format (page_view, box_click, server-autocomplete).
• The development of a part of the proxy application that handles the events’
reception and adding some more information related to the server environ- ment.
158 lines of code has been implemented for the tracking code on the client-side.
42 lines of code has been implemented for the handling of the events on the server- side.
Chapter 3
Collection of the data
3.1 Introduction
Client-side (page view, box clicks, suggestion clicked) and server-side (autocom- plete query) events are retrieved by the proxy. Because Findify’s architecture is distributed in order to handle a high amount of requests, several proxy servers can receive requests (search or analytics events). All the received analytics events need to be saved in a permanent storage to be aggregated by the processing system.
This part deals with the collection of the logs from the proxy to a permanent storage. After having defined the requirements for the collecting system, several possible solutions are studied. The actual solution, currently used in our produc- tion environment, is then described.
The collecting system is the second piece of the analytics platform as the figure 3.1 is showing.
Figure 3.1. The collection of the data, second piece of the analytics platform
3.2 Requirements
The collecting system has a simple function: taking logs from the proxy machines and storing them all in a permanent storage. As you will see in the chapter 4, Findify wants to use the AWS[11] service Elastic MapReduce[13] (EMR). Because
CHAPTER 3. COLLECTION OF THE DATA
EMR can only retrieve data from S3[16]1 or DynamoDB[17], the chosen solution will store data in one of these two services. The type of storage is a bit different be- tween S3 and DynamoDB though: S3 is a cloud-based file system and DynamoDB is a database.
Several requirements must be respected by the collecting system:
1. The computations made by the proxy machine should be minimized when sending a log to the collecting system. The proxy machines shouldn’t have heavy process dedicated to the logs.
2. The collecting system must be highly available and fault-tolerant: logs can- not be lost. They represent the behaviours of the customer and need to be collected as a whole.
3. The chosen solution scales automatically depending of the amount of logs sent.
4. The chosen solution should not cost more than $50.
5. If the permanent storage is S3, the logs need to be packaged before being stored. Packaging the logs will decrease the amount of time the processing system takes to retrieve all the files.
Figure 3.2. Input, the collecting system and output
1Amazon S3 or Simple Storage Service is a AWS service for storing data.
3.3. SYSLOG SERVER
3.3 Syslog server
The first approach is based on one or several dedicated instances called "syslog"
servers. The syslog server receives events from the proxy machines and stores them in a local buffer. When the buffer reaches its maximal size, the server packages a file and sends it to the permanent storage.
Figure 3.3. Using a Syslog server
1. Minimization of the proxy computation: this requirement is respected because the proxy sends directly their events to a syslog server.
2. High availability & Fault-tolerance: The high availability requirement is re- spected if two instances are available because if one crashes, events will be sent to the second instance. The fault-tolerancy requirement is also respected because the syslog server are storing their buffer on EBS[18] volumes. If the instance crashes and restarts, data will be saved.
3. Automatic scaling: The syslog servers can scale in and out depending on the load, thanks to the service AutoScaling[19] which ensures that if the number of requests increases, the syslog servers will adapt themselves to support the load.
4. Price under $50: This requirement is not respected. A small instance costs
$30. If two instances are added to respect the high availability requirement, the cost will be $60. If one instance is alive, the high availability requirement will not be respected because AWS cannot prevent underlying hardware issues that leads to an instance crashing. So having one instance is a single point of failure and the proxy servers are not able to send their events anymore.
CHAPTER 3. COLLECTION OF THE DATA
5. Packages to S3: This requirement is respected because the syslog servers are packaging a specific amount of events into a file.
3.4 Direct upload to S3
One solution that is encouraged by AWS[20] is to upload directly the logs from the proxy to S3.
Figure 3.4. Direct upload to S3 from the proxy
1. Minimization of the proxy computation: This requirement is not respected because each proxy machine has two main tasks: handling the requests made by the customers and save, package the events in a file before pushing it to S3.
2. High availability & Fault-tolerance: The high availability and the fault-tolerancy requirements are ensured because operations are done on the proxy server.
3. Automatic scaling: The automatic scaling is handled by the proxy servers.
4. Price under $50: This solution does not cost anything, so this requirement is respected.
5. Packages to S3: This requirement is respected because the packages are made on the proxy machine.
3.5 SQS and an EC2 instance
A solution respecting all the requirements was to send the logs from the proxy to Amazon SQS[21]. SQS is a messaging queue. The proxy instances act as producers:
they put messages into the queue (#1) by telling the name of the queue and the message. A consumer instance then retrieves messages (#2): the message will be
"locked" and any other consumer will not be able to retrieve the message. Once the
3.5. SQS AND AN EC2 INSTANCE
consumer has dealt with the message, it sends a message to SQS so the message is removed(#5). The figure 3.5 represents this architecture.
Figure 3.5. Using a queue SQS and a consumer EC2
Visibility timeout
The visibility timeout is a configurable time parameter of the SQS queue. If the EC2 instance does not transfer the messages retrieved from SQS to S3 before the visibility timeout (if the instance crashes for instance), the messages are put back in the SQS queue. Another EC2 instance or a rebooted machine is then able to retrieve the messages that have not been dealt with previously. If the period for sending packages of logs to S3 is smaller than the visibility timeout, then we ensure that a log taken in the queue will be either pushed to S3, either put back to SQS for a future process.
Retention period
For each SQS queue, a retention period can be defined: from 60 seconds to 14 days.
After the retention period has elapsed, SQS automatically deletes messages that have been in a queue for more than the total retention period. No messages should be lost if a new instance is brought up within the retention period for dealing with the message.
CHAPTER 3. COLLECTION OF THE DATA
Architecture from this knowledge
Setup Autoscaling can add automation to the setup. Using autoscaling, a new in- stance will automatically be provisioned, should the first instance fail. Another benefit of autoscaling is that if the processing time of the messages is taking too long (heavy load), it’s possible to add more capacity by increasing the number of instances allowed to retrieve messages from SQS ; and then scaling back down after the peak event.
Because of the lack of time to implement the consumer application, this solution has not been implemented.
3.6 First solution with DynamoDB
A solution based on DynamoDB has been implemented then and deployed in pro- duction as the figure 3.6 shows it. DynamoDB[22] is a key-value store NoSQL database, offered by Amazon. Its advantages is its high speed, its fault-tolerancy and high-availability guarantees.
Figure 3.6. Using DynamoDB
1. Minimization of the proxy computation: This requirement is respected be- cause the proxy sends asynchronously a call to DynamoDB with the event to save.
2. High availability & Fault-tolerance: This requirement is respected because DynamoDB is natively highly available and fault tolerant.
3. Automatic scaling: This requirement is respected the throughput of Dy- namoDB can be adjusted automatically.
3.7. ACTUAL SOLUTION USING KINESIS AND S3
4. Price under $50: This requirement is not respected because retrieving all the logs from DynamoDB, from EMR, is a very expensive operation. Done several times a month, this solution is not cost-efficient.
An advantage provided by this solution is the possibility of querying using a partial SQL language (it’s possible to filter the columns in the key-store model) to retrieve specific information.
3.7 Actual solution using Kinesis and S3
Retrieving all the logs each time the processing system is running is not a cost- efficient solution. It has then been decided to go on a more flexible solution that respects all the requirements.
The solution currently in production is based on Kinesis[23]. Kinesis was fitting directly Findify’s need by proposing a service that can collect hundreds of terabytes of data per hour from multiple sources and deal with them in real-time. At the time the SQS solution with the EC2 consumer was created, Kinesis had not been released.
Figure 3.7. Actual solution using Kinesis
As seen in the figure 3.7, the Kinesis solution is very similar to the solution with SQS and the EC2 consumer. The proxy instances (producers) are pushing events under JSON format inside a Kinesis "stream" (#1). Then, the EC2 machine (the consumer) retrieves the logs in real-time from this stream (#2). AWS provides a
CHAPTER 3. COLLECTION OF THE DATA
Java SDK to retrieve the messages. A Java application has been implemented on the EC2 consumer. When receiving a log, the Kinesis application checks the validity of the log and buffers it in a local buffer. Eventually, the local buffer is transformed into a package of logs (a line is one JSON log) and sent to S3 (#3).
3.7.1 Details about the solution
Kinesis is a stream-based service which retains data for 24 hours. Behind the scene, each stream has a specific number of "shards". The higher the number of shards is, the more throughput of data the Kinesis application can handle. By launching one Kinesis consumer application per shard, it becomes possible to scale horizontally the stream, each application being responsible of its shard. By launching more Kinesis applications, no ordering is ensured on the storage in S3. Though, it is possible to sort the logs in the processing system.
The Kinesis application, thanks to the SDK, logs its current status to a DynamoDB index, ie. the last message that has been processed (sent to S3). This ensures that if the EC2 machine is crashing, no data will be lost: the EC2 machine application when booting again will read the status in DynamoDB and begin the processing from the moment it stopped. This is true, only if the EC2 machine is restarting within 24 hours (the retain period).
The EC2 machine has been placed in an Autoscaling group that allows in case the load of Kinesis is too high to add more machines to handle the stream. The minimum amount of machines has been set to 1. Thanks to this configuration, if the AS group detects that an instance crashed or has CPU and memory higher than a specific threshold, it will launch a new machine automatically.
3.7.2 From the proxy to Kinesis
The figure 2.5 shows that the proxy is calling the function "log" of the module
"logging" with the parameters. A part of this function is displayed in listing 4.
3.7.3 The Kinesis consumer application
The Kinesis consumer application is reading the logs that are pushed in the Kinesis stream by the proxy. The Kinesis consumer application contains 1378 lines of code.
The function processRecords is called in order to deal with a list of records, pulled from the Kinesis stream. This function is presented in the listing 5.
Each record is transformed to a generic type T. In this program, the type T is a String because it contains the event under the format JSON. After having formatted the event in type String, the function is retrieving the last item saved from the buffer. One of the requirement for the collecting system to push to S3 was for the events to be packaged in a file. A file of logs is pushed if one of these following conditions is reached:
3.7. ACTUAL SOLUTION USING KINESIS AND S3
// data_json is the log object with the parameters inside.
// it needs to be encoded in base64 to be handled by Kinesis.
var data_base64 = new Buffer(JSON.stringify(data_json)).toString(’base64’);
// Push the log into Kinesis Kinesis.putRecord({
Data: data_base64, // Log to save
PartitionKey: config.aws[’Kinesis-PartitionKey’], // Group data by shard StreamName: config.aws[’Kinesis-StreamName’] // Name of the Kinesis stream }, function(err, data) {
if(err) {
// Findify’s monitoring system
newrelic.noticeError(’AWS kinesis logging failed’);
console.log(’Error:’, ’AWS kinesis logging failed’, err, data_json);
} } );
Listing 4: Sending of an event from the proxy to Kinesis
• The amount of logs stored in the buffer exceeds 5000 logs.
• The size of the buffer exceeds 5 MB.
• The day of the current processed log is different from the day of the last log buffered.
In the listing 5, after having retrieved the last item saved in the buffer, the third condition is checked. The day of the last item saved is compared to the current item processed. If the day is different, the buffer of logs belonging to the day N needs to be pushed to S3, so the current item processed belonging to the day N+1 can be pushed to a empty buffer of logs. If the day is not different, the buffer consumes the record thanks to the method consumeRecord. Then, the two first conditions (amount of logs and size of buffer) are checked to know if the buffer needs to be pushed to S3 (listing 6).
The function presented in the listing 7, emit, pushes the buffer of events to S3.
The line 5 is calling the function shown in the listing 8. This function is send- ing the buffer of events to S3. A mechanism of retries have been implemented to handle the failure of the call emitter.emit. Besides, if the variable unprocessed is empty as checked in line 6, the buffer becomes empty (line 16) and a checkpoint is sent to DynamoDB in order to acknowledge that the messages have been dealt with and sent to S3: these messages will not be read again by the Kinesis consumer application.
The call emitter.emit in the listing 7, at the line 5, calls the function in listing 8.
CHAPTER 3. COLLECTION OF THE DATA
public void processRecords(List<Record> records,
IRecordProcessorCheckpointer checkpointer) {
// Transform each record and add to the buffer for (Record record : records) {
try {
T transformedRecord = transformer.toClass(record);
// If the new record has not been saved the same day than the last record, // we emit the buffer to ensure the separation of the events by day.
T lastRecord = buffer.getLastRecord();
if (lastRecord != null && !sameDay(lastRecord, transformedRecord)) { List<U> emitItems = transformToOutput(buffer.getRecords());
emit(checkpointer, emitItems);
}
if (filter.keepRecord(transformedRecord)) {
buffer.consumeRecord(transformedRecord, record.getData().array().length, record.getSequenceNumber());
}
} catch (IOException e) { LOG.error(e);
} }
// Emit when the buffer is full if (buffer.shouldFlush()) {
List<U> emitItems = transformToOutput(buffer.getRecords());
emit(checkpointer, emitItems);
}
Listing 5: Processing the records with the Kinesis consumer application
public boolean shouldFlush() {
return (buffer.size() > getNumRecordsToBuffer()) ||
(byteCount.get() > getBytesToBuffer());
}
Listing 6: When does the buffer need to be flushed to S3?
The parameter of the function is the buffer of String. In the line 4, the last record included in the buffer is extracted to know the number of the day. Then, the name of the file that will be pushed to S3 is generated on line 11. The path of the file on the S3 file system is created in line 13. All the records are then created into a file, that will be pushed to S3 using the Java SDK provided by AWS. If the file has been successfully transferred to S3, the method emit returns an empty collection that will be the base for the new buffer.
3.7. ACTUAL SOLUTION USING KINESIS AND S3
1 private void emit(IRecordProcessorCheckpointer checkpointer, List<U> emitItems) {
2 List<U> unprocessed = new ArrayList<U>(emitItems);
3 try {
4 for (int numTries = 0; numTries < retryLimit; numTries++) {
5 unprocessed = emitter.emit(new UnmodifiableBuffer<U>(buffer, emitItems));
6 if (unprocessed.isEmpty()) {
7 break;
8 }
9 try {
10 Thread.sleep(backoffInterval);
11 } catch (InterruptedException e) {}
12 }
13 if (!unprocessed.isEmpty()) {
14 emitter.fail(unprocessed);
15 }
16 buffer.clear();
17 // checkpoint once all the records have been consumed
18 checkpointer.checkpoint();
19 } catch (IOException | KinesisClientLibDependencyException | InvalidStateException
20 | ThrottlingException | ShutdownException e) {
21 LOG.error(e);
22 emitter.fail(unprocessed);
23 }
24 }
Listing 7: Emit the buffer of events to S3 (1/2)
3.7.4 Results
The figure 3.8 is showing the graphical interface of the service Amazon S3. In the year 2014, in the 5th month, events are grouped by days as expected.
CHAPTER 3. COLLECTION OF THE DATA
1 public List<String> emit(final UnmodifiableBuffer<String> buffer)
2 throws IOException {
3 List<String> records = buffer.getRecords();
4 // Get the last record
5 String lastRecord = records.get(records.size()-1);
6 // Get the calendar timestamp of the last record
7 // to know in which location to put the file
8 CalendarUtils utils = new CalendarUtils();
9 Calendar calendar = utils.getCalendarRecord(lastRecord);
10 // UUID for the file
11 String uuid = getRandomUUID();
12 // Get the S3 filename
13 String s3FileName = getS3FileName(calendar, uuid);
14 // Write all of the records to a compressed output stream
15 PrintWriter writer = new PrintWriter(uuid, "UTF-8");
16 for (String record : records) {
17 writer.println(record);
18 }
19 writer.close();
20
21 File object = new File(uuid);
22 try {
23 s3client.putObject(s3Bucket, s3FileName, object);
24 LOG.info("Successfully emitted " + buffer.getRecords().size() +
25 " records to S3 in s3://" + s3Bucket + "/" + s3FileName);
26 return Collections.emptyList();
27 } catch (AmazonServiceException e) {
28 LOG.error(e);
29 return buffer.getRecords();
30 } finally {
31 object.delete();
32 }
33 }
Listing 8: Emit the buffer of events to S3 (2/2)
3.8. OVERVIEW
The figure 3.9 shows the content of the directory representing the 5th of May 2014. The 4 first files of events have been pushed because the number of logs reached 5000 events. The last file of events (at the bottom) has been cut to respect the third condition: the segmentation of the events by day.
Figure 3.9. For each day, several files are present
This separation is useful to have a flexible way to analyse logs on a daily basis point-of-view. For instance, if a very high load happens on a specific day, the computing system is able to run analysis on the logs generated on this specific day
; instead of retrieving all the logs and filtering only the ones that belongs to the specific day (in the future, giga/terabytes of logs will be collected).
3.8 Overview
Collecting logs from multiple sources and storing them into a permanent storage is now possible thanks to a Kinesis based solution. The proxy, producer, is pushing logs into the Kinesis stream. The Kinesis application consumes in real-time the available logs, analyses them and transfers packages of them into S3. Furthermore, the following items have been studied and implemented:
• The requirements to choose a solution: minimization of the computation for sending logs, high availability, fault-tolerance, scalability and price.
• Several options based on AWS services that fit partially our requirements.
Syslog Upload to S3 SQS and EC2 DynamoDB Kinesis Proxy Com-
putation
OK NOK OK OK OK
HA and FT1 OK OK OK OK OK
Automatic scale
OK OK OK OK OK
Price NOK OK OK NOK OK
Complexity OK OK NOK OK OK
1High availability and Fault-Tolerance
CHAPTER 3. COLLECTION OF THE DATA
• The final solution based on Kinesis, including the proxy as the producer, an EC2 instance with a Java application as the consumer and an end-point, S3, where files are daily saved.
In the next chapter, the processing of the logs through the processing system is studied. Analytics that can be generated from the data collected are also presented.
Chapter 4
Processing and Analytics
In the chapter 3, the collection of the logs from the proxy to a permanent storage, Amazon S3, has been studied. In order to create analytics, the collected logs need to be analysed and aggregated. It’s the role of the big data system to do so by processing them so that analytics can be produced.
Figure 4.1. The processing system, third piece of the analytics platform
This chapter is composed of three main parts. The first part introduces the pro- cessing system chosen for Findify. The general software design architecture is then presented before diving into the flow of analytics generated. Most of the algorithms presented in this chapter are confidential and will not be shown. However, all the algorithms have been implemented and the program implemented to generate ana- lytics is composed of 1761 lines of code.
There are two main types of analytics:
• External Analytics: The analytics for the merchants, made available via a weekly report.
• Internal Analytics: The analytics for internal purposes: understanding cus- tomers’ behaviours and improving our product.
4.1 Requirements
The main parts of the processing system are:
1. Retrieve the logs stored in Amazon S3. (External & Internal Analytics)
CHAPTER 4. PROCESSING AND ANALYTICS
2. Enrich them: cleaning and filtering operations. (External & Internal Analyt- ics)
3. Make transformations over the data. (External Analytics only) 4. Produce analytics that can be queried. (External Analytics only)
Figure 4.2. A view of the computing system
The computing system is the central piece of the analytics platform. It must be able to extract the logs for generating analytics ; and in the future, run machine learning algorithms in order to improve the relevancy of Findify search engine.
Several computing systems and platforms have been studied.
Elastic MapReduce & Hadoop
Elastic MapReduce (EMR) is an Amazon service that provides a Hadoop cluster.
Amazon EMR is able to process data across a resizable cluster of EC2 instances that automatically scales depending on the need. Monitoring, security and optimal tun- ability of the Hadoop configuration are provided by EMR, which makes it a very interesting platform to begin on, considering that the configuration of a Hadoop cluster is not simple.
Hadoop is a large-scale processing system able to process datasets among a cluster of machines. Hadoop is storing and retrieving data from HDFS (Hadoop Distributed File System). MapReduce is the programming model used by Hadoop to process data in a distributed way. YARN is another module of the Hadoop system for managing the cluster resources. Other applications can be setup on top of Hadoop, for instance Apache Hive which lets query the data using a SQL-like language.
Databases
Using a relational database like PostgreSQL is a viable solution if the amount of logs is smaller than 4TB[24]. Indeed, the paradigm MapReduce can be reduced to a Select/GroupBy in SQL. Besides, over 4TB1, the data will need to be distributed
1The current highest capacity for a hard drive is 4TB.