• No results found

Identifying Early Usage Patterns That Increase User Retention Rates In A Mobile Web Browser

N/A
N/A
Protected

Academic year: 2021

Share "Identifying Early Usage Patterns That Increase User Retention Rates In A Mobile Web Browser"

Copied!
70
0
0

Loading.... (view fulltext now)

Full text

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer Science

Master thesis, 30 ECTS | Information Technology

2017 | LIU-IDA/LITH-EX-A--17/012--SE

Identifying early usage

pat-terns that increase retention

rates in a mobile web browser

Att indentifiera tidiga användarmönser som ökar

använ-dares återvändningsfrekvens

Pontus Persson

Supervisor : Zlatan Dragisic

External supervisor : Magnus Gasslander Examiner : Patrick Lambrix

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och admin-istrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sam-manhang som är kränkande för upphovsmannenslitterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet – or its possible replacement – for a period of 25 years starting from the date of publication barring exceptional circum-stances. The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the con-sent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping Uni-versity Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

c

(3)

Abstract

One of the major challenges for modern technology companies is user retention man-agement. This work focuses on identifying early usage patterns that signify increased re-tention rates in a mobile web browser. This is done using a targeted parallel implementa-tion of the associaimplementa-tion rule mining algorithm FP-Growth. Different item subset selecimplementa-tion techniques including clustering and other statistical methods have been used in order to reduce the mining time and allow for lower support thresholds.

A lot of interesting rules have been mined. The best retention-wise rule implies a re-tention rate of 99.5%. The majority of the rules analyzed in this work implies a rere-tention rate increase between 150% and 200%.

(4)
(5)

Acknowledgments

I would like to express my deepest gratitude to my supervisor at Opera, Magnus Gasslander, and to my examinor at Linköping University, Patrick Lambrix for their support and feedback throughout this project. I owe particular thanks to Hans-Filip for his advice and thoughts given to me over countless ping-pong matches while at Opera.

Finally, I would like to express my sincere gratitude to my family and close friends for their support and encouragement throughout my thesis work.

Linköping, June 2017 Pontus Persson

(6)

Contents

Abstract iii

Acknowledgments iv

Contents vi

List of Figures viii

List of Tables ix 1 Introduction 1 1.1 Motivation . . . 1 1.2 Aim . . . 2 1.3 Research questions . . . 2 1.4 Delimitations . . . 2 2 Background 3 2.1 Association Rule Mining . . . 3

2.1.1 Apriori Algorithm . . . 4

2.1.2 FP-growth . . . 4

2.1.3 Targeted association rule mining . . . 5

2.1.4 Evaluating Association Rules . . . 6

2.2 Association Rule Mining with Clustering . . . 6

2.2.1 DBSCAN . . . 7

2.2.2 K-Means . . . 7

2.3 Feature Selection . . . 7

2.3.1 Variance Threshold . . . 8

2.3.2 Pearson’s χ2Test . . . 8

2.4 User Retention Studies . . . 8

3 The Data Set 9 4 Work Method 13 4.1 Requirements . . . 13 4.2 Analysis . . . 14 4.3 Design . . . 14 4.4 Implementation . . . 14 4.5 Evaluation . . . 14 5 Method 15 5.1 Event Selection . . . 16

5.1.1 Mini-batch K-Means Clustering . . . 16

5.1.2 DBSCAN clustering . . . 16

(7)

5.1.3 Low variance filtering . . . 16 5.1.4 Pearson’s χ2Test . . . 17 5.1.5 Manual Selection . . . 18 5.1.6 No Selection . . . 18 5.2 Rule Mining . . . 18 5.3 Evaluation . . . 19 6 Results 21 6.1 Mini-batch K-Means Clustering . . . 22

6.2 DBSCAN . . . 25

6.3 Variance Threshold . . . 27

6.4 Pearson’s χ2Test . . . . 29

6.5 Manual Feature Selection . . . 30

6.6 No selection . . . 32 7 Discussion 35 7.1 Results . . . 35 7.1.1 Event Selection . . . 35 7.1.2 Rule Mining . . . 36 7.2 Method . . . 39 7.2.1 Preprocessing . . . 39 7.2.2 Event Selection . . . 39 7.2.3 Rule Mining . . . 39 7.2.4 Evaluation . . . 40

7.2.5 Replicability, Reliability and Validity . . . 40

7.3 The work in a wider context . . . 40

8 Conclusion 43 8.1 Future Work . . . 44

Bibliography 47

A Event Codes 51

B Manually Selected Events 53

(8)

List of Figures

2.1 FP-tree Example . . . 5

3.1 Key-value distributions . . . 10

3.2 Data set analysis . . . 11

5.1 Experiments overview . . . 15

6.1 Cluster sizes . . . 22

(9)

List of Tables

3.1 Sample of data set table . . . 11

5.1 Number of clusters . . . 16 5.2 DBSCAN Parameters . . . 16 5.3 Variance Thresholds . . . 17 5.4 Significance levels . . . 17 5.5 Retention rates . . . 19 5.6 Support thresholds . . . 19

6.1 Number of rules mined from K-Means clusters . . . 22

6.2 Antecedent for top lift K-Means rules . . . 23

6.3 Top K-Means rules based on lift ratio . . . 23

6.4 Second selection of rules mined from K-Means clusters . . . 23

6.5 Antecedent for second selection of K-Means rules . . . 24

6.6 DBSCAN Clusters . . . 25

6.7 Number of rules generated from DBSCAN clusters . . . 25

6.8 Top DBSCAN cluster based rules sorted by lift . . . 25

6.9 Antecedent of top lift DBSCAN rules . . . 26

6.10 Second selection of DBSCAN based rules . . . 26

6.11 Antecedent for the second selection of DBSCAN rules . . . 26

6.12 Kept events after variance threshold filtering . . . 27

6.13 Number of rules generated from low variance items . . . 27

6.14 Top lift rules mined from low variance events . . . 27

6.15 Antecedents for top lift low variance rules . . . 28

6.16 The second selection of low variance based rules . . . 28

6.17 Antecedents for the seconds selection of low variance based rules . . . 28

6.18 Rules found using manually selected items . . . 30

6.19 Top lift rules based on manually selected events . . . 30

6.20 Antecedents for top lift rules based on manually selected events . . . 31

6.21 The second selection of rules mined from manual events . . . 31

6.22 Antecedents for the second selection of rules mined from manual events . . . 31

6.23 Number of rules mined from all events . . . 32

6.24 Top lift rules mined from all events . . . 32

6.25 Antecedents for top lift rules mined from all events . . . 32

6.26 The second selection of rules mined from all events . . . 32

6.27 Antecedents of the second selection of rules mined from all events . . . 33

7.1 Events in the rules with the highest confidence . . . 36

7.2 Second selection of DBSCAN rules with events clarification . . . 37

7.3 Event descriptions for low variance rules . . . 37

(10)
(11)

1

Introduction

One of the major challenges for modern technology companies is user retention management. More, and active users drive up ads revenue, buy more goods and help spread the word about a company’s product or service. It is also generally more expensive to acquire new users compared to retaining current users [17]. The analytics company Localytics has found that only 42% of users return at least once in the first month [16]. This study was based on the customer analytics recorded by Localytics which includes websites, mobile games and other companies with a large internet presence. Companies constantly try to combat this steep drop off by creating intuitive user interfaces, on-boarding processes and more.

This project investigates whether the usage patterns of retained users differ from the usage patterns of non-returning users, based on the usage during the first few sessions. Given such differences, can patterns discovered in retained users’ behaviour be used to improve a products’ user experience and overall retention rate? We approach this problem from a data mining point of view and use association rule mining to discover features used or actions taken by users that signify higher retention.

1.1

Motivation

Having indicators that signify user behaviour related to high user retention rates can be use-ful in many ways to the company behind a product. Designers, for instance, can use the indicators when designing a better on-boarding experience for new users. If a new user is encouraged to explore certain features, or to engage in the application in other ways, that have previously been shown to correlate with continued usage, the probability of the new user also continuing to use the application may improve.

As an example of this, consider an instant messaging application. Two of the main fea-tures are sending and receiving messages. Assume that an indicator of increased user re-tention rates for this application is that a user has received at least 5 messages and sent at least one message during the first two days since installing the application. Developers could then implement a “Hey! Your friend John Doe just joined, why not send him a message?” alert going out to all John Doe’s friends in order to incite them to send him a message. John, having received messages from his friends might then respond to some of them. Deploying on-boarding techniques like this could work to improve John’s experience and the probability of him continuing to use the application.

(12)

1. INTRODUCTION

1.2

Aim

The major aim of this thesis is to extract association rules from a data set of session logs in order to find indicators that correlate with user retention of a specific mobile web browser. To maximize the interestingness and quality of the found rules this thesis also aims to eval-uate different feature selection techniques, including clustering and correlation tests. These techniques are to be used as a preprocessing step to the actual association rule mining.

1.3

Research questions

To fulfill the aim of this work the following research questions have been stated. The first two questions are directly related to the aim of this thesis. The third question is answered prior to the first two in order to use the answer when designing the method for selecting events and mining association rules.

1. Are there differences between the behaviour of returning and non-returning users, and can association rules be mined that reveal user retention indicators?The data set used contains action taken by users three hours after installing the application. The aim of this question is to find a subset of users that all have taken a specific set of action or used some specific features that also have higher than average return rate. Association rule mining will be used for finding the sets of actions or features.

2. How should events be selected? The data set contains a large number of distinct user actions and events. Mining association rules using all of them might be infeasible due to the run time complexity of available algorithms. This question aims to find good subsets of events to perform the mining process on.

3. What does the data set contain and how can it be interpreted to fit existing algorithms and methods? In order to get a better understanding of the data set at hand and to be able to make informed decisions about how to handle the data some initial analysis of the data set is needed.

1.4

Delimitations

It is not the goal of this work to provide instructions on how to improve an on-boarding process in order to increase the retention rate. The study is limited to one specific mobile application, more specifically a mobile web browser for Android. The results and conclusions might or might not be applicable to other applications, more on this topic in Chapter 7.

Other machine learning approaches for predicting the retention rate exist, but are not the topic of this work.

(13)

2

Background

In order to propose a trustworthy method with relevant scientific bases in Chapter 5 this chapter will describe under-lying theories as well as related work. This includes work on dis-covering user retention indicators, previously successful association rule mining algorithms, and feature selection techniques.

2.1

Association Rule Mining

Association rule mining is a form of rule-based machine learning used for finding interesting relationships between variables in large data sets. The following problem description comes from the initial paper on association analysis by Agrawal et al. [1].

The problem stated for association analysis can be formalized as follows. Given a set of items I = ti1, i2, . . . , inuand a set of transactions T = tt1, t2, . . . , tnuwhere each tk P T is a

binary array. If the j-th position in tkis set (tk[j] =1), then the transaction is said to contain

item ij. An item set is a set of items and an association rule is an implication of two item sets

X ñ Y. Methods for evaluating how interesting a rule is can differ between algorithms, two of the most widely used and straight-forward metrics are support and confidence, used by the Apriori algorithm [2], Apriori TID [19], FP-growth [12] and Eclat [35].

support(X) =P(X) = |tt | X Ď t, t P Tu|

|T| X item set (2.1)

con f idence(X ñ Y) =P(Y|X) = support(X Y Y)

support(X) X, Y item sets (2.2)

Equation 2.1, support, tells us the fraction of transactions that contain a specific item set. Equa-tion 2.2, confidence, tells us how often the rule holds. More clearly put, how often do both X and Y appear together in the transactions compared to the transactions containing X. Some algorithms, such as the Apriori algorithm [2] and targeted association rule mining [22] gen-erate so called candidate item sets by creating combinations of items. The candidates are then evaluated using some interestingness measure to determine if they can be “promoted” to fre-quent item sets. A frefre-quent item set is the base from which association rules are generated. This approach is essentially generating the power set of the items and evaluating them one by one. Both mentioned algorithms employ different pruning methods in order to limit the

(14)

2. BACKGROUND

search space. Other algorithms such as FP-Growth by Han et al. [12] do not generate candi-date item sets by using frequent pattern trees instead. This makes it more efficient than the first two algorithms mentioned.

2.1.1

Apriori Algorithm

First proposed by Agrawal et al. [2], the Apriori algorithm is one of the most well known algorithms for association analysis according to Yanbin et al. [33]. It uses a breadth-first leveled approach for generating candidate item sets by iteratively creating larger and larger item sets. It starts of with all items ik P I as candidate item sets and for each one checks

to see if it has sufficient support given some support threshold minSupport. The items with sufficient support are kept as frequent item sets. In the next iteration all subsets of I of size two are used as candidate item sets. Before calculating each candidate item set’s support the algorithm makes sure that the prefix of a candidate item set (the k ´ 1 first items in a k-size item set) is present in the previous iterations frequent item sets. For the prefix property to work the items have to always be ordered the same way. It also uses the apriori property that states that no super sets of an infrequent item set I1can be frequent. This process is iterated

upon with larger and larger item sets until no more frequent item sets can be found.

The Apriori algorithm is only used for generating frequent item sets. However, Agrawal et al. [2] also propose a simple algorithm that can be used for rule generation from frequent item sets. The rule generation algorithm checks the confidence for each rule that can be generated for each frequent item on the form I1´i ñ i. If the rule holds it is saved and sent

to output together with all rules I1´i1 ñi1where i1Ăi, i1 ‰ H.

In the worst case, 2|I|candidate item sets will be generated, which is problematic since it

will take a long time for any decent number of items and there will be problems fitting all of them in a computer’s main memory. In the next section we will look at a second algorithm called FP-growth which doesn’t generate candidate item sets, successfully mitigating this problem.

2.1.2

FP-growth

Han et al. [12] notes that the Apriori algorithm and its use of candidate item set generation gets inefficient for data sets with a large number of frequent item sets. The number of candi-date item sets can grow exponentially in the worst case. This is also true when using a low support threshold.

In response to this they propose a new algorithm using Frequent Pattern trees (FP-trees) to limit the number of necessary scans of the items and transactions to two times by avoiding generating candidate item-sets.

The algorithm works by first scanning the database once to calculate the support of each item. After that, each transaction, with items sorted in descending support order is inserted into a FP-tree. The FP-tree contains a dummy root node. The insertion of a transaction into the FP-tree is described by the following python code:

1 def a d d _ t r a n s a c t i o n ( t r e e , t r a n s a c t i o n ) : 2 p a r e n t = t r e e . r o o t 3 f o r item in t r a n s a c t i o n : 4 i f item in p a r e n t . c h i l d r e n : 5 p a r e n t = p a r e n t . c h i l d r e n . g e t ( item ) 6 p a r e n t . support += 1 7 e l s e:

8 new_child = Node ( name=item , support =1 , p a r e n t= p a r e n t ) 9 p a r e n t . c h i l d r e n . add ( new_child )

10 t r e e . it em s [ item ] . append ( new_child ) 11 p a r e n t = new_child

(15)

2.1. Association Rule Mining

A visual representation of a FP-tree after inserting the transactions tta, b, cu, ta, b, du, td, euu into the tree is shown in Figure 2.1a. The number in each node is the support for that node and the “Items” table, also called header table, keeps track of all nodes in the tree.

After constructing the FP-tree, each item in the header table is evaluated by checking if the sum of the support counts of all the nodes of that item in the tree is greater or equal to the supplied support threshold, minSupport. If the item has sufficient support, it can be sent to output as a frequent item set and that items’ conditional prefix tree is constructed. The con-ditional prefix tree of an item is constructed by treating the path from the root to all instances of the item in the original tree as transactions and adding them to a new FP-tree omitting the item. The conditional prefix tree for item “d” is shown in Figure 2.1b.

(a) Sample FP-tree (b) “d” conditional prefix tree (c) tb, du conditional prefix tree

Figure 2.1: FP-tree Example

The process is then repeated on the new tree. If “b” has sufficient support, the itemset tb, du is sent to output and the conditional prefix tree for tb, du is generated, as shown in Figure 2.1c. This process is repeated recursively until no more frequent item sets can be generated.

If a conditional prefix tree contains a single path, as shown in Figures 2.1b and 2.1c, there is no need to continue the recursion process deeper. Instead each combination of the suffix and the nodes in the tree can be evaluated for their respective support and sent to output depending on the result. Given the prefix tree in 2.1b, the directly evaluated item sets are ta, du , tb, du , ta, b, du.

Finally, rules can be generated from the frequent item sets using the same method as described for the Apriori algorithm.

2.1.3

Targeted association rule mining

Rong et al. [22] have studied the behaviour of travelers that share their own and/or read about others’ experiences on travel websites. This is done in order to help travel managers design more efficient market strategies that target either sharers (people sharing travel expe-riences) or browsers of travel websites.

The main feature of the algorithm they used, called targeted association rule mining, that dis-tinguishes it from both the Apriori algorithm and FP-Growth is that the possible consequent items of the generated rules are pre-determined. In other words, the algorithm searches for causes to a set of predefined effects. The algorithm generates candidate item sets of increasing size. Each candidate item set Xiis considered together with different targets t (consequent to

be). If support(X Y t)ěminSupport, the item set is considered a positive candidate, else it is a negative candidate item set. Next, the absolute value of the leverage as defined in equation 2.3 is examined for each positive and negative item set. For negative item sets, t (inverse of the target) is used instead of t. Item sets with a leverage greater than or equal to some user defined minLeverage are kept as interesting positive or negative item sets. The positive item sets are used to generate larger item sets and the process above is repeated until no more candidate item sets can be produced. Last, the Conditional Probability Independence Ratio (CPIR, Equation 2.4) of all kept candidate item sets, together with their respective target is

(16)

2. BACKGROUND

examined. Rules with a CPIR above a user-defined minCPIR threshold are kept and sent to output.

leverage(X, t) =support(X Y t)´support(X)support(t) (2.3) CPIR(X ñ t) = P(t|X)´P(X)

1 ´ P(t) =

leverage(X, t)

support(X)(1 ´ support(t)) (2.4)

The problem solved by Rong et al. is similar to our problem in the sense that we too know the expected variables in the consequent. The frequent item set generation process is similar to the Apriori algorithm, but with added constraints. Because of the similarity, it suffers from the same problem where a lot of candidate item sets needs to be evaluated.

One of the most interesting features of the algorithm is that the rules are directly accessible from the mined frequent item sets and no additional computations are needed.

2.1.4

Evaluating Association Rules

The evaluation of association rules are often done by looking at different interestingness mea-sures. The most commonly used measures are support and confidence. Other metrics include lift [8] (formerly known as interest) and added value (AV) [26].

li f t(X ñ Y) = con f idence(X ñ Y)

support(Y) =

P(X ^ Y)

P(X)P(Y) (2.5)

AV(X ñ Y) =con f idence(X ñ Y)´support(Y) (2.6) Lift is the ratio between the observed support of X ñ Y and the support if X and Y were independent. If the lift of a rule is 1, it means that X and Y are independent and thus not an interesting rule. Values ą 1 give a measure of the dependency between X and Y.

The value of AV shows the difference between the rule and the default case. Consider that we own a convenience store and that we want to increase our sales of orange juice. Today 10% of all customers buy orange juice and we have found an association rule X ñ orangejuice with a confidence of 80%. The added value of that rule is 0.7 or 70 percentage points, which is the potential increase if we get all customers to also buy X.

Other factors that one might need to consider when evaluating the quality of an associa-tion rule is the length of the rule. A rule with an antecedent of 10 items can prove difficult to draw knowledge from depending on the context.

2.2

Association Rule Mining with Clustering

Consider the possibility that the data set from which association rules are to be mined from is sparse and contains a large number of items. One problem with such data sets is that the minimum support threshold has to be very low in order to find any rules, but a low threshold might also generate too many rules, making it difficult or even impossible to select the most relevant rules. A low support threshold also enables larger frequent item sets to be mined and thus increases the run time. Too large item sets and in extension large and complex rules are hard to analyse and to make decisions from. Plasse et al. [21] propose to use clustering methods to group similar items together prior to association rule mining. By only mining association rules containing similar items the support threshold can be lowered without increasing the run time and without generating a massive number of rules. Another approach has been proposed by Iodice D’enza et al. [13] where the transactions are clustered as opposed to the items. Using their method they claim to be able to find “hidden” patterns not visible to regular support-based algorithms.

(17)

2.3. Feature Selection

2.2.1

DBSCAN

Density-Based Spatial Clustering of Applications with Noise, or DBSCAN, proposed by Es-ter et al. [10] is a density based clusEs-tering method. It creates clusEs-ters consisting of at least minPoints core points and points density reachable from it based on a distance metric and a minimum distance minε. DBSCAN classifies points as one of three categories:

Core point is a point with at least minPoints neighbours within minεdistance.

Reachable point pfrom q if there is a path of directly reachable core points between p and q, q does not have to be a core point.

Outliers are all points not reachable from any other points.

The algorithm starts with picking a point which has not yet been visited. If that point is a core point it recursively extends the neighbourhood through neighbouring core points. All reachable non-core point and all core points are added to the cluster. This process is repeated until all points are visited. A point with less neighbours than minPoints is marked as an outlier.

One of the main features of DBSCAN is that the user does not have to specify the number of expected clusters, only the parameter minεand minPoints are needed.

2.2.2

K-Means

K-Means is a partitioning clustering algorithm and one of the most fundamental clustering algorithms according to Han et al. [11]. The algorithm, as the name suggests, generates k clus-ters where it tries to minimize the within-cluster mean sum of squares. The algorithm first picks k randomly chosen points in the data set as the initial cluster centroids. All other points are placed in the cluster closest to it using an euclidian distance metric. Then, new cluster centroids are calculated as the mean of the data points in each cluster. All points are then re-assigned to the cluster which it is closest to using the updated cluster centroids. This process is repeated until no points are moved between clusters, a number of iterations threshold is reached or timeout is overrun. While the optimization problem K-Means attempts to solve is NP-hard [5, 15], multiple variations of the algorithm that address the problem of computa-tional complexity exists, one of which is called Mini-batch Kmeans. Mini-batch Kmeans was proposed by Sculley et al. [23] offers a computational complexity orders of magnitude lower than K-Means when working with large and sparse data sets. The trade-off is that Mini-batch Kmeans solves for local optimal solutions within randomly sampled mini-batches. However, the results produced are generally only slightly worse than the original K-Means [24].

One of the downsides with K-Means when the knowledge about the data set is limited is that the number of clusters k has to be specified before clustering. There are methods that automatically select a suitable k, such as the X-Means algorithm prosed by Pelleg et al. [20]. It is a variation of K-Means which repeatedly divides clusters and keeps the best results until a predefined condition is met. Smyth [25] propose using cross-validation in order to select a suitable k. By using a smaller subset of the data for clustering using different values of k and the rest of the data for validation, one can choose the best k for the data set. One problem with this method is that it will not work if there are more than one optimal clusterings [6].

2.3

Feature Selection

Feature selection, or attribute subset selection can be used as a preprocessing method in data mining with the purpose of reducing the problem space, training times and over-fitting [7, 14]. Another product of feature selection when used together with association rule mining is that it reduces the maximum length of the rules, making them easier to interpret.

(18)

2. BACKGROUND

2.3.1

Variance Threshold

The squared deviation of a random variable from its mean is called the variance of that vari-able. A random variable that fits the Bernoulli distribution [28] is a binary variable that takes on the value 1 with probability p and the value 0 with a probability of q, where q = 1 ´ p. The variance of a Bernoulli distributed variable X is:

Var[X] = pq= p(1 ´ p) (2.7)

One approach to selecting a subset of features is to remove all low variance features. It is unlikely that an item that is 1 or 0 in more than 90% of the time correlates with the retention rate in this thesis, where the base line retention for day 1 is 41%.

2.3.2

Pearson’s χ

2

Test

Pearson’s χ2test is a statistical test to evaluate how likely it is that two variables are indepen-dent. The test can be used to remove redundant features from a data set [11]. Equation 2.8 shows how to compute the Pearson’s χ2statistic of two discrete attributes A and B which can take on k and l concrete values respectively. If A and B are binary variables, k=l=2.

χ2= k ÿ i=1 l ÿ j=1 (oij´eij)2 eij (2.8) oij = count A=ai^B=bj n (2.9) eij = count(A=ai)˚count B=bj n2 (2.10)

Where n is the total number of observations and i P[1, k], j P[1, l]. The test is based on a level of significance together with degrees of freedom(k ´ 1)(l ´ 1). Given that the hypothesis can be rejected, the attributes are statistically correlated. If two highly correlated attributes are found, one of them is redundant and can be removed.

2.4

User Retention Studies

Several previous studies using data mining techniques to examine user retention exist. Eiri-naki et al. [9] presents an overview of real-world usage of data mining used for web per-sonalization in order to improve amongst other things user retention. They have found that multiple companies use association rule mining in order to build personalized recommen-dation systems and to analyze user behaviour. Ahn et al. [4] examines factors that impact customer churn rates in mobile telecommunications using a set of predefined possible “in-teresting” features. The study does not employ association rule mining, but investigates the impact of the features on customer churn using three different logistic regression models.

Tsai et al. [27] uses association rules as a preprocessing step to find potentially interesting features. Those features are later used with a back-propagating neural net and decision trees to find rules that predict customer retention in on-demand multimedia streaming.

Ng et al. [18] have studied customer retention and churn rates of the Port of Singapore Authority. Their main rule generation method is decision trees, but they also use association rules for determining “if shipping company X defects, which other companies will follow suit?”.

In another study Ng et al. [17] found that multi-level association rules can identify cus-tomers that are on their way of leaving the company. Multi-level in this setting refers to grouping items in a multi-level hierarchy and performing the association rule mining on groups of items higher up in the hierarchy.

(19)

3

The Data Set

The data is collected from the mobile web browser Opera Mini and consists of a collection of user session logs. New users, using the beta version of the web browser, available as a sepa-rate application in the Google Play Store1, are subject to the data recording used in this work. The recorded data consists of a sequence of events emitted by the browser in response to user interaction during the first three hours following the application being installed. Because of the fast release schedule of beta versions and the fact that most releases include changes that might impact users’ perception of the browser, this thesis is focused on a single beta version. This version was rolled out January 31 2017 and replaced by the next beta version February 9 2017. During this time period there were 182,000 new users.

The retention rate is measured using day based “events”. If a user is seen on the same day as they installed the browser they have retention the event r_0, if one is seen the next day r_1, etc. Note that it is possible for a user to have r_3, but not r_2. The retention information is available in a database separate from the first session events (the events recorded the first three hours).

An event emitted by a user’s browser shows the following structure and is associated with a unique user identifier:

<time since last event>:<event code>:<value>

A complete list of all available event codes together with a short description is available in Appendix A. Throughout this paper some event codes and values will be printed. The reader is advised to consult the appendix as no further explanation on the interpretation of the events will be available in the text. Not all events alone give a good idea of what the user has done. The best example is perhaps the event fired when a view is clicked (vc). The event itself is almost useless without its associated value. The value of an event can be binary, categorical or continuous. Some events can have special values consisting of a key-value pair (ex. gudu:MAX_TABS=3, which means that the maximum number of simultaneous open tabs of all time is 3). The key-value pairs are mostly counters for events already present or unique strings, such as serial numbers. Some of them, however have boolean values, which sometimes represent settings or other states in the application. There is unfortunately no way of separating the boolean values from the integer counters since the boolean values are

(20)

3. THEDATASET

encoded as 1 or 0. All values greater than or equal to 1 are written as one, and zeroes are kept as 0. This means that the event shown before, gudu:MAX_TABS=3, will be encoded as gudu:MAX_TABS=1. This method discards a lot of information, since we essentially create a “greater than or equal to 1” event instead of keeping one event for each integer value present in the data set. In many cases this might be all the information needed to understand what the user is doing. The up-side of using only two events types (=1, =0) is that those events will have much greater support. This will make it more probable for those events to be included in the mined association rules. Figure 3.1a shows the frequency distribution of values of all key-value pair variables that are interpretable as integers. It is clear that most integer values are unique, only 1% of all integer values occur more than 8 times. The boolean values might however still be interesting to look at and because of the limitation previously mentioned I have decided to rewrite all integer values to 0 or 1. Figure 3.1b shows the same data, but for string values. Only 0.1% of the values occur 3 or more times. Given that almost all strings are unique I have chosen to remove all string values and treat all key-value events with the same key as the same event.

10 20 30 40 50 Frequency 105 104 103 102 101 100 % of in te ge r v alu es x

(a) Distribution of integer values

1 2 3 4 5 6 7 Frequency 105 104 103 102 101 100 % of st rin g va lue s x

(b) Distribution of string values

Figure 3.1: Key-value distributions

Since the events themselves don’t convey much meaning without their associated values they have been concatenated with a “:” as separator. Values that include an equals sign (=) have been rewritten based on the analysis above. The concatenated event-value pairs will from this point and onwards be referred to as simply “events”. In the recorded data about 550 unique events have been observed.

Far from all users experience all events, in fact 80% of all users see less than or equal to 10% of the events. This is shown by the CDF (Cumulative Distribution Function) in Figure 3.2a. Figure 3.2b shows that only the top 38 most common events are seen by the majority of users. The retention rate for new users is shown in Figure 3.2c, the figure shows the percentage of users that returns on the X-th day after installing the web browser. This suggests that the most common events don’t match user behaviour associated with user retention. Another important remark is that given the one day retention rate of 41% the minimum confidence (equation 2.2) for a rule for it to be regarded as an interesting rule is 41%. Rules with high confidence are strong rules, but any rule above 41% suggests an improvement for day 1.

(21)

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 fraction of events 0.0 0.2 0.4 0.6 0.8 1.0 fra cti on of u se rs se en X ev en ts

(a) CDF of events seen by users

0 100 200 300 400 500 Event Rank 0.0 0.2 0.4 0.6 0.8 1.0 Seen by users (%)

(b) Events seen by users (ranked)

r_0 r_3 r_6 r_9 r_12 r_15 r_18 r_21 r_24 r_27 r_30 Day 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 fraction of users (c) Retention Rate

Figure 3.2: Data set analysis

In order to impose some limitations on this thesis and to reduce the run time of the anal-ysis described in Chapter 5 I have chosen to completely disregard the order of events where emitted by a user and the number of times an event has been emitted by a user. The resulting data set can be visualized as a table with one row per user and one column per event. Each cell has the value “True” or “False” depending on whether the user of that row has emitted the event of the column. A subset of that table is shown in Table 3.1. This structure mim-ics a transactional data base which is the data structure most of the algorithms presented in Chapter 2 work with.

Table 3.1: Sample of data set table

gdsc:COMPLETED gdsc:FAILED gdsc:IN_PROGRESS gfa:CREATE_NEW_TAB gfa:DATA_SAVINGS_OVERVIEW

user0 True False True True False

user1 False False False False False

user2 False False True False False

user3 False False False False False user4 False False False False False

(22)
(23)

4

Work Method

In this work I have employed the waterfall development model. There are many variations of the waterfall model, but the one thing they all have in common is that they are sequential. This means that the whole development process is subdivided into smaller phases which are processes after each other. There can be some overlap between the phases, but once one of them is done it can’t be revisited.

The development process I have used is divided into five steps, requirements, analysis, design, implementation and evaluation.

Sections for each step can be found below where its purpose is described and which parts of Chapter 5 it refers to.

Chapter 5 details the method used to obtain the results of this work and does not follow the structure of the waterfall model. Instead, we start off with an architecture overview, which is part of the design phase. Following this overview are the analysis, more design and implementation. These parts are are not separated to show the development process, but rather aim to show the process from which the results are generated, from start to finish. The final section contains the method for evaluating the results.

The reason for this structure is to allow the reader to know what experiments are to be conducted and to follow the work flow in a more fluent manner.

4.1

Requirements

The formal requirements on this work given to me by Opera are high level requirements. They are to present...

• association rules that indicate increased user retention based on early usage patterns • a document containing code or other implementation details on how to mine said

asso-ciation rules

• a report detailing the work that I have done (you are reading it now)

The first requirement is tightly coupled with the aim of this work, which was presented in Chapter 1.2. This can be seen as the main requirements with the other two as auxiliary re-quirements that are easily met after achieving the primary one.

(24)

4. WORKMETHOD

Since the requirements are stated at a high level I have been given some freedom regarding how to interpret them. From my interpretation I have stated the research questions, previ-ously shown in Chapter 1.3. The questions have been validated by my supervisor at Opera to fit the requirements.

4.2

Analysis

The analysis part of the work flow encompasses evaluating different methods that can be used to meet the requirements.

Part of this analysis takes place in Chapter 2, where similar studies and useful techniques are presented. Chapter 5.1 and 5.2 details the resulting choices made after this analysis.

4.3

Design

The result of the design phase is the software architecture. This is a high level description of the system or experiment. The design is different from the implementation. One way to think about it is that the design contains high-level blocks, how the blocks are connected and how they should work. Once you start to write code in a programming language you have left the design phase and entered the implementation phase.

Chapter 5 starts off with a high level overview of the proposed system which is illustrated in Figure 5.1. There are also parts relevant to the design phase in Chapter 5.2.

4.4

Implementation

The implementation phase is where all the code is written. The implementation details are available in Chapter 5.1, Chapter 5.2 and Appendix C.

The majority of the code used in this work is made up of open source libraries, to which references can be found in Chapter 5. The components that I have developed have been writ-ten using a highly interactive python environment. It is trivial to test and run any part of your code at any time with any data. This allows the code to be tested and validated repeat-edly throughout the development process without the need for formal testing frameworks or methods. The implementation have been tested in a “bottom-up” manner. First, small code blocks are tested, sometimes only one line, sometimes a whole method. When I know that the smaller parts work as expected I will can connect them and test them together. By testing bottom-up, as early as possible and with a mix of real data from the session logs and with hand-made examples I have tried to make it as easy as possible to identify bugs and faulty pieces of code, and to make sure that nothing breaks when putting things together.

4.5

Evaluation

The last step is the evaluation and verification of the method and results. As stated in the pre-vious section, the implementation has continuously been tested throughout the development phase. The evaluation of the results are described in Chapter 5.3.

(25)

5

Method

The work of this thesis is laid out in several steps, preprocessing, event selection, rule mining and evaluation. Figure 5.1 shows the flow of processing and the steps that are independent and done in parallel.

Figure 5.1: Experiments overview

The first step is the preprocessing step, which we have already covered in Chapter 3. The second part is event selection. Here, six different strategies for selecting a subset of events are applied. The different strategies output sets of events that can be used to select a subset of the table produced by the preprocessing step (Table 3.1). The smaller tables are then fed into the third step, rule mining. The rule mining process uses FP-Growth and each rule generation based on an event selection is independent of the others. The final and fourth step is evaluation. Each event selection strategy are run multiple times with different parameters, so the first evaluation is on the rules found using the same event selection technique. Later the rules found by all event selection techniques are evaluated and the best rules are presented.

The experiments are run using Jupyter Notebooks1 running IPython kernels. Jupyter Notebook is an open source web application that allows users to create documents containing executable code, text, figures and plots.

(26)

5. METHOD

5.1

Event Selection

Because of the large number of available events and the sparsity of the data set, a low mini-mum support threshold is required in order to find interesting association rules. One effect of a low support threshold is an increased number of frequent item sets. This allows us to find infrequent user patterns that impact the user retention rates. A larger number of fre-quent item sets also implies larger (more complex) item sets and in extension more complex rules. Another unwelcome side-effect of a low support threshold is increased run time. In the worst case a low enough support forces the algorithm to do work proportional toO(m2n)for n events and m users. One way of handling these bad effects is by selecting a subset of the events and preform the association rule mining on only them. This process gives us a lower n for each FP-Growth run which enables us to run it with lower support thresholds. This section features how clustering algorithms, statistical feature selection methods and expert knowledge is used in this thesis in order to select subsets of the available events.

5.1.1

Mini-batch K-Means Clustering

As previously mentioned in Section 2.2.2, Mini-batch K-Means is a heuristic variation of the original K-Means algorithm. The implementation used for this thesis is available in the Python library scikit-learn2.

The implementation supports tweaking of multiple parameters in order to fine-tune the clustering process. For the purpose of this thesis, all parameters but the number of clusters are left to their default values. Table 5.1 shows the different number of clusters used.

Table 5.1: Number of clusters

k 10 20 40 80

The number of clusters are no less than 10 in hopes of that the resulting clusters will be small enough to give the benefits described earlier without generating too large or many item sets.

5.1.2

DBSCAN clustering

DBSCAN, as opposed to K-Means clustering, does not need the number of clusters to be specified beforehand. The trade-off is that the minimum number of neighbour points needed to create a core point (minPoints) and the minimum distance for two points to be directly density reachable minε) are required to be specified. The values for minPoints and minε that

are used for this experiment can be found in Table 5.2. The implementation I have used can also be found in the Python library scikit-learn.

Table 5.2: DBSCAN Parameters minε minPoints

0.5 5

0.1 5

0.01 5

5.1.3

Low variance filtering

Events with low variances are not likely to be part of rules with high confidence. This experi-ment removes events that are commonly or rarely fired by users. The VarianceThreshold

2http://scikit-learn.org/

(27)

5.1. Event Selection

method from scikit-learn is used for this purpose. It assumes the variables to Bernoulli dis-tributed, and the thresholds can be calculated as threshold = p(1 ´ p), where p is the prob-ability of the variable being true. The p-values along with the associated thresholds used in this experiment are shown in Table 5.3.

Table 5.3: Variance Thresholds

p 0.9 0.8 0.7

threshold 0.09 0.16 0.21

5.1.4

Pearson’s χ

2

Test

Pearson’s χ2 test can be used to test for correlation between two random variables. After performing pairwise tests of all events one of the events in a strongly correlating pair can be removed. For N events this translates to creating a N ˆ N matrix where each cell(i, j)

contains the test statistic from Pearson’s χ2test between event i and event j. Code listing 5.1 shows my implementation of the test, given the matrix of all events and users, t and the two events to tests, i, j.

Listing 5.1: Pearson’sχ2Test implementation 1 def c h i _ s q ( t , i , j ) : 2 # t : p a n d a s . DataFrame 3 # i , j c o l u m n s o f t 4 5 n = t . shape [ 0 ] # Number o f u s e r s 6 # O b s e r v e d v a l u e s

7 o_00 = t [~ t [ i ] & ~ t [ j ] ] . shape [ 0 ] / n 8 o_01 = t [~ t [ i ] & t [ j ] ] . shape [ 0 ] / n 9 o_10 = t [ t [ i ] & ~ t [ j ] ] . shape [ 0 ] / n 10 o_11 = t [ t [ i ] & t [ j ] ] . shape [ 0 ] / n 11 # E x p e c t e d v a l u e s

12 e_00 = t [~ t [ i ] ] . shape [ 0 ] ∗ t [~ t [ j ] ] . shape [ 0 ] / ( n ∗ n ) 13 e_01 = t [~ t [ i ] ] . shape [ 0 ] ∗ t [ t [ j ] ] . shape [ 0 ] / ( n ∗ n ) 14 e_10 = t [ t [ i ] ] . shape [ 0 ] ∗ t [~ t [ j ] ] . shape [ 0 ] / ( n ∗ n ) 15 e_11 = t [ t [ i ] ] . shape [ 0 ] ∗ t [ t [ j ] ] . shape [ 0 ] / ( n ∗ n ) 16

17 r e t u r n \

18 ( o_00 ´ e_00 ) ∗∗ 2 / e_00 +\ 19 ( o_01 ´ e_01 ) ∗∗ 2 / e_01 +\ 20 ( o_10 ´ e_10 ) ∗∗ 2 / e_10 +\ 21 ( o_11 ´ e_11 ) ∗∗ 2 / e_11

Three different significance levels will be tested, 1%, 0.5% and 0.1%. Boolean variables gives us 1 degree of freedom and with the help of a χ2 distribution table3we can see the critical value for the above mentioned significance levels, also shown in Table 5.4.

Table 5.4: Significance levels Significance Level Critical Value

0.01 6.635

0.005 7.879

0.001 10.828

(28)

5. METHOD

After calculating the test statistics, strongly correlating events will be pruned using the psuedo code in Listing 5.2.

Listing 5.2: Pruning Correlating Events 1 def s e l e c t S u b s e t ( events , c r i t i c a l V a l u e ) : 2 f o r ev1 in e v e n t s : 3 f o r ev2 in ( e v e n t s ´ ev1 ) : 4 i f c h i S q u a r e ( ev1 , ev1 ) >= c r i t i c a l V a l u e : 5 # e v 1 and e v 2 c o r r e l a t e d 6 e v e n t s . remove ( ev2 )

5.1.5

Manual Selection

Using knowledge about the application from which the data set is collected together with suggestions from my on-site supervisor I have manually selected a list of interesting events. The total number of events on the list is 185. A list of the manually selected events are avail-able in Appendix B.

5.1.6

No Selection

Compared to the other event selection strategies, this is the far most simple. Here we select all available events.

5.2

Rule Mining

Rules will be mined for each subset of events generated in the previous sections. For this purpose a parallel implementation of FP-Growth will be used. The algorithm has also been modified to output “targeted” association rules instead of frequent item sets. The code in Listing 5.3 shows the targeted rule generation. Before the FP-Growth is run “retention items” are added to the transactions. The retention items for a user symbolize whether that user was retained on day 1, 7, 14, 21 and 30 after installing the application. The retention items are called r_1, r_7, etc. which is what the code in Listing 5.3 searches for on line 5. The main reason for choosing these particular days is to follow the standard at Opera.

Listing 5.3: Targeted rule generation 1 def c h e c k _ r u l e ( i t e m s e t ) :

2 a n t e c e d e n t = [ ]

3 r e t e n t i o n _ e v e n t s = [ ] 4 f o r item in i t e m s e t :

5 ( a nte ced ent , r e t e n t i o n _ e v e n t s ) [ item . s t a r t s w i t h ( ’ r _ ’ ) ] . append ( ãÑ item )

6 i f r e t e n t i o n _ e v e n t s and a n t e c e d e n t : 7 f o r consequent in r e t e n t i o n _ e v e n t s :

8 i f c o n f i d e n c e ( a nt ece den t , consequent ) >= min_confidence ( ãÑ consequent ) :

9 y i e l d ( a nte ced en t , consequent )

The function min_confidence() returns the base line retention rate for the day. This allows all rules that imply increased retention for the day specified in the consequent to be sent to output. The base line retention rates for each tested day are shown in Table 5.5

(29)

5.3. Evaluation

Table 5.5: Retention rates Event Rate r_1 40% r_7 17% r_14 12% r_21 10% r_30 1%

A user is retained for day i if it is seen that day, the retention rate for day i (r_i) is equal to the percentage of users that have been seen during that day.

Three different support thresholds will be tested on each event subset. Table 5.6 shows the thresholds used in this thesis.

Table 5.6: Support thresholds Support 10% 1% 0.1%

Since the original FP-Growth algorithm is recursive and the fact that python has limited recursion depth my implementation uses a work queue that is shared among a number of processes. Each new conditional FP tree that is found is placed in the work queue. The work queue is queried by the processes when they have finished in order to get more work.

Even though we have used different techniques to select subsets of events to run this algorithm on the work in the worst case is still proportional to calculating the power set of the events. This is, even with parallelism an astronomical number of computations for a large set of events. In case a run of FP-Growth does not terminate within 24 hours it will be forceably terminated and the rules mined so far will be presented in the result.

The full implementation of the parallel, targeted FP-Growth algorithm is available in Ap-pendix C.

5.3

Evaluation

In the likely event that the above mentioned rule mining processes based on the different event selection techniques yields a number of rules too big to manually select the most inter-esting rules we need to specify a method for doing so without manually going through the rules one by one.

To simplify things we will select two rules per event selection technique and retention event. The retention events are the events that will be in the consequent of the mined rules, listed in Table 5.5. To select these rules we will first select the rules with the largest retention rate gain for each retention event and event selection technique. In case two rules are tied for the top position retention gain-wise, antecedent length and support threshold will be used to break the tie. We will also note the parameters used by the event selection technique if applicable and the support threshold used for generating the selected rules. This will aid us in answering the second research question: “How should events be selected?”. The retention gain will be measured as the ratio between the rule’s confidence and the base line retention of the retention event in the rule’s consequent. This confidence to base line retention rate, is the same as lift as shown in Equation 2.5. This ratio is used in order to be able to compare mine rules with different consequent. Since the rules are mined with different minCon f idence, is is not convenient to compare rules based on confidence alone.

The second selection of rules focuses on selecting rules with sub-optimal lift that are shorter or in other ways can be mined faster. Long rules can, depending on the events it contains, become too complex to be usable in a real-world context, e.g. to improve an

(30)

on-5. METHOD

boarding process. In general, a low support threshold causes FP-Growth to consider more and longer item sets. This also increases the run time of the mining process.

The second selection selects the shortest rule that has been mined with the highest possible support threshold while still maintaining a high lift. For the purpose of this work the lift threshold has been set at 90% of the lift of the rule it aims to replace. The threshold is only put in place to allow us to select a different set of rules and 90% is a reasonable limit to how much worse the lift can be and still be acceptable.

The reason for only selecting two rules per event selection technique and retention day is that since they are sorted, it is likely that this will give us the most interesting rules. It is also to be able to present the results in a condensed way. While only the rules from these selections are presented in this paper, the whole set of rules are saved and made available to Opera.

One thing that neither selection takes into account is one of the most important factors that determines the usability and interestingness of a rule. That is, the higher meaning of the events when grouped together in a rule. For example, a rule with five events might tell a detailed, easily followed use case whereas a rule with one obscure event might not give any deeper understanding of the users’ behaviour or give any insights into how to improve the user experience for future new users. Since it is hard to define an automatic filtering method for this the rules selected by the two methods will be presented in Chapter 6 as is. A discussion on the events contained in the rules and their usability will be covered in

Chapter 7.

(31)

6

Results

This chapter presents the results obtained by the experiments described in the previous chap-ter. There are 6 sections, one for each event selection technique. In each of the sections, the result of the event selection will be presented along with the number of rules mined for each retention event. Also available are the best rules selected by the two different rule selection techniques described in Chapter 5.3. In the cases were the rule mining has been terminated due to the time limit of 24 hours being exceeded those results have been marked with an asterisk (*). It is also clearly stated in the text associated with each rule mining run which, if any, runs that failed to meet the deadline.

In this chapter there are tables with event codes and values. Since it is not the aim of this chapter to evaluate the rules based on the meaning of the events they contain, no descriptions of them are offered here. Chapter 7 contains a discussion and evaluation on this subject along with descriptions of all event code and values shown in this chapter.

(32)

6. RESULTS

6.1

Mini-batch K-Means Clustering

Figure 6.1 shows the size of each resulting cluster created by Mini-batch K-Means clustering. Figure 6.1a, 6.1b and 6.1d produce two clusters that contain the majority of all events. The clusters for k=40 produce three larger clusters as shown in 6.1c.

0 2 4 6 8 Cluster 0 50 100 150 200 250 300 Cluster size (a) k=10 0 5 10 15 20 Cluster 0 50 100 150 200 250 Cluster size (b) k=20 0 5 10 15 20 25 30 35 40 Cluster 0 20 40 60 80 100 120 140 160 Cluster size (c) k=40 0 10 20 30 40 50 60 70 80 Cluster 0 25 50 75 100 125 150 175 Cluster size (d) k=80

Figure 6.1: Cluster sizes

The number of rules generated from the K-Means clusters are shown below in Table 6.1. The number of rules are sorted by retention event, number of clusters and support thresh-old. The before-mentioned 24 hour deadline caused some of the mining attempts to abort. The aborted runs are marked with an asterisk. The majority of the runs could be completed without any problems. With k= 10 no runs were able to execute to completion. This might be due to a number of causes, but one might be inclined to think that cluster number 7, with more than 300 events might be one of the issues. For tests with 20 and 40 clusters only the mining processes with the support threshold set to 0.1% failed to meet the deadline.

Table 6.1: Number of rules mined from K-Means clusters

10 20 40 80 Retention 10%* 1%* 0.1%* 10% 1% 0.1%* 10% 1% 0.1%* 10% 1% 0.1% r_1 130271 166769 795579 128879 158042 612315 160971 160928 342009 41041 41502 47506 r_7 140246 283109 144583 137222 133799 152877 168822 154837 187475 21729 41174 43757 r_14 144487 142730 58685 142483 126106 87332 86693 1734 14076 10744 41150 42962 r_21 166558 142303 40917 109961 130690 78842 16525 1566 11545 8246 41137 40045 r_30 0 16469 2042 0 146260 74399 0 83390 132280 0 9923 21110 22

(33)

6.1. Mini-batch K-Means Clustering

The rules selected in the first round, i.e. the rules with the highest lift all have the same antecedent, as shown in Table 6.2.

Table 6.2: Antecedent for top lift K-Means rules Retention Antecedent r_1 [gudu:SAVED_PAGE_COUNT=1, sc:recommendations_language_region=, hfns:None] r_7 [gudu:SAVED_PAGE_COUNT=1, sc:recommendations_language_region=, hfns:None] r_14 [gudu:SAVED_PAGE_COUNT=1, sc:recommendations_language_region=, hfns:None] r_21 [gudu:SAVED_PAGE_COUNT=1, sc:recommendations_language_region=, hfns:None] r_30 [gudu:SAVED_PAGE_COUNT=1, sc:recommendations_language_region=, hfns:None]

The lift and the parameters used to find the rules are shown in Table 6.3. One thing worth noting the steep difference in lift between r_30 and the other retention events. The base line retention for r_30 is only 1% which would probably be higher if averaged over neighboring days. The possible poor accuracy of r_30 might be the cause of this behaviour.

Table 6.3: Top K-Means rules based on lift ratio

Retention Cluster Size Support Threshold Length Confidence lift 0.9˚lift

r_1 80 0.1% 3 99.50% 2.50 2.25

r_7 80 0.1% 3 99.50% 5.81 5.23

r_14 80 0.1% 3 99.00% 8.06 7.25

r_21 80 0.1% 3 99.00% 10.18 9.16

r_30 80 0.1% 3 98.51% 117.81 106.03

The second selection yielded similar rules to the first selection. The only difference being that the event sc:recommendations_language_region= has been dropped from all the rules’ antecedents, as shown in Table 6.5. This has also caused the lift to drop slightly and the rule length to decrease by one. The metrics for the second selection are available in Table 6.4.

Table 6.4: Second selection of rules mined from K-Means clusters Retention k Support threshold Length Confidence lift

r_1 80 0.1% 2 97.67% 2.45

r_7 80 0.1% 2 96.89% 5.66

r_14 80 0.1% 2 96.50% 7.86

r_21 80 0.1% 2 96.11% 9.88

(34)

6. RESULTS

Table 6.5: Antecedent for second selection of K-Means rules Retention Antecedent r_1 [gudu:SAVED_PAGE_COUNT=1, hfns:None] r_7 [gudu:SAVED_PAGE_COUNT=1, hfns:None] r_14 [gudu:SAVED_PAGE_COUNT=1, hfns:None] r_21 [gudu:SAVED_PAGE_COUNT=1, hfns:None] r_30 [gudu:SAVED_PAGE_COUNT=1, hfns:None] 24

(35)

6.2. DBSCAN

6.2

DBSCAN

The resulting number and sizes of the clusters given by the different DBSCAN clustering tests are shown in Table 6.6. It is notable that in all three test the vast majority of all events are classified as noise. Only the first test, where minε =0.5, produces more than one cluster.

Table 6.6: DBSCAN Clusters (minε, minPoints) cluster size

(0.5, 5) 0 41 1 7 2 5 3 5 noise 490 (0.1, 5) 0 23 noise 525 (0.01, 5) 0 11 noise 536

The number rules generated from the events in the DBSCAN clusters are available in Table 6.7. It is interesting to note that for the last experiment, with minε = 0.01, all possible

subsets of the items used are frequent enough to become antecedents in rules. We know this since there are 11 items in the only cluster, and the power set of the items, minus the empty set, are of size 211´1=2047. In contrast to the rule mining based on the K-Means clusters, the majority of FP-Growth runs with DBSCAN clusters failed to meet the deadline. This causes some odd numbers, for example the number of rules for r_7 with minε = 0.5 and

minSupport = 0.1 are much greater than the number of rules for the same minε with lower

support thresholds. If all test would have been able to run until completion this would not have been possible.

Table 6.7: Number of rules generated from DBSCAN clusters DBSCAN Parameters (0.5,5) (0.1,5) (0.01,5) Retention 10% 1%* 0.1%* 10%* 1%* 0.1% 10% 1% 0.1% r_1 9405 9435 9590 141414 141475 141238 2047 2047 2047 r_7 116969 62 216 144095 207133 160348 2047 2047 2047 r_14 120327 13045 13182 144311 145170 145071 2047 2047 2047 r_21 145202 104155 10155 168185 16421 16421 2047 2047 2047 r_30 0 128630 9627 0 16421 16421 0 2047 2047

The top rules lift-wise mined from the DBSCAN clusters are shown in Table 6.8. In three of the cases the rule with the highest lift has a length of 1. In two of those cases the best rules are also found with the support threshold 1%, the middle threshold. We also see that the lift are lower than the lift for the K-Means rules.

Table 6.8: Top DBSCAN cluster based rules sorted by lift

Retention minε Support Threshold Length Confidence lift 0.9˚lift

r_1 0.5 1% 2 60.87% 1.53 1.38

r_7 0.5 0.1% 1 28.60% 1.67 1.50

r_14 0.5 0.1% 3 22.40% 1.82 1.68

r_21 0.5 0.1% 4 17.15% 1.76 1.58

(36)

6. RESULTS

Table 6.9 shows the antecedents of the rules featured in Table 6.8. They, compared to the rules from the K-Means clusters, are possibly easier to understand even without comprehen-sive knowledge of the application or the events.

Table 6.9: Antecedent of top lift DBSCAN rules Retention Antecedent r_1 [vc:ok_button, vc:theme_blue] r_7 [vc:theme_blue] r_14 [vc:settings_night_mode, sc:night_mode=0, sc:night_mode_sunset=1] r_21 [gfa:NIGHT_MODE_MENU, sc:night_mode=1, sc:night_mode=0, sc:night_mode_sunset=1] r_30 [occ:_web_]

Given the lower threshold for lift in Table 6.8 the following rules in Table 6.10 have been chosen during the second selection phase. The antecedents of the newly selected rules are available in Table 6.11. There we can see that all rules have changed, except for r_30.

Table 6.10: Second selection of DBSCAN based rules

Retention minε Support Threshold Length Confidence lift

r_1 0.5 1% 1 60.80% 1.53

r_7 0.5 1% 1 28.60% 1.65

r_14 0.5 0.1% 1 20.70% 1.68

r_21 0.5 0.1% 1 16.96% 1.74

r_30 0.5 1% 1 5.13% 6.13

Table 6.11: Antecedent for the second selection of DBSCAN rules Retention Antecedent r_1 [stc:BLUE] r_7 [sc:app_theme=1] r_14 [sc:night_mode_sunset=1] r_21 [vc:theme_blue] r_30 [occ:_web_] 26

(37)

6.3. Variance Threshold

6.3

Variance Threshold

Table 6.12 shows the number of kept event after applying the variance threshold filter. Table 6.12: Kept events after variance threshold filtering

p Threshold Events

0.7 0.21 26

0.8 0.16 38

0.9 0.09 63

Based on the filtered events, FP-Growth has been run which produces the number of rules shown in Table 6.13 for the different support thresholds. Not surprising, higher vari-ance threshold and lower support threshold contributes to more rules. As indicated by the asterisks, slightly more than half of the rule mining tests failed to terminate withing the spec-ified deadline. The ones that did terminate were run using less events and higher support thresholds.

Table 6.13: Number of rules generated from low variance items Variance Threshold 0.7 0.8 0.9 Retention 10% 1% 0.1% 10% 1%* 0.1%* 10%* 1%* 0.1%* r_1 225 225 936548 1627 502411 570098 957897 665130 526993 r_7 10 10 97973 34 28297 33777 39713 60021 42 r_14 1 1 33845 5 24187 31930 2788 2859 34 r_21 0 0 0 0 12321 31011 44 38458 0 r_30 0 0 0 0 21 16727 0 2641 181

The rules with the highest lift mined from low variance events are shown in Table 6.14. The lift is somewhere between the ratios of K-Means and DBSCAN rules, except for r_30 which is the lowest one yet. We also note that for three out of five retention events, the top lift rules are found using a support threshold of 1%.

Table 6.14: Top lift rules mined from low variance events

Retention p Support Threshold Length Confidence lift 0.9˚lift

r_1 0.7 0.1% 9 73.94% 1.86 1.67

r_7 0.8 1% 3 26.05% 1.83 1.65

r_14 0.8 1% 5 19.97% 1.84 1.66

r_21 0.8 1% 1 16.31% 1.68 1.51

(38)

6. RESULTS

The associated antecedents for the rules above are shown in Table 6.15. Table 6.15: Antecedents for top lift low variance rules Retention Antecedent

r_1 [hfns:Discover, ofc:true, hfa:p,

vc:bottom_navigation_bar_opera_menu_button, gudu:MAX_ACTIVE_OBML_TABS=1, sc:obml_ad_blocking=1, gdsc:IN_PROGRESS, vc:bottom_navigation_bar_tab_count_button, vc:bottom_navigation_bar_back_button] r_7 [gudu:MAX_ACTIVE_OBML_TABS=1, vc:bottom_navigation_bar_tab_count_button, vc:bottom_navigation_bar_back_button] r_14 [gfa:OPERA_MENU, gudu:MAX_ACTIVE_OBML_TABS=1, gdsc:IN_PROGRESS, vc:bottom_navigation_bar_tab_count_button, gdsc:COMPLETED] r_21 [gudu:MAX_ACTIVE_OBML_TABS=1] r_30 [gudu:MAX_ACTIVE_OBML_TABS=1, vc:bottom_navigation_bar_tab_count_button]

Based on the lift shown i Table 6.14 we have selected the rules in Table 6.16 for the second phase. The largest improvement is seen for r_1 where the length of the rule is reduced from 9 to 5. For the other retention events either the rule has been shortened or the support threshold has been increased. In Table 6.17 we see the antecedent for the rules above.

Table 6.16: The second selection of low variance based rules Retention p Support Threshold Length Confidence lift

r_1 0.7 0.1% 5 70.79% 1.78

r_7 0.8 1% 2 29.22% 1.71

r_14 0.9 1% 2 22.21% 1.81

r_21 0.8 1% 1 16.31% 1.68

r_30 0.8 1% 1 3.87% 4.62

Table 6.17: Antecedents for the seconds selection of low variance based rules Retention Antecedent

r_1 [os, hfa:p, gudu:MAX_ACTIVE_OBML_TABS=1,

gdsc:IN_PROGRESS, vc:next_button] r_7 [gudu:MAX_ACTIVE_OBML_TABS=1, vc:bottom_navigation_bar_back_button] r_14 [gudu:MAX_ACTIVE_OBML_TABS=1, vc:bottom_navigation_bar_tab_count_button] r_21 [gudu:MAX_ACTIVE_OBML_TABS=1] r_30 [sc:dist_install_referrer=] 28

(39)

6.4. Pearson’s χ2Test

6.4

Pearson’s χ

2

Test

Pairwise tests of all available events using Pearson’s χ2test revealed that no two events are correlated at the significance levels 0.1%, 0.5% or at 1%. In fact, the maximum critical value found was 0.9380 which translates to a p-values of 0.3328 using 1 degree of freedom. This result indicates that if we would have removed one of the variables that gave the maximum critical value due to them being correlated, that decision would be statistically wrong roughly 66% of the time.

Since no events could be removed with the significance levels described in Chapter 5, the resulting set of events is identical to what we started with, so there is no need to mine any rules.

References

Related documents

För att uppskatta den totala effekten av reformerna måste dock hänsyn tas till såväl samt- liga priseffekter som sammansättningseffekter, till följd av ökad försäljningsandel

The increasing availability of data and attention to services has increased the understanding of the contribution of services to innovation and productivity in

Av tabellen framgår att det behövs utförlig information om de projekt som genomförs vid instituten. Då Tillväxtanalys ska föreslå en metod som kan visa hur institutens verksamhet

Generella styrmedel kan ha varit mindre verksamma än man har trott De generella styrmedlen, till skillnad från de specifika styrmedlen, har kommit att användas i större

I regleringsbrevet för 2014 uppdrog Regeringen åt Tillväxtanalys att ”föreslå mätmetoder och indikatorer som kan användas vid utvärdering av de samhällsekonomiska effekterna av

Parallellmarknader innebär dock inte en drivkraft för en grön omställning Ökad andel direktförsäljning räddar många lokala producenter och kan tyckas utgöra en drivkraft

Närmare 90 procent av de statliga medlen (intäkter och utgifter) för näringslivets klimatomställning går till generella styrmedel, det vill säga styrmedel som påverkar

Det finns en bred mångfald av främjandeinsatser som bedrivs av en rad olika myndigheter och andra statligt finansierade aktörer. Tillväxtanalys anser inte att samtliga insatser kan