Churn Prediction using Sequential Activity Patterns in an On-Demand Music Streaming

(1)

Churn Prediction using Sequential Activity Patterns in an On-Demand Music Streaming Service

FILIP STOJANOVSKI

(2)

Churn Prediction using Sequential Activity Patterns in an On-Demand Music Streaming

Service

Master’s Thesis

FILIP STOJANOVSKI

Master’s Thesis at KTH Information and Communication Technology Supervisor: Daniel Gillblad

Examiner: Magnus Boman Company Supervisor: Manish Nag

TRITA-ICT-EX-2017:140

(3)

(4)

Abstract

In data-driven companies, churn analysis aims to make use of novel machine learning and data mining techniques for the purpose of better understanding of the customers.

The most common approach is to engineer a vast number of features describing users, products, services, and actions, which are then used to infer knowledge by means of machine learning and data mining. However, one aspect is typically neglected since it appears more difficult to model and utilize, and that is the time. This work presents the modeling of user activity on a music streaming service in the form of sequential temporally-dependent data, which serves to explore the advantages of detecting churning users by means of basic or long short-term memory recurrent neural networks. The performance and complexity are compared against non-sequential models using the same data. The conclusion reached is that even though recurrent networks bring no improvement of the module of a churn prediction model based on activity data, that data presented in sequential form does.

(5)

till företag. Det vanligaste angreppssättet är att ta fram ett antal attribut som kännetecknar användare, produkter, tjänster och handlingar, vilka sedan används för att genere- ra kunskap med hjälp av maskinlärning och informations- utvinning. En aspekt som dock vanligtvis försummas är tid, då den visar sig vara svår att modellera och använda. Det- ta arbete presenterar modelleringen av användaraktivitet hos en musikstreamingtjänst i form av sekventiell tempo- rärt beroende data, i syfte att utforska fördelarna med att detektera kunder som väljer att terminera sitt medlemskap med hjälp av long short-term recurrent neurala nätverk.

Prestanda och komplexitet jämförs med icke-sekventiella modeller med samma data. Arbetets slutsatsen är att trots att recurrent neurala nätverk inte resulterar i en förbättrad churn-predikteringsmodul baserad på aktivitetsdata, så ger data presenterad i sekventiell form en förbättring.

(6)

Acknowledgements

This thesis project has been a true life-changing journey;

even though filled with obstacles, many people contributed to making it possible in one way or another. First and fore- most, I want to thank my friend Tina RaniÊ, who had trust in me and recommended me for the position of a master’s thesis student at Spotify. I also want to thank my man- agers, Anders Nyman and Sebastian Widlund, for giving me this opportunity and making me feel as a part of the team.

A big thank you to Manish Nag, who took the responsibility of being my supervisor, with the constant assistance from Nathan Stein and Steven Corroy. A special thanks to Thúy Tr`ân, Magnus Petersson, and Ludvig Fischerström who, even though were not formally my supervisors, definitely felt like ones; they have helped me a great deal and put up with me throughout the whole process of the project. From the Spotify team, I also want to thank, in no particular order, Stefan Avesand, Andreas Mattsson, Marcus Isaksson, Daniel Lazarovski, Sahar Asadi, Clay Gibson, Ian Ander- son, Elaine Chung, Patrik Törmänen, Claire Amaouche, Anna Harris, and Mattias Frånberg for attending meetings with me and making their contributions to different aspects of the project. I want to thank my friends and classmates Guilherme Dinis Chaliane Jr and Philipp Eisen for all the helpful discussions, comments, and thesis chapter reviews.

Last but definitely not least, I am extremely grateful for the constant support I got from KTH from my supervisor Daniel Gillblad and examiner Magnus Boman, regarding both the academic and nitty-gritty aspects of this master’s thesis project.

(7)

(8)

Acronyms

ANN artificial neural network.

AUC area under the curve.

BPTT back-propagation through time.

CDN content delivery network.

CEC constant error carousel.

CLI command-line interface.

CNN convolutional neural network.

CPU central processing unit.

CRM customer relationship management.

DFA detrended fluctuation analysis.

DTW dynamic time warping.

FFNN feedforward neural network.

FFT fast Fourier transformation.

GCP Google Cloud Platform.

GCS Google Cloud Storage.

GPU graphics processing unit.

HMM hidden Markov model.

kNN k-nearest neighbors.

KPI key performance indicator.

(11)

RF random forest.

RNN recurrent neural network.

RTRL real-time recurrent learning.

SVM support vector machine.

TB TensorBoard.

TF TensorFlow.

TSC time series classification.

VM virtual machine.

(12)

Chapter 1

Introduction

Companies that provide services to people or to other companies have recently, due to the ability to store and process data at scale, realized the importance of tracking their users’ behaviour for improving value proposition and customer relationship.

User behaviour is defined by the context and shaped by the problem the service provider aims to address. Frequency and amount of service utilization, payment habits, user journey from one product to another, as well as demographics are a few examples of how a customer can be described. Novel, more tech-oriented companies, such as Google, Facebook, Skype, and Spotify, gather vast amounts of data related to their users and focus on finding ways and techniques to extract knowledge that would aid their business [1]. Data-driven decision making helps the service providers take more informative, strategic, and customer-centric actions and make moves that benefit both them and their customers. Knowing and understanding the users’ pains and problems, as well as what is causing them, gives companies a great competitive advantage. However, predicting the users’ next steps is even more important, since it gives the providers time to react and work on fixing the issues, resulting in great user satisfaction and loyalty, which have a positive influence on long-term financial performance [2][3].

Even though companies are preoccupied with general key performance indica- tors (KPIs), like paying subscribers, monthly, weekly, and daily active users, and followers on social media, a way to improve them is by looking more specifically into each user. Predicting a drop in increase of subscribers is one thing, but understanding why certain users cancel their subscriptions is another insight which is potentially more beneficial for the service provider. User activity data is one way to depict user behavior; how much time per day/week/month one user spends on the service, which parts of the day are the user’s most and least favorite, is there a point in time when a change in the behavior can be noticed, and so on. These behavioral patterns can be potentially traced and linked to previous users’ movement in and out of the service. This results in ability to answer questions like “How likely is a certain user to churn from the service?” and “Can the problems the churning users are experiencing be detected and fixed?”. Throughout this thesis, the term churn

(13)

model (HMM) [4]. A number of papers focus on comparing different machine learning and data mining techniques for churn prediction in telecommunication companies (telcos)¹, all of which support the hypothesis, through different experiments, that neural networks perform better than other prediction techniques when it comes to churn [5][6][7][8].

Spotify is a client-server music streaming service using a content delivery network (CDN) (formerly peer-assisted), having worldwide popularity, over 30 million tracks and millions of users, reaching over 50 million premium subscribers² as of March 2017 [9]. By analyzing the streaming behaviour of Spotify’s users that churned from premium (a paying product) to free using different machine learning models which make use of the temporal aspect, certain patterns can be discovered and classified.

This would allow for preventive actions to be taken in order to extend the lifetime on premium of new subscribers. Throughout the thesis, the terms premium user and subscriber are used interchangeably and are equivalent, referring to users of the premium product, while free users refer to users of the free product.

Previous attempts within many services exist with heavy feature engineering and using non-temporal classification models, which can lay as a reference for the model performances, but not as a benchmark since the details are unknown and the idea is to test whether putting the same data in sequential form brings an improvement in the predictions. The classification model that uses sequential data can be further extended to use other static user-related data and split the churning class depending on the length of the premium contract. This approach can be compared to survival analysis, which aims to discover when a user will leave a service. This opens a door for creating separate time-sensitive methods for retaining these users. The performance indicator that can be addressed is premium customer retention, which significantly affects and aims to improve the percentage of monthly active users which are paying subscribers, as well as their consumption time.

1.2 Problem

Customer retention, as a main objective of customer relationship management (CRM) [10] which is applied to many fields such as telecommunications, banking

1Churn in telcos is referred to in the background since it is the only area that is mature enough and has a number of publications.

2Users paying for an ad-free version of the product, with additional features only available to them.

(14)

1.3. PURPOSE

and insurance, retail market, and mobile app market, is crucial following the fact that keeping old customers and acquiring new ones at the same time can yield strong strategic advantages in competitive markets. Retaining existing users is not only strengthening customer loyalty and creating positive word of mouth from brand advocates, but its cost is in some cases 20 times lower than the cost of customer ac- quisition [11]. For every provider whose business model is built upon the freemium concept, such as Spotify, premium users are more valuable than free users [12], im- plying that better understanding of their behaviour is crucial. Free users also add a significant rights burden to freemium music services, which dilutes the revenue [12][13].

This means that early discovery of churning Spotify users is crucial in order to apply retention techniques and keep them as paying customers. The problem can be formulated as the question “Can certain distinguishable and characteristic temporal patterns be discovered that can label users as churning or retaining?”. In other words, would a classification model using activity data, favorite music genres, user demographics, payment behaviour, etc. have any benefits in performance if the part using activity data for the observation period was in a sequential form? Seeing the classification system modularized, the churn prediction model based on activity data can be isolated and its performance assessed based on the different types of data and classification techniques.

1.3 Purpose

The purpose of this thesis is to compare different classification model performances over temporal, flattened³, and aggregated⁴ user activity data (see Chapter 3.2 for further explanation). It follows the complete process from feature selection, getting and processing the data, implementing the models and evaluating their performance, including business-related implications⁵. The resulting written report aims to instruct the reader on building and implementing large-scale machine learning models, as well as understanding the models’ theoretical basis (with a focus on the mathematics). The latter is usually neglected, resulting in using machine learning models like a “black box” that “just works”; the author believes that understanding the algorithms in more detail helps making more informed choices regarding the input data and the model architecture, and inferring conclusions related to the output and the model performance. However, in order not to make the thesis very mathematically heavy and hard to read, certain parts are moved to the appendix and referenced accordingly.

3The time dimension is ignored, considering same features at different time steps as different.

4Aggregation is performed over the time dimension, summing up the features’ values from all the time steps.

5The business aspects presented do not necessarily align with the ones Spotify has, but are the author’s views which were presented in his minor thesis project within entrepreneurship.

(15)

1.5 Benefits, Ethics, and Sustainability

Both the company and their premium users will benefit from this data-driven problem-detecting approach. Solving these issues would both improve customer satisfaction, which is crucial for positive branding, and increase revenues. The users benefit from using a service that prioritizes them and cares about their experience, for which they pay.

For the purpose of the project, user activity data is used to train a classification model. However, all data is anonymized and there is no way to trace back the users solely from their activity, respecting their privacy. The data is used specifically for scientific purposes and no retention techniques are experimented with on real users.

Regarding sustainability, the United Nations adopted a set of 17 goals to end poverty, protect the planet, and ensure prosperity for everyone [14]. This research work aims for reducing manual, time-consuming understanding of user behavior by utilizing data centers that run on green energy, thus targeting the goal 12, i.e. to

‘ensure sustainable consumption and production patterns’. Google’s Cloud Platform data centers run on half the energy of a typical data center, and in 2017, Google will purchase 100% renewable energy [15]. Furthermore, by understanding user activity patterns better, energy-saving optimizations can be made from the app side, such as automatic shut down or sleep, also being in line with the same goal. However, a possible implication is putting at risk marketing job roles since the process of churning user discovery is partially automatized.

1.6 Research Methodology

For the purpose of deciding the type of thesis research, a ‘portal’ with common methods and methodologies is used [16]. A qualitative approach is chosen, since the proposed hypothesis cannot be statistically proven, although the experiments use measurable variables, and so have a strong quantitative element. A realist philosophical assumption is taken since there is collected data from an observed phenomenon, which serves for understanding the underlying users and developing knowledge. Regarding the research method, the applied one is used because it involves solving a practical problem and builds on existing research, using real world data. The research approach is inductive, i.e. theories and propositions are drawn from observations and patterns (in this case user activity data), while the

(16)

1.7. DELIMITATIONS

research strategy is ex post facto, which is carried out after the data is collected and studies behaviours by searching back in time to find plausible causal factors.

Being a master thesis student within the Data Research team in Spotify’s Stock- holm office, the author’s main responsibility is to build a churn prediction model, including every step from literature review with a more in-depth theoretical and mathematical study, collecting and cleaning data, to shaping a prediction model accordingly (choice of type of model, architecture, metrics, and hyperparameter search). Even though having assigned a supervisor and co-supervisor, both from the New York office, the guidance throughout the project is aided by both team-mates and colleagues from other teams that have valuable insights. The loose hierarchy and the company’s inclusive culture contributes in the feeling of being part of the team, participating in regular company events like daily stand-ups, meetings, hack days, learning seminars, and presentations.

1.7 Delimitations

Even though Spotify is used all over the world, for the scope of the project the chosen users reside in Sweden, and joined on premium for the first time in the period of one year (from 2015-08-01 to 2016-07-31). The choice of only taking users from Sweden is following the fact that the market saturation is country-specific and Spotify is already very well established in Sweden; also, this guarantees one time zone for most users (except for those travelling or living outside of the country), which simplifies the per-day log extraction and time series modeling. As mentioned above, the goal is not to build a full churn prediction model, but to improve a module which would get user activity data as input, and can be plugged in a complete prediction system.

The activity data is limited only to streaming behavior, neglecting other activity on the service that can include browsing for content, artists or users, curating playlists, and so on.

The normalization function utilized at the data preprocessing stage is merely a design decision, and is not compared against alternative solutions; however, possible implications are discussed in the Chapter 6. Regarding the training of the models, a few things are worth mentioning. A large part of the users had to be discarded because of class balancing (procedure used to output a dataset with 50% retaining and 50% churning users) and non-active user removal (users with no activity in the complete observation period). Feedforward neural networks (FFNNs) were not used as a benchmark since they take longer to design and train, and random forests (RFs) were already proven to be a strong candidate to compare to. For the hyperparameter training, only certain ranges were used for random sampling based on previous work in this area [17]. Because of the time constraint, an extensive search over the hyperparameter ranges could not be performed, meaning that the “optimal” set is not very likely to be found.

(17)

to this research thesis project.

(18)

Chapter 2

Relevant Theory

2.1 Dynamic Churn Prediction

In order for companies not to waste their retention efforts, accurate targeting of potential churning users is vital. Typical approaches of models predicting churn neglect the temporal aspect, or in other words how churn rate changes over time.

Dynamic churn prediction, according to [18], would allow the companies to allo- cate retention and service improvement resources accordingly across groups of users likely to churn within different periods of time. This distinction allows for tailored retention strategies which consider timing of churn and diagnose how much of the attrition can be controlled [19]. Since the time of possible churn can be relatively large, this method enables early detection of customers likely to churn, giving more time to the company to detect the issues and convince their customers to stay [20].

Ali and Arıtürk [21] introduce a framework for generating training data and an independently trained binary classifiers approach for dynamic churn prediction, increasing the predictive accuracy by using multiple training observations per customer from different time periods. The performance is better than the one for approaches using single observations per customer both because it is able to use the most recent data, and because it allows horizon-specific predictors.

2.2 Motivation for Churn Prediction from User Activity Sequences

Fierce competition and saturation of marketplaces for fast-evolving companies introduce a necessity for creation and nurturing of long-term relationships with customers. In these circumstances, keeping existing customers is more profitable and valuable than attracting new ones, who have higher attrition rate and create lower revenues [2][22][23][24]. Churn prediction, as a predictive analysis technique aiming to point out possible churning users, is a way of steering CRM, becoming most companies’ customer-centric marketing strategy.

Rothenbuehler et al. [4] present an expert system for churn prediction and pre-

(19)

with most churn correspond to the states with the lowest activity, thus showing that motivation is related to churn and can be captured by using an HMM on activity time series. HMM performs similarly to LR, artificial neural network (ANN), and SVM. However, it has advantages from the point of storage and computational requirements, which is crucial for expert systems.

HMMs have been an important part of another sequential problem since the 1980s: speech recognition. They had the central role of modeling the sequence of phonemes in speech recognition systems and until the late 00s, neural networks were mostly used for learning extra features [29]. The first major breakthrough in this area came from Graves, Mohamed, and Hinton [30], who trained a deep long short-term memory (LSTM) recurrent neural network (RNN) and achieved a record low of 17.7% error on TIMIT [31]. Despite being the best, this system was one of the first to completely abandon the usage of HMM, becoming an end-to-end deep learning speech recognition system. This ANN architecture is explained more in detail in section 2.4.

Spotify, as well as other big on-demand music streaming services, have focused on content and experience personalization for their users. Little work has been published related to user behavior in these kinds of services, so [32] presents a Spotify-specific empirical study of user sessions (arrival patterns, daily variations, session correlations) and device switching (multiple desktop and mobile clients) for premium users in Sweden, UK, and Spain between 2010 and 2011. Main relevant findings include:

• Strong daily patterns in session arrivals, playback arrivals, and session length;

• Session arrivals in varying intervals (10 minutes and 1 hour) can be modeled as a non-homogeneous Poisson process;

• Strong inertia to continue successive sessions on same device;

• Most users have their favorite time of the day to use Spotify;

• Session length can be used as indicator for both the successive session length and downtime.

The session arrival is measured as new sessions within 1-hour intervals, observing a strong daily pattern and significant variation of hourly arrival rates. It is lowest around 2 am, and increases sharply until 9-10 am (morning peak), after which there

(20)

2.3. CLASSIFICATION

is a slight drop during lunch break in weekdays. There is an increase again, reaching the daily peak around 6-7 pm (evening peak). There is an one-hour shift ahead of the morning peak on weekdays for mobile usage, believed to be as a consequence of commuting. As a weekend effect, the morning peak and lunch break dip of mobile sessions disappear, along with the commuting effect.

Hourly playback arrivals, or the numbers of playbacks in sessions that start each hour, is a proxy for user activeness in music streaming services whose daily pattern differs from the one for hourly session arrival (significantly more for desktop session). The morning sessions are more active in terms of number of playbacks, which can be explained by the long session length during that period. The evening peak of playback arrivals is much less significant than the evening peak of session arrivals, which in turn can be explained by the short session length in the evening.

Session length, as another key feature describing important properties such as churn rates in P2P systems, exhibits slightly different daily patterns. The length of desktop sessions peaks in the morning only during weekdays and then decreases almost monotonically, behaving similarly both for weekdays and weekends, believed to be because of utilizing Spotify for “background music” at work. Another observation is that mobile sessions are much shorter than desktop sessions, with a less significant morning peak, indicating that the usage pattern is dramatically different when comparing user sessions on desktop and mobile.

2.3 Classification

Classification is a predictive data mining technique where the objective is to map observed data to a set of predefined types, which serve as labels on the training data.

The instances can be assigned to only one (single-label classification), or multiple classes (multi-label classification). Most commonly, the data is relational and can represent a set of known features of an object, or a sequence of features over time.

The most widely used classification models are logistic regression (LR) and random forest (RF). LR is a method which gives a probability of belonging to a cate- gorical target value. For binary classification, the target value is usually encoded as 0 or 1, and the binary LR model estimates this value based on a set of independent features. The sigmoid function is the central element of this classification model, defined as:

‡(z) = e^z

e^z+ 1 = 1

1 + e^≠z, (2.1)

where z is a real value which can be a linear function of the input x (z = —0+—1x), and the output gives a value from 0 to 1, which can be interpreted as a probability.

With this simple case, the idea is to find the suitable values for —0 and —1 such that for every input x, the function’s output (probability) is always larger than the threshold (most commonly 0.5 for binary classification) for the correct target class.

Fig. 2.1 depicts this function for values of z from -10 to 10.

(21)

Figure 2.1. Sigmoid function ‡(z).

RF is an ensemble learning method, which generates many decision tree classifiers and aggregates their results. A decision tree classifier uses a tree-like graph, where each node is a condition based on some feature which splits the dataset, aiming to have leaves with observations that belong to a single class. The condi- tions chosen for each split are determined based on how well they separate the data according to functions like information gain (the entropy of the parent, minus the weighted sum of the entropy of the children) and Gini impurity (sum of the prob- ability of an item with a certain target label to be chosen, times the probability of a mistake in categorizing that item). After these condition nodes are set and the leaves labeled, new observations can be classified by going left or right in the tree depending on the condition fulfillment, starting from the root. Each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest [33].

The technique of bootstrap aggregating, or bagging, is used over the training set, creating sets of random samples and fitting decision trees for each. Since certain features can be strong predictors, this might lead to selecting those in most trees, resulting in high correlation. This is fixed by the technique of feature bagging, where for each split only a subset of the features are considered for a condition.

For an ensemble of B trained classifiers, f1, f₂, ..., fB, predictions on unseen observations x^Õ is done by taking the majority vote of all the decision trees. The bootstrapping technique leads to a better model performance since it decreases the variance without increasing the bias. The free parameters include the number of decision trees, maximal tree depth, and minimal number of samples needed for a node to be a leaf.

(22)

2.3. CLASSIFICATION

2.3.1 The Time Series Classification Problem

Mining time series data can reveal important patterns, such as similarities [34], trends [35] or periodicity [36]. The single-label multivariate time series classification (TSC) problem is defined by the following elements [37]:

• A universe U of objects representing dynamic system trajectories or scenarios.

Each object o, is observed for some finite period of time [0, tf(o)].

• Objects are described by a certain number of temporal candidate attributes which are functions of object and time, thus defined on U ◊ [0, +Œ). The value of the attribute a at time t for the object o is denoted by a(o, t).

• Each object is furthermore classified into one class, c(o) œ {c1, ..., c_M}.

The goal of a classification model is to find a function f(o) that only depends on attribute values, which is as close as possible to the true classification c(o) for a subset of objects from the universe U. The classification should not depend on the object itself or absolute time values; the model should be able to classify every unseen scenario whatever its duration of observation.

Numerous domains are in need of time series classification: speech recognition, medical signal analysis, gesture recognition, intrusion detection, biometric and car- diological classification, weather forecast, etc. The nature of the predictive problem (number of observations, interval size between two time steps, number of features per time step, number of classes) defines the type of model used to solve it. Two types of TSC approaches are distinguished: distance-based methods and feature- based methods.

Distance-based TSC requires some kind of pair-wise distance in order to infer similarity, which is then used with algorithms such as k-nearest neighbors (kNN) or SVMs with similarity-based kernels. One way is to use Euclidean distance, whose efficient calculation made it ubiquitous [38][39][40][41][42][43], but it is known to be sensitive to distortion on the time axis [44][45][46][47][48]. Dynamic time warping (DTW), a method that allows non-linear alignments between two time-series which might be similar but locally out of phase, addresses this problem and was brought to the database community by Berndt and Clifford [49]. DTW is the best way to solve a vast number of problems in various disciplines, such as bioinformatics [50], chemical engineering [51], biometrics [52][53][54], handwriting recognition [55], robotics [56], and music [57]. However, the good performance comes with a great cost, as DTW does not scale well to large datasets because of its quadratic time complexity, making it hundreds or thousands of times slower than Euclidean distance, depending on the length of the sequences. A depiction of the two approaches is shown in Fig. 2.2.

Feature-based TSC characterizes each time series with a set of features, such as statistical, first and second order features [59], as well as complex features from

(23)

Figure 2.2. Euclidean distance produces a pessimistic dissimilarity measure by supposing that the i^th point in one sequence is aligned with the i^th point in the other. The non-linear DTW alignment allows a more intuitive distance measure to be calculated [58]. Figure adapted from Ratanamahatana and Keogh [58].

detrended fluctuation analysis (DFA)¹ and spectral analysis² [61]. These feature vectors can then be classified using any feature-based classifier (SVM, LR, ANN).

A different approach performs feature extraction based on signature subsequences of the time series known as shapelets [62]. One way to use the shapelets is by considering every time series as a dictionary and every shapelet as a word, utilizing a bag-of-words model. End-to-end neural network models that advance state-of- the-art exist as well, with automatic feature extraction followed by classification in a single framework [63], showcasing the potential of ANNs both in the encoding and the classification part.

2.4 Artificial Neural Networks

An ANN can be considered as a highly simplified model of the structure of a biolog- ical neural network. It consists of interconnected processing units (named artificial neurons) that calculate a weighted sum of their inputs, known as an activation value, which produces the output signal. The sign of the weight for each input determines whether the input is excitatory (positive weight) or inhibitory (negative weight). The interconnections between the neurons form a topology in order to perform a pattern recognition task, defining if a processing unit receives the inputs from other units’ outputs or from an external source [64].

The simplest version is called a feedforward neural network (FFNN) or a mul- tilayer perceptron (MLP). Its goal is to approximate a function f^ú based on shown observation data x and target values y (or categories in the case of classification).

1DFA is a method for determining the statistical self-similarity of a signal [60].

2Spectral analysis is a method for calculating the power spectra, where resampling is performed using interpolation, from which the mean value and the standard deviation are subtracted before applying the fast Fourier transformation (FFT).

(24)

2.4. ARTIFICIAL NEURAL NETWORKS

An FFNN defines a mapping y = f(x; ◊) and learns the values of the parameters (weights) ◊ that result in the best function approximation. These models are called feedforward because the information flows in a single direction through the function being evaluated from x, through the intermediate computations (layers of artificial neurons) used to define f, and finally to the output y. The number of layers (input, hidden, and output) gives the depth of the model, hence the popular “deep learn- ing” name, and the number of units (neurons) in each hidden layer determines the width of the model. The introduction of hidden layers requires choosing activation functions that will be used to compute the hidden layer values [29]. When FFNNs are extended to include feedback connections (outputs from units are fed back into themselves as inputs), they are called RNNs, explained in more detail in Section 2.4.1.

According to [29], each machine learning algorithm needs to have an opti- mization procedure, a cost function, and a model family. Neural networks cause loss functions to become nonconvex with their nonlinearity, meaning that iterative gradient-based optimizers only drive the cost functions to local optima and have no convergence guarantee. This means that not necessarily will the “perfect” set of weights, also known as the global optimum, be found. This depends on many factors; weight and bias initialization, learning rate, momentum, type of regular- ization, activation functions, network architecture, etc. These factors are referred to as hyperparameters and need to be decided on before the neural network starts training, most commonly by a procedure called hyperparameter tuning (grid search, random search, or optimized model-specific search techniques).

For classification problems, the most common choice for the cost function J(◊) is the cross-entropy error function. Cross-entropy between two probability distri- butions p and q, in information theory terms, measures the average number of bits needed to identify an event drawn from a set of events, if the coding scheme used is optimized for q rather than p. Mathematically it is defined as

H(p_k, q_k) = ≠^ÿ

i

p_kilog qki, (2.2)

for every class i an input xk can belong to. For a single-label classification, the distribution p has a value pki= 1 only for the class i the item belongs to, while the distribution q gives the output of the classification model. If a high probability was given to the correct class, the cross entropy will be a positive number close to 0 (low loss), while in the case the model gave a very low probability to the correct class, the value for the cross entropy can get to +Œ (high loss). The complete model loss is simply the average over all classified samples:

J(◊) =^ÿ^N

k=1

H(p_k, q_k). (2.3)

(25)

Figure 2.3. Unfolding of an RNN with no outputs. Figure adapted from Goodfellow, Bengio, and Courville [29], page 376.

The training procedure in FFNNs consists of two parts: forward propagation, where the input x is accepted in the neural network, producing an output ˆy and the scalar cost J(◊); and back-propagation (or only backprop), allowing the information from the cost to flow backward through the network in order to compute the gradient and update the weights ◊.

2.4.1 Recurrent Neural Networks

This section describes how an ANN’s architecture is changed when recurrent connections are added and what it can be re-purposed for. The goal of RNNs is to be able to input data in sequential form and make use of the temporal or sequential dependencies that might exist in the data. Their structure enables input in the form x⁽¹⁾, ..., x^(·), where each x^(t) is a value or a vector that represents a set of values for the predefined feature(s) at a time step t. RNNs can scale to much longer sequences than would be practical for networks without sequence-based specialization. The concept of sharing parameters across different parts of the model is being used with the help of feedback connections, enabling applying that machine learning model to examples of different forms (different lengths in this case) and generalizing across them.³

An RNN with no outputs is shown in Fig. 2.3 in order to understand the concept of states and how the feedback connection impacts them. A sequence of inputs x is processed and incorporated in a sequence of states h, which also depends on itself.

At a time step t, the state h^(t) depends on the input at that step, x^(t), and the state from the previous step, h^(t≠1). The number of hidden states is the same as the sequence length, but the number of outputs can vary. On the left there is a circuit diagram, where the black square indicates a delay of a single time step. On the right, the same network is depicted, only as an unfolded computational graph, where each node now depicts the input/state at a particular time step.

Depending on the problem the network ought to solve, there are different versions of RNNs, among which:

3This section heavily refers to Chapter 10 from [29].

(26)

Figure 2.4. An RNN with a single output. Figure adapted from Goodfellow, Bengio, and Courville [29], page 382.

• RNNs that produce an output at each time step and have recurrent connections between hidden units;

• RNNs that produce an output at each time step and have recurrent connections from the output at one time step to the hidden units at the next time step;

• RNNs with recurrent connections between hidden units, that read an entire sequence and then produce a single output.

The last type summarizes the whole sequence and is suitable for a task like classification, which is the main focus of the thesis, so it will be further discussed.

Fig. 2.4 depicts the unfolded computational graph of this version of RNN. The last step · gives an output o^(·), which represents the unnormalized log probabilities of each possible value (class in this case). As a post-processing step, in order to get the normalized class probabilities ˆy^(·), the softmax operation is applied. Forward propagation starts with initializing the initial state h⁽⁰⁾, and then for every step from 1 to ·, the states are acquired by applying the following:

h^(t)= g(Ux^(t)+ W h^(t≠1)+ bh), (2.4) After the last state h^(·) is acquired, the output can be calculated by:

(27)

xpaired with a target output y^(·) can be calculated as the negative log-likelihood:

L({x⁽¹⁾, ..., x^(·)}, y^(·)) = ≠ log p^model(y^(·)|{x⁽¹⁾, ..., x^(·)}), (2.7) where pmodel(y^(·)|{x⁽¹⁾, ..., x^(·)}) is given by reading the entry for y^(·)from the model’s output vector ˆy^(·). This forward pass is followed by a backward propagation pass moving right to left through the graph (Fig. 2.4), also known as back-propagation through time (BPTT).

2.4.2 Long Short-Term Memory

Hochreiter and Schmidhuber [65] targeted the problem of how error is being propagated back in time in RNNs using conventional back-propagation through time (BPTT) [66][67] and real-time recurrent learning (RTRL) [68]. The error signals either blow up, causing oscillating weights, or vanish, drastically extending the time to learn long time lags or completely disabling that capability.

The disadvantages of other attempts at that time include practicality for short time lags only (time-delay neural networks [69], Plate’s method [70], Kalman filters [71]), external fine tuning (time constants [72]), addition of units proportional to time lag size (Ring’s approach [73]), unacceptable number of states (Bengio et al.’s approaches [74]), quadratic, to number of weights, complexity of operations per time step (second order nets [75]), limitation to simple problem solving (weight guessing [76][77][78]) and intolerance of input noise (adaptive sequence chunkers [79]).

The proposed solution is a recurrent network architecture with a constant error carousel (CEC) as a central feature, combined with an improved gradient-based learning algorithm. The CEC, as a self-connected linear unit, is enhanced with a multiplicative input gate unit that protects the information stored from perturbation by irrelevant inputs, and a multiplicative output gate unit which protects other units being perturbed by irrelevant memory content stored at the current step.

This complex unit is called a memory cell denoted cj. Besides the net input, cj

gets input from the two additional input and output gates, denoted inj and outj

accordingly, whose activations at each time step depend on the corresponding net input. The internal state of the memory cell at a current time step is computed by adding the “squashed” (by a variation of a sigmoid function, g) net income multiplied by the input gate activation to the previous time step’s state. Here, the input gate controls what input is considered relevant and has to be stored (“write”

(28)

operations) according to its learned weights. The unit’s output at a given time is the result of multiplying the current “squashed” (by the same or a different sigmoid function, h) cell state and the output gate activation, controlling which information is accessible by other units (“read” operations).

A variant of RTRL [68] is used for the learning process, which takes into account the controlling mechanisms of the added input and output gates. To enable learning long lag dependencies, non-decaying error backprop through internal cell states is enabled by using a truncated version of back-propagation through time (BPTT) [80]. The errors arriving through all the net inputs (of the cell, the input and the output gate) serve only to change the incoming weights, but do not get propagated back further in time. The error is propagated only through previous internal cell states where it can flow indefinitely without being scaled, until it leaves the memory cell through an opening input gate.

The problem tackled in [81] is the disadvantage of having to a priori segment the input sequence and mark start and end for the classical LSTM. If these markers do not exist and the input sequence is continual, the cell state can grow indefinitely, causing a network breakdown if not reset. Other possible solutions, such as weight decay [82][83], variants of “focused back-propagation” [84] and “teacher force” [85][86] do not work well for solving this problem.

The solution is introducing another gate that controls the loop connection from the previous cell state to the current one, that chooses what information to be kept, and what to be discarded or forgotten, thus the name forget gate. It learns to gradually reset memory blocks once the information stored is out of date and useless for the general learning process. The forget gate activation is calculated like the activations of the other gates, where logistic sigmoid is the squashing function.

This activation is the weight of the self-recurrent connection of the internal state, i.e. the previous state value is not added with a factor of 1, but the forget gate activation which is between 0 and 1. The change of weights going to the forget gate is treated the same way as the weights to the cell and the input gate in the backward pass, using a truncated version of RTRL.

The cell structure of LSTM is very powerful when it comes to learning long term dependencies, but it has certain requirements that need to be fulfilled in order for this bridging to work. For example, if the relevant event happened k discrete time steps ago, there has to be a marker input informing the network that its next action is crucial. Thus, the network does not learn to ignore the intervening k≠1 steps, just to act when the marker is observed. In real-world scenarios time sequences do not typically have markers to indicate relevant lags, but the idea of using LSTM is to internally represent these dependencies. The limitation of the traditional LSTM is that each gate receives connections from the input units and the outputs of all cells, with no direct connection from the CEC it is controlling, thus not being able to observe the internal state when the output gate is closed.

Adding weighed peephole connections from the CEC to the cell gates tackles this problem, enabling all gates to inspect the current cell state even when the output is gated close to a zero. These connections are treated like regular connections

(29)

Figure 2.5. An LSTM block as used in the hidden layers of an RNN. Figure adapted from Greff et al. [17].

to gates, through which no error is back-propagated from gates to the CEC during learning. In this case, a two-phase update scheme is necessary for the output gate to be able to see the current value of the cell state via the peephole connections, already affected by input and forget gate. The first phase updates input gate activations, forget gate activations, and then cell input and cell state; in the second phase the output gate activation and the cell output are updated.

The equations for the traditional LSTM only need to be supplemented with corresponding update rules for the partial derivatives and weights associated with the peephole connections. LSTM with forget gates and peephole connections remain local in space and time, with a minor increase of complexity (three weights per cell).

The vector formulas for a vanilla LSTM layer forward pass are given below. Let x^(t) be the input vector at time t, the U are rectangular input weight matrices, the W are square recurrent weight matrices, the p are peephole weight vectors and b are bias vectors. Functions ‡, g and h are point-wise non-linear activation functions:

logistic sigmoid (‡(x) = _1+e¹_≠x, which transforms x to a value from 0 to 1) is used as the activation function of the gates, and hyperbolic tangent (tanh(x) = ^e_e^xx^≠e+e^≠x^≠x, which transforms x to a value from -1 to 1) is usually used as the block input and output activation functions. Point-wise multiplication of two vectors is denoted by

§. The subscript shows which unit the weight matrix/vector refers to, and the superscript shows the time step of the activation. At a time step t, the block input is given by z^(t) (Eq. 2.8), i^(t) is the input gate activation (Eq. 2.9), f^(t) is the forget gate activation (Eq. 2.10), c^(t) is the cell state (Eq. 2.11), o^(t) is the output

(30)

2.5. PERFORMANCE METRICS

gate activation (Eq. 2.12), and finally y^(t) is the block output (Eq. 2.13) [17]:

z^(t) = g(Uzx^(t)+ Wzy^(t≠1)+ bz), (2.8) i^(t)= ‡(Uix^(t)+ Wiy^(t≠1)+ p_i§ c^(t≠1)+ bi), (2.9) f^(t) = ‡(Ufx^(t)+ Wfy^(t≠1)+ p_f § c^(t≠1)+ bf), (2.10)

c^(t) = i^(t)§ z^(t)+ f^(t)§ c^(t≠1), (2.11)

o^(t)= ‡(Uox^(t)+ Woy^(t≠1)+ p_o§ c^(t)+ bo), (2.12)

y^(t) = o^(t)§ h(c^(t)). (2.13)

2.5 Performance Metrics

In order to present metrics that evaluate a binary classification model performance, some concepts ought to be introduced. Let there be a dataset with P positive instances and N negative instances of some condition. These groups of instances can be regarded as classes which are not always balanced, meaning that there are many more positive than negative (P >> N) or vice versa (N >> P ). Throughout this project, the retaining users are negative instances, while churning are positive, with a N >> P imbalance. A classification model will label the instances as positive or negative, but not always correctly. A contingency or confusion matrix can be constructed based on the numbers of correct classifications (classified positive as positive - T P and negative as negative - T N) and misclassifications (classified negative as positive - F P and positive as negative - F N). From these values certain metrics can be calculated, which capture the performance of the model from different aspects.

Accuracy (Eq. 2.14) is the most common and general performance metric, which simply gives the fraction of correctly classified items, regardless of which class they belong to. Even though this metric gives an idea of how good of a classifier the model is, it fails to account for highly unbalanced classes (eg. if there are 99%

positive instances and the model classifies all instances as positive, the accuracy will be 0.99 even though none of the negative instances were classified as such).

A= T P + T N

T P + T N + F P + F N. (2.14)

Precision and recall are metrics commonly utilized in information retrieval sys- tems, as well as for evaluating model performance. Precision (Eq. 2.15) measures how many of the instances the model classified as negative are correctly classified, and recall (Eq. 2.16) measures the fraction of negative instances that were actually classified as negative by the model.

(31)

Figure 2.6. An example plot of a Precision-Recall curve.

P = T P

T P + F P, (2.15)

R= T P

T P + F N. (2.16)

Since these two metrics describe performance in different ways, a useful perfor- mance measure would be some kind of combination of the two. F-measure or F1

score (Eq. 2.17) is defined as the harmonic mean of precision and recall:

F₁ = 2 P· R

P+ R = 2T N

2T N + F N + F P. (2.17)

Another way of using the precision and recall is plotting a Precision-Recall (PR) curve (Fig. 2.6), which gives the precision for different recall thresholds. The area under the PR curve (PR AUC), which ranges from 0 to 1, can be used as a single- valued performance indicator of the model.

(32)

Chapter 3

Churn Prediction Methods

3.1 Long Short-Term Memory Time Series Classification

As it was concluded in Section 2.3.1, end-to-end neural network systems that perform feature-based time series classification push the state-of-the-art [63]. The feature extraction and classification is done by a multi-scale convolutional neural network (CNN) in three stages: transformation, local convolution, and full convolution. The local convolution stage extracts features, which are then concatenated in the full convolution stage which produces the class probability distribution.

Another approach is to encode the time series using an LSTM RNN, which combines both the feature extraction and classification. What can be considered as a feature in this kind of neural network is the “memory” stored in one cell. The positive side of this approach is that LSTMs are purposed for learning sequential data, thus these so-called features can represent various sequence-dependent patterns. The random weight initialization of each cell aims to make the cells encode different information from the same sequence, resulting in different cell states and activations.

Since the cell activations are fed back in the same layer, as well as the next one, the complexity of the patterns “memorized” is increasing with the network depth, going from general to specific. The complexity of the sequences dictates the depth of the LSTM. The input of one cell in the first layer are the weighed features from the current step and weighed activations from the first layer’s previous step (these weights are specific to each cell and are learned during training). Let us suppose that one cell learns to activate after seeing a pattern 0.8 for feature 1 at step 1, 0.2 for feature 4 at step 2, and 0.1 for feature 2 at step 2, and another cell learns to fire an activation after seeing 0 for feature 2 at step 1, 0.5 for feature 8 at step 2, and 0 for feature 3 at step 3. The cells at the second layer have these two weighed activations from step 3 (among others) and its own weighed activation from step 2 as input. If some learn to activate in case of the previous two activations, this can be interpreted as learning a more complex pattern (which is a merge of both described above). Hence the conclusion that the deeper the network, the more

(33)

classified based on the class distribution after the last step. There is little research done on LSTM classification, so this work focuses on understanding the network structure for such tasks, forward and back-propagation, and comparing the performance to both RNNs and other classification approaches that are not tailored to utilize sequential data. LSTMs have already been proved to outperform RNNs in many tasks, including speech recognition, because of their capability to link long term lags of thousands of steps. However, this does not mean that it ought to be expected for LSTMs to be suitable for the task of time series classification simply because of the simplicity of the data (no complex patterns to be learned), the length of the sequences (no long time lags to find dependencies for), and lack of data (resulting in a smaller model that can have the weights trained properly, but cannot generalize well). Another thing to have in mind is that the features chosen might have the same predictive power when aggregated (in this case summed up) to the complete observation window and granularized to small time steps, which depends on the nature of the problem to be solved.

3.2 Logistic Regression and Random Forest Benchmarks

In order to validate the predictive power of the features chosen for each time step of the sequence, an intuitive solution is to build classification models that are trained on the same features aggregated, or summed up, on a higher level. The transformed, now non-sequential, data contains the same observations and their corresponding labels, with a nullified time dimension. This variant of the data will be further referred to as aggregated data. Building classification models over it and outputting the performance can verify whether sequential data improves the predictive power of the data. For the purpose of benchmarking, binary LR and RF (introduced in Section 2.3.1) were chosen, which are the typical go-to approaches and most widely used classification models for non-temporal data.

Another way to justify the usage of LSTMs, which works specifically with time series, is to flatten the features per time step as different features for the complete observation period. Let each sequence consist of · time steps, each described by k features. The flattened version of each observation would have · · k features which can be considered as independent and neglect the time dimension. This variant of the data will be further referred as flattened data. Now the same classification mod- els, LR and RF, can be trained over this data, having the advantage of being given

(34)

3.3. HYPOTHESIS AND EVALUATION

the same amount of data points as the sequential variant. The author understands this approach is not a standard machine learning procedure, since these kinds of models are purposed for and work best on non-correlated data which is static (non- sequential) by nature in order to avoid the model learning the dependencies instead of the target class. Each ith feature of the flattened data is highly correlated to the (i ≠ k)th and (i + k)th feature (the previous and next step in the time series), as well as with the (i ≠ S · k)th and (i + S · k)th feature (in case of seasonality of S steps present in the sequential data).

3.3 Hypothesis and Evaluation

The hypothesis that was being tested and was driving this research can be formulated as:

Activity data in sequential form describing premium users’ behavior and favor- able opinion about the music streaming platform, coupled with a suitable sequence- based classification model (LSTM RNN) improves the churn prediction performance.

In order to validate the hypothesis, a series of experiments were performed in order to compare the following model performances:

• LR and RF models trained on aggregated data;

• LR and RF models trained on flattened data;

• RNN and LSTM models trained on sequential data with different time window lengths and features.

retrievedData from logs

Derive classes and

label data

Split to training, validation, and testing datasets

Aggregated data and flattened

data

Sequential data

Train Logistic Regression

Randomand Forest

Train Recurrent

Neural Network and Long

Short- MemoryTerm

Model performances

Figure 3.1. Flow of the complete process of training models and comparing perfor- mances.

(35)

plexity, and execution time will be taken into consideration when choosing the most suitable data format and model for the defined problem.

3.4 Implementation and Setup

A parametrized Jupyter notebook¹ was implemented as a complete pipeline that retrieves Spotify’s data stored in BigQuery and transforms it for the purpose of visualization and training. Additional Jupyter notebooks were implemented for plotting and visualizing the data. The notebooks were running on a Google Cloud Platform (GCP) virtual machine (VM) instance with an Intel Haswell CPU platform and NVIDIA Tesla K80 GPU, where the transformed activity data was stored.

The notebook kernel is Python 3, the standard SQL queries to BigQuery were implemented using the Google Cloud Datalab Python package datalab, the time series creation was done with the R interface RPy2, and most of the visualization was implemented in Python on top of pandas dataframes. The data was then uploaded to a Google Cloud Platform (GCP) bucket using the gsutil CLI command, from where it could be accessed for training.

The classification models were implemented in Python 2.7 using TensorFlow (TF) version 1.0.1, along with the corresponding summary and metadata written to TensorBoard (TB) for LR, RNN, and LSTM, and scikit-learn version 0.18 for RF. More theoretical background about how LSTM is implemented by TF can be found in Appendix A. The models were submitted to CloudML as training jobs (using the gcloud CLI command) with a basic single GPU scale tier. Every time a job is dispatched, all the requirements are installed and the data from GCS is downloaded. The intermediate training outputs are outputted as logs, as well as to an output file whose name is structured based on the hyperparameters the model was trained with. This output file, along with performance plots and TensorBoard (TB) output are stored back to GCS.

1A notebook that has parameters which can be given as arguments. In this case by running the command-line interface (CLI) command jupyter-runner and passing a file with named parameters in every row as an argument.

(36)

Chapter 4

Activity-Based Premium Churn Prediction

4.1 Premium Churn

Spotify, as a world-wide on-demand streaming service, needs to tailor their products so that they reach larger and more customer segments. For this purpose, three types of products can be differentiated: free, premium, and bundle. The latter two are paid packages, but the indications for a user churning from a bundle contract cannot be isolated; this is the reason why this group is not being taken into consideration.

The premium product has a few versions (with the standard monthly prices at the time of the writing expressed in Swedish crowns in brackets): organic (99 kr), student (49 kr), and family (149 kr). Certain discounts are offered, such as 3 months for 9.99 kr, to showcase the premium alternative and its benefits to users, with the idea that they will start paying the full price afterwards. Users on a discounted offer will not be analyzed, since a large part joins without the intent of staying after the trial. The student and family products have a lot of external factors affecting the user’s journey through different products (one stops being a student, members from family move out, ...), so only organic premium, further referenced only as premium, is selected for the purpose of churn analysis. With all this stated, premium churn, or just churn, is defined as the action of a premium user going to a free product (not to other premium or bundle products), or away from the service.

It should be noted at this point that the prediction model takes activity data as input and outputs probability of churning. However, lack of activity is not necessarily correlated to the act of stopping to pay, and vice versa, excessive usage of the service does not indicate that one will keep paying for the product. There are many external factors affecting this decision, one of the most important and obvious being the financial state the user is in.

Churn Prediction using Sequential Activity Patterns in an On-Demand Music Streaming

Churn Prediction using Sequential Activity Patterns in an On-Demand Music Streaming Service

FILIP STOJANOVSKI

Churn Prediction using Sequential Activity Patterns in an On-Demand Music Streaming

Service

Abstract

Acknowledgements

Contents

Acronyms

Chapter 1

Introduction

1.2 Problem

1.3 Purpose

1.5 Benefits, Ethics, and Sustainability

1.6 Research Methodology

1.7 Delimitations

Chapter 2

Relevant Theory

2.1 Dynamic Churn Prediction

2.2 Motivation for Churn Prediction from User Activity Sequences

2.3 Classification

2.4 Artificial Neural Networks

2.5 Performance Metrics

Chapter 3

Churn Prediction Methods

3.1 Long Short-Term Memory Time Series Classification

3.2 Logistic Regression and Random Forest Benchmarks

3.3 Hypothesis and Evaluation

3.4 Implementation and Setup

Chapter 4

Activity-Based Premium Churn Prediction

4.1 Premium Churn