RNN-based sequence prediction as an alternative or complement to traditional recommender systems

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2017,

RNN-based sequence prediction as an alternative or complement to traditional recommender systems

PIERRE GODARD

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION

(2)

RNN-based sequence prediction as an alternative or complement to tra- ditional recommender systems

PIERRE GODARD

Master in Machine Learning Date: August 16, 2017

Supervisor: Hedvig Kjellström Examiner: Danica Kragic

Swedish title: RNN-baserad sekvensförutsägelse som ett alternativ eller komplement till traditionella recommender-system

School of Computer Science and Communication

(3)

i

Abstract

The recurrent neural networks have the ability to grasp the temporal patterns within the data. This is a property that can be used in order to help a recommender system better taking into account the user past history. Still the dimensionality problem that raises within the recommender system field also raises here as the number of items the system have to be aware of is susceptibility high.

Recent research have studied the use of such neural networks at a user’s session level.

This thesis rather examines the use of this technique at a whole user’s past history level associated with techniques such as embeddings and softmax sampling in order to accom- modate with the high dimensionality.

The proposed method results in a sequence prediction model that can be used as is for the recommender task or as a feature within a more complex system.

(4)

ii

Sammanfattning

De Recurrent Neural Networks har möjlighet att förstå de tidsmässiga mönstren inom data. Det här är en egenskap som kan användas för att hjälpa ett rekommendatörsystem bättre med hänsyn till användarens historia. Problemet med dimensioner inom rekom- mendatörsystem uppstår dock även här, eftersom antalet saker som systemet måste vara medveten om är extremt många.

Nyare forskning har studerat användningen av sådana neurala nätverk på en använda- res sessionsnivå. Denna avhandling undersöker snarare användningen av denna teknik som en hel användares tidigare historiknivå i samband med tekniker som inbäddning och softmax-provtagning för att tillgodose den höga dimensionen.

Den föreslagna metoden resulterar i en sekvensprediktionsmodell som kan användas som för recommender-uppgiften eller som en funktion inom ett mer komplext system.

(5)

Introduction

This section first exposes the background in which the project takes place. Then a few details regarding the implementation of the project are presented. Finally, a thought about sustainability ethics and societal aspects concerning the project is proposed.

1.1 Background and Objective

This master project’s objective consists in contributing to set up a recommender system that suggest products of interest for customers. This system’s structure has to be designed with a strong will to make use of the sequential patterns that lie beneath each customers’ behaviour. Indeed, the presented system will be implemented for Rexel, a b2b (business to business) distribution company, whose customers are professionals, primarily involved in buildings and public works. As professionals, those customers conduct their own projects and have specific needs which vary over time, depending on their profile.

The recommender system research field was popularized by the success of the Netflix prize [1] [2]. Since then, that field became very prolific, mainly in terms of collaborative filtering on items explicitly rated by the user. This kind of feedback given by the user are also called explicit feedback. Indeed, the Netflix data set [1] was one of the first big release of real-world and at-scale dataset.

But that explicit feedback concept is notably different from the data available for this project. Most of it are transactions recorded over the last 5 years and logs from the company’s webshop. Moreover, additional information are available depending on the context the recommender system will be used in. An another way to meet the expectations and be able to suggest the next best offers is to think of it as a sequence prediction problem which should be well-suited in this b2b context, where customers are likely to purchase products depending on their progression in their own running projects.

Here is a quick summary of the possible applications of a recommender system could be used for in the company:

• On the webshop, when a user is on a specific web page, the system should recommend her the 5 products she is the most likely to add to the basket and purchase

1

(8)

2 CHAPTER 1. INTRODUCTION

next. This could take into account contextual information e.g. the recent navigation or the basket content.

• When a user use the search engine to find a product, the ranking could give more weight to what he is likely to looking for.

• The marketing team should be able to visualize what kind of product a customer should buy considering her behavior (but actually did not).

These are quite ambitious objectives and what is called ”recommender system” is actually a system composed with multiple smaller components. Some parts of it has already been implemented and some other parts are not within the scope of this project. Here is a list of the different components that have been or will be addressed:

• Cross sell: recommends products that are bought together with what the customer already has in hers basket or is currently looking at.

• Substitution: recommends products based on how similar they are from each others.

• Next best offer: recommends what the user should buy considering what she have been buying recently.

The cross sell component has already been done at some extents. The substitution will be addressed by another person within the team. What this project is about is the next best offer component.

1.2 Research Question and Methods

This project consists in developing a method for greatly improving the accuracy of the product recommendations based on the customers’ past purchases (or more generally, their past actions). Within the recommender system field, the use of the user history is one of the most popular way to make accurate suggestions. The user is very likely to act in the future similarly to how she has been used to act in the past. The contribution of this project is to suggest a way to emphasize this aspects by considering that history sequentially.

As the amount of data is very large, it is be very challenging to come up with a fully scalable algorithm. Moreover, as it is the first set up of such a recommender system in the company, it’s impossible to know how the available data will behave and if there is possibly information to be extracted from it. As a reminder, Rexel is a b2b (business to business) company, not a b2c (business to customer) company like for instance Amazon.

One have to be cautious to what it implies, as the b2b world is not as intuitive as the b2c world can be. Finally, the problem is not limited to just setting up a recommender system. This project and its results will permit to give the whole team a better view of that research field, as the recommender system topic is a very hot one with many different approaches actively studied.

(9)

CHAPTER 1. INTRODUCTION 3

Concerning the algorithms that are used in this project, I will often confront the already implemented cross sell and the sequence prediction, using some elaborated Re- current Neural Networks (such as LSTM, see Section 3.1.2). Nevertheless, one have to keep in mind that recommender systems that are launched in production often lie on many different models for the overhaul recommender to perform better, depending on the strengths and weaknesses of each approaches. The method suggested in this project have undoubtedly to be comprehended as a potential component of a greater system.

As mentioned before, the data that are used here can come from a lot of different sources, but the main input will consist in raw transactional data. Therefore, business knowledge and proximity to business persons is very important to be able to properly understand how should the recommender system works.

The sequence prediction approach is suspected to be quite relevant and the use case here looks like a word prediction problem from natural language processing, where each product is a word and the purchase sequences are sentences. The goal is to conclude on the validity of such an approach to recommender systems. Even if the suggested method may be far from more classical approaches to recommender systems this project is the occasion to take a step back and have an overview of the fields and how the suggested approach interacts with them.

1.3 Evaluation and News Value

In the first place, the results of the approach proposed here, next best offer, will be compared to those of the currently used cross sell approach. Therefore, offline evaluation methods will have to be set up. Ultimately, the results should be examined by a test population of commercial agents as the final objective is to integrate it in a business in- terface. Subjectively, the recommender system is considered successful if the users are happy with it. This should be measurable objectively with some online metrics such as the click rate through the recommendations.

The people that are first impacted by this project are be the customers themselves. But the commercial agents could find a particular interest in it as it can help them better follow the customers expectations and standard behaviours too.

1.4 Sustainability, Ethics and Societal Aspects

As a system that is built to interact with the customers and potentially alter their behaviour, setting it up raises some ethic questions.

The most obvious one affects the customer’s privacy regarding the Law. This is an important point when it comes to aggregating personal information about users. Many services such as Netflix or Spotify make an extensive use of the user data in order to suggest content that match the customer’s taste.

What is different in a company such as this one is that most of the users are not mere customers but professionals. Then one could think the privacy policy does not apply to

(10)

4 CHAPTER 1. INTRODUCTION

the same extent with a professional than with a civil customer. The right of privacy is something that is quite promoted within the European Union from an individual point of view. And the fact that both a private individual or a professional can create an account force the company to be as transparent as possible on what is done with their data.

This implies for example to let the customer know what data is collected, how long it will be kept and what are the data used for. At some point the customer could be able to accept or refuse the collect of their data. Concerning this matter, some services on the web make the acceptance of such things a requirement to the use of their services. Other actors do activate the collect by default and some others make the choice very explicit to the user.

Whatever the way it is done, once the data is collected it should be securely stored.

This serves both the customer’s right of privacy and secrecy of the company’s business details.

The nature of the collected data itself differs. For example, storing a customer’s purchase history for at least some years sounds completely normal for a product retailer.

The same as some essential information regarding the contact details of the professional clients. On the other hand, is it of as much importance for a company to collect the web behaviour of a customer? There is some sort of sweet spot between those two extreme cases. Defining what is this spot exactly looks like a difficult matter that should be en- trenched clearly by the Law.

The second raised question is the exact impact of a recommender system on someone that uses it.

The exact purpose of a recommender system is to suggest content of interest to customers. Suppose the recommender system just has been set up and the customer just started using it. That system will try to advise the customers regarding what have been their behaviour in the past, before it was set up. But now, if the recommender system is correctly done, the users started using it and their behaviour changed with it.

And after some time, re-fitting the system will train it over that new user behaviour.

Maybe slightly changed but still it is a quite difficult fact to detect.

One of the drawbacks of such a phenomenon is the user is then susceptible to be stuck within a small amount of suggestions that perhaps perfectly matches its original profile, but will never make the user evolve. As a personal example, I’m often using the music recommendations on the music streaming platform. When I started using the ser- vice, I discovered various artists similar to the ones I enjoy. But as the time pass and as I nearly exclusively listen to these recommendations, I am very often suggested artists or even songs I want already suggested many times. In my case, I think that functional- ity greatly lacks of serendipity that would allow me to keep discovering new songs and artists. But this probably not the case for every one as maybe some people really like to stick to their favorite songs.

This is an example relatively far from the case of a retail company, but the fact is that a behaviour affects the recommender system that affects that behaviour back. This is a kind of loop that slowly converges to a stable position that neither the customer nor the company may wish for.

(11)

Chapter 2

Related Work

Recommender systems basically consists in suggesting n items s to a user c so that the user is likely to have some interest in these items.

The recommender systems field early became an important research area as it have been proven to be useful in many different applications. From e-commerce product recommendation to video recommendations on video streaming platforms, recommender systems help the users find their way through huge amount of information by showing them relevant suggestions.

Adomavicius and Tuzhilin [3] summarized the problem by defining an utility function such that for each user c ∈ C, we choose items s⁰_c whose utility uc,s to the user c is the highest:

∀c ∈ C, s⁰_c= arg max

s∈S u_c,s (2.1)

This paper [3] also describes the recommender systems field as historically divided in two rough approach categories: content-based methods and collaborative filtering.

Content-based methods consist in using the different properties of each items the user liked to recommend new items which are considered similar and are then likely to be useful to the user. For example, if a user A just read some thriller novels that takes place in the early fifties in New York, we will recommend to that user more novels that match the most this description. We consider here that by suggesting new items very similar to those the user used to like, she is very likely to like those for the same reasons she liked the previous ones.

Collaborative filtering approaches are great serving personalised recommendations to the users. Here we want to use the information made available by other users to help recommend relevant items to a specific user. Let’s keep the example of that user A fond of thriller novels. Another user, user B, also liked many books user A did read. Then, as user B also liked some other books user A didn’t read, these books might be of interest for user A.

No approach surpasses one another as they both have their own pros and cons. In order to create a recommender system that behave the best in all the cases, one can combine those different approaches to strengthen the overall recommendation power of the system. This is simply called the hybrid approach.

5

(12)

6 CHAPTER 2. RELATED WORK

Recommender systems are built using all available information about the items to be suggested. One of the major data source is the available information about the item nature, such as its genre if its a movie. An other source of importance is how the users did interact in the past with the items. This is the feedback a user gives on items. One can generally distinguish two different kinds of user feedbacks [4] from the user.

Explicit feedback: the user gives an explicit judgment over some items she has been interacting with. For instance, on a video streaming website like Youtube, the user can explicitly give a feedback on a video with a ’thumb up’ button if she liked it or a ’thumb down’ button if she disliked it [5].

Implicit feedback: the user, throughout its behaviour, indicates how much she appreci- ates an item. Staying on the case of the Youtube platform. A user that did entirely watch a video is likely to have enjoyed it [5]. Additionally, the comments a user leave on a video could be processed with sentiment analysis to determine the opinion of the user about this video.

This is an important distinction because most of the focus of the field was oriented toward explicit feedback. That’s why most of the papers use the term ratings with the assumption that the items that the user is likely to give high rates are also those with the most utility for this user.

In general, one of the major problems recommender systems that make use of feedbacks has to face is called the cold-start problem. We can differentiate the case where we have to recommend items to a new user and the case we have a new item in our system [3].

It is difficult to make accurate suggestions to a new user as we generally does not have enough feedback from that user to identify similar users.

Similarly, there will be a problem with new items. For example, if on an e-commerce web shop we have a brand new product collection we would like to put forth, a collaborative filtering approach will have difficulties suggesting them to users, because no one gave feedback on these items yet. These items are then kind of unknown by the system.

Moreover, the collaborative filtering approach is based on recommending similar items to similar users. What about users that do not behave like any other? As this specific user profile is very different from other, it is very likely the major part of the items even the most similar users have interest in will not be of interest for this particular one.

One possibility to palliate these problems is to hybrid our recommender system method with other techniques so that they compensate their weaknesses each other. Therefore, a lot of work and studies have been made regarding the best way to combine collaborative filtering and content-based approaches [6].

We will next focus on that collaborative filtering approach and review some popular techniques that emerged within that field.

2.1 Collaborative Filtering Techniques

The collaborative filtering part of the recommender system field increased in popularity in the last decade. Starting in the nineties, the topic gradually evolved as new techniques appeared.

(13)

CHAPTER 2. RELATED WORK 7

In general, one of the major problems the collaborative filtering approach has to face is called the cold-start problem [3]. We can differentiate the case where we have to recommend items to a new user and the case we have a new item in our system.

It will be difficult to make accurate prediction to a new user as we generally do not have enough feedback from that user to identify similar users. This is a problem not only for collaborative filtering approach but for content-based methods too.

2.1.1 Nearest Neighbours

Among the most famous early techniques are the user-based and item-based nearest neighbours methods. They are quite popular partly due to the fact they are rather intuitive compared to other techniques. The main point of these methods is to define a similarity measure between users (or items) to be able to use a nearest neighbour between the current user (or item) and the rest.

Herlocker and Konstan [7] estimated the utility of an item s for an user c using an user-based similarity approach as follows:

ˆ

u_c,s = u¯c+P

c⁰∈Nc(uc⁰,s− ¯u⁰_c) × sim(c, c⁰) P

c⁰∈Ncsim(c, c⁰) (2.2)

With Nc the K nearest neighbours of the user c, ¯ucthe mean utility of the items the user already has interacted with (by giving a rate for example) and sim(c1, c₂) a similarity measure between the user c1 and the user c2. That similarity measure can be, for example, a Pearson correlation (a correlation metric based on the covariance of the vari- ables) or a cosine similarity and is also used to find the so-called K nearest-neighbours that are be utilized in the utility estimation.

Sarwar et al. [8] details an other approach that make use of item-item similarity instead, this is the item-based method. Depending on the dataset, this alternative technique can result in speedups and sometimes better estimations.

2.1.2 Matrix Factorization

The matrix factorization techniques came up afterwards and gain major visibility partly thanks to the Netflix Prize [1]. The Netflix Prize was a competition for which the well known movie streaming company disclosed a large dataset of movie and user ratings for the community to compete on. It resulted in a boost in the recommender system research field partly thanks to such a large dataset being available.

Koren et al. [2] detail the basics of that method. As the input dataset is an user-item matrix that can be quite large, it can be split into several smaller matrices. In general, from the input matrix U are created two smaller matrices. One matrix C for the user and one matrix S for the items. These two matrices represents the users and the items in a latent space whose dimensions are much smaller than the total number of users and items.

This factorization is very similar to an Singular Vector Decomposition (SVD). Then by mul- tiplying them together, we obtain one big matrix whose dimensions are the same as the input matrix. That new matrix can be seen as an estimation of the ideally filled utility matrix, because as the information has been truncated in a latent space, the reconstructed matrix try to fill the unknown since the latent space grasped the general trends in the data.

(14)

8 CHAPTER 2. RELATED WORK

U = Cˆ ^T × S

Then the main difficulty here is to find those two matrices in a reasonable time. Ko- ren et al. [2] mentions two methods; The Stochastic Gradient Descent which is a fast and popular method and the alternating least squares that can be easier to parallelise and more well-suited to the use of implicit data. Indeed, while the explicit data is made of a small amount of feedback, because the user will very likely only rate few items, the implicit data is much more dense as it can contain data from all the small interactions the users had with the items. This is the approach suggested by Hu et al. [4]

Koren [9] propose a method called SVD++ to add temporal dynamics directly within the model instead of just using the classic sliding windows techniques. This is based on the intuition that users can change their interest in items over time.

This is one advantage of the matrix factorization method. It can be improved with many different additional data sources and tweaked to add advanced features and contextual data [10].

Rendle et al. [11] and Takács and Tikk [12] adapt those techniques with the idea of personalized ranking. Instead of predicting the utility, they try to directly optimize the item ranking as the recommendation problem can be seen through a search engine’s perspective.

2.1.3 Deep Learning Approaches

The use of neural networks is not something new in the recommender system field [3].

However, such techniques were not very popular compared to other collaborative filtering approaches.

Nevertheless, since 2012, deep neural networks and deep learning achieved very good results in many different contests [13]. Following years it has shown very good accuracy performance in many different tasks. Inevitably, deep learning methods were quickly tested on some recommendation tasks.

These approaches still are very close to the classical collaborative filtering approaches we just viewed. Hao Wang, Naiyan Wang [14] propose a collaborative deep learning method originally based on stacked denoising autoencoder networks.

Covington et al. [5] presents an other approach that makes an extensive use of Recti- fier Linear Unit neurons (ReLUs), a commonly used type of neuron within the deep learning field, and sampling.

Cheng et al. [15] emphasize the usefulness of combining wide and deep learning networks in order to increase the network capability to memorize with a wide range of neuron units combining the input data and still keeping a deep portion in the network for its capability to generalize.

From an historic perspective, these deep learning methods can be viewed as a gener- alization of the matrix factorization technique, as they result in user and item representation vectors that lie in a latent dimensional space. One major advantage of such neural network approaches is their ability to take many different data as an input.

(15)

CHAPTER 2. RELATED WORK 9

2.2 Session-based recommendations

Recurrent Neural Networks (RNN) are a special kind of neural networks in which the data the model did previously see is memorized through time. In practice, Long-Short Term Memory neural networks (LSTM) [16], a special kind of RNN, are preferred to simple fully connected RNN. This is because LSTM are able to capture long-term patterns in a long-term memory thanks to the use of gate layers in addition to a short-term memory just like in simple RNNs.

An alternative to LSTMs are Gated Recurrent Units (GRUs) that fundamentally work in a similar way. Chung et al. [17] showed it was not clearly better than LSTMs.

Hidasi et al. [10] and Tan et al. [18] view the recommendation problem through a session- based perspective and then make use of RNN techniques. It is justified by the fact that many e-commerce websites do not keep track of user behaviour beyond a session level.

The idea is then to take profit from all the available click history in order to generate recommendations.

This is indeed greatly similar to the objective of this project. We first want to grasp the sequential patterns, not at a session level, but rather at a client level. Then, even- tually, use this historical information in a deep collaborative filtering system in future works.

(16)

Chapter 3

Background

This section exposes all the background knowledge required to the in-depth under- standing of the implementation of this project.

3.1 Neural Networks

Artificial neural networks [19] are an ensemble of algorithms whose the initial motiva- tion was to mimic the interaction between biological neurons in the brain. It is an ensemble of many artificial neurons, small computing units, whose interaction create a complex system.

The one-way interaction between two artificial neurons is called a connection. Each connection between two neurons is more or less strong. The weight of the connection represents how strong the connection is. A large weight means a strong connection.

Generally, neurons are grouped together into layer and layers are stacked to create complex artificial neural networks. In classical multilayer perceptron neural networks, each layer’s neurons is fully-connected to every neurons from the previous layer, but more subtle topology exists.

Once the network architecture is defined, it is possible to train it. This is done first by generating output from the training data, this is called the forward pass. And then we compute the gradients for to correct each connection’s weight, this is called the backward pass or backpropagation for backward propagation of error. It is important that each part of the network is differentiable. These two step constitute a training period. In order to train the network, several periods are often necessary.

We will next describe the general structure of artificial neural networks and then detail the particular case of recurrent neural networks and long-short term memory networks.

3.1.1 Artificial Neurons

The Figure 3.1 shows a typical example of the structure of a single artificial neuron unit.

The inputs xi, i ∈ [1, 5] and a bias component b are combined together with a weight vector w through a scalar product. The resulting value is then passed to an activation function σ which results in the output value y. Mathematically, these computations can be expressed as follows :

10

(17)

CHAPTER 3. BACKGROUND 11

b x1

x₂ x₃ x4

x5

W σ y

Figure 3.1: A single artificial neural unit

y = σ(

5

X

i=0

xi× w_i) (3.1)

For convenience, the bias term has been written as x0. In practice, we do computations on layers. Thus, the weight vector w become a weight matrix W.

Activation Functions

We mentioned earlier the activation function that is applied to the raw output of the neuron. It exists many different activation functions that can have different uses, depending on the task to accomplish. In general, activation functions are mainly used to break the linearity of the model. Indeed, as matrix multiplication or scalar products are linear operations, the overall complexity level our system would be able to achieve with a linear activation function won’t be any better than a simple linear model’s, even with many neurons or layers.

Note that the activation functions have to be differentiable, so that it will be possible to learn the weights of the neuron through backpropagation.

Sigmoids Two of the most used activation functions are both sigmoids. One of them is the hyperbolic tangent function and the other one is the softmax activation function.

The main difference between those two is that the former output values’ ranges from −1 to 1 whereas the latter’s ranges from 0 to 1. This slight difference makes, for example, the logistic function more suitable for multiclass classification, where we need to grant samples with a certain probability, between 0 and 1, for each class.

Hard sigmoid The hard sigmoid function is based on the logistic function. It is merely an approximation of that logistic function. It is faster to compute.

Rectifiers The rectifier activation function is, in its simplest form, a linear function that floor its negative values to 0. When used as an activation function, the neuron is then called Rectified Linear Unit (ReLU). It is widely and efficiently used in deep neural networks and deep image processing in general. Indeed, it has some advantage compared to the classical sigmoids. For instance, it is way less sensitive to vanishing gradients when trained, because the signal that is now set to 0 is kept unchanged, whereas a sigmoid will be likely to over flatten the signal. Adding layers increase the risk of vanishing gradients, which is why ReLU generally are preferred within deep networks.

(18)

12 CHAPTER 3. BACKGROUND

3.1.2 Neural Network Layers

One of the key concept to conceive more complex neural networks is the layer. In a typical artificial neural network, a layer is simply a group of neurons. Layers are connected to each other so that the input of some layers can be connected to the output of some other layers. Figure 3.2 shows an example of a three-layered multilayer perceptron neural network. In this example the neurons within a same layer are not connected together, but they are fully connected to every neurons of the previous layer. These layers are also called feedforward layers because of the one-way nature of their connections.

x1

x₂ x3

y₁ y2

Hidden Layer Input

Layer

Output Layer

Figure 3.2: A multilayer perceptron with three layers. Each dark-grayed node is a single neuron.

The light-grayed nodes are just a notation to represent the inputs.

This flexibility in interconnecting many layers allows to build deeper neural networks, with different kind of layers, from classic feedforward layers to convolutionnal layers often used in image processing or recurrent neural networks used for time series processing.

Recurrent Neural Networks

Recurrent Neural Networks (RNNs) [20] [21] are so called because they are able to keep an internal state between two iterations that will be updated depending on the last samples the network has viewed. This internal state can be considered as a memory the network will use to remember information from the previous inputs. That makes them suitable for temporal data.

We previously viewed how easy it is to interconnect layers to build up neural networks. It is possible to interconnect networks to create even more complex networks.

This is why we will often refer recurrent neural networks as layers themselves: because it is often used as a part of a bigger network.

The most basic RNN is the Fully-Connected RNN (FCRNN) [20] [21]. It simply consists in feeding back the output of the RNN at time t together with the next input at time t+1.

Figure 3.3 illustrates that concept: the sequential input data is fed chronologically into the network and each timestep output is kept to be concatenated with the input of the next timestep.

Although this technique let the neural network remember the past, in practice, it per- forms quite bad at remembering long-term events. Furthermore it is very sensitive to vanishing gradients. This is because many layer directly stacked together makes the gradient correction smaller and smaller as we back propagate deeper and deeper.

(19)

RNN

Input ut

tanh Output vt

RNN

Input ut+1

tanh Output vt+1

timestep t timestep t + 1

Figure 3.3: A fully-connected recurrent neural network with one layer whose activation function is the hyperbolic tangent. At each timestep, the memory from the previous timestep is concatenated with the current timestep’s input. Illustration inspired from the Christopher Olah’s article [21].

Long-Short Term Memory Networks

The Long-Short Term Memory (LSTM) [16] [21] neural networks are gate-based RNNs.

Its gates mechanism allows it to have both a short-term and a long-term memory which is why it nearly always outperform the simple RNN approach which struggle remembering events from too many timesteps ago.

Indeed, what makes the LSTM robust against vanishing gradients and then make it able to memorize long term patterns is that its cell state is never directly transformed by an activation function thanks to the gates mechanism.

These gates are layers connected together in a very specific way in order to manage long term memory properly. A typical LSTM will have three gates as in Figure 3.4:

Forget gate layer This is a layer with an activation function that ranges from 0 to 1.

Its role is to select which data from the cell state will be forgotten knowing the current timestep input and hidden state from the previous step. Concretely, this layer’s output will be multiplied with the long-term memory vector so that a 0 means the memory will completely drop the information and a 1 means it will keep all the information from the previous timestep in the cell state. Here is the mathematical formulation using the notations as in Figure 3.4:

f_t= σ(W_f · [h_t−1, u_t] + b_f) (3.2) where ft is the gate output at timestep t, σ the activation function, utthe RNN input at timestep t, ht−1 the hidden state from timestep t − 1. Wf and bf are respectively the weight matrix and the bias of this gate’s network layer.

Input gate layer The role of this layer is to select which parts of the current timestep input and the previous step hidden state will be stored in the cell state. Its activation ranges from 0 to 1 and works in a similar way to the forget gate layer, except that it se- lects what will be remembered. Once the forget gate has been applied to the cell state, this

(20)

RNN

Input ut

in out

forget tanh

tanh

Output vt

×

× +

timestep t Hidden ht−1

Cell State Ct−1

ht

C_t

Figure 3.4: A long-short term memory recurrent neural network. Ct−1and ht−1are coming from the previous timestep and Ctand htare sent to the next timestep. Illustration inspired from the Christopher Olah’s article [21].

layer’s output is multiplied with the input passed through another layer with an hyperbolic tangent activation function (so that the possible values range from −1 to 1). This results in a vector containing the substantial information to be kept. This vector is then directly added to the cell state, which is now completely updated for this timestep. Here is the mathematical formulation using the notations as in Figure 3.4:

i_t= σ(W_i· [h_t−1, u_t] + b_i) (3.3) where it is the gate output at timestep t, σ the sigmoid activation function, ut the RNN input at timestep t, ht−1 the hidden state from timestep t − 1. Wi and bi are respectively the weight matrix and the bias of this gate’s network layer.

And:

Ce_t= tanh(W_C· [h_t−1, u_t] + b_C) (3.4) where eC_t is the temporary cell state for timestep t. WC and bC are respectively the weight matrix and the bias of this state managing network layer.

Finally:

C_t= f_t· C_t−1+ i_t· eC_t (3.5) where Ct is the updated cell state for timestep t.

Output gate layer Finally, this last gate layer is utilized to update the hidden state once the cell state is updated. It consists in a layer similar to those of the previous gates, with the same input that consists in a concatenation of the current timestep input and the previous step hidden state. But it is used to select what information from the cell state is

(21)

used to recreate the hidden state. To do so, the gate’s output is multiplied with the cell state that have been passed through an hyperbolic tangent function. That updated hidden state is then sent to the higher layers of the neural network and will be passed to the next timestep. Here is the mathematical formulation using the notations as in Figure 3.4:

ot= σ(Wo· [h_t−1, ut] + bo) (3.6) where ot is the gate output at timestep t, σ the sigmoid activation function, ut the RNN input at timestep t, ht−1 the hidden state from timestep t − 1. Wo and bo are respectively the weight matrix and the bias of this gate’s network layer.

And:

h_t= o_t· tanh(C_t) (3.7)

with htthe updated hidden state for timestep t and Ct the previously updated cell state at timestep t.

3.1.3 Backpropagation Objective function

The network is trained by minimizing the output of an objective or loss function. This is a function that output a score based on the network output and the corresponding training labels. Depending on the nature of the problem, this objective function can be changed to be more adapted to that particular case.

A well-known objective function in machine learning is the Mean Square Error (MSE).

Useful functions that are relevant when it comes to classification and categorical output are entropy-based functions. In the case of multi-class classification, one can use the categorical cross-entropy function as it measures the average information needed to distinguish the output distribution from the target distribution:

H(p, ˆp) = −X

s∈S

p(s)log(ˆp(s)) (3.8)

with ˆp the output distribution estimated by the network over all the classes (or items) S and p the ground-truth target distribution.

Optimization methods

The most common technique used to compute the backpropagation errors is the Stochas- tic Gradient Descent. However, there exists many variations and alternatives. A simple example of an variational approach consists in adding some momentum to the backpropagation by remembering the corrections made in the previous period and weight it with the values from the SGD at this current step.

Therefore, many strategies exist and some of them are reputed more suitable for training RNNs like Adagrad, Adadelta or Adam [22] that are more sophisticated and are able to adapt their learning rate.

(22)

3.2 Elements from Natural Language Processing

The challenging part of this project is to be able to manage vast amount of different product. To do so, we will use some techniques from the natural language processing field where one have to deal with huge vocabularies. This high-dimensionality problem arises at two different times in our model: the input that is a sequence of products and the output that is a probability distribution over all the products.

Indeed, one of the common problem with words or products or any item is that they are categorical data. It can’t be passed to a neural network as is. One possible solution would be to assign each item with an index. That way, each item could be easily iden- tified as a number the model could understand. But directly giving an item index as a single dimensional input to the model means two products with close index will be considered similar by the model. This is because the input dimensions of a neural network are continuous.

A valid approach would consist in using a one-hot encoding for the input. As it is not relevant to have one single dimension with the information about all the items, the one- hot encoding will create a vector with as many dimensions as items and each dimension will correspond to one item. In practice, it is a very sparse vector full of zeros, with just a one in the dimension corresponding to the input item. Though this is a valid approach, it has its limit if the number of items is very large as the input vector would then be very large.

A similar problem occurs for the output of the model, as it has to give a probability to each item. In this case, the output vector is not even sparse. It is just a huge vector containing values that sum up to 1.

3.2.1 Embeddings

Now going back to the input problem. Categorical data can be seen as sparse data by using one-hot encoding. One solution to that high dimensionality problem is to transform that sparse data into dense data with fewer dimensions. Even if it means an information loss, we can try to place each item in that lower dimensional space, also called latent space, so that similar items are close to each other.

The next question is: What are similar items then? In the case of natural language processing, the data is a corpus of texts containing sentences composed by sequence of words. So one can make the assumption that words that occur in similar context, i.e.

with the same kind of surrounding words, are similar. As for e-commerce, one has the clients purchase history or web sessions history that are composed with sequence of products.

There are several embedding algorithms such as the Global Vector (GloVe) algorithm [23] or word2vec [24]. These different algorithms can learn embeddings from two different training approaches. The Continuous Bag Of Word (CBOW) models try to predict the probability to see a target word given its context whereas the Skip-Gram models try to predict a context given a target word. These algorithms have been conceived with the word representation use case in mind. However it can be applied to any other use case implying similar data i.e. bag of items with categorical values.

(23)

Skip-Gram

Just as mentioned earlier, the Skip-Gram approach tries to predict the context (i.e. surrounding items) of a given item.

The goal of that method defined in [24] is to maximize the average log probability:

1 N

N

X

n=1

X

−c≤m≤c,m6=0

log p(s_n+m|s_n) (3.9)

where c is the maximum distance between one particular item and an other item for that latter one to be considered as part of the former’s context. sn is the item’s value at the n^th position in the corresponding training sample. N is the number of items in the training sample.

In order to maximize this average log probability, the Skip-Gram method also defines that p(xn+m|x_n) from above:

p(sb|s_a) = exp(u^0>_s_bu_s_a) P

s∈Sexp(u^0>_susa) (3.10) where us and u⁰s are respectively called the ”intput” and ”output” vector expression for the item s. S is the whole vocabulary set. Once the model is trained, us is the so-called embedded representation of the word s.

Even if the model can be train as is, one usually relies on computation tricks such as the ones detailed in the following Section in order to speedup the computations.

3.2.2 Speeding up the softmax computation

Now we focus on the output of the model and its last layer. As the model that will be build up for this project requires a probability distribution as output, it is relevant to make use of a softmax layer. Indeed, by using it, the output values range from 0 to 1 and sum up to 1.

Again, in an high dimensional case, a softmax layer cause computational inefficiency because of the needed normalization of the output vector. In the following Sections are detailed two different techniques to improve computational performance: the hierarchical softmax [25] and the softmax sampling [26].

Those techniques both achieve speed-up by computing a partial softmax. Although this is suitable for training, this reveals to be irrelevant at inference time, as we need the model to compute the softmax over all the items in order to pick only the relevant ones.

Hierarchical softmax

This technique does not take the form of one unique softmax layer as in the full softmax case, but is rather a hierarchy of multiple softmax layers. The output items are placed on the leaves of a tree, where each node represents an intermediate softmax layer. The par- ents layers are then computing scores for theirs children and one have to browse through the hierarchy until a leaf to get the probability of the corresponding item. Using a Huff- man tree to create that hierarchy has been proven to be a quite efficient solution Mikolov et al. [24]. This is because the most frequent items will be higher in the hierarchy.

A concrete mathematical formula of the method is given by Mikolov et al. [24]:

(24)

p(y|v) =

L(y)−1

Y

j=1

σ([[n(y, j + 1) = ch(n(y, j))]] · v⁰_n(y,j)^>v) (3.11) where L(y) is the number of nodes on the path from the hierarchy root to the y leaf and n(y, j) is the j^th node on that path, ch(n) is an arbitrary child of the node n, [[x]] is an indicator function that is 1 if x is true and −1 otherwise, v is the known context vector taken from previous layers output.

This way, we can choose to only compute the normalization of the nodes that lead to the positive output item to compute the loss. However, this technique does not apply during inference as we want to find the item with the highest probability among all the possible items. During inference, we have then to go through all the nodes in order to compute all the probabilities.

Softmax sampling

The sampling approach avoid normalizing over all the items by approximating the normalization. To do so it uses a smaller subset of items sampled from the entire item vocabulary to compute the normalization coefficient.

Such approach is widely used when it comes to learn embeddings for instance. The algorithm derived for such applications just try to approximate a good loss function and not necessarily the softmax function. However a softmax approximation using the method defined in [26] was derived and implemented in tensorflow [27].

Starting from the exact output probability:

p(y|v) = exp(W_(y)v + b_(y)) P

k∈Sexp(W_(k)v) + b_(k) (3.12)

where W and b respectively are the weight matrix and the bias vector of the layer.

Thus W_(k) and b_(k) are the weight vector and bias associated with the item k. v is the known context vector taken from previous layers output.

Now working with logarithmic gradients:

O log p(y|v) = O(W(y)v + b_(y)) −X

k∈S

p(y|v) × O(W(k)v + b_(k)) (3.13) This gives an expression composed with two distinct parts. A positive part that de- pends on the target item and the given input. This first part can be viewed as the influ- ence of the target item over the loss. And a negative part that corresponds to the influ- ence of all the other items. It is this negative part that needs to be approximated as |S|, the number of the item vocabulary is possibly huge. Luckily, that negative part can be viewed as an expectation:

O log p(y|v) = O(W(y)v + b_(y)) − Ep(y|v)[O(W(k)v + b_(k))] (3.14) which allows us to approximate that part based on a small subset of the complete vocabulary. One can then define a separate probability distribution to pick that subset out of the vocabulary, approximate that expectation based on that distribution and directly compute the gradient for the backpropagation.

(25)

Chapter 4

Method

Let us start with an overview shortly presenting how the proposed implementation is organized. Two phases are described: the training phase and the inference.

The explanation of the training phase is summarized in Figure 4.1.

First the data from the quotations database is collected and prepared (Step 1). This results in a set of product sequences. Each sequence represents a quotation. These quotations are then used for the training of an embedding model (Step 2).

The data from the transactions database is collected and prepared just like the quotations (Step 3), except each sequence represents a customer’s purchase sequence i.e. a sequence of products. Each product within these sequences can then be embed within a dense vector (Step 4). The sequences of products are here transformed into sequences of dense vectors. They can finally be passed as the training input of the sequence prediction model (Step 5).

Step 1 Prepare the

training quotations

Step 2 Train the embeddings

model

Step 3 Prepare the

training transactions

Step 4 Embed the

training transactions

Step 5 Train the sequence prediction

model

Figure 4.1: Training flow chart. The step are performed in order and the arrows represent the depen- dencies between each step.

The explanation of the inference phase is summarized in Figure 4.2. Still it is similar to the training phase but the inference focus on predicting the next purchase of one particular customer.

The transaction related to that particular customer is then collected and prepared (Step 1). Next the resulting sequence of products is embed (Step 2) before being passed as the input to the sequence prediction model (Step 3). The model finally gives the products that will most likely be purchased next bu the customer.

19

(26)

20 CHAPTER 4. METHOD

Please note that the testing phase is also based on that process.

Step 1 Prepare the

customer transaction

Step 2 Embed the

customer transactions

Step 3 Run the sequence prediction

model

Step 4 Get the most likely purchased

next products

Figure 4.2: Inference flow chart. The step are performed in order and the arrows represent the depen- dencies between each step.

The following Sections in this chapter describe the data and models used in these steps more precisely.

4.1 Tools

In this Section the different major tools that are used in this project are described and the way they can help fulfill this project objectives is explained.

4.1.1 IBM SPSS Modeler

IBM SPSS Modeler is an Extraction Transformation Load (ETL) tool used in the business analytics team that makes it easy to manipulate and prepare the data. It is packed with some basic built-in models and allows the integration of custom ones. It is primarily used for exploratory data analysis and data preparation in this project.

4.1.2 Python Data manipulation

The used python libraries are those that are the most commonly used when it comes to data manipulation in python. Numpy is used when it comes to manipulating matrices and as a consistent format to put the data into the models. Pandas is a data manipulation library that rather view data in a data frame format. It is quite efficient for preprocessing raw data and is used to complement IBM SPSS Modeler.

Numerical computation frameworks

Even if most of the algorithmic logic is written in python, most of the computations are optimized at a lower level scale by using a framework such as tensorflow or theano. They are frameworks in an high level language composed of bindings to low level data manipulation and computations. Both of tensorflow and theano are very flexible and suitable for setting up deep learning algorithms.

Tensorflow [27] is a Google API that has been open sourced a few years ago as a beta and that has recently been fully released. It has an active community and benefits from many contributions.

Theano [28] is an other open sourced library that is mostly maintained by researchers at the Université de Montréal.

(27)

CHAPTER 4. METHOD 21

For this project, keras [29] is used as an higher level abstraction to tensorflow and theano that structure the code in a layer-based approach. This helps keeping the code well-structured in a way it is still easily configurable in the case we want to add custom operations.

4.1.3 Word2Vec

Word2Vec [24] is an algorithm that has originally been designed for natural language processing applications. Conceptually, a word2vec model is trained on a very large text corpus and is intended to take a word as input and to produce a dense vector represent- ing this particular word as output.

In this project the skip-gram approach will be used for the training. It consists in fitting the input word on its context i.e. the words that surrounds it. The implementation used to do so is the one implemented in the Apache Spark’s MLLib [30].

4.2 Data

4.2.1 Description

Many different records tracking the interactions with the customers are currently actively collected by the company. Some of them can constitute a good input for this project.

Three major data sources indeed are of interest: the transactions, the webshop browsing and the quotations. They all keep track of the customer’s behaviour through time.

Transactions In this project, what is called a transaction corresponds to one product delivery. In the data, this one product delivery is materialized as a delivery line (see Figure 4.3). Often, multiple products are delivered together. They form a single delivery that is performed via one particular delivery mean e.g. provisioning at the store or direct delivery at one particular address. Each delivery originates from one particular order. The order’s purchase action of a customer can originate from different possible order means e.g. the website, directly at the store or using automated orders.

Delivery Delivery Id Mean

Delivery Line Delivery Line Id Line Number Order

Order Id Timestamp

Customer Customer Id Type

Product Product Id Category Provider 1

0..*

1 1..* 1 1..*

0..*

1

Figure 4.3: Simple relational representation of the transactional data. Each link represents the asso- ciation between the objects. For example, a customer can order zero or multiple times. On the other hand, if an order exists, he is always associated with exactly one customer.

(28)

Web browsing data Web browsing data contains a part of the previously mentioned transactional data. More precisely all the transactions that occurred via the webshop.

But it also gives many additional information concerning other actions the customer did.

For instance, it keeps track of the products the customer simply viewed or added to the cart. Unfortunately, an up to date version of these navigational data were not technically available at the time this project was carried out.

Quotations A quotation is a set of coherent products manually suggested to the customers by employees such as sales representatives. The transactional data is often considered a subset of the quotations here at the company. The latter is a set of products that are suggested to the customer while the former is the set of products the customer finally decided to buy. The relational organization of the quotations is quite close to the transactions’ one (see Figure 4.4). As the quotations are more consistent than the purchases only, learning the relationship between the products using this as a base should give better results than just using the transactions.

Quotation Quotation Id

Quotation Line Quotation Line Id Line Number Customer

Customer Id Type

Product Product Id Category Provider 1

0..*

1 1..*

1..*

1

Figure 4.4: Simple relational representation of the quotations.

This project will then focus on making use of transactions and quotations. These data are already quite well-structured and cleanly stored in a database. Therefore they will only require small preprocessing and adjustments.

One more notion to have in mind is the concept of a project from the customer’s point of view. This is a notion that is not directly visible in the data presented earlier. What is called project here is a set of actions done by the customer in order to fulfill one of its own project. Indeed, in the company, the customers are professionals with their own customers and own projects they need to achieve. This is a difficult notion to retrieve from the data. Therefore the assumption is made that training the models over the customers’

sequence of actions – without really knowing why she is doing so – is a good enough approximation. The models should be able to detect the patterns in the customers’ behaviour out of the huge amount of data samples.

(29)

CHAPTER 4. METHOD 23

4.2.2 Preparation

In this Section, the different steps that lead from the raw database data to clean and ready samples is detailed.

Working on a subset

The available computational power is limited. That’s why in order to speed up the testing process and the design of the models, the experiments are ran on a subset of the data.

These original data has approximately 20 million records for the last year only. A consistent subset is then created from this data. It contains of approximately twenty times less records. It as been created as a filter over the customers: by taking only the customers from one customer type – the electricians – and by taking the first 5000 of them that buy most of their products from one well known product provider.

Cleaning

From the data directly accessible in the database to the data used as the model input, there is few transformation and data augmentation steps. These steps are summarized in Figure 4.5. The cleaning phase described here concerns as much the transactions than the quotations.

First, the subset described in the previous Section is collected from the database (Step 1) and organized in samples, where one sample corresponds to one customer (Step 2).

Each sample is sorted temporally so that the oldest elements come first (Step 3).

Now, depending on what the model will work with – product reference or product category – a choice has to be made regarding what is to be used as the sequences items (Step 4a/4b).

Once this is done, it is important to re-index the chosen item by frequency (Step 5).

What is meant here is that most frequent items will get smaller integer indices whereas least frequents will have larger integer indices. This new index should be contiguous for convenience. Along with the re-indexing, one should keep a trace of it in order to be able to reverse it.

Right now a sample represents the whole customer lifeline. In order to help the model that complete customer lifeline is split if no order has been preformed for a too long time (Step 6). This time has been estimated to 2 weeks, trying to model a realistic period between two customer’s projects. But this is a weak assumption as projects sometimes overlap and, depending on the customers type, projects are more or less sparse.

Finally, the so far created samples consisting in only 1 item are dropped, as there is nothing to learn from them (Step 6).

Feeding the neural network

The data now has to be transformed so that it can be passed as input and target to the model.

(30)

Step 1 Collect the

data from databases

Step 2 Create one sample per customer

Step 3 Sort the samples by

timestamp

Step 4a Use the product

reference as sequence item

Step 4b Use the product

category as sequence item

Step 5 Re-index the

the items by frequency

Step 6 Split the samples with more than 2 weeks of inactivity

Step 7 Drop the samples

with only 1 item left

Figure 4.5: Data cleaning flow chart.

The used numerical computation frameworks mainly manipulate matrices. It is a problem in this case as the dataset consists in samples of varying lengths. Still the input samples have to fit a matrix. Let us define the maximum number of timesteps among a set of samples as follow:

T_max= max

n∈[1,N ]T_n (4.1)

where Tn the length (or number of timesteps) of the n^th sample and N is the total number of samples used.

A matrix with the dimensions (N, Tmax) is then created. The elements of this matrix directly contain the product identifiers. The problem is the defined matrix is much too large for most of the samples as it has been defined to fit the longest sample. Smaller samples will then leave some cells as useless in their respective rows. The solution is to define a fake item to fill those cells with. This is called sequence padding and that fake item is called a mask. All the sequences are padded up to the length of the longest sample.

Now as the samples are of very different length, it can become computationally in- efficient to put everything in the same matrix because many of the cells will contain the mask item. One solution is to create multiple matrices with samples of about the same size to that the mask item ratio is kept low.

4.3 Model

The proposed models are described in this Section. Two complementary models are detailed here. The first one consist in computing the item embeddings and the second one try to predict the next item of a sequence, based on the embeddings from the first model.

In the last Section, we will also detail the baseline model that is used to determine our model performance.

4.3.1 Embedding model

What this model do is to compute the item embeddings. As training data the quotation dataset is used because it is susceptible to better express the relations between the items (see Data Section).

Once the model is trained, it gives a matrix that associate each item s to a corresponding embed dense vector u. For instance, let the matrix represented in the Table 4.1 be the

RNN-based sequence prediction as an alternative or complement to traditional recommender systems

RNN-based sequence prediction as an alternative or complement to traditional recommender systems

PIERRE GODARD

RNN-based sequence prediction as an alternative or complement to tra- ditional recommender systems

Abstract

Sammanfattning

Contents

Chapter 1

Introduction

1.1 Background and Objective

1.2 Research Question and Methods

1.3 Evaluation and News Value

1.4 Sustainability, Ethics and Societal Aspects

Chapter 2

Related Work

2.1 Collaborative Filtering Techniques

2.2 Session-based recommendations

Chapter 3

Background

3.1 Neural Networks

3.2 Elements from Natural Language Processing

Chapter 4

Method

4.1 Tools

4.2 Data

4.3 Model