Forecasting Daily Supermarkets Sales with Machine Learning

(1)

Forecasting Daily

Supermarkets Sales with Machine Learning

DANIEL FREDÉN HAMPUS LARSSON

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ENGINEERING SCIENCES

(2)

(3)

Forecasting Daily Supermarkets Sales with Machine Learning

DANIEL FREDÉN HAMPUS LARSSON

Degree Projects in Optimization and Systems Theory (30 ECTS credits) Master’s Programme in Industrial Engineering and Management KTH Royal Institute of Technology year 2020

Supervisor at ELVENITE AB: Erik Karlström Supervisor at KTH: Xiaoming Hu

Examiner at KTH: Xiaoming Hu

(4)

Royal Institute of Technology School of Engineering Sciences KTH SCI

SE-100 44 Stockholm, Sweden URL: www.kth.se/sci

(5)

casts have been done through a combination of statistical measurements and experience. However, with the increased computational power available in mod- ern computers, there has been an interest in applying machine learning for this problem. The aim of this thesis was to utilize two years of sales data, yearly calendar events, and weather data to investigate which machine learning method could forecast sales the best. The investigated methods were XGBoost, ARI- MAX, LSTM, and Facebook Prophet. Overall the XGBoost and LSTM models performed the best and had a lower mean absolute value and symmetric mean percentage absolute error compared to the other models. However, Facebook Prophet performed the best in regards to root mean squared error and mean absolute error during the holiday season, indicating that Facebook Prophet was the best model for the holidays. The LSTM model could however quickly adapt during the holiday season improved the performance. Furthermore, the inclusion of weather did not improve the models significantly, and in some cases, the results were worsened. Thus, the results are inconclusive but indicate that the best model is dependent on the time period and goal of the forecast.

i

(6)

(7)

har dessa utförts genom en kombination av statistiska metoder och erfarenhet.

Med den ökade beräkningskraften hos dagens datorer har intresset för att ap- plicera maskininlärning på dessa problem ökat. Målet med detta examensarbete är därför att undersöka vilken maskininlärningsmetod som kunde prognosticera försäljning bäst. De undersökta metoderna var XGBoost, ARIMAX, LSTM och Facebook Prophet. Generellt presterade XGBoost och LSTM modellerna bäst då dem hade ett lägre mean absolute value och symmetric mean percentage absolute error jämfört med de andra modellerna. Dock, gällande root mean squared error hade Facebook Prophet bättre resultat under högtider, vilket in- dikerade att Facebook Prophet var den bäst lämpade modellen för att förutspå försäljningen under högtider. Dock, kunde LSTM modellen snabbt anpassa sig och förbättrade estimeringarna. Inkluderingen av väderdata i modellerna resul- terade inte i några markanta förbättringar och gav i vissa fall även försämringar.

Övergripande, var resultaten tvetydiga men indikerar att den bästa modellen är beroende av prognosens tidsperiod och mål.

ii

(8)

(9)

We would also like to thank our supervisor at KTH, Xiaoming Hu, for the guidance and feedback throughout the project.

iii

(10)

(11)

Abstract i

Sammanfattning ii

Acknowledgements iii

Table of Contents iv

List of Figures vii

List of Tables viii

1 Introduction 1

1.1 Background . . . 1

1.2 Research Objective . . . 2

1.3 Problem Setting . . . 2

1.4 Programming Language . . . 3

1.5 Outline . . . 3

2 Literature Review 4 2.1 Previous Work . . . 4

2.2 Algorithms Used in Previous Work . . . 5

2.3 Variables Used in Previous Work . . . 5

2.4 Evaluation Metrics Used in Previous Work . . . 6

3 Theory 7 3.1 Time Series . . . 7

3.2 Auto-Regression . . . 7

3.3 Supervised Learning . . . 8

3.4 Artificial Neural Networks . . . 8

3.5 Selected Models . . . 10

3.5.1 Naive Model . . . 10

3.5.2 ARIMAX . . . 10

3.5.3 Facebook Prophet . . . 12

3.5.4 XGBoost . . . 14

iv

(12)

3.6.2 Evaluation Metrics . . . 20

3.7 Weather as a Predictor . . . 21

4 Data 22 4.1 Included Data . . . 22

4.1.1 Coop Data . . . 22

4.1.2 SMHI . . . 23

4.1.3 Additional Data . . . 24

4.2 Data Processing . . . 26

4.2.1 Exploratory Data Analysis . . . 26

4.2.2 Aggregation . . . 26

4.2.3 Missing Values . . . 27

4.2.4 Feature Engineering . . . 27

4.2.5 Feature Selection . . . 29

4.2.6 One Hot Encoding . . . 30

4.2.7 Data Split . . . 31

5 Method 32 5.1 Model Implementation . . . 32

5.1.1 ARIMAX . . . 32

5.1.3 LSTM . . . 34

5.1.4 XGBoost . . . 35

6 Result 36 6.1 Performance of Models . . . 36

6.1.1 Naive Model . . . 39

6.1.2 ARIMAX . . . 40

6.1.4 LSTM . . . 43

6.1.5 XGBoost . . . 44

7 Discussion 46 7.1 Model Comparison . . . 46

v

(13)

7.5 Conclusion . . . 50

8 Further Studies 52

9 Appendices 53

9.1 Appendix 1: Available Coop Data . . . 53 9.2 Appendix 2: Holiday Data . . . 54

Bibliography 55

vi

(14)

1 A neural network with one hidden layer . . . 9

2 Example of a regression tree . . . 15

3 Unrolled form RNN . . . 16

4 Single module RNN . . . 16

5 Single module of an LSTM network . . . 17

6 Cross-validation for time series . . . 19

7 % of mean sales . . . 24

8 Explanation of holiday variable . . . 25

9 % of mean aggregated weekly sales . . . 28

10 One hot encoding example . . . 30

11 MAE for each week and model . . . 37

12 SMAPE for each week and model . . . 38

13 RMSE for each week and model . . . 38

14 % of mean sales for the naive model and one specific product and store 39 15 % of mean aggregated sales for the naive model . . . 39

16 % of mean sales of the ARIMA and one specific product and store . . 40

17 % of mean sales aggregated ARIMA . . . 41

18 % of mean sales for Prophet and one specific product and store . . . 42

19 % of mean aggregated sales Prophet . . . 42

20 % of mean sales for LSTM Model and one specific product and store 43 21 % of mean sales aggregated LSTM . . . 44

22 % of mean sales for XGBoost Model and one specific product and store 45 23 % of mean sales aggregated XGBoost . . . 45

vii

(15)

1 RMSE, MAE and SMAPE . . . 21

2 Available SMHI data . . . 23

3 Example of lags . . . 29

4 The evaluated hyperparameters for ARIMAX . . . 33

5 The evaluated hyperparameters for Prophet . . . 33

6 The evaluated hyperparameters for LSTM . . . 34

7 The evaluated hyperparameters for XGBoost . . . 35

8 Model results (mean) . . . 36

9 Model results (median) . . . 37

10 The final hyperparameters for ARIMAX . . . 40

11 The final hyperparameters for Prophet . . . 41

12 The final hyperparameters for LSTM . . . 43

13 The final hyperparameters for XGBoost . . . 44

viii

(16)

(17)

1 Introduction

Developments in the field of machine learning and the increase of computational power have led to the implementation of machine learning in various industries [1].

The retail industry is no exception. One of the applications of machine learning in the retail industry is the use of advanced forecasting algorithms to better predict up- coming sales and thus improve the ordering processes and the allocation of products.

Improved forecasting models for the retail industry can provide many benefits. For the end customer, products receive higher availability and the stores become a reliable source of goods. For the stores, an improved forecasting performance can provide the ability to; minimize waste due to overstocking which can have negative consequences economically; maximize sales as understocking could decrease sales due to lack of product availability and; to improve the allocation of personnel. Thus, an increased forecast precision could be beneficial through multiple economical aspects for the stores. Furthermore, in the case of grocery stores in Sweden, 30 000 tons of groceries were wasted during the year of 2016 [2]. Thus, improved forecasts could also be beneficial environmentally. Therefore it is evident that improved forecasts could be desirable from multiple perspectives and for multiple stakeholders within the whole retail industry but, especially within grocery stores.

1.1 Background

Historically, forecasting has been relying on experience-based knowledge within the personnel. However, as grocery stores grow larger and contain a high number of products, with different characteristics, knowledge-based forecasting becomes an in- creasingly difficult task. With the increased ability of data gathering, it is possible to utilize data for the forecasts. Statistical models are often used to calculate how sales have behaved historically and then these statistics are used in combination with experience to predict future sales. Now with the increased computational power available, it could be possible to apply sophisticated machine learning models and rely on the data to a larger extent when predicting sales.

(18)

1.2 Research Objective

The primary objective of this thesis was to investigate which machine learning model yields the best performance when forecasting sales for a given set of products and stores. Utilizing data provided by Coop Värmland, the forecasts were implemented for multiple stores and products to forecast the sale quantity of each product and store for a seven day period. Thus, the goal of this thesis was to lower the amount of waste and increase product availability through improved forecasting models.

1.3 Problem Setting

Coop Värmland is one of the largest grocery store chains in the county of Värmland, Sweden, and consists of over 60 stores of various sizes [3]. Currently, they utilize their data to automate orders for a large set of products. However, for a set of products with a short expiration date, orders are placed manually and as these orders could potentially be improved and become more accurate, these products were the focus of this thesis. For these products, it was assumed to be optimal if all products are sold the same day that they are displayed in stores due to products becoming less desirable by the customer if stored longer. Thus, when forecasting sales for these products, it should be done for each day between consecutive deliveries. In this case, the time between consecutive deliveries was assumed to be seven days.

As the project was performed based on data from Coop Värmland, the data was biased towards this market. Other counties could have other characteristics and thus other variables that would be necessary to include to fully understand why sales increase or decrease in a general retail setting. As the scope of this thesis was limited to Coop Värmland and grocery products with a short expiration date, the results are not guaranteed to be viable for other sorts of products or other sorts of retail stores.

(19)

1.4 Programming Language

Python was used for data preparation, data analysis, and implementation of forecast models. During these stages, multiple libraries were used, including Pandas, Numpy, Scikit-learn, Keras, and TensorFlow.

1.5 Outline

The thesis is structured as follows: Chapter 2 presents a literature review of related works including, algorithms, features, and evaluation metrics used. Chapter 3 describes the theory behind the models that were chosen and used. In chapter 4 the data is described in detail, including the pre-processing of the data. In chapter 5 the methodology of implementing the models is discussed. The results are then presented in chapter 6, followed by a discussion of the results and a conclusion in chapter 7. Chapter 8 proposes possible ideas for future work and how this project could be continued and improved.

(20)

2 Literature Review

This chapter contains a brief overview of previous work related to the problem this thesis was investigating. The main objective of this section was to understand the current depth in this field, the amount of academic research existing, how that research has been executed, and where there exist possible gaps in the literature. Furthermore, a secondary objective was to dig deeper into the existing research to conclude which algorithms, features, and evaluation metrics that appear frequently throughout the literature.

2.1 Previous Work

A simplistic approach to understanding the current width of the academic literature within this thesis area was to utilize academic literature databases such as Web of Science and Google Scholar. Introducing the keywords "food", "waste" and "machine learning" resulted in only seven hits within Web of Science and 23,000 hits at Google Scholar. Comparing this to 146,365 and 3,100,000 hits when searching only for "Machine Learning", and 20,780 and 255,000 hits for "Food Waste" at Web of Science and Google Scholar respectively, it was evident that much research has been done in related areas. Although many articles include the terms, few approached an equivalent problem as this thesis. Thus, it is evident that a large amount of research has been done concerning limiting food waste and within machine learning, individually. However, there has been a limited amount of research on how to limit food waste using machine learning with forecasts on a day to day basis. As seen by the individual searches, the limited amount of hits does not correspond to a lack of knowledge. Instead, it indicates that the knowledge has not been thoroughly applied for this specific use case.

By broadening the search and focusing on the knowledge instead of the application, there existed more academic research on algorithms and methods for the problem at hand [4, 5, 6, 7, 8, 9, 10, 11]. Thus, the academic literature on time series and forecast are thorough and new algorithms and approaches are continuously developed to handle new problems. As this field expands continuously with new implementations and new algorithms, there existed a need to compare newer algorithms to older ones to conclude if improvements are occurring.

(21)

2.2 Algorithms Used in Previous Work

To forecast sales, several types of algorithms have been proposed with neural networks and auto-regression being the most prominent. This is expected as the problem was, in essence, a time series problem. ARIMA and Long Short Term Memory (LSTM) have yielded much discussion and promising results in the academic literature [4, 5, 6].

However, regression models such as Lasso, support vector regression (SVR), and Ran- dom Forest has shown promising results as well, indicating that this approach could yield prominent results [4, 7, 8]. Extreme gradient boosting (XGBoost), was published in 2016 by Tianqi Chen and Carlos Guestrin from the University of Washington [12], and since then it has been proved to be a successful model for forecasting in data science competitions and recent literature [13]. Furthermore, Facebook Prophet, an auto-regressive model, was published on Github in 2017 [14]. While Prophet lacks academic research, it has been actively used in the online communities with promising results [15].

2.3 Variables Used in Previous Work

Besides analyzing which models have shown the most promise, it is important to analyze variables that could explain customer shopping habits and thus correlate with the number of products sold. The relevance of weather as a predictor for sales have been shown throughout the literature [7, 16, 9, 17]. Different aspects of weather have been utilized, such as temperature, the amount of sunlight, and the amount of rain and there is no consensus over which aspect is the most relevant. Furthermore, multiple studies have shown that calendar events and public holidays such as Christmas and Easter have a high correlation with sales [9, 10, 11]. A third important variable that has been shown to correlate with sales of a product in the literature, was if there is an ongoing sale on the product in question or not [4, 11]. This was presumably due to a lowered cost and thus and increases demand. Lastly, previous work also suggests that the specific weekday correlates to products sold, and can thus be used to improve the performance of the models [4, 10, 11].

(22)

2.4 Evaluation Metrics Used in Previous Work

To evaluate and compare the different models fairly, the choice of evaluation metrics was important as each metric have different characteristics. It was also important to include several metrics since different metrics could display different flaws or benefits in the models. Root mean squared error (RMSE), mean absolute error (MAE), and Mean Absolute Percentage Error (MAPE) have been used extensively in the academic literature and could, therefore, be deemed to be the most useful [6, 7, 18]. In comparison with these, when analyzing the online communities, data science competitions, and sources outside the academic literature, it was clear that Symmetric Mean Absolute Percentage Error (SMAPE) can be beneficial when comparing the models [19, 20]. By utilizing multiple performance metrics, with different characteristics, as specified above, the chances of locating the best algorithm for a specific outcome is increased.

(23)

3 Theory

This chapter introduces the relevant theory and forms the foundation for subsequent chapters. Firstly, the basic theory of time series, auto-regression, supervised learning, and neural networks are introduced. Secondly, the models that were selected for this thesis are presented and discussed. Thirdly, the evaluation methods and metrics are discussed.

3.1 Time Series

When data is collected over time and time is an aspect of the data containing important information, it is a time series. The order of the data is important as succeeding data points can be correlated. Therefore, it is possible that previous values in the time series can be a great predictor of the following ones. There are several examples of time series, for example, sales data and weather data. [21]

3.2 Auto-Regression

In an auto-regressive model the predictions, ˆy_t, are based on a linear combination of past values y_t. Thus, this is a regressive model were previous values of the variable in question is used to predict the subsequent values. The model can be altered to include a pre-defined set of previous variables. If the model utilizes p number of previous values, the model can be written as:

ˆ

y_t= a₁y_t−1+ · · · + a_py_t−p+ e_t. (3.1)

Where at is the coefficients, yt is the previous values of the variable, and et is a Gaussian distributed white noise. The goal is to determine the coefficient, a_i, i = 1, . . . , p, such that the errors of the auto-regressive model are minimized. [22]

(24)

3.3 Supervised Learning

Supervised learning maps a set of inputs, often referred to as features, X, to a set of outputs, often referred to as the target variable, Y. In this problem setting, the target variable corresponds to a non-negative real value, thus the applied supervised learning is a regression task. The model is constructed utilizing training data, which is a subset of the data containing prior observation. Each prior observation is a pair of inputs, x_i ∈ X, and the observed target variable, yi ∈ Y. The goal is to construct a model that can utilize previously unseen inputs, x^∗_i to predict an estimation of the target variable, ˆy with a minimal error. [23]

With time series modeling, the data used to train the models, the training data, has to be data of prior dates compared to the test data due to the time dependency of the data [24]. However, most machine learning models do not consider the time of the observations when predicting the target variable as they are not explicitly developed for time series. Observations of earlier date are a powerful predictor, and by incorporating it as an input for subsequent data points the time series forecasting problem can be analyzed as a supervised machine learning problem [25].

3.4 Artificial Neural Networks

Artificial neural networks, often simply called neural networks, is a supervised machine learning method that was developed to mimic the network on neurons in the brain. A neural network is structured of several layers, where each layer contains a set of neurons. In the neural networks, there is an input layer, one or several hidden layers, and an output layer. However, the configuration in how data is transported from layer to layer can be different depending on which neural network model is used.

In Figure 1, a simple feedforward neural network is displayed as the output from each layer is the input for the subsequent layer. [26]

(25)

Figure 1: A neural network with one hidden layer

Figure 1 displays a neural network with one hidden layer. In the displayed neural network, the inputs go through the input layer and are given individual weights, w_i,j. The weighted outputs from the input layer are then combined as inputs to the subsequent layer, the hidden layer. Within the hidden layer, the values are mapped to a value between zero and one in each neuron using an activation function. The choice of activation function can differ depending on the task at hand but is most commonly sigmoid, relu, or tanh. The output from these neurons is then transported to the next layer, which in this example is the output layer. The predicted target value is then calculated based on the weights (and biases) within this output layer and outputted as ˆy. If this small example were to be expanded with several hidden layers, the process of weighting the inputs and combining them would be replicated through each added layer. Regardless of the number of hidden layers, the goal is to minimize the error metric used by tuning the weights and biases [27]

To optimize the performance of the neural network, backward propagation of errors, often denoted simply as backpropagation can be used. Given an error function, the gradient of the error function is calculated based on the weights of the neural network. The gradients are calculated backward through the network, with the gradients of the first layer being calculated last. As the error of the model flows backward in the model, instead of each layer being calculated independently, backpropagation is computationally more efficient. [28]

Depending on the problem, different types of neural networks might be suitable.

When dealing with time-series data, a neural network configuration which can utilize previously seen data points is presumably the best.

(26)

3.5 Selected Models

The models that this thesis utilized were ARIMAX, LSTM, XGBoost, and Facebook Prophet as well as a naive model based on mean values of sales for each day of the week, product, and store. Each model is described in detail in the following chapters.

3.5.1 Naive Model

The naive model in this thesis was based on the assumption that each day of the week has the same quantity of sold products, independent of the week for each combination of store and product. As all other variables are, in this model, assumed to have no effect and the future will have equivalent sales as the past, this model is a naive approach to forecasting. The estimate, ˆy, for an individual product, store, and weekday was calculated as the mean value of the previously observed values, yi.

Thus for a given product and store, the prediction, ˆy_j, for each day of the week was given by

ˆ yj = 1

k

X

i=1

yj,i, j = 1, 2, . . . , 7. (3.2)

Where yj is the observed target value for a day of the week j, k is the number of prior observations of that day of the week in the training data.

Naturally, this model cannot predict when sales increase or decrease over time as the model is not dependent on time. However, the model can serve as a baseline for other models to be evaluated against.

3.5.2 ARIMAX

ARIMAX is an auto-regressive model based on the auto-regressive integrated moving average (ARIMA) model. ARIMAX is an extension of the ARIMA model, adding exogenous variables as inputs. Furthermore, the ARIMA model adds the ability to model non-stationary models on the ARMA model [29]. Thus to understand ARI- MAX it is important to understand the underlying ARMA model.

(27)

The ARMA, model is denoted as ARM A(p, q). Where, p denotes the number of previous time series observations that the estimation, ˆy_t, is dependent on and q+1 denotes the number of errors which the models should include, e_t, e_t−1, . . . , e_t−q. Where e_t is the Gaussian distributed white noise, ai, i = 1, . . . , p is the Auto-regressive (AR) coefficients, and bj, j = 1, . . . , q is the moving average (MA) coefficients. [30]

The prediction for ˆy using the ARMA model is therefore

ˆ

yt= a1yt−1+ · · · + apyt−p+ et+ b1et+ · · · + bqet−q. (3.3)

The ARIMA model extends the ARMA model by adding a component to handle non-stationary time series. The ARIMA model is denoted as ARIMA(p,d,q), where d denotes the number of times the time series is differentiated until made stationary.

When the time series is made stationary, the ARMA(p,q) model is used for predictions. [30]

The ARIMAX model adds additional exogenous variables, X_t, for each time step to the ARIMA model.

X_t= [x¹_t, x²_t, . . . , x^m_t ]^T (3.4)

Where m is the number of exogenous variables for each time step.

Multiplying the exogenous variables with a row vector, β, containing the coefficients for each exogenous variable and adding these to the prediction ARIMA model we get

ˆ

y_t= βX_t+ a₁y_t−1+ · · · + a_py_t−p+ e_t+ b₁e_t+ · · · + b_qe_t−q. (3.5)

By incorporating additional explanatory variables it is possible to increase the predictive power of the model as more complex behaviors of customer shopping habits can be modeled.

(28)

3.5.3 Facebook Prophet

Facebook Prophet is an additive and decomposable model with three main components. Trend denoted as g(t) models non-periodic changes in the time series, for example, a linear growth over time. Seasonality denoted as s(t) models the periodic changes in the time series, for example, weekly, monthly, yearly changes in sales.

Holidays denoted as h(t) models the effects of irregular events, such as holidays [31].

Combining the components and a Gaussian distributed white noise, e_t, the following equation is obtained

y(t) = g(t) + s(t) + h(t) + e_t. (3.6)

Trend can be modeled in two different ways in Prophet, either by a piece-wise linear model or a saturating growth model.

The piece-wise linear model is given by

g(t) = (k + a(t)^Tδ)t + (m + a(t)^Tγ). (3.7)

where the growth rate is denoted by k, the rate adjustments are denoted by δ, γ is a set to make the function continuous, and m is an offset parameter.

The saturating growth model is given by

g(t) = C

1 + exp(−k(t − m)). (3.8)

Where C is the carrying capacity, k is the growth rate, and m is an offset parameter.

(29)

Seasonality is modelled with Fourier series. Smooth seasonal effects are approximated by

s(t) =

N

X

n=1

(a_ncos(2πnt

P ) + b_nsin(2πnt

P )). (3.9)

Where P is a regular period expected in the data.

Fitting the seasonal components require estimation of a₁, . . . , a_N and b₁, . . . , b_N. Therefore a matrix consisting of seasonal vectors is constructed for each historic and future time value in the data. For yearly seasonality and N = 10, this becomes

X(t) = [cos(2π(1)t

365.25), . . . sin(2π(10)t

365.25)]. (3.10)

An increased N results in the ability to model faster-changing seasonality affects.

However, it also increases the risk of overfitting.

The seasonal component is then

s(t) = X(t)β (3.11)

Where β is normally distributed N (0, σ²) to impose a smoothing prior on the seasonality.

Holidays are modeled by an indicator function. Assume that L is the number of holidays imputed, then

Z(t) = [1(t ∈ D₁), . . . , 1(t ∈ D_L)]. (3.12)

Holidays are assumed to not only affect the explicit day but also surrounding days.

(30)

Therefore, a prior is used, such that

h(t) = Z(t)k. (3.13)

Where k is normally distributed N (0, σ²). It is important to note that the holiday function does not need to be explicit holidays, but can be other events affecting sales, such as sport events.

3.5.4 XGBoost

XGBoost is an abbreviation of extreme gradient boosting and is based on the gradient tree boosting methods [12]. Thus it is important to introduce gradient boosting to understand XGBoost. Gradient boosting is an ensemble machine learning technique used to combine weak learners into a strong learner through an iterative approach.

Typically, weak learners are decision trees or regression trees. For a dataset with m features and N number of samples we have

D = {(x_i, y_i)}(|D| = N, x_i ∈ R^m, y_i ∈ R). (3.14)

A tree ensemble model uses K additive functions to predict the output

ˆ

y = φ(x_i) =

K

X

k=1

f_k(x_i), f_k ∈ F, (3.15)

where F = f (x) = w_q(x)(q : R^m −→ T, w ∈ R^T). (3.16)

Here F denotes the space of the regression trees and in F , q represents the structure of each tree. T is the number of leaves, f_kcorresponds to a tree structure independent of q and leaf weight w. The weight for each leaf can be understood as a score for each leaf. Thus, wi is the score for the i:th leaf.

In Figure 2 an example of a possible regression tree is displayed.

(31)

Figure 2: Example of a regression tree

As functions are used as parameters, this model cannot be optimized using traditional methods, instead, it has to be trained additively.

The prediction of the i:th instance at the t:th iteration is denoted as ˆy_i^t. f_t is added to minimize the equation below and is chosen in a greedy manner such that the improvements of the model is maximized.

L(φ) =X

i

l(ˆy_i, y_i) +X

k

Ω(f_k). (3.17)

l, is the loss on the training data and is a differentiable convex function measuring the difference in ˆy_i and y_i. The loss function is most commonly a square or logistic loss and is dependent on the problem. Ω is the regularization and it measures the model complexity. This regularization term is added to avoid overfitting by smoothing the final weights. When the regularization is set to zero the model defaults to a regular gradient boosting.

XGBoost improved on regular gradient boosting by utilizing second-order derivatives of the loss function to gain information about the gradient descent direction. In con- trast, regular gradient boosting uses the loss function of the base model for minimizing the error of the model. As presented, L1 and L2 regularization are implemented to improve model generalization. Furthermore, hardware optimization and paralleliza- tion lowers the model training time significantly [12]. The increased computational efficiency is what extreme gradient boosting refers to, however, given the nature of the model it has also been referred to as regularized gradient boosting [32].

(32)

3.5.5 LSTM

LSTM is an acronym for long short-term memory and is an artificial neural network that is based on a recurrent neural network (RNN) architecture. Unlike other common neural network architectures, RNNs are capable of keeping information from previous events. This architecture makes RNNs suitable for problems with sequences of data such as time series as it can store information from previous time steps. How- ever, when the dependencies are over a long period of time this information can be lost. All tough RNNs are, in theory, capable of learning long time dependencies, in practice, it can be difficult due to either vanishing or exploding gradient [33]. The unrolled form of RNN can be seen in Figure 3, where xt is the input, and ht is the output for each time step. Each module, A, can be viewed independently as seen in Figure 4.

Figure 3: Unrolled form RNN

Figure 4: Single module RNN

(33)

LSTM was developed to better store information for a longer period of time or when the time dependencies are of unknown duration [34]. In each repeating module, there are four interacting neural network layers, instead of one, as in a regular RNN. LSTM contains a cell state which can maintain information over time. The cell state consists of a cell state vector and a gating unit which regulates the information held in this memory over longer periods of time. The gates control which information should be kept and which should be removed by utilizing a sigmoid neural net layer and a point- wise multiplication operation. The information is then scaled, based on the relevancy of the information, to a value between zero and one [35]. A descriptive picture of a single module can be seen in Figure 5.

Figure 5: Single module of an LSTM network

The first step in LSTMs is the "forget gate layer". This gate is controlled by a sigmoid layer which decides which information should be kept. For each component in Ct−1

the sigmoid layer outputs a value between zero and one based on the input x_t and h_t−1. The activation vector of the forget gate is given by

ft= σ(Wf ∗ [ht−1, xt] + bf). (3.18)

(34)

The following step is to decide which information should be kept in the cell state. The

"input gate layer" consists of a sigmoid layer and decides which information should be updated. It is followed by a tanh layer that determines candidate values, ˜C_t, which can be added to the cell state. The activation vectors are given by

i_t= σ(W_i∗ [h_t−1, x_t] + b_i), (3.19) C˜_t= tanh(W_C ∗ [h_t−1, x_t] + b_C). (3.20)

These steps are followed by an update in the cell state, Ct. The old state, Ct−1 is multiplied with the forget gate’s activation vector, and the new candidate values, ˜C_t, are multiplied with the input gates activation vector, i_t. Thus, both the old cell state and the new candidate values are scaled by their importance.

Ct = ft∗ Ct−1+ it∗ ˜Ct. (3.21)

Lastly, the output is decided. The cell state information is put through an activation function, commonly tanh, and a sigmoid layer filters this information such that it can be outputted.

o_t= σ(W_o∗ [h_t−1, x_t] + b_o), (3.22)

h_t= o_t∗ tanh(C_t). (3.23)

This output and the current cell state is then transferred to the next module in the LSTM model and is used for subsequent predictions.

(35)

3.6 Evaluation of Model Performance

In this section, the theory of cross-validation for time-series is presented. This method of evaluation lays the foundation for how the results were obtained. Subsequently, the evaluation metrics used in combination with these methods of evaluation are presented and discussed.

3.6.1 Cross-validation for Time-Series

For a time-series problem, it is important that the unseen data, the testing data, are of later dates than the training data, due to the time-dependency of the data.

Furthermore, as the purpose is to predict one week into the future and predictions over a longer period can degenerate the performance, the evaluation method has to be adapted.

A method of adapting cross-validation for time series is by dividing the test data into several subsets in chronological order, each with a size corresponding to the real-life scenario, in this case, seven days. The training data is then used to predict the first subset of seven days. This subset is then added into the training data to predict the consecutive subset of seven days and the model is updated and trained again. See Figure 6 for an overview of this methodology. [36]

Figure 6: Cross-validation for time series

The blue cells denote training data, the red cells denote test data, and the grey cells denote data that is not used for that iteration. Thus, the models are tested

(36)

on the tested data similarly to the real-life scenario where each week would yield new data used for the subsequent week. Note that in this case, the blue cells denote a seven day period, where the sales for each individual day of that period is predicted.

The overall performance of the models is then calculated as the mean and median values of the chosen performance metrics over all iterations. Assuming that the testing data is divided into n subsets and has a performance p_i for each subset i = 1, 2, . . . , n.

Then the overall performance of the model is given by:

P_mean= 1 n

n

X

i=1

p_i, (3.24)

P_median = M edian(p_i). (3.25)

It is important to utilize both the median and the mean value of the performance metrics since the results of one, or several, weeks of the testing data could skew the results of the mean. However, the median of the performance metrics does not consider the results during all weeks and could, therefore, depict a glorified version of the results.

3.6.2 Evaluation Metrics

Three evaluation metrics were used, MAE, RMSE, and SMAPE, each with its characteristics. In this setting, MAE is the most basic metric as it corresponds to the actual wasted goods or sale opportunities. RMSE will enlarge the effect of larger absolute errors and SMAPE is better protected against outliers. Large absolute errors can be seen as detrimental due to the large economical effect for a retail store.

However, the large absolute errors can also be seen as coincidental events that were not preventable. Furthermore, it is important to utilize several evaluation metrics since the target variable is of various sizes. For products with a low average number of products sold, an absolute error will yield a larger percentage error compared to the same absolute error for a product with a large number of quantities sold on average. Thus, MAE, RMSE, and SMAPE cover multiple aspects of evaluation for the forecast. For a complete overview of these metrics, see Table 1, where e_t is the error, yt is the target value, and n is the number of observations in the test data.

(37)

Table 1: RMSE, MAE and SMAPE

Root Mean squared error RMSE=

v u u t 1 n

n

X

t=1

e²_t

Mean absolute error MAE= 1

n

X

t=1

|e_t|

Symmetric mean absolute percentage error SMAPE= 1 n

n

X

t=1

(|_|y ^e^t

t|+|ˆyt|| if |y_t| + |ˆy_t| 6= 0

0 Otherwise

3.7 Weather as a Predictor

Utilizing weather as a predictor in machine learning models can increase the predictive performance and thus improving the results according to the academic literature [7, 16, 9, 17]. As forecasting involves predicting future behaviors and events, the uti- lization of weather as a feature introduces uncertainty in the data due to the weather features being based on forecast rather than actual weather data. The inclusion of weather as a feature should, therefore, be done carefully. If the weather forecast is not accurate for a given day, the sales on that day could potentially increase or decrease leading to increased waste or missing sales opportunities. However, the weather forecasts are relatively accurate when forecasting a fewer number of days into the future.

A five-day prediction is correct about 90% of the time while a seven-day prediction lies around 80% accuracy [37]. This indicates that weather could presumably, with quite a high certainty, be utilized to improve the predictive performance of the models.

(38)

4 Data

This chapter describes in detail what data was collected and how it was utilized. This includes a description of the data used, the data aggregation procedure, missing values handling, feature engineering, and One Hot Encoding.

4.1 Included Data

As this project and machine learning projects overall utilizes large sets of data, the data has to be carefully explained in order to simplify the reproducibility and thus improve the validity of the project. The available raw data used within this project could be divided into three separate categories; sales data available from Coop Värm- land, weather data gathered from SMHI, and additional data such as calendar events.

4.1.1 Coop Data

The raw data that was made available to this project included Point of Sale (POS) data. This corresponded to detailed receipt information for each sale made in stores belonging to Coop Värmland, according to a specific selection. This included two years of data from four different stores, all with similar size with regards to their total amount of sales and all within the proximity of each other. To capture the long term effects within the sales patterns, additional years of data would have been favorable. As there was not an initial selection concerning the specific products or product categories, some assumptions had to be made to reduce the number of products to a manageable amount in regards to the total data size. In order to achieve this, five different products from each corresponding category; bread; charcuterie; diary; veg- etables, and cheese that satisfied two additional criteria were chosen to be included in this project.

• Criteria 1: The product should have a relatively short shelf life which means that Coop Värmland are placing and planning those orders manually without any external forecasting models. It is favorable that these particular products are to be sold within the same day as they are displayed, thus, accurate forecasts on a daily basis is a necessity.

(39)

• Criteria 2: The products should have continuous, or close to continuous, sales in every store throughout the two-years. One challenge with forecasting product sales is that products are often altered or changed, and thus become another article in the data.

Within the POS data there existed several columns with information, which product was sold, at which store the product was sold, which date the corresponding sale occurred, how much of the product was sold, and if there were any type of discount or not. The specific products that were used in this thesis are resented in Appendix 1.

4.1.2 SMHI

The utilized weather data was gathered from the organization SMHI [38]. There existed a large number of different aspects of weather within this database, including;

amount of rain, mean temperature, minimum temperature, maximum temperature, amount of sun hours and mean wind speed for all separate days within the two-year time interval used within this project. As the POS data included sales from four different stores all of these weather variables had to be collected from their respective closest weather station. This weather station should also be reasonably close to the store. This resulted in; rain and temperature information from Arvika, Karlstad and Kristinehamn, amount of sun hours information from Karlstad. While heavy wind could potentially affect the sales, wind information had to be disregarded as the only weather station that collected that information in proximity to the four stores were closed during a longer time period. In Table 2 the location of all the stations for each parameter is displayed. Two stores were in proximity and will, therefore, use the same stations for all the variables.

Table 2: Available SMHI data

Information Variables Station Location

Rain Arvika, Karlstad and Kristinehamn Mean Temperature Arvika, Karlstad and Kristinehamn Minimum Temperature Arvika, Karlstad and Kristinehamn Maximum Temperature Arvika, Karlstad and Kristinehamn

Sun Hours Karlstad

(40)

4.1.3 Additional Data

As the academic literature showed that including holidays and other special calendar events could be beneficial when forecasting sales, this data needed to be gathered separately as well. Based on an overview of the aggregated sales for all products and stores during the two year time period in question, these events showed sale spikes during; Easter, Midsummer, and Christmas. These holidays were, therefore, included as a variable to be able to predict the increased sales during these holidays. See Figure 7 for a clear overview of this aggregated sales.

Figure 7: % of mean sales

(41)

As people do not necessarily make their purchases for Christmas on Christmas Day, the sales were not only affected on the day in question but also on all the surrounding days. This phenomenon was present for all holidays, as during the holiday itself people do not traditionally shop. For Easter, Midsummer, and Christmas this was solved by including the dates two days before the day in question, the day in question, and two days after the day in question as different variables. Thus, there existed several categorical holiday variables denoting if there was a holiday within two days of the holiday. For Christmas and Easter this meant that six respectively eight dates were used as there existed several Swedish public holidays within those events, and for Midsummer which only had one Swedish public holiday, only five dates were included [39]. In Figure 8 the red block denotes the public holiday. The orange blocks surrounding the holiday are also modeled as separate variables.

Figure 8: Explanation of holiday variable

As there was not a sufficient amount of data to model the holidays as different variables, all holidays were modeled within the same holiday variable. Thus, a binary variable was introduced that denotes if it was a holiday or not at each date. The variables denoting if it was one or two days before, or after a holiday, was modeled equivalently with binary variables. See Appendix 2 for an overview of all holiday dates, including surrounding days, utilized in the holiday feature.

By analyzing the data, it was possible to determine the possible effects and patterns of certain events. For multiple months, one of the days with the largest amount of sold products was on the payday. The number of products sold during a payday was on average 20 percent larger than the average number of products sold during non-paydays. When comparing the number of products sold during a payday and the number of products sold on dates later than the 25:th, the difference was 17 percent.

Thus, it was probable that the payday could have a large effect on sales. In Sweden the most common payday is on the 25:th or if the 25:th is on a holiday or weekend, the closest working day before the 25:th. As the payday can contribute to an increase in sales, it was included as a binary variable. It should however be noted that depending on the weekday in which the payday occurs the effects can differ.

(42)

4.2 Data Processing

As the raw data could not be directly used for the different models it had to be pre- processed. Transforming all the previously mentioned raw data into usable datasets involved multiple steps; exploratory data analysis; aggregating the daily sales; handling possible missing values; transforming the features to become more useful; se- lecting the features that actually help explain the quantity sold; making the features usable within the models by One Hot Encoding; and lastly splitting the dataset into a train, a validation, and a test dataset.

4.2.1 Exploratory Data Analysis

In order to fully understand the available dataset, an initial data analysis had to be completed. The objective of this data analysis was to get a deeper understanding into how the different variables behaved throughout the two-year time period, how the sales of different products compared to each other, what trends and patterns that existed, for example weekly and monthly trends, and lastly to detect possible missing values and outliers. By performing an exploratory data analysis it could be possible to find underlying patterns that are not apparent when looking at the data. Thus, this is a critical step to determine if the data has to be manipulated before being applied and if it was possible to extract more information that could be utilized. [40]

4.2.2 Aggregation

The second step was to aggregate the daily sales. As there could occur hundreds of individual purchases of one particular product at one particular store each day, the data needed to be aggregated so that it showed the total sold quantity every day for each product and store combination. As there were four stores with 25 products each, this yielded 100 rows for each date within the two-year time period. Some products could, presumably, been sold both in units of weight and as individual items. How- ever, when aggregating all products, they were assumed to be of the same unit. Thus, the aggregated quantity for each product was either in weight or number of items.

In the aggregated data, there existed rows where the information regarding Type of Discount could have multiple values. This occurred if the product was sold both with and without a promotion during that day. Multiple could be recorded when the sales were only for a specific group or special deals dependent on how much the person bought. However, in this project, the promotion of a product in an individual store

(43)

was modeled as a binary variable. Therefore, if the product was on a promotion in a store, for any customer, it was determined to be on promotion. The POS data and additional data were then merged based on dates while the weather data were merged based on both dates and locations.

4.2.3 Missing Values

After the aggregation was completed, the next step was to handle missing values in each corresponding feature. There existed a few missing values in two variables. Ei- ther there was missing weather data due to the weather station not being operational or there were no recorded quantities sold for an individual product. The missing values within the weather data were imputed using the mean values of that feature.

Missing values regarding sold quantity could have two underlying reasons, there was no sale of that product at that store at that particular date or that the information was missing. Furthermore, it could also be due to the supplier not producing a sufficient amount of products and thus no product could have been sold although the demand was presumably similar to other dates. To handle this, all periods with one or two days in a row with missing values were imputed as zero. Periods with three or more missing days in a row was imputed using a rolling mean function that calculated the mean value of that particular product at that particular store during the last 30-day window, starting from one week before the missing day in question.

[41]

4.2.4 Feature Engineering

A feature that showed promise within the academic literature was the corresponding weekday of the sales. As this feature was not part of the raw data, explicitly, it had to be engineered by utilizing the dates and then merged into the dataset. According to previous research, this feature could have high explanation power as weekly sales often follow weekly trends, and therefore the sales occurring at a Monday could be correlated with the last Monday and the Monday before that. Furthermore, sales tended to not be uniformly distributed over all weekdays and thus it could be used as a predictor to, potentially, increase the performance of the models. [4, 10, 11]

In order to make sure that these facts were applicable to this specific data, an aggregation over the existing weekdays was conducted and this showed clearly that there

(44)

did not exist a uniform distribution of sales during a week. More exactly, there existed two spikes occurring during Fridays and Tuesdays, where Friday was the largest.

This could have corresponded to the fact that people often prepare for the weekend by shopping during the Fridays. Over the week Sunday corresponded to the day with the lowest quantity sold which may correlate with the fact the stores often receive new inventories on non-weekends. These aggregated weekly sales can be seen in Figure 9.

Figure 9: % of mean aggregated weekly sales

As sale analysis revolves around time series analysis, another feature that could presumably be beneficial for some machine learning models is previous values of the quantity itself, in the form of lags [4, 42, 43]. These lags represent previous values and was derived by copying the quantity and shifting it desired number of days for- ward, see Table 3 for an example. It is also clear within this example that if one chooses a lag corresponding to -3, the first three rows would get missing values and thus have to be removed or imputed. 16 different lags of three types were created.

Firstly, ten different daily lags based on the same logic as the example in Table 3.

Secondly, four weekly rolling sums, the total quantity sold last week, the week before that, the week before that, and the week before that. Thirdly, two weekly rolling means, the mean quantity sold during the last 4 and 8 weeks. As 8 weeks is 56 days, the 56 first days were removed from the dataset. This was done to incorporate and capture the seasonality effects.

(45)

Table 3: Example of lags

Quantity Lag -1 Lag -2 Lag -3 5

6 5

8 6 5

1 8 6 5

6 1 8 6

9 6 1 8

8 9 6 1

4 8 9 6

4.2.5 Feature Selection

Including a large number of features in machine learning models could be beneficial or detrimental depending on the model and computational power available. Including more features could improve the model performance as there is more information included within the model. However, this can result in overfitting and an unnecessary increase in computational power if included features do not contribute with sufficient predictive power. Overfitting can also occur if there is a large amount of features compared to the number of datapoints since this would result in a large variance.

Thus the increased errors could occur due to the models sensitivity to the training data. This thesis aimed to model forecasting algorithms that can be beneficial for multiple stores, not only for the four chosen ones, thus the models needed to be tested with and without the weather data, as not all stores would have access to nearby weather stations. It would also be beneficial to analyze if the weather data would increase the variance of the models, or if it can aid in the predictive power.

To analyze these effects, separate feature sets were tested and evaluated within this project. The difference being the inclusion of the weather data gathered from SMHI.

These will be denoted as FSW and FS, respectively, throughout the rest of this report.

(46)

4.2.6 One Hot Encoding

Some machine learning models are not suitable with categorical data as they cannot interpret the data as such. Instead, the models would interpret it as numerical data, resulting in misinterpreted data and worse predictions. For example, the weekdays feature, ranging from 0 to 6, where 0 corresponds to Mondays and 6 to Sundays. If the model were to use a single variable ranging from 0 to 6 as values it would interpret Tuesday to be greater than Monday. Similarly, Wednesday would be considered greater than Tuesday and so on. This is meaningless as the values cannot be compared directly, and thus such data had to be modeled as categorical variables.

One method to transform a categorical variable that has multiple different categories is to use one-hot encoding (OHE). This algorithm transformed each categorical feature with r categories to r new features. Each observation corresponding to category j would then get a 1 in the corresponding feature column j and a 0 in every other column. So for two categorical features with three respectively six categories OHE created nine new features. The original, non one-hot encoded variable, would then be removed from the dataset. See Figure 10 for an example. [44]

Figure 10: One hot encoding example

(47)

4.2.7 Data Split

In order to use the prepared data in the machine learning models and be able to evaluate the performance of the models, the data had to be split into three separate datasets. A training, a validation, and a testing set. The training data was 16 months, validation data 2 months, and the test data was 4 months. The training data was used to train the model. The validation data was used to tune the parameters of the model. Lastly, the testing data was used to evaluate the performance of each model.

The performance of the models on the validation and testing data was measured following the logic in section 3.3.1.

(48)

5 Method

In this chapter, each model and its corresponding implementation is presented in detail. This includes descriptions of hyperparameters, packages used and a general overview of how each model was utilized.

5.1 Model Implementation

All models were trained on the training dataset and the hyperparameters were tuned using the validation data. Lastly, the models were evaluated with the test dataset.

The process of evaluating the models, and the hyperparameters, followed the cross- validation approach presented in chapter 3. As the naive model followed the basic approach described in section 3.2.1 and did not have any hyperparameters it is not further presented within this chapter.

5.1.1 ARIMAX

Statsmodels was used when creating the ARIMAX model. Each product and store combination was trained and evaluated separately. As the ARIMAX model was created to incorporate lags automatically, the manually lags created in the data set were not necessary [45]. However, the models were tested both with and without the manually created lags to evaluate if the model could gain an increased predictive power. Auto-correlation and partial auto-correlation graphs were used to determine possible values for the automatic lags for this model. Furthermore, multiple models were evaluated based on the presented evaluation metrics to determine the best hyperparameter values. The hyperparameters that were tested are displayed in Table 4.

(49)

Table 4: The evaluated hyperparameters for ARIMAX

Hyperparameter P ossible V alues

p 7, 14, 21

q 7, 14, 21

d 0, 1, 2

trend None, c, t, ct

Where c indicates a constant trend, t indicates a linear trend, and ct is both. Fur- thermore, "None" indicates no trend variable. [45]

Prophet was implemented using Facebooks own fbprophet library. As this model handled, trends, seasonality, and holidays internally, the manually created lags and holiday data could be discarded, anyhow the model was tested with and without these manually created lags. Besides its main steps, creating the model, fitting data to the model, and predicting new values from the model, some model-specific steps had to be taken. The holiday related dates had to be specified, the additional features had to be added as extra regressors, and the hyperparameters had to be tuned. See Table 5 for a complete list of evaluated hyperparameters for the prophet model.

Table 5: The evaluated hyperparameters for Prophet Hyperparameter Possible Values Yearly Seasonality True, False Weekly Seasonality True, False Daily Seasonality True, False Changepoint Prior Scale 0.05, 0.5, 2, 10

Changepoint Range 0.65, 0.9, 0.05 Seasonality Prior Scale 0.01, 0.1, 1, 10, 100, 1000

Holidays Prior Scale 0.1, 1, 10, 100, 1000

Growth Linear, Logistic

(50)

5.1.3 LSTM

The LSTM model was created using the Keras, TensorFlow, and Numpy libraries in Python. As the LSTM model contained a cell state which stores memory of previous days, the manually created lags were not explicitly necessary. Therefore, the model was evaluated both with and without the manually created lags. The algorithm implementation includes creating the model, compiling the model, fitting data to the model, and predicting new values from the model. Besides these major steps, the data had to be normalized as some products target variable was of a larger magni- tude than others. Multiple hyperparameters were tuned in order to reach the best performance, see Table 12.

Table 6: The evaluated hyperparameters for LSTM Hyperparameter Possible Values

Batch Size 100, 700, 1400, 2100, 2800 Number of Epochs 100, 200, 300,400

Dropout Level 0, 0.1, 0.2 Number of Neurons 8, 16, 32, 64 Activation Function Tanh, relu

Number of Layers 1, 2, 3

(51)

5.1.4 XGBoost

XGBoost was implemented using the XGBoost library which was based on the model authored by Tianqi Chen [46]. The hyperparameters that were tested in a grid search approach are displayed in Table 7. The manually created lags were used for this model so that previous data points could be used as a predictive variable. Without the manually created lags, the model would not have benefited from the time dependency of the data.

Table 7: The evaluated hyperparameters for XGBoost

Hyperparameter P ossible V alues

Subsample ratio of columns when constructing each tree 0.5, 0.6, 0.7, 0.8, 0.9, 1 Learning rate 1,0.5, 0.25, 0.1, 0.05, 0.01,0.005

Max depth 6, 7, 8, 9, 10

Minimum child weight 4, 5, 6, 7, 8, 9

Fraction of observations to be randomly samples for each tree 0.5, 0.6, 0.7, 0.8, 0.9, 1 Number of estimators 100, 250, 500, 750, 1000

(52)

6 Result

In this chapter the final selection of hyperparameters for each model and the result for these models are presented. The results are based on the median and mean MAE, RMSE, and SMAPE for each model. Furthermore, as the performance of the models can be dependent on the time period, graphs are presented displaying the evaluation metrics for each week in the testing data. The naive model follows the basic func- tionality described in section 3.2.1 and does not have any hyperparameters. Thus, no final set of parameters is presented for the naive model.

6.1 Performance of Models

The performance of the final models were evaluated using the mean and median value of RMSE, MAE, and SMAPE over all the iterations, and these results are presented in Tables 8 and 9. Furthermore, Figures, 11, 12 and 13 displays the results for each individual week. Two feature sets were evaluated for all models except the naive model, FSW, and FS, which, respectively, denotes if the feature set included weather data or not. Furthermore, as the naive model was only based on the mean values of the quantities sold, it used neither FSW nor FS.

Table 8: Model results (mean)

Model Feature Set MAE SMAPE RMSE

ARIMAX FSW 6.64 0.25 13.95

LSTM FSW 6.63 0.24 14.68

Prophet FSW 6.65 0.28 13.44

XGBoost FSW 6.43 0.24 13.75

ARIMAX FS 6.58 0.25 13.78

LSTM FS 6.72 0.25 14.46

Prophet FS 6.62 0.27 13.38

XGBoost FS 6.39 0.24 13.75

Naive - 7.45 0.27 15.08

(53)

Table 9: Model results (median)

Model Feature Set MAE SMAPE RMSE

ARIMAX FSW 5.40 0.23 9.86

LSTM FSW 5.23 0.23 9.63

Prophet FSW 5.66 0.26 9.99

XGBoost FSW 5.07 0.23 9.20

ARIMAX FS 5.23 0.24 9.27

LSTM FS 5.39 0.24 9.69

Prophet FS 5.65 0.26 9.93

XGBoost FS 5.09 0.23 9.27

Naive - 6.20 0.26 10.04

Figure 11: MAE for each week and model

(54)

Figure 12: SMAPE for each week and model

Figure 13: RMSE for each week and model

(55)

6.1.1 Naive Model

In Figure 14 the sales for one particular product are displayed in relation to its mean value. In Figure 15 the aggregated sales for all products are displayed in relation to its mean value.

Figure 14: % of mean sales for the naive model and one specific product and store

Figure 15: % of mean aggregated sales for the naive model

(56)

6.1.2 ARIMAX

The final set of hyperparameters for the ARIMAX model is displayed in Table 10.

Table 10: The final hyperparameters for ARIMAX

Hyperparameter F inal V alue

p 7

q 7

d 0

trend None

Utilizing the manually created lags resulted in no added performance over the validation data, therefore the final model did not utilize the manually created lags.

In Figure 16 the sales of an individual product and store are displayed to showcase the model prediction. Figure 17 displays the total amount of sales compared to the total predicted sales. Both figures displayed the results in comparison to the mean value of sales.

Figure 16: % of mean sales of the ARIMA and one specific product and store

(57)

Figure 17: % of mean sales aggregated ARIMA

The final set of hyperparameters for the Prophet model are displayed in Table 11.

Table 11: The final hyperparameters for Prophet Hyperparameter Final Value Yearly Seasonality True Weekly Seasonality True Daily Seasonality False Changepoint Prior Scale 0.5

Changepoint Range 0.65 Seasonality Prior Scale 1000 Holidays Prior Scale 1000

Growth Linear

Utilizing the manually created lags resulted in no additional performance over the validation data, thus the final model did not utilize the manually created lags. In Figure 19 the aggregated sales for all products are displayed in relation to its individual mean value and in Figure 18 the sales for one particular product are displayed in relation to its mean value.

(58)

Figure 18: % of mean sales for Prophet and one specific product and store

Figure 19: % of mean aggregated sales Prophet

(59)

6.1.4 LSTM

The final parameters for the LSTM model are presented in Table 12. Utilizing the manually created lags did not improve performance over the validation data, thus the final model was evaluated without these lags.

Table 12: The final hyperparameters for LSTM Hyperparameter Final Value

Batch Size 1400

Number of Epochs 300 Dropout Level 0.1 Number of Neurons 32 Activation Function Tanh

Number of Layers 2

Figure 20 displays the predicted amount of sales for an individual product and store in comparison to the mean value of sales. Figure 21 displays the total amount of sales and predicted sales for all products and stores.

Figure 20: % of mean sales for LSTM Model and one specific product and store

(60)

Figure 21: % of mean sales aggregated LSTM

6.1.5 XGBoost

Multiple versions of XGBoost with different sets of hyperparameters were evaluated.

The final and best model is displayed in Table 13.

Table 13: The final hyperparameters for XGBoost

Hyperparameter F inal V alue

Subsample ratio of columns when constructing each tree 0.9

Learning rate 0.01

Max depth 9

Minimum child weight 6

Fraction of observations to be randomly samples for each tree 0.9

Number of estimators 750

Figure 22 displays the true and predicted quantities sold for a specific product and store pair. Figure 23 displays the total amount of sold products and predicted products sold for all stores.

(61)

Figure 22: % of mean sales for XGBoost Model and one specific product and store

Figure 23: % of mean sales aggregated XGBoost