Sales Forecasting by Assembly of Multiple Machine Learning Methods: A stacking approach to supervised machine learning

(1)

Master’s Thesis, 30 ECTS

Master of Science in Industrial Engineering and Management – Data Science, 300 ECTS

Sales Forecasting by Assembly of Multiple

Machine Learning Methods

A stacking approach to supervised machine learning

Anton Falk, Daniel Holmgren

(2)

SALES FORECASTING BY ASSEMBLY OF MULTIPLE MACHINE LEARNING METHODS. A stacking approach to supervised machine learning.

Submitted in fulfilment of the requirements for the degree Master of Science in Industrial Engineering and Management

Department of Mathematics and Mathematical Statistics Ume˚a University

SE - 901 87 Ume˚a, Sweden

Supervisors from Ume˚a University:

Peter Anton H˚akan Lindkvist

Supervisor from Trimma AB:

Axel Sandstr¨om Emmelin Examiner:

Jun Yu, Ume˚a University

(3)

Abstract

Today, digitalization is a key factor for businesses to enhance growth and gain advantages and insight in their operations. Both in planning operations and understanding customers the digitalization processes today have key roles, and companies are spending more and more resources in this fields to gain critical insights and enhance growth. The fast-food industry is no exception where restaurants need to be highly flexible and ag- ile in their work. With this, there exists an immense demand for knowledge and insights to help restaurants plan their daily operations and there is a great need for organizations to continuously adapt new technolog- ical solutions into their existing processes. Well implemented Machine Learning solutions in combination with feature engineering are likely to bring value into the existing processes.

Sales forecasting, which is the main field of study in this thesis work, has a vital role in planning of fast food restaurant’s operations, both for budgeting purposes, but also for staffing purposes. The word fast food describes itself. With this comes a commitment to provide high quality food and rapid service to the customers. Understaffing can risk violating either quality of the food or service while overstaffing leads to low overall productivity. Generating highly reliable sales forecasts are thus vital to maximize profits and minimize operational risk.

SARIMA, XGBoost and Random Forest were evaluated on training data consisting of sales numbers, business hours and categorical variables describing date and month. These models worked as base learners where sales predictions from a specific dataset were used as training data for a Support Vector Regression model (SVR). A stacking approach to this type of project shows sufficient results with a significant gain in prediction accuracy for all investigated restaurants on a 6-week aggregated timeline compared to the existing solution.

(4)

Sammanfattning

Digitalisering har idag en nyckelroll för att skapa tillväxt och insikter för företag, dessa insikter ger fördelar b˚ade inom planering och i först˚aelsen om deras kunder. Det här är ett omr˚ade som företag lägger mer och mer resurser p˚a för att skapa större först˚aelse om sin verksamhet och p˚a s˚a sätt öka tillväxten. Snabbmatsin- dustrin är inget undantag d˚a restauranger behöver en hög grad av flexibilitet i sina arbetssätt för att möta kundbehovet. Det här skapar en stor efterfr˚agan av kunskap och insikter för att hjälpa dem i planeringen av deras dagliga arbete och det finns ett stort behov fr˚an företagen att kontinuerligt implementera nya tekniska lösningar i befintliga processer. Med väl implementerade maskininlärningslösningar i kombination med att skapa mer informativa variabler fr˚an befintlig data kan aktörer skapa mervärde till redan existerande processer.

Försäljningsprognostisering, som är huvudomr˚adet för den här studien, har en viktig roll för verksamhet- splaneringen inom snabbmatsindustrin, b˚ade inom budgetering och bemanning. Namnet snabbmat beskriver sig själv, med det följer ett löfte gentemot kunden att tillhandah˚alla hög kvalitet p˚a maten samt att kunna tillhandah˚alla snabb service. Underbemanning kan riskera att bryta n˚agon av dessa löften, antingen i un- derm˚alig kvalitet p˚a maten eller att inte kunna leverera snabb service. Överbemanning riskerar i stället att leda till ineffektivitet i användandet av resurser. Att generera högst tillförlitliga prognoser är därför avgörande för att kunna maximera vinsten och minimera operativ risk.

SARIMA, XGBoost och Random Forest utvärderades p˚a ett träningsset best˚aende av försäljningssiffror, timme p˚a dygnet och kategoriska variabler som beskriver dag och m˚anad. Dessa modeller fungerar som basmodeller vars prediktioner fr˚an ett specifikt testset används som träningsdata till en Stödvektorsreggre- sionsmodell (SVR). Att använda stapling av maskininlärningsmodeller till den här typen av problem visade tillfredställande resultat där det p˚avisades en signifikant förbättring i prediktionssäkerhet under en 6 veckors aggregerad period gentemot den redan existerande modellen.

(5)

Acknowledgements

We would like to express our great appreciation to our supervisor at Trimma, Axel Sandstr¨om Emmelin. As our supervisor, you have made us feel as a part of the company and your guidance has been of great help.

We would also like to reach out to the consultant and development team at TRIMMA for their great effort in helping us and the interest shown in our project. Especially Adam Holmstr¨om for your involvement in this project and Moa Forsberg for helping us initiate contact with the company and made this possible in the first place.

Second we would like to express our gratitude to our supervisors at Ume˚a University, Peter Anton and H˚akan Lindkvist for their support and guidance.

Last but not least, we would like to address our greatest appreciation to our families for their support, not only during this thesis project, but throughout our whole studies.

(6)

1 Introduction

1.1 Trimma AB

Trimma is an IT company established in 2003 and is focusing on solutions for business management and decision support mainly through their product INSIKT. Trimma’s headquarter is located in Ume˚a with offices in ¨Ornsk¨oldsvik, Sundsvall and Gothenburg. Trimma employs over 70 people and is currently owned by Visma (Trimma n.d.(a)).

1.2 Background

INSIKT is a complete system for business management and decision support (Trimma n.d.(b)). The product supports a continuous process, all the way from setting goals and planning, to analysis and goal management.

INSIKT provides information for planning and budgeting through predictions made on historical data. The goal with INSIKT is to provide as accurate insights as possible to maximize value to its customers. There is no implementation of machine learning in the current product, but there is a desire from the customers that INSIKT in the future will use some sort of machine learning to provide business intelligence. Trimma has had one earlier thesis project on this subject that is referenced in Section 1.6, and is very interested in further investigation, development and tuning of the prediction methods. At this time, the product implements a baseline model that uses sales from a reference date to estimate the upcoming sales.

1.3 Purpose

The purpose of this project is to enhance knowledge about machine learning and its implementations. Fur- ther, we want to gain an understanding of which stacked supervised learning methods that can be applied to improve the performance of INSIKT’s sales forecasting. The goal from the company’s point of view is to get a better understanding of how machine learning and statistical methods can be implemented in their already established products. More specifically, can machine learning increase the products prediction accuracy, bringing more value to Trimma’s customers through more opportunities to cut costs?

1.4 Task description

The aim of this project is to investigate how well different machine learning and time series forecasting models perform in predicting company sales based on time series data and in the end present a machine learning solution suitable for sales forecasting. Thus the task becomes to investigate and compare different methods of machine- and statistical learning, track their performance against the baseline model and investigate if there is a significant gain in accuracy, then combine (stack) the methods and evaluate the performance.

1.5 Limitations

It is possible to build and test many different models using many different methods. Due to the time constraints, this thesis focuses on five modelling methods only. Even though there were over 100 restaurants available in the database, it was decided to only include three of them in this project. These three restaurants have been chosen to represent different environments in which this fast food chain operates. Further description about the restaurants can be found in Section 3.2.

(9)

1.6 Previous work 1 INTRODUCTION

1.6 Previous work

Granberg (Granberg 2021) examined the accuracy of supervised machine learning models on part of the same base data as in this study. He tested five different methods on five different feature sets and compared the results to Trimma’s current algorithm for predicting sales. The results from Granberg’s study were used to decide which features to include and determine which methods to leave out and which to test further.

Also, the best hyperparameter settings from Granberg’s models were used on the parts of the data where it was fit to reduce computation time.

(10)

2 THEORY

2 Theory

In this section, all relevant theory used during this project is described.

2.1 Linear regression

A multiple linear regression model without interaction is defined as:

y= β0+ β1X1+ β2X2+ ... + βpXp+ ε,

where X₁, X₂, ..., X_prepresents all the features and β1, β2, ..., βp(called the regression coefficients) defines the relationship between the corresponding feature and the response variable y. ε is the random error term. Esti- mation of the regression coefficients is done by minimizing the RSS (see definition in Section 3.1.3)(James et al. 2013, p. 71-72).

2.2 Random Forest

Random forest is an ensemble learning method, meaning it combines many smaller models to achieve better predictions. It uses the idea of forming many decision trees, hence the name. It also uses bootstrap aggregation, or ”bagging”, to reduce the variance of the resulting model without increasing the bias (James et al. 2013, p. 319-321). This thesis only focus on regression problems, thus excluding theory regarding classification problems.

2.2.1 Bagging

Take a set of n independent observations Z₁, ..., Z_nwith the same individual variance σ². The mean ¯Zof those observations has a variance σ²/n. Thus one can see that averaging a set of observations reduces variance (James et al. 2013, p. 316). This is the basic idea behind bagging. In bagging, many separate training sets are built from the original training set. If B models ( ˆf) are trained ˆf¹(x), ˆf²(x), . . . , ˆf^B(x) using B different training sets, one can average them into a single low-variance model given by:

fˆavg(x) = 1 B

B

∑

b=1

fˆ^b(x).

In reality, multiple training sets are not usually accessible. This can be synthetically achieved by using bootstrap aggregation, i.e generating B replicates (with replacement) of the training set. The process could be described as picking individual marbles (observations) from a bag (training set), registering them, and then putting them back in the bag before picking a new set. Hence the name bagging. This allows the same observation to be used multiple times in each subset. Then by training the models on the bth training subset to obtain ˆf^∗b(x) and average all B predictions, the following bagging definition can be obtained:

fˆ_bag(x) = 1 B

B

∑

b=1

fˆ^∗b(x),

where ˆf^∗b(x) is the bth model trained on the bth bootstrapped subset (James et al. 2013, p. 316-319).

(11)

2.2 Random Forest 2 THEORY

2.2.2 Decision trees

Decision trees can be used for both regression and classification problems, but because we only apply it to regression problems in this thesis, classification theory will be excluded.

As illustrated in Figure 1, each tree consists of a root, a number of internal nodes and leafs. The root and the internal nodes, also called predictor splits, tests each observation passed through the tree on different criteria (produced by model training) until the observation is placed in a terminal node (leaf node). The observation is then assigned a response value depending on which terminal node it falls into. There are two basic steps that go into building a regression tree:

[1] Over one or multiple iterations, divide the set of possible response values, also called the predictor space, into J non-overlapping regions, R₁, R₂, ..., R_J. These regions corresponds to the terminal nodes in Figure 1 called LEAFs.

[2] Assign every observation that falls into each region R_jthe same response value. In regression trees, that response value is decided upon the mean response value of all the training observations in region R_j.

The regions mentioned in step [1] are constructed by using a recursive binary splitting (a top-down, greedy approach), minimizing the RSS at each split. It is defined as:

RSS =

J

∑

j=1

∑

i∈R_j

y_i− y_R

j

2

, where y_iis the training observations in region R_jand y_R

j is the mean response value of the training observations within region R_j. In other words, each splitting point is determined by considering all possible splits of the predictors and choosing the split that minimizes the RSS (James et al. 2013, p. 306).

Figure 1: Illustration of a basic binary decision tree where root and internal node is the predictor splits and the leaf are the terminal nodes.

(12)

2.3 XGBoost 2 THEORY

2.2.3 The Random Forest algorithm

The Random Forest method is built on the combination of decision trees and bagging with some important tweaks. While using the bagging technique, the trees are constructed using subsets of the training data.

Random Forest adds to that by only allowing a subset of predictors to be considered at each predictor split.

The default number m of predictors to consider is m =√

pwhere p is the total number of predictors. This leads to decorrelation of the trees and makes the model more robust towards highly correlated trees (James et al. 2013, p. 319).

2.3 XGBoost

XGBoost is a tree based learning method similar to Random Forest but with some very important differences in how the different models are trained, for further information about Decission trees, please see Section 2.2.2. XGBoost denotes Extreme Gradient Boosting and is a scalable system for machine learning (Chen and Guestrin 2016). XGboost can be used both for classification and regression purposes, however, this thesis only focus on regression problems, thus excluding theory concerning classification problems.

2.3.1 Boosting trees

In Tree Boosting, R_jrepresents a region where j = 1, 2, . . . , J is represented by the terminal nodes of a tree.

A constant γjis assigned to region R_jsuch that the predicted rule is, x∈ R_j⇒ f (x) = γj. From this rule, the formal definition of a tree can be expressed as:

T(x; Θ) =

J

∑

j=1

γjI(x ∈ R_j) ,

where Θ =R_j, γ_j J

1are parameters and where J is treated as a meta parameter. The parameters can be found by minimizing:

Θ = arg minˆ

Θ J

∑

j=1

∑

xi∈R_j

L(y_i, γj) ,

where L is a squared error loss function L(y_i, γj) = (y_i− γj)²and x_iare the observations in region R_j. This is a combinatoral optimization problem and is best solved by dividing the problem in two parts, see Section 2.2.2.

For regression the solution to the minimization problem (1) is the tree that minimizes the residuals to the current model y_i− f_m−1(x_i) where f_m−1is the current model, thus ˆγjmis the mean of the residuals in each corresponding region m.

Θˆ_m= arg min

Θ_m N

∑

i=1

L(y_i, f_m−1(x_i) + T (x_i; Θm)) , (1) where Θmare the set of parameters for Tree T_m(x) (Hastie, Friedman, and Tisbshirani 2017, ch.10.9).

(13)

2.4 Time series modeling 2 THEORY

2.3.2 Gradient tree boosting The loss function for predicting y is:

L( f ) =

N

∑

i=1

L(y_i, f (x_i)) ,

and the objective is to minimize L( f ) with respect to f where L(y_i, f (x_i)) = (y_i− f (x_i))²is the loss function constrained by the sum of trees. Ignoring the constraint of sum of trees can be viewed as a numerical optimization problem:

ˆf = argmin

f L(f), (2)

and the parameters f ∈ R^Nare values of the function f (x_i) for any N datapoints of xi: f = { f (x₁) , f (x₂)) , . . . , f (x_N)} .

Numerical optimization solves the equation 2 as a summation of component vectors:

f_M=

M m=0

∑

h_m, h_m∈ R^N.

Initial guesses f₀= h₀where each f_mis based on f_m−1which is the sum of previous conducted updates:

f_l=

l m=0

∑

h_m, l = 0, 1, 2, . . . , M.

By choosing the increment vector h_m= −ρmg_m where ρm is a scalar and g_m∈ R^N is the gradient of L(f) given by f = f_m−1, thus the gradients, g_mare given by:

g_im=

∂ L (y_i, f (x_i))

∂ f (xi)

f(x_i)= f_m−1(x_i)

, and the step length ρmis given from the solution to:

ρm= arg min

ρ

L(f_m−1− ρgm) . The solution can be reached by iteratively updating f_m:

f_m= f_m−1− ρmg_m, (Hastie, Friedman, and Tisbshirani 2017, ch. 10.10).

2.4 Time series modeling

A time series is a set of observations {X_t}, each one being recorded at a specific time t. Because the data used for this thesis a discrete-time time series, the theory will focus primarily on time series of that kind. A time series is a discrete-time time series if the observations are defined at a fixed time interval (Brockwell and Davis 2016, p. 1).

The time series model chosen for this thesis project is SARIMA. To be able to explain SARIMA (2.4.8), ARIMA (2.4.7) which SARIMA is built upon also needs to be explained. Further, to be able to explain ARIMA, its components AR (2.4.4) and MA (2.4.6) and the process of differencing 2.4.3 needs to also be explained. A few other key concepts will also be introduced.

(14)

2.4.1 Weak stationarity

The time series {X_t,t ∈ Z} is weakly stationary if the expected value µX(t), is independent of time t and if the covariance function γX(t + h,t) is independent of t but only depending on each horizon h (Brockwell and Davis 2016, p. 13).

2.4.2 KPSS-test

A time series can be expressed as the sum of deterministic trend, random walk and stationary error. A trend can be described as an increasing or decreasing pattern in the data over time, which according to the definition of stationarity in Section 2.4.1 makes a time series non-stationary. A Kwiatkowski–Phillips–Schmidt–Shin (KPSS) test is a hypothesis test used for determining if a time series is stationary around a deterministic trend.

It is also a hypothesis test that the random walk has zero variance (Kwiatkowski et al. 1992).

2.4.3 Differencing

The method of differencing means that the time series is transformed into a new set where the change between consecutive observations is depicted instead of their original observed values:

X_t⁰= X_t− X_t−1, X_t⁰⁰= X_t⁰− X_t−1⁰ ,

...

X_t^d= X_t^d−1− X_t−1^d−1,

where X_tis the observation at time t and X_t^dis the differentiation of order d. Differencing is a way to make a non-stationary time series stationary (Hyndman and Athanasopoulos 2018, ch. 8.1).

2.4.4 AR

A process {X_t,t ≥ 0} is called an Autoregressive process (AR) of order p if:

X_t= c + φ₁X_t−1+ . . . + φ_pXt−p+ ε_t,

where c is a constant, εtis a normally distributed error term in discrete time t (also called ”white noise”) and φ1, . . . , φpare the model parameters (Hyndman and Athanasopoulos 2018, ch. 8.3).

2.4.5 Augmented Dickey-Fuller test

The Augmented Dickey-Fuller test, henceforth ADF Test, tests if there exists a unit root in an autoregressive process, where Xt is the data, this problem becomes a linear regression problem and the test is performed to see if the term γ is different from 0.

X_t⁰= c + β t + γXt−1+ X_t−1⁰ + X_t−2⁰ + . . . ,

where X_t⁰= X_t− X_t−1. ADF test allows for higher order autoregressive processes by including X_t−p⁰ . The null hypothesis for ADF test suggests that the data is non-stationary (Holmes, Scheuerell, and Ward 2021, ch. 5.3).

(15)

2.4.6 MA

A moving average (MA) model of order q is defined as:

X_t= c + εt+ θ1εt−1+ θ2εt−2+ · · · + θqεt−q,

where c is a constant, εtis a normally distributed error term at time t, θ1, ..., θqare the model parameters and θq6= 0 (Hyndman and Athanasopoulos 2018, ch. 8.4).

2.4.7 ARIMA

A combination of differencing(2.4.3), AR(2.4.4) and MA(2.4.6) is called an AutoRegressive Integrated Mov- ing Average (or ARIMA(p, d, q)) model:

X_t^d= c + φ₁X_t−1^d + · · · + φ_pX_t−p^d + θ₁ε_t−1+ · · · + θ_qεt−q+ ε_t,

where p is the order of the autoregressive part, d is the order of differencing, and q is the order of the moving average (Hyndman and Athanasopoulos 2018, ch. 8.5).

2.4.8 SARIMA

The Seasonal ARIMA(p, d, q)(P, D, Q)[m] model (SARIMA) is constructed using additional seasonal parameters to the already defined ARIMA (2.4.7) model. The seasonal terms of the SARIMA model are similar to the non-seasonal ARIMA parameters but involve backshifts/lag of the seasonal period (Hyndman and Athanasopoulos 2018, ch. 8.9). The backshift (B) notation can be written as BX_t= X_t−1where B operates on X_t by shifting the data back one period (Hyndman and Athanasopoulos 2018, ch. 8.2). The parameters can be described as:

• P - The seasonal order of the autoregressive part p.

• D - The seasonal order of the difference part d.

• Q - The seasonal order of the moving average part q.

• m - The number of observations contained in each period.

2.4.9 The Hyndman & Khandakar algorithm

The Hyndman & Khandakar algorithm is used for tuning the parameters of ARIMA(p, d, q) and Seasonal ARIMA(p, d, q)(P, D, Q)[m] models (Hyndman and Khandakar 2008). The value of the frequency parameter m is chosen by exploring the plots of the time series data and using the most intuitive cycle. The algorithm is as follows:

Step 1: The number of differences 0 ≤ d ≤ 2, or seasonal differences 0 ≤ D ≤ 2, is determined using repeated KPSS tests.

Step 2: Four possible starting models:

• ARIMA(2, d, 2)(1, D, 1)

• ARIMA(0, d, 0)(0, D, 0)

(16)

2.5 Support Vector Machines 2 THEORY

• ARIMA(1, d, 0)(1, D, 0)

• ARIMA(0, d, 1)(0, D, 1)

The models are fitted with c 6= 0 if d + D ≤ 1 or c = 0 if d + D > 1. The model with the lowest AIC is then chosen as the official starting model, or the ”current” model.

Step 3: Consider a number of variations of the current model:

• Vary one of p, q, P and Q by ±1 from the current model.

• Vary p and q simultaneously by ±1 from the current model.

• Vary P and Q simultaneously by ±1 from the current model.

• If the current model has the constant c 6= 0, exclude c. If the constant c in the current model equals zero (c = 0), include c.

The ”current” model is updated when a model with a lower AIC (3.1.4) is found. The third step is repeated until no better model is found in terms of AIC (Hyndman and Khandakar 2008, p. 10-11).

2.5 Support Vector Machines

2.5.1 The Maximal Margin Classifier

A hyperplane in a p-dimensional space is a subspace of dimension p − 1. The definition of a hyperplane can be written as:

f(x) = β0+ β1X₁+ β2X₂+ . . . + βpX_p= 0, (3) for parameters β1, β2, ..., βp. If a point x = (X₁, X₂, ..., X_p)^T in the p-dimensional space satisfies equation 3, then x is located on the hyperplane. Suppose β1, β2, ..., βpis the set of parameters of a hyperplane in p-dimensional space that separates the space into two subspaces, then a set of n observations

X =







x₁₁ . . . x_1n ... . .. ... x_p1 . . . x_pn







can be assigned to either of the two subspaces depending on if the equation f (x_i) = β₀+ β₁x_i1+ β₂x_i2+ . . . + β_px_ipequals a positive or negative result (James et al. 2013, p. 338-339).

The maximal margin hyperplane can then be defined as the solution to the following optimization problem:

Maximize:

β₀,β₁,...,βp,M

M,

subject to: ∑^p_j=1β_j² = 1,

y_i(β0+ β1x_i1+ β2x_i2+ ... + βpx_ip) ≥ M ∀ i = 1, ..., n,

(4)

where M is the margin of the hyperplane, y_i∈ {−1, 1} is the associated class labels and the constraints ensures that all observations are located on the correct side and at least distance M from the hyperplane (James et al. 2013, p. 342-343).

(17)

2.5.2 The Support Vector Classifier

The support vector classifier works similar to the maximal margin classifier but with a few differences.

Sometimes it is not possible to linearly separate two classes, or there exists incentives to allow some mis- classifications in order to achieve a more robust model and avoid overfitting. By extending and modifying equation 4, a new optimization problem is achieved where the support vector classifier is a solution:

Maximize:

β₀,β₁,...,βp,υ₁,...,υn,M M

subject to: ∑^p_j=1β²_j = 1,

y_i(β₀+ β₁x_i1+ β₂x_i2+ ... + β_px_ip) ≥ M (1 − υ_i) ,

∑ⁿ_i=1υ_i ≤ C, υi, C ≥ 0,

(5)

where υ1, ..., υnare slack variables allowing individual observations to be on the wrong side of the hyperplane and C is a tuning parameter controlling to which extent that is allowed overall (James et al. 2013, p. 345-346).

The computation of the support vector classifier only involves the inner products of the observations. The definition of said inner products between two p-vectors a and b can be written as ha, bi = ∑^p_i=1a_ib_i, which implies that the inner product of two observations x_i, x_i⁰ is given by:

hx_i, x_i⁰i =

p

∑

j=1

x_{i j}x_i⁰_j. (6)

Thus the linear support vector classifier can be defined as:

f(x) = β0+

n

∑

i=1

α_ihx, x_ii ,

where x is a new observation, x_iare all training observations i = 1, 2, ..., n. α is the estimated parameters for all pairs of observations in 6. Note that α is nonzero only for support vectors of the solution (James et al.

2013, p. 351).

The support vector classifier can be extended to accommodate a nonlinear relationship by enlarging the feature space. This is done by using exponential functions of the predictors. Instead of just using the linear features in equation 3, features of exponential order (for example order 2) can be added giving features X₁, X₁², X₂, X₂², . . . , X_p, X_p²instead. Inserting these into the linear optimization problem 5 yields a new optimization problem:

Maximize:

β0,β₁₁,β₁₂,...,β_p1,β_p2,υ₁,...,υ_n,M M subject to: y_i

β₀+ ∑^p_j=1β_j1x_{i j}+ ∑^p_j=1β_j2x²_{i j}

≥ M (1 − υi) ,

∑^p_j=1∑²_k=1β_jk² = 1,

∑ⁿ_i=1υi ≤ C, υi, C ≥ 0,

(7)

where the solutions generally are nonlinear (James et al. 2013, p. 350).

(18)

2.5.3 Support Vector Machines

Enlarging the feature space as in equation 7 can not be done indefinitely because of computational limitations. Therefore Support Vector Machines (SVM) are introduced using different kernel functions K to quantify the similarity of two observations instead of only using the inner product as it is. The standard notation of the kernel function is:

K(x_i, x_i⁰) . A linear kernel can be defined as:

K(x_i, x_i0) =

p

∑

j=1

x_{i j}x_i⁰_j,

and is in practice the same as the the inner product calculation of the support vector classifier.

To accommodate a nonlinear relationship using SVM, a polynomial or a radial kernel could be used. The polynomial kernel K_pis defined as:

Kp(x_i, x_i0) = 1 +

p j=1

∑

xi jx_i⁰_j

!d

,

where d is a positive integer that denotes the degree of the polynomial kernel (James et al. 2013, p. 350-352).

The radial kernel K_ris defined as:

K_r(x_i, x_i0) = exp −γ

p j=1

∑

x_{i j}− x_i0j

2

! ,

where γ is a positive constant (James et al. 2013, p. 352-353).

2.5.4 Support Vector Regression

The idea behind support vector regression (SVR) is to use the theory that SVM is built upon and apply it to regression problems, that is fitting a hyperplane with the help of support vectors. Instead of maximizing the margin as in SVM in order to minimize the error, SVR seeks to minimize the Euclidean norm kωk², making the hyperplane as flat as possible, while having at maximum υ deviation from the hyperplane (Basak, Pal, and Patranabis 2007, p. 204-205). The SVR is a solution to the following optimization problem:

Minimize: ^kωk₂²+C ∑ⁿ_i=1(ξi+ ξ_i^∗) subject to: y_i− hω, xii − b ≤ υ + ξi,

hω, xii + b − y_i≤ υ + ξ_i^∗, ξi, ξ_i^∗≥ 0,

C> 0,

(8)

where ξiand ξ_i^∗are slack variables introduced to handle infeasible constraints in the optimization problem (8), υ is the maximum allowed deviation of the approximations of all pairs (xi, y_i) in the training set. C is the parameter that decides the trade off between the tolerance of deviations larger than υ and the flatness of the hyperplane (Basak, Pal, and Patranabis 2007, p. 205-208). A standard linear SVR can be defined as:

f(x) =

n i=1

∑

(αi− α_i^∗) hx_i, xi + b.

(19)

The definition of SVR with a nonlinear kernel function can thus be written as:

f(x) =

n i=1

∑

(αi− α_i^∗) K (x_i, x) + b.

(20)

3 METHOD

3 Method

In this section the implementation of the methods used in this project are presented. This section also contains descriptions of error metrics, the data and the data preprocessing, modeling procedures, and regression performance assessments. It focuses mainly on one method of testing where the data train/test split is static, which will be referred to as Method A. An additional testing method is based on a dynamic train/test data split and are referred to as Method B, and is described thoroughly in Section 3.8.

3.1 Error metrics

The error metrics chosen to evaluate the models are presented here.

• n is the number of observations in the set and i represents the general index of said observations i= 1, 2, ..., n.

• y_iis the observed value.

• ˆy_iis the predicted value.

3.1.1 Mean Absolute Error

The mean absolute error (MAE) (Wei 2006, p. 182) is defined as:

MAE =1 n

n i=1

∑

|y_i− ˆy_i| .

3.1.2 Root Mean Squared Error

The root mean squared error (RMSE) (James et al. 2013, p. 256) is defined as:

RMSE = s1

n

∑

i=1

(y_i− ˆy_i)².

3.1.3 Sum of Squared Errors

The sum of squared errors (SSE) is defined as:

SSE =

n i=1

∑

(y_i− ˆy_i)².

This measure is also called the residual sum of squares (RSS) (James et al. 2013, p. 62).

3.1.4 AIC

The Akaike Information Criterion (AIC) is a function used for model selection. Definition:

AIC = n log SSE n

+ 2(k + 2),

where n is the number of observations used in the model and k + 2 is the number of model parameters (Hyndman and Athanasopoulos 2018, ch. 5.5).

(21)

3.2 Data description 3 METHOD

3.2 Data description

This section describes the data in its raw format, the collection of data, and the processing of the data to obtain the final data frame. The sales data are based on the information that is sent from the restaurant’s checkout system to INSIKT. The raw data are structured in the manner of the restaurant name, restaurant ID, business hour, and corresponding sales, and INSIKT prognosis for the given date and hour. The data are divided into a training set, a test set and a stacking set and the division is 70%, 20% and 10%. The full test set corresponds to approximately 30 weeks of hourly data (out of 148 weeks). A visual example of the raw data can be seen in Table 1 and a comparison in sales between restaurants can be found in Figure 2.

Some modeling methods were performed in R and some in Python. This will be disclosed in each section, respectively.

Table 1: Anonymous representation of the raw data.

Restaurant ID Sales Prognosis Date Hour

1 1568 2697 2017-06-02 21

1 1674 2185 2017-06-02 22

1 5331 2339 2017-06-02 23

... ... ... ... ...

28 12987 7351 2018-03-25 17

28 6744 7209 2018-03-25 18

28 2290 3875 2018-03-25 19

(22)

3.3 Data preprocessing & feature engineering 3 METHOD

Figure 2: The mean sales for each hour across all weeks (there are 168 hours in one week.) contained in the full dataset. The graph above starts at Monday 00:00 and ends on Sunday 23:00.

3.3 Data preprocessing & feature engineering

The data in its raw format provided little information that could be suitable for analysis. The raw data only represented timesteps for open business hours. Another issue is that the opening hours were unevenly distributed through the weeks with extensive business days for weekends and shorter business days for weekdays. The SARIMA model requires continuous data to work properly and would have problems with the raw formatted data set. To get past this problem, the timeline has to be extended to include 24hours for each business day returning the actual sales for persisting business hours and zero in sales for hours when the restaurant is not open for business. To maintain good information regarding business day, six separate binary class features, IsMonday to IsSaturday, representing the business day were created, where for example, IsMonday returns 1 if the corresponding day is Monday and 0 otherwise. To be able to gain information between seasonal effects such as month, eleven binary class features describing if the intended observation belonged to a specific month were also added to the data set. A visual presentation of how the processed data could look like can be found in Table 2.

(23)

3.3 Data preprocessing & feature engineering 3 METHOD

Table 2: Anonymous representation of the processed data.

Restaurant ID Sales Prognosis Hour isMonday . . . isSaturday isJanuary . . . isNovember

1 1568 2697 21 0 . . . 0 0 . . . 0

1 1674 2185 22 0 . . . 0 0 . . . 0

1 5331 2339 23 0 . . . 0 0 . . . 0

... ... ... ... ... ... ... ... ... ...

28 12987 7351 17 0 . . . 0 0 . . . 0

28 6744 7209 18 0 . . . 0 0 . . . 0

28 2290 3875 19 0 . . . 0 0 . . . 0

Worth noticing is that Prognosis (the benchmark prognosis from INSIKT) was only used to compare the results between the model predictions and the current forecasting algorithm used by INSIKT. In other words, the whole column was excluded from the training and testing procedures.

Figure 3: Visualization of the train-test-stack split of the data for all restaurants.

The data are split into 3 categories, training data, test data, and stacking data as shown in Figure 3. Each model is then trained on the training data. The stacking data are performed in two steps. The first step is to generate predictions given the features from the stacking data. The predictions given by each model will then create a new data set consisting of the predictions from the base learners (Models trained on the training data set as illustrated in Table 3) and the sales from the stacking data as response variable. The test data for the stacking data are generated similarly to the stacking data set where the original data are used to generate predictions from base learners. Further description of the stacking procedure can be found in Section 3.7.

Due to a large number of restaurants and the amount of data processing needed, this project has focused on three restaurants (A, B and C) operating in different environments. These environments have been chosen to best represent the environments that this chain operates in.

A - City center in a middle sized city. Near shopping malls, offices and night clubs.

B - Main Swedish highway.

C - Railway station and traveling hub.

(24)

3.4 Random Forest 3 METHOD

3.4 Random Forest

Three random forest models were constructed, one for each restaurant using the sklearn package built upon scikit-learn in Python. Each model was tuned using the ”random search” method. Instead of grid searching the whole spectra of possible hyperparameter combinations, 600 models with random combinations of parameters were tested. Parameter settings are possible with the defined variable ranges below.

• max depth - The maximum amount of nodes in each decision tree. Tested on values between 10 and 110 with increments of 10.

• max features - To save computation time, this parameter was always set to ”auto”. Auto means that max features will always be set to the number of features used in the training and no feature subsets are tested.

• max leaf nodes - The maximum number of leaf nodes allowed within each decision tree. Set to 500 according to earlier studies by Granberg (Granberg 2021) as it didn’t affect accuracy significantly.

• min sample leaf - The minimum number of samples/observations allowed in a leaf/end node. Tested on values between 2 and 10 with increments of 2.

• min samples split - The minimum number of samples/observations in any node when considering a split. Tested on values between 3 and 19 with increments of 2.

• n estimators - The number of decision trees used in each model. Tested on values between 2 and 2000 with increments of 2.

The resulting hyper parameter settings for each restaurant can be found in Table A1.

(25)

3.5 XGBoost 3 METHOD

3.5 XGBoost

For XGBoost, the R package XGBoost was used. An extensive grid search was performed for each restaurant to find optimal hyperparameter settings. The tuning was done trying to minimize RMSE (Section 3.1.2). The following parameters were tuned:

• nrounds - The number of decision trees in the model. First tested on 10, followed by 100,200,...,1000.

Then tested on ±50 of the best results in increments of 10. And finally tested on ±10 off the best result in increments of 1.

• eta - Also called learning rate or step size. Having a smaller eta will generally give a more precise result, but at the cost of computation power. Tested on values between 0 and 1 with increments of 0.1.

• gamma - Decides how much the algorithm should penalize more splits in the decision trees. A high gamma will increase the required loss reduction when considering a split. Tested from 0 to 10 with increments of 1.

• max depth - The maximum amount of nodes in each decision tree. Tested on the interval [3, 24] with increments of 1.

• min child weight - Sets a lower limit on how many observations a leaf is allowed to contain while considering a split. Tested on the interval [1, 10] with increments of 1.

• subsample - Determines how large fraction of the training observations to be sampled while training each tree. Tested on values between [0.5, 1] with increments of 0.1.

• colsample bytree - Determines how large fraction of the columns in the training set to be sampled while training each tree. Tested on values between [0.5, 1] with increments of 0.1.

The resulting hyper-parameter settings for each restaurant can be found in Table A2.

3.6 SARIMA

R was once again chosen for constructing and tuning our SARIMA models. More specifically the packages tseries and forecast. For time series modeling with SARIMA, the data need to fulfill certain properties and the data need exploratory analysis. An important part of modeling is that the data are stationary according to the ADF test (2.4.5). Only the Sales variable was included while modeling using SARIMA.

After the data had been processed as described in 3.3, the Hyndman & Khandakar algorithm (2.4.9) was used for each restaurant to achieve an optimal combination of parameters. After testing the achieved model on the test data, it was concluded that the trend parameter d (equation 2.4.3) needed manual adjusting for 1 restaurant and was lowered from 1 to 0 since Sales did not increase over time. This resulted in better long time forecasting and didn’t significantly affect short-term forecasting.

(26)

3.7 Stacking 3 METHOD

3.7 Stacking

The sales predictions from the base learners (Random Forest, XGBoost, and SARIMA) were used to form a new train data set to be used when training the two stacking models. The new data set contained three columns with each of the base model predictions along with the fourth column with the actual sales. A linear regression model and a support vector regression model were trained on this new data set and then used to predict the sales of the last test set. A visual example of the data can be seen in table 3.

Table 3: Anonymous representation of the resulting data matrix used to train the stacking models.

Sales SARIMA Random Forest XGBoost

6503 6605 6455 6414

6808 5400 6481 6339

9143 6608 6356 6370

8814 8697 7042 8071

7206 7627 7341 7606

10176 5550 5538 5791

... ... ... ...

The motivation to use linear models as stacking models is that there appears to be a linear relationship between sales and the predicted sales values for the base models XGBoost, SARIMA and random forest as seen in Figure 4.

Figure 4: The relationship between the predicted values (x-axis) and the true sales value (y-axis). The graph represents the true sales from test data and the predictions based on the features of the test data.

3.7.1 Linear Regression

A linear regression model was used for each restaurant to act as a benchmark model for our main stacking model. The built-in function lm in R was used for this task.

(27)

3.8 Additional testing 3 METHOD

3.7.2 Support Vector Regression

The R package e1071 was used to construct and tune our support vector regression (SVR) models. The SVR is used as the main stacking method in this project. As mentioned in Section 3.7, the predictions given from the base learners work as training data for the SVR model. An extensive grid search on the hyper parameters was performed to generate the hyperparameter settings for the final SVR models.

• ε - Regulates the width of the margin. Investigated values were 0.0001 to 1 with an increment of a factor of 10.

• γ - For radial basis kernel, this hyper-parameter describes the measure of influence of any individual support vector. A large gamma implies high bias and low variance. Gamma was tested on the interval from 0.0001 to 1 with an increment of a factor of 10.

• Cost - Describes the cost of a violation of the margin. Investigated values for the hyper parameter is [100, 2000] with an increment of 50.

The final hyperparameter settings can be found in Table A5.

3.8 Additional testing

To widen the perspective on forecasting sales, an additional method has been tested to investigate if there could be a significant gain in accuracy from the original method described above (Method A). This new method (Method B) was to extend the training data and then predict sales for a randomized period, thus varying the size of the data set and hopefully capturing recent trends in sales. If predictions were to be made for period t = n, separate XGBoost, Random Forest and stacking models would be trained on the test data from period t = 0 to period t = n − 2, where the period n represents a time step of 168 observations (1 week).

Due to the extensive time needed to train new SARIMA models, these were excluded from this test. These results can be found in tables 12 through 15.

(28)

4 RESULTS

4 Results

In this section, all results discussed in Section 5 will be presented. All MAE and RMSE values are calculated based on sales in SEK. The remaining results can be found in the Appendix.

4.1 Results on aggregated time horizons

This subsection contains the results for aggregated forecast horizons of 1, 6 and the full test set that is 30 weeks for all models and restaurants.

Table 4: Error measures MAE and RMSE for all tested methods and restaurants given aggregated forecast horizons of 1, 6 and the full test set of 30 weeks. The colored cells contains the best results for each forecast horizon and error metric. The yellow cells corresponds to restaurant A, the green cells to restaurant B and the blue cells to restaurant C.

Restaurant

A B C

Method Forecast Horizon RMSE MAE RMSE MAE RMSE MAE

INSIKT 1 week 1316.10 793.48 1008.06 609.38 1139.51 705.74

6 weeks 1764.45 851.63 1202.02 720.84 1282.38 807.44 Full test set 1630.49 901.19 1277.02 764.49 1477.42 920.52

SARIMA 1 week 1565.04 853.91 972.15 586.30 1213.97 783.05

6 weeks 1473.16 738.05 1203.49 705.32 1200.56 753.11 Full test set 2074.37 1018.88 1455.36 844.83 1929.03 1167.51 Random Forest 1 week 1847.46 1035.52 1009.44 599.66 1079.58 677.42

6 weeks 1540.71 827.32 1178.18 689.18 1150.45 723.68 Full test set 1865.38 932.18 1382.00 799.04 1549.71 924.46

XGBoost 1 week 1791.18 949.10 1009.85 601.07 1066.32 639.94

6 weeks 1533.97 810.75 1178.94 691.93 1172.78 734.69 Full test set 1665.16 862.07 1382.26 803.77 1553.70 922.30 Linear Stacking 1 week 1215.88 697.19 977.14 611.04 1069.55 715.34 6 weeks 1242.46 681.58 1106.25 679.23 1116.12 742.16 Full test set 1561.02 971.10 1386.02 823.16 1555.27 960.25

SVR Stacking 1 week 1203.84 632.44 954.75 566.97 1072.02 682.54

6 weeks 1241.53 614.52 1106.33 652.65 1127.93 712.66 Full test set 1826.97 1141.90 1420.60 821.88 1555.90 923.87 Tabel 4 includes results from all models and all restaurants based on MAE and RMSE. The results are aggregated to include the whole period described from observation 1 to n where n is the specified forecast horizon. All results in the table are calculated on the same period in time and include non-business hours.

The mean sales for these specific restaurants over all business hours of the datasets where between 5000- 6000 SEK which places the best MAE results close to a percentage error of 10%.

(29)

4.1 Results on aggregated time horizons 4 RESULTS

Table 5: Improvement in MAE on a aggregated 6 weeks sales forecast compared to INSIKT.

Resturant A B C

INSIKT 0% 0% 0%

SARIMA 13.28% 2.08% 6.69%

RF 2.82% 4.31% 10.41%

XGBoost 4.82% 4.03% 9.05%

LM 19.98% 5.69% 8.05%

SVR 27.85% 9.58% 11.77%

Tabel 5 presents the percentage advantage in mean absolute error for the given 6 weeks aggregated period.

Table 6: Improvement in RMSE on 6 weeks aggregated sales forecast compared to INSIKT.

Resturant A B C

INSIKT 0% 0% 0%

SARIMA 16.50% -0.08% 6.40%

RandomForest 12.70% 2.00% 10.30%

XGBoost 13.10% 2.00% 8.58%

LM 29.59% 7.99% 12.95%

SVR 29.65% 7.99% 12.09%

The improvement in RMSE for each model is shown in Table 6 where the percentage advantage the given model has in RMSE compared to INSIKT is presented. A positive percentage means that the corresponding model has lower prediction error than the INSIKT model, the same holds for both tables 5 and 6.

(30)

Figure 5: Residuals for each model on a forecast horizon of 6 weeks for restaurant A. Outliers are shown in red. The numbers on each box represents the mean of the residuals for each model.

(31)

Figure 6: Residuals for each model on a forecast horizon of 6 weeks for restaurant B. Outliers are shown in red. The numbers on each box represents the mean of the residuals for each model.

(32)

Figure 7: Residuals for each model on a forecast horizon of 6 weeks for restaurant C. Outliers are shown in red. The numbers on each box represents the mean of the residuals for each model.

Figures 5, 6 and 7 visualizes the residuals, or the prediction errors in SEK (not to be confused with MAE), for each model in the form of boxplots, including INSIKT as the benchmark. These prediction errors are calculated by subtracting the predicted values y_ifrom the corresponding observed values ˆy_i. The predictions included are from the first 6 weeks of the test set but only the predictions for open business hours. All tested models predicts below the actual sales on average and the SARIMA model is the most skewed one. The performance of the SVR model is the most stable of all tested methods. More boxplots describing the results for the whole test set can be found in the appendix.

(33)

Figure 8: Residuals for two XGBoost models trained on different versions of the same training set. The first set was the full training set that was used in the other training procedures described in Section 3.3 and the second set only included the business hours of the restaurant. Outliers are shown in red and mean of the residuals are shown in the boxes as a number.

According to the results presented in Figure 8, only using business hours when modeling yields better results, both in terms of consistency and offset from a mean of zero. It also appears to be less big outliers when using business hours only.

(34)

Figure 9: XGBoost feature importance for Restaurant A on a model trained on the whole original training set. The feature importance of all features sums up to 1.0. The features are clustered based on importance measurement.

Figure 10: XGBoost feature importance for Restaurant B on a model trained on the whole original training set. The feature importance of all features sums up to 1.0. The features are clustered based on importance measurement.

(35)

4.2 Randomized weeks 4 RESULTS

Figure 11: XGBoost feature importance for Restaurant C on a model trained on the whole original training set. The feature importance of all features sums up to 1.0. The features are clustered based on importance measurement.

Business hour is by far the most influential feature (independently of location) according to the results presented in Figures 9, 10 and 11. Weekends have some influence followed by the other weekdays and July is the most influential month.

4.2 Randomized weeks

In this section, all relevant results concerning the testing of random sub periods of 1 week within the test set are presented.

4.2.1 Compared to the baseline model

The following tables contains the learning model results compared to the baseline model provided by IN- SIKT. The results are not aggregated and the models are trained on the original training data set (Method A).

(36)

Table 7: Error measures for SARIMA and INSIKT on 6 semi randomized individual weeks. The colored cells highlights the best results for each week and method. The yellow cells corresponds to restaurant A, the green cells to restaurant B and the blue cells to restaurant C.

Week: 4 7 11 15 20 28 Mean

Location Method MAE MAE MAE MAE MAE MAE MAE

A INSIKT 790.71 849.54 1002.37 1098.83 978.68 660.67 896.8

SARIMA 834.12 1103.61 786.34 1520.70 1189.45 791.12 1037.56

B INSIKT 809.36 916.53 668.34 751.57 744.12 673.47 760.57

SARIMA 777.16 1165.15 588.00 850.56 874.53 673.10 821.41

C INSIKT 912.77 1110.12 960.60 1161.98 971.82 710.65 971.32

SARIMA 721.43 1330.28 862.51 1555.77 1519.82 797.08 1131.15

Method RMSE RMSE RMSE RMSE RMSE RMSE RMSE

A INSIKT 1751.53 1390.63 1660.85 1966.28 1212.48 1152.02 1522.30

SARIMA 1318.25 1826.73 1427.02 2452.96 2327.01 1347.01 1783.16

B INSIKT 1305.03 1530.37 1054.01 1140.83 1173.54 1230.35 1239.02

SARIMA 1292.22 2024.46 938.58 1324.08 1345.71 1175.44 1350.08

C INSIKT 1398.41 1786.87 1486.09 1734.32 1501.47 1167.80 1512.49

SARIMA 1087.79 2047.85 1385.94 2327.44 2401.94 1334.69 1764.28

Table 8: Error measures for Random Forest and INSIKT on 6 semi randomized individual weeks. The colored cells highlights the best results for each week and method. The yellow cells corresponds to restaurant A, the green cells to restaurant B and the blue cells to restaurant C.

Week: 4 7 11 15 20 28 Mean

A INSIKT 790.71 849.54 1002.37 1098.83 978.68 660.67 896.80

RF 765.81 1084.71 689.80 1222.69 665.76 608.74 867.58

B INSIKT 809.36 916.53 668.34 751.57 744.12 673.47 760.57

RF 756.07 1097.16 590.30 765.16 701.64 639.59 735.65

C INSIKT 912.77 1110.12 960.60 1161.98 971.82 710.65 933.38

RF 729.26 1215.88 773.23 1066.55 941.60 722.85 875.26

A INSIKT 1751.53 1390.63 1660.85 1966.28 1212.48 1152.02 1522.30 RF 1273.27 1764.28 1234.21 2094.88 1125.22 1023.62 1480.42 B INSIKT 1305.03 1530.37 1054.01 1140.83 1173.54 1230.35 1239.02 RF 1266.08 1850.72 945.04 1201.92 1064.53 1161.95 1214.24 C INSIKT 1398.41 1786.87 1486.09 1734.32 1501.47 1167.80 1512.49 RF 1156.83 1867.06 1265.92 1627.91 1522.21 1206.15 1389.38

(37)

Table 9: Error measures for XGBoost and INSIKT on 6 semi randomized individual weeks. The colored cells highlights the best results for each week and method. The yellow cells corresponds to restaurant A, the green cells to restaurant B and the blue cells to restaurant C.

Week: 4 7 11 15 20 28 Mean

A INSIKT 790.71 849.54 1002.37 1098.83 978.68 660.67 896.80

XGBoost 773.09 1119.39 693.21 1204.67 670.79 608.92 845.01

B INSIKT 809.36 916.53 668.34 751.57 744.12 673.47 760.57

XGBoost 772.89 1084.48 623.74 756.23 719.25 645.11 766.95

C INSIKT 912.77 1110.12 960.60 1161.98 971.82 710.65 971.32

XGBoost 786.47 1242.23 793.14 1055.01 925.70 702.92 917.58

A INSIKT 1751.53 1390.63 1660.85 1966.28 1212.48 1152.02 1522.30

XGBoost 1302.51 1815.45 1242.62 2089.82 1121.44 1031.42 1433.88

B INSIKT 1305.03 1530.37 1054.01 1140.83 1173.54 1230.35 1239.02

XGBoost 1269.27 1856.05 1004.88 1170.71 1086.20 1150.13 1256.20

C INSIKT 1398.41 1786.87 1486.09 1734.32 1501.47 1167.80 1512.49

XGBoost 1218.62 1892.73 1294.03 1608.93 1521.47 1189.43 1454.20

Table 10: Error measures for linear stacking and INSIKT on 6 semi randomized individual weeks. The colored cells highlights the best results for each week and method. The yellow cells corresponds to restaurant A, the green cells to restaurant B and the blue cells to restaurant C.

Week: 4 7 11 15 20 28 Mean

A INSIKT 790.71 849.54 1002.37 1098.83 978.68 660.67 896.80

LM 739.84 984.01 752.81 1129.69 696.12 622.64 803.18

B INSIKT 809.36 916.53 668.34 751.57 744.12 673.47 760.57

LM 739.78 1053.02 645.35 807.29 918.24 697.24 781.71

C INSIKT 912.77 1110.12 960.60 1161.98 971.82 710.65 971.32

LM 709.70 1205.64 810.70 1119.96 1011.11 785.56 908.29

A INSIKT 1751.53 1390.63 1660.85 1966.28 1212.48 1152.02 1522.30 LM 1224.67 1558.34 1215.80 1872.76 1119.00 960.89 1309.62 B INSIKT 1305.03 1530.37 1054.01 1140.83 1173.54 1230.35 1239.02 LM 1166.27 1833.02 1001.90 1239.62 1388.72 1152.83 1251.35 C INSIKT 1398.41 1786.87 1486.09 1734.32 1501.47 1167.80 1512.49 LM 1076.75 1812.03 1245.02 1695.68 1605.95 1225.57 1390.08

(38)

Table 11: Error measures for SVR stacking and INSIKT on 6 semi randomized individual weeks. The colored cells highlights the best results for each week and method. The yellow cells corresponds to restaurant A, the green cells to restaurant B and the blue cells to restaurant C.

Week: 4 7 11 15 20 28 Mean

A INSIKT 790.71 849.54 1002.37 1098.83 978.68 660.67 896.80

SVR 692.51 963.06 693.87 1084.67 660.18 573.48 757.17

B INSIKT 809.36 916.53 668.34 751.57 744.12 673.47 760.57

SVR 719.35 1073.90 628.42 819.41 969.15 638.66 773.69

C INSIKT 912.77 1110.12 960.60 1161.98 971.82 710.65 971.32

SVR 682.19 1186.89 791.07 1088.43 961.77 722.74 873.66

A INSIKT 1751.53 1390.63 1660.85 1966.28 1212.48 1152.02 1522.30 SVR 1226.30 1576.18 1196.41 1869.04 1137.67 965.83 1310.75 B INSIKT 1305.03 1530.37 1054.01 1140.83 1173.54 1230.35 1239.02 SVR 1181.31 1853.14 1010.00 1300.70 1534.64 1120.17 1279.24 C INSIKT 1398.41 1786.87 1486.09 1734.32 1501.47 1167.80 1512.49 SVR 1095.84 1834.27 1258.12 1671.73 1586.45 1205.83 1389.18 As seen in tables 7 to 11, the performance of the SARIMA models is lacking compared to the tree based models while the stacking methods show the best and the most stable performance overall.

4.2.2 Comparison with new extended models

This sub section contains the model results from the previous section compared to new models trained with more recent data. The results are attained by training the models to period t = n − 2 and generating predictions for period t = n (Method B, as described in Section 3.8).

(39)

Table 12: Error measures for Random Forest on 6 semi randomized individual weeks compared to a new model trained on data up to 2 weeks before the tested week. The colored cells highlights the best results for each week and method. The yellow cells corresponds to restaurant A, the green cells to restaurant B and the blue cells to restaurant C.

Week: 4 7 11 15 20 28 Mean

A New 744.12 971.86 684.23 1220.65 665.75 603.97 815.10

RF 765.81 1084.71 689.80 1222.69 665.76 608.74 867.58

B New 759.00 1090.82 592.95 759.94 706.31 640.25 758.21

RF 756.07 1097.16 590.30 765.16 701.64 639.59 735.65

C New 735.47 1206.33 769.98 1066.71 942.78 723.39 907.44

RF 729.26 1215.88 773.23 1066.55 941.60 722.85 875.26

A New 1277.29 1632.88 1233.81 2103.03 1132.31 1026.88 1401.03

RF 1273.27 1764.28 1234.21 2094.88 1125.22 1023.62 1480.42

B New 1261.18 1850.24 955.54 1202.94 1065.65 1157.45 1248.83

RF 1266.08 1850.72 945.04 1201.92 1064.53 1161.95 1214.24

C New 1165.25 1862.61 1267.55 1630.58 1532.67 1208.11 1444.46

RF 1156.83 1867.06 1265.92 1627.91 1522.21 1206.15 1389.38

Table 13: Error measures for XGBoost on 6 semi randomized individual weeks compared to a new model trained on data up to 2 weeks before the tested week. The colored cells highlights the best results for each week and method. The yellow cells corresponds to restaurant A, the green cells to restaurant B and the blue cells to restaurant C.

Week: 4 7 11 15 20 28 Mean

A New 737.37 1117.91 693.52 1206.49 670.28 610.58 839.36

XGBoost 773.09 1119.39 693.21 1204.67 670.79 608.92 845.01

B New 754.37 1083.29 625.42 767.46 716.75 644.59 765.31

XGBoost 772.89 1084.48 623.74 756.23 719.25 645.11 766.95

C New 778.65 1242.24 792.93 1049.06 929.19 712.61 917.45

XGBoost 786.47 1242.23 793.14 1055.01 925.70 702.92 917.58

A New 1280.43 1816.16 1241.40 2090.77 1120.00 1035.21 1430.66

XGBoost 1302.51 1815.45 1242.62 2089.82 1121.44 1031.42 1433.88

B New 1258.67 1847.23 989.46 1192.25 1083.80 1152.93 1254.05

XGBoost 1269.27 1856.05 1004.88 1170.71 1086.20 1150.13 1256.20

C New 1208.17 1892.71 1292.06 1602.72 1517.69 1202.81 1452.69

XGBoost 1218.62 1892.73 1294.03 1608.93 1521.47 1189.43 1454.20

Sales Forecasting by Assembly of Multiple Machine Learning Methods: A stacking approach to supervised machine learning

Sales Forecasting by Assembly of Multiple

Machine Learning Methods

A stacking approach to supervised machine learning

SALES FORECASTING BY ASSEMBLY OF MULTIPLE MACHINE LEARNING METHODS. A stacking approach to supervised machine learning.

Submitted in fulfilment of the requirements for the degree Master of Science in Industrial Engineering and Management

Department of Mathematics and Mathematical Statistics Ume˚a University

SE - 901 87 Ume˚a, Sweden

Supervisors from Ume˚a University:

Peter Anton H˚akan Lindkvist

Supervisor from Trimma AB:

Axel Sandstr¨om Emmelin Examiner:

Jun Yu, Ume˚a University

Contents

1 Introduction

1.1 Trimma AB

1.2 Background

1.3 Purpose

1.4 Task description

1.5 Limitations

1.6 Previous work

2 Theory

2.1 Linear regression

2.2 Random Forest

∑

∑

∑

∑

2.3 XGBoost

∑

∑

∑

∑

∑

∑

∑

2.4 Time series modeling

2.5 Support Vector Machines

∑

∑

∑

∑

∑

∑

∑

3 Method

3.1 Error metrics

∑

∑

∑

3.2 Data description

3.3 Data preprocessing & feature engineering

3.4 Random Forest

3.5 XGBoost

3.6 SARIMA

3.7 Stacking

3.8 Additional testing

4 Results

4.1 Results on aggregated time horizons

4.2 Randomized weeks