• No results found

Phase-Out Demand Forecasting: Predictive modeling on forecasting product life cycle

N/A
N/A
Protected

Academic year: 2021

Share "Phase-Out Demand Forecasting: Predictive modeling on forecasting product life cycle"

Copied!
50
0
0

Loading.... (view fulltext now)

Full text

(1)

IN THE FIELD OF TECHNOLOGY DEGREE PROJECT

ENGINEERING PHYSICS

AND THE MAIN FIELD OF STUDY

COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2020

Phase-Out Demand

Forecasting

Predictive modeling on forecasting product life

cycle

SHADMAN AHMED

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)
(3)

Phase-Out Demand

Forecasting

Predictive modeling on forecasting product life cycle

SHADMAN AHMED

Master in Machine Learning Date: October 23, 2020

Supervisor: Amirhossein Akhavanrahnama Examiner: Danica Kragic

School of Electrical Engineering and Computer Science Host company: Ericsson

Swedish title: Utfas-Prediktering: Prognostisering av produktlivscykler

(4)
(5)

iii

Abstract

The phase-out stage in a product life cycle can face unpredictable demand. Accurate forecast of the phase-out demand can help supply chain managers to control the number of obsolete inventories. Consequently, having a positive ef-fect in terms of resources and lower scrap costs. In this thesis, we investigated if data-driven forecasting models could improve the accuracy of forecasting the phase-out stage when compared with domain experts.

Since the space of available models is vast, a set of 11 best performing models according to literature were investigated. Furthermore, a thorough model selection based on performance suggested that the following three mod-els were best suited to our dataset: Autoregressive Integrated Moving Average (ARIMA), Support Vector Regression (SVR), and Gaussian Process Regres-sion (GPR). The final results showed that none of the models were able to improve the forecast accuracy overall. However, SVR displayed good per-formance close to the domain experts’ estimates across 14 unique products through variation of analysis. In addition to the comparative study, this study showed that using less data improved the models’ performances. Only 60% of the training data seemed optimal for ARIMA and GPR, while SVR had a good performance with only 80% of data. We present the results along with further research questions to be explored in this domain.

(6)

iv

Sammanfattning

Utfasningen i en produktlivscykel kan kännetecknas vara oförutsägbart. En noggrann prognos av stadiet kan ge värdefull insikt såsom att begränsa antalet utgångna inventeringar och om produktens efterfrågan. Detta kan ge positiv ekonomisk effekt samt spara resurser. I denna studie jämförde vi med domän experter om data drivna prognosmodeller kunde förbättra estimeringen av ef-terfrågan inom utfasningen i en produktlivscykel.

På grund av att tillgängligheten av prognosmodeller är omfattande, ett an-tal modeller studerades som visat bäst resultat i olika studier. Efter en nogrann urval av 11 olika modeller som visade bäst prestanda, användes följande 3 modeller för den senare delen av studien: Autoregressiv Integrerad Glidande Medelvärde (ARIMA), Stödvektor Regression (SVR) och Gaussisk Process Regression (GPR). Resultat visade att ingen av modellerna kunde generellt förbättra prognoserna, dock visade SVR signifikant liknande prognosfel som planestimeringarna från domän experter för 14 unika produkter. Dessutom vi-sades sig att en minskning av data förbättrade prestandan hos modellerna. Där endast 60% av träningsdatat tycktes vara optimalt för ARIMA och GPR me-dan SVR med 80%. Vi presenterar resultaten ihop med ytterligare frågor som undersöktes inom detta område.

(7)

Contents

1 Introduction 1

1.1 Product Life Cycle . . . 1

1.2 Demand Forecasting . . . 2

1.3 Problem Statement . . . 2

1.4 Aim and Objective . . . 3

1.5 Related Works . . . 3 1.5.1 Successful cases of SVR . . . 3 1.5.2 Successful cases of GPR . . . 4 1.5.3 Successful case of LSTM . . . 5 1.5.4 ARIMA . . . 6 1.6 Thesis Outline . . . 6 2 Background 7 2.1 Time Series Analysis . . . 7

2.1.1 Stationary vs Non-Stationary . . . 8

2.2 Time Series with Machine Learning . . . 9

2.2.1 Forecasting Strategies . . . 9

2.2.2 Cross-Validation . . . 10

2.3 Forecasts Evaluation . . . 11

2.4 Forecasting Models . . . 12

2.4.1 Autoregressive Integrated Moving Average . . . 12

2.4.2 Support Vector Regression . . . 14

2.4.3 Gaussian Process Regression . . . 16

3 Method 18 3.1 Dataset . . . 18 3.1.1 Missing Value . . . 19 3.1.2 Data Scaling . . . 19 3.2 Experiment . . . 19 v

(8)

vi CONTENTS 3.2.1 Implementation . . . 19 3.2.2 Model Selection . . . 20 3.2.3 Input Horizon . . . 21 3.2.4 Parameter Selection . . . 21 3.2.5 Forecast Horizon . . . 22

3.3 Statistical Test for Evaluating Results . . . 23

4 Results 24 4.1 Comparing Different Input Horizons . . . 24

4.2 Effect of Data Size on Predictability . . . 25

4.3 General Performances of the Forecasting Models . . . 26

4.4 Variance Analysis ANOVA - Demand Plan Comparison . . . . 28

4.5 The Predictions of the Models . . . 29

5 Discussion 32 5.1 Discussion of the Models’ Performances . . . 32

5.2 Ethical and Sustainability . . . 33

5.3 Limitations and Challenges . . . 34

5.4 Future Works . . . 34

6 Summary and Conclusion 36

(9)

Chapter 1

Introduction

1.1

Product Life Cycle

Product life cycles (PLC) represent the change in demand for products over time. It starts when the product enters the market and ends when removed or being replaced [1]. A conventional PLC pattern is a bell-shaped curve divided into several stages corresponding to changes in volume. The number of phases can be between three to six. The six-phased cycle includes the introduction stage, which is usually characterized by low and unpredictable production vol-umes. Later, the product continues in a growth stage with increasing demand and acceptance in the market. Then it enters a maturity stage where the de-mand stabilizes, following with a decline due to decreasing dede-mand. At the end of a product life-cycle, manufacturers announce the end-of-life of the product. A last-time-buy (LTB) date is informed to the customers when the product is no longer available in the market or substituted by another product [2]. Forecasts of the phase-out demand can be benefit managers, to decide where to place the LTB date and provide knowledge about the demand activity of the product to demand planners. The bell-shaped curve does not always hold and can look different depending on the product’s demand. For business goods, the demand can usually fluctuate due to business customers infrequently purchasing more substantial quantities than consumer marketers. These characteristics can re-sult that the product life cycle becomes non-periodic and intermittent [3].

The final phase of a PLC is a crucial step to limit the number of obso-lete inventories and subsequent write-offs. Obsoobso-lete inventories can have a negative economic effect due to high scrap costs, which is commonly present in industries working with fast clock-speed and high production costs, such

(10)

2 CHAPTER 1. INTRODUCTION

Figure 1.1: A bell-shaped representation of a six phased product life cycle.

as high-tech companies [4]. Manufacturers that produce high-tech products such as computers and circuit boards are constantly pressured by competition, innovation, and customer satisfaction. In that dynamic market, it is vital to keep up with the demand by introducing new technologies for improvements, subsequently, resolving in shorter product life cycles [5].

1.2

Demand Forecasting

Demand forecasting has been studied for more than 100 years, but in recent decades, there have been significant improvements in demand-driven forecast-ing in the business domain [6]. One of the reasons is the availability of large amounts of data, subsequently becoming one of the main drivers in supply chain management. It has helped businesses to uncover consumers’ behav-ior, synchronizing with the supply chain, and optimizing operations from the manufacture to distribution [6]. Moreover, machine learning has become the new forefront in demand forecasting. Gartner, a consulting company, esti-mated that at least 50 percent of global companies would be using machine learning-based utilities for their supply chain operations for automation and optimization by the year 2023 [7].

1.3

Problem Statement

The demand planners at the host company make forecasts of the phase-out manually and with intuition. Part of the workers’ task includes analyzing his-torical data of demand. Trying to forecast highly unpredictable business goods is a difficult task. With the available data, the experts could benefit by using predictive modeling and data-driven methods to provide an automatic and ac-curate solution.

(11)

CHAPTER 1. INTRODUCTION 3

In this thesis, we examined if the following predictive models: autoregres-sive integrated moving average (ARIMA), Support vector regression (SVR), and Gaussian process regression (GPR), could optimize phase-out forecast-ing of product demands. A description of the model selection is mentioned in subsection 3.2.2. Furthermore, our research hypothesis was formulated that one of the methodologies ARIMA, SVR, or GPR may improve the accuracy of forecasting the last stage of the product life cycles.

1.4

Aim and Objective

The objective was to find the best performing models which would outperform the domain experts when forecasting the phase-out period. The data consisted of monthly demanded quantities of a dozen products with historical data be-tween 41 to 125 instances. A thorough analysis of variation and performance comparison of 11 different models was conducted. Where ARIMA, SVR, and GPR performed better based on a prestudy described in subsection 3.2.2.

The aim was to provide a solution that would improve the demand plan-ners’ current forecast performance. To help reduce the waste of obsolescence products in order to save money and resources.

1.5

Related Works

There have not been many studies on predicting the phase-out period in par-ticular. Most papers describe forecasting the demand in general. Hence, we studied papers that were about forecasting regardless of which stage in a prod-uct life cycle. In the following, a couple of comparative studies are mentioned where ARIMA, SVR, and GPR have been implemented and where SVR and GPR have shown better results than other types of models.

1.5.1

Successful cases of SVR

Wang [8] compared the performances of SVR and a neural network with a radial basis function (RBF) architecture, when forecasting the demand for a company’s supply chain. His data reflected a year of weekly sales of paper, where the first 41 weeks were used for training the models, and the next 10 weeks for testing. The dataset was normalized before modeling, and a grid

(12)

4 CHAPTER 1. INTRODUCTION

search was used to obtain SVR’s optimal hyperparameters: σ = 1, penalty fac-tor C = 1, and width parameter  = 0.001, by running through the parameters: σ ∈ [0.001, 1000], C ∈ [1, 1000] and  ∈ [0.001, 0.1]. SVR’s final forecast error showed a relative square mean error of 0.51, and the neural network with 0.6. The author stated that SVR was superior in terms of generalization performance and forecast accuracy. Besides the fact that the model displayed promising results of forecasting demands, the key takeaway from the study was that appropriate hyperparameter tuning was vital for modeling the time series.

Falat et al. [9] compared SVR and ARIMA when forecasting monthly crude oil prices, which was described as highly volatile. The data was divided into four time series where the training size ranged from 42 to 384 observa-tions, and 6 to 12 observations for the test sets. In addition to the method com-parison, the authors evaluated SVR’s performance on different kernels. SVR showed the lowest absolute percentage error on all four datasets compared to ARIMA. Meanwhile, SVR with a linear kernel outperformed on two of the experiments where the time series had the lengths of 48 and 82 months. And the other two time series with 264 and 394 observations were better estimated with an RBF kernel. The author stated that using a linear kernel for SVR was better suited for short time series and an RBF kernel for more extended time series. Moreover, the study indicated that SVR seemed to be robust on data with a degree of volatility.

1.5.2

Successful cases of GPR

Ahmed et al. [10] wrote an extensive work about comparing the following eight different machine learning methods: GPR, SVR, decision tree, k-nearest neighbor, generalized regression neural network, radial basis functions, Bayesian neural network, and multilayer perceptron (MLP). Besides comparing the mod-els’ performances, the authors compared three data preprocessing methods on 1045 different time series with a size of 226 months of demand and sales. The last 18 months of each time series were used for testing. The preprocessing methods included lagged-validation, which maps the previous months of ob-servations (input) with the following month (output). The other two methods were differencing, taking the first backward difference, and moving-average. The moving-average and the lagged-validation yielded similar results for all the models while differencing displayed much worse performance. The final results of the model comparison showed that MLP, GP, and SVR obtained an

(13)

CHAPTER 1. INTRODUCTION 5

average symmetric mean absolute percentage error of 0.085, 0.090, and 0.099. Moreover, the rest of the models scored error values above 0.1. A further sig-nificant test implied that MLP and GP forecast errors were not sigsig-nificantly different, unlike between GPR and the other methods. To conclude, GPR out-performed a vast number of machine learning models on a dataset more exten-sive than ours. And, the lagged-validation method without data manipulation (except normalization), seemed to be the best approach to prepare time series, which also the previous papers utilized.

Poyraz’s et al. [11] work also showed that GPR with an RBF kernel, out-performed other models on a dataset that consisted of demands of ten different medical drugs. 196 weeks of data were used to train the models and the next 65 weeks for testing the models. Besides GPR, the authors implemented SVR, decision tree, random forest, and multiplayer perceptron. GPR had the low-est mean absolute percentage error overall with 31.6 while the rlow-est had above 32.1. The authors also compared whether predicting the following week or the second week or the third week would give the best results in contrast to Ahmed’s study of lagged-validation, where he only set the output equal to the following week. The results showed that increasing the weekly distance be-tween the input and the output deteriorated each model’s performance. Which means that predicting the next week’s demand in each iteration would give the best outcome.

To summarize, the first two papers mentioned how SVR outperformed ARIMA and a neural network (RBF) model on pricing and demand datasets. Furthermore, the latter two papers described extensive comparative studies where GPR depicted better forecasts than many other methods. Except in Ahmed’s [10] work, where MLP showed lower forecast error than GPR, al-though there was no significant difference between the models forecast errors.

1.5.3

Successful case of LSTM

There have been studies that have applied long short-term memory (LSTM) to the problem. Shabani et al. [12] proposed a multilayer LSTM. The authors compared the model with ARIMA, exponential smoothing, k-nearest neigh-bor, full connected neural network, vanilla recurrent neural network, single layer LSTM, and SVR. The data consisted of 152 months of furniture sales, where 20 % of the data was used for testing. To find optimal values for LSTM’s seven parameters, the authors systematically generated different combinations

(14)

6 CHAPTER 1. INTRODUCTION

of values using a grid search. The final result showed that the proposed model outperformed the other methods, with an symmetric mean absolute percentage error value of 0.108, while the second best model SVR scored 0.112. Further-more, the authors also tested the models on five other different time series with similar data size. The multilayer LSTM outperformed on three of the cases, while SVR and exponential smoothing were estimated better on one time se-ries each. It means that it was not evident that LSTM would perform best in all cases.

1.5.4

ARIMA

ARIMA is a traditional way to conduct time series forecasting and is most commonly used as a baseline to compare other models because of its low com-plexity and high interpretability [9, 13]. Furthermore, the papers mentioned in this section described implementing forecasting models by training on past data. The authors were assuming that the past data would give the models’ necessary information to predict the future. One big difference between our study and the papers was that this thesis was about comparing the models’ performances in contrast to human expertise and not just between them.

1.6

Thesis Outline

The next chapter presents the background of the thesis, including a follow-up on the forecasting models ARIMA, SVR, and GPR. The background will also include the traditional theory behind conducting time series forecasting and with machine learning. Chapter 3 describes the implementation of the models, and the results are shown in chapter 4. The last chapters 5 and 6 consist of discussions and conclusions together with a summary.

(15)

Chapter 2

Background

This chapter presents the background theory that is essential to the project. It explains two ways to model time series, the traditional approach, and the ma-chine learning way. The chapter ends with a comparison of different evaluation metrics and descriptions of the selected forecasting models.

2.1

Time Series Analysis

A time series is a set of ordered observed values, where the mean and variance changes over time. It is mathematically denoted {xt}Tt=1 as a set of variables xt. Time series analysis focuses on identifying underlying patterns to get a future prediction ˆxT +1estimated by a model F .

ˆ

xT +1 = F (xt, ..., xT). (2.1) Time series modeling replicates every element of a series by decomposing and computing the different signals without knowing the underlying cause of each [14]. The signals can be trend and seasonality that explains the change of the mean and the variance in a time series. Moreover, the signals can be assumed to be additive:

xt= Tt+ St+ Et

where Ttis the trend component, Stthe seasonal component and Etis the random component. In the additive series, when the trend increases, the size of the seasonal peaks stays roughly the same (see figure 2.1).

(16)

8 CHAPTER 2. BACKGROUND

Figure 2.1: Trend, seasonal and random signals of an additive time series.

2.1.1

Stationary vs Non-Stationary

Many traditional forecasting models bases on modeling time series with stable properties in terms of the mean and variance. Meaning no trend and seasonal behavior, which is also called a stationary time series. A time series is sta-tionary if for all possible observations, xt, xt+1, ..., xt+T, until time step T , do not depend on time t. Checking for stationarity can be done by visualizing the behavior of the series or using statistical hypothesis tests [15]. The two common tests are:

1. Augmented-Dickey-Fuller

Augmented-dickey-fuller (ADF) is the most used technique to test sta-tionarity of a process. The null hypothesis posits that a unit root is present in the time series. Take for example a time series model that only weighs the previous observation, also called an autoregressive model of order one,

(17)

CHAPTER 2. BACKGROUND 9

xtis the present value, xt−1 is the past value and etis white noise. If φ equals to one, the time series has a unit root, which implies not being stationary. The result of the ADF test accounts for all different times-tamps h, and a significant level helps determine if one should reject the null hypothesis. Shumway et al. and Nielsen [14, 15] mentioned the shortcoming of ADF, that it mainly focuses on testing if the level of the process is stationary.

2. Kwiatkowski–Phillips–Schmidt–Shin

Another standard test is the Kwiatkowski–Phillips–Schmidt–Shin (KPPS). Which states that the null hypothesis posits no present unit root in the time series. The test is constructed based on linear regression, where the time series is divided into three parts:

xt= rt+ βt+ et

where rtis a random walk, βtis a deterministic trend and et is a white noise error. The method measures if the series has fixed intercepts for the random walk and the trend components according to a significant threshold [15]. If they exist, the series is assumed to be stationary.

2.2

Time Series with Machine Learning

Machine learning models require a different data preprocessing approach com-pared to the ARIMA models when forecasting time series. The data first needs to be transformed into a supervised learning setting to model the relationship between a set of input variables to a set of output variables [16].

2.2.1

Forecasting Strategies

Forecasting future values can be done using a single-step or a multi-step ap-proach. The main difference is the forecast horizon’s size, where one value is forecasted at a time when using the single-step method (see equation 2.1). Meanwhile, the multi-step forecast consists of predicting multiple values di-rectly.

According to Bontempi et al. [16], a multi-step forecast can be done using three strategies: direct, recursive, or multiple-output. The direct method works by using H independent models Fhto forecast multiple single values at times

(18)

10 CHAPTER 2. BACKGROUND

t + h (see equation 2.2). Each value concatenates to a set of H forecasted values.

ˆ

xt+h= Fh(xt, ..., xt−k−1) (2.2) where k is the input horizon and h ∈ {1, . . . , H}. The recursive strategy trains an one-step model F which then inputs the forecasted value to predict the next one until time H,

xt+h = F (xt, ..., xt−k−1) (2.3) Compared to the direct and the recursive method, the multiple-output method does not output a single value iteratively. Instead, the strategy implies map-ping the input variables directly to multiple output values, hence predicting a vector.

Taieb et al. [17] depicted the difficulty of getting improvements with the multiple-output method in multi-step forecasting, because it accumulates the errors from the model, especially more so with short time series. The authors favored the direct strategy, but stressed that the method is computationally ex-pensive as one needs to generate multiple models. As for the recursive strat-egy, new predictions depend on the accuracy of the previous estimates, which means that longer-term predictions may deteriorate. However, the advantage of using this method is because of its simplicity and computational ease.

2.2.2

Cross-Validation

Validation for model selection is trickier with time series forecasting, com-pared to tasks such as image classification. Methods such as k-fold cross-validation (CV) are not suitable for time series due to the observations’ de-pendencies. K-fold CV randomly partitions the training data into k-1 folds, where one fold is for validating the model. In order to keep the temporal de-pendency between the samples, Kaastraa and Boyd [18] described a modified CV called walk-forward CV, which is suited for time series (see figure 2.2).

(19)

CHAPTER 2. BACKGROUND 11

Figure 2.2: K-fold CV on the left and walk-forward CV on the right. The blue points are training points, and the red are validation points. The horizontal axis states the time and the iteration steps are denoted as top to bottom.

The method involves partitioning the series into sequences of training and validation samples. The sequences for computing moves forward in time, re-training the model on a new input sequence, and evaluating with the new vali-dation set. It forces the model to adapt to new incoming time-dependent data.

2.3

Forecasts Evaluation

There are several methods for measuring forecast error for univariate time se-ries. Hyndman et al. [19] reviewed many different measures where some of them can be seen in table 2.1. The most common measures are the root mean square error (RMSE) and the mean absolute error (MAE). RMSE puts weight on significant errors, making it more sensitive to outliers compared to MAE. Nevertheless, they are both scaled dependent and not appropriate when com-paring time series of different scales.

Percentage errors are independent scale measures and are frequently used to compare forecast performances across different datasets. One example is the mean absolute percentage error (MAPE). It puts more weight on positive errors than on negative errors, consequently being unsymmetrical [19]. Be-cause of that, a modified version was created called symmetric mean absolute percentage error (sMAPE), which is less sensitive to lower values. However, both of them cannot handle zero values.

To tackle the problem of comparing time series of different sizes, which also contains zero values, Hyndman et al. [19] proposed the mean absolute scaled error (MASE) metric. It takes the mean ratio of the forecasted error and the mean absolute error of one-step naive-prediction of the training data, where the naive approach means that the prediction is equal to the previous

(20)

12 CHAPTER 2. BACKGROUND

observation. A MASE value below one implies that the out-of-sample forecast does worse than the naive average forecast of the in-sample.

Measurement Formula MAE H−11 PT +H i=1 |xi−xbi| RMSE q 1 H−1 PT +H i=T +1(xi−xbi) 2 MAPE H−1100 PT +H i=T +1| xi−ˆxi xi | sMAPE H−1200 PT +H i=T +1 |xi−ˆxi| (|xi|+|ˆxi|) MASE 1 H−1 PT +H i=T +1|xi−ˆxi| 1 T −1 PT j=2|xj−xj−1|

Table 2.1: Common forecast accuracy metrics with the forecast horizon size [T + 1, T + H], where H is the time at the end of the horizon.

2.4

Forecasting Models

2.4.1

Autoregressive Integrated Moving Average

Traditional time series models like ARIMA bases on the assumption that a fu-ture prediction is a linear combination of several past observations and random noise [20].

The model has the general form ARIMA(p,d,q):

wt= p X i=1 βiwt−i+ et− q X i=1 θiet−i+ θ0 (2.4) β and θ are the model’s parameters whose numbers are determined by the orders of p and q. wt is the dth difference of observations xt, meaning

(21)

CHAPTER 2. BACKGROUND 13

wt = ∇dxt. And etare random noise terms. ARIMA(1,1,1) for example has d = 1, stating first order differencing (wt= xt− xt−1). The equation is:

xt = (1 + φ1)xt−1− φ1xt−2+ et− θ1et−1+ θ0,

formulating the future value of xt as a linear combination of the two pre-vious observations together with parameters and random noise terms. Before finding p and q, one needs to differencing (d) the non-stationary time series to become stationary, which is a necessary condition when building an ARIMA model (see subsection 2.1.1)

The basic idea behind identifying the order of p and q is to examine auto-correlation properties of the time series where Box and Jenkins [20] proposed the use of Autocorrelation and Partial Autocorrelation functions (ACF and PACF) which derives the values of the parameters β and θ,

ACFh = cov(xt, xt+h) cov(xt, xt) (2.5) P ACFh = cov(xt, xt+h|xt−1, . . . , xt−n) pvar(xt|xt−1, . . . , xt−n)var(xt+h|xt−1, . . . , xt−n) (2.6)

ACF measures the linear relation of different observations as a function of different lags h. Where a zero value indicates no occurring correlation be-tween two observations. A statistical rule ±1.96√N posits as a threshold to deter-mine the significance of the ACF and PACF estimates, where N is the number of observations. Compared to ACF, PACF measures the correlation between a set of observations and its intermediate lagged observation in a process. The equation 2.6 describes the PACF of xtwith xt+hconditioned on the previous n observations.

(22)

14 CHAPTER 2. BACKGROUND

Figure 2.3: ACF and PACF plots of ARIMA(1,0,1).

The example in figure 2.2 show ACF and PACF plots of an ARIMA(1,0,1) process. The spike in time step one in the ACF plot, describes the high cor-relation of the past observation xt−1, which sets the q to order one. Similarly, the PACF plot tells the order of p which in this case also has the order one.

Time series can show multiple significant lags resulting with different com-binations of p and q. The tentative model’s parameters can be measured using Akaike’s information criterion (AIC), serving as the model selection method [20]. The pair that gives the lowest AIC value implies being a better-fitted model. The criterion formulates as:

AIC = −2ln( ˆL) + 2g, (2.7) where the first term is the log of maximal likelihood and the second term penalize addition of parameters in the model, which denotes as g = p + q + 1.

2.4.2

Support Vector Regression

Support vector regression bases on the concept of maximal margin by utiliz-ing a hyperplane on the trainutiliz-ing data to predict the next continuous value. It has a regularization parameter C, which states the degree of point violation allowance. Slack variables ξi± denotes the distance between the points from the hyperplane and the upper, lower bound of the margin. The width of the margin controls by a hyperparameter . The hyperplane equation and the two constraining boundary lines formulates as:

(23)

CHAPTER 2. BACKGROUND 15

yi− wTxi− b ≤  + ξi+ (2.9)

−yi+ wTxi+ b ≤  + ξi− (2.10) where w is the weight of the hyperplane, xi is an instance, and b is a constant. The optimal weight is calculated by taking a loss function of the predicted and actual value (ˆyi,yi). Together with the slack points and the reg-ularization C that penalize the weight to gain lower cost value [21],

ˆ

w =X i

αixi,

The xi for which αi > 0 are called the support vectors which lie in or outside the margin. The final prediction equation becomes,

ˆ yi = ˆw0+ X i αiκ(xi, x 0 i) (2.11)

where ˆw0is a constant and x

0

iis an unseen data point. κ is a kernel function derived from multiplying ˆw with xi. The kernel function transforms the data to a higher dimension such that the model can find a linear solution for the hyperplane (see figure 2.4).

Figure 2.4: The green stars are data points fitted to the hyperplane and its margin. Model with non-linear data can be seen in the left. The transformation of the data with a kernel function ϕ(x) gives a linear hyperplane solution which is shown on the right.

The kernel function quantifies the similarity of two instances xi, x

0

i con-taining p features. Where the linear kernel defines as:

(24)

16 CHAPTER 2. BACKGROUND κ(xi, x 0 i) = p X j=1 xi,jx 0 i,j (2.12)

SVR has also other kernels, such as polynomial, RBF and exponential. For this thesis the linear kernel was used according to the recommendation by Falát’s paper [9].

2.4.3

Gaussian Process Regression

Gaussian process (GP) is a machine learning model based on probability the-ory that provides functionalities to model uncertainty, probabilistic inference for optimization, and making predictions in uncertain environments [21].

GP defines a prior over functions with input data P (f (x1), ..., f (xN)) as a jointly Gaussian with the mean m(x) and covariancePNi,j = κ(xi, xj), where κ is a kernel. The idea is that if xi and xj are similar by the kernel, then the output values are expected to be similar [21].

Figure 2.5: A graphical representation of a GPR with two training points (x1, y1), (x2, y2) and one test point (x∗, y∗). fi are the hidden nodes that are connected. The strength of the edges represents the covariance term κ(xi, xj). If the test point x∗ is similar to the training points (x1, x2) then the predicted output y∗is similar to (y1, y2).

For regression with noisy observations, the function is denoted as y = f (x) + , where  ∼ N (0, σy2I) and σy is the standard deviation of the noise. Prediction of a single test point x∗gives by first deriving the covariance of the target values y given all the input data X, which has the following form,

(25)

CHAPTER 2. BACKGROUND 17

cov[y|X] = κ(X, X) + σy2I,

where σy2I is a diagonal variance matrix. The mean prediction (see equa-tion 2.13) is estimated from a posterior distribuequa-tion which is derived using Bayes rule from a GP prior over the latent functions on the test point.

b

f∗ = E(P (f∗|x∗, X, y)) = κ(x∗, X)T[κ(X, X) + σ2 yI]

−1

y (2.13) The radial basis kernel with a training data point x denotes as,

κ(x, x∗) = σf2exp(− 1

2l2(x − x∗) 2

) + σ2yδi∗, (2.14) where l is the horizontal scale over which the function changes and σf2 denotes the size of the vertical scale. These two variables, together with σy, are the main parameters to tune in a Gaussian process regression model. Similar to SVR, there are other kernels that could also be implemented on GPR. We followed the examples of Ahmed and Pyraz et al. [10, 11] and utilized the RBF kernel.

(26)

Chapter 3

Method

This chapter describes the data and the practical implementation of modeling on time series, based on the theory that was described in the previous chap-ter. In the last section, a statistical test is mentioned to answer the research question.

3.1

Dataset

The data consisted of 14 univariate time series, where each described the entire product life cycle of a unique product, including last-time-buy (LTB) dates. Each product had different lengths ranging from 41 to 125 months of demanded quantities. Six of the products had above 80 number of instances, while the rest had below 76. Hence some of the use cases presented with fewer data. The phase-out period for these products consisted of the last year before the LTB date, which meant that this period would serve as the ground truth. Testing on new data is fundamental when evaluating machine learning mod-els. In this case, the time series were split into a training set consisting of the demand until 12 months before the LTB date. And the later 12 months were our test sets. The models’ forecast errors would then be compared with the prediction errors of the domain experts.

The domain experts’ estimations, also called the demand plan, consisted of a global plan that dimensions the supply chain. It was set per product by demand planners based on market forecasts, sales plans, life cycle plans, and historical data. Supply chain managers and product leaders assessed costs and risks before approving the plan. It covers 21 months with the main focus on the nearest 12 months. Thus the last year of the product plan cycle would serve

(27)

CHAPTER 3. METHOD 19

as our baseline.

3.1.1

Missing Value

Missing data is a common problem and can occur for various reasons, such as fault in the system or in this case because of absent demands. It can cre-ate problems when one wants to implement machine learning models, causing biased estimations of the parameters [21]. There are different ways to tackle the problem. One could impute values by filling in numbers manually, inter-polation, or by inserting the mean value. In our case, three to five numbers of values in random were missing in the products’ training data. Because they were few, the missing values were replaced with a zero value such that the data would reflect the reality of absent demands.

3.1.2

Data Scaling

For certain machine learning models such as neural networks, scaling the data is a common practice. It helps to avoid getting stuck at local minima, as well for faster learning and convergence. Wang and Ahmed et al. [8, 10] scaled each time-series before applying SVR and GPR for faster convergence. We followed the approach and linearly scaled each time series training sets to values in the range [0, 1]. The models’ predictions were then rescaled back to meet the magnitude of the actual data.

3.2

Experiment

3.2.1

Implementation

In recent years, developers have built Python libraries to make it more efficient, easier, and robust to implement machine learning models on various datasets. In our case, as there were multiple time series to model. It benefited us to use packages instead of writing the algorithms from scratch. One popular library that was used for the machine learning models was called Scikit-learn [22]. For ARIMA, the Pyramid package [23] was utilized, which provides an automatic way to model time series according to the theory described in subsection 2.4.1.

(28)

20 CHAPTER 3. METHOD

3.2.2

Model Selection

In recent years, deep learning has shown state of the art performance on vari-ous applications, such as computer vision. Hence, we first started experiment-ing with the deep learnexperiment-ing model LSTM. The outcome was the the model over-fitted on our time series and was not suitable for forecasting. A deep learning researcher, Ian Goodfellow et al. [24], mentioned a rough rule of thumb where deep learning models could achieve acceptable performance with around 5000 instances. And a dataset containing at least 10 million instances could provide enough information to exceed human performance. Our time series had far less data, which limited us to the types of models suitable for smaller datasets. Ahmed et al. [10] benchmarked a vast number of machine learning models on demand forecasting (see subsection 1.5). A deep learning type showed the best performance, following with GPR and SVR. While, the latter two methods are also compatible with small datasets [21].

Figure 3.1: Mean MASE on 5 products’ time series. The models consisted of: exponential-smooting (EST), ARIMA, garch, bayesian structural time series (BSTS), dynamic linear model (DLM), linear regression (LR), SVR, random forest (RF), GPR, fully connected neural network (FNN). The imaginary line was our threshold at 1.25 MASE.

In addition to ARIMA, SVR and GPR. Experiments were done on other types of models besides LSTM. Some of them were used in the related works in section 1.5. The other methods included exponential-smoothing, garch [25], bayesian structural time series [26], dynamic linear model [27], random forest, fully connected neural network and linear regression. Because of the data constraints, some of the models had shortcomings. LSTM was one example

(29)

CHAPTER 3. METHOD 21

discussed earlier. Exponential-smoothing and linear regression depicted poor results due to the complexity of the data. The models were filtered based on their average performance on five different cases, where ARIMA, SVR and GPR scored lowest mean errors (MASE), see fig 3.1. The five products were only available at the time of the model selection and were included with the rest for the final experiment in chapter 4.

3.2.3

Input Horizon

As part of the first step in the experiment, the input sample size k was needed to be determined in order to predict the following month when using SVR and GPR, as was mentioned in subsection 2.2.1. To determine the optimal size, the walk-forward CV procedure was conducted (see subsection 2.2.2) with a range of k = [1, 2, 3, 4, 5], which Ahmed et al. [10] used. This means for example with k = 3, that one input-output sample (Xt, Yt) at time t formulates as:

Xt= [xt+1, xt+2, xt+3], Yt= xt+4,

where it maps the previous three months of demands to the next fourth month of value, for each iteration. To ensure that the walk-forward CV score solely depended on the different input sizes during the test, the other param-eters were left unchanged. The number k that scored the lowest prediction error from the walk-forward CV would be used. ARIMA models on the time series directly, without having to transform the data into a supervised setting. It derives a combination with previous observations that had the highest cor-relations.

3.2.4

Parameter Selection

Parameters are essential to machine learning algorithms and have to be appro-priately tuned to control each time series’s learning process.

For the SVR with a linear kernel, two optimal values for the main hyper-parameters were needed: C and . Chalimourda et al. [28] wrote a well-cited article on a comprehensive analysis of these parameters. The authors proposed to extract the optimal values directly from the training set. By using C = YmaxN and  = 3σpln(N)/N, where Ymax is the maximum training target value, N is the number of training samples and σ is the noise standard deviation of the time series. A reason for asserting the regularization term as the maximum value was to handle outliers, and  is, in theory, denoted

(30)

22 CHAPTER 3. METHOD

as being proportional to the noise spread [28]. To calculate σ, Ahmed et al. [10] suggested computing it from the noise signal after decomposing each time series (see figure 2.1) using libraries such as Statsmodels [29]. This parameter selection approach required less computational power than using a grid search [9, 8].

The second machine learning model, GPR with RBF kernel, utilizes max-imum likelihood (read more in [30]) to find its optimal hyper-parameters of: σf, l and σy. First initial values were needed to obtain the optimal values. We were unable to find a suitable article mentioning which values to use, which seems to be an open question. Therefore decided to start with the default ini-tial values used in Scikit-learn [22] library which were: σf = 1, l = 1 and σy = 0.1.

The last forecasting model, ARIMA, has the main parameters: p, q, d, which are computed according to the theory described in subsection 2.4.1. The value of d is determined by how many times it requires to differentiate the time series to become stationary, according to the KPSS test (see subsection 2.1.1). Pairs of p and q are then collected using the autocorrelation measures. The one pair who scored the lowest AIC value was later selected.

3.2.5

Forecast Horizon

Figure 3.2: Visualisation of a recursive forecast, starting from top to bottom. The vertical line states the beginning of the forecast horizon. The blue dots are training observations and the green dots are the forecasted values.

SVR and GPR are based on predicting one output value compared to neural networks, which can output multiple values. Wang, Falat, Ahmed and Poyraz et al. [8, 9, 10, 11] in the related works section 1.5, the authors utilized the test data to obtain future predictions (single-step method). In order to forecast long horizons without using the hold-out sets can be done by utilizing one

(31)

CHAPTER 3. METHOD 23

of the forecasting strategies described in subsection 2.2.1. In this thesis, the recursive technique was used for its computational ease, compared to the direct method, which would require building 12 models (for the 12-month horizon) on each time series. The recursive strategy is also used in ARIMA to forecast arbitrary numbers of observations. To describe the recursive method, let the input horizon be k = 3. In the first iteration, the model is given the three last observations from the in-sample to estimate its first prediction (the green dot in figure 3.2). In the second iteration, the predicted value is added in front of the two previous observations to give the next three-sized input sample and a second prediction. The procedure repeats until the entire twelve-sized forecast horizon is obtained.

3.3

Statistical Test for Evaluating Results

To answer the research question, the repeated measures analysis of variance (one-way-ANOVA) was utilized described by Girden [31], to determine if there was a significant difference between each model’s mean errors. If a dif-ference existed, a pairwise t-test measures the significant difdif-ference between the models’ average forecast errors and the average demand plan forecast er-rors.

One-way-ANOVA and t-test suited this experiment because each unique time series were repeatedly measured. The hypothesis tests bases on differ-ent assumptions, one of which is the normality of the measures and equal error variance of each test group. If the data would violate the assumptions, Welch’s ANOVA would be considered. Similarly with the t-test, then the non-parametric Wilcoxon signed-rank test would be appropriate to use [32].

(32)

Chapter 4

Results

This chapter presents the result of the forecasting methods on the product life cycles. We ran each tests five times and extracted the average mean and vari-ance errors. The first two subsections present a comparison study between sizes of the input horizon and a study on the effect of different training data sizes on the predictions. Based on the models forecast errors, performance analysis was conducted to measure how well the models performed on the time series compared with the domain experts’ (DP) estimates. The last sub-section in this chapter presents a visualization of the models’ predictions on five of the total fourteen products.

4.1

Comparing Different Input Horizons

Input Size SVR GPR 1 5.70 ± 16.91 6.16 ± 16.99 2 4.93 ± 15.39 4.98 ± 15.49 3 4.22 ± 10.31 4.64 ± 10.38 4 5.84 ± 25.10 6.05 ± 25.17 5 5.46 ± 19.89 5.77 ± 19.91

Table 4.1: Mean MASE and one standard deviation from walk-forward CV over the time series with different input sizes.

Before training SVR and GPR, a suitable input size for the data was needed. The MASE measures from the walk-forward CV (see table 4.1) shows that an input size of 3 gave the lowest MASE mean value and hence increased the mod-els’ generalization ability the most. Both models worked better when mapping

(33)

CHAPTER 4. RESULTS 25

three input values to the target value. Meanwhile, decreasing or increasing the input size distorted more of the relationship between the input and the output.

4.2

Effect of Data Size on Predictability

Zhu et al. [5] mentioned that high-tech products usually present in shorter life cycles. The remark raised questions about how long the time series should be to get good forecasts after modeling. And how far back in the data the models need to know to forecast the phase-out stage. To find out, we measured the models performances on different training sizes based on MASE of predicting the 12-month horizons. And modeled on the percentage of the latter training observations to keep the time dependency intact between the training and test sets.

Train Size % ARIMA SVR GPR

20 1.54 ± 1.64 1.10 ± 1.50 1.57 ± 1.63 40 1.20 ± 0.62 0.99 ± 0.54 1.28 ± 0.59 60 1.06 ± 0.47 0.85 ± 0.47 1.25 ± 0.56 80 1.22 ± 0.64 0.77 ± 0.41 1.35 ± 0.56 100 1.35 ± 0.86 0.78 ± 0.40 1.53 ± 0.84

Figure 4.1: Mean MASE and one standard deviation of the forecasting models on different training data sizes.

Figure 4.1 showed that the estimated average MASE increased when in-cluding more training samples for ARIMA and GPR from 60 to 100 percent, with a local minimum at 60 percent. Meaning that removing or including more

(34)

26 CHAPTER 4. RESULTS

or less than 40 percent of the data did not benefit the models’ average predic-tions. However, for SVR, the error decreased until hitting 80 percent, where it started to flat. The final result suggested that the models’ performances did not improve by using, on average, the entire data sets of each time series. The models were trained based on individual optimal training size for the next re-sults.

The finding goes against the intuition that more training data leads to better performance. However, time series forecasting is different from other machine learning applications, for example image classification. The dependency be-tween each observation in a time series data set needs to be taken into con-sideration in order to let the models learn the underlying behavior. Hence, on average 40% and 20% of the previous observations resulted in redundancy and maybe confused the models as the time series changed over time. However, this study was based on 14 unique time series, which may not be enough to determine how much training samples were optimal for these models. Nev-ertheless, it indicates to possibly investigate how far back one should acquire data that does not end up as noise when modeling time series.

4.3

General Performances of the

Forecast-ing Models

Figure 4.2: Mean MASE of the models and the domain experts estimates on the time series. The error bars show one standard deviation of the error.

(35)

CHAPTER 4. RESULTS 27

the products, including the demand plan estimated error. It is evident in figure 4.2, that the DP estimated better on average than the models with a MASE mean value 0.72, although SVR’s average forecast error of 0.77 was closest to DP’s. GPR performed worst overall with the MASE mean value of 1.25 and had the highest error variance of 0.31, meaning that the model predicted much worse on some products. The models’ mean and variance can be seen in table 4.2 and the forecast error on each product.

Product DP ARIMA SVR GPR 1 0.72 1.93 0.46 2.12 2 0.43 2.16 1.71 2.98 3 0.36 0.40 0.22 1.74 4 0.43 0.95 0.48 0.81 5 0.26 0.71 0.39 1.30 6 0.98 1.12 1.20 1.25 7 0.89 1.20 0.95 1.01 8 0.82 0.62 0.57 0.78 9 0.86 0.93 0.53 1.13 10 1.21 1.95 0.69 2.07 11 1.57 1.04 1.19 1.32 12 0.94 1.32 1.11 1.23 13 0.45 1.14 1.00 1.20 14 0.14 1.26 0.13 0.72 MASE mean 0.72 1.06 0.77 1.25 MASE variance 0.15 0.26 0.17 0.31 Table 4.2: Table of the models and the domain experts MASE values on each product.

The table above shows that the DP predicted the lowest error on half of the products. Meanwhile, SVR scored lowest on 6 out of the 14 and ARIMA on 1. None of the models were able to perform better on more products than the DP. The reason could be that some data patterns were more challenging for the models to find any patterns. For instance, products 2 and 13 (table 4.2) were estimated much worse by the models overall. Where DP presented with MASE values of 0.43 and 0.45, while the models had errors greater or equal to 1.71 and 1.00.

(36)

28 CHAPTER 4. RESULTS

4.4

Variance Analysis ANOVA - Demand Plan

Comparison

The previous subsection displayed that some models performed better on cer-tain products than DP. Hence, a variance analysis was conducted to know sta-tistically about the performance difference between each model and the DP. With the variance of each model’s forecast error being roughly equal (see ta-ble 4.3), it did not violate the assumption of conducting a standard one-way-ANOVA test.

Source of Variation SS df MS F P-value F-crit Between Groups 5.133 3 1.711 6.110 0.001 2.782 Within Groups 14.561 52 0.280

Table 4.3: Repeated measures analysis of the models and DP’s errors with a significant value 0.05.

The null hypothesis from the one-way-ANOVA test in table 4.3, which stated that the MASE means of each method being equal to zero, was rejected with a p-value of 0.001. It implied that there was a significant difference be-tween the averages errors of at least two methods. To find statistically which method’s average error value was significantly different, the pairwise t-test was computed in the table below.

Models Mean Difference t-stat P two-tail t-crit two-tail DP-ARIMA -0.338 3.004 0.010 2.160

DP-SVR -0.050 -0.457 0.654 2.160 DP-GPR -0.53 -3.226 0.006 2.160

Table 4.4: Pairwise t-test comparison of mean MASE between each model and the DP, with a significant value of 0.05.

T-test with two-tailed p-value determines, in addition to the null-hypothesis, if the first mean value is larger or less than the second model’s mean [33]. Table 4.4 shows that the t-test between SVR and DP did not support the null-hypothesis with a p-value of 0.654. While, it supported the mean differences between DP and ARIMA, GPR, with p-values below 0.01. Which meant that they performed significantly worse than the baseline. On the other hand, the test showed that SVR’s overall forecast errors were significantly similar to the demand plan’s estimated errors.

(37)

CHAPTER 4. RESULTS 29

4.5

The Predictions of the Models

After comparing the models’ performances, visualizing the forecasts may pro-vide a greater understanding behind the error measures. Forecasts on the phase-out stage of five different products are presented below. To only focus on the forecast horizons, the entire training sets in the plots were not include.

(38)

30 CHAPTER 4. RESULTS

Figure 4.3: Forecasts of products 2,3,8,11 and 13. The ground truth is in black. The y-axis denotes quantities and x-axis depicts the time in months.

All the three models produced overestimated forecasts when the demand dropped right before the start of the forecast horizon (see products 2, 11, and 13). Reasons can be because sudden changes in quantities were hard for the models to capture. It also appeared that the models predictions became more distant to the actual values further into the future, which clearly shows on prod-ucts 2, 3, and 13. The recursive method may be the explanation, as it uses predictions as input to produce the next estimation.

Overall, GPR forecasted much worse than the other models, which the performance analysis from before indicated. GPR utilizes the noise variance of the training data when modeling. The noise of the in-samples in products 2,3 and 13 may have significantly impacted the model, and therefore presented with volatile forecasts. It can also be because the training samples had seasonal patterns, which resulted in the periodic estimations, similar to ARIMA on product 2.

(39)

CHAPTER 4. RESULTS 31

ARIMA and SVR made similar predictions in some of the time series (see product 3,8 and 11). Yet, SVR’s estimations were closest to the actual values and seemed to depict trends. Nevertheless, the model was unable to predict most of the peaks in the time series. Which may be because the support vectors did not capture the spiky observations in the training data.

(40)

Chapter 5

Discussion

In this thesis, we examined if data driven models could improve forecasting the last stage of a product life cycle. Three advanced forecasting methodolo-gies were compared with the domain experts’ estimations on different selected products, by analyzing the performance.

5.1

Discussion of the Models’ Performances

According to the study Poyraz et al. [11], GPR outperformed SVR. In this the-sis, SVR outperformed the other models. Even if some forecasting methods have displayed more exceptional performance in other articles, it may be wise to test the inferior models as well, as they may show better performance on someone else’s data set. That was why we experimented in subsection 3.2.2, with some of the other methods the authors used in related works. More-over, the ARIMA model bases on the assumption of linearity, similar to holt-winters and linear regression. Thus more complex data can be challenging for ARIMA to find any underlying patterns compared to SVR, which can handle non-linear data. Both ARIMA and GPR utilize correlation when modeling. If there were no apparent correlation between the observations, the models might have computed interaction of the noise, which can also explain why GPR produced volatile forecasts in subsection 4.5.

GPR showed potential in Ahmed’s [10] study. However, the authors fore-casted one month at a time while using test samples as input, which is not suitable when one wants to know the demand for more than one month in ad-vance. However, the plots in subsection 4.5 displayed that the models predic-tions were closest to the actual values at the beginning of the forecast horizons,

(41)

CHAPTER 5. DISCUSSION 33

and became more distant forward in time. Which could be due to the recur-sive forecasting strategy, and how it accumulate the error. Otherwise, it could suggest that these models are better suited to predict the nearest months, and not longer terms.

The examples in subsection 4.5, showcased the difficulties of estimating most of the peaks in the time series. The forecasting models learned the past observations to predict the future. If fluctuations of similar magnitude did not occur in the training data, they would unlikely to be estimated in the future un-less the models found underlying patterns describing the out-of-sample. The large shifts of demands could be a result of various factors that have influenced the buyers. Could be due to changes in marketing campaigns or product avail-ability. If such data were utilized, we would probably get a different outcome.

5.2

Ethical and Sustainability

Industries who manufacture high technology products have significant expenses, and consume vasts amounts of resources to produce their business appliances. The raw materials for products such as chipboards are mined from nature, impacting the environment due to forest depletion and soil contamination. Al-though it takes a toll on the environment, the materials are necessary for the technology that our society requires, making our lives more comfortable and the society more sustainable. By making good forecasts of the demand, com-panies can minimize extensive consumption of resources by aligning produc-tion with demand more accurately. In particular, with a good phase-out fore-cast, it would help supply chain managers to minimize the amount of product obsolesces by knowing the product’s demand at the end of its life cycle. It could also improve the company’s distribution and logistics by minimizing the warehouse space occupied by old goods.

This type of technology can also affect the traditional way of working in supply chain management. Providing an alternative way to conduct de-mand forecasting automatically, compared to manual human labor. However, it raises the ethical issue of programs replacing human workers. Although AI and machine learning have come a long way, it is still too early to replace this type of job entirely by a program. Machine learning models require guidance and are only suited for specific tasks that do not inherit human cognition, which is necessary when working in this domain. Our motivation for this study was to provide a valuable tool that the demand planners could use to improve their

(42)

34 CHAPTER 5. DISCUSSION

strategies.

Another ethical aspect to consider is the use of customer data. Misusing such data could lead to severe consequences, making the customers vulnera-ble. The data in this thesis consisted of aggregated demand patterns from cus-tomers worldwide. No personal data was used besides the requested amount for specific products. By summing all the quantities, no individual customer or market area could be distinguished in the time series.

5.3

Limitations and Challenges

We faced a few challenges during the project. The first was to make sure to find a method that could model time series of different lengths. It is unlikely that all types of hardware products are active in the market for the same amount of time, resulting in an equal number of instances. Moreover, to narrow down the model selection process, other papers were studied. However, our dataset con-sisted of a unique set of business products, where individual demand patterns are not common as consumer goods in the retail domain.

One limitation was the number of features used in this thesis, where only the demand variable was considered. The domain experts have more informa-tion to go on when estimating future demands such as approved deals, mar-keting campaigns, competitor information, and market shares. It could be the reason why we could not improve the forecast accuracy. These external at-tributes could mitigate the present volatility in the time series, giving the de-mands’ vivid shifts necessary input. However, such pieces of information are usually not stored in a data format.

5.4

Future Works

Based on the previous section, further investigation of the SVR model would be interesting to explore as it showed the most promising forecasting results out of the three models. Also, adding explanatory variables could probably mitigate the problem of occurring volatility in the time series. Potential fea-tures would be, for instance, salesforce data containing the number of deals that went through for a specific product in time. Such attributes would give valuable pre-information to the models, especially if significant deals went through that would possibly present with higher spikes in the future demand.

(43)

CHAPTER 5. DISCUSSION 35

Further exploration could also be how far back one should gather data when forecasting the phase-out. The results in the previous subsections suggest that there may be an optimal amount of training data to model.

Another approach to deal with volatility in time series would be by using cumulative transformation. Aggregating the demand for each time step would smoothen out the time series, and still enable to estimate future quantity by subtracting the prediction from the past aggregated volumes. One would then also examine if the demand would increase or start to damp when forecasting the phase-out stage.

After studying phase-out forecasts, it may be of interest to investigate ramp-up forecasting, the beginning of a product life cycle. However, it may require a different strategy compared to the one used in this thesis. To take account of new products that enter the market, where no data is available for modeling. One solution can be to use a clustering approach [34]. Grouping old product life cycles based on the demand patterns during the ramp-up. Each cluster would have labels that cover all the products, e.g. notation of application us-age. The ramp-up estimate of the new product would be the average volume of clustered products which has the same label.

(44)

Chapter 6

Summary and Conclusion

In this thesis, we investigated if data driven models could improve the accuracy of forecasting the demand during the last stage of a product life cycle. The final phase consisted of the 12 months before the last-time-buy date of a product. A comparative study was conducted to compare the performances of ARIMA, SVR, GPR with forecasts made by domain experts. Each method made pre-dictions on 14 different time series consisting of demanded quantities, which was evaluated using MASE. Besides the primary goal, an examination of how different training data sizes would impact the models’ performances was in-cluded.

The results of this study showed that SVR performed best on the time se-ries compared to ARIMA and GPR. However, none of the models improved the overall forecast accuracy in contrast to the domain experts’ estimates. Al-though, a statistical test supported that the forecasting error between SVR and the baseline was significantly similar, which may suggest further investigation of using SVR in future works. The models appeared to estimate better on some of the products. One theory suggests that some products had a higher tenancy of volatility, which was harder to model. Tackling volatility would maybe require adding other features or transform the data. Furthermore, the study showed that using each product’s entire training data to fit the model resulted in lower performance. Only 60% of the training data seemed to be optimal for ARIMA and GPR while SVR with 80%. The assumption of more data leads to better performance does not necessarily hold for univariate time series forecasting. Although this study was merely on 14 different time series, it raises the question of how far back does the models need to know in order to predict the future?

(45)

Bibliography

[1] David R. Rink and John E. Swan. “Product life cycle research: A lit-erature review”. In: Journal of Business Research 7 (1979), pp. 219– 242.

[2] Rajeev Solomon, Peter Sandborn, and Michael Pecht. “Electronic part life cycle concepts and obsolescence forecasting”. In: Components and

Packaging Technologies 23 (2001), pp. 707–717.

[3] C.W. Lamb, J.F. Hair, and C. McDaniel. “Marketing”. In: Cengage Learn-ing, (2012).

[4] Gokhan Usanmaz. “End-of-life cycle product management”. In: MIT Press, (2000).

[5] Kaijie Zhu and Ulrich Thonemann. “An adaptive forecasting algorithm and inventory policy for products with short life cycles”. In: Naval

Re-search Logistics 51 (2004), pp. 633–653.

[6] C.W. Chase. “Demand-Driven Forecasting: A Structured Approach to Forecasting”. In: Wiley, (2013).

[7] Kasey Panetta. “Gartner Predicts 2019 for Supply Chain Operations”. In: Gartner, (2018). url: http : / / precog . iiitd . edu . in / people/anupama.

[8] Wang Guanghui. “Demand Forecasting of Supply Chain Based on Sup-port Vector Regression Method”. In: Procedia Engineering 29 (2012), pp. 280–284.

[9] Lucia Pancíková, Martina Hlinková, and Lukas Falat. “Prediction Model for High-Volatile Time Series Based on SVM Regression Approach”. In: International Conference on Information and Digital Technologies (2015), pp. 77–83.

(46)

38 BIBLIOGRAPHY

[10] Nesreen Ahmed et al. “An Empirical Comparison of Machine Learn-ing Models for Time Series ForecastLearn-ing”. In: Econometric Reviews 29 (2010), pp. 594–621.

[11] Ahmet Gürhanli Ilker Poyraz. “Drug Demand Forecasting for Pharma-cies with Machine Learning Algorithms”. In: International Journal of

Engineering Research and Applications 10 (2020), pp. 51–54.

[12] Hossein Abbasimehr, Mostafa Shabani, and Mohsen Yousefi. “An op-timized model using LSTM network for demand forecasting”. In:

Com-puters and Industrial Engineering (2020), p. 106435.

[13] Yekta Amirkhalili, Amir Aghsami, and Fariborz Jolai. “Comparison of Time Series ARIMA Model and Support Vector Regression”. In:

Inter-national Journal of Hybrid Information Technology 13 (2020), p. 12.

[14] Robert Shumway and David Stoffer. “Time Series Analysis and Its Ap-plications: With R Examples”. In: Springer, (2017).

[15] A. Nielsen. “Practical Time Series Analysis: Prediction with Statistics and Machine Learning”. In: O’Reilly, (2019).

[16] Gianluca Bontempi, Souhaib Ben Taieb, and Yann-Aël Le Borgne. “Ma-chine Learning Strategies for Time Series Forecasting”. In: Business

Information Processing 138 (2013), pp. 62–77.

[17] Souhaib Ben Taieb, Antti Sorjamaa, and Gianluca Bontempi. “Multiple-output modeling for multi-step-ahead time series forecasting”. In:

Neu-rocomputing 73 (2010), pp. 1950–1957.

[18] Iebeling Kaastra and Milton Boyd. “Designing a neural network for forecasting financial and economic time series”. In: Neurocomputing 10 (1996), pp. 215–236.

[19] Rob Hyndman and Anne Koehler. “Another look at measures of fore-cast accuracy”. In: International Journal of Forefore-casting 22 (2006), pp. 679– 688.

[20] Reinsel Gregory Box George Jenkins Gwilym and Ljung Greta. “Time Series Analysis: Forecasting and Control”. In: Wiley, (2016).

[21] Kevin Murphy. “Machine Learning: A Probabilistic Perspective”. In: MIT Press, (2012).

[22] Lars Buitinck et al. “API design for machine learning software: Ex-periences from the scikit-learn project”. In: ECML PKDD Workshop:

Languages for Data Mining and Machine Learning (2013), pp. 108–

(47)

BIBLIOGRAPHY 39

[23] Smith G. Taylor. “Pyramid: ARIMA estimators for Pythons”. In:

Pyra-mid. (2018). url: http://alkaline-ml.com/pmdarima/0.

9.0/index.html.

[24] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. “Deep Learn-ing”. In: MIT Press, (2016).

[25] Matthew G Karlaftis. “Demand forecasting in regional airports: dy-namic Tobit models with GARCH errors”. In: Sitraer 7 (2008), pp. 100– 111.

[26] Steven Scott and Hal Varian. “Predicting the Present with Bayesian Structural Time Series”. In: Int. J. of Mathematical Modelling and

Nu-merical Optimisation 5 (2014), pp. 4–23.

[27] Phillip M. Yelland. “A Model of the Product Lifecycle for Sales Fore-casting”. In: Sun Microsystems, Inc, (2004).

[28] Athanassia Chalimourda, Bernhard Schölkopf, and Alex Smola. “Ex-perimentally optimal in support vector regression for different noise models and parameter settings”. In: Neural networks: the official

jour-nal of the Internatiojour-nal Neural Network Society 18 (2005), pp. 127–

141.

[29] Skipper Seabold and Josef Perktold. “Statsmodels: Econometric and Statistical Modeling with Python”. In: Proceedings of the 9th Python in

Science Conference (2010), pp. 92–96.

[30] C Rasmussen and Williams. “Gaussian Process for Machine Learning”. In: MIT Press, (2006).

[31] Eric Ziegel and E. Girden. “ANOVA: Repeated Measures”. In:

Techno-metrics 35 (1993), pp. 464–465.

[32] Wolfgang Wiedermann and Alexander von Eye. “Robustness and Power of the Parametric T Test and the Nonparametric Wilcoxon Test under Non-Independence of Observations”. In: Psychological test and

assess-ment modeling 55 (2013), pp. 39–61.

[33] David Pillemer. “One- Versus Two-Tailed Hypothesis Tests in Contem-porary Educational Research”. In: Educational Researcher 20 (1991), pp. 13–17.

[34] Sr Aghabozorgi, Ali Seyed Shirkhorshidi, and Teh Wah. “Time-series clustering - A decade review”. In: Information Systems 53 (2015), pp. 16– 38.

(48)
(49)
(50)

www.kth.se

Figure

Figure 1.1: A bell-shaped representation of a six phased product life cycle.
Figure 2.1: Trend, seasonal and random signals of an additive time series.
Figure 2.2: K-fold CV on the left and walk-forward CV on the right. The blue points are training points, and the red are validation points
Table 2.1: Common forecast accuracy metrics with the forecast horizon size [T + 1, T + H], where H is the time at the end of the horizon.
+7

References

Related documents

The purpose of this research project is to develop methods that allow an extended set of design parameters in the conceptual phase of the globally distributed product

As we may see in the following chapters, historical data of load demand in Greece for the years 2015-2018 are going to be analyzed in order to develop a method for forecasting

Sofiia Miliutenko, Environmental Strategies Research, KTH Stefan Nyberg, Teracom

To get a clear understanding of the working process of the team members, the researchers were introduced to the working process of the feature team through company presentations.

database version 3 [Online] Available at: www.ecoinvent.org [Accessed 5 February 2020]. Assessment of embodied energy and global warming potential of building construction using

tillkommande transporterna var i stort sett lika stora för de olika alternativen men i vissa fall då energianvändningen för materialtillverkning redovisades som modul

Denna studie kan bidra till framtida studier och fördjupad kunskap, samt fyller delvis kunskapsluckan med forskning om perfektionism och stress hos danslärare. Mer forskning

Studierna visar att barns intressen har en viktig plats i förskolan. Det som kan verka lite tvetydigt i dessa resultat är om denna kunskap om barns intressen främst ska användas