Transfer Learning for Sales Volume Forecasting Using Convolutional Neural Networks

(1)

SECOND CYCLE, 30 CREDITS ,

STOCKHOLM SWEDEN 2019

Transfer Learning for Sales

Volume Forecasting Using

Convolutional Neural Networks

MARCUS ALSTERMAN

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

(3)

Volume Forecasting Using

Convolutional Neural

Networks

MARCUS ALSTERMAN

Master in Computer Science Date: June 27, 2019

Supervisor: Johan Gustavsson Examiner: Örjan Ekeberg

School of Electrical Engineering and Computer Science Swedish title: Transfer learning för prediktion av

(4)

(5)

Abstract

Improved time series forecasting accuracy can enhance demand planning, and therefore save money and reduce environmental impact. The idea behind this degree project is to explore transfer learning for time series forecasting. This has boiled down to two concrete goals. The first one is to examine if trans-fer learning can improve the forecasting accuracy when using a convolutional neural network (CNN) with dilated causal convolutions. The second goal is to investigate whether transfer learning makes it possible to forecast time series with less historical data.

In this project, time series describing sales volume and price from three different consumer appliances are used. The length of the time series is about three years. Two transfer learning techniques are used: shared-hidden-layer CNN and pre-training. To tackle the first goal, the two transfer learning tech-niques are benchmarked against a CNN. The second goal is investigated con-ducting an experiment where the training set size varies for both a CNN and the two transfer learning techniques.

The results from the first experiment indicate that transfer learning neither increase nor decrease forecasting accuracy. Interestingly, the second experi-ment however show that only 60 % (40 % for the SHL-CNN) of training sam-ples is optimal for all models. This goes against the intuition that more training data leads to better model performance and this is most likely a phenomenum related specifically to time series forecasting. However, the percentage of 60 % most likely is application specific, we also find that pre-training, from any of the other products, improves the forecasting accuracy. Finally, reducing the training set further (20 % of training samples) affect the model differently. One pre-training model performs better than the rest, which perform very similar. This indicates that there are cases when transfer learning allows for forecast-ing smaller time series. However, further studies are required to establish how general these observations are.

(6)

Sammanfattning

Bättre tidsserieprediktion kan förbättra planering av en försörjningskedja, där-med spara pengar och minska miljöpåverkan. Tanken bakom detta examens-projekt är att utforska transfer learning för prognos av tidsserier. Detta resulte-rar i två konkreta mål. Det första är att undersöka hur transfer learning kan för-bättra prognosnoggrannheten när ett faltningsnätverk (CNN) med utvidgning och kausalitet används. Det andra målet är att undersöka om transfer learning gör det möjligt att förutspå tidsserier med mindre historisk data. De använda tidsserierna består av försäljningsvolymer och priser från tre hushållsapparater av samma slag. Tidsseriernas längd är cirka tre år. Två transfer learning -tekniker används: delade dolda lager CNN (SHL-CNN) och förträning av ett CNN.

För att ta itu med det första målet, så jämförs prognosnoggrannheten mel-lan de två transfer learning - teknikerna och ett CNN. Det andra målet under-söks genom ett experiment där storleken av träningsuppsättningen varieras för ett CNN och de båda transfer learning - teknikerna.

Resultat ifrån det första experimentet indikerar att transfer learing varken försämrar eller förbättrar prognosnoggrannheten. Det andra experimentet vi-sar att när antalet träningsexempel minskas till 60 % (40 % för SHL-CNN) så förbättras prediktionerna för alla modeller. Detta är inte intuitivt och är sanno-likt ett fenomen specifikt för prediktion av tidsserier. Vidare så är proportionen 60 % specifik för detta projekt och vi finner även att vid denna proportion så är prediktionerna från förträning bättre än de från faltningsnätverket. Den sista upptäckten är att när antalet träningsexempel krymper till 20 % så presterar förträningsmodellen bättre än de andra. Detta pekar på att transfer learning i vissa fall kan göra det möjligt att förutspå tidsserier med mindre historisk data.

(7)

1 Introduction 1

1.1 Time Series Forecasting . . . 1

1.2 Transfer Learning . . . 2

1.3 Problem Statement . . . 3

1.4 Scope . . . 3

2 Background 4 2.1 Time Series Forecasting . . . 4

2.1.1 Preprocessing . . . 5

2.1.2 Forecast Accuracy Measures . . . 6

2.1.3 Methods for Validation . . . 7

2.1.4 Classical Linear Forecasting Models . . . 8

2.1.5 Artificial Neural Networks . . . 9

2.2 Transfer Learning . . . 11 2.2.1 Pre-training . . . 13 2.2.2 Autoencoder . . . 13 2.2.3 Shared-hidden-layer . . . 13 2.3 Research Design . . . 14 3 Method 16 3.1 Dataset . . . 16 3.2 Models . . . 17 3.2.1 CNN . . . 18 3.2.2 Pre-training . . . 20 3.2.3 SHL-CNN . . . 21 3.2.4 Parameter Selection . . . 21 3.3 Experiments . . . 21 3.3.1 Overall Performance . . . 23

3.3.2 Varying Training Set Size . . . 23

(8)

3.4 Implementation . . . 23 4 Results 25 4.1 Parameter Selection . . . 25 4.1.1 CNN . . . 25 4.1.2 SHL-CNN . . . 25 4.1.3 Pre-training . . . 28 4.2 Overall Performance . . . 30

4.3 Varying Training Set Size . . . 30

5 Discussion 34 5.1 Discussion of Results . . . 34

5.2 Related works . . . 36

5.3 Future Work . . . 37

5.4 Ethics and Sustainability . . . 38

6 Conclusions 39

(9)

Introduction

1.1 Time Series Forecasting

Time series forecasting can be described as making predictions about future events given historical data. A variety of applications exist, however, some time series are more difficult than others to forecast. The winning lotto num-bers of tomorrow should be impossible to forecast whereas other phenomena are feasible to forecast. A practical example where time series forecasting can be beneficial is when a country shall decide whether to build a new power plant or not. In this case, accurate forecasts of future electricity consumption can help planners make an informed decision. Other common applications are prediction of the stock market, electricity usage and sales volumes [1, 2].

Historically, forecasting has been dominated by classical linear models. The autoregressive moving-average (ARMA) model was described in 1951 by Peter Whittle [3]. It was later succeeded by the more general and versatile autoregressive integrated moving average (ARIMA) that Box & Jenkins intro-duced in 1970 [4]. Other methods are state space models and Holt-Winters’ seasonal method [5]. A potential downside of these models is that they are restricted by their linearity and cannot forecast non-linear patterns [6].

A more recent machine learning model is the artificial neural network (ANN). It has not been particularly successful in time series forecasting ac-cording to a review article by De Gooijer and Hyndman [7] in 2006. However, the research into ANNs has grown recently with the introduction of deep learn-ing, providing state-of-the-art results in tasks such as image recognition, text translation and text generation. Compared to the mentioned tasks, the ANN research in time series forecasting is not as commonplace. Yet, examples do exist, in 2015 Szoplik [2] successfully used a multilayer perceptron (MLP)

(10)

to forecast gas consumption and in 2018 Slavek Smyl won the M4 forecasting competition (a big international forecasting competition) using a hybrid neural network approach [8].

Depending on the domain, different types of neural networks are used. In image classification, arguably the biggest field using neural networks, convo-lutional neural networks (CNN) are present in many state of the art networks [9, 10]. Convolutions can capture small meaningful features while at the same time keeping the number of weights low [11]. In sequence modeling tasks (tasks where the order of the data matters) such as language modeling, text generation and hand writing generation, the long short-term memory (LSTM) architecture has been successfully used. Recently CNNs have become a strong competitor to LSTMs in sequence modeling tasks according to Bai, Kolter, and Koltun [12]. Since time series data has temporal dependencies, it is a field similar to both sequence modeling and image classification (local structural dependencies between pixels).

A recently developed CNN is a network called WaveNet [13]. It was built to generate raw audio waveforms. A key component that makes the network successful, is the dilated causal convolution, a special kind of convolution that is well suited for time series data. It makes the receptive field large, allow-ing the network to use more information, while at the same time keepallow-ing the complexity low.

1.2 Transfer Learning

A common practical problem in many machine learning applications is the amount of data available. Less data makes complex models more likely to overfit (generalize poorly). If more training data can be made available it can improve model performance. However, in many real scenarios it is challeng-ing, if not impossible, to gain access to large amounts of quality data. For example, for newly developed consumer products, there is no historical sales data. One approach to circumvent this problem is to utilize a technique known as transfer learning. It explores the problem of storing knowledge gained by solving a different but still similar problem, to better solve the target task. The field has recently gained traction in the research, where transfer learning is used to achieve state of the art results [14, 15, 16]. The transfer learning sub-category inductive transfer learning, has lately had an impact on the com-puter vision field. Applied models are fine-tuned from models pre-trained on big datasets such as ImageNet [16]. Apart from increased accuracy, transfer learning can also shorten the training time and thereby decrease use of

(11)

com-putational resources.

Another practical problem with neural networks is that they can be com-putationally heavy to optimize. Some computer vision models require several weeks to train with expensive hardware. When reusing parts of a pre-trained models the training time shrinks significantly. Successfully utilizing transfer learning can therefore both increase model performance while at the same time reducing computing costs.

1.3 Problem Statement

The purpose of this degree project is to evaluate transfer learning methods for time series forecasting of sales for various consumer appliances. The report aims to answer the following research questions. When forecasting time-series data of sales using a CNN with dilated causal convolutions:

• How effective is transfer learning in terms of improving forecasting ac-curacy?

• To what extent can transfer learning allow for forecasting sales on prod-ucts with less historical data?

1.4 Scope

To answer the research questions, this degree project uses sales data from three consumer appliances. The data consists of daily recorded sales volumes and prices. For some dates the data is unavailable and the value needs to be im-puted. As the data is confidential, normalized and anonymized values are pre-sented. The time series length is about three years.

This study leads to conclusions for whether transfer learning is suitable for time series forecasting using a CNN with dilated causal convolutions. If the technique is successful, the forecasting accuracy of both new and existing products could improve, allowing for better demand planing. This could give both economical and environmental benefits when the supply chain is opti-mized. Regardless of its success, the insights and learnings from the project might be of interest for the research community and/or demand planners.

(12)

Background

The chapter describes the theory that is essential to the degree project. The first section explains time series forecasting and neural networks, while the second describes transfer learning. In the last section the method choice is justified.

2.1 Time Series Forecasting

A time series, denoted XT, is a set of observations xt. Each observation is recorded at a specific time t. The series can be either continuous or discrete. In the discrete and most commonly used case, the time steps make up a discrete set. Often the observations are recorded at fixed time intervals. For daily time series the difference is exactly one day between each observation [17]. We denote a forecasted value at time t withbxt. Additionally, we define the forecast origin as the latest point in time with a known value. The input horizon is the number of past values used while forecast horizon stands for the number of future values to forecast.

Forecasting future values can be done using single-step or multiple-step forecasting. In the single-step case one future value is forecasted at a time. Given the forecast origin t and input horizon p, we forecast bxt+1 using the model f ,

b

xt+1= f (xt, · · · , xt−p−1) (2.1)

There are two commonly used approaches to forecast multiple time steps ahead according to Hamzaçebi, Akay, and Kutay [18]. In the direct method, the model f outputs several values for a given input,

b

xt+h, · · · ,xbt+1 = f (xt, · · · , xt−p−1) (2.2)

(13)

where h is the forecast horizon. According to Taieb et al. [19] it is the favored method for forecasting multiple steps. In the iterative method, a single-step model is used to forecast one step at a time, recursively using the forecasted values as input for the next forecast. It is a flexible model that can produce forecasts of any length. It may also be the only option as some models can only produce one value at a time.

Time series can be either univariate or multivariate. In the univariate case each observation is a single value. For multivariate series, each observation is a set of values. When forecasting one phenomena it might be interesting to take another into account. However, adding additional series that are uncorrelated may introduce noise and decrease the model performance. An example of a multivariate time series is recordings of both wind speed and power generated from a wind turbine. The benefit of multivariate series is that more data is available, possibly increasing model performance.

2.1.1 Preprocessing

Preprocessing is an important part of practical machine learning and time se-ries forecasting. Time sese-ries data can manifest a variety of shapes and pat-terns. It is often helpful and sometimes necessary to split the time series into several components. This technique is known as a time series decomposition. Components are trend, seasonality and cycles, which represent underlying pat-terns. Typically one assumes the parts are either additive or multiplicative. The trend-cycle is often estimated using moving averages. A popular tech-nique for decomposing monthly and quarterly data is the X11 method, which is based on classical decomposition with extra steps and features [5]. When the patterns have been identified they can be removed from the series using the technique known as deseasonalization, creating a seasonally adjusted se-ries [5].

ANNs are versatile and can learn complex tasks. Some say seasonality is handled directly and prior deseasonalization is not needed, while others ar-gue the opposite [20]. In their article, Zhang and Qi conclude that combined detrending and deseasonalization is an effective data preprocessing approach [20]. It shall be noted that since the article’s release in 2005, more advanced neural networks have been developed.

When working with neural networks, normalization of data is often per-formed. New values are transformed using the following formula.

xi =

xi− minixi

(14)

Table 2.1: Common forecast accuracy measures and their definitions. h is the forecast horizon. Measurement Definition MSE 1 h t+h X i=t (xi−bxi) 2 RMSE v u u t 1 h t+h X i=t (xi−bxi) 2 MAPE 100% h t+h X i=t xi−bxi xi sMAPE 100% h t+h X i=t |xi−xbi| (|xi| + |xbi|)/2

This ensures all values are in the interval [0, 1]. When normalizing test data one uses the minimum and maximum from the train dataset.

2.1.2 Forecast Accuracy Measures

There are multiple forecast accuracy measures. De Gooijer and Hyndman [7] lists several in their review article about time series analysis. A selection of commonly used measures are: mean square error (MSE), root mean square error (RMSE), mean absolute percentage error (MAPE) and symmetric mean absolute percentage error (sMAPE). Their definitions can be seen in table 2.1. Both the MSE and RMSE are continuously differentiable, making them appro-priate for training neural networks. However, a problem is that they are scale dependent and thus not appropriate for comparison between different time se-ries. Both MAPE and sMAPE are scale independent, making them appropri-ate for comparing accuracy on different time series. sMAPE was proposed by Makridakis [21] in 1993. It was a response to MAPE putting a heavier penalty on positive errors than on negative errors. However, it is not truly symmetric and not as easily interpreted as MAPE [7]. A last advantage of sMAPE is that it can handle zeros in the observed series, whereas MAPE cannot.

(15)

Time

all available data

train test

(a) Fixed-origin evaluation

Time

iterations

all available data

train test train test train test train test train test (b) Rolling-origin-recalibration evalua-tion and Rolling-origin-update evaluaevalua-tion

Time

iterations

all available data

train test train test train test train test train test (c) Rolling-window evaluation

Figure 2.1: Four approaches of Bergmeir and Benítez [22] to validate time series forecasts.

2.1.3 Methods for Validation

Compared to computer vision tasks such as image classification, validation is not as straight forward in time series forecasting. Methods such as k-fold cross validation are not applicable because of the temporal dependencies in the data. It is still possible to split the data into a training and test set. One can also perform walk-forward cross validation. Bergmeir and Benítez [22] suggest the use of a blocked form of cross-validation for time series. The authors define four possibilities for evaluating forecasts on individual time series. They are visualized in figure 2.1 and described below.

Fixed-origin evaluation

In fixed-origin evaluation, seen in figure 2.1a, the forecast origin is fixed in place. The model is trained using data up to and including the forecast origin. Model output is then compared to the true value. If applied to only one time series the evaluation method has shortcomings. As only one forecast is gener-ated, model evaluation is vulnerable to deviations unique to the forecast origin. This validation method is most commonly used for forecasting competitions with multiple series [22, 23].

(16)

Rolling-origin-recalibration evaluation

In rolling-origin-recalibration evaluation, visible in figure 2.1b, one iterates over multiple sequential values of the forecast origin. For each value, the model is trained from scratch using data up to and including the forecast origin. Model output is then compared to the true value. Compared to the fixed origin evaluation, multiple forecasts are generated on slightly different data. Overlap exists, hence the different examples are not entirely uncorrelated as examples are in image classification. The approach is still more statistically solid [22]. A drawback is increased testing time, especially if the model recalibration is computationally expensive.

Rolling-origin-update evaluation

Rolling-origin-update evaluation, see figure 2.1b, is similar to rolling-origin-recalibration besides that no retraining of the model is performed. The new values are only used to test the model. Tashman [23] argues that recalibra-tion is the preferred procedure of the two. The method is a good approach if retraining the model takes a long time.

Rolling-window evaluation

In figure 2.1c we show rolling-window evaluation. The training set size is now kept constant by throwing away old values when moving the forecast origin forward. One can think of the approach as two windows (train and test) sliding over the time series data. Rolling-window evaluation can be a good approach if old values tend to disturb the model performance.

2.1.4 Classical Linear Forecasting Models

The ARMA model was introduced in 1951 by Peter Whittle [3]. Two parts make up the model. The autoregressive part says that a future valuexbt+1 is a linear combination of the last p values with added noise εt+1 [5]. With the

linear weights as φiand the constant c, future values are forecasted as, b xt+1 = c + p−1 X i=0 φixt−i+ εt+1 (2.4)

(17)

The moving-average part dictates that a future valuexbt+1 is a regression of q past forecast errors. With θias the linear weights, we have

b xt+1= c + q−1 X i=0 θiεt−i+ εt+1 (2.5) Adding the autoregressive and moving-average part together the ARMA pre-diction becomes b xt+1 = c + p−1 X i=0 φixt−i+ q−1 X i=0 θεt−i+ εt+1 (2.6) ARMA was later succeeded by the more general and versatile ARIMA that Box & Jenkins introduced in 1970 [4]. Instead of regressing over past values in the AR part, one uses differenced values instead. This is a statistical transformation that makes the series stationary [5]. A disadvantage of both the ARMA and ARIMA models is that they can require a lot of experience from an analyst in order to perform well. Additionally, they do not support seasonal data. Later on, new variants such as the multivariate ARIMA and seasonal ARIMA have arisen. The first allows use of multivariate time series, whereas the second directly models of the seasonal component, allowing the usage of seasonal time series [17].

2.1.5 Artificial Neural Networks

ANNs come in many different architectures. They are at the time of writing focus for a lot of research. The applications are many and a selection is the following: image classification, image segmentation, text to speech, text trans-lation and time series forecasting.

In their review article about time series forecasting, De Gooijer and Hyn-dman argue that ANNs have not been particularly successful. Instead they favor classical methods. Since the article’s release in 2006, the ANN field has progressed. In 2015 Szoplik [2] successfully used a multilayer percep-tron to forecast gas consumption. In 2018 the M4 forecasting competition (the fourth Makridakis competition [8]) was won by Slavek Smyl, who used a hybrid neural network approach. The M4 runner-ups used combinations of statistical methods. Earlier Makridakis competitions were dominated by linear and statistical models [8].

Training a neural network is generally computationally expensive, time consuming and requires a lot of training data. Selecting good parameters is

(18)

often not an easy task. Generally, given a fixed network architecture, the more data available at training, the less likely a neural network is to overfit. This constitutes a problem when using time series data. There may not be multiple parallel time series of the same kind. Hence the only way to add data is to add observations further back in time. This can be problematic for time series as their structures can change over time.

Three common architectures when dealing with sequential tasks or time series are feed forward neural networks, LSTM and CNNs. The LSTM was introduced by Hochreiter and Schmidhuber [24] in 1997. The network has recurrent connections, allowing it to remembering past inputs. This makes it suitable for sequence tasks. Recent research indicate that CNNs can outper-form the LSTM on sequence tasks [25].

Convolutional neural networks

Convolutional neural networks are commonly applied in computer vision tasks. The architecture is inspired by the animal visual cortex organization. CNNs replace the matrix multiplication in feed forward networks with a convolution. This decreases the number of parameter weights needed, hence lowering the complexity of the network. Another advantage of the CNN is that it requires minimal pre-processing. Recent GPU-accelerated computing techniques have accelerated use of CNNs and as no recurrent connections exists in CNNs, they are often faster to train than the LSTM [13, 26].

Most CNN research have been focused in the computer vision realm. In 2016, Oord et al. [13] released their paper about the WaveNet model. It is a deep neural network (DNN) for generating raw audio waveforms. When they apply the network on text-to-speech tasks, state-of-the-art performance is achieved. A key component is the dilated causal convolution. Causal means that no information can flow from future values to forecasts of old values. This can allow for efficient parallelized training [13]. Figure 2.2a shows stacked causal convolutional layers. The receptive field (inputs connected to the out-put) is the same as the number of layers. Dilation spreads out the convolutions, which allows for more input using the same amount of layers. The dilation rate refers to the distance between points that are fed to a convolution. The recep-tive field doubles for each layer if the dilation rate is doubled after each layer. This is the case in figure 2.2b which illustrates the dilated causal convolution. The kernel size is fixed at 2 in both images and represents how many values are fed to the convolution. It is common to have several kernels in parallel, allowing for extraction of multiple features. The exact amount is referred to

(19)

xt-15 xt-14 xt-13 xt-12xt-11xt-10 xt-9 xt-8 xt-7 xt-6 xt-5 xt-4 xt-3 xt-2 xt-1 xt

b

xt-14 bxt-13 bxt-12 bxt-11xbt-10 bxt-9 bxt-8 bxt-7 bxt-6 bxt-5 xbt-4 xbt-3 xbt-2 xbt-1 bxt xbt+1

input output

(a) Causal convolution

x_t-15 x_t-14 x_t-13 x_t-12x_t-11x_t-10 x_t-9 x_t-8 x_t-7 x_t-6 x_t-5 x_t-4 x_t-3 x_t-2 x_t-1 xt

b

xt-14 bxt-13 bxt-12 bxt-11xbt-10 bxt-9 bxt-8 bxt-7 bxt-6 bxt-5 xbt-4 xbt-3 xbt-2 xbt-1 bxt xbt+1

input output

(b) Dilated causal convolution

Figure 2.2: An illustration of dilation. The kernel size is two. Inspired by Oord et al. [13].

as the number of filters.

Residual connections

A major recent innovation in the field of DNNs is the residual connection, which was published by He et al. [27] in 2016. When having multiple sub-sequent layers in a DNN, one adds residual connections so that layers learn modifications of the identity mapping [25]. The building block presented by He et al. [27] can be seen in figure 2.3 and the technique allows for training deeper neural networks and its use is present in state of the art networks papers such as Huang et al. [9], Szegedy et al. [28] and WaveNet paper by [13].

2.2 Transfer Learning

Transfer learning can be described as the process of storing knowledge gained from solving one task to improve solving another but similar task. It is a field within machine learning that has grown recently. In 2010, Pan and Yang [29] surveyed the field of transfer learning and identified three different settings:

(20)

weight layer weight layer + relu relu x F (x) x F (x) + x

Figure 2.3: Residual learning building block. Inspired by Oord et al. [13]. inductive transfer learning, transductive transfer learning and unsupervised transfer learning. Pan and Yang [29] define them as follows:

“• Inductive transfer learning: Given a source domain DSand a learning

task TS, a target domain DT and a learning task TT, inductive transfer

learning aims to help improve the learning of the target predictive func-tion fT(·) in DT using the knowledge in DS and TS, where TS 6= TT.

• Transductive transfer learning: Given a source domain DSand a learn-ing task TS, a target domain DT and a learning task TT, transductive

transfer learning aims to help improve the learning of the target pre-dictive function fT(·) in DT using the knowledge in DS and TS, where

DS 6= DT and TS = TT. In addition, some unlabeled target-domain data must be available at training time.

• Unsupervised transfer learning: Given a source domain DS and a learning task TS, a target domain DT and a learning task TT,

unsu-pervised transfer learning aims to help improve the learning of the tar-get predictive function fT(·) in DT using the knowledge in DSand TS, where TS 6= TT and the label domains YS and YT are not observable. ” A closely related and similar field to inductive transfer learning is multi-task learning. However, several multi-tasks are learned simultaneously instead of just focusing on the target task [29, 30]. Multi-task learning can also be seen as an approach to inductive transfer learning [31].

To the best of our knowledge, most transfer learning research is focused on computer vision and natural language processing where large datasets exist. Identified transfer learning techniques applicable to time series forecasting are pre-training, auto-encoders and SHL.

(21)

2.2.1 Pre-training

Ptraining is an approach to inductive transfer learning. State of the art re-sults were achieved by Sharif Razavian et al. [15] in 2014. They took the publicly available convolutional neural network Overfeat [32] and used the ac-tivations of a hidden layer as a fixed feature extractor, which they connected to a support-vector machine (SVM). This model was then applied to other recog-nition tasks: image classification, scene recogrecog-nition, fine grained recogrecog-nition, attribute detection and image retrieval on a diverse set of datasets [15]. With simple augmentation techniques, they achieved state of the art results on most of the tasks.

Another paper examining pre-training is Yosinski et al. [33]. They split the ImageNet dataset into two parts and examined pre-training between two networks trained on the distinct parts. They conclude that when reusing layers, fine-tuning improves generalization. When locking weights (no fine-tuning), performance drops due to representation specificity.

2.2.2 Autoencoder

Another approach to transfer learning is that of Laptev et al. [14]. They use an LSTM autoencoder as a powerful feature extractor. By feeding a time se-ries through the autoencoder, a smaller and feature rich representation is cre-ated. The autoencoder is shared between multiple series, hence being the en-tity where transfer learning takes place. The resulting representation is then used to train a series specific LSTM forecasting model. With this approach the authors reported improved accuracy on company data.

2.2.3 Shared-hidden-layer

A DNN can be seen as increasingly complex feature transformations [34]. This motivates the transfer learning technique called shared-hidden-layer. In it, the input layer and hidden layers are shared across similar tasks, while the tasks each have their own output layer.

A natural application is speech recognition. Here Huang et al. [34] use this approach by sharing layers between multiple languages, reporting that they reduced their errors by 3-5 %.

SHL has also been used for time series data. In their work, Hu, Zhang, and Zhou [1] forecast wind speed for the next 8 hours with a DNN that shares hid-den layers and input. The network has separate output layers for the different wind farms. They find that this improves the forecast accuracy compared to

(22)

training separate neural networks for each wind farm[1]. However, although the accuracy has improved, it still performs very similar to that of another forecasting model.

2.3 Research Design

LSTMs have for long been a popular neural network approach for sequential problems while CNNs have been the state-of-the-art for computer vision tasks. Interestingly, this seems to change. Research indicate that convolutions are more suitable for sequence problems [12]. This change seems to be mainly due to the introduction of dilated causal convolutions. Hence, CNNs make a good candidate for time series forecasting and was therefore chosen for use in this degree project.

Transfer learning is well established in computer vision, where it has been used with CNNs to great success. In the forecasting field it is a different story. Here classical methods have dominated for a long time. Only recently have neural networks achieved state-of-the-art results [8]. This may be a reason for little to no research in transfer learning on time series forecasting. Three transfer learning techniques were identified in section 2.2. Of these, the SHL and pre-training were chosen for evaluation. Hence there are three models to be used to forecast times series.

• CNN: No transfer learning using a CNN.

• SHL-CNN: Transfer learning using SHL on a CNN.

• Pre-training: Using a CNN, first train on a different but similar time series and then on the target time series.

With the three models, forecast accuracy can be compared, henceforth dealing with the first research question. To deal with the second, the length of a time series can be reduced while forecast accuracy is assessed.

To measure the forecast accuracy over multiple series, the scale indepen-dent sMAPE was chosen. MAPE was ruled out after finding zeroes in the selected time series. No deseasonalizing of the data was performed. This ap-proach was chosen because the degree project is limited in time and as Zhang and Qi [20] mention in their article, there is unclarity whether deseasonal-isation is necessary when using neural networks. Additionally, the forecast horizon was chosen as 120 days.

(23)

To the best of our knowledge, there is a small amount of research covering transfer learning for time series, CNNs for time series forecasting and espe-cially SHL-CNNs for time series forecasting. We have found no papers that look at all three problems at once.

(24)

Method

3.1 Dataset

The data used in this degree project consists of daily sales volumes and prices for three consumer appliances of the same kind. We denote them with A, B and C. As two variables are present, the time series are multivariate. All three datasets contain daily recorded values over a period of over three years and the variable to forecast is the daily sales volume. For the three products, the price is set in advance, allowing models to use future price values when forecasting the sales volume. For example, when predicting the sales one days ahead, the model inputs the price of one day ahead.

Some values were missing in the data. Training data and input values were filled using a heuristic. If the price was missing the value was selected as that of the previous day. In the case of sales volumes the following logic was used if a value was missing. If the price was the same last week the same day then use that sales volume, otherwise previous days value was used. Data points used for testing the model were instead filled with NaNs (not a number) and did not contribute to calculated accuracies.

We normalized each time series independently using the following formula xi =

xi− minixi

maxixi− minixi (3.1)

where xi is the observation at time i. For price data the minimum and max-imum values were computed on the whole series, whereas for sales volumes, minimum and maximum were calculated from observations before the simu-lated forecast origin. A year of the normalized time series can be seen in figure 3.1. No deseasonalizing of the data was performed, as discussed earlier (see

(25)

0 50 100 150 200 250 300 350 time [days]

0.0 0.5 1.0

Normalized sales volume

(a) Product A 0 50 100 150 200 250 300 350 time [days] 0.0 0.5 1.0

(b) Product B 0 50 100 150 200 250 300 350 time [days] 0.0 0.5 1.0

(c) Product C

Figure 3.1: Normalized sales volumes of the three products over the span of a year.

section 2.3).

3.2 Models

The main forecasting model was a neural network using dilated causal con-volutions. It served as a basis for the transfer learning approaches in sections 3.2.2 and 3.2.3. We designed our models to forecast 120 days ahead. For all neural networks, the AdaGrad optimizer [35] was used as it is more forgiving when setting the learning rate [35]. Other advanced optimizers using momen-tum (moving average of gradient updates) could not easily be used with our implementation of the SHL-CNN. In table 3.1 we present all models used.

All three models (CNN, SHL-CNN and Pre-training) share the same di-mensions of the input and output data. When examining the data and

(26)

develop-Table 3.1: All models used, and description on which products (A, B, C) they were trained and evaluated on, respectively.

Model Trained on Evaluated on CNN - A A A CNN - B B B CNN - C C C SHL-CNN - A A,B,C A SHL-CNN - B A,B,C B SHL-CNN - C A,B,C C Pre-training - A2B A,B B Pre-training - A2C A,C C Pre-training - B2A B,A A Pre-training - B2C B,C C Pre-training - C2A C,A A Pre-training - C2B C,B B

ing the models, it was noticed that yearly seasonal patterns existed. This led to having an input horizon greater than a year.

3.2.1 CNN

To achieve the desired input horizon in both price and sales volumes, we used two heads of nine stacked dilated causal convolutional layers with residual connections, making the receptive field 512. This network architecture is dis-played in figure 3.2. The first head used the last 512 observations of sales volumes. For the second head we wanted to input both historical price and the price of the day to forecast. Therefore we input the last 511 and one future value of the price. The filter size of the convolutions was later determined by a hyperparameter search. We concatenated the output of the convolutional layers along the last dimension. Two fully connected layers were then applied, which produced the model output. The last value of the output corresponds to the future forecast. ReLU was used as the activation function throughout the network architecture as they have been shown to enable better training of DNNs [36].

When training the network, training samples were made as sequential slices from the training data. Each training sample consisted of 512 sales volumes

(27)

input price conv n × p × 1 conv conv conv conv conv conv conv conv

input sales volume conv n × p × 1 conv conv conv conv conv conv conv conv concatenate dense 128 dense 1 output n × p × 32 n × p × 32 n × p × 64 n × p × 128 n × p × 1

Figure 3.2: Network architecture for the CNN. All convolutions are causal and use dilation. n refers to the batch size while p is the number of past values. The example shows 32 filters.

(28)

Time Sold

Price

512 1

(a) One training sample

Time Sold

Price

512 120

(b) One test sample

Figure 3.3: Difference between training samples and test samples. Forecast origin is at the dashed line. Model output is indicated by a red color, whereas input is colored grey

.

and 512 price values one day ahead of the corresponding sales volume, mak-ing the input horizon 512 and the forecast horizon 1. A smak-ingle sample can be seen in figure 3.3a.

After the training phase, the model was tested on one test sample. Such a sample instead had a forecast horizon of 120 days, see figure 3.3b. The whole test set made up the values after the forecast origin, whereas the other values originated from the end of the train set. To make forecasts with the larger forecast horizon we used the iterative method described in section 2.1.

3.2.2 Pre-training

To evaluate pre-training we used the CNN architecture described in section 3.2.1. Given a forecast origin and two products, A and B, we first loaded the network weights that were trained on product A, at that forecast origin, using the best found parameters. Then we trained the network on data from product B in the same manner as in section 3.2.1. During this training no layer weights were frozen as Yosinski et al. [33] reported their best results when doing so.

(29)

Table 3.2: Tested parameters during parameter se-lection of the CNN, SHL-CNN and pre-training (training on target data).

Parameter Tested values Filters† 8, 16, 32, 64

Learning rate 0.01, 0.005, 0.001, 0.0005, 0.0001 Epochs 5, 15, 30, 50, 100, 200

†

64 filters used in pre-training

set size of the second product can be varied while the first is fixed.

3.2.3 SHL-CNN

The SHL-CNN extends the CNN architecture in figure 3.2 and is presented in figure 3.4. Given multiple products such as A, B and C, we used one head for each product. When training the model each epoch consisted of sequentially training one epoch on each product’s data. Weights were only updated where it made sense. For example, weights of layers specific to the output of B or C were not updated if the input data came from product A.

3.2.4 Parameter Selection

All three models rely on their hyperparameters. Hence it was necessary to find suitable values for the learning rate, number of filters and epochs. To do this, data was first reserved for the final experiments described in section 3.3. A good starting range of parameters were found after experimentation when implementing the code. The parameters searched for are presented in table 3.2.

Rolling-origin-recalibration evaluation was used with ten sequential values of the forecast origin for all networks. The reason using several sequential values was to account for variation in both the data and model training.

3.3 Experiments

After hyperparameters were determined we went on and performed two ex-periments to try answer the two research questions. In both exex-periments we compared the forecast accuracy in terms of sMAPE.

(30)

input price conv n × p × 1 conv conv conv conv conv conv conv conv

input sales volume conv n × p × 1 conv conv conv conv conv conv conv conv concatenate dense 128 dense 1 output_0 dense 128 dense 1 output_1 dense 128 dense 1 output_2 n × p × 32 n × p × 32

Figure 3.4: Network architecture for the SHL-CNN. All convolutions are causal and use dilation. n refers to the batch size while p is the number of past values. The example shows 32 filters.

(31)

Table 3.3: Varying training set size. The parameters are equivalent.

Parameter Tested values [%] Training samples† 100, 80, 60, 40, 20 Training data* 100, 85, 76, 65, 56

†

Proportion of all training samples used.

*_{Proportion of all training data used to}

make samples.

3.3.1 Overall Performance

Using the best parameters found in the parameter search, we evaluated all mod-els on all products using the previously reserved data. Each model was tested on 100 sequential forecast origins while the forecast horizon was set at 120 days. We finally report forecast accuracy in terms of both median and mean of the sMAPE from each forecast origin.

3.3.2 Varying Training Set Size

After conducting the first experiment, we wanted to see if the effect of the transfer learning approaches change when less training data is available. With product A as the target and using previously selected parameters, we restricted the number of training samples of product A to the values present in table 3.3. Restricting the number of training samples by a percentage corresponds to restricting all training data with another percentage. This is since one training sample extends over multiple days. We then trained the CNN using this data. The pre-trained network was loaded from networks trained on the full length of product B and C. For the SHL-CNN we restricted the size of all data sources simultaneously.

Instead of using 100 sequential forecast origins as in the overall perfor-mance experiment, we used the last 25.

3.4 Implementation

Multiple Python libraries were used in this project. The Keras library was used for implementing all neural networks [37]. We used Tensorflow as the Keras backend [38]. Plots were generated using Matplotlib [39]. Data preprocessing

(32)

and validation method was implemented in Python with help from the two libraries Pandas [40] and NumPy [41].

(33)

Results

4.1 Parameter Selection

This section presents the hyperparameter search for the models in table 3.1. The best parameters that were chosen are presented in table 4.1.

4.1.1 CNN

In figure 4.1, a box plot of the top five parameters in terms of lowest mean sMAPE, is presented per product. As one sees, for product A, the best setting is 200 epochs, 64 filters and a learning rate of 0.0001. The difference between the two best settings for product B is small though leaning towards 32 filters, 200 epochs and a learning rate of 0.0005. For product C the best setting is 50 epochs, 64 filters and the learning rate set to 0.0005.

A general observation in the three plots is that the variance seems to be quite high. At the same time, many parameter combinations show similar performance.

4.1.2 SHL-CNN

Figure 4.2 shows box plots of the top five parameters in terms of lowest mean sMAPE, per product. The best setting for product A is 64 filters, 15 epochs and a learning rate of 0.0005. For product B, 64 filters, 50 epochs and 0.001 as the learning rate is a promising selection. Lastly, the setting of 64 filters, 100 epochs and a learning rate of 0.0005 clearly outperforms the runner-ups.

Again we observe high variance. For product A and B it seems to be on a similar level to the CNN parameter search while it is higher for product C.

(34)

15 0.0005 32 30 0.0005 32 5 0.001 32 100 0.0001 64 200 0.0001 64 epochs

learning rate_filters 25.0 27.5 30.0 32.5 35.0 37.5 40.0 42.5 sMAPE [%] (a) Product A 200 0.0005 32 100 0.001 32 200 0.0005 64 200 0.001 64 30 0.005 64 epochs

learning rate_filters 16 18 20 22 sMAPE [%] (b) Product B 100 0.0005 32 200 0.0005 32 30 0.001 32 50 0.0005 64 100 0.0005 64 epochs

learning rate_filters 25 30 35 40 sMAPE [%] (c) Product C

Figure 4.1: sMAPE of the five best parameters in the CNN parameter search. Each parameter combination was tested on 10 sequential forecast origins. The box extends from the lower to upper quartile with a orange line showing the median. The whiskers show the range of values and outliers are displayed as circles. The horizontal axis shows parameter combinations. Top is epochs, middle is learning rate and bottom is number of filters.

(35)

30 0.001 8 50 0.0005 16 30 0.0005 32 15 0.0005 64 5 0.001 64 epochs

learning rate_filters 25 30 35 40 45 50 55 sMAPE [%] (a) Product A 50 0.0005 64 15 0.001 64 50 0.001 64 100 0.001 64 30 0.005 64 epochs

learning rate_filters 16 18 20 22 sMAPE [%] (b) Product B 200 0.0005 16 50 0.001 32 50 0.0005 64 100 0.0005 64 100 0.001 64 epochs

learning rate_filters 27.5 30.0 32.5 35.0 37.5 40.0 42.5 sMAPE [%] (c) Product C

Figure 4.2: sMAPE of the five best parameters in the parameter search of the SHL-CNNs. Each parameter combination was tested on 10 sequential forecast origins.

(36)

Table 4.1: Best parameters for all models Model Filters epochs learning rate CNN - A 64 200 0.0001 CNN - B 32† 200 0.0005 CNN - C 64 50 0.0005 SHL-CNN - A 64 15 0.0005 SHL-CNN - B 64 50 0.001 SHL-CNN - C 64 100 0.0005 Pre-training - A2B 64 50 0.005 Pre-training - A2C 64 200 0.0001 Pre-training - B2A 64 15 0.0001 Pre-training - B2C 64 5 0.0001 Pre-training - C2A 64 5 0.01 Pre-training - C2B 64 50 0.01 †

64 filters used in pre-training

4.1.3 Pre-training

Top five parameters, in terms of lowest mean sMAPE per pre-training model is presented in figure 4.3. For the A2B (pre-trained on A and then trained on B) network, the setting with 50 epochs, 0.005 learning rate and 64 filters had a low median and the smallest spread, therefore being the choice. In the A2C search, 200 epochs, 0.0001 learning rate and 64 filters yielded the lowest sMAPE.

Looking at the box plots for B2A and B2C we see that for both, 64 filters should be used. For the first of the two, the learning rate and epochs were chosen as 0.0001 and 15 respectively because of low median and lower spread than the setting with lowest median. In the latter, 5 epochs and 0.0001 as learning rate is a clear choice.

For both the C2A and C2B networks 64 filters is optimum. For the first of the two, the best setting was a learning rate of 0.01 and 5 epochs, whereas for the second, 50 epochs and a learning rate of 0.01.

(37)

50 0.0005 64 100 0.0005 64 200 0.0005 64 50 0.005 64 15 0.01 64 epochs learning rate filters 14 16 18 20 22 sMAPE [%] (a) A2B 50 0.0001 64 100 0.0001 64 200 0.0001 64 15 0.0005 64 50 0.0005 64 epochs learning rate filters 30 40 50 sMAPE [%] (b) A2C 5 0.0001 64 15 0.0001 64 30 0.0001 64 5 0.005 64 5 0.01 64 epochs learning rate filters 30 40 50 60 70 sMAPE [%] (c) B2A 5 0.0001 64 15 0.0001 64 30 0.0001 64 15 0.005 64 30 0.005 64 epochs learning rate filters 30 40 50 sMAPE [%] (d) B2C 5 0.01 64 15 0.01 64 50 0.01 64 100 0.01 64 200 0.01 64 epochs learning rate filters 30 40 50 60 sMAPE [%] (e) C2A 100 0.0005 64 100 0.001 64 30 0.005 64 30 0.01 64 50 0.01 64 epochs learning rate filters 16 18 20 22 sMAPE [%] (f) C2B

Figure 4.3: sMAPE of five best parameters in the parameter search of the pre-trained networks. Each parameter combination was tested on 10 sequential forecast origins.

(38)

4.2 Overall Performance

The final results were conducted on a total of 100 sequential forecast origins with a forecast horizon of 120 days. These are presented in table 4.2. For product A, the best performing network in terms of median sMAPE and mean sMAPE is the CNN. It is followed closely by the pre-training - C2A network. When it comes to product B, the best network in terms of lowest median sMAPE, is the SHL-CNN - B, whereas SHL-CNN - A achieves the lowest mean sMAPE. The SHL-CNN - B is second in terms of mean sMAPE and the pre-training A2B is second in terms of median sMAPE. It is noteworthy that the variance in sMAPE is high for all networks. Lastly there is product C. Here the SHL-CNN - A performs best in terms of median sMAPE and mean sMAPE. The second best is the CNN.

So summarize it seems like the transfer learning methods does not improve nor decrease the accuracy.

4.3 Varying Training Set Size

In this experiment product A was the target of the forecasts. When the number of training samples were varied, the last 25 sequential forecast origins of the reserved data were used to test the models. The forecast horizon was as in the previous experiment, 120 days. Both the pre-training - B2A and pre-training - C2A were pre-trained with 100 % data, whereas the SHL-CNN varied all three datasets equally.

In figure 4.4 the mean sMAPE is plotted for all models and training set sizes present in table 3.3. The results are also displayed in more detail in table 4.3. For all models the performance in terms of mean sMAPE improves when the training set size decreases from 100 % to 80 %, 60 % and 40 % of training samples. When the size shrinks to 20 % of training samples, the mean sMAPE increases for all models. The pre-training - C2A performs better than the other models which have a higher and similar sMAPE.

In figure 4.5, we present box plots of the sMAPE when the number of training samples were reduced to 60 %. Here we see that both the pre-training - B2A and pre-training - C2A have lower medians and distributions than that of the CNN.

(39)

Table 4.2: Final results of all models. Each model was run on 100 sequential forecast origins. The upper and lower bounds refer to one standard deviation.

(a) Product A

Model median sMAPE [%] mean sMAPE [%]

CNN 30.3 31.3 ± 5.0 SHL-CNN - A 31.6 33.1 ± 6.8 SHL-CNN - B 32.7 35.3 ± 10.9 SHL-CNN - C 31.7 35.3 ± 9.5 Pre-training - B2A 31.4 32.3 ± 6.2 Pre-training - C2A 30.8 32.9 ± 9.2 (b) Product B

Model median sMAPE [%] mean sMAPE [%] CNN 50.4 47.1 ±15.0 SHL-CNN - A 50.3 44.7 ± 18.2 SHL-CNN - B 48.6 46.7 ± 16.1 SHL-CNN - C 50.4 49.0 ± 18.3 Pre-training - A2B 49.9 47.4 ± 17.6 Pre-training - C2B 50.1 50.1 ± 21.6 (c) Product C

Model median sMAPE [%] mean sMAPE [%] CNN 27.5 28.2 ± 6.3 SHL-CNN - A 24.4 25.7 ± 4.6 SHL-CNN - B 28.1 30.4 ± 9.7 SHL-CNN - C 29.7 30.7 ± 7.6 Pre-training - A2C 36.4 37.9 ± 9.4 Pre-training - B2C 31.8 35.7 ± 15.0

(40)

100 % 80 % 60 % 40 % 20 % Training set size

36 38 40 42 44 mean sMAPE [%] CNN Pre-training - B2A Pre-training - C2A SHL-CNN

Figure 4.4: Effect of varying the number of training samples for product A. SHL-CNN varied data from B and C correspondingly, whereas the pre-training networks did not. Each point is a mean of evaluating the sMAPE on 25 sequential forecast origins.

CNN

Pre-training - B2A Pre-training - C2A SHL-CNN

30 35 40 45 50 55 60 sMAPE [%]

Training set size: 60 %

Figure 4.5: Box plot of sMAPE on 25 forecast origins when the training set size of product A was reduced to 60 % of training samples (76 % of all training data).

(41)

Table 4.3: Results when varying the amount of training samples. Product A was the target. Each model was run on 25 sequential forecast origins. The upper and lower bounds refer to one standard deviation.

(a) 100 %

Model median sMAPE [%] mean sMAPE [%] CNN - A 38.2 40.7 ± 6.0 SHL-CNN - A 44.6 43.4 ± 6.7 Pre-training - B2A 37.0 37.5 ± 3.8 Pre-training - C2A 34.8 37.8 ± 9.5

(b) 80 %

Model median sMAPE [%] mean sMAPE [%] CNN - A 37.1 38.7 ± 4.9 SHL-CNN - A 36.3 37.4 ± 4.0 Pre-training - B2A 33.9 35.1 ± 3.4 Pre-training - C2A 35.8 36.8 ± 6.1

(c) 60 %

Model median sMAPE [%] mean sMAPE [%] CNN - A 37.5 38.5 ± 4.9 SHL-CNN - A 36.9 38.3 ± 5.2 Pre-training - B2A 34.3 35.2 ± 4.1 Pre-training - C2A 31.6 34.9 ± 8.1

(d) 40 %

Model median sMAPE [%] mean sMAPE [%] CNN - A 38.5 39.3 ± 4.1 SHL-CNN - A 35.0 36.3 ± 4.1 Pre-training - B2A 35.0 35.0 ± 2.8 Pre-training - C2A 33.0 36.6 ± 7.6

(e) 20 %

Model median sMAPE [%] mean sMAPE [%] CNN - A 41.3 42.6 ± 5.3 SHL-CNN - A 44.4 44.0 ± 2.9 Pre-training - B2A 43.8 43.5 ± 2.5 Pre-training - C2A 39.0 39.5 ± 3.9

(42)

Discussion

In this degree project we examine if transfer learning on a CNN can improve time series forecasting and allow for forecasting of time series with less his-torical data. The motivation is that improved forecasts can have a positive economical and environmental impact. First a CNN (no transfer learning) is compared to both SHL-CNN and pre-training on time series data from three consumer appliances. Then forecasting accuracy is assessed while the train-ing set size is varied. In this chapter, the results and general observations are discussed. Possible future work and ethical aspects are also considered.

5.1 Discussion of Results

Looking at the results chapter in table 4.2 we observe that for product A, the SHL-CNN is not better than the CNN, while it is on product B. Lastly, on product C, the SHL-CNN - C performs worse than the CNN. Additionally we remark that on product C, SHL-CNN - A performs better than SHL-CNN - C even though the latter had parameters optimized to perform well on product C. The results of all three products are inconclusive. A possible reason is that the optimal hyperparameters may change over time. As our test set consists of 100 overlapping forecast origins with a forecast horizon of 120 days, the time difference between the last target value in the parameter search and the testing, is over half a year. Another possible reason is that the three products are less correlated than necessary for SHL to work properly.

Considering the effect of pre-training, in table 4.2 we see that for product A and B pre-training performs slightly worse compared with the CNN, while at the same time accuracy is much poorer for product C. The overall conclusion is that it has a negative effect in the first experiment. However, looking at

(43)

the results from the second experiment in table 4.3a we see that pre-training actually improves the accuracy. Therefore the results contradict each other. It shall be pointed out that the second experiment tested accuracies on the last 25 forecast origins whereas the first used the last 100 forecast origins. Time series change over time and a model may perform better on a certain period of time.

When looking at figure 4.4, it is apparent that the mean sMAPE improves when the training set size is lowered. This goes against the intuition that more training data leads to better model performance. However, time series data differs from data used in other machine learning applications. It may be the case that old training samples confuse the model as the time series has changed over the years. Further we also see that it may be the case that the two transfer learning techniques can better utilize a small dataset. This is quite reasonable since both techniques are based on learning from more data.

Making a general observation, in table 4.2b we observe that the sMAPE for product B is much higher than it was during the parameter search, were it was around 18 % as seen in figure 4.1b. This may be explained by the reality that many factors that influence buyers change over time. For example, marketing campaigns could increase brand value, increasing the share of the total market. Another explanation is that there could have been problems in the supply chain, decreasing the product availability. These factors, among others, all affect the sales volumes but they are not given as input to the three models. Hence they could not possibly forecast large changes caused by such external factors.

We would also like to make a comment on the long training times experi-enced. Producing the final results for a single model on 100 forecast origins took several hours on a high performance computer. This is a big disadvantage of using this kind of method for large scale forecasting. With more resources and/or a faster implementation, we could have gained more insights by run-ning the models several times for each forecast origin, giving the difference in model and data variance. Also we could have performed a more extensive parameter search or even possibly searched for parameters for each forecast origin. This could have produced better results as parameters would have been optimized closer in time to the test samples.

It is also worth discussing the results of the parameter search reported in section 4.1. For many parameters, the boxplots were very similar both in value and in their spread. This makes it hard to make a choice of parameters, as there is no clear optima. For example, there were cases where one alternative had a low median and high spread and another had a slightly higher median but lower spread.

(44)

Lastly we would like to reconnect to the research questions. Interpreting our results in table 4.2, we cannot say that transfer learning for time series fore-casting with our CNN, improves the forefore-casting in terms of sMAPE. However, when varying the training set we see that pre-training improves the accuracy. It suggests that under certain conditions, transfer learning is an effective ap-proach. Further on, when the training set size is restricted to 20 % of the train-ing samples, we see that one pre-traintrain-ing model performs better than the rest, which perform very similar. This indicates that there are cases when transfer learning allows for forecasting smaller time series. However, further studies are required to establish how general these observations are.

5.2 Related works

We also want to connect to the related works that influenced the models of this degree project. Our choice to use the SHL-CNN originated in the report by Hu, Zhang, and Zhou [1]. They used a SHL-DNN to forecast wind speed for wind power parks. They found that the SHL-DNN is highly beneficial when there is less training data. However, in some cases other shallow models prevail.

While our results are not completely aligned with one another, they do not contradict each other. This suggest that the results are credible. However, there are several possible reasons for why we did not come to the same conclusion as them. Firstly the domains are different. Hu, Zhang, and Zhou use wind speeds collected every 10 minutes whereas we use daily data of sales volumes. Secondly our neural networks are different. We use a CNN with several layers whereas their DNN only has two hidden layers. It could be that we added too many layers to our network. A third difference is the training of the SHL-CNN and SHL-DNN when shrinking the target training set size. We train the whole network at once, shrinking all data sources equally whereas Hu, Zhang, and Zhou first train their hidden layers, freeze them and lastly train the output layers.

Further, the method of pre-training was inspired by the paper of Yosin-ski et al. [33]. The authors test pre-training on a CNN with several layers. They find that all layers should be reused and fine-tuned on the target data. Again our results are not aligned. In our first experiment pre-training did not improve forecasting accuracy while it did in the second. There are however many differences between images and time series that could explain this. For example, there may be bigger correlation between images of cars and trucks than that of different product sales. Markets and peoples buying habits can

(45)

change quickly whereas the appearance of vehicles change slowly. Hence, the difference is significant both in terms of the domain and application. Another reason may be large size of the ImageNet dataset, compared to the data used in this degree project, which is orders of magnitude smaller.

Considering transfer learning in general, both of the previously mentioned papers use data where the correlation between datasets is clear. This is ex-pected since the entire logic behind transfer learning depends on the similar-ity between source and target domain. In this project the data is based on the three selected similar consumer appliances. They were selected by technical specification, sales volumes and price. As sales depend on so many different hidden factors, there is no way of knowing the true correlation between the three time series. Different marketing strategies for the products is a potential hidden factor in this case. Therefore the domain could be a limitation to the success of transfer learning.

5.3 Future Work

In this degree project we did not implement the fastest and most computation-ally efficient training of the used CNN. If set up correctly, the network could be trained in a parallel way, taking in the entire training set at the same time. To implement this approach one would have to tackle some obstacles, one which is that the learning rate would have to be adjusted depending on how much data is fed to the model (this changes when moving the forecast origin). After implementing this approach it would be feasible to extend the hyperparameter search, rerun the models several times, evaluate on the time series of more products and experiment with the network architecture. Because we suspect the variance of the model is high, it would be interesting to test less complex architectures.

Another improvement could be to experiment with the receptive field by varying the dilation rate, kernel size and number of layers. There may be other combinations that are more suitable for time series forecasting and also transfer learning. A smaller receptive size or some type of zero padding would be needed to forecast sales of newly launched products. We also see a possibility inputting the price several days ahead instead of a single day. It may be that consumers are keener to buy a product if they know the price is bound to increase.

A future work could also analyse the accuracy of different models on this data. We have not performed any comparisons with other classical linear meth-ods, LSTMs or naive methods.

(46)

The last idea for a future work is to investigate how far back one should gather data when making forecasts. As our results suggest, there may be an optimal amount of data to account for.

5.4 Ethics and Sustainability

The manufacturing of consumer appliances is expensive and consumes re-sources. Because of their size, they are also expensive to transport and take up warehouse space. By making good forecasts of future sales, a company can more effectively plan their supply chain, lowering warehouses buffers and avoid producing excessive amounts of products. Environmental benefits could be large, considering that in global supply chains, sometimes parts need to be ordered months in advance.

Another important environmental aspect when using DNNs, is the use of computer resources. Training advanced and complex models can take weeks on expensive modern hardware, consuming much electricity.

An important ethical aspect today is the use and management of personal data. In this degree project personal data was not used nor interacted with. The sales volumes are also over a large geographical area on a small selection of products. Therefore no individual sales can be distinguished.

(47)

Conclusions

The purpose of this degree project was to investigate whether transfer learning on a CNN with dilated causal convolutions improves the forecast accuracy on sales volumes and allows for forecasting time series with less historical data. For that purpose, one network without transfer learning (CNN) was compared to both a CNN with shared hidden layers (SHL-CNN) and pre-training on a CNN using data from three multivariate time series representing three con-sumer appliances of the same kind. First we tested the overall accuracy in terms of sMAPE, then we targeted a single product and report forecast accu-racy at different training set sizes.

The results from the first experiment indicate that transfer learning neither increase nor decrease forecasting accuracy. Interestingly, the second experi-ment however show that only 60 % (40 % for the SHL-CNN) of training sam-ples is optimal for all models. This goes against the intuition that more training data leads to better model performance and this is most likely a phenomenum related specifically to time series forecasting. Although, the percentage of 60 % most likely is application specific, we also find that pre-training, from any of the other products, improves the forecasting accuracy when the training set is reduced. Finally, reducing the training set further (20 % of training samples) affect the model differently. One pre-training model performs better than the rest, which perform very similar. This indicates that there are cases when transfer learning allows for forecasting smaller time series. However, further studies are required to establish how general these observations are.

(48)

[1] Qinghua Hu, Rujia Zhang, and Yucan Zhou. “Transfer learning for short-term wind speed prediction with deep neural networks”. In: Renewable

Energy 85 (2016), pp. 83–95.

[2] Jolanta Szoplik. “Forecasting of natural gas consumption with artificial neural networks”. In: Energy 85 (2015), pp. 208–220.

[3] Peter Whittle. Hypothesis testing in time series analysis. eng. Uppsala: Almqvist & Wiksell, 1951. isbn: 991-527406-8.

[4] George E. P Box. Time series analysis forecasting and control. eng. Holden-Day series in time series analysis. San Francisco, 1970. isbn: 99-0068276-9.

[5] Rob J Hyndman and George Athanasopoulos. Forecasting: principles

and practice. OTexts, 2018. isbn: 9780987507112.

[6] Dong C Park et al. “Electric load forecasting using an artificial neural network”. In: IEEE transactions on Power Systems 6.2 (1991), pp. 442– 449.

[7] Jan G De Gooijer and Rob J Hyndman. “25 years of time series fore-casting”. In: International journal of forecasting 22.3 (2006), pp. 443– 473.

[8] Spyros Makridakis, Evangelos Spiliotis, and Vassilios Assimakopou-los. “The M4 Competition: Results, findings, conclusion and way for-ward”. In: International Journal of Forecasting 34.4 (2018), pp. 802– 808.

[9] Gao Huang et al. “Densely connected convolutional networks”. In:

Pro-ceedings of the IEEE conference on computer vision and pattern recog-nition. 2017, pp. 4700–4708.

(49)

[10] Saining Xie et al. “Aggregated residual transformations for deep neural networks”. In: Proceedings of the IEEE conference on computer vision

and pattern recognition. 2017, pp. 1492–1500.

[11] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. http://www.deeplearningbook.org. MIT Press, 2016. [12] Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. “An Empirical

Evalu-ation of Generic Convolutional and Recurrent Networks for Sequence Modeling”. In: arXiv:1803.01271 (2018).

[13] Aaron van den Oord et al. “WaveNet: A Generative Model for Raw Audio”. In: arXiv preprint arXiv:1609.03499 (2016).

[14] Nikolay Laptev et al. “Time-series extreme event forecasting with neu-ral networks at uber”. In: International Conference on Machine

Learn-ing. 34. 2017, pp. 1–5.

[15] Ali Sharif Razavian et al. “CNN features off-the-shelf: an astounding baseline for recognition”. In: Proceedings of the IEEE conference on

computer vision and pattern recognition workshops. 2014, pp. 806–

813.

[16] Jeremy Howard and Sebastian Ruder. “Universal language model fine-tuning for text classification”. In: arXiv preprint arXiv:1801.06146 (2018). [17] Peter J. Brockwell and Richard A. Davis. Introduction to Time Series

and Forecasting (Springer Texts in Statistics). Springer, 2016. isbn:

9783319298528.

[18] Coşkun Hamzaçebi, Diyar Akay, and Fevzi Kutay. “Comparison of di-rect and iterative artificial neural network forecast approaches in multi-periodic time series forecasting”. In: Expert Systems with Applications 36.2 (2009), pp. 3839–3844.

[19] Souhaib Ben Taieb et al. “A review and comparison of strategies for multi-step ahead time series forecasting based on the NN5 forecasting competition”. In: Expert systems with applications 39.8 (2012), pp. 7067– 7083.

[20] G Peter Zhang and Min Qi. “Neural network forecasting for seasonal and trend time series”. In: European journal of operational research 160.2 (2005), pp. 501–514.

[21] Spyros Makridakis. “Accuracy measures: theoretical and practical con-cerns”. In: International Journal of Forecasting 9.4 (1993), pp. 527– 529.