Mail Volume Forecasting an Evaluation of Machine Learning Models

(1)

UPTEC IT 16005

Examensarbete 30 hp April 2016

Mail Volume Forecasting

an Evaluation of Machine Learning Models Markus Ebbesson

Institutionen för informationsteknologi

(2)

(3)

Teknisk- naturvetenskaplig fakultet UTH-enheten

Besöksadress:

Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress:

Box 536 751 21 Uppsala Telefon:

018 – 471 30 03 Telefax:

018 – 471 30 00 Hemsida:

http://www.teknat.uu.se/student

Abstract

Mail Volume Forecasting - an Evaluation of Machine Learning Models

Markus Ebbesson

This study applies machine learning models to mail volumes with the goal of making sufficiently accurate forecasts to minimise the problem of under- and overstaffing at a mail operating company. A most suitable model appraisal in the context is found by evaluating input features and three different models, Auto Regression (AR), Random Forest (RF) and Neural Network (NN) (Multilayer Perceptron (MLP)).

The results provide exceedingly improved forecasting accuracies compared to the model that is currently in use. The RF model is recommended as the most practically applicable model for the company, although the RF and NN models provide similar accuracy.

This study serves as an initiative since the field lacks previous research in producing mail volume forecasts with machine learning. The outcomes are predicted to be applicable for mail operators all over Sweden and the World.

Examinator: Lars-Åke Nordén Ämnesgranskare: Sholeh Yasini Handledare: Lisa Stenkvist

(4)

(5)

Table of Contents

List of Abbreviations vi

List of Figures vii

List of Tables viii

1 Introduction 1

1.1 Bring Citymail Sweden . . . . 1

1.2 Motivation . . . . 1

1.3 Research Questions . . . . 2

2 Background 3 2.1 Time Series . . . . 3

2.2 Random Forest and Decision Trees . . . . 3

2.3 Neural Networks and the Multilayer Perceptron . . . . 5

2.4 Related Work . . . . 7

3 Methodology and Implementation 8 3.1 Data . . . . 8

3.1.1 Data Extraction . . . . 8

3.1.2 Training and Test Data . . . . 10

3.1.3 Features . . . . 12

3.1.4 Data Preparation . . . . 14

3.1.5 Performance Measurement . . . . 14

3.2 Auto Regression . . . . 14

3.3 Random Forest . . . . 15

3.4 Multilayer Perceptron . . . . 15

3.4.1 The Backpropagation Algorithm . . . . 16

3.5 Avoiding Overfitting with Cross-Validation . . . . 17

3.6 Limitations . . . . 19

4 Results 20 4.1 Prediction Results . . . . 20

4.1.1 Results per Mail Segment . . . . 22

4.2 Feature Importances . . . . 24

4.2.1 Feature Importances per Mail Segment . . . . 24

5 Discussion 29 5.1 Future Work . . . . 32

6 Conclusion 34

Bibliography 35

(6)

List of Abbreviations

AR Auto Regression.

ARIMA Auto Regression Integrated Moving Average.

BCMS Bring Citymail Sweden.

MA Moving Average.

MAPE Mean Absolute Percentage Error.

MLP Multilayer Perceptron.

NN Neural Network.

OOB Out-Of-Bag.

RF Random Forest.

RMSE Root Mean Squared Error.

(7)

List of Figures

2.1 Example of how a tree with 7 leaves gives an output y based on input parameters x1, x2, x3, and x5. The output of the Random Forest (RF) will be the average of all trees’ outputs for the same input. . . . 4 2.2 A fully connected feed-forward Multilayer Perceptron (MLP) with

one hidden layer. . . . 5 3.1 Illustrates the similarities of the mail volume behaviour between

the destinations. . . . 9 3.2 Total mail volume per day for destination Stockholm during the

time period 2014-03-01 to 2015-09-18. . . . 10 3.3 Mail volume measured in number of postal items per segment

for destination Stockholm during the time period 2014-03-01 to 2015-09-18. Together, the six segments make up the total volume for the destination. . . . 11 3.4 Shows how inputs turn into an output of a unit in a Multilayer

Perceptron (MLP) via the propagation rule, the activation rule and the output rule. . . . 16 3.5 Illustration of the overfitting phenomenon. . . . 18 4.1 Prediction outputs (number of postal items) for the time period

2015-03-02 - 2015-03-27 from a selection of the models, plotted in green. The actual mail volume outcomes are drawn in black. . 23 4.2 Actual volume and model prediction outputs per segment for

the time period 2015-03-02 - 2015-03-27, measured in number of postal items. . . . 25 4.3 Feature importance for a model trained on the total mail volume.

Out-Of-Bag (OOB) error increase for each feature as the values of each feature are permuted across the Out-Of-Bag (OOB) data. A larger permutation error increase suggests higher relative feature importance. . . . 26 4.4 Feature importance per segment as indicated by the Out-Of-Bag

(OOB) error increase as the values of each feature are permuted across the OOB data. A larger value suggests higher feature importance. . . . 27

(8)

List of Tables

3.1 List of considered features. The features’ respective importance (prediction capabilities) will be evaluated and included/excluded thereafter. . . . 12 4.1 Model accuracies measured in Mean Absolute Percentage Error

(MAPE) and Root Mean Squared Error (RMSE). Each model’s performance is measured on 1-5 days’ prediction, 22-26 days’ prediction and 1-26 days’ prediction. . . . 21 4.2 Model accuracies measured in Mean Absolute Percentage Error

(MAPE) for each segment separately. Each model’s performance is measured on 1-26 days’ prediction, except models NN E and RF E that predicts 1-5 days. . . . 22 5.1 Positive and negative aspects of each method. . . . 31

(9)

1 Introduction

This report pioneers the area in which the subjects of forecasting using Ma- chine Learning and the daily work of mail operators meet. The eﬃcient use of resources in a company is assessed through applying forecasting models on their mail volumes.

Despite the growing digitalisation and usage of emails, traditional mail will always be needed and the domain is therefore sustained. While some mail categories decline with digitalisation the mail operators are forced to adapt thereafter. This phenomenon put the mail operators in a competitive environment and eﬃcient handling of resources is consequential.

The study evaluates and compares three machine learning models to find if they are suﬃciently powerful in predicting the mail volume for the near future.

It also suggests which model is the most accurate and more practically applicable in the context of the company.

1.1 Bring Citymail Sweden

Bring Citymail Sweden (BCMS) delivers mail from businesses to businesses (B2B) and from businesses to consumers (B2C). BCMS covers (i.e. delivers to) major areas of Sweden divided into five destinations based on four terminals lo- cated in Stockholm, Malmö, Gothenburg and Örebro. The five destinations are Stockholm, Malmö, Gothenburg, Örebro and Visby. Mail that will be delivered in destination Visby passes through terminal Stockholm.

The terminals are where so called post producers deliver the customers’

mail once they have been printed. The terminals also serve as centralised sorting stations where the mail is sorted down to the exact order that each Cityman (mailman) will do his or her round of deliveries. The sorted mail is then trans- ported to the appropriate distribution oﬃce, at which the Citymen work and proceed to deliver the mail to its recipients.

BCMS handles large volumes of mail of diﬀerent kinds and of varying size every day. The amount of mail being passed through depends on the amount of people that live in each area, but also has more advanced underlying influences aﬀecting it.

1.2 Motivation

The current predictions are based solely on the information about what the customers have booked, making them very inaccurate. These orders are sum- marised in a web-based tool that the schedule-setting staﬀ can consider.

The current booking situation, displayed in the tool, can be far from the truth. Orders can change in volume and date, and new orders can be made as close as the day before arrival, making the booking situation and therefore the predictions vastly preliminary. Making a good prediction based solely on

(10)

the current prediction tool therefore requires expert knowledge, long experience, guesswork and luck.

Staﬀ that sort and deliver mail (Citymen) are scheduled as far as three weeks in advance. However, this can be altered to some extent a few days in advance using employees working on hour-based salaries. Despite this, there is currently a problem at BCMS with both under- and over staﬃng because of the short booking time span.

Looking at today’s tools for future planning is hopeless for an inexperienced person. The varying amount of orders and orders switching dates makes it hard to interpret. With experience, it is easier to handle the current way of predicting the future. However, a proper prediction model alongside the current booking situation would be more useful, intuitive and accurate for both inexperienced and experienced staﬀ schedulers.

The varying nature of the mail volume, along with inaccurate forecasts and orders arriving late and being changed, makes it diﬃcult to plan and schedule resources. An accurate forecast of this volume could therefore help the company plan their use of resources more eﬃciently.

1.3 Research Questions

This thesis’ problem formulations include:

1. Can a supervised machine learning model be used to produce suﬃciently accurate mail volume forecasts?

2. Comparing Auto Regression, Random Forest and the Multilayer Percep- tron, which model is in this context...

(a) the most accurate?

(b) more applicable?

3. Can forecasts be improved by...

(a) ...splitting up the forecast into the six mail segments?

(b) ...incorporating future booked volume information?

4. Which features are most important for predicting mail volume?

For a forecasting model to be reliable and trusted it needs to be accurate enough.

In the company’s interest, however, a model is suﬃciently accurate as long as it is more accurate than the current predictions.

The ’best’ model could be the most accurate model or the most practically applicable, or a combination of the two. A more accurate model is the one that produces more exact forecasted values in comparison to the actual outcome. A practically applicable model is a model whose properties would make it more suitable to implement as a finished product to be used in the daily operation in the BCMS context and at this problem. The most accurate model could also be most accurate in forecasting either 1 to 5 days in advance or 3 to 4 weeks in advance, matching the timespans of which staff scheduling are made. Different models are expected to be better in different aspects.

(11)

2 Background

BCMS’s mail volume varies a lot from day to day. Furthermore, the booked volume for each day changes up to as close as one day before delivery. The nature of this problem is in the agreements with the customers that for diﬀerent reasons cannot be changed. Therefore, the sensible approach to this problem is to make prediction based on the time series data that is the mail volume history. The problem will also be approached with more advanced models, namely Random Forests (RFs) and Neural Networks (NNs).

2.1 Time Series

Time series forecasting is a way of predicting the next step of a data set based on previous values. Common methods include Moving Average (MA), exponential smoothing, Auto Regression (AR), Auto Regression Integrated Moving Average (ARIMA), et cetera.

The correlation of the current value to the previous values is defined in diﬀerent ways in each model. The simplest models usually use an identical or a decaying importance (weight) with each step backward in time, while more advanced time series models find other relationships through regression.

MA is the method of calculating the average of part of the data, such as the few previous time steps. The MA model gives a slight delay and does not produce powerful predictions, but rather makes it easier to find shorter trends.

Exponential smoothing is another method of smoothing time series data.

The AR model attempts to find patterns by estimating how a time step depends on a number of previous time steps. It can therefore have more prediction power, if it finds suitable parameters.

The ARIMA model is a combination of AR, MA and I, which stands for ’In- tegrated’. A non-stationary time series can be made stationary by diﬀerencing and is then called an integrated version of the stationary series. AR and/or MA can then be applied once the time series is stationary.

2.2 Random Forest and Decision Trees

Introduced by Breiman in 2001[1], RF is a way of bootstrap aggregating, or bagging, decision trees. Therefore, it attacks the infamous bias-variance trade- oﬀ by minimising variance without aﬀecting the bias.

However originally introduced on classification trees, RFs can similarly be applied on regression trees, thereby sometimes referred to as Regression For- est. Regressions Forests are for nonlinear multiple regression and is therefore applicable on the problem at hand.

The output of a RF is defined as the average of the output of each tree in the forest as they are exposed to the same inputs. Each tree is trained in a similar fashion, but with two random elements. First, the selection of data points. Through sample with replacement, each tree is trained on a subset

(12)

of the total training data, the so called bag or in-bag data. The variance is therefore minimised by training each tree with slightly diﬀerent data, while keeping robustness and bias of decision trees. As a result, each tree can also be evaluated on its Out-Of-Bag (OOB) error. Second, by using a subset of input parameters (features). For each split, a subset of the parameters is selected and potential splits for each of these parameters are evaluated.

Each tree is trained to a maximum depth or a maximum number of leaves by recursively evaluating the information gain of each selected parameter. The split with the most information gain is selected and the node is split into two based on the value of this parameter. Figure 2.1 shows an example of such a tree as a part of a forest.

x3< 0.2

x5> 0.7

x2< 0.2

y = 0.21y = 0.32 y = 0.55

x2< 0.1

x1> 0.9

y = 0.01x3> 0.8

y = 0.45 y = 0.61 y = 0.88

...

x3< 0.2

x5> 0.7

x2< 0.2

y = 0.21 y = 0.32 y = 0.55

x2< 0.1

x1> 0.9

y = 0.01 x3> 0.8

y = 0.45 y = 0.61 y = 0.88

...

Figure 2.1: Example of how a tree with 7 leaves gives an output y based on input parameters x1, x2, x3, and x5. The output of the Random Forest (RF) will be the average of all trees’ outputs for the same input.

A random forest, therefore, consists of a relatively large number of individual trees, usually in the order of hundreds to thousands. It is possible to take ad- vantage of this property and gain information through analysing the individual, trained trees. For example, there have been applications where the individual trees have been used to find prediction intervals, i.e. confidence intervals for each prediction [2, 3].

(13)

2.3 Neural Networks and the Multilayer Per- ceptron

Although using neuron-like units to solve simple mathematics was introduced as early as the 1940s [4], the explosion of interest for NNs merely dates back to the 1980s. NNs have then steadily grown in popularity since the credit assignment problem (error propagation problem) for multilayer networks was solved around mid-1980 [5].

NNs use a set of neuron-like units and a pattern of weighted connections between the units, it’s so called architecture, which represent and interpret information by manipulating the weighted connections.

The most recognised and implemented NN structure is the feedforward Mul- tilayer Perceptron (MLP). In this approach, the input given to the network is passed through the input layer, to one or more hidden layers and finally to the output layer to create an output, thereof ’feedforward’. The connections between the layers can be fully connected, where all units in one layer is connected to all units in the next layer, or in diﬀerent ways partially connected.

A fully connected feedforward MLP with one hidden layer is depicted in figure 2.2. This network has N input units, some hidden units and one output unit.

The �

represents the propagation rule, and the �

represents the activation function. The network takes N inputs (x1, ..., xN) and provides one output, y.

Connection weights connect all units in one layer to all units in the next layer, making it fully connected. It also uses bias units, which always output−1.

� �

...

� �

Bias =−1 x1

x2

x3

xN

� �

...

� �

Bias =−1

� � y

Figure 2.2: A fully connected feed-forward Multilayer Perceptron (MLP) with one hidden layer.

A NN is typically implemented with an activation rule, an output rule, a propagation rule and a learning rule. The activation rule defines the unit’s state, or activation, based on its current input. The output rule determines the output of a unit based on its current activation, usually set to equal the activation. A unit’s input is decided by the propagation rule on the basis of the outputs and weights of units connecting to it. Connection weights are manipulated using the learning rule to correct the network towards a desired behaviour.

(14)

The activation function is used to scale the input of one unit to make its output. Activation functions can be continuously diﬀerentiable or not. A continuous function, linear or non-linear, is necessary to allow gradient-based learning methods. Non-linear activation functions are therefore the most powerful, and examples of such functions are the logistic (log-sigmoid) function and the hyperbolic tangent (tan-sigmoid) function.

In order to run an input pattern through a MLP, such as the one in figure 2.2, the input units are set to assume the value of each respective input. The units in the hidden layer thereafter acquire an input according to the propagation rule. The hidden units assume the value given by the activation function from this input and output according to the output rule. The output unit thereafter calculates its output in the same way as the hidden units, according to the input from the hidden layer. The output of the output unit is the output of the perceptron. The perceptron may have more than one output unit and consequentially more outputs.

The perceptron is trained through supervised learning via the learning rule.

Supervised learning means that the perceptron is exposed to an input pattern, and the output is compared to an expected value, also known as a teacher. The pattern error describes how wrong the perceptron is and the learning rule alters the weights in the network to minimise this error. The network is run with all input patterns in a training set, the population. Learning once from each training pattern is called an epoch. The network can change the weights after each pattern, called sequential or online learning, or make a summed change of the weights from all pattern errors at the end of the epoch, called batch learning.

Typically, it takes hundreds or thousands of epochs for the perceptron to learn, depending on the complexity of the problem.

Since the network’s representation of what it has learnt lies in its architecture and weights, it does not have to save all input patterns that it has learnt from.

Furthermore, it can be continuously taught as new data is available. However, this so called online learning introduce the problem of catastrophic forgetting.

Catastrophic forgetting means that when a NN is exposed to new training data it can drastically aﬀect what it has previously learnt. Therefore, online learning should be applied with caution.

A suﬃciently large MLP can be proved to learn any computable function (Turing equivalent) with the introduction of non-linear activation functions and bias units (see figure 2.2), which allow shifting of these activation functions.

However, multiple layers introduce the credit assignment problem, which means that it is diﬃcult to find a neuron or connection in a fully connected multi- layered network to blame for an error in the output.

The most famous learning rule to approach the credit assignment problem is the generalised delta rule, also known as backpropagation. It is a supervised learning rule which means that it compares the output of a network with a desired output, the error, and corrects the network towards producing a closer value next time. The error is passed backwards through the network to assign credit to weights and alter them accordingly. The continuous activation function lets the backpropagation algorithm find partial derivatives and step-wise correct output errors by altering the weights of the network. These errors are further propagated backwards through the network to assign credit deeper into the network.

(15)

2.4 Related Work

During the research phase of this project it appeared that there are no previous studies done on mail volume forecasting. Therefore, it is the first time such a problem is being considered.

On the other hand, there are many available papers comparing diﬀerent forecasting models. The rise of NNs have inspired a lot of researchers to compare NNs to more traditional methods. They are found to be promising in areas such as railway passenger demand forecasting [6], airline passenger and car sales forecasting [7] and even the Makridakis Competition data set [8].

Comparing a NN with a time-series forecasting model using parameters as 1 through 13 months back in time, it was found that that NNs can be superior in forecasting [7]. Furthermore, NNs perform as well as, and occasionally better than, statistical models [9]. The same study concluded that NNs especially performed better when there are non-linear elements present. A subset of the same authors later found that NNs outperformed traditional statistical and human judgement methods when forecasting time series monthly and quarterly data [8].

Another positive aspect of NNs is that they can be viewed as more robust than time series models [7]. Time series models are more sensitive to noise.

There are also some studies that find other models superior to the NN. For example, the Box-Jenkins algorithm seems to be more accurate in short term forecasting, compared to NNs [7]. However, traditional statistical models were also found to be comparable to NNs on annual data [8]. These contradicting conclusions may imply that the problem is context specific. Therefore, diﬀerent data sets are best fit with diﬀerent models.

On the RF side, there are some research conducted producing forecasts with high accuracy. One example is electricity load forecasting [10]. A compara- tive study measured higher accuracy from its RF technique than the NN in predicting dementia progression [11].

Ease of use is another aspect to keep in mind when selecting a forecasting model that will be implemented in a finished product, a forecasting tool.

When developing for a company, there has been some research conducted on what methods they use and what is important to them when selecting a model.

Arguing that many companies could benefit from using more advanced models rather than naive implementations for accuracy, ease of use is of great importance as well [12].

(16)

3 Methodology and Implementation

This chapter describes data, features, models and implementation. The acquir- ing, and preparation, of the appropriate data results in diﬀerent input data sets and teachers for the models. The implementations of the models are then described. The performance of the models is then increased by estimating optimal parameters and settings.

3.1 Data

To analyse this data as a time series problem, it is organised on a mail volume per day basis. The mail volume is measured in number of postal items, such as letters, envelopes, papers, magazines, et cetera.

BCMS wishes to make forecasts on so called destination level. There are currently five destinations that BCMS has divided its coverage area into; Stock- holm, Malm¨o, Gothenburg, ¨Orebro and Visby. Each of these cover an area of Sweden in the proximity of the city it is named after.

Further division of the total mail volume can be made on varieties of mail.

BCMS handles six diﬀerent categories of mail, so called mail segments. These are:

1. Administrative routines: Invoices, bank account statements, credit records, et cetera.

2. Mailbox sized packages: Trackable postal items.

3. Direct advertising: Addressed direct advertisements, campaigns and of- fers.

4. Oﬃce mail : Unsorted mail collected directly from BCMS’s customers, i.e.

does not go through a post producer.

5. Unaddressed advertising: Free papers and civic information.

6. Magazines: Subscription papers that are delivered at least four times per year. Requires a Publication License.

3.1.1 Data Extraction

Data is extracted from the time period 1st of March 2014 until the 18th of September 2015. It contains all orders that have been delivered in this time period. The information about each order includes delivery dates, split date, volume (number of postal items), destination, mail segment, item format, average weight et cetera.

A delay at a sorting station will aﬀect many distribution oﬃces since they are centralised. Therefore, the focus is on making forecasts on the date that the mail will be sorted, referred in the order as each order’s split date.

(17)

The destinations handle diﬀering volumes because of the population diﬀer- ences of each area. However, the volume varies proportionally similarly over time, as shown in figure 3.1. The figure shows mail variations for each destination for the time period 2014-03-01 to 2015-09-18. Moving average has been applied to decrease complexity and the result has been transformed to lie between 0 and 1. In reality, destination Stockholm handles much greater volumes than the Visby destination, for example. However, the transformed volumes behave similarly over time. Therefore, the rest of this report will focus on the Stockholm destination and assume that the forecasting models are similarly applicable to the other destinations.

2014- 03-01

2014- 05-01

2014- 07-01

2014- 09-01

2014- 11-01

2015- 01-01

2015- 03-01

2015- 05-01

2015- 07-01

2015- 09-01 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Transformedmailvolume

Stockholm Malm¨o Gothenburg Orebro¨ Visby

Figure 3.1: Illustrates the similarities of the mail volume behaviour between the destinations.

The true total volume for destination Stockholm within the time period is displayed in figure 3.2. Non-operating days, such as weekends and public holidays, have been removed since they show a volume of 0. The graph shows a lowered volume during the two summers, as well as a slight increase before Christmas. There is also a lowered volume at the very end of the year. Further, there is a slight indication of a monthly pattern.

The mail segment of each postal item is also of interest. In contrast to the total volume (of one destination), as shown in figure 3.2, the same volume is split up into each of the six mail segment and plotted in the six graphs of

(18)

2014- 03-03

2014- 05-03

2014- 07-03

2014- 09-03

2014- 11-03

2015- 01-03

2015- 03-03

2015- 05-03

2015- 07-03

2015- 09-03 0

500 1,000 1,500 2,000 2,500

·10³

Numberofpostalitems

Figure 3.2: Total mail volume per day for destination Stockholm during the time period 2014-03-01 to 2015-09-18.

figure 3.3. It becomes apparent that the mail segments that the postal items belong to aﬀects the way it behaves in the time series. For example, the mail segment Administrative routines has a monthly pattern that is not present in the Direct advertising segment. It is therefore a good idea to analyse these segments separately.

The data is broken down into mail volume, measured in number of postal items, per split date and mail segment.

3.1.2 Training and Test Data

The one-year time period 1st of March 2014 to 1st of March 2015 is used to train the models. It consists of 248 data points, once all non-operating days are removed. Each data point presents the mail volume, measured in number of postal items, per day.

The test data, used to evaluate the trained models, consists of the following 1-26 days, i.e. the following four weeks. This eﬀectively simulates 1st of March 2015 and making predictions for the following four weeks based on previous volumes. The upcoming 1 to 5 days into the future (2015-03-02 - 2015-03-06)

(19)

2014- 04

2014- 06

2014- 08

2014- 10

2014- 12

2015- 02

2015- 04

2015- 06

2015- 08 0

500 1,000

·10³

(a) Administrative routines

2014- 04

2014- 06

2014- 08

2014- 10

2014- 12

2015- 02

2015- 04

2015- 06

2015- 08 0

2 4 6 ·10³

(b) Mailbox sized packages

2014- 04

2014- 06

2014- 08

2014- 10

2014- 12

2015- 02

2015- 04

2015- 06

2015- 08 0

500 1,000 1,500

·10³

(c) Direct advertising

2014- 04

2014- 06

2014- 08

2014- 10

2014- 12

2015- 02

2015- 04

2015- 06

2015- 08 0

50 100

·10³

(d) Oﬃce mail

2014- 04

2014- 06

2014- 08

2014- 10

2014- 12

2015- 02

2015- 04

2015- 06

2015- 08 0

500 1,000

·10³

(e) Unaddressed advertising

2014- 04

2014- 06

2014- 08

2014- 10

2014- 12

2015- 02

2015- 04

2015- 06

2015- 08 0

200 400 600

·10³

(f) Magazines Figure 3.3: Mail volume measured in number of postal items per segment for destination Stockholm during the time period 2014-03-01 to 2015-09-18. To- gether, the six segments make up the total volume for the destination.

(20)

and 22 to 26 days into the future (2015-03-23 - 2015-03-27) will also be evaluated separately, as motivated in section 1.2. These will tell how good each model is at predicting short-term versus long-term mail volumes.

3.1.3 Features

For the RF and NN approaches, the key is to find a correlation between inputs and the output, or mail volume. These inputs are feature parameters that can be anything that changes with, aﬀects or is somehow otherwise related to what the mail volume is. For example, the mail volume for the Administrative routines segment seems to be highly related to what time of the month it is, as seen in figure 3.3a.

Feature Engineering

The process of finding and creating features is a process that requires expert knowledge and, to some extent, creativity. Some features are included because they are expected to be related to the behaviour of the mail volume (expert knowledge), while others are experimental (creativity).

The following features will act as inputs for the NN and RF models. They will be evaluated and used in diﬀerent combinations to improve prediction performance. The considered features are listed in table 3.1.

Variable name Value range Description

Time index [0,∞] Days since 2014-03-01.

Day of the week [1, 5] Day of the week. Weekends removed.

Day of the month [1, 31] Day of the month.

Day of the year [1, 365] Day of the year.

Month [1, 12] Month index (January - December).

Time diﬀerence [1, 6] Diﬀerence from last operating day. Mail accumulates when there are many public holidays in a row.

Summer [0, 1] Dummy variable. 1, between 15th of

June and 31st of July. 0, otherwise.

1 week ahead booked volume

[0,∞] Booked mail volume for this date as known 1 week ago.

4 weeks ahead booked volume

[0,∞] Booked mail volume for this date as known 4 weeks ago.

1 week ahead confirmed booked volume

[0,∞] Booked mail volume labelled ’confirmed’

for this date as known 1 week ago.

4 weeks ahead confirmed booked volume

[0,∞] Booked mail volume labelled ’confirmed’

for this date as known 4 weeks ago.

Table 3.1: List of considered features. The features’ respective importance (prediction capabilities) will be evaluated and included/excluded thereafter.

The Time index feature is incremented for each day since the first data point. It could possibly help the model find long term trends through mapping

(21)

the feature to a long-term increase or decrease in mail volume.

Day of the week, Day of the month and Day of the year describes the input’s weekly, monthly and yearly status. Day of the week ranges from 1 to 5 since weekends are non-operating days. These features might help the models find weekly, monthly and yearly patterns, if they exist.

The feature Month is further used to detect possible yearly patterns.

Time diﬀerence is a feature that explores the eﬀect of non-operating days, more specifically many non-operating days in a row. The idea is that mail can accumulate as mail is not being handled.

The dummy variable Summer is set to 1 for dates between the 15th of June and 31st of July, and 0 otherwise. This feature is included to assist the model in learning the known phenomenon of the decrease in volume during summertime.

Other known information that is available about the future is the booked volume for each day. Booked orders have diﬀerent assurance codes such as

’preliminary’ or ’confirmed’. This information is included in the predictions using the last four features, 1 weeks ahead booked volume, 4 weeks ahead booked volume, 1 week ahead confirmed booked volume and 4 weeks ahead confirmed booked volume. Section 1.2 describes the scheduling routines of the company and that the time spans of interest are a few days into the future and three weeks into the future. Therefore, the features 1 week ahead booked volume and 4 weeks ahead booked volume describes the total booked volume as it was known 1 week ago and 4 weeks ago, respectively. 1 week ahead confirmed booked volume defines how much of this total 1 week ahead booked volume was labelled ’confirmed’, and 4 weeks ahead confirmed booked volume does the same for 4 weeks ahead.

Feature Sets

Evaluating and selecting a subset of features is the first approach to ensuring prediction accuracy. The complexity and therefore risk of overfitting is decreased by excluding non-contributing features. Furthermore, fewer features increases the model’s performance as it decreases the amount of calculations.

The RF and the NN was trained on three subsets of features:

1. All features excluding booked volumes, i.e. Time index, Day of the week, Day of the month, Day of the year, Month, Time diﬀerence and Summer.

2. Same as 1, plus 1 week ahead booked volume.

3. Same as 1, plus 4 weeks ahead booked volume.

To see if booked volume labelled ’confirmed’ further increases accuracy, RF was also trained on:

4. Same as 2, plus 1 week ahead confirmed booked volume.

5. Same as 3, plus 4 weeks ahead confirmed booked volume.

Feature Importance Estimation

The features listed in table 3.1 will be evaluated using the OOB error in the RF model. Although there are more advanced and reliable feature selection methods (such as Principal Component Analysis and Support Vector Machines),

(22)

the OOB error shows an indication of how important each feature is for the sake of producing an accurate output.

This method of estimating feature importance finds the increase or decrease in output accuracy by evaluating the diﬀerent trees in the forest. Each tree is trained on diﬀerent inputs, so called in-bag. The data that is not included are called out-of-bag. The OOB error, therefore, is the tree’s error on training data that it has not been trained on.

Each feature is evaluated by calculating the error increase as the values of each feature are permuted across the OOB data. This rearrangement should not aﬀect the error if the feature is of low importance, and the magnitude of the error increase therefore gives an indication of how important each feature is.

3.1.4 Data Preparation

To achieve as good results as possible, the data needs pre-processing. This helps the models to learn faster.

The data in this study is of high quality, containing very few erroneous values. Therefore, the only removed data points are weekends and public holidays, where the mail volume is always zero.

The mail volume and each input feature are all normalised using min-max scaling, to lie in the range [0, 1]. The minimum and maximum values are taken from the time period 2014-03-01 to 2015-03-01, i.e. the simulated previous one- year period.

3.1.5 Performance Measurement

The Root Mean Squared Error (RMSE) is a common way of measuring how well a model performs. The squared element of this equation penalises far-oﬀ forecasts heavily. For the application of these models, whose outputs will be in- terpreted by a human and schedules made thereafter, a model that occasionally outputs far-oﬀ values will not be trusted and is therefore no good. Therefore, the RMSE is in this case preferred. The RMSE for outputs ˆyi when the actual outcomes are yi is defined in equation 3.1.

RM SE =

��

�� 1 n

�n i=1

(yi− ˆyⁱ)² (3.1)

To complement the RMSE, the Mean Absolute Percentage Error (MAPE) will also be measured on for each model. The MAPE is a measurement on how close to the actual value the forecasting model generally performs. Equation 3.2 defines the MAPE for predictions ˆyi when the actual values are yi.

M AP E = 1 n

�n i=1

��

��(yi− ˆyⁱ) (yi)

��

�� ∗ 100% (3.2)

3.2 Auto Regression

The AR model is a simple time series model in which the output at time t, y , depends on p number of previous values, as described in equation 3.3. Its

(23)

parameters, ϕi, are estimated using least squared error. The noise term �t is assumed to be white noise.

yt= c +

�p i=1

ϕiy_t−i+ �t (3.3)

The capabilities of the AR model were measured in Matlab, using the Sta- tistical Learning toolbox.

3.3 Random Forest

A RF approach was implemented in Matlab using the built in Treebagger function. This allowed getting results quickly and left more room for experimenting with number of trees, maximum number of leaves and parameter subset size.

The number of trees should be large, ranging from a hundred to thousands.

The model showed no indication of overfitting when experimenting with the number of trees in the range 100-10,000. It is confirmed by the original author that overfitting should not be a problem [1]. However, the model stopped im- proving in OOB prediction performance as the number of trees surpassed around 500 trees. Therefore, the number of trees was selected to be 500 to avoid excess calculations.

The OOB error, as mentioned in section 2.2, was measured for diﬀerent numbers of maximum number of leaves. 5 leaves per tree proved to be suﬃcient.

3.4 Multilayer Perceptron

A fully connected feedforward MLP was implemented in C#. The program represents such a NN of three layers, one input, one hidden and one output layer, each of arbitrary size, similar to the NN in figure 2.2.

The propagation rule in equation 3.4 states that the output oi of one unit i, multiplied by the weight wij that connects units i and j constitutes an input input_ji for the unit j in the next layer.

input_ji= oiwij (3.4)

The net input netj of a unit j, however, is dependent on all units i in the previous layer of size N and the bias unit θj, as shown in equation 3.5.

netj=

�N i=1

input_ij+ θj (3.5)

The non-linear activation function used in this implementation is defined in equation 3.6.

activation = 1

1 + e^−input (3.6)

The output rule in equation 3.7 defines the output of a unit to be its activation.

(24)

output = activation (3.7) The behaviour of one unit in the implemented MLP is illustrated in figure 3.4, along with the activation, output and propagation rule. In this case, the propagation rule defines the unit’s input to be the sum of the product of each output of a previous unit and its connection weight (equation 3.4). The activation rule in this example is a logistic activation function (equation 3.6) and the output of a unit is its activation (equation 3.7).

�N

i=1oiwi= input 1

1 + e^−input = activation = output o3

o2

o1

...

oN

w1

w2

w3

wN

Figure 3.4: Shows how inputs turn into an output of a unit in a Multilayer Perceptron (MLP) via the propagation rule, the activation rule and the output rule.

The number of input units depends on the number of inputs, or input features, that the network is learning from. There is always one input unit per input feature. Only one output unit is required for the learning of the mail volume data, representing the mail volume associated with the current input.

The amount of hidden layers was kept at one to avoid unnecessary computa- tion time and risk of overfitting. More hidden layers will rarely improve results [13].

The number of hidden units varied from the same as the number of input units to one or two fewer hidden units than input units. Fewer hidden units forces the network to squeeze the representations through the hidden layer. This complexity reduction can improve generalisation or reduce overfitting. However, too few hidden units could make a network that is not powerful enough.

3.4.1 The Backpropagation Algorithm

The implementation uses the Backpropagation Algorithm to teach the network.

The backpropagation algorithm, also known as the Generalised Delta Rule, is a gradient descent approach. It defines the way of altering the connecting weights in the NN to reduce the output error compared to the teacher, the desired output.

The weights are updated as defined in equation 3.8, where ∆wijis the weight change (positive or negative) for the weight that connects units i and j, η is the learning constant, δj is the error term for unit j and oi is the output of unit i.

The learning constant η is typically set to around 0.1.

(25)

∆wij= ηδjoi (3.8) The error term δj is calculated as described in equation 3.9. Here, tj is the target, or desired, output of unit j and oj is the actual output. f_j^�(netj) is the derivative of the activation function and netj is the net input of unit j, as defined in 3.5.

δj = (tj− o^j)f_j^�(netj) (3.9) The backpropagation algorithm can also utilise momentum, defined by the momentum constant, for faster learning. The momentum constant also reduces the risk of the algorithm getting stuck in a local minima. Momentum means that the weight update from the previous epoch ∆wij(n− 1) aﬀects the change in weight strength for the current epoch ∆wij(n), as described in equation 3.10.

α is the momentum constant which lies between 0 and 1, typically 0.9.

∆wij(n) = ηδjoi+ α∆wij(n− 1) (3.10)

3.5 Avoiding Overfitting with Cross-Validation

The overfitting phenomenon is crucial to understand when applying machine learning. Overfitting happens when a model learns a training set too well, i.e.

it is trained too much on the same data. This implies that the model has not learned the general correlation of the features and the output, but rather acts similar to a look-up table of the specific training data. Therefore, the model loses the ability to generalise to new cases that it has not yet been introduced to. This phenomenon is illustrated in figure 3.5. For this figure, a model is trained on a training data set, and therefore the training data population error keeps decreasing. The population error of the test data set, that the model is not trained on, will eventually start increasing. The model is then losing its ability to generalise, i.e. overfitting the training data.

Since the RF has its OOB error estimate, there is no need to perform cross- validation for the RF model.

A special form of cross-validation is needed to work with the AR model. We cannot shuﬄe or randomly pick out validation data points since the model needs the data for each step in chronological order. Instead, the data is split up into 10 ordered and numbered segments. The model’s parameters are then estimated on diﬀerent training segments and evaluated on other validation segments, as such:

1. Training segment 1, validation segments 2-10 2. Training segments 1-2, validation segments 3-10 3. Training segments 1-3, validation segments 4-10 4. Training segments 1-4, validation segments 5-10 5. Training segments 1-5, validation segments 6-10 6. Training segments 1-6, validation segments 7-10

(26)

0 100 200 300 400 500 600 700 800 900 1,000 0

2 4 6 8 10 12

·10⁻²

Epochs of training

Populationerror

Training data Test data

Figure 3.5: Illustration of the overfitting phenomenon.

7. Training segments 1-7, validation segments 8-10 8. Training segments 1-8, validation segments 9-10 9. Training segments 1-9, validation segment 10

The average error over all of the validation data is the estimated error of the model. Diﬀerent number of parameters are evaluated using this method. The estimated optimal number of parameters is the model that minimises this error.

Estimation of the MLP’s optimal input features, number of hidden units and learning rate is made with k-fold cross-validation. Therefore, the model’s complexity is reduced to protect the model from overfitting the training data.

k-fold cross-validation splits the training data into k parts, or folds. Each of these parts is used as validation data on a model trained on the remaining k− 1 parts. This way, all of the data is used in both training and validation. Once k models have been trained on diﬀerent data, the average validation error is the estimate of how good the model is. The optimal behaviour is found by applying cross-validation systematically over diﬀerent input features, architectures and parameters. The model can then be trained on the complete training set once the preferred settings have been found.

(27)

3.6 Limitations

This is a study of the possibility of producing suﬃciently accurate forecasts on mail volume in the BCMS context. Therefore, the implementations are proto- types rather than finished, every-day working, continuously learning (updating) models. Further, the models perform predictions 1-26 days into the future from the simulated date 2015-03-01 and are trained on the time period 2014-03-01 - 2015-03-01. This way, all models are trained on the same data and can be compared against the actual outcome and against each other. However, it would be interesting to apply these models and perform predictions on other dates. An- other interesting aspect would be to explore how using longer time periods for training data aﬀects, and possibly increases, prediction performance and how it possibly requires additional consideration for data point importance in history.

The study is also limited to three models, AR, RF and NN. There are other possible methods such as Multiple Linear Regression, Auto Regression Inte- grated Moving Average (ARIMA) and Support Vector Machines (SVM).

The booked volume features are also limited in two ways. First, they tell the booked total volume instead of the booked volume per segment. Second, they are limited to the booked volume 4 weeks in advance and 1 week in advance.

These could successfully be updated for each day as the date comes closer, but this was not implemented for simplicity reasons.

(28)

4 Results

The following sections describe each feature’s importance and optimal usages of each model, i.e. parameter settings, after which the actual prediction results of each model is presented.

The goal is to achieve as good accuracy as possible. However, as long as the model performs better than the current model, it is an improvement. The current forecasting model has been logged in archives, allowing evaluation of the predictions made on the same time period as being tested on the proposed models. Given the standpoint of the 1st of March 2015, the model that is currently in use predicted the future to be as drawn in figure 4.1a. This represents a MAPE of 129.08 % for the first five days, 86.44 % for 22-26 days into the future and 85.15 % 1-26 days’ prediction overall. This will be used to compare against the proposed models in section 4.1.

Each model is trained on both the total volume and on each segment separately. The models are also trained using diﬀerent parameter settings and the NN and RF are trained on several diﬀerent selections of features. These selections and adjustments are described more carefully in section 4.2.

4.1 Prediction Results

The performance of each model is presented in table 4.1. The table contains the MAPE and RMSE for predictions made 1-5 days into the future, prediction made 22-26 days into the future, and overall 1-26 days into the future, for each model. The table also contains these error measurements for the forecasting model that the company currently use to assist in scheduling. All predictions are made from the simulated date 2015-03-01, based on training data from 2014- 03-01 - 2015-03-01.

Table 4.1 provides an overview of how the models compare against each other and the current model. The baseline AR model in this project notably outperforms the model that is currently in use, with an overall MAPE of 32.23

% compared to the current model’s 85.15 %. Furthermore, the RF and NN approaches improve the forecasts further, reaching around and below 20 %.

The mail segments, such as Administrative routines, Direct advertising, et cetera, have for some approaches had one model trained on each separately.

The models with this approach are practically six parallel models that produce an output for each mail segment respectively. The sum of these outputs make up the total prediction output. The table shows an indication of prediction improvement once the volume is split up into each mail segment when trained on with the AR model, reducing the overall MAPE from 32.23 % to 24.28 %.

However, in some cases it can confuse the model and make it perform worse overall like in the cases of RF A (19.64 %) versus RF B (20.83 %). In the NN case, there is a slight improvement from NN A (21.65 %) to NN B (19.06 %).

These results are further exposed in section 4.1.1.

The table (4.1) shows that the booked mail volume can be used to further improve the forecast accuracies. Both the RF and NN predictions improve when