• No results found

Using supervised learning methods to predict the stop duration of heavy vehicles.

N/A
N/A
Protected

Academic year: 2021

Share "Using supervised learning methods to predict the stop duration of heavy vehicles."

Copied!
72
0
0

Loading.... (view fulltext now)

Full text

(1)

School of Education, Culture and Communication

Division of Applied Mathematics

BACHELOR THESIS IN MATHEMATICS / APPLIED MATHEMATICS

Using supervised learning methods to predict the stop duration of heavy

vehicles

by

Emiel Oldenkamp

Kandidatarbete i matematik / tillämpad matematik

DIVISION OF APPLIED MATHEMATICS

MÄLARDALEN UNIVERSITY

(2)

School of Education, Culture and Communication

Division of Applied Mathematics

Bachelor thesis in mathematics / applied mathematics

Date:

2020-09-22

Project name:

Using supervised learning methods to predict the stop duration of heavy vehicles

Author(s):

Emiel Oldenkamp

Supervisor(s):

Christopher Engström & Fredrik Husander

Reviewer: Thomas Westerbäck Examiner: Sergei Silvestrov Comprising: 15 ECTS credits

(3)

Abstract

In this thesis project, we attempt to predict the stop duration of heavy vehicles using data based on GPS positions collected in a previous project. All of the training and prediction is done in AWS SageMaker, and we explore possibilities with Linear Learner, K-Nearest Neighbors and XGBoost, all of which are explained in this paper. Although we were not able to construct a production-grade model within the time frame of the thesis, we were able to show that the potential for such a model does exist given more time, and propose some suggestions for the paths one can take to improve on the endpoint of this project.

(4)

Acknowledgements

First off, I would like to thank Christopher Engström, Senior Lecturer at MDH and my super-visor from MDH, for the advice he offered through the entire process of writing this thesis, and his patience when the specifics of this project were very unclear for a long time due to the breakout of the current COVID-19 pandemic.

I would also like to thank Fredrik Husander, Senior Manager Contracted Services & War-ranty at Scania and my supervisor from Scania, for being available to answer questions at any time, and providing me with an interesting project to work on, even though this was all but an easy task at the time.

This thesis would not have been possible without the constant help from Tingwei Huang and Gustav Rånby, both being Data Scientists at Scania. The questions answered, help with debugging, and general guidance regarding Scala, Spark, and AWS have proven invaluable, and I am extremely thankful for their help.

Lastly, I would like to thank Staffan Vildelin, Director of Shared IT at Scania, for helping me find a project at Scania to write this thesis on.

(5)

Contents

List of Abbreviations 5 List of Figures 6 1 Introduction 7 2 Literature Review 8 2.1 FUMA . . . 8 2.2 Machine Learning . . . 8 2.3 Hyperparameter tuning . . . 10 2.3.1 Grid Search . . . 11 2.3.2 Random Search . . . 11 2.3.3 Bayesian Search . . . 12 3 Methods 14 3.1 Linear Learner . . . 14 3.2 K-Nearest Neighbors . . . 16 3.3 XGBoost . . . 17 4 Application 21 4.1 Setup . . . 21 4.1.1 The dataset . . . 21 4.1.2 The pipeline . . . 22 4.2 Performance metrics . . . 24 4.3 Features . . . 24 4.4 Feature selection . . . 26

4.5 Removing extreme values . . . 26

4.6 Model selection . . . 26

4.7 Hyperparameter tuning . . . 27

5 Results and Analysis 28 5.1 Feature selection . . . 28

5.2 Removing extreme values in the training set . . . 28

(6)

5.4 Training on the full dataset . . . 29 5.5 Removing outliers . . . 29 5.6 Analysis . . . 30 6 Discussion 31 7 Conclusion 33 A Results 34 A.1 Feature Selection . . . 34

A.2 Removing outliers . . . 36

A.3 Model selection . . . 37

A.4 Training on full dataset . . . 39

(7)

List of Abbreviations

ANN Artificial Neural Network. 31

AWS Amazon Web Services. 7, 9, 10, 12, 14, 21, 22, 26, 31

CI/CD Continuous Integration/Continuous Delivery. 22, 31

CSV Comma-Separated Values. 31 DT Decision Tree. 17–20

FCC Fraunhofer-Chalmers Centre. 8

FUMA Fleet telematics big data analytics for vehicle Us-age Modeling and Analysis. 7, 8, 21

GD Gradient Descent. 15, 16

KNN K-Nearest Neighbors. 9, 16, 17, 29 LL Linear Learner. 9, 14, 17

LR Linear Regression. 14, 15

MAE Mean Average Error. 24, 26, 28, 30 ML Machine Learning. 8–10, 13, 17, 22, 24 MSE Mean Squared Error. 17

POI Point Of Interest. 7, 21, 24, 25, 29, 31 RMSE Root Mean Square Error. 11, 24, 26, 28–30 S3 Simple Storage Service. 21, 22

SGD Stochastic Gradient Descent. 16

(8)

List of Figures

2.1 Overfitted model vs. regularized model. . . 10

2.2 Random Search vs. Grid Seach. . . 12

3.1 Gradient Descent training process. . . 15

3.2 XGBoost prediction. . . 20

4.1 Spark job workflow. . . 22

4.2 Succesfully executed step function. . . 23

(9)

Chapter 1

Introduction

Over the years 2016 to 2019, a group at Scania worked on a project called Fleet telematics big data analytics for vehicle Usage Modeling and Analysis (FUMA). FUMA was a Vinnova project in collaboration with Fraunhofer-Chalmers Centre (FCC) for which a lot of data was collected in the form of GPS coordinates of vehicles at regular intervals. The GPS data was collected by Scania’s 200000+ connected vehicles that continuously send their GPS inform-ations to the company. This data does however need to be manipulated into another level of abstraction before conclusions can be drawn from it. The goal was to create information on both the usage mode of the vehicles and the transport network in general based on GPS posi-tions using big data calculaposi-tions.

The information extracted from the calculations was then used to describe the operation of a vehicle and identify transport networks in a large fleet of vehicles. Lastly, this information had to be visualized in a modern and understandable way.

As the quantity of data available to Scania is constantly increasing, the scalability of the algorithms is of high importance. Therefore, a large emphasis was put on streaming methods. This makes it an ideal candidate for an implementation using Amazon Web Services (AWS).

It was a successful project, and its code is currently in use at Scania. They were able to extract certain Point Of Interest (POI)s where many stops occured. Using these POIs as the nodes, they were able to create a transport graph which describes the behavior of the vehicles. A clear distinction could be made between the different kinds of vehicles this way. A more general analysis of the graph was also done, leading to the discovery of a close connection between certain POIs.

The goal of this thesis project is to use supervised learning methods in AWS SageMaker on the previously acquired data from the FUMA project to create a model that is able to predict the duration of a stop at a certain POI.

(10)

Chapter 2

Literature Review

2.1

FUMA

A prior study [6] connected to the FUMA project was conducted by Fraunhofer-Chalmers Centre (FCC) and Scania. In this study, the researchers analyzed the movements of certain vehicles and showed that the positional information that the vehicle sends out can be used to determine the use of said vehicle. Among other things, this kind of analysis can be used to establish whether an owner is using the right kind of vehicle for their purpose.

The goals of the FUMA project were summarized in three parts. The first one encom-passes exploring and understanding the applicability of large scale computation methods. The second goal is to study and understand the movement of vehicles and thereby learn about what it is used for. Lastly, they wanted to explore and use modern methods to visualize these large quantities of data. All three of these goals were achieved, and the results of the first two are currently in use at Scania. The results of the last goal have improved understanding and cre-ated insights on how to visualize geospacial data. The software that was developped for this purpose has been used in several presentations and workshops to spread the knowledge that was collected. [17]

2.2

Machine Learning

Probably the most well known and basic definition of Machine Learning (ML) originates from a 1959 article by A.L. Samuel [14], ”Machine Learning is the field of study that gives com-puters the ability to learn without being explicitly programmed”. This article is where the term was coined and it is thus seen as the first article in this field. In it, the concept was explained in the form of a program that could play checkers. In non-trivial problems, programs that present solutions can quickly become a long list of complex rules. Instead of programming a large amount of rules explicitly by hand, it works through algorithms which make a computer learn what rules or decisions will lead to the wanted outcome.

(11)

With how important data and what we can learn from it are in today’s industry environ-ment the automation of this process is extremely valuable. It does not only have an impact on the time that it takes to solve these problems, but also generally leads to better results, better readable programs, easily maintainable programs, and adaptable algorithms which can learn from new data. [7]

There are two major categories within ML algorithms, supervised learning and unsuper-vised learning. While superunsuper-vised learning uses data which is labelled, unsuperunsuper-vised learning does not. Said labels refer to the target variables of the data entries, the characteristics which we wish to predict. The nature of these goal variables divide supervised learning further into classification and regression. The former referring to problems where we attempt to put in-stances in the correct class, while the latter has to do with predicting a numerical value [7].

Looking at our problem, we are dealing with a supervised learning problem, more spe-cifically, a regression problem.

There are three ML algorithms that we will use in this project, Linear Learner (LL), K-Nearest Neighbors (KNN), and Extreme Gradient Boosting (XGBoost). We came to the con-clusion that these are the most applicable as we want an algorithm that can deal with large amounts of data, outputs a scalar response, and, of course, gives a prediction for the goal vari-able which is as accurate as possible. As some algorithms are built in to SageMaker, while others need to be implemented by the user, the degree of availability of algorithms in AWS SageMaker was also a deciding factor due to time constraints. The three algorithms we went with are all built in to SageMaker. [1]

Another important concept in ML is the bias-variance tradeoff. A bias error refers to an error in the predictions of a ML model caused by the model making wrong assumptions. When a model makes many mistakes of this kind, it is caused by the model not having picked up on important relations between features and output. Such a model is said to have high bias and is underfitted. On the other side, a variance error refers to a ML model making mistakes by being molded too closely to the training data, and hereby not generalizing well to new entries in the test data. Such a model is said to have high variance and is overfitted. These two phenomena create a continuum called the bias-variance tradeoff, which refers to having to find a balance between bias and variance, as one has to decrease if one wishes to increase the other, and vice versa. Regularization is a broad term for techniques that increase bias and decrease variance in ML models to counter overfitting. Overfitting and an example of the possible effects of regularization techniques are visualized in figure 2.1. [7]

(12)

Figure 2.1: Overfitted model vs. regularized model.

2.3

Hyperparameter tuning

The vast majority of ML algorithms need so called hyperparameters to be set before commen-cing with a training job. The term hyperparameters, as it pertains to ML algorithms, refers to all parameters that are not determined by the model itself and thus need to be set by the user prior to training [12]. The chosen values can have a substantial effect on the performance of the model, so determining appriopriate values is paramount when optimal performance is desired [5]. Due to the importance of these values, there has been a lot of interest in finding strategies to optimize the hyperparameters as efficiently and as well as possible. Now, auto-mated strategies are usually outperforming the classical approach of handpicking and tweak-ing by an expert [3].

The strategies to find optimal hyperparameter values, or so called hyperparameter tuning strategies, that are included in AWS SageMaker and thus are relevant in the scope of this pro-ject, are Grid Search, Random Search, and Bayesian Optimization.

(13)

2.3.1

Grid Search

Grid Search is nothing more than a slightly faster and more user friendly approach of using trial and error. Instead of running several training jobs after one another, each time configuring the hyperparameters manually, one can pass a set of different values for each hyperparameter one wishes to tune to a Grid Search function. The function then runs a training job for each possible combination of parameters that has been passed to it (possibly in parallel to speed up the process), and then outputs the values which minimized a predetermined cost function compared to the other combinations [7].

2.3.2

Random Search

Very similar to Grid Search, Random Search works by passing a range of values for each hy-perparameter one wishes to tune to a Random Search function. The function then runs a set amount of training jobs using random combinations of values within the ranges and outputs the values which minimize a predetermined cost function.

Although at first glance Grid Search may seem like a preferable option due to being able to control the spacing of the used values, Grid Search tends to only be preferred when the hyper-parameter search space is very small [7]. It usually gets massively outperformed by Random Search on efficiency, which also is either as good or better performance wise. This can be attributed to the low effective dimensionality of the function of interest.

Consider a function f with as independent variables the hyperparameter values Xa =

x1, x2, . . . , xn and as dependent variable the chosen performance metric of the model (for

ex-ample Root Mean Square Error (RMSE)). In high-dimensional spaces, this function tends to have a low effective dimensionality, meaning that the function can be closely approximated by another function which only takes a strict subset of the independent variables of f as inde-pendent variables. Let this function be g which takes Xbas independent variables with Xb⊂ Xa

for which g(Xb) ≈ f (Xa). Using Grid Search on all hyperparameters in this scenario will most

likely produce inefficient coverage of the dimensions which produce the most change. Ran-domized Search solves this issue as is visualized in the figure below [2].

Figure 2.2 below shows us a visualization of a simplified scenario in which we are tuning 2 hyperparameters, one of which has a large effect on the desired performance metric (import-ant), the other one barely having an effect at all (unimportant). Because of the extremely low impact of the unimportant hyperparameter, we can approximate our performance metric with the effect of the important hyperparameter on the chosen performance metric.

(14)

Figure 2.2: Random Search vs. Grid Seach.

2.3.3

Bayesian Search

As we find ourselves in a situation where evaluation of the function we wish to optimize can be very computationally expensive, it only makes sense to employ some computational power in order to make informed decisions about what values to test next. If successful, this could lead to substantially cutting down the computational time and thus costs, while still optimizing results. This is where Bayesian Search comes in.

Bayesian Search treats the tuning of hyperparameters like a regression problem. Similar to Random Search, ranges are provided for the hyperparameters we wish to tune, and the first batch of training jobs that will be run uses guesses. The next iterations will use the informa-tion from all previously completed training jobs to test values which have a high likelihood of improving the performance of the model. This can range from picking values close to the best training job thus far, to picking values far from anything that was tested before. The decision-making that is behind these choices is led by an AWS SageMaker implementation of Bayesian Optimization[1].

Bayesian Optimization, as described originally in [10], typically works by assuming that the function of interest is sampled from a Gaussian Process, creating a posterior distribution

(15)

using the results of the training jobs. Optimization of either Expected Improvement compared to the best training job thus far or Guassian process upper confidence bound is then used to pick values for the following iterations [16].

We chose to use the Bayesian Search method because in the past it has been shown to take less time to reach similar results compared to Random Search [11], [18], and thus lowering the costs while maintaining results. Although there is no certain way of determining which strategy will be optimal for a specific problem, trends seem to support Bayesian Search more and more. Most likely, this is caused by it relying less on chance and thereby offering a higher degree of consistency, along with it being a better fit for the increasingly complex ML al-gorithms and massive datasets.

(16)

Chapter 3

Methods

3.1

Linear Learner

The LL regressor algorithm is an AWS SageMaker implementation of a Linear Regression (LR) algorithm. Thus, predictions are made by computing a weighted sum of the input features and a constant bias term. A prediction made by the LL algorithm is illustrated in 3.1

ˆ

y= θT· x (3.1)

With ˆythe predicted value, θ the feature weight vector starting with the bias term and x the feature value vector with 1 as the first entry. [7]

In other words, for a dataset with n features and one label column, the model that LL cre-ates describes a straight line in an n + 1 dimensional space. Each feature is represented by one dimension, in addition to one dimension for the target value.

Once trained, the LL algorithm can give fast predictions on new data as the computational complexity is very low.

To find optimal feature weights and bias term or ”train” the algorithm, we have to minimize a cost function. The most commonly used cost function is shown in equation 3.2. It is the sum of the squared residuals (the difference between the predicted value and the actual value), scaled with an additional factor 12. The reason for the additional factor will become apparent later. l(θ ) = 1 2 n

i=1 ( ˆyi− yi)2 (3.2)

with ˆyi the predicted target value for the ith entry in the data set and yi the actual target

(17)

in SageMaker.

Typically, LR models are trained using Gradient Descent (GD), originally described by Cauchy in 1847 [4]. When using a cost function like 3.2, it is also possible to use the Least Squares method for an exact solution, but its calculations quickly get out of hand when dealing with large datasets with many features. GD is an algorithm created to approach the minimum of a cost function in a more efficient manner than calculating it analytically. Another benefit of using GD is that it is technically applicable to any cost function, although its performance can greatly vary according to which cost function is chosen. Ideally, the cost function is a convex function with only one local minimum. An example of such a function is 3.2.

GD starts off by calculating the gradient of the cost function, removing the beforemen-tioned 12 factor if 3.2 is used. Thus, there will be one derivative for every feature weight parameter in vector θ . Next, it picks starting values for the feature weights, these can either be random or some kind of initial guess. The chosen values are then plugged into the gradi-ent. Using the gradient of the function at the chosen point, this gradient is then scaled by a learning rate to determine how large of a step the algorithm will take. The learning rate is a hyperparameter set in advance to control the step size of GD. The algorithm then descends along the gradient, arriving at a new point, calculated by subtracting the step sizes from their respective feature weights. These new feature weights are in turn plugged in to the gradient and the process repeats itself.

Figure 3.1: Gradient Descent training process.

Note that the step size decreases as GD approaches the minimum due to the slope decreas-ing. This way, the algorithm is more efficient than if we were to use a very small step size for

(18)

every iteration, and performs better than if we were to use a large step size every time.

Even though GD was created as a more efficient solution, it still becomes slow when deal-ing with the massive datasets that we see more and more of today. To solve this problem, a slight variation based on the methods described in [13] on GD was created, called Stochastic Gradient Descent (SGD). This also happens to be the method that SageMaker uses, albeit in a distributed implementation.

SGD is extremely similar to normal GD, except instead of taking into account the loss for every point in the training set in every iteration, it only uses a random sample of the entries, called a mini batch, of a size set prior to training. This greatly increases the efficiency of the algorithm, without suffering much in performance if the mini batch size is set appropriately.

3.2

K-Nearest Neighbors

The KNN model makes predictions by averaging out the labels of the k training entries nearest to the entry we are trying to predict. A prediction from a KNN algorithm is thus given by 3.3.

ˆ y=1 k k

i=1 yi (3.3)

With ˆy the predicted value, k the amount of neighbors used, and yi the label of the ith

nearest neighbor. [8]

Usually, the Euclidean Distance, shown in formula 3.4, is used as a metric for how near one entry is to another when using a KNN algorithm [8].

d(p, q) = q

(q1− p1)2+ (q2− p2)2+ · · · + (qn− pn)2 (3.4)

In Euclidean n-space, with p and q points in this space, and (p1, p2, . . . , pn) and

(q1, q2, . . . , qn) their respective cartesian coordinates (or in our scenario, the features of the

respective entries). [9]

Where KNN differs majorly from the other methods that we use in this project, is that it is non-parametric and does not require training in the classical sense. One could say that ”the model” simply is the training data set.

An important consideration when using KNN is the value that we choose for k, the num-ber of nearest neighbors which will be used to make predictions with. There is no set way of finding an optimal value for k except experimentation, bar some rules of thumb that tend to point people in the right direction. When k is too low, the model tends to overfit, while a value

(19)

that is too high tends to overgeneralize and lead to low accuracy.[8]

Another important fact to be aware of is its slowness when dealing with large ”training” set. Due to, in the classic iteration of KNN, every prediction requiring scanning the entire dataset to find the nearest neighbors, the algorithm gets very slow when basing predictions off of a large dataset. To solve this issue, modern implementations of KNN, including the one in AWS SageMaker, often use randomly sampled batches of the dataset to make predictions off, instead of using the entire set. [1]

3.3

XGBoost

Originally described by Chen and Guestrin in 2016, XGBoost is a much more complicated algorithm than both LL and KNN. Before tackling its workings, we need to cover a few con-cepts which are integral to how it makes predictions.

Boosting, as it pertains to ML, is the practice of using many several models, called weak or base predictors, in a chain together to make more accurate predictions. Each new link in the chain is trained with the findings of the previous links in mind, boosting the performance of the new link and in turn the entire model if you will. [15]

As the name implies, the Decision Tree (DT) algorithm constructs, or ”grows”, a DT which is then used to make predictions. Starting with a root node of depth 0, it then divides the data entries over two child nodes based on the feature and threshold that minimizes a cost function. Then, this new node can either continue the same process or be a leaf node, a node without child nodes. This decision can be made based on Mean Squared Error (MSE), depth or other parameters. To make predictions, the algorithm simply puts the data entry through the DT, sees in which leaf node it ends up, and outputs the prediction value associated with said leaf node. In a regression DT, this value is often the average target value of all training instances that ended up in this node, but an optimal output formula for the leaves can depend on the cost function that was used to grow the tree. [7]

Note that because DTs make few assumptions, they will mold themselves very closely to the data and will most likely overfit when not given any constraints. Herefore, it is important to always regularize DTs. This can be done by tuning hyperparameters like maximum depth, minimum samples in a leaf node or similar, or by ”pruning” the tree afterwards (e.g. removing unnecessary nodes).

XGBoost makes its predictions by summing the average of the target outputs with the out-puts of a number of DTs scaled by a learning rate η. The DTs are grown sequentially, using the predictions of the previous ones to boost performance.

(20)

variables in the training set. Next, this average is used as an inital guess to calculate residuals for all entries in the training set. The residuals resulting from using the average as a prediction are then used to grow the first DT.

Note that the DTs by themselves attempt to predict the residuals belonging to the predic-tions made by all previous iterapredic-tions, and not the target value itself. Thus, a prediction from the XGBoost algorithm is made up of the sum of an initial guess, the average, and n DTs, scaled by a learning rate η, which all try to predict how much the sum of all previous terms is off of the actual value.

Before we can construct the trees however, we need to know what will be the objective function to minimize, with which we can calculate how the different splits can be compared and what the optimal output values of the leaves will be. The DTs are constructed by minim-izing the objective function 3.5.

L(t)= n

i=1 l(yi, ˆy (t−1) i + ft(xi)) + Ω( ft) (3.5)

With L(t) the objective function in the tth iteration, n the number of training entries, l a

differentiable convex loss function, yithe actual target value in the ithtraining entry, ˆy (t−1) i the

predicted target value in the ith training entry in the (t − 1)th iteration, ft(xi) the output of the

decision tree constructed in the tth iteration when using the features of the ith training entry, and Ω( ft) a regularization term that penalizes the model for complexity of the DT constructed

in the tthiteration. Ω( f ) in turn is given by 3.6. Ω( f ) = γ T +1 2λ ||O|| 2 (3.6) With γ a regularization hyperparameter which controls the amount of pruning, T the num-ber of leaves in DT f , λ a regularization hyperparameter which controls how much effect a single training entry can have on the configuration of the trained model, and O the outputs of the leaves of DT f .

Second-order Taylor approximation is used to approximate the objective to solve this op-timization problem faster. Applying this approximation and using g and h for the first and second order gradients with regards to ˆy(t−1)i of the loss function l respectively we get 3.7.

L(t) n

i=1 [l(yi, ˆy(t−1)i ) + gift(xi) + 1 2hif 2 t (xi)] + Ω( ft). (3.7)

(21)

Next, we can drop the constant terms ∑ni=1l(yi, ˆy (t−1)

i ) to simplify the objective, as these

do not affect the values that minimize the function. This gives us the new objective function 3.8. ˜ L(t)= n

i=1 [gift(xi) + 1 2hif 2 t (xi)] + Ω( ft) (3.8)

Substituting 3.6 into 3.8 then gives us 3.9.

˜ L(t)= n

i=1 [gift(xi) +1 2hif 2 t (xi)] + γT + 1 2λ T

j=1 O2j (3.9)

Which we can restructure into 3.10 using the fact that ∑ni=1ft(xi) = ∑Tj=1∑i∈IjOj with Ij

the set of training entries that end up in leaf j.

˜ L(t)= T

j=1 [(

i∈Ij gi)Oj+ 1 2(i∈I

j hi+ λ )O2j] + γT (3.10)

Next, we can limit the formula to a single DT, let us call this arbitrary tree φ , resulting in 3.11. ˜ L(φ )= (

i∈Ij gi)Oj+ 1 2(i∈I

j hi+ λ )O2j+ γT (3.11)

By deriving 3.11 and setting it to zero, we can calculate ˜Oj, the optimal output value for

leaf j of a fixed DT, shown in 3.12. d dOj( ˜L (φ )) = (

i∈Ij gi) + (

i∈Ij hi+ λ )Oj 0 = (

i∈Ij gi) + (

i∈Ij hi+ λ ) ˜Oj m ˜ Oj= − ∑i∈Ijgi ∑i∈Ijhi+ λ (3.12) Lastly, by plugging 3.12 into 3.10, we get 3.13. In XGBoost, this formula is used as a scoring system to compare DTs with different splits.

˜ L(t)= −1 2 T

j=1 (∑i∈Ijgi) 2 ∑i∈Ijhi+ λ + γT (3.13)

(22)

Note that for regression, the typical loss function used in XGBoost is l(yi, ˆyi) =12(yi− ˆyi)2,

causing g(yi, ˆyi) = −(yi− ˆyi) (the negative of the residual) and h(yi, ˆyi) = 1. Substituting this

into 3.12 leads to a very intuitive version of the formula 3.14.

Optimal output of leaf = Sum of the residuals

Number of residuals + λ (3.14) And the same holds for plugging it into 3.13, especially after multiplying the right hand side by −2, making it into a maximization problem instead of a minimization problem. This creates a kind of performance score that indicates how ”good” a DT is, shown in 3.15. These formulas create a clear view on the factors that the algorithm takes into consideration when training.

Performance score =

All leaves in tree

Squared sum of the residuals

Number of residuals − Pruning term (3.15) In most versions of XGBoost, it uses the Basic Exact Greedy Algorithm to find the best tree. It works by going over every possible split on all features, evaluating them using 3.13, keeping the tree which scores the best, and repeating this process for the next step. As long as the residuals keep decreasing, this process is then repeated until the set maximum number of trees is reached, or until the error has shrunken below a certain threshold.

Below in figure 3.2 is a visualization of how a prediction from a trained XGBoost al-gorithm with n trees would look for an entry with feature vector x.

(23)

Chapter 4

Application

4.1

Setup

4.1.1

The dataset

The dataset that we are working with was constructed with the data collected by the FUMA project from 2016 to 2019 and contains 31 different values on 492184 unique hub stops of both trucks and buses. These values range from integers, to floats, to strings, to binary values and include information like the vehicle ID, the coordinates of the stop, and the POI ID among others. It is stored as a Hive table in the Scania Datalake.

We define a hub stop as the entire stay within the area of a POI. This means that when the vehicle stands still within the area of a POI, moves, and stands still again, but remains within the perimeter of the same POI, this is still included in one single hub stop. The dura-tion of the hub stop is defined as the difference between the time of arrival at the POI and the time it leaves measured in seconds. This duration, which we from now on will refer to as the hub_stay_duration, is the label in our dataset. This is the value that we wish to predict with our regression algorithm.

Fetching and transforming the data in order to use it for training jobs is done using a Spark job written in Scala. The full functionality of the Spark job we created for this project is shown in figure 4.1 below. The steps between brackets are optional and have to be adjusted according to what we wish to achieve. After the required dataset is saved to the HDFS in the Datalake by the Spark job, it has to be downloaded manually (using a tool like MobaXterm), and then manually uploaded to an AWS Simple Storage Service (S3) bucket.

(24)

Figure 4.1: Spark job workflow.

4.1.2

The pipeline

The pipeline was provided as a pattern project on the Scania Gitlab and needs a GitLab run-ner to function. The first step in setting up the pipeline is forking this pattern project and setting up a GitLab runner. Then, using External ID, one has to set up a Continuous Integra-tion/Continuous Delivery (CI/CD) variable in the project for the role which will be assumed in AWS by the runner. Lastly, you need to provide the pipeline with the name of the S3 bucket where the training/testing data is stored, the stack name, and your AWS account number.

When code gets pushed to a forked version of the project, the runner automatically pushes/ updates an AWS CloudFormation template to the specified AWS account. This template defines resources for creating an end-to-end ML pipeline. This pipeline can be executed by triggering a step function which then runs all of its steps in sequence according to the settings

(25)

specified in the GitLab project. A visualization of this step function can be seen in figure 4.2 below.

Figure 4.2: Succesfully executed step function.

One can change the settings for training and inference in the according files in the forked GitLab project. These settings include but are not limited to the used algorithm, the type and size of the instance to compute on, and the objective of the training job.

As the step function is meant for production rather than experimentation purposes, a Jupy-ter notebook is also included. This notebook is meant to explore the data and experiment with different features, models, and hyperparameters. Being able to use the intuitive Python language with its associated packages like Pandas makes for a more enjoyable and fluid work experience in the explorative and experimental phases of this project. Because of this, we mainly use the notebook, only planning to use the step function to push the model into

(26)

pro-duction if the results of the project warrant this decision.

4.2

Performance metrics

The main performance metric that we use is RMSE, defined in equation 4.1 below.

RMSE(θ ) = s 1 n n

i=1 ( ˆyi− yi)2 (4.1)

We chose this metric due to its sensitivity for large errors as these are highly undesired in our application. Mean Average Error (MAE), defined in equation 4.2, is also kept track of as a secondary measure. MAE(θ ) =1 n n

i=1 abs( ˆyi− yi) (4.2)

We calculate these measures for both the models we train and for two baseline guesses. The first baseline guess is using the mean hub_stay_duration of the POI where the stop oc-curs at if it is present in the training data, otherwise using the overall hub_stay_duration. The second baseline guess is almost identical to the first, the only difference being that we take the median instead of the mean. The thought process behind this baseline guess is that if we can use basic calculations on historical data to create better predictions than when using a computationally expensive ML algorithm, it makes very little sense to use the ML algorithm.

4.3

Features

All cyclical features have been encoded to account for their cyclical nature. This is done by normalizing them on a unit circle and dividing them into a sine and a cosine component. These two values identify a point on the unit circle, thus making the feature cyclical. An example can be seen below, as applied to the daytime feature, which described the time of day at the start of the stop.

daytime_sin = sin 2π daytime 86400



daytime_cos = cos. 2π daytime 86400



All categorical features have been one-hot encoded. One-hot encoding is a way of deal-ing with categorical features when usdeal-ing an algorithm that only accepts numerical features.

(27)

It is done by dividing the feature up into as many new features as it has categories, and as-signing a value of 1 to the category it fits in and 0 to all others. Figure 4.3 provides an example.

Figure 4.3: Example of one-hot encoding.

There are six features which were considered for use in the model.

The daytime feature describes the time of day of the start of the hub stop. It is constructed using the hub_stay_start value from the original Hive table and has been encoded to account for its cyclical nature.

The weekday feature describes the day of the week of the start of the hub stop. It is con-structed using the hub_stay_start value from the original Hive table and has been encoded to account for its cyclical nature.

The month feature describes the month of the start of the hub stop. It is constructed using the hub_stay_start value from the original Hive table and has been encoded to account for its cyclical nature.

The poi_key feature describes the category of the POI of the hub stop. It is constructed using the poi_key value from the original Hive table and has been one-hot encoded.

The poi_value feature describes the category on a lower level of the POI of the hub stop. It is constructed using the poi_value value from the original Hive table and has been one-hot encoded.

The productclass feature describes the type of vehicle. It is constructed using the vehicle-classvalue from the original Hive table and has been one-hot encoded.

(28)

4.4

Feature selection

The feature selection process consisted of running several training and testing jobs on a sample of the dataset, iteratively adding features and keeping only those that improve the RMSE. All of the training and testing jobs are done using the XGBoost algorithm, using the same set-tings and hyperparameter values every time. When a feature increases the RMSE only slightly while decreaseing MAE, the feature is removed, but kept for consideration. We will run an additional training and testing job including it when all other features have been tested to de-cide whether it will be included or not.

4.5

Removing extreme values

To see the effect of removing outliers in the training data set, we train using a dataset excluding long stops, meaning a hub_stay_duration of more than 3 standard deviations over the mean, short stops, meaning a hub_stay_duration less than four minutes, and both. Note that these extreme values are not per sé outliers in the classical sense, as there is a substantial amount of short stops (46.1% of the training set of the sample dataset). Training and testing is done identically to the feature selection step.

4.6

Model selection

The models that are considered for this project are Linear Learner, K-nearest Neighbors, and XGBoost. These are all the algorithms that AWS SageMaker provides that are applicable to our problem. Although SageMaker also gives us the option of bringing our own algorithm, this is a strategy that will not be explored in this project due to time constraints.

We select the most applicable model by running the same hyperparameter tuning job on every algorithm, each training and validating on the same dataset which is made up of a 10% random sample of the entries in the full Hive table and contains all of the features we decided to move forward with after the feature selection process. The reason for the hyperparameter tuning job is to give every algorithm as fair of a chance as possible, lowering the chance of a ”bad” hyperparameter configuration making a model look worse than it actually is. The strategy we use for these hyperparameter tuning jobs is Bayesian Search due to its tendency to find hyperparameters which produce similar results to Random Search in less time while relying less on chance and thus being more consistent.

(29)

4.7

Hyperparameter tuning

After determining which model delivers the best results, we run a hyperparameter tuning job with the same settings using the chosen model on the full dataset (as opposed to the 10% sample we have been using thus far). The results of this hyperparameter tuning job will be used as the settings for the model.

(30)

Chapter 5

Results and Analysis

The full results are available in Appendix A.

5.1

Feature selection

The features that showed to be beneficial to the model are daytime, weekday, poi_key, and poi_value. The two features where doubt existed were weekday and vehicleclass.

weekdayinitially caused only a very small decrease in RMSE of 0.14%, while increasing MAE by 0.67%. When retesting for this feature after having run through all other features except productclass, testing showed an MAE increase of 2.19% and an RMSE increase of 1.06%. Thus, the decision was made to include this feature.

vehicleclassposed another issue, as it was not recorded for all entries. This meant having to exclude part of the data (about 4.8% of the data) if we want to include this feature. Includ-ing the feature, and thus excludInclud-ing part of the data, led to an increase of 17.11% in the MAE and a massive 44.01% in RMSE. We are not sure about what could have been the cause of this huge increase. When excluding the feature again, but still excluding the data where it was not recorded, it led to a very small increase in performance compared to including the feature (-0.40% MAE and -0.08% RMSE). We decided not to include this feature.

5.2

Removing extreme values in the training set

Removing extreme values from the training set did not result in increased performance in any of the three scenarios. The case of removing just the short stops is the only one that resulted in any decrease in RMSE, but the decrease in RMSE was so insignificant (0.54%) compared to the increase in MAE (12.72%) that we decided to not move forward with this.

(31)

5.3

Model selection

As expected, XGBoost performs the best. Using the second-best performing algorithm (KNN) makes for a 1.72% increase in RMSE compared with XGBoost. Thus, we decide on the be-forementioned XGBoost algorithm.

5.4

Training on the full dataset

As all previous tuning/training jobs were conducted on a 10% sample of the full dataset, the next step is to run a full-scale tuning and training job using the entire dataset that we have available. Even though the RMSE is not completely comparable as we’re effectively changing the data we are working with compared to previous steps, the result of the testing was disap-pointing, bringing a 41.86% increase in RMSE adding up to 40413.8. We are unsure as for what the exact reason is for this massive increase.

5.5

Removing outliers

Not being satisfied with the results of the model that came out of the previous steps (RMSE of 40413.8 on the full dataset), we decided to train, validate and test on the full dataset excluding the upper outliers. The thought process behind this was that very long stops would most likely be planned in advance, thus not relevant to how this model would be applied. We defined these upper outliers in two different ways.

The first kind of upper outliers that we exclude are the entries with a hub_stay_duration greater than three standard deviations above the mean. In this case, the limit is set at 124100 seconds, or roughly 34 and a half hours.These entries make up 0.64% of the data.

The second kind of upper outlier are all entries that concern a POI with an average hub_ stay_duration greater than 3 standard deviations above the mean. These entries make up 0.014% of the data, or 15 POIs.

Removing outliers of the first kind proved to be a lot more effective than those of the second kind. We suspect the reason for the second kind not making much difference is because of the small amount of entries being excluded compared to the massive size of the full dataset. The first kind however, made for an apparent huge increase in performance, decreasing RMSE by 82.89% (from 40413.8 to 6916.8). It should be noted that these results technically cannot be used directly to compare the quality of the model itself, as we are changing the data. The truth however is that from a more practical point of view, this massive leap in performance still proves extremely substantial and possibly valuable.

(32)

Because of the way in which this model would be applied, we chose to move forwards with this iteration of the model. As mentioned previously, the stops that are excluded from both training and testing in this iteration are most likely planned, and thus not relevant for making predictions on.

5.6

Analysis

The final iteration of the model has an RMSE of 6916.8 and an MAE of 2385.8 when tested on a partition of the model that it has never seen before. When we compare this to the RMSE of our baseline guesses, it performs better than both the mean and median guess by 4.35% and 12.86% respectively. Looking at the MAE though, the baseline guesses perform better, by 0.21% and 26.33% for mean and median respectively. It has to be taken into account though that the model has been trained and tuned with the goal of minimizing RMSE as the goal and that we chose to evaluate the model using RMSE, while keeping track of MAE as a secondary metric. Thus, we can conclude that the model performs better than our baseline guesses.

Whether the error is too large or not for actual application in a production setting will have to be discussed with other teams at Scania.

(33)

Chapter 6

Discussion

In this chapter, we will discuss what more could have been done, how a continuation of this project could possibly look, and whether this continuation would be worth it.

Because of the very time demanding setup of this project, which was only exacerbated by not having previous experience with AWS, GitLab CI/CD, Spark, or Scala, some things that we wanted to do ended up being scratched due to time constraints. Many of the things that could not be done within this timeframe would most likely find their place very well in a continuation of this project if that comes to be.

It could be valuable to experiment more with the threshold for the long stops which are excluded. This, along with meeting with people from other parts of the company to discover where the threshold between planned and unplanned stops lies in practice, could not only im-prove the model in a statistical sense, but also make it more usable in practice.

The possibilities for more features are endless, and especially distance between the stop at hand and the previous stop is a feature that we would like to explore. Intuitively, it seems like it could add significantly to the predictive powers of the model. Other features that could be explored are the final destination of the trip, distance to the final desination of the trip, historical means and/or medians of POIs, and application of the vehicle.

Originally, there also were plans to train an Artificial Neural Network (ANN) to make predictions. This ended up being scratched due to there not existing an ANN implementation built into AWS SageMaker. The possibility to create an implementation is wide open however, as SageMaker has great support for as they coin it ”Bring you own algorithm”. This could be a valuable path to explore and might push performance over that which we achieved with XG-Boost.

One of the large points of struggle in this project was the incompatibility of Spark and AWS SageMaker when using the provided workflow. Preparing the data in Spark, to then use a Jupyter notebook or the pipeline to train and test, leads to not being able to use a lot of the built-in functionalities of Spark. This, along with having to provide the data in

(34)

Comma-Separated Values (CSV) format, necessitates the use of inefficient code and makes the entire process of writing the code for the Spark job a lot slower and more difficult than it needs to be. We think that the project could benefit greatly from choosing between the two options instead of using parts of both. Either the data preparation and training/testing is all done in Spark (for which SageMaker provides a library), or everything is done in a Jupyter notebook, doing the data preparation using Pandas.

At this point in time, the Spark job and the Jupyter notebook are not structured in a way that makes for easy changes and additions for more experimentation. We believe that spend-ing a bit of time to transform these into a more modular and user-friendly format could prove paramount in getting more done in this project in the future. This one-time time investment could possibly reduce the time spent on every test, change, or idea applied to these resources in the future, meaning that more work and experimentation could be done.

The model is still in a developmental stage, where it is not deployed, nor is it ready to. Once it is that far however, creating an intuitive and easy-to-use interface to make predic-tions for new stops would be very valuable in making sure that the model actually sees use in practice. Right now, this is still something that is fairly far in the future compared to the other points, but it is definitely something to keep in mind as we gather more information and experience about this project.

We believe that a continuation of this project could lead to a production-grade model. The limiting factor has mostly been time, not a lack of other resources or ideas. As much time was spent on the setup this time around, progress in this continuation will likely be a lot quicker compared to the timeline of this iteration, and results will likely follow.

(35)

Chapter 7

Conclusion

While not having reached a production-ready model, we believe that this project has shown potential for such a model given more time and resources. The posibilities for more features, along with tweaking the threshold for which stops to exclude based on both theoretical per-formance of the model and practical information on when and how this model would be used show a large potential for growth. Even though the model that we have arrived at might not be good enough at this point in time, it does still perform better than our baseline guess, which in itself already is far from random.

The apparent potential to transform this early-stage model into a high-performing predict-ive algorithm, along with a starting point for the model, and a possible direction to move into, is the result of this project. We hope that this project will see further continuation, as we be-lieve that the outcome could provide substantial value to Scania.

(36)

Appendix A

Results

A.1

Feature Selection

• Round 1

– 10% sample – Features: daytime.

– MAE compared to mean guess including poi mean when possible: 5302.914245 – RMSE compared to mean guess including poi mean when possible: 29103.04433 – MAE compared to mean guess including poi median when possible: 3626.498994 – RMSE compared to mean guess including poi median when possible: 30035.11421 – MAE compared to model: 4588.968952

– RMSE compared to model: 30184.38744 • Round 2

– 10% sample

– Features: daytime, weekday.

– MAE compared to mean guess including poi mean when possible: 5302.914245 – RMSE compared to mean guess including poi mean when possible: 29103.04433 – MAE compared to mean guess including poi median when possible: 3626.498994 – RMSE compared to mean guess including poi median when possible: 30035.11421 – MAE compared to model: 4619.69754

– RMSE compared to model: 30141.27801 • Round 3

(37)

– Features: daytime, weekday, month.

– MAE compared to mean guess including poi mean when possible: 5302.914245 – RMSE compared to mean guess including poi mean when possible: 29103.04433 – MAE compared to mean guess including poi median when possible: 3626.498994 – RMSE compared to mean guess including poi median when possible: 30035.11421 – MAE compared to model: 4625.545031

– RMSE compared to model: 30237.6036 • Round 4

– 10% sample

– Features: daytime, weekday, poi_key.

– MAE compared to mean guess including poi mean when possible: 5302.914245 – RMSE compared to mean guess including poi mean when possible: 29103.04433 – MAE compared to mean guess including poi median when possible: 3626.498994 – RMSE compared to mean guess including poi median when possible: 30035.11421 – MAE compared to model: 4550.797644

– RMSE compared to model: 29987.31144 • Round 5

– 10% sample

– Features: daytime, weekday, poi_key, poi_value.

– MAE compared to mean guess including poi mean when possible: 5302.914245 – RMSE compared to mean guess including poi mean when possible: 29103.04433 – MAE compared to mean guess including poi median when possible: 3626.498994 – RMSE compared to mean guess including poi median when possible: 30035.11421 – MAE compared to model: 4129.846676

– RMSE compared to model: 29332.13323 • Round 6

– 10% sample

– Features: daytime, poi_key, poi_value.

– MAE compared to mean guess including poi mean when possible: 5302.914245 – RMSE compared to mean guess including poi mean when possible: 29103.04433 – MAE compared to mean guess including poi median when possible: 3626.498994

(38)

– RMSE compared to mean guess including poi median when possible: 30035.11421 – MAE compared to model: 4220.148842

– RMSE compared to model: 29643.22777 • Round 7

– 10% sample

– Features: daytime, weekday, poi_key, poi_value, productclass. – All entries with NaN values in productclass are removed.

– MAE compared to mean guess including poi mean when possible: 5820.208973 – RMSE compared to mean guess including poi mean when possible: 42154.08275 – MAE compared to mean guess including poi median when possible: 4302.585016 – RMSE compared to mean guess including poi median when possible: 42885.69635 – MAE compared to model: 4836.552504

– RMSE compared to model: 42239.89497 • Round 8

– 10% sample

– Features: daytime, weekday, poi_key, poi_value.

– All entries with NaN values in productclass are removed.

– MAE compared to mean guess including poi mean when possible: 5820.208973 – RMSE compared to mean guess including poi mean when possible: 42154.08275 – MAE compared to mean guess including poi median when possible: 4302.585016 – RMSE compared to mean guess including poi median when possible: 42885.69635 – MAE compared to model: 4817.221731

– RMSE compared to model: 42206.31445

A.2

Removing outliers

• Round 1

– 10% sample

– Removing the all entries from the training set with a hub_stay_duration of 3 stand-ard deviations above the mean.

– Training data size goes from 39291 to 39027

(39)

– RMSE compared to mean guess including poi mean when possible: 29480.7858 – MAE compared to mean guess including poi median when possible: 3624.11586 – RMSE compared to mean guess including poi median when possible: 30067.70112 – MAE compared to model: 3950.496364

– RMSE compared to model: 30004.31251 • Round 2

– 10% sample

– Removing the all entries from the training set with a hub_stay_duration less than 240s.

– Training data size goes from 39291 to 21189

– MAE compared to mean guess including poi mean when possible: 7340.535261 – RMSE compared to mean guess including poi mean when possible: 29499.3281 – MAE compared to mean guess including poi median when possible: 4007.08946 – RMSE compared to mean guess including poi median when possible: 29975.86568 – MAE compared to model: 4655.360776

– RMSE compared to model: 29174.79252 • Round 3

– 10% sample

– Removing the all entries from the training set with a hub_stay_duration less than 240s or 3 standard deviations above the mean.

– Training size goes from 39291 to 20925

– MAE compared to mean guess including poi mean when possible: 5100.21024 – RMSE compared to mean guess including poi mean when possible: 29442.81541 – MAE compared to mean guess including poi median when possible: 3993.054762 – RMSE compared to mean guess including poi median when possible: 30000.08867 – MAE compared to model: 4273.032229

– RMSE compared to model: 29915.28696

A.3

Model selection

Any values that are not mentioned are set to their respective SageMaker defaults.

(40)

– 10% sample

– Tuning job configuration: ∗ learning_rate: 0.0001-0.5 ∗ wd: 0.0000001-1

∗ l1: 0.0000001-1

∗ mini_batch_size: 100-5000

∗ ResourceLimits: max 30 training jobs, max 3 in parallel ∗ Strategy: Bayesian

∗ MetricName: validation:objective_loss ∗ Type: Minimize

– Tuning job results:

∗ l1: 5.166299062870825e-07

∗ learning_rate: 0.0021688641964931724 ∗ mini_batch_size: 130

∗ wd: 3.960109184055871e-05 ∗ RMSE: 29284.90451

• Round 2: K-Nearest Neighbors – 10% sample

– Tuning job configuration: ∗ k: 1-1024

∗ sample_size: 256-20000

∗ ResourceLimits: max 30 training jobs, max 3 in parallel ∗ Strategy: Bayesian

∗ MetricName: test:mse ∗ Type: Minimize – Tuning job results:

∗ k: 16

∗ sample_size: 8038 ∗ RMSE: 28980.14106 • Round 3: XGBoost

– 10% sample

– Tuning job configuration: ∗ eta: 0.01-0.2

(41)

∗ alpha: 0-50 ∗ subsample: 0.5-1 ∗ num_round: 1-50

∗ ResourceLimits: max 30 training jobs, max 3 in parallel ∗ Strategy: Bayesian

∗ MetricName: validation:rmse ∗ Type: Minimize

– Tuning job results:

∗ alpha: 7.502089080686181 ∗ eta: 0.05281515905191552 ∗ max_depth: 5 ∗ min_child_weight: 9.722886978684691 ∗ num_round: 28 ∗ scale_pos_weight: 1 ∗ silent: 0 ∗ subsample: 0.8610868351094432 ∗ RMSE: 28489.19922

A.4

Training on full dataset

• Round 1

– Full dataset

– Tuning job configuration: ∗ eta: 0.01-0.2

∗ min_child_weight: 0-20 ∗ alpha: 0-50

∗ subsample: 0.5-1 ∗ num_round: 1-50

∗ ResourceLimits: max 30 training jobs, max 3 in parallel ∗ Strategy: Bayesian

∗ MetricName: validation:rmse ∗ Type: Minimize

– Tuning job results:

∗ alpha: 48.16387571560461 ∗ eta: 0.08917529573338208 ∗ max_depth: 5

(42)

∗ min_child_weight: 18.22798715379956 ∗ num_round: 50 ∗ scale_pos_weight: 1 ∗ silent: 0 ∗ subsample: 1 ∗ RMSE: 36273.19922 • Round 2 – Full dataset

– Tuning job configuration: ∗ eta: 0.01-0.2

∗ min_child_weight: 0-50 ∗ alpha: 0-200

∗ subsample: 0.5-1 ∗ num_round: 1-100

∗ ResourceLimits: max 30 training jobs, max 3 in parallel ∗ Strategy: Bayesian

∗ MetricName: validation:rmse ∗ Type: Minimize

– Tuning job results:

∗ alpha: 131.9616321319353 ∗ eta: 0.12142029280163846 ∗ max_depth: 5 ∗ min_child_weight: 50.0 ∗ num_round: 93 ∗ scale_pos_weight: 1 ∗ silent: 0 ∗ subsample: 0.9988227616596089 ∗ RMSE: 36231.39844 • Round 3

– Removing upper outliers (More than 3 standard deviations above the mean) – Tuning job configuration:

∗ eta: 0.01-0.2

∗ min_child_weight: 0-20 ∗ alpha: 0-50

(43)

∗ num_round: 1-50

∗ ResourceLimits: max 30 training jobs, max 3 in parallel ∗ Strategy: Bayesian

∗ MetricName: validation:rmse ∗ Type: Minimize

– Tuning job results:

∗ alpha: 32.765541898349895 ∗ eta: 0.19496861492446121 ∗ max_depth: 5 ∗ min_child_weight: 12.280455217964835 ∗ num_round: 49 ∗ scale_pos_weight: 1 ∗ silent: 0 ∗ subsample: 0.6756319380932873 ∗ RMSE: 6655.069824

(44)

Appendix B

Presentation

(45)

Using supervised learning

methods to predict the stop

duration of heavy vehicles

Bachelor thesis project

Emiel Oldenkamp

Mälardalen University,

Västerås

September 22, 2020

Summary

Used data gathered in the FUMA project to construct a model to

predict the stop time of heavy vehicles

Machine Learning algorithms in AWS SageMaker

(46)

Content

1

Introduction

2

Theoretical Background

3

ML Algorithms

4

Application & Results

5

Discussion

6

Conclusion

(47)

Introduction - FUMA

Vinnova project in collaboration with Fraunhofer-Chalmers

Centre

GPS coordinates of vehicles at regular intervals

Create information on both the usage mode of the vehicles and

the transport network in general

All goals were fulfilled

Extracted Points Of Interest (POIs) or hubs

2 38

(48)

Introduction - Thesis Project Goal

3 38

(49)

Theoretical Background - Machine Learning

”Machine Learning is the field of study that gives computers the

ability to learn without being explicitly programmed.”

-Arthur Lee Samuel, 1959

Algorithms which make a computer learn what rules or decisions

will lead to the wanted outcome

Not only faster, but also better results

4 38

(50)

Theoretical Background - ML Algorithms

6 38

Theoretical Background - Hyperparameter

Tuning

Hyperparameters are set before training

Tuning them can have a dramatic effect on the performance of

the model

(51)

Theoretical Background - Bayesian Search

Applicable when the function to optimize is computationally

expensive

Spend computational power to make informed decisions

Cuts down the computational time

Treats tuning as a regression problem

Assumes that the function of interest is sampled from a Gaussian

process

8 38

Theoretical Background - Bayesian Search

Repeat these steps until stopping criteria is reached

1.

Run training jobs. First batch of training jobs uses random

values

2.

Create a posterior distribution using the results of all

completed training jobs

3.

Use this information to choose new values that have a high

likelihood of improving the model using either Expected

Improvement or Gaussian process upper confidence bound

(52)

Theoretical Background - Bayesian Search vs.

Other Methods

Takes less time to reach similar results to Random Search

Relying less on chance, offering a higher degree of consistency

Better fit for the increasingly complex ML algorithms and

massive datasets

10 38

(53)

ML Algorithms - Linear Learner

Prediction: ˆy = θ

T

· x

SageMaker implementation of Linear Regression

Usual cost function: l(θ) =

1

2

P

ni=1

y

i

− y

i

)

2

11 38

ML Algorithms - Linear Learner

Minimize l(θ)

Least Squares

Exact solution

Quickly gets out of hand

Gradient Descent

Approaches analytical solution

More efficient than analytical approach

Works for several cost functions

(54)

ML Algorithms - Linear Learner (Gradient

Descent)

13 38

ML Algorithms - Linear Learner (Stochastic

Gra-dient Descent)

Solution for dealing with massive datasets

Only uses a mini batch

Greatly improves efficiency without substantial performance

drops

(55)

ML Algorithms - K-Nearest Neighbors

Prediction: ˆy =

1

k

P

ki=1

y

i

Distance: d(p, q) =

p

(

q

1

− p

1

)

2

+ (

q

2

− p

2

)

2

+

· · · + (q

n

− p

n

)

2

Non-parametric and does not require training

”The model” is the dataset itself

15 38

ML Algorithms - K-Nearest Neighbors

K is an important consideration

Predictions get very slow for large datasets

Randomly sampled batches solve this issue

(56)

ML Algorithms - XGBoost (Prediction)

17 38

ML Algorithms - XGBoost (Training)

1. Calculate initial guesses (average of label)

2. Calculate residuals using the guesses

(57)

ML Algorithms - XGBoost (Training)

3. Train DT using residuals from previous step

4. Update guesses to include new DT

19 38

ML Algorithms - XGBoost (Training)

(58)

ML Algorithms - XGBoost (Training DTs)

Objective function: L

(t)

=

P

n i=1

l



y

i

, ˆ

y

i(t−1)

+

f

t

(

x

i

)



+ Ω(

f

t

)

With Ω(f) = γT +

1 2

λ

||O||

2

Scoring system: ˜

L

(t)

=

1 2

P

Tj=1 (Pi∈Ijgi)2 P i∈Ijhi

+ γ

T

21 38

ML Algorithms - XGBoost (Training DTs)

Plugging in l(y

i

, ˆ

y

i

) =

12

(

y

i

− ˆy

i

)

2

gives

Optimal output of leaf = Sum of the residuals

Number of residuals + λ

Performance score =

X

All leaves in tree

Squared sum of the residuals

(59)

ML Algorithms - XGBoost (Training DTs)

Basic Exact Greedy agorithm

Evaluates every possible split

23 38

(60)

Application & Results - The Dataset

2016-2019

31 different values on 492184 unique hub stops

hub_stay_duration

entire stay within the area of a POI

Spark job written in Scala to fetch and transform data

24 38

Application & Results - Performance Metrics

RMSE(θ) =

v

u

u

t 1

n

n

X

i=1

y

i

− y

i

)

2

(61)

Application & Results - Feature Selection

Iteratively add features

Decide based on RMSE

Consider MAE

26 38

(62)

Application & Results - Removing Extreme Values

long stops

More than 3 sd’s above mean

short stops

Less than 4 minutes

28 38

Application & Results - Removing Extreme Values

Did not result in increased performance

Removing short stops

RMSE -0.54%

MAE +12.72%

(63)

Application & Results - Model Selection

Attempt to give each model an equal chance

Hyperparameter tuning job

Bayesian

Identical settings where possible

Same 10% sample

30 38

Application & Results - Model Selection

XGBoost performed best

K-Nearest Neighbors

RMSE +1.72% compared to XGBoost

Linear Learner

(64)

Application & Results - Hyperparameter Tuning

Train on full dataset with XGBoost

Similar tuning job to previous step

32 38

Application & Results - Training on the Full

Dataset

Implications of changing the dataset

RMSE: +41.86%

(65)

Application & Results - Removing upper Outliers

Excluding upper outliers on all sets

Entries with hub_stop_duration greater than 3 SD’s above the

mean (≈ 34.5h)

0.64% of the data

RMSE -82.89%

POI’s with mean hub_stop_duration greater than 3 SD’s above

the mean (≈ 34.5h)

0.014% of the data (15 POI’s)

No substantial change in RMSE

34 38

Application & Results - Analysis

RMSE = 6916.8

Performs better than both guesses on RMSE

Mean by 4.35%

Median by 12.86%

MAE = 2385.8

Baseline guesses both perform better on MAE

Mean by 0.21%

(66)

Discussion

Discussion

Experiment with different thresholds for the long stops

Explore more features

Explore possibilities with Artificial Neural Networks

Creating a more user friendly workflow

(67)

Discussion

Rework code

Create user interface once the model is ready for deployment

Limiting factor has mainly been time

37 38

Figure

Figure 2.1: Overfitted model vs. regularized model.
Figure 2.2: Random Search vs. Grid Seach.
Figure 3.1: Gradient Descent training process.
Figure 3.2: XGBoost prediction.
+4

References

Related documents

The sales predictions from the base learners (Random Forest, XGBoost, and SARIMA) were used to form a new train data set to be used when training the two stacking models.. The new

This study shows that the machine learning model, random forest, can increase accuracy and precision of predictions, and points to variables and

46 Konkreta exempel skulle kunna vara främjandeinsatser för affärsänglar/affärsängelnätverk, skapa arenor där aktörer från utbuds- och efterfrågesidan kan mötas eller

Further in the analysis the different driver factors are used in order to determine the fuel saving potential of the road stretches where the factors are computed.. The results

First, in Papers III and IV it is applied to study the early-age behaviour of concrete, while in Paper V it is used to study the long-term water absorption into air-entrained

Figure B.3: Inputs Process Data 2: (a) Frother to Rougher (b) Collector to Rougher (c) Air flow to Rougher (d) Froth thickness in Rougher (e) Frother to Scavenger (f) Collector

Studiens syfte är att undersöka förskolans roll i socioekonomiskt utsatta områden och hur pedagoger som arbetar inom dessa områden ser på barns språkutveckling samt

All three women—Suyin, Ruiling and Elin who are all of East Asian descent, not Thai—and several of the other interview participants understand instinctively that the treatment