Timing Predictions in Vasaloppet using Supervised Machine Learning

(1)

UPTEC F 20004

Examensarbete 30 hp Februari 2020

Timing Predictions in Vasaloppet using Supervised Machine Learning

Karl Ekström

(2)

Teknisk- naturvetenskaplig fakultet UTH-enheten

Besöksadress:

Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0

Postadress:

Box 536 751 21 Uppsala

Telefon:

018 – 471 30 03

Telefax:

018 – 471 30 00

Hemsida:

http://www.teknat.uu.se/student

Abstract

Timing Predictions in Vasaloppet using Supervised Machine Learning

Karl Ekström

Timings at future controls in Vasaloppet were predicted using timings at past and current controls. Predictions were made using linear regression, deep neural networks and support vector machine regression. Timings up to the current control and age were used as input data; predicted timing at a future control was used as output data. This resulted in 28 estimated functions, which were made for each starting row. With eleven starting row, the final number of estimated transfer functions is 308.

All methods significantly improved prediction with up to six times lower mean error compared to the currently used method. It was found that deep neural networks had the possibility to make the best

predictions, but that the training time required was unrealistic given available resources. Support vector regression performed almost as well as deep neural networks, but trained much faster. Linear regression had the worst performance, albeit not by much, and the fastest training time of the machine learning algorithms. Improvement ranged from up to six times lower average hourly error to 1.3 times, depending on the transfer function estimate evaluated. Improvements for predictions from the first control, where the absolute error was by far the largest, were the greatest. Thus the worst predictions with the original model improved the most, resulting in considerable improvement for the service offered during Vasaloppet.

Examinator: Tomas Nyberg Ämnesgranskare: Andreas Lindholm Handledare: Carl Hallén

(3)

Popul¨ arvetenskaplig sammanfattning

Vasaloppet genomförs ˚arligen som Världens största skidtävling. Vid huvudtävlingen startar som mest 15 800 personer i Sälen för att via sju kontroller ta sig till Mora, nio mil bort. Längs med loppet, vid kontrollerna och vid m˚alg˚angen samlas människor för att titta p˚a loppet och stötta bekanta. Det g˚ar även att följa Vasalopps˚akare online genom Vasaloppets resultatservice, skött av Mika Timing/EST, som visar när en ˚akare har passerat en kontroll och förutsäger när de kommande kontrollerna kommer passeras.

Förutsägningen görs för närvarande genom att beräkna en ˚akares medelhastighet baserat p˚a den senaste tidtagningen. Denna metod kan dock ge upphov till stora fel när den nuvarande medelhastigheten inte är en bra representation för medelhastigheten under hela loppet. Detta sker i regel vid den första kon- trollen, Sm˚agan, där sträckan fr˚an start till Sm˚agan best˚ar av en l˚ang backe med köbildning. Medelhastigheten till Sm˚agan är s˚aledes betydligt lägre än den är under loppen som helhet, vilket leder till ett snittfel p˚a tre timmar för prediktion fr˚an Sm˚agan till m˚al i Mora. Med osäkra prediktioner följer att det

¨

ar sv˚art för ˚ask˚adare att befinna sig vid rätt plats vid rätt tidpunkt, vilket ökar köbildning och försv˚arar p˚ahejande under loppet.

För att förbättra prediktionen har maskininlärning använts. Maskininlärning

¨

ar ett samlingsnamn för algoritmer som kan använda befintlig data för att förbättra sig p˚a en uppgift kopplad till datan. I fallet Vasaloppet kan mask- ininlärningsmetoder använda uppmätta tider för att förutsäga kommande tider, och sedan titta i ett dataset för att se hur bra gissningen var. Därefter uppdat- erar algoritmen sin gissning baserat p˚a hur nära rätt svar den var.

Resultaten efter att ha implementerat tre maskininlärningsmetoder är att tidsförutsägningen fr˚an Sm˚agan till Mora har g˚att fr˚an i snitt tre timmar fel till att vara i snitt drygt en halvtimme fel. Tre maskininlärningsmetoder användes - Linjärregression, Djupa Neurala Nätverk och Supportvektormaskiner.

Linjärregression är den simplaste av de tre metoderna, och var den som tränade snabbast men gav sämst resultat. I snitt gav linjärregression 3,5 g˚anger bättre prediktion fr˚an Sm˚agan, och ca. 1,5 g˚anger bättre fr˚an övriga kontroller.

Djupa Neurala Nätverk (DNN) gav bäst prediktioner givet tillräckligt med träningstid och val av hyperparametrar, som är val som görs av ingenjören innan varje träningsinstans. Träning av denna metod var dock extremt tidskrävande, och med de resurser som fanns kom DNN upp i ca 10% bättre prediktioner än linjärregression.

Supportvektormaskiner (SVR) är en algoritm som i stort ˚astadkommer samma sak som linjärregression, men som använder n˚agra knep vilket gör den icke- linjär och mer flexibel. Den tog längre tid att träna än linjärregression, men

(4)

var betydligt snabbare än DNN. SVR gav sämre resultat än de bästa DNN- implementeringarna, men vid en körning p˚a alla nödvändiga prediktioner där resurser för komplexa DNN inte fanns s˚a var den marginellt bättre.

Resultaten visar att maskininlärning är lämpligt att använda för att förbättra Vasaloppets prediktioner, och kommer implemneteras till kommande Vasalopp.

Hänsyn till strömförbrukning vid prediktioner är taget, vilket gör att enklare metoder kommer implementeras vid fall där deras prediktioner är snarlika de mer komplexa metoderna.

(5)

1 Introduction

Vasaloppet¹is a Swedish cross-country skiing competition which goes from S¨alen to Mora. It is arranged annually and is the biggest cross-country skiing competition in the world, with up to 15 800 people participating in the main race, and over 60 000 participating during all races of the Vasaloppet week.

The main competition is a Swedish tradition which has been held from 1922.

It traces its origins to when Gustav Vasa (later king) skied from Mora to S¨alen on his way to Norway to escape from pursuing Danish soldiers. Vasaloppet cel- ebrates his return journey, once Swedish peasant skiers had caught up to him and declared their support for him.

Vasaloppet is a long race, 90 km, and takes from around four to twelve hours to finish depending on the skier. To support the skiers, seven controls - Sm˚agan, M˚angsbodarna, Risberg, Evertsberg, Oxberg, H¨okberg and Eldris - are present during the race, where light food and drinks are served, and timing measurements are made. Thus, during Vasaloppet, competitor timings are obtained at regular distances throughout the race. These timings are presented in real time by Mika Timing²/EST³ through the results service of the Vasaloppet website⁴. Vasaloppet is broadcasted on Swedish national television and online. Many people gather to watch it in person however – along the race, at the controls or at the finish line. Due to the length of the race, most who cheer for friends or family try to predict when they should be at the control or finish line. For this, the results service is integral, as even experienced skiers can differ in finish time in order of hours, depending on weather and day performance. Novice skiers can be even more unpredictable without access to in-race performance.

1.1 Prediction

Besides presenting timings, the result service also offers a prediction for when the competitor will pass later controls and the finish line. Experienced cross- country skiers can make a good guess based on prior knowledge of Vasaloppet, but for many people this prediction is the best available indicator of when some- one will pass future controls.

The timing prediction is currently made by calculating the mean velocity of the competitor to the current control, and then assuming this to be the mean velocity to all future controls by dividing the distance to the target of prediction with the current mean velocity. With no other information than control

1https://www.vasaloppet.se/

2https://www.mikatiming.de/en

3https://est.se/

4https://results.vasaloppet.se/2019/?_ga=2.262704516.988466433.

1578756903-364498295.1555513997

(8)

distances, this is a reasonable method for prediction. However, experienced skiers know that this prediction tends to err, as the average velocity differs greatly with regards to the circumstances of the race; for instance, the distance from the start to the first control starts with a long and steep slope, where queues form. Competitors that start late can spend an hour in this queue with basically no distance moved. Thus, the average velocity to the first control is generally much lower than the true average velocity for the whole race. Experi- enced skiers know this, and can take this into account, but so should a prediction model, since more information than just the distance and the timings is, in fact, available.

To improve the prediction model, two approaches can be taken. Either, human knowledge can be coded directly into the model, taking into account initial queues, slopes and other information to scale the predicted mean velocity to more accurately represent the true mean velocity. Another solution is to let a computer make prediction guesses based on available information, and train it so that the guesses are accurate. The benefit of using human expertise is that it doesn’t require large amounts of data to train predictions with. However, some timing data is still needed to evaluate the models. The main benefit of using a computer model is that, given enough (labeled) data, it can surpass human expertise as it considers all available data to make the predictions that minimize error within its model. To further improve a computer implementation, human knowledge can be used to assist machine predictions.

1.2 Machine Learning

A computer algorithm which learns from data to improve performance is called a machine learning algorithm. Machine learning is a wide field with no single all encompassing definition, but one common description of machine learning is

“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E.”⁵.

Machine learning is a field which has expanded rapidly in recent years. Though many algorithms have been theorized before the 21st century, advances in computer hardware and software has allowed for testing, implementation and improvement, as well as given rise to completely new algorithms.

Machine learning consists of three main subgroups: supervised, unsupervised and reinforcement learning.

1.2.1 Supervised Learning

Supervised learning is characterized by the algorithm having access to “labeled”

data. Labeled data means that if a computer has access to a set of parameters,

5Machine Learning - Tom Mitchell - McGraw Hill, 1997

(9)

for instance timings at controls at Vasaloppet, and it tries to make a prediction for what finish time a given set of variables should result in, it can then look at the actual finish time to evaluate the “loss”, the performance of the prediction.

In other words, in supervised learning, the algorithm is taught to replicate a behavior where the solution is known, usually to then perform the task where the solution is unknown, such as the next Vasaloppet race 2020.

1.2.2 Unsupervised Learning

Unsupervised learning is usually used for pattern recognition. In unsupervised learning, labeled data is no longer available for the algorithm, and instead it tries to find patterns in the data. If the computer is given a set of data with parameters, it can minimize the loss by reducing variance through “clustering”.

For instance, if given a dataset consists of birds, humans and dogs, with the parameters “number of wings” and “number of legs”, it could cluster all datapoints with two legs in one cluster, two legs and two wings in the second cluster, and four legs in the third cluster. It won’t know what these clusters mean, and a dog that misses two legs would in this case be clustered together with humans, but it is a useful tool for finding patterns, especially in larger more complex datasets.

In relation to timings of Vasaloppet, clustering could in theory reveal information of different types of skiing patterns during the race; if an algorithm is ordered to create three clusters, it could possibly present three different strategies during the race, such as going fast early at the expense of energy later, keeping a steady pace or saving energy for the latter stages of the race. Due to the variations within the races as a result of mainly weather, it could also simply cluster with regards to competition year, which is a known variable.

1.2.3 Reinforcement Learning

Reinforcement learning is the branch of machine learning that is most obviously linked to AI. It generally consists of an “agent”, which acts in an environment with the goal to maximize gain or minimize loss. It functions as a “trail and error”-method to optimize a policy with regards to the defined environment.

One example is the algorithm AlphaZero⁶, which learned Go and Chess by playing simulated games where it tried different strategies, until it became so proficient that it beat grandmasters in both games. In the example of chess, the environment it acts in is the chess board with chess rules. It tries different moves with the goal to obtain check mate, and remembers what policies lead to more gain, i.e chances of winning. Different board pieces or board positions can also be introduced as loss or gain, where check mate is still the by far largest gain, but queen is worth more than pawn, and an open rook in the middle of the field is worth more than a rook locked behind pawns. As the performance of a competitor in Vasaloppet is based much more on their physical and technical

6https://deepmind.com/blog/article/alphazero-shedding-new-light-grand-games-chess-shogi-and-go

(10)

abilities than their policy during the race, training an agent on Vasaloppet timings is unlikely to provide meaningful results.

1.3 Task

The objective of this thesis is to optimize timings predictions, for which supervised learning was used.

2 Theory

In this project, five methods were evaluated: the current mean velocity based method (original), the average of the timing measurements with regards to starting row; linear regression (LR), deep neural networks (DNN) and support vector machine regression (SVR).

2.1 Original Method

The original method, as introduced in the introduction, works by calculating the mean velocity using the timing at the current control and dividing the distance to that control with the timing. This mean velocity is then used to predict future timings by dividing the distance to the prediction target with the current mean velocity, and adding the current timing to that value. Thus, this method requires no knowledge of prior timings, and no other data than the distances between the controls.

2.2 Mean value prediction

The simplest prediction method using data from prior years is taking the average timing per evaluated group. The mean timing value was calculated for every starting row at every control, and was used as predictions for every control and starting row group. This prediction requires no data from the actual race to utilize, and is thus very efficient computationally. It can thus make predictions before the race begins.

2.3 Data handling

A machine learning method works by splitting the dataset into a training set, a test set and a validation set. The training set is the majority of the data, and is the set from which the algorithm learns. The validation set is used to prevent overfitting by evaluating the loss on another data set that has not been used for training, and can be used as a stopping criterion; if the loss for the training set is decreasing but the loss for the validation set starts increasing, that means that overfitting is highly likely. It is continuously evaluated during the training process, and is thus seen during the training, but not trained on.

The test dataset is completely separated from the training process, and only evaluated over after the training is done. It is the loss on the test dataset that

(11)

is used to represent how well the model is performing.

Splitting the dataset can be done by random sampling, or by excluding one year as test year. By using cross testing, this exclusion can be repeated until the whole dataset has been used as test data, to minimize the risk of optimizing for only a subset of the whole problem. Furthermore, the validation set can be split from the training set by again excluding one year, or by randomly sampling from the training set. For Deep Neural Networks, retraining was used so that the model was saved and reused on the next excluded year, Consequently, cross testing implies that the test data is not truly unseen during training, as it was present in a previous training session of the same model. To prevent this, one year can be excluded, after which one year is excluded as validation year in a cross validation sequence.

2.4 Machine Learning Process

The training process consists of making a prediction, evaluating the prediction with regards to the labels, updating the weights used in the prediction with regards to the evaluation, and then making a new prediction. Weights are initialized at the start of the training either randomly or by using a priori information.

2.5 Loss Function

The loss function is the basis of evaluation and optimization for machine learning algorithms. Depending on machine learning task it can be evaluated differently, though in the case of supervised learning, it is generally calculated as a distance between the predicted output and the labels. This difference can be the mean squared error (MSE) or the mean absolute error (MAE) for regression tasks, and cross entropy⁷, negative log likelihood, margin classifier or soft margin classifier⁸ for classification tasks. The choice of loss function depends on the task; for regression, any type of L projection is mathematically viable, but in general MSE is chosen. If MAE, the L1 projection, is used, the model will be less penalizing on outlier values, while higher dimensions of the projection result in the worst predictions dominating the loss value.

2.6 Optimization Algorithms

At the most general level, machine learning learns to estimate a function from inputs to outputs. Usually, the function estimate is made as a set of weights.

The optimization of these weights is the main purpose of machine learning algorithms, and many optimization algorithms exist, such as Gradient De- scent, Stochastic Gradient Descent (SGD)[1][2], Stochastic Gradient Descent with Momentum[3], AdaGrad[4], RMSProp[5] and Adam[6].

7https://en.wikipedia.org/wiki/Cross_entropy

8https://en.wikipedia.org/wiki/Margin_classifier

(12)

2.6.1 Gradient Descent

Gradient Descent optimizes by updating the weights proportionally to the negative gradient of the loss function, thus updating the weights proportionally to how much they influenced the loss. With L as the loss function of the set of weights θ at iteration n, with a learning rate γ and a data set size of N , the equation for gradient descent becomes

θn+1= θn− γ∇θL(θn) = θn− γ

N

X

i=1

∇θ

L(θn)

N . (1)

Gradient descent works well with functions that are smooth and easily dif- ferentiable. If the loss function is computationally costly to differentiate, or if the training set is larger than there is computational power to process, random sampling from the whole training set to calculate loss and gradient can be used to enable training. Stochastic Gradient descent works like Gradient Descent, but samples a random batch to use for gradient update for the weights. The loss function is thus evaluated fewer times,

θn+1= θn− γ∇θL(θn) = θn− γ

M

X

i=1

∇θ

L(θn)

M , (2)

where M < N is the batch size, and the batch elements are chosen stochasti- cally. If, for instance, the Vasaloppet data set consisted of 10¹2 entries, choosing to only use 10⁶data points would still allow for training on the whole data set given enough iterations, but considerably faster at the expense of not using every data point during each iteration. In the case of Vasaloppet, it is highly likely that many data points convey similar information, and thus little information is lost by not using every data point in every iteration.

Another issue when using Gradient Descent is if the loss function has local minima, or maxima if gradient ascent is used to maximize the function. In that case, the gradient goes to zero and the algorithm converges to a solution that is not the optimal global solution. To prevent this, ”momentum” can be introduced, which is a velocity term that is updated by the gradient, and in turn updates the weights. The algorithm for SGD with momentum is

vn+1= βvn− γ∇θL(θn) θ_n+1= θ_n− γv_n+1,

where v_n is the velocity at iteration n and the momentum term β is another hyperparameter which is a value between 0 and 1 and acts as friction on the system, to ensure that the velocity does not grow uncontrollably. The velocity term has two benefits over regular SGD; it increases the step size when the algorithm is continuously moving with the same gradient sign, thus increasing speed when the algorithm is heading in the correct direction on the solution

(13)

map, and it has the ability to bypass local solutions, where it slows down due to the changing gradient sign, but ideally moves past the local maximum/minimum before it stops. Besides using the value of the gradient to update the velocity, the velocity can also be updated by a fix value multiplied with the sign of the gradient. This makes the velocity faster when the gradient is small and slower when the gradient is large compared to the normal version. Another version, called Nesterov accelerated gradient instead uses the estimated next position in the loss topography to calculate the gradient. This allows the algorithm to react ahead of time down as it approaches a significant change in the gradient.

2.6.2 Adagrad

Gradient descent with momentum updates each weight without memory of their previous updates. This means that sparse data with infrequent features, where weights for features that are rarely seen or relevant, will risk being drowned in noise from data points where they should not be updated, or only updated slightly, since their relevant updates happen so infrequently. To prevent this, an algorithm which adapts the learning rate to the parameter is desirable. Each parameter would then have an individual learning rate, which is updated as the algorithm trains. Adagrad is one such algorithm, which updates the learning rate by the square of the gradient as

gn = 1 N

N

X

i=1

∇θL(θi)

Gn =

n

X

τ =1

g^|_τgτ

θn+1= θn− γ

pdiag(G_n+1) + I · gt.

(3)

Gt is the running sum of squared gradients from the start to time t. gt is the current term for the squared gradient. θn is the weight at iteration n. γ is the learning rate and I is a small value times the unity matrix, chosen to ensure that the update never goes to zero.

The main benefit of Adagrad is that it updates proportionally by accelerating updates with small gradients through division of a small value, and deaccel- erating updates with large gradients. Thus, the path that it navigates in the solution space is usually a short path to the optimal solution. However, as the iterations increase, the summed value of G_n will increase and the steps will become smaller and smaller. Even if Adagrad converges quickly geometrically, the steps it takes might become so short that it never reaches the end point.

For most problems, Adagrad is slow. In cases where a solution might diverge if it overshoots the minimum, a cautious method is prudent, but in a smooth solution space a faster method is usually desirable.

(14)

2.6.3 RMSProp

To prevent the issue of a vanishing step length, a version of Adagrad called RMSProp can be used. RMSProp introduces a decay term α which reduces the growth of G_n with iteration length.

gn = (1 − α)1 N

N

X

i=1

∇θL(θi)

Gn =

n

X

τ =1

α^n−τg^|_τgτ

θ_n+1= θ_n− γ

pdiag(Gn+1) + I · gt

(1 − α)

(4)

The term α is another hyperparameter, usually set to around 0.9.

With these improvements, the RMSProp version of Adagrad becomes significantly faster. It is still usually slower than Stochastic Gradient Descent with Momentum however, but trades convergence speed for convergence stability.

Ideally though, a method which combines both the speed and the robustness would be desired.

2.6.4 Adam

Adam, Adaptive Moment Estimation, is a method that combines the benefits of momentum with the benefits of adaptation. The update function for Adam is

m_n+1= β₁m_n+ (1 − β₁)1 N

N

X

i=1

∇_θL(θ_i)

gn= (1 − β2)1 N

N

X

i=1

∇θL(θi)

G_n=

n

X

τ =1

β₂^n−τg^|_τg_τ θn+1= θn− γ

pdiag(G_n+1) + I · mt+1

(5)

With a more qualitative representation, this simplifies to mn+1= β1mn+ (1 − β1)∇θL(θ)

g_n+1= β₂g_n+ (1 − β₂)∇_θL(θ)² θn+1= θn− γ

√g_n+1+ mt+1.

(6)

The m-term is the momentum-term from SGD with momentum, and the g- term is the same as for RMSProp. β1 is usually chosen to be around 0.9,

(15)

and β2 is usually set to around 0.999. With the combination of momentum and adaptation, Adam can navigate the solution space with both speed and caution. There is naturally a trade-off, and Adam will generally not be as fast and reckless as SGD with momentum, not as direct in its path as RMSProp or Adagrad, but it combines most of the benefits of both approaches and is a suitable optimization algorithm for most problems.

2.7 Algorithms

Besides the choice of loss function and optimization algorithm, the way to estimate the function must also be decided. In this project, three algorithms were considered for regression: Linear Regression, Deep Neural Networks and Support Vector Machine Regression.

2.8 Linear Regression

Linear regression is the simplest machine learning implementation. It works by finding a linear solution which minimizes the given loss function. The linear solution is modelled as a set of weights that are multiplied with the inputs and summed to form the output; each input is assigned one weight value, which is a scalar, and one weight is assigned to an input of one, which acts as bias. The sum of every input times its weight then results in the output.

In the case of Vasaloppet, with timings up to the current control, it attempts to find the optimal weights to multiply the timings with to minimize the mean square error (MSE) between the prediction and the true value at the target.

This is a linear system of equations, which can be solved analytically. However, for complex problems in many dimensions, a numerical macine learning method is preferred as the analytical methods suffer from numerical difficulties when inverting large matrices.

With the loss function being mean square error, the equation for the loss for linear regression becomes

L = 1 N

N

X

i=1

(yi− (kxi+ m))², (7)

where L is the loss, i is a skier in the data set with corresponding timings and possible other data features; N is the total number of competitors in the data set; y_i is the label, the timing at the target of prediction; k are the weights assigned to each input; x_iare the inputs, the timings at the controls leading up to the target, and other available data, such as starting row; and m is the bias.

y_i, k and x_i are vectors, with length corresponding to the number of inputs and outputs. If a prediction is being made from the third control to all upcoming controls and the finish line, yiis a 5x1 vector with the true values at the targets, k is a 5x3 vector with the weights for the first three controls, and xi is a 3x1

(16)

vector with timings for the first three controls.

When training one model per control, the loss is computed for every future control with the same model. Thus the loss becomes a vector with with the same size as the output. To minimize the total loss, this vector has to be evaluated as well. One solution is to calculate the mean error over all outputs after the error for each output is calculated according to equation 7. However, some predictions are more likely to have large errors than others, as predicting a timing four controls away is more difficult than predicting the timing at the next control. To counteract this, a weighed sum could be used, which could equalize the influence of different targets by analysing the error distribution with regards to the prediction target. This method runs the risk of having a larger error for later targets once evaluated, as they will now be scaled down in relation to their absolute error when training. A method to accomplish optimal solutions for both close and far targets is to split the training algorithm for every target.

This results in a larger amount of weights to be trained and stored, but in theory better prediction results as specificity increases for every estimated transfer function.

With the loss function specified in equation 7, the partial derivatives with regards to k and m are

∂L

∂k = 2 N

N

X

i=1

−xi(yi− (kxi+ m)) (8)

∂L

∂m = 2 N

N

X

i=1

−(yi− (kxi+ m)). (9)

Using these equations, k and m can be updated as

k_t+1= k_t+ γ∇_tL_k (10)

mt+1= mt+ γ∇tLm, (11)

where t denotes iteration number and γ the learning rate, which is a hyperparameter chosen by the designer. To evaluate the prediction, loss is again calculated. The algorithm stops either after a certain number of iterations or when it satisfies a stopping criterion, such as a low enough error, or when the error for the validation set increases.

Linear regression has some advantages compared to more complex machine learning algorithms due to its simplicity; it is easy to implement, fast to train and execute, and the resulting weights can be understood intuitively. For every parameter, for instance timing at a certain control in Vasaloppet, one weight is assigned, which shows how this parameter contributes to the prediction. Since

(17)

the number of weights is small, the number of input parameters plus one for every output, training goes fast as few weights need to be tuned. Furthermore, a linear model with a limited amount of weights decreases the risk of overfitting.

While simplicity has some benefits, there are also disadvantages associated with linear regression. If the considered problem is nonlinear, a linear solution will perform poorly. Compared to more complex models, linear regression can be assumed to have worse performance, given a proper problem formulation with training data that is a good representation of the test data. To improve performance beyond linear regression and map nonlinear functions, a more complex model is thus needed. For this, Deep Neural Networks can be appropriate.

2.9 Deep Neural Networks

Deep Neural Networks work as an extended Linear Regression algorithm, but with layers of weights and nonlinear activation functions instead of simple sum- mations. The inputs to the model are multiplied with weights and summed up, just as with linear regression. However, the output nodes of the input layer are not the true output, but the first hidden layer. Each sum from the first layer is put through an activation f unction⁹, and serves as input to the next layer. A node which sums up the outputs times the weights from the previous layer and outputs an activation function of this sum is called a neuron. The process is repeated for every layer until the output layer is reached. The structure of the network is an important hyperparameter which is chosen by the designer.

2.9.1 Network overview

A deep neural network is a complex sum of simple parts; each neuron is a simple function, very similar to a linear regression algorithm and easily understood, while the network forms a much more complex function and its behaviour is difficult to intuitively grasp. The network consists of several layers of simple neurons, where the output of one layer multiplied with the weight for that connection is used as the input to the next. The depth of the network is how many layers it has, and the width is how many neurons there are per layer. Different layers can have different number of neurons. A deeper and wider network will be able to approximate more complex functions, but also has a higher risk of overfitting and takes longer to train. The architecture of the network is an important hyperparameter to tune, as the complexity of the task and the dataset size are important factors to take into consideration when designing a network.

An overview of a network with architecture 5x4 with 3 inputs and 2 outputs is show in figure 1.

9Explained in section 2.9.3

(18)

Figure 1: Representation of a deep neural network architecture.

Source: https://towardsdatascience.com/a-laymans-guide-to-deep-neural- networks-ddcea24847fb

The essence of the neural network is the neuron, which is loosely based on the neurons in the brain, and the weight, which is loosely based on the synapse, which transmits information between neurons. The neuron has inputs, an activation function and outputs. The outputs times the respective weights are used as inputs to the next layer of inputs. The activation function is a function, typically nonlinear, that transforms the value at a neuron to an output. It is an important part of the network as a linear activation function, such as the unit function, would make depth irrelevant as any network would just be a set of linear equations, which can be simplified as one layer of linear functions. The activation function is explained in more detail in section 2.9.3.

To evaluate the performance of the network, the loss L between the label y and the output a^N, where N is the network depth, is calculated using the loss function f_L, which gives the equation

L =

N −1

X

j=0

fL(yj, a^N_j ). (12)

The output a^l_jof a neuron j at layer l is the result of the activation function fa of the value z_j^l at the neuron, which gives equation 13

a^l_j = f_a(z_j^l). (13)

z^l_j is the sum of all outputs a^l−1_k from the previous layer l − 1 times the weights w_jk^l between the layers plus the bias b^l_jat the layer, where j indicates the neuron

(19)

at layer l and k the neuron at layer l − 1. The combination of the weights and biases is referred to as the parameters θ. The weight w_jk^l is the connection between the neuron k at the previous layer and neuron j at the current layer, and the bias b^l_j is the added value at the current neuron. The relationship between the value at the neuron and the weights, inputs and bias thus becomes

z_j^l =

m

X

k=1

w^l_jka^l−1_k + b^l_j. (14) The weights w between the neurons and the bias b at the neurons are parameters that the network tries to optimize through the optimization algorithms. The gradient ∇θis thus the partial derivatives of the loss with regards to the parameters w and b. To calculate the gradient and apply an optimization algorithm, a method called backpropagation is used.

2.9.2 Backpropagation

In all optimization algorithms mentioned in 2.6, the gradient of the loss function with regards to the weights is a central component which has to be calculated.

In linear regression, the gradient is easily calculated and the weights updated as the cost function is a function of the output, and the output is a linear function of the input. The effect of each weight on the loss is thus easy to compute and to visualize. For deep neural networks however, every layer has a nonlinear activation function, so directly computing the gradients with respect to each individual weight is extremely costly. Instead, the gradient is computed for each single input/output example with respect to the weights of the whole network[3].

Backpropagation is the algorithm for updating the weights of a network with regards to one input-output sequence. The algorithm consists of three steps, repeated for each input-output sequence in either the whole dataset, or in the minibatch if SGD is used:

1) Feedforward . Compute the all neuron values z and activation functions a for the given input xi

2) Compute the output error Lθ(yi, ˆyi between the label yi and the network output ˆyi)

3) Propagate the error backwards to compute the error contribution for each weight and bias and compute the gradient _∂θ^∂Ll

jk

for all j, k and l.

After all selected inputs have been evaluated, the average gradient over all selected inputs is used in the update step.

The gradient can then be applied in the optimization algorithm to update the weights and biases. Steps 1 and 2 are straightforwards to execute;, as they follow the equations is section 2.9. Steps 3 however requires more equations to execute, as propagating the error backwards and calculating the gradient is not

(20)

trivial.

When utilizing backpropagation, the error contribution of weights and biases is first calculated for the last layer, by computing the partial derivative of the loss with regards to the weights in that layer, _∂θ^∂LLjk, where L indicates the last layer. The result from this calculation will inform how much the weight wjk^L between neuron k at the second last layer L − 1 and neuron j at the last layer - and the bias b^L_j at neuron j - affected the error between the output ˆyj = a^L_j and the label yj.

This equation can be solved using the chain rule for derivatives, which gives

∂L

∂θ^l_jk = ∂L

∂a^L_j

∂z_j^L

∂z^L_j

∂θ_jk^L . (15)

To solve equation 15, the three partial derivatives on the right hand side must be solved. ^∂L

∂a^L−1_k can be calculated using the definition of L from equation 12. For a single weight θ^L_jk, the general solution becomes

∂L

∂a^l_j = f_a⁰(yj, a^L_j). (16) The resulting value symbolizes how much the output value of neuron j of the final layer L affected the loss. With the loss function being mean squared error, the solution becomes

∂L

∂a^L_j = 2(a^L_j − yj). (17)

∂a^L_k

∂z^L_j is solved by using the definition from equation 13. The general solution is the derivative f_a⁰(z_j^L) of the activation function. For RELU, this becomes 1 for values over 0, and 0 otherwise. The resulting value from this computation symbolizes how much the summed value at the neuron affected the output value.

Since values below zero are set to zero, this means that for this input vector, the weights to output a^L_j do not change anything if altered slightly.

Finally, ^∂z

L j

∂θ^L_jk is calculated using equation 14, from which the equations

∂z_j^L

∂w^L_jk = a^L−1_k (18)

and

∂z^L_j

∂b^L_j = 1 (19)

can be derived. a^L−1_k has been computed in step 1 of the backpropagation algorithm. θ is used to symbolize both the weight and the bias, which is why two

(21)

separate equations arise when solving ^∂z

L j

∂θ^L_jk. The solution to the weight derivative means that the effect of the value of the weight w^L_jk on the value z^L_j is the output value of the neuron k at layer L − 1.

The partial derivatives are calculated for all weights and biases of the final layer, which gives a representation of how each connection and bias affected the total error for this particular input-output sequence. The process is then repeated for the previous layer L − 1.

For layer L − 1, ^∂z

L−1 j

∂θ^L−1_jk and ^∂a

L−1 k

∂z_j^L−1 is solved exactly the same as for layer L, with the exception of using values from layer L − 1 instead of L. However, the effect of the output a^L−1_j on the loss is more complex, as it is connected to all outputs in the output layer. Thus, the solution depends on all outputs, and becomes

∂L

∂a^L−1_j =

nL−1

X

j=0

∂L

∂a^L−1_k

∂a^L_j

∂z^L_j

∂L

∂a^L_j , (20)

which informs how much the output of the current neuron affects the sum of each error in the output layer.

Following this method, the error is then propagated backwards and gradients calculated until the first layer is reached. The term _∂a^∂L0

j

for any output in the first layer will thus lead to all calculated partial derivatives _∂a^∂L1

j

being used, which requires all partial derivatives _∂a^∂L2

j

. This shows the utility of backpropagation, where only the effects of weights on the output nodes are actually being considered. Computing the gradient by starting at the first layer and multi- plying forward would instead result in computing how weight and bias changes affect the hidden node values as well, which is considerably more costly.

2.9.3 Activation Function

The activation function is another choice for the designer, and the needs differ depending on the training task. The simplest activation function is the unit function. This can be seen as the function that acts on linear regression, i.e a multiplication with 1. However, with linear activation functions, no matter the depth of the network, it will decompose into a linear set of equations between the input and the output, in other words, one layer. To achieve depth in the neural network, a nonlinear term must exist, and since the weights are linear due to being scalar, the activation function is chosen as nonlinear. Common activation functions are the sigmoid f unction, tanh and RELU [7], Rectified Linear Unit, which is defined as max(0, x). There are also variations of RELU , such as leaky RELU [8], which is defined as max(kx, x), where k is a small number, typically around 0.1. A recent (2017) function called swish[9], defined as x ∗ sigmoid(x), which is related to relu has also risen in popularity, but is

(22)

not as common as relu yet.

The sigmoid function and tanh are similar and are both useful for classification tasks. Their main benefits are that they are smooth functions, with no sharp change in value of the gradient and that the output value is bounded between 0 and 1 for the sigmoid function, and −1 and 1 for tanh. Both functions are thus normalized, and the output can be used to represent probability which makes it useful for statistical prediction. The classification version of linear regression, called logistic regression, typically uses the sigmoid function instead of the unit function as an activation function of the output nodes, and the output is interpreted as the probability that the input belongs in the output class. Since both the sigmoid function and tanh are bounded, large values of

|x|, if the functions are thought of as f (x), result in clear predictions close to the boundary. This also results in the gradient being close to zero however, which can result in the gradient vanishing. As explained in section 2.9.2, the gradient is iteratively computed from the last layer to the first. A small gradient might then be multiplied by other small gradients, which eventually leads to the product being smaller than the data type can represent, causing the gradient to vanish entirely from the training process.

RELU and its variations do not serve as classifiers, as their output is not normalized, but also do not suffer from vanishing gradients, as the gradient is either of two values for x bigger or smaller than 0. Thus, RELUs are computationally efficient, and work better in deep networks than sigmoid or tanh. RELU has two issues however; it vanishes for negative values of x, and it has a sharp change in value of the gradient at x = 0. Leaky RELU addresses the first issue, but swish addresses both. Swish has a similar shape to RELU, but is continuous at x = 0 and does not vanish completely for x < 0. The trade off for swish is that it is more computationally demanding than RELU. For deep networks where performance at the task is paramount, swish is likely more appropriate, but for simpler tasks with a more shallow architecture or when computational power is limited, RELU is still a good alternative.

2.10 Support Vector Machine Regression

Support Vector Machines (SVM) is a shallow machine learning algorithm that is more powerful than linear regression. It relies on support vectors and nonlinear kernels to form hyperplanes and accurately model nonlinear relations. SVM is usually used to classification, but can also be used for regression.

Hyperplanes are a common part of classifiers, as they form the boundary between classes, called a decision boundary. If the inputs place the data points on one side of the boundary, that data point belongs to the corresponding class.

For regression, hyperplanes are not as common, but they are used for support vector regression (SVR). In the case of SVR, the hyperplane forms a basis for predicting a value, similar to interpolating a value between two points.

Timing Predictions in Vasaloppet using Supervised Machine Learning

Examensarbete 30 hp Februari 2020

Timing Predictions in Vasaloppet using Supervised Machine Learning

Karl Ekström

Abstract

Timing Predictions in Vasaloppet using Supervised Machine Learning

Popul¨ arvetenskaplig sammanfattning

Contents

1 Introduction

2 Theory