How Certain Are You of Getting a Parking Space? : A deep learning approach to parking availability prediction

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Datateknik

2020 | LIU-IDA/LITH-EX-A--20/040--SE

How Certain Are You of Getting a

Parking Space?

–

A deep learning approach to parking availability prediction

Maskininlärning för prognos av tillgängliga parkeringsplatser

Sophie von Corswant

Mathias Nilsson

Supervisor : Jose M. Peña Examiner : Cyrille Berger

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten ﬁnns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

Traffic congestion is a severe problem in urban areas and it leads to the emission of greenhouse gases and air pollution. In general, drivers lack knowledge of the location and availability of free parking spaces in urban cities. This leads to people driving around searching for parking places, and about one-third of traffic congestion in cities is due to drivers searching for an available parking lot. In recent years, various solutions to pro-vide parking information ahead have been proposed. The vast majority of these solutions have been applied in large cities, such as Beijing and San Francisco. This thesis has been conducted in collaboration with Knowit and Dukaten to predict parking occupancy in car parks one hour ahead in the relatively small city of Linköping. To make the predictions, this study has investigated the possibility to use long short-term memory and gradient boosting regression trees, trained on historical parking data. To enhance decision making, the predictive uncertainty was estimated using the novel approach Monte Carlo dropout for the former, and quantile regression for the latter. This study reveals that both of the models can predict parking occupancy ahead of time and they are found to excel in dif-ferent contexts. The inclusion of exogenous features can improve prediction quality. More specifically, we found that incorporating hour of the day improved the models’ perfor-mances, while weather features did not contribute much. As for uncertainty, the employed method Monte Carlo dropout was shown to be sensitive to parameter tuning to obtain good uncertainty estimates.

(4)

Acknowledgments

First of all, we want to thank all employees at Knowit and Dukaten that have made us feel welcome. A special thanks to our supervisor Fredrik Grahn who has inspired us and made this project possible. Also, thanks to Matts Skeppstedt at Dukaten and Sandra Strand at Sankt Kors for contributing with inputs of the industrial perspective.

Next, we want to direct our most sincere thanks to our supervisor at Linköping University, Jose M. Peña, for helping us with challenges we have encountered during the project. We also want to thank our examiner Cyrille Berger and our opponents Hanna Sterneling and Per Olin for valuable feedback.

Sophie von Corswant, Mathias Nilsson Linköping 2020

(5)

List of Figures

2.1 LSTM memory cell as described by Gers et al. 28. . . . 6 2.2 A schematic illustration of a regression tree (left) and an ensemble of trees (right). . 8 2.3 Comparison of naive (left) and variational (right) dropout applied to RNNs.

Hor-izontal arrows are recurrent and vertical feed-forwards. Arrows in same color represent dropout layers with identical dropout masks. . . 11 4.1 Temporal dependencies. In 4.1b it can be seen that the car parks have different

occupancy peaks, even though the pattern is similar. In figure 4.1a the differences in occupancy rate between different types of days can be seen. . . 17 4.2 An overview over the complete dataset. The gray lines show the weakly

occu-pancy rate and the black line represents an average week. Akilles is more noisy than the other datasets. . . 18 4.3 Box plot of the occupancy rate in parking house Baggen when it is good weather

vs. bad weather on different hours of the day. . . 18 4.4 A plot of the PACF for parking house Baggen. . . 19 4.5 The proposed methodology. . . 20 4.6 Data i split using a prequential approach. 10% of the data is used for testing

whereas the rest is used for training and validation. . . 21 5.1 Model comparison between GBRT and LSTM: the dashed lines correspond to the

models trained with GBRT and the solid lines in are models trained with LSTM. The y-axis show the test RMSE and the x-axis show the forecasting horizon. . . 26 5.2 Quantile regression with a 95% prediction interval for t =1 hour, obtained from

predictions on test data. For comparison, each cark park is plotted on the same time sequence for both D1 and D2. The figures on the left-hand side are trained on D1 and thus include exogenous features. . . 28 5.3 Feature importance for the trained models on different datasets. All models on

the top line (5.3a 5.3c) are trained on dataset D1, whereas the bottom line (5.3d -5.3f) are trained on D2. . . 29 5.4 MC Dropout with a 95% prediction interval for t = 1 hour, obtained from

pre-dictions on test data. For comparison, each cark park is plotted on the same time sequence for both D1 and D2. The figures on the left-hand side are trained on D1 and thus include exogenous features. . . 30 5.5 Histograms of 10,000 predictions from MC dropout on car park Baggen with and

without input dropout, both trained on D1. . . 30 B.1 MC dropout prediction histograms with input dropout for 10,000 predictions on a

single sample. All models on the top line include exogenous features, whereas the bottom line does not. . . 48 B.2 MC dropout prediction histograms for 10,000 predictions on a single sample

with-out input dropwith-out. All models on the top line include exogenous features, whereas the bottom line does not. . . 49

(8)

List of Tables

4.1 Parking occupancy datasets. Dataset 1 consists of historical parking data and ex-ogenous features, including weather and date and time attributes. Conversely,

dataset 2 only consists of historical parking data. . . 21

5.1 Validation MSE for 1 layer LSTM models, trained for 50 epochs on parking occu-pancy of Baggen, with a different number of time lags. . . 25

5.2 Hyper-parameter optimization results for LSTM . . . 26

5.3 Hyper-parameter optimization results for GBRT . . . 26

5.4 Performance comparison of different datasets for parking occupancy forecasting using GBRT evaluated on test data. The models trained with exogenous features on D1 achieve the best performance for all car parks with both metrics and for nearly all forecasting horizons. . . 27

5.5 Performance comparison of different datasets for parking occupancy forecasting using LSTM modelling, evaluated on test data. The models trained on D2 data achieves the best performance with both metrics for almost all forecasting horizons. 29 5.6 Percentage of empirical coverage of several MC dropout prediction intervals, eval-uated on the test dataset. . . 31

A.1 RMSE for Baggen - quantile regression . . . 45

A.2 MAE for Baggen - quantile regression . . . 45

A.3 RMSE for Akilles - quantile regression . . . 46

A.4 MAE for Akilles - quantile regression . . . 46

A.5 RMSE for Druvan - quantile regression . . . 46

A.6 MAE for Druvan - quantile regression . . . 46

B.1 RMSE for Baggen - LSTM without Input Dropout . . . 47

B.2 MAE for Baggen - LSTM without Input Dropout . . . 47

B.3 RMSE for Akilles - LSTM without Input Dropout . . . 47

B.4 MAE for Akilles - LSTM without Input Dropout . . . 47

B.5 RMSE for Druvan - LSTM without Input Dropout . . . 48

(9)

1 Introduction

Traffic congestion is a severe problem in urban areas. It leads to the emission of greenhouse gases and air pollution [1]. Air pollution has a negative effect on air quality in urban areas and causes many health issues, such as lung cancer and heart disease. It also increases the risk of children developing asthma and makes the symptoms of asthma more severe among people who already have the disease [2]. Moreover, traffic congestion is expensive. In the USA, it is estimated that traffic congestion annually costs about US$124 billion [3].

In 2018, 55 % of the world’s population lived in urban areas. According to a report pub-lished by the United Nations this figure is expected to rise to 68 % by 2050 [4]. Also, as the growth of per capita income increases, car ownership increases as well. Therefore, the problem of traffic congestion is expected to continue [5].

In general, drivers lack knowledge of the location and availability of free parking spaces in urban cities, which leads to people driving around searching for parking places. About one-third of traffic congestion in cities is due to drivers searching for an available parking lot [6].

Thanks to modern technologies that have revolutionized the ways of recording trans-portation data, systems can use real-time data to estimate parking availability ahead. One key enabler for these types of systems is the development of Internet of Things (IoT) and the con-cept of smart cities. The concon-cept of "smart" cities could be thought of as a large organic system connecting many subsystems and components [7]. More specifically, a city is categorized as "smart" when investments in traditional transport and information and communication (ICT) infrastructure, as well as human and social capital, support sustainable economic growth and a high quality of life, while preserving natural resources wisely through participatory gover-nance [8]. A novel type of “smart” city infrastructure, applicable to the transportation sector, is IoT. The basic idea of this concept is that a variety of devices and objects – such as sensors, mobile phones, RFID, etc. – are able to interact with each other to reach common goals [9]. By continuously collecting transportation information, the IoT provides increasingly smart and reliable services [10].

An application of smart transportation systems especially interesting for this thesis is the implementation of smart parking systems. A study made in 2012 deployed a parking alloca-tion system in a garage at Boston University, where cars were able to request and reserve a parking lot in advance [11]. A project that has been launched to a larger audience is a smart parking system in Nice, where ten thousand sensors are installed in several parking areas.

(10)

1.1. Motivation

A smartphone application was developed so that passengers can receive parking availability information [12].

This thesis will use the concept of smart city within the field of transportation to develop a smart parking service. By utilizing machine learning techniques, this thesis aims to predict parking occupancy in the near future to give drivers more information about the coming traffic situation.

A critical issue with many machine learning systems is to handle uncertainty, that is un-derstanding what the model does not know [13]. In some areas, the lack of this knowledge may lead to severe outcomes, as in the case of self-driving cars [14]. Existing approaches to model uncertainty include particle filtering and conditional random fields. However, many modern applications use deep learning methods as a state-of-the-art, which often are not able to represent uncertainty [13]. In the case of predicting parking occupancy, an incorrect predic-tion may mislead drivers to believe a parking place is available when it is actually occupied. Therefore, uncertainty is an interesting aspect to consider in this thesis.

1.1 Motivation

1.1.1 Dukaten

Dukaten aims to promote a good parking and urban environment by offering well-located parking facilities in the city of Linköping. Dukaten’s activities include the operation of park-ing facilities, monitorpark-ing, and handlpark-ing of control fees. Further, Dukaten works to offer its customers information about available parking places to enhance parking experience and reduce the time cars drive around searching for available parking lots. Dukaten is a pub-lic company and it is involved in projects together with the municipality of Linköping and public transport, among others. Consequently, Dukaten can create social benefits in a broad sense.

In cooperation with Knowit, Dukaten has developed a mobile application called LinPark for a simpler payment solution than ticketing machines. The application also shows real-time parking space availability. Instead of just providing real-real-time data, Dukaten also wants to provide predictions of parking space availability ahead of time to its customers. Not only to increase customer value, but also to gain knowledge about customer behavior for future projects. This thesis will investigate the possibility of making such predictions using statisti-cal models and machine learning. Historistatisti-cal parking occupancy data in the city of Linköping, provided by Dukaten, will be used in order to develop the models.

1.1.2 General Application

As already mentioned, the problem of traffic congestion in cities is partly caused by drivers searching for an available parking place; it is is not isolated to a few places, but it is a se-vere problem in many parts of the world, having negative effects on people’s quality of life and the environment. Even though the model in this paper is developed to predict park-ing space availability in the city of Linköppark-ing, the findpark-ings of this paper will be informative when developing similar models in other cities. Moreover, most previous studies predicting parking availability have been carried out in major urban areas, such as San Francisco and Beijing [15]–[18]. In this paper, parking availability will be predicted in the relatively small city of Linköping, Sweden, which had a population of about 161,000 in 2018 [19]. It will be interesting to see how a model to predict parking availability performs in a smaller city.

1.2 Aim

This paper aims to construct and evaluate two models for predicting parking space occu-pancy ahead of time in the city of Linköping. The models that will be evaluated are long

(11)

1.3. Research Questions

short-term memory (LSTM) and gradient boosting regression trees (GBRT). It also aims to investigate which exogenous features improve the chosen models. Furthermore, this thesis aims to investigate how uncertainty can be estimated for GBRT and LSTM in order to get information about what the models do not know.

1.3 Research Questions

1. How can parking space occupancy be predicted using LSTM and GBRT modeling ap-proaches, based on historical parking data?

2. What performance can be achieved by the chosen models? 3. Which exogenous features improve the performances? 4. How can uncertainty be modeled for GBRT and LSTM?

1.4 Delimitations

This study will be limited by following delimitations:

• The models will only be constructed and evaluated for parking houses.

• No considerations will be taken to special types of parking spaces, such as car lots designated for electric cars or people with disabilities.

• Exogenous features will be limited to weather and date and time attributes. • Transaction cost will not be taken into account.

1.5 Ethical Considerations

Some of the data used for developing and evaluating models in this project is real-world his-torical transactions of parking data. All the transaction data was anonymized by Dukaten before the data was handed over to the authors of this thesis, meaning that no personal infor-mation has been accessed or used during the project.

(12)

2 Theory

In this chapter, the theory needed to conduct the study is presented. As can be seen in Chap-ter 3, many models have been evaluated in previous studies to predict parking occupancy. Naturally, the parking situation differs between different locations and some models might excel in some locations while doing worse in others. However, it is assumed that the parking situation in different cities has many similarities and therefore the previous studies within the research field are a good basis for model selection.

The following chapter is divided into three main focus areas. The first part of the chap-ter focuses on neural networks (NNs) in general, and recurrent neural networks (RNNs) in particular. More specifically, the theory regarding LSTM will be described. The second focus area of the chapter focuses on tree-based methods for regression problems. Lastly, theory on how to capture model uncertainty is described.

2.1 Neural Networks

Linear models are limited by only being able to model linear functions. For example, a linear function f(x)is only able to model linear functions of x. By applying a non-linear transforma-tion φ to x, and then applying the linear model to the transformed input f(φ(x)), it is possible to model non-linear functions of x [20]. This is motivated by Cover’s Theorem, which states that data that is not linearly separable, can, with a high probability, be linearly separated after being mapped to a higher dimension [21].

NNs are able to model non-linear relations by first doing a linear transformation and then pass the output of the linear transformation through a non-linear function. This is described by equation 2.1, where φ is a non-linear function.

f(x; W) =φ(Wx) (2.1)

NNs are often composed of many different functions connected in a chain, hence the word network. An example of this would be three functions f(1), f(2) and f(3) connected in the following way [20]:

(13)

2.2. Recurrent Neural Networks

In this example, f(1)is called the first layer, f(2)the second layer and f(3)the output layer in which the final predictions are made [20]. For regression, the output layer does not usually have a non-linear activation function.

The weights in an NN are commonly trained using the back-propagation algorithm which was first introduced by Rumelhart et. al [22] in 1986. During training, the error is calculated using a loss function L(W), which compares the predicted output with the correct output. Gradient descent is then used to update the weights. This is done by calculating the gradient of the the loss function∇L(W) and then update the weights based on the gradient. The objective is to minimize the loss function, and since the gradient points in the direction of steepest ascent, the weights are updated in the opposite direction of the gradient, as described by equation 2.2, where η is the learning rate [23].

Wupdated=W ´ η∇L(W) (2.2)

2.2 Recurrent Neural Networks

An RNN is a type of NN. The basic idea behind RNNs is that the prediction in one time step influences predictions in future time steps. This makes RNNs useful for making predictions where the output depends on previous outputs, as in time-series. The modeling of the "mem-ory" of an RNN is achieved by using state nodes; the state is usually denoted as h(t). The state of time step t depends on the current input and the state from the previous time step h(t´1). The state is calculated using the following equation [23]:

ht= f(Wxxt+Whht´1+bh)

where f(¨)is the activation function, usually implemented by a sigmoid or a hyperbolic tan-gent function (tanh) [24]. Wx is the weight matrix for the current input xt, and Wh is the weight matrix for the previous state ht´1. In order for the node to learn an offset, a bias pa-rameter bhis used. The values of the weight matrices and the bias vector are tuned during training through back-propagation through time (BBTT), which is described in Section 2.2.1. The predicted output for each time step is calculated using the following equation [23]:

yt=g(Wyht+by) (2.3)

where g(¨)is a transformation function, usually linear [24]. As can be seen in equation 2.3, the predicted output ytdepends on the current state ht. Moreover, just like when calculating the state ht, a weight matrix Wyand a bias vector byis used to compute the prediction. One way of conceptualizing an RNN is as a deep network, where each layer predicts an output for a specific time step, and where the weights are shared among all the layers [23].

2.2.1 Backpropagation Through Time

BPTT is a gradient-based technique which propagates the gradient information calculated on a loss function back to the model parameters. First, BPTT unfolds the neural network in time, meaning replicating the hidden layer structure for each time interval, to obtain a certain type of NN. The main difference between a standard NN and an unfolded RNN is that for the latter the parameters are the same in all replicas of the layers [24].

If the true value at time t is yt, an estimation of the model parameters can be obtained by minimizing a loss function, such as least square or cross entropy. To backpropagate the RNN over the whole sequence, one must take the derivative of the activation functions. Using the chain rule to compute the derivatives, the intermediate term BL

Bht that computes the gradient

of the error with respect to states appears [25]: BL Bht = BL BhT BhT Bht = BL BhT T´1 ź k=t Bhk+1 Bhk . (2.4)

(14)

2.2. Recurrent Neural Networks

Figure 2.1: LSTM memory cell as described by Gers et al. [27].

For a vanilla RNN, the product of Jacobians(śT´1 k=t

Bhk+1

Bhk )in Equation 2.4 tends to either

vanish or explode when k is large. The former is more common, since the activation function’s derivative f1₍_¨₎_{tends to be less than one, due to the properties of both the sigmoid function} and the hyperbolic tangent. When multiplied k times, the product of Jacobians will converge towards zero [25].

2.2.2 Long Short-Term Memory

LSTM was first introduced by Hochreiter and Schmidhuber in 1997, with the purpose to over-come the problem with exploding and vanishing gradients. The difference between LSTM and a standard RNN is that each ordinary node is replaced by a memory cell, which can add or remove information from the current cell state, regulated by a gate structure [23]. In Hochre-iter and Scmidhuber’s model, each memory cell contains a self-connected recurrent edge fixed to one which ensures that the gradient can backpropagate many steps without vanish-ing or explodvanish-ing [26]. However, since the cell state tends to grow linearly in time series and becomes expensive in terms of memory, the model was extended in 1999 by Gers et al. [27] introducing the "forget gate". The forget gate can reset memory blocks once its contents are considered useless. The term "reset" does not only mean immediate reset to zero but also gradually resets as the importance of previous information fade [27].

Forward Pass Propagation in LSTM

Figure 2.1 depicts an LSTM memory cell with all its elements. The LSTM cell features three gates - input, forget, and output - which control the information flow passed through the unit. These gates act on received signals by blocking or passing information based on importance, which is filtered with a set of weights [23]. The heart of the memory cell is the cell state Ct, which flows through the network carrying long-term information. Typical networks have a number of memory cells connected to each other, forming a network [28]. The compact forms of the equations for the forward pass of an LSTM cell are [27]:

ft=σ(Wf ¨[ht´1, xt] +bf) (2.5)

˜

Ct=tanh(WC¨[ht´1, xt] +bC) (2.6)

(15)

2.3. Gradient Tree Boosting

Ct= ft‚Ct´1+it‚ ˜Ct (2.8)

ot=σ(Wo¨[ht´1, xt] +bo) (2.9)

ht=ot‚tanh(Ct) (2.10)

where σ denotes the logistic sigmoid function. σ and tanh are used to squash the values to the ranges[0 ´ 1]and[-1 ´ 1]respectively. Each gate takes xtand ht´1 as inputs, where xt denotes the input vector at time t, whereas ht´1 is the output vector from the previous time step. [23].

First, the forget gate ftis calculated in Equation 2.5, which determines what information should be kept of the previous cell state Ct´1. Second, in Equation 2.6 a vector of candidate values ˜Ctthat could be added to the cell state Ctis computed through a tanh function. Third, the cell is updated by multiplying element-wise the old state Ct´1 with the forget vector ft, and adding new candidate values scaled by how important each state value is considered to be it¨ ˜Ct, seen in Equation 2.8. Lastly, the cell output htis calculated by passing Ctthrough a tanh function and then multiplying element-wise with ot, which is used to determine what information from Ctshould be passed on to ht, represented by the Equations 2.9 and 2.10 [24]. Backpropagation in LSTM

Consider Equation 2.4 used for calculating the gradient of the error function of a regular RNN. With LSTM notations, the corresponding gradient is computed as follows:

BLk BW = BLk Bhk Bhk BCk¨ ¨ ¨ BC2 BC1 BC1 BW = BLk Bhk Bhk BCk( k ź t=2 BCt BCt´1) BC1 BW (2.11)

The main difference between vanilla RNNs and LSTM lies in computing the factor śk

t=2BCBCt´1t . In LSTM, Ctis defined as Equation 2.8, which consists of the elements Ct, ft, ˜Ct, it.

Hochreiter and Schmidhuber [26] refer to this equation as the constant error carousel (CEC), because the local error back flow remains constant and thus neither vanishing nor exploding. The CEC is regulated by the input, output and forget gates and its derivative is computed as follows: BCt BCt´1 = Bft BCt´1 ¨Ct´1+ BCt´1 BCt´1 ¨ft+ Bft BCt´1 ¨ ˜Ct+ B ˜Ct Bct´1 ¨it. (2.12) The presence of the forget gate’s activation function allows the LSTM cell to control when the information at a certain time step should be forgotten, and to update the model’s parameters accordingly. After the gradients are derived, the model can be learned using gradient-based methods [28].

2.3 Gradient Tree Boosting

Decision trees are intuitive methods for constructing prediction models, which are obtained by recursively partitioning the data and fitting a prediction model within each partition [29]. A tree is typically depicted as Figure 2.2a, with a root node at the top, branching out down-wards. An observation passes down the tree, through a series of nodes (Interior nodes in Figure 2.2a) at which a decision is made which branch to proceed with until a leaf node (response) has been reached [30].

A key property of tree-based models is that they are interpretable by humans since they correspond to a sequence of binary decisions applied to the input variables. However, it has

(16)

2.4. Uncertainty

(a) Regression tree. _{(b) Tree ensemble}

Figure 2.2: A schematic illustration of a regression tree (left) and an ensemble of trees (right).

been found that the learning procedure of a tree structure is very sensitive to the details of the data set [31].

Extensions of the basic decision tree structure allow learning target functions with numer-ical outputs, and hence solve regression problems. In regression trees, a split is evaluated by the mean square error (MSE) from the estimated value [32].

Gradient tree boosting is an ensemble method that refers to the idea of adding new mod-els to the ensemble sequentially, which is depicted in Figure 2.2b. Ensemble learning typically refers to methods that generate several models that are combined to make predictions, either in regression or classification problems. In the example of regression, the output of an ensem-ble can be the average of every model or a weighted average where some models are allowed to contribute more to the final model [31]. Rather than finding a single hypothesis to best explain the data, ensemble learning algorithms construct a set of hypotheses [33].

In tree boosting each tree generates an output from a copy of the feature vector. Next, each output is multiplied by a weight that is assigned to each tree. The weight can be interpreted as an importance measure and the larger the weight the greater impact on the end result [31]. At each iteration, a new weak learner is trained with respect to the pseudo-residuals of the whole ensemble learnt so far [34]. The predictive model is defined by

F(x; tβm, amu₁M) =

M ÿ m=1

βmh(x; am)

where h is a weak learner, usually a parameterized function of the input variables x, charac-terized by parameters a. Let tβm, amu=P, where P=tP1, P2, ...u is a finite set of parameters. Then the optimal function can be obtained by optimizing the parameters [35]:

P˚=arg min

P Ey,x[L(y, F(x; P))] and then

F˚(x) =F(x; P˚).

Typically, there is no closed-solution to estimate the parameters. Instead an iterative numeri-cal procedure needs to be considered, often by using gradient descent [35].

2.4 Uncertainty

The aim of machine learning is essentially to construct models based on data in order to make predictions. As such, it is closely connected with uncertainty. Traditionally, uncertainty has been modeled using a probabilistic approach, where probability theory has given useful tools to the machine learning field. There are two different sources of uncertainty, namely aleatoric and epistemic uncertainty [36]. The former refers to the natural and unpredictable

(17)

2.4. Uncertainty

variation in the performance of a system, which is due to inherently random effects. More knowledge about the system is not expected to reduce the aleatoric uncertainty, which is why it is sometimes referred to as irreducible. On the contrary, epistemic uncertainty is due to the lack of knowledge about the behaviour of the system. In principle, the epistemic uncertainty can be eliminated with a sufficient study, which is why it is sometimes referred to as reducible [37].

2.4.1 Monte Carlo Dropout

One attempt to estimate the epistemic uncertainty in deep neural networks is a novel ap-proach called Monte Carlo (MC) dropout[38]. Dropout is a common regularization technique for deep NNs, which circumvents overfitting and improves model accuracy. In dropout, the network’s units are multiplied by Bernoulli random variables, called the dropout layer. By multiplying the dropout layer with the neuron layer, some of the neurons are disconnected, which results in different network structures in each training step. Typically, after training the dropout layer is not used anymore [39].

In 2016, Gal and Ghahramani [38] discovered a connection between dropout networks and Bayesian inference. More specifically, they show that a NN with arbitrary depth and non-linearities, and dropout applied before each layer, is equivalent to a MC integration over a Gaussian process (GP) posterior approximation [38]. In a GP, the parametric model is dis-pensed and instead, prior distribution probabilities are defined over the functions directly [31]. By modelling the distributions over the function space with a GP, the posterior can be evaluated analytically

F|Xv N (0, K(X, X))

Y|Fv N (F, τ´1IN)

where K(X, X)is the covariance function that defines the similarities between every pair of inputs. Assume the covariance function

K(x, y) =

ż

N (w; 0, l´2IN)p(b)σ(wTx+b)σ(wTy+b)dwdb

with some prior length scale l, some prior distributions p(b) and p(w), and σ a non-linear function, such as tanh. Then, this can be approximated using Monte Carlo integration:

ˆ K(x, y) = 1 K K ÿ k=1 σ(wT_kx+bk)σ(wT_ky+bk) (2.13)

with wk v N (0, l´2IN)and bkv p(b). The K terms in Equation 2.13 would correspond to K hidden units in a NN.

Assume a two-layered NN with the weight matrices and bias vector ω = tW1, W2, bu,

then the GP predictive distribution can be re-parametrized as:

p(y˚_|_x˚_{, ω}_{) = N (}_y˚_; c 1 Kσ(x ˚_W 1+b)W2, τ´1IN) p(y˚|x˚, X, Y) = ż p(y˚|x˚, ω)p(ω|X, Y)dω (2.14) where the posterior distribution p(ω|X, Y)is generally intractable. Instead, let qθ(ω)be an approximating variational distribution, parameterized by some parameters θ. Then, mini-mizing the Kullback-Leibler (KL) divergence

(18)

2.4. Uncertainty

will give the approximate predictive distribution qθ(y˚|x˚) =

ż

p(y˚_|_x˚_{, ω}₎_q

θ(ω)dω that at test time can be approximated by

qθ(y˚|x˚)« 1 T T ÿ t=1 p(y˚_|_x˚_{, ω}₎_.

The variational distribution qθ(ω)that approximates p(ω|X, Y)in 2.14 is defined by qθ(ω) = qθ(W1)qθ(W2)qθ(b). Then qθ(W1) = Q ź q=1 qθ(wq)qθ(wq) =p1N (mq, s 2_I K) + (1 ´ p1)N (0, s2IK)

where mqis the variational parameter, p1the dropout rate and s the standard deviation. If s is set to a small value, this corresponds to a sample from a Bernoulli random variable [38].

Minimizing the KL divergence between the approximate posterior and the full GP poste-rior is the same as maximizing the log evidence lower bound with respect to θ

LV I := ż

qθ(ω)logp(Y|X, ω)dω ´ KL(qθ(ω)||p(ω|X, Y)).

Approximating the log evidence lower bound using Monte Carlo integration results in the objective LGP´MC91 N N ÿ n=1 ´logp(yn|xn, ˆω) τ + L ÿ i=1 pil2 2τN}Mi} 2 2+ l2 2τN}mi} 2 2 . (2.15)

Note that Equation 2.15 is a loss function and a regularization term as described in 2.1. Fur-ther, note that if E(yn, ˆyn(xn, ˆωn)) =´logp(yn|xn, ˆω)/τ then 2.15 would be the same as using L2regularization.

The predictive estimation is referred to as MC dropout. In practice, MC dropout is equiv-alent to performing the forward process T times during test time, resulting in T samples for each target variable [38]. Since the hidden units are not deterministic, but rather stochastic due to the dropout layers, it is often referred to as a stochastic forward pass. Due to the Gaussian properties, these samples can be used to estimate characteristics of the underlying posterior distribution. The estimated variance of the distribution indicates the uncertainty of the model [40]. Estimates for the mean and predictive variance of the proposed Bayesian model are given by [41]:

E[y˚]« 1 T T ÿ t=1 ˆy˚t(x˚) (2.16) Var(y˚₎_{« τ}´1_I D+ 1 T T ÿ t=1 ˆy˚ t(x)Tˆy˚t(x)´E[y˚]TE[y˚], (2.17) where τ is the model precision, often obtained using grid-search methods [41]. The first part of Equation 2.17 contains the variance of the observation error τ´1(aleatoric uncertainty). The second part contains the variance due to parameter uncertainty (epistemic uncertainty), which will converge towards zero as the model uncertainty decreases [13].

The dropout technique has also shown good results when applied to RNNs, such as LSTM. In this variant, each weight matrix row is randomly sampled once, and the same mask

(19)

2.4. Uncertainty

Figure 2.3: Comparison of naive (left) and variational (right) dropout applied to RNNs. Horizontal arrows are recurrent and vertical feed-forwards. Arrows in same color represent dropout layers with identical dropout masks.

is used through all time steps for input, output, and recurrent layers [42]. This varies from the naive dropout approach to RNNs, in which new dropout masks are generated for each input sample, thus losing some of the long-term memory properties. Figure 2.3 illustrates the dif-ferences between the two approaches, where each color corresponds to a dropout mask. MC dropout can be seen as an ensemble method since the output is an average over the output from several different networks, which use different dropout masks [40].

MC dropout has been criticized for approximating only the aleatoric uncertainty rather than both aleatoric and epistemic uncertainty, meaning dropout sampling gives information about the risk in y rather than the uncertainty of the learned model [43]. Further, even if training time is similar to other existing models, the test time is scaled by T, thus the compu-tational time during inference will increase linearly with T.

2.4.2 Quantile Regression

Quantile regression is a statistical analysis able to approximate the conditional distribution of a response variable. It was first introduced by Koenker and Basset in 1978 as an extension of classical linear regression. The technique is similar to L1regression where the absolute error is minimized, but instead of fitting the 0.5 quantile - the median - the loss function can be asymmetrical and fit to any quantile. The loss function is defined as follows [44]:

Lq =

#

q(y ´ ypred) if y ď ypred

(1 ´ q)(y ´ ypred) otherwise

(2.18) where q is the qth quantile P [0, 1]. When q ą 0.5, the loss function Lq penalizes over-prediction more than under-over-prediction, meaning that the line will fit to a greater value than if regular L1regression is used. The opposite is true when q ă 0.5 [44].

A key property of quantile regression is that it can be used to build prediction intervals. A 95% prediction interval for the value Y is given by:

I(x) = [q.025(x), q.975(x)]

which is a property that can be used for outlier detection; if a new observation is extreme with regard to the prediction interval, it is likely to be regarded as an outlier. Further, quantile regression can be applied to other problems than linear regression, including decision trees [45]. The latter are supported by several software libraries, including scikit-learn in Python.

2.4.3 Other Methods for Uncertainty Estimation

In recent years, modeling predictive uncertainty has become more and more important in the machine learning field. For example, there have been several contributions to the field of approximate Bayesian inference in deep learning. Blundell et al. [46] introduced Bayes

(20)

2.4. Uncertainty

by backprop, a variational based approach which regularizes the weights of a neural net-work by minimizing the variational free energy, leading to model averaging [46]. However, this method comes with a prohibitive computational cost and an alternative approach was given by Hernández-Lobato and Adams [47]. They proposed a framework called probabilis-tic backpropagation, which makes use of expectation propagation; probabilities are propa-gated forward in the network to obtain the marginal likelihood, before propagating back-wards the gradient of the marginal likelihood with respect to the model parameters [47]. Another method, similar to MC dropout, is DropConnect, in which connections are dropped instead of nodes [48].

A more generalized method for uncertainty estimation is given by the bootstrap, some-times referred to as bagging [43]. The bootstrap samples multiple realizations of a given dataset, where each consists of N training vectors, sampled at random with replacement. By fitting an estimator on each sampled dataset an ensemble of networks are obtained, resulting in a distribution to approximate uncertainty [49].

Even though there are other methods for estimating uncertainty in deep learning, MC dropout has become widely spread due to its simple implementation [50]–[53]. Further, it is supported by popular machine learning libraries such as Keras. As for decision trees, little research on distinguishing epistemic uncertainty from aleatoric uncertainty have been done, even if there exist a few [54], [55].

(21)

3 Related Work

There exists a substantial number of research papers investigating how to predict parking occupancy utilizing statistics and machine learning. Various types of models with different types of feature sets have been evaluated throughout the years and will be summarized in this chapter. Several factors might influence parking occupancy, such as weather, holidays, day of the week, time, special events, etc. Also, cities are constantly changing, new car parks are built and closed, roads are changing, parking demand varies and so on. This means that a model for predicting parking occupancy risks becoming outdated quickly [1]. The remainder of the chapter will summarize the research field and how the above-mentioned challenges have been acknowledged in previous studies.

3.1 Machine Learning Models

An intuitive model representation of parking occupancy is regression trees. Alajali et. al [56] showed that GBRT performed well on predicting parking occupancy in Melbourne. In their study, they derived three different feature sets, using different feature combinations from on-street parking data gathered from the city of Melbourne. The feature sets were used in order to train and evaluate GBRT, support vector regression (SVR)s and regression trees for predicting parking occupancy. GBRT performed the best, with the lowest mean square error (MSE) and mean absolute error (MAE) for all the feature sets. Further, it was shown that exogenous features gathered from multiple sources can improve the predictions [56].

Chen [15] developed and evaluated four different models in order to predict parking occupancy in San Francisco. The models were autoregressive integrated moving average (ARIMA), linear regression (OLS), SVR and NN. NN outperformed the other models with the lowest mean absolute percentage error (MAPE) value of 3.57%. All the other models had an MAPE value in the range of 7–9%. The reason why NN outperformed the other models was not discussed.

Shao et al. [1] argue that LSTM is a good model when predicting parking occupancy since it leverages temporal dependency. The data was clustered regionally using k-means and then used as an input with one LSTM model for each region. The LSTM method was compared with a multilayer perceptron (MLP) model and results showed that LSTM performed better than MLP at predicting parking occupancy 1 minute, 5 minutes and 30 minutes ahead.

(22)

3.2. Temporal and Spatial Correlations

Shuguan et. al [57] used a graph convolution neural network (GCNN) and graph spectral theory to extract spatial information from large-scale road networks. LSTM was then used to capture the temporal features of the data. In the last step of the model, a multi-layer decoder was used, making it possible to add different types of data sources and variables easily. The model is evaluated by comparing it with other baseline models, such as 2-layer LSTM, 3-layer LSTM and LASSO. The proposed model (GCNN + LSTM) performed the best. Worth mentioning is that LASSO performed better than the LSTM models. However, the authors argue that this is likely due to the fact that the dataset is relatively small in relation to the number of dimensions of the dataset; more training data would probably have been needed for the LSTM models to perform at their full potential.

Another approach proven to perform well is the use of Markov chains. Tilahun and Marzo Serugendo [6] propose a cooperative dynamic model between multiple agents for parking space availability. In their model, an agent in each parking place uses a Markov chain to predict the availability, communicating with other agents to produce a model for the whole region. Similarly, Caliskan et al. apply a continuous-time Markov chain model through an ad hoc network for predicting parking availability, taking the time needed to arrive at a certain parking place into account [58]. However, the authors claim that a significant problem with the proposed model is the difficulty of computing the transition matrix. A solution to this problem was suggested by Klappenecker et al., who adopted the model proposed by Caliskan et al. and proved that the use of factorization improved it significantly [59].

3.2 Temporal and Spatial Correlations

Two issues to acknowledge when predicting parking occupancy is temporal and spatial cor-relations between different parking places. A common approach to this problem is to use clustering algorithms to identify spatial and temporal similarities between parking places. Zheng et al. [60] emphasize the importance of clustering, not only to detect normal temporal behavior, but also to identify anomalous behavior.

To identify spatial correlations, Chen [15] applied the k-means algorithm before training different predicting models in order to catch spatial differences. Likewise, Shao et al. [1] clustered the data set using k-means in the same manner as Chen, since they discovered that the occupancy rate differed between regions.

Rajabioun and Ioannou [18] analyzed parking data from San Francisco and found that parking trends differ during different seasons and different days of the week. They also found spatial correlations between neighboring parking locations and that there are temporal correlations for each parking location.

Zhang et. al [16] developed a model, called SHARE, with the ability to capture both spatial and temporal features. They argue that spatial parking data is important, since parking places close to each other are likely to follow similar patterns. They give the example of a concert which attracts a lot of people; the parking availability close to the concert is likely to be low, with the influence on parking availability caused by the concert fading the further one goes from the concert. In order to capture the spatial correlations, graph theory, and convolution was used. They also argue that distant parking availability might be correlated if different parking locations fall into the same category. For example, parking availability is likely to be low in certain locations during office hours, while residential areas are likely to have higher parking availability at the same time. A soft assignment matrix was used for clustering so that a certain parking location can belong to several clusters but with different probabilities. The temporal features were modeled using gated recurrent units. SHARE was evaluated using two real-world data sets, one from Shenzhen and one from Beijing. SHARE performed better than all models it was compared with [17].

(23)

3.3. Feature Selection

3.3 Feature Selection

In addition to using historical parking data as the basis for the analysis, numerous stud-ies investigate the impact of exogenous variables. Previous research has included various exogenous features in their proposed predicting models, with more or less model improve-ments. The most common features are the type of day (such as Monday or Saturday) and time of day, which are able to catch fluctuating parking demand within a day or a week [6], [56], [61], [62]. Fabusuyi et al. [61] suggest a model that considers special events, such as theatres and sports games, split into morning, day or evening events. Similarly, Alajali et al. [56] incorporate pedestrian volume and car traffic volume to detect all types of special situ-ations in the surroundings of a parking place. This type of information can be categorized into foreseen and unforeseen circumstances, both of which can affect parking demand [6]. Further, the weather has been suggested to have an effect on non-recurrent parking demand [57]. Yang et al. collected hourly based weather data, and used linear interpolation with 10 minutes intervals in order to approximate the weather between the collected data points. They concluded that incorporating weather information can improve the performance of the predicting model significantly [57].

3.4 Model Uncertainty

None of the above-mentioned studies has estimated the uncertainty for their parking pre-dicting models. In recent years, modeling predictive uncertainty has become more and more important in the machine learning field. For example, there have been several contributions to the field of approximate Bayesian inference in deep learning. Blundell et al. [46] intro-duced Bayes by backprop, a variational based approach which regularizes the weights of a neural network by minimizing the variational free energy, leading to model averaging [46]. However, this method comes with a prohibitive computational cost and an alternative ap-proach was given by Hernández-Lobato and Adams [47]. They proposed a framework called probabilistic backpropagation, which makes use of expectation propagation; probabilities are propagated forward in the network to obtain the marginal likelihood, before propagating backwards the gradient of the marginal likelihood with respect to the model parameters [47]. Another method, similar to MC dropout, is DropConnect, in which connections are dropped instead of nodes [48].

A more generalized method for uncertainty estimation is given by the bootstrap, some-times referred to as bagging [43]. The bootstrap samples multiple realizations of a given dataset, where each consists of N training vectors, sampled at random with replacement. By fitting an estimator on each sampled dataset an ensemble of networks are obtained, resulting in a distribution to approximate uncertainty [49].

Even though there are other methods for estimating uncertainty in deep learning, MC dropout has become widely spread due to its simple implementation [50]–[53]. An example of where an LSTM network has been successfully implemented together with MC dropout is given by Uber [63]. To get uncertainty estimates for rider demand to enhance resource al-location, anomaly detection, and budgeting, they implemented MC dropout for time-series forecasting. Not only did they get good uncertainty estimates, but their model also outper-formed both vanilla LSTM and quantile random forest in terms of evaluation metrics. MC dropout is also supported by popular machine learning libraries such as Keras.

The medical sector is a domain where uncertainty estimates are of great importance. In a recent study, an ensemble method with different types of NNs as base learners was proposed. Instead of averaging over the ensemble, each base learner was weighted according to its respective predictive confidence obtained by MC dropout [50].

As for decision trees, little research on distinguishing epistemic uncertainty from aleatoric uncertainty have been done, even if there exist a few [54], [55].

(24)

4 Method

The following chapter describes the methodology of constructing a prediction model for parking occupancy. First, a pre-study is conducted to get a better understanding of the data, which then forms the basis of the succeeding implementation phase. Lastly, evaluation met-rics are presented.

4.1 Pre-study

A pre-study is carried out in order to make an initial analysis of the data. Several experiments on the data are conducted in order to analyze both spatial and temporal correlations tween different parking facilities. Previous research has employed clustering algorithms be-fore training the predicting model to be able to catch regional dependencies. Intuitively, this approach seems relevant when making a parking prediction model for a large city. However, the dataset in this study consists of only three parking houses and the question is whether a clustering step will enhance the performance of the models substantially. The experiments described below will form the basis of the final implementations of the models.

The historical parking data used in this study is extracted from Dukaten’s transaction records. Each transaction entry comes with a starting time and an ending time as well as the name of the parking place. Since a delimitation of this study is to predict parking occupancy only for parking houses, these are extracted out from the main data set: Baggen, Druvan, and Akilles. Rather than having a continuous-time attribute, the data is transformed to a time series consisting of discrete 5 minute time slots, similar to [62]. Further, the occupancy rate is calculated based on the occupancy rate at t = 0, increasing the rate counter for each car entering the facility, and decreasing for every car leaving the facility.

Weather data is gathered from SMHI’s open data. SMHI provides a range of weather at-tributes collected hourly for a specific weather station. For this study, weather observations of Linköping are obtained and then linearly interpolated with 5 minute temporal resolution. In the input space, weather information for a time interval is a vector of floating values rep-resenting wind speed, air temperature, and precipitation.

The data described above is analyzed in terms of temporal and spatial dependencies as well as impact of weather.

(25)

4.1. Pre-study

4.1.1 Temporal Dependencies

(a) Box plot of the occupancy rate for parking house Baggen for weekdays, weekends and

holi-days for each quarter of the year. (b) Temporal dependencies.

Figure 4.1: Temporal dependencies. In 4.1b it can be seen that the car parks have different occupancy peaks, even though the pattern is similar. In figure 4.1a the differences in occupancy rate between different types of days can be seen.

Intuitively, parking demand varies between days. For example, the demand is assumed to be lower on a Sunday compared to a Monday, but lower on Christmas eve compared to an average Sunday. Figure 4.1a shows a box plot of the occupancy rate for parking house Baggen, divided into holidays, weekdays and weekends split into quarters. As can be seen, there is a dependency between the occupancy rate and type of day, which implies that both weekends and holidays affect parking demand. Further, there is a difference between the quarters, where Q3 (meaning July, August and September) has the lowest occupancy rate compared to the other quarters. The lack of a box plot for holidays in Q3 is because there are no holidays during these months. It is assumed that the relatively low parking occupancy rate during Q3 is a consequence of many people being on vacation during the summertime in Sweden, which reduces the parking demand in the city. Hence, the data also has a sequential pattern over time, where parking occupancy is lower during the summer months.

4.1.2 Spatial Dependencies

Figure 4.1b shows a plot of the average occupancy rate for each day of the week, and for each car park. As can be seen, the three car parks follow similar patterns. Usually, the rate goes up during the day and down during the night, which is expected behavior since the car parks are located in the central part of Linköping where there are many workplaces. How-ever, the occupancy rate peaks at different levels for each parking house, meaning that the parking demand differs. Moreover, as can be seen in Figure 4.2, the periodical pattern for Akilles is not as evident as for the other car parks. The dataset is more noisy compared to Baggen and Druvan, which shows that the parking demand behaviors differently between the parking houses. In addition, the parking demand also differs between the car parks on the weekends. It is therefore assumed that separate models for each parking facility will re-sult in more accurate predictions, rather than constructing one general model for all parking houses.

4.1.3 Weather

Little research has investigated the impact of weather data when predicting parking occu-pancy. However, Shuguan et. al [57] argue that weather improved the prediction model and is relevant to include. In this thesis, the importance of different weather attributes, namely precipitation, wind velocity, and temperature, are evaluated to see if they can improve the

(26)

4.1. Pre-study

Figure 4.2: An overview over the complete dataset. The gray lines show the weakly occupancy rate and the black line represents an average week. Akilles is more noisy than the other datasets.

Figure 4.3: Box plot of the occupancy rate in parking house Baggen when it is good weather vs. bad weather on different hours of the day.

prediction model when implemented in a small city. Figure 4.3 shows the difference in the average parking occupancy rate between bad and good weather for 8:00 to 17:00. Bad weather is defined as precipitation of more than 5 millimeters per hour and/or a wind speed of 8 me-ters per second or higher. There is no distinct difference between parking demand when the weather is good or when it is bad; for some hours of the day, the parking occupancy rate is higher when there is good weather, and for some hours the parking occupancy rate is lower when there is good weather. Consequently, weather does not seem to affect the parking de-mand in the parking houses of Linköping and therefore does not seem to improve the quality of the predictions. However, this assumption is dependent on the definition of bad and good weather, which was chosen arbitrarily, and there might be important information missing in Figure 4.3. Therefore, weather data is further investigated.

(27)

4.2. Implementation

Figure 4.4: A plot of the PACF for parking house Baggen.

4.1.4 Time Lags

Intuitively, the parking occupancy at the time of making predictions should influence the predictions. Meaning, if the parking occupancy is high at time step t0, it is also likely to be high 5 minutes later at time step t1. Taking several time lags into account is also likely to be useful, especially since several time lags combined contain information about the current trend. However, it is not obvious how many time lags contribute to increase the prediction quality of the models, which is what will be investigated in this section.

The number of time lags is analyzed by plotting the partial auto-correlation function (PACF), which is a function that describes the partial correlation of the time series with its own lagged values, without the values of the time series at all shorter lags. The PACF in Figure 4.4, goes to zero for time lags > 10, indicating that these lags are most relevant for pre-dictions. The figure includes only 30 time lags, but the pattern remains for a higher amount of lags.

To determine the number of time lags several LSTM models with fixed hyper-parameters are trained with [2, 4, 6, 8, 10, 12, 14] time lags as input respectively. The models are trained on Baggen, using only the parking occupancy data from the last prequential data-set, and validated on the corresponding validation data. Further, the models have one hidden layer, 50 hidden nodes and are trained for 50 epochs each before validated. In order to avoid adding too much noise to the models, and to decrease training time when tuning other hyper-parameters, as few time lags as possible that still produce a low validation MSE is preferred.

4.2 Implementation

The implementation method described in this section is based on the following conclusions from the pre-study. First, it is concluded that the models will be trained on each car park separately since the parking houses exhibits different parking occupancy patterns. Second, temporal features are assumed to affect the parking demand and thus, date and time at-tributes are added to the exogenous feature set. Third, the impact of weather is considered to be interesting for further investigation and will also be added to the exogenous feature set. The proposed method can be seen in Figure 4.5. Note that the number of time lags tuned in the pre-study is used for all methods due to the limited time frame.

4.2.1 Frameworks and Environment

The LSTM model is implemented in Keras version 2.3.1, running on top of TenserFlow ver-sion 1.14.0. Keras is a neural networks API, written in Python, which supports both convolu-tional networks and recurrent networks, as well as the combination of the two [64]. The GBRT model is implemented using scikit-learn version 0.22.1, a free software library for Python.

(28)

4.2. Implementation

Figure 4.5: The proposed methodology.

All the experiments of this thesis are conducted on machines with the following specifi-cations:

• CPU: Intel®Xeon(R) CPU E5-1607 0 @ 3.00Ghz x 4, • RAM: 7,7 GiB.

4.2.2 Dataset

Data preprocessing is necessary to obtain a good performance of the proposed models. The transaction data for three parking houses in Linköping is processed as described in 4.1. Weather data comprising wind speed, temperature and precipitation are merged as exoge-nous inputs with the parking data. To capture temporal dependencies weekend, holiday, month and hour are added as features as well. In total, the dataset for car parks Baggen and Akilles consists of around 193,000 samples, spanning over 24 months, and described by 8 attributes. The car park Druvan was renovated during the second half of 2019, meaning that the transaction data was skewed, not following the usual patterns. Therefore, data during this period is left out for Druvan, and about 153,000 samples are used in total.

To be able to evaluate the impact of exogenous features, two separate datasets are cre-ated. The first consists of the transaction data only, whereas the second include all features described above. Table 4.1 gives a summary of both datasets.

90% of the data is put into a training set, and the remaining 10% is put into a testing set. The temporal order of the observations is maintained. A tool proven to be a good method in the evaluation of models using real-world time-series is a prequential approach [65]. In this method, the train data i split into blocks which maintain the temporal order of the data. In the first iteration, the first two blocks are used, where the first block is used for training and the latter is used for validation. In the next iteration, these two blocks are used for training and the third is used for validation and so forth until all blocks are used, meaning that the last block will only be used for validation. Here, the training data is split into 6 blocks, in total 5 iterations are run and the validation error is the mean of all the validation errors from all the iterations. The process is summarized by Figure 4.6. The prequential approach is used in order to test different parameters, where the combination of hyper-parameters producing the lowest validation error is chosen. The test data is then used to evaluate the model and to get the generalization error. Note that the generalization error cannot be measured using the validation data, since the validation data is used to chose the hyper-parameters, meaning that it influences the training of the model.

(29)

4.2. Implementation

Table 4.1: Parking occupancy datasets. Dataset 1 consists of historical parking data and exogenous features, including weather and date and time attributes. Con-versely, dataset 2 only consists of historical parking data.

ID Attribute Description

D1

Car park The parking occupancy rate for car park = Baggen, Akilles, Druvan for the current time

weekend 1 if it is Saturday or Sunday, 0 otherwise holiday 1 if Swedish holiday, 0 otherwise month Number of the current month hour Number of the current hour

precipitation The precipitation for the current hour in milliliters wind The current wind speed in meter/seconds

temperature The current temperature in degrees Celsius

D2 Car park The parking occupancy rate for car park = Baggen, Akilles, Druvan for the current time

Figure 4.6: Data i split using a prequential approach. 10% of the data is used for testing whereas the rest is used for training and validation.

4.2.3 Hyper-parameter Tuning

Both LSTM and GBRT have several hyper-parameters that need to be considered. Optimizing the hyper-parameters is computationally expensive, especially when the parameter space is large. There exist a number of algorithms that has the objective to optimize the hyper-parameters in order to find a model that best fits the data. The three most commonly used approaches are grid search, manual search and random search [66].

Bergstra and Bengio [66] argue that random search is better than a grid search when it comes to hyper-parameter tuning. The main reason being that different hyper-parameters are usually not equally important, where the value of one parameter might impact the per-formance of the model a lot, while another parameter might not impact the perper-formance at all. This means that grid search usually spends a lot of time investigating parameters that will have close to zero impact on the performance of the model, and parameters that actually do impact the model are not investigated enough. In a random search, all hyper-parameters get new random values for each iteration, meaning that parameters with a large impact will take on more values in total than in grid-search.

Two main drawbacks of many hyper-parameter optimization frameworks, including the above, are that users have to construct the parameter search space for each model manually, and that efficient pruning is not featured [67].

A recently developed software with the aim to solve these issues is Optuna. The Optuna framework’s optimization flow consists of two main components: sampling strategy and

(30)

4.2. Implementation

pruning strategy. The former defines how to sample the next set of parameter values whereas the latter detects unpromising trials which are pruned, both of which increase the efficiency of the tuning process. Optuna supports a range of sampling algorithms, including both grid search and random sampler. Further, Optuna employs a define-by-run principle allowing the user to dynamically define the search space [67]. It is compatible with most machine learning libraries, such as TenserFlow and scikit-learn, and is thus used to tune the hyper-parameters for both LSTM and GBRT.

4.2.4 Experiments

In this section, the experiment details are described. Both models are trained on both of the two datasets for each car park. 6 time lags (30 minutes) of parking occupancy data are used as input features for all models. This is based on the tuning described in Section 4.1.4; the results of the tuning are presented in Section 5.1.1. All of the models are trained to predict 12 time steps, meaning the closest prediction is five minutes ahead, and the furthest is 60 minutes ahead. Below, the model-specific setups are specified.

GBRT

The hyper-parameters that are tuned for GBRT include the number of estimators, the number of features to consider at each split, maximum depth, minimum number of samples required to split a node and learning rate. To make predictions for several time steps, a multioutput regressor is applied to the boosting ensemble tree model that fits one estimator per target.

The quantile loss in Equation 2.18 is used as a loss function to model the aleatoric uncer-tainty with q= [0.975, 0.5, 0.025], which yields a 95% prediction interval.

LSTM

The first LSTM layer takes six time steps of parking availability data as input. All the LSTM layers use tanh as the activation function. Some of the exogenous features are not sequential in nature, meaning that they are not optimal for the LSTM layers. For example, in the vast majority of cases, the feature weekday has the same value six time steps (30 minutes) back. Weather data is sequential, but it is assumed that the current value is most important, and to save training time only one value is used. Therefore, the exogenous features are passed with one input each through a dense layer with 20 nodes and sigmoid as activation function, and the output from the dense layer is then concatenated with the output from the last LSTM layer. The output layer, which is a dense layer, takes the concatenated data as input and does not use any activation function, meaning that the activation is linear. MSE is used as loss function, and Adam is used as optimizer. Adam is a gradient-based optimizer, but rather than having a fixed learning rate as in stochastic gradient descent, the learning rate is adapted for each parameter [68].

Each layer has an L2regularizer with weight-decay λ. The regularizers penalizes layer parameters to avoid over-fitting and the weight-decay λ control the strength of the regular-ization.

Dropout masks are applied in accordance with [42] to enable MC dropout. For the LSTM layers, the dropout masks are applied in such a way that the dropout rates for the inputs, weights and recurrent layers are sampled once, and then used for every time step in accor-dance with [42]. Optuna is used to optimize the number of LSTM layers, the number of nodes per LSTM layer, and batch size.

The number of epochs refers to how many times all the training data is used for training when training an NN. Too many epochs might lead to overfitting, where the training error keeps decreasing for each epoch, but the validation error does not decrease or even increases. Also, the training time increases as the number of epochs increases. Therefore, it is important not to train the model for too many epochs. To deal with this, early stop is used during

How Certain Are You of Getting a Parking Space? : A deep learning approach to parking availability prediction

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Datateknik

2020 | LIU-IDA/LITH-EX-A--20/040--SE

How Certain Are You of Getting a

Parking Space?

A deep learning approach to parking availability prediction

Maskininlärning för prognos av tillgängliga parkeringsplatser

Sophie von Corswant

Mathias Nilsson

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

1

Introduction

1.1

Motivation

1.1.1

Dukaten

1.1.2

General Application

1.2

Aim

1.3

Research Questions

1.4

Delimitations

1.5

Ethical Considerations

2

Theory

2.1

Neural Networks

2.2

Recurrent Neural Networks

2.2.1

Backpropagation Through Time

2.2.2

Long Short-Term Memory

2.3

Gradient Tree Boosting

2.4

Uncertainty

2.4.1

Monte Carlo Dropout

2.4.2

Quantile Regression

2.4.3

Other Methods for Uncertainty Estimation

3

Related Work

3.1

Machine Learning Models

3.2

Temporal and Spatial Correlations

3.3

Feature Selection

3.4

Model Uncertainty

4

Method

4.1

Pre-study

4.1.1

Temporal Dependencies

4.1.2

Spatial Dependencies

4.1.3

Weather

4.1.4

Time Lags

4.2

Implementation

4.2.1

Frameworks and Environment

4.2.2

Dataset