Machine Learning Approaches on a Travel Time Prediction Problem

(1)

IN

DEGREE PROJECT MATHEMATICS, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2018,

Machine Learning Approaches on a Travel Time Prediction Problem

SARA DANIELSSON

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ENGINEERING SCIENCES

(2)

(3)

Machine Learning Approaches on a Travel Time Prediction Problem

SARA DANIELSSON

Degree Projects in Optimization and Systems Theory (30 ECTS credits) Degree Programme in Applied and Computational Mathematics (120 credits) KTH Royal Institute of Technology year 2018

Supervisor at Budbee: Esteban Salva Supervisor at KTH: Anders Forsgren Examiner at KTH: Anders Forsgren

(4)

TRITA-SCI-GRU 2018:022 MAT-E 2018:08

Royal Institute of Technology School of Engineering Sciences KTH SCI

SE-100 44 Stockholm, Sweden URL: www.kth.se/sci

(5)

Abstract

This thesis concerns the prediction of travel times between two points on a map, based on a combination of link-scale road network data and historical trip-scale data.

The main idea is that the predictions using the road network data can be improved by a correction factor estimated from historical trip data. The correction factor is estimated both using a Machine Learning approach, more specifically Gaussian Progress Regression, and a simple baseline method inspired by an approach in the literature. The Gaussian Progress Regression is performed using a string kernel and a squared exponential kernel. The Gaussian Process Regression using the string kernel outperforms both the baseline and the squared exponential kernel, and is hence the most promising approach on the considered problem.

(6)

(7)

Sammanfattning

Denna rapport behandlar uppskattningar av restider mellan två punkter på en karta, utifrån en kombination data från vägnät (på en skala av vägsegment) och historisk data från utförda resor (på en större skala, som resorna). Huvudidén är att uppskattningarna kan förbättras genom att införa en korrigeringsfaktor som uppskattas utifrån den historiska resdatan. Korrigeringsfaktorn uppskattas både genom maskininlägningsmetoder, mer specifikt med regression baserad på Gaussianska processer, och med en enkel referensmetod inspirerad av en metod i litteraturen.

Två olika kärnfunktioner används vid den Gaussinaska regressionen: en kvadratiskt exponentiellt kärna och en strängkärna. Den metod som använder strängkärnan är den mest lovande metoden i denna studie, då den presterar bättre än de båda andra.

(8)

Acknowledgements

I would like to thank the team at Budbee for providing an interesting problem setting and relevant datasets, and specifically David Ödling and Esteban Salva for great supervision during the process. Further, I want to thank my supervisor at KTH, Anders Forsgren, for valuable discussions throughout the project and extra encouragement towards the end. Finally, I want to thank my partner Jesper for your great patience and unfailing support.

(9)

List of Symbols

x^∗_j query trip j

o^∗_j origin of query trip x^∗_j d^∗_j destination of query trip x^∗_j

y^∗_j predicted travel time of query trip x^∗_j

yˆ^∗_j road network travel time estimate for query trip x^∗_j y¯_j true (observed) travel time of query trip x^∗_j

α^∗_j travel time correction factor corresponding to ˆy_j^∗ D training set of trips x_i

Q query set of trips x^∗_j

N˜ set of neighbor trips to x^∗_j: ˜N = ˜N (x^∗_j)

(12)

(13)

Abbreviations

ARD Automatic Relevance Detection GP Gaussian Process

GPR Gaussian Process Regression OD Origin – Destination (approach) PD Path-Dependent (approach) SE Squared-Exponential (kernel) STD Standard Deviation

STR String (kernel)

SVRP Stochastic Vehicle Routing Problem TSP Traveling-Salesman Problem

VRP Vehicle Routing Problem

(14)

(15)

Chapter 1

Introduction

There is much efficiency to gain in predicting accurate travel times, in many parts of society. On an individual level, many people rely on their smart phone’s navigation apps in order to estimate travel times. Companies within logistics and delivery services rely on travel time predictions as inputs to their routing algorithms. On the one hand, there is a cost associated with being to optimistic in the predictions and not being able to deliver according to the plan. On the other hand it is costly to keep high margins in the routing, as in the case of high variance or overestimated travel times. Hence, accurate travel time estimations are desirable.

This study was performed considering the problem setting of the Swedish logistics company Budbee, where the considered methods were tested on their dataset.

1.1 Background

During the last decade, there has been a fundamental shift in traffic modeling, as the availability of data has increased dramatically through tracking of GPS devices [11] [12]. This has fed the development of models of varying complexity, each trying to capture the traffic patterns of an urban area.

1.1.1 The travel time prediction problem

The travel time prediction problem considered in this thesis aims at predicting travel times between two points, i.e. for one trip, at a certain time of the day. The trips are in this setting the job to go from one costumer stop (delivery) – or terminal stop – to another, hence a delivery route for one driver consists of several subsequent trips.

Besides the predicted travel time, an uncertainty measure of that estimate is also of interest. The information that an estimate is uncertain, due to high variations

1

(16)

CHAPTER 1. INTRODUCTION 2

in historical data for similar trips, is useful as it suggests to increase the margins in the routing process.

There are datasets containing travel times on a road segment level, which is typically the segment of a road between junctions or traffic lights. The travel times are sometimes differentiated by the time of the day, and also by the day of the week and based on legal speed limits and historical data for each road segment. These roadmaps constitute a good basis for travel time estimations, although they are not able to capture higher-level patterns, i.e. the patterns on a trip level (consisting of subsequent road segments).

In the literature, there are many different approaches on travel time estimation on a trip level, given a dataset of historical trips. What all approaches have in common is that they all consist of two fundamental steps: (1) they define and compute a similarity measure between trips, and (2) they propose a way to infer the travel time of a query trip from historical trips, based on the similarity measure defined in the previous step.

One major divider between the methods is whether they only consider the origins and destinations of the trips for the similarity measure, or if they also consider the paths taken within each trip. Further, the methods differ in the mechanism that is used for the inference, i.e. the approach for the second step described above.

Both classic regression methods and machine learning methods are presented in the literature.

1.1.2 Challenges

Travel time prediction comes with certain challenges. To begin with, traffic dynam- ics have a complex nature, which puts high demands on a model to capture. The general approach is to let historical trips speak as much as possible for themselves by considering big sets of historical data and using methods inspired by machine learning.

As with all data-driven approaches, data sparsity might cause problems. For the inference to yield reasonable results, it is necessary to have enough similar trips. [13]

The definition of similarity is hence connected to the sparsity issue, as approaches of different similarity definitions will be affected differently by sparsity issues. For example, origin–destination approaches will require trips of similar origins and destinations, while path-dependent approaches only require the trips to share sub-paths.

The sparsity issue can summarized as an aspect of the curse of dimensionality, which describes challenges when working with data of many dimensions [5].

The second challenge is to deal with complexity and scalability. Many of the so- phisticated methods in the literature have computational complexities growing ex- ponentially with the size of the data sets considered. Consequently, it requires

(17)

extra care in terms of approximations in order to reduce complexity and speed up the algorithms [13] [12].

1.2 Research objectives

In general, the objective of this thesis is to investigate if there are methods from the literature that can improve Budbee’s current travel time predictions, and to include an uncertainty measure of the predictions. The following section describes the problem setting, in order for us to be more specific about the objective formulation.

1.2.1 Problem setting

Today, Budbee estimates the travel time of a trip by adding the travel times along the individual road segments that constitute the shortest path of a trip. The travel times for each road segment is taken from a road network graph, provided by an external source. The travel time for each link is not a single value but a profile describing how the travel time varies over hours of the day and days of the week.

Thus, Budbee can take the starting time for each trip into consideration when routing.

Once the travel time estimates for all trips are given, they are used as inputs to an optimization process, aiming at routing vehicles minimizing a cost function under certain constraints, i.e. the vehicle routing problem (VRP). The VRP can be formulated either to take deterministic values of the travel times (costs) as inputs, or to take probability distributions over the travel times as inputs. The latter is commonly referred to as the stochastic vehicle routing problem (SVRP) [7]. Considering the trade-off between costs associated with not being able to deliver according to the schedule and keeping high margins in the routing to be on the safe side, it is desirable to formulate a SVRP.

While the current travel time prediction approach is very robust – free from both sparsity and complexity issues – it tends to either underestimate or overestimate travel times, repeatedly. Consider a fraction of a path consisting of two road segment, separated by a junction. To simply add the travel times for each road segment in order to get the travel time for the fraction is a simplification, as the junction (possibly involving traffic lights) might take time in itself. To include an expected value of the waiting time at the junction in one of the road segment’s time is also problematic, as the waiting time at the junction might differ widely between, for instance, right turns and left turns. This motivates why a higher level of abstrac- tion, being able to capture patterns in sequences of links instead of only individual links, can improve the prediction performance.

This thesis is based upon the idea that there is accuracy to gain by combining the current travel time prediction process, that is closely related to the road network

(18)

data, with the historical data that Budbee collects from all their trips.

It is of great importance to keep the robustness of the current approach, as it covers all road segments in Sweden, and not only the ones that have been already been traversed by Budbee drivers. At the same time, it is desirable to learn from Budbee’s historical trip data and hereby be able to improve the predictions overall.

Hence, approaches where both datasets contribute in the prediction process are investigated.

1.2.2 Objective of study

The primary objective of the study is to investigate if a selection of methods from the literature can improve Budbee’s current travel time prediction method. More specifically, the methods from the literature use Gaussian Process Regression (GPR) based on two different kernels (one path–depentent, one based on origin–destination) [4] and one neighbor-based baseline method [12].

Note that the aim is to see if and how the current predictions can be improved by the inclusion of historical trip data, and not to replace the current prediction process with the considered methods.

The secondary objective is to obtain a measure of uncertainty of each trip, i.e. a probability distribution of the travel times for each trip instead of a deterministic value. As mentioned above, it opens up for the optimization process to take stochas- ticity into account and will complement any improvement in travel time prediction we may achieve. It is a way to increase the influence of historical trip data and thereby connect the routing algorithm even stronger to reality.

1.3 Thesis disposition

This thesis consists of a more detailed problem description, where also some terms used frequently in the report will be defined. Next, a review of the literature is given with summaries of the considered methods. Following is a chapter where the implementation is described, together with the modifications made to the methods considered in the previous chapter in order for them to fit into this problem setting.

Next, some results from a comparison between the current predictions together with the investigated methods are presented, followed by a discussion of the results.

Finally, the conslusions of the study are summarized and some further research areas are pointed out.

(19)

Chapter 2

Problem Definition

In this chapter, the travel time prediction problem is defined and formulated. First, some definitions are made, followed by a mathematical formulation of the problem setting is presented.

2.1 The Travel Time Prediction Problem

Before formulating the problem, some definitions are made in order to prevent confusion.

Definition 1 (Link). A link is a road segment, that is an edge between two neigh- boring intersections.

Definition 2 (Trip). A trip is the passage between two customer stops (origin and destination stops).

A trip will be denoted by x_i, independent of its representation. Consequently, x_i can have different forms, depending on the method used. The travel time will consequently be denoted by y_i. The historical data set D is written like

D = {x_i, y_i}^N_i=1, where N is the number of historical trips.

Definition 3 (Path). The path of a trip is the sequence of links taken for the trip.

Definition 4 (Route). A route consists of one or more subsequent trips, describing the route of one driver during one shift.

Definition 5 (Query trip). A query trip is a trip of which the travel time is to be inferred.

5

(20)

CHAPTER 2. PROBLEM DEFINITION 6

Figure 2.1: An illustration of a the road network graph. Note the two different arcs:

the paths (red) and the theoretical arc connecting each customer stop.

Any query trips will be denoted by x^∗_j, and their corresponding estimated travel time by y_j^∗. In the investigative setting of this thesis, the query trips will be taken from a historical dataset in order to be able to evaluate the performance of the methods. However, in the real problem setting, query trips correspond to new trips to be used as input in the routing routine and the outcome in terms of the travel time is hence not known beforehand. The set of query points, together with the travel times to be predicted, will be denoted by

Q = {x^∗_j, y^∗_j}^M_j=1, where M is the number of query trips.

In other words, a historical dataset of N +M trips will be partitioned into a training set D of N trips and a test set Q of M trips.

Now, the travel time prediction problem can be formulated as in Problem 1.

Problem 1 (Deterministic Travel Time Prediction Problem). Given a dataset D of historical trips, infer a travel time y_j^∗ of a query trip x^∗_j.

As mentioned in the introduction, the uncertainties of the estimates are of great interest. Therefore, an alternative problem formulation is also made in Problem 2, where we are equally interested in the uncertainty of the travel time as we are of its expected value.

Problem 2 (Probabilistic Travel Time Prediction Problem). Given a set D of historical trips, infer a predictive distribution of the travel time y_j^∗ of a query trip x^∗_j.

(21)

CHAPTER 2. PROBLEM DEFINITION 7

The predictive distribution sought in Problem 2 – the probability distribution of the travel time y^∗_j given x^∗_j and the historical dataset D – is expressed as p(y^∗_j|x^∗_j, D). If the distribution is Gaussian, it is fully defined in terms of a mean and a variance, known as the predictive mean and the predictive variance.

(22)

Chapter 3

Literature Review

In this chapter, a review of some approaches from the literature in the area will be made. The methods that are presented are all focused on solving either Problem 1 or Problem 2. The methods presented in this chapter form the basis for the following implementation and analysis.

3.1 Introduction to Different Approaches

Fundamentally, there are two different perspectives taken in traffic modeling: the observer’s view and the driver’s view. The observer’s view is the view of an ob- server standing next to a road, watching the traffic on that particular link. It is a perspective often taken in traffic controlling and city planning [10]. The driver’s view on the other hand, takes the perspective of an individual car and observes the conditions along its path [4]. These two views correspond to different modeling approaches, but also relates to the different sources of data. In the observer’s case, one has stationary sensors on the roads such as cameras and loop detectors, while the observer’s view corresponds to floating car data such as GPS data [8].

In this problem setting, the focus will be on the driver’s view. That is, we are interested in getting travel times for particular trips rather than finding patterns for individual links. Within the driver’s perspective, there are different approaches on how to determine travel times for the trips. One divisor is whether the travel time estimator is path-dependent or not. It is path-dependent (PD) if it considers the trajectories within a particular trip. If it only considers the start (origin) and end (destination) points, it is called an origin–destination (OD) travel time estimation.

Out of these two approaches, the path-dependent approaches are more common than the OD approaches [12]. In this thesis, approaches of both kind will be considered and compared.

8

(23)

CHAPTER 3. LITERATURE REVIEW 9

3.2 On Gaussian Process Regression

What most people think of when they hear regression is in fact a parametric regression. Parametric regression is a two-step process where the first step is to choose a parametric form of the functions (polynomial, exponential, etc.) and the second step is a process aiming at finding the most representative parameters for the model.

However, there is also another form of regression that is non-parametric, which is more suitable if the underlying behaviour of the observables is unknown or when the number of dimensions increases. This way, the data is allowed to “speak for itself” instead of being forced upon a certain model.

3.2.1 Gaussian Processes in Theory

One non-parametric approach is to let all data points be random variables that are assumed to be correlated through a predefined correlation function. (For a definition of a random variable, see literature in probability theory). This way, any subset of data points – random variables – forms a joint Gaussian distribution. This is referred to as a Gaussian Process (GP). The following review of Gaussian processes is based on the book of Rasmussen and Williams from 2006 [9].

Definition 6 ( Gaussian Process ). A Gaussian Process is a collection of random variables, any finite number of which have a joint Gaussian distribution [9].

If f (x) is a Gaussian Process, we write

f (x) ∼ GP(m(x), k(x, x⁰)).

One key point of GPs is that they are completely specified by their mean and covariance functions

m(x) = E(f (x)) (3.1)

cov(f (x), f (x⁰)) = E((f (x) − m(x)) f (x⁰) − m(x⁰)). (3.2) The covariance function is crucial in GP modeling and is just what it seems to be:

a measure of the covariance between the outputs [9]. It is defined as a function of the inputs, via a kernel function k(·, ·),

cov(f (x), f (x⁰)) = k(x, x⁰). (3.3) There are different choices of kernel functions, which will be discussed in later sections. The main idea that necessary for the understanding of the mechanism of GPs is that the aim of the kernel is to measure similarities in terms of the inputs, where more similar inputs should yield higher covariance of the outputs. Hence, if x ≈ x⁰, then k(x, x⁰) should approach its maximum, while dissimilar x and x⁰ should yield a kernel function value close to zero. The generalization properties of

(24)

Figure 3.1: An illustration of a GP. The squares represent observed variables and the circles unknown (latent) variables. The thick line represent the GP connecting the different outcomes [9].

the model are determined by the kernel function, making it a very significant part of GPs. Worth mentioning is that kernel functions by definition are symmetric and positive definite [2] [9].

The marginalization property of GPs is that if a GP specifies (y₁, y2) ∼ N (µ, Σ), then y₁ ∼ N (µ₁, Σ11), where µ₁ is a vector of the corresponding mean values to y₁ and Σ₁₁ is the relevant sub matrix of Σ.

Now we will let a GP f (x) model the inputs x and outputs y in a dataset that we can denote A. Let A be a set consisting of the two subsets D = {x_i, y_i}^N_i=1 and Q = {x^∗_j, y_j^∗}^M_j=1 (i.e. the training set and test set from Chapter 2).

Following Definition 6 on GPs, we know that the data points in A are jointly Gaussian. The joint distribution over the training outputs y = (y₁, ..., y_N)^T) in D and query outputs y^∗ = (y₁^∗, ..., y_M^∗ )^T) in Q is called the prior and is expressed as

"

y y^∗

#

∼ N "

m(x) m(x^∗)

# ,

"

K(x, x) K(x, x^∗) K(x^∗, x) K(x^∗, x^∗)

#!

, (3.4)

where K(x, x) is the N × N matrix of covariances evaluated at all entries of x and x, K(x, x^∗) is the corresponding N × M matrix evaluated at all entries of x and x^∗ and similarity for K(x^∗, x) and K(x^∗, x^∗).

This is valid if the observations of the outputs are assumed to contain no noise.

However, it is reasonable to assume that the observed variables contain a certain noise. We introduce independent identically distributed Gaussian noise. The noise

 has mean zero and variance σ², according to

y_i= f (x_i) + , i = 1, . . . , N (3.5)

 ∼ N (0, σ²). (3.6)

(25)

This corresponds to adjusting the covariance of the prior on the noisy observations to

cov(y_i, y_i⁰) = k(x_i, x_i⁰) + σ²δ_i,i⁰,

where δ_i,i⁰is the Kronecker delta function. Now, we can include the noise assumption (3.5)-(3.6) on y in (3.4), yielding

"

y f^∗

#

∼ N "

m(x) m(x^∗)

# ,

"

K(x, x) + σ²I K(x, x^∗) K(x^∗, x) K(x^∗, x^∗)

#!

. (3.7)

A commonly used graphical representation of a GP can be found in Figure 3.1.

Here, dependencies are represented by arrows and the covariance cov(f (x), f (x⁰)) is represented by the thick line. Note that the observed outputs y_i are conditionally independent of all nodes except f_i [9].

3.2.2 Regression using Gaussian Processes

Until now, we have only studied GPs in general, and said little about how they can be used for regression, in so called Gaussian Process Regression (GPR). Remember the probabilistic travel time prediction problem (Problem 2, Chapter 2), where one is interested in the predictive distribution over the query outputs f^∗. Now, this can be achieved by conditioning (3.7) on the training set, yielding p(f^∗|x^∗, D).

The conditioning yields a new Gaussian distribution, see (A.4) – (A.7) in the ap- pendix for the conditioning formulas to be applied. We get,

p(f^∗|x^∗, D) ∼ N (¯f^∗, Σ^∗), (3.8) where the predictive mean and variance is given by

¯f^∗ = m(x^∗) + K(x^∗, x)^hK(x, x) + σ²I i−1

(y − m(x)), (3.9) Σ^∗ = K(x^∗, x^∗) − K(x^∗, x)^hK(x, x) + σ²I

i−1

K(x, x^∗). (3.10) Note that (3.8) – (3.10) fully defines the predictive distribution that was sought.

In order to arrive at a predictive mean and a predictive variance, one has to know or assume m(x) and m(x^∗), choose a kernel function used to build the covariance matrices K from the inputs x and determine a noise variance σ².

The inputs x only appear in the kernel function. Consequently, trips can be rep- resented in various ways, as long as the kernel function treats that specific domain representation. In the following applications of GPs, we will both consider an OD and PD representations.

(26)

It remains to choose reasonable values for m(x^∗) (which is not known beforehand) and the variance σ² of the assumed noise. For the former, one natural choice is to assume

m(x^∗) ≈ m(x),

where m(x) can be calculated directly as the average of the outputs y, m(x) = 1

N

X

i=1

y_i.

The variance σ² is a so called hyperparameter. Next, the general approach to esti- mate hyperparameters is presented.

3.2.3 Determination of Hyperparameters

A common approach to determine hyper parameters of a kernel is to study the marginal likelihood p(y|x, θ) (or evidence), where θ denotes the hyper parameters, in the process of getting optimal values of the hyper parameters.

The marginal likelihood can be achieved by a marginalization of the likelihood times the prior over the function values f, that is by computing the integral

p(y|x) = Z

p(y|f, x)p(f|x)df. (3.11)

One common trick is to instead consider the log marginal likelihood in the optimiza- tion process of finding the hyper parameters, that is to maximize

log p(y|x, θ) = −1

2y^TC⁻¹y − 1

2log det(C) − N

2 log 2π. (3.12) See Rasmussen, page 19, for a derivation of this expression [9]. The derivative with respect to a hyper parameter θ_i is

∂

∂θi

log p(y|x, θ) = 1

2y^TC⁻¹∂C

∂θi

C⁻¹y − 1 2Tr

C⁻¹∂C

∂θi

. (3.13)

To arrive at this, a couple of matrix derivatives formulas have been applied (see A.1).

The derivatives are commonly used in the optimization process, where gradient methods are to be preferred [9].

Estimation of the hyperparameters θ by maximizing the marginal likelihood is com- mon in statistical analysis. Intuitively, it adapts the model parameters to a given dataset. In (3.12), the three terms correspond to the data fit, a complexity penalty and a normalization constant (in that order). Thus, it takes overfitting into account as it balances model fit and complexity.

(27)

Figure 3.2: (a) The marginal likelihood plotted over the two hyper parameters:

noise standard deviation and length-scale. (b) One local optimum (interpretation of data), with better data fit and higher complexity. (c) Another local optimum (interpretation of data), with a simpler model. Figure from Rasmussen, 2006 [9].

The computation of the marginal likelihood is dominated by the inversion of the C matrix, in terms of computational complexity, as inverting matrices is O(N³). Once the inverse is known, the derivatives are relatively cheap – O(N²) – to compute, why a gradient-based optimizer is to be preferred.

As for many other functions, it is not possible to exclude that local optima will be present in the marginal likelihood function. Each local maximum corresponds to an interpretation of the data. This is illustrated in Figure 3.2, where two local optima are shown in a case with two hyper parameters: noise standard deviation and length-scale. Based on the information from only the seven data points in the figure, it is hard to tell which interpretation that is best. With increasing dataset sizes, practical experience indicates that local optima rarely causes problem, as the local optima differ significantly in size [9]. However, care should always be taken when optimizing non-convex functions.

(28)

3.3 Origin – Destinations Approaches

If one were to choose only one variable to represent a trip, the total length of the trajectory would probably be the most sensible choice. However, we do not have to restrict ourselves to a one-dimensional input space. It is reasonable to assume that the second most impacting factor is where the trip is – a trip in a suburban area will probably have a different average speed than one in the city center. Therefore, it comes natural to also include the origins and destinations (OD) of the trips. One motivation behind only considering OD is that it is simple and that two trips of similar ODs will probably consist of similar paths, as well. Additional aspects to include could for example be the time of the day of the trip and the day of the week.

In this section, different OD approaches will be studied. The main idea is that we do not care about the paths taken by the trips: instead we look upon the trips in terms of OD, length, time of day, etc.

3.3.1 Gaussian Process Regression: SE kernel

Gaussian Process Regression (GPR) can be applied on the OD approach. The theory and methods behind GPR have been given in section 3.2 and it remains only to present some useful kernel functions.

First, we remind ourselves that the purpose of a kernel function is to capture simi- larities between input points x ∈ X . It has to be a valid covariance function – that is it has to produce valid covariance matrices.

The probably most widely used kernel is the squared exponential (SE) ¹[9]. With two parameters: the covariance amplitude σ²_f and the length scale l, it is defined for one-dimensional inputs x as

kSE(x, x⁰) = σ_f²exp −(x − x⁰)² 2l²

!

. (3.14)

The corresponding generalization to multidimensional inputs ¯x = (x¹, . . . , x^D) is given in Definition 7.

Definition 7 (Generalized SE kernel).

kSE(¯x, ¯x⁰) = σ²_fexp

−1

2(¯x − ¯x⁰)^TM (¯x − ¯x⁰)

(3.15)

1Some argues that this is not the correct name for this kernel, and that exponentiated quadratic covariance function is to be used instead. However, SE will be used throughout this thesis, as it is still the most widely used term for the kernel.

(29)

Above, M is a D × D symmetric matrix, that can be defined as

M = diag(l)⁻² (3.16)

where l is the vector of corresponding length scales (l₁, . . . , lD) for the dimensions of the input point ¯x. Note that (3.15) corresponds to the product of the one- dimensional kernel functions in (3.14) along each dimension,

k_SE(¯x, ¯x⁰) =

D

Y

d=1

σ˜_d²exp −(x^d− x^{d 0})² 2l²_d

!

= σ²_fexp



−

D X d=1

(x^d− x^{d 0})² 2l²_d



.

In (3.15), the length scales l_d control the “relevance” of dimension d of the inputs x, ¯¯ x⁰ [2]. As described in section 3.2.3, the approach to estimate hyperparamers is to maximize the marginal likelihood. This way, one automatically detects the dimension’s relevance. This is known as automatic relevance determination (ARD).

The SE kernels in (3.14) and (3.15) are infinitely differentiable, yielding a charac- teristic smooth behaviour of the GP [9].

3.3.2 Baseline Regression

In the spirit of “big data beats algorithm”, Wang et al. [12] proposed a baseline method to estimate travel times. It is path-independent, and will hence only consider origin and destination points and the time of the day for the trip. The main idea is that a query trip is inferred directly from a set of similar trips, where similar trips are defined as ones being from close origin and destination points and at the same hour of the day. Wang et al. performed their study on public data from New York City Taxi & Limousine Commission, containing more than 173 million taxi trips [6].

In the problem setting of the study of Wang et al., they demonstrated that their relatively simple method outperformed state-of-the-art approaches that requires much more expensive calculations. For example, they compared their proposed method with results from Bing Maps and Baidu Maps (with APIs allowing large scale comparisons) with satisfactory results [12].

This baseline method consists of two fundamental steps:

1. the definition of similar trips, and

2. the aggregation of travel times for similar trips.

(30)

For the first step, Wang et al. determine similarity in terms of both spatial and temporal proximity. They define x^∗_j to be a neighbor of x_i if the distance their corresponding origins (o^∗_j and o_i) and destinations (d^∗_j and d_i) are “small enough”.

More specifically, they define the neighbor set ˜N of x^∗_j as

N (x˜ ^∗_j) = {x_i ∈ D |dist(o_i, o^∗_j) ≤ τ and dist(d_i, d^∗_j) ≤ τ } (3.17) where dist(p_i, pj) is the Euclidean distance between two points p_i and p_j and τ defines a radius of which all trips are considered neighbors within. It is the only parameter of this method and the choice of it is crucial when balancing coverage and performance.

To summarize, the travel time of a query trip x^∗_j is estimated as an average over the neighbor trips according to

y^∗_j = 1

| ˜N (x^∗_j)|

X xi∈ ˜N (x^∗_j)

y_i. (3.18)

In the paper of Wang et al., they continue to discuss including weights on how much the neighboring trips x_i should affect the prediction y_j^∗. One possible weight coefficient is the inverse of the distance to the origins and destinations, yielding a stronger influence from trips of very similar origins and destinations. However, they found out empirically that it did not necessarily lead to better performance and that the hard threshold could be kept [12].

A more significant weigh coefficient would be one that captures the temporal dy- namics of the traffic patterns. They introduce a weight, or “scaling factor” and make a couple of assumptions in order to be able to estimate it from data on how the average speed vary at different time stamps.

Algorithm 3.1: The Baseline Approach Algorithm Input: Dataset D, query trip x^∗_j.

Result: Estimated travel time y^∗_j for query trip x^∗_j

N˜_j^∗ ← {x_i ∈ D |dist(o_i, o^∗_j) ≤ τ and dist(d_i, d^∗_j) ≤ τ } (eq. 3.17) y_j^∗ ← ¹

| ˜N_j^∗|

X

xi∈ ˜N_j^∗

yi (eq. 3.18)

return y_j^∗

The greatest advantage of this baseline approach is that it has linear complexity in the inference step. Further, it does not practically have a training phase, which makes the complexity even more beneficial. However, one major disadvantage of this approach is that is deals poorly with the sparsity issue, as it is not able to return any estimates in the cases where no neighbor trips are found. This is less

(31)

often the case in the NYC taxi case and more often in the application considered in this thesis. Consequently, one has to consider modifications in order to make it robust even in these cases.

Finally, we conclude that uncertainties are not a natural part of this approach. In other words: this method treats Problem 1 rather than Problem 2. However, one can compute the variance of the neighbor set ˜N and make this the uncertainty measure of the estimate.

3.4 A Path-Dependent Approach

In contrast to Wang et al’s baseline approach and the GPR approach using SE kernel, there are also methods using the actual paths of the trips in order to extract similarities, i.e. path-dependent approaches. This way, one gains information even from trips that does not necessarily share origins and destinations, as long as they share a part of their paths somewhere along the way of their trips. Now, a path- dependent GPR method will be presented.

3.4.1 Gaussian Process Regression: String Kernels

A trajectory-based (path-dependent) approach to travel time estimation is presented by Idé and Kato, where they propose string kernels (STR) to extract similarities between trips. That is, the kernel takes a path representations of trips and returns a measure of similarity [4].

They present three different string kernels, of which only the best performing one will be considered in this thesis: the ID kernel. In machine learning, it is known as a p-spectrum kernel. P-spectrum kernels were first introduced in bioinformatics, where sequences of DNA strings are studied. Sequences are subsets of an alphabet.

In this problem setting, the alphabet corresponds to all link ID:s (remember, from Definition 1, a link being a road segment). See Definition 8 and Definition 9.

Definition 8 (Alphabet Σ). The alphabet Σ is a set of symbols representing links of the road network.

Definition 9 (Alphabet Σ^p). The set Σ^pconsists of subsequences of p consecutive symbols in Σ.

If we denote a subsequence u, then N_u(x) is defined as the number of occurrences of u in path x. The p-spectrum string kernel can now be defined in Definition 10.

Definition 10 (The p-spectrum string kernel).

k_ID(x, x⁰) = β ^X

u∈Σ^p

N_u(x)N_u(x⁰) (3.19)

(32)

Here, β is the magnitude of the variance. It is a hyperparameter to be determined from the data. Further, the length of the subsequences, p, will also affect the covariance and its optimal values is to be studied. Note that for p = 1, one gets

Σ = Σ^p = {ID₁, ID2, . . . , IDK},

where ID_k represents the k:th link ID. In this case, the kernel in (3.19) simply counts the number of links that two trips have in common. For p ≥ 2, it will count the number of subsequences of a particular length p, that they have in common.

(33)

Chapter 4

Research Methodology

This chapter focuses on the differences that are made from the methods presented in the previous chapter, as well as the steps that are made prior to the querying in terms of processing of the data. First, some comments are made on the datasets that are considered and what they contain. Second, the steps of the data processing are presented. Following that are the modifications that all three methods have in common, compared to the methods from the literature. Finally, the implementations of all three considered methods are presented in detail.

4.1 Using data from reality

In this study, a dataset of trips was provided by Budbee. As Budbee is a delivery firm, a trip starts at one end costumer (or terminal) and ends at another end costumer (or terminal), where each end costumer stop corresponds to a delivery.

Several trips define a route taken by a car.

Budbee is a software-based company, collecting and storing big amounts of data, continuously. All deliveries are logged in a database, with all the relevant features associated with each delivery. The dataset that is considered in this study consists of 200 000 trips. See Table 4.1 for an example of a trip from the considered dataset.

In Table 4.2, some features calculated during the data processing step, for the same example trips, are shown.

Table 4.1 does not contain the trajectories of the trips. There is a separate dataset containing GPS signal samples of all trips. However, to map sparse GPS sample data to trajectories in a road network is in itself a complex problem, where one has to formulate a model to cut analyze the noisy GPS data. This would be a separate project in itself. Hence, only the dataset presented above is used in this thesis.

When a path–dependent method is used, the trajectories will be reconstructed using a shortest–path approach.

19

(34)

CHAPTER 4. RESEARCH METHODOLOGY 20

Table 4.1: Example trips from dataset

Trip ID Postal code of origin Position of origin Postal code of destination

trip_ID orig_postal_code orig_lat_lon dest_postal_code

17345 181 31 (59.36897, 18.13275) 115 25

39613 185 94 (59.43145, 18.34756) 185 94

Position of destination Starting time of trip Road network estimate Actual travel time dest_lat_lon timestamp road_network_estimate travel_time

(59.32704, 18.14669) 2016-05-22 18:26:22 645 1124

(59.43241, 18.34425) 2016-08-22 16:23:17 63 54.2

As mentioned in the introduction, Budbee already has data to predict travel times.

This is map data defining a road network. The network consists of links, where each link has a speed profile, i.e. the time it takes to travel along that link at different times of the day. In this problem setting, a link corresponds to a road segment (as defined in Definition 1).

The road network estimate is the estimated time of a trip, achieved by adding the travel times for each link along the path of trip. It is what is used today as input in the TSP optimization process. This estimates will be used as a reference in the following methods, why it is also included in Table 4.1.

4.1.1 Data Processing

Since the data came straight from the database, some processing had to be done before the regression. It is to be expected that some trips in the dataset have unreasonable values as a consequence of external and human factors. For example if something goes wrong with a delivery, the trip is sometimes manually added to the database. Some filtering for outliers is therefore necessary.

First, some outliers were detected by simply considering the “average euclidean velocity”. That is, the euclidean distance between origin and destination of a trip was calculated using the Haversine formula for calculating the great-circle distance between two points at the surface of a sphere [14]. Next, it was divided by the travel time of the trip, yielding a measure of an average velocity (although it is not indeed the average velocity taken, which will be higher considering that the trip will not be a straight line). This way, some unreasonable outliers could be removed from the dataset by assigning a cutoff velocity. Table 4.2 shows two of these added features.

The dataset contains trips from all over Sweden, although the data becomes very sparse on the countryside. None of the methods considered here perform well on sparse datasets, why the trip domain is restricted to the Stockholm area. More specifically, to postal codes within the range [100 00, 175 00]. Similarly, abnormal coordinates of longitude and latitude, considering the Stockholm area as the domain for the study, were removed.

(35)

Table 4.2: Example trips from dataset

Trip ID Euclidean length of trip Average velocity trip_ID euclidean_length average_velocity

17345 l1 v1

39613 l2 v2

Estimated trajectory trajectory

[573716869, 573716871, 573716868, 573716865, 27994395R, 809424395, 809424396, 27994336, . . . ] [28805168R, 28805169, 28805157R, 28805158]

Further, particularly short trips were also neglected, as it is hard to draw a line between the travel time and the service time (that is, the actual delivery) for those trips. The cutoff distance used here was 50 meters.

In Table 4.2, the trajectories of the trips are also displayed. These are in fact reconstructed trajectories, where a shortest-path algorithm has been used on the road network. The trajectories are hence not necessarily the ones taken in the corresponding trip, but the optimal paths according to a shortest-path algorithm on the road network graph. This is of course a simplification and something to keep in mind when the method is evaluated.

4.1.2 Data Partitioning

In order to be able to verify the results, the historical dataset is partitioned into a training set D and a test set of queries Q. Let N + M be the size of the original dataset and N and M be the size of D and Q, respectively. The partitioning is preferably random.

In other words, M trips were sampled from the original set to constitute the test set, while the remaining N trips constitute the training set.

For a fair evaluation of the methods, the test set should be sufficiently large. On the other hand, the dataset is of limited size and the training set should be at least one order of magnitude larger than the test set. In the spirit of bootstrapping, the sets will be partitioned repeatedly, where the inference of the test set will be performed once for each partition.

Let P denote the number of partitions that are made. Hence, the test phase will be performed P times on M trips each time, yielding a total of P × M query trips.

See Table 4.3 for the values of M, N, P used in this implementation.

(36)

Table 4.3: Parameters for implementation

description notation value

original dataset (after filtration) M + N 98993

query set size M 500

no. of partitions P 10

4.2 Implementation of Regression

For the implementation of the inference step, i.e. the regressions, there is one difference from the problem settings of the reviewed methods that requires extra thought. It already exists descent information on travel times for trips, estimated from the road network data (referred to as the road network estimate). Remember from the introduction that we want to combine this road network data with Budbee’s historical trip data.

Hence, the approach on combining these sets of information will be to search for correction factors α, rather than absolute travel times, in the regression. With the travel time of a query trip x^∗ denoted y^∗ and its road network estimate by ˆy^∗, we can write

y^∗ = α^∗yˆ^∗. (4.1)

We can in this setting consider ˆy^∗ as given, although it is in fact the result of shortest–path search in the road network, followed by a summation of corresponding link–level travel times along the path.

Note that in (4.1), a correction factor is introduced as α^∗. Now, the inference can focus on estimating α^∗, since a good estimate on α^∗is equivalent to a good estimate on the travel time y. If the road network estimates ˆy always would be consistent with the reality, we would have that α = 1 for all trips. Note that much in the formulation of estimating y^∗ and α^∗ is the same: we only switched to a relative entity.

To be clear, the problem definitions 1 and 2 are here reformulated to Problem 3 and 4, focusing on predicting α^∗.

Problem 3 (Deterministic Travel Time Correction Factor Prediction Problem). Given a dataset D of historical trips and an estimate ˆy^∗_j, infer a cor- rection factor α^∗_j of a query trip x^∗_j, such that y^∗_j = α_j^∗yˆ^∗_j.

Problem 4 (Probabilistic Travel Time Correction Factor Prediction Problem). Given a dataset D of historical trips and an estimate ˆy_j^∗, infer a predic- tive distribution of the correction factor α_j^∗ of a query trip x^∗_j, such that y^∗_j = α^∗_jyˆ_j^∗.

(37)

Following below are the considered approaches on inferring α^∗ from the historical dataset D. The time dependency of the travel times is assumed to be captured by the road network data estimate ˆy^∗ and the time dependencies is hence left out of the α^∗ estimation.

4.2.1 A Baseline Regression

The baseline method was implemented, based on the procedure presented in section 3.3.2. The implementation is quite straightforward (see algorithm 3.1) and it remains two things: to modify it to estimate the correction factor α^∗ instead of y^∗ and to choose a value for the neighbor radius threshold τ .

For the former, we need only to replace y_i in (3.18) with α_i

α^∗_j = 1

| ˜N (x^∗_j)|

X xi∈ ˜N (x^∗_j)

α_i. (4.2)

On the one hand, a low value of τ will theoretically yield more accurate results, as all trips in the neighbor set ˜N are very similar in terms of origins and destinations.

On the other hand, a low value of τ will yield empty neighbor sets more often, as there may be no trips in D satisfying (3.17). Thus, one should consider the consequences of empty neighbor sets. If there would be no road network data to fall back on, it would be a problem with empty neighbor sets, with lacking estimates.

However, we can rely on the road network estimates in these cases. See section 5.1 for an analysis of the choice of τ .

4.2.2 Gaussian Process Regression: SE kernel

This approach is based on the GPR reviewed in section 3.2, with the squared exponential kernel presented in section 3.3.1. To the best of my knowledge, there are no studies in the literature of travel time estimations based on GPR using the SE kernel. It is an interesting method to include in the comparison of this study as it can be considered as a mixture of the baseline approach presented above in section 4.2.1 and the GPR string kernel approach presented in section 4.2.3. It is similar to the baseline method in terms of the inputs that the similarity measure uses while the regression is GPR.

It remains to specify the representation of the input vector ¯x representing the query trip. This approach classifies as an OD method, so the origin and destination are natural to include as features. Further, the length of a trip without doubt an important factor on its required travel time, so it is also included and denoted by l. This corresponds to five features, considering that the origin o and destination d

(38)

consists of two coordinates each (longitude, latitude). In other words,

x = [ o¯ ^lon, o^lat, d^lon, d^lat, l ]. (4.3) According to the idea behind kernels, the similarity between two trips is determined by the features in (4.3). The hyperparameter of each feature’s relevance on the output can be determined trough ARD, as discussed in section 3.3.1.

There are seven hyperparameters corresponding to this kernel: one for each input dimension, the covariance amplitude σ²_f (see Definition 7) and the signal noise σ_n (σ² in (3.6)). They can be written as

θ = [ l₁, l₂, l₃, l₄, l₅, σ_f, σ_n]. (4.4)

The hyper parameters are determined through the maximum likelihood process presented in section 3.2.3. The conjugate gradient method was used in the maxi- mization of the log marginal likelihood. Initial values of the hyper parameters are presented in B.1.

The complexity of GPR is dominated by the inversion of the covariance matrix, which is O(N³). Hence, considering the entire training set D would be costly, even for relatively small datasets. Further, two arbitrary trips from the training set will probably have very little in common (yielding a negligible kernel value).

An extra step is introduced in order to select the most relevant trips from the training set, for each query trip. In other words, the training set is decreased to only include the nearest trips. This procedure is very straightforward and is summarized in Algorithm 4.1. The radius τ is here defining two circles around the origin and destination points and the neighbors are defined as all training trips that share origin and destination domains. This is the same idea as in the definition of neighbors in the baseline study.

Algorithm 4.1: Restricting the training set size.

Input: Training set D, query trip x^∗_j, max. dataset size ˜N , initial radius τ₀. Result: Training set ˜D of decreased size.

N⁰ ← size( ˜D) τ ← τ₀

while N⁰> ˜N do

D ← get_neighbors(x˜ ^∗_j, τ⁰) N⁰ ← size( ˜D)

τ ← 0.9 · τ end while return ˜D

(39)

Note that the procedure in Algorithm 4.1 introduces the parameters ˜N and τ₀. See 4.4 for the values used in this implementation.

Table 4.4: Parameters for implementation (GPR)

description notation value

maximal training dataset size for GPR N˜ 500

initial radius τ₀ 2000

4.2.3 Gaussian Process Regression: STR kernel

Finally, the GPR method using string kernels proposed by Idé and Kato, presented in section 3.4.1, was implemented. This is a path-dependent approach. The paths were reconstructed using the same a shortest path algorithm as the road network estimates ˆy are based on. See Table 4.2 for examples of the path representation.

For the GPR using string kernels, the same decreased training set ˜D for each query trip x^∗ was used as for the GPR using SE, obtained through Algorithm 4.1.

The optimization of the marginal likelihood over hyperparameters was performed using the conjugate gradient method. The initial values of the hyperparameters are given in B.1.

The subsequence length p is not included in this optimization process, as the like- lihood lacks derivatives w.r.t. p. Instead, the GPR STR was implemented for p = 1, 2, 3, with the resulting performance presented in section 5.1.

Machine Learning Approaches on a Travel Time Prediction Problem

Machine Learning Approaches on a Travel Time Prediction Problem

SARA DANIELSSON

Machine Learning Approaches on a Travel Time Prediction Problem

SARA DANIELSSON

Abstract

Sammanfattning

Acknowledgements

Contents

List of Symbols

Abbreviations

Chapter 1

Introduction

1.1 Background

1.2 Research objectives

1.3 Thesis disposition

Chapter 2

Problem Definition

2.1 The Travel Time Prediction Problem

Chapter 3

Literature Review

3.1 Introduction to Different Approaches

3.2 On Gaussian Process Regression

3.3 Origin – Destinations Approaches

3.4 A Path-Dependent Approach

Chapter 4

Research Methodology

4.1 Using data from reality

4.2 Implementation of Regression