• No results found

Predicting taxi passenger demand using artificial neural networks

N/A
N/A
Protected

Academic year: 2021

Share "Predicting taxi passenger demand using artificial neural networks"

Copied!
59
0
0

Loading.... (view fulltext now)

Full text

(1)

Predicting taxi passenger demand

using artificial neural networks

(2)

Predicting taxi passenger demand using artificial

neural networks

GUSTAV ZANDER

Master’s Thesis at NADA Supervisor: Jeanette Hellgren Kotaleski

(3)
(4)

Abstract

(5)

Uppskattning av taxibeställningar med

artificiella neurala nätverk

(6)

Contents

1 Introduction 2

1.1 Purpose and goals . . . 3

1.2 Delimitations . . . 3

2 Background 5 2.1 Machine learning . . . 5

2.2 Supervised learning . . . 5

2.3 Artificial neural networks . . . 6

2.4 Implementation of artificial neural networks . . . 8

2.4.1 Architecture . . . 9

2.4.2 Activation functions . . . 10

2.4.3 Learning . . . 12

2.4.4 Input parameters . . . 13

2.4.5 Error and validation . . . 14

2.4.6 Cross-validation . . . 16

2.5 Related work . . . 16

3 Method and experiment 18 3.1 Data . . . 18

3.1.1 Implementation . . . 20

3.2 Optimizing the network . . . 21

3.2.1 Metrics for network optimization . . . 21

3.2.2 Loss function . . . 21

3.2.3 Architecture . . . 22

3.3 Input parameter experimentation . . . 23

3.3.1 Validation technique . . . 26

3.3.2 Baseline . . . 26

3.3.3 Metrics for feature testing and final network validation . . . . 26

3.3.4 Final testing . . . 27

4 Results 28 4.1 Architecture evaluation . . . 28

(7)

4.1.4 Activation function . . . 31

4.1.5 Overfitting . . . 33

4.1.6 Final network implementation parameters . . . 35

4.2 Feature set evaluation . . . 35

4.2.1 Prediction result plots from chronological validation . . . 36

4.3 Performance of final network model . . . 40

4.3.1 Comparison of results using different thresholds . . . 41

4.4 Different encoding experiment . . . 41

5 Discussion and Conclusions 43 5.1 Loss function convergence . . . 43

5.2 Input variable encoding . . . 43

5.3 Feature sets . . . 44 5.4 Network performance . . . 45 5.5 Evaluation metrics . . . 45 5.6 Best results . . . 46 5.7 Conclusions . . . 46 5.8 Future work . . . 47 5.8.1 Classification model . . . 47

5.8.2 Further parameter evaluation . . . 47

(8)

CONTENTS

Glossary

• ANN, short for Artificial neural network

• Epoch, A full round of training over the whole dataset

• Loss function, the function describing how much loss the network predictions have, namely how much off from a perfect prediction the network is.

(9)

Introduction

Taxi services are a central transportation method in almost all urban areas. Like so many other fields today the taxi business is undergoing a rapid digital trans-formation with new actors like Uber taking market shares with innovative digital products.

The most important objective for any taxi company and driver is to minimize the vacant driving time, and a very important aspect of this is to know where to find the passengers. Naturally, we can never know where the passengers will be, however, the experienced drivers can make guesses and predictions based on their knowledge.

Predicting where passengers is a problem ideally suited for a machine learning approach. If ride quantities can be properly estimated it can give a taxi company a larger picture of how to position their vehicles and is thus a potentially stronger tool than the predictive capability of an individual driver, since drivers operate independently of each other all of them might drive to the same area when there are in fact more with large demand. There can also be patterns in passenger demands that are too irregular for a human to interpret but which can be found by machine learning, which is a characteristic of problems suitable for artificial neural networks according to Mitchell [12].

In this report a machine learning model to estimate taxi demand is constructed and trained with historical taxi ridership data for the city of Stockholm. The city is divided into geographical zones and the data is converted into input for the neural network model by creating input samples that are based on two main variables, the zone and the hour of the day. Together with these two values, other parameters that could affect taxi ridership levels are explored and tested. The purpose of the machine learning model is to find a relation between this input information and the amount of rides that occurs in that zone in that hour. When such a relation is found the network would be able to predict how many taxi rides will occur in the future.

(10)

1.1. PURPOSE AND GOALS

which of them affect the taxi demand in Stockholm.

A few attempts on similar estimations have been performed but none using the same input data and machine learning model. Grinberg et al. [4] are somewhat successful at predicting taxi in New York city using three other machine learning models, a task which might yield better results as the taxi ridership levels in New York are a lot higher and steady than those in Stockholm. Mukai and Yoden [14] attempt to predict taxi ridership levels for the city of Tokyo with a similar approach using artificial neural networks but use the previous hour ridership as the most significant input parameter, an approach which is not applied in this report as the objective is to estimate the more distant future.

The main difficulty in this work was to find the external parameters that could affect the taxi demand. Since the zones with their mapping to an occurred number of rides already contain a lot of the information that affect taxi ridership, such as the prevalence of restaurants, clubs, or maybe residents that ride a lot of taxi the extra input, parameters should be those that are not directly tied to an area, such as weather or major events. Weather data could be gathered for Stockholm, however, it was impossible to find good data covering the major events of the city and their location.

The results are somewhat satisfying, it is clear that adding relevant parameters to the machine learning model makes the predictions a lot more accurate, however, the best model found is only capable of estimating about 43 % of the rides within a 70 % margin which means that the model is not really robust enough to use for car positioning in itself. With the addition of certain heuristics or with human supervision of the resulting predictions the model could possibly be used for a good basic indication.

1.1 Purpose and goals

The problem faced is to be able to predict quantities of taxi customers in Stockholm for the principal company. In order to do this geographical areas need to be defined for which the predictions quantities are made. A time slot is also necessary in order to know on which interval to make the predictions. A machine learning model is trained on historical taxi ridership data which is then used in order to make predictions based on the previous ridership levels.

The goal of this report is to identify what information affects taxi ridership that can then be used as input for an artificial neural network to estimate the number of taxi rides. The goal is to be able to predict taxi ridership with some level of confidence which will then be useful in car positioning.

1.2 Delimitations

(11)
(12)

Chapter 2

Background

2.1 Machine learning

The idea behind machine learning is to teach a computer to teach itself instead of explicitly programming it [15]. The main problem area addressed with machine learning is often based on decision problems with many different inputs and special cases. For example the pricing of a house, in a conventional programming environ-ment you would have to do some special checking if the house has a garden or a garage, in such an environment the number of required implementation details can grow rapidly and it becomes impractical to program a solution that considers all different inputs and special cases.

Machine learning, sometimes called statistical learning, is the principle of letting a computer learn and create a model from statistical data which can then be used to predict values based on different inputs. Machine learning can be split into two main subfields, supervised learning, in which the data used should contain both input as well as output, or the ’correct answer’ corresponding to that input, with which a model is trained to estimate new outputs based on new input. The other subset is unsupervised learning, where a large dataset is available. Using that dataset the goal is to create some kind of boundaries, which are initially unknown, that can categorize the inputs in some way in order to understand the dataset [7]. The focus of this paper will be on supervised learning, predicting taxi demands using historical taxi ride data.

2.2 Supervised learning

(13)

val-Figure 2.1. Depiction of a human neuron. The dendrites to the left are the incoming

receivers from other neurons. The axon is the computational unit that decides what the output strength should be based on the inputs. The axon terminals are the output terminals to other neurons in the network. Interconnected with other neurons they form neural networks [15].

ues as opposed to the discrete ones in classification. For example, what is the price of house y given the parameters x. Many common methods in supervised learning consist of doing curve fitting. Either to compute a linear or non-linear boundary separating the classes in classification or to predict a method function that is fitted to previous data and used to predict new values.

2.3 Artificial neural networks

Artificial neural networks, ANNs, stem from the biology field where they originally were attempts to model the brains’ way of learning. A neuron in the brain is a computational unit that takes a number of inputs, namely the outputs from other neurons, and given those inputs calculates and forwards a new signal with some signal strength, a depiction of this can be seen in figure 2.1.

In the brain the scale of this neural network is very large. The amount of neurons can be approximated to 1011 where each neuron is connected to roughly 104 other

neurons [12] and they operate in complete parallelism. The idea behind ANNs is to mimic the brains’ neural network [10] but due to the computational complexity it is impossible with the tools available today to create a neural network as complex as our brains. Also, an artificial neural network operates on a sequential processor so we can only simulate parallelism of artificial neural networks [22]. However, even with all these limitations compared to real neural networks, ANNs are still capable of capturing complex relations in data and giving accurate forecasts for many problems [22].

(14)

2.3. ARTIFICIAL NEURAL NETWORKS

Figure 2.2. Depiction of a feed-forward artificial neural network. This simple

example has three input nodes and outputs two values. Similarities can be seen to real neural networks, the incoming connections to a node here can be compared to the dendrites in the real neuron. The output connections can be compared to the Axon terminal and every node is assigned an activation function which computes the output of that node based on the input.

at least two layers, the input layer and the output layer. Most problems, however, require more complex network architectures for a correct model to be found [15]. The complexity is increased by adding more layers, commonly called hidden layers between the input and output layers. In figure 2.2 we see an ANN with an input layer with three input nodes, which means it takes three different input variables. Then it has a hidden layer that computes and forwards the data to two output units in the output layer.

The nodes can have any interconnectivity between each other but the most common form of ANN is a feed-forward network where all nodes are fully connected for the next layer in the network. In this project only feed-forward networks will be considered.

Every node in the neural network has an activation function that is used to compute a value based on the incoming signals to forward to the next layer of neurons. The activation function combined with finding the optimal architecture for the net is crucial to produce good results. Deciding all these parameters for the taxi passenger demand problem is one of the main objectives of this paper.

(15)

demand estimation problem well.

• Instances are represented by many attribute-value pairs. The data can be used to construct many interesting input parameters such as Time of day, Day of week, Day of month, Day of year, Weather and more and the training data used in this paper is extensive.

• The target function output may be discrete-valued, real-valued, or a vector of several real- or discrete-valued attributes The objective of this paper is to output a real-valued number of taxi passengers that will be present in some area given a certain time slot.

• Long training times are acceptable. Training the model may take a long time with many input features and a large set of training data for the taxi passenger estimation.

• Fast evaluation of the learned target function may be required. The evaluation of a certain time slot for an area needs to be fast as the updates will be needed in real-time when a driver needs to find a good spot.

• The ability of humans to understand the learned target function is not impor-tant. Understanding what parameters are important to find customers is of course very relevant for a taxi driver. However, one of the objective here is to try to find irregular correlations in the data that can be hard for humans to predict.

2.4 Implementation of artificial neural networks

An artificial neural network is composed of a number of artificial neurons, called nodes. The nodes are the computational units in a network, they forward a signal based on their activation function and the input received from the connected input nodes. The nodes in a neural network are interconnected and in this paper only fully connected feed-forward networks are considered. This means networks where all the nodes are fully connected to the nodes in the succeeding layer such as the one depicted in 2.2.

(16)

2.4. IMPLEMENTATION OF ARTIFICIAL NEURAL NETWORKS

network is then trained using the training data and evaluated with the unseen test data to see how well it performs.

The network training process typically follows as 1. Assign random numbers to the weights

2. For every item in the training set, calculate the output variables 3. Compare the calculated output with the observed values

4. Adjust the weights of the network using a learning algorithm

5. Repeat steps 2-4 while the distance between the output and correct answer continues to decrease

6. Evaluate the network performance using the test data

A successful implementation of a neural network constitutes a variety of chal-lenges that all have to be considered. The net can not have a high bias (underfitting the data) or too high variance (overfitting the data). The network architecture also has to be complex enough that it can capture all the features of the data, but not too complex so that it creates a computational problem or even starts to capture relations that don’t really exist in the data and worsen its generalization ability [22]. The simple training procedure mentioned above is very susceptible to overfit-ting and more advanced methods than just splitoverfit-ting the data into training and test sets exist and will discussed further.

2.4.1 Architecture

The architecture of the net is crucial in order to properly model a specific problem and to learn the relations in the data [2]. The problem needs to be formulated in a way that fits the neural network approach and apart from choosing the input parameters and desired output values the hidden architecture of the network has to be decided upon. This architecture is the part of the network that gives ANNs their black-box characteristic, even if the neural network can properly model a problem, it isn’t necessarily comprehensible by humans as how it does it [12].

A lot of research has been done on the topic of neural network architecture but no proven theory exists that works well for all problem sets. Hornik et al. (1989) claims that any complex non-linear function can be approximated by a network with one hidden layer and the proper number of hidden units in that layer [6]. However, increasing the number of hidden units too much in a network makes it a lot more susceptible to overfitting and affects training time negatively.

(17)

Tamura and Tateishi (1997) [21] prove that a network with a single hidden layer with n ≠ 1 neurons can give any n input-target relations exactly. They continue to empirically prove that a network with two hidden layers with n/2 + 3 neurons can capture the same input-target relations with insignificant error and thus indicating that increasing the number of layers is often a good choice for complex problems rather than to follow the infinite neuron idea put forward by Hornik (1989) [6].

Zhang et al. (1998) claim that only one hidden layer is sufficient for most fore-casting purposes and suggests that the 2n + 1 rule is a good starting approximation for any ANN with one hidden layer where n is the number of input units. Contro-versially they also find that the rule Input Units = Hidden units seems to work very well for some problems [22].

Ng (2012) [15] claim that for most problems it is sufficient to use the same number of hidden neurons in every layer for deep layered networks.

In most cases, the network architecture for a certain problem has to be derived from iterative trials, however, the general rules previously described can be a good starting point.

Bias

Bias units are single neurons, one in each layer except the output, whose activation function is always one and that is not affected by any input. Since the bias is not affected by any input of the network it can be considered as a generalization unit that assists the net in outputing the more common results [2]. The term bias is derived from exactly that principle as it assists the neurons in making more general predictions and the use of bias neurons is found in most models .

2.4.2 Activation functions

In early research of artificial neural networks Frank Rosenblatt introduced the per-ceptron, a neuron with a binary activation function that only output a result of 0 or 1 if the sum of the input weights exceeded some threshold [16]. However, the binary output of the perceptron make it hard for a network to make complex deci-sions since adjusting the weights can cause a drastic shift in the results for different inputs [16].

To combat this shortcoming the sigmoid neuron was introduced, the sigmoid neuron uses a sigmoid activation function that squashes the input into the interval [0,1] just as the perceptrons step function but on a continuous value instead of the binary. This causes small shift in weight updates to cause less drastic changes in the network predictions and allows for the networks to learn more complex problems [16].

(18)

2.4. IMPLEMENTATION OF ARTIFICIAL NEURAL NETWORKS

the activation function works well for a network if the value range of the output of the function matches the desired output for the network.

Logistic sigmoid

Figure 2.3. The logistic sigmoid function

The term sigmoid function is usually used for the logistic sigmoid function: 1

1 + e≠x

The logistic function squashes the input into a continuous value between 0 and 1 as seen in figure 2.3.

Hyperbolic tangent

Figure 2.4. The tanh sigmoid function

The hyperbolic tangent function or tanh, seen in figure 2.4, is a sigmoid variant that squashes the input to the range [≠1, 1]

1 + e≠2x

(19)

Rectified Linear Unit

Figure 2.5. The rectified linear unit function

The Rectified linear unit function or ReLU, seen in figure 2.5, squashes the input to real positive values with the function max(0, inputs). The rectified linear unit is sometimes claimed to work better in deep neural networks and often converges faster for problems with large datasets or complex models.

Relu6

Relu6 is a common subversion of the rectified linear unit which is calculated through the formula min(max(inputs, 0), 6) which may converge faster for some specific types of deep networks.

Choice of activation function

Symmetric sigmoids such as the hyperbolic tangent are often found to converge faster than the standard logistic function. LeCun et al. (2012) also claim that symmetric sigmoids work better for problems that have a large portion of their output target close to zero as symmetric sigmoids tend to be better at outputing a zero value [11]. LeCun et al. (2012) recommend the hyperbolic tangent as the best universal function [11]. However, just like with the architecture the most common way of deciding on an activation function is done by trial and error.

2.4.3 Learning

(20)

2.4. IMPLEMENTATION OF ARTIFICIAL NEURAL NETWORKS

Loss functions

The value given by the loss function only has to be a relative one that decreases when the predictions get better. What loss function to use depends heavily on the problem. For prediction the most straightforward approach is to use a sum or mean of the differences between the predicted output and the real output, which will be a number representing how far the predictions are from being correct.

Optimization algorithms

The optimization algorithm is used in the network to minimize the loss function during the training. The most widely used optimization algorithm is gradient de-scent [16] which tries to find a direction to step across the loss function curve in order to minimize it the most. How fast gradient descent moves across the loss function plane is defined by a learning-rate parameter which is actually the length of the step, using a high learning rate might allow the algorithm to converge quickly but might also cause the descent to jump from side to side over the minimum. On the other hand a very low learning rate could leave the algorithm stranded in a local minimum and not be able to ’climb’ out of it.

A problem with gradient descent is that depending on where you start the evaluation it might step down into a local minimum and never find the global minimum or best possible solution [11], in fact no algorithm exists that is guaranteed to find the global minimum in a reasonable time [22].

RMSProp is an unpublished algorithm proposed by Hinton (2012) [5] that at-tempts to solve some of the issues with gradient descent. Since the magnitude of the different gradients of the net can be very different and changes during learning it can be hard to find a single global learning rate. RMSProp attempts to solve the problem of choosing a learning rate by adapting the step size depending on previous gradients. Hinton claims that for problems with large redundant datasets the appropriate approach would be to attempt either gradient descent and then RMSProp which is claimed to outperform gradient descent in many cases [5]

ADAM is an algorithm proposed by Kingma and Ba [8]. It is an attempt to combine the strengths of another common algorithm, the AdaGrad algorithm, which works well with sparse gradients, and RMSProp. Just like RMSProp, ADAM per-forms automatic step size adjusting. The authors claim that they have empirically proven that ADAM outperforms other common optimization algorithms for a num-ber of different datasets and network models [8]. The authors also claim that ADAM is well suited for machine learning problems with either large datasets and/or high-dimensional parameter spaces as well as deep multi-layered neural networks, partly due to a good memory efficiency.

2.4.4 Input parameters

(21)

since it is ’given’. However, in the case of this report the challenge is to find the proper input parameters that maximizes the accuracy of the network. If too few are given then no good predictions will likely be made, for example if the prediction for a zone is only based on the hour of the day then the differences between weekends and weekdays will not be captured.

Normalization is a commonly used technique where all the input points are normalized to some range, often [0, 1]. The purpose of doing this is to allow the training process to determine what parameters are most important, if one of the inputs is on the scale of a million while the others are perhaps always 0 or 1, the network might become biased and change the output a lot given a change in the large input variable, even if it is not justified. Using normalization the optimization algorithm can update the weights more fairly and you often get a much faster converging rate [11].

Lecun et al. give some hints on how to normalize [11].

• The inputs for the variables should be distributed evenly around zero over the training set.

• The input vectors should be scaled so that their covariance is about the same. • Optionally, different input parameters could be scaled according to their im-portance, however, the training process will supposedly adjust this automati-cally since the updates should cause irrelevant parameters to have low or no effect on the output.

A common normalization formula that is easily interpreted and formulates the variables accordingly to item one and two is inputnorm= maxinput(inputs)≠min(inputs)≠min(inputs)

2.4.5 Error and validation

Monitoring the networks results and prediction ability during training is required to see how well it is performing. Typically the training process consists of monitoring the value of the loss function, which is the global error of all samples. The loss function is always calculated for the data being used for training, it is also often calculated for some reference set, which might differ depending on which evaluation technique is used. The core idea behind this, however, is to make sure that the loss in both sets go down when training. [15]. When the loss error of the sets stop improving or starts to diverge the best solution for the setup is found. If the score of the loss functions starts to diverge then that is an indication that you are no longer learning the general problem described by the data, but instead you are learning only the data set that you are using for training. When the training is stopped a solution is found, that network model is chosen as the one to use for solving the problem.

(22)

2.4. IMPLEMENTATION OF ARTIFICIAL NEURAL NETWORKS

describe the performance in a fair way it can give false positives where one is lead to believe that it is very accurate when it actually performs poorly and vice versa. When the model is chosen, appropriate metrics will have to be used in order to get a good understanding of the performance of the network. For estimation tasks a mean of errors is often a common approach. However the mean average difference method can be bad at indicating the real performance, especially if the distribution of target outputs is not even. In figure 3.2 on page 20 we see that the distribution of output targets in the data in this paper is not even at all. Few rides are far more common than many rides in the different hour and zone combinations. A mean error method here might be very low if the network always output zero, since that will give a zero error for most cases. That low error will, however, be useless since the network has no capability of producing good estimations.

Overfitting problem

Overfitting, sometimes called variance, is a common problem in machine learning where the method learns the training data too well and fails to generalize the real data. Overfitting can be caused either by too high adaptivity in the method, for example too many training epochs or more hidden neurons/layers than is necessary to describe the problem [15] or by a too small set of training data that fails to represent the whole scope of the problem [19]. The data used in this paper is extensive and will likely be sufficient to give stable results for Stockholm. Despite overfitting being a problem it is often utilized when training a neural network by first creating a model that overfits the data, which indicates that the model is able to properly learn the problem instance, and then applying measures to avoid overfitting [15]. A common method to reduce overfitting is to remember the current best weights of the model and stop training once the performance on the validation set gets worse, called early stopping [19].

Dropout

Dropout is a technique where nodes are dropped randomly during training, meaning that they won’t affect the outcome for that specific training epoch. Srivastava et al. (2014) [19] claim that dropout breaks up co-adaptation between the nodes that is created by the training data and thus gives a more robust adaptation to real world data. They found that dropping 20% of the input nodes and 50% of the hidden nodes every epoch were a good approach for most networks. They also claim that dropout increased the performance of networks in a wide variety of domains and is thus a general technique that would work for any neural network [19].

(23)

2.4.6 Cross-validation

Cross validation are a series of techniques to avoid overfitting and to test the ro-bustness of a method. The concept of cross-validation is about testing and training on different subsets of the data available [12]. The most naive way of training a neural network would be just to split the data in two sets and train on the whole training set until the loss stops to improve and then check the performance on the test set, this method however is very prone to overfitting, since there is no way of knowing when the optimization process goes from learning the relations in the data to learning exactly the points in the training set.

One of the most simple and common variants of cross-validation is to split the data into three sets, training-set, validation-set, and test-set. The network is trained on the training set for as long as the loss of the validation set keep improving. Then to make the performance test unbiased the network is tested on the test-set which is essential to have since the network model will be based on both the training and validation sets as they have been used to determine the network model.

A problem with the three set split is that it might give good results purely by chance. If the data is not large enough or if the test set is ’easy’ for the network to predict then one will be led to believe that the model is better than it really is. A common cross-validation technique to avoid this is k-fold cross-validation. The principle is to divide the data set into k subsets, and for k iterations, train the network using every set except k, and the validate the results with subset k. Then take the average of the results from all the k iterations as the real network performance [12]. This method with a proper size of k will eliminate the problem with ’easy’ sets and give a more fair estimate of how well the network will actually perform on unseen data.

Kohavi et al. (1995) [9] evaluates different cross-validation methods and they claim that 10-fold cross-validation is the optimal number to use in model selection.

2.5 Related work

Mukai and Yoden (2012) [14] attempted a similar way of forecasting of taxi pas-senger demand based on geographical areas and time slices. They grouped their data of the 25 major areas of Tokyo into time slices of four hours and predicted the demand for every city area for the coming time slice based on the demand in the previous four hours. This approach is a bit different from the one in this paper as no fresh data will be available to predict the next set of demand outputs. The approach in this paper will instead be to use fewer input parameters to predict a large output vector which will present an additional challenge.

(24)

2.5. RELATED WORK

better predictions, such as having the rain amount as a continuous parameter or only including it if the rain reaches a certain threshold.

Moreira-Matias et al. (2012) [13] attempt an alternative prediction method to ma-chine learning for estimating passenger demand for the coming 30 minute period, namely time series forecasting techniques on real-time GPS events transmitted be-tween 441 cars in a taxi network. They compared their real time estimation with the actual passenger demand in 63 taxi stands in the city of Porto. Each tried model achieved an accuracy over 74%.

Grinberg et al. (2014) [4] did similar taxi demand forecasting to the one in this paper using a grid split over New York city and attempting to predict the number of rides for the grid squares for a given hour. They attempted three different machine learning approaches, linear least-squares regression, Support Vector regression and Decision Tree regression. They explore various feature sets containing zone, hour-of-day, day-of-week, and hourly precipitation. Just like Mukai and Yoden they do not find rainfall to be a statistically significant parameter.

(25)

Method and experiment

The method chapter is divided into two main subparts. First the evaluation of parameters and implementation details for the network is described. Then the evaluation of input features of the taxi estimation problem is described to see which input features yield the most correct results.

3.1 Data

The data available to use consists of 44 million taxi rides in Sweden with a large number of parameters attached to them. In this report only the area of the Stock-holm is analyzed, due to the belief that the data for the rest of Sweden would be to sparse to give stable results. Two datasets are created, one smaller for testing implementations and one for testing the different input sets the final model used. The first consists of rides in the district of Södermalm where rides for the time period 2010-01-01 to 2016-02-29 are distributed over 10 different zones. The rides for Södermalm in that period number to 4934694. The other set contain the data for the whole city area of Stockholm where 15463312 rides in the same time period mentioned above are distributed over 42 zones. A distribution of the number of rides per day in this time frame can be seen in figure 3.1.

The taxi rides of the data contain only 2 parameters, an integer representing in which zone it occurred in the principal company’s layout maps and the time stamp for when it happened. This data is converted into data for the network by grouping all rides that occurred in the same hour, day and zone.

(26)

3.1. DATA

Figure 3.1. The distribution of rides for all days of the selected period 2010-01-01

to 2016-02-29

By spreading out the rides on the hours and zones in which they occurred, the rides of the Södermalm dataset are converted to 540240 input/output pairs and the larger dataset of Stockholm is converted into 2269008 input/output pairs. Every point in the converted datasets contain one integer representing the zone and one time stamp, with resolution in whole hours, representing the date and hour when the ride occurred. These two variables are mapped to a number of rides that have occurred in that time and zone. From the time stamp we can extract several parameters to use as input for the network. The hour is the basic one, but the date also gives us day of the week, day of the month, month of the year etc.

Apart from the taxi ride statistics, historical weather data, provided by the Swedish Meteorological and Hydrological Institute, SMHI [20] is combined with the datasets. The weather data consists of hourly temperature and precipitation readings from a station in Stockholm.

The data sets used for implementation testing is done using the common training / validation / testing split where only a subset of 24 % the data available for Södermalm is used. The sets are divided into equal thirds of 8 % of the total data each, this is done due to the heavy computational complexity in the training process. With 8 % of the complete data used for training decent training times could be achieved. Because of the limited experimentation time, running each test for several hours was not feasible. Using 8% of the data made the tests run in about one hour while still not compromising too much with the stability of the results. The percentage was chosen by empirical tries where it was lowered until the cost values started to vary between the runs.

(27)

Figure 3.2. The distribution of the numbers of input pairs that share the same

target ride amount. We can clearly see that there is a heavy tendency for a low number of rides for most hour and zone pairs and very few zones and hours where more than 50 rides occurred

% split. The reason for doing a 60/20/20 % split is really because often the data might not be very large and it is more important to have a lot of data to train on rather than to verify with. In this report the data is quite extensive and a 33/33/33 % split might have been okay but there isn’t really any need for larger verification sets and it is always useful to train on more data as more can be learned and the larger the data the less risk of overfitting it.

The set used for input feature testing is the complete Södermalm dataset which is split into the more common way of 60/20/20 %. The same split is used with the final parameter choices for the complete Stockholm dataset to receive the final results of the model.

3.1.1 Implementation

(28)

3.2. OPTIMIZING THE NETWORK

3.2 Optimizing the network

In order to find the optimal neural network to use for the taxi demand estimation problem different parameters for the network implementation are trained and tested with the subset of the Södermalm data mentioned earlier. This dataset should be sufficient to decide which network parameters work best for estimating taxi demand. During this implementation parameter testing the data is divided into three chronologically ordered datasets, the training set, the validation set and the test set. The network architecture and implementation choices are tuned for the most basic parameters which are believed to be the most important ones, namely, zone, hour-of-day and day-of-week.

3.2.1 Metrics for network optimization

To measure the performance of different network architectures two different values will be observed. Most importantly is the loss function convergence, which indicates if the network is actually getting better at estimating the data. Fast convergence rate in combination with the ability to reach a low final loss value is the most essential properties when training a neural network.

In order to get a performance variable that is a little bit easier to comprehend, the ’accuracy’ measure will be included as well. The accuracy of the network which is defined as mean error, the average distance from the predictions, yp, and the

correct output in the test-set y.

accuracy= mean(|y ≠ yp|)

Thusly, an accuracy score of 2.5 would mean that on average, the network guesses are 2.5 rides off from the target.

3.2.2 Loss function

The loss function for the optimization is chosen as mean squared error, loss = mean((y ≠ yp)2). It is important for the loss function not only to minimize the

average difference between y and ypbut also take into account that the data contain

many points with few number of rides and few points with a large number of rides. Two other loss functions were attempted, mean error, and mean error to the power of three. The mean error loss = mean(|y ≠ yp|) doesn’t penalize output

targets of large ride quantities in a reasonable way so it will likely give unfair accuracy values. Using the mean error function with the data available would likely cause the network to output zero very often as that is the most common output, and no penalization would be taken against guessing very badly. Meaning that it will be able to lower the loss and consequently the accuracy measure accuracy = mean(|y ≠ yp|) by being very correct for the quantitative amount of data points

(29)

In order to combat the loss function not taking large output values into ac-count the function loss = mean(|(y ≠ yp)3|) was attempted which heavily penalizes

differences for the data points with large ride numbers but no clear improvement was found to the chosen loss function, thus mean squared error was chosen as loss function.

3.2.3 Architecture

The architecture of the network has to be tailored for the problem that we are trying to learn. For example a network of only a single node in the layer before the output will not be able to generalize at all but will, if trained properly, only output the average answer through every training example. In the same sense a too complex network with too many layers and neurons will be very computationally expensive and might be more prone to overfitting and learning special cases rather than the general problem. The challenge here is to find the minimal network architecture that properly learns the problem.

Input and Output layers

The input layer will have as many nodes as we choose our input to be, if 5 inputs are used then the input layer will have 5 nodes. The size of the output layer will depend on what problem you are trying to solve. If you are doing classification with 10 different classes then the output layer would consist of 10 nodes. In these experiments the only target is to output a value. Therefore the output layer will have only one node.

Hidden neurons

To find a good number of hidden neurons to use in the hidden layers, a network with different amount of hidden neurons is tested with the ADAMOptimizer algo-rithm and the hyperbolic tangent activation function as they are claimed to be the best general optimizer and activation functions according to Kingma et al. (2014) and LeCun (2012) [8] [11]. Initially, a network with only a single hidden layer is attempted with 20, 80, 150 and 450 neurons for which the results can be seen in section 4.1.1.

Hidden layers

(30)

3.3. INPUT PARAMETER EXPERIMENTATION

Optimization algorithm

For an optimization algorithm to be good it should converge relatively quickly and preferable not get stuck in a bad local optima. Three optimization functions are tested with the symmetric sigmoid hyperbolic tangent as activation function in each layer as it is claimed to be the strongest general activation function.

The three attempted algorithms are Stochastic Gradient Descent, RMSProp, and Adam. The optimization testing is done using a network of three layers with 150 neurons in each and with 500 training epochs. The results are presented in the result section: 4.1.3.

Activation functions

Four different activation functions are evaluated. The ones chosen to be tested are the two most common sigmoid functions as well as ReLU and ReLU6, which are claimed to perform better in deep networks and networks with output targets over 0. In order to reduce the number of options the same activation function will be chosen for each layer except for the output layer which will have the identity function in order to get regression behavior. The results can be seen in the section 4.1.4.

Overfitting

In an attempt to combat overfitting a training sequence that forces overfitting is cre-ated where a three-layered network with 150 neurons in each hidden layer is trained for 2000 epochs using the hyperbolic tangent activation function and ADAMOpti-mizer with all the available input parameters. Dropout is then applied in two vari-ations. First with the levels suggested by Srivastava et al. (2014) of 20% dropout in the input layer and 50% dropout in the hidden layers [19]. Then also with a milder level of dropout where only 10% of the input nodes are dropped and 20% of the hidden nodes. The comparison of the different techniques can be observed in section 4.1.5.

3.3 Input parameter experimentation

The two basic parameters that formulate the base of each input case are:

• zone, the zone where the ride originated, where the zones are represented by integers from 0 to the number of zones minus one

• hour, the hour of the ride, represented by integers in the range [0, 23]

(31)

household spending increases after the Swedish payday, the 25th of every month. Additionally there is a significant increase in activity of the nightlife scene after the Swedish payday which is commonly tied to taxi ridership in Stockholm. To better reflect this the parameter day of the month will be modified to be ’days after the 25th’, which would indicate a larger household spending and possible increase in ridership for low values of this parameter.

The principal company also acknowledged that good weather during the summer months (which can also be observed in figure 3.1) caused the taxi demand to drop significantly which leads one to believe that the month of the year would be relevant. As mentioned weather is a possible factor as well, heavy precipitation could lead to more spontaneous taxi rides, but a more steady prediction might be gained from observing temperature which is not as locally occurring as rain or snowfall, since most data points contains zero rainfall it is reasonable to assume that combining precipitation with temperature would give a more clear estimate of good or bad weather.

The extra input features chosen were the ones in the survey among the principal company for which reliable data could be gathered for all points. The final input features chosen for testing are as follows, all features are floats normalized to the range [0,1].

• day-of-week, the day of the week when the ride occurred, represented by inte-gers on the range [0, 6]

• day-of-month, the day of the month when the ride occurred, modified so that the first day is the 25th which is the most common Swedish payday, repre-sented by integers on the range 0 to the number of days in the month

• month, the month of the year when the ride occurred, represented by integers on the range [0, 11]

• temperature, the temperature of Stockholm when the ride occurred, repre-sented by a float in the range [0, max(temp) + abs(min(temp))]

• precipitation, the accumulated rainfall of Stockholm in the hour that the ride occurred, represented by a float of the rain amount in millimeters

(32)

3.3. INPUT PARAMETER EXPERIMENTATION

In the same way it might seem odd to input the zone as a continuous discrete ID since the zones geographical properties might not be similar at all. Another idea might be to feed the zones as many different input variables with the values 0 or 1 depending on which zone it is. However, this realization occurred a bit late in the project progress and will be considered as future work. All the variables will be fed as continues values for the experiment of this project.

The features are grouped into five sets, which are tested and examined. Set 1 is the minimal set, only containing the two parameters that are required for testing, this set is thus the same data used for the baseline predictions. Set two is the minimal set with the day of the week included, this parameter is believed to be the strongest one except hour and zone for predicting taxi demand. A variance pattern over the week seems to be the most probable of all the tested parameters.

Set 3 includes the monthly parameters to the second set which will indicate if day-of-month and month-of-year are relevant in predicting taxi demand when compared to the results of Set 2. Set 4 is the basic set with weather data included which will indicate if weather is a relevant factor when predicting taxi demand when compared to Set 1. Finally Set 5 contains all the parameters which would likely be the best set if all the parameters attempted are in fact relevant. A clearer description of the feature sets is presented in table 3.1

Table 3.1. The feature sets for testing

Feature set number Features

1 - basic zone

hour 2 - basic + week zone hour

day-of-week 3 - basic + week/month zone

hour

day-of-week day-of-month month-of-year 4 - basic + precipitation zone

(33)

3.3.1 Validation technique

10-fold cross validation is applied to the input testing in order to receive robust re-sults as Kohavi claims it to be a strong cross-validation approach for model selection [9].

K-fold cross-validation seems to be a solid choice to use for testing the network parameters. However, in this report the problem has a chronological aspect as well that makes the three subset split interesting to use. Since the point of the network is to be able to predict future rides, splitting the dataset into three chronological parts will be interesting in order to see how predicting future values actually fares. The dataset should be sufficiently large that it becomes unlikely that the test-set will become an ’easy’ set to predict by chance.

3.3.2 Baseline

The performance of the network using the different feature sets will be measured in a number of ways and compared with each other. However, in order to get an understanding of how well the network models are actually performing they will be compared against a baseline value as well. The baseline will be an average using the whole dataset over the two basic variables, zone and hour of the day. The baseline will thus be a very basic way of guessing taxi demand based on simple averages. The least you could expect from a correctly implemented and trained network model with additional parameters that affect the demand is to beat this baseline.

3.3.3 Metrics for feature testing and final network validation

During the testing of the implementation parameters the metrics Accuracy and Loss were observed and used. However, in order to both get a better idea of how good the networks predictions actually are and to be able to better distinguish the performance between the attempted feature sets, two additional metrics will be observed.

In order to give a more comprehensible understanding of how well the network is performing the ’correct’ metric will be introduced. A guess will be called correct if it is sufficiently close to the real answer. This metric will be defined as the number of predictions that lie within a threshold of 30%, or 1 ride from the correct answer. The boundary of 30 % and 1 was chosen arbitrarily together with the principal company as a measurement that would give an understanding of how often the model make a reasonably correct answer. Hopefully it should give the reader a number that they feel represents how well the network is guessing.

correct =

I

yp>0.7 ú y · yp <1.3 ú y if y > 0, or

yp> y≠ 1 · yp< y+ 1 if y Ø 0

(34)

3.3. INPUT PARAMETER EXPERIMENTATION

an hour and zone is a stochastic variable. The definition of Poisson distribution gives us the probability of an event occurring k times in an interval where ⁄ is the average occurrence rate.

P(x) = e≠⁄⁄

x

x!

If the network is making proper estimations then it should have learned the average rate of car occurrences given the input parameters for an hour and a zone. This means that if we, for each data point in the set used for verification, use the prediction as ⁄ in the Poisson formula and run it with k as the real answer, we will get the probability of the network being correct in that estimation. To get a value that represents the likelyhood that all guesses are correct the product of these poisson values will be used. Since this final value will likely be very small and often rounded to 0 the product will be represented by the sum of the natural logarithms of the poisson values, in order to get it on a better scale.

Including these two metrics should give the readers an idea of how well the network is guessing as well as make sure that the comparisons between the different feature sets are actually valid.

3.3.4 Final testing

(35)

Results

This chapter will present the results of the two main tasks of this report, finding the optimal network architecture for estimating taxi demand and finding the optimal input features for estimating taxi demand.

4.1 Architecture evaluation

In this section the results of the network architecture experiments will be presented. Even though the baseline is not trained in any way the loss and accuracy values can be calculated for the baseline predictions using the same data as the testing to get a reference of how the different setups compare.

Table 4.1. Loss and Accuracy values for the baseline using the subset of the

Söder-malm data

Loss Accuracy 102.41 4.77

4.1.1 Hidden neurons

This section contains the results of the experiment with different number of hidden neurons for the network architecture.

Table 4.2. Comparison between different hidden neuron numbers for a single layered

network using the hyperbolic tangent activation function. The network is trained using ADAMOptimizer until no improvement on loss is made for five epochs

Neurons Loss Accuracy

20 146.17 6.13

80 115.78 5.76

150 104.04 5.44

(36)

4.1. ARCHITECTURE EVALUATION

The baseline achieves a Loss and Accuracy measure of 102.41 and 4.77 respec-tively, as seen in table 4.1. If we compare this to the neauron amount trials in 4.2 we see it has a loss value in the same area as the networks with the two larger neuron sizes, but a better accuracy value. It is likely that the baseline is more prone to outputing very low values, since most datapoints have a target that is very low. This in turn would mean that since most of the data points are zero or very low, the baseline method will get a good average difference between the guesses and real values. Notable here is that none of the different attempts with a single layered network is able to beat the baseline in accuracy indicating that a single layered network is not able to generalize the problem properly.

The loss function achieves a lower minimum with the increase in hidden neurons but the improvement between 150 and 450 seems to be diminishing indicating that it might be wiser to increase the number of layers instead. A layer size of 150 hidden neurons should be a good enough to solve the problem. The 450 neurons increased the computational complexity to an unrealistic level even for a single layered network 150 will be chosen as the hidden neuron number for each layer according to the rule proposed by Ng (2012) [15], that all hidden layers have the same number of neurons.

4.1.2 Hidden layers

This section contains the results of the experiment with different number of hidden layers for the network architecture.

Table 4.3. Comparison between different number of hidden layers with 150

neu-rons each. The hyperbolic tangent is used as activation function and trained using ADAMOptimizer until no improvement on loss is made for five epochs

Layers Loss Accuracy

1 114.28 6.74

2 54.46 4.00

3 40.96 3.38

4 40.58 3.28

It is clear from table 4.3 that the neural network can learn the problem a lot better when we increase the number of hidden layers above one. However, the benefit seems to diminish when going from three to four layered networks. This leads to the conclusion that a network architecture with three hidden layers is sufficient to describe the taxi estimation problem.

If we compare the values in table 4.3 once again against the baseline values in table 4.1 it is clear that the networks are better than the baseline even with two hidden layers, and three and four hidden layers outperform the baseline even further.

(37)

results can be seen in table 4.4.

Table 4.4. Comparison between different number of neurons in a net with three

hidden layers, hyperbolic tangent activation functions and ADAMOptimizer

Neurons Loss Accuracy

150 43.43 3.39

250 43.96 3.54

450 43.95 3.37

In table 4.4 we see that all the attempted network architectures loss value con-verge to very similar levels. The concon-vergence rate per epoch was only marginally faster with the networks with 250 respectively 450 neurons, and due to a heavily increased computational time 150 neurons will be assumed to be a good enough number of hidden neurons.

4.1.3 Optimization algorithm

This section contains the results of the experiment for choosing optimization algo-rithm for the network.

Table 4.5. Comparison between optimization algorithms. The algorithms are used

to train a network with three hidden layers with 150 neurons each and the hyperbolic tangent as activation function for 500 training epochs

Algorithm Loss

SGD 62.42

RMSProp 39.78

Adam 38.27

Figure 4.1. Loss improvement over the training process for Stochastic Gradient

(38)

4.1. ARCHITECTURE EVALUATION

Figure 4.2. Loss improvement over the training process for RMSProp

Figure 4.3. Loss improvement over the training process for Adam

It seems from table 4.5 that the RMSProp and Adam optimizers outperform Stochastic gradient descent and are able to converge to better values. RMSProp and Adam seems to converge to roughly the same value but Adam has a much smoother and faster convergence rate, as seen in figures 4.1, 4.2, and 4.3, and seems to be the best choice for optimization algorithm.

4.1.4 Activation function

This section contains the results of the experiment with different activation functions for the network architecture.

Table 4.6. Comparison between activation functions. A network of three layers

with 150 neurons each is trained using ADAMOptimizer until no loss improvement has been done for five training epochs

Function Loss Accuracy ReLU 62.73 4.78 ReLU6 49.00 3.86 Sigmoid 62.09 4.10 Tanh 43.56 3.42

(39)

Figure 4.4. Loss improvement over the training process with Rectified Linear Unit

activation functions

Figure 4.5. Loss improvement over the training process with ReLU6 activation

functions

Figure 4.6. Loss improvement over the training process with Logistic Sigmoid

activation functions

Figure 4.7. Loss improvement over the training process with Hyperbolic Tangent

activation functions

(40)

4.1. ARCHITECTURE EVALUATION

4.1.5 Overfitting

This section contains the results of the experiment to combat overfitting when training the network.

Table 4.7. Comparison between training attempts with different levels of dropout on

all features after 2000 epochs. The training was done on a network with three hidden layers with 150 neurons in each using the hyperbolic tanget activation function and ADAMOptimizer

Dropout input layer Dropout hidden layers Loss training-set Loss validation-set

None None 13.49 53.47

10 % 20 % 85.49 95.26

20 % 50 % 134.76 141.26

Figure 4.8. Training process over 2000 epochs using all input features and with no

(41)

Figure 4.9. Training process over 2000 epochs with all features with a dropout of

20 % on the input layer and 50% on the hidden layers

Figure 4.10. Training process over 2000 epochs with all features with a dropout of

10 % on the input layer and 20% on the hidden layers

We see in figure 4.8 that the loss for the training and validation sets starts to diverge after 200 training epochs, indicating that the neural network is over trained on the training set and loses it’s generalization ability on unseen data. We can also see that the error estimates in these graphs starts to diverge which proves that the network gets worse at predicting unseen data while getting better at estimating the training set which is a clear indication of overfitting.

In figure 4.9 we can see the application of the higher level of dropout. It is obvious that this prevents the network from overfitting and the two curves follow each other better. However, with these high levels of dropout the network is not able to converge to a very good loss value.

(42)

4.2. FEATURE SET EVALUATION

Due to the fact that dropout implementations seems to become unable to con-verge to nowhere near the same level of the loss function value as the non-dropout version, as seen in table 4.7, dropout will not be used for the final testing and early stopping will be the only measure taken to avoid overfitting. Early stopping will be activated when the validation set loss has not improved over 5 training epochs, meaning if no improvement has been made over five complete training rounds it should be safe to assume that the loss value will not converge much further. 4.1.6 Final network implementation parameters

Based on the results above these parameters were chosen for the final architecture of the network to use on the input feature testing.

• Hidden neurons, 150 per layer • Hidden layers, 3

• Loss function, Mean squared error

• Optimization algorithm, ADAM optimizer

• Activation functions, Hyperbolic tangent for hidden layers, linear for output layer

• Overfitting prevention, early stopping after five training epochs without im-provement in the loss of the validation set.

4.2 Feature set evaluation

In this section the results of the performance of the constructed ANN with the different feature sets will be presented. The results for the baseline measure will also be included to indicate the performance of the network against a very trivial method.

Table 4.8. Comparison of loss results between the feature sets for one full training

run per set

Feature set Training loss Validation loss Test loss

1 116.84 111.14 114.74

2 40.04 39.87 41.06

3 38.24 38.77 39.16

4 117.39 112.23 115.89

5 38.66 39.44 40.08

(43)

lower than the ones reached by set 1 and 4. Clearly indicating that these three sets share a variable that is very significant to the results, most likely day-of-week since sets 2 and 3 seems to be performing similarly.

Table 4.9. Comparison of accuracy between the feature sets with chronological

validation

Feature set Accuracy test-set Corrects test-set Loss test-set Poisson test-set

Baseline 5.08 0.39 112.57 -443 998 1 5.14 0.38 114.74 -454 431 2 3.36 0.47 41.06 -302 395 3 3.30 0.47 39.16 -302 745 4 5.30 0.37 115.89 -455 351 5 3.26 0.49 40.08 -297 079

In table 4.9 the results using chronological validation of the different feature sets. The results of the different metrics mirror the results of the training process in table 4.8. Notable is that the baseline has a higher chance of making accurate predictions than the neural network using feature sets 1 and 4 according to the poisson and accuracy metrics. It is also clear that sets 2, 3 and 5 perform very similarly which leads to believe that the difference in their parameters does not seem to matter that much. Rather that the parameter they share is a very relevant one.

Table 4.10. Comparison of accuracy between the feature sets with 10-fold validation

Feature set Accuracy Correct predictions Loss Poisson

Baseline 5.08 0.39 113.37 -740 860 1 5.22 0.37 115.35 -757 229 2 3.36 0.48 41.09 -505 008 3 3.28 0.49 38.69 -498 028 4 5.21 0.37 115.56 -761 440 5 3.25 0.50 38.81 -495 929

In table 4.10 we see the results when using the k-fold validation. Note that the scale of the poisson value differ between the sets in tables 4.9 and 4.10, that is because the value of it is dependant on how many data points that are measured. If we look at the other metrics the results are similar to the ones received in the chronological validation seen in table 4.9. This indicates that the data set used is likely sufficiently large and homogeneous enough that extreme cases in the data do not affect the result of the predictions very much.

4.2.1 Prediction result plots from chronological validation

(44)

4.2. FEATURE SET EVALUATION

values for the baseline.

Figure 4.11. Predictions for the baseline approach of the average ride value for

each hour per day and zone. X-axis is the target number of predictions, Y-axis is the predicted number and the red line indicate perfect predictions

Figure 4.12. Predictions for model trained using feature set 1. X-axis is the target

(45)

Figure 4.13. Predictions for model trained using feature set 2. X-axis is the target

number of predictions, Y-axis is the predicted number and the red line indicate perfect predictions

Figure 4.14. Predictions for model trained using feature set 3. X-axis is the target

(46)

4.2. FEATURE SET EVALUATION

Figure 4.15. Predictions for model trained using feature set 4. X-axis is the target

number of predictions, Y-axis is the predicted number and the red line indicate perfect predictions

Figure 4.16. Predictions for model trained using feature set 5. X-axis is the target

number of predictions, Y-axis is the predicted number and the red line indicate perfect predictions

(47)

x-axis represent the target values and the y-axis represent the predictions made by the models. The red line represents the distribution of what would be perfect predictions. We can see how adding certain parameters clearly increase the networks abilities to predict higher quantities of rides. Set number one for example looks very similar to the baseline when plotted, because they are based on the same data. Using only these few parameters the network does not have enough information to base the decisions on and will just output an average of the trained data, exactly like the baseline is doing. Interesting to note can be that while sets two and three seems to perform similarly according to tables 4.9 and 4.10. Their plots are not very really similar. Since the network using feature set 2 has less variables to base the predictions on we get a lot of predictions on the same value, since there is no information that the network can use to distinguish the predictions further.

Important to know here is that the scores on the entire test-set of the Södermalm dataset is plotted in these plots, which means the amount of points in the graphs are roughly 108 000. When looking at for example plot 4.14 we can distinguish these points quite clearly in the top right corner, indicating that they are quite few and probably hard points to predict. In contrast looking at figure 3.2 indicates that the vast majority of the points are actually located far down on top of each other in the left corner.

4.3 Performance of final network model

In this section the performance of the training and predictions of the final network will be presented. The network is trained on the complete dataset of Stockholm with the parameters described in section 4.1.6 and feature set #5. Set 5 is chosen because it was the best performing feature set according to tables 4.9 and 4.10, despite the fact that weather did not seem to affect the results very much when comparing set 1 and 4.

Table 4.11. Scores for the final network model with the two validation methods

compared to the baseline. The poisson metric is excluded for the cross validated network as that model uses a smaller data set for validation and the poisson metric is dependent on the same validation set in order to be fair.

Accuracy Corrects Loss Poisson

Baseline 3.74 0.43 65.12 -1 508 804

Chronological validation 2.72 0.46 26.21 -1 132 403

10-fold validation 2.73 0.45 27.04

(48)

4.4. DIFFERENT ENCODING EXPERIMENT

4.3.1 Comparison of results using different thresholds

The point of this work is to be able to estimate taxi demand and the most interesting aspect of that is of course to be able to estimate the demand when there actually is a demand. For the sake of this the results will be presented for different thresholds where all the data points in the validation sets with a target ride amount below the threshold variable have been taken out of the equation. Again the poisson metric is omitted here. As the points used for validation get fewer the higher the threshold is set, the poisson metric will be useless for comparison.

Table 4.12. Results of all the metrics for the final network model and baseline with

different ride target thresholds

Metric Threshold ANN Baseline

Loss None 27.28 65.17 Loss 5 67.86 167.85 Loss 10 130.43 332.80 Loss 20 278.76 763.46 Accuracy None 2.81 3.7 Accuracy 5 4.88 7.07 Accuracy 10 7.57 11.55 Accuracy 20 12.11 20.26 Corrects None 0.43 0.42 Corrects 5 0.16 0.12 Corrects 10 0.076 0.046 Corrects 20 0.031 0.015

In table 4.12 we see that all three metrics clearly indicate that it is much harder for the network to guess at the points where there occurred many rides, which is not very strange as very many of the data points have a very low target ride amount. At least we can see that the network is significantly better than the baseline in all scenarios.

4.4 Different encoding experiment

The neural network training experiments in this project all used a few continuous value parameters. A late realization was that another input strategy might have been smarter and would have allowed the network to perform better. Due to this fact a final experiment was run where the input encoding strategy was changed. To give a relation to the results in section 4.3 the same input features and data was used. However, the difference was that the three most important features according to the previous experiments, hour, zone and day-of-week were encoded differently.

(49)

the index corresponding to the id of the zone for that input. The same encoding was used for the day-of-week parameter with 7 different input nodes.

The hour parameter was still fed as a continuous value but the break point was shifted. Instead of shifting from 23 to 0 in a rather busy hour of the night the shift was moved so that 0 represents 5 in the morning and 23 represents 4 which are the two least busy hours of the day. The hypothesis here was that going from 23 to 0 should now be less of a disturbance for the network. That disturbance will still exist but for data points later at night, but as a larger portion of these hours have few or no rides, the network will hopefully be better at forecasting them.

Table 4.13. Scores for the final network model using the different input encoding

described in section 4.4

Accuracy Corrects Loss Poisson 2.52 0.50 25.22 -1 083 812

References

Related documents

Stress formation and evolution is strongly linked to the film morphology at various growth stages, during which a multitude of processes, including surface diffusion and grain

Both Wu and Häne’s papers [23][10] shows that it is possible to generate a 3D object, using one input image and convolutional neural networks.. This will be considered simple enough

Figures 16 and 18 shows the predicted rates of 17-01-2020 calculated by equation (19) for the neural network approach and by equation (18) for the PCA approach using the full daily

Syftet med studien är att bidra med ökad kunskap om utvecklingssamtal i förskolan och utifrån pedagogernas perspektiv belysa samverkan med vårdnadshavare vid utvecklingssamtal

innehållet endast ska kunna påverkas av den som tillhandahåller webbplatsen. Det kan då påpekas att de sociala medierna endast är sammankopplade med den grundlagsskyddade

ICT4D - Information and Communication Technologies for Development ICTD - Information and Communication technologies and development ICTDC - Information and

The aim of the model development is to determine a structure that can be success- fully trained to accurately predict retention times of peptides. This requires a model that

The average accuracy that is achieved over time indicates if a population is able to evolve individuals which are able to solve the image classification task and improve over time..