DOIT WP3 report on predictive modeling and data insights : Version 5.0

(1)

DOIT WP3 Report on

Predictive Modeling and Data Insights

Version: 5.0 Written by: Fatemeh Rahimian Amir H. Payberah RISE SICS AB May 2017

(2)

1 Introduction 2 2 Data Insights (D3.1) 3 2.1 Data Sources . . . 3 2.1.1 Vehicle Data . . . 3 2.1.2 Fleet Data . . . 4 2.1.3 Carrier Data . . . 5

2.2 Extracting the Assignments . . . 6

2.2.1 Challenges and Approaches . . . 6

2.2.2 Trips . . . 8

2.2.3 Assignments . . . 8

2.2.4 Merged Assignments . . . 10

2.2.5 Ferry Assignments . . . 11

2.3 Statistics and Results . . . 13

2.3.1 Vehicles . . . 13

2.3.2 Carriers . . . 14

2.3.3 Trips and Assignments . . . 14

3 Predicting Fuel Consumption (D3.2) 17 3.1 Preprocessing of the Input Dataset . . . 17

3.2 Building the Fuel Prediction Model . . . 18

3.2.1 Decision Tree . . . 19

3.2.2 Random Forests . . . 21

3.2.3 Implementation . . . 21

3.3 Evaluating the Model . . . 24

(3)

Introduction

The overall objective of this project is to demonstrate how big data analytics and optimization can provide improved decision support and reduce cost in heavy-duty road transportations.1

The purpose of this work package is to enable provision of accurate and relevant information to the optimization procedures developed in WP4. This will be achieved through development and adaption of existing big data analytics tools for extracting information from the data provided in WP2. This work-package corresponds to Objectives 1 and 2, and it mainly addresses Research Challenge 3.

The main use-case defined and selected is designing a tool for efficient and cost-effective planning of heavy-duty assignments. Ideally the information about assignments should be provided by the carrier companies. However, we do not have access to this information and thus, have to extract assignments from raw data. We also define the notion of trip, which refers to any journey between two stops. We observe the fuel consumption at the trip granularity, and then use this information to predict fuel consumption for future trips, and ultimately for assignments. The next and main step is to build a machine learning model, which is described in details in Chapter 3. Our machine learning model is built in Spark processing engine by employing Spark Machine Learning package (mllib). We can therefore, build the model with very big of training datasets.

1_{Potentially sensitive information have been excluded from this version of the report.}

(4)

This chapter covers the first step in the process of assignment planning, which is discovering trips and assignments. We have three sources of data: vehicle data, carrier data, and fleet data. While vehicle data provides us with information about vehicle configuration and characteristics, carrier data shows which carrier company owns which vehicles. Moreover, the fleet data file contains the status reports regularly sent by these vehicles during the course of their assignments.

Extracting assignments from the status reports is not straightforward. Even after data cleaning, there remains a lot of uncertainties in interpreting data. For example, if a truck is stopped, we can not simply tell if it is at a source or destination of an assignment, or it is stopped for refueling or fika, or due to road congestion. We need to take into account many other data pieces from before and after the stop point, in order to decide which case it is. In section 2.2 we elaborate on the challenges we faced and the approaches we took to address those challenges. The work resulted in the extraction of 6012 assignments for 5328 vehicles owned by 632 different carriers.

2.1 Data Sources

We have three sources of data for analysis. First, we have the vehicle data, which contains the configuration of a set of vehicles, for which we have the fleet data during a one month period from 2014-03-01 to 2014-03-31. Next, we have the carrier data, which maps vehicles to their owner companies. Finally, we have the fleet data, a huge data file containing all the status reports sent from the vehicles during March 2014.

2.1.1 Vehicle Data

Vehicles are characterized by a variety of attributes or features. These features are listed in Table 2.1.

The Vehicle Data file contains the information about 19992 vehicles in total, including 19737 trucks and 255 buses. Since our fuel consumption optimization is aimed towards trucks, we exclude buses from this data. Furthermore, from the set of trucks we only select the Long haul trucks with Medium or Heavy duty class. In other words, we apply the following filters on the Vehicle Data

(5)

Table 2.1: Vehicle Features

Feature Type Description

VehicleId String A unique number for each vehicle ChassisNumber String A unique number forged on chassis

Product class Typed Truck or Bus

Wheel configuration Typed 4x2, 6x2, 6x2/2, 6x2/4, 6x4, 8x2, 8x4, 8x2/*6, 8x2*6, 8x2/4, 8x4/4, 8x4*4

Type of transport Typed Long haul, Construction, Distribution

Duty class Typed Medium(M), Heavy(H), Extra heavy(E)

Technical total weight Integer Weight of the vehicle in Kg GTW Technical Integer Gross Train Weight in Kg Engine stroke volume String Engine stroke volume Engine type, vehicle String Engine type

Emission level Typed Emission level

(Euro 3, Euro 4, Euro 5, Euro 6, EEV)

Fuel Typed Diesel or Petrol

Overdrive Typed With or Without

Ecocruise Typed With or Without

Acceleration control Typed With or Without Rear axle gear ratio String Rear axle gear ratio Clutch system Typed Manual or Automatic Suspension system, front String Suspension system, front Suspension system rear String Suspension system, rear Gearbox management system String Gearbox management system

file to get the most relevant data for analysis and optimization: Product class = ’Truck’ AND

Type of transport = ’L’ AND Duty class = ’H’ OR ’M’

In Section 2.3.1, we will give some detailed statistics about the data included in the this file.

2.1.2 Fleet Data

Fleet Data file contains all the reports that are are sent by Scania vehicles during March 2014. Each row in this file consists of values for 26 different attributes. These attributes are listed in Table 2.2. There exist 20284529 rows in this file, out of which 16575259 rows belong to our filtered trucks.

Table 2.2: Fleet Data Features

Feature Unit/Type Description

(6)

signment planning should be conducted within each carrier company separately. We can not use a vehicle from carrier A to carry out an assignment of carrier B. Carrier data is composed of three attributes, listed in Table 2.3.

Table 2.3: Carrier Data Features

Feature Unit/Type Description

(7)

2.2 Extracting the Assignments

Since we have chosen to do assignment planning, the first thing we require is the assignments. Ideally the information about assignments should be provided by the carrier companies. However, in the absence of this information, we have to extract the assignments from the fleet data.

2.2.1 Challenges and Approaches

Transforming the fleet status reports to assignments is not straightforward. To begin with, we do not have the source and destination of the assignments, or their associated load and timing constraints.

We start by trying to find source and destination of assignments. We know the vehicles would be stationary at the source and destination, but those are not the only times they stop. A driver may stop a truck for various other reasons, e.g., driving regulations, traffic jam on a road, or simply for taking a coffee (Figure 2.1).

Figure 2.1: An assignment with several stops

Figure 2.2: Two consecutive assignments. If we do not consider the weight, the stopping pattern is similar to that of Figure 2.1

We know some assignments are repeated regularly. Therefore, we first in-vestigate if we can extract these assignments, by finding patterns in the status reports, by tracking the position of vehicles over time. To give you an idea, Figure 2.3 shows the coordinates of truck #162941 over a week. Note that coor-dinates reported by the trucks are not necessarily the same, even if a truck moves through the exact same path at around the same time everyday. There is also some time drift across days. But the main problem is there are only few cases of

(8)

8

9

10

11

12

13

14

15

16

55

55.5

56

56.5

57 Longitude

Latitude

2014-03-03

2014-03-04

2014-03-05

2014-03-06

2014-03-07

2014-03-10

2014-03-11

2014-03-12

Figure 2.3: Travels of truck #162941

such patterns (similar to that of Figure 2.3). In most cases, the travel pattern is not repeated frequently, there are several stops during an assignments, and even some stop points may be frequently chosen, not because they are sources or destination, but because they are gas stations, or some restaurants that are repeatedly chosen by the drivers. Furthermore, we can not differentiate between several assignments that are done one after the other (See Figure 2.2).

Another indication for detecting the assignments is the load of a truck. We expect that the load does not change if the truck is stopped at a restaurant, for example. But if it does, it could indicate the end of an assignment and/or the start of a new assignment (See Figure 2.2).

Estimating the load is another challenge, because in the status reports we only have the odometer and tonometer attributes. It seems straightforward to calculate the weight by dividing the difference in tonometer by the difference in odometer:

W eight =∆tonometer ∆odometer

But the status reports can be noisy, for example, Figure 2.4 shows tonometer and odometer values for truck #155097. This data is definitely not reliable, because the odometer and tonometer values can not be decrementing. After removing such noisy cases, we get something like that of Figure 2.5.

We now have some hint on how to detect the assignments. Each line segment in the weight plot in Figure 2.5 represents a different load status for a truck. However, there is another hunch here. We observe that sometimes consecutive

(9)

Figure 2.4: Odometer and Tonometer reported by truck #155097

status report do not indicate that the vehicle is stopped, but the calculated weight has either increased or decreased. In other words, the weight changes, as if the truck is being loaded or unloaded, while it is moving. This of course can not be the case. We therefore attribute the change to measurement error (specifically tonometer measurement error), if it is marginal. But the question is what value of the tonometer and weight should be taken into account. To answer this, we calculate the average weight over each time interval that the truck is moving. More specifically, we introduce the notion of trips, as the constituting parts of assignments, to be any journey between two stops. In the next section, we define trips and assignments in detail.

2.2.2 Trips

We define a trip to be a journey between two stops, with a non-null distance and duration. Note that a trip contains the aggregated data of multiple rows from the fleet data (at least two status reports). Around 800,000 trips detected. To each trip we assign a list of features, which are shown in Table 2.4. Features that are marked by star are calculated based on other values, while the other features are directly read from the Fleet data file. Start and end cities are ap-proximated from the reported coordinates. Using the odometer values we can compute the traveled distance by taking the difference between odometer values at start and end points. Similarly, the tonometer values are used to compute the average weight during the trip. Finally, the time traveled is the difference between start and end times.

2.2.3 Assignments

Having the trips, we can now define the assignments as one or more consecutive trips, carrying the same average weight. Features that are used to describe an assignment are listed in Table 2.5. Most features are similar to that of trips.

(10)

Figure 2.5: Odometer and Tonometer reported by truck #235658, and the cal-culated weight over time

Table 2.4: Trip Features Vehicle ID Customer ID

Start time End time Time traveled * Start latitude Start Longitude Start city *

End latitude End longitude End city * Start odometer End odometer Distance traveled * Start tonometer End tonometer Average weight *

Start fuel End fuel Fuel consumption *

The four additional feature, marked in bold are explained below.

Number of trips aggregated is a value greater than or equal to 1, and as its name suggests, is the number of trips that constitute the assignment. The bird fly distance is the distance on an approximate straight line between start and end points. We use this feature for verification. We expect that the traveled distance is never less than the bird fly distance. Time driving is the time traveled without stops, and finally, the state of the truck can be LOADED or UNLOADED.

Table 2.5: Assignment Features

Vehicle ID Customer ID Number of trips aggregated

Start time End time Time traveled

Start latitude Start Longitude Start city

End latitude End longitude End city

Start odometer End odometer Distance traveled Start tonometer End tonometer Average weight

Start fuel End fuel Fuel consumption

(11)

To select the most meaningful assignments, we apply the following filter on the extracted assignments. Some of these filters are due to the existing noise in data, or measurement errors. Also we consider those trips, for which we have a minimum number of reports.

• weight 0 < w < 80 tons: Vehicles (not trips) with more than 80 tons are not considered.

• speed s < 30 m/s (108 km/h): Vehicles (not trips) with calculated speed of more than 30 m/s are not considered.

• frequency f < 600 s: Only include trips with message every 600 sec-onds in average, at max.

• distance d > 1000 m: Only include trips with minimum distance 1000 in meters.

• gap g < 5000 m: Maximum odometer gap allowed between trips, in me-ters.

• threshold for changed load l = 3 tons: the load status will be con-sidered as changed, only if the difference in weight is more than this value. Note, that the weight is never stable, it varies almost all the time, because of inaccurate sensors and road status and such. We cannot consider that every time there is a change of value, the truck is unloading or loading, it can be only a bumpy road that is affecting the tonometer sensors. So we need a threshold: if the weight of a truck varies less than this threshold, then we consider that it is only an accuracy problem and the truck has the same load. But if the weight varies more than this threshold, then we consider that the truck has changed its real weight, due to loading or unloading goods.

• values odometer = 10: Vehicles (not trips) with this value are not con-sidered.

• fuel consumption > 0 (when distance > 1000 m): Vehicles (not trips) with negative or low fuel consumption are not considered. Note, such neg-ative or negligible values happen, due to sensor accuracy problems.

2.2.4 Merged Assignments

While verifying the extracted set of assignments, we faced with a number of unexpected cases. One of these cases is when the distance traveled between a source and destination is far more than the expected distance. We already know that the distance traveled is expected to be more than the bird fly distance, because of choice of the routes. Also because sometimes drivers have to take a detour to get to a gas station or restaurant etc. For example, Figure 2.6 shows that the 398km bird fly distance between Paris and Strasbourg can be traveled in 490km, 491km, or even 557km.

But if the difference is far more, for example 761km as shown in Figure 2.7, then we are most probably merging two different assignments in one. We don’t

(12)

Figure 2.6: The expected traveled distance is between 490 and 580km, for an assignment from Paris to Strasbourg

Figure 2.7: Travelled distance is 761km, which is far more than the expected value. This seems to be two different assignments, wrongly detected as one.

expect that a truck on a single assignment to travel through Brussels on its way from Paris to Strasbourg. We, therefore, exclude those assignments with the travelled distance more than 1.5 times the bird fly distance. Alternatively, we could try and split these assignments, but to increase the reliability of our extracted assignments, we decided to exclude them from the output dataset.

2.2.5 Ferry Assignments

Another peculiar case that came up in the verification step is when the distance traveled is far less than the actual distance between start and end cities. For example we found that truck #159711 has once moved from IT-Ciampino to ES-Esparrequera, with 890 km between them, but the distance traveled indi-cated by the odometer is only 156 km. We later realized that this truck has been on a ferry for the most of the travel time (See Figure 2.8). We discovered 12 such cases in the dataset, which are all excluded from the output dataset.

(13)

Figure 2.8: A ferry assignment, where the travelled distance is way smaller than the bird fly distance.

(14)

tion 2.1, as well as the result of our trip and assignment extraction process. This process creates two new data files, one for trips and one for assignments. The former file is then combined with some external data source (obtained through Digital Reality), that includes some information about the road fea-tures (particularly the road slope profile). This augmented trip file is later used in WP3 for building a fuel prediction model, while the assignment file is used for planning and optimization in WP4.

2.3.1 Vehicles

The Vehicle Data file contains the information about 19737 trucks, with features explained in Table 2.1. Three types of applications for trucks with a distribution depicted in Figure 2.9 are identified.

Figure 2.9: Distribution of trucks over their type of transport

Since the project only targets the long haul assignment planning, we exclude other type of truck, and only consider the remaining 89%. The distribution of these trucks over their duty class is shown in Figure 2.10. Since the Extra-heavy duty class constitute a negligible part of data, we only consider trucks with heavy or medium duty class.

Figure 2.10: Distribution of long haul trucks over their duty class Figure 2.11 depicts the distribution of heavy or medium duty long haul trucks, over the wheel configuration attribute. All these trucks, which are a

(15)

total of 17512 out of the 19737 initial ones, are considered in our analysis.

Figure 2.11: Distribution of heavy/medium load trucks over their wheel config-uration

2.3.2 Carriers

Trucks are owned by 7,624 carrier. Between 1 and 219 trucks per carrier were detected. The top 10 carriers with respect to the number of trucks they own are:

1. List redacted

2.3.3 Trips and Assignments

There are 20284529 rows (26 values each) in the Fleet data file, 18380474 ones after cleaning the null values, out of which 16575259 rows belong to our filtered trucks. These reports correspond to 463,157 drivers and from 77 different coun-tries (depicted in Figure 2.12)

The result of the trip extraction is stored in a text file. Each line describes a single trip with the attributes listed in Table 2.4. This file has 822,520 rows or trips that are identified using the approach in Section 2.2.

In the assignment file we have 33660 rows or assignments, with the attributes listed in Table 2.5. There are some assignments, for which the truck is unknown. If we exclude these assignments from the set, we are left with 33386 assignments and 3708 trucks that belong to 1545 different carriers.

The distribution of assignments over carriers is very skewed. Over 1400 carriers have less than 50 assignments a month. Too few assignments means that there is not much room for planning and optimization. Thus, we will focus on the carriers with the most number of assignments. The top 20 carriers with the most number of assignments are listed in Table 2.13. The table also shows how many assignments, trips, and messages per carrier exist. Total distance traveled and fuel consumptions are listed as well.

(16)

(17)

(18)

Consumption (D3.2)

In this section we will explain how we built a model to predict fuel consumption based on the data of existing trips. We assume that we have a history of different trips, where for each trip we know how much fuel was consumed, as well as other information of the trip, such as the trip distance, duration, route, weight, and vehicle features. Moreover, for each route that corresponds to a trip, we use a slope profile, acquired from the Digital reality service. We, then, use this data to build a model to predict fuel consumption for unseen future trips. Since we know the fuel consumption for the existing trips, we can consider this problem as a supervised learning problem.

Supervised learning generically refers to problems where we have input vari-ables, called features (e.g., trip information), and an output variable, called label (e.g., fuel consumption), and we are looking for a mapping function from the input to the output. To find such a mapping, we first need to prepare the dataset to fit into the features-label format, and then, to use existing supervised machine learning algorithms to build the model. In the rest of this section, we go through these two steps and explain each. All the implementations of this section are based on Spark machine learning library, known as MLlib.

3.1 Preprocessing of the Input Dataset

To prepare the input for our model, we will use two input datasets, namely vehi-cle dataset and trip dataset. The vehivehi-cle dataset have already been introduced in Section 2.1.1. Below we go through the attributes that define a trip (a row in the trip file). These attributes are listed in Table 3.1. Note that, we only focus on the features that are used in building the model.

As we see in Table 3.1, we have 11 different features for slope profile, where each of them shows what fraction of the road has a specific slope. Sum of the values for all these features will be one, of course. Similarly, we have four different features for the speed of a vehicle during a trip, where each one shows what fraction of a trip was traveled with a specific speed. Speed is classified into four groups of low, medium, high, and very high.

(19)

Table 3.1: Trip Features

Feature Type Description

TripId Int A unique number for each trip VehicleId Int A unique number for each vehicle TotalWeight Float The total carried weight in a trip Duration Int The duration of a trip

AvgSpeed Float The avg. speed in a trip

FuelPerKm Float The avg. fuel consumption per kilometer in a trip LessThanOnePerc Float The trip road slope information s < 1% ↓

OneToTwoPercAsc Float The trip road slope information 1% < s < 2% ↓ TwoToThreePercAsc Float The trip road slope information 2% < s < 3% ↓ ThreeToFourPercAsc Float The trip road slope information 3% < s < 4% ↓ FourToFivePercAsc Float The trip road slope information 4% < s < 5% ↓ FiveOrMorePercAsc Float The trip road slope information 5% < s ↓ OneToTwoPercDesc Float The trip road slope information 1% < s < 2% ↑ TwoToThreePercDesc Float The trip road slope information 2% < s < 3% ↑ ThreeToFourPercDesc Float The trip road slope information 3% < s < 4% ↑ FourToFivePercDesc Float The trip road slope information 4% < s < 5% ↑ FiveOrMorePercDesc Float The trip road slope information 5% < s ↑ LowTrafficSpeed Float The vehicle speed information (low speed) MediumTrafficSpeed Float The vehicle speed information (medium speed) HighTrafficSpeed Float The vehicle speed information (high speed) VeryHighTrafficSpeed Float The vehicle speed information (very high speed)

To get all the features for the model, we need to join the two vehicle and trip datasets on their shared key VehicleId. The result of this join is a table, where each of its rows consists of the features, listed in Table 3.1, and the vehicle information, listed in Table 2.1. Some of the features in the vehicle table are “typed features” with categorical values. For example, duty class has three distinct values, i.e., Medium (M), Heavy (H), and Extra heavy (E). In order to use these categorical features in our model, we should transform them to scalars. One way to convert a categorical input variable into scalar is to use dummy variables, which are variables with only two values, 0 and 1 that indicate the presence or lack of a group.

After replacing all such categorical attributes with their equivalent dummy variables, we are ready to move forward and format the input data as pairs of (label, features). To do so, we use the LabeledPoint data structure, which is a pre-defined data structure in Spark for supervised machine learning. A LabeledPoint is a pair, where the first item is the label, and the second item is the feature set. Since we are looking for the fuel consumption for each trip, we set the label to FuelPerKm, and set the features to the rest of the features.

3.2 Building the Fuel Prediction Model

The next step is to feed the LabeledPoint data to a machine learning algo-rithm and build a prediction model. As mentioned in the beginning of this section, since we know the actual consumed fuel for each trip, we can consider this problem as a supervised learning. Supervised learning problems are, in

(20)

to (a discrete value). Our problem is a regression problem, because the fuel consumption values are continuous.

Spark provides a number of algorithms for building regression models. Among those we have chosen decision tree 1_{and random forest} 2_{. Tree-based models}

in principle are popular methods for the machine learning tasks of classification and regression. Decision trees are widely used since they are easy to interpret, handle categorical features, extend to the multi-class classification setting, do not require feature scaling, and are able to capture non-linearities and feature interactions. Tree ensemble algorithms such as random forests are among the top performers for classification and regression tasks3,4_{. Before delving into the}

implementation details, we will first give a quick review of how decision trees and random forests work.

3.2.1 Decision Tree

Decision tree builds regression or classification models in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. To clarify the algorithm through an example. Let us assume we have training dataset given in Figure 3.1.

Figure 3.1: A sample training dataset, with three features (Distance, Duration, and Weight) and one target label (Fuel). The variance of the target values is 4.69

A decision node is selected among the features, and has two or more branches, each representing a subset of values for the feature tested. Leaf node (e.g., Fuel) represents a decision on the numerical target (label). The topmost decision node in a tree is called root node.

The decision tree algorithm is a top-down approach that works based on a greedy search. The goal is to partition data into subsets that contain instances with more homogenous target values (labels). To measure the homogeneity of

1_{Breiman L., Classification and regression trees. Chapman & Hall/CRC, 1984.} 2_{Breiman L., Random forests, Journal of Machine learning v45, n1. Springer, 2001.} 3_{Dietterich, T. G. Ensemble methods in machine learning, International workshop on}

mul-tiple classifier systems. Springer, 2000.

4_{Liaw A., and Matthew W., Classification and regression by randomForest. R news 2.3,}

(21)

the target value the notion of impurity is introduced. In the case of regression (which we are dealing with), impurity is defined as variance reduction. To be more precise, the impurity is measured as _N1 PN

i=1(yi− µ)

2_{, where y} i is the

target value for an instance, N is the number of instances in the dataset, and µ is the mean given by _N1 PN

i=1yi.

Figure 3.2: Possible trees in the first step

The lower the variance of the target values, the better impurity (and the higher homogeneity) we have. Figure 3.2 shows 3 different trees, each with a different root node. To compute the variance of the target values for each tree, first we compute the variance of target values at the end of each branch, and then take the weighted average of those variances. For example, for tree (a) in Figure 3.2, here is how we calculate the variance:

v =var(9.5, 12.0) ∗ 2 + var(10.2, 5.4, 9.1, 8.9) ∗ 4

6 = 3.92

Figure 3.3: A sample decision tree built to predict fuel, given the training dataset in Figure 3.1.

As shown, all the trees reduce the variance compared to that of the initial dataset (4.69 in Figure 3.1), but the largest variance reduction is achieved when

(22)

sets with smallest variance in target values. If impurity can not be improved any more (that is when the post-split variance is not reduced), the tree is not further extended. Similarly, if the tree reaches a maximum depth, it will not extend any further. Figure 3.3 shows the final decision tree that is built for our sample dataset (assuming maxDepth = 2).

Now, given such a tree, we can start predicting the target value for any new instance. We start from the root, and traverse the tree along its branches based on the values in the given instance. The leaf that we end up at, represents our prediction for the target value. If there exist multiple target values at the end of a branch, the predicated value will be the average of all those values.

3.2.2 Random Forests

While decision trees are a very popular method in many machine learning tasks, they are known to be prone to the overfitting problem. Specially deep trees are likely to model all the irregularities in the training datasets, and thus will even model noisy data. As a result, the accuracy of prediction is reduced. Random forests are a way of averaging multiple deep decision trees, trained on different parts of the same training set, with the goal of reducing the variance. Each tree in a random forest is built for a random subset of features. At each potential split point, a subset of features (typically 1/3) are randomly sampled from all features and the optimal splitting feature among this random subset is determined. This is repeated for each splitting node of each tree. Note that, if one or a few features are very strong predictors for the target variable (that is when they effectively reduce the variance of the target values), these features will be selected in many of the decision trees, causing them to become correlated. On the other hand, noisy values in the training dataset are unlikely to be represented in multiple trees, since each tree is built for a random subset of training instances. To predict the target value for a given instance, the average of the predicted value over all trees is calculated.

3.2.3 Implementation

To build a decision tree in Spark we need to define an instance of the Strategy class that consists of the configuration parameters that the tree is built upon. An instance of the Strategy class can be constructed as follows:

val strategy = new Strategy(algo, impurity, maxDepth, numClasses, maxBins, quantileCalculationStrategy, categoricalFeaturesInfo, minInstancesPerNode, minInfoGain, maxMemoryInMB, subsamplingRate, useNodeIdCache, checkpointInterval).

The following parameters are defined in this class:

• algo: type of decision tree, either Classification or Regression, where it is Regression in our case.

(23)

• impurity: indicates what impurity should be used to split a node. We set it to Variance meaning that we measure the impurity as_N1 PN

i=1(yi−µ)2.

• maxDepth: maximum depth of a tree. The deeper trees are, the more expressive they are, but potentially provide higher accuracy. We set it to 30 in our implementation.

• numClasses: number of classes, which is only used in classification. • maxBins: number of bins used when discretizing continuous features,

which is set to 100 in our model.

• quantileCalculationStrategy: the algorithm used for calculating quan-tiles. The supported algorithm is QuantileStrategy.Sort.

• categoricalFeaturesInfo: specifies which features are categorical and how many categorical values each of those features can take. This is a a map from feature indices to number of categories. Since we have already used dummy variables to convert categorical features to numbers, we can leave this parameter empty that indicates all features are continuous. • minInstancesPerNode: for a node to be split further, each of its children

must receive at least this number of training instances. It is 30 in our implementation.

• minInfoGain: for a node to be split further, the split must improve at least this much of information gain. We consider it zero in our case. • maxMemoryInMB: amount of memory to be used for collecting statistics.

We used the default value, which is 256MB.

• subsamplingRate: fraction of the training data used for learning the de-cision tree, which we set it to one, meaning to use the whole training data.

• useNodeIdCache: we set this parameter to false, which means that the algorithm will not avoid passing the current model to executors on each iteration. The true value is useful with deep trees to speed up the com-putation on workers.

• checkpointInterval: frequency for checkpointing node ID cache RDDs. Setting this too low will cause extra overhead from writing to HDFS; setting this too high can cause problems if executors fail and the RDD needs to be recomputed. We set it to 10 seconds.

As mentioned above, maxMemoryInMB is the amount of memory used for collecting statistics. For faster processing, the decision tree algorithm collects statistics about groups of nodes to split (rather than 1 node at a time). The number of nodes which can be handled in one group is determined by the mem-ory requirements (which vary per features). The maxMemmem-oryInMB parameter specifies the memory limit in terms of megabytes which each worker can use for these statistics. The default value is conservatively chosen to be 256 MB to allow the decision algorithm to work in most scenarios. Increasing maxMemoryInMB can lead to faster training by allowing fewer passes over the data. However,

(24)

MLlib uses for scaling up the process in deeper trees (i.e., when maxDepth is set to be large). From the implementation prescriptive of the tree and random forest in MLlib, the algorithms send the current model to executors so that they can match training instances with tree nodes. When useNodeIdCache is set to true, the algorithms cache this information and avoid passing the current model, and consequently speed up computation on workers, and reduce communication on each iteration. Although caching information can reduce the cost, it generates a long lineage of RDDs, which itself can cause performance problems. To alleviate this problem, we can checkpoint the intermediate RDDs, by choosing a checkpointDir and determining the frequency of checkpointing by setting checkpointInterval. A too small interval for checkpointing will cause extra overhead from writing to HDFS, and a too high value can cause problems, if executors fail and the RDD needs to be recomputed.

After defining the decision tree configuration, we can use DecisionTree.train method to build the tree. This method takes two parameters: trainingData and strategy, where trainingData is the input dataset in the LabeledPoint format and an instance of the Strategy class. The result of calling the train method is the decision tree model, which is an object of DecisionTreeModel. Then, we can call the predict method on our model, and gives a feature set as an input and get the predicted result:

val strategy = new Strategy(...)

val model = DecisionTree.train(trainingData, strategy) val result = model.predict(features)

Similarly, we can use the random forests algorithm to build our model. Ran-dom forests are ensembles of decision trees that combine many decision trees in order to reduce the risk of overfitting. Spark supports random forests for both classification and regression. Random forests train a set of decision trees separately, so the training can be done in parallel. Decision trees are trained the same way as for individual decision trees, however, the some randomness in injected into the training process so that each decision tree is a bit different. Combining the predictions from each tree reduces the variance of the predic-tions, improving the performance on test data. The injected randomness into the training process includes: (i) subsampling the original dataset on each itera-tion to get a different training set, and (ii) considering different random subsets of features to split on at each tree node. To make a prediction on a new in-stance, a random forest aggregates the predictions from its set of decision trees. This aggregation is done differently for classification and regression, by taking the majority votes or the averaging, respectively.

In addition to the Strategy we used to build a decision tree, we need to set three other parameters to be able to use random forest to build a model:

• numTrees: the number of trees in the random forest. If it is set to one, then no bootstrapping is used. We used 10, thus bootstrapping is done. • featureSubsetStrategy: the number of features to consider for splits

(25)

onethird. If auto is set, then this parameter is set based on numTrees, meaning that if numTrees is one, then set it to all, otherwise set it sqrt for classification and to onethird for regression. We explicitly set it to onethird in our implementation.

• seed: the random seed for bootstrapping and choosing feature subsets. Considering all these parameters, then, we can use the random forest algo-rithm to build a model, and use the model to predict values of the given input features. Below, it is shown these steps:

val strategy = new Strategy(...)

val model = RandomForest.trainRegressor(trainingData, strategy, numTrees, featureSubsetStrategy, seed)

val result = model.predict(features)

3.3 Evaluating the Model

Before delving into the experiments, it is important to mention that Spark breaks down a job (for example, trainRegressor) into multiple tasks, where these tasks are scheduled, serialized and distributed over multiple workers. The task break-down and distribution is based on a DAG, which is in turn, derived from the dependencies in the functional transformations defined on the data. All these steps come with some overhead cost, even in the local mode. Therefore, compared to other local analytics platforms, such as R or Matlab, which are designed and opted for a single node execution, Spark is not unexpected to work faster. However, if the amount of data is so big such that it cannot be loaded on a single machine, the other platforms become impractical, while Spark can seamlessly handle big data. Hence, we do not show any measurements for computation speed, and instead, focus on the accuracy of the constructed models.

We used regression analysis to predict a continuous output variable from a number of independent variables. We used the metrics in Table 3.2 to evaluate the accuracy of the model we built:

Table 3.2: Evaluation Metrics. Lower errors indicates more accurate models.

Metric Formula

Mean Square Error (MSE) MSE = 1

n

Pn

i=1(yi− ˆyi) 2

Mean Absolute Error (MAE) MAE = 1_nPn

i=1|yi− ˆyi|

Mean Absolute Percentage Error (MAPE) MAPE = 1_nPn

i=1 yi− ˆyi yi Median Absolute Percentage Error (MdAPE) MdAPE = median(Pn

i=1 yi− ˆyi yi )

We used these metrics to configure the parameters of the models and eval-uate their accuracy. In particular, we study the effect of the following pa-rameters for the Decision Tree model: (i) maxDepth, (ii) maxBins, and (iii)

(26)

First, we fix maxBins and minInstancesPerNode parameters to 100 and 30, respectively, and evaluate the model for varying values of maxDepth. The results are depicted in Table 3.3. Note that maxDepth > 30 is not currently supported in the implementation of decision tree in Spark.

Next, we study how the number of bins affect the model. We, therefore, set maxDepth and minInstancesPerNode parameters to 30 and 30, respectively, and build the model with different values for maxBins. The result is shown in Table 3.4.

Finally, we would like to see the impact of minInstancesPerNode parameter, and thus, we set maxDepth and maxBins parameters to 30 and 100, respectively. The result is shown in Table 3.5

Table 3.3: Impact of maxDepth

maxDepth 30 20 10 5 2

MSE 0.0013 0.0013 0.0014 0.0017 0.0027 MAE 0.0261 0.0265 0.0274 0.0311 0.0392

MAPE 8.5% 8.7% 8.9% 10.2% 12.8%

MdAPE 6.7% 6.9% 7.15% 8.3% 10.6%

Table 3.4: Impact of maxBins

maxBins 150 100 75 50 25 5

MSE 0.0013 0.0013 0.0012 0.0012 0.0013 0.0013 MAE 0.0265 0.0261 0.0259 0.0258 0.0263 0.0265

MAPE 8.6% 8.5% 8.5% 8.4% 8.5% 8.6%

MdAPE 6.9% 6.7% 6.7% 6.6% 6.7% 6.7%

Table 3.5: Impact of minInstancesPerNode

minInstances 50 40 30 20 10 5

MSE 0.0012 0.0012 0.0013 0.0013 0.0014 0.0016 MAE 0.0261 0.0260 0.0261 0.0265 0.0276 0.0294

MAPE 8.5% 8.5% 8.5% 8.6% 9.0% 9.6%

MdAPE 6.7% 6.7% 6.7% 6.7% 7.1% 7.5%

As you can see in the above results, the decision tree model does not drasti-cally change when we change the maximum number of bins. However, we see a decline in model accuracy when the maximum depth of tree is set very low. This is reasonable, because too shallow trees are limited in their capacity to guide the prediction in right direction. On the other hand, extending the tree further only makes sense if we have enough number of samples down the branches. If we require too few number of instances to split a feature further, we observe

(27)

an increase in the error (as in Table 3.5), because that could be an issue with overfitting the model.

The next step is to compare the decision tree model with random forest model. The higher number of trees employed in the random forest, is expected to increase the accuracy, but at the same time there is a trade-off between accuracy and computation cost. Table 3.6 shows the accuracy for a random forest with 2, 5, and10 trees compared to a single tree. As expected, we received a higher accuracy by using multiple trees in the random forest algorithm. It is, however, interesting to note that using only two trees in the random forest, does not necessarily increase the accuracy compared to a single tree in decision tree. Recall, a tree in a random forest is built over nearly one third of the features, while a pure decision tree employs all the features. We observe that the advantage of using random forests kicks in as soon as we have 3 or more trees. Specially in the case where 10 trees are used, we see a great reduction in the error, which indicates a highly accurate model.

Table 3.6: The Accuracy of the models, in terms of their error.

Metric Decision Tree RF RF RF RF

2 trees 3 trees 5 trees 10 trees

MSE 0.0013 0.0014 0.0012 0.0011 0.0004

MAE 0.0261 0.0271 0.0265 0.0242 0.0009

MAPE 8.5% 8.9% 8.4% 7.9% 4.8%

(28)

This report demonstrated how big data analytics can provide improved decision support and ultimately reduce cost in heavy-duty road transportations. The selected use-case is designing a tool for efficient and cost-effective planning of heavy-duty assignments, by targeting fuel consumption reduction. We define the notion of trip, which refers to any journey between two stops, and build a fuel prediction model for the trips, instead of assignments. However, the opti-mization process in WP4 not only requires a fuel prediction model, but also the actual assignments. Ideally the information about assignments and their asso-ciated fuel consumption should be provided by the carrier companies. However, in the absence of this data, we have to extract this information from raw data. In Chapter 2, we explained the challenges and approaches to transform raw data from three input data files (Vehicle, Carrier, and Fleet data) into two new files containing Trips and Assignments. First, we have the vehicle data, which contains the configuration of a set of vehicles, for which we have the fleet data during a one month period from 2014-03-01 to 2014-03-31. Next, we have the carrier data, which maps vehicles to their owner companies. Finally, we have the fleet data, a huge data file containing all the status reports sent from the vehicles during March 2014. The features of all five files are explained, and the filters that are applied on the raw data are listed. The processing is performed using Python scripts. To summarize, we started by 19,992 vehicles in Vehicle data file, 20,284,529 messages in Fleet data file, and 7,624 carriers in Carrier data file. The result contains 33,386 assignments (comprised of 1,678,733 trips) carried out by 1545 carriers with a total of 3708 trucks.

In Chapter 3 we built a model for predicting fuel. We first explained how we augmented the trip file with some additional information, including road features and speed profile, for each trip. We then reviewed two main algorithms that we have used for modeling, namely decision tree and random forest. Finally we explained how these models are implemented in Spark processing framework. We configured the models using different parameters and evaluated their accu-racy. Note that, the Spark implementation makes it possible to construct the model with very large training dataset, and on a cluster of machine, if needed. As soon as the model is built, it can be used to predict fuel consumption for any given trip or assignment in no time. Finally, we showed that a model based on

(29)

random forest, can predict fuel consumption with as low as 0.0004 mean square error.