Scalable Gaussian Process Regression for Time Series Modelling

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2019,

Scalable Gaussian Process Regression for Time Series Modelling

VIDHYARTHI BOOPATHI

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

Regression for Time Series Modelling

VIDHYARTHI BOOPATHI

Master in Computer Science Date: September 10, 2019 Examiner: Henrik Boström Supervisor: Vladimir Vlassov

Industrial Supervisor: Srikar Muppirisetty (Volvo Cars) Swedish title: Skalerbar Gaussisk process regression för modellering av tidsserier

School of Electrical Engineering and Computer Science

(3)

ii

Abstract

Machine learning algorithms has its applications in almost all areas of our daily lives. This is mainly due to its ability to learn complex patterns and insights from massive datasets. With the increase in the data at a high rate, it is becoming necessary that the algorithms are resource-efficient and scalable. Gaussian processes are one of the efficient techniques in non linear modelling, but has limited practical applications due to its computational complexity. This thesis studies how parallelism techniques can be applied to optimize performance of Gaussian process regression and empirically assesses parallel learning of a sequential GP and a distributed Gaussian Process Regression algorithm with Random Projection approximation implemented in SPARK framework. These techniques were tested on the dataset provided by Volvo Cars. From the experiments, it is shown that training the GP model with 45k records or 2¹⁹ ≈ 10⁶ data points takes less than 30 minutes on a spark cluster with 8 nodes. With sufficient computing resources these algorithms can handle arbitrarily large datasets.

Keywords: Distributed Machine learning, Spark, Gaussian Processes, Regression, Time series

(4)

Sammanfattning

Maskininlärningsalgoritmer har sina applikationer inom nästan alla områden i vårt dagliga liv. Detta beror främst på dess förmåga att lära sig komplexa mönster och insikter från massiva datamängder.

Med ökningen av data i en hög takt, blir det nödvändigt att algorit- merna är resurseffektiva och skalbara. Gaussiska processer är en av de effektiva teknikerna i icke-linjär modellering, men har begränsade praktiska tillämpningar på grund av dess beräkningskomplexitet. Den här uppsatsen studerar hur parallellismtekniker kan användas för att optimera prestanda för Gaussisk processregression och utvärderar parallellt inlärning av en sekventiell GP och distribuerad Gaussian Process Regression algoritm med Random Projection approximation implementerad i SPARK ramverk. Dessa tekniker testades på en data- mängd från Volvo Cars. Från experimenten visas att det krävs mindre än 30 minuter att träna GP-modellen med 45k poster eller 2¹⁹ ≈ 10⁶ datapunkter på ett Spark-kluster med 8 noder. Med tillräckliga dato- ressurser kan dessa algoritmer hantera godtyckligt stora datamängder.

Nyckelord: Distribuerad maskininlärning, Spark, Gaussiska processer, Regression, Sensormodellering, Tidsserier

(5)

iv

Acknowledgements

I would like to express my special thanks to my Volvo Supervisor Srikar Muppirisetty for giving me an opportunity to write thesis on this topic and guiding me throughout the project. Thank you for the time spent on invaluable discussions, ideas and feedback that motivated a clear research direction for the project.

I would also like to thank my Supervisor at KTH, Vladimir Vlassov, and my examiner Henrik Boström, for their helpful insights and guid- ance during our meetings that certainly improved my thesis. Thanks to my fellow thesis workers at Volvo Cars for all the good times and for supporting each other and sharing ideas to understand the concepts.

(6)

1 Introduction 1

1.1 Background . . . 3

1.2 Problem . . . 4

1.3 Purpose . . . 5

1.4 Methodology . . . 6

1.5 Ethics, Benefits, Sustainability . . . 6

1.6 Delimitations . . . 7

1.7 Thesis Outline . . . 7

2 Extended Background 8 2.1 Time Series . . . 8

2.2 Sensor Modelling . . . 9

2.3 Gaussian Processes . . . 9

2.3.1 Gaussian Process Regression . . . 10

2.3.2 Covariance Functions . . . 11

2.3.3 Scalable Gaussian Process Regression . . . 14

2.4 Parallel and Distributed methods . . . 16

2.4.1 Local Training . . . 16

2.4.2 Distributed Training . . . 16

2.5 Apache Spark . . . 18

2.5.1 Spark MLlib . . . 20

v

(7)

CONTENTS vi

2.5.2 Spark Resource Management . . . 22

2.6 Related Work . . . 23

3 Methodology 25 3.1 Implementation Design Choices . . . 25

3.2 Parallelizing a single-machine algorithm . . . 28

3.3 Distributed Algorithm . . . 29

3.4 Performance Evaluation Metrics . . . 29

3.4.1 Accuracy measures . . . 29

3.4.2 Computational Performance Measures . . . 30

3.5 Experimental Setup . . . 31

3.5.1 Testing environment . . . 31

3.5.2 Dataset . . . 32

3.5.3 Parameters . . . 32

3.5.4 Accuracy Experiments . . . 32

3.5.5 Scalability Experiments . . . 33

3.5.6 Speed-Up Experiments . . . 33

4 Results 34 4.1 Spark Configuration settings . . . 34

4.2 Findings . . . 35

4.3 Discussion . . . 41

4.4 Challenges . . . 43

5 Conclusion and Future Work 44 5.1 Conclusion . . . 44

5.2 Future Work . . . 45

(8)

Introduction

Over the recent years there have been dramatic advances in machine learning, which led to increasingly powerful machine learning algorithms and AI applications and boosted the adoption of machine learning across a wide range of industries. In automotive indus- try, machine learning is supporting the development of Advanced Driver Assistance Systems (ADAS) and Autonomous Driving (AD).

Autonomous vehicles use a variety of sensors such as camera, LiDAR, radar, GPS, etc., to continuously monitor and capture data about the surrounding environment. The raw data from sensors are processed to detect objects around the vehicle and parameters describing their status such as their position, speed, convex-hull, lifetime are estimated and represented as multi-dimensional time series data [1].

Safety functions are then designed using these parameters in order to perform adaptive cruise control or implement collision avoidance systems. For a reliable function of ADAS, it is essential to have accurate sensor performance in real traffic [2]. Therefore it is crucial to test and evaluate the quality of the sensors before they are deployed.

It is usually done by going for a long test drive and collecting raw data from sensors in different settings and the collected data will be subsequently used for testing. As this method of data collection can be time consuming and expensive and not all possible situations in road traffic can be tested using field tests even when the test is extensive. Hence, virtual verification methods are used to simulate these situations, so many scenarios can be tested in parallel. Therefore, modelling of the sensors is required for virtual verification process.

1

(9)

CHAPTER 1. INTRODUCTION 2

One intuitive way to learn about the performance of a sensor is to compare the signal output from the sensor under study with the corresponding signal from another sensor that has higher accuracy.

The difference between outputs can be used for constructing an error model of the sensor under study [1]. In this study we are interested in developing a model for off-the-shelf sensor referred to as production sensor from Volvo cars considering the LiDAR measurements as the ground truth. In particular we are interested in modelling the error of the longitudinal position of tracked objects from a host car as seen in the figure 1.1. Longitudinal error is defined as

lgtErr =lgtPosground_truth−_lgtPosprod_sensor

Figure 1.1: Illustration of the setup and variables

In this project, sensor data provided by Volvo Cars will be used to

(10)

model the error in the sensors. We intend to build a generative model from the data that will produce the generative distribution as close as possible to the empirical distribution.

1.1 Background

Time series data is a sequence of data points indexed in time order.

Time series modelling is studying the past observations of a time series in order to learn an appropriate model for the underlying data and is very useful in several ways, from describing the feature that makes up the time series, to forecasting the future [3]. It plays an important role in nearly all fields of science and engineering, such as economics, finance, telecommunication, meteorology and business intelligence [4]. This domain was influenced for a long time by linear statistical methods such as ARIMA. However starting form the early 1980s, it was becoming increasingly clear that linear models are not suitable for many real world applications [5]. Several non-linear models such as bilinear model [6], threshold autoregressive model [7, 8, 9] and autoregressive conditional heteroscedastic (ARCH) model [10] were proposed. However the study of non-linear time series analysis and forecasting is still at its infancy compared to linear time series [5].

In the last two decades machine learning models have gained interest and models like Artifical Neural Networks (ANNs) outperformed statistical methods [11, 12]. Later, other models such as support vector machines, decision trees, K-Means Clustering were explored in the context of non linear time series analysis and forecasting [13, 14]. Many researches focused on applying Bayesian inference for time series analysis owing to the fact that they allow us to perform robust modelling even when there is high uncertainty. Gaussian process (GP) regression models have become a standard tool in Bayesian signal estimation due to their expressiveness, robustness to overfitting and tractability [15]. Gaussian processes are a powerful tool for probabilistic inference over functions and have become a popular method for non-parametric modelling, especially for time-series data, and a wide variety of kernel functions have been proposed for different modelling tasks [16, 17, 18] and [19] reports GPs outperform radial basis function neural network.

(11)

1.2 Problem

Practically, most of the traditional machine learning algorithms are designed as single-node learning algorithms and assume that data can be easily accessed but accessing data from disk is usually slow. If an algorithm is computationally expensive and won’t fit into the main memory, then it will not scale well or won’t be able to process all of the training samples in a reasonable amount of time. However increasing the training samples often increases the accuracy of the models [20], and in-order to handle large datasets one must go for distributed implementations of machine learning algorithms: algorithms that run on several workstations which communicate over a network interface. Distribution of the learning can be a solution for large-scale learning where computational complexity and memory limitations are the major hindrances.

There are two main challenges that arise when scaling out machine learning to tackle large-scale problems. The first challenge relates to algorithm design: in order to learn in a distributed setting one must determine how to partition the training data across the worker nodes, how to assign the computations to each worker and how the workers should communicate with each other to achieve global convergence.

The second challenge relates to implementation and accessibility [21].

Big data machine learning is in its early stages; Algorithms and tools for distributed machine learning are being actively researched and developed.

Many platforms have recently emerged specifically for the purpose of distributed machine learning. The architectural design of these platforms determines the scalability, performance and availability of those platforms[21]. Examples of these frameworks include MapRe- duce [22], Apache Spark [23, 24], Apache Mahout, h2o, Distributed Machine Learning Toolkit (DMTK), MADlib. Apache Spark is most well known and widely adopted distributed computing framework for big data solutions.

The expressive power of a Gaussian process (GP) model comes at a cost of poor scalability in the data size [25]. For a dataset of size N, training scale in O(N³) which practically limits GPs to datasets of size O(10⁴) [26]. Scaling Gaussian processes to big datasets and deploying

(12)

it on big data flow engines is an active area of research.

Many previous work had proposed efficient ways to reduce the computational complexity of Gaussian Process Regression [27, 28, 29]. To improve the scalability of GPs, two broad classes of methods have been proposed: (a) Low- approximate representations of the full rank GP (b) Localized regression [30, 31, 32, 33] and covariance tapering methods [34] and compactly supported covariance functions [35]. Low et al. presents low-rank-cum-markov-approximation (LMA) for GP model to work on BigData. Deisenroth et al. and Zhang et al. pro- vide parallel GP algorithms that use Product-of-experts models and Mixture-of-experts models respectively to make local approximations.

Banerjee, Dunson, and Tokdar proposed a GP regression algorithm for large datasets using Random Projection Method and Nyström approximation. However, they have not explored the performance of parallel computing techniques This thesis builds upon this work and studies the parallelization of the algorithm with blocking techniques using Apache Spark’s data structures.

1.3 Purpose

The aim of the thesis is to implement a distributed Gaussian Process Regression to overcome computational and memory constraints on the size of the data-set that can be trained and investigate the level of improvements in terms of speed and compare with that of parallelizing single-machine GP Regression algorithm. This leads to our main research question:

How can we use parallelism techniques to develop a scalable Gaussian Process Regression algorithm for a distributed setting?

In order to answer the research question, we break it into following sub-questions:

1. How does the degree of parallelism affect the performance of learning?

2. How does the increase in dataset size influence time consump- tion?

(13)

3. How does the performance scale with each added worker?

4. What are the critical bottlenecks and limitations for designing and implementing a distributed Gaussian Process Regression algorithm ?

The contribution of this thesis is (i) to investigate different parallelization techniques for developing a Gaussian Process Regression Algo- rithm over the SPARK framework and (ii) to validate the algorithms by analyzing their accuracy, scalability and speedup by means of numerical experiments.

1.4 Methodology

The research methodology for this thesis is to perform a quantitative, empirical research. A literature review has been done to find out the open-challenges in the state-of-the-art, and identify how parallelism techniques can be applied to optimize performance of Gaussian process regression and then continued with the design and implementation of algorithms. The design solutions and implementations were done in Apache Spark, a popular open-source distributed computing framework, ensuring the reproducibility of the work and also con- tributing to the open-source community. Section 3 provides a more detailed description of the methodology and solutions design.

1.5 Ethics, Benefits, Sustainability

As discussed in subsection 1.1 efficient active safety (AS) systems in autonomous cars is essential for transportation safety and therefore society. This piece of thesis work is part of the development of this technology and has ethical issues in terms of responsibility. In the decisive moments before or during an accident, AS systems introduces a certain level of control shift from driver to the system. All active safety systems that were tested and developed using the modelled sensors are classified as driver aids, so the manufacturer has to guarantee that these systems will not cause any accident by interfering with the

(14)

driver. The driver the should be able to gain full control of the systems at all times.

The immediate impact of this thesis work can be seen in R&D, as it aids in safely testing active safety functions in a virtual environment, eliminating the necessity for test expeditions. This work contributes to cheaper and quicker development of active safety systems, which in turn introduces the technology faster to the market, thereby offering improved road safety to the public.

This thesis work was done in Volvo Cars with the resources and data sets provided by them. Volvo’s data and other confidential information are not shared in this report or anywhere. This work do not claim superiority over other works, or do not falsify the results of the experiments. Appropriate caution has been taken to avoid plagiarism and citations are provided wherever required.

1.6 Delimitations

This thesis study mainly focuses on the scalability of the implemented model. Accuracy measures are not intensely tested and validated. This thesis does not evaluate distributed frameworks for deep-learning.

1.7 Thesis Outline

This thesis is organized as follows. Chapter 2 describes all necessary theory and algorithms that are required to understand the implemented model and summarizes the related work. Chapter 3 outlines the chosen method, and the motives behind choosing it. Chapter 4 describes the findings of the study and discusses the results. Chapter 5 summarizes the work and provides a note on the limitations and makes suggestions for future research.

(15)

Chapter 2 Extended Background

2.1 Time Series

A time series T is a sequence of observations usually indexed in time order.

T = {t₁, t₂, . . . , tn}, t_i ∈ _R

A time series is often the result of observing an underlying process or system over time and collecting data points at regular time intervals. A time series can be uni-variate when a sequence of measurements of the same variable are collected over time or multivari- ate when several variables are measured simultaneously within the same time range. A time series is characterized by the fact that the observations close together in time will be more closely related than observations that are far apart in time.

A time series can be continuous or discrete. In a continuous case observations are recorded over some time interval, e.g. (0,1), whereas in discrete case measurements are taken at discrete points of time. The consecutive measurements in discrete time series are usually taken at uniformly spaced time intervals such as daily, weekly, monthly or yearly. A continuous time series can be transformed to a discrete series by merging data over a specified time interval. For example, time series datasets could be generated by storing stock prices, recording measurements from sensors equipped on a car, or by monitoring blood

8

(16)

pressure of a patient. The pattern in time series data can be visualized as a graph, by plotting the observations against corresponding time.

2.2 Sensor Modelling

Sensor error can be defined as the difference between the measurements from production sensor and the corresponding measurement from LiDAR (ground truth). If Sl(t) and Sps(t) are respectively the measurements from LiDAR and production sensor, then the error between the sensor measurements can be defined as

e(t) =Sl(t) −Sps(t)

generic formulation of the sensor error model is of the form,

e(t) =α⁽⁰⁾e(t−1) +

∑

N j=1

α⁽^j⁾x^j(t) +ε(t)

e(t) = _β(t) +_ε(t)

where x^j(t)are the feature values at time step t and ε(t)is the noise.

The error model consists of two components: they are (1) deterministic error β(t)and (2) stochastic error ε(t). The deterministic error β(t) = α⁽⁰⁾e(t−1) +_∑_j^N₌₁α⁽^j⁾x^j(t), also called systematic error is predictable using the input features {x^j(t)}^N_j₌₁ and previous error e(t−₁)_{. The} stochastic error ε(t) also known as random error, is random bias or noise. The essential idea of sensor error modelling is to fit a suitable model to the given sensor error data and estimate the coefficients in deterministic error, the variance and covariance of the noise such that the noise is well characterized in the system.

2.3 Gaussian Processes

Gaussian processes (GPs) are a class of stochastic processes that have proved very successful to perform inference directly over the space

(17)

CHAPTER 2. EXTENDED BACKGROUND 10

of functions [15, 37, 38]. They were originally introduced in geo- statistics termed as ”Kriging” [39, 40] , and over the past decades GPs are widely studied and researched in machine learning community.

In the following, a brief note on Gaussian Processes is given. For a detailed description on Gaussian Processes refer to [15, 41]. A Gaussian process can be defined as a collection of random variables f = f1, f2, f3, . . . , any finite number of which is Gaussian distributed f ∼ GP (_{m, k}). A Gaussian process is fully determined by its mean m and a kernel/covariance function k, such that

p(f₁, f2, f3, . . .) = N











 m(x₁) m(x2) m(x3)

...





 ,







k(x₁, x₁) k(x₁, x₂) k(x₁, x₃) · · · k(x2, x₁) k(x2, x2) k(x2, x3) · · · k(x3, x1) k(x3, x2) k(x3, x3) · · · ... ... ... · · ·













i.e.,

p(f(x₁), f(x2), f(x3)|x₁, x2, x3, . . .) = N (m, K)

where K is the full covariance matrix of the function values f(x1), . . . , f(xn) and m is the corresponding mean vector. Through Gaussian Processes various kinds of function can be modelled, not only linear functions but also complex non-linear functions, by modifying the mean and covariance function of this normal distribution. This property highlights the flexibility of GP.

2.3.1 Gaussian Process Regression

A GP regression is formulated as follows: given a training set S = {(xn, yn)}_i=1,...,N, where xiis an element of some spaceX and yi ∈ _Rⁿ is the corresponding output of a latent function f . We assume that the latent function f comes from a non-zero mean Gaussian process with a covariance function k(x_i, x_j). Each observation y_i can be thought of as related to an underlying function f(x) through a Gaussian noise model:

y_i = f(x_i) +e_i, i =1, . . . , n,

(18)

where e N (0, σ²) are white noises independent of f(x_i). Let the predictive distribution of f at a test location x∗, denoted by f∗ = f(x∗). The joint distribution of(f∗, y)is,

p(f∗, y) = N0,k∗∗ k^T_x_∗ kx∗ Kxx+σ²In

,

where k∗∗ = k(x∗, x∗), kx∗ is the column vector whose ith element is k(x∗, xi), Kxxis the kernel matrix whose ijth element is k(xi, xj), and In

is the n x n identity matrix. The subscripts of k∗∗, kx∗and Kxxrepresent two sets of points between which the covariance is computed. The predictive distribution of f∗given y is,

p(f∗|y) = N (k^T_x_∗(Kxx+_σ²In)⁻¹y, k∗∗−k^T_x_∗(Kxx+_σ²In)⁻¹kx∗). The predictive mean k^T_x_∗(Kxx +σ²In)⁻¹y gives the point prediction of

f(x) at location x∗, whose uncertainty is measured by the predictive variance k∗∗−k^T_x_∗(Kxx+_σ²In)⁻¹kx∗.

To apply GP regression to time series data, we reframe time series prediction as follows: : Assume that example time series ¯x¹, . . . , ¯x^M are given, where ¯x^j = (x₁^j, . . . , x^j_T

j) with x^j_t ∈ X. . Then the task is to infer the successor x_t^j₊₁ from its history (x₁^j, . . . , x_t^j) for all j and t. Following the Markov assumption, all but the last history entry become irrelevant. This leads to the regression problem with input-output pairs{(x^j_t, x_t^j₊₁)}_t^j⁼₌^1,...,M_1,...,T

j−1which can be modelled by GP regression, provided real vectors x^j_t.

The main aim of the thesis is to implement an efficient way to calculate the predictive mean and variance so that making the regression model scalable for large datasets.

2.3.2 Covariance Functions

Covariance function or kernel function measures the nearness or degree of similarity between two points. A covariance function states that if two points x and x⁰ are similar, then f(x) and f(x⁰) should be similar. Kernel function k(x_i, x_j)characterizes the correlations between

(19)

different points. Four standard covariance functions will be discussed in this section.

Radial Basis Function (RBF)

A radial basis function (RBF) kernel, also commonly known as the squared exponential kernel, is given by

kRBF(x, x⁰) =σ²exp

−(x−x⁰)²) 2λ²

where λ is the lengthscale parameter, which controls the horizontal scale over which the function changes and the latter is the noise variance, which controls the vertical scale changes. One prominent characteristic of an RBF kernel is the smoothness assumption. RBF assumes that the underlying function is smooth and infinitely differ- entiable.

An RBF covariance function value falls within 0 and 1, kRBF(x, x⁰) = [_{0, 1}]. RBF defines similarity as closeness, where the closer the point x with x⁰, k(x, x⁰) → 1. In contrast, when the difference|x−x⁰|becomes higher, the value decays to zero. The rate of decay is controlled by the λparameter.

A small lengthscale causes a more rapidly changing, ’wiggly’ looking function. Small lengthscale makes the covariance decays to zero very fast as the distance between points goes higher, which explains the rapid variation changes. Conversely, a large lengthscale causes a slow change, which in effect creates a very smooth function. The lengthscale parameter plays an important role on our assumption about the function would be. To model a rapidly changing function, small lengthscale RBF kernel should be used. In contrast, to extrapolate a value that is far from the training data, a large lengthscale should be chosen instead.

Rational Quadratic (RQ)

A rational quadratic (RQ) kernel is given by

(20)

kRQ(x, x⁰) =σ²exp

1+(x−x⁰)²) 2αλ²

An RQ kernel can be seen as an infinite sum of RBF kernels with multiple lengthscales [42]. An RQ kernel does not assume that the function f is smooth, unlike the RBF kernel. Because of that, the RQ kernel is more appropriate to model a non-smooth, rough function.

In addition to the λ and σ parameter, an RQ kernel has another parameter, the power parameter α which defines how quick the local variation is. Larger α value indicates more rapid location variation.

Periodic

A periodic covariance function is defined as follows

k_Per(x, x⁰) =_σ²_exp

−^sin

2(_π(x−x⁰)/p) 2λ²

A periodic kernel gives a repeating structure over the function which is controlled by the period parameter p. The lengthscale λ and variance σ²have the same effect that is found in an RBF kernel. Larger p causes a slower oscillation while smaller p causes higher oscillations.

Linear

The linear covariance function is defined as follows

KLin(_{x, x}⁰) =_σ²_xx⁰

It is a very simple kernel, where the only parameter is the noise variance σ²which controls the vertical lengthscale of the function.

Combining Kernels

It is possible to combine several covariance functions together to model a more complex function. The only thing to remember is that the resulting covariance matrix must be a positive semi-definite matrix.

(21)

The two operations below, addition and multiplication, are two ways to combine covariance functions together while keeping the positive semi-definite property intact.

k(x, x⁰) = k₁(x, x⁰) +k₂(x, x⁰) k(x, x⁰) = k1(x, x⁰)k2(x, x⁰)

2.3.3 Scalable Gaussian Process Regression

Let y ∈ R be the noisy output of a function f(x). Let X = {x_i}ⁿ_i₌₁ denote a set of n training points each of which belongs to input space D. Let y∈ _Rⁿ represent the corresponding vector of training outputs.

For a zero-mean GP, with the covariance function,

E[f(x_i)f(x_j)] =k(x_i, x_j) +σ²1[i= j],

where k is the kernel function, σ² is the variance of the observation noise, and 1[.]is the indicator function, the predictive distribution over the output f∗ at a test point x∗is normally distributed. The mean and the variance of the predictive distribution is given by,

µ∗ =k(x∗)^T(K+σ²In)⁻¹y

σ∗²=k(x∗, x∗) −k(x∗)^T(K+_σ²In)⁻¹k(x∗) +_σ²

where K is the kernel matrix whose ijth element is k(x_i, x_j), and k(x∗) is the column vector whose ith element is k(x∗, xi) and In is the n x n identity matrix.

(22)

Algorithm 1:Algorithm for Scalable Gaussian Process Regression Input :Training set S represented by feature space X = {x_i}ⁿ_i₌₁

and corresponding output space yn ∈ R of a function f Output:Predictive distribution of f at a test location x∗

1 Compute covariance matrix K of order n x n whose ijth element is k(x_i, x_j)for a chosen kernel function k;

2 Generate a matrixΩ of order n x r in which(i, j)^thelement

= (^√¹_rω_ij), where ω_ij ≈_N(0, 1)and r<<n ;

3 Calculate the matrix product KΩ ;

4 Compute SVD decomposition of the matrix KΩ and denote left factor of rank m spectral decomposition asΦ^T ;

5 Compute K1 =_ΦKΦ^T ;

6 Perform Cholesky Factorization of K1 =BB^T ;

7 Compute the Nyström factor C=_KΦ^T(B^T)⁻¹;

8 Compute spectral decomposition for C=UDV^T ;

9 Calculate the approximate spectral decomposition for K≈K^ˆ =UD²U^T ;

10 Compute(K^ˆ +_σ²I)⁻¹using Woodbury matrix identity;

11 Best estimate y∗for the test input x∗is given by:

µ∗ =k(x∗)(σ⁻²I−σ⁻⁴U(D⁻²+σ⁻²I)⁻¹U^T)y ;

The computational overhead lies in O(n³)cost in computing inversion of the full covariance matrix K+_σ²In and O(n²)cost for storage. One method to cut computational cost is to use Nyström approximation [43, 44] to compute approximate spectral decomposition for Kernel Matrix K ≈K^ˆ =UD²U^T and further apply Woodbury matrix identity [45] as follows:

(K+σ²I)⁻¹ = (K^ˆ +σ²I)⁻¹

= (_UD²_U^T+_σ²I)⁻¹

= (σ²I)⁻¹− (σ²I)⁻¹U(D⁻²+U^T(σ²I)⁻¹U)⁻¹U^T(σ²I)⁻¹

=σ⁻²I−σ⁻²U(D⁻²+σ⁻²U^TU)⁻¹U^Tσ⁻²

=σ⁻²I−σ⁻⁴U(D⁻²+σ⁻²I)⁻¹U^T

In the equation above, D⁻² +σ⁻²I is a diagonal matrix, inverse of which can be obtained by just taking the reciprocal of the diagonals.

(23)

Thus the direct matrix inversion which causes the major computational bottleneck is avoided and the computational cost reduces to O(nm²). The exact steps are shown in the Algorithm 1. Steps 2 to 9 are based on the work of Banerjee, Dunson, and Tokdar [28] in which they proposed a solution using random projection and Nyström approximation of matrix K. Our implementation further employs Woodbury matrix Identity to simplify matrix inversion. In the implementation we choose m = r. This algorithm is implemented in Apache Spark framework and takes advantage of distributed linear algebra and matrix computations in Spark.

2.4 Parallel and Distributed methods

In-order to scale up machine learning algorithms one need to come up with parallel and distributed algorithms that can run faster and reduce the training times. There are several methods to parallelize and distribute the computations across multiple machines and multiple cores. Some of the methods are listed below:

2.4.1 Local Training

Data and the model is stored on a single machine in local training. In Multi-core processing the whole model and the data is assumed to fit into a single machine with multiple cores. These multiple cores share the memory. Multiple cores are used to run multiple mini batches in parallel. Multi-core processing and GPU can be used together for computationally intensive subroutines like matrix multiplication.

2.4.2 Distributed Training

When it is not possible to store the entire dataset or the model on a single machine, then it becomes necessary to store across multiple machines.

(24)

Algorithm 2: Algorithm for parallelizing a machine learning algorithm

Input :Dataset S and single-machine learning algorithm L Output:Model for the given dataset

1 Partition the dataset S into n subsets S1, S2, . . . , Sn based on a sampling procedure;

2 The learning algorithm L is run on each of the subsets, producing models M1, . . . , Mn;

3 The models M1, . . . , Mn are processed by a combining procedure, which either selects from the n models or combines them to produce a final model;

Figure 2.1: Parallel learning using data partitioning

Data parallelism

When the dataset is too large to be stored in a single machine then it is stored across multiple machines for faster training and the same model is run on each subset of the data stored in multiple machines

(25)

in parallel. Each machine independently tries to estimate the same set of parameters. Machines exchange their estimates and the final estimate is obtained by aggregating or combining the estimate from each machine using a suitable combining procedure. Parallelizing learning using this method is described in Algorithm 2 and Figure 2.1 depict a general model for this approach.

Model parallelism

When the model is too big to be fit on to a single machine, one can resort to model parallelism where the model is split across different machines. In model parallelism, same data is sent to all machines and each machine is responsible for estimating different set of parameters.

For example in deep learning, a single layer can be stored and executed on a single machine and the forward and backward propagation involves communication of output from one machine to another in a serial manner.

2.5 Apache Spark

Apache Spark [23] is an open source cluster computing framework with in-memory data processing engine developed at AMPLab at UC Berkeley. Frequently used data is cached in-memory therefore Spark overcomes the cost of writing intermediate data to disk. For this Spark makes use of the concept of Resilient Distributed Dataset(aka RDD) which represents a read-only, partitioned collection of data spread over one or many machines and can be operated on in parallel. The core of Spark is written in Scala but also provides developer API for programming languages like Python, R and Java.

In Spark, a computation is modeled as a directed acyclic graph (DAG), where vertices represent an RDDs and the edges respresent the operation to be perfomed on RDD. For example, an edge E from vertex P to Q in a Spark DAG may imply that RDD Q is the result of applying operation E on RDD P. The two different types of operations supported by spark RDDs are: transformations and actions. A transformation(map(), filter(), union(), etc) returns a new modified RDD by performing an operation on the original RDD and maintains a pointer

(26)

to the parent RDD. An action (first(), count(), collect(), etc) returns a value by performing some computation on an RDD. A typical Spark job may have one or more transformations on a RDD and applying each transformation builds incrementally a RDD lineage with all parent RDDs of the final RDDs.

Figure 2.2: Spark Architecture

Spark uses master/slave architecture and as shown in figure 2.2, a spark cluster consists of a single master/driver and multiple workers/slaves. When a spark application is submitted, a driver process is created. Driver is responsible for converting spark user code into actual spark jobs and negotiating resources with cluster manager.

Cluster manager acquires resources and launches executors in worker nodes. Each executor is a JVM instance responsible for executing tasks sent to it by driver. The Spark driver comprises of two sched- uler components: DAGScheduler and TaskScheduler. DAGScheduler splits the operator graph into stages of tasks and submits each stage to TaskScheduler which launches tasks via cluster manager.

Fault tolerance: Spark uses the DAG to remember the lineage of operations executed on RDDs. If a partition of an RDD is lost due to the failure of a worker node, spark re-computes it from its parent RDD using lineage of operations. In-order to achieve fault tolerance

(27)

for all the RDD’s, the data is replicated among many executors in the worker nodes in the cluster.

2.5.1 Spark MLlib

Spark’s distributed machine learning library MLlib targets large-scale learning settings that benefit from data-parallelism or model-parallelism to store and operate on data or models. MLlib provides fast and scalable implementations of many common learning algorithms including classification, regression, clustering, dimensionality reduc- tion and clustering, simplifying large-scale machine learning. It also includes many statistical and optimization algorithms. It also provides utilities for linear algebra computations. Spark MLlib also integrates seamlessly with other Spark components such as Spark SQL, Graphx and streaming. MLlib fits in Spark’s APIs and is usable in Java, Scala, R and Python. In this project, we utilize the library’s implementation of matrix representations, and its matrix multiplication interface.

MLlib supports local vectors and matrices which are stored on a single machine and also supports distributed matrices stored across machines. Here, a brief explanation of distributed matrices is provided as the project implementation is done utilizing those. Distributed matrices in Spark can be expressed in four different types, each of those can be esaily converted to the other.

RowMatrix and IndexedRowMatrix are row-oriented distributed matrices where each row is a local vector. The difference between the two matrices is that IndexedRowMatrices have meaningful row indices whereas RowMatrices do not. Since each row is represented by a local vector, number of columns is limited by the integer range.

A coordinate matrix is a distributed matrix where each entry is a tuple (Long, Long, Double) consisting of row index, column index and an entry value. This matrix representation is preferred when the matrices are huge and very sparse.

BlockMatrix is a collection of MatrixBlocks, where a MatrixBlock is a tuple of ((Int, Int), Matrix), where (Int, Int) denotes the block index and Matrix is the sub-matrix at the given index. This matrix represe- nation supports methods such as ADD and MULTIPLY with another BlockMatrix.

(28)

Figure 2.3 shows different ways of partitioning a matrix. Each grey block represents a small chunk of the matrix which can be allocated to a partition. As shown in the figure, a local matrix is one big block, therefore it cannot be distributed. In a RowMatrix, rows are distributed among the partitions and the assumption is that the number of columns is not huge as the entire row has to be stored in a single machine. In a BlockMatrix, MatrixBlocks are distributed among partitions.

Figure 2.3: An illustration of how a matrix can is stored by Spark’s different data structures.

Spark incurs additional overhead to partition and distribute the matrices. The trade-off is to balance between the level of parallelism and costs incurred for partitioning and distributing. Insufficient partitioning can result in nodes being idle, whereas too much of partitioning can result in queues on every node. Other factors such as data transfer speed and memory limit of the nodes should also be considered while distributing matrices.

(29)

2.5.2 Spark Resource Management

Efficient resource allocation is a key factor for running any spark application. Different nodes in a Spark cluster may have different amount of resources and running same task on nodes with different amount of resources might be problematic and complicated. Therefore, Spark uses executors to address this. Executors, as mentioned in previous section are virtual instances launched for an application and hosted by worker nodes. All executors across different nodes and within the same node will have same amount of resources such as CPU cores and memory allocated to them. The amount of resources that have to be allocated to executors are specified by the user in spark-submit while executing the spark application.

Figure 2.4 shows an example of a Spark Cluster consisting of two nodes. Both the nodes have 4 CPU cores each, but one of them has 16GB memory while the other one has only 12GB. While executing an application, if the user specifies 2 CPU cores and 8GB memory for each executor, then 3 executors will be created by Spark: two by using the resources from first node and the third one from second node. But 2 CPU cores and 4GB memory from second node will left unused. Therefore providing right configuration settings for executors is essential for good performance of spark applications.

Figure 2.4: An illustration of the process of creating homogeneous executors from resources available on a cluster.

There are several factors to consider while deciding on executor con-

(30)

figuration. For example consider the most granular approach, creating one executor per core will result in many number of small sized executors. One might think that having more executors to run more concurrent tasks will result in better performance. But the weaker executors cannot process large tasks, which could be possibly processed by large executors. This also results in executor initialization overhead. Lets consider the least granular approach, creating one executor per node, which would give us biggest executors, but this also has a bad influence on the performance of the spark application due to overuse of resources without leaving resources for OS and other daemons. It is recommended by Spark to leave at-least 1 core and 1 GB RAM for OS and other processes. Therefore it is important to strike the balance between the two approaches mentioned above to improve the performance of the spark application.

Executor splits the memory allocated to it into two sections namely, execution memory and storage memory. Execution memory is used for actual computation of the application for example shuffles, joins, sorts and aggregations. Storage memory is used for caching of data which will be reused later. The difference between the two is that the execution memory can be released as soon as the computation is done, whereas for storage memory the data has to be stored for future computations, so can’t be wiped off directly. It is left to LRU to take care. Spark handles the memory usage between the two through unified memory management. Instead of statically reserving memory in advance, spark deals with memory contention when it arises by forcing members to spill.

2.6 Related Work

Although traditional machine learning algorithms can perform well when applied to small and low dimensional data-sets, but when applied to large data-sets the accuracy and performance of these algorithms decline significantly [46]. Algorithms and tools in distributed environments are being actively developed. Large-scale learning has witnessed considerable attention in the recent years and many successful techniques have been proposed and implemented [47, 48, 49].

The different approaches proposed in the literature can be categorized

(31)

into three main approaches [50]:

• Design a fast algorithm

• Use a relational representation, and

• Partition the data

One of the promising research lines for large scale machine learning is distributed computing since allocating learning to several machines is a natural solution for scaling up learning [51]. There have been approaches proposed for distributed large scale learning for classification like ensemble methods, stacking, boosting [52, 53], and other strategies like clustering and voting [54, 55], but a significant body of work dealing with distributed implementation for regression algorithms is still evolving.

A broad variety of software is available to work with Gaussian Process implementing either parallel algorithm using approximation methods or non-parallel algorithms for exact computations such as [56, 57, 58]. In this thesis, rather than modifying the techniques that scientists have focused on, distributed implementation of those techniques are explored and investigated. However, the current version of Spark do not have Gaussian Process Regression implementation. In this thesis a distributed version of Gaussian Process Regression using sparse approximation is implemented using Spark. In addition to this, parallelizing single machine GP algorithm is explored and is compared with the distributed approach.

(32)

Methodology

3.1 Implementation Design Choices

In this thesis, two types of approaches for scaling up machine learning algorithms using Apache Spark has been studied.

1. Parallelizing single-machine algorithms

2. Designing new algorithms particularly for distributed settings Before getting into detail on these two approaches, the motivation for choosing Gaussian Process Regression and Apache Spark for this study is explained.

Motivation for choosing Gaussian Process

Consider a simple linear regression model for a time series data given by,

Yt =β0+β1Xt+εt

One of the assumption the model holds about the error term εt is that E(εt, εt−1) = 0, that is there is no autocorrelation in errors, which means the error at one time period are not correlated with errors in

25

(33)

CHAPTER 3. METHODOLOGY 26

the preceding time period. This assumption is most likely not to be met when modelling time series data.

One of the most prominent time series model is ARIMA (autoregressive integrated moving average) model which is a combination of AR (autoregressive) model and MA (moving average) model with differencing procedure to perform de-trending.

The autoregressive model represents the current value of a time series as a linear combination of its past values and the number of previous values often called as the order of the autoregressive model is denoted by p. An autoregressive model of order p or AR(p)given by

yt =c+_φ₁y_t−1+_φ₂y_t−2+ · · · +_φ_pyt−p+_ε_t

where φ1, φ2, . . . , φp parameters of the model, c is a constant, and ε is white noise, εt = N (0, σ_N²).

Moving Average (MA) models current value of a time series as a linear combination of several past errors. The moving average model of order q, or MA(q), is equal to

yt =c+_ε_t+_θ₁_ε_t₋₁+_θ₂_ε_t₋₂+ · · · +_θ_q_ε_t−q

,

where θ1, θ2, . . . , θqare parameters of the moving average.

The combination of AR(p) and MA(q) model is the ARMA(p, q) model and is given by,

yt =c+εt+

∑

p i=1

φ_iy_t−i+

∑

q j=1

θ_jε_t−j

Trend in the time series is removed by differencing procedure, by computing the difference between consecutive observations.

y⁰_t =yt−y_t−1

y⁰⁰_t =y⁰_t−y⁰_t₋₁

= (yt−yt−1) − (yt−1−yt−2)

(34)

ARIMA model is an ARMA model with a dth order of differencing.

The ‘integrated’ term comes from this differencing procedure, and it do not mean ‘integration’ as it is the opposite of differencing.

An ARI MA model is often notated as ARI MA(p, d, q) where p is the order of the AR, q is the order of the MA, and d is the order of differencing. An ARI MA(1, 0, 2) is equal to an ARMA(1, 2). ARIMA model is a popular model in time series domain and is well-studied and theoretically mature.

These models estimate y as a linear function of the inputs x with some parameters θ, often much smaller than the number of training samples N : |θ| << N. However these models are parametric in the sense that a finite number of unknown parameters have to be inferred as part of the modelling process. When learning a model from time series, we deal with noisy datasets that lead to an uncertainty about what the most appropriate function is, to model the available data.

Therefore we would like models that allow us to work with the infinite space of all functions that have characteristics of the underlying data and makes predictions as probability distributions over likely functions rather than characterized with explicit set of parameters to be inferred. Such an approach is called non-parametric modelling, and is based on probability theory, so these models are often referred to as Bayesian non-parametric models. A Gaussian Processes (GP) are a particular member of Bayesian non-parametric models and are method of choice for probabilistic non-linear regression as their non- parametric nature allows for flexible modelling through kernel modifi- cation and produce reasonable predictions without manual parameter tuning [39]. Gaussian Process (GP) as defined by Rasmussen and Williams [15], is a generalization of the Gaussian probability distribution where instead of distributing over a finite number of functions the Gaussian process distributes over an infinite number of variables or functions.

Motivation for choosing Apache Spark

The technique of using cluster of machines for distributed linear algebra exists in the history, for example [59]. However these systems are not efficient as they are not fault-tolerant to hardware failures and

(35)

also assume random access to non local memory. Gaussian Process Re- gression involves complex matrix computations and requires handling of large matrices. In the recent past, MapReduce [22] was perceived as a generic parallel programming model and there have been at- tempts to implement matrix computations on MapReduce framework (e.g. HAMA [60]). But this was not efficient as MapReduce had lot of overheads and did not make a good use of distributed memory [23]. Several data flow engines that generalize MapReduce have been developed for large-scale data processing, and in particular, Apache Spark [24] has emerged as a widely used open-source engine.

Spark is a fault-tolerant cluster computing framework and provides implementations for distributed matrices. Spark’s distributed linear algebra and optimization libraries targets large-scale matrices that benefit from row, column, entry, or block sparsity to store and operate on distributed and local matrices. Spark has several features that are particularly attractive for matrix computations. LINALG library comprises fast and scalable implementations of standard matrix cal- culations for linear algebra operations including basic operations such as multiplication and also advanced operations such as factorization.

It also supports variety of functions to compute column and blocks statistics.

3.2 Parallelizing a single-machine algorithm

A serial algorithm or single-machine algorithm can be scaled by partitioning the data and learning a new model from each partition and combining all the predictions by a suitable averaging method. The approach avoids the need to run algorithms on very large data sets.

The algorithm for parallelizing single-machine learning algorithm is given below.

A variety of procedures for sampling instances from a large data set exist and some of them are Random Sampling , Duplicate Compaction, Stratified Sampling, etc., [20, 61]. The training dataset is first sampled using Random Sampling technique to create subsets of the data. Gaus- sian Process Regression (GPR) from scikit-learn is applied on each of the subsets, producing a model on each of them. The models are merged or averaged, to produce a final model as described in

(36)

Algorithm 2.

3.3 Distributed Algorithm

In distributed learning a best-fit regression model is learnt from the given training data, where every machine only has access to its own part of the data and some shared information that is periodically exchanged over the network. In this study, a Scalable Gaussian Process Regression algorithm employed with random projection for matrix approximation, Nyström approximation for spectral decomposition of covariance and Woodbury matrix identity for simplifying matrix inversion as described in Algorithm 1 is implemented using Spark framework.

3.4 Performance Evaluation Metrics

The performance of the distributed learning algorithm is evaluated based on predictive accuracy and computational speed. This section will briefly introduce the accuracy measures more suitable for time series modelling and scalability measures. In this thesis we focus more on speed and scalability of the model.

3.4.1 Accuracy measures

Root Mean Squared Error

The root mean squared error between the true and simulated error trajectory using modelMfor each episode of length T is computed in the form:

RMSE= v u u t1

T

∑

T t=1

(e(t) −eM(t))²

(37)

KL Divergence for continuous data

Kullback-Leibler divergence also known as KL divergence is a measure to calculate the difference between two distributions. However, in this case we don’t have access to the distribution of the true error.

So, we resort to empirical methods to calculate the KL divergence [62].

The empirical KL divergence is described below.

Calculation of Empirical KL divergence

Let there be two distributions P & Q and two sets of i.i.d samples from these two distributions. Let {X₁, X2, . . . , Xn} be the set of samples of length n from distribution P and{Y₁, Y₂, . . . , Ym}be the set of samples of length m from distribution Q. Then the empirical KL divergence is calculated as follows. First sort the samples from distribution Q to obtain {Y₍₁₎, Y(2), . . . , Y(m)} such that Y(1) ≤ Y₍₂₎ ≤ _{. . .} ≤ Y₍_m₎. Partition them into Tm partitions{I_i^m}_i₌_1,...,T_m such that each partition consists of lm samples, Tm = bm/lmc, lm ≤ m. For i = 1, . . . , Tm, let k_i denote the number of samples from P that fall into the segment I_i^m.

KLˆ (P||Q) =

T_m−1 i

∑

=1

k_i

n log k_i/n

lm/m +^k^T^m

n log kTm/n lm/m+_δ_m

where, δm = (m−_l_m_T_m)/m is the correction term for the last segment.

If lm, Tm → _∞asm → ∞ then divergence estimator ˆKL(P||Q) → KL(P||Q) as n, m → ∞. It should be noted that there may be cases where ki could be zero. In that case, ˆKL(P||Q)is not defined, therefore we set a small value ki =_eps,∀_k_i =_0.

Furthermore, for evaluation a symmetric divergence measure ˆKL(P||Q) + KLˆ (Q||P)is considered.

3.4.2 Computational Performance Measures

Speed

It can be described as how much speed-up we can hope to achieve given more nodes or processors. Let T1,n denote the run-time on one

(38)

node given an input of size n. Suppose we have p nodes, then the speed-up of the model can be defined as,

SpeedUP(p, n) = ^T^1,n T_p,n

Scalability

Scalability is the capability of the model to handle growing amount of data samples, or its potential to be enlarged to accommodate that growth.

If SpeedUP(p, n) = _Θ(p), then we can say that the model is strongly scalable.

Time Complexity

Time Complexity T(n) of an algorithm is a measure of the amount of time required by an algorithm for an input of a given size (n).

We define the time taken by an algorithm without depending on the factors such as processor speed; instruction set, disk speed, brand of compiler and etc. Time complexity is measured as the number of times the principal activity of an algorithm is performed until its completion.

We will represent the time function T(n)using the "big-O" notation to express the model’s run-time complexity.

3.5 Experimental Setup

3.5.1 Testing environment

The Spark cluster in our testing environment consists of 8 identical nodes. There are 20 cores per node and 125 GB RAM per node.

Apache Spark 2.3.1 with scala 2.11.10 and python 2.7 is used for testing. Along with this other external libraries namely scalanlp breeze 2.11, Jfreechart 1.0.19 and scikit-learn 0.18.1 are used.

(39)

3.5.2 Dataset

The dataset for the project is provided by Volvo cars. The dataset contains sequences of tracked objects and their properties including their type, speed, position and lifetime, with maximum sequence length of approximately 900. As part of pre-processing, the objects that have very short life-time (i.e., the duration for which the object is tracked by the sensor) were filtered out from the dataset. All sequences of less than 25 were removed from the dataset as doing so will make the features more reliable. Input variables were chosen by computing pairwise correlation between all variables and variables that correlated lowest to the output variable were removed.

3.5.3 Parameters

In our study, the available datasets were divided into two parts: training and test datasets. The training dataset was used to train the model and the test set was used to measure the accuracy of the results.

The sizes of the training and test datasets were important as they influenced the scalability, speedup and accuracy measures. Another parameter considered was level of parallelism based on the number of cores used and number of data partitions. Learning time was also considered as a parameter to compare the methods. In most of the experiments one of the parameter was varied and the others were kept fixed. Three kinds of experiments were conducted.

3.5.4 Accuracy Experiments

Accuracy experiments were conducted to evaluate the goodness of fit of the models. As the accuracy criterion, Root Mean Squared Error(RMSE) and KL divergence metrics were used. The results are more reliable as the model was trained on one piece of dataset (training data) and the results were measured on the other dataset (test data).