• No results found

GPU Implementation of the Particle Filter

N/A
N/A
Protected

Academic year: 2021

Share "GPU Implementation of the Particle Filter"

Copied!
85
0
0

Loading.... (view fulltext now)

Full text

(1)

Department of Electrical Engineering

Examensarbete

GPU Implementation of the Particle Filter

Examensarbete utfört i Elektroteknik vid Tekniska högskolan vid Linköpings universitet

av Joakim Gebart LiTH-ISY-EX--13/4698--SE

Linköping 2013

Department of Electrical Engineering Linköpings tekniska högskola

Linköpings universitet Linköpings universitet

(2)
(3)

Examensarbete utfört i Elektroteknik

vid Tekniska högskolan vid Linköpings universitet

av

Joakim Gebart LiTH-ISY-EX--13/4698--SE

Handledare: Niklas Wahlström

isy, Linköpings universitet

Gustaf Hendeby

FOI

Examinator: David Törnqvist

isy, Linköpings universitet

(4)
(5)

Reglerteknik

Department of Electrical Engineering SE-581 83 Linköping 2013-06-15 Språk Language □ Svenska/Swedish □ Engelska/English □ ⊠ Rapporttyp Report category □ Licentiatavhandling □ Examensarbete □ C-uppsats □ D-uppsats □ Övrig rapport □ ⊠

URL för elektronisk version

http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-94190

ISBN — ISRN

LiTH-ISY-EX--13/4698--SE

Serietitel och serienummer Title of series, numbering

ISSN —

Titel Title

GPU implementation av partikelfiltret GPU Implementation of the Particle Filter

Författare Author

Joakim Gebart

Sammanfattning Abstract

This thesis work analyses the obstacles faced when adapting the particle filtering algorithm to run on massively parallel compute architectures. Graphics processing units are one example of massively parallel compute architectures which allow for the developer to distribute computational load over hundreds or thousands of processor cores. This thesis studies an implementation written for NVIDIA GeForce GPUs, yielding varying speed ups, up to 3000% in some cases, when compared to the equivalent algorithm performed on CPU.

The particle filter, also known in the literature as sequential Monte-Carlo meth-ods, is an algorithm used for signal processing when the system generating the signals has a highly nonlinear behaviour or non-Gaussian noise distributions where a Kalman filter and its extended variants are not effective. The particle filter was chosen as a good candidate for parallelisation because of its inherently parallel nature. There are, however, several steps of the classic formulation where compu-tations are dependent on other compucompu-tations in the same step which requires them to be run in sequence instead of in parallel. To avoid these difficulties alternative ways of computing the results must be used, such as parallel scan operations and scatter/gather methods.

Another area where parallel programming still is not widespread is the area of pseudo-random number generation. Pseudo-random numbers are required by the algorithm to simulate the process noise as well as for avoiding the particle depletion problem using a resampling step. In this thesis a recently published counter-based pseudo-random number generator is used.

Nyckelord

(6)
(7)

This thesis work analyses the obstacles faced when adapting the particle filtering algorithm to run on massively parallel compute architectures. Graphics processing units are one example of massively parallel compute architectures which allow for the developer to distribute computational load over hundreds or thousands of processor cores. This thesis studies an implementation written for NVIDIA GeForce GPUs, yielding varying speed ups, up to 3000% in some cases, when compared to the equivalent algorithm performed on CPU.

The particle filter, also known in the literature as sequential Monte-Carlo meth-ods, is an algorithm used for signal processing when the system generating the signals has a highly nonlinear behaviour or non-Gaussian noise distributions where a Kalman filter and its extended variants are not effective. The particle filter was chosen as a good candidate for parallelisation because of its inherently parallel nature. There are, however, several steps of the classic formulation where compu-tations are dependent on other compucompu-tations in the same step which requires them to be run in sequence instead of in parallel. To avoid these difficulties alternative ways of computing the results must be used, such as parallel scan operations and scatter/gather methods.

Another area where parallel programming still is not widespread is the area of pseudo-random number generation. Pseudo-random numbers are required by the algorithm to simulate the process noise as well as for avoiding the particle depletion problem using a resampling step. In this thesis a recently published counter-based pseudo-random number generator is used.

(8)
(9)

1 Introduction 1

1.1 Prior work . . . 2

1.2 Outline . . . 2

2 Theory 3 2.1 Linear state-space model . . . 3

2.2 General state-space model . . . 4

2.3 Bayesian filtering . . . 5 2.4 Linear filtering . . . 7 2.5 Nonlinear filtering . . . 7 2.6 Particle filtering . . . 8 2.6.1 Time update . . . 8 2.6.2 Measurement update . . . 9 2.6.3 Resampling . . . 9 2.6.4 Summary . . . 13

3 Parallel programming in CUDA 15 3.1 Brief history of GPGPU . . . 15

3.2 CUDA framework . . . 16

3.3 CUDA programming model . . . 16

3.4 Compute capability . . . 19

3.5 Branching and flow control . . . 19

3.6 CUDA memory architecture . . . 19

3.7 Coalescing memory operations . . . 20

3.8 High level libraries . . . 20

3.8.1 Thrust template library . . . 20

3.8.2 Random123 random number generation library . . . 21

3.8.3 Eigen linear algebra library . . . 21

4 Parallelisation 23 4.1 Time update . . . 23

4.2 Random number generation . . . 24

4.2.1 Recursive PRNGs . . . 24

(10)

vi CONTENTS

4.2.2 Counter-based PRNGs . . . 24

4.2.3 The Box-Muller transform . . . 25

4.2.4 The Kaiser and Dickman method . . . 25

4.3 Measurement update . . . 26

4.3.1 Residual, ei k . . . 26

4.3.2 Likelihood, p(yk|xik) . . . 26

4.4 Resampling . . . 27

4.4.1 Parallel binary prefix operations . . . 27

4.4.2 Parallel reduction . . . 28

4.4.3 Normalizing weights . . . 28

4.4.4 Computing the cumulative probability distribution . . . 28

4.4.5 Computing the number of particle copies . . . 28

4.4.6 Copying particles . . . 29

4.5 Complexity summary . . . 29

5 Implementation 33 5.1 C++ templates . . . 33

5.2 Function objects . . . 34

5.3 Particle filtering framework . . . 34

5.3.1 The time update . . . 34

5.3.2 The measurement update . . . 35

5.3.3 Resampling . . . 35

6 Results 37 6.1 Hardware . . . 37

6.2 Software . . . 38

6.3 Applications . . . 39

6.3.1 2-dimensional terrain navigation using altitude sensor and range meter . . . 39

6.3.2 Tracking targets captured by an image processing system . 42 6.4 Comparing floating point results . . . 44

6.5 Memory usage . . . 45

6.6 Correctness of the implementation . . . 45

6.7 Execution time . . . 55

6.7.1 Impact of choice of random number generator . . . 55

6.7.2 2-dimensional terrain navigation . . . 57

6.7.3 Visual tracking . . . 58

7 Conclusions 63 7.1 Number of particles in relation to number of processing cores . . . 63

7.2 Comparing results between architectures . . . 63

7.3 Online applications . . . 64

7.4 Future work . . . 64

7.4.1 Implement Kalman filter time update function object . . . 65

7.4.2 Evaluating more resampling methods . . . 65

(11)

7.4.4 Parallelising the CPU implementation . . . 65

(12)
(13)

1

Introduction

Graphics processing units (GPUs) have evolved tremendously over the last decade. Today they are quite capable in terms of computing abilities and are therefore used in other areas than computer graphics. An advantage of using GPUs for general-purpose computing is their large number of computing cores, usually numbering in the hundreds in a single chip, compared to the 4, 8, or even 12 cores in the CPUs available today. GPUs are a cheap source of computing power, but they should still be regarded as a complement to, not a replacement for, general purpose processors because they have other associated limitations. These limitations come from design decisions needed for achieving the high density of cores, such less cache and the difficulty of dealing with divergent branches, this is discussed at further length in chapter 3.

Particle filtering is a signal processing method that is suitable for nonlinear systems and non-Gaussian noise distributions where Kalman filtering, extended Kalman filtering and unscented Kalman filtering do not perform well. The particle filter al-gorithm is computationally demanding because a large number of potential states or ‘particles’ must be processed in each time step; however, because many of the computations are not dependent on each other, it is suitable parallelisation where the workload is distributed over multiple processors to speed up the computations. This thesis studies one implementation where the computations in a particle filter-ing problem are parallelised and run on a NVIDIA GeForce GPU, yieldfilter-ing speed ups when compared to the equivalent algorithm performed on CPU. The accuracy of the implementation is evaluated by comparing the filter output on GPU to the filter output on a reference implementation running on CPU.

This thesis report is mainly targeted at engineers and engineering students inter-ested in signal processing, computer science, and parallel algorithm

(14)

2 1 Introduction

tion.

1.1 Prior work

Particle filters have been tested on parallel architectures before [Brun et al., 2002], [Teulière and Brun, 2003], including GPUs [Montemayor et al., 2004], [Hendeby et al., 2010]. However, these early examples of GPGPU 1 particle filtering were written for OpenGL or Direct3D before computing frameworks such as CUDA and OpenCL were developed. Chao et al. [2010] describes a particle filter CUDA implementation that attempts to reduce the number of particles needed to achieve a good filter output by increasing the ratio of effective particles, additionally using a technique which allows for a localized resampling scheme, which yield a perfor-mance benefit when compared to a classical particle filter on GPU.

Different resampling methods have been studied in parallelisation contexts [Míguez, 2007], including new resampling schemes for utilizing parallel architectures [Bolic et al., 2004] and specifically GPUs [Murray, 2012].

1.2 Outline

The report begins with an introduction to state-space models, estimation and fil-tering in chapter 2. The particle filter algorithm is also introduced and a detailed breakdown of the algorithm is provided which is used as the ground for the rest of the report. An introduction to the CUDA framework and programming model is provided in chapter 3 for readers with no experience in parallel programming. Chapter 4 analyses each part of the particle filter algorithm with regard to the limitations and possibilities outlined in the previous chapter and provides solu-tions to calculating the results in parallel to allow performance gains on massively parallel architectures, such as GPUs. Chapter 5 introduces some important C++ concepts and provides a short description of the particle filtering framework cre-ated during the writing of this thesis. Three example applications of the particle filter are presented in chapter 6. These applications are used to collect results and the gains in execution time from the parallelisation of the filter is presented in the same chapter, along with an analysis of the correctness of the implementation. The collective results are further discussed in chapter 7 and some conclusions are drawn. In the end, some suggestions on future work within the same domain as this thesis are presented.

(15)

2

Theory

Sensor measurements always contain undesired noise when used in the physical world. The purpose of signal processing is to extract useful information from noisy signals. For simple problems, such as removing a disturbing tone from a sound recording, it is often enough to simply remove the undesired frequencies from the signal spectrum using a band stop filter, but for more complex problems it will be possible to achieve better results using more of the knowledge about the system that is being measured. In many cases it is not even possible to directly measure the quantity that is needed but rather some signal processing is required to get an estimate from some other measurement source.

2.1

Linear state-space model

To use knowledge about the system being measured it is often useful to formulate a state-space model of the system. The state-space model describes how the system behaves over time. For example, if the initial position, velocity and direction of a car are known, it is possible to predict the future position of the car by integrating the velocity over time.

A simple 2-dimensional linear state-space model of a car moving in one dimension at a constant velocity is

x1[k + 1] = x1[k] + x2[k]Ts (2.1)

x2[k + 1] = x2[k] (2.2)

where x1is the position and x2is the velocity of the vehicle, k is the current time

(16)

4 2 Theory

step, Ts is the time step length.

In most cases not all states are measurable, to represent this fact we add a mea-surement equation

y[k] = x1[k] (2.3)

where y[k] is the measurement. As can be seen in the above equation the only measurement is the position.

It is often too complicated or not even possible to model the system completely. To account for the error introduced by simplifications such as neglecting air resistance and rolling resistance or errors in the state equations or parameters we introduce process noise v[k]. The measurements also have some noise, this is represented by measurement noise e[k]. The complete model is thus

x1[k + 1] = x1[k] + x2[k]Ts+

Ts2

2 v[k] (2.4)

x2[k + 1] = x2[k] + Tsv[k] (2.5)

y[k] = x1[k] + e[k] (2.6)

The process noise v[k] and measurement noise e[k] are independent, white stochas-tic processes with known probability density functions. The coefficients Ts2

2 and Ts for the process noise comes from sampling a continuous system and the assump-tion that the physical noise can be seen as an external unknown force acting on the vehicle, see Gustafsson [2010] for an explanation.

For a more thorough explanation of state-space models and model based predic-tion, see e.g. Gustafsson [2010].

For the remainder of this report the state will be used as a vector and the notation

x[k] will be written as xk for readability.

xk =     x1[k] .. . xnx[k]     (2.7)

2.2 General state-space model

There are many real-world systems that are inherently nonlinear and where the process noise term vkwill become too large if approximated by a linear system and render the model useless for filtering. In order to cope with such systems we will need a more general model that allows for nonlinear relations between the states.

(17)

A state-space model for a general nonlinear system is given by

xk+1=f (xk, uk, vk), vk ∼ pvk (2.8) yk =h(xk) +ek , ek ∼ pek (2.9)

where xk is the state vector, yk is the measurement, uk is an external control input (e.g. output from a controller). vk is a disturbance caused by model errors and noise in measuring uk. ek is noise in the measurement of yk. The distributions

vk and ek can be arbitrary but their probability density functions pvk and pek are

assumed known at filter design time. Only yk and uk are measurable.

The above model is a very general system model and very few assumptions are made about the modelled system which makes it suitable for nonlinear systems and/or non-Gaussian noise distributions.

Measurements and control inputs are assumed to arrive in order, see Orton and Marrs [2005] for an extension to systems with out-of-sequence measurements.

2.3

Bayesian filtering

Bayesian filtering is a statistical approach to filtering using likelihoods and prob-ability densities to extract useful information from a system. In Bayesian filtering the goal is to reconstruct the filtering distribution p(xk|y1:k) and prediction distri-bution p(xk+1|y1:k) of a system using all prior information, including all previous measurements. The filtering density describes the distribution of the state at the current time given all measurements up to the current time step. The prediction distribution describes the state distribution in the next time step given previous measurements up to the current time step, hence ‘prediction’.

Bayesian filtering is a more general approach to the filtering problem than linear filtering since the theory can be applied to any system dynamics and arbitrary noise distributions.

Gustafsson [2010] gives a longer description of nonlinear filtering, below is a short summary.

Consider Bayes’ theorem conditioned on the variable C

p(A|B, C) = p(B|A, C)p(A|C)

p(B|C) (2.10)

Substituting A = xk, B = yk, C = y1:k−1 and recognizing the Markov property of state-space models1yields

1A state distribution based on all previous measurements incorporates all possible information

from the measurements and therefore conditioning on both the state and the measurements will add no new information compared to conditioning on only the state

(18)

6 2 Theory

p(xk|y1:k) =

p(yk|xk)p(xk|y1:k−1)

p(yk|y1:k−1)

(2.11)

Equation (2.11) is known as a measurement update in Bayesian filtering because it incorporates the information from new measurements into the filtering distribu-tion.

p(yk|xk) is called the likelihood of the measurement. The denominator in (2.11) can be expressed by using the law of total probability

p(yk|y1:k−1) = ∫ Rn

p(yk|xk)p(xk|y1:k−1) dxk (2.12)

Using Bayes’ rule conditioned on C,

p(A, B|C) = p(A|B, C)p(B|C) (2.13)

with A = xk+1, B = xk, C = y1:k and again recognizing the Markov property of state-space models yields

p(xk+1, xk|y1:k) =p(xk+1|xk)p(xk|y1:k) (2.14) Integration on both sides with respect to xk over the entire state-space (using the law of total probability) yields

p(xk+1|y1:k) = ∫ Rn

p(xk+1|xk)p(xk|y1:k)dxk (2.15)

This is called the time update in Bayesian filtering because it advances the state distributions to the next time step.

By using (2.11) and (2.15) recursively and initiating by

p(x1|y0) =p(x0) (2.16)

(19)

p(xk|y1:k) = p(yk|xk)p(xk|y1:k−1) p(yk|y1:k−1) (2.17a) p(yk|y1:k−1) = ∫ Rnx p(yk|xk)p(xk|y1:k−1)dxk (2.17b) p(xk+1|y1:k) = ∫ Rnx p(xk+1|xk)p(xk|y1:k)dxk (2.17c)

(2.17) are the equations that need to be solved to arrive at the filtering and pre-diction densities.

On-line filtering applications typically requires a new estimate every time a new measurement is available. To prevent the computational complexity from increas-ing with increasincreas-ing number of measurements (y1:k grows with one measurement vector per each new measurement), it is necessary to utilize the recursion in the Bayesian filtering framework and keeping the most current state estimate at all times. From this result the actual computations become an iterative solution, with one time update per time step or ”tick“ and one measurement update per mea-surement.

2.4

Linear filtering

The linear filtering problem is to estimate the state of a linear dynamic system given observations of some measurable part of the system.

The model in section 2.1 has a linear state update and is thus a linear system. For linear systems with Gaussian noise distributions it can be proven that the Kalman filter is an optimal solution (with regards to minimum variance) to the problem of reconstructing the unknown state xk from all previous (noisy) measurements y1:k [Kalman, 1960].

2.5

Nonlinear filtering

The extended Kalman filter (EKF) is an extension of the Kalman filter that allows systems with minor nonlinearities in the system dynamics and Gaussian or almost-Gaussian noise distributions. In each time step of the EKF, the nonlinear system model is linearised about the current state estimate using the Jacobian (partial derivatives) of the nonlinear state prediction function and Taylor series expansions. The EKF is thoroughly discussed in Jazwinski [1970].

Another approach, called unscented Kalman filter (UKF) [Julier et al., 1995], is to let a number of points in the state-space propagate through the model and after-wards recover the distribution mean and covariance. In some cases, this method

(20)

8 2 Theory

allows for prediction functions with greater nonlinearities than the EKF can han-dle.

The above solutions only work on models with small nonlinearities or else the process noise term will become too great and the filter will yield very little useful information. The UKF is somewhat better at handling the nonlinearities than the EKF but it still fails on greater nonlinearities. To handle arbitrary distributions and greater nonlinearities it is better to use other statistical methods, such as a point mass filter (PMF) or Monte Carlo-approaches such as a particle filter (PF). The particle filter is described in chapter 2.

2.6 Particle filtering

For linear systems with Gaussian noise distributions it can be proven that the Kalman filter is the optimal solution to the problem of estimating the system state given noisy measurements. However, many systems are highly nonlinear or have non-Gaussian noise distributions. For such systems particle filtering (PF), or sequential Monte-Carlo (SMC), is a better method for estimating the system’s filtering distribution and prediction distribution. The particle filter was first intro-duced by Gordon et al. [1993].

The particle filter is a Bayesian approach to the filtering and prediction problems. The probability density function in the general Bayesian filtering framework is approximated by a set of Np particles or state-space vectors, each particle with its own associated weight. Thus, the density distributions can be seen as clouds of probability mass points.

The particle filter consists of three parts:

• Time update, for advancing the state, closely related to (2.17c).

• Measurement update, when new measurements arrive, closely related to (2.17a).

• Resampling, to prevent particle depletion.

The following sections describe each part in greater detail.

2.6.1 Time update

The time update uses the system model to predict the state distributions in the next time step in order to advance the time counter to the next time step. This step represents the time update described in (2.17c) in section 2.3. The time update updates the prediction density in the Bayesian framework. Predicting the future state is performed using a model of the system.

(21)

where xk+1 is the new system state, xk is the old system state, f (xk, uk, vk) is a (possibly nonlinear) function for predicting the future state given the current state, uk is a measurable control input, vk is a sample of the process noise, drawn from the process noise distribution, pvk.

2.6.2

Measurement update

The measurement update modifies the particle weights according to the measure-ment noise distribution pek. This step represents the measurement update step

in the general Bayesian recursion in section 2.3 and updates the filtering density. Below is the measurement update in the particle filter

wik|k = 1

ck

wki|k−1p(yk|xik) (2.19) where p(yk|xk) is the likelihood of the measurement, in an application this will be closely related to the measurement noise distribution. (2.19) can be compared to (2.17a) in the general Bayesian recursion, the particle weight wi

k|k−1 can be compared to the previous prediction density p(xk|y1:k−1). ck is a normalization weight ck = Npi=1 wik|k−1p(yk|xik) (2.20)

(2.20) can be compared to (2.17b), substituting the integral over a continuous distribution with a sum of discrete samples.

With the normalization constant ck defined as in (2.20) the sum of all particle weights after a measurement update is always 1.

2.6.3

Resampling

Because of process noise and measurement noise, particles will tend to diverge from the true state and all weights except one will tend to zero. This situation is called depletion or impoverishment because the pool of useful particles is depleted. To prevent depletion, particles having a very low weight should be replaced by particles of a higher weight.

There are many methods of resampling, some commonly used ones include

multi-nomial resampling and systematic resampling [Douc et al., 2005].

Multinomial resampling can be likened to having all particles spread out around a wheel of fortune or roulette wheel, the width of the fields corresponding to the weight of the particle, and a needle pointing at the edge of the wheel [Blanco, 2009]. The needle points out one particle to put in the set of particles to use in the next time step. This method needs Np spins of the wheel to sample Np particles for the next time step. Systematic resampling, on the other hand, can be likened

(22)

10 2 Theory

to having the same roulette wheel as in multinomial resampling, but with Np needles spread out at even distances around the edge of the wheel. The systematic resampling method thus needs only a single spin to decide Np particles for the next time step. Figure 2.1 shows an illustration of the two resampling schemes. The consequence of using multiple needles is that if a particle has weight greater than 1

Np it is guaranteed that at least one copy will be made, while when using only

one needle it will be possible, however unlikely, that a particle of weight greater than 1

Np will not be copied at all. Systematic resampling is thus still a stochastic

process but a more deterministic such than multinomial resampling.

Figure 2.1: To the left, multinomial resampling as a fortune wheel. To the

right, systematic resampling with Np = 8. Each coloured field represents a single particle, the width of the field corresponding to the weight of the par-ticle, thereby giving a particle of greater weight a greater chance of being pointed out by a needle after a spin. Spinning the right wheel once will point out 8 particles at once, while the left requires one spin per particle.

The spin of the wheel can be simulated by drawing a number from a uniform distribution. Using un ∼ U(0, 1) to represent the location (distance from angle ‘zero’) of the needle along the edge of the spun wheel it is necessary to find which particle is located at the chosen point. If all particles had the same weight it would have been trivial to find the index of the particle pointed at; simply multiply the location pointed at by the number of particles and rounding the result to an integer to get a particle index. However, since particles are not of equal weight this method will not work. An alternative way of finding the index can be performed by first computing the cumulative sum of particle weights and seeking along this sequence to find the index of the particle whose region begins before the location of the needle and ends after the location of the needle.

(23)

Let u be a vector of length Np of needle locations, from now on called intersection levels, where Np is the number of particles. The intersection levels can be chosen as uniformly distributed stochastic variables

un ∼ U(0, 1), n = 0, 1, 2, . . . , Np− 1 (2.21) this is a more formal description of multinomial resampling. The name multinomial resampling comes from the fact that the probability density of the count of particle copies is distributed according to the multinomial distribution, using the particle weights as parameters. To reduce the number of random samples needed, u can be chosen as a small random offset, a single sample from a uniform distribution, with uniform steps above it

un = n Np +ξ, n = 0, 1, 2, . . . , Np− 1, ξ ∼ U ( 0, 1 Np ) (2.22) This method with a single random offset is known as systematic resampling [Douc et al., 2005], stratified sampling [Carpenter et al., 1997] or stochastic universal

sampling [Whitley, 1994]. In this report the term systematic resampling will be

used to refer to this resampling scheme.

Given a finite number of particles, each with its own weight, calculate the cumu-lative weight of all particles before particle with index, i:

cdfk[i] = i

j=1

wkj|k (2.23)

cdfk[i] is the cumulative density function of the discrete particle distribution. It is

not necessary that the particles are ordered in any way. (2.23) is straightforward to compute sequentially in a computer program while it takes some extra work to parallelise efficiently, this is further discussed in section 4.4.1.

The index, i, of the particle at location un on the roulette wheel can be found by solving

cdfk[i] = un (2.24)

for i. This is easily performed in a sequential fashion by a search through the sequence cdfk, also known as performing a lower bound search for un in cdfk. Given a uniformly distributed level un it will be more likely to draw a particle with greater weight than one of less weight. Figure 2.2 shows cdfk[i] as a discrete function and the intersection levels u as horizontal lines. Each intersection between

(24)

12 2 Theory

cdfkand a level line adds a copy of the corresponding particle to the set of particles in the next time step. A particle that crosses more than one level line will be duplicated multiple times in the set of particles for the next time step, once per intersection. 0 1 2 3 4 5 6 7 8 9 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Picking particles to copy

Particle index (i)

Probability

cdf

k[i]

u

n

Figure 2.2: Computed cumulative density function of a simulated distribution

along with the levels from (2.22) with Np = 10, ξ = 0.05. The dashed lines (level lines) can be seen as a representation of the needles on the roulette wheel in figure 2.1. Each intersection between the CDF and a level line adds a copy of the corresponding particle to the set of particles in the next time step. In this figure, the particle with index i = 6 will be used once (one level line intersecting the CDF at i = 6), the particle with i = 7 will be used twice (two level intersections), and the particle with i = 8 will be discarded (no intersections).

Since the resampling step is only used to prevent depletion it is not necessary to perform the resampling after each measurement update. One method to reduce the computational load is to count effective particles, Nef f, the number of particles with weights above a threshold value, and do a resampling whenever that number falls below a defined quota. However, if the filter is designed to run in a real-time online situation it is necessary to be able to do resampling before the next

(25)

measurement arrive, thus the resampling should be fast enough to run between each measurement update and the only reason to skip the resampling step would be to save power. In the implementations created in this thesis work no such optimizations are done and resampling is done after each measurement update.

2.6.4

Summary

Algorithm 1 lists the complete particle filter algorithm described in this chapter. Algorithm 1 Particle filter

1. Distribute Np particles x0 over the state space.

xi0∼ p(x0) (2.25)

where i = 1, 2, ..., Np denotes the particle index 2. Give each particle an initial weight

w1|0i = 1/Np (2.26)

3. For each time step k = 1, 2, ...:

• Time update: Update the estimated state for each particle

xik+1=f (xik, uk, vki), v

i

k∼ pvk (2.27)

where uk is a vector of all measurable external system inputs.

• Measurement update: If there is a new measurement, recalculate parti-cle weights wik|k= 1 ck wki|k−1p(yk|xki) (2.28) where ck = Npi=1 wki|k−1p(yk|xik) (2.29) and p(yk|xki) is the likelihood of the measurement given the particle state.

• Resampling: Sample the distribution in wk|k Np times to form a new particle set for the next time step

(26)
(27)

3

Parallel programming in CUDA

Parallel programming means designing a program for executing many operations in parallel instead of sequentially one at a time. Some algorithms are parallel in their very nature and thus requires almost no modifications to be able to run in parallel while others are quite sequential in their original formulation but can still be modified to run in parallel by clever changes to the algorithm flow.

3.1

Brief history of GPGPU

Graphics processing units, GPUs, are designed for rendering 3D-graphics in games and applications, but GPUs have become more and more programmable since the first shader-capable cards were launched in 2001 with the GeForce 3 series of cards. Shader programs are tiny programs that run in parallel on a GPU and are used in all modern games for calculating lighting and visual effects. However, by using carefully selected geometry and saving the output from the shader stages it was possible to use the GPU for other computations than displaying graphics on screen but the close connection to graphics made it hard to use efficiently for unrelated problems and it was mostly used for computer vision and image processing research.

In 2006, NVIDIA launched their CUDA framework [NVIDIA, 2012c] to ease adop-tion of GPU technology for general purpose computing. AMD launched their coun-terpart, Stream SDK, later in the same year [AMD, 2006]. In 2008 the Khronos Group launched the OpenCL language in collaboration with Apple, IBM, AMD, NVIDIA and Intel [Khronos, 2008]. OpenCL is a heterogeneous computing frame-work, designed for executing both on CPU and GPU platforms.

(28)

16 3 Parallel programming in CUDA

3.2 CUDA framework

CUDA, or Compute Unified Device Architecture, is a parallel programming frame-work designed by NVIDIA for their GeForce and Tesla product lines. CUDA is a C-based language with some extensions for calling GPU kernels from CPU code. It is possible to interface with CUDA code from C++ programs just like interfacing regular C functions.

Since its introduction in 2006, CUDA has become the most popular GPGPU frame-work with circa 11000 articles published [Google, a] compared to circa 4100 articles published containing the word OpenCL [Google, b].

Developers have created bindings for CUDA in other programming languages; Fortran, Haskell, Python, Java, MATLAB, Perl, Ruby, .NET are a few examples.

3.3 CUDA programming model

CUDA programs are first started on the CPU but the CPU-side code can call functions known as kernels on the GPU to perform computations. Kernels are written in CUDA C, a superset of C with support for launching GPU kernels via a special syntax, in addition there is some support in CUDA C for C++ constructs such as classes and templates.

Some terms need to be explained to understand the next section: The following terms relate to the logical1 threads:

• thread – a single thread of execution.

• block – a group of multiple threads that execute the same kernel. • grid – a group of blocks.

Figure 3.1 shows the different levels of logical thread groups relate to each other, along with what kinds of memory is available at which level.

Figure 3.2 shows a block diagram of the Fermi architecture SM hardware. The following terms relate to the physical2 threads:

• core – a single compute core, one core runs exactly one instruction at a time. • warp – a group of threads that execute in parallel on the hardware, a warp

consists of 32 threads on current generation CUDA hardware.

Kernels are executed by one or more Streaming Multiprocessors (SM). A typical mid-to-high-end GeForce card from the Fermi family (GeForce 400 and GeForce 500 series) has 8-16 SMs on a single GPU3. Each SM consists of 32 CUDA Cores

1Logical as in software constructs.

2Physical as in hardware architecture dependent.

3See section 6.1 for a more detailed description of the specific hardware used in this thesis

(29)

Thread per-thread local private memory Block per-block shared memory Grid 0 Grid 1 per-application context global memory

...

...

Figure 3.1: CUDA thread concepts. Source: NVIDIA [2009], figure used with permission from NVIDIA

(cores) on the hardware used in this thesis [NVIDIA, 2009]. Threads are scheduled

for execution by the warp schedulers, seen at the top of figure 3.2. Each SM has two warp scheduler units that work in a lockstep fashion. The smallest unit that a warp scheduler can schedule is called a warp, which consists of 32 threads on all CUDA hardware released so far at the time of writing. Only one warp may execute at a time on each SM.

Threads in CUDA are much more lightweight than CPU threads, context switches are cheaper and all threads of a warp execute the same instruction or have to wait while the other threads in the warp execute the instruction, this is called Sin-gle Instruction Multiple Thread (SIMT) and is similar to traditional CPU SinSin-gle Instruction Multiple Data (SIMD) instructions such as SSE, AVX, NEON, Al-tivec etc., this has consequences when using conditional statements as described in section 3.5.

To allow for problems which demand more than 32 threads to solve the CUDA threads are arranged into logical groups called blocks and grids of sizes that are

(30)

18 3 Parallel programming in CUDA Dispatch Unit Warp Scheduler Instruction Cache Dispatch Unit Warp Scheduler Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core SFU SFU SFU SFU LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST Interconnect Network

64 KB Shared Memory / L1 Cache

Uniform Cache

Core

Register File (32,768 x 32-bit)

CUDA Core

Operand Collector Dispatch Port

Result Queue

FP Unit INT Unit

1-4 GB Global Memory

Figure 3.2: Fermi architecture streaming multiprocessor (SM) block diagram. Source: NVIDIA [2009], figure used with permission from NVIDIA

defined by the software developer. A block is a 3-dimensional collection of threads, each thread in the block has its own individual 3-dimensional identification num-ber to allow the developer to distinguish between the threads in the kernel code. Threads within a single block can share data through shared memory, this reduces the load on global memory. Shared memory has a much lower latency than global memory but is a limited resource, the user can choose between (per block) 16 kB shared memory and 48 kB L1 cache or 48 kB shared memory and 16 kB L1 cache. Several blocks of threads in turn can be grouped into a grid. Grids are 3-dimensional arrays of blocks. The maximum block size is tied to the available hardware re-sources while the grids can be of (almost) arbitrary size. Blocks within a grid can only share data through global memory, which is the on-GPU memory which has the highest latency, more on this in section 3.6.

The identification numbers of threads and blocks are 3-dimensional because many applications have a natural way of dividing up the threads in a 3-dimensional space, for example in a finite element (FEM) simulation the IDs could be correlated to the x,y,z-position of the element handled by the thread. If a problem has less than three dimensions it is still possible to keep the length of the extra dimensions at

(31)

1, also known as singleton dimensions.

A Fermi GPU can have 48 warps (1536 threads) active at once per SM, given that the threads use little enough local and shared memory to fit all at the same time. Context switches between threads are fast since registers are allocated to the threads and hence there is no need for saving and restoring registers and shared memory between thread switches. The result is that it is actually desired to over-allocate the hardware since it will hide memory stalls inside the kernels by letting the warp schedulers switch the currently active warp whenever a stall occurs.

3.4

Compute capability

NVIDIA uses the term “Compute capability” to refer to different versions of their CUDA hardware which have different capabilities. The Fermi card used in this thesis work is of compute capability 2.0. For a detailed description of what each Compute capability version represents, see NVIDIA [2012b]. The major addition in Compute capability 2.0 was the addition of full IEEE 754-2008 floating point support in both single- and double precision computations. This is an important feature when comparing results computed on different architectures and is dis-cussed further in section 6.4.

3.5

Branching and flow control

The thread warp is a hardware group of threads that execute on the same SM. Threads of a warp can be compared to sharing a common program counter between the threads, hence all threads must execute the same line of program code. If the code has some brancing statements such as if ... then ... else the warp must first execute the threads that enter the first block, while the other threads of the warp wait, next the threads that enter the next block will execute while the other threads wait and so on. Because of this behaviour conditional statements should be avoided in GPU code if possible. When threads of a warp follow different lines of execution it is known as having divergent threads. While conditional blocks should be kept to a minimum inside CUDA kernels, it is sometimes possible to reorder statements so that all threads of the same warp follow only a single path of execution in an if ... then ... else block and mitigate this limitation.

3.6

CUDA memory architecture

GPU memory is organized in a hierarchical structure with three primary levels;

local, shared and global. Figure 3.1 shows how the memory levels relate to the

threading concepts introduced in section 3.3. Each thread have a small amount of thread-local memory accessible only by the thread, local memory contents is stored in registers. Operations on local memory are completed in one clock cycle. Each thread block have a somewhat larger amount of shared memory. Shared memory

(32)

20 3 Parallel programming in CUDA

is accessible by all threads in the same block. The remaining memory is called global memory. Global memory has a high total throughput (many GB/s) but it has a much higher latency than accessing registers in the SM.

Current CUDA hardware can only access memory in 32-, 64-, or 128 byte blocks. In addition, there is a high latency for accessing global memory, in the order of hundreds of clock cycles. It is possible to hide this long latency by using more algorithm threads than available hardware can run in parallel and let the warp scheduler handle the memory stalls by issuing another warp when the active warp is waiting for a global memory operation to complete. The number of algorithm threads divided by the number of hardware threads is called the occupancy of a CUDA program.

3.7 Coalescing memory operations

All operations using global memory should be written in a way that allows the GPU to use all bytes in a 32-, 64- or 128-bit operation. To achieve this it is necessary to use separate arrays for different variables belonging to the same particle. The common practise in C++ programs that run on a CPU is to create a struct or class to group attributes together that belong to the same object and then create a single array of such objects, this method is called Array of Structs or AoS, or more commonly known as Object Oriented Programming (OOP). OOP focuses on arranging data in a way that makes logical sense to the software architect and makes it easier to design programs. The opposite to OOP is called Data Oriented Programming (DOP), the focus in DOP is to arrange the data in a way that is more efficient for the computer hardware, grouping data primarily by which attribute it belongs to instead of grouping data by which object it belongs to. This is realized by using a single struct containing each attribute as an array of some primitive data type ( int, float, double etc.), this approach is called Structure of Arrays or SoA. Using a SoA approach allows the hardware to use all the loaded bytes in a 128-bit read instead of discarding some parts of each struct that are not needed by the current operation.

3.8 High level libraries

This section describes the higher level libraries used in the thesis work.

3.8.1 Thrust template library

Thrust [Hoberock and Bell, 2012] is a collection of C++ templates for parallel algorithms written in CUDA. It contains parallel versions of many C++ standard template library algorithms for sorting, reductions, partial sums and some special-ized combined operations to optimize performance for GPU execution and mini-mize memory accesses and transfers. This library is used for its parallel algorithm primitives throughout the particle filtering framework described in chapter 5. The

(33)

primitives used are described in section 4.

3.8.2

Random123 random number generation library

Random123 [Salmon, 2012b] is a library for counter-based pseudo-random number generation. The library contains both optimized CPU code and an efficient CUDA implementation of the counter-based pseudo-random number generators described in Salmon et al. [2011]. Random123 is being debated for inclusion in the Boost C++ library collection [Salmon, 2012a]. Random number generation is a complex subject and it is very difficult to develop good pseudo-random number generators. Random number generation is a major part of the time update step; generating process noise, vi

k in (2.27).

3.8.3

Eigen linear algebra library

Eigen is not a CUDA library but is used extensively on the CPU-side in this thesis work. The Eigen library [Guennebaud et al., 2010] is a collection of C++ templates for common linear algebra operations, such as matrix-vector multipli-cations, matrix-matrix-multiplimultipli-cations, matrix inverse, Cholesky factorization and lots more. The implementations are highly optimized and make automatic use of vectorization instructions such as SSE and AVX.

(34)
(35)

4

Parallelisation

This chapter discusses parallelising the particle filter algorithm.

It would be desirable to run one thread per particle up to at least the number of computing cores available. However, some parts of the algorithm are more difficult to parallelise due to dependencies between the particles. In addition, given the CUDA programming model it can be beneficial to over-allocate threads so that the warp scheduler can keep the computing cores active while waiting for high-latency operations such as loads and stores in global memory.

Preliminary profiling and analysis performed during the beginning of this thesis work indicated that the resampling step is the most complex part of a simple PF test program.

The chapter is split up into sections each covering a single component of the entire algorithm. The notation ‘particle-local’ is used in this chapter to denote a variable that is specific to a given particle and independent of all other particles.

4.1

Time update

The system prediction function, f (xi

k, uk, vik) in (2.8), does not have any dependen-cies between the particles. It needs to read one global variable, uk, two particle-local variables, xi

k and v i

k, and write one particle-local value, x i

k+1. The pro-cess noise, vi

k, however, must be sampled from its distribution before computing

f (xik, uk, vki), this requires a parallel random number generator and is explained in section 4.2.

(36)

24 4 Parallelisation

4.2 Random number generation

Random or pseudo-random numbers are needed to generate process noise in the time update. A pseudo-random number generator (PRNG) is an algorithm/func-tion that yield seemingly uncorrelated numbers in sequence, but since computers work in a deterministic fashion a PRNG is not a source for truly random numbers.

4.2.1 Recursive PRNGs

Conventional pseudo-random number generators (PRNG) are designed in a re-cursive fashion where the generated number in the sequence is dependent on an internal state which in turn is dependent on the internal state in the previous iteration.

xk+1 =f (xk), k = 0, 1, . . . (4.1)

where x0 is known as the “seed”. Since each state depend on the previous state it is an inherently sequential process to generate more numbers from the same sequence.

The number of articles concerning parallel pseudo-random number generation has increased recently with the introduction of massively parallel GPUs, [Howes and Thomas, 2007], [Bradley et al., 2011], [Manssen et al., 2012]. To generate good qual-ity random numbers in parallel it is necessary that each parallel thread generate a disjunct sequence of random numbers [Brent, 2004]. Each separate thread requires a unique instance of the PRNG which has a seed that is distinct from all other threads or the threads will follow the same sequence of pseudo-random numbers. To achieve disjunct sequences of pseudo-random numbers there are several alter-natives [Manssen et al., 2012] using recursive pseudo-random number generators. A first idea might be to seed the generators in turn with another PRNG sequence and using a long period generator to minimize the risk of overlapping sequences. However, to be certain that the seed yields a disjunct sequence it is necessary to seed all generators with the same seed and advancing the state for each thread further than the number of pseudo-random numbers needed by the computations (per thread). By analysing the state advancing function it can also be possible to create a skip-ahead method for a PRNG. Bradley et al. [2011] describes CUDA implementations of the conventional PRNGs Mersenne-twister (MT19937), Sobol and MRG32k3a including skipahead methods for each generator.

4.2.2 Counter-based PRNGs

Recently, an article [Salmon et al., 2011] was published that describes a parallel counter-based pseudo-random number generator. This generator uses the princi-ple of encrypting a counter using a modified cryptographic cipher. The algorithms in the ciphers are modified to reduce the number of rounds applied which de-creases the number of operations (and computation time) needed but also reduces the cryptographic strength. These counter-based PRNGs all pass the TestU01

(37)

BigCrush [L’Ecuyer and Simard, 2007] test batch, which is currently the most used testing suite for random number generators [Manssen et al., 2012]. The use of a counter in the PRNG lowers the memory requirement for storing the state and it is trivial to advance arbitrary numbers of steps.

The counter-based PRNG library Random123 [Salmon, 2012b] was used in the implementation in this thesis.

4.2.3

The Box-Muller transform

The Box-Muller transform [Box and Muller, 1958] is used in this thesis to transform a pair ofU(0, 1)-distributed numbers obtained from the Random123 PRNGs into a pair of GaussianN (0, 1)-distributed numbers. The Box-Muller transform is defined as

X1=

−2 ln U1cos (2πU2) , U1,2∼ U(0, 1) (4.2)

X2=

−2 ln U1sin (2πU2) (4.3)

where X1 and X2are independent and GaussianN (0, 1) distributed if U1 and U2 are independent and uniformU(0, 1) distributed. For a derivation of the transform and justification of the method, see Box and Muller [1958].

To keep the implementation complexity down, the Box-Muller transform is used in both the CPU and the GPU implementations in this thesis, even though there are faster methods available on the CPU side, e.g. Marsaglia and Tsang [2000].

4.2.4

The Kaiser and Dickman method

The multivariate normal distribution is a generalisation of the 1-dimensional Gaus-sian distribution to higher dimensions. A covariance matrix and a vector mean is used to describe the multivariate normal distribution instead of a scalar variance and scalar mean.

The probability density function, pdf(x), of the multivariate normal distribution is pdf(x) = √ 1 (2π)k|R|exp ( −1 2(x− µ) TR−1(x− µ)) (4.4)

where x is an n-dimensional vector, µ is the mean and R is the covariance matrix, |R| is the determinant of the covariance.

To generate new samples from the multivariate normal distribution with covari-ance R and mean µ, the most straightforward method is to use the Cholesky factorisation of R to perform an affine transform to the correct distribution from independentN (0, 1)-distributed numbers, this method is known as the Kaiser and Dickman method from their paper [Kaiser and Dickman, 1962].

(38)

26 4 Parallelisation

Algorithm 2 Drawing samples from a N -dimensional multivariate normal distri-bution using the Kaiser and Dickman method.

1. Generate N independentN (0, 1)-distributed numbers, z ∼ NN(0, I). 2. Compute the Cholesky factorisation of the covariance matrix R, that is find

a L s.t. R = LLT

3. x = Lz + µ is a multivariate normally distributed sample with mean µ and covariance matrix R.

Note: There are other methods of finding a factorisation of the matrix R for the affine transform, e.g. spectral decomposition, matrix square root etc. Cholesky factorisation is however already implemented in most linear algebra libraries for programming and thus readily available.

The matrix factorisation in the Kaiser-Dickman method only needs to be per-formed whenever the covariance matrix changes. The applications in this thesis work all have constant process noise covariances and therefore the matrices are factorised on filter initialization and the resulting matrix is stored for use in the time update. The matrix multiplication x = Lz + µ is performed in a sequential manner for each element in the matrix but many particles are processed in parallel.

4.3 Measurement update

As described in section 2.6.2, the measurement update first updates the weights, then normalizes the set of weights.

4.3.1 Residual, e

i k

The residual is computed according to

eik=yk− h(xik) (4.5)

where xi

k is the state vector and yk is the new measurement. This computation is almost trivial to parallelise as it only needs to read one global value, yk, and one particle-local value, xi

k, and writes one particle-local value, eik.

4.3.2 Likelihood, p(y

k

|x

ki

)

The likelihood is computed by using the probability density function pek.

p(yk|xik) =pek(e

i

k) (4.6)

(39)

pek(e i k) = 1 √ (2π)k|R|exp ( −1 2(e i k− µ)TR−1(eik− µ) ) (4.7) where µ is the mean and R is the covariance matrix of the noise distribution,|R| is the determinant of the covariance.

This computation, too, is straightforward to parallelise as it uses one particle-local variable, ei

k, (possibly vector valued) and two constants, µ and R, (possibly vector-and matrix-valued) vector-and writes one scalar-valued particle-local variable, p(yk|xki). The computation can be optimized by ignoring the constant normalizing factor and only compute the exponential function on each call, since all weights are normalized properly at the end of the measurement update.

If the covariance matrix is a diagonal matrix the vector-matrix-vector product in the exponential function argument can be simplified further in the implementation.

4.4

Resampling

The resampling step is the step that is the most difficult for a GPU to process in parallel because it involves random reads from global memory, which hurts performance because of the hardware architecture, see section 3.7.

4.4.1

Parallel binary prefix operations

Binary prefix operations, also known as scan operations, are the building blocks of many parallel algorithms and have been written about extensively before [Hillis and Steele, 1986], [Harris et al., 2007]. A binary prefix operation takes an input vector

x = (x1, x2, x3, . . . , xn) (4.8)

and any associative binary operator, (such as addition, multiplication, AND, OR, min, max etc.) called⊕ in this section, and generates the sequence

S = scan(⊕, x) = (x1, x1⊕ x2, x1⊕ x2⊕ x3, . . . , x1⊕ . . . ⊕ xn) (4.9)

Assuming n is a power of two1, parallelising the above operation efficiently is done in two phases; an up-sweep and a down-sweep, the names stem from the original formulation of results in a binary tree in Blelloch [1990]. The elements of the vector x are seen as leaves on a balanced binary tree. During the up-sweep each node remembers the value of its left child and computes the value of the operator

1It is easy to extend the algorithm to values of n that are not powers of two by padding the

vec-tor with the chosen operavec-tor’s identity (0 for addition/subtraction, 1 for multiplication/division etc.)

(40)

28 4 Parallelisation

⊕ applied to the values of its two child nodes and passes the result to its parent. Each node on the same level can be computed in parallel but all levels must be computed in sequence, starting from the bottom of the binary tree (the leaves) and moving up.

During the down-sweep each node forwards to its left child the value from its parent and to its right child the value of the operator⊕ applied to the stored value (that originated in its left child node during the up-sweep) and the value passed from its parent. Again, in this phase each node on the same level can be computed in parallel but all levels must be computed in sequence, starting from the top of the binary tree and moving down to the leaves.

By arranging the computations in a binary tree, the number of operations2needed will beO(N log N), N leaves and log2N levels of the tree.

4.4.2 Parallel reduction

A reduction operation is any operation that takes a vector and reduces it to a scalar, one example reduction is the sum-of-all-elements-operation. It is trivial to implement a sum of all elements in a sequential fashion; simply iterate over all values and add them one at a time to an accumulator variable. To implement a reduction in parallel one can use the same up-sweep as in the parallelisation of the binary prefix scan but omit the down-sweep. The required value can be found by applying the operator⊕ to the children of the top node.

Because of the similarities with parallel binary prefix operations the resulting complexity is therefore also of the same order for a parallel reduction as for a parallel binary prefix operation, although the number of operations should be about half for a reduction compared to a binary prefix operation because of the missing down-sweep in the reduction.

4.4.3 Normalizing weights

To perform the resampling methods mentioned in section 2.6.3 it is necessary to know the sum of the weights as a normalization constant. The parallel reduction described above can be used to parallelise this computation. This normalization can be combined with the computation of the cumulative probability density, sec-tion 4.4.4 below, and is not necessary to perform as a separate step.

4.4.4 Computing the cumulative probability distribution

A binary prefix operation with⊕ = + is used to compute the empirical cumulative distribution function in (2.23).

4.4.5 Computing the number of particle copies

A (scalar) lower bound search searches for a key (a value) in a sorted sequence of values and returns the first position where the key can be inserted in the sequence

2Number of operations here means the total number of mathematical results that need to be

(41)

without violating the sorting order.

A vectorised lower bound search performs the same search as a scalar lower bound search but takes a vector of keys and returns a vector of resulting positions. Finding the particles that intersect each level line as in figure 2.2 can be imple-mented as a vectorised lower bound search. The result of a lower bound search using the cumulative distribution as a value sequence and the intersection level as key will yield the index of the first particle that crosses the level line. The imple-mentation in this thesis computes all intersection levels un and stores them in a vector u, then performs a vectorised lower bound search on all elements of u in parallel. The results of the search is stored in a vector.

Thrust provides an implementation of a vectorised lower bound where one thread is executed per key but each thread performs the search in a sequential manner, this method requires that the number of keys being searched for is large in order to utilise the GPU hardware and to hide the memory latency.

4.4.6

Copying particles

The particles found by the lower bound search in section 4.4.5 must be copied, this is done by first creating a new empty vector, and then for each particle, copy the particle indicated by the corresponding element in the index vector, this is sometimes called a gather operation. The index vector is the result of the lower bound search in section 4.4.5. Figure 4.1 shows the particle duplication process in three steps. The old_particles vector is discarded after the gather operation and replaced by the next_particles vector.

Because of the random nature of the particle duplication process it is very difficult to implement the gather operation efficiently in parallel. It is however much more efficient to run one thread per element of the next_particles vector and have several threads reading from the same memory location (gather) than to execute one thread per element of the old_particles vector and let each thread write all of its own copies to the next_particles vector in sequence (scatter), the latter method will result in uncoalesced unaligned random writes to memory and divergent threads.

4.5

Complexity summary

Below is a breakdown of algorithm 1. “Completely parallelisable” means that there are no dependencies on other particles in that step and thus it is possible to run each computation in a separate thread. The estimated complexities of the computations for N particles are:

• Time update:

– Sampling the process noise is completely parallelisable given a parallel random number generator,O(N) operations.

(42)

30 4 Parallelisation particle 1 particle 2 particle 5 particle 4 particle 7 particle 6 particle 3 particle 8 particle 1 particle 1 particle 5 particle 5 particle 8 particle 8 particle 2 particle 8 particle 1 particle 2 particle 5 particle 4 particle 7 particle 6 particle 3 particle 8 next_particles old_particles

The result of the lower bound search gives the mapping between old and next

An empty vector called next_particles

is created

Result after copying particle 1 particle 2 particle 5 particle 4 particle 7 particle 6 particle 3 particle 8

Figure 4.1: Copying particles as a part of the resampling step.

– Predicting the future state, xk+1, is completely parallelisable,O(N) op-erations

• Measurement update: – Computing h(xi

k) and solving the measurement equation in (2.19) for

ek is completely parallelisable,O(N) operations.

– Computing the likelihood p(yk|xik) for eachi is completely parallelisable, O(N) operations.

– Normalizing the particle weights is parallelisable by using a parallel reduction,O(N log2N ) operations.

(43)

– Computing the cumulative distribution function of the particles is done using a parallel prefix sum,O(N log2N ) operations.

– Computing the number of particle copies is parallelisable using a par-allel binary search,O(N log2N ) operations.

– Copying the particles can be done in parallel,O(N) operations, but re-quires either uncoalesced memory reads or uncoalesced memory writes, which hurt performance. A gather operation was chosen over a scatter operation as in this particular case it leaves no GPU threads idle, gives a deterministic number of operations (minus waiting for high latency global memory), and gives coalesced writes to global memory.

Amdahl’s law is a formula for estimating the maximum theoretical speedup for adding more threads to an algorithm. Below is Amdahl’s law for computing the speedup

S(N ) = 1

(1− P ) +NP (4.10)

where S(N ) is the speedup achieved for N parallel threads, where P is the propor-tion of the algorithm that can be parallelised.

For a completely parallelisable algorith, P = 1, while an algorithm in which it is impossible to perform anything at all in parallel yields P = 0.

From (4.10) it can be seen that the time complexity of a completely parallelis-able algorithm requiring O(T ) operations running on M parallel threads can be approximated by O(T M ) =   O (T ) T ≫ MO (1) T ≤ M (4.11) From (4.11) it can be seen that depending on the number of computations, the computation time will be close to constant for low numbers of computations com-pared to the number of threads, and gradually turn to linear in time for a greater number of computations.

A comparison of time complexities between the sequential version and the parallel version of the filter is given in table 4.1.

The approximations in table 4.1 suggest that a large number of threads executing in parallel using the chosen parallel algorithm implementation could be able to outperform a single thread running the sequential implementation.

(44)

32 4 Parallelisation

Table 4.1: Time complexity of sequential particle filter compared to parallel

filter with M threads.

Step Sequential time complexity Parallel time complexity Time update

Sampling process noise O(N) O(N

M ) Prediction O(N) O(N M ) Measurement update Measurement O(N) O(N M ) Likelihood O(N) O(N M )

Normalizationa O(N) O(N log2N

M ) Resampling CDF O(N) O ( N log2N M )

Number of copies O(N) O(N log2N

M )

Copying particlesb O(N) O(N

M )

aThe normalization in the measurement update can be merged into the compu-tation of the empirical cumulative density function in the resampling step. bPerformance is highly dependent on the memory architecture of the CUDA

(45)

5

Implementation

During the writing of this thesis two implementations of the particle filter were developed. One implementation was written in C++ using standard template li-brary (STL) components and the Eigen linear algebra lili-brary for optimized linear algebra calculations, this implementation is run solely on the CPU. The second implementation was written in CUDA C++ using the Thrust template library for algorithmic primitives. This CUDA implementation runs the computations on the GPU and algorithm control on the CPU. Some parts of the initialization of the GPU implementation, such as computing the Cholesky factorization of the process noise covariance needed by the process noise generator, is done on the CPU, but this is only computed once when initializing the filter code.

5.1

C++ templates

C++ has a very powerful template system that allows for meta programming. A function can be written as a template where for example the data type of some variable is determined by a template parameter. Any good C++ book will give a thorough explanation of templates and use cases, e.g. Stroustrup [2000], Lippman et al. [2005].

C++ templates are used extensively throughout the particle filtering framework to allow generic code for the particle filtering steps which are algorithmically the same steps regardless of the number of states or the data types of the involved variables.

References

Related documents

Through a field research in Lebanon, focusing on the Lebanese Red Cross and their methods used for communication, it provides a scrutiny of the theoretical insights

Object A is an example of how designing for effort in everyday products can create space to design for an stimulating environment, both in action and understanding, in an engaging and

46 Konkreta exempel skulle kunna vara främjandeinsatser för affärsänglar/affärsängelnätverk, skapa arenor där aktörer från utbuds- och efterfrågesidan kan mötas eller

För att uppskatta den totala effekten av reformerna måste dock hänsyn tas till såväl samt- liga priseffekter som sammansättningseffekter, till följd av ökad försäljningsandel

Regioner med en omfattande varuproduktion hade också en tydlig tendens att ha den starkaste nedgången i bruttoregionproduktionen (BRP) under krisåret 2009. De

pedagogue should therefore not be seen as a representative for their native tongue, but just as any other pedagogue but with a special competence. The advantage that these two bi-

But even though the playing can feel like a form of therapy for me in these situations, I don't necessarily think the quality of the music I make is any better.. An emotion

While the theoretical possibility to subvert one’s gender role is seen in Katherine when her performative patterns change in the desert, she never breaks free from