Accelerating the D3Q19 Lattice Boltzmann Model with OpenACC and MPI

(1)

Accelerating the D3Q19 Lattice Boltzmann Model with OpenACC and MPI

Alessandro Gabbana

Alessandro Gabbana VT 2015

Master’s Thesis Project, 30 hp Supervisor: Lars Karlsson

External Supervisors: Sebastiano Fabio Schifano, Raffaele Tripiccione Examiner: Eddie Wadbro

(2)

(3)

Abstract

Multi-GPU implementations of the Lattice Boltzmann method are of practical interest as they allow the study of turbulent flows on large-scale simulations at high Reynolds numbers. Although programming GPUs, and in general power-efficient accelerators, typically guarantees high performances, the lack of portability in their low-level programming models implies significant efforts for maintainability and porting of applications. Directive-based models such as OpenACC look promising in tackling these aspects. In this work we will evaluate the performances of a Multi-GPU implementation of the Lattice Boltzmann method accelerated with Ope- nACC. The implementation will allow for multi-node simulations of fluid flows in complex geometries, also supporting heterogeneous clusters for which the load balancing problem is investigated.

(4)

(5)

1 Introduction

The Lattice Boltzmann Method (LBM) is a computational fluid simulation method for modeling the Navier-Stokes equations and simulating fluid flows. The increasing popularity of LBM comes from its flexibility, allowing complex geometries and different kind of boundary conditions, and from its being particularly suitable for parallel computing due to the large degree of parallelism intrinsic to the algorithm.

Over the years LBM codes have been written and optimized targeting different kinds of High Performance Computing (HPC) architectures [1, 2, 3]. Recently the attention has focused on accelerators such as GPUs [4] and Xeon-Phi processors [5].

In the past porting codes on these architectures required the knowledge of low- level programming models, such as CUDA and OpenCL. An emerging alternative approach is represented by directive-based programming models, such as OpenACC and OpenMP 4.0, where a set of directives are used to guide compilers in the paral- lelization of applications, hiding most of the complex details specific to the underlying architectures.

The aim of this thesis is the multi-GPU implementation and optimization of a LBM code accelerated with OpenACC. We will initially discuss the performances and the portability on different architectures for the single-GPU implementation, which will be designed to support the simulation of fluid flows in complex geometries. We will then assess the multi-node scaling of the LBM solver, trying when possible to minimize the cost of communications. Finally we will evaluate different strategies to distribute the computational load to multiple GPUs in heterogeneous clusters and in the presence of unbalanced workloads.

The report is structured as follow: In Chapter 2 we introduce the Lattice Boltz- mann Method, describing the algorithmic aspects and presenting details concerning a few example of boundary conditions later implemented in the solver. In Chapter 3 we introduce the OpenACC programming model, giving a detailed description of the constructs and the directives used to optimize the code performances. In Chap- ter 4 we present the implementation of the multi-GPU LBM solver starting from the optimization and validation of the single-GPU code. In Chapter 5 we evaluate two load-balancing strategies for the multi-GPU implementation. In Chapter 6 we present and discuss the performance measurements of the implementations presented in the previous chapters. In Chapter 7 we conclude with a summary of the results obtained and suggest areas of future work.

(8)

(9)

2 The Lattice Boltzmann Method

The Lattice Boltzmann Method (LBM) is a computational fluid simulation method for solving incompressible flows. It was developed as an evolution of the Lattice Gas Cellular Automata (LGCA) method and was generated in response to its drawbacks.

In particular the LBM reduces the statistical noise affecting the LGCA method by simulating each lattice point with a collection of particles whose behavior is described by distribution functions [6]. The LBM has later evolved in a self-standing research subject and it could be discussed independently with no need to refer to LGCA. [7]

2.1 The Lattice Boltzmann Equation

There are several ways to derive the Lattice Boltzmann Equation (LBE) from either discrete velocity models or from the Boltzmann transport equation. There are also several ways to derive the macroscopic Navier-Stokes equations from the LBE (see [7, 8, 9] for details).

For a system of particles without external forces, the Boltzmann equation can be written as

∂f

∂t+ e · ∇_xf = Ω (2.1)

where f (x, e, t) is a particle distribution function (PDF) describing the number of particles at time t with positions in the interval [x, x + dx] and velocities in [e, e + de], and Ω denotes the collision operator. Note that ∇_xf and e are vectors.

The macroscopic quantities describing the fluid obey the following:

ρ = Z

f de (2.2)

ρu = Z

f e de (2.3)

where ρ is the fluid density, and u the fluid velocity. Since the collision operator Ω is a function of f , calculating the analytic solution for Equation 2.1 is complicated, therefore Ω is commonly approximated with a simplified model introduced in 1954 by Bhatnagar, Gross and Krook (BGK) [10]:

Ω = 1

τ(fêq− f ) = ω(fêq− f ) (2.4) where fêq is the local equilibrium distribution function, a Maxwell-Boltzmann dis- tribution function. The coefficient ω is called the collision frequency and τ is called relaxation factor.

(10)

∂f

∂t + e · ∇_xf = 1

τ(f^eq− f ) (2.5)

Considering a finite set of velocity vectors e_i, and the correspondent discrete for the PDF and equilibrium function, f_i, f_i^eq, with i = 0, 1, . . . , m − 1, it is possible to write the discrete Boltzmann transport equation (in the absence of external forces) as:

∂fi

∂t + e_i· ∇_x_ifi=1

τ(f_i^eq− f_i), i = 0 . . . m − 1 (2.6) Equation 2.6 can be discretized as:

f_i(x, t + ∆t) − f_i(x, t)

∆t +e_if_i(x + ∆x, t + ∆t) − f_i(x, t + ∆t)

∆x =1

τ(f_i^eq(x, t) − f_i(x, t)) , i = 0 . . . m−1 Taking ∆x_i/∆t = ei:

fi(x, t + ∆t) − f_i(x, t)

∆t +fi(x + e_i∆t, t + ∆t) − f_i(x, t + ∆t)

∆t =1

τ(f_i^eq(x, t) − f_i(x, t)) , i = 0 . . . m − 1 f_i(x + e_i∆t, t + ∆t) − f_i(x, t)

∆t =1

τ(f_i^eq(x, t) − f_i(x, t)) , i = 0 . . . m − 1 Finally we get:

fi(x + e_i∆t, t + ∆t) = f_i(x, t) +∆t

τ (f_i^eq(x, t) − f_i(x, t)) , i = 0 . . . m − 1 (2.7) Furthermore the discrete equivalent of Equation 2.2 and Equation 2.3 hold:

ρ =^X

i

fi (2.8)

ρu =^X

i

f_ie_i (2.9)

The local equilibrium function f^eq together with the relaxation time ∆t determine the type of problem to be solved.

2.2 Algorithmic aspects

The common terminology used in LBM is to refer to the dimension of the problem and the number of speeds as DnQm, where n represents the dimension of the problem (1 for 1-D, 2 for 2-D and 3 for 3-D) and m refers to the speed model. The velocity space is described by a finite set of velocities e_i= 0, 1, . . . , m − 1, chosen in accordance with the time step and the mesh size, such that given a lattice site x then also x + ∆t · ei lies on the lattice ∀i.

(11)

We initially consider for simplicity the D2Q9 model with the following velocity set:

e_i=







e0 e1 e2 e3 e4 e5 e6 e7 e8

0 1 1 0 −1 −1 −1 0 1

0 0 1 1 1 0 −1 −1 −1







The corresponding stencil is represented in Figure 1.

e₀

e₁ e₅

e₆ e₇ e₈

e₄ e₃ e₂

Figure 1: D2Q9 Stencil (velocity sets).

Choosing a proper local equilibrium function f^eq, Equation 2.7 can be advanced in time iterating the following two steps:

f˜_i(x, t + ∆t) = f_i(x, t) +∆t

τ (f_i^eq(x, t) − f_i(x, t)) , i = 0 . . . 8 (2.10) called the collision step, in which the effect of the collision operator is applied

fi(x + e_i∆t, t + ∆t) = ˜fi(x, t + ∆t), i = 0 . . . 8 (2.11) called the streaming step, in which the populations are advected to the neighboring nodes in accordance with the velocity set. This two-phase procedure, known as

”push-algorithm”, can be visualized in Figure 2

Figure 2: Push Algorithm: Collision-Propagate scheme for the D2Q9 model.

Black color indicates pre-collision values for populations, blue is used to represent the post-collision status.

It has been shown (e.g. [11]) that re-ordering the collision-propagate steps fol-

(12)

performances on certain architectures for which non-local write operations are more expensive than non-local read operations. In the pull-algorithm the propagate step writes as:

fi(x, t) = f_i(x − e_i∆t, t − ∆t), i = 0 . . . 8 (2.12)

Figure 3: Pull Algorithm: Propagate-Collision scheme for the D2Q9 model. Black color indicates pre-collision values for populations, blue is used to represent the post-collision status.

Algorithm 1 LBM solver - Pull Algorithm

1: for all lattice site x in f_i(x, t) do

2: for all i do

3: f˜i= f_i(x − e_i∆t, t − ∆t) { Streaming }

4: end for

5: < set boundary conditions >

6: ρ ←^P_ifi 7: u ← ¹_ρ^P_if_ie_i

8: f_i^eq← g(ρ, u) {Compute equilibrium function}

9: for all i do

10: f_i(x, t + ∆t) = ˜f_i(x, t) +^∆t_τ (f_i^eq(x, t) − f_i(x, t)) { Collision }

11: end for

12: end for

2.3 The D3Q19 Model

The model considered for our implementation is the D3Q19 model. The velocity set, whose stencil is presented in Figure 4, is given by:

e_i=







e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11 e12 e13 e14 e15 e16 e17 e18

0 1 −1 0 0 0 0 1 1 −1 −1 1 −1 1 −1 0 0 0 0

0 0 0 1 −1 0 0 1 −1 1 −1 0 0 0 0 1 1 −1 −1

0 0 0 0 0 1 −1 0 0 0 0 1 1 −1 −1 1 −1 1 −1







(13)

1

7

16

13 6

18 14

10 4 8

2

12 11

15 5

9 17

3 0

Figure 4: D3Q19 Stencil (velocity sets).

We follow a classic choice for the equilibrium function f^eq obtained by a Taylor series expansion of the Maxwell-Boltzmann distribution:

f_i^eq= ρω_i

"

1 +ei· u

c²_s +(e_i· u)² 2c⁴_s − u²

2c²_s

#

, i = 0 . . . m (2.13) where the speed of sound c_s is a lattice constant, c_s= 1/√

3, the fluid density ρ and the fluid density u comes from Equation 2.8 and Equation 2.9, and the weighting factors ω_i for the D3Q19 model [8] are:

w₀=1 3 wi= 1

18, i = 1 . . . 6 w_j= 1

36, j = 7 . . . 18

Using these assignments, velocity components at each site can be computed as:

ρ u_x= f₁− f₂+ f₇+ f₈− f₉− f₁₀+ f₁₁− f₁₂+ f₁₃− f₁₄ ρ u_y = f₃− f₄+ f₇− f₈+ f₉− f₁₀+ f₁₅+ f₁₆− f₁₇− f₁₈ ρ uz= f₅− f₆+ f₁₁+ f₁₂− f₁₃− f₁₄+ f₁₅− f₁₆+ f₁₇− f₁₈

(14)

2.4 Boundary Conditions

(a) Complex Boundary Conditions. (b) Elementary Boundary Conditions.

Figure 5: Complex boundary conditions with an example of how elementary boundary conditions can be used to create a staircase approximation of a smooth surface.

A crucial aspect of LBM simulations is the accurate modeling of boundary conditions. In the literature different approaches are suggested and tested to handle interaction between fluid and solid sites. Depending on whether or not the boundary points lie on a surface aligned with the mesh, we can distinguish between elementary and complex boundary conditions (Figure 5) [7]. The latter require the introduction of techniques such as staircasing and extrapolation [13]. In the following we will stick to the case of elementary boundaries, simpler to implement as suggested by the name, but still allowing for accurate approximations of complex surfaces whose precision can be increased by introducing finer grids.

Periodic Boundary Conditions. Periodic boundary conditions are usually employed to describe bulk phenomena and represent the simplest instance of boundary conditions: components f_i that would leave the domain during the streaming phase simply re-enter on the opposite side, in practice mapping the domain onto a torus.

Figure 6: Periodic boundary conditions with external halo layers surrounding the physical lattice.

The implementation of the periodic boundary conditions can be obtained by placing conditional statements to check if a site is on the border of the lattice or,

(15)

as shown in Figure 6, by adding some extra ”halo” layers used to copy the correct value for the velocities entering the physical region.

No Slip Boundary Conditions. No slip, or bounce back, boundary conditions require that all the components of the velocity are identically zero at the boundary.

This from a physical perspective corresponds to the case where the solid wall presents a roughness sufficient to prevent fluid motion along the wall.

1 2 3 4 5

6 7 8

Figure 7: No-slip boundary conditions. In the example populations f₆, f₇, f₈ of a boundary site of coordinates (x, y) are copied respectively on sites with coordinates (x − 1, y − 1), (x, y − 1) and (x + 1, y − 1).

The implementation is rather simple as it just requires to reverse the populations sitting on a boundary node. This can be obtained, for example, using the nodes representing the surface of a wall to store the appropriate value to be read during the streaming phase. In the example in Figure 7 a boundary point with coordinates (x, y) will read during the streaming phase:

f2(x, y) =f₆(x, y) f3(x, y) =f₇(x, y) f₄(x, y) =f₈(x, y)

With this choice the velocity components become:

ux=f₁+ f₂+ f₈− f₄− f₅− f₆= f₁− f₅= 0 uy=f₂+ f₃+ f₄− f₆− f₇− f₈= 0

Therefore u_y, normal to the wall, is automatically zero thanks to the previously imposed condition, whereas the tangential component, u_x, is zero if the boundary sites are initialized with f₁= f₅.

Free Slip Boundary Conditions. Free slip boundary conditions can be used to simulate walls without friction, thus requiring that only the velocity components perpendicular to the boundary to be zero, while the tangential velocity remains unchanged.

(16)

1 2 3 4 5

6 7 8

Figure 8: Free-slip boundary conditions. In the example populations f₆, f₇, f₈of a boundary site of coordinates (x, y) are copied respectively on sites with coordinates (x + 1, y − 1), (x, y − 1) and (x − 1, y − 1).

The implementation is similar to the no-slip case, with the difference being the interaction with the wall sites (Figure 8), that in practice act as a mirror reflecting back incoming populations, obtained with the following choice for the components of a boundary site:

f2(x, y) =f₈(x, y) f₃(x, y) =f₇(x, y) f₄(x, y) =f₆(x, y)

Other examples of Boundary Conditions. Even if they will not be discussed in the following, it is worth mentioning two more cases of commonly employed elementary boundary conditions:

• Sliding Walls: where the wall sites move, typically along a tangential or normal direction with respect to the fluid flow

• Open boundaries: where the fluid enters one side of the lattice (e.g. at constant velocity) to then leave from another side.

2.5 Turbulence

We conclude this introductory chapter to the LBM taking into consideration one of its possible applications, the study of turbulent flows. A turbulence is a flow regime characterized by chaotic property changes making the long term behavior of a system hard to predict. The degree of a turbulence is qualitatively described by a dimensionless parameter known as Reynolds number:

Re =U L

ν (2.14)

where U is the typical macroscopic flow speed, L is the characteristic linear dimen- sion, and ν is the kinematic viscosity of the fluid which, in the discrete, is related to the relaxation factor τ with:

ν =1 3

τ −1

2

(2.15)

(17)

Most interesting fluid flows occur in the large Re regime. Considering the kine- matic viscosity of water ν ≈ 10⁻⁶m²/s and of air (at atmospheric pressure and temperature of the order of 10 degrees) ν ≈ 1.5 · 10⁻⁵m²/s, we can observe that, for example, for the motion of a swimmer (speed of the order of 1m/s and L ≈ 1m), one has Re ≈ 10⁶ , while for a commercial airplane (speed of the order of 250m/sec and L ≈ 50m), Re ≈ 8 · 10⁸. These numbers substantiate the need of studying flows at large Re. This is numerically very challenging, and even today first-principle simulations are not able to come close to the figures discussed above [7]. Reliable simulations at large Re require large velocities and large system sizes. It is inter- esting to derive a simplified scaling law for the value of Re as a function of the size of the simulation domain. From the Kolmogorov K41 theory, the length scale of a turbulent flow is given by:

η = ν³

!¹

4

(2.16) where is the average rate of dissipation of the turbulence kinetic energy. Using Equation 2.14 together with Equation 2.16 we have

η

L ∝ Re³⁴ (2.17)

From Equation 2.17 we can directly see that in 3 dimensions Re cannot grow faster than the power 4/9 of the simulation domain. This explains why even very large scale simulations are not able to study in all details systems with realistic values of Re. However these limitations have been overcame with the development of several approximation schemes for engineering applications: These methods (such as for instance the Large Eddy Simulation (LES) method [14]) neglect the details of the fluid motion at small scales, but are still able to reproduce correctly the macro scale quantities, making the study of fluid systems at intermediate Re interesting per se.

A detailed analysis of turbulent flows goes beyond the scope of the present work.

in the following we will make use of some well known simulation setups, at different Reynolds numbers, to the purpose of validating our implementation comparing with results taken from the literature.

(18)

(19)

3 Parallel Computing on Many Core Architectures

Exascale computing represents the next frontier in High Performance Computing (HPC) but power constraints and efficiency make it a challenge both from a hardware and software point of view.

With the growth in CPU clock frequency stalled, the demand for performance growth has pushed the widespread adoption of architectures featuring many low- powered cores per node. In particular heterogeneous architectures, combining tra- ditional multi-core CPUs with energy-efficient accelerators such as GPUs or Intel’s MIC, are now a common solution for improving performances of compute nodes, with the ninety systems using accelerator/co-processor technology accounting for 35% of the computing capabilities of all the systems in the latest (June, 2015) TOP500 list[15].

The complexities introduced by heterogeneous architectures motivate new programming models to facilitate the expression of the degree of concurrency of an application, in order to exploit a large fraction of the performance peak of a system.

Programming frameworks for accelerator-based systems range from low-level models such as proprietary CUDA from NVIDIA and the open standard OpenCL, to higher- level directive-based models such as OpenMP and OpenACC. Directive-based models represent an appealing solution, promising to reduce the cost of development and ensuring portability of the code. Although the specifications of the latest OpenMP 4.0 standard support GPU programming, at the time of writing no compiler has support for these architectures. In this work we will therefore adopt the OpenACC programming standard for the development of a Lattice Boltzmann solver. In the following we will provide an introduction to the OpenACC programming model [16], describing the directives and clauses employed in the implementation presented in Chapter 4.

(20)

Figure 9: Fraction of total TOP500 performance produced by systems adopting accelerators, from [17].

3.1 The OpenACC programming model

The OpenACC Application Program Interface (API) is an open standard describing a collection of compiler directives used to expose parallelism in the code, guiding compilers in building applications capable of executing on a variety of parallel accelerators.

OpenACC defines an abstract model for accelerated computing, designed to support offloading of both computation and data from a host to an accelerator device.

In Figure 10 we show a diagram of the model where the two components, the host and the accelerator, are assumed to have two different memory spaces. The model is host-centric in the sense that the host controls the offloading of code regions and data to the accelerator. The accelerator consists of a set of processing elements (PEs) organized in a SIMD unit which allow for (asynchronous) processing of tasks from one or more execution queue. The compiler will implicitly map each component of the OpenACC abstract model into the structure of the target architecture.

3.1.1 Directives for Compute Regions

In OpenACC computationally intensive sections of the code can be annotated to be executed on the accelerator using the parallel or the kernels directives, both generating a compute region. The parallel directive is used for explicit parallelism.

Similarly to the OpenMP parallel construct it creates a number of parallel threads that execute the parallel region redundantly until a work-sharing loop is reached, and where every thread will execute a subset of the loop iterations. The kernels directive is used for implicit parallelism. It relies on compiler automatic paralleliza- tion techiniques to identify which loops are safe to parallelize. This analysis can be augmented by the user by the means of directives and clauses.

Loop structures represent the main source of parallelism in parallel programming.

The loop directive is used to provide the compiler with additional information about

(21)

HOST MEMORY HOST

CPU

EXECUTION QUEUES

PE 1 PE 2 PE N

SIMD/SIMT

DEVICE MEMORY

CACHE CACHE CACHE

... ... ...

Figure 10: OpenACC Abstract Machine Block Diagram.

a loop by means of the following clauses:

• independent, used to specify that the loop iterations are data-independent and can be executed in parallel.

• collapse, used to specify how many tightly nested loops are associated with the loop construct.

• private, used to avoid race conditions, it implies that each loop iteration requires its own copy of the listed variables.

• reduction, which performs a reduction operation on a chosen variable by creating a private copy of the variable for each loop iteration returning the result of the combined result of the reduction at the end of the loop.

The loop directive supports also a set of clauses to specify how to distribute a loop over three different levels of parallelism, referred in the OpenACC terminology respectively as gang, worker and vector. Vector parallelism presents the finest granularity, processing in a SIMD-fashion multiple data elements with the same instruction. Each vector is processed by a worker. A group of workers constitutes a gang. Gang parallelism is coarse-grained parallelism, where gangs work independently, do not share memory, and do not support synchronization. For each gang the OpenACC model associates a cache memory, which can be used by all workers and vectors within the gang.

The gang, worker and vector clauses can be used, together with knowledge of the underlying architecture, to specify how the compiler should map the loop iterations into parallelism on the hardware. For example, considering the CUDA execution model on NVIDIA GPUs, one can map the gang parallelism into CUDA blocks, vectors into threads and assume the number of workers to be fixed and equal to the warp size. It is clear that providing the compiler with more details about how to map the parallelism can boost performances on a specific accelerator but, at the same time, limit performance portability on other architectures.

(22)

...

CACHE VECTOR

WORKERS ...

{

GANGS

Figure 11: OpenACC execution model.

3.1.2 Directives for Data Regions

For each compute region the compiler creates also an implicit data region, where all the data required for the computation is copied from the host to the device once entering the region, and back from the device to the host at the end of the region.

The data directive provides control on the locality of the data used by the parallel regions. It is used both to minimize and synchronize data movements from the host to the device memory. It allows for sharing of data between multiple parallel regions within different functions. The data flow is controlled with the following clauses:

• copy, which creates space for the specified variables on the device, moves data to the device at the beginning of the region and back to the host at the end of the region, also releasing the allocated memory on the device.

• copyin, which creates space for the specified variables on the device, moves data to the device at the beginning of the region, releases the allocated memory on the device at the end of the region without copying the data back to the host.

• copyout, which creates space for the specified variables on the device at the beginning of the region, moves them from the device to the host on region exit releasing the allocated memory on the device.

• create, which creates space for the specified variables on the device when entering a region to then release it on region exit.

• present, which states that the listed variables are already present on the device and should be used without any further redundant data movement.

• deviceptr, which states that the listed variables are pointers to device ad- dresses, allowing interaction with other APIs for handling the accelerator memory.

(23)

In order to synchronize data between host and device memories it is possible to employ the update directive. The clause host tells the compiler to generate code for moving the specified data from the device to the host, vice versa for the device clause.

Finally the cache directive can be used to fetch data from the shared memory within a gang. Unfortunately the lack of synchronization at the workers level lim- its the use of this directive to read-only data. As a consequence memory access optimizations exploiting the shared memory, such as temporal blocking, cannot be explicitly implemented and are thus left to the compiler cleverness.

3.1.3 Directives for asynchronous operations

For accelerators connected over a PCIe bus to a host CPU, the cost of moving data often represents a bottleneck for applications performance. It is sometime possible to reduce this performance penalty by overlapping data movements with computation on the host, on the device, or both. This can be achieved in OpenACC with the async clause that can be specified together with parallel, kernels, and update directives, to specify that once the associated operation has been sent to the accelerator, rather than waiting for its completion, the host can asynchronously process the next instruction. The async clause optionally allows to specify a queue number where to place a certain operation. Intuitively operations placed in the same queue will be in-order executed, following a FIFO policy, whereas, on the other hand, operations on different queues may operate in any order, even in parallel if enough resources are available. To control the flow of asynchronous operations the wait directive is used to specify to stop the execution until all operations on a specific queue (or all queues) are completed.

(24)

(25)

4 Implementation

In this chapter we will describe the implementation of a multi-GPU LBM solver.

Since the first GPU porting by T¨olke and Krafczyk [4] many different strategies have been presented [18, 19] for optimizing performances. Due to the large number of sparse memory accesses required by the simulation, these strategies have mostly been based in employing the shared memory to avoid costly misaligned accesses to the GPU global memory. However Obrecht et al. [20] have shown that the use of shared memory is not as crucial. This, together with significant hardware improvement introduced in NVIDIA devices supporting 2.x compute capability, increases the possibilities of observing good performances even employing higher level programming models.

4.1 General Implementation

In this section we will describe the steps followed to parallelize a serial implementation of Algorithm 1 using OpenACC in an incremental approach. This first instance of LBM solver will allow the simulation of a fluid flow in a L_x× L_y× L_z discrete mesh (lattice), so far ignoring any solid objects in the inner-lattice, with the only boundary sites lying on the surface of the six outer-faces of the box. For the implementation of the boundary conditions, described in Section 2.4, the physical lattice is enveloped in a halo surface thus increasing the dimension of the lattice to (L_x+ 2) × (L_y+ 2) × (L_z+ 2). Although this choice increases memory requirements it is motivated by the intent of avoiding, where possible, branching operations in the code, in order to maximize performances. Moreover, for large simulations, this additional cost in terms of memory resources is negligible.

4.1.1 OpenACC kernels

The structure of the LBM implementation consists of a loop over time steps where at each iteration three kernels are applied: bc, propagate and collide. The use of two lattices (F1, F2) prevents any data dependency allowing each site to be processed in parallel in each one of the three different phases:

• bc(F1), where boundary conditions are implemented by writing on the sites belonging to the halo-layer a proper value for those populations accessed in the following propagate step.

• propagate(F1, F2), where populations of lattice F 1 are moved across lattice F 2 according to the stencil described in Figure 4, thus locally storing at each site the populations that will interact in the collision phase.

(26)

site are performed. For every population of each site the new value at time step t + 1, obtained as a function of its previous value at time t (and read from lattice F 2), is written on lattice F 1.

Annotating the code with proper OpenACC directives each of these three kernels can be executed on the accelerator. In Listing 4.1 we present the implementation of the propagate kernel. With #pragma acc kernels we suggest the compiler to generate a parallel accelerator kernel for the nested loops following the directive.

Furthermore we specify with #pragma acc loop independent that the iterations of these loops are data independent with respect to each other. The same approach is followed to parallelize the bc and the collide kernels.

1

2 v o i d p r o p a g a t e(p o p _ a o s _ t ∗ nxt, p o p _ a o s _ t ∗ prv) { 3 # d e f i n e NXT(_x,_y, _z, _fi) nxt[IDX(_x,_y,_z) ] .p[_fi] 4 # d e f i n e PRV(_x,_y, _z, _fi) prv[IDX(_x,_y,_z) ] .p[_fi] 5

6 int i, j, k; 7

8 # p r a g m a acc k e r n e l s

9 # p r a g m a acc l o o p i n d e p e n d e n t

10 for (i = HX; i< (HX + S I Z E X) ; i++) { 11 # p r a g m a acc l o o p i n d e p e n d e n t

12 for (j = HY; j < (HY + S I Z E Y) ; j++) { 13 # p r a g m a acc l o o p i n d e p e n d e n t

14 for (k= HZ; k < (HZ + S I Z E Z) ; k++) {

15

16 NXT(i, j, k, 0) = PRV(i , j , k , 0) ; // POP 0

17 NXT(i, j, k, 1) = PRV(i−1, j , k , 1) ; // POP 1 18 NXT(i, j, k, 2) = PRV(i+1, j , k , 2) ; // POP 2 19 NXT(i, j, k, 3) = PRV(i , j−1, k , 3) ; // POP 3 20 NXT(i, j, k, 4) = PRV(i , j+1, k , 4) ; // POP 4 21 NXT(i, j, k, 5) = PRV(i , j , k−1, 5) ; // POP 5 22 NXT(i, j, k, 6) = PRV(i , j , k+1, 6) ; // POP 6 23 NXT(i, j, k, 7) = PRV(i−1, j−1, k , 7) ; // POP 7 24 NXT(i, j, k, 8) = PRV(i−1, j+1, k , 8) ; // POP 8 25 NXT(i, j, k, 9) = PRV(i+1, j−1, k , 9) ; // POP 9 26 NXT(i, j, k, 1 0) = PRV(i+1, j+1, k , 1 0) ; // POP 10 27 NXT(i, j, k, 1 1) = PRV(i−1, j , k−1, 1 1) ; // POP 11 28 NXT(i, j, k, 1 2) = PRV(i+1, j , k−1, 1 2) ; // POP 12 29 NXT(i, j, k, 1 3) = PRV(i−1, j , k+1, 1 3) ; // POP 13 30 NXT(i, j, k, 1 4) = PRV(i+1, j , k+1, 1 4) ; // POP 14 31 NXT(i, j, k, 1 5) = PRV(i , j−1, k−1, 1 5) ; // POP 15 32 NXT(i, j, k, 1 6) = PRV(i , j−1, k+1, 1 6) ; // POP 16 33 NXT(i, j, k, 1 7) = PRV(i , j+1, k−1, 1 7) ; // POP 17 34 NXT(i, j, k, 1 8) = PRV(i , j+1, k+1, 1 8) ; // POP 18

35 }

36 }

37 }

38

39 # u n d e f NXT

40 # u n d e f PRV

41 }

Listing 4.1: Propagate kernel: Naive implementation.

4.1.2 Data Regions

For an effective use of the GPU, data transfers between the host and the accelerator must be minimized. The LB solver can be made fully resident on the device making use of OpenACC data regions. This is obtained with the #pragma data copy and #pragma data copyin directives (Listing 4.2), used to control how to move the specified variables from the host to the device upon entry to the data region and viceversa back from the device memory to the host memory when exiting the region. The kernels inside the data region in the example Listing 4.2 will not need anymore to exchange data with the host as it will be already resident on the device. In Listing 4.3 the present clause is used to indicate that all input data is already available on the device.

(27)

1 # p r a g m a acc d a t a c o p y(f1[0:NX∗NY∗NZ] ) c o p y i n(f2[0:NX∗NY∗NZ] , p a r a m[0:1] ) 2 for (i t e r = 0; i t e r < N I T E R; i t e r++) {

3

4 // BC

5 //−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

6 bc(f1) ; 7

8 // P r o p a g a t e

9 //−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

10 p r o p a g a t e(f2, f1) ; // f 2 <−− f 1 11

12 // C o l l i s i o n

13 //−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

14 c o l l i d e(f1, f2, p a r a m) ; // f 1 <−− f 2 15

16 }

Listing 4.2: LBM solver structure making use of an OpenACC data region.

8 # p r a g m a acc k e r n e l s p r e s e n t(nxt, prv) 9 # p r a g m a acc l o o p i n d e p e n d e n t

10 for (i = HX; i< (HX + S I Z E X) ; i++) { 11 # p r a g m a acc l o o p i n d e p e n d e n t

12 for (j = HY; j < (HY + S I Z E Y) ; j++) {

13 # p r a g m a acc l o o p i n d e p e n d e n t

15 . . .

16 }

Listing 4.3: Propagate Kernel with data resident on the device.

4.1.3 Data Dependencies and Pointer Aliasing

The implementation using two lattices, as described in Section 4.1.1, prevents the occurrence of data dependencies in the propagate and the collide kernels. In Listing 4.1 we made use of the independent clause to advice the compiler that the iterations of the successive loops are data independent with respect to one other.

However, in the presence of a possible pointer aliasing, i.e. when the same memory location is addressed using different names, the compiler might refuse to parallelize the loop to preserve the correctness of the program. It is then necessary to rearrange the code to make clear to the compiler that no data dependencies are present. This is done adopting the restrict qualifier, as in the example in Listing 4.4, for the pointers used to access the lattice in the parallel loops.

1 v o i d p r o p a g a t e(p o p _ a o s _ t ∗ r e s t r i c t nxt, p o p _ a o s _ t ∗ r e s t r i c prv) {

Listing 4.4: Solving pointer aliasing in the declaration of the propagate kernel.

4.1.4 Memory Layout

Two data organization schemes are possible for LBM solver implementations.

The first taken into consideration is the Array of Structures (AoS) format, which corresponds to a single array with one element per lattice site, each containing a data structure used to store the 19 populations of the D3Q19 model. A different approach is represented by the Structure of Arrays (SoA) format, in our case consisting of a structure with 19 arrays, each one corresponding to all of the spatial locations of a single population.

While it has beeen shown [21] that AoS represents the best choice for serial CPU implementations, the same is not true for GPU architectures where this layout leads to poor memory access patterns. On the other hand the SoA organization guarantees better coalescence of global memory accesses and it is therefore the mandatory choice

(28)

4.1.5 Device Occupancy

So far we have delegated the compiler the control on the mapping of the parallel loops to the underlying hardware architecture. With the continuous improvement of the compiler heuristics this choice allows for performance portability, although a significant improvement can be in general obtained tuning the device occupancy. In OpenACC this is achieved mapping loops to the components of the programming model (gangs, workers, vectors). A detailed analysis of this aspect will be given in Section 6.2.1.

1 v o i d p r o p a g a t e(p o p _ s o a _ t ∗ nxt, p o p _ s o a _ t ∗ prv) { 2 # d e f i n e NXT(_x,_y, _z, _fi) nxt[IDX(_x,_y,_z) ] .p[_fi] 3 # d e f i n e PRV(_x,_y, _z, _fi) prv[IDX(_x,_y,_z) ] .p[_fi] 4

5 int i, j, k; 6

7 # p r a g m a acc k e r n e l s p r e s e n t(prv, nxt)

8 # p r a g m a acc l o o p i n d e p e n d e n t g a n g(N U M B L O C K S _ P R O P A G A T E) v e c t o r(B L O C K S Z _ P R O P A G A T E) c o l l a p s e(3) 9 for (i = HX; i< (HX + S I Z E X) ; i++) {

10 for (j = HY; j < (HY + S I Z E Y) ; j++) {

12

13 NXT(i, j, k, 0) = PRV(i , j , k , 0) ; // POP 0

14 . . .

Listing 4.5: Propagate Kernel with parameters allowing for larger or smaller device occupancy in order to optimize performances.

4.1.6 Code Validation: Poiseuille flow

In order to evaluate the correctness of the LBM solver described in Section 4.1 we compare the numerical results obtained from a simulation against known analytical results. We consider the Poiseuille flow of a fluid in a rectangular channel between two parallel plates placed at a distance 2d from each other, with d small compared to the dimension of the places. The flow is driven by an external force F acting on the fluid such that only one velocity component is nonzero: In the example in Figure 12 the force F = (0, 0, F_z) gives a nonzero fluid velocity in the z component uz. The plates are implemented using no-slip boundary conditions, meaning that uz= 0 at x = ±d. The velocity profile of the Poiseuille flow is given by the parabola in Equation 4.1:

u(x) = u_z(x) = F_z 2ν

d²− y² (4.1)

x y

z

d u_z(x)

Fz

d

Figure 12: Poiseuille flow in a rectangular channel between two parallel plates.

We consider a lattice of dimension 48 × 256 × 256, with τ = 0.8 and an external

(29)

force F = (0, 0, 1e − 5), and we validate the code running on the host, on a NVIDIA GPU [B.1] and on an AMD GPU [B.2]. In Figure 13 we show the evolution in the first 10000 time steps of the velocity profile of the Poiseuille flow. We reach a steady state at the time step 30000 from which we can extrapolate the data for the validation of the code. As can be seen in Figure 14. the results obtained fit with the analytical solution.

30 20 10 0 10 20 30

y 0.000

0.005 0.010 0.015 0.020 0.025 0.030

uz(y)

step = 1000 step = 2000 step = 3000 step = 4000 step = 5000 step = 6000 step = 7000 step = 8000 step = 9000 step = 10000

Figure 13: Time development of a LBM simulated Poiseuille flow in the first 10000 iterations.

30 20 10 0 10 20 30

y 0.000

0.005 0.010 0.015 0.020 0.025 0.030

uz(y)

Analytical solution Numerical solution - CPU Numerical solution - GPU nVidia Numerical solution - GPU AMD

Figure 14: Numerical results for the velocity profile of a LBM simulated Poiseuille flow, obtained running on a CPU and two GPUs, and validated comparing against the analytical results.

(30)

4.2 Flows past obstacles and complex geometries

Fluid flows over obstacles, and in general the study of transport phenomena in disordered media is a topic of large interest with applications in many different fields, from engineering to geophysics, medicine and more. The LBM has proven to be a powerful and efficient tool for investigating these scenarios thanks to its capability of treating complex boundary conditions and its intrinsic parallelism.

The symmetric behavior shown by the simulation presented in Section 4.1.6 breaks down for Reynolds numbers Re > 2000. Instabilities at low Reynolds numbers can be triggered introducing a solid object in the lattice. If the involved geometry is as simple as a cuboid or a sphere an efficient solution is to hard code the object in the lattice, making use of the solid points on the surface to implement proper boundary conditions as seen before in Section 2.4. A much more flexible approach consists in introducing a data structure to flag each lattice point to distinguish between fluid and solid sites.

In cases where the actual number of fluid particles is small compared to the dimension of the bounding lattice it is desirable to reduce memory consumption by storing only the particles required by the simulation. A sparse lattice representation for the study of fluid flows in complex geometries has been adopted, for example, by Bernaschi et al [23]. An alternative is represented by the so called patch-based approach where the lattice is divided in sub-blocks (patches) and only those containing fluid cells are stored in memory [22]. In this section we describe different data addressing and organizations for handling complex geometries in a LBM solver. We will treat separately the propagate and for the collide kernel in order to investigate the impact on performances following different approaches. A performance evaluation of the various implementations here presented will be object of Section 6.2.2.

4.2.1 Propagate Kernel

The propagate kernel is purely memory-intensive as it consists of a series of load and store operations to collect the populations from neighboring sites. When solid sites are introduced in the inner lattice, it is necessary to handle irregular memory access patterns. We take into consideration two strategies based on a full-matrix representation and one for a sparse lattice representation.

• Indirect memory addressing. We introduce a data structure associating to each site a set of pointers to the location of the correct population (i.e.

taking already into account boundary conditions) to be read in the propagate phase. During the initialization phase the data structure is initialized treating differently each site according to its typology. We distinguish between three typologies: fluid sites, solid sites and interface sites, with the latter classifying those points part of the fluid having at least one of their nearest neighbors marked as solid. Depending on the site typology, each element of the corresponding data structure used to refer its nearest neighbors is initialized as follow:

– For fluid sites we simply follow the stencil of the model (Figure 4).

– Interface sites must carefully handle boundary conditions (Section 2.4).

Accelerating the D3Q19 Lattice Boltzmann Model with OpenACC and MPI