A smoothed particle hydrodynamic simulation utilizing the parallel processing capabilites of the GPUs

Full text

(1)LiU-ITN-TEK-A--09/052--SE. A smoothed particle hydrodynamic simulation utilizing the parallel processing capabilities of the GPUs Viktor Lundqvist 2009-09-30. Department of Science and Technology Linköping University SE-601 74 Norrköping, Sweden. Institutionen för teknik och naturvetenskap Linköpings Universitet 601 74 Norrköping.

(2) LiU-ITN-TEK-A--09/052--SE. A smoothed particle hydrodynamic simulation utilizing the parallel processing capabilities of the GPUs Examensarbete utfört i vetenskaplig visualisering vid Tekniska Högskolan vid Linköpings universitet. Viktor Lundqvist Handledare Magnus Wrenninge Examinator Jonas Unger Norrköping 2009-09-30.

(3) Upphovsrätt Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under en längre tid från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/ Copyright The publishers will keep this document online on the Internet - or its possible replacement - for a considerable time from the date of publication barring exceptional circumstances. The online availability of the document implies a permanent permission for anyone to read, to download, to print out single copies for your own use and to use it unchanged for any non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional on the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its WWW home page: http://www.ep.liu.se/. © Viktor Lundqvist.

(4) A BSTRACT Simulating fluid behavior has proven to be a demanding challenge which requires complex computational models and highly efficient data structures. Smoothed Particle Hydrodynamics (SPH) is a particle based computational model used to simulate fluid behavior that has been found capable of producing convincing results. However, the SPH algorithm is computational heavy which makes it cumbersome to work with. This master thesis describes how the SPH algorithm can be accelerated by utilizing the GPU’s computational resources. It describes a model for how to distribute the work load on the GPU and presents a suitable data structure. In addition, it proposes a method to represent and handle moving objects in the fluids surroundings. Finally, the performance gain due to the GPU is evaluated by comparing processing times with an identical implementation running solely on the CPU. . .

(5) . .

(6) . A CKNOWLEDGMENTS I would like to thank everyone at Sony Pictures Imageworks, especially the application group, for giving me this opportunity and making me feel part of their team. A special thank to my supervisor Magnus Wrenninge who has been an invaluable discussion partner and a great support all through the work on this thesis. I am also deeply grateful to Anders Ynnerman and Aida Vitoria at Linköping’s University who helped me to get into the IPAX program. Thank you! . .

(7) .

(8) C ONTENTS INTRODUCTION ............................................................................................................ 2 1. 1 MOTIVATION ..................................................................................................................... 2 1.2 WORK ENVIRONMENT ......................................................................................................... 3 1.3 OUTLINE OF REPORT ............................................................................................................ 3 BACKGROUND & RELATED WORK ................................................................................. 4 2.1 RELATED WORK .................................................................................................................. 4 2.2 SMOOTHED PARTICLE HYDRODYNAMICS (SPH) ........................................................................ 5 2.2.1 Modeling Fluid Dynamics ........................................................................................... 6 2.2.2 Smoothing Kernels ...................................................................................................... 9 2.3 COMPUTE UNIFIED DEVICE ARCHITECTURE (CUDA) ................................................................ 10 2.3.1 NVIDIA’s Hardware Architecture ............................................................................... 10 2.3.1.1 Execution Model..................................................................................................... 10 2.3.2 Programming Interface ............................................................................................. 12 IMPLEMENTATION ..................................................................................................... 14 3.1 SPH ALGORITHM .......................................................................................................... 14 3.1.1 Data Structure ........................................................................................................... 15 3.1.2 SPH and Parallel Processing ...................................................................................... 16 3.1.2 Normalize Density ..................................................................................................... 19 3.1.3 Smoothing Kernels .................................................................................................... 19 3.2 INTEGRATION METHOD ...................................................................................................... 20 3.2.1 Step Size .................................................................................................................... 20 3.3 PARTICLE SOURCES ...................................................................................................... 21 3.4 COLLISION DETECTION ....................................................................................................... 21 3.4.1 Collision Objects ........................................................................................................ 21 3.4.2 Collision Algorithm .................................................................................................... 22 RESULTS ..................................................................................................................... 24 4.1 PARTICLE BEHAVIOR .......................................................................................................... 24 4.2 PERFORMANCE.............................................................................................................. 26 4.3 DISCUSSION ..................................................................................................................... 27 4.4 FURTHER IMPROVEMENTS .................................................................................................. 28 4.4.1 Performance Improvements ..................................................................................... 28 4.4.2 Functionality ............................................................................................................. 29 REFERENCES ............................................................................................................... 31 .

(9) C HAPTER 1 . INTRODUCTION   This chapter aims to introduce the reader to the purpose and the underlying structure of this thesis.   . 1. 1 M OTIVATION The motion of fluids has proven to be one of the most challenging physical phenomena to simulate. The complex behavior and physical relations involved in e.g. a raising pillar of smoke, the breaking of an ocean wave or water being poured into a glass is probably what is making it fascinating to watch. However, mimicking these behaviors in computer graphics requires complex computational models and highly efficient data structures. Smoothed Particle Hydrodynamics (SPH) is a particle based computational model used to simulate fluid dynamics. The SPH model has been found capable of producing convincing and physically based computer animations of fluid dynamics. However, the computational burden of fluid simulation is heavy which generally results in low frame rates. Therefore, fluid phenomena are typically simulated off‐line and then rendered in a second step. The high simulation time together with the difficulties of predicting fluids behavior makes it somewhat frustrating to work with, since it is often necessary to re‐ run a simulations several times in order to achieve the desired result. During the last years the performance growth of single core contemporary general‐ purpose processors (CPUs) has stagnated. This has led to an increased interest in multicore chip organizations, a field that the vendors of graphics processors (GPUs) have put a lot of research into. Today’s GPUs provides a vast number of simple but deeply multithreaded cores with high internal memory bandwidth. Although, GPUs traditionally have been dedicated to strictly processing data strongly coupled to the rendering process, last year’s development have led to increased programmability capabilities, making them attractive for non‐graphical purposes. The SPH simulation algorithm is in many ways suitable for parallel processing.   The purpose of this thesis is to examine how the computational resources on a modern graphic card can be utilized in order to speed up the process of an SPH simulation. . . . 2 .

(10) 1.2 W ORK E NVIRONMENT The main work of this thesis was carried out during an internship at Sony Pictures Imageworks. The SPH simulation was implemented to suite the production pipeline at Sony Picture Imageworks with the aim to be useful in future productions.   The resulting simulation tool is written as a plug‐in to the 3D visual effects program Houdini, coded in C++ and Compute Unified Device Architecture (CUDA).  CUDA is a general purpose GPU programming language developed by NVIDIA in order to provide users with extended control over the processing capabilities of the GPU. See section 2.3 for a detailed description of CUDA. Sony also provided several in‐house C++ libraries facilitating certain processes, such as handling field data. . 1.3 O UTLINE OF R EPORT Chapter 2 consists of the background information and previous work which this thesis is based upon. The first part describes the basic theories behind simulating liquids using SPH. Thereafter, the CUDA hardware and programming approaches are described.  This chapter aims to give the reader the basic understanding of the SPH algorithm and CUDA principles necessary to grasp the following chapters and can therefore be disregarded by readers already familiar with these fields of knowledge. Chapter 3 describes how the theories behind the SPH algorithm were implemented in a way that maximizes the utilization of the GPU’s capabilities. It also describes the implementation of other parts, i.e. collision handling and integration technique, necessary to make the simulation practically useful. In chapter 4 the results are being presented and discussed, both in terms of computational performance as well as visual aspects. Finally, some suggestions on further improvements are proposed. . 3 .

(11) C HAPTER 2 . BACKGROUND   & RELATED WORK . 2.1 R ELATED W ORK Computational Fluid Dynamics (CFD) is a well established research area with a long history. In 1845 Claude Navier and George Stokes managed to describe the dynamics of fluids in the Navier‐Stokes equations. In 1977 Monaghan et al. introduced SPH in order to simulate nonaxisymmetric phenomena in astrophysics (Gingold & Monaghan, 1977). In contrast to grid‐based (Euler) simulation techniques it models the dynamics of fluids by applying forces to a particle system (Lagrange), the applied forces ensures the Navier‐ Stokes equations. The core of the method is the use of a smoothing kernel, a function with certain properties which is used to sum up each particles contribution to various field values in a fluid. SPH was found to be rugged, easily extendable and intuitive to work with. Since then, SPH has been shown applicable to a wide variety of fields such as the study of gravity currents near black holes (Evans & Kochanek, 1989), viscous flows (Takeda, Miyama, & Sekiya, 1994), wave propagation (Monaghan & Kos, Solitary waves on a Cretan Beach, 1999) and incompressible flows (Monaghan & Humble, 1991).   The recent growth of the computational power of GPUs has resulted in an increased interest in accelerate the simulation process by performing all (Kolb & Cuntz, 2005), or part (Amad, Imura, Yasumoto, Yamabe, & Chihara, 2004) of the computations on the GPU. Harada et al. proposed a method for SPH simulations running on the GPU using a flat 3D texture to store the complex data structure in the graphic cards video memory (Harada, Kawaguchi, & Koichiro, 2007). Even though they experience some limitations due to the design and accessibility of the texture memory, they manage to simulate a particle system containing 60, 000 particles in real‐time and experienced an extensive speed up for complex off‐line simulations.   The above GPU implementations were achieved by using existing 3D‐rendering APIs, DirectX (Microsoft) and OpenGL (Kessenish & Baldwin).  Since the original purposes of these APIs where to handle rendering the implementations need to be posed in the context of polygon rasterization. This leads to difficulties when e.g. implementing complex data structures which have made this approach cumbersome. The need of general‐purpose computing on the GPU (GPGPU) has lately been recognized by leading graphic cards vendors. Computer Unified Device Architecture (CUDA) is a software architecture for general‐purpose programming on the GPU, released by NVIDA in 2007. CUDA allows the user to access the highly parallel 4 .

(12) performance capabilities of the GPU without the need to go through the whole graphic pipeline. CUDA is supported by the NVIDIA GeForce 8 series and newer NVIDIA cards, as well as some Quadro GPUs and Tesla cards. CUDA has shown excellent potential in parallel processing for a broad spectrum of applications (Che, Boyer, Meng, Tarjan, Sheaffer, & Skadron, 68:1370‐1380).   . 2.2 S MOOTHED P ARTICLE H YDRODYNAMICS (SPH) A fluid is represented by a set of non‐ordered particles which hold fluid properties at discrete locations in the fluid. The SPH simulation technique uses interpolation theory to evaluate field properties at certain points in the fluid. A scalar value (A) can be interpolated at location r by summing up all particles contributions,   A r. ∑ m. A. W r. r , h    . . . . . (1) . where j iterates over all particles, m is the mass of the particle with index j, A the approximated scalar quantity at r and ρ the density at r . The core of the function is the smoothing kernel (W) which determines each particles contribution to the field value. The smoothing kernel has cut‐off radius (h) for which W. 0 when r. r. . The smoothing kernel has to be even (equation 2) and . normalized (equation 3) in order to assure a second order of accuracy in the interpolation.     W r, h. W. W r dr. r, h . . . . . . (2) . . . . . . . (3) . 1    . There are several factors to be considered when designing a suitable smoothing kernel. This will be further discussed in section 2.2.5.   The density field of a fluid can be evaluated using the following equation,   ρ r. ∑ mW r. r , h . . . . . . . (4) . Most fluid equations involve the derivatives of various field quantities. One main advantage with the SPH interpolation technique is that such derivatives only affect the smoothing kernel, since the rest of the variables are constants. The gradient of is simply,   ∑. ∑. ,. ,. . . . (5) . . . (6). Likewise, will the Laplacian evaluate to,   ∑. ,. . . . . 5 . .

(13) 2.2.1 M ODELING F LUID D YNAMICS The governing equations for incompressible fluid dynamics are the mass conservation equation (equation 7) and the Navier‐Stokes equation (equation 8) which formulates the conservation of momentum,   0 . . . . . . . (7) .    . . . . . . (8) . where is the gravitational acceleration, the fluids viscosity coefficient and the velocity.  It is important to note that equation 8 represents a simplified version of the Navier‐Stokes equation, used for viscous incompressible fluids.   The conservation of mass is a trivial task in a SPH simulation. Since the number of particles is constant and each particle has a constant mass throughout the simulation, the conservation of mass will be assured as long as the smoothing kernel is normalized (equation 3).   The right hand side of the Navier‐Stokes equation above consists of three different ) models the pressure, the second . components. The first component external forces and the third one . represents . models the viscosity of fluids. The contribution . of each component will be further discussed in section 2.2.1, 2.2.2 and 2.2.4. The particles in a SPH system should be considered as point masses and the forces acting on them has to be described in the form of point forces. However, fluid equations in literature is often described in terms of force fields. The point force acting on a particle in a force field is described in equation 9.     . . . . . . . . (9) . The point force can be approximated using SPH interpolation over one particle (equation 10).   ,. ,. . . (10) . Rewriting this relation with Newton's second law in mind gives the connection between applied force and acceleration (equation 11).   . . . . . . 6 . . . . (11) .

(14) 2 .2. 1.1 P R E S S U R E The force acting on a particle due to pressure can be described using Newton's second law (equation 12).   .   . . . . . (12) . Application of the SPH rule (equation 1) to the pressure term results in the following relation (equation 13).   ∑. ,. (13). The above equation does not produce symmetrical forces and is thereby an example of one of the issues with SPH. The resulting relations of a SPH interpolation does not guarantee to satisfy any physical principles. The symmetry between forces (every reaction leads to a counter reaction) is vital to get valid simulation results. Different ways to achieve symmetrization of equation 13 have been proposed. Müller et al. suggests using the mean of the interacting particles pressure values (Müller, Keiser, Nealen, Pauly, Gross, & Alexa, 2003) resulting in the following equation (equation 14).     .. ∑. .. ,. . . . (14) . The pressure can then be computed using a modified version of the ideal gas state equation (Desbrun & Cani, 1996),     . . . . . . . (15) . where is a gas constant determined by the speed of sound in the specific liquid and is the liquid's rest density.      2 .2. 1.2 V I S C O S I T Y Viscosity models the way particles with different speeds within the same liquid interacts with each other. Applying the SPH interpolation technique on the viscosity term yields the following equation.   ∑. ,.     . . . . (16) . This equation suffers from the same symmetrical problems as equation 13, and therefore needs to be modified before being used in the simulation. Müller et al. suggests using the relative speed between to particles to balance the forces (Müller, Keiser, Nealen, Pauly, Gross, & Alexa, 2003). Viscosity forces do only depend on the difference in velocities, which makes this a natural approach. This results in equation 17.  . 7 .

(15) ∑. . ,.   . . . . (17) . Clavet et al. presents a different approach, applying radial pairwise impulses, to model viscosity (Clavet, Beaudoin, & Poulin, 2005). The size of the impulses depends on the two particles speed towards each other (equation 18 and 19). · ̂ . . . . . . . . . (18) . . . . . . (19) . and are coefficients used to control the viscosity’s linear and quadratic dependencies on velocity. Impulses are only applied if is greater than 0, the proposed algorithm will therefore only cause forces when particles are moving towards each other.    2 .2. 1.3 S U R F A C E T E N S I O N The Navier‐Stokes model does not include forces due to surface tension. This is therefore often added as a separate part. Surface tension can be seen as a force striving to minimize curvature by applying forces towards the core of the liquid in the direction of the surface normal. The more curved a surface is, the greater surface tension force will be generated.   Morris proposes the use of a color field to determine forces acting upon a set of particles due to surface tension (Morris, 2000). The color field is defined as, ∑. ,. .   . . . . . (20) . This is simply a measure of the particle distribution. The surface normal field of a set of particles is defined as the gradient of the color field (equation 21). . . . . . . . . (21) . . . . (22) . The curvature of a surface can be calculated as, | |. . . . . . . The force due to surface tension can then be calculated using the following equation. | |. . . . . . (23) . Where is a scalar coefficient used to scale the amount of surface tension to be applied. The magnitude of the normal can be used to restrict the surface tension to only be applied to parts of the liquid close to a surface. 2 .2. 1.4 E X T E R N A L F O R C E S External forces such as gravity and collision forces can be applied directly to the particles, without the use of any interpolation technique. See chapter 3.6 for a detailed description of the implemented collision handling algorithm. . 8 .

(16) 2.2.2 S MOOTHING K ERNELS The outcome of a simulation in terms of accuracy, speed and stability is greatly affected by the choice of smoothing kernels. As discussed earlier (see section 2.1) it is necessary to choose kernels that are even (equation 2) and normalized (equation 3), to be able to limit the interpolation error. In order to obtain stability it is crucial to pick smoothing kernels that are zero with vanishing derivates at its boundaries (Müller, Keiser, Nealen, Pauly, Gross, & Alexa, 2003). Apart from these basic constraints it is important to consider computational time when designing a kernel. The kernel has to be evaluated several times for each particle in the simulation at every iteration. Even a small change in the complexity of the kernel can have devastating effects on the simulations performance.   There are several different smoothing kernels suggested in literature, designed to achieve different results in terms of speed and particle behavior. The most common approach is to use different kernel types depending on the scalar component to be interpolated. Müller et al. suggests the following set of kernels.    0 0                . ,. . . . . (24) . This kernel has a big computational advantage. It is relatively simple in its form and the fact that the distance variable r only appears squared, has the advantage that the kernel can be evaluated without computing any square roots. Which is necessecary when calculating the distance between two particles. Müller et al. uses this kernel for density computations. Since the above kernel (equation 24) has vanishing gradients close to its center it is not usefull to calculate pressure forces. The magnitudes of the forces due to pressure are stictly dependent on the smoothing kernels gradient (equation 13) hence would the above kernel lead to vanishing forces between particles close to each other. This is an unwanted behaviour that tend to cause clustering of particles. The following kernel (equation 25) was designed to address this problem and create strong repelling forces between particles close to each other. 0   0               . ,. . . . . (25) . As mentioned earlier, Müller et al. uses the Laplacian to calculate forces due to viscosity (equation 13). A negative Laplacian would cause the viscosity to increase the difference in velocities rather than smoothen out the velocity field. Since both of the above kernels have negative Laplacian values at some points within theire cut‐off radus, a third kernel with an exlusivly positive Laplacian is presented (equation  26). ,. 1   0   0                                       9 . . . . . (26) .

(17) 2.3 C OMPUTE U NIFIED D EVICE A RCHITECTURE (CUDA) CUDA, or Computer Unified Device Architecture, is NVIDIAS software and hardware architecture for general‐purpose programming on the GPU. CUDA was released in 2007 with the purpose of making the parallel processing capabilities of the GPU accessible through an API not related to graphics. A CUDA compatible GPU can be considered a as a device providing a set of parallel co‐processing units to the main CPU, the host. From this point on the GPU will be referred to as the device and the CPU as the host.   . 2.3.1 NVIDIA’ S H A R D W A R E A R C H I T E C T U R E It is necessary to have a basic understanding of NVIDIA’s hardware architecture in order to fully utilize a CUDA compatible GPU's processing power.   CUDA enabled GPU's consists of a number of unified general‐purposed processing units referred to as streaming multiprocessors. Each streaming multiprocessors consists of a number of streaming processors. The multiprocessors utilizes a single instruction multiple data (SIMD) architecture, meaning that each streaming processor, at a given point in time, perform the exact same instructions but on different sets of data.  . . 2.3.1.1 E XE C U T I O N M O D E L A function executed on the device is called a kernel. The same kernel is executed on the device's all streaming multiprocessors. Each kernel executes over a grid of blocks, where each block consists of a number of threads. It is up to the user to define the dimensions of the grid and the blocks.  . F I G U R E 1 . Illustrates the CUDA execution model. Each kernel is executed over a two‐dimensional grid of blocks. A block consists of threads organized in up to three dimensions. . Threads within the same block have access to the same shared memory (see section 2.2.1.2) and can be synchronized through a barrier‐like construct. It is however not possible to synchronize threads executed in different blocks. The only way to achieve synchronization over all threads is to split the workload into two separate kernels for sequential execution. As mentioned earlier, the threads are executed in a SIMD manner. The grid and block organization is typically used to determine which data to operate on. . 10 .

(18) All these facts makes the grid and block configuration a crucial part of most CUDA implementations.   When a kernel is executed, the blocks within the kernel are distributed over the available streaming multiprocessors. The threads within a block are then divided into groups of 32, called warps.  Threads contained by a warp are then distributed over the streaming processors and executed in parallel. At any point in time the hardware executes instructions from one selected warp. 2 .3. 1.2 M E M O R Y H I E R A R C H Y There are several different types of memory available on the device, each with different properties such as accessibility and functionality.   Device memory (VRAM) is the counterpart of the Random Access Memory (RAM) available on the CPU and is divided into, global memory, constant memory, texture memory and local memory.   The global memory is the only memory type that gives full access (read/write) to both the device and the host. It is therefore the only way to handle data transactions from the device to the host. However, the global memory is not cached, which makes the access times relatively high. It is therefore desirable to minimize transactions to and from this memory type.   Both the texture and the constant memory are cached, which substantially reduces the access times in situations where the same memory is accessed several times. In addition, the texture memory provides hardware support for certain functions such as interpolation etc.   The local memory has the same performance capabilities as the global memory with the difference that it is only accessible from the specific thread that wrote to it. Since it is slow it should be avoided and can often be substituted with the multiprocessors local memory (on‐chip memory).    There are two types of on‐chip memory, shared memory and registers. They both have extremely fast access times. Shared memory is shared between blocks on the same streaming multiprocessors. It is often used as manually controlled cache, where data from the global memory can be copied when accessed frequently within the same block (see section 2.3.3.1). Registers are used for local storage of data which is only accessible from a specific thread. The downside however is that each streaming multiprocessors only have a very limited amount of on‐chip memory. If it is used without care and consideration it might greatly affect the computational performance of an implementation and possibly even prevent an application from executing.  . 11 .

(19) F I G U R E 2 . Different memory types vary in terms of access speed and accessibility. . Parallel processes generally consume a lot of data, it is therefore extremely important to make the data as easily accessible as possible to prevent memory access time from becoming a bottleneck. This is further discussed in section 2.3.2.1.   . 2.3.2 P R OG R A M M I N G I N T E R F A C E There are currently two different interfaces supported to write CUDA programs: CUDA driver API and C for CUDA. The first one is a lower‐level API which offers a very high level of control but also requires more code. C for CUDA on the other hand is basically an extension to the C language, exposing the CUDA programming model and is the interface I used for my implementation. It provides the programmer with libraries necessary to transfer data back and forth to the graphic card, execute kernels and synchronize threads etc. NVIDIA's CUDA compiler (NVCC) automatically separates the CUDA source files into host code and device code and then compile the device code into binary code and leaves the host code to be compiled by the systems standard C compiler. . . . 2 .3. 2.1 O P T I M I Z A T I O N S T R A T E G I E S Functional CUDA code is far from a guarantee of good performance. The real challenge is to program the CUDA kernels in a way that fully utilizes the GPU’s processing capability. There are three main strategies to achieve this, maximizing parallel execution, optimizing memory usage and optimizing instruction usage (NVIDA Corporation). One step towards maximizing parallel execution is to provide the streaming multiprocessors with as many warps as possible. This increases the streaming multiprocessors’ possibilities to hide memory latencies by switching from warps currently reading from memory, to warps that are ready for execution. However, the on‐ chip resources limit the amount of warps, threads and thread blocks a multiprocessor . 12 .

(20) can handle simultaneously. The on‐chip resources available on the GeForce‐8 series is summarized below. •. Max 24 warps per multiprocessor  . •. Max 768 threads per multiprocessor  . •. Max 28 thread blocks per multiprocessor . •. Max 8192 32‐bit registers per multiprocessor . •. Max 16384 bytes shared memory per multiprocessor . These resources do vary slightly depending on the graphic card available in the machine. However, all card specific figures presented in this report are reflecting the Geforce‐8 series resource specifications, since that was the card type available at Sony Pictures Imageworks during my internship. The ratio of active warps to the maximum number of warps available on the GPU is called the occupancy.  Higher occupancy will not necessarily lead to better performance but it will prevent bottlenecks due to long memory access times. Memory access times can be reduced by avoiding data transfers to memories with low bandwidth. Transfers between host and device units should be avoided and transfers between global and local device memory should be minimized. One way of minimizing threads access to the global memory is to let threads within a block load parts of the global memory into the shared memory. This does however require that threads within the same block needs access to data at the same place in the global memory. Another strategy to reduce global memory access is simply to recalculate data rather than fetch the result from previous a calculation that has been stored in the memory. Moreover, it is possible to reduce access time by optimizing memory access patterns. It is for example faster to fetch data that is organized in a sequential order than data that is scattered all over the memory. Since data required for a specific warp is fetched from the memory at the same time, a lot can be gained from making sure that the threads within that warp will require memory that is organized sequentially in memory. Another notable fact about the device is that it is optimized to fetch data that is organized in groups of four (traditionally red/green/blue/alpha). Depending a little on what the access patterns look like it is often more efficient to organize and fetch data in groups of four even if one of the data spots is not used. As for optimizing instruction usage, some computational heavy arithmetic functions can be replaced with less expensive versions especially modified to perform well on the device. This does however mean some loss of precision and should only be done in cases where it will not affect the quality of the end result.   All these optimization techniques are discussed in more depth in NVIDIA’s CUDA programming guide (NVIDA Corporation). . . 13 .

(21) C HAPTER 3 . IMPLEMENTATION   Knowing the advantages and weaknesses of the GPU, it is possible to implement the algorithm in a way that will result in maximum possible utilization of the available processing resources. Even though the SPH algorithm is the driving element behind the whole implementation there are several important parts necessary to set up and control the behavior of the simulation. Figure 3 gives an overview of the different stages involved in executing one iteration of a simulation and where the data is processed. Each stage is described in more details later on in this chapter. . F I G U R E 3 . Illustrates the different stages involved in the execution of one iteration. The left side lists operations being processed on the host while the right side lists operations being executed on the device. . Note that a CPU version of the entire algorithm has been implemented in addition to the GPU version. There are two main reasons for this. Firstly, implementing an identical version running on the CPU makes it possible to investigate the computational gain of utilizing the GPU’s processing power. Secondly, only a small part of the machines currently available at Sony have CUDA enabled graphics cards installed.   . 3.1 SPH A LGORITHM When considering the SPH algorithm from a practical perspective, it soon becomes obvious that interpolating particle values is going to be the most resource demanding part. It requires stepping through each particle to sum up its neighbors contribution to the current value. Note that a particle is only considered a neighbor to another particle if it lies within the specified kernel radius of the smoothing kernel. Bearing in mind that a full scale simulation can consist of hundreds of thousands of particles and that it is not 14 .

(22) unusual that particles have around 20‐50 neighbors, a lot can be gained by speeding up this process. Firstly, it will be necessary to find a data structure capable of quickly find a particle’s neighbors. Secondly, the workload coupled to the interpolation has to be processed in a way that fully utilizes the GPU’s capacity.   . 3.1.1 D A T A S T R U C T UR E As discussed earlier, the nature of the SPH algorithm makes it crucial to efficiently find a particle’s neighbors within the kernel radius. To exhaustively search through and compare every particle in the set is a waste of resources and practically unthinkable for anything else than extremely small particle sets. It is therefore necessary to sort the particles into some kind of data structure that can be used to limit or guide the search algorithm through the particle set. There are numerous different approaches to do this, where Kd‐trees and different hash tables are examples of possible solutions. However, I choose to sort the particles, based on position, into a uniform and wrapped grid structure and then limit the search for particles within the kernel radius to the corresponding and neighboring grids. By choosing a cell size equal to the kernel radius, no points are neglected. The size and dimension of the grid is static throughout the simulation but each point is hashed into the grid using a simple hashing function: %. . . . . . . . . % %. . (27) . (28) . . . . . . . . (29) . . . . . . . (30) . The above described structure makes it possible to sort an infinitely large geometric span into the grid without changing its dimensions. Even though the hashing leads to distance comparisons between particles that are potentially far away from each other this has turned out to be negligible considering the computational time.   When a hashing key have been calculated for each particle, the particles are being sorted and the position array will be reordered based on grid id. This means that particles hashed into the same cell will be placed next to each other in the position array.   This grids structure is represented by a set of arrays. One array storing the position of each particle, sorted in order of their hash keys and two arrays used to store the beginning and end of each cell. . 15 .

(23) F I G U R E 4 . The left side illustrates six particles’ spatial distribution in relation to a grid system; the dashed lines symbolize cells that will be wrapped; the gray circles denote the particles’ interaction radius. The top array on the right side demonstrate how the x and y positions of the particles is stored sequentially in a flat array structure. Each particle is given a hash number (second array) and the particle position array is sorted based on this number (third array). Note how the particle in grid (1, 3) is given the same hash as the particle in grid (1, 1).  The two last arrays are used to keep track of where each cell starts and end. Only the last three arrays are necessary to represent the particles sorted into the grid structure. . As mentioned earlier, data is accessed faster if organized into groups of four. For that reason are all arrays that are storing three‐dimensional data such as; position, velocity, acceleration and force, stored with an extra “dummy” data value.  This will obviously require more memory but it does decreases memory access time which has been prioritized in this case. . 3.1.2 SPH A N D P A R A L L E L P R O C E S S I N G Computation of density, pressure and the different forces acting on the particles are all evaluated on the device. Note that computations bound to the collision handling are processed on the host. The reasons for not processing the collision handling on the device is both because of problems with storing and handling collision object data in device memory and lack of implementation time. Since the result for each particle is independent of the result of the other particles, it is possible to distribute the workload so that each particle gets a thread each. Little would be gained by splitting the computations necessary for one particle over multiple threads, since it would not change the amount of registers used by each thread. The other option, to let one thread handle multiple particles would lead to a way too high register count per thread and thereby limit the numbers of blocks simultaneously loaded on to a streaming multiprocessor. Therefore does one thread per particle seem to be both the most intuitive and effective solution.   . . 16 .

(24) 3 .1. 2.1 G R I D A N D B L O C K D I M E N S I O N When it comes to find the most appropriate grid and block configuration it is important to determine if there are any dependencies between threads that can be taken advantage of. For example, if the threads can be organized into blocks where each thread will require the same data from global memory, a lot can be gained by loading that specific part of the global memory into the shared memory. Such a relation can be found between particles sorted into the same cell, since they got the same potential neighbors.   However, the amount of particles in one cell can vary greatly, some cells may be empty or only containing a few particles where as other cells may contain hundreds or even thousands of particles. Since the amount of threads in a block has to be static throughout a kernel this uneven distribution makes it hard to take advantage of this similarity by using the shared memory.   Nonetheless, a certain level of similar or even identical data dependencies can be achieved by sorting the particle array on cell index and then divide the array into equally big chunks and distribute them over the blocks. Some block will contain particles from several different cells with rather different data dependencies while as other blocks will contain particles from only on cell with very similar data dependencies.  Even if this would not make it possible to utilize the shared memory it will at least lead to better coherence in the memory reads.  In addition, this brings up the possibility to make use of the texture memory. Since the above described setup will result in a lot of identical memory requests close to each other (in time) it has great potential of taking advantage of the texture memory’s ability to cache data. Accessing data in the cached memory is nearly as fast as to read from on‐chip memory. By binding the global memory arrays to textures and use texture lookups to fetch data the memory access time is improved drastically. The conclusion is that since we cannot find a way to predict the spatial limits of particles sorted into a specific block, the best solution would be a flat grid and block configuration.   3 .1. 2.2 O C C U P A N C Y To fully take advantage of the GPUs parallel processing capability it is necessary to maximize the amount of active threads on each multiprocessor. As discussed in section 2.3.3.2 the on‐chip resources limits the amount of threads a multiprocessor can handle simultaneously. The most limiting factor in this implementation turned out to be the number of registers possible to load on to a multiprocessor. The original kernel used to interpolate the forces acting on the particles each thread consumed 28 registers. With that amount of registers per thread and 64 threads per block the streaming multiprocessor would only be able to handle 8 warps simultaneously out of the maximum of 24 active warps. This leads to an occupancy value of 33%. Such a low occupancy value will leave the streaming processors with few warps to choose between hence it poses a major risk to cause memory access times to limit the CPU’s performance. By moving the least accessed data from the register to the private global 17 .

No results found