Realtidsrendering av försimulerade partiklar

(1)

Department of Science and Technology Institutionen för teknik och naturvetenskap

Linköping University Linköpings universitet

g n i p ö k r r o N 4 7 1 0 6 n e d e w S , g n i p ö k r r o N 4 7 1 0 6 -E S

LiU-ITN-TEK-A-14/002-SE

Realtidsrendering av

försimulerade partiklar

Nathalie Ek

2014-03-14

(2)

LiU-ITN-TEK-A-14/002-SE

Realtidsrendering av

försimulerade partiklar

Examensarbete utfört i Medieteknik

vid Tekniska högskolan vid

Linköpings universitet

Nathalie Ek

Handledare Joel Kronander

Examinator Jonas Unger

(3)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare –

under en längre tid från publiceringsdatum under förutsättning att inga

extra-ordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner,

skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för

ickekommersiell forskning och för undervisning. Överföring av upphovsrätten

vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av

dokumentet kräver upphovsmannens medgivande. För att garantera äktheten,

säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ

art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i

den omfattning som god sed kräver vid användning av dokumentet på ovan

beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan

form eller i sådant sammanhang som är kränkande för upphovsmannens litterära

eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se

förlagets hemsida

http://www.ep.liu.se/

Copyright

The publishers will keep this document online on the Internet - or its possible

replacement - for a considerable time from the date of publication barring

exceptional circumstances.

The online availability of the document implies a permanent permission for

anyone to read, to download, to print out single copies for your own use and to

use it unchanged for any non-commercial research and educational purpose.

Subsequent transfers of copyright cannot revoke this permission. All other uses

of the document are conditional on the consent of the copyright owner. The

publisher has taken technical and administrative measures to assure authenticity,

security and accessibility.

According to intellectual property law the author has the right to be

mentioned when his/her work is accessed as described above and to be protected

against infringement.

For additional information about the Linköping University Electronic Press

and its procedures for publication and for assurance of document integrity,

please refer to its WWW home page:

http://www.ep.liu.se/

(4)

Abstract

This master thesis presents a method for real-time streaming of pre-simulated particle systems. The particle systems are simulated offline in any software and then saved as an Alembic file. This Alembic file is then imported into Frostbite and can be loaded at run-time. The result of this master the-sis work is the implementation of a streaming and rendering framework for pre-simulated particles. The implementation contains only basic shad-ing and lightshad-ing due to the time constraints of the work but the streamshad-ing part features an advanced solution to get predictable and manageable CPU and memory overhead. The implentation performs well and works satisfac-tory.

(5)

4.1.4 The Emitter . . . 30 4.1.5 Simulation . . . 30 4.1.6 Rendering . . . 31 4.1.7 Performance . . . 32 5 Method 35 5.1 Introduction . . . 35 5.2 Alembic importer . . . 35 5.2.1 Particle information . . . 36 5.3 Pipeline . . . 36 5.4 Runtime entity . . . 37 5.5 Streaming . . . 37 5.5.1 Streaming Cache . . . 37

5.5.2 Streaming State Machine . . . 38

5.5.3 Reading from the Cache . . . 40

5.6 Rendering . . . 42

5.7 Sorting . . . 43

5.7.1 Sorting using a key-index stream . . . 43

5.7.2 Sorting the Particle Stream directly . . . 44

5.7.3 CPU sorting . . . 44

6 Result 45 6.1 Pipeline results . . . 45

(7)

Contents 3 6.3 Workflow Results . . . 46 6.4 Screenshots . . . 46 6.5 Performance . . . 47 7 Discussion 50 7.1 Future Improvements . . . 51 7.1.1 Compression . . . 51 7.1.2 Lighting . . . 51

(8)

List of Figures

3.1 Particle System in Genesis . . . 14

3.2 Spacewar (1962) . . . 15 3.3 SIMD . . . 18 3.4 Movie Texture . . . 19 3.5 Mergesort . . . 22 3.6 Quicksort . . . 23 5.1 Data flow . . . 36 5.2 State machine . . . 39 5.3 Chunk states . . . 39 5.4 Ringbuffer example . . . 40

5.5 Information split over chunks . . . 41

6.1 Particle Stream simulation . . . 47

6.2 Particle Stream simulation . . . 48

6.3 Combined CinemaStream and ParticleStream simulation. . . 48

6.4 Combined CinemaStream and particle simulation. . . 49

(9)

List of Tables

3.1 LSD - Soring by least significant digit. . . 23

(10)

Chapter 1 Introduction

This chapter will give a short motivation for the work. It will also go through the third party tools involved in this master thesis work.

1.1 Motivation

Particles are of great value when it comes to giving the viewer/user a be-lievable experience, whether it concerns movies or games. Hence, particle systems are an important tool for visual effects in movies and games. The challenge with particles in the movie industry is not efficiency. It does not matter if one frame takes two days to render as long as the end result is rewarding. In games, the situation is the opposite - rendering time has to be less than 33 ms (for 30 fps) to be usable. With the fourth generation of consoles (Xbox One and Playstation 4), Frostbite actually runs in 60 fps which cuts the frame time in half to 16 ms.

The quality of particle simulations and the quantity of particles that can be simulated have increased enormously since the introduction of GPU-accelerated particle systems.

In modern games, particle systems tend to use a small amount of particles

(11)

Chapter 1. Introduction 7

but large ones in size, combined with shading instead of a large number of small-sized particles. The particles are usually rendered as point sprites and an artist provides a painted texture, sometimes along with a normal map to integrate better with the light in the scene.

Since the number of particles increases with the evolution of hardware, shad-ing becomes more and more important. Previous defects that could be concealed with textures become more obvious as the number of particles increases. One of the important aspects of shading dense particle systems is self-shadowing, i.e. that particles can cast shadows on other particles. Self-shadowing gives important cues to the density and shape of a particle cloud.

1.2 The Company

1.2.1 EA Digital Illusions CE - DICE

EA Digital Illusions CE - DICE, was founded by Ulf Mandorff, Olof Gustafs-son, Fredrik Liliegren, Andreas Axelsson and Markus Nystr¨om in Alvesta in the early 1990s. The first games released were Pinball Dreams (1992) and Pinball Illusions (1995). The release of Battlefield 1942 resulted in a great break through within the game industry. Today the Battlefield se-ries is among the most popular first person shooter games out there on all platforms.

In 2004 the company entered collaboration with Electronic Arts (EA), re-sulting in DICE being an affiliated company to EA.

1.2.2 Frostbite

Frostbite (currently Frostbite 3) is a game engine developed by DICE. As of today, the engine is designed for use on Microsoft Windows, Playstation4, Xbox One, Playstation 3, Xbox 360 and mobile platforms. It is also adapted to a wide range of video game genres.

(12)

“Frostbite empowers game creators around the world to shape the future of gaming. Together with our game teams, we provide a smooth, beautiful, and collaborative development experience.” [1]

1.3 Third Party Tools

1.3.1 Real Flow

RealFlow is a fluid and dynamics simulator for the 3D industry and is used within the movie industry for creating realistic fluid simulations. RealFlow fluids has appeared in, among others, Resident Evil Retribution, Ice Age 4, The Avengers and The Girl with the Dragon Tattoo. RealFlow has also been used in games like Crysis 2 and Mass Effect 3.

RealFlow Intuitive Fluids is an industry-standard, out-of-the-box fluid sim-ulation software. It is fast and easy to use and it is compatible with all major 3D platforms (Maya, 3DS Max, Lightwave, Softimage, Houdini, Cin-ema4D).

In January 2008 RealFlow won the Academy Technical Achievement Award 2007 granted by The Academy of Motion Picture Arts and Sciences.

1.3.2 Alembic

Alembic is an open computer graphics interchange framework which con-tains tools for collaboration management and a generic, extensible, data representation scheme. It includes a C++ library, a file format, client plu-gin and applications. It was initially developed in 2010 by teams from Sony Pictures Imageworks and Industrial Light & Magic.

Alembic was made to create an open standard for scene data sharing and to support a baked-data workflow. It enables easy hand-off between disci-plines and to enable fast workflows leading to greater productivity. Alembic

(13)

(14)

Chapter 2 Related Work

This chapter presents a summary on related work in the field of particle simulation and also sorting algorithms.

2.1 Related Work

In 2004 A.Kolb and P.Kipfer [2], [3] introduced particle systems on the GPU for real-time animation and rendering of particle systems in OpenGL since the CPU was too slow. The GPU was subsequently used for simulating fluid motion with smooth particle hydrodynamics [4].

Different application purposes require different parallelization strategies. The particle system used by A.Kolb and P.Kipfer back in 2004 did not require any knowledge about the surrounding particle neighbors and were easily parallelized (i.e. one thread per for each particle). The SPH imple-mentation used in [4] as well as P.Kipfer [3] required information about the local neighbors due to collision detection and advection of the particles. Both are Forward-Euler solutions where the particle velocity could be adjusted by using a small uniform time step to allow the system to converge.

The implementation used in Dynamic Particle System for Mesh Extraction

on the GPU [5] does not have a uniform time step. Instead, each particle

(15)

Chapter 2. Related Work 11

determines its step size based on its energy and the local curvature. This allows for faster convergence for the purpose of the mesh extraction. To extract high quality meshes for isosurface computation one recent ad-vance is based on a dynamic particle system. Particle placement techniques requires a significant amount of time to produce a satisfactory mesh and to address this problem Kim et al [5] have studied the parallelism property of particle placement and the use of CUDA, a parallel programming technique on the GPU. The approach significantly improves the performance of the particle placement. Kim et al [5] presents their curvature dependent sam-pling method and the implementation using CUDA on the GPU to extract high quality meshes. They also devise and efficient implementation of a par-ticle system on the GPU to reduce the runtime of the parpar-ticle system. Drone [6] presents different methods for creating an advanced interaction particle system. The computations and data reside entirely on the GPU and Drone [6] uses non-parametric particle systems on the GPU to display the complex behavior of a particle.

Non-parametric particle systems must react to their environment as well as to each other giving the non-parametric systems a true advantage over parametric systems. To handle the particle interaction in the system Drone [6] outlines a method of dealing with N-body problems on the GPU – Force

Splatting for N2 _{Particle Interactions. The goal is to project the force from}

one particle onto all other particles in the system during a single operation. The algorithm exploits the alpha blending capabilities and the fast rasteriza-tion of the modern graphics hardware without the constant need to recreate complex space partitioning structures on the GPU.

Another important aspect when it comes to particle systems is how to sort the particles.

In the past, many sorting algorithms have been proposed (Knuth [7], Martin [8] and and Quicksort is one of the fastest algorithms used in practice. Since often used there are also many optimized implementations of the Quicksort available. A Quicksort implementation using SIMD instructions would be

(16)

Chapter 2. Related Work 12

ideal, but according to Inoue [9] there is no known technique to implement Quicksort algorithm using existing SIMD instructions.

The performance of sorting is often dominated by pipeline stalls caused by branch miss predictions according to Sanders and Winkel [10]. The al-gorithm presented by Inoue [9] makes it possible to take advantage of data parallelism of SIMD instructions while avoiding pipeline stalls cause by miss predictions.

(17)

Chapter 3 Background

This chapter will go through the history of particle systems and the hard-ware evolution. It will also describe the third-party softhard-ware used in this report and provide a comparison for sorting algorithms suitable for particle systems.

3.1 History

3.1.1 A particle system is born

In the early 1980s, William Reeves, who worked on Star Trek II: The Wrath of Khan, started to do research methods to create realistic natural phenom-ena in real time, in particular for creating realistic fire in the Genesis Demo sequence (see figure 3.1). Reeves realized that conventional modeling would not do the trick even though it was the best at creating objects that had smooth, well-defined surfaces [11].

He used the term “fuzzy” when referring to the objects and thought it would be better if they were modeled as a system of particles that behaved within a set of dynamic rules.

Reeves was not the first who wanted to use particles and niether the first

(18)

Chapter 3. Background 14

Figure 3.1: _{An early particle system used in Genesis).}

to use them. Particles had been used before to create natural effects such as smoke and galaxies of stars. However, particles turned out to be hard to control. Even though it was not referred to as such, particle systems were used in some of the very first video games, for example Spacewar (see figure 3.2) which used particles to display explosions as early as in 1962.

Reeves came to the conclusion that by applying a system of rules to the particles, a chaotic effect could be achieved while maintaining some creative control.

During the past thirty years there has been a lot of particle systems making people grasp for air; watching buildings blow up, giant cities get flooded or other visually stunning effects. Notable uses in the movie industry includes Lord of the Rings, Transformers, The Avengers, Ice Age.

(19)

Figure 3.2: _{Two spaceships, one goal - blast each other out of the sky.} Built by a small team at MIT led by Steve Russel. The relatively action-packed

(20)

3.2 Hardware evolution

The evolution in both CPU and GPU hardware has been progressing ex-tremely fast over the last couple of decades.

3.2.1 CPU evolution

In the late 1970s, Intel released the first x86 CPU (Central Processing Unit) with the introduction of the Intel 8086. This instruction set gradually be-came the industry standard and Intel processors were the most popular processors going into the 1990s. This decade saw a race where clockspeeds of the processors increased at a rapid pace. However, in the early 2000s, processor manufacturers hit what would become known as the power wall. This meant that it was not possible to create processors with higher clock frequencies without using large amounts of power. This led to the intro-duction of multicore processors. This shift was not a shift only in hardware since a new type of programming needed to be used to fully utilize the power of the new processors. This is where the future of processors lie. Multiple cores and threads running concurrent software.

3.2.2 GPU evolution

GPU (Graphics Processing Unit) started emerging in the early 1990s. An early example of mass-market dedicated GPUs was the Playstation and Nin-tendo 64. The first GPUs were dedicated to accelerating 3D functionality with separate discrete boards. The most notable example of this setup be-ing 3dfx with their Voodoo cards. The 1990s also saw the introduction of OpenGL and quite soon, the API influenced hardware development. Dur-ing the late 1990s, Microsoft introduced an API similar to OpenGL, called DirectX. This API eventually gained popularity and became the industry standard among Windows game developers.

(21)

(General Purpose computing on GPU). The term was coined by Mark Harris in 2002 when he noticed a trend of using the GPU for non-graphics work [12]. The GPGPU evolution is still ongoing and it is still not as common as might have been expected.

Even more complex systems can be simulated by creating a new particle system when each particle dies. The technique of using more than one particle system was used as early as in the Genesis sequence in Star Trek II. Up to 400 particle systems consisting of 750 000 particles were used.

3.3 SIMD - Single Instruction Multiple Data

SIMD (Single Instruction Multiple Data), see figure 3.3, is a method which lets one microinstruction operate at the same time on multiple data items. A single computer instruction perform the same identical action (retrieve, calculate or store) simultaneously on two or more pieces of data. Typically this consists of many simple processors, each with a local memory in which it keeps the data which it will work on.

The advantage of the SIMD format is that for the cost of doing a single instruction, N instructions worth of work are performed. This results in large speedups for data-parallelizeable algorithms.

Each processor simultaneously performs the same instruction on its local data progressing through the instruction in lock-step, with the instruction issued by the controller processor. SIMD concepts are also applicable to GPUs since they are massively parallell units that are capable of vector operations on wide registers.

Particle systems are in general data-parallel, i.e. the same operation is performed on a large amount of data. Therefore, SIMD programming is beneficial for particle systems.

(22)

Figure 3.3: _{SISD - Single Instruction Single Data.} SIMD - Single Instruction Multiple Data.

DirectCompute

DirectCompute is an API from Microsoft that supports general-purpose computing on graphics processing units. It is part of the Microsoft Di-rectX collection of APIs and was initially released with the DiDi-rectX 11 API. DirectCompute allows for vendor-independent development of GPGPU al-gorithms which can be used for particle system computations.

Movie Textures

To simulate very complicated particle effects in real time, reality has to be faked. One way to simulate such an effect, which could be a big explosion, is to use movie textures.

Movie textures are animated textures that are created from a video file. It is a technique where a movie is played back on a simple polygon. They can be used for cut scene movie sequences or to render movies into the scene itself.

(23)

Figure 3.4: _{Example of a movie texture used in Battlefield 4.}

3.4 Sorting Particles

Since particles are often transparent to some degree, sorting has to be per-formed. This is done to be able to guarantee that the order that particles are rendered, corresponds to the depth. The fundamental operation of many sorting algorithms is to compare two values and swap them if they are out of order. Since sorting is such a fundamental part of particle systems, it is important to conduct experiments to get performance numbers for the different algorithms before making a choice.

3.4.1 Bubblesort

Bubblesort compares each element to the next element and makes a swap if sorting is needed. The gap between the two elements being compared is always one. The complexity for Bubblesort is Θ(n2_).

3.4.2 Combsort

Combsort is an extension to Bubblesort. It compares and, if needed, swaps two non-adjacent elements. Performance is drastically improved by com-paring two values with large separations. This is because each value moves

(24)

toward its final position more quickly. Unlike Bubblesort the gap between the two comparing elements can be more than one. The inner loop of the Bubblesort that handles the actual sort requires a modification. The gap between the two elements reduces for each iteration of the outer loop in steps of a shrink factor.

[inputSize / shrinkFactor, inputSize / shrinkFactor^2, inputSize / shrinkFactor^3, ...]

The length of the list being sorted divided by the shrink factor decides the value of the gap and the list is then sorted with that gap. During the sorting the gap is divided by the shrink factor again and the process repeats until the gap is 1. If the list is not fully sorted by this point, the sort continues using a gap of 1 until sorting is completed. The final step is thus equivalent to an efficient Bubblesort since most problems have been dealt with. The computational complexity of combsort approximates Θ(n · log(n)) on average.

3.4.3 Insertion sort

Insertion sort is an algorithm that works relatively efficient for small lists and mostly unsorted lists (Shellsort is a variant of Insertion sort that is more efficient for larger lists). Being a simple sorting algorithm it is often used as part of a more sophisticated algorithm. It builds the final sorted array (or list) one item at a time.

Shellsort is not as efficient on large lists or as more advanced algorithms (Quicksort, Heapsort or Mergesort), but it provides several advantages - it is simple to implement and is efficient for small data sets. Shellsort is also stable and can sort a list as it receives it and only requires a constant amount Θ(1) of additional memory space.

(25)

list, which is sorted, grows. For each iteration one element from the input data is removed, finds the location it belongs to in the output list and is inserted at the position.

The sorting is typically done in-place by growing the sorted array behind the array that is being iterated. The value of each array element is compared against the largest value in the sorted array. If the value is larger than the largest one in the sorted array it leaves the element in place and inserts it in the next. If the value in the sorted array is smaller, it finds the correct position in the sorted list and shifts all the larger values to make place, then inserts it at the correct position.

3.4.4 Mergesort

Mergesort, see figure 3.5, is a divide and conquer and comparison-based algorithm invented in 1945 by John von Neumann.

Mergesort first divides the unsorted list into the smallest unit possible (a single element) and then compares each element with the adjacent list to sort, i.e. it compares every two elements (the first with the second, the third with the fourth, etc.) and makes a swap if the first element should come after the second. Each of the resulting lists of two are merged into lists of four and then merged into lists of eight, and so on; until at least two lists are merged into the final sorted list.

Merge sort scales well to very large lists and its worst case running time is Θ(n · logn). In the worst case, merge sort does ca. 39 % fewer comparisons than Quicksort does in the average case.

Stable sorting in-place is possible but causes the algorithm to be a bit slower, even though still managing Θ(n · logn) time. Stable sorting can be achieved by merging the blocks recursively.

(26)

Figure 3.5: _Mergesort

3.4.5 Quicksort

Quicksort, see figure 3.6, is a divide and conquer algorithm that relies on a partition operation, it is also a comparison sort since the elements are compared to each other. Quicksort is a fast sorting algorithm which has, on average, Θ(nlogn) complexity, which makes it suitable for sorting large data.

A pivot point, an array element, is selected and then used when doing the partition. With the pivot point comes two sublists that are reordered -smaller values are moved to the left of the pivot point and larger elements are moved to the right. This can be done efficiently in linear time and in-place.

The two sublists are then recursively sorted, each sublist getting its new pivot point and sublists.

(27)

Figure 3.6: _Quicksort

3.4.6 Radix Sort

Radix sort is a sorting algorithm that sorts numbers by processing individual numbers, which makes this a distribution sort. It is a simple sort that is both easy to understand and easy to use.

Radix sort sorts numbers by the least significant digit, the next least sig-nificant digit, and so on. If two numbers are equal, the first integer must always stay before the second integer.

LSD (Least Significant Digit) preserves the relative order of the digits by using a stable sort (requires the use of a stable sort).

Table 3.1: _{LSD - Soring by least significant digit.}

Unsorted Sorted by 1s Sorted by 10s Sorted by 100s 123 123 123 123 583 583 625 154 154 154 154 456 567 625 456 567 689 456 567 583 625 567 583 625 456 689 689 689

(28)

MSD (Most Significant Digit) does not require the use of a stable sort and the in-place MSD radix sort is not stable.

While LSD radix sort sorts after the least significant digit, MSD sorts after the most significant digit.

3.4.7 Summary

The choice of sorting algorithm was based both on measurements and the suitability for a SIMD implementation. Since quicksort is recursive, it was discarded quite early even though it has good performance characteristics in the single threaded case. The choice was made to use radix sort due to the possibility for a SIMD implementation.

3.5 Third Party Software

3.5.1 RealFlow

RealFlow is used to simulate fluid, water surfaces, fluid-solid interactions, soft bodies, rigid bodies and meshes. It uses particle based simulations which can be influenced in a multitude of ways by point-based nodes. These nodes can do anything from simulating gravity to recreating the vortex-like motion of a tornado. [13]

Large Scale

A cutting-edge hybrid grid/particle solver, Hybrido, provides endless pos-sibilities for large-scale simulations, such as floods or oceans with breaking waves.

RealFlow automatically creates particles by calculating the conditions for splashes, foam and mist information. It is possible to generate millions of particles using this advanced feature.

(29)

Small-scale simulations

The SPH (Smoothed-Particle Hydrodynamics) solver in RealFlow is ideally suited for highly-detailed fluid simulations with tiny splashes and turbulent surfaces.

Particles are generated by emitters and their total amount represents the fluid. Each particle can be handled as a point in 3D space with certain properties (velocity, position or mass). The emitters can interact with solid or soft bodies and RealWave objects. The emitters can also be completely customized and it is even possible to write your own fluid engine if de-sired.

Oceans

When an ocean surface has to be simulated quickly and effectively, Real-Wave is ideal since it is a powerful simulation toolset for small to medium ocean surfaces. To achieve certain wave forms and structures it uses the displacement of a the vertices in a mesh.

3.5.2 Alembic

Alembic distills complex, animated, scenes into non-procedural, application-independent, baked geometric results. Alembic will efficiently store the an-imated vertex positions and anan-imated transforms that result from an ar-bitrarily complex animation and simulation process. It will not attempt to store any representation of the network of computations which were re-quired to produce the final animated vertex positions and animated trans-forms.

(30)

Chapter 4 Particle Systems

This chapter will present the concept of a particle system. The chapter will furthermore describe different types of interaction, simulation and rendering for particle systems.

4.1 Particle Systems

“In physical sciences, a particle is a small localized object which can be described by physical properties such as volume or mass. The word is rather general in meaning, and is refined as needed by various scientific fields.” [14]

4.1.1 A particle system in general

A particle system is a collection of 3D points in space where each point represents a single particle. Compared to standard geometry objects, which are static, particle systems are not. Each particle goes through a complete life cycle - they are born, change over time and then die off. Particle systems tend to be chaotic since a given particle does not have a pre-determined path. Also, the particles can each have a random element, called a stochastic process, which modifies its behavior and makes the effect look organic and

(31)

Chapter 4. Particle Systems 27

natural.

There are many use cases for particle systems in games and other computer graphics applications - water, foam, smoke, dust, fire, blood splashes, sparks, even hair and cloth simulation. Worth mentioning is that these kinds of sys-tems also have scientific applications - cosmological simulations use particle systems with tens of millions of particles for studying the creation of the universe. Particle simulations are also used in research for large and costly fusion reactors.

4.1.2 The Particle

When building a particle system, the particles can have a number of prop-erties. The minimal set of properties are typically:

• A position as well as the previous position.

• The direction in which the particle is currently traveling is stored (can be stored in a direction vector).

• The speed of the traveling particle can simply be combined with the direction vector by multiplication.

Since a particle goes through a lifetime, in addition to the above, the life count of the particle needs to be stored. This is the number of frames that the particle has existed which is compared to a set limit on the lifetime of particles.

4.1.3 Different types of Particle Systems

There are many different types of particle systems. The different kinds of systems can be categorized by the level of interaction between parti-cles.

(32)

No Interaction

No interaction refers to the fact that the particles are independent of each other. The computational complexity of such a system is Θ(n) per step and places non-interacting particle systems as the least computationally complex type.

for i = 1 to numParticles do move particle i

end for

Limited Interaction

This kind of system has particles that interact with neighboring particles, that is, other particles within a short range. A use case for this type of interaction would be collision between hard spheres that bounces off each other.

There are two ways to compute this type of particle system - brute force or using spatial data structures. The computational complexity for the brute force approach is Θ(n2_{). However, by using spatial data structures as}

mentioned above, and a neighborhood size a, the computational complexity can be reduced.

for i = 1 to numParticles do move particle i

for all j in neighbours(particle[i]) do check collision between particle i and j end for

(33)

Full Interaction

In this kind of system all the particles affect each other as occurs with a gravitational or electrostatic force. The brute force approach is required and this type of particle system has a computational complexity of Θ(n2_).

for i = 1 to numParticles do

for j = 0 to numParticles where j != i do compute interaction between j and i end for

move particle i end for

Particle Data Structure

All the properties are stored in a structure of some kind and if a more complex particle is wanted there is no problem to add more properties. Complementary additions/changes could be adding a size to animate the size of the particles, add the mass of the particle or adding transparency by adding an alpha component to the color.

A restriction to the paricle structure is that the size should be kept as small as possible to be able to handle huge amounts of particles but still require reasonable amounts of memory.

struct Particle { Vector position; Vector velocity; float mass; // ... };

(34)

4.1.4 The Emitter

Once a particle system is created, the system itself has to be created in the world to make any sense. The particle emitter is an entity responsible for creating the particle system and it is this object that is placed in a 3D world.

The number of particles and the general direction in which they should be emitted as well as all the global settings are controlled by the emitter. This could for example be the above mentioned lifetime setting for the parti-cles.

4.1.5 Simulation

For the particles to be visually interesting, they have to move. There are different ways to move the particles, either through real-time simulation or off-line simulations that are baked in different formats and played back later.

Physics

The physics model in a particle system could handle attributes such as the mass of the particle which could be randomized, causing gravity to affect each particle individually. Friction could be added to force some particles to slow down while animating. Other local spatial effects such as wind gusts, magnetic fields and rotational vortexes would make the particles stand out from each other even more. Collision handling can also be added to the particles to have them interacting with the surrounding world.

External Influences

It is important to consider all of the possible parameters that might be wanted when creating a particle system and build that flexibility into the system.

(35)

With wind as a parameter there might be a need of changing the wind direction vector. For example, when a car drives on a snowy road, the snowflakes are affected by a new wind direction generated by the car and the snow responds to the wind as the car passes.

Updating the Particles

For each cycle of the simulation, each particle needs to be updated. To make sure not to waste valuable time, the status of the particle is inspected to see if the lifetime of the particle has expired. If the particle is marked as dead it is removed from the emitter and returned to the global particle pool.

Point Cloud Vertex Animation

To enable complex animations, for example facial animations, point clouds can be used. Point clouds are pre-rendered particle data that is exported and played back later. This data is also attached to underlying vertex data in a mesh, resulting in vertices being ”skinned” to the point cloud data. This way, it is possible to create detailed animations with the help from particles. In this case, the particles are only used indirectly and not rendered on screen.

4.1.6 Rendering

There are many ways to render a particle system depending on what the system represents.

In a violent game with a lot of blood splatter, there might be a need of having multiple blood systems - blood pool, blood splat, blood squirt and camera lens blood splat. Each blood system contain suitable particles, all requiring their own rendering technique, creating a chain of called effects resulting in a final effect. The blood squirt would render blood squirts flying through the air and when the squirts collides with an object (a wall, the ground, etc)

(36)

the blood splat function would be called. This would create messy blood splats on the object.

Since a particle system is basically a collection of 3D points in space, it can be rendered as just that - a set of colored 3D points. There is always the option to calculate a polygon around the 3D point which always faces the camera like a billboard. Perspective can be created by scaling the polygon with the distance from the camera. Another option is to draw a 3D object of any type at the position of the particle - the possibilities are endless. As said above, the particles can be represented by a polygon when rendering. This polygon is most often a quad, i.e. four vertices.

Vector vertices[] { Vector(1.f, 1.f, 0.f), Vector(-1.f, 1.f, 0.f), Vector(-1.f, -1.f, 0.f), Vector(1.f, -1.f, 0.f) } 4.1.7 Performance

There has always been a great difference between spectacular effects in games and movies and until recently this has mostly been due to hardware limita-tions. There is simply no way to fill a Playstation 3 with millions of rendered fluid particles or skyscrapers getting blown up into millions of pieces or hav-ing a city completely flooded.

To get an accurate and realistic simulation most use cases require a large number of particles. The more particles, the higher computational complex-ity, resulting in a fundamental impact on performance and hence limits the size of the particle system.

Games need to have smooth animation and responsive interaction, therefore fast execution times are required. As the need for larger and more realistic

(37)

particle systems increases, so does the computational complexity.

Before the introduction of GPGPU computations and the game physics en-gine PhysX, fluid simulations in games did not use particle systems. Today substantial real-time fluid simulations can be performed.

Memory

As mentioned in section 4.1.3, it is important that the memory requirements for a single particle is kept at a minimum to be able to have huge amounts of particles. To further manage the memory requirements for a particle system, it is common to use a pool for the particle memory. Since the pool will consist of blocks with the same size, managing fragmentation also becomes easier.

When it comes to allocations and releases, memory operations, there should be as few as possible due to the performance overhead inherent in the re-quired context switches. If a particle gets old and dies it should not be released from the memory. It should instead be flagged and marked as dead and re-initialized. It is not until all the particles in the particle system are marked as dead that allocated memory for the entire system is released. This is done by using the pool design mentioned above, so that all mem-ory in the fixed size pool is pre-allocated, and just flagged as free when the particle is released.

Rendering

When a particle, a 3D point in space, is supposed to correspond to for ex-ample a snowflake, the image of the snowflake has to be drawn on a polygon. A particle most likely needs four vertices which creates two polygons. Thus, with 3000 visible snowflake particles, 6000 visible polygons are added for the snow alone. Since most particles in a particle system moves, the vertex buffer cannot be pre-calculated and needs to be changed every frame. Most particle rendering methods also use hardware instancing to lower the CPU

(38)

(39)

Chapter 5 Method

This chapter will describe in detail how the implementation of a pre-simulated particle system was made in Frostbite. The chapter will describe implemen-tation of pipelines, streaming and rendering for the particle system.

5.1 Introduction

An implementation of a pre-simulated streaming particle system was made in the Frostbite game engine. It reads particles pre-simulated in RealFlow, exported as Alembic files. The system then uses a custom streaming solution to minimize memory usage. A basic sorting algorithm and rendering system for the streamed particles were also implemented. The data flow for the system is illustrated in figure 5.1.

5.2 Alembic importer

The first step of creating a ParticleStream is authoring the simulation. This is done inside the software Realflow (could be any software capable of ex-porting Alembic files) and the result is exported to an Alembic file. Alembic is an open source exchange format for digitally created assets, developed by

(40)

Chapter 5. Method 36

Figure 5.1: _{Figure illustrating the data flow from artist creation to Frostbite} runtime.

Sony Imageworks. The format supports high level constructs such as meshes and points (particles). It is also possible to access the low-level parts of an archive to attach and store any data such as particle scales, colors and vertex colors for meshes.

To get the particle data into the Frostbite asset pipeline, a custom importer was written that reads particle data from an Alembic file and stores it in-side a custom binary file in the Frostbite asset pipeline, called a sandbox file.

5.2.1 Particle information

The particle information contained in the Alembic file format is organized in a Point structure inside Alembic. This structure contains information about the position and has a numerical identifier for each particle. As stated above, it is also possible to access lower-level constructs of the Alembic archive to attach arbitrary data to each particle.

5.3 Pipeline

After the particle data has been read from the Alembic file, the pipeline processes the custom binary file to create streamable chunks of particle data. These chunks are 2MB each which is the standard chunk size for free

(41)

streaming chunks in Frostbite. The pipeline goes through all frames in the imported particle simulation and writes them down into the chunk in the following format

[position.x position.y position.z scale]

5.4 Runtime entity

To be useful in the game, there must be a way to place the particle simula-tion in the game world. This is made possible by the implementasimula-tion of a

ParticleStreamEntity which has a world space transform. This world space

transform is used to transform all the particles in the stream from object space to world space.

5.5 Streaming

The amount of memory required to keep all the particles for the whole simulation in main memory is too large and some sort of streaming is needed. The above mentioned chunks are used for this streaming.

5.5.1 Streaming Cache

The particle simulation is then created in runtime and meta-data for it is read. This meta-data stores how many particles there are per frame in average in the imported file and how many streaming chunks there are. The meta-data is then used to determine how many chunks are needed to be kept in memory. For a higher framerate-simulation it is natural that more chunks need to be kept in memory. The number of chunks needed also depends on the average number of particles in a frame. The chunk cache is allocated as a large array for maximum data locality.

(42)

s = t ∗ fps ∗ T /p (5.1)

where s is the size of the cache expressed in streaming chunks, t is the target time to have in cache in seconds, f ps is the frame rate of the simulation, T is the max number of particles in any frame in the simulation and p is the number of particles per streaming chunk.

Each chunk in the chunk cache has a state which can be one of loading, empty and ready. The streaming process is then controlled by a state machine described in section 5.5.2.

5.5.2 Streaming State Machine

A state machine (rather a finite state machine) is a system that can, at any given time, be in exactly one of a pre-defined set of states as shown in figure 5.2.

The particle stream is updated once per frame. In this update, the streaming state machine is updated. The update goes through all chunks in the chunk cache and take the appropriate action depending on their state.

If a chunk is currently loading, no action is taken. Chunks that have finished loading from disk is set to the loaded state. To not create a too big load on the IO-system, only one load is active at any given time. This means that whenever a finished load is detected, a new one can be started if needed. This request will always be created for the next chunk unless the previously loaded chunk is the last chunk in the animation. Also, when a chunk has been rendered completely, the state for it is set to empty, meaning it is ready to store a new chunk. In this way, the chunk cache acts as a ring buffer which is illustrated in figure 5.3 and 5.5

(43)

Figure 5.2: _{Figure illustrating a (finite) state machine with states and state} transitions.

Figure 5.3: _{Chunks have three states - empty (E), loading (L) and ready (R). F}0, F1 and F_xshows frame 0, 1 and x.

(44)

5.5.3 Reading from the Cache

The cache is essentially a ring buffer. This means that there are methods to be able to transparently start reading from the start of the buffer when the end is reached as illustrated in figure 5.4.

Figure 5.4: _{A ringbuffer is a standard array in memory but is conceptually treated} as a ring.

The contents of the buffer is treated as raw bytes just as the chunks. This means that particle information can be split over chunks, as illustrated in figure 5.5, and it can also be the case that one particle starts at the end of the ring buffer and the rest of the information is in a chunk placed at the beginning. To allow for this, a given frame is read from the ring buffer into an intermediate buffer where particle information is aligned to not be split in memory. The reading from the ring buffer into this intermediate buffer is handled by support routines. Example code for these support routines are given below.

void incrementAndWrap(u32 bytesToIncrement, u8* const chunkCache) {

currReadPos += bytesToIncrement; // u8 is an 8-bit unsigned integer u8* const chunkCacheEnd =

(45)

Figure 5.5: _{Particle information can be split over chunks. F}

xand Fy shows frame x and y.

// Cross chunk border?

if (currReadPos - chunkCache >=

(activeChunkIndex + 1) * CHUNK_SIZE_BYTE) {

chunkStates[activeCacheIndex] = LoadingState_Empty; activeChunkIndex = (activeChunkIndex + 1) % chunkCount; }

// Wrap around

if (currReadPos >= chunkCacheEnd)

currReadPos = chunkCache + (currReadPos - chunkCacheEnd); }

Below is the algorithm for handling reads from the cache into the interme-diate buffer described above.

void safeRead(void* dest, size_t size) {

u8* const chunkCache = static_cast<u8* const>(m_chunkCache); u8* destPtr = static_cast<u8*>(dest);

(46)

u32 bytesToRead = static_cast<u32>(size);

while (bytesToRead > 0) {

// Read a safe amount of bytes // s64 is a 64-bit signed integer

s64 rem = max<s64>(chunkCount * CHUNK_SIZE_BYTE -(currReadPos - chunkCache), 0);

// u32 is a 32-bit unsigned integer

u32 remainingBytesInCache = static_cast<u32>(rem);

u32 bytesRead = min(remainingBytesInCache, bytesToRead);

memoryCopy(destPtr, currReadPos, bytesRead); destPtr += bytesRead;

incrementAndWrap(bytesRead, chunkCache);

// Do we have bytes left to read? bytesToRead -= bytesRead;

} }

5.6 Rendering

As with most particle systems, the particles are rendered as screen-aligned quads. Since the buffer at this stage contains information about particles in the form [pos.x pos.y pos.z scale] it contains everything needed to place the quads at the correct world space location.

To handle rendering of the particles, a separate ParticleStreamRenderer was created. This renderer is responsible for creating the needed GPU re-sources and copying particle buffers into their GPU counterparts. A buffer with fixed size is created for all particles and in the beginning of each frame

(47)

the renderer calls the ParticleStream with the GPU buffer as an argument. The ParticleStream then copies the internal CPU buffer over to the GPU buffer sent in as an argument.

The particles are rendered with hardware instancing on the platforms where it is applicable. The instancing uses the per-instance data for each particle from the buffer mentioned above. The fixed quad is transformed according to the world space position and also scaled according to the embedded particle scale.

5.7 Sorting

Particle systems that use additive and multiplicative blending can be ren-dered in any order, there are, however, particle systems where ordering need to be imposed on the system. These particle systems require sort-ing. One reason to sort particles is for visual correctness. In cases where non-commutative blending mode is used, such as alpha blending, sorting is needed to ensure the correct order of operations. For non-commutative operations care must be taken that blending happens in the correct order -back to front order.

Rendering alpha blending particles in the wrong order is extremely notice-able in motion as the particle system loses all sense of shape. To be notice-able to use any sorting algorithm it has to be applied to the particle data.

5.7.1 Sorting using a key-index stream

One way to sort the particles is to produce a key-index pair for each particle. The key contains the value on which the sorting acts (the distance to the viewer) and the index simply points out the position of the particle in the particle stream.

The approach results in less bandwidth usage and better cache coherency since there is no need to fetch a lot of data.

(48)

5.7.2 Sorting the Particle Stream directly

Another approach is to sort the particle stream directly without using a key-indexed stream.

The downside of this approach is that the sorting itself has to read and write the amount of data twice resulting in worse cache performance. The sorting metric (the distance) needs to be computed for every sorting pass rather than just doing it once per frame.

5.7.3 CPU sorting

The first implementation was a simple insertion-sort as a proof-of-concept. This sorting algorithm has bad time complexity characteristics so something faster is needed. A radix sort algorithm is what is currently used due to the suitability for a SIMD implementation as described in section 3.4.6.

GPU sorting could also be used but was out of the time scope for this implementation.

(49)

Chapter 6 Result

The result of the implementation is a pipeline that can import Alembic files containing pre-simulated particle data into Frostbite. Furthermore, a real-time free-streaming solution for runtime was implemented. A simple renderer was also implemented for the particle streams.

6.1 Pipeline results

The pipeline implementation is a pipeline that reads the Alembic file frame by frame. Since there is no real-time requirements on the pipeline, it does not have to be very efficient and many times the easier solution was chosen during this thesis. The pipeline is still very fast and only with very large datasets will performance be a problem.

6.2 Runtime Results

The runtime implementation consisted of handling streaming of the particle stream chunks. This procedure is described in chapter 5. The solution to the streaming problem is the most advanced part of this work and works very well with good performance. Sorting and rendering the particles are

(50)

Chapter 6. Result 46

also part of the runtime implementation. Sorting is currently implemented with a simple quicksort algorithm.

The rendering is implemented with instancing on supported platforms which gives good performance even with large amounts of objects. The rendering implementation is currently missing proper lighting and shading support due to time running out during the implementation phase. The rendering part of the implementation is also where the bottleneck of the implementation lies. This is since it is expensive to render alpha-blended geometry due to overdraw costs.

6.3 Workflow Results

The workflow for placing a particle stream in the world is straightforward. First, the particle stream data is exported in Alembic format from any authoring tool wanted. This Alembic file is then imported into FrostEd (the Frostbite editor). When this import is done, the particle stream pipeline runs and stores the particle stream data in an internal format (described in chapter 5 inside the game database. The artists then place a special particle stream entity somewhere in the world and attach a particle stream asset to it. The particle stream entity can then be controlled by the visual scripting language in Frostbite to start, stop and in other ways control the playback of the particle stream. The particle stream is then streamed and played back in runtime.

This workflow is streamlined apart from the fact that the authoring of con-tent happens outside Frostbite and the iteration times for making small changes can be large.

6.4 Screenshots

This section present four screenshots (figures 6.1, 6.2, 6.3 and 6.4) from the result of the streaming implementation. These screenshots do not represent

(51)

any finished lighting or shading.

The screenshots also show the debug rendering used to debug the streaming process. This debug rendering illustrates the size of the buffer and each chunk is shown as a block. The color of a block illustrates the status of the block and there is also a marker to show the current position that is read from in the buffer. Furthermore, some statistics on the current particle stream are shown.

Figure 6.1: _{Screen capture showing a particle stream simulation and the related} debugging tools.

6.5 Performance

The particle stream pipeline performs well and the performance is largely dependent on the size of the data set. The particle stream pipeline currently does no compression which will be a problem for more practical use-cases where the amount of particles can be very large.

The runtime implementation also runs at good performance. The rendering is implemented with instancing on supported platforms and can handle a lot of particles since it never handles single particles. The streaming is also fast and the major cost for the runtime implementation is still sorting and

(52)

Figure 6.2: _{Screen capture showing a particle stream simulation where time has} passed since figure 6.1.

Figure 6.3: _{Screen capture showing a combined CinemaStream and Particle} Stream simulation.

(53)

Figure 6.4: _{Screen capture showing a combined CinemaStream and Particle} Stream simulation.

rendering the particles in the stream.

Experimentation with a GPU-based sorting algorithm was made, but due to shortage of time it was cancelled.

When it comes to memory usage, the runtime implementation has pre-dictable memory overhead. This is since the size of the cache is static and the memory usage is thus also static. This characteristic is desirable for streaming buffers in general. However, even though the size of the stream-ing buffer is constant durstream-ing streamstream-ing it is not constant for all particle streams. There is logic for deciding the size of the particle streaming buffer (described in chapter 5).

(54)

Chapter 7 Discussion

To summarize, I think the implementation turned out well and uses good schemes for handling runtime streaming and rendering.

I am very satisfied with the simplicity and efficiency of the implementation. The ring-buffer used for caching streamed chunks makes the performance overhead very predictable. Furthermore, since it contains heuristics for de-termining the size, the intent is that it should work well without manual tuning of buffer sizes.

The pipeline implementation is also simple and turned out to work well. It basically consists of reading an Alembic archive and then packing the data into runtime chunks for streaming. The simplicity of the implementation makes it easy to plug in compression at a later point.

The workflow is also a part that turned out to work well. I think this is mostly since the concepts are simple. A ParticleStream simulation is simply placed somewhere in the world as a game entity and everything in the simulation is then transformed in relation to the transform of the ParticleStream entity.

A presimulated particle system has a lot of advantages in game applica-tions and the only real drawback is that there is no gameplay control of the particle system. This is quite a big drawback though and it can be hard

(55)

Chapter 7. Discussion 51

to fit a presimulated particle system in a massively dynamic world. The biggest advantage is performance since a CPU-based particle system with gameplay elements are orders of magnitude heavier than pre-simulating par-ticles.

7.1 Future Improvements

To make ParticleStream production-ready, a few key improvements has to be made. Most notably, compression and support for more advanced lighting has to be implemented.

7.1.1 Compression

In the current implementation, the particle data is stored uncompressed in the Frostbite data storage. This is not sustainable for real-world use cases that uses gigabytes of data. The idea is that the ParticleStream system would be integrated in the more general CinemaStream compression infras-tructure in the future.

7.1.2 Lighting

What is obviously lacking in the implementation is proper lighting which had to be skipped due to time constraints. Proper lighting of the particles was not really the goal for this work. The work was instead focused around the problem of getting source data into the Frostbite runtime and to be able to utilize streaming efficiently. The goal was also to create an efficient implementation that could handle large amounts of data.

The particles are simply rendered as alpha-tested quads with regular shaders and support for artist-created custom shaders. One area in particular where the lighting has to be improved is the shadowing. This is since volume shading is a key aspect of high quality particle rendering.

(56)

Chapter 7. Discussion 52

Shadowing

The most obvious way to achieve shadowing for a particle system is to use volume rendering techniques. To be able to do this, it is required that the particles are converted into a discrete volume representation by being rasterized into a 3D volume (voxelization). Once the particle is represented as a volume, there are several ways to render it with shadowing.

Self-Shadowing

There has been a lot of research in this area. Some of the topics are Fourier Shadow Mapping, Half-Angle Slice Rendering, Opacity Shadow Maps, but none of them are scalable enough to use in large-scale in-game scenarios. They are, however, perfect for cut-scenes and contained environments.

Half-Angle Slice Rendering

The key idea of half-angle slice rendering is to calculate a vector which is half way between the light direction and the view direction. The volume is then rendered as a series of slices perpendicular to this half-angle vector.

The half-angle vector enables the possibility of rendering the same slices from both the light’s and camera’s point of view, since they will be facing towards both. As a result of this the shadowing can be accumulated from the light at the same time as the slices are being blended.

The main advantage of this technique is that it only requires a single 2D shadow buffer.

Opacity Shadow Maps

Opacity Shadow Maps samples visibility at regular intervals and there are numerous variants optimized to handle special cases such as hair. The

(57)

Bibliography 53 method is also suitable for generation of self-shadows in discontinuous vol-umes with explicit geometry, such as fur and foliage, but continuous volvol-umes such as smoke and clouds may benefit from the approach.

With a set of planar opacity maps the light transmittance inside a complex volume is approximated.

A volume made of standard primitives (points, lines, and polygons) is sliced. The volume is then rendered with graphics hardware to each opacity map that stores alpha values instead of traditionally used depth values.

Each primitive point is enclosed by the alpha values sampled in the maps and then interpolated for shadow computation.

The algorithm is memory efficient and extensively exploits existing graphics hardware

(58)

Bibliography

[1] Frostbite. http://www.frostbite.com, 2014. Accessed: 2014-01-18. [2] A. Kolb, L. Latta, and C. Rezk-Salama. Hardware-based

simula-tion and collision detecsimula-tion for large particle systems. In Proceedings

of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware, HWWS ’04, pages 123–131, New York, NY, USA, 2004.

ACM. ISBN 3-905673-15-0. doi: 10.1145/1058129.1058147. URL http://doi.acm.org/10.1145/1058129.1058147.

[3] Peter Kipfer, Mark Segal, and R¨udiger Westermann. Uber-flow: A gpu-based particle engine. In Proceedings of the ACM

SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware,

HWWS ’04, pages 115–122, New York, NY, USA, 2004. ACM. ISBN 3-905673-15-0. doi: 10.1145/1058129.1058146. URL http://doi.acm. org/10.1145/1058129.1058146.

[4] Andreas Kolb and Nicolas Cuntz. Dynamic particle coupling for gpu-based fluid simulation. In In Proc. of the 18th Symposium on Simulation

Technique, pages 722–727, 2005.

[5] Mark Kim, Guoning Chen, and Charles Hansen. Dynamic particle system for mesh extraction on the gpu. In Proceedings of the 5th

Annual Workshop on General Purpose Processing with Graphics Pro-cessing Units, GPGPU-5, pages 38–46, New York, NY, USA, 2012.

ACM. ISBN 978-1-4503-1233-2. doi: 10.1145/2159430.2159435. URL http://doi.acm.org/10.1145/2159430.2159435.

(59)

Bibliography 55 [6] Shannon Drone. Real-time particle systems on the gpu in dynamic environments. In ACM SIGGRAPH 2007 Courses, SIGGRAPH ’07, pages 80–96, New York, NY, USA, 2007. ACM. ISBN 978-1-4503-1823-5. doi: 10.1145/1281500.1281670. URL http://doi.acm.org/ 10.1145/1281500.1281670.

[7] Donald E. Knuth. The Art of Computer Programming, Volume 3: (2Nd

Ed.) Sorting and Searching. Addison Wesley Longman Publishing Co.,

Inc., Redwood City, CA, USA, 1998. ISBN 0-201-89685-0.

[8] W. A. Martin. Sorting. ACM Comput. Surv., 3(4):147–174,

1971. ISSN 0360-0300. doi: 10.1145/356593.356594. URL http://portal.acm.org/citation.cfm?id=356593.356594&coll= Portal&dl=GUIDE&CFID=89172762&CFTOKEN=95662085.

[9] Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu, and Toshio Nakatani. Aa-sort: A new parallel sorting algorithm for multi-core simd processors. In Proceedings of the 16th International

Confer-ence on Parallel Architecture and Compilation Techniques, PACT ’07,

pages 189–198, Washington, DC, USA, 2007. IEEE Computer Soci-ety. ISBN 0-7695-2944-5. doi: 10.1109/PACT.2007.12. URL http: //dx.doi.org/10.1109/PACT.2007.12.

[10] Peter Sanders and Sebastian Winkel. Super scalar sample sort. In Su-sanne Albers and Tomasz Radzik, editors, ESA, volume 3221 of Lecture

Notes in Computer Science, pages 784–796. Springer, 2004. ISBN

3-540-23025-4. URL http://dblp.uni-trier.de/db/conf/esa/esa2004. html#SandersW04.

[11] W. T. Reeves. Particle systems—a technique for modeling a class of fuzzy objects. ACM Trans. Graph., 2(2):91–108, April 1983. ISSN 0730-0301. doi: 10.1145/357318.357320. URL http://doi.acm.org/ 10.1145/357318.357320.

[12] Gpgpu.org - general purpose computation on graphics hardware. http: //gpgpu.org, 2014. Accessed: 2014-01-18.

(60)

Bibliography 56 [14] Wikipedia - particle. http://en.wikipedia.org/wiki/Particle,