Volume rendering with Marching cubes and async compute

(1)

Bachelor of Science in Computer Science May 2019

Volume rendering with Marching cubes

and async compute

Max Tlatlik

(2)

This thesis is submitted to the Faculty of Computing at Blekinge Institute of Technology in partial fullment of the requirements for the degree of Bachelor of Science in Computer Science. The thesis is equivalent to 10 weeks of full time studies.

The authors declare that they are the sole authors of this thesis and that they have not used any sources other than those listed in the bibliography and identied as references. They further declare that they have not submitted this thesis at any other institution to obtain a degree.

Contact Information: Author(s):

Max Tlatlik

E-mail: matl15@student.bth.se

University advisor(s):

Lecturer Stefan Petersson, Associate Professor Hans Tap Department of DIDA

Faculty of Computing Internet : www.bth.se

(3)

Abstract

With the addition of the compute shader stage for GPGPU hardware it has become possible to run CPU like programs on modern GPU hardware. The greatest bene-t can be seen for algorithms that are of highly parallel nature and in the case of volume rendering the Marching cubes algorithm makes for a great candidate due to its simplicity and parallel nature. For this thesis the Marching cubes algorithm was implemented on a compute shader and used in a DirectX 12 framework to de-termine if GPU frametime performance can be improved by executing the compute command queue parallell to the graphics command queue. Results from performance benchmarks show that a gain is present for each benchmarked conguration and the largest gains are seen for smaller workloads with up to 52%. This information could therefore prove useful for game developers who want to improve framerates or de-crease development time but also in other elds such as volume rendering for medical images.

Keywords: Volume rendering, async compute, multi-engine.

(4)

(5)

Chapter 1 Introduction

This rst introductory chapter will introduce the reader to the topic of volumetric rendering, the history and background of the topic and its relevance for game de-velopment. A problem description is also presented with suggestion on how to solve problems related to the topic and the reader is presented with the problem this thesis tries to answer. Finally the research methodology used to answer the problem will be presented.

The introduction is followed up by a chapter covering relevant theory about vol-umetric rendering, the algorithm used for volume rendering with the framework, important annotations about compute shaders and why they were used for this the-sis implementation and nally the reader is presented with the concept of multi-engines in DirectX 12. In chapter 3 details about the implementation is presented that shows how the theory is applied practically and how the results for performance benchmarking are collected. In chapter 4 the results are presented and in chapter 5 they are discussed. Furthermore, in chapter 5 the implementations problems and limitations, and terrain generation are discussed and a summary of the discoveries is also presented. The report is then concluded in chapter 6 with reections and possible future work for this thesis.

1.1 Background

During game development one of the most time consuming development processes is the creation of assets such as 3D models, world sculpturing, texturing, etc. In some cases smaller game development studios do not even have the necessary funds for proper assets creation. To speed up content creation processes developers can make use of procedural mesh generation. Such content creation tools can quickly create meshes that may only need some polishing thus greatly increasing the speed of the development process. While this is great for oine content creation, such tools can also be adapted to be used online. It is then important that the procedural mesh generation is running in real-time at an acceptable framerate. Many algorithms for procedural terrain generation can create meshes with the CPU based on height maps with some noise functions such as Perlin noise, Value noise or Worley noise [13], how-ever a high complexity terrain is then usually not created during playable runtime but rather done during world loading. Terrain generation at high complexity, which can include terrain features such as caves and overhangs, is best suited for the GPU since it is a highly parallel task and the GPU is better suited for parallel processing than the CPU [11].

(8)

2 Chapter 1. Introduction

Utilizing the GPU for procedural terrain generation can accelerate development speed and generate more complex worlds than CPU based heightmap solutions. Minecraft has procedural world generation based on voxels1_{; while it has simplistic graphics,}

it creates unique experiences for each new world. To generate the games world Minecraft makes use of a volume rendering technique. There are many dierent algorithms available for volume rendering and one particular algorithm, Marching cubes (MC), presented by Lorensen and Cline (1987) [10] has become very popular for volumetric mesh generation because it is fast to implement on the CPU and gen-erates high-resolution meshes given large enough volumes. The common usage of the MC algorithm are 3D visualization of volume data from magnetic resonance imaging and computed tomography scans [10, 16] but has nowadays also been adapted to be used in real-time interactive applications, such as video games, due to the ad-vancements made in GPU hardware and the accessibility to run CPU-like programs on a general purpose GPU (GPGPU) with compute shaders. Marching cubes is in its nature well suited for parallel execution since it works on independent individual voxels and therefore will benet greatly by an implementation that runs on the GPU, this is explained in greater detail in chapter 2.

1.2 Problem description

One major problem with volume rendering is the fact that the space complexity is O(n3_{) and therefore heavily impacts performance and memory consumption as the}

volume increases. To be able to render the volume at an acceptable interactive fram-erate, one must make use of algorithms, such as Marching cubes [10], that can during generation discard polygons that are of no interest. There are a few other algorithms that instead can be utilized such as Cubical Marching Squares [7], surface nets [6] and HistoPyramids [9], and while these advanced algorithms can perform better than MC they are also more dicult to implement. During game development time for developing algorithms or creating assets is often limited and therefore a simple, yet good enough, solution might be more desirable if high performance is not a crucial criterion. This makes Marching cubes a good candidate for problems such as proce-dural terrain generation. One aim for this thesis is to present the reader with viable modern options for procedural terrain generation techniques that can be used in in-teractive applications, such as games, on general purpose GPUs. Hopefully this will help developers gain insight for which volume rendering technique to use for their problem. The second aim for this thesis is to answer this question:

What are the dierences in GPU frametime performance with marching cubes mesh generation and mesh rendering for sequential and parallell compute execution?

(9)

1.3. Scope 3

1.3 Scope

Volume rendering has a long history and a lot of work has been done on the topic, especially in the medical and compute graphics area. The catalogue for relevant in-formation is therefore expansive which implies that a major part for this thesis can be spent on information gathering and literature analysis. For this reason the focus will be narrowed down to one algorithm for volume rendering, the Marching Cubes algorithm, which is implemented for the project.

The practical part for this thesis is focused on examining the dierence in GPU frametime performance when the Marching cubes algorithm is executed sequentially and in parallell to mesh rendering. The goal is to push the algorithm to its extremes and therefore it is made sure that generated overhead is minimal by implementing the algorithm in a very simple and light rendering framework. The framework is custom built with C++ and uses the DirectX 12 API for this purpose. The test environment will also be limited to one particular machine with modern hardware.

1.4 Research methodology

To be able to show the reader which volume rendering techniques are of particu-lar interest literature analysis is done for relevant articles to gain knowledge about procedural mesh generation and to get a good understanding for its application in real-world scenarios.

(10)

Chapter 2 Theory

This chapter will present the reader with theory about volume rendering, the applied isosurface-extraction algorithm in the framework, general purpose GPUs and com-pute shaders and lastly the concept of asynchronous comcom-pute execution with DirectX 12.

2.1 Volume rendering

Volume rendering is a 3D rendering technique which takes a sampled 3D data set, usually in the form of 3D scalar elds, as input and outputs a 2D projection to display on the screen. The eld of volumetric visualization can be categorized in 3 domains [3], slicing, direct volume rendering (DVR) and indirect volume rendering (IVR). Slicing visualizes 2D cuts from the 3D volume by mapping the data to colors or by creating contour lines from the data. The 2D cuts are usually parallel to one of the coordinate planes. Rendering slices is therefore simply displaying the colored planes. With DVR the data points are based on laws of physics and treated as semi-transparent light emitting sources with emission, absorption and scatter properties. There are two common projection methods for DVR, forward and backward. The backward projection method uses image space algorithms where ray casting is per-formed per pixel. Only primary rays are being cast and each ray will accumulate color and opacity data along its way through the volume. The forward projection method uses object space order algorithms where cells are projected onto a 2D sur-face to display, some commonly used techniques are incremental slicing [12], splatting [17] and shear-warping [4]. Direct volume rendering with ray casting will produce accurate high quality images although at a high computational cost and it is there-fore not well suited for applications with an interactive frame rate. Indirect Volume rendering is better suited for such problems but at the loss of accuracy since most algorithms for IVR make approximations and in the worst cases can produce false positives. Indirect volume rendering converts volume data to an intermediate rep-resentation, usually surfaces, that can be used in traditional rendering techniques such as rasterization. See gure 2.1 for an example of an rendered image with the application of slicing, DVR and IVR techniques.

(11)

2.1. Volume rendering 5

Figure 2.1: Applied volume rendering techniques [15]

There are several algorithms available for isosurface extraction of volume data and the most commonly used one is Marching cubes [10] due to its relatively simple im-plementation and acceptable results. For this thesis marching cubes is implemented on compute shaders to generate terrain and stress testing, the algorithm is explained in detail in the following section. Other isosurface-extraction algorithms that can be used for terrain generation are Contour tracing [8] which nds isosurfaces from 2D contours and Marching tetrahedron [14] which is similar to Marching cubes but it also works on unstructured grids by tessellating space into tetrahedrons.

The 3D data set for volume rendering can be acquired through various means de-pending on the use case. In medical imaging the 3D data set can be built by a collection of 2D slice images captured by computed tomography or magnetic reso-nance imaging scanners. These 3D data sets have certain sets of regularities, such as the 2D image width and height in pixels and the depth distance between captured slices. 3D data sets with such regularities can be dened as regular grid volumes where each element is often referred to as a voxel. The individual scalar values can then be retrieved by sampling at the voxels corner coordinates. For procedural ter-rain generation the 3D data set can be built by lling a regular grid volume or 3D texture with density scalar values generated by some density functions. The applied density functions ultimately determine how the terrain will look later.

(12)

6 Chapter 2. Theory is a surface that represents points of a constant value within a volume of space1_{. It}

can therefore be dened as a function of 3D space where C is a constant, usually zero, and the iso2 _{prex indicates that the function takes the same value over the}

whole surface: F (x, y, z) = C

2.2 Marching cubes

This isosurface extraction algorithm originally presented by Lorensen and Cline [10] works on regular grid volumes where the sampled data from voxel corners describe density values. Depending on the sampled data the algorithm can generate up to 5 triangles per voxel. The sampled density value indicates if the corner is outside or inside the solid volume. Negative values mean that the corner is outside and positive mean that its inside the solid volume, if the value is equal to the isovalue the corner is on the surface. In the case where all corners are positive the entire voxel is inside the solid volume and if all corners are negative its entirely outside, no triangles are generated for either case. In the case where two interconnected corners have one negative value and one positive value there is a surface point along the edge where the density is equal to the isovalue. This then becomes a binary case for each of the 8 corners resulting in a total of 28 _{= 256 cases. Two look-up table are created,}

one that contains information about how many triangles to create for a given case and one that contains information about the interconnected corners for each case. Triangulation for all the 256 cases is a possibility but it can be reduced to 14 patterns by applying two dierent symmetries of the cube, see gure 2.2 for all patterns.

(13)

2.2. Marching cubes 7

Figure 2.2: Triangulation patterns for the cube [10]

To access the correct case in the look-up tables an index value can be set in a byte sized variable since a bytes range is 0-255. To determine the case number we set the bits of the byte variable, if the corner is outside the respective bit is set to 0 and if it is inside it is set to 1. An example of this is shown in gure 2.3.

(14)

8 Chapter 2. Theory

Figure 2.3: Voxel with vertices, white corners are outside and red corners are inside. be generated. A second look-up table is created that stores lists for how the triangles are connected between interconnected corners of the cube. The case number also acts as the index for the second table. Following the example from gure 2.3 the case number is 81 and accessing the second look-up table will then return the following lists:

(15)

2.2. Marching cubes 9 Figure 2.4 illustrates where points are created along the given edges when the case number is 81 and how they are triangulated. The points position is determined by interpolation along the edge. The interpolation value is determined by the two density values of the corners that make up the edge and the isovalue, see equation 2.1.

interpolation = isovalue − D1

D2 − D1 (2.1)

Figure 2.4: Triangulation of points with case number 81. Blue dots are interpolated positions along edges and the gray area are created triangles.

(16)

10 Chapter 2. Theory to the surface rather than the triangle. To determine the gradient vector of a corner simply sample around the corner in each direction and then subtract the density value in the negative direction with the value from the positive direction. This will yield in the gradient vector that is orthogonal to the surface at that point, see equation 2.2.

Gx(i, j, k) = D(i + 1, j, k) − D(i − 1, j, k)

Gy(i, j, k) = D(i, j + 1, k) − D(i, j − 1, k)

Gz(i, j, k) = D(i, j, k + 1) − D(i, j, k − 1)

(2.2)

2.3 GPGPU and compute shaders

General-purpose GPUs allows execution for programs that were traditionally not designed for stream processors but rather for CPUs. This is possible with the pro-grammable compute shader stage which has been added with the release of Shader model 53_{. This stage enables computation on data which is unrelated for graphics}

by reading and writing to buers, usually in the form of textures, in parallel across utilized stream processors. The greatest benet can be seen in algorithms that are parallel by nature because the GPU has many more processing units (cores) than a CPU. However these cores run at a lower frequency than CPU cores and it is there-fore better to use the CPU for tasks that are mainly sequential.

A shader is a small program that can be executed on the GPU and it is written in a high level shader language such as OpenGL Shading Language (GLSL) or High Level Shading Language (HLSL). For this thesis shaders are written in HLSL since the implementation uses Microsoft's graphics API DirectX 12. Programs that are written for compute shaders are dened as kernels and invoked on threads. For good parallelization kernels should be kept small in size and behave independently. Threads are collected in thread groups and each thread group gets a core assigned to execute threads on, it is possible that multiple thread groups can be executed on the same core. Inside the compute shader program the number of threads per thread group attribute is assigned with numthreads(X, Y, Z) and on the CPU in the dispatch call the number of thread groups are given as in-parameter,

Dispatch(X, Y, Z). The number of threads per thread group and the number of thread groups is specied as a three dimensional array. The total amount of threads is the product of the number of threads per thread group and the number of thread groups of the dispatch call, e. g. Dispatch(2, 2, 2) with numthreads(8, 8, 8) will result in 2*2*2 * 8*8*8 = 4096 threads. Each thread has its own id assigned and is stored in a 3D vector. This makes it particularly useful in cases where the id can be used as an index to a 3D Texture since no conversion has to be made, for example reading density values. The range of the id is determined by the product of the number of threads per thread group and the given number of thread groups, e.g

(17)

2.4. Multi-engine with DirectX 12 11 Dispatch(3, 5, 6) and numthreads(2, 2, 2) will result in a range of (0, 0, 0) - (5, 9, 11). Depending on the GPU hardware and the computational cost and nature of the kernel it is important to nd a good balance for thread distribution to maximize the parallel workload. In Shader Model 5.0 the maximum amount of threads in a thread group is 1024 and in this thesis for ease of implementation the number of threads per thread groups is chosen to be numthreads(8, 8, 8) which amounts to 512 threads per thread group. Having the thread count in each direction based on the same factor, 8, makes for easy uniform scaling with the Dispatch calls, ranging from 1 to 12 in each direction, for stress testing purposes.

2.4 Multi-engine with DirectX 12

With the release of DirectX 12 Microsoft has changed the way for how instructions, or commands, can be fed to the GPU compared to the submission model in DirectX 11. In DirectX 12 three dierent command queues can be deployed, one for graph-ics, compute and copy commands. The graphics command queue accepts all kind of commands, while the compute command queue accepts only compute and copy commands and the copy queue only copy commands. This means that commands can be recorded on dierent command lists, preferably each on their own thread, and executed on dierent command queues. Command queues are then deployed on command processors, or engines [1]. Command processors are not actual hardware but rather an API construct that allows them to have their own queue(s) with the API synchronizing work by signaling to fences. This means that graphics, compute and copy dispatches can be run in parallel each with their own command processor, or engine. This is illustrated by gure 2.5.

(18)

12 Chapter 2. Theory

Figure 2.5: Multi-engine design [1]

(19)

Chapter 3 Implementation

This chapter will present the reader with an overview of the framework used for volume rendering and benchmarking, the graphical outcome of some density functions and the applied density function for benchmarking, how the vertex generation was implemented on compute shader, how the generated mesh was rendered and nally how the performance benchmarks were done.

3.1 Framework overview

In order to generate and render meshes created by the Marching cubes algorithm with an async compute approach Microsoft's DirectX 12 API has been utilized to build a simple custom renderer that creates a graphics and compute command queue to support multi-engine execution. Window and event management was implemented by adding the third party library SDL 2.0 and a simple user interface was imple-mented with the inclusion of the third party library ImGui. ImGui was only used for development purposes and later excluded from benchmarking. Everything was implemented in C++ with Visual Studio 2017 and shaders were written in HLSL with Shader Model 5.

During initialization of the system a volume is created and dened as a regular grid volume in the form of a 3D Texture. The dimensions of the volume are hard-coded and not congurable during runtime. The volume is lled once during the rst frame with a compute shader that has some density function implemented. Since the goal of this thesis is to examine the dierence in performance for sequential and parallel execution with the MC algorithm a simple dual-buer design was implemented. A buer stores generated vertices that are later used for rendering the mesh. During sequential execution the same buer that was used for mesh generation is used for rendering. During parallel execution the buer that is used for mesh generation is used for rendering in the next frame, and the previously used buer is used for rendering. This is illustrated by gure 3.1.

(20)

14 Chapter 3. Implementation

Figure 3.1: Dual-buer design

Mesh generation is done every frame for benchmarking purposes. It is important to know beforehand how much memory needs to be allocated for each buer to ensure that the whole volume is covered. This is solved by creating a single chunk that covers the entire volume. The number of voxels for that chunk depends on the dimensions of the volume. If the volume dimensions are X x Y x Z then the number of voxels for the chunk are X-1 x Y-1 x Z-1. The last thing to consider is that the maximum possible number of triangles for a given voxel is 5 and the size of the triangle structure in bytes. The buer size is then given by equation 3.1

buf f er size in bytes = (X − 1) ∗ (Y − 1) ∗ (Z − 1) ∗ 5 ∗ sizeof (T riangle) (3.1) This means that the chunk dimensions are 1:1 with the volume dimensions during performance benchmarks.

3.2 Density functions

(21)

3.2. Density functions 15 by a simplex noise function and conditions based on the thread id's Y value. The simplex noise function takes the thread id, which is equivalent to a position in world space, as input and returns a oating-point value.

Figure 3.2: Generated terrain with simplex noise. Volume dimensions (64,8,64). Flat shading.

The generated terrain is very simple and lacks complex features such as overhangs or caves. In gure 3.3 a closed cave has been created by a density function that sets boundaries to create walls, a ceiling and ground. Density values inside the boundaries are generated by the simplex noise function with some modications to the generated value to create more open areas.

(22)

Figure 3.3: Generated cave with simplex noise. Volume dimensions (64,16,64). Flat shading.

(23)

3.3. Vertex generation 17

3.3 Vertex generation

The Marching cubes algorithm described in section 2.2 is implemented on a compute shader with Shader Model 5. The rst step is to sample the eight density corner values that make up a voxel from the density 3D texture. As previously stated the dimensions of the chunk are 1:1 with the volume dimensions, therefore it is important to not create any new voxels from the outermost density values. Unfortunately, this means that unnecessary threads are created. This is intentionally implemented this way since it makes it very easy to directly load data from the 3D texture, the same thread id that was used to write to the 3D texture can be used for this shader by using the Load( int4 index) function on the 3D texture. No sampler object needs to be created. Once the corner density values are loaded the case number can be constructed with bit shifting, see gure 3.5 for the code implementation.

Figure 3.5: HLSL code implementation of the rst step of the MC algorithm. The case number is used to index the look-up table case to numpolys which will return the number of polygons to be created for this case, the value ranges from 0 to 5. All look-up tables are stored in the preprocessor directive MCTables.hlsli which is included in the shader. If the number of polygons is greater than 0 the case number is used again in a second look-up table, edge connect list, which will return the edge number. The edge number is used to determine which corners are to be used for creating the new vertex point. Figure 3.6 shows the code for how this logic is implemented in HLSL on the compute shader.

(24)

Figure 3.6: HLSL code implementation of the second step of the MC algorithm, vertex generation.

the vertex position. It is important to note that vertices are not directly pushed to the vertex buer but stored in the Triangle structure. This is to ensure that the ordering is correct for vertices so that the triangles can later be correctly rendered with triangle list as the primitive topology.

To push triangles to the vertex buer the IncrementCounter() function can be used which will increment the hidden counter of the buer resource and return the value of the counter before it was incremented. This value can be used as an index for writing to the buer. The function also internally manages atomic add functions which are needed for thread synchronization of the shared resource. However, this implies that as the number of threads increases more thread congestion will arise since all threads share the same buer.

3.4 Rendering

(25)

3.5. Performance benchmarking 19 It is also important to ensure that the counter resource can be read back to the CPU to retrieve the counter value. Once the counter is retrieved a Vertex Buer View can be created with the counter and the buer resource and be used for rendering. The shaders for rendering are basic vertex and pixel shaders. The vertex shader transforms the vertices with a view and projection matrix and then proceeds to send them to the pixel shader stage. The pixel shaders illumination model is simple Lambertian diuse shading with a set directional light and added ambient light. Depending on the normal that has been created during vertex generation at shading or soft shading can be achieved, see gure 3.7 for a comparison.

Figure 3.7: Generated sphere with at and soft shading

3.5 Performance benchmarking

To investigate the dierences in performance for sequential and parallel execution of the multi-engine framework for mesh generation with Marching cubes a set of volume congurations are measured. Only the GPU frame is measured, nothing on the CPU. The set is shown in table 3.1.

Conguration 1 2 3 4 5 6 7 8 9 10 11 12

Volume dimensions 83 ₁₆3 ₂₄3 ₃₂3 ₄₀3 ₄₈3 ₅₆3 ₆₄3 ₇₂3 ₈₀3 ₈₈3 ₉₆3

Chunk dimensions 73 ₁₅3 ₂₃3 ₃₁3 ₃₉3 ₄₇3 ₅₅3 ₆₃3 ₇₁3 ₇₉3 ₈₇3 ₉₅3

Table 3.1: Set of congurations

The table shows that the dimensions increase uniformly and linearly for each increment in conguration. A total of 12 congurations are benchmarked and the maximum number of voxels for a conguration is 95*95*95 = 857375. This range is assumed to be large enough to see a pattern for how the performance changes as the volume increases.

(26)

20 Chapter 3. Implementation the average frametime for each conguration. To ensure that the results from the proling tool are valid internal timestamp queries have also been taken. Timestamp queries are taken for the relevant dispatch and draw calls. During development of the framework it was shown that there was some latency added to measurements by using the external proling tool, however the added latency was insgnicant ranging in the hundreds of a millisecond. The external proling tool is used due to its very detailed proling capabilities. The same could be achieved by implementing a competent GPU proler into the framework, however that also implies that a considerable amount of time is spent on developing such a tool and for this project that is out of scope.

(27)

Chapter 4 Results

In this chapter results are presented for the measurements of the density volume generation time with compute shaders, the mesh generation time with the Marching cubes implementation on compute shaders, the draw time of the generated meshes and nally the performance dierences for synchronous and asynchronous execution of the GPU workload. All the measurements were taken by the external proling tool.

4.1 Density volume creation time

The density volume is lled once during the rst frame and has therefore no further impact on the performance for future frames, but it is still of interest to see how long it actually takes to ll the volume with density values generated by the simplex noise function and how the execution time scales as the volume increases in size. Figure 4.1 shows that lling the volume is very fast even at the largest conguration being just below 70 microseconds. It also shows that the creation time increases exponentially, which is to be expected since the volume increases in size exponentially, see equation 4.1 where N is the conguration number.

# of f loats = (8 ∗ N )3 (4.1)

4.2 Mesh generation time

The mesh is generated each frame and measurements for both sequential and parallell compute execution are taken. Figure 4.2 shows that the mesh generation time scales exponentially as the chunk size increases, this is consistent with the behavior shown in gure 4.1. The mesh generation time for both sequential and parallell execution are about equal which is to be expected since in both cases the same shader is run for the same congurations generating the same meshes. In gure 4.2 the yellow dotted line indicates the threshold for what can be considered an acceptable frametime for real-time interactive applications, which is at around 33 ms. At the voxel count of 250047 the number of generated triangles was over 83 thousand and this resultet in an execution time slightly under the threshold and for the increasingly larger chunk dimensions the threshold was exceeded.

(28)

22 Chapter 4. Results

Figure 4.1: The density volume generation time for all 12 congurations

(29)

4.3. Mesh draw time 23

4.3 Mesh draw time

It is of interest to see the draw time for the generated meshes since this is also part of the GPU frame. Figure 4.3 shows that the draw time is about equal for both cases and that the draw time also scales exponentially, the same behavior as seen in gure 4.1 and 4.2. Note that even at the largest chunk size which generated more than 2.7 million triangles the draw time is still very low at 1.6 milliseconds.

(30)

4.4 Performance benchmark results

Measurements of the GPU frame time shows that the dierence between sequential and parallell compute execution is the largest for the smallest chunk size and as the chunk size increases the dierences gets signicantly smaller, however it is important to note that for every conguration the approach with parallell compute execution yielded better frame times. See gure 4.4 for the frame time measurements. The yellow dotted line in gure 4.4 has the same indication for the real-time application frametime threshold as previously seen in gure 4.2.

Figure 4.4: GPU frame time for synchronous and asynchronous execution

(31)

4.4. Performance benchmark results 25

Figure 4.5: The GPU frame idle percentage of synchronous and asynchronous exe-cution for all 12 congurations.

(32)

(33)

Chapter 5 Discussion

In this chapter the benets of asynchronous compute execution will be discussed with regard to the results from the previous chapter. A short discussion concerning the implementations complexity and limitations is also presented and followed up by a discussion about the implication this thesis has in the context of procedural terrain generation at an interactive framerate. Lastly the chapter is concluded by a short summary.

5.1 The benets of asynchronous compute execution

In chapter 4 the performance gain as shown by gure 4.6 clearly indicates that the greatest benet is found in cases where the computational workload is rather small. As the workload increases the gain decreases due to workload becoming a much greater factor of the frame than synchronization time. It is important to note that while the execution of the command queues is asynchronous on the GPU the application is not. It still needs to synchronize the compute queue so that the buer is ready for rendering for the consecutive frame. It is possible to simply continue render the current buer and swap once the compute queue has nished execution, this would remove the data dependency and thus implement a true asynchronous framework. This approach would however lead to that the measurements with the proling tool become unreliable as there is a possibility that the compute queue execution is spread over multiple presented frames. Another benet is shown by gure 4.5 which indicates that the GPU frame utilization is higher and therefore asynchronous execution increases frame eciency.

5.2 Implementation complexity and limitations

Since for this project the framework was built from scratch with a custom renderer and the inclusion of two external libraries for easier window and user interface man-agement the application is small in size. Keeping the size of the application small has helped focusing on the implementation of the problem and setting up the test-ing congurations. It was also of interest to have the implementation low level and focused to guarantee that no performance was lost unnecessarily. This is why ImGui was excluded later on during performance benchmarking since it would add draw calls to the frame to render the user interface which is unrelated to the problem. This makes the application very limited to just performance benchmarking. While

(34)

28 Chapter 5. Discussion you can traverse the rendered world with mouse and keyboard input there is noth-ing that can be interacted with. The application is also limited to just one chunk object per frame since only a single chunk is needed to cover the entire volume. The framework also does not support texturing for the generated meshes.

5.3 Terrain generation at an interactive framerate

In order to generate terrain at an interactive framerate while traversing the world it is undesirable to have the framerate drop below a certain threshold. In chapter 4 the mesh generation time and draw time of various chunk sizes is shown and this information could be used for frame budgeting. Please consider that the application of this project generates a mesh each frame for benchmarking purposes, a more real-world application would not need to generate the same mesh each frame. In the case of a very large volume it also possible to increase performance by partitioning the volume with many chunks. Then only relevant chunks are used to generate the terrain and as the user traverses the world chunks that may be needed in the future can be generated in the background with asynchronous compute. Performance can also be further increased by various optimization techniques such as octree data structures, frustum culling and and level-of-detail systems that dynamically changes the chunks dimensions. The marching cubes algorithm implementation can also be optimized by adding multiple passes to decrease the memory consumption. An improved implementation is presented by Ludwig Pethrus Engström which removes duplicated vertices and creates triangles with the addition of index buers [5].

5.4 Summary

(35)

Chapter 6 Conclusion

To answer the research question, What are the dierences in GPU frametime perfor-mance with marching cubes mesh generation and mesh rendering for sequential and parallell compute execution?, the results from the project have shown that by using the multi-engine capabilities of DirectX 12 a GPU frametime performance increase was present when the Marching cubes mesh generation was executed in parallell to mesh rendering in comparison to the sequential approach.

This information is useful not only for game developers to increase framerate or decrease development time but also for speeding up visualization of medical data. Such data sets can be very large and by partitioning the volume and generating chunks asynchronously on powerful GPU hardware it could be possible to traverse the human body at an interactive framerate.

6.1 Reections

The implementation of the Marching cubes algorithm on the compute shader has been rather straight-forward. Conceptually the algorithm is simple and it can gener-ate interesting terrain, but the results can be pretty blocky depending on the volume resolution. As for someone who previous to this thesis had no experience in volume rendering techniques I can strongly recommend Marching cubes to others who are new to volume rendering.

The implemented dual-buer design to support the proling of the asynchronous compute execution was simple and could be quickly implemented, but at the cost of doubling the memory consumption. In any other application where proling with an external tool is not crucial a simple single-buer design is still enough to make use of async compute.

As the results have shown that the biggest performance gain can be seen for small workloads with the GTX 1080 GPU similar results should be seen in other type of workloads that are equally demanding. I believe that workloads such as particle eects could therefore also benet greatly with asynchronous compute execution.

6.2 Future work

A future implementation can be made that supports partitioning of the volume with optimization techniques. This could potentially allow for much larger volumes as

(36)

(37)

References

[1] https://docs.microsoft.com/en-us/windows/desktop/direct3d12/ user-mode-heap-synchronization. [Online; accessed 15-May-2019].

[2] https://gpuopen.com/gaming-product/nbody-directx-12-async-compute-edition. [Online; accessed 15-May-2019].

[3] Ken Brodlie and Jason Wood. Recent advances in volume visualization. Com-puter Graphics Forum, 20(2):125148, 2001.

[4] Rashmi Dubey, Sarika Jain, and R. S. Jadon. Volume rendering: A compelling approach to enhance the rendering methodology. pages 712717. IEEE, 2016. [5] Ludwig Pethrus Engström. Volumetric terrain genereation on the gpu, 2015. [6] Sarah F. F. Gibson. Constrained elastic surface nets: Generating smooth

sur-faces from binary segmented data. volume 1496, pages 888898, 1998.

[7] Chien -. Ho, Fu-Che Wu, Bing-Yu Chen, Yung-Yu Chuang, and Ming Ouhyoung. Cubical marching squares: Adaptive feature preserving surface extraction from volume data. Computer Graphics Forum, 24(3):537545, 2005.

[8] Tatsuya Ishige. Contour tracing for geographical digital data. Cogent Geo-science, 3(1), 2017.

[9] Kristoer Lindström. Performance of marching cubes using directx compute shaders compared to using histopyramids, 2011.

[10] William E. Lorensen and Harvey E. Cline. Marching cubes: A high resolu-tion 3d surface construcresolu-tion algorithm. ACM SIGGRAPH Computer Graphics, 21(4):163169, 1987.

[11] Hubert Nguyen. GPU Gems 3, chapter 1, pages 67111. Addison-Wesley Pro-fessional, 1 edition, 2007.

[12] David Reed, Roni Yagel, Asish Law, Po-Wen Shin, and Naeem Shareef. Hard-ware assisted volume rendering of unstructured grids by incremental slicing. pages 55. IEEE Press, 1996.

[13] Thomas J. Rose and Anastasios G. Bakaoukas. Algorithms and approaches for procedural terrain generation - a brief review of current techniques. pages 12. IEEE, 2016.

(38)

32 References [14] G. M. Treece, R. W. Prager, and A. H. Gee. Regularised marching tetrahedra: improved iso-surface extraction. Computers and Graphics, 23(4):583598, 1999. [15] Prof. Dr. Tino Weinkauf. Direct volume rendering [powerpoint presenta-tion]. https://www.kth.se/social/files/565e35dff27654457fb84363/08_ VolumeRendering.pdf, 2015. [Online; accessed 15-June-2019].

[16] Yongjie J. Zhang. Geometric Modeling and Mesh Generation from Scanned Images, volume 6, chapter 4, pages 145149. Chapman and Hall/CRC, Boca Raton, 1 edition, 2018;2016;.

(39)

(40)

Volume rendering with Marching cubes and async compute