Evaluation of Performance on Variable Rate Shading

(1)

Bachelor of Science in Digital Game Development June 2020

Evaluation of Performance on Variable Rate Shading

Jonathan Carrera Iseland Leonard Grolleman

Faculty of Computing, Blekinge Institute of Technology, 371 79 Karlskrona, Sweden

(2)

This thesis is submitted to the Faculty of Computing at Blekinge Institute of Technology in partial fulﬁlment of the requirements for the degree of Bachelor of Science in Digital Game Development.

The thesis is equivalent to 10 weeks of full-time studies.

The authors declare that they are the sole authors of this thesis and that they have not used any sources other than those listed in the bibliography and identiﬁed as references. They further declare that they have not submitted this thesis at any other institution to obtain a degree.

Contact Information:

Authors:

Jonathan Carrera Iseland E-mail: joia15@student.bth.se Leonard Grolleman

E-mail: legr15@student.bth.se

University advisor:

Stefan Petersson

Department of Computer Science

Faculty of Computing Internet : www.bth.se

Blekinge Institute of Technology Phone : +46 455 38 50 00

SE–371 79 Karlskrona, Sweden Fax : +46 455 38 50 57

(3)

Abstract

Background. Modern games are becoming more demanding on the hardware, and to counter this, new techniques to ease these demands are developed. One such opti- mization technique is Variable Rate Shading (VRS), included in the DirectX 12 API.

It allows developers to vary the quality of parts of the frame to improve performance.

How eﬃcient VRS is, seems to vary as diﬀerent benchmark tests get various results.

This is most likely because of the diﬀerent scene environments used in the tests.

Objectives. To further expand the environments used in VRS benchmark tests, this study will focus on measuring and evaluating the performance of VRS in a lightweight environment that diﬀers from the others.

Methods. The method consists of developing a lightweight Direct3D 12 application, implement the VRS technique, and measure performance. For a clear evaluation, sev- eral tests are conducted measuring frame time, frame rate, and draw call speed at the diﬀerent settings using the VRS technique at various resolutions over 1000 iterations.

Results. By measuring the frame time, frame rate, and draw call speed with VRS it was possible to collect performance data which is showcased in this study. The study showcases the average performance using 1x1, 2x2, and 4x4 shading rates at 480p, 1080p and 2160p resolution. The average data were compared between shad- ing rates and resolutions to examine the correlation and deviation. As anticipated, the results showed generally performance improvements when using VRS. However, some settings showed inconsistency in deviations between shading rates, and others showed impaired performance.

Conclusions. The conclusion drawn from this study suggests VRS improves per- formance even in lightweight applications, within reasonable boundaries. However, the performance gain was of a lower degree when comparing with other benchmark tests. This suggests VRS be more useful in higher demanding environments.

Keywords: DirectX, Variable Rate Shading, Performance, Render, Benchmark.

(4)

(5)

Acknowledgments

Special thanks to our supervisor Stefan Petersson, for giving us the idea to work with this subject, for giving us support and advice on diﬀerent approaches, and for helping us with the development of the 3D-renderer and testing. We would also like to thank Diego Navarro for the continuous academic feedback on our work, as well as Yan Hu for her guidance on academic writing and for giving us direction on where to begin our study. Lastly, we thank family and friends for their feedback and support.

iii

(6)

(7)

Abstract i

Acknowledgments iii

1 Introduction 1

1.1 Aim and Objectives . . . . 2

1.1.1 Aim . . . . 2

1.1.2 Objectives . . . . 2

1.2 Research Questions . . . . 3

2 Background 5 2.1 Variable Rate Shading . . . . 5

2.2 Benchmarks . . . . 6

2.3 Compute Shader . . . . 7

3 Methodology 9 3.1 Implementation . . . . 9

3.1.1 D3D12 Application . . . . 9

3.1.2 Variable Rate Shading . . . . 12

3.1.3 Timer class . . . . 12

3.2 Evaluation . . . . 13

3.3 Limitations . . . . 14

3.4 Hardware . . . . 14

4 Results 15 5 Discussion and Analysis 21 6 Conclusions and Future Work 23 6.1 Conclusions . . . . 23

6.2 Future Work . . . . 23

v

(8)

(9)

List of Figures

3.1 Stages of the rendering pipeline [35] . . . . 9 3.2 A class diagram of the application . . . . 10 4.1 Bar graph showing the average frame time diﬀerences for no SRI ini-

tiation, 1x1, 2x2, and 4x4 shading rate at 480p, 1080p, and 2160p resolutions. . . . 15 4.2 Bar graph showing the average frame rate diﬀerences for no SRI ini-

tiation, 1x1, 2x2, and 4x4 shading rate at 480p, 1080p, and 2160p resolutions. . . . 16 4.3 Graph showing how shading rates 1x1, 2x2, and 4x4 at 480p, 1080p

and 2160p resolutions deviates in frame time percentage from no SRI. 17 4.4 Multiple graphs showing frame rate consistency over 1000 iterations

for no SRI initiation, 1x1, 2x2, 4x4 at 480p, 1080p, 2160p resolution . 18 4.5 Bar graph showing the average draw call time diﬀerences for no SRI

initiation, 1x1, 2x2, and 4x4 shading rate at 480p, 1080p, and 2160p resolutions. . . . 18 4.6 Graph showing how shading rates 1x1, 2x2, and 4x4 at 480p, 1080p,

and 2160p resolutions deviates in draw call time percentage from no SRI. . . . 19

vii

(10)

(11)

Chapter 1 Introduction

Optimal performance is an essential part of video games [5]. Game developers are continuously pushing for new heights in computer graphics [15]. They are striving to improve the performance and visual quality of their content [32]. As the modern GPU sees rapid growth in computational capacity, maintaining optimal frame rate seems less strenuous.

However, the growing desire for improved quality in real-time rendering increases computational and shading cost in modern video game content. As of today, the majority of users currently have 1080p resolution [10] as their standard monitor resolution. With the estimated significant growth in the global market for 2160p resolution displays in the coming years [17], pixel-count will further increase. The per-pixel shading cost becomes more demanding to compute as more video game studios push for more realistic and detailed rendering. Additionally, the increasing per-pixel shading cost in mobile games also affect the power consumption on mobile devices, resulting in less battery lifetime. Due to performance constraints from the pixel-count and per-pixel shading cost mentioned above, graphic renderers may not always afford to deliver the same quality level to every part of its output image. This is especially true in virtual reality devices where scenes are rendered twice, once for each eye [24].

To counter these problems, GPUs today support various mechanisms to lower the shading costs. Some examples of this that game studios utilize today include multisample anti-aliasing, mixed resolution shading, dynamic resolution rendering, checkerboard rendering, and coarse pixel shading.

Variable rate shading (VRS), or coarse pixel shading, is a mechanism that has been getting more attention recently for its promising way of optimizing [9]. It enables developers to allocate rendering/shading capacity for each 16x16 pixel region, otherwise called "tile", on the screen at rates varying across the rendered image. This makes it possible to perform shading at a coarser frequency than a pixel, coloring a group of pixels from a single sample. Developers determine the shading capacity within each tile with the use of shading rate image (SRI) at rates of 1 pixel (1x1), 2 pixels (1x2, 2x1), 4 pixels (2x2), 8 pixels (2x4, 4x2), and 16 pixels (4x4). The preferable use case of VRS is to allocate lower shading rates at selective parts of the image that barely impacts the visual quality of the rendered image. Using VRS in this preferred way could be considered as performance gain without drawbacks as the rendering is more granular and not as detailed. Chapter 2 further describes this.

How eﬃcient one application becomes through the use of VRS varies. Diﬀerent benchmark tests have been conducted to evaluate the features performance capacity.

1

(12)

2 Chapter 1. Introduction However, the environments used in these benchmark tests diﬀer in polygon count, light computation, and overall complexity, resulting in varying performance results regarding the feature. One benchmark test by UL Benchmark tested the VRS feature in a simple scenery with approximately 50 % improved performance [1]. Whereas another test made by a developer from Microsoft measured a 14 % - 20 % improved performance in the complex game Civilization 6 [26]. More on these in Chapter 2.

The focus of this study is to further explore the performance efficiency from the VRS feature in a less complex scene, widening the variety of testing environments for VRS. Widening the variation of environments may contribute to other studies searching to study the performance benefits of VRS. Testing VRS in a simpler en- vironment should also show the raw performance of VRS optimization capacity as little to no other computations affects the performance of the application. For a suf- ficient estimation of the performance, testing several of the feature’s settings would be appropriate.

1.1 Aim and Objectives

1.1.1 Aim

This study aims to analyze the performance of the D3D12 feature Variable Rate Shading when used in a simple render application. The reason being, to test VRS in a simpler testing environment compared to the other benchmark tests done by UL and DirectX developers. This is to widen the variety of environments used for benchmark tests and to isolate the impact of VRS. No extended use of VRS such as content-awareness algorithms was used. It only covers the performance of the native VRS support provided by DirectX 12.

The application kept shader and system computations to a minimal extent and focused on being optimized with the use of multithreading. Performance was mea- sured in frame rate, frame time and draw call speed for the evaluation of the VRS feature.

To acquire suﬃcient performance data, several tests was conducted, measuring the three most substantial shading rates of 1x, 2x, 4x as well as measuring the application without a SRI for comparison. Each shading rate was fully set across the SRI, utilizing its screen pixels for a series of resolutions. 640x480, 1920x1080, and 3840x2160 was used in this case as they are to this date the lowest possible, most common, and signiﬁcant growing resolutions.

Results should show a clear correlation between shading rate and pixel-count. It will give an overall performance estimation using the set environment.

1.1.2 Objectives

The objectives of the study are the following:

• O1: Develop a D3D12 render application and render a geometric plane.

• O2: Implement VRS feature Tier 2 with the inclusion of a SRI.

• O3: Conduct the tests in the minimal environment.

(13)

1.2. Research Questions 3

• O4: Measure the render pass time for a frame time, frame rate, and draw call for each shading rate at each resolution.

• O5: Evaluate the average performance.

• O6: Compare the results.

1.2 Research Questions

The research questions for this study are as follows:

• RQ1: What is the average performance for drawing a basic geometric plane mesh to a render target using Variable Rate Shading from a DirectX 12 plat- form?

• RQ2: What is the performance ratio when increasing the shading rates using a SRI?

• RQ3: What is the performance cost for using a SRI in the pipeline?

(14)

(15)

Chapter 2 Background

2.1 Variable Rate Shading

Variable Rate Shading [9], a new and promising rendering technique featured in Di- rectX 12’s Turing architecture. It is supported by NVIDIA’s latest graphic cards RTX 20-series and GTX 16-series [3] and the coming RDNA 2 architecture used in Intel’s AMD-graphic cards Gen11 [7]. As part of the DirectX 12 API, the ﬂexible technique allows developers to optimize performance as well as quality by dynami- cally varying the shading rate for diﬀerent regions across the frame.

The only means of controlling shading rate before VRS was through the use of multisampling anti-aliasing (MSAA) [23, 9, 8] combined with super-sampling. Multi- sampling anti-aliasing, as the name suggests, takes multiple samples on a single-pixel to remove aliasing along the edges of polygons. Whereas VRS reduces the num- ber of samples in various locations. This together with Multi-Resolution Shading, Lens-Matched Shading [9] was where the idea of VRS derived from. The display resolution aﬀected the performance of virtual reality devices due to high bandwidth when rendering to two high-resolution screens simultaneously [22]. This was also an implication that display resolution grows faster than pixel throughput [16]. Some- thing that proves problematic since performance is a crucial part of a good game experience. The reason behind this high latency and high rendering requirements for 3D rendering originate from the game and movie industries. As industries drive the development for higher and ﬁner graphical quality in gaming, technology in perfor- mance optimization and hardware follows in development [32, 13, 14, 12, 6].

The principle of VRS is to reduce the number of samples taken across the entirety of a frame by grouping up the pixels into tiles during the pixel-shading stage. It does so by dividing a uniform shading rate into two passes controlled by two combiners.

In the first pass, the user can choose how to combine the shading rate values, one shading rate value passed from the pipeline and one shading rate passed from the vertex or geometric shader. How the values are combined is determined by flags to accumulate values from only one pass, the highest or lowest values from both passes, or the sum of both passes. The second pass executes similarly to the first pass. Here the user will instead pick between the shading rate from pass one or a shading rate from the SRI. These two passes then result in the final shading rate to be used when drawing the screen. This allows the user to apply a coarse shading rate where vision is impaired. This typically occurs in shadows or in the distance, where details are very dense [2]. This means that games or applications with VRS implemented may have overall improved performance with minor visual impacts [23, 19, 29].

5

(16)

6 Chapter 2. Background However, reducing the pixel shading count is not a new method of optimizing performance. Coarse pixel shading is a technique that has been developed since 2014 for DirectX 11 [31], along with the idea of merging pixels [27]. The idea is to take only one sample across a set of multiple pixels, using this sample to draw the color value to the entire tile. These expedite can theoretically be demonstrated with Screen Resolution / Sampling Rate = Samples Per Resolution, where a resolution of 3840x2160 using a shading rate of 2x2 will result in the same amount of samples as 1x1 on a 1920x1080 resolution. Because of this, the hypothesis was that a higher shading rate will result in faster computation.

The VRS support was released with two Tiers, where Tier 1 for older versions only held support for static shading rates per-draw-basis. Tier 2 added support for many other features as deﬁning diﬀerent shading rates separately across the image or render target, re-using shading rates sets across several viewports. In addition, it added SV_ShadingRate as input in the pixel shader, opening up possibilities for many other implementations, such as making the shading rates content-aware.

2.2 Benchmarks

3DMark is a leading benchmark tool developed by UL for computer and mobile devices. The tool determines the hardware’s 3D graphical rendering performance and CPU workload capacity through a series of intensive benchmark tests. It has tailored benchmark tests for a speciﬁc hardware capability, ranging from high-end hardware systems to low-performance systems. The benchmark tests focus on rendering and updating complex game environments in real-time. Each benchmark test gives a score based on diﬀerent performance parameters. Users can use this score to compare with similar systems. As of this date, 3DMark is the world’s most popular and widely used benchmark tool with millions of users, hundreds of hardware reviews, and many of the world’s leading manufactures.

In August 2019, UL Benchmarks added performance testing of VRS to their per- formance testing application, 3DMark 11 [2], allowing users to try the VRS technique at different settings in a 3D environment. Similar to this study, 3DMark measures the performance of the VRS technique and presents the estimated FPS. The tests allow users to try VRS at different settings in a 3D environment and see the per- formance differences. According to UL Benchmarks in their 3DMark’s VRS feature test where they render a 3D scene with a moving camera, the performance improved by approximately 50 % with VRS, enabling minimal loss in visible quality [1].

DirectX developer Jacques reported measuring a 14 % - 20 % increase in perfor-

mance while utilizing VRS in their experimentation partnered with Firaxis games

[26]. In their experiment, Firaxis initially tested Tier 1 support with a dynamic

shading rate, shading terrain, and water at a 2x2 shading rate and smaller assets at

a 1x1 shading rate. They measured approximately 20 % improved performance but

with little loss in visual quality. In their other experiment, they used Tier 2 for edge

detection to preserve detail. They measured approximately 14 % performance gain

but making it nearly impossible to see any visual quality loss.

(17)

2.3. Compute Shader 7

2.3 Compute Shader

As graphics processing unit (GPU) handles graphics, general-purpose computation,

traditionally handled by the central processing unit (CPU), can be performed on

the GPU using a compute shader. Computer shaders are a programmable shader

stage that allows large numbers of parallel processors on the GPU to perform general

computations [20]. It can potentially speed up the application immensely as more

threads than just on the CPU compute.

(18)

(19)

Chapter 3 Methodology

This thesis is an implementation built upon the analysis of a DirectX 12 feature for 3D-rendering. The method involves developing a D3D12 application with VRS and timers incorporated, run timer tests, and evaluate the resulting data.

Developing the application from the ground up allowed full control of computa- tions on the CPU and GPU. It is suggested for this scenario to have full control when testing VRS. Other computations such as lights and geometry, which are common in games and would otherwise interfere with the performance, can be excluded in this case. Therefore, the timers are more accurate and allow isolation of smaller sections in the pipeline, such as the draw call for geometry.

The application kept shader and system computations to a minimal extent and focus on being optimized with the use of multithreading. Performance was measured in frame rate, frame time, and draw call speed for evaluation of the VRS feature.

The application also called SetStablePowerState to prevent the application from exceeding the thermal limitations of the processors and drains excessive current.

This is to enable proﬁling of GPU usage without experiencing artifacts. The data gathering involves measuring the application’s frame time, FPS, and draw call speed by commissioning the timer and timestamp query heap. To authorize the data, Microsoft recommends their software PIX to ensure the accuracy of extraction and debugging [11]. This software makes it possible to peek into the GPU to examine if the resources get the correct values.

3.1 Implementation

3.1.1 D3D12 Application

The D3D12 application got developed in a Microsoft Visual Studio 2019 (VS) inte- grated development environment, using the C++ program language. The application followed a standard graphics pipeline for 3D-engines in DirectX 11 & 12 [35].

Figure 3.1: Stages of the rendering pipeline [35]

9

(20)

10 Chapter 3. Methodology The application’s fundamental structure utilizes the native Windows API. This was done mostly for convenience when creating a window handle that can later be utilized by the DirectX 12 API when setting up viewports. Besides, it also delivered a clean message handling system providing a simple render loop. Therefore, the windows API also held the responsibilities for creating and pre-initializing all the DirectX core interfaces and resources. The window and viewports were initialized with the resolution size of 640x480, 1920x1080, and 3840x2160. Including more resolutions was not necessary because the performance of diﬀerent shading rates should scale the same between resolutions mentioned in the background.

For the architecture to give a fair representation of a realistic minimal game envi- ronment and the raw performance of VRS, everything was divided into its systems:

• The core, which would take care of the engine’s critical DirectX resources.

• The render engine, which held the responsibility for rendering the scene.

• The timer engine, to not be aﬀected by other processes and to provide accurate results.

Frames were prepared in separated threads, driven by a ring buffer, storing them in a queue before presenting them to the frame buffers. See figure 3.2. Note that it is important to consider that the driver can only queue up to 3 frames as a de- fault. This will affect the latency between frames. The Windows Display Driver Model limits the operating system from queuing more unless changed manually with IDXGIDevice1::SetMaximumFrameLatency() to a maximum of 16 frames[21]. This feature is supported using the DXGI 1.1 or higher. For the sake of this study, the queued frames were remained at their default values.

Figure 3.2: A class diagram of the application

(21)

3.1. Implementation 11 For the core, a device was installed with the ID3D12Device6 using feature level 12.1 to ensure the support for the latest version of VRS Tier 2. With this device, a feature support check was performed, extracting the options data from the device using OPTIONS6. The data made it possible to fetch the supported Tier, ensuring Tier 2 was available. The swap chain was installed with IDXGISwapChain4 used with two frame buffers to match the traditional pipeline. These buffers were created with the flip discard setting to achieve the best performance mentioned in Microsoft’s documentation [28].

The render targets implemented were initialized without blending, so the color values clearing the screen or from the previous frames did not aﬀect the color out- put of the pixel shader. They were implemented to support all colors and alphas to not perform any logical operations or blending in themselves. Also to blend the Red, Green, Blue, and Alpha (RGBA) outputs from the pixel shader together be- fore adding it to the render target. The core also held the responsibility for the ID3D12CommandQueue to ensure other systems synchronize and perform all of the GPU commands in the right order.

As for the render engine, it held the responsibility for all the frames, textures, and necessary resources for rendering, each frame containing its ID3D12CommandAllocator and ID3D12GraphicsCommandList5, created as a direct type for immediate execu- tion. This stage could be optimized through the use of bundles, if the instructions were the same for every draw call as the instructions would be preprocessed by the driver. However, in this study, it was deliberately left as a direct list for reuse at the setup of the system and VRS. All the draw instructions were executed on their thread to allow more frames to be prepared in parallel.

The ring buffer included in the interface was created using a deque containing a dataset of the fence value for the finished frame and its offset in the queue. Deque was a good choice in this case since it allows a faster insertion in the front as well as the back of the queue compared to vectors, making it suitable as a circular buffer.

In this way, it was the buffers only purpose to keep track of the frames committed to the command queue, which was represented by the head and the finished frames representing the tail. The head, the tail, the current number of frames committed, and the maximum number of frames allowed in the buffer are all represented as an unsigned short integer. This kept the memory footprint small and calculations fast.

The maximum allowed frames in the queue was equal to the number of avail- able threads in the system using the std::thread::hardware_concurrency() function.

When the buﬀer was full, the allocation returned an error message, which indicated a release of all the completed frames before drawing the next one. As the study aims to utilize the feature for the whole screen, shading each tile was necessary. Therefore in the vertex shader, the signature passed a single vertex of 3 ﬂoat values and add an alpha value to be passed along with the shading rate within the given position. The vertex shader’s sole task was to pass through the data for the rasterizer to register if there was any geometry present. To ensure the GPU shades each tile, the entire frus- tum needs rasterization. An entire rasterized frustum will generate fragments over the render target for each sample-location to contain data for the shading process.

To achieve full rasterization, a simple geometric plane covering the frustum would

be enough.

(22)

12 Chapter 3. Methodology

3.1.2 Variable Rate Shading

The question stands which Tier of VRS to use for answering the research questions, as both of the Tiers are suitable when drawing a scene. However, Tier 1 only allows a singular shading rate to uniformly be drawn across a render target, which could cause the scene to become blurry or render artifacts[29]. Since Tier 2 allows a more dynamic use, SRIs are of higher interest as they provide a more precise positioning where details are less necessary. Therefore, it is more likely to be used in a real engine, even though the SRI for this study only contained a uniform shading rate.

The VRS resource was prepared before the render loop, avoiding calculations for each draw call, which aﬀects performance. Many features and usage areas of VRS, such as content awareness is based on the ability to utilize the SRI values during runtime.

Many studies focus on the performance and visual quality of these techniques [29, 19, 34, 33], but as the attentiveness lies within the performance of drawing with the use of VRS, it would only be of interest to read from the SRI. When assembling the VRS resources, it was possible to fetch the image tile size supported by the graphics adapter to set up the palette. This was made through the use of OPTION6, fetched from the device as previously mentioned. To obtain the size of the palette, the application window was divided by the tile size retrieved. It was important that this step was carried out right since the tile size can only be set between three values, 8, 16, and 32. With this, the SRI was created as a committed ID3D12Resource1 with unordered access to allow direct distribution from the GPU, using an unordered access view(UAV).

Despite the vertex and the pixel shader being the only two necessary stages to draw geometry. The application had a compute shader implemented to shift the workload over to the GPU when populating the SRI.

The compute shaders signature contained a singular table consisting of a constant buﬀer view(CBV) and an Unordered Access View bound to the SRI using the same dimensions. These views shared the same heap where the CBV stood ﬁrst in line as it was accessed more frequently when carrying out the population algorithms.

Before each draw call, the application instructs the combiners to override the uniform shading rate value to allow editing in the vertex shader stage and read the SRI values.

3.1.3 Timer class

The timer class in the application was held as a separate system to avoid interference with scene rendering. The class divides the timers into two areas of responsibility, one to measure frame time and FPS and another to measure draw call. Each of the timers ran on its dedicated thread to ensure that the results were accurate and were not aﬀected by the outputted processing instructions.

The ﬁrst timer, measuring frame time and FPS using the Chrono library. It

functions by placing a timestamp at the beginning of a render call, which was later

subtracted with the timestamp taken from the previous frame, granting the frame

time that was then stored inside a ﬂoat vector. The current timestamp was then

saved as the last timestamp for the next frame. In this manner, the ﬁrst frame

will always be invalid since the previous timestamp will not contain any value, and

therefore the ﬁrst value was not included. These values were then later saved into a

(23)

3.2. Evaluation 13 text ﬁle to be used for analyzing.

Algorithm 1: Simple FPS algorithm for ﬁnding the time taken between the previous frame and the current frame.

1

f unction_f ps_count[ ]()

δ = currentT imestamp − previousT imestamp;

lastT imestamp = currentT imestamp;

F P S_V ector.push_back(δ);

The second timer, measures draw call speed, was a GPU timestamp system built upon a class concept from the work of Mikael Olofsson [25]. Here a timestamp query heap was created along with a default committed resource that functions as storage for the timestamps. This system queries timestamps from the GPU queue and calculates the elapsed time. The diﬀerence is that this system is not dependent on the CPU to wait for the GPU to ﬁnish a frame. Instead, the frames keep their separate resources for parallel querying.

Algorithm 2: Timestamp query

1

Create_T imer() Create Query Heap;

Create Committed Resource.

2

Update()

Set a timestamp into the current position of the command list and bind the position to the heap.

Add draw command into the current position in the command list.

Set a second timestamp into the current position of the command list and bind the position to the heap.

Fetch the timestamps included in the heap and bind them to the committed resource.

Calculate the time by reading two timestamp positions from the committed resource.

3.2 Evaluation

At this stage, the application could run the tests and output the necessary timing

data. Each test initially measured the applications frame rate, FPS, and draw call

time at resolution of 480p, 1080p, 2160p without utilizing a SRI. Same set of tests

were then performed but with a SRI for each shading rate of 1x1, 2x2, 4x4, resulting

in 12 diﬀerent tests. Each test was performed through 1000 consecutive iterations.

(24)

14 Chapter 3. Methodology The measured data was saved into a spreadsheet for further calculations. The spreadsheet consisted of calculating the average frame time, FPS, and draw call speed as well as the consistency diversity for them. These average values were then used to ﬁnd how they deviate compared one another. This was then followed by comparing shading rates and resolutions. With these values, an overall estimation of the VRS performance was presented and discussed.

3.3 Limitations

For this experiment 1000 iterations was considered an acceptable amount as further iterations would have minimal diﬀerence in the results. Therefore, the samples taken were limited to 1000 iterations. This amount also seems to be common among other benchmark works [25, 19].

The difficulty when taking performance tests of features is that the architecture behind every application differs [4], this resulting in different outcomes using different game engines, benchmark environments or even hardware [1, 26, 19, 29]. As there are multiple approaches in developing and optimizing an application, there is most likely room for improvement on the application regarding pipeline and computation in this study. The application could also be impacted by internal background processes of the operating system as they might block the access of threads. This would cause the application to stall and wait for access when constructing new threads [30, 19].

3.4 Hardware

The experimental environment was set-up as a hardware-driven application per-

formed on a LENOVO LNVNB161216 motherboard based on a windows 10 home x64

operation system. Driven by an HM370 Intel Chipset, Intel Core i7-9750H processor

at 2.6 GHz, 8GB NVIDIA RTX 2080 MAX Q, and a 32GB SO-DIMM DDR4 RAM

at 2666MHz. These parts were chosen to mimic the current generation of hardware

released for the RTX 20-Series of graphics cards on which VRS Tier 2 was released

for. The study used the extension of a Philips The One 65" 4K UHD LED Smart

TV 65PUS7354/12 to attain support for 4k resolution.

(25)

Chapter 4 Results

This chapter covers all the results of the conducted experiments. The data showcase the performance of VRS in the set environment under the given circumstances. All the tests performed were divided through shading rates and screen resolutions. The tests measured the frame rate and draw calls without the use of SRI and with the use of SRI at the shading rates of 1x1, 2x2, 4x4 for the given resolution of 480p, 1080p, 2160p.

The data was processed to calculate the deviation between the results and con- sistency.

Examining the average frame time presented in figure 4.1, different statistics can be observed, showing the frame time measured for each shading rate on the different resolutions. The data show a clear diversity in performance between shading rates as well as resolutions.

No SRI VRS 1X1 VRS 2X2 VRS 4X4

640x480 3.52 ms 3.42 ms 3.29 ms 3.16 ms

1920x1080 3.48 ms 3.13 ms 3.35 ms 3.23 ms

3840x2160 3.30 ms 3.37 ms 3.39 ms 3.53 ms

2.90 ms 3.00 ms 3.10 ms 3.20 ms 3.30 ms 3.40 ms 3.50 ms 3.60 ms

Average frame time

640x480 1920x1080 3840x2160

Figure 4.1: Bar graph showing the average frame time diﬀerences for no SRI initia- tion, 1x1, 2x2, and 4x4 shading rate at 480p, 1080p, and 2160p resolutions.

Each shading rate measured on 480p shows a decrease in frame time in consecutive order. However, the data for 1080p show an irregular consistency across shading rates. The result for shading rate 1x1 at 1080p, shows the frame time making a

15

(26)

16 Chapter 4. Results considerable drop from when the SRI was uninitiated. The deviation between SRI uninitiated and 1x1 shading rate shows a diﬀerence of 11 %. See ﬁgure 4.3. As for shading rate 2x2, the data show a ∼7 % higher frame time compared to 1x1. The 4x4 shading rate lowers the frame time with ∼4 % from 2x2. Looking at the pattern, each setting shows regularity except from 1x1, where it shows an arbitrary drop in frame time compared to the other.

The test for 2160p shows a more consistent regularity in frame time between each shading rate, although increasing in frame time for each rate. Each increment in frame time over the shading rates changes by a factor of 0.58 % - 3.68 %. This decreases performance for every consecutive rate when using a resolution of 8’294’400 pixels in the set environment. The result for each resolution on the NO SRI setting shows a lower demand when pixel-count increases. However, the results for shading rate 2x2 and 4x4 show a higher demand when pixel-count increases. As for frames per second, similar results can be seen regarding patterns and overall performance.

See ﬁgure 4.2.

640x480 284 fps 292 fps 304 fps 316 fps

1920x1080 287 fps 319 fps 299 fps 310 fps

3840x2160 303 fps 296 fps 295 fps 283 fps

260 fps 270 fps 280 fps 290 fps 300 fps 310 fps 320 fps 330 fps

Average frame rate

640x480 1920x1080 3840x2160

Figure 4.2: Bar graph showing the average frame rate diﬀerences for no SRI initiation, 1x1, 2x2, and 4x4 shading rate at 480p, 1080p, and 2160p resolutions.

Overall, the frame time shows an increase of 11 % at most when observing 1x1 at 1080p and 4x4 at 480p. However, frame time decreases down to 6 % when used at 2160p.

The preferred case is when the frame time is as consistent as possible throughout the rendering process, leading to smooth fps. For this study, the captured data showed the frame rate to be around ∼300 fps, in which estimating the preferred frame time gives 1/300 = 0.003, meaning that ∼3 ms was the most optimal frame time in this case.

The general consistency in frame rate landed at 3.00-4.00 ms with a few exceptions

in high peaks in ms. See ﬁgure 4.4. Each test did receive a high peak somewhere

(27)

17

2.93%

7.25%

11.53%

11.11%

3.98%

7.84%

-2.14%

-2.72%

-6.40%

-8.00%

-6.00%

-4.00%

-2.00%

0.00%

2.00%

4.00%

6.00%

8.00%

10.00%

12.00%

14.00%

VRS 1X1 VRS 2X2 VRS 4X4

Frame time deviation in percentage

640x480 1920x1080 3840x2160

Figure 4.3: Graph showing how shading rates 1x1, 2x2, and 4x4 at 480p, 1080p and 2160p resolutions deviates in frame time percentage from no SRI.

along the iterations in the tests with an average variance of ∼5.2 ms when comparing the highest peak and the lowest peak of frame rate.

The test with the most consistent frame rate was shading rate 2x2 at 480p, which showed a variance of ∼3.9 ms, whereas the tests with the least consistency of all the tests were both shading rate 2x2 and 4x4 at 2160p. Both showed a variance of exact

∼7.16 ms in high and low peak diﬀerence.

Generally the tests show that the variation between frame time high and low peak increases with resolution, 480p, 1080p, and 2160p get an average variance of

∼4.4 ms, ∼5.2 ms, and ∼5.9 ms respectively.

Compared to the frame time, the draw call speed showed little to no change throughout each resolution for the different shading rates. Although, the speed compared between resolutions showed a significant difference.

The draw call speed vary across each resolution with 0.51 ms, 0.61ms, and 0.10 ms for 480p, 1080p, and 2160p respectively. See ﬁgure 4.5. The variance could be considered insigniﬁcant as the change was less than 1.6 % between the shading rates.

See ﬁgure 4.6.

(28)

18 Chapter 4. Results

0.00 ms 1.00 ms 2.00 ms 3.00 ms 4.00 ms 5.00 ms 6.00 ms 7.00 ms 8.00 ms

0 100 200 300 400 500 600 700 800 900 1000

Frame rate consistency NO SRI 640x480p

0 100 200 300 400 500 600 700 800 900 1000

Frame rate consistency 1x1 640x480p

0 100 200 300 400 500 600 700 800 900 1000

0.00 ms 1.00 ms 2.00 ms 3.00 ms 4.00 ms 5.00 ms 6.00 ms 7.00 ms 8.00 ms 9.00 ms

0 100 200 300 400 500 600 700 800 900 1000

Frame rate consistency NO SRI 1920x1080

0 100 200 300 400 500 600 700 800 900 1000

Frame rate consistency 1x1 1920x1080

0 100 200 300 400 500 600 700 800 900 1000

Frame rate consistency NO SRI 3840x2160

0 100 200 300 400 500 600 700 800 900 1000

0.00 ms 1.00 ms 2.00 ms 3.00 ms 4.00 ms 5.00 ms 6.00 ms 7.00 ms 8.00 ms 9.00 ms 10.00 ms

0 100 200 300 400 500 600 700 800 900 1000

Figure 4.4: Multiple graphs showing frame rate consistency over 1000 iterations for no SRI initiation, 1x1, 2x2, 4x4 at 480p, 1080p, 2160p resolution

640x480 32.53 ms 32.07 ms 32.02 ms 32.16 ms

1920x1080 65.62 ms 66.24 ms 66.13 ms 66.00 ms

3840x2160 180.90 ms 181.00 ms 181.00 ms 180.91 ms

Average draw call time

640x480 1920x1080 3840x2160

Figure 4.5: Bar graph showing the average draw call time diﬀerences for no SRI

initiation, 1x1, 2x2, and 4x4 shading rate at 480p, 1080p, and 2160p resolutions.

(29)

19

1.45%

1.59%

1.14%

-0.93%

-0.78%

-0.57%

-0.05% -0.05% 0.00%

-1.50%

-1.00%

-0.50%

0.00%

0.50%

1.00%

1.50%

2.00%

VRS 1X1 VRS 2X2 VRS 4X4

Draw call time deviation in percentage

640x480 1920x1080 3840x2160

Figure 4.6: Graph showing how shading rates 1x1, 2x2, and 4x4 at 480p, 1080p, and

2160p resolutions deviates in draw call time percentage from no SRI.

(30)

(31)

Chapter 5 Discussion and Analysis

Evaluating the results, it was somewhat clear that performance increased when using coarser shading rates in some scenarios. However, in other scenarios, the performance decreased. At best, the average frame time and frame rate increased in performance up to 11.5 %. The average draw calls maintained a similar time across each rate with a diﬀerential up to 1.6 %. However, at worst, the average frame time and frame rate decreased in performance down to 6.4 % and 0.93 % for average draw call speed.

Regardless of whether these results are unexpected, they do answer the ﬁrst research question of this study:

"What is the average performance for drawing a basic geometric plane mesh to a render target using Variable Rate Shading from a DirectX 12 platform?"

The lower resolution 480p did perform faster for each shading rate, where the perfor- mance optimization ratio for each rate was 3 % - 4 %. This supports the hypothesis that fewer samples increase performance. VRS at 1080p also proved to be faster, with an average optimization ratio of 2.6 % between rates. Although it showed a sporadic pattern where shading rate 1x1 was the most eﬃcient and not 4x4, which oppose the hypothesis. Likewise, the results for 2160p, where each shading rate decreased frame time by an average ratio of -2.1 %, also seem to oppose the hypothesis. As of the diﬀerent ratios, this answer the second research question in the set environment of this study:

"What is the performance ratio when increasing the shading rates using a SRI?"

The decrease in performance somewhat contradicts the other benchmark tests show- ing VRS performance boots when using VRS.

One reasoning behind the performance decrease may be that the benchmark application used in this study does not use Direct3D 12 to its full extent. Dividing computational work into multiple threads, in this case, may not be suﬃcient. It could be, that multithreading does not accelerate the workload, but instead decrease it.

Another possible explanation could be that the performance cost of using VRS in a low demanding scene outweights the actual optimization done. As larger resolution requires larger SRI, the cost of combining shading rates and setting up the shading rate image in the application adversely aﬀects performance, in this case, using 2160p.

21

(32)

22 Chapter 5. Discussion and Analysis Each shading rate further increases the demand by overriding the previous rate. This would suggest VRS to be as most eﬀective when the GPU resources are stretched to a considerable extent.

To determine the cost of specific processes, it is highly suggested to use a method of distinguishing cost from performed optimization. Unfortunately, very few models can precisely differentiate such cases. In the case of SRI, it could be possible to get a rough estimation when comparing the performance differences between shading rate 1x1 and no SRI. As both of these settings take one sample per pixel, the different performance are not a result of different sampling amounts. This means that other factors than the intended sampling reduction of VRS are affecting the performance in this case.

The cost can be further isolated by solely observing draw call performance as it only measures the workload on the GPU, whereas frame time also includes the work from the CPU. The diﬀerent known factors here, which vary between no SRI and 1x1, are the cost of the SRI usage and the combining of shading rates. The draw call speed showed -0.45ms(1.45 %), +0.62ms(0.93 %) and +0.10ms(0.05 %) for 480p, 1080p and 2160p respectively using a shading rate of 1x1. This gives a rough estimation of the cost of using a SRI which roughly answers the third research question:

"What is the performance cost for using a SRI in the pipeline?"

The reason for the increase in performs could point to the use of SRI on a low resolution as being somehow beneﬁcial for performance. Further testing is suggested to speciﬁcally explain the reasons for this.

To end the discussion, a rough comparison of the already existing benchmark tests with the benchmark of this study was made. The following show the benchmark tests from Chapter 2. as well as the results of this study:

• VRS benchmark done in the complex game showed 20 %

• VRS benchmark done in the medium environment showed 50 %

• VRS benchmark done in the simple environment showed 11 %

While the simple testing environment from this study did not show a greater perfor-

mance than the medium environment. The hypothesis would suggest a correlation

between environmental complexity and optimization eﬃciency using VRS is not sup-

ported in this case.

(33)

Chapter 6 Conclusions and Future Work

6.1 Conclusions

A conclusion can be drawn that when using VRS and rendering a basic geometry in a lightweight application, the performance did increase as well as decrease depend- ing on shading rates and pixel-count. The increase was not of a similar scale as the other benchmark tests and did not show any indication that would suggest a correla- tion between scene complexity and optimization eﬀectiveness. One speculation was that VRS is to be more eﬃcient as the GPU workload grows, explaining why the other benchmark tests with a more demanding scene got higher performance results.

Also, this would partly be an explanation to the decrease in performance, as the computational cost of using VRS outweigh the performance gain.

6.2 Future Work

Knowing that the benchmark application of this study did show contradicting per- formance results compared to other benchmark tests, further testing is suggested.

One addition to this study would be to update the application used for this testing with geometry and light sources at a controlled amount and perform similar tests for different amounts of geometry and lights as done in this study [18]. This would further show the correlation between GPU workload and VRS efficiency and would show if VRS is as most effective when the GPU resources are stretched to a considerable extent. It would also possibly shed light to why the performance for 2160p appeared to decrease.

A more broadly understanding on the performance of VRS for different resolu- tions could be useful. The tests could also be further expanded with every available shading rate measured against more resolutions than just 640x480, 1920x1080, and 3840x2160. The results should show a finer correlation between shading rate and pixel count. This could be interesting as it would further show the performance affects of running VRS at higher resolutions and could be useful when planing the development of D3D12 applications.

23

(34)

(35)

References

[1] UL Benchmarks. 3DMark VRS feature test - compare Variable-Rate Shad- ing performance and image quality. Aug 2019 (accessed November 20, 2020).

url: https : / / www . youtube . com / watch ? v = d1zoGmhVB1U & ab _ channel = ULBenchmarks.

[2] UL Benchmarks. Test Variable-Rate Shading with 3DMark. Aug 2019 (ac- cessed May 22, 2020). url: https : / / benchmarks . ul . com / news / test - variablerate-shading-with-3dmark.

[3] Swaroop Bhode. Turning Variable Rate Shading. Sep 2018 (accessed May 22, 2020). url: https : / / devblogs . nvidia . com / turing - variable - rate - shading-vrworks/.

[4] Damien Charrieras and Nevena Ivanova. “Emergence in video game produc- tion: Video game engines as technical individuals”. English. In: Social Science Information 55.3 (2016), pp. 337–356.

[5] Mark Claypool, Kajal Claypool, and Feissal Damaa. “The Eﬀects of Frame Rate and Resolution on Users Playing First Person Shooter Games”. In: Proceedings of SPIE - The International Society for Optical Engineering 6071 (Jan. 2006).

[6] Robert L Cook, Loren Carpenter, and Edwin Catmull. “The Reyes Image Ren- dering Architecture.” English. In: ACM SIGGRAPH Computer Graphics 21.4 (1987), pp. 95–102.

[7] Intel corporation. Intel Processor Graphics Gen11 Architecture. Sep 2018 (ac- cessed May 22, 2020). url: https://software.intel.com/sites/default/

files/managed/db/88/The-Architecture-of-Intel-Processor-Graphics- Gen11_R1new.pdf.

[8] NVIDIA Corporation. NVIDIA Turing GPU architecture. English. 2019;2018;

url: https://www.nvidia.com/content/dam/en-zz/Solutions/design- visualization / technologies / turing - architecture / NVIDIA - Turing - Architecture-Whitepaper.pdf.

[9] NVIDIA corporation. VRWorks - Variable Rate Shading(VRS). Jul 2018 (ac- cessed May 22, 2020). url: https : / / developer . nvidia . com / vrworks / graphics/variablerateshading.

[10] Valve Corporation. Steam Hardware & Software Survey: December 2020. Mar 2020 (accessed December 29, 2020). url: https://store.steampowered.

com/hwsurvey/Steam-Hardware-Software-Survey-Welcome-to-Steam.

25

(36)

26 REFERENCES [11] davidcongruili. DirectX 12 Support in Visual Studio. September 2020 (accessed Sep 9, 2020). url: https://docs.microsoft.com/en-us/visualstudio/

debugger / graphics / visual - studio - graphics - diagnostics - directx - 12?view=vs-2019.

[12] Yangdong Deng et al. “Toward Real-Time Ray Tracing: A Survey on Hardware Acceleration and Microarchitecture Techniques”. English. In: ACM Computing Surveys (CSUR) 50.4 (2017), pp. 1–41.

[13] Digital Foundry. Star Citizen’s Next-Gen Tech In-Depth: World Generation, Galactic Scaling + More! Feb 2020 (accessed Feb 23, 2020). url: https : //www.youtube.com/watch?v=hqXZhnrkBdo.

[14] Digital Foundry. Star Citizen’s Next-Gen Tech: Micro-Level Detail - From Bat- tle Damage To Particle Eﬀects + More. Feb 2020 (accessed Feb 23, 2020). url:

https://www.youtube.com/watch?v=TUFcerTa6Ho.

[15] David Goodhue. “Velocity-Based Compression of 3D Rotation, Translation, and Scale Animations for AAA Video Games”. In: ACM SIGGRAPH 2020 Talks. SIGGRAPH ’20. Virtual Event, USA: Association for Computing Ma- chinery, 2020. url: https : / / doi - org . miman . bib . bth . se / 10 . 1145 / 3388767.3407392.

[16] Intel. Use Variable Rate Shading (VRS) to Improve the User Experience in Real-Time Game Engines | SIGGRAPH 2019 Technical Sessions. Slideshare, Aug 2019 (accessed Mars 9, 2020). url: https://www.slideshare.net/

IntelSoftware/use-variable-rate-shading-vrs-to-improve-the-user- experience-in-%20real-time-game-engines.

[17] Mordor Intelligence. 4K Display Resolution Market. Mar 2020 (accessed De- cember 29, 2020). url: https://www.mordorintelligence.com/industry- reports/global-4k-display-resolution-market-industry.

[18] Yousra J’lali. “DirectX 12: Performance Comparison Between Single- and Mul- tithreaded Rendering when Culling Multiple Lights”. English. In: (2020).

[19] Filip Lundbeck. Analysering av Variable Rate Shading’s bildbaserad skuggning i uppskjuten sammansättning av ljussättning : En jämförelse mellan bildbaserad skuggning och enhetlig skuggning för spel. Swedish. 2020.

[20] Microsoft. Compute Shader Overview. url: https://docs.microsoft.com/

en - us / windows / win32 / direct3d11 / direct3d - 11 - advanced - stages - compute-shader.

[21] Microsoft. IDXGIDevice1::SetMaximumFrameLatency method (dxgi.h). May 2018 (accessed 18 November, 2020). url: https://docs.microsoft.com/en-us/

windows/win32/api/dxgi/nf-dxgi-idxgidevice1-setmaximumframelatency.

[22] Joerg Mueller et al. “Shading atlas streaming”. English. In: ACM Transactions on Graphics (TOG) 37.6 (2019;2018;), pp. 1–16.

[23] NvidiaGameWorks. Siggraph 2018 - Variable Rate Shading (VRS). Youtube, Sep 2018 (accessed Feb 12, 2020. url: https://www.youtube.com/watch?v=

Hgl9eTJio8Q.

(37)

REFERENCES 27 [24] Oculus. Oculus for developers: Guidelines for VR Performance Optimization.

url: https://developer.oculus.com/documentation/native/pc/dg- performance-guidelines/.

[25] Mikael Olofsson. Direct3D 11 vs 12 : A Performance Comparison Using Basic Geometry. English. 2016.

[26] Jacques van Rhyn. Variable Rate Shading: a scalpel in a world of sledgeham- mers. Mar 2019 (accessed November 20, 2020). url: https : / / devblogs . microsoft . com / directx / variable - rate - shading - a - scalpel - in - a - world-of-sledgehammers/.

[27] Rahul Sathe and Tomas Akenine-Möller. “Pixel Merge Unit.” In: Eurographics (Short Papers). 2015, pp. 53–56.

[28] Jacobs M. Satran M. For best performance, use DXGI ﬂip model. May 2018 (ac- cessed May 30, 2020). url: https://docs.microsoft.com/en-us/windows/

win32/direct3ddxgi/for-best-performance--use-dxgi-flip-model.

[29] Stefan Stappen. “Improving Real-Time Rendering Quality and Eﬃciency using Variable Rate Shading on Modern Hardware”. MA thesis. Favoritenstrasse 9- 11/E193-02, A-1040 Vienna, Austria: Research Unit of Computer Graphics, Institute of Visual Computing and Human-Centered Technology, Faculty of Informatics, TU Wien, Dec. 2019. url: https://www.cg.tuwien.ac.at/

research/publications/2019/stappen-2019-vrs/.

[30] Andrew S. Tanenbaum and Herbert Bos. Modern operating systems. English.

4; global. Boston: Pearson, 2014, pp. 81–97.

[31] K. Vaidyanathan, M. Salvi, and F et al. R. Toth. Coarse Pixel Shading. English.

2014. url: https://fileadmin.cs.lth.se/graphics/research/papers/

2014/cps/cps.pdf.

[32] Carsten Wenzel. “Real-time atmospheric eﬀects in games”. English. In: ACM, 2006, pp. 113–128.

[33] Lei Yang. What is NVIDIA Adaptive Shading? Demystifying The Turing Fea- ture That Boosts FPS Up To 15%. Aug 2019 (accessed Mars 7, 2020). url:

https : / / www . nvidia . com / en - us / geforce / news / nvidia - adaptive - shading-a-deep-dive/.

[34] Lei Yang et al. “Visually Lossless Content and Motion Adaptive Shading in Games”. In: Proc. ACM Comput. Graph. Interact. Tech. 2.1 (June 2019). url:

https://doi.org/10.1145/3320287.

[35] Jason Zink, Matt Pettineo, and Jack Hoxley. Practical Rendering and Compu-

tation with Direct3D 11. English. 1st ed. Natick: A K Peters/CRC Press.

(38)

(39)

(40)

Evaluation of Performance on Variable Rate Shading

Bachelor of Science in Digital Game Development June 2020