INOM
EXAMENSARBETE DATALOGI OCH DATATEKNIK,
AVANCERAD NIVÅ, 30 HP STOCKHOLM SVERIGE 2019,
Real-time Raytracing and Screen-space Ambient Occlusion
POORIA GHAVAMIAN
Real-time Raytracing and Screen-space Ambient Occlusion
Pooria Ghavamian
Supervisor at KTH: Björn Thuresson Examiner: Tino Weinkauf
Supervisor at Fatshark: Axel Kinner
Abstract
This paper investigates the advances in real-time ambient occlusion (AO). Topics discussed are state-of-the-art screen-space techniques and raytraced ambient occlusion. Methods compared are our screen-space ambient occlusion (SSAO) variant, horizon-based ambient occlusion(HBAO), Unity’s scalable AO (AlchemyAO), multi-scale volumetric AO (MSVO), and raytraced AO (RTAO).
The methods were compared based on the errors produced in dynamic scenes, performance and similarity to reference scenes rendered by an offline raytracer. Important dynamic scene errors were highlighted, visual results were objectively evaluated using Structural Similarity Index (SSIM) and Unity engine was used as a common platform for all the methods in order to obtain performance metrics. RTAO managed to achieve a strikingly high SSIM score, while, MSVO traded some
accuracy to be the fastest of all the methods. Further analysis of different implementations and their strengths and weaknesses are provided.
Sammanfattning
Denna studie utforskar framsteg inom realtid ambient occlusion (AO). Ämnen som diskuteras är senaste typen av screen-spaceteknik och raytraced ambient occlusion. Metoderna som jämförs är vår egen screen-space ambient occlusion (SSAO) variant, horizon-based ambient occlusion (HBAO), Unitys scalable AO (Alchemy AO), multi-scale volumetric AO (MSVO), och raytraced AO (RTAO).
De olika metoderna jämfördes baserat på prestanda, likheter till referens scener och fel som
tillverkas inom dynamiska scener. Viktiga dynamiska scener var markerad och de visuella resultaten var objektivt evaluerad genom användning av Structural Similarity Index (SSIM). Unity motorn användes som en gemensam plattform för alla typer av metoder för att få fram prestanda mått.
RTAO lyckades att uppnå ett högt SSIM betyg medan MSVO blev den snabbaste av alla metoder dock har lägre precision. Ytterligare analys av olika genomföringar och deras styrkor samt svagheter ingår i rapporten.
Table of Content
Abstract 1
Background 4
Introduction 4
Motivation 5
DirectX Raytracing 6
Problem Statement 10
Research Question 10
Goal 11
Related Work 12
Rendering Equation 12
Ambient Occlusion 13
Monte-Carlo Method 16
Screen-space Ambient Occlusion Methods 19
Crytek SSAO 20
Normal-oriented SSAO 21
Horizon-based Ambient Occlusion 22
Volumetric Obscurance 24
Alchemy and Scalable Ambient Occlusion 25
Multi-scale Ambient Occlusion 27
Denoising and Blur 28
Gaussian Blur 28
Separable Blur 30
Bilateral Blur and variations 31
Spatiotemporal approach 32
Implementations 34
SSAO Variant 34
HBAO 35
Unity’s SAO (AlchemyAO) 36
MSVO 37
RTAO 38
Results 40
Dynamic Scene Applicability 40
Banding and Noise 40
Temporal Lag 41
Over-occlusion 41
Edge-of-Screen artefacts 41
Flickering 42
Performance 42
Scalability 42
Resolution 42
Sample Count 44
Radius 45
Raytraced Ambient Occlusion 46
Resolution 46
Radius 47
Sample Count 47
Accuracy 48
Discussion 52
Criticism and Future Work 54
Conclusion 55
Acknowledgment 55
Bibliography 56
Appendix A 60
Appendix B 62
Background
This section lays down the theoretical foundation for the thesis. Included in this chapter are the problem statement, theory, motivation, related studies and finally the research question that governs the main direction for the rest of the paper. A brief discussion on methods for evaluation will also be provided at the end.
Introduction
The project is carried out at Fatshark, a video game company in Stockholm. Their products, like most of the modern video games, rely heavily on nuanced visual cues to leave a convincing perceptual impression on the users and provide a strong sense of immersion. An effective way of accomplishing this goal is by paying extra close attention to the interplay of lighting and scene elements.
The way light interacts with the environment, in reality, is quite complex and carries information that ranges from the redness of an apple to the age of a distant galaxy. Light is comprised of photons, and the visible light that is the focus of this study is only a small part of the electromagnetic spectrum. There are many strange properties to light, such as being both a particle and a wave at the same time. This duality is behind familiar behavior like caustics, afterglow, reflection, refraction and also other less present optical phenomena like the 22° halo and diffraction.
It should come as no surprise that accurately capturing and displaying all the intricacies of light can prove to be an insurmountable task. Therefore, in real-time computer graphics, we have to approximate light behavior and work with a simplified model of it. One way of calculating light at each point is to take into account only the light sources and the object being lit. This simplified model is called local illumination. A widely used illumination model of the local variety is the Phong reflection model which breaks down a reflection off an object to ambient, diffuse and specular terms. The diffuse term which uses Lambertian reflection, is the direct light hitting a surface and it is dependent on the incident angle and independent from the view. The Specular term, on the other hand, attempts to capture the specular highlights seen in polished shiny objects and is view-dependent. It is the ambient term, however, that could prove to be too simplified and looks incorrect as it is only a constant value spread equally across the scene. Therefore, shadows and the inter-reflection of other objects and the way they affect each other are ignored in local illumination
models. On the other hand, models that do take into account the reflected light from other objects within a scene (indirect light), are called global illumination models.
Global illumination (GI) is a crucial component of a rendered scene. As mentioned before, surfaces receive light from all directions. If that was not the case, objects not under direct light (e.g. corner of the walls) would appear pitch black. Instead of only taking the effect of the light source directly on objects (direct lighting), GI focuses on following light bounces as it hits different objects in the environment. This approach drastically changes the scene’s quality by replicating a lighting behavior that is closer to reality. However, an accurate implementation is computationally expensive and a simple scene can take hours to render. Therefore, a mix of local illumination model and algorithms that handle specific GI effects is used for real-time applications.
Motivation
Ambient Occlusion is an effect that helps us approximate Global Illumination. Simply put, ambient occlusion is a measure of exposure of each point in the scene to the ambient lighting and is a scalar value between zero and one. This effect is powerful in bringing out details in the scene and unlike other forms of lighting does not depend on light direction. Some games have decided to precompute ambient occlusion in their indirect lighting calculations through Monte Carlo methods [1]. In this method rays are cast in a random and uniform fashion around the hemisphere of a normal and the visibility function is calculated based on the intersections. Baking (non-real-time solution), the ambient occlusion can increase quality and performance for static scenes, however, when it comes to dynamic scenes a real-time solution is needed.
Baking is a powerful tool in computer graphics when used appropriately, and is basically a transfer process. It’s a precomputation that can be done for many different attributes. Instead of calculating textures, and other attributes in real-time, one can bake them before running a scene and just reload them when needed. Indeed, this can only be applied to static objects, as otherwise, if an object moves within a scene, its baked shadow will remain in the same place.
The methods in the real-time realm can be divided into object-space and screen-space approaches.
Object-space and screen-space refer to the coordinate systems used in computer graphics and based on what calculation is taking place, one might switch from one space to another. As the name suggests, screen-space is defined by the screen with coordinates in 2D. Whereas, object-space is the coordinate system from the object’s point of view.
Object-space solutions have a cost that is proportional to scene complexity and usually, sampling is done to find ray intersection with non-empty areas around each pixel. Whereas, screen-space solutions make use of depth, normal and position buffers (screen-space data). Scene complexity does not affect screen-space solutions and they have constant cost, and they are only affected by the resolution used for rendering [1].
A well-known screen-space solution is SSAO [14], which uses only the z-buffer (which stores depth information) as its input. The ambient occlusion factor is therefore calculated as a function of samples acquired in the spherical depth test. Although the result is not physically accurate, it still manages to increase the quality of the scene in a robust way. Since SSAO shares many properties with other screen-space solutions, it will be analyzed in detail. There are, of course, certain problems associated with this approach which we will discuss in a future section.
DirectX Raytracing
DirectX Raytracing (DXR) introduces new elements to the DirectX 12 API. It enables GPU-accelerated raytracing. It provides developers with new shaders, acceleration structures, and other useful elements. It is quite low-level and the developer is given a lot of freedom to optimize their implementation, however, they have to be dealing with all the minute details of the pipeline.
Raytracing is a powerful tool that at its core is quite intuitive. The way raytracing is done is quite reminiscent of how ancient Greek philosophers thought human vision actually worked. In what is called Extramission Theory, visual perception is made possible through beams that leave our eyes and hit objects around us. Unlike rasterization, where primitive traversal takes place, in raytracing the 3D scene is not projected to 2D screen before coloring and effects like reflection, refraction, and shadows are produced through the act of casting rays, without any special effects.
A basic Whitted raytracer as demonstrated in “An Improved Illumination Model for Shaded Display” [49], uses a recursive structure to achieve photorealism. There are two types of rays:
primary rays and secondary rays. Primary rays are used to compute visibility, in what is also known as raycasting. In a naive implementation, a primary ray is generated, cast through each pixel and is checked against the triangles for the closest intersection distance. The introduction of secondary rays, on the other hand, managed to solve three of the important challenges that the rendering community faced at the time. These challenges were reflection, refraction, and shadows.
Secondary rays are spawned at the point where primary rays intersect with an object. For example, in the case of a diffuse object, a secondary ray is sent towards the light source (also known as shadow
spawned is under the shadow of the object that the ray has intersected with. The same idea applies to reflection and refraction as well.
Figure 1. Recursive raytracing showcasing primary and secondary rays. Note how shade() is invoking trace()
for further assessment of the radiance. Source from [1].
The main operations of a naïve raytracer can, therefore, be separated into two parts:
● Intersection calculation
● Shading
For a given scene with N primitives, a basic raytracer will carry out intersection tests for each primitive in O(N) time complexity. The shading that takes place upon intersection is similarly done in O(N) time. Since optimization has moved towards focusing on execution profiling (detecting the hotspots in a program), a raytracer can be optimized in mostly two different ways. According to the equations below, the total time within a program can be broken down as follows:
Where:
As shown above, one can either carry out low-level optimization of carrying out a given task i or reduce the number of times our program has to run that task. In this context, the intersection test will be defined as task i. DirectX Raytracing (DXR) provides us with means to optimize this hotspot with its new additions to the DirectX 12 API.
Using a technique called acceleration structure, one can significantly reduce the number of unnecessary intersection tests. Acceleration structures can be divided into two main groups of object hierarchies and spatial subdivision, with each side having their own advantages and disadvantages. DXR opts for the former class of acceleration structures by deploying the commonly used bounding volume hierarchies (BVH) in its implementation (although, it does not mandate its use) [35]. BVH helps in enclosing the primitives in an axis-aligned bounding box (AABB) and guarantees bounded memory usage [35]. DXR uses a two-level hierarchy, with the structure divided into top-level acceleration structure (TLAS) and bottom-level acceleration structure (BLAS). This model helps with optimizing ray traversal, and also opens up possibilities for dynamic objects [10].
Figure 2. A scene with 5 objects and their BVH. Source from [1].
Figure 2 depicts a simple BVH with spheres chosen as the bounding volume. The scene is deconstructed in the form of a hierarchical tree. The topmost node is called the root node and contains the entire scene. The internal nodes are analogous to TLAS in DXR and have pointers to their children, which could be either another internal node or a leaf node. Leaf nodes are similar to BLAS and hold the geometry that will be used for rendering. In a naïve raytracer, intersection test has to be done against every single geometry in the scene and thus is extremely inefficient. However, with a BVH, a shadow ray (also used in ambient occlusion) that returns the first hit found will have to only carry intersection tests with sphere BVs. If the ray misses the BV, it can safely disregard all the content under the missed node. Otherwise, it will recurse until it reaches a leaf node, and then the intersection test is carried out with the geometry. Using BVH can, therefore, significantly
improve the performance by helping the ray prune large sections of the scene. It should come as no surprise that DXR gives great importance to acceleration structures in their raytracing setup.
DXR also introduces new additions to the high-level shader language (HLSL) in the form of raygeneration, closest-hit, any-hit, and miss shaders [10].
Figure 3. Overview of the three main components needed for DXR. Source from [35].
As it can be seen in figure 3, DXR breaks down their architecture to three main components.
Acceleration structures are a two-level hierarchy that encompasses the geometry and facilitate the faster search for ray intersections. There are trade-offs in the way the acceleration structures are implemented, that depending on what qualities are sought out for should be taken into
consideration. For example, for ray intersection performance, larger bottom-level structures are desired. Whereas, for flexibility more and smaller top-level structures should be deployed. On a high-level, the raytracing pipeline object holds the function declarations of shaders, compiled
programs and the payload that shaders need for communication. The shading binding table (SBT) is one of the most important segments in the setup since it binds all the programs and the TLAS needed to know what resources an invoked shader needs and which shader should be executed for a given geometry.
Problem Statement
The main focus of this thesis will be on Ambient Occlusion (AO) in a dynamic setting, where fluid frame transition is a necessity. Given the importance of AO and how it enriches a given scene, the new DirectX 12 and advances in real-time rendering techniques such as DirectX Raytracing (DXR), could enhance the quality and efficiency of this technique. Our eyes are usually not sensitive to low-frequency variations in light, which allows us to simplify the light transport in terms of ambient occlusion. Therefore, this leaves the onus on ambient occlusion to convincingly create the illusion of Global Illumination, and help us in visually discerning the details in objects.
The desired result of this paper is identifying an ambient occlusion effect that meets the following criteria:
● Applicable to Dynamic scenes: the solution has to require no precomputation, and operate in real-time. This means that no artefacts should be produced due to camera movement, changes in geometry or locomotion.
● Performant: Latency is an issue that is highly perceptible to the user. Therefore, the frame frequency needs to be maintained within 60-90 frames per second for a desirable interaction experience [1]. With rapid advances in technology, one must take into account the increasing demand for higher resolution and sampling rate. Moreover, scene complexity and geometry is an important factor that needs consideration, as an average scene in a modern video game will have millions of triangles at any given moment.
● Accurate: This criterion falls more under the qualitative aspect of the technique and one that will be the most obvious to the user. Our solution has to have an overall convincing and accurate appearance. Ambient occlusion disappearing at the edges, halo, over- and under-occlusion are examples of a violation of this condition. Naturally, a realistic ambient occlusion effect has to take into account off-screen items as well for an accurate result.
The criteria above will establish the core upon which the evaluation of the effect will be carried out.
As can be seen, the requirements mostly harken back to the goals of real-time graphics effect.
Research Question
“To what extent can real-time raytracing in DXR produce a viable alternative to screen-space ambient occlusion solutions?”
Goal
The aim of this thesis is to investigate state-of-the-art approaches in ambient occlusion, and specifically study how DXR affects the experience. The hypothesis is that with DXR we can attain higher qualities while not hindering the performance in a significant way. The main means of evaluation will be through measuring render-time for performance and extracting Structural Similarity Index (SSIM) [47] between proposed methods and a standard for accuracy.
SSIM
Structural Similarity Index is a relatively new method that has gained traction in the field of visualization. It allows to measure the similarity and quality between two images, under the condition that one has the ideal quality. SSIM works on the principle that human eyes pay attention to structural differences in images and therefore, unlike other methods does not compare individual pixels. Factors such as luminance, contrast and structural information are computed and averaged using an 8x8 sliding window. The sliding window moves one pixel to the other, and the score is computed from the mean SSIM. The SSIM used in this paper follow the form:
Where and are the luminance, contrast and structural correlation components of the image, respectively. The Matlab implementation used is directly based on the SSIM paper [47]. More traditional methods like mean squared error (MSE) and peak signal-to-noise ratio (PSNR) carry out pixel-by-pixel analysis and do not reflect the Human Visual System (HVS) effectively. Ambient occlusion maps are black and white and can have sharp changes that may cause MSE and PSNR to overestimate or underestimate the quality of an image by relying only on
absolute error. SSIM on the other hand, like HVS, does not only depend on pixels and takes into account chrominance and correlation to find structural similarities. Although more complicated, SSIM is a much more apt metric for our use case.
Related Work
This section is dedicated to the previous work done in evaluating ambient occlusion. First, the theoretical background required to understand these related works is provided. Then different methods in object-space and screen-space are presented, with how they make use of many of the common techniques.
Rendering Equation
Before exploring ambient occlusion, it is beneficial to have an overview of the rendering equation, as it is the mathematical basis for all the interactions between light and surfaces within a scene.
Outgoing radiance is the term that is outputted at the end of a rendering system and is a measure of the amount of light reflected from a point p in the direction towards the viewer.
Figure 4. light and surface interaction. Source from [37]
The surface properties are formalized by the bidirectional reflectance distribution function (BRDF).
Simply put, BRDF is the ratio of incident light in direction to the outgoing radiance in direction from a point p on the surface and is denoted by . The outgoing radiance can,
therefore, be written as [7]:
Since ambient occlusion deals primarily with ambient light, certain useful assumptions can be made.
It is important to outline these assumptions as they will be the basis of how ambient occlusion will be formalized for the rest of the paper. The rendering model in equation 1 does not have a term to take into account the effect of an emissive surface, and another system is needed for the emissive case. Secondly, the surface is assumed to be isotropic/Lambertian and scatters ambient light equally in all directions [37] and therefore, the BRDF can be taken out of the integrand and be treated as a constant. The final assumption that has been mentioned earlier concerns itself with the indirect light.
The ambient light is isotropic and is equally incident, making a constant term. A binary visibility term is then multiplied with to allow contribution only for unoccluded regions. The following section will demonstrate the equation that is derived after these assumptions are made, and will discuss how it is used in different methods.
Ambient Occlusion
So far the rationale behind the importance given to ambient occlusion has been outlined. It is beneficial to now delve deeper into the analytical details of the effect. Ambient occlusion itself is a special case of ambient obscurance that operates in object-space and is evaluated in a preprocess [50]. In this paper, any form of indirect light (i.e. light modulated through refraction or reflection) falls under the term ambient light [7]. An object illuminated only by the ambient term in the Phong model will have a “flat” appearance, due to the constant nature of this factor.
Figure 5. (left) rendering with no ambient occlusion. (right) rendering with ambient occlusion. Source from
[33]
As it can be seen in the figure above, none of the details of the spaceship model is discernible with no ambient occlusion and the object on the left has no depth complexity, whereas, the same object rendered with ambient occlusion reveals much more detail to the observer.
The concept of obscurance as defined by Zhukov et al. [50] tackles this issue by taking into account the amount of ambient light that is accessible to a given point p on a surface with normal . Ambient obscurance is formulated as:
Where ωrepresents all the directions under the unit sphere over point p. is the distance between point p and the closest occluding object, and a monotonically smooth kernel is applied to visibility that attenuates to 0 at maximum distance. The rationale behind this is to restrict the contribution of occluders the farther they are away from the originating point. The ambient light in this approach is modeled after non-absorbing transparent gas with constant emittance τper unit volume [50]. Formally, this translates to the integration of the unit hemisphere with center at the normal of point p.
The ambient occlusion term used in modern literature, is the ambient obscurance that is reversed and redefined, to measure the amount of light that is occluded by surrounding objects around point p.
The attenuation function is replaced by a binary visibility term , which acts as the visibility function from the rendering equation for a given direction . The equation averages the light occluded at the point of interest. The value is then cosine-weighted with the angle between the normal of point p and the occluder. This means that occluders parallel to the surface defined by have more blocking power, whereas, occluders on the horizon of p will have almost no effect,
which is a more intuitive approach to how light is naturally blocked. Therefore, ambient occlusion or the cosine-weighted, normalized percentage of the unoccluded hemisphere can be written as equation 3. It was also shown earlier in the report how the equation below can be obtained from the rendering equation.
Although they are usually used interchangeably, the difference between obscurance and occlusion is that in the latter, the visibility function has discrete values of 0 and 1. Obscurance, on the other hand, includes a continuous kernel that takes distance to the occluder as its input. Moreover, in
ambient occlusion, a value of 1 means completely occluded, whereas, it means the exact opposite for obscurance.
For the AO function, when p is completely occluded, or when = for all , the integral reaches its highest value of . Since for computational purposes we want to clamp the value of AO to 0 and 1, we multiply it by a normalization factor of . The maximum value of for this integral can be justified through differential geometry. Since relevant literature does not delve into the rationale behind this maximum value, a brief derivation will be provided.
As we are operating in unit sphere , the term will be equivalent to and is used interchangeably throughout the report. If we switch to spherical coordinates [43], the integral over the surface area of the sphere will require the differential area element, expressed as [16]. Where is the radius and has a value of 1 in this case and the surface area of the hemisphere is
. Thus, the integral in spherical coordinates is:
A simple integration by simplification of the inner integral results in the following:
And this integral will be simply:
Monte-Carlo Method
Now that we have formally established the ambient occlusion, we need to focus on how to evaluate the integral. In the field of rendering especially complicated integrals are abundant, and most lack a closed-form solution. Therefore, numerical methods and approximations need to be employed. One such approximation that makes use of the law of large numbers is the Monte-Carlo integration. In this method, stochastic sampling is done to estimate the value of integrals. The random nature of this method gives it the power to estimate even integrands with discontinuities. The general form of the Monte-Carlo estimator can be expressed as [37]:
In the equation above, based on the law of large numbers, as N (number of samples) approaches infinity, the estimator converges to the original integrand. Since we can not afford to have an infinite number of samples, some noise and variance will be present. In this equation, is a random sample from the integration domain, and is known as the probability density function (PDF).
PDF is defined as the relative probability of to be chosen in the integral domain.
The earliest mention of Ambient occlusion proposes a Monte-Carlo sampling method done through raycasting [23]. For every point on the surface, rays are cast in the unit hemisphere around the normal and is intersections tests are carried out with the surrounding occluders. Finally, the number of intersections is divided by the number of rays cast. As shown in the figure below:
Figure 6. Ambient Occlusion found through raytracing.
The Monte-Carlo approach is quite simple and can solve complex integrands in an unbiased and stable manner. However, it is not without its disadvantages. One main problem with this method, as opposed to other deterministic ways of solving the integrand (e.g. Reimann’s sum), is its slow rate of convergence where N is the number of samples [37]. When used for rendering, the variance manifests itself as noise, and the relationship effectively means that in order to reduce the noise by half, one has to multiply the number of samples by a factor of 4. Increasing the sample number exponentially to have a linear decrease in noise is simply not a feasible solution in a real-time context. Therefore, the choice of PDF becomes a crucial step in variance reduction. The ambient occlusion integral in equation 2, can therefore, be written as:
Since increasing sample count is not feasible, one has to pick a PDF that behaves closely to the original integrand function. One safe type of PDF in the absence of a better knowledge of the function is the uniform distribution. In this class of PDFs, sample points all have an equal chance of being chosen. Since we established that the surface area of the unit hemisphere is , the uniform PDF will be:
And therefore, equation 8 will take the form:
However, this uniform PDF could produce a large amount of noise as the variance from the
expected value for certain integrands is bound to be high in complex rendering functions. Therefore, samples we choose have to be distributed more around the important parts of the original function.
By not choosing a proper PDF, it is possible to completely miss out on an important part of the function, for example, one representing a light source. Since the variance of a Monte-Carlo
integration of a constant function is 0, the closer the chosen PDF is to the actual function, the less variance and noise is produced. This principle is the rationale behind the method used in importance sampling [35].
Figure 7. The effect of probability distribution function on variance. (left) PDF choice is completely
different from f(x) and produces a significant amount of variance. (middle) PDF choice is a uniform distribution that misses the important section of the function. (right) PDF choice is similar to f(x), therefore,
providing significant variance reduction. Source from [21].
Therefore, by choosing PDF to be the cosine-weighted hemisphere, we give it similar attributes to the AO function:
This choice will result in the simple estimator:
Monte-Carlo integration is a powerful method and is quite straightforward for a raytracer. However, both uniform sampling and importance sampling suffer from sample clumping. The problem arises when two or more sample points fall in close proximity and return the same information. Since the sample budget is restricted, clumping will greatly hinder the accuracy and efficiency of the
estimation. An improved approach is that of Quasi Monte-Carlo method, which strives to find a middle ground between aliasing caused by deterministic techniques (e.g. Riemann’s sum) and the noise caused by stochastic methods (e.g. Monte-Carlo Integration). This is done through stratified sampling [8].
Further optimization to the raytraced solution presented above was proposed, which utilized spatial subdivision structures and the fact that ambient occlusion is a function of local geometry[48]. In their solution, an octree is used and occluders are traversed for more efficient intersection tests.
Furthermore, one can average the rays that don’t hit any occluders to derive the “bent normal” and store it in the ambient occlusion maps [23]. Bent normals will be discussed later, as they can be used
to look up the environment map, and are quite useful in representing accurate reflections and ambient light, therefore, increasing the overall quality of a render.
Until quite recently, raytracing the ambient occlusion was precomputed for static scenes and was used mostly in high-end films and effects. Due to being computationally expensive, the
precomputed values, although accurate, failed at processing dynamic scenes. Furthermore, one must note that ambient occlusion itself is an approximation and an effect that does not alert the human eyes. Therefore, it was only a matter of time that a trade-off between dynamic handling capability and accuracy was made.
An early attempt to leave the computationally expensive object-space could be credited to the disk-based approximation[5]. This method provides another way of calculating ambient occlusion that is suitable for dynamic scenes, without using raytracing. In this approach, meshes are turned to surfels (surface disks) and instead of using the visibility function, the occlusion of one disk on another is calculated to approximate the ambient occlusion. This method without optimizations runs in the order of , which is quite expensive. The introduction of a two-pass method and a hierarchical structure of disks reduces this to [1]. Since visibility estimation is done on a per-vertex basis, some linear interpolation artefacts are produced. Moreover, highly tessellated objects are required for correct high-frequency shadowing, for instance, in the case of contact shadows. A per-fragment approach is suggested to tackle these problems [18]. This disk-based approach does not quite fall under the screen-space category and is more of a surface discretization technique, however, it paves the way to screen-space methods.
Screen-space Ambient Occlusion Methods
On top of being expensive, object-space methods will naturally have cost positively correlated with scene complexity. The introduction of deferred rendering and experimentations with screen-space data (e.g. unsharp masking) for enhancing contrast and depth perception opened the doors to new approaches in evaluating the AO integrand [40, 29]. It was not long after that a method using the ND-buffer (normal and depth buffer) was proposed that computed ambient occlusion as a
full-screen pass. This approach splits the scene into near and far ambient occlusions and computes high-frequency AO using the image-space for near objects, and far objects are processed through spherical occluders as the low-frequency part [41].
Crytek SSAO
Screen-Space Ambient Occlusion (SSAO) was the first AO technique that was made available for use in a fully dynamic real-time application by Crytek [14]. SSAO spawned many variations that strived to improve on its limitations. The main difference between them lies in the way the screen-space data is handled, and how the ambient occlusion term is interpreted.
The SSAO as proposed initially for CryEngine 2 makes use only of the z-buffer. The spatial
coherency of the z-buffer allows it to be used as a representation of the scene, and in order to check for occlusion, neighboring pixels are sampled and depth comparison is carried out. After generating a depth map during the render, a full-screen quad is drawn to invoke the pixel shader for each point p. Samples are constructed as uniformly random vectors that are added to the view space position in a sphere surrounding the surface point p, as shown below:
Figure 8. Crytek’s SSAO. Random samples are shown as circles and rectangles as depth buffer fetches. The occlusion value in this figure is ⅓, as 2 samples fall below the surface. Source from [31].
For every virtual sample point’s depth d we need to check if it with the z-value in the depth map.
This is a basic form of containment test which is done by carrying out a depth comparison to check if the sample falls inside the geometry or on the outside. One problem with this implementation that is readily noticeable is that due to the spherical sampling, almost half the sample point falls behind the geometry even for flat surfaces. This gives scenes using this variation of SSAO a distinctive grey color caused by self-occlusion.
Normal-orientedSSAO
An improved SSAO variant used in StarCraft II, changes the sphere into a hemisphere and takes into account the normal of the point and flips the sample points that fall below the surface[14]:
Figure 9. SSAO used in StarCraft II with normal-oriented hemisphere. Source from [31].
The attenuation factor is also brought back, as the step function shown in figure 10. The
occlusion factor coupled with sample flipping effectively halves the samples needed and also handles self-occlusion. Furthermore, the sampling position is captured in world coordinates and projected to screen coordinates. and downsampling is done to save bandwidth. Finally, similar to the original SSAO, a geometry-aware filter is applied to remove the high-frequency noise caused by sampling.
Due to the low-frequency nature of ambient occlusion, this error-prone AO approximation is still widely used in the industry.
Figure 10. Occlusion function as depicted in the source from [39].
Horizon-based Ambient Occlusion
Another approach that is quite different than SSAO’s sampling for assessing visibility is
Horizon-based Ambient Occlusion (HBAO) [4]. This approach makes use of horizon mapping [28]
and raymarches the depth buffer under the assumption that it is a continuous heightfield. Therefore, they rewrite the equation 2 as [4]:
Where is a linear attenuation function, is the elevation angle and is the azimuthal angle, note the elevation interval of the inner integral. In this approach, any point that falls below the horizon is counted as an occluder. This assumption is possible due to the fact that the depth buffer being treated as a heightfield. However, there can be many instances that the evaluation can lead to erroneous shadowing due to depth discontinuities below the horizon.
Figure 11. Different components of HBAO. The inner integral marches the heightfield defined by a 2D slice, and the outer integral swipes the slice within the hemisphere. Source from [29].
In this implementation, the hemisphere is split to 2D slices, where is found through raymarching the heightfield for each angle , and is the angular offset of the tangent surface defined by normal
. Monte-Carlo integration is once again introduced, as equation 13 can be converted to:
And therefore, sampling the azimuthal angles uniformly at random, the estimator can be written as:
Where PDF is which cancels out with the multiplier of the integral, resulting in a simpler term.
As noise is preferable to banding artefacts, they jitter the step size per pixel [39]. Overall, this
approach requires much more computation than SSAO. However, it can be seen from the equations that HBAO is closer to the original definition of AO, is more physically-based due to raymarching, and thus should produce higher quality output.
Volumetric Obscurance
Point sampling is a common and convenient way of sampling for screen-space solutions. However, due to the discrete nature of this approach, certain artefacts are created during movement as samples become occluded and disoccluded. As it can be seen in the figure 12 below of the Crytek SSAO sampling scheme, a differential change in geometry makes the occlusion value pop from 4/8 to 3/8.
Furthermore, the geometry on the left will have the same AO as a flat surface, which in itself is wrong as well.
Figure 12. A radical change in AO as a result of differential change in geometry. Source from [17].
A smoother result can be obtained by using line sampling. The idea is to approximate the
unoccluded volume around point p through the ratio of visible to occluded length on a given line segment. Using line sampling is one of the main ideas behind another screen-space approach that goes by the name of Volumetric Obscurance (VO) [24]. Based on the ambient obscurance in equation 1, a volumetric obscurance (VO) quantity is defined as:
Where X is the 3D volume around p and O is the occupancy function which takes the value of 0 if there are no occluders at x and 1 otherwise.
Figure 13. Volumetric Obscurance as defined in [24]. Where f/h is the ratio of the unoccluded segment to the occluded segment of a line sample that enters the hemisphere at U and exits it at V. Source from [31].
Alchemy and Scalable Ambient Occlusion
Alchemy ambient occlusion (Alchemy AO), builds on top of the methods explained above and at the same time harkens to the original derivation [31]. The main improvement is by intelligently picking a falloff function that will simplify the equation.
Where t is the sample distance between point p and sample s, and u is the user-defined falloff constant. As stated in the paper, the specific falloff function above represents a shifted hyperbola that had earlier produced desirable results in StarCraft II [14]. Replacing the falloff function with
in the ambient occlusion equation will yield:
The equation is rewritten by defining , a vector from point p to the occluder in direction . The numerical estimator for the integral is thus derived as:
( )
Where 𝜖 is used to prevent division by zero. is a vector from point p to the ith sample. is thevi zp camera space depth and bias distance is an aesthetics parameter that can be adjusted to deal with self-shadowing and light leaks. and are used to alter the intensity of the occlusion and modify the contrast respectively.
Figure 14. Overview of Alchemy AO. The sampling scheme is similar to both HBAO and VAO [31].
The sampling is done similar to the VAO approach [24], where a disk of a certain radius is defined around point p in screen-space, and a sample point s is chosen uniformly at random on the disk, and is projected to camera space, similar to HBAO [4], to a point q on the surface. The component in equation 18 is, therefore, calculated based on the vector from q to p [31]. Scalable Ambient
Occlusion (SAO), is an improvement on Alchemy AO [30]. This approach uses the same estimator, yet, focuses on making Alchemy AO scale better at high resolutions and cuts latency. One
significant improvement is creating a mipmap from the depth buffer, which helps handle larger radii and increases cache efficiency. Samples close to p will fetch the high-resolution level of the depth buffer, while the samples further away will have the low-precision buffer.
Multi-scale Ambient Occlusion
Figure 15. multi-resolution approach as used in MSSAO. Source from [17].
Almost all the screen-space solutions presented here need to apply a form of blur at the end to remove high-frequency noise produced during the sampling. The blur pass itself is quite expensive and produce further deviation from the standard. Furthermore, the mentioned methods struggle with capturing both local high-frequency AO and the global larger scale AO. A multi-scale ambient occlusion (MSSAO) approach is proposed [17] which computes AO at multiple resolutions and combines them at the end to obtain a map that holds both the effect of far occluders and also the high-frequency AO from local nearby occluders.
Figure 16. Interleaved sampling pattern for use with a 3x3 low-pass filter. Source from [17].
At all resolutions except the highest precision, a fairly small sampling kernel is used, with samples taken at every other pixel. For example, the 11x11 sampling kernel used in figure 16 will have 36 texel fetches per fragment instead of 121. This method of sampling is similar to interleaved sampling, however, without the randomness. The multi-resolution solution greatly removes noise and the need for excessive blurring.
Denoising and Blur
This section will outline certain design choices common between the methods and how they affect the final AO output. Further individual breakdown of certain methods will be carried out as well.
Almost all the methods presented above require or encourage using a separate blur pass. Depending on whether sampling is done stochastically or non-randomly, the final image will either have noise or banding artefacts. In the industry, banding and aliasing are commonly traded for noise and
techniques such as dithering are widely used, as the result is easier on the human eye.
Denoising and filtering are equally crucial to both screen-space ambient occlusion methods and the raytraced method. Due to the way sampling is done for the numerical estimators, variance is an all-present quality that manifests itself as noise in rendering. The presented methods produce a varying amount of high-frequency noise based on their sampling schemes and require some form of denoising.
Gaussian Blur
The Gaussian blur is one of the most widely used effects in rendering [1] and is the cornerstone of many filtering techniques that will be presented in this paper. The blur in its simplest form is quite expensive, however, with certain adjustments, it can be made computationally feasible. The
optimizations themselves can spawn new blurring techniques used in the methods presented in this paper such as bilateral blue and separable blur.
The Gaussian blur uses the Gaussian kernel, which is a -tap convolutional filter that weights the pixels that fall inside its square according to the 2D Gaussian curve, formalized as:
Where is the standard deviation of the Gaussian distribution. The kernel falls under
rotation-invariant filter kernels, and only uses the distance from the central pixel in its computation [1]. can be thought of as the window size of the curve in figure 17. A larger standard deviation provides more blur, however, it also increases memory access, which is undesirable under the tight budget for real-time applications.
The Gaussian blur in the form presented in equation 21, however, is not usable as a image blurred using a fragment shader that deploys a -tap Gaussian filter would roughly require a staggering 4 billion texture fetches. This problem is remedied by noting that equation 21 can be reconstructed by multiplying two 1-D Gaussian kernels:
The Gaussian filter, therefore, is said to be separable. This effectively means that instead of weighting the contribution of each pixel and adding them, we can separate the operation into two passes. For example, for the support in figure 18, a horizontal pass will take the two texels to its right and left, and weight their contribution. A second vertical pass will repeat the same process for the two texels above and below the central pixel, and it will yield the same result as the expensive single pass blur. The number of texture fetches fall from to , where is the support. For the previous example, applying two 33-tap Gaussian filters to the image will now require 243 million fetches, which is a massive optimization.
Figure 17. The Gaussian function. Note how the weight of neighboring pixels is reduced over the radially symmetric curve. Source from [12].
Separable Blur
Figure 18. A 5x5-tap Gaussian filter. (a) An expensive single pass Gaussian blur with 25 texel fetches can be separated into two one-dimensional horizontal and vertical passes. (b) Horizontal blur filter is applied which is fed to a vertical blur filter (c). The filter is separable as multiplying the weights of (b) and (c) results in the
original two-dimensional filter. Source from [1].
The separable blur that has been derived is fast and efficient, however, it lacks geometry-awareness, which is a highly desirable quality in filtering ambient occlusion. As shown in figure 18, the filter in its current state does not differentiate geometries and blurs the entire scene, this causes shadow bleeding in ambient occlusion and the edges are not preserved, which is a source of error. To combat this, an extension of Gaussian blur called the Bilateral filter tries to reduce or eliminate the contribution of texels that are deemed unrelated to the central pixel [36].
Bilateral Blur and variations
A bilateral filter in its simplest form takes into account the difference between the intensities, and builds upon the Gaussian filter in the following way:
Where q is the neighboring texel and p is the central pixel. is the normalization factor and the filter has now two weighting components. The first segment is the standard kernel that averages the intensity in the spatial domain, and the second part takes into account the intensity difference.
Therefore, pixels with intensity difference will have no spatial interactions with one another.
Although the bilateral filter is quite effective at edge-preservation, the complexity of a brute force implementation of it is , where N is the pixel count. One optimization method is to restrict the range to . This will naturally change the complexity to . Further optimization can be done by employing the techniques mentioned earlier in the separable blur.
Bilateral filters are not separable, however, the artefacts caused by the separation is deemed a reasonable trade-off for the increase in speed. The complexity of a separable bilateral blur is therefore .
Figure 19. (left) Original input (middle) brute force Bilateral filter (right) separable kernel. Notice the streaks
that appear as an artefact at the bottom of the right image. Source from [36].
The bilateral filter is robust and is not bound to intensity checks only. Cross or joint bilateral filters take into account other data such as depth, velocity, and normals in deciding where to apply the blur. In ambient occlusion the filter used has to be geometry-aware, and preserving the edges in a blur is integral to the quality of the result. A relatively safe assumption in AO is that surfaces that belong to different geometries will have different depths. Therefore, the intensity term can be changed to depth terms:
Where and are the respective depths of central pixel p, and its neighbor q. It can be seen from equation 24, how as depth difference increases the intensity contribution of neighbor q
decreases. The Gaussian filter and its variants are quite effective in denoising, however, their myopic spatial implementation leaves a lot of room for improvement for dynamic scenes. Further optimization can be done by taking the temporal coherency of scenes into consideration.
Spatiotemporal approach
Pixel shaders are increasingly using more of the computational power of systems for real-time applications. This computational strain can be significantly alleviated by noting that in a standard scene, surface regions, camera movement, and lighting are temporally stable and usually do not undergo rapid changes from one frame to the next. Reprojection methods take advantage of said temporal coherency and use samples from previous frames for various applications and fall under two categories of reverse and forward [34].
Figure 20. Images with their coherence. As it can be seen both camera movement (left), and animation (middle and right) demonstrate significant temporal coherency, with green denoting the visible portions from
previous frames, and red the newly visible points. Source from [34].
For a triangle rendered at , both the current frame and previous ( ) frame, the vertex position is calculated and if they are close enough, the shaded value from history buffer can be used instead of doing new shading computation [34]. This approach was proposed independently by two
different papers, and the buffer is referred to as history buffer or real-time reprojection cache and the practice of saving shaded values of previous frames for reuse is known as reverse reprojection caching [27].
It is common in post-processing effects to make use of two off-screen buffers that hold intermediary and final results in a feedback loop. Due to the back-and-forth nature of how the
AO, reverse reprojection caching can be done with sampling AO values from frame.
Temporal filtering can also be done by filtering and blending the AO terms of previous frames with the current frame, which also helps in eliminating noise pops caused by movement.
There are certain issues that one must take into account when using temporal refinement and filtering. Depending on which variety of temporal methods is used, the method might incur overhead on the system. For example, temporal filtering through blending will not be a logical approach if the scene is mostly static. Moreover, the feedback can produce new artefacts, most notably temporal lag. Another decision that is important in the design of such filters is how to handle invalid pixels. An invalid AO value can happen if a disocclusion has taken place (newly visible surfaces) or the sample neighborhood of the pixel has changed [27]. There are many approaches to handling these issues that will be out of the scope of this paper.
As has been demonstrated so far, the evolution of SSAO methods highlights certain qualities and trade-offs. All SSAOs trade accuracy for speed and are based on many assumptions that help eliminate many constraints, yet, bring a myriad of erroneous results with them. Different sampling patterns that trade noise for banding are proposed; each striving to increase sampling quality and efficiency. Multi-scale approaches focus on faster and more efficient texture fetches and different estimator setups focus on better convergence for their estimators. Monte-Carlo integrators play a significant role in both object-space and screen-space approaches, which consequently make denoising a critical step in making the samples usable for human eyes in a real-time context. The following section will have a closer look at some of these design decisions and how it affects the final result of different solutions.
Implementations
This section will cover the inner mechanisms of the above-mentioned methods. Included, is also a variation on the normal-oriented hemispherical SSAO that has been implemented predominantly to examine how it scales next to other standard industry-level implementations. The latest version of Unity was chosen as a platform to run the different implementations as it is a widely used both in the indie and AAA scene, thus it reflects the current demands of the industry. Furthermore, Unity can be extended with custom image effects and can be equipped with a post-processing stack that provides two state-of-the-art screen-space ambient occlusion solutions: Multi-scale Volumetric Occlusion (MSVO) and Unity’s Scalable Ambient Obscurance (SAO) [46]. At the time of the writing of this paper, Unity has also provided an experimental build that can handle DXR and plans to integrate DXR to its high-definition render pipeline (HDRP) in the future. The built-in profiler was also used for more detailed measurements.
The vertex shader passes a full-screen quad in all implementations and only snippets deemed important from the fragment shaders of MSVO, SAO, HBAO, RTAO, and our variation on SSAO will be presented here; their common traits will be mentioned and further comparison will be done in the results section.
SSAO Variant
Our method is based on StarCraft II’s normal-oriented hemispherical ambient occlusion [14], the main difference mostly lies in the sampling pattern and the inclusion of a normal-based falloff function which brings the implementation closer to the original AO expression. It also can use a geometry-aware bilateral blur, similar to the one used by Unity. However, it was found that the samples converge nicely on their own. The main purpose of this variant is to compare it with
industry standard solutions and also to introduce techniques common among screen-space solutions.
Firstly, we produce samples on the C# side within a unit hemisphere and scale it with a simple interpolation function to make the samples fall closer to the origin instead of them being completely randomly distributed. Then, we push these samples to the HLSL (high-level shader language) side using an array. In order to prevent banding, the samples are further reflected along a random vector on a per-fragment basis. The random vector is extracted from a normal map that is scaled and tiled as below:
normalize(tex2D(_NoiseTex, (_ScreenSize) * uv / _NoiseTex_TexelSize ).xy * 2.0f - 1.0f);
Then for maximum randomness, we produce two sets of random directions that will be used as offsets. One sample set is reflected against the normal map texture, while the other will be rotated at a random degree along the z direction. The sampling radius is scaled to account for projection. Then the sample vectors are added as offsets to the fragment position, and the depth test is carried out.
The occlusion values are accumulated and divided by the sample count. However, two weighting factors based on distance and normal is added:
float occ = max(0.0, dot(normal, normalize(sample_Dir)) - _Bias) * (1.0 / (1.0+length(sample_Dir)) * _Intensity);
The dot productdot(normal, normalize(sample_Dir)) between normal and sample direction allows us to give the most weight to occluders that are right in front of the fragment, while the (1.0
/(1.0+length(sample_Dir)) is the falloff function that reduces the contribution of occluders. Their implementation has 3 aesthetic parameters: _Intensity, _Bias, and_Radius.The_Intensity
parameter adjusts the strength of the AO, while _Bias is the size of the cone defined by the normal of the fragment and the sample direction. This implementation is fast and satisfactory, however, it suffers from some familiar SSAO errors that will be discussed in the results section.
HBAO
Horizon-based ambient occlusion (HBAO) takes a very different and a much more physically based approach than the previous implementation. The HBAO used for testing is [51], which uses the Nvidia implementation [4]. The implementation has many auxiliary features like color bleeding which will not be used for the purpose of this paper. The sampling parameters are divided into directions and steps. Directions are randomly chosen azimuth angles that define a slice and steps are the raymarching steps that are done within a slice. Similar to the double integrand in the solution, the implementation will have two for-loops in its fragment shader, as shown in the pseudo-code below:
for (int d = 0; d < DIRECTIONS; ++d) { float angle = theta * float(d);
// Randomly rotate direction and Compute its normalized 2D direction // Jitter starting sample within the first step
//calculate tangent of ray
for (int s = 0; s < STEPS; ++s) {
//calculate texture increment for ray marching, snap to the center of texel to
//fetch sample’s view position //advance step
//find horizon vector and length for calculating occlusion
float3 horizon_Vector = sample_ViewPos - p_ViewPos;
float horizon_LengthSqr = dot(horizon_Vector,horizon_Vector);
float occlusion = dot(normal, horizon_Vector) * rsqrt(horizon_LengthSqr);
//check if horizon angle is maximum and scale by the attenuation factor //add contribution
//accumulate ao }
}
The algorithm is much more complex than the previous SSAO and there are many variations that try to optimize the angle calculations. As it is shown in the simplified pseudo-code above one noticeable assumption is that there are no discontinuities in the heightmap, even though, it could be the case that the maximum horizon angle will have unoccluded region under it.
Unity’s SAO (AlchemyAO)
Scalable ambient occlusion (SAO) as specified by McGuire[30] is an extension on the Alchemy ambient occlusion. The core fragment shader, however, is the same as AlchemyAO. The pseudo-code below provides an overall look at the algorithm:
half4 frag_ao(v2f i) : SV_Target {
//fetch view-space normal and depth of fragment // offset depth value to avoid precision error
//reconstruct view-space position “vpos_o” from depth for (int s = 0; s < _SampleCount; s++)
{
// Sample a point v_s1 in a hemisphere, flip according to normal and add vpos_o float3 v_s1 = PickSamplePoint(uv, s);
v_s1 = faceforward(v_s1, -norm_o, v_s1);
float3 vpos_s1 = vpos_o + v_s1;
// Reproject the sample point
// find Depth of the reprojected sample point