Out-of-Core Multi-Resolution Volume Rendering of Large Data Sets

Full text

(1)LiU-ITN-TEK-A--11/038--SE. Out-of-Core Multi-Resolution Volume Rendering of Large Data Sets Fredrik Lundell 2011-06-10. Department of Science and Technology Linköping University SE-601 74 Norrköping , Sw eden. Institutionen för teknik och naturvetenskap Linköpings universitet 601 74 Norrköping.

(2) LiU-ITN-TEK-A--11/038--SE. Out-of-Core Multi-Resolution Volume Rendering of Large Data Sets Examensarbete utfört i medieteknik vid Tekniska högskolan vid Linköpings universitet. Fredrik Lundell Examinator Karljohan Lundin Palmerius Norrköping 2011-06-10.

(3) Upphovsrätt Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under en längre tid från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/ Copyright The publishers will keep this document online on the Internet - or its possible replacement - for a considerable time from the date of publication barring exceptional circumstances. The online availability of the document implies a permanent permission for anyone to read, to download, to print out single copies for your own use and to use it unchanged for any non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional on the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its WWW home page: http://www.ep.liu.se/. © Fredrik Lundell.

(4) Abstract A modality device can today capture high resolution volumetric data sets and as the data resolutions increase so does the challenges of processing volumetric data through a visualization pipeline. Standard volume rendering pipelines often use a graphic processing unit (GPU) to accelerate rendering performance by taking beneficial use of the parallel architecture on such devices. Unfortunately, graphics cards have limited amounts of video memory (VRAM), causing a bottleneck in a standard pipeline. Multi-resolution techniques can be used to efficiently modify the rendering pipeline, allowing a sub-domain within the volume to be represented at different resolutions. The active resolution distribution is temporarily stored on the VRAM for rendering and the inactive parts are stored on secondary memory layers such as the system RAM or on disk. The active resolution set can be optimized to produce high quality renders while minimizing the amount of storage required. This is done by using a dynamic compression scheme which optimize the visual quality by evaluating user-input data. The optimized resolution of each sub-domain is then, on demand, streamed to the VRAM from secondary memory layers. Rendering a multi-resolution data set requires some extra care between boundaries of sub-domains. To avoid artifacts, an intrablock interpolation (II) sampling scheme capable of creating smooth transitions between sub-domains at arbitrary resolutions can be used. The result is a highly optimized rendering pipeline complemented with a preprocessing pipeline together capable of rendering large volumetric data in real-time..

(5) Acknowledgements I especially want to thank my supervisors Daniel Jönsson and Erik Sundén for their support during this project. Special thanks to my examiner Karljohan E. Lundin Palmerius for valuable feedback and to the Voreen community for answering my questions. I would also like to thank my family, friends and girlfriend for supporting me through all these years..

(6) Contents 1. Introduction 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 2. Concepts of Direct Volume Rendering 2.1 Volumetric Data . . . . . . . . . . . . . . 2.1.1 Volumetric Data Acquisition . . 2.1.2 Volumetric Data Representation 2.2 Transfer functions . . . . . . . . . . . . . 2.3 Direct Volume Rendering . . . . . . . . 2.3.1 Volume Rendering Integral . . . 2.3.2 Volume Ray Casting . . . . . . . 2.4 Large Data sets . . . . . . . . . . . . . . 2.4.1 Static Data Reduction . . . . . . 2.4.2 Dynamic Data Reduction . . . . 2.4.3 Distortion metrics . . . . . . . . . 2.5 Graphic Processing Unit . . . . . . . . . 2.6 GPU Based Ray Casting . . . . . . . . . 2.7 OpenCL . . . . . . . . . . . . . . . . . .. 3. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. Out-of-core Streaming and Rendering of Multi-resolution Volumetric data 3.1 Data Preprocessing and Analysis . . . . . . . . . . . . . . . . 3.1.1 Multi-resolution Blocking . . . . . . . . . . . . . . . . 3.1.2 Approximating Density Distribution Histograms . . 3.1.3 Error estimation . . . . . . . . . . . . . . . . . . . . . . 3.2 Level-of-detail management . . . . . . . . . . . . . . . . . . . 3.2.1 View-Dependent Scheme . . . . . . . . . . . . . . . . 3.2.2 Transfer function based Scheme . . . . . . . . . . . . 3.3 Out-of-Core Data Management . . . . . . . . . . . . . . . . . 3.3.1 Multi-threaded Data Stream . . . . . . . . . . . . . . . 1. 4 4 5 6 6 7 7 8 9 9 10 11 12 12 13 14 14 15 17 18 18 19 20 21 21 22 24 25.

(7) 3.4 3.5 3.6 4. 5. 6. 3.3.2 Data Stream Optimization . . Mixed-Resolution Texture Packing . 3.4.1 Dynamic Updates . . . . . . . Pipeline Overview . . . . . . . . . . Multi-resolution Volume Rendering 3.6.1 Multi-resolution Raycasting .. Implementation Details 4.1 Voreen Framework Integration 4.1.1 Preprocessing . . . . . . 4.1.2 Rendering . . . . . . . . 4.2 QT Threading . . . . . . . . . . 4.3 OpenCL Ray-Casting . . . . . .. . . . . .. . . . . .. . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. Result 5.1 Test Data . . . . . . . . . . . . . . . . . . 5.2 Test System . . . . . . . . . . . . . . . . . 5.3 Preprocessing . . . . . . . . . . . . . . . 5.4 Out-of-Core Block Reading Performance 5.5 Intrablock Volume sampling . . . . . . . 5.6 Adaptive Sampling . . . . . . . . . . . . 5.7 TF Based Data Reduction . . . . . . . . . 5.8 View Based Data Compression . . . . . 5.9 Rendering Performance . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .. . . . . . .. . 26 . 26 . 27 . 29 . 31 . 31. . . . . .. . . . . .. 36 36 37 37 39 40. . . . . . . . . .. 41 41 41 42 42 43 45 46 49 49. . . . . . . . . .. Discussion 55 6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57. 2.

(8) Abbreviations LOD Level Of Detail FPS Frames Per Second GPU Graphic Processing Unit GPGPU General-Purpose computation on Graphics Processing Units Streaming Transfer data between different layers of memory TF Transfer function DVR Direct Volume Rendering VRAM Video Random-Access Memory NN Sampling Nearest Block Sampling II Sampling Intrablock Interpolation Sampling. 3.

(9) Chapter 1 Introduction The following chapter will give a brief introduction to some of the challenges currently facing the field of volumetric visualization and provide the main motivation and aim of this thesis.. 1.1. Motivation. Scientific visualization is a field in science aiming to reveal and find correlation within spatial and temporal structures of data. Volume rendering is a specific branch of visualization used to obtain images from threedimensional data sets. Volumetric data holds information about the internal structure of an object and special visualization techniques are needed for extracting different abstractions within the data. Volume visualization has a wide range of applications in particular within the field of medical visualization. The data sets obtained from acquisition devices has in recent years rapidly increased in size and hardware limitations in terms of memory capacity and transfers rates forces the use of date reduction schemes and out-of-core storage to overcome these limitations. Direct volume rendering (DVR) is a volumetric visualization technique used to directly extract images from a volumetric data set. The original DVR pipeline gives the user the ability to interact with the data in order to reveal interior structures within the volume. To provide a user with full understanding of the underlying data it is important that the system properly responds to user interaction. Real-time performance is hard to achieve for the rendering of high quality images. The DVR pipeline often uses hardware acceleration units such as a Graphical Processing Unit (GPU) to accelerate rendering. Data must then be stored on the graphic card’s VRAM, which is a very limited resource. 4.

(10) 1.2. Aim. The aim of this thesis is to extend the basic DVR pipeline to render large volumetric data sets in real-time. The proposed pipeline shall benefit from out-of-core storage and multi-resolution volume representation where subparts of arbitrary resolution can be loaded in-core on request during runtime. The system will exploit parallel data processing on the GPU as well as multi-threading on the CPU for maximizing the performance. The system shall respond properly to user interaction and use such information to minimize the amount of data to process through the pipeline while maximizing the rendering quality. The system shall implement multiresolution rendering techniques used to increase the render quality for data represented at arbitrary resolution and use acceleration techniques such as adaptive sampling to increase the performance of rendering.. 5.

(11) Chapter 2 Concepts of Direct Volume Rendering This chapter covers the basic concepts of DVR. DVR is the process of extracting two-dimensional images directly from a three-dimensional scalar field of data. Contradictory to DVR, an indirect approach is to extract a polygonal mesh of an iso-surface obtained from the volume. The DVR pipeline consist of several important steps, all of which will be covered in this chapter. The first part explains the essence of volumetric data and how it can be represented in a computer memory. The next part discusses the vital aspects of volumetric classifications and how it can be used to reveal embedded structures within the volume. Further more, details about volume rendering using ray casting will be explained, as well as its derivation from the light transport equation. This chapter will also emphasis the limitations that comes from using a basic DVR pipeline with large data sets and how it can be improved using data reduction methods. Finally some details will be given about parallel programming, OpenCL and how to render volumes using the GPU.. 2.1. Volumetric Data. A volumetric data set [1] is a discrete representation of a continuous function defined in a three-dimensional space. Mathematically it is formulated as a scalar field defined by a function, which for each point in space maps to a scalar value. f : R3 → R 6. (2.1).

(12) Volumetric data can be generated from either computer-aided simulations or captured from real-world measurements. The latter approach is often done for a scientific purpose such as medical visualization.. 2.1.1. Volumetric Data Acquisition. A modality device can be used to acquire three-dimensional data from a real-world object. In medical visualization and radiology it is common to use methods like Computer Tomography(CT) and Magnetic Resonance Imaging (MRI) to capture the spatial interior from a body. Computer Tomography(CT) A CT or CAT scanner obtains cross section images by inflicting the body with X-ray radiation. Several images are often stacked together to provide a three-dimensional data set. Different substances inside the body absorbs different amounts of radiation, which can be used as classification of different tissues. CT scans are especially efficient for classifying bones and hard substance. High resolution images can only be obtained using high amount of radiation which can be dangerous if the subject is alive. Magnetic Resonance Imaging(MRI) An MRI scan collects a density distribution of the body by exposing it to a large magnetic field. The magnetic field will set the interior hydrogen atoms in motion. The motion of the atoms will be viewed as slight variations inside the magnetic field, with different variations for different tissues. MRI scans are extremely good at classifying soft tissues and can be considered safe as long as the subject does not have any metal objects inside the body.. 2.1.2. Volumetric Data Representation. Volumetric data is often arranged on a uniform grid holding entities such as scalar attributes obtained from data acquisition. A scalar attribute in threedimensions is referred to as a voxel, which is the smallest finite element of holding data. Figure 2.2 illustrates a uniform grid of voxels representing a volume. The voxel representation is regular in both topology and geometry and can be represented implicit by the voxel dimensions, origin, and spacing. It is common to use a linear storage and fold the three-dimensional domain into a one-dimensional array. Specific voxels can then be accessed by the use of a memory offset 7.

(13) Figure 2.1: Discrete Voxel volume domain. mo f f set = x + y · Nx + z · Nx · Ny. (2.2). where x,y and z is a three-dimensional position of a voxel inside a volume of dimension Nx , Ny and Nz . The distance between two adjacent voxels is the sampling step used for discretizing the continues signal. Signal theory states that if the Nyquist theorem 1 is satisfied a discrete signal can be ideally reconstructed by convolving it with a sinc kernel. However, in practice the sinc kernel is usually replaced with a box or tent function due to the high calculation cost. Performing convolution with these two filters are referred to as nearestneighbor and trilinear interpolation respectively which can be used to obtain scalar data at an arbitrary location within the volume.. 2.2. Transfer functions. A Transfer function(TF) [2] is used to map optical properties into scalars in order to reveal certain parts of a volume. TF provides an additional layer of interactivity by allowing the user to specify what should and what should not be visible. A TF mapping is defined as crgba = τ (v) τ : R → R4 1 The. (2.3). Nyquist sampling rate states the minimum sampling step that can be used to avoid aliasing. This rate is equal to twice the highest frequency contained within the signal.. 8.

(14) where v is a voxel with a specific iso-value and crgba is the optical properties in terms of color and opacity. Isolating specific parts of a volume can be a challenging task, thus different embedded structures can be represented with similar iso-values, making them undistinguishable from each other. Levoy [3] adds a dimension of gradient data into the TF representation which enables more robust classifications of volumetric surfaces. Lundström et al [4] uses automatic tissue detection to extend and simplify the design of a TF for a specific classification task.. Figure 2.2: Three rendered images from the same data set using three different TFs. Left: Skin, Center: Transparent skin and hair and opaque bone structure, Right: Bone structure.. 2.3. Direct Volume Rendering. Volume rendering [2] is the process of rendering a two-dimensional image from a three dimensional data set. DVR refers to evaluating an optical model of how light interacts with the volume rather than extracting polygonal meshes from iso-surfaces within the volume.. 2.3.1. Volume Rendering Integral. The physical properties of how light interacts with matter can be modeled using the light transport equation presented by Krueger in [5]. A full evaluation of the transport equation can be used to produce highly realistic lighting effects including light emission, absorbtion, reflections, scattering and refractions. The whole transport equation is too complex to process in real-time and simplifications are made to only consider light emission and 9.

(15) absorption. This equation is commonly referred to as the volume rendering integral which is a simplified optical model used for real-time processing. The volume rendering integral is illustrated in equation 2.4 and calculates the amount of light reaching a specific point as it attenuate through the volume.. I ( a) =. Z b a. E(t)e. −. R. τ (u)du. dt. (2.4). The integral is evaluated along a ray of light traveling from point a to b. I ( a) is the amount of light arriving at point a and E(t) is the self emission at point t. The exponential function is the absorption factor based on the volume density τ (u).. 2.3.2. Volume Ray Casting. Figure 2.3: Ray casting the volumetric domain and extracting an embedded torus. Ray casting is a image order direct volume rendering method used to approximate the volume rendering integral. The method proceeds by evaluating how the volume contributes to the pixels of the final rendered image. Rays are traced from the screen back to the light source. As the ray 10.

(16) progress through the volume, samples are taken and mapped to its optical properties using the current TF. The ray casting method is illustrated in image 2.3 and unlike ray tracing no secondary interactions are allowed. The discreet version of the volume rendering integral is formulated as n. I ( a) =. ∑. k =1. k −1. c k α k ∏ (1 − α i ). (2.5). i =0. where ck is the color at position k and α is the transparency of a sample point at position k and i respectively. The discrete samples collected along each ray can with the use of equation 2.5 be composed together into a final pixel color. In terms of optimization, it is far more efficient to trace rays backwards from the screen hence not all light will contribute to the final image. The discreet version of the volume integral can also be represented on a recursive form, back-to-front using ctot = ctot + (1 − αtot ) · αi · ci αtot = αtot + (1 − αtot ) · αi. (2.6) (2.7). where i is the sample index going from i : n, n − 1, ..., 1. 2.4. Large Data sets. The technical improvement of modalities has during resent years resulted in a medical data explosion. The rapid increase of data sizes is due to the fact that modern modalities can reproduce data at very high spatial and temporal resolutions. Temporal varying data is the future and provides surgeons to examine functionalities of the body in real-time to prepare for a surgical procedure. Computer performance has unfortunately failed to keep up with the rapid increase of data sizes and sufferers from limitations in memory capacity and bandwidth between memory units. Further, modifications must be made to the classical DVR pipeline to handle data sets which exceed both the available system in-core memory as well as the texture memory of the graphics card. Most of the research done in this area uses compression techniques to reduce the size of data before storage. Dynamical strategies can also be used for adapting the quality of data to the importance of a specific region of the volume. Lundström [6] divides the available methods into static and dynamic data reduction which are performed at different stages of the pipeline, illustrated in figure 2.4. 11.

(17) Figure 2.4: Image illustrating where in the visualization pipeline static and dynamic data reduction are performed.. 2.4.1. Static Data Reduction. A static data reduction scheme uses compression techniques in order to minimize the data storage and amount of data passed through the pipeline. Mensmann et al [7] divides the volume into blocks and uses the lossless compression scheme Lempel-Ziv-Oberhumer (LZO) for data reduction. A compression scheme should not only be selected due to a large compression ratio but also based on decompression speed. Compression can be done as preprocessing unlike decompression which must be done during runtime. To handle the bottleneck of slow bandwidth during memory transfer it is best do delay decompression until the data has reached the VRAM on the graphics card. Ljung et al [8] uses a wavelet transform together with entropy coding for achieving lossless block wise compression. However, a significantly larger data reduction ratio can be achieved by using lossy compression schemes such as quantization techniques, unfortunately at the cost of distortion.. 2.4.2. Dynamic Data Reduction. A dynamic data reduction scheme takes advantage of user interactions to classify regions of low importance. Data reduction is performed by allowing invisible or uniform data of low importance to be represented at lower resolution. This results in a highly dynamical data reduction scheme which can continuously adapt the resolution of data within a subset of the volume. Crassin et al [9] uses a view and occlusion dependent approach where data can be streamed directly based on information extracted during rendering. In respect of the current available graphics card memory they can visualize data sets of billions of voxels in real-time. Ljung et al [8] uses 12.

(18) a TF based approach for determining the importance of individual blocks within a volume. Blocks of varying resolution are then densely packed into an efficient representation stored on the memory of the graphics card. Most methods for visualizing large amount of data require subdividing schemes and multi-resolution representation to update specific regions within the data. Such methods will be further discussed in the next chapter.. 2.4.3. Distortion metrics. It is important to evaluate the amount of visual error introduced when performing lossy data compression using static or dynamic data reduction. For medical use it is very important to be able to separate rendering artifacts from possible tumours, and the main reason why physicians have not fully embraced the technology of volume rendering is that they do not trust the accuracy of the rendered image. Most objective distortion metrics is based on the Root-Mean-Square Error (RMSE) such as signal-to-noise ratio and peak-signal-to-noise ratio. However, such methods are insufficient for measuring visual quality as it is perceived by humans. The CIE L*a*b* (CIELAB) [10] is a color spaced designed to mimic the human vision system in order to compare colors independently of any device. The CIELAB color space is defined in a uniform color scale which means that differences between points plotted correspond to actual visual differences between colors. The space is organized in a three-dimensional cube spanned by three axes, • L* for lightness of the color, L*=0 for black and L* =100 for diffuse white. • a* has no specific numerical limit +a* for different levels of red and -a* for different levels of green • b* has no numerical limit +b* for different levels of yellow and -b* for different levels of blue ∆E is a single value which represent the total color difference. ∆E is calculated using equation 2.8. ∆E =. √. ∆L∗ + ∆a∗ + ∆b∗. (2.8). The total visual error is calculated in the CIELAB color space as a normalized sum of the color differences ∆E for all pixels between an image rendered from the original and compressed data. 13.

(19) 2.5. Graphic Processing Unit. The game industry is mainly responsible for pushing the development of graphics cards capable of processing large amount of data in real-time. The hardware is optimized for parallel computations which is invoked through the graphics pipeline. The pipeline were initially non programmable and mapped to the graphic hardware chip to run geometrical, rasterization and fragment operations. In the early 21st century, the GPU (Graphic Processing Unit) was introduced with a programmable graphics pipeline, allowing more customized rendering effects. GPU instructions were organized into small programs called shaders executed in beneficial to the SIMD (Single Instruction, Multiple Data) architecture on the GPU. By the introduction of GPGPU (General Purpose Graphics Processing Unit) the GPU was unhooked from strictly graphical related operations to perform high performance scientific calculations in parallel.. 2.6. GPU Based Ray Casting. Volume ray casting can be optimized to take advantage of the current SIMD architecture. Several implementations are done using shader programs to perform per-pixel ray casting on the GPU. A ray, defined in volume coordinates, is launched from each pixel on the screen. The volume is entirely stored inside the VRAM of the graphics card and can be accessed at an arbitrary sampling position of the ray using the hardware supported interpolation. The starting point and direction of the ray can be calculated using rasterization techniques as in [1].. Figure 2.5: Bounding Box culled and rendered in two passes, back and front used for obtaining the parametric equation of a ray for each pixel. Left: Ray start position, middle; Ray end position, right: Ray Direction.. 14.

(20) A box is defined as the domain of the volume. The bounding box is assigned texture coordinates and rendered in two passes using front and back face culling. The result is stored as two textures viewed in figure 2.5. The two textures can be used to obtain the ray’s start position and direction for a specific pixel.. Figure 2.6: The figure is showing a bounding box penetrated by rays where f is the entry point, b the exit point and d the direction.. 2.7. OpenCL. OpenCL (Open Computing Language) [11] is a open framework suitable for GPGPU programming. OpenCL is standardized by the Khronos group also known from their development of OpenGL. OpenCL includes a C based language for writing kernels that will execute on a OpenCL device. A GPU, DSP or CPU are typically be used as devices, all of which have multiple computation units. The Host in the OpenCL context is referred to as the environment which initializes and controls a device. The OpenCL programming model executes a kernel function simultaneous on several processing units by placing it into a command queue. Several kernel calls can be placed inside the command queue which all can be executed using the SIMD architecture. OpenCL uses an index system to make sure that each processing unit process different data. OpenCL provides a work group ID for all kernels that will be processed on the same computation unit and one unique work item ID for each kernel within a computation unit. The ID can have up to thee-dimensions to correspond with the data to process.. 15.

(21) Figure 2.7: The OpenCL Memory model, illustrating all layers of memory available for a work-item. OpenCL allows access to four types of memory. The global memory is the device’s main memory and can be executed by all work-items. Constant memory has the same accessibility as the global memory but can be used more efficiently if the hardware supports constant memory cache. Local memory can be used by all work items within a work group. Private memory can only be used within a work item. The host is capable of reading and writing from the global and constant memory. OpenCL also supports twodimensional and three-dimensional image buffers with support for hardware based interpolation. The entire OpenCL memory model is illustrated in figure 2.7. 16.

(22) Chapter 3 Out-of-core Streaming and Rendering of Multi-resolution Volumetric data This chapter gives the theoretical concepts of multi-resolution DVR. Most of the multi-resolution techniques are based on work done by Ljung et al [12] and a out-of-core DVR pipeline is presented capable of rendering large data sets in real-time. The first chapter emphasis the use of preprocessing to organize data into an efficient representation. Fundamentals of multi-resolution blocking is presented and the difference between flat and hierarchical blocking is discussed. This section also presents some data analysis methods such as precalculated histograms and error estimation methods which can be used as acceleration structures during runtime. Section two describes the LOD management system which is used to measure the importance of data in order to reduce memory overhead while minimizing the visual error. Section three presents available memory layers as well as out-of-core storage and data streaming. A part of the section is dedicated to CPU threading and how it can be used to efficiently transfer data between memory layers without decreasing rendering performance. Section four is about GPU storage and data updates. The section introduces a data structure for storing blocks of mixed resolutions efficiently packed in memory and a method for updating individual blocks within it. The next section gives an overview of the pipeline processing and how all subparts collaborates to process and render large volumetric data sets. Finally, rendering techniques are discussed such as block sampling, intrablock interpolation techniques and adaptive ray casting.. 17.

(23) 3.1. Data Preprocessing and Analysis. The preprocessing step involves structuring the data into an efficient representation. A volume subdivision scheme is needed to pass data at various resolutions along the pipeline. Volume subdivision is referred to as blocking, which restructures the volume into a hierarchy of blocks represented at different resolutions. Data analysis is performed to extract information used to decide what to process through the pipeline. Each block will have a density distribution which gives knowledge about the type of content. Another important aspect is the error introduced by downsampling a region of the volume. This analysis data is stored and made available at the dynamic data reduction phase of the pipeline.. 3.1.1. Multi-resolution Blocking. Blocking is the concept of subdividing the volume into a set of blocks available at different resolutions. Literature primarily refers to two blocking schemes used in this context, hierarchical and flat blocking. All methods proposed in this thesis are based on the flat blocking scheme proposed by Ljung et al. [12].. Figure 3.1: The Flat Blocking Scheme; the spatial extent is fixed for all blocks and the number of samples representing a block are reduced, through c averaging, into three resolution levels. 2006 Courtesy of Patric Ljung [12]. Flat blocking is conducted by subdividing the volume into an uniform set of fixed blocks. The spatial extent of a block is constant and does not grow with a reduced resolution level. Multi-resolution representation is obtained by downsampling each block into a pyramidical representation 18.

(24) holding a range of resolutions. Downsampling can be performed recursively by averaging the samples within one block. The number of resolution scales obtained from each block is limited by the block size and is calculated in equation 3.1. A block at the coarsest resolution is represented by one single value.. Lscales =. log(blocksize) +1 log(2.0). (3.1). The flat blocking scheme is executed during preprocessing and made available through out-of-core storage. The multi-resolution representation increases the memory needed for disk storage with 14.3 percent but on the other hand provides an efficient representation with improved memory locality. LaMar et al. [13] were the first to introduce the concept of blocking in volume rendering and used am octree-based hierarchial scheme for subdividing the spatial domain into blocks of data. Conversely to flat blocking, blocks have a fixed number of samples. The structure is downsampled by reducing the number of blocks until one covers the entire volume. Most multi-resolution techniques are based on this approach, although it suffers from some important drawbacks. Flat blocking supports arbitrary resolutions between neighboring blocks and provides higher memory efficiency due to higher culling. The most important drawback of flat blocking is that the number of blocks is constant. A hierarchial representation can adapt the number of blocks to the available memory resulting in a more dynamic scene adaption. Figure 3.2 shows a comparison between hierarchial and flat blocking.. 3.1.2. Approximating Density Distribution Histograms. A local histogram is calculated for each block and stored as meta-data. Histograms are used for measuring block significance for data mapped to a TF. The information can be used to reduce or increase the resolution of blocks. Storing a complete set of full resolution histograms is memory expensive. Ljung et al [12] uses a piecewise constant approximation of 1012 segments, while Gyulassy et al [14] propose a method for determining histogram size derived from the current data set. This thesis uses a uniform histogram approximation with a fixed number of bins.. 19.

(25) Figure 3.2: Hierarchial vs Flat Blocking Scheme, Level 0 is the lowest and level 3 is the highest resolution of blocks. If a block intersects the object’s border (blue), it’s selected at the highest resolution and the interior of the object is selected at level 1. The Hierarchial scheme allows partial usage of blocks where high resolution parts can be added to the necessary area of a low resolution block. Still, the data reduction is much higher using a flat c blocking scheme 1.8:1 to 2.8:1, 2006 Courtesy of Patric Ljung [12].. 3.1.3. Error estimation. Error estimation can be used as an additional tool for measuring the spatial variance of a specific volumetric region. Blocks holding large regions of spatial varying data will be more sensitive to low resolution representation, hence produce more visual error to the output. Blocks containing slow or nonvarying data can be reduced in resolution without loss of visual quality.. errorrms. 1 =| nh. Z d 0. 1 Densityh ( x ) − nl. Z d 0. Densityl ( x )|. (3.2). The amount of error caused by using a low block resolution is measured using the root-mean-square error as in 3.2. The resulting error is precalculated and stored as meta-data.. 20.

(26) 3.2. Level-of-detail management. The Level of Detail (LOD) management scheme provides the ability to select an optimized resolution level for each block. The LOD scheme measures the block importance and priority based on user interactions. The input data is collected from the user and can in combination with static information such as data-error and local histograms serve as a decision foundation for LOD selection. LOD selection methods can be categorized into view-dependent, data-error and transfer function based techniques. This thesis implements two hybrid approaches based on view-dependent and transfer function related data used in combination with measurements of static data-error.. 3.2.1. View-Dependent Scheme. A view-dependent approach is of high cohesion with how the user navigates through the volume. The importance is dependent on the visibility of the block from the current view as well as the screen-space error caused by using a low resolution representation. The projected screen-space error is a measurement of how much area a voxel occupies when projected on the screen. If the projected screen-space exceeds the size of one pixel, visual artifacts will arise. A. Gyulassy et al [14] states that the projected screen-space is far to expensive to calculate and suggests a less expensive approximation. The approximation is calculated in equation 3.4 where d is the distance to the camera and r is the radius of virtual bounding sphere covering a voxel. errorscreen = φ r2 errorscreen ≈ 2 d. (3.3) (3.4). If the screen space error exceeds a threshold, the resolution is increased to prevent the growth of visual artifacts. When observing the volume from a short distance, blocks are likely to be project outside the screen-space, thus be invisible to the user. Invisible blocks can be ignored by defining a region of interest which solely concur blocks visible to the user. A region of interest is illustrated in figure 3.3 for a specific view. Each block is projected upon the screen using the camera projection matrix defined as the transformation between three-dimensional camera space and two-dimensional screen-space. The position of a block is transformed to camera space and then to screen-space as 21.

(27) Figure 3.3: The region of interest defined for a specific view-port. Green blocks are within the region of interests and red blocks are outside.. Xc = R ( Xl − T ) Xs = PXc. (3.5) (3.6). where Xl is the local position of a block within the volume, R is the camera rotation, T is the camera translation and P is the projection matrix between the camera and screen-space.. 3.2.2. Transfer function based Scheme. TF based LOD importance can be measured to select a block resolution optimized for memory efficiency and minimized visual-error. Data samples are mapped to the TF domain before rendering. The TF mapping provides information about the visual importance of a block. TF mapping τ : < → <4 is defined as a function τrgba (v), v ∈ B which maps a given intensity value into a specific color and alpha value. Invisible and homogenous blocks gives low contribution to the final rendering and 22.

(28) Figure 3.4: Histogram of a data distribution together with a TF. The hisc togram is approximated by the simplified histogram. 2006 Courtesy of Patric Ljung [12] can thus be represented at a very coarse scale. Ljung et al [15] uses the following categories to classify block significants: • No TF content, τ (v) = 0 v ∈ B The block is transparent • Non-varying TF content, τ (v) = C v ∈ B , where C is a vector constant: The block is completely homogeneous • Varying TF content, such that τ (v)! = τ (u) u, v ∈ B : The derivative can be used to determine block importance Calculating local block histograms is expensive and a full representation does not always improve the result. A Histogram approximation is shown in figure 3.4. The importance ζ of a block is measured using. ζ=. 1 N. N. ∑(τa (hi ) − τa (v¯))2 · hi )1/2. (3.7). i. where N is the number of bins of the histogram, hi is the histogram value for bin i, τa (hi ) is the TF mapping for the current bin and τa (v¯ )) is the TF mapping of the average intensity for the block. A LOD is selected based on the importance ζ of a block. If the RMSE for the current block resolution is above a fixed threshold the resolution is further refined.. 23.

(29) 3.3. Out-of-Core Data Management. Computer systems benefit from the use of multiple memory layers for storing and accessing data. The cache memory is one of the most important layers and provides very fast data access. Out-of-core data management implies the use of an additional layer such as disk or network storage which can hold a significantly larger amount of data. In general there is a trade off between storage capacity and access performance which motivates the fact of transferring data between different layers of memory. During preprocessing a multi-resolution data set is created and stored on disk. The system can on demand move specific blocks between layers of memory referred to as data streaming. The VRAM on the graphics card is used as a high-performance cache, storing blocks active for rendering. Figure 3.5 illustrates different memory layers used in a computer and how they are connected.. Figure 3.5: The image is illustrating how different layers of memory are connected. The disk block represents the layer used for out-of-core storage. Streaming data between different layers of memory is in general a bottleneck, especially out-of-core to main memory which is about 20 times slower than data transfers from main memory to VRAM.. 24.

(30) 3.3.1. Multi-threaded Data Stream. The algorithm uses three synchronized threads: The rendering thread, the I/O data thread and the LOD management thread. Threading is used to isolate data streaming and LOD calculations from interfering with the rendering performance. The threading system is illustrated in figure 3.6. The LOD management thread is given information about changes in camera or TF. All blocks are classified into an appropriate LOD to act as block representation. Each block is prioritized and pushed to a priority queue. As soon as the LOD management thread is done, it notifies the I/O data thread which immediately aborts all current operations. The I/O data thread begins to pop items from the queue which gradually are fetched and streamed to the VRAM on the graphics card for rendering. The streaming operation uses memory mapping techniques to copy data to the device. Memory-mapping techniques are used to map device memory into the host memory address space. After the mapping is done data can efficiently be transferred to the device. The host-to-device data transfer is synchronized with the rendering thread to avoid conflicts due to reading and writing at the same memory location. Figure 3.6 shows how the threads operates during a time frame.. Figure 3.6: Data is transferred by the use of different CPU threads in order to stream data from out-of-core, in-core to VRAM without interfering with rendering performance. 25.

(31) 3.3.2. Data Stream Optimization. The slow transfer rates between external and internal memory layers motivates the need for an in-core data cache of blocks. Accessing data located in internal memory is a lot more efficient than reading from disk. Once a block is retrieved from the disk, it is stored inside the internal memory. If the amount of data exceeds the capacity of memory available, a least recently used (LRU) strategy is used. LRU is commonly used in operating systems as a demand-based page replacement algorithm. The strategy will remove blocks based on their temporal order. Other ways of optimizing the streaming is to reduce the amount of blocks passed through the pipeline. An update request is denied to enter the priority queue if one or more of the following criterium is breached. • The block resolution requested is already available on the VRAM • The block is invisible due to the current TF • The block is not within the current region of interest. 3.4. Mixed-Resolution Texture Packing. Texture packing can be used to efficiently store multi-resolution blocks on the VRAM. The method is similar to the adaptive texture map technique introduced in [16]. The buffer is structured to store a limited set of high resolution blocks allocated as subspaces inside the texture.Each subspace can be used to store high resolution data or to act as containers for multiple low resolution blocks. The concept of containers is used throughout the hierarchy of resolutions which enables each block to act as a container to multiple low resolution blocks. The average downsampling scheme enables such a representation since a block of resolution level R uses an equal amount of storage space as eight blocks of resolution R − 1. This concept can be thought of as a tree structure where containers are parents to data blocks which are leaves on the tree. The packed structure gives a buffer of blocks, capable of storing blocks of arbitrary resolution, efficiently packed in memory. However, since the spatial cohesion between volume blocks are no longer valid, a lookup structure is used for providing memory access to blocks within the buffer. Figure 3.7 illustrates the structure of blocks densely packed in memory.. 26.

(32) Figure 3.7: Left: Lookup texture performing address translation to specific blocks within the block buffer Right: Blocks of mixed-resolution densely packed into texture memory Below: Tree structure of a container holding blocks at mixed-resolutions.. 3.4.1. Dynamic Updates. The block cache provides a memory efficient block storage resided in the VRAM. Block representation is a subject to change and an algorithm is needed to manage this while maintaining the structure of the densely packed texture. Consider the case of using a TF related LOD scheme. No block representation is changed until the user decides to change the TF. In this case it would be accepted to remove and repack all blocks without any major loss of interactivity. Now consider the case of a visual-dependent LOD scheme. The user expects real-time updates of volumetric regions which signifies the need of an algorithm which can update individual or regions of blocks during runtime. A densely packed texture has the downside of introducing some problems when updating individual blocks. Decreasing block resolution give rise to holes of unused memory and increasing block resolution overwrites storage space used by other blocks. Using an additional step of block restructuring I propose an algorithm that guarantees the texture to be densely packed after a block is updated. Figure 3.8 illustrates the algorithm using 27.

(33) an example. Nupdate is the container holding the original block about to be updated. The new block resolution occupies the same amount of storage as the parent to the original block, hence all children located in this container are temporarily placed in a block pool. Nlast is the only container allowed to be sparse and is used as a cache for obtaining blocks used to fill holes or in this case to provide storage for leftover blocks from the pool. When blocks have been restructured into a densely packed representation, all changes are stored in the lookup structure.. Figure 3.8: Left: Block structure before updating the resolution of a block at container Nupdate and Nlast . A block at resolution level 2 is selected to be replaced with a block of resolution level 1. Right: Result of restructuring the packed blocks at container Nupdate and Nlast after update. Left over blocks are placed inside a block pool which is emptied in the container Nlast . figure 3.9 shows a complete flowchart of the block updating algorithm. A block is either increased of decreased in resolution at container Nupdate . If the resolution is increased, left over blocks are pushed into Nlast , if decreased, Nupdate is filled with blocks from Nlast .. 28.

(34) Figure 3.9: Flowchart of updating algorithm. A block is either increased of decreased in resolution at container Nupdate . If increased, left over blocks are pushed into Nlast , if decreased Nupdate is filled with blocks from Nlast .. 3.5. Pipeline Overview. The improved DVR system consists of two parts, preprocessing and rendering. The preprocessing pipeline is used to organize data into an efficient structure and the rendering pipeline reacts to user interaction such as camera movement or changes in TF in order to find an optimized distribution of blocks used for rendering high quality images from the volume. Multi-resolution blocking techniques provides an efficient representation for limiting the amount of data passed through the pipeline. Large data sets require multiple layers of storage and a pipeline capable of moving data between memory layers on demand. The LOD management scheme is 29.

(35) used for limiting the memory overhead while minimizing the visual error caused by using low-resolution data. Real-time rendering implies the use of multi-threading which enables LOD calculation and data transfers to be performed without affecting rendering performance. Memory limitations on the graphics card can be addressed by allowing regions of the volume to be represented at lower resolution and efficiently packed in memory. The user information is sent to the LOD Select block to calculate optimized update requests for blocks. All requests are prioritized into a queue and further processed by an I/O thread which loads the data requested for each item. A requested block can be loaded either from a block cache held in internal memory or from disk. As soon as the block exists in internal memory it gets uploaded and rendered by the graphic card. The preprocessing pipeline provides an efficient representation of the volume by subdividing it into multi-resolution blocks of data. Preprocessing is also used to perform basic data analysis, valuable in the LOD Selector scheme. A full overview of the pipeline is illustrated in figure 3.10 and 3.11.. Figure 3.10: The preprocessing pipeline. Data is imported from the source, blocked, and together with meta-data stored on disk.. Figure 3.11: Complete out-of-core streaming pipeline, The LOD scheme uses either view-dependent or TF data to distribute updates into a priority queue used to fetch blocks from either in-core or out-of-core and pass them to the VRAM of the GPU at the correct memory position.. 30.

(36) 3.6. Multi-resolution Volume Rendering. Multi-resolution volume representation adds an extra layer of complexity into rendering. LaMar et al [13] used the concept of texture slicing and rendered each block separately using an adaptive sampling scheme dependent of current block resolution. Adaptive volume sampling can be used to increase rendering performance but requires an opacity modification for dealing with invariant rendering results when sampling the volume at varying sampling steps. The opacity correction is calculated as α adj = 1 − (1 − αorg )∆org /∆adj. (3.8). where ∆org and ∆ adj are the original and adjusted sampling density respectively. Ljung et al [12] states that GPU-based ray casting has no performance overhead over texture slicing but significantly improves the render quality and adds support to more complex rendering techniques.. 3.6.1. Multi-resolution Raycasting. An additional lookup texture is needed to access blocks of different resolutions densely packed in memory during rendering. The arbitrary spatial placement of blocks gives interpolation errors when sampling near block boundaries. A first approach is to use a Nearestblock sampling scheme which restricts the sampling position to a valid domain within the block. Nearest-block sampling does not perform any interpolation over block boundaries which can evidently produce sharp edges and artifacts for some specific TFs. Artifacts can be avoided using sample replication and padding between blocks. Unfortunately this results in a data reduction and discontinuities between blocks of arbitrary resolutions. Ljung et al [17] proposes a intrablock interpolations scheme which removes the need of sample replication at the cost of runtime filtering. Nearest Block Sampling Nearest block (NB) sampling avoids hardware-accelerated interpolation errors near block boundaries by restricting the sampling to a valid domain. 0 The restricted sample domain p is defined as 0. 0. pC = C1δ−δ ( p ) 31. (3.9).

(37) where C clamps the value to an interval [δ, 1 − δ]. The valid domain is defined as the smallest square spanning all samples within a block indicated by the red, dashed squares in figure 5.4 to the left. Given the resolution level l, δ can be calculated as. δ=. 1 2l +1. (3.10). Intrablock Interpolation Sampling Despite the use of NB sampling it is evident for block artifacts to emerge. Artifacts are specially visible in adjacent to low resolution blocks and are magnified by the use of thin, iso-surface like TF settings. Beyer et al [18] use sample replication to overcome these artifacts despite the cost of data reduction.. Figure 3.12: Left: 2D neighborhood of blocks with mixed resolutions Right: 3D intrablock domain for a eight block neighborhood with edge notificac tions Image , 2006 Courtesy of Patric Ljung [12]. Intrablock Interpolation (II) [17] removes the need of sample replication and gives a robust interpolation scheme for adjacent blocks of arbitrary resolution. II collects samples from an eight-block neighborhood and weigh them together. Samples are taken from the immediate closest boundary of a neighboring block using NB sampling. The origin of the intrablock domain, r0 , s0 , t0 is determined by a translation of the global block coordinates according to equation 3.11 for r g , s g , t g respectively. The original sample position will end up somewhere within this domain and its position is 32.

(38) calculated using equation 3.12. The position defined in local coordinate space r, s, t on the range [−0.5, 0.5] which will be used to determine the contribution of each block in a local neighborhood. r0 = bC0Nr −1 (r0 − 0.5)c. (3.11). r = r g − r0 − 1.0. (3.12). From the position r, s, t an offset is made into all neighborhood blocks. The offset is clamped into the closest block boundary using NB sampling which will result in a non-uniform cube for a block neighborhood of different block resolutions. There are 12 edges spanning the intrablock domain which are organized into three sets corresponding to orientation r, s, t. Using the labeling in figure these are: Er = (1, 2)(3, 4)(5, 6)(7, 8) Es = (1, 3)(2, 4)(5, 7)(6, 8) Et = (1, 5)(2, 6)(3, 7)(4, 8). (3.13) (3.14) (3.15). Edge weights ei,j ∈ [0, 1] determine the block weights ωb , shown in equation 3.23. ω1 ω2 ω3 ω4 ω5 ω6 ω7 ω8. = (1 − e1,2 ) · (1 − e1,3 ) · (1 − e1,5 ) = e1,2 · (1 − e2,4 ) · (1 − e2,6 ) = (1 − e3,4 ) · e1,3 · (1 − e3,7 ) = e3,4 · e2,4 · (1 − e4,8 ) = (1 − e5,6 ) · (1 − e5,7 ) · e1,5 = e5,6 · (1 − e6,8 ) · e2,6 = (1 − e7,8 ) · e5,7 · e3,7 = e3,4 · e3,4 · e3,4. (3.16) (3.17) (3.18) (3.19) (3.20) (3.21) (3.22) (3.23). When all block weights have been calculated, the sample value ψ is computed as a normalized sum. ∑8b=1 ψb · ωb ψ= ∑8b=1 ωb 33. (3.24).

(39) Ljung et al [17] present three different methods for calculating edge weights ei,j . The methods are referred to as Minimum Distance Interpolation, Boundary Split Interpolation and Maximum Distance Interpolation which all have different properties. This thesis implements the Maximum Distance Interpolation method since it provides c0 continuity between block borders. The edge weights are calculated as. ei,j (ρ) = C01 (. (ρ + δi ) ) δi + δj. (3.25). where ρ is either r, s or t depending on the set ei,j appertains. Adaptive Sampling of Multi-resolution Volumes A multi-resolution volume enables the use of accelerating techniques such as adaptive volume sampling. Adaptive sampling is performed by adapting the sampling density into current block resolution. Blocks of lower resolution can be sampled sparsely while structures containing a lot of air can be skipped entirely. The sampling step can be calculated as η = step ·. σmax σ. (3.26). where σmax is the maximum block size and σ the size of the current block. Artifacts may arise when the sampling domain verges from a low to high resolution block. The step taken will most likely end up far into the high-resolution block causing vital information to be skipped. Ljung et al [19] solves this by calculating the remaining ray distance l within a block. This distance is calculated using ray-box intersection [20] and if η is larger than l the step length is limited to l as η = min(step ·. σmax , l + step) σ. The adaptive sampling scheme is illustrated in figure 3.13. 34. (3.27).

(40) Figure 3.13: Adaptive sampling of multi-resolution volume. Blocks are sampled with different sampling densities dependent on the current block resolution. The ray-block intersection make sure that sampling always starts at the beginning of a block.. 35.

(41) Chapter 4 Implementation Details This chapter provides details about the implementation done in this thesis. The system is written in C++, OpenCL and integrated into the volume rendering framework Voreen(Volume Rendering Engine)1 which uses the QT UI framework 2 for GUI and CPU thread handling.. 4.1. Voreen Framework Integration. Voreen is an open source C++ framework for interactive visualization of volumetric data using OpenGL and GLSL. Voreen provides a QT based graphical interface which allows the user to use predefined modules to build dynamical processing pipelines of volumetric or image based data. Each module is graphically represented by a processor block which performs operations on data. Processor blocks can be connected to pass data though the pipeline or linked to share data during runtime. Voreen can be extended to implement own functionalities by adding processors to the framework. Recently versions of Voreen also supports OpenCL for performing GPGPU based computations. The following section will present all implemented processors and how they interact with each other to form the proposed pipeline. The preprocessing pipeline is integrated in Voreen using two processor blocks as in 4.1. The integration of the rendering pipeline is shown in 4.2.. 36.

(42) Figure 4.1: Two processors connected to load blocks of data, preprocess them and store on disk.. 4.1.1. Preprocessing. LoadBrickedVolume Processor which holds a fileDialogproperty allowing the user to select a volume to preprocess. CreateBrickedVolume Loads the volume into blocks of data. Each block gets downsampled into several LODS which all are organized and written to a file on the disk. Information such as number of blocks, dimensions, number of LODS and RMS error between LODS are stored as meta-data.. 4.1.2. Rendering. useBrickedVolume Used to initialize the rendering pipeline. Stores an initial set of blocks into the internal memory block buffer. The buffer is represented using a one dimensional array containing a set of blocks of different LODS. The processor calculates histograms and average block densities to all sets of blocks. A memory pointer to the block buffer is passed to the next processor as well as a bounding volume used for calculating entry and exit points used during rendering. StreamingHandler The processor is linked with both camera and TF to receive real-time updates of user related changes. The processor is responsible for initializing the LOD management and block update scheme. An optimized LOD is calculated for all blocks. All update requests are 1 http://www.voreen.org/ 2 http://qt.nokia.com/. 37.

(43) Figure 4.2: The whole out-of-core rendering pipeline implemented in Voreen. prioritized and placed inside an update queue. The queue is implemented using a max-heap which provides the time complexity of O(log(n) for standard operations such as insertion and removal of prioritized items. openCLRaycaster The processor is used for rendering and provides the connection between the CPU host and GPU device. A kernel function is called to perform the actual rendering and the resulting image is read from the device memory and passed on to the canvas for viewing. CubeMeshProxyGemetry and MeshEntryExitPoints are two preexisting processors used for calculating entry and exit points for the bounding volume. The entry and exit points are passed as textures to the kernel. The 38.

(44) kernel uses the textures to define the parametric equation of all rays about to be launched through the volume. The processors StreamingHandler and openCLraycaster are linked through a property. This connection allows blocks of data to be passed between the processors whenever an update has been made. The data is passed together with information about where the block should be placed inside the densely packed block buffer located on the device. The openCLraycaster has access rights to the current GPU context and can continuously copy data between the host and global memory on the device. The texture buffer is created using the OpenCL command clCreateImage3D during initialization. Blocks of data can be mapped into a specific region inside the image using the command clEnqueueMapImage. The function maps a threedimensional image region into the host address space and returns a pointer to this mapped region. Matthew Scarpino [21] states that memory mapping is more efficient than standard read and write operations. The data within a region is undefined until the clEnqueueUnmapImage operation has been called.. 4.2. QT Threading. The StreamingHandler uses QT the framework which provides thread support in the form of platform-independent threading classes to take advantage of multiprocessor machines. Both the LOD management and I/O scheme uses this support to perform all operations on a separate CPU thread. QT uses a signal/slot mechanism for connecting different threads to each other. When an optimized LOD is calculated for all blocks a signal is emitted informing the I/O thread to stop its current work and replace all previous update requests with new items from the queue. The I/O thread starts processing the queue and whenever a block of data has been loaded from either out-of-core or in-core a new signal is emitted placing the data at the correct position on the VRAM. The Qt::BlockingQueuedConnection flag is used to prevent several signals to be emitted at the same time. When using multi-threaded processing it is common that operations try to read and write to the same memory location at the same time which will cause a dead-lock to freeze all processes. A Mutual exclusion algorithm (Mutex) existing in QT is also used for preventing the thread from popping the same queue item twice.. 39.

(45) 4.3. OpenCL Ray-Casting. The rendering is done using ray casting performed on the openCL device. The fact that all image pixels are independent makes ray casting an ideal task to process parallel on the GPU. clEnqueueNDRangeKernel queues data parallel tasks which enables a kernel to be launched simultaneously on several processing units. Each ray is assigned to a work-item and the parallel ray casting algorithm is expressed in 4.1. Algorithm 4.1 Ray casting Require: work − item index x, y get entry and exit points f or x, y calculate parametric ray equation ray(t) while length(ray) < length(dir ) do Lookup sample position in index texture Get sample value Classi f y sample value based on current TF Use composition to blend sample contributions Check early ray termination Increase ray step end while Set f inal pixel color Sampling inside a densely packed image buffer requires some additional steps of address translation. 1. The Ray is described by ray(t) = entry + t · dir and gives the current sampling position at ray(t). 2. The lookup texture is defined in a three- dimensional block space where the current block index is obtained by taking the integer part of ray(t) for x, y, z. 3. The lookup texture at position int(ray(t)) gives the block position and block dimensions for the densely packed block texture. 4. The intrablock coordinate is defined by f rac(ray(t)) for x, y, z scaled by the block width. 5. The value is obtained using NB or II sampling. 40.

(46) Chapter 5 Result The following chapter presents the results obtained from implementing the methods described in chapter 3 and 4. The results consist of preprocessing performance, visual quality for different compression ratios and rendering performance.. 5.1. Test Data. The data sets used in this thesis are provided from CMIV 1 (Center for Medical Image Science and Visualization) and the division for MIT 2 (Media and Information Technology) at Linköpings University. Details about the data sets are viewed in table 5.1. Data Set Dimensions Golden Lady 512x512x625 Moose 512x512x3069. Bits 16 16. Size(MB) 320 1571. Source MIT CMIV. Table 5.1: The data sets used for testing the methods implemented in this thesis. 5.2. Test System. All tests are performed using a system with the following specifications. CPU Intel Core I7 @ 3.06 GHZ 1 http://www.cmiv.liu.se/ 2 http://www.itn.liu.se/mit. 41.

(47) GPU NVIDIA GTX560i 1024 MB VRAM HDD Western Digital Caviar Green 1TB 64MB 5400 RPM RAM Corsair Vengeance DDR3 1600MHz 6GB. 5.3. Preprocessing. The preprocessing phase involves subdividing the volume into blocks using the flat blocking scheme. Each block is downsampled and stored on disk together with the amount of error introduced by representing a block at lower resolution. Meta-data is also stored containing detailed information about the data set as well as information where to locate each block within in the memory of the stored file. The preprocessing time is dependent on the numbers of blocks the volume is subdivided into as well as the total size of the data set. If the volume does not fit into in-core memory the volume has to be read, brick by brick which is an extremely time consuming task. However, the preprocessing step is a single time operation and once a data set has been preprocessed it can be accessed by the rendering pipeline multiple times. The table 5.2 shows measurements of the preprocessing times using two different block sizes. Data Set Block Size Golden Lady 323 Moose Golden Lady 163 Moose. Block Dimensions 16x16x19 16x16x95 32x32x39 32x32x191. Preprocessing(s) 39 254 63 434. Size on disk 347 MB 1.69 GB 357 MB 1.7 GB. Table 5.2: The time performance measured in seconds for preprocessing the data sets at block sizes 163 and 323 respectively.. 5.4. Out-of-Core Block Reading Performance. Reading out-of-core data is very time consuming due to the low bandwidth between external and internal memory layers. When performing a read operation it takes a constant amount of time for the disk head to move to the actual location of the requested data. By organizing blocks into chunks of neighborhood data the amount of seek operations can be limited to one 42.

(48) Figure 5.1: A graph plotting the read performance obtained from reading blocks at different sizes from disk. per block. Figure 5.1 shows the average time spent on reading blocks of various sizes. As the figure implies it takes almost the same amount of time to read blocks containing 2 bytes (13 Voxels) of data as a block containing 8 KB (163 Voxels) of data. Reading low resolution blocks separately are thus very inefficient. Further, due to the low storage cost most of the low resolution block can be stored inside the main memory to optimizing the read performance. When comparing the reading performance between blocks of size 323 and 163 respectively the latter appear as most efficient according to the graph. Nevertheless, one should keep in mind that subdividing the volume into smaller regions will significantly increase the amount of blocks which has to be passed through the pipeline.. 5.5. Intrablock Volume sampling. NB sampling is the most native scheme of sampling data within the interior of a block. The method avoids hardware interpolation errors by clamping the sample position within a valid domain of the current block. Figure 5.2 shows some results of rendering a volume using two different transfer functions as well as very low resolution representation of the volume. The result illustrates that NB sampling can produce a satisfying rendering quality when using a very opaque TF, but works poorly for thin, iso-surfacelike TFs and low resolution data. 43.

(49) Figure 5.2: Comparison of NB sampling of volume rendered using an opaque, thin transfer function and very low resolution blocks. II sampling provides an intrablock interpolation scheme which weights samples obtained at the boundary of the closest neighborhood of blocks at arbitrary resolutions. The result eliminates all artifacts caused by NB sampling and provides a smooth transition between blocks which significantly increases the rendering quality.. Figure 5.3: Illustration of how the Maximum Distance scheme interpolates in a neighborhood of mixed resolution blocks. Figure 5.3 shows an example of how a neighborhood of blocks at different resolutions get smoothed using the Maximum Distance scheme. This scheme provides a continuous transition between blocks at arbitrary resolutions giving high visual quality. The amount of smoothing performed depends on the difference in resolution between adjacent blocks. This is illustrated by the figure to the right where a high resolution orange block get smoothed into a low resolution empty block. Figure 5.4 shows a comparison of the visual quality of rendering using the NB and the II sampling scheme respectively. Both images uses an opaque TF setting for skin and 44.

(50) a thin, iso-surface-like setting for extracting the skull. The artifacts due to NB sampling are clearly visible in the left image which are then removed using II sampling in the figure to the right.. Figure 5.4: Comparison between NB and II sampling. Showing how II sampling effectively removes artifacts caused by NB sampling.. 5.6. Adaptive Sampling. Adaptive sampling benefits from the multi-resolution representation and increased performance by lowering the sampling density for low resolution blocks. Figure 5.5 show the result of rendering using full sampling, native sampling and adaptive sampling. Native sampling does not perform ray-block intersection and can thus skip large parts of the volume. This produces the artifacts viewed in the center figure above. Adaptive sampling produces a far better result more similar to the full sampling approach, although some alias artifacts are visible. The performance of using corresponding sampling schemes can be viewed in the rendering performance section.. 45.

(51) Figure 5.5: Rendering quality of using Left: full sampling Center: native sampling Right: adaptive sampling. Native sampling skips important parts of the volume and thus produces an unacceptable amount of artifacts.. 5.7. TF Based Data Reduction. Storing a full resolution histogram is expensive in memory cost. The calculation time for classifying each block is also affected thus optimal LOD estimation is done by averaging over the number of bins used to represent the content of a block. The goal is to find a optimal histogram size for calculating good LOD estimations while keeping the storage and calculation cost low. Table 5.3 shows the average time spent on classifying the LOD for a specific number of histogram bins. Histogram Bins 256 128 64 32 16. LOD/Block 0.0078 ms 0.0043 ms 0.0024 ms 0.0012 ms 0.0004 ms. Table 5.3: Performance time for calculating optimal TF based LOD selection The optimal histogram size was found by comparing the deviation of the LOD distribution calculated for all histogram approximations. Representing the histograms with 32 bins was found to be optimal choice in terms of storage, calculation cost and quality. The resulting LOD distribution only deviated 0.3 percent from the use of a 256 bin histogram. 46.

(52) Data Reduction L0 2.57503 15503 3.57171 10840 4.62511 8329 5.4714 6950 6.63861 5611. L1 L2 L3 L4 0 0 0 24433 2577 661 1425 24433 2137 1659 3378 24433 2442 1894 4217 24433 2858 2031 5003 24433. Figure 5.7 a 5.7 b 5.7 c 5.7 d 5.7 e. Table 5.4: Golden Lady, subdivided into blocks of 163 voxels. The LOD distribution is calculated using the TF based dynamic data reduction scheme. The columns L0, L1, L2, L3 and L4 are the number of blocks at a specific resolution level where L0 is the highest and L4 the lowest. Data reduction states the level of dynamic data compression. The goal with the TF based LOD selection scheme is to maximize the rendering quality for a given data reduction ratio. The error measurements are calculated for images rendered both with NB and II sampling with different amount of data reduction ratios. The total error and max error is plotted against the data reduction ratio in figure 5.6, a full list of LOD distribution is shown in table 5.4 and the rendered image together with the calculated color differences ∆E is shown in figure 5.7. Figure 5.6: Left: Total error vs data reduction from performing dynamic TF based data reduction on the goldenlady data set using both II and NB sampling Right: Maximum error vs data reduction ratio using both II and NB sampling. The TF based LOD selection scheme gives near lossless compression for data reduction ratios of 2-3 times. This is due to the large amount of blocks 47.

(53) containing nothing but air thus can be represented at lowest resolution. The II sampling scheme has the ability to provide smooth transitions between blocks at arbitrary resolutions. Although producing smooth results, it is also responsible for spreading errors. The II sampling scheme tends to blur data contribution into low resolution blocks which immediately registers as a large error near the thin tube, surrounded by a lot of low resolution blocks of air. Further, as the data reduction increase so does the visual error. The fact that II sampling produces less visual error than NB sampling for high data reductions emphasizes its visual quality compared to NB sampling.. Figure 5.7: Top: Rendering of Golden Lady using II sampling at different levels of data reduction Bottom: CIELAB ∆E for each table caused by data reduction. Corresponding LOD distributions can be viewed in table 5.4 The data reduction scheme gives the ability to dynamically compress volumes that would normally not fit on the VRAM. Table 5.5 shows an example of the data reduction rate obtained by calculating an optimal LOD distribution for the Moose data set. Since the available VRAM does not allow a full resolution representation of the data set, no objective visual error measurements have been made. Although rendered images of each data reduction ratio is illustrated in figure 5.8 for a subjective comparison. This data set allow more lossless data compression since it contains more blocks of air. For larger data reduction ratios artifacts on the surface of the skin become more visible. 48.

No results found