Fusing Stereo Measurements into a Global 3D Representation

Full text

(1)LiU-ITN-TEK-A--21/035-SE. Fusing Stereo Measurements into a Global 3D Representation Per Blåwiik 2021-06-11. Department of Science and Technology Linköping University SE-601 74 Norrköping , Sw eden. Institutionen för teknik och naturvetenskap Linköpings universitet 601 74 Norrköping.

(2) LiU-ITN-TEK-A--21/035-SE. Fusing Stereo Measurements into a Global 3D Representation The thesis work carried out in Medieteknik at Tekniska högskolan at Linköpings universitet. Per Blåwiik Norrköping 2021-06-11. Department of Science and Technology Linköping University SE-601 74 Norrköping , Sw eden. Institutionen för teknik och naturvetenskap Linköpings universitet 601 74 Norrköping.

(3) Upphovsrätt Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under en längre tid från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/ Copyright The publishers will keep this document online on the Internet - or its possible replacement - for a considerable time from the date of publication barring exceptional circumstances. The online availability of the document implies a permanent permission for anyone to read, to download, to print out single copies for your own use and to use it unchanged for any non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional on the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its WWW home page: http://www.ep.liu.se/. © Per Blåwiik.

(4) Linköping University | Department of Science and Technology Master’s thesis, 30 ECTS | Media technology 2021 | LIU-ITN/LITH-EX-A--2021/035--SE. Fusing Stereo Measurements into a Global 3D Representation Sammanslagning av Stereomätningar till en Global 3DRepresentation Per Blåwiik Supervisor : Jochen Jankowai Examiner : Mark E Dieckmann. External supervisor : Fredrik Lundell. Linköpings universitet SE–581 83 Linköping +46 13 28 10 00 , www.liu.se.

(5) Abstract This report describes the thesis work with the aim to find a complete method for fusing an arbitrary sequence of stereo measurements into a global mesh-based representation, as a subsystem of a real-time stereo reconstruction system. The proposed system outline takes stereo measurements in the form of disparity maps and colour images together with estimated camera poses as input, and produces a global 3D reconstruction of the scene. The input stereo observations are fused together as a global signed distance function (SDF), by using a cumulative weighted average of the signed distance for each voxel, and a triangle mesh is incrementally extracted using the well known marching cubes algorithm. The data structure of the proposed method involves a global implicit surface represented by an octree-based SDF, where the leaf nodes of the octree store uniform blocks of voxels with three mipmapped levels of resolutions. To evaluate the proposed method, a prototype system was implemented and integrated with a real-time SLAM-based stereo reconstruction system. Several reconstructions produced by the prototype were compared to another approach that involves overlapping keyframe meshes. The results showed that the proposed method outperformed the other method in terms of visual quality of both geometry and textures. Furthermore, performance benchmark tests showed that the prototype has great potential for real-time application, however, further optimizations are required for reconstructions with higher voxel resolutions..

(6) Acknowledgments First, I would like to thank Hans Holmgren for giving me the chance to do this thesis work together with Saab Dynamics AB. It was a great experience which gave me the opportunity to do research in a field that was partially new, and partially familiar to me. I also want to thank my supervisor, Jochen Jankowai, and my examiner, Mark E. Dieckmann, for the guidance regarding the presentation, the report and other scientific approaches. Finally, I would like to thank my external supervisor at Saab, Fredrik Lundell, who has been very helpful in deciding on the approach for the work, and has provided invaluable practical knowledge that was instrumental for the implementation of the final prototype.. iii.

(7) Contents Abstract. ii. Acknowledgments. iii. Contents. iv. List of Figures. v. List of Tables 1. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. 1 1 2 3 3. Theory 2.1 Related Work . . . . . . . 2.2 Stereo Reconstruction . . . 2.3 Signed Distance Function 2.4 Octree Data Structure . . . 2.5 Marching Cubes . . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. 4 4 5 7 8 10. Method 3.1 The Work Process . . . 3.2 System Outline . . . . 3.3 The Data Structure . . 3.4 Octree Lookup . . . . . 3.5 Global Surface Fusion 3.6 Mesh Extraction . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. 13 13 14 15 17 20 22. 4. Results 4.1 Performance Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 25 25 26. 5. Discussion 5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 39 39 40 42. 6. Conclusion. 43. 2. 3. Introduction 1.1 Motivation . . . . . 1.2 Aim . . . . . . . . . 1.3 Research Questions 1.4 Delimitations . . .. viii. . . . .. . . . .. . . . .. . . . . . .. . . . . . .. Bibliography. 45. iv.

(8) List of Figures 1.1. 2.1. 2.2. 2.3. 2.4. 2.5. 2.6. 2.7. 2.8. 3.1. 3.2. The point cloud of a global 3D reconstruction represented as an octree-based signed distance field. The scene was reconstructed from a sequence of 2500 frames using a publicly available dataset of RGB-D frames and camera pose tracking. . . An illustration of the stereo geometry of a calibrated camera pair. The baseline separation and the different viewing angles of the cameras, result in the notion of disparity between two corresponding 2D projections of the same point in 3D space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A geometric illustration of how the depth, z, of a 3D point, P, can be derived from the disparity, d = u R ´ u L , of two rectified stereo images. CL and CR represents the left and respectively right camera pinhole locations, and f is the focal length. . An illustration of an implicit circle sampled into a two dimensional signed distance field. Positive notation represent space outside of the circle, and negative is space inside the surface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A two dimensional demonstration of how a quadtree can save space by only allocating geometric information in a narrow region around the boundary of a circle. The upper left quadrant shows the hierarchical tree structure in comparison to the other quadrants of uniform grids. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Each node, starting from the root node, in an octree is represented by an axis aligned bounding box that can be decomposed into eight octants with half its side lengths. Figure (a) illustrates of how an octree node can be decomposed into eight octants with eight different indices, and (b) shows the tree graph representation of the same decomposed node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . An illustration of how the spatial coordinates of a discrete scalar field can be used to form a grid of cubes. The different colours represent different scalar values, and the surface intersections are determined by linear interpolation between vertices that share edges. The purple coloured triangle mesh is an isosurface extracted from the scalar data field. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . An illustration of the 15 unique cases in the marching cubes algorithm published in the original article by Lorensen et. al [Lorensen:1987]. Highlighted corners represents state one, and the others represents state two. Note that the states are reversely interchangeable, since each case is symmetrically identical to its complementary case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . An illustration of the interpolation between two points, v1 and v2 , of a cube edge, based on the weight of the isolevel, α, between the scalar values, s1 and s2 . . . . . The main pipeline of the proposed method visualized as a block diagram. The pipeline can be separated into three contiguous processes for each incoming stereo observation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . An illustration of the hierarchical tree structure in which the implicit surface is represented. Each branch can be subdivided into at most eight octant children, where only leaf nodes store the voxel blocks containing the geometric data (SDF). .. v. 2. 5. 6. 8. 9. 10. 11. 11 12. 14. 15.

(9) 3.3. 3.4. 3.5. 3.6 3.7 3.8. 3.9. 4.1. 4.2. 4.3. 4.4. 4.5. 4.6. A demonstration of how the voxel blocks are allocated in narrow regions around the surface of a sphere using the proposed octree structure of SDF voxel blocks. Note that the coloured dots represent individual voxels, and are not affected by the actual signed distance values. Figure (a) shows the surface of a sphere sampled into blocks of 83 contiguous voxels, and (b) is a sliced view of the same sphere visualizing 82 voxels per block. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . An illustration of the three mipmapped resolutions of the same voxel block containing 83 , 43 , and 23 voxels respectively. Notice that the side-lengths of the block stay the same, only the grid resolution is affected. The parameter s represents the resolution scale, where a lower value equals finer resolution. . . . . . . . . . . . . A histogram showing the frequency of surface points per voxel blocks in a subregion of the input image. A total of 314 surface points are located across only 11 voxel blocks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The result of fusing ten noisy observations into one curve by applying the cumulative weighted sum strategy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Groups of eight neighbouring voxels are processed at a time to apply the marching cubes algorithm. The center locations of the voxels represent the corners of a cube. A neighbourhood of eight voxel block nodes needed for generating triangles in between adjacent submeshes, using the marching cubes algorithm. The black box is the current submesh, and the grey boxes are its neighbours. Blue cubes represent exterior face blocks shared between two nodes each, the green blocks are edge blocks shared between four nodes each, and the purple cubes are a corner block shared between all eight neighbours. . . . . . . . . . . . . . . . . . . . . . . . . . . . An illustrations of how a submesh can be textured by a small patch of the camera image. The blue triangles represents the submesh, and the pink bounding area represents its corresponding texture sample. . . . . . . . . . . . . . . . . . . . . . . The running average of the computation times for the three main processes in the pipeline. The graphs represent the computation times for each frame of the house data set, using the scale of 2 cm per voxel at the finest resolution (first row in Table 4.1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . An overview of a fully textured 3D reconstruction of the house scene. In (a), the house was reconstructed with the old method, and in (b), with the proposed method. The green path represents the tracking of the camera. . . . . . . . . . . . . A side view of a fully textured 3D reconstruction of the house scene. In (a), the house was reconstructed with the old method, and in (b), with the proposed method. The green path represents the tracking of the camera. . . . . . . . . . . . . A close-up view of a full 3D reconstruction of the house scene. The left column was generated by the old method, and the right column by the proposed method. Sub-figures (a) and (b) show the fully textured reconstructions, (c) and (d) show the wireframe representations of (a) and (b), and finally, (e) and (f) show zoomed in views of (c) and (d). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A wireframe side-view of the wall of the house, demonstrating the depth based multi-scale resolution system. In (a), the crossing of two different voxel resolutions was captured during the reconstruction process, and (b) is the final reconstruction of (a). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Different views of a 3D reconstructed rock in the underwater scene, where the left column was generated by the old method, and the right column by the proposed method. Sub-figures (a), (b), (c) and (d) show two different views of the same rock. In (e) and (f), the wireframe representations of (c) and (d) are shown. . . . . . . . .. vi. 16. 17. 18 20 22. 22. 24. 26. 28. 29. 30. 31. 32.

(10) 4.7. A close up view of a 3D reconstructed object in the underwater scene that shows an example of how the proposed method failed to fuse measurement errors. The left column was generated by the old method, and the right column by the proposed method. In (a) and (b), the reconstructions of the object were still incomplete. Subfigures (c) and (d) show the fully textured models of the complete reconstructions of (a) and (b) (all available measurements added), while (e) and (f) show their wireframe representations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8 A partial top view of the seabed in the underwater scene produced by the proposed method, using two different voxel scale resolutions. In (a), a resolution of 1 cm per voxel was used, and in (b), 2 cm per voxel. The finer resolution, (a), contains more visible patches of separated surface layers, than the coarser resolution, (b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.9 A partial overview of a 3D reconstruction of the garage scene produced by the proposed method. Sub-figure (a) shows the fully textured reconstruction, and (b) shows its wireframe representation. The green path represents the tracking f the camera. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.10 A close-up view of a full 3D reconstruction of the garage scene, where the left column was generated by the old method, and the right column by the proposed method. In this example, the proposed method managed to capture fine details in both textures and geometry. Sub-figures (a) and (b) show the fully textured reconstruction of the scene, (c) and (d) are the wireframe representations (a) and (b), and finally, (e) and (f) are a zoomed in view of (c) and (d). . . . . . . . . . . . . 4.11 A close-up view of a fully textured reconstruction of a stroller in the garage scene. An example of how the proposed method successfully managed to fuse new observations with previous ones. Sub-figure (a) was generated by the old method, and (b), by the proposed method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.12 A close-up view of a fully textured reconstruction of the garage scene produced by the proposed method, that shows an example of artifacts caused by the submesh patch texturing method. In (a), artifacts has been created by occlusions close to the borders of the camera image, and in (b), the artifacts has been removed by cropping the borders the image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1. A drawback of the chosen texture method is that it requires cropping of the disparity map to make sure submeshes close to the borders of the camera image are completely covered. In (a), the disparity map was not cropped, and in (b), the borders of the disparity map were uniformly cropped by a width of 24 pixels. . .. vii. 33. 34. 35. 36. 37. 38. 41.

(11) List of Tables 4.1. The average computation times for the three main processes of the pipeline. Frames is the total number of stereo observations in the data set, and voxel scale is the size of each voxel at the finest resolution scale. Note that there are two separate benchmarks for the lookup process, the naïve implementation and the acceleration strategy proposed in Section 3.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. viii. 26.

(12) 1. Introduction. The purpose of this report is to present the thesis work with the goal of finding a method for fusing stereo measurements into a global 3D representation, in a real-time application. In this chapter, the problems and the general topics of the report are introduced, followed by the aim of the work and the research questions that was formed to help achieving it.. 1.1. Motivation. The challenge of real-time 3D reconstruction from live camera feeds is an active research area in the fields of computer vision and robotics. The problem known as simultaneous localization and mapping (SLAM) has emerged as a popular method for solving navigation problems of robots in unknown environments (e.g. obstacle detection and path planning), and has been successfully integrated with real-time 3D reconstruction frameworks [1, 2]. SLAM keeps track of an agent’s location by estimating sensor (e.g. a camera) poses while simultaneously reconstructing a map of the area. Furthermore, the reconstruction of polygonal meshes can be useful for real-time 3D-scanning purposes (e.g. volume estimation of objects), and mapping of large-spaced areas such as, the interior/exterior of a building, the seabed, or hazardous zones. Image based 3D reconstruction is a passive method, meaning that the sensors do not interfere with the reconstructed environment, and can be achieved by using a sequence of images captured by either a single camera, a specialized depth camera (RGB-D), or a calibrated stereo camera pair [3]. This work is mainly focused on 3D reconstruction using binocular stereo vision, which obtains 3D geometric information from the disparity between two calibrated cameras (similar to human eyes), mounted on a shared rig. Saab Dynamics AB has developed a SLAM-based real-time stereo reconstruction system, which can reconstruct multiple polygonal meshes into a fully textured 3D environment. It’s current global visual representation consists of multiple reconstructed polygonal meshes, which are overlapping each other as the scene grows over time. The meshes are selected keyframes, relative to the binocular camera motion, derived from preprocessed disparity maps. The preprocessing involves an incremental merging of multiple stereo measurements, where only consistently detected geometric features are saved, which results in a weighted average with reduced noise. Although the individual meshes are fairly accurate, the method of overlapping them causes visual artifacts, and consumes unnecessary space in memory. 1.

(13) 1.2. Aim This thesis work was a collaboration with Saab Dynamics, which aimed to improve their system’s current global representation by fusing the overlapping keyframe meshes into a joint model, while keeping the most accurate measurements. Consequently, the 3D reconstruction should be refined over time as more reliable features are measured. Newcombe et al. [1] demonstrated that using a weighted signed distance function (SDF) for fusing geometry into a global representation is an effective approach for real-time 3D reconstruction. Furthermore, by storing the SDF in a hierarchical data structure, such as an octree, time and memory complexity can be significantly reduced compared to a uniformly divided grid of voxels [4]. Inspired by these papers, this work did set out to utilize a SDF integrated in a hierarchical data structure to fuse an arbitrary sequence of stereo measurements into a global volumetric representation. New observations should dynamically expand the global volume and existing geometric features should be updated when more accurate ones are observed. The final rendering would be done by extracting a mesh from the implicit surface expressed as a SDF. Figure 1.1 shows an example of a scene, visualized as a point cloud, that has been reconstructed and fused into a global 3D representation, using the method proposed in this report.. Figure 1.1: The point cloud of a global 3D reconstruction represented as an octree-based signed distance field. The scene was reconstructed from a sequence of 2500 frames using a publicly available dataset of RGB-D frames and camera pose tracking.. 1.2. Aim. The aim of the work was to find and implement a complete method for fusing a sequence of stereo measurements into a global mesh-based representation. The geometry with affiliated textures should be continuously refined while new and more accurate features are collected. It was important that the most reliable measurements should be kept in the global model, so that it would converge into a representation with optimal accuracy. The chosen method should be usable in a real-time application, meaning that it should be susceptible to parallelism and future performance improvements. The evaluation of the method was mainly done by comparing the visual quality of the proposed method, with the current method used by Saab Dynamics. The qualitative evaluation was done by subjectively analyzing the reconstruction in terms of visual artifacts, and other details of both mesh and textures. Furthermore, benchmarking tests of the pipeline were performed to evaluate the performance in regards to real-time application usability. 2.

(14) 1.3. Research Questions. 1.3. Research Questions. To achieve the aim of the proposed work, the following three research questions were formed. 1. How can a signed distance function (SDF) be utilized for fusing a sequence of stereo measurements into a global representation? What data structure is appropriate for the SDF, in terms of time and memory complexity, if used in a real-time application with potentially large-scaled scenes? 2. The global reconstruction should be refined when more accurate measurements are accumulated over time. How can the same data structure in (1) be adapted to achieve this? 3. How can a fully textured triangle mesh be incrementally extracted (based on updated regions) from a global volumetric representation (SDF), in a real-time application?. 1.4. Delimitations. Due to numerical errors of the camera pose estimations, the incremental 3D reconstruction produced by the current system of Saab Dynamics, tends to accumulate drifting over time. This issue becomes more apparent in larger scenes where the camera is swept in a closed cycle, e.g. moving around a building, making previously estimated geometric features misalign with new estimations of the same location. A delimitation for this work is that there was no attempt of correcting this accumulated drifting. Another delimitation for this work is that the actual rendering process was not taken into account in the design of the pipeline, only the generation of a fully textured polygonal mesh that can be used for the rendering is covered. Although the camera calibration, image preprocessing, and disparity map construction processes of a stereo reconstruction system are crucial to the quality of the final 3D model, the focus of this thesis work was on the 3D reconstruction, and the data structure used for representing and fusing point clouds into a global volumetric model. The pipeline of the proposed method requires camera pose estimations together with stereo observations produced by an external system, as input data.. 3.

(15) 2. Theory. In this chapter, the theoretical foundation of the thesis work is covered. First, a brief summary of related research that inspired the project is presented, followed by the relevant theory behind the proposed method.. 2.1. Related Work. Real-time image-based 3D reconstruction can be formulated as the problem of tracking a handheld camera, while simultaneously creating a 3D surface of the observed geometry, from a sequence of images. In live applications, this problem is mostly solved using simultaneous localisation and mapping (SLAM), while offline methods usually are based on structure from motion (SFM). The following papers are all examples of working real-time 3D reconstruction systems that uses SLAM methods for camera pose estimations. With the release of a cheap depth sensor known as Kinect, Newcombe et al. [1] proposed a method called KinectFusion, a real-time 3D reconstruction system that performs dense SLAM from a live stream of captured depth images. Their method utilizes a signed distance function (SDF) to represent the geometry of the 3D scene which is updated incrementally using a simple weighted running average scheme, originally proposed by Curless et al. [5]. Although the approach by Newcombe et al. [1] produced a dense consistent 3D scene in real-time, the SDF data structure was based on a uniform voxel grid, rendering the method unsuitable for larger scenes (considering the cubically increasing memory consumption). Additionally, the realtime reconstruction and rendering was only possible utilizing complex GPU parallelization techniques which makes their method limited to devices equipped with modern graphics hardware. To solve the time and memory complexity problem of using a uniform voxel grid, hierarchical data structures can be used. Zeng et al. [4] integrated an octree into the KinectFusion system, and consequently managed to greatly enhance the performance as well as reduce up to 90% in memory consumption compared to the original method. The surface prediction and rendering were done by GPU accelerated ray casting similar to the original KinectFusion method. Additionally, by utilizing their octree structure, they introduced a novel approach for removing tree nodes in dynamic scenes with moving objects to avoid memory overflow. Instead of using an octree representation, Kähler et al. [6, 7] used a hash table data structure called voxel block hashing, first proposed by Nießner et al. [8]. Similar to the octree-based 4.

(16) 2.2. Stereo Reconstruction methods, their approach involves a SDF to represent the 3D surface. The voxel block hashing structure consists of 83 voxel grids in sparsely allocated regions around the observed surfaces of the scene. The hash table structure guarantees a O(1) time complexity for looking up the voxel blocks and can be visualized by direct volume rendering using ray casting techniques. In their extended work [7], they present an adaptive resolution approach based on the local curvature of a surface, motivated by the fact that flat surfaces requires less triangles than more complex shapes. Finally, Steinbücker et al. [9] showed that an octree-based SDF system is possible in realtime using only the CPU. They proposed a multi-scale resolution framework where each level of the octree nodes stores blocks of 83 voxel grids with corresponding scaling. The result is a hierarchy of stacked bricks, where a parent node contains coarser voxels with twice the size of its octant child. The hierarchical scales are computed using the distance between the camera and the reconstructed surface, motivated from the fact that the measurement error grows quadratically with increasing depth [10]. To enable real-time performance using only the CPU, Steinbücker et al. [9] utilized the well-known marching cubes algorithm to incrementally extract a triangle mesh from the SDF. Their proposed mesh extraction method can run asynchronously on a separate CPU core at 1 Hz on their highest resolution of 5 millimeter per voxel.. 2.2. Stereo Reconstruction. Stereo reconstruction, more commonly referred to as triangulation in the field of computer vision, is the process of calculating a point in 3D space from 2D images using two or more cameras. In this project the focus was strictly on binocular stereo reconstruction, which involves two calibrated cameras, statically mounted on a shared rig. Ideally, the camera pair should be separated by a narrow distance on a parallel optical axis to achieve a distinct disparity effect. The idea is to utilize the perceptions of two different viewing angles to estimate an objects location in 3D space, similar to how human eyes function (stereopsis) [11]. An illustration of the elementary geometry used in stereo triangulation is shown in Figure 2.1.. Figure 2.1: An illustration of the stereo geometry of a calibrated camera pair. The baseline separation and the different viewing angles of the cameras, result in the notion of disparity between two corresponding 2D projections of the same point in 3D space. Typically the full process of determining a 3D point using stereo reconstruction involves a chain of steps: 5.

(17) 2.2. Stereo Reconstruction 1. Calibration of the camera pair to acquire intrinsic and extrinsic parameters (this is a preparation step done before the actual reconstruction loop). 2. Capturing and preprocessing of the 2D stereo images. 3. Construction of a disparity map from rectified stereo images. 4. Reconstruction of a 3D point using the disparity map and the camera calibration parameters. Repeating step 4 for all points in the disparity map results in a dense point cloud of the current observation, the most basic geometric representation of a full 3D reconstruction. The camera calibration can be done by letting the camera pair observe a complex planar pattern from multiple different viewing angles to detect feature points, as proposed by Zhang [12]. This process involves the estimation of five intrinsic parameters which includes the camera focal length and the principal point (image center), as well as determining the extrinsic parameters used to describe the 3D coordinate system transformations from world to local camera space. The parameters are finally refined by maximum-likelihood inference. Since the outline of this thesis work just covers a subsystem of a full stereo reconstruction system, the in depth details of steps 1-3 are not relevant to the scope of this report, and will not be further explained.. Depth From Disparity A disparity map is a joint representation of two rectified stereo images (left and right camera), where each value represents the displacement of two corresponding pixel locations, pL and pR , in each respective image. Note that both stereo images have their own disparity map, but only one of them (typically the left camera) is needed for the triangulation process. Intuitively, a large displacement (disparity) means that a surface point is close to the camera pair, and a small displacement means that the point is located far away. It can be shown that the disparity is inversely proportional to the depth value of a surface point in 3D space. Figure 2.2 shows the bird’s eyes view of two rectified camera images, separated by the baseline distance, b = 2h, and how a point, P, in 3D space, can be determined from the corresponding 2D image projection points, p L and p R , using elementary stereo geometry [13].. Figure 2.2: A geometric illustration of how the depth, z, of a 3D point, P, can be derived from the disparity, d = u R ´ u L , of two rectified stereo images. CL and CR represents the left and respectively right camera pinhole locations, and f is the focal length. 6.

(18) 2.3. Signed Distance Function In the 3D space coordinate system displayed in Figure 2.2, x is the horizontal coordinate (where x = 0 is exactly in between the camera pair pinhole locations, CL and CR ), y is the vertical coordinate (pointing into the figure), and z represents the depth value (distance from the cameras). The image planes have their own local 2D coordinate systems measured from the center of the images, and for two corresponding projected points, p L and p R , the height coordinate, v, is always the same, e.g. v L = v R . Given the focal length f and the half baseline distance h = b/2 (both parameters derived from the camera calibration), and the horizontal image coordinates, u L and u R , the following correlations with the 3D point can be expressed: uL h+x =´ , f z. uR h´x = . f z. (2.1). By eliminating the x-coordinate in Equations 2.1, they can be written as z(u R ´ u L ) = 2h f ,. (2.2). and finally by solving z, the depth is calculated by 2h f bf = , (2.3) (u R ´ u L ) d where d is the disparity value. Equation 2.3 shows that the disparity is inversely proportional to the depth, e.g. if u R ´ u L Ñ 0, then z Ñ 8. Since a surface point cannot actually be at an infinite distance, this effectively means that the relative error of the depth z, grows with the measured depth (smaller disparity). The final two coordinates, x and y, of the point, P, can be determined using Equations 2.4. z=. x=´. 2.3. b(u L + u R ) , 2d. y=. bv L d. (2.4). Signed Distance Function. The signed distance function SΩ (x) is a simple concept where the value at location x, describes the closest distance to an implicit surface (or boundary) of subset Ω in a metric space [14]. For all points x located in Ω, the function is negative, and for all points outside of Ω, the function is positive. Note that there is no strict consensus on whether the space outside the surface should be negative or positive, the important part is that they differ (hence the use of a signed distance). In more formal context, the signed distance function can be expressed as in Equation 2.5. # ´d(x, BΩ), if x P Ω , (2.5) SΩ ( x ) = d(x, BΩ), if x P ΩC where d(x, BΩ) is the distance between point x and the boundary BΩ, and ΩC is the complement of set Ω.. Discrete Sampling Analytically, the boundary is located where the signed distance function is exactly zero, however, in a numerically integrated setting, where a finite volume is sampled into a grid, the location of the boundary is instead defined by the zero crossings, see Figure 2.3. The figure shows an example of sampling an implicit circle into a discrete signed distance field with a grid resolution of 202 cells, using the following signed distance function: S(x) = kx ´ ck2 ´ r,. (2.6) 7.

(19) 2.4. Octree Data Structure where x is the center location of the grid cell, c is the center location of the circle, r is the radius of the circle, and k¨k2 is the euclidean norm. Note that the figure does not visualize the distance to the boundary, only the sign that separates the circle from empty space.. Figure 2.3: An illustration of an implicit circle sampled into a two dimensional signed distance field. Positive notation represent space outside of the circle, and negative is space inside the surface. In a discrete three dimensional space, a signed distance field is typically represented by a uniform grid of voxels, where each voxel with center location, x, stores the signed distance function, SΩ (x), of an implicit surface, Ω. By increasing the resolution, e.g. more and finer voxels in the same space, the sampled surface should converge to its implicit definition.. 2.4. Octree Data Structure. Although the foundation of the octree structure is well established, it can be represented in various ways. The octree conventions covered in this section reflect the implementation for the thesis work presented in this report. At its core, an octree is a hierarchical tree graph data structure, the three dimensional equivalent of a two dimensional quadtree, and a one dimensional binary search tree. From the root of the tree, each node in a binary search tree can be split into at most 21 = 2 children, a quadtree node can be subdivided into at most 22 = 4 children, and an octree node can be subdivided into at most 23 = 8 children. Each node of an octree is represented by a symmetric axis aligned bounding box which can be uniformly subdivided into eight octants with half its length [15]. This hierarchical decomposition of three dimensional space make octrees popular to use in the fields of computer graphics, either by iso surface extraction [16], direct volume rendering [17], accelerated ray tracing applications [18], or 3D reconstruction frameworks as proposed in this report. For the purpose of the work presented in this report, the octree was mainly used for a sparse storage of the geometric data of an implicit surface defined by a signed distance function (SDF), see Section 2.3. In the case of a signed distance field, the majority of the cells represent empty space which effectively is wasted memory.. 8.

(20) 2.4. Octree Data Structure An example of how the hierarchical structure can reduce memory consumption compared to a uniform cell grid is illustrated in Figure 2.4. The figure demonstrates how the areas that surround the geometry can be uniformly and recursively subdivided into smaller cells of any desired scale, for a more sparse memory allocation. In the two dimensional example with a sample grid of 642 = 4096 cells, shown in Figure 2.4, the number of cells could be reduced to 724 by using a quadtree structure, saving up to 82% in memory consumption.. Figure 2.4: A two dimensional demonstration of how a quadtree can save space by only allocating geometric information in a narrow region around the boundary of a circle. The upper left quadrant shows the hierarchical tree structure in comparison to the other quadrants of uniform grids.. Terminology The terminology of the octree is similar to most tree graph structures. The root node of the tree represents the top layer of the hierarchy, and is the only node that has no parent. In an octree with the depth n, the root is said to be located at level 0, and the deepest tree node is located at level n ´ 1. Each level represents a branching of the previous level where parent nodes are subdivided into at most eight children. A node with no children is referred to as a leaf node, and otherwise, a branch node. An octree where all nodes at each tree level, except the leaf nodes, has eight children is called a complete, or a full, octree.. Search and Insertion Operations Another advantage of the octree, other than saving memory, is that it can be used for fast searching by utilizing binary decomposition of the search space along each of the three axes [15]. When searching or inserting a node in an octree, the search space is reduced by 7 8 for each tree level, effectively resulting in a time complexity of O (log N ). This is extremely efficient when considering that large 3D environments can potentially consist of millions of voxels. The search always begins from the root node, takes a point, P, as input, and returns the corresponding node in which the point is located within. The first step is to check if the 9.

(21) 2.5. Marching Cubes. (a). (b). Figure 2.5: Each node, starting from the root node, in an octree is represented by an axis aligned bounding box that can be decomposed into eight octants with half its side lengths. Figure (a) illustrates of how an octree node can be decomposed into eight octants with eight different indices, and (b) shows the tree graph representation of the same decomposed node. point, P, is within the bounding box of the root node. If it is not, the search can be dismissed, otherwise, the search will recursively continue until the desired tree depth as been reached. Figure 2.5 shows how an octree node can be uniformly subdivided into eight octants with eight different index IDs. Using these indices, the process of finding out which octant a point, P = ( x P , y P , z P ), is located within, is straightforward. The octant index, Ioct , can be calculated using three step functions (one for each coordinate axis), as shown in Equations 2.7 to 2.10. Ioct = 4 ¨ xsign + 2 ¨ ysign + zsign ,. xsign =. ysign =. zsign =. # 1,. if x P ´ xc ą 0. 0, if x P ´ xc ď 0 # 1, if y P ´ yc ą 0 0, if y P ´ yc ď 0 # 1, if z P ´ zc ą 0 0,. if z P ´ zc ď 0. (2.7). ,. (2.8). ,. (2.9). ,. (2.10). where the coordinates, ( xc , yc , zc ), represent the center location of the node. Once the octant index has been determined, the corresponding octant can be decomposed in the exact same manner as explained in the previous step, to continue the search. Note that the difference between searching and inserting a node in an octree, is that a search will stop if a node could not be found, while the insertion operation will expand the tree to make room for the non existing node.. 2.5. Marching Cubes. The marching cubes algorithm, proposed by Lorensen et. al [19], is a method for extracting the triangle mesh of an isosurface in a discrete volumetric scalar field. The idea is to take a discrete three dimensional scalar field, letting the spatial coordinates of the scalar values represent the vertices in a grid of logical cubes, and then determine how an isosurface intersects. 10.

(22) 2.5. Marching Cubes. Figure 2.6: An illustration of how the spatial coordinates of a discrete scalar field can be used to form a grid of cubes. The different colours represent different scalar values, and the surface intersections are determined by linear interpolation between vertices that share edges. The purple coloured triangle mesh is an isosurface extracted from the scalar data field. the edges of said cubes, see Figure 2.6. The algorithm processes one logical cube at a time by marching through neighbouring scalar data. Each cube contains eight corner vertices with associated scalar values that can be binary classified: either larger, or lower, than a user-specified threshold value (e.g. the isolevel to extract). The rule is simple, if two vertices that share an edge of the cube has different states, the surface must intersect that edge. Given the binary classifications (two states) of the eight cube vertices, there are a maximum of 28 = 256 different ways an isosurface can intersect a cube [19]. By assigning indices for all 256 cases, a lookup table for the corresponding surfaceedge intersections of each case can be created. Although it is possible to generate triangle patterns for all 256 cases, this can be an errorprone and tedious task. Fortunately, by utilizing complementary and rotational symmetry, the total number of cases can be reduced to only 15 unique cases, shown in Figure 2.7. Case 0 is the base case where all corner values are higher, or lower, than the desired isolevel, meaning that the entire cube is either completely inside or outside of the isosurface. In case 1, only one of the corners are located inside, or outside, of the isosurface, resulting in three edge intersections and one triangle, and so forth. Permutations, regarding complementary and rotational symmetry, of the 15 cases shown in Figure 2.7, can produce all 256 possible cases.. Figure 2.7: An illustration of the 15 unique cases in the marching cubes algorithm published in the original article by Lorensen et. al [19]. Highlighted corners represents state one, and the others represents state two. Note that the states are reversely interchangeable, since each case is symmetrically identical to its complementary case. 11.

(23) 2.5. Marching Cubes Once the surface-edge intersections have been determined from the lookup table, the intersection points, Pk = tp0 , p1 , .., pn u, can be calculated using linear interpolation, as described by Equations 2.11 and 2.12. p i = v1 + k ( v2 ´ v1 ),. (2.11). α ´ s1 , s ‰ s2 , s2 ´ s1 1. (2.12). k=. where s1 and s2 are the scalar values of edge vertices v1 respectively v2 , and α is the isolevel. The coefficient, k, represents the linear weight of how far the point, pi , is located from vertex point, v1 , along the vector given by v2 ´ v1 , as illustrated in Figure 2.8. For example, k = 0 ñ pi = v1 , and k = 1 ñ pi = v2 . Note that if a surface-edge intersection has been found, the domains of the scalar values, s1 and s2 , must be separated by an iso-value, α, such that s1 ď α ă s2 , or s2 ď α ă s1 . The intersection points in Pk , correspond to a subset of the triangle vertices used in the final polygonal mesh.. Figure 2.8: An illustration of the interpolation between two points, v1 and v2 , of a cube edge, based on the weight of the isolevel, α, between the scalar values, s1 and s2 . To enable shaded rendering, e.g. the Phong lighting model, the final step is to calculate the surface normals for all triangle vertices in Pk . Lorensen et. al [19] calculate the gradient as the derivative of the density function of the surface, which can be estimated using central differences along each of the three axes. Another approach is to compute the face normals of each triangle based on the fact that the cross product of two non-parallel vectors (linearly independent) along the same plane produces a vector that is orthogonal to both vectors, and thus normal to the plane. Then simply averaging the normals of the neighbouring faces for each vertex to determine the vertex normals. Let v0 , v1 , and v2 be the three vertices of a triangle, then p = v1 ´ v0 , and q = v2 ´ v0 , are two non-parallel vectors along the same plane. The normal vector, n, of the triangle can then be calculated using the cross product of p and q, see Equation 2.13.   y p zq ´ z p yq n = p ˆ q =  z p xq ´ x p zq  (2.13) x p yq ´ y p xq Finally, the union of all sets of vertices, Pk , for all cubes k = 0, 1, .., N, in the discrete scalar field, can be used to represent a complete polygonal surface mesh.. 12.

(24) 3. Method. Since the main goal of this project was to solve the practical problem of fusing stereo measurements into a global 3D representation, the majority of the work was allocated to the implementation of a prototype. The two main problems to solve were: 1. to choose and implement a data structure for storing and fusing 3D geometric data in a real-time application; 2. to choose and implement a method for visualizing the global 3D representation. In this chapter, first an overview of the work and thought process regarding the implementation is presented, and then the specific details behind the proposed method are explained. Some details of the method are explained with mathematical expressions, others with algorithms, and some with figures. The purpose of the algorithms is to give the general idea of a process to the reader, implementation specific details are left out.. 3.1. The Work Process. The preparatory work process consisted of researching related state-of-the art methods for real-time 3D reconstruction, and deciding on a theoretical approach relevant to the aim of the thesis project. Since the ambition was to improve the system developed by Saab Dynamics, studying the source code and other corresponding dependencies was also necessary. In summary, the goal of this project was to find a way to fuse 3D geometry into a global representation and then render it, meaning that it is based on the assumption that all necessary measurements are ready and available from previous stages in the pipeline. The expected input information, per frame, consists of a disparity map, camera-to-world transforms (e.g. camera pose estimation derived from a SLAM system), RGB-values for texturing, as well as intrinsic and extrinsic attributes derived from the camera calibration. These parameters can then be used to calculate local 3D points relative to the camera image plane, and then transform them into global world coordinates. Converting the measurements into 3D points results in a point cloud of the 3D geometry, the most basic 3D reconstruction representation. The first task was to implement the main data structure, an octree-based signed distance function (SDF), used for fusing the incoming stereo measurements into a joint volumetric rep13.

(25) 3.2. System Outline resentation. Normally, a SDF consists of a uniformly divided grid of voxels, where each voxel stores the closest signed distance to an implicit surface. Negative values denote space inside the surface, positive values are space outside, and the zero crossings represent the isosurface itself [20]. The motivation for storing the SDF in a hierarchical structure, such as an octree, is based on the fact that the vast majority of the voxels in a grid are marked as empty space (inside or outside the implicit surface), which effectively is wasted memory. The hierarchical tree structure of an octree (the three dimensional equivalent of the one dimensional binary tree) can be used to sparsely allocate space in a narrow region around the surface points, decreasing the memory consumption by up to 90%, compared to a uniform grid [4]. Similar to a binary tree, searching/inserting nodes in an octree is performed with O(log N ) time complexity, which is crucial in a real-time application. Once the data structure was complete, the focus was on implementing a mesh extraction method for rendering purposes. Although it is a common practice to render the final reconstruction using ray casting visualization techniques [1, 4, 6, 7, 2], the ambition for this project was to use a triangle mesh-based approach. Mesh structures are more appropriate for resource-limited devices without powerful graphics hardware (e.g. quadcopters or smartphones), and enable the possibility to view it on a base-station, using any desired virtual camera pose. For this project, the marching cubes algorithm was chosen, as proposed by [5, 9, 21]. Marching cubes is simple to implement, and by utilizing a lookup table for resolving ambiguous cases, the algorithm can also perform extremely fast. The prototype of the proposed method was developed from scratch using C/C++ since the original system by Saab Dynamics was primarily developed in C. Using a low-level language also enabled good control over the parallelization of CPU tasks, as well as memory allocation management, which is highly beneficial when developing a real-time system.. 3.2. System Outline. The pipeline of the proposed method is similar to the work presented by Steinbrücker et al. [9], and can be separated into three main processes: looking up nodes in the octree, surface fusion update, and mesh extraction. The proposed system requires an estimated camera pose (rotation and location), together with a stereo observation (disparity map and an RGB image) as input data for each iteration. The three processes are illustrated in Figure 3.1, and can be summarized as follows: 1. the camera measurements are first transformed into 3D surface points, then these points are used to find the corresponding voxel blocks of the sample space in the octree, which are finally added to an update queue; 2. for each voxel in the blocks contained in the update queue, project the 3D coordinates to the camera image plane, calculate the signed distance value and perform a cumulative update of that value to fuse it with previous measurements; 3. for all updated voxel blocks, apply the marching cubes algorithm to extract a triangle mesh from their implicit surface representations (the SDF).. Figure 3.1: The main pipeline of the proposed method visualized as a block diagram. The pipeline can be separated into three contiguous processes for each incoming stereo observation.. 14.

(26) 3.3. The Data Structure Note that each stereo observation represents one time frame of a contiguous sequence of incoming data. The input sequence can either be captured from a prerecorded video clip or a live camera feed. The purpose of the proposed system was to fuse multiple individual stereo measurements into a global 3D representation, meaning that it can only be used as a subsystem of a full reconstruction system that handles the camera pose estimation and stereo measurements on the side, e.g. with visual SLAM. Alternatively, a dataset that provides all required input data would also be sufficient. The rendering can be done with any 3D engine that can draw triangle meshes using vertex color data or texture sampling.. 3.3. The Data Structure. To represent the reconstructed surface as a volumetric SDF, the surface measurements need to be integrated in some form of voxel grid. The traditional way of doing this is to define a uniform grid of voxels [1], where each voxel represents the closest distance to an implicit surface. However, with this approach, the time and memory complexity grow cubically with higher resolution, or volumetric space required. Furthermore, by using a uniform grid, modifying the size dynamically as the scene grows also becomes problematic since the entire space of the scene is defined by the SDF grid. To solve these issues the SDF was integrated into an octree data structure, described in Section 2.4. The most straightforward way of structuring the voxel samples in an octree is to let individual voxels be stored as leaf nodes at the deepest level of the tree, as proposed by Zeng et al. [4]. This approach result in a collection of sparsely allocated voxel blocks that requires a large amount of tree traversals when accessing neighbouring voxels (which is required during mesh extraction). Instead, the method proposed in this work follows the idea of representing the leaf nodes as building blocks of 83 contiguous voxels, as presented in [8, 7, 2]. By using this structure, the voxel blocks can be treated as thread safe batches of multiple signed distance fields, which are trivially parallelizable. A logical illustration of the tree hierarchy is shown in Figure 3.2.. Figure 3.2: An illustration of the hierarchical tree structure in which the implicit surface is represented. Each branch can be subdivided into at most eight octant children, where only leaf nodes store the voxel blocks containing the geometric data (SDF). 15.

(27) 3.3. The Data Structure To further illustrate how the SDF voxel blocks work as sparse building blocks in a 3D environment, a sphere was sampled using the proposed data structure, see Figure 3.3. Note that the actual signed distance values are not visualized in the figure, it should only serve as a means of demonstrating how the voxel blocks are allocated around the actual surface. For comparison, the same visualization method applied on a uniform grid would simply be the entire space filled with coloured dots with no resemblance of a sphere.. (a). (b). Figure 3.3: A demonstration of how the voxel blocks are allocated in narrow regions around the surface of a sphere using the proposed octree structure of SDF voxel blocks. Note that the coloured dots represent individual voxels, and are not affected by the actual signed distance values. Figure (a) shows the surface of a sphere sampled into blocks of 83 contiguous voxels, and (b) is a sliced view of the same sphere visualizing 82 voxels per block.. Multi-scale Resolution As briefly mentioned in Section 2.2, in a stereo-based reconstruction system, it can be shown that the depth error grows quadratically with the measured depth z (relative camera to surface distance), according to Equation 3.1 [10]. ez =. z2 ¨e bf d. (3.1). bf. where, z = d is the depth of a triangulated point with disparity value d, b is the baseline between the camera pair, f is the focal length of the camera, and ed is the disparity error. Similar to the framework proposed by Steinbrücker et al. [9], a multi-scale resolution system based on the measured camera to surface depth was implemented. However, instead of assigning voxel blocks at every level of the octree, a simpler approach involving mipmapped representations of the 83 voxel blocks was used. This effectively results in up to three resolution scales: 83 voxels at the finest scale, 43 voxels at the intermediate scale, and 23 at the coarsest scale, see Figure 3.4. Using this approach, all voxel block data still reside in the leaf nodes of the octree, making surface fusion and mesh extraction significantly more straightforward compared to the stacked brick system by Steinbrücker et al. [9]. Logically, the three different resolutions lay on top of each other, which results in a full octree of three levels, without the need of traversing the tree between parent and children nodes. One voxel from the coarsest scale 23 can be subdivided into eight finer voxels with half the size, and so forth.. 16.

(28) 3.4. Octree Lookup. Figure 3.4: An illustration of the three mipmapped resolutions of the same voxel block containing 83 , 43 , and 23 voxels respectively. Notice that the side-lengths of the block stay the same, only the grid resolution is affected. The parameter s represents the resolution scale, where a lower value equals finer resolution. The resolution scale is determined by the distance between the camera and the measured surface using a staircase function defined by Equation 3.2. # 1, if zt ď 1 s(zt ) = (3.2) t log z u t 2 , if zt ą 1 where, zt is the depth between the camera and the surface point. Since there are only three different resolutions in practise, the scale function is truncated and can only result in three valid scales: s1 = 20 , s2 = 21 , and s3 = 22 . Using these scales, the grid base number B can be conveniently derived by a simple division, as shown in Equation 3.3. 3. B =. . 8 s(zt ). 3 (3.3). In the memory, the voxels of a block are represented as a static array with the size 83 + 43 + = 584, where the individual flat voxel index IB of a grid with base B, can be determined by its local relative grid coordinates (i, j, k), using Equation 3.5. ! IB (i, j, k) = i + j ¨ B + k ¨ B2 + O (3.4). 23. $ ’ &0, O = 83 , ’ % 3 8 + 43 ,. if B = 8 if B = 4. (3.5). if B = 2. The local 3D indexing convention of a voxel block is defined such that the voxel located at the left-top-front cell is given by (i, j, k) = (0, 0, 0).. 3.4. Octree Lookup. The octree lookup is the first process in the pipeline presented in Section 3.2, and takes as input: a stereo observation, e.g. a disparity map derived from stereo images, together with an estimated camera pose, and produces an update queue of voxel blocks to perform the surface fusion and mesh extraction on. From the disparity map and the camera transform, corresponding 2D image coordinates are converted into 3D world coordinates using triangulation, as explained in Section 2.2. The resulting surface points are then used to search the octree structure (according to the theory in Section 2.4) to find or allocate matching voxel blocks, which are finally added to an update queue to be sent to the next process. As explained in Section 3.3, each voxel block covers a narrow region around the surface point sampled as an implicit signed distance field.. 17.

(29) 3.4. Octree Lookup A simple approach of the full process is presented in Algorithm 1, where W and H are the width and height of the input camera image, d is the disparity map, Pxyz is the reconstructed surface point, and ˚B is a pointer to the corresponding voxel block of the SDF octree. Algorithm 1: Octree Lookup for u=0, 1, ..., W do for v=0, 1, ..., H do if d(u, v) is valid then Calculate Pxyz from d(u, v); Use Pxyz to find or allocate ˚B in the octree; Add ˚B to update queue; end end end. Accelerated Lookup Process Considering that each reconstructed surface point is sampled into a block of 83 voxels, there is a high likelihood that adjacent pixels in the camera image will end up located in the same voxel block, rendering duplicate octree lookups. Intuitionally, this probability depends on the distance between the camera and the surface, since (due to perspective projection) surface measurements far away will be more sparsely spread and closer measurements more densely packed. Although the time complexity of searching the octree is O(log N ), testing showed that performance could be significantly increased by using a simple heuristic of comparing distances of the surface points in subregions of the input image before the lookup process. In Figure 3.5, 314 surface points could be derived from a subregion of 202 pixels located across 11 voxel blocks. For instance, this would mean that voxel block 1 can be looked up only once in the octree instead of 115, or in summary: 11 total tree searches instead of 314.. Figure 3.5: A histogram showing the frequency of surface points per voxel blocks in a subregion of the input image. A total of 314 surface points are located across only 11 voxel blocks.. 18.

(30) 3.4. Octree Lookup As a benefit of subdividing the input image into multiple regions, the calculation process can be trivially parallelized, increasing performance even further. Note that the optimal region size depends on the image dimensions as well as the chosen voxel resolution. Too small regions will yield more duplicate lookups, too large regions will loose the time saved doing octree lookups to the process of comparing distances between 3D points. The process of checking if neighbouring surface points are located in the same voxel block can be done in various ways. One approach is to first determine all surface points in a subregion from the disparity map. Then in a nested loop, one surface point is used for looking up a voxel block, and then all remaining points are checked to see if they are located inside its bounding box. All points assigned to a voxel block are marked in a list to eliminate them from being used for duplicate lookups in future iterations. The accelerated octree lookup algorithm is shown in Algorithm 2, where NR is the number of subregions of the input image, and NP is the number of reconstructed surface points in each corresponding subregion. Algorithm 2: Octree Lookup, Acceleration Strategy for r = 0, 1, .., NR ´ 1 do Calculate all surface points in region r and add them to P<list>; Define a checklist C<list> for the surface points in P<list>; for i = 0, 1, .., NP ´ 1 do if C[i] is not Checked then Mark C [i ] as Checked; Use P[i ] to find or allocate voxel block ˚B in the octree; for j = i, i + 1, .., NP ´ 1 do if C[j] is not Checked then if P[j] P ˚Bbbox then Mark C [ j] as Checked; end end end Add ˚B to update queue; end end end. Histogram Filtering As a consequence of using the acceleration strategy presented in Section 3.4, a histogram containing the number of surface points per voxel block can be generated for each processed region of the input image. The histogram can be utilized for noise filtering by discarding voxel blocks, that contain less points than a predefined threshold, from the update queue. For example, if the threshold value would be set to 20 in the case of Figure 3.5, only voxel blocks 1-5 would be added to the update queue. This would only be done under the assumption that voxel blocks with low amount of surface points do not have a positive contribution to the quality of the final reconstructed surface. Using a high threshold results in the removal of surfaces far from the camera since those measurements are more sparsely spread apart within a subregion.. 19.

(31) 3.5. Global Surface Fusion. 3.5. Global Surface Fusion. In the surface fusion process, the signed distance values from the new stereo measurements are calculated and fused together with previous values. This is achieved by using a weighted running average function [5], which slowly fuses a sequence of incoming measurements into a joint surface. Averaging the surface into a global weighted signed distance field can be viewed as incrementally de-noising multiple noisy measurements [1]. Although out of scope for this report, it can be mathematically proven that the weighted signed distance function is equivalent to a least squares minimization of squared distances, yielding the weighted isosurface optimal [5]. The signed distance is calculated per voxel sample for each block in the update queue acquired from the octree lookup process, presented in Section 3.4. Each voxel at position x holds a cumulative signed distance function, D (x), and a cumulative weight function, W (x). The combination rules for measurements at time frames t = 0, 1, 2..n are defined in Equations 3.6 and 3.7. ř wt ( x ) d t ( x ) ř D (x) = (3.6) wt ( x ) ÿ W (x) = wt ( x ) (3.7) where dt (x) and wt (x) are the signed distance and weight functions of the stereo observation input at time t for voxel at position x. In practice, the cumulative functions are updated in an incremental fashion according to Equations 3.8 and 3.9: Dt + 1 ( x ) =. Wt (x) Dt (x) + wt+1 (x)dt+1 (x) Wt (x) + wt+1 (x). Wt+1 (x) = Wt (x) + wt+1 (x). (3.8) (3.9). Figure 3.6 demonstrates the result of applying the cumulative functions, Equations 3.8 and 3.9, on multiple noisy observations of a one dimensional curve. The red curve is the result of fusing ten observations using a constant update weight wt = 1 for each fusion step.. Figure 3.6: The result of fusing ten noisy observations into one curve by applying the cumulative weighted sum strategy.. 20.

(32) 3.5. Global Surface Fusion Although using a constant weight function proves to be sufficient in most cases, the finer details are often lost. In the proposed method, a more sophisticated weight function presented by Steinbrücker et al. [22] was also implemented. The weight function is set to one for all signed distance values behind the surface, and linearly decreases for values in front of surface, see Equation 3.10. $ ’ &1, wt ( x ) =. Φ´d t ( x ) , ’ Φ´δ. % 0,. if dt (x) ă δ if dt (x) ě δ and dt (x) ď Φ. (3.10). if dt (x) ą Φ. where Φ is a truncation value proportional to the length of a voxel, and δ is a small value close to zero. To calculate the signed distance function, dt (x), of a voxel, the first step is to transform its world location, x, into local camera space coordinates, xc , by inverting the camera transformation matrix Mt = Rt Tt as shown in Equation 3.11. xc = Mt´1 ¨ x = ( Rt Tt )´1 ¨ x. (3.11). where Rt and Tt are the camera rotation and translation matrices, at time frame t. The next step is to project the local camera position, xc , to the camera image plane to determine its corresponding pixel location in the disparity map. By rearranging Equations 2.1 - 2.4, the local image plane coordinates, (u L , v L ), of the left camera can be calculated as described in Equation 3.12. uL = ´ f ¨. ( 2b + xxc ) , zxc. vL = f ¨. yxc zxc. (3.12). where f is the focal length, b is the baseline separation between the stereo cameras, and ( xxc , yxc , zxc ) are the 3D-coordinates of position xc . As mentioned in Section 2.2, the local image plane coordinates are measured from the center of the image. The normalized coordinates (in the range [0, 1]) of an image with the width w, and height h, are given by v L + 2h u L + w2 , v1L = , (3.13) w h and given the pixel resolution, n ˆ m, the actual pixel coordinates can be determined by u1L =. u pixel = tu1L ¨ nu,. v pixel = tv1L ¨ mu.. (3.14). Note that if zxc is negative, or if any of the normalized coordinates, u1L and v1L , do not belong to the interval [0, 1], the corresponding 3D point is not visible in the camera image, and thus, the signed distance can not be calculated. Once the pixel location is known, its corresponding disparity value from the disparity map can be used to calculate the surface depth, z, relative to the camera, according to the triangulation process explained in Section 2.2. Finally the signed distance function, dt (x), can be computed using Equation 3.15. z (3.15) d t ( x ) = k x c k2 1 ´ zxc where kxc k2 is the euclidean norm and zxc is the depth value of the voxel’s camera space location xc . Using this formula, the signed distance value is negative for space in front of the surface, and positive behind the surface.. 21.

(33) 3.6. Mesh Extraction. 3.6. Mesh Extraction. The idea of the mesh extraction process was to incrementally update a global triangle mesh by replacing existing, or creating new, submeshes, strictly for the voxel blocks placed in the update queue, see the system outline given in Section 3.2. In other words: the entire global mesh does not need to be regenerated every update, only the parts of the scene that are visible by the stereo camera pair. The method for creating polygonal meshes from the implicit representation is the marching cubes algorithm covered in Section 2.5. Every voxel block node in the octree represents a submesh, and for each voxel block in the update queue, the algorithm is applied using eight voxels at a time to form logical cubes, starting from the left-top-front voxel as illustrated in Figure 3.7. As explained in Section 3.3, each voxel stores a signed distance value, meaning that the marching cubes algorithm is configured to extract the isosurface at the zero crossing.. Figure 3.7: Groups of eight neighbouring voxels are processed at a time to apply the marching cubes algorithm. The center locations of the voxels represent the corners of a cube. The interior voxels of a single block can be traversed in groups of eight in a trivial manner by using a triple nested loop (one step at a time along each of the three coordinate axes), however, since the cubes are formed using the center location of the voxels, gaps will be left in between neighbouring submeshes. To completely cover these gaps, each block node needs access to its seven neighbouring blocks during the mesh extraction process. The neighbouring voxel blocks are marked as grey cubes in Figure 3.8.. Figure 3.8: A neighbourhood of eight voxel block nodes needed for generating triangles in between adjacent submeshes, using the marching cubes algorithm. The black box is the current submesh, and the grey boxes are its neighbours. Blue cubes represent exterior face blocks shared between two nodes each, the green blocks are edge blocks shared between four nodes each, and the purple cubes are a corner block shared between all eight neighbours. 22.

(34) 3.6. Mesh Extraction As illustrated in Figure 3.8, there are three different cases that can be placed into the following exterior voxels categories: face blocks (blue), edge blocks (green), and corner blocks (purple). There are a total of three face blocks, three edge blocks, and one corner block for every complete submesh. Each of the three face blocks is shared with one neighbouring node respectively (backwards, right, and downward directions), and thus, a face block requires two voxel blocks total. The edge blocks require four voxel blocks each, and the corner block requires all eight (one cube vertex per voxel block). Algorithm 3 shows a piece of the sequential process of generating one submesh. Algorithm 3: Mesh Extraction Input: Pointer to current voxel block node n Output: Vertex array V V Ð MarchingCubesInterior(n); back Ð FindNeighbourNode(n, BACK) ; if back exists then V Ð MarchingCubesFaceBlock(n, back); end right Ð FindNeighbourNode(n, RIGHT) ; if right exists then V Ð MarchingCubesFaceBlock(n, right); end down Ð FindNeighbourNode(n, DOWN) ; if down exists then V Ð MarchingCubesFaceBlock(n, down); end if right and down exists then right_down Ð FindNeighbourNode(right, DOWN) ; if right_down exists then V Ð MarchingCubesEdgeBlock(n, right, down, right_down); end end .. . if all neighbours exist then V Ð MarchingCubesCornerBlock(n, right, down, right_down, ...); end return V; Accessing neighbouring nodes can be solved either by explicitly storing pointers in memory to the neighbour nodes, or by using a recursive tree traversal search. In the proposed method, the tree traversal approach was used according to Algorithm 3, and since all voxel blocks are leaf nodes located at the maximum depth of the tree, the process is rather straightforward. For example, given the index convention illustrated in Figure 2.5, the right-side neighbour of octant three always has the octant index seven (if it exists), and vice versa. Due to the hierarchical decomposition of an octree, in the worst case scenario (regarding time complexity) of a tree with the depth d, a neighbouring node requires d ´ 1 tree level traversals to reach the nearest related parent (the root node), however this is a relatively rare scenario, and in practise, the tree depth seldom exceeds 10.. 23.

(35) 3.6. Mesh Extraction. Mesh Textures Two different approaches for generating colour textures for the global mesh were implemented. The first approach, proposed by Steinbücker et al. [9], stores a cumulative weighted colour (one scalar for each colour channel), identical to the weighted signed distance function used in the surface fusion method (Equation 3.8), in each voxel. The colour values are derived during every update cycle from one of the stereo images with the same image coordinates used for the disparity map. During the mesh extraction process, the colour values of the voxels are linearly interpolated just like the triangle vertices in the marching cubes algorithm (Equation 2.11). As a result, each triangle vertex is associated with a color value, and no actual texture samples are used for the rendering. The other approach involves saving small patches from one of the stereo images, that completely covers each submesh generated during the mesh extraction. This can be achieved by first projecting all of the vertices of a submesh into image plane coordinates (according to Equations 3.12 - 3.14), and then map them to a corresponding bounding area of the camera image using normalized coordinates relative to the area. Figure 3.9 shows an illustrations of how a submesh (the blue triangles) is mapped to a patch (the pink bounding area) of the camera image that completely covers the mesh. The rendering can be done by associating every submesh with their own patch of texture samples.. Figure 3.9: An illustrations of how a submesh can be textured by a small patch of the camera image. The blue triangles represents the submesh, and the pink bounding area represents its corresponding texture sample. Although the weighted colour method is robust, the sharpness entirely depends on the voxel resolution (and consequently the vertex density) since the vertex colours are interpolated across the triangles. The texture patching approach yields sharper quality, regardless of the vertex density, and ended up being used in the final prototype.. 24.

No results found