3D Object Reconstruction Using XBOX Kinect v2.0

(1)

Master Thesis

Electrical Engineering September 2016

Master Thesis

Electrical Engineering with

emphasis on Signal Processing

September 2016

3D Object Reconstruction

Using XBOX Kinect v2.0

Srikanth Varanasi

Vinay Kanth Devu

Department of Applied Signal Processing Blekinge Institute of Technology

(2)

This thesis is submitted to the Department of Applied Signal Processing at Blekinge Institute of Technology in partial fullment of the requirements for the degree of Master of Science in Electrical Engineering with Emphasis on Signal Processing.

Contact Information: Author(s):

Srikanth Varanasi

E-mail: srva15@student.bth.se Vinay Kanth Devu

E-mail: vide15@student.bth.se

Supervisor: Irina Gertsovich

University Examiner: Dr. Sven Johansson

Department of Applied Signal Processing Internet : www.bth.se Blekinge Institute of Technology Phone : +46 455 38 50 00

(3)

Abstract

Three dimensional image processing and analysis, particularly, the imaging and reconstruction of an environment in three dimensions has received signicant attention, interest and concern during the recent years. In the light of this background, this research study intends to provide an ecient way to reconstruct an irregular surfaced object, for example "the sole of a shoe", with good precision at a low cost using XBOX Kinect V2.0 sensor. Three dimensional reconstruction can be achieved either by using active or passive methods. Active methods make use of light source such as lasers or infra-red emitters for scanning a given environment and measuring the depth, to create a depth map. In contrast, in passive methods, colour images of the environment in dierent perspectives are used to create a three dimensional model of the environment. In this study, an active method using a set of depth maps of the object of interest is implemented, where the object of interest is represented by a sole of a shoe. Firstly, a set of depth maps of the object of interest are acquired in dierent perspectives. The acquired depth maps are rst pre-processed for removing any outliers in the data acquired and are then enhanced. Enhanced depth maps are converted into 3D point clouds using the intrinsic parameters of the Kinect sensor. These obtained point clouds are subsequently registered into a single position using the Iterative Closest Point(ICP) algorithm. Aligned point clouds of the object of interest are then merged to form a single dense point cloud of the object of interest. Analysis of the generated single dense point cloud has shown that accurate 3D reconstruction of the real object of the interest has been achieved.

(4)

Acknowledgements

Firstly we would like to thank our thesis advisor Irina Gertsovich for her valuable guidance, feedback and support throughout our thesis. We are indebted to our parents, professors, colleagues and friends for their immense support and the help they have oered during various phases of our thesis.

Srikanth Varanasi Vinay Kanth Devu

(5)

List of Figures

1.1 XBOX 360 Kinect showing its internal sensors.

Image courtesy: Microsoft Corporation. . . 2 1.2Active 3D imaging system using time of ight technology.

Image courtesy: [1] . . . 3 1.3 XBOX One Kinect showing internal sensors.

Image courtesy: iFixit.com . . . 4 3.1 Kinect for Windows V2.0 sensor showing the orientation of the

coordinate system.

Image courtesy: Microsoft Corporation. . . 9 4.1 Experimental setup without the objects of interest being present. 13 4.2Experimental setup with the objects of interest. . . 14 5.1 A set of depth maps of the scene without the objects of interest. . 19 5.2A set of depth maps of the scene with the objects of interest in the

initial position(1st_{). . . 2 0}

5.3 A set of depth maps of the scene with the objects of interest in the nal position(21st_{). . . 2 0}

5.4 Averaged depth map of the scene without the objects of interest. . 21 5.5 Enhanced depth map of the scene without the objects of interest. 22 5.6 Averaged depth map of the scene with the objects of interest in

the initial position(1st_{). . . 2 3}

5.7 Enhanced depth map of the scene with the objects of interest in the initial position(1st_{). . . 2 3}

5.8 Subtracted depth map of the scene with the objects of interest in the 1st _{position. . . 2 4}

(8)

5.9 Depth map of the objects of interest in the 1st _{position. . . 25}

5.10 Depth map of the objects of interest in the 21st _{position. . . 25}

5.11 Point cloud of the scene with the objects of interest in the 1st _{position. 26} 5.12 Point cloud of the objects of interest in the 1st _{position. . . 27}

5.13 Point cloud of the objects of interest in the 21st _{position. . . 27}

5.14 Point clouds of the objects of interest in the rst and second positions. 28 5.15 Point cloud of the objects of interest in the second position with the registered point cloud of the rst position with respect to the second position. . . 29

5.16 Point clouds of the objects of interest in the 21st_{and the 20}th_{position. 30} 5.17 Point cloud of the objects of interest in the 20th _{position with the} registered point cloud of the 21st _{position with respect to the 20}th position. . . 30

5.18 Merged point cloud of the objects of interest in the second position with the registered point cloud of the rst position with respect to the second position. . . 31

5.19 Merged point cloud of the objects of interest in the 20th _position with the registered point cloud of the 21st _{position with respect to} the 20th _{position. . . 31}

5.20 Merged point cloud of the objects of interest in the 11st _position obtained from the positions 1 to 10. . . 32

5.21 Merged point cloud of the objects of interest in the 11st _position obtained from the positions 12 to 21. . . 32

5.22 Final point cloud of the objects of interest in the 11th _{position. . . 33}

5.26 Depth map of the objects of interest generated based on the nal point cloud. . . 36

6.1 A top view of the object on a graph paper. . . 39

6.2 A top view of the point cloud of the object. . . 40

6.3 A scene containing the object on the graph paper. . . 40

(9)

6.5 Front view of the object on a graph paper. . . 41 6.6 Front view of the point cloud of the object. . . 42 6.7 Colour image of the scene showing the placement of the object of

interest and the Kinect sensor on dierent levels. . . 43 6.8 Depth map of the object of interest showing the IR reection in

front of the object. . . 43 6.9 Two depth maps of the same scene showing the depth variations

over a period of time due to the drift in the temperature. . . 45 6.10 Two depth maps of the same scene showing how the holes at the

edges of the objects vary over time. . . 45 6.11 Colour image and depth map of a scene depicting the intensity

related issues in depths maps. . . 46

(10)

List of Tables

1.1 Comparison between XBOX 360 Kinect & XBOX One Kinect . . 4 4.1 The positions of the object of interest and their corresponding angle

of rotation with respect to the reference pointer. . . 17 5.1 Intrinsic parameters of the Kinect for Windows sensor. . . 18 5.2 The ICP registrations and their corresponding RMSE values. . . . 37

(11)

Chapter 1 Introduction

Computer vision is an interdisciplinary eld that helps us in gaining high-level understanding of digital images or videos. In the last few decades computer vision has grown into a very broad and diverse eld. Particularly, the concepts of three dimensional(3D) imaging and reconstruction have been of much concern in the past few years. 3Ddata acquisition can be achieved mainly in two ways [2, 3]. They are:

Passive 3DImaging Active 3DImaging

A passive 3Dimaging system is a system that does not project its own source of light or other form of electromagnetic radiation, for capturing the 3Dinformation of a scenario. A passive imaging system relies on the ambient-lit images i.e. colour images of the scene in dierent perspectives to generate a 3Dmodel [2]. Passive 3Dimaging can be achieved in two ways, namely multiple view and single view approaches.

Multiple view approaches make use of more than one viewpoint, either by using multiple cameras at the same time or by using a single moving camera at dierent times. The systems which use more than one camera at a time for capturing multiple images of the scene are called stereo vision systems and collectively all the cameras are referred to as a stereo camera [2]. These methods make use of triangulation for locating a same point in the environment in more than one of the images to determine the points 3Dposition. In contrast to stereo imaging, a single camera moving over a period of time and capturing the scene in dierent perspectives, for creating a 3Dmodel, is called structure from motion (SfM ) [2, 4].

(12)

Chapter 1. Introduction 2 Single view approaches, on the other hand, depend upon only one viewpoint for inferring the 3D shape of an object in an environment. They mainly depend upon information cues such as shading, texture, focus for 3D modelling of the object [2].

In contrast to these, active 3D imaging systems make use of controlled arti-cial illumination or other forms of electromagnetic radiation for acquiring the dense range maps of the environment with minimum ambiguity [2, 3]. The use of articial illumination makes it easy for acquisition of dense and accurate depth maps of texture-less objects. Active 3D imaging systems make use of a large variety of methods for creating an accurate depth map of an environment. Based on the technique being used, the range of operation and accuracy of the system vary. Structured Light (SL) and Time of Flight (ToF ) are some of the techniques being used in active 3D imaging systems.

In the systems using structured light for measuring the depth, a sequence of known patterns of electromagnetic radiation is projected on to the environment. The patters in the electromagnetic radiation get deformed due to geometry of the objects in the environment. The distorted patterns are observed using a camera, and are analysed based on the disparity from the original projected pattern and intrinsic parameters of the camera to generate the depth maps of the environment. This system is similar to the binocular stereo vision system in passive 3D imaging, where one of the cameras is replaced by a projector [2, 3, 5]. Hence this technique is also called active stereo vision. XBOX 360 Kinect is an active 3D imaging system that works based on this principle [6]. Figure 1.1 shows an XBOX 360 Kinect with the dierent sensors present in it and their internal locations.

(13)

Chapter 1. Introduction 3 ToF technology is mainly based on measuring time taken by the light emitted from a source to travel to an object and back to a sensor [1, 2, 5]. The illumination in most of the cases is considered to be continuous wave since it helps in the delay estimation. The source of illumination and the sensor are assumed to be at the same location. Since the distance between the object and the sensor is constant, speed of light c is nite, the time shift caused in the emitted signal is equivalent to the phase shift in the received signal. Based on the phase shift(Δϕ) between the received signal and generated signal, the ToF is calculated, which in turn is used for generating the depth maps. XBOX One Kinect is an active 3D imaging system that works based on this principle [6]. Figure 1.2 shows the operating principle of a ToF sensor, where Δϕ is the phase dierence between the transmitted and received signal. Figure 1.3 shows a tear down of the XBOX One Kinect sensor with the locations of the dierent sensors present in it.

(14)

Chapter 1. Introduction 4

Figure 1.3: XBOX One Kinect showing internal sensors. Image courtesy: iFixit.com

Features XBOX 360 Kinect XBOX One Kinect

Range ofoperation 0.4 to 3 meters or_{0.8 to 4 meters} 0.5 to 4.5 meters Colour camera

resolution (pixels) 640 x 480 @ 30 Hz 1920 x 1080 @ 30 Hz Near IR camera

resolution (pixels) 640 x 480 @ 30 Hz 512 x 424 @ 30 Hz Field ofview (FoV)

(horizontal x vertical) 57°x 43° 70°x 60°

Depth sensing

(15)

Chapter 1. Introduction 5

1.1 Aims & Objectives

The main aim of this research is to create a reliable three dimensional (3D) model for an object with irregular surfaces present in an indoor environment using a XBOX One Kinect sensor.XBOX One Kinect was primarily developed as a gaming device, but subsequently it generated signicant interest in the academic and research world due its relatively low cost IR RGB-D sensor which can work with personal computers, high data delity, depth resolution and accuracy.In this research we make use of MATLAB and Microsoft Visual Studio for creating a reliable 3D point cloud of the object of interest.The main objectives of this research are to:

Acquire sets of depth maps of the environment with the objects of interest Enhance the acquired depth maps and separate the depths of the object of

interest.

create an accurate 3D point cloud for the object of interest from the en-hanced depth maps.

The research questions for this thesis are:

Is the 3D reconstruction of irregular surfaced object using XBOX Kinect V2.0 possible?

How is the proposed algorithm better in terms of speed of reconstruction and accuracy when compared to the existing ones?

Is this a realistic approach for performing online 3D reconstruction rather than an oine one?

1.2 Thesis Organisation

(16)

Chapter 2 Literature Review

Kinect for Windows sensor is one of the most popular sensors being used for 3D reconstruction of objects in an environment at a low cost. It could also be used for the 3D reconstruction of a scene in real time, creating a virtual reality, motion sensing, making a display touch enabled, etc. Most of these applications make use of the Microsoft Kinect SDK. The main issue with the use of Kinect SDK for real time reconstruction of a scene is that the hardware requirements to process such large amounts of data are quite high. Due to these hardware requirements, one must depend upon other programming tools such as MATLAB, etc.

In [1] the authors gave a thorough account about Time of Flight(ToF ) cam-eras, their operating principle, calibration and alignment of ToF camcam-eras, dierent ToF and structured light cameras such as the Kinect for Windows V1.0 and V2.0 sensors.

3D Imaging techniques, creation of 3D objects, their representations, regis-tration of 3D point clouds in dierent positions, applications of 3D imaging and analysis in dierent areas of science are discussed by the authors in [2].

In [3, 4], the authors have given an account of dierent computer vision tech-niques such as image segmentation, feature detection matching and alignment, structure from motion(SfM ), 3D reconstruction, etc.

In [5], authors have given an objective comparison between the structured light and ToF technologies for range sensing using a Kinect for Windows sensor. They also discuss about the dierent error sources faced while using a Kinect sensor and give a constructive framework for evaluating the performance of Kinect sensors.

In [6], the authors have given a brief account about the Kinect for Windows sensors, 3D reconstruction, segmentation, matching and recognition, and the dif-ferent algorithms used for these purposes.

(17)

Chapter 2. Literature Review 7 In [7], the author suggests the real-time reconstruction of a 3D scene using Kinect Fusion, which requires a very powerful GPU and a large amount of mem-ory. The main issues with the use of Kinect for Windows V1.0 sensor for 3D reconstruction are that the depth data acquired is not as reliable as the data acquired using Kinect for Windows V2.0 sensor due to the change in the sensing technology, delity of the data is very low. The usage of Kinect Fusion for 3D reconstruction is very tedious and has a number of hardware requirements.

A comprehensive evaluation of the Kinect for Windows V2.0 sensor for the purpose of 3D reconstruction is explained in detail in [8]. In [9], the authors introduce Kinect for Windows V2.0 sensor toolbox for MATLAB versions prior to MATLAB R2016a. This toolbox makes use of C++ and MATLAB Mex functions for providing access to colour, depth, infra-red streams of the Kinect sensor using Microsoft Kinect Fusion SDK V2.0.

(18)

Chapter 3 Theoretical Background

3.1 Intrinsic Parameters of the Kinect Sensor

The intrinsic parameters of a Kinect sensor's Near IR (NIR) camera are focal length (Fx, Fy), location of the focal point (Cx, Cy) and the radial distortion

parameters (K1, K2, K3). Each Kinect sensor has its own depth intrinsic

pa-rameters which are sensor and lens dependent. Each Kinect sensor is calibrated in the factory and the intrinsic parameters are stored in the internal memory of the Kinect sensor. The depth intrinsic parameters of the Kinect sensor can be acquired and stored with the help of the function "getDepthIntrinsics" of the Kin2 toolbox developed for MATLAB [9].

3.2 Depth Image Enhancement

Consider a scenario in which N number of depth frames are acquired, each frame being of resolution J X K pixels. For each pixel location (j, k), there exist N data samples, which also contains outliers due to the noise inherent to the sensor. From these N samples the outliers need to be removed. The outliers in the data acquired are removed based on the median absolute deviation (MAD) of the data [10, 11].

The MAD of a normal distribution is the median of absolute deviation from the median, i.e.

M AD = b· M(|xi − M(X)|) (3.1)

where M is the median of a given distribution, X is the set containing the N samples of data, xi is every individual sample in the data set X. Assuming

the depth intensities to be a normal distribution, b = 1.4826 is considered [11], ignoring the abnormalities induced by the outliers in the data.

(19)

Chapter 3. Theoretical Background 9 The criterion for detecting outliers is based on a threshold value set based on the value of MAD. The equation

|xi− M(X)| <= 3MAD (3.2)

is the criterion for detecting outliers. If a given sample xi of data set Xsatises

this condition, then the sample belongs to the data set. If more than 50% of the data has the same value, then the MAD becomes zero. In such scenarios this detection technique does not work.

After the outliers in the depth pixel intensities are detected and discarded for each set X, the remaining samples of depth intensities are averaged to acquire a value for pixel location (j, k) in the averaged depth map.

A pixel in a depth map is said to be invalid if it does not hold any depth information, i.e. if the intensity value of that pixel is undened or zero. The invalid pixels in a depth map are called holes in this work. These holes need to be lled with valid depth values in order to avoid any holes in the point clouds. The holes in the depth data are lled using the eight nearest neighbour principle, which uses a set of intensities of the 8 nearest neighbours of a hole for calculating its depth value based on the mean of those 8 nearest neighbours. Holes in the averaged depth maps are lled and stored.

3.3 Point Cloud Generation

Figure 3.1: Kinect for Windows V2.0 sensor showing the orientation of the coor-dinate system.

Image courtesy: Microsoft Corporation.

(20)

Chapter 3. Theoretical Background 10 these depth maps is converted into a physical location P(X, Y, Z) in 3D point cloud with respect to the location of the depth camera in the Kinect, i.e. the origin of the point cloud now generated is located at the position of the depth camera of the Kinect sensor.The orientation of X, Y, Z axes in this 3D system is given by the gure 3.1. The X and Y locations of the point P(X,Y, Z) corresponding to each pixel p(j, k) in a depth map are calculated using the equations

X = j− Cx Fx

· Z (3.3)

Y = k− Cy

Fy · Z (3.4)

Here (j, k) is the location of the pixel p in the depth map, (Cx, Cy) and (Fx,

Fy) are the intrinsic parameters of the depth camera which are the location of the

focal centre and the focal length respectively.Z is the depth value of the pixel p(j, k) in the depth map.

3.4 Point Cloud Alignment & Merging

The point clouds thus generated are then aligned to a particular position using the iterative closest point (ICP) algorithm [2, 12].The registration of one point cloud with respect to another involves the calculation of transformation between the two point clouds and followed by the transformation of the input point cloud into the reference point cloud's orientation.

Registration of two or more 3D objects is a process that involves the approxi-mation of the transforapproxi-mation between dierent 3D objects. ICP algorithm is the standard process for registration of 3D objects.Consider two point clouds D, M in such a way that D ⊂ M.

D ={d₁, d₂, ..., dN_d}

M = {m₁, m₂, ..., mN_m}

(3.5) Here Nd, Nm are the number of points in the point clouds D and M

(21)

Chapter 3. Theoretical Background 11 The error function between the two point clouds is

EICP(a, D, M) = E(T (a, D), M) = N_d

u₌₁

||(Rdu+ t) − mv||2 (3.6)

where a is the transformation function that best aligns the point cloud D to M,

a = (R, t) where R is the rotation matrix, t is the translation vector, (du, mv)

are the corresponding points. Fixing du ∈ D, the corresponding point mv ∈ M is

computed such that

v = arg min

w∈{1,2,...,Nm}

||(Rdu+ t) − mw||2. (3.7)

Based on the point correspondences, the transformation function a = (R,t) could be computed to minimize the EICP. This process could be accomplished in

several ways. One of those approaches is based on singular value decomposition (SVD).

Firstly, the cross covariance matrix C is formed based on the Nd

correspon-dences (du, mv), as C = 1 Nd N_d u=1 [du− d][mv − m]T (3.8)

where d, m are the means formed over the Nd correspondences. Performing SVD

of C, we get

U SVT = C (3.9)

where U,V are two orthogonal matrices and S is a diagonal matrix of singular values. The rotation matrix R can be estimated from the pair of orthogonal matrices using the equation [13].

R = V SUT, (3.10) where S = I if det(U)det(V) = 1 diag(1, 1, ..., 1,−1) if det(U)det(V) = -1. (3.11)

Here diag() denotes a diagonal matrix, I denotes identity matrix and det() denotes the determinant of a given matrix. The translation vector t can be estimated as

(22)

Chapter 3. Theoretical Background 12 The ICP algorithm is iteratively performed so as to improve the correspon-dences, hence minimizing the EICP, until the true correspondences are known.

The root mean square error of the dierence between the reference and the trans-formed point cloud is calculated.

The point cloud could be converted back into a depth map if required using the equations 3.13, 3.14 [8]. If P(X, Y, Z) is a point in a 3D coordinate system, then the pixel p (j, k) in the depth map is located using the equations

j = X· Fx Z + Cx (3.13) and k = Y · Fy Z + Cy (3.14) and the depth value of the pixel p is given by

(23)

Chapter 4 Implementation

The hardware requirements of this research study are a Microsoft Kinect for Win-dows V2.0 sensor, a PC that satises the requirements for using Microsoft Kinect for Windows V2.0 sensor and a turntable. The software requirements for acquir-ing and processacquir-ing data from a Kinect sensor are Windows 8.1 operatacquir-ing system, MATLAB 2016a with image processing toolbox, Kinect for Windows hardware support package for MATLAB, Kin2 toolbox for MATLAB [9], Microsoft Kinect SDK V2.0 and Microsoft Visual Studio 2013.

4.1 Experimental Setup

The experimental setup for data acquisition in this research is shown in the gure 4.1.

(24)

Chapter 4. Implementation 14 A scenario is considered in which a long table is present against a clear white background (i.e. a wall) with a turn-table at one end with a Kinect sensor facing it at the other end. A 360°protractor is placed at the centre of the turntable to facilitate accurate rotation as shown in the gure 4.1. A paper pointer is set along the 0° of the protractor to facilitate the measurement of angles with respect to this reference.

The object of interest (OOI ) is placed at the centre of the turntable. Two non-identical objects are placed on either side of the OOI as shown in gure 4.2 by the two yellow lines in the place of the imaginary planes. The minimum distance between OOI and the plane of the Kinect sensor is considered to be longer than 0.5 meters, due to the range limitations of the Kinect sensor. It is considered that the OOI and the two non-identical objects are present between two xed imaginary planes, parallel to the face of the Kinect sensor as shown in g 4.2. These imaginary planes are considered in such a way that the object's surface does not cross these planes in all the perspectives, i.e even when the turn-table is rotated by an arbitrary angle. The horizontal distanceD1 between the imaginary

plane-1 in front of the OOI and the Kinect sensor is measured. The horizontal distanceD2 between the imaginary plane-2 behind the surface of the OOI and the

wall is also measured. If the setup is considered to be in the middle of the room, another imaginary plane needs to be considered as a wall for allowing the isolation of the OOI fromthat of the environment. The Kinect sensor is connected to the PC via a USB 3.0 connection.

(25)

Chapter 4. Implementation 15

4.2 Experiment

In this research study, depth maps of the scene with and without the OOI are acquired to create a good quality 3D model of the OOI. This includes data ac-quisition followed by various pre and post processing techniques for obtaining the nal point cloud of the OOI to an appreciable accuracy.

4.2.1 Data Acquisition & Depth Image Enhancement

In the data acquisition stage, the Kinect object in MATLAB is initialized. The intrinsic parameters of the Kinect sensor are acquired with the help of the "get-DepthIntrinsics" of the Kin2 toolbox and stored. Initially a set of depth maps and colour images of the scene without the OOI, for eg. 10, are acquired and stored. The OOI is then placed at the centre of the turn-table with two non-identical objects, eg. pens, on either side of the OOI. The horizontal distance D1 between

the face of the Kinect sensor and the OOI, D2 between the OOI and the wall

are measured. Sets of depth maps and colour images of the scene with turn-table being rotated, for eg. 10 in each position, are acquired and stored. The total number of positions is considered to be odd, for eg. 21, so that the nal position of the point cloud of the OOI can be considered to be the middle position, i.e "0°" with respect to the reference. The turntable is rotated by maximum two degrees at a time, for eg. −10° to +10° with a rotation of one degree per position, so as to maintain the accuracy and minimize the root mean square error.

The acquired depth maps of the environment are then enhanced and holes in the data are lled based on the MAD and eight nearest neighbour principle. For hole lling purposes, the MATLAB function "imll" is used, as it works based on the same principle [14]. After the outliers and the holes in the depth maps of the environment are discarded, a single averaged and hole lled depth map for each position of the OOI on the turn-table is constructed.

For isolating the depths of OOI from the remaining of the environment, depth maps of the OOI in each of the 21 positions is subtracted from that of the environment without the OOI. Depths of the OOI in these subtracted outputs would that from the wall behind the OOI as shown in the gure 4.2. Based on these depth intensities and the distance D2, the pixels with depth intensities

greater than D2 are identied and their corresponding depths from the enhanced

(26)

4.2.2 Point Cloud Generation, Alignment and Merging

The depth maps of the environment with the OOI and that of the isolated OOI are converted into 3D point clouds based on the intrinsic parameters of the Kinect's depth sensor and the depth data using the equations 3.3 and 3.4. The origin of each of these point clouds is then translated along the Z -axis on to the imaginary plane-1 which is in front of the OOI using the distance D1.

The generated point clouds of the OOI are then aligned to a particular position using the ICP algorithm. The nal position of the point cloud is considered to be the middle position in the acquired data, i.e. 0° with respect to the reference paper pointer. The MATLAB function "pcregrigid", with increased maximum number of iterations and very low tolerance of error in terms of rotation and translation, is used to register a given point cloud with respect to another [15]. Point clouds of the same orientation can be merged together to form a single point cloud using the MATLAB function "pcmerge". This function merges the two point clouds by using a box grid lter [16].

We consider a total of 21 rotated positions of the objects of interest i.e. from

−10° to +10° with a dierence of 1°per position. The table 4.1 shows positions of

the OOI and their corresponding angle of rotation with respect to the reference pointer.

(27)

Position

Number Angle of Rotation w.r.tReference(degrees)

1(initial) -10 2 -9 3 -8 4 -7 5 -6 6 -5 7 -4 8 -3 9 -2 10 -1 11(middle) 0 12 1 13 2 14 3 15 4 16 5 17 6 18 7 19 8 20 9 21(nal) 10

(28)

Chapter 5 Results

5.1 Intrinsic Parameters

Intrinsic parameters of the Kinect sensor being used for this work were acquired using Kin2 toolbox for MATLAB[9]. The acquired intrinsic parameters of the Kinect sensor being used in this work are shown in the table 5.1.

Intrinsic Parameters _{Values (pixels)}Kinect SDK

Focal Length Fx 365.2946 Fy 365.2946 Focal Centre Cx 259.7606 Cy 205.8992 Radial Distortion 2nd _Order _0.0923 4th _Order _-0.2701 6th _Order _0.0927

Table 5.1: Intrinsic parameters of the Kinect for Windows sensor.

(29)

Chapter 5. Results 19

5.2 Data Acquisition & Depth Image

Enhance-ment

Initially a set of depth maps and colour maps of the scene without the OOI are acquired. These acquired images are horizontally ipped versions of the original scene that we perceive. Figure 5.1 shows depth maps highlighting the changes in the location of the holes over a period of time.

Figure 5.1: Aset of depth maps of the scene without the objects of interest. The OOI along with the two non identical objects is then placed on the turn-table and the distance between the Kinect sensor and imaginary plane-1 is noted down as D1. Similarly the distance between the wall and the imaginary plane-2

is noted down as D2. Aset of depth maps and colour maps of the scene with the

(30)

Chapter 5. Results 20 Figures 5.2, 5.3 show the depth maps and colour maps of the scene in the 1st

and 21st _{positions respectively.}

Figure 5.2: A set of depth maps of the scene with the objects of interest in the initial position(1st_).

Figure 5.3: A set of depth maps of the scene with the objects of interest in the nal position(21st_).

(31)

Chapter 5. Results 21 The outliers from the acquired depth maps of the scene with and without the OOI are detected based on the MAD using the equations 3.1 and 3.2 and are then removed. The sets of depth maps of the scene in each position are then averaged and the holes in the averaged depth maps are lled using the MATLAB function "imll". Figures 5.4 and 5.5 show the averaged depth maps and enhanced depth maps of the scene without the OOI respectively. When compared to the originally captured depth maps, it could be observed that the averaged depth maps have considerably less holes, due to the averaging of the set of depth maps over a period of time. From gure 5.5, we can observe that the holes present on the surface of the turn-table in gure 5.4, are lled. The black areas inside the circle in the gure 5.4 show the holes.

(32)

(33)

Chapter 5. Results 23 Figures 5.6 and 5.7 show the averaged and enhanced depth maps of the scene with the OOI in the 1st _{position respectively.}

Figure 5.6: Averaged depth map of the scene with the objects of interest in the initial position(1st_).

(34)

Chapter 5. Results 24 The enhanced depth maps of the scene with the OOI in dierent positions considered are subtracted from that of the scene without the OOI. Based on the distance from the wall D2, the pixels with depth values less than D2 are discarded.

For the remaining valid pixels, the corresponding depth values from the enhanced depth maps of the scene with the OOI are copied into another empty depth map of the same dimensions. Figure 5.8shows the subtracted depth map of the scene with the objects of interest in the initial position. Figures 5.9 and 5.10 show the retrieved depth maps of the OOI objects of interest from the Kinect sensor in the 1st _{and 21}st _positions.

(35)

Figure 5.9: Depth map of the objects of interest in the 1st _position.

(36)

5.3 Point Cloud Generation

The depth maps of the OOI and that of the scene with the OOI are subsequently converted into point clouds based on the depth values and intrinsic parameters of the Kinect depth sensor (Table 5.1), usingthe equations 3.3 and 3.4. Figure 5.11 show the point cloud of the whole scene with the OOI, generated by converting the depth map in the gure 5.7 in to a point cloud. Figures 5.12 and 5.13 show the point clouds of the OOI in the 1st _{and 21}st _{positions respectively, generated}

from the depth maps in the gures 5.9 and 5.10 respectively. The origin of the 3D coordinate system of the point clouds in the gures 5.11, 5.12 and 5.13 has been translated on to the imaginary plane-1, based on D1, and all the points with

negative depths have been discarded. From gure 5.11, it could be observed that the whole scene is present with the object of interest where as in gures 5.12 and 5.13 only the OOI are present. From gures 5.12, 5.13, it could also be observed that the orientation of the OOI changes signicantly.

(37)

Figure 5.12: Point cloud of the objects of interest in the 1st _position.

(38)

5.4 Point Cloud Alignment & Merging

The point clouds thus generated for dierent positions of the objects of interest are then aligned into the middle position(11th_{), where the surface of the object}

of interest is considered to be parallel to the face of the Kinect sensor, iteratively using the ICP algorithm. Figures 5.14 and 5.15 show the point clouds in the rst and second positions before and after ICP algorithm is implemented for aligning point cloud in the rst position into the second.

(39)

Figure 5.15: Point cloud of the objects of interest in the second position with the registered point cloud of the rst position with respect to the second position.

(40)

Chapter 5. Results 30 Similarly gures 5.16 and 5.17 show the point clouds in the 21st _{position into}

the 20th _{positions before and after ICP algorithm is implemented for aligning}

point cloud in the 21st _{position into the 20}th _position.

Figure 5.16: Point clouds of the objects of interest in the 21st_{and the 20}th_position.

Figure 5.17: Point cloud of the objects of interest in the 20th _{position with the}

(41)

20th

(42)

Chapter 5. Results 32 This process is continued till all the point clouds before and after the middle position are aligned and merged into it. Figures 5.20 and 5.21 show the point clouds generated from aligning and merging of the point clouds before and after the middle position into it.

Figure 5.20: Merged point cloud of the objects of interest in the 11st _position

obtained from the positions 1 to 10.

Figure 5.21: Merged point cloud of the objects of interest in the 11st _position

(43)

11th

(44)

11th

(45)

(46)

(47)

Chapter 5. Results 37 ICP registration from

position no. to position no. Root Mean Square Error(RMSE )

1 to 2 0.0012 2 to 3 0.0011 3 to 4 0.0016 4 to 5 0.0017 5to 6 0.0019 6 to 7 0.0024 7 to 8 0.0027 8 to 9 0.0027 9 to 10 0.0035 10 to 11 0.0030 21 to 20 0.0048 20 to 19 0.0046 19 to 18 0.0042 18 to 17 0.0043 17 to 16 0.0037 16 to 15 0.0038 15to 14 0.0034 14 to 13 0.0032 13 to 12 0.0038 12 to 11 9.58 × 10−₄

(48)

Chapter 6 Discussion

6.1 Validation of Results

For validation of the results, a scenario is considered where the shoe is placed on a graph paper as shown in gure 6.1. The X-axis of the 3D coordinate systemis considered to be along the length of the shoe, Y-axis is considered to be along the height of the shoe, which is perpendicular to the surface of the graph paper, while the Z-axis is considered to be along the width of the shoe. Initially, we locate the origin of the coordinate systemon the graph paper on the XZ plane as shown in gure 6.1 . Fromgures 6.1 and 6.2, it can be observed that the length of the shoe and height of the sole are approximately the same in both the cases. Two points P1(4, −10, 12) mm and P2(23, −10, 13) mm are considered on the surface

of the real object as shown in the gure 6.3, and the distance d between themis calculated as

d =(x₂− x₁)2+ (y₂− y₁)2+ (z₂− z₁)2 (6.1)

d =(23 − 4)2+ (−10 − (−10))2+ (13 − 12)2 = 19.02mm. (6.2)

On the surface of the point cloud, we try to locate the nearest matches to the two points considered earlier on surface of the object. The points cannot be exactly located as there still exists some loss of resolution. The two points from the point cloud are P3(4.163, −9.907, 11, 85) mm and P4(22.98, −9.955, 13.35) mm

as shown in gure 6.4 and the distance d' between themis calculated as

d =(22.98 − 4.163)2+ (−9.955 − (−9.907))2+ (13.35 − 11.85)2 = 18.876mm.

(49)

Chapter 6. Discussion 39 From d and d_{, it could be observed that the variation in the measured distance}

is very much negligible. Based on this, it could be said that there is not much distortion in the point cloud thus generated. From gures 6.2, 6.4 and 6.6, it could be observed that the depth intensities of the points in the point cloud present around the edges of the OOI are quite erroneous due to the presence of ying pixels in the data.

(50)

(51)

(52)

Chapter 6. Discussion 42

Figure 6.6: Front view of the point cloud of the object.

Similarly, two points P5(55, −14, 15) mm and P6(65, −9, 22) mm on the real

object are considered with a variation in the depth as shown in gure 6.5, and the distance d1 between them is calculated.

d₁ =(65 − 55)2+ (−9 + 14)2+ (22 − 15)2 = 13.19mm. (6.4)

Nearest matches to these two 3D points are then located on the point cloud of the object as P7(55.22, −14.21, 17) mm and P8(65.27, −9.02, 22.18) mm and the

distance d

1 between them is calculated.

d₁ =(65.27 − 55.22)2+ (−9.02 + 14.21)2+ (22.18 − 17)2 = 12.44mm. (6.5)

From d1and d1, it could be observed that the dierence between them is negligible.

(53)

6.2 Sources of Error in 3D Model Creation

Initially, the objects of interest and the sensor were at dierent elevations as shown in gure 6.7. This led to the creation of an IR reection of the object in front of it as seen in gure 6.8. The reection could be avoided in a scenario where both the OOI and the Kinect sensor are placed at the same level.

Figure 6.7: Colour image of the scene showing the placement of the object of interest and the Kinect sensor on dierent levels.

(54)

Chapter 6. Discussion 44 The minimum distance between the Kinect and the OOI for the OOI to be recognized by the Kinect depth sensor is 0.5 meters, hence we consider an optimal distance of approximately 0.85 meters between the Kinect and the OOI. If the OOI is considered nearer or too farther than that of the optimal distance(0.8to 1.2 meters), the variance of the depth intensities over a period of time increases, in turn leading to an increase in the RMSE.

Two non identical objects are considered along with the main OOI to simplify the detection of points used for the alignment of point clouds in two dierent positions using the ICP algorithm.

In order to remove the noise created during the conversion of a depth map into a point cloud, we consider two imaginary planes IP1 and IP2 parallel to the

face of the Kinect sensor such that plane IP1 is in front of the object's surface and

the plane IP2 behind the object's surface as shown in gure 4.2. The imaginary

planes IP1 and IP2 also facilitate isolation of the depths of the OOI from the

background.

The Kinect for Windows V2.0 depth sensor is vulnerable to a various kinds errors [5, 8]. The main sources of errors in the acquired depth data are due to

Temperature drift Depth inhomogeneity Intensity related issues

(55)

Figure 6.9: Two depth maps of the same scene showing the depth variations over a period of time due to the drift in the temperature.

Figure 6.10: Two depth maps of the same scene showing how the holes at the edges of the objects varyover time.

(56)

Figure 6.11: Colour image and depth map of a scene depicting the intensity related issues in depths maps.

(57)

6.3 Advantages and Limitations

Due to the change in the technology used for sensing the depths in the Kinect for Windows V2.0 sensor, the measurements have become more accurate, and have three times more delity when compared to that of the earlier Kinect sensor. The point clouds generated through this sensor have a higher resolution, as the depth is calculated individually for each pixel in the depth maps, which imminently leads to its capability to capture even the small artefacts in an environment. Its invulnerability to daylight makes it even more possible to use Kinect for Windows in an open environment.

Due to the high accuracy in the depth maps, i.e. less than 3 mm at a distance of around one meter [5], we could say that the depth maps and point clouds generated using a Kinect for Windows V2.0 sensor are reliable. In this study, we exploit this ability of the Kinect sensor to generate accurate and high resolution point clouds at a low cost. By decreasing the rotational and translational error tolerance of the ICP algorithm, it could be observed that the generated point cloud of the OOI is perfectly registered and that the nal average RMSE of the system is 2.9 mm. This study emphasises on the use of low cost 3D imaging sensors such as Kinect for Windows sensor for high resolution 3D reconstruction of an object in an environment.

(58)

Chapter 7 Conclusions and Future Scope

Based on the results, it could be concluded that the 3D point cloud of an irregular surfaced object can be reconstructed using Kinect for Windows V2.0 sensor in an indoor environment. The average RMSE of the reconstructed sole of the shoe is 0.002885 m. This system could be eectively used for the 3D reconstruction in forensics, for preserving historic artefacts, etc.

7.1 Further Scope

Accuracy of the algorithm can be further improved by using image lters such as joint bilateral lters for lling the holes in a depth map. Implementation of a colour, depth and texture based ICP algorithm for the registration of the point clouds in dierent positions may also lead to an increase in the accuracy, but it has to be noted that the computational complexity, hardware and time requirements of the process also increase signicantly with the increase in the accuracy.

The accuracy of the rotation system can be increased by automating the turntable so that it can be rotated in steps of one degree, which in turn leads to an increase in the accuracy of the system. By using ecient surface reconstruc-tion techniques, this system can be used to generate 3D printable point clouds. Parallel computing could be used for eectively decreasing the time taken for the registration of dierent point clouds into the same alignment.

A perfect 3D reconstruction of the sole of the shoe could be achieved by acquiring more complete depth data of the sole of the shoe, i.e. by acquiring the depth data while rotating the shoe along both its horizontal and vertical axis, subsequently converting the depth maps into point clouds, aligning and merging them to form a more dense point cloud.

(59)

References

[1] M. Hansard, S. Lee, O. Choi, and R. Horaud, Time-of-Flight Cameras: Principles, Methods and Applications. SpringerBriefs in Computer Science, Springer London, 2012.

[2] N. Pears, Y. Liu, and P. Bunting, 3D Imaging, Analysis and Applications. Springer London, 2014.

[3] D. Forsyth and J. Ponce, Computer Vision: A Modern Approach: A Modern Approach. Pearson Education Limited, 2015.

[4] R. Szeliski, Computer Vision: Algorithms and Applications. Texts in Com-puter Science, Springer London, 2010.

[5] H. Sarbolandi, D. Leoch, and A. Kolb, Kinect range sensing: Structured-light versus time-of-ight kinect, CoRR, vol. abs/1505.05459, 2015.

[6] L. Shao, J. Han, P. Kohli, and Z. Zhang, Computer Vision and Machine Learning with RGB-D Sensors. Advances in Computer Vision and Pattern Recognition, Springer International Publishing, 2014.

[7] S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, P. Kohli, J. Shot-ton, S. Hodges, D. Freeman, A. Davison, and A. Fitzgibbon, KinectFusion: Real-time 3d reconstruction and interaction using a moving depth camera, ACM Symposium on User Interface Software and Technology.

[8] E. Lachat, H. Macher, M.-A. Mittet, T. Landes, and P. Grussenmeyer, First Experiences with Kinect v2 Sensor for Close Range 3d Modelling, ISPRS -International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, pp. 93100, Feb. 2015.

[9] J. R. Terven and D. M. Córdova-Esparza, Kin2. a kinect 2 toolbox for MATLAB,

[10] Oracle crystal ball reference and examples guide - outlier detection meth-ods.

(60)

References 50 [11] C. Leys, C. Ley, O. Klein, P. Bernard, and L. Licata, Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median, Journal of Experimental Social Psychology, vol. 49, no. 4, pp. 764 766, 2013.

[12] S. Rusinkiewicz and M. Levoy, Ecient variants of the ICP algorithm, in Third International Conference on 3D Digital Imaging and Modeling (3DIM), June 2001.

[13] S. Umeyama, Least-squares estimation of transformation parameters be-tween two point patterns, IEEE Transactions on pattern analysis and ma-chine intelligence, vol. 13, no. 4, pp. 376380, 1991.

[14] Fill image regions and holes - MATLAB imll - MathWorks nordic.

[15] Register two point clouds using ICP algorithm MATLAB pcregrigid -MathWorks india.

3D Object Reconstruction Using XBOX Kinect v2.0

Master Thesis

Electrical Engineering with

emphasis on Signal Processing

September 2016

3D Object Reconstruction

Using XBOX Kinect v2.0

Srikanth Varanasi

Vinay Kanth Devu

Abstract

Acknowledgements

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1 Aims & Objectives

1.2 Thesis Organisation

Chapter 2

Literature Review

Chapter 3

Theoretical Background

3.1 Intrinsic Parameters of the Kinect Sensor

3.2 Depth Image Enhancement

3.3 Point Cloud Generation

3.4 Point Cloud Alignment & Merging

Chapter 4

Implementation

4.1 Experimental Setup

4.2 Experiment

4.2.1 Data Acquisition & Depth Image Enhancement

4.2.2 Point Cloud Generation, Alignment and Merging

Chapter 5

Results

5.1 Intrinsic Parameters

5.2 Data Acquisition & Depth Image

Enhance-ment

5.3 Point Cloud Generation

5.4 Point Cloud Alignment & Merging

Chapter 6

Discussion

6.1 Validation of Results

6.2 Sources of Error in 3D Model Creation

6.3 Advantages and Limitations

Chapter 7

Conclusions and Future Scope

7.1 Further Scope

References