Model-based object tracking with an infrared stereo camera

(1)

International Master’s Thesis

Model-based object tracking with an infrared stereo

camera

Juan Manuel Rivas Diaz

Technology

Studies from the Department of Technology at Örebro University

(2)

(3)

Model-based object tracking with an infrared stereo

camera

(4)

(5)

Studies from the Department of Technology

at Örebro University

Juan Manuel Rivas Diaz

Model-based object tracking with an

infrared stereo camera

Supervisor: Todor Stoyanov Examiners: Henrik Andreasson

(6)

© Juan Manuel Rivas Diaz, 2015

Title: Model-based object tracking with an infrared stereo camera

(7)

Abstract

Object tracking has become really important in the field of robotics in the last years. Frequently, the goal is to obtain the trajectory of the tracked target over time and space by acquiring and processing information from the sensors. In this thesis we are interested in tracking objects at a very short range. The primary application of our approach is targeting the domain of object tracking during grasp execution with a hand-in-eye sensor setup. To this end, a

promising approach investigated in this work is based on the Leap Motion sensor, which is designed for tracking human hands. However, we are

interested in tracking grasped objects thus we need to extend its functionality. The main goal of the thesis is to track the 3D position and orientation of an object from a set of simple primitives (cubes, cylinders, triangles) over a video sequence. That is the reason we have designed and developed two different approaches for tracking objects with the Leap Motion device as stereo vision system.

Keywords: Stereo Vision, Tracking, Leap Motion, Robotics, Particle filter.

(8)

(9)

Acknowledgements

This thesis is the end of an amazing year in Sweden, where I had the

opportunity to grow as an adult person and in my academic career. First and foremost, this would not have been possible without the endless love, encourage and support of my family.

Second to my friends Javi, Zurita, Nelson, Belen, Carra...even with the distance I know that you were there. Third, to all the friends from different nationalities that I made here and specially to those who shared the bad and good moments in the lab T002 with me making this thesis.

Last but not the least, to all the great Masters professors and specially to my supervisor Todor, that always helped me when I need it.

To all of them, THANKS.

(10)

(11)

List of Figures

2.1 Pinhole camera example . . . 5

2.2 Pinhole camera geometry . . . 6

2.3 Radial distortion . . . 6

2.4 Before and after radial distortion correction . . . 7

2.5 Epipolar geometry . . . 8

2.6 Occlusion representation . . . 9

2.7 Canonical camera configuration . . . 10

2.8 Stereo geometry in canonical configuration . . . 11

2.9 Example of the Canny edge detector. . . 16

2.10 Distance transform for a simple rectangular shape. . . 16

2.11 FAST feature detection in an image patch. . . 17

2.12 Haar filters used in the SURF descriptor . . . 18

2.13 Haar responses in the subregions around a keypoint . . . 19

2.14 Matching representation of the descriptor . . . 19

3.1 Leap Motion architecture. . . 26

3.2 Raw image from the Leap Motion . . . 27

3.3 Example of the feature extraction steps. . . 28

3.4 Flowchart of the image processing of the first approach . . . 29

3.5 Image before and after applying the Canny edge detector. . . 30

3.6 Distance Image. . . 31

3.7 Flowchart of the Particle Filter . . . 32

3.8 Flowchart of the measurement model . . . 33

3.9 Representation of the different frames of the system. . . 34

4.1 Leap Motion schematic . . . 38

4.2 Leap Motion Controller interaction area . . . 38

4.3 Cubes and rectangle objects to track used in the experiments. . . 39

4.4 Triangle and cylinder to track used in the experiments. . . 40

4.5 Experimental setup overview . . . 40

(14)

viii LIST OF FIGURES

5.1 Test 1 : Plot of features detected over time at 5 cm. . . 44

5.2 Test 1 : Plot of good features over time at 5 cm. . . 44

5.7 Evolution of particles in the particle filter . . . 48

5.8 Evolution of the projection in the particle filter . . . 49

5.9 Histogram of the weights of the particles in the initial position . 50 5.10 Histogram of the weights of the particles in the final position . . 50

(15)

List of Tables

4.1 Summarize of the experiments. . . 41 5.1 Means of the features detected with SURF and FAST algorithms 43 5.2 Table with the results of the particle filter . . . 47

(16)

(17)

List of Algorithms

1 Algorithm Bayes Filter . . . 22 2 MCL Algorithm . . . 22 3 Algorithm Low variance sampler . . . 23

(18)

(19)

Chapter 1

Introduction

1.1 Motivation

Object tracking has become really important in the field of robotics with multiple of applications, for example object manipulation, human-machine interaction, human detection, road traffic control, security and surveillance systems[44]. Frequently, the goal is to obtain the trajectory of the tracked target over time and space, by acquiring and processing information from the sensors. Object tracking in real-time is a difficult task, since it requires online processing of a large amount of data, which can be computationally expensive. Depending on the sensor type, different approaches can be proposed for the tracking problem. For example, Bray and Koller-Meier [5] have used a particle filter for 3D Hand tracking and Muñoz-Salinas [30] have used a Kalman filter to track people based on color features.

1.2 Problem statement

In this thesis we are interested in tracking objects at a very short range. The primary application of our approach is targeting the domain of object tracking during grasp execution with a hand-in-eye sensor setup. To this end, a

promising approach investigated in this work is based on the Leap Motion [8] sensor. It is a USB device with two IR cameras, packaged with a set of

processing algorithms that can track hands in a range of 0.1 to 0.6 m above the controller. The Leap Motion is designed for tracking human hands, however we are interested in tracking grasped objects, thus we need to extend its functionality.

Due to the IP protection of the company, the firmware is closed source thus will be difficult to replicate its performance. The main goal of the thesis is to

(20)

2 CHAPTER 1. INTRODUCTION

track the 3D position and orientation of an object from a set of simple primitives (cubes, cylinders, triangles) over a video sequence.

1.3 Contributions

The Leap Motion device is able to track hands but not objects, thus the object tracking improvement can extend the range of applications in the robotics field that can use it. That is the reason we have designed and developed two different approaches for tracking objects with the Leap Motion device as a stereo vision system. In the first approach, different computer vision algorithms for extracting features from a scene were covered, however the implementation was not possible due to the lack of features. For the second approach, a contour based particle filter is implemented with a simple edge detector as input of the algorithm. Despite the behaviour is not as expected, with future modifications and improvements it can be used as a tracking system for a robotic grip.

Additionally, different experiments were carried out with the Leap Motion device; testing the different algorithms from the first approach and the efficiency of the contour based particle filter. Analysing quantitative and qualitative the results and giving the reasons for their behaviour.

1.4 Thesis Outline

This section describes the organization of the content of this thesis. The document is structured in five chapters. The Chapter 1 relates the reasons that motivate the realization of this project and states the goals to be achieved. In the Chapter 2, all the theory needed to make the thesis is summarized. Including information about computer vision, geometry and filter algorithms. Two different ways to address the problem are described in Chapter 3. Different flowcharts are included in order to easily understand them. Then, in the Chapter 5 the results are presented and analysed. Finally, in Chapter 6 the conclusions obtained from the results and the future work is presented.

(21)

Chapter 2

Background

2.1 Previous Work

In the last decade, tracking objects with stereo cameras has attracted the attention due to its applications in fields like robotics, road traffic control or security, some of which were described in the Chapter 1. Tracking objects is a difficult problem because there is no sensor that gives perfect measurements. Laser range finders, sonars, IR sensors or cameras are tied down with uncertainty and noise. Bayesian filters can easily handle this level of uncertainty and that is why most of the researches of tracking objects with cameras are extensions of Bayesian filtering techniques.

Most of the solutions proposed were based on the Kalman filter [19], a variant of the Bayesian filter. These filters are optimal under the assumption that the uncertainty can be modelled with a Gaussian and the observation and

dynamic models are linear. Dellaert and Thorpe [11] have use a Kalman filters for robust car tracking with simple image processing techniques. Gennery [13] uses a modified Kalman Filter for tracking known three-dimensional objects with six degrees of freedom.

As it was stated before, the Kalman filters assume that the system is linear, however most of the real world system are not linear. For these situations the Extended Kalman Filter [18] is often used. In this special case of Kalman filters, the idea is to linearise the estimation of the state with a first-order Taylor series. Medeiros [24] have used EKF to model and track objects in video sequences.

Due to the difficulties on the implementation and tuning of the EKF’s, the Unscendent Kalman Filters (UKF) are presented [40]. The main difference is that the UKF state distribution is represented using a minimal set of sample points that are carefully chosen. Dambreville [9] uses Unscented Kalman filters

(22)

4 CHAPTER 2. BACKGROUND

for tracking deformable objects by active contours.

A different approach of Bayesian filters is the Particle Filter [10], which can be easily implemented and do not have the drawbacks of the Kalman filter methods. They became popular with Isard and Blake [17] where contour based observations were used to track objects in clutter. Contour based tracking systems were also described by MacCormick [23] and Sullivan [38]. Colour based particle filters methods were also presented by Barrera [2] where the estimated errors were under 5 cm.

2.2 Computer Vision

Computer vision includes the methods to convert the images from the real world onto numerical information in order to make it tractable for a

computer. It also covers the algorithms to analyse and manipulate images [27]. Different types of image data exist: single view image, stereo vision, stereo video with more than two cameras, video sequences etc. There are also different sub-disciplines like object recognition, object tracking, scene reconstruction, pattern recognition etc.

This chapter starts with a brief explanation of the simplest geometric camera model and continues with the stereo vision theory. Afterwards, computer vision algorithms for detecting, extracting and matching features are presented. In the end, the basics of general particle filters are explained.

2.2.1 Pinhole Camera

Different models to define the geometry of the projection of an object into the camera plane exist. The simplest one is based on the Pinhole camera model. As it could be seen in the Figure 2.1, a Pinhole camera is a box completely dark in its interior that allows the light rays to pass through a single small aperture, a pinhole. The light that passes through the pinhole projects an inverted image of the scene.

Some considerations must be taken into account for this model (Figure 2.2): • Every light ray that traverses the optical center C do not suffer any

deviation. It is the origin of the coordinate system.

• The optical axis is the imaginary line that starts in the optical center and perpendicularly cross the image plane.

(23)

2.2. COMPUTER VISION 5

• The focal distance b is the distance from the optical center to the focal plane.

• The focal plane or image plane is situated in Z = b. It is the virtual plane where the image is projected without any inversion.

• The principal point p, is the intersection of the optical axis with the image plane.

Figure 2.1: Pinhole Camera example [12].

This camera model provides an easy understanding for the relation of the position of a point in the real world P(X, Y, Z) with the optical system of the camera (u, v). From the Figure 2.2 , with direct geometric relations, the following equations can be extracted:

ˆu = −bx

z (2.1)

ˆv = −by

z (2.2)

For this model, every point M from the real word can be converted into a point m of an image with the following relation:

m ≈ PM (2.3) Where P is the projection matrix that will be described in detail in the next sections.

(24)

Figure 2.2: Pinhole camera geometry [42].

2.2.2 Distortions

In order to maintain the relation from the Equation 2.3, the radial distortions generated during the image creation need to be considered.

The use of the lens make easier the entrance of the light, providing a good focus and versatility. However, it also introduces deformations in the images taken by the sensor. One of these effects is called radial distortion and it can be appreciated in the Figure 2.3.

Figure 2.3: Radial distortion [20].

When a light ray enters the camera, the lens bends the light ray so that it hits the sensor, which records it as a greyscale brightness value at a specific pixel location. Of course, no lens is perfect, so a ray of light does not land on the sensor in the optically perfect spot. This distortion is more pronounced next to the image limits, as it could be seen in the Figure 2.4. It also growths when the

(25)

focal distance decreases or with low quality lens [28].

The radial distortion can be modelled with a Taylor series around r = 0 [16]. Therefore, the image coordinates with distortion are transformed like:

ˆx = xc+ L(r)(x − xc) (2.4)

ˆy = yc+ L(r)(y − yc) (2.5)

where

• ( ˆx, ˆy) are the corrected coordinates. • (x, y) are the measured coordinates. • (xc, yc)is the centre of radial distortion.

• ¯r is the radial distancep ˆx2₊ _ˆy2_{from the centre for radial distortion.}

• L(¯r) = (1 + k1r2+ k2r4+ k3r6+...) is a distortion factor, which is only

function of the radius ¯r.

• k1and k2are the distortion coefficients. Normally it is enough with k1

and k2, however for lens with a high distortion (e.g. cheap lens,

fisheye...) the use of k3is necessary for a proper correction.

(26)

2.2.3 Stereo cameras and epipolar geometry

Only with one camera it is possible to determine the transformation of a 3D point from the real world to a 2D point of an image. It is feasible to establish the relation between the coordinates of a real world point and its projection in pixels in the image. Nevertheless, the inverse operation can not be computed only with one image due to the ambiguity of the relations within the real points and pixels.

Figure 2.5: Epipolar geometry [31].

Hence, it is necessary to have two or more cameras (or views) for simultaneously acquiring images of the same scene like in the Figure 2.6. Then, the disparity map can be computed and the distances can be measured. This fact is possible thanks to the epipolar geometry, that establish the relations between the cameras and the real world.

Based on the Figure 2.5

• P is a point in the space that is projected in each camera, generating the points prand pl.

• Oland Orare the optical centres of the cameras.

• The lines Pland Prare called epipolar lines.

• The indicated points are called epipoles, which are the projections of the optical centre of a camera into the projection plane of the other camera. Every epipolar line of an image traverses the epipole of the image.

(27)

It could be seen that when a point falls in the line Olplor in Orpr, it always

has the same projection pror pl. That is why it is not possible with just one

view to relate a projection and a real world point. This ambiguity is called occlusion.

Taking a look to the Figure 2.6, it could be appreciated that with only one image, the three points from the real world will appear as the same point in the projected image. The depth information is recovered by the second camera, where the three points will be represented.

However, in order to use the epipolar geometry for the triangulation, the following facts need to be considered:

• Every point X in the cameras work area it is placed in the epipolar plane, thus it will have two projections (xl, xr).

• A point xiplaced in the projective plane of one of the cameras has a

associated projection in the other image, situated along the epipolar line of the camera. This fact is called epipolar restriction and once the epipolar geometry is known, it is possible to make a bi-dimensional search of the projection pairs along the epipolar lines. It will save computation, time for the correspondences search and it will also help to reject false positives.

• Every epipolar plane associated to a point in the space will always intersect the line O1O2.

All these facts will help for the calculus of the physic positions of the points in the 3D space.

(28)

2.2.4 Cannonical Stereo Vision

The reconstruction of a three-dimensional scene based on two images acquired from different positions and viewing directions is defined as stereo vision [42]. The properties of these cameras are determined by their epipolar geometry, which describes the relationship between the real-world observed points in their fields of view and the images on their respective sensing planes. The Leap Motion Sensor takes advantage of the canonical configuration for building the stereo geometry. In this configuration, the baseline of the two Pinhole cameras is aligned to the horizontal coordinate axis, the optical axis of the cameras are parallel, the epipoles move to the infinity and the epipolar lines in the image planes are parallel as it is shown in Figure 2.7.

(29)

Figure 2.8: Stereo geometry [32].

In the Figure 2.8 if a point in the real world is examined in P = (x, y, z), the following relations can be extracted from the geometry.

Pl f = − h + x z (2.6) Pr f = h − x z (2.7)

Where Pland Prare the projections of P onto the left and right images and h

is the half of the baseline distance.

Combining these equations together and defining the disparity as d = Pr− Pl,

the depth can be obtained as:

Z =2hf d =

bf

d (2.8)

As it could be seen, the depth Z is inversely proportional to the disparity, thus zero disparity means that the point is infinite far.

(30)

2.2.5 Stereo Camera Parameters

The projection matrix P from the Equation 2.3 can be used to transform the coordinates of a 3D point from the real world to the pixel coordinates of an image. For stereo pair configurations, it is constructed from the matrix K and a vector with information about the position of the optical center.

P = [K | T ] (2.9) Where K is the intrinsic matrix calibration, constructed with the intrinsic parameters of the camera as follows:

K =   fx 0 cx 0 fy cy 0 0 1  

cxand cyrepresents the distance from the center of the coordinate image

plane to the principal point. fxand fyare the focal lengths in pixels. They are

proportional to the focal length as stated in the Equations 2.10 and 2.11: fx= fSx (2.10)

fy= fSy (2.11)

Where f is the physical focal length of the lens in metric units and SxSyare the

number of pixels per metric unit of the sensor along the X and Y axis. If the sensor has the same number of pixels per unit metric in all the dimensions, fx

and fywill have the same value.

The vector T contains information about the position of the optical center of the second camera in the first camera’s frame. In the canonocial configuration, the first camera always has Tx= Ty=0. For the right (second) camera of a

horizontal stereo pair, Ty=0 and Tx= −fx′∗ B where B is the baseline

between the cameras.Tz=0 since both cameras are in the same stereo image

plane [7]. T =   Tx Ty 0  

2.3 Feature Detection

Once the camera parameters are determined, the obtained images are rectified and transformed to correct distortions and errors induced by the camera, the next step is the search of correspondences. In this thesis, finding

(31)

2.3. FEATURE DETECTION 13

good features that can be used as input of the tracking algorithm.

The first step for this process is to find the features on the images. These point have a perfectly defined position and their neighbour pixels carry a big amount of local information. One of their most important characteristics is the stability under local and global perturbations. These perturbations can be translations, rotations, scaling or changes in the luminosity and perspective. In order to achieve the first step, there are feature detectors based on different operators that can be used. Some operators reveal information about the corners, others about the edges etc.

Once the features are detected, the next step is to describe the area around each feature. The features detected in the first step give information about the localization, however they are not sufficient for a subsequent image

correspondence search. The description of the features will be extracted from a defined area or kernel around the point of interest. Bigger kernels imply more computational cost but large description region. Nevertheless, big descriptors are not of interest since they lose the locality property.

There are different ways to describe the neighbourhood of a feature point. One way is to evaluate the grey level, if the image is normalized to one, darker pixels will have a value closer to zero and brighter ones close to one. Another way it is the color level, where the three color matrices R, G and B are

evaluated. The color of a region can be evaluated by its three color histograms or by the mean of the region. It is also possible to describe the features by the orientation of the gradients around them. All these descriptors are

independent of the detection method used in the first step.

The feature description will allow the algorithms to relate the feature points from a trained image to a test image. That its why it is important for these feature descriptors to be scale and position invariant. An item in the trained image can appear rotated, translated or scaled in the test image, thus the algorithm must be invariant under these changes.

There are different methods for computing these point features. Although the descriptors should have these characteristics:

• Simplicity: The descriptor should capture the characteristics of the image in a simple and clear way in order to be easily interpreted.

• Repeatability: The generated descriptor from an image must be independent from the moment where it was generated.

(32)

• Uniqueness: The descriptor should have a hight grade of differentiability with respect other images and at the same time it should have enough information for establishing relations within similar images.

• Invariant: The descriptor needs a high degree of robustness to face different transformations within images.

• Efficiency: The generation of the descriptors needs to be in accordance with the time/computation constraints of the applications.

There are two types of image descriptors that have to be considered for the election of the correspondence method.

• Global descriptors: They contain lots of information of the image with few data. Even through its simplicity, they have a common use due to their low computational cost and good results. One example of these descriptors is the Color Histogram.

• Local descriptors: These descriptors are applied when the local information of the image is more relevant. They act under predefined regions of interest identifying key information about the point and its neighbours. These regions of interest or points are called keypoints and the descriptor is formed by the totality of feature keypoints vectors. Some examples of local descriptors are SURF, SIFT and ORB. After testing different methods, the election for this thesis is: the feature detection will be computed with FAST and the descriptor with SURF due to their robustness and high performance that can be used in real time

applications. The matching problem will be solved with a brute force matcher, even thought is a naive implementation, it systematically checks for all the possible feature pairs.

In the following section the Canny edge detector and the distance images are also explained because they are applied in the thesis to obtain the information that is going to be used as input in the particle filter.

2.3.1 Canny edge detector

The Canny edge detector is an operator developed by John F.Canny in

1986 [6]. It uses a multi-step algorithm for detecting edges in images that aims to satisfy these criteria:

• Low error rate: Just detect good edges.

• Good localization: The distance within real edge pixels and edge pixels have to be minimized.

(33)

• Minimal response: It can just exist one detector response per edge. The steps of the algorithm are the following [33]:

1. A Gaussian filter will remove any noise. An example of a Gaussian kernel of size = 5 that might be used is:

K = 1 159       2 4 5 4 2 4 9 12 9 4 5 12 15 12 5 4 9 12 9 4 2 4 5 4 2       2. Obtain the intensity gradient of the image:

(a) Apply a pair of convolution masks in x and y directions: Gx= 1 159   −1 0 1 −2 0 2 −1 0 1   Gy= 1 159   −1 −2 −1 0 0 0 1 2 1   (b) Find the direction and strength of the gradient:

G =qG2 x+ G2y

θ = arctan(Gy Gx

)

The direction is rounded to one of these possible angles: 0, 45, 90 or 135.

3. Non-maximum suppression is applied for removing the pixels that are not considered to be part of an edge.

4. In the last step Canny uses two thresholds to perform the hysteresis: (a) If the pixel gradient is higher then the upper threshold, the pixel is

accepted as an edge.

(b) If a pixel gradient value is below the lower threshold, then it is rejected.

(c) if the pixel gradient is within the thresholds, then it will be accepted only if it is connected to a pixel that is above the upper threshold.

(34)

An example of the algorithm can be seen in the Figure 2.9:

Figure 2.9: Example of edge detection with the Canny algorithm [33].

2.3.2 Distance Fields

The distance fields or distance images convert a binary digital image into an image where all the pixels have a value corresponding to the distance to the nearest black pixel [4]. The result of the transformation is a grey level image that looks similar to the input image, except that the grey scale intensities of points inside foreground regions are changed to show the distance to the closest black pixel. Figure shows a distance transform for a simple rectangular shape:

Figure 2.10: Distance transform for a simple rectangular shape [35].

2.3.3 FAST detector

Features from Accelerated Segment Test (FAST) is an algorithm to detect local features in images. In this thesis it will be used due to its high efficiency that allows the on-line operation of a tracking system [37].

An examination around a circle of 16 pixels (radius 3) with center in the feature placed at pixel p is performed. A feature is detected at p if the

(35)

intensities of at least 12 contiguous pixels are all above or below the intensity of p by some threshold, t. This condition can be optimized by for example examining pixels 1, 9, 5 and 13 in order to faster pixel candidate rejection as it can be appreciated in the Figure 2.11. This can be done because a feature can only exist if three of these tested points are all above or below the intensity of p by the threshold [36].

Figure 2.11: FAST feature detection in an image patch: C is the pixel position of the feature and the numbered pixels are the ones to be evaluated [37].

2.3.4 SURF descriptor

Speeded Up Robust Features (SURF) [3] is a local feature detector and descriptor inspired in SIFT [21]. In this thesis it will be used as descriptor. The first step is to determine the orientation for each of the keypoints detected in the previous step. For it, the Haar response is going to be computed in both x and y directions with the functions of the Figure 2.12. The region of interest for the calculus is a circular area centred in the keypoint and with radius 6s, where s is the scale where the keypoint was detected.

The sampling step also depends on the scale and it takes as value the s. About the wavelet Haar functions, they also depend on the s by taking a value of 4s. Higher s values will imply a higher dimensionality of the wavelet Haar functions.

(36)

Figure 2.12: Haar functions for the calculus of the responses in the x direction (left) and y (right). Black color has a value of -1 and white +1 [3]. After all this calculus, the computed responses are weighted by a Gaussian distribution centred in the keypoint and with σ = 2, 5s. Then, the responses are represented by vectors in the space by setting the horizontal and vertical responses in the x and y axes. Finally, the predominant orientation for each sector is obtained by the sum of all the responses under a a mobile window that covers an angle of π

3 like the author states [1].

In the second step of the process, a square region with size 20s is generated. It is created around the keypoint with the computed orientation from the last step. This region is divided in 4 x 4 subregions where the Haar response is calculated. The Haar responses in the horizontal and vertical directions relative to the keypoint orientation are defined as dxand dy. In the

Figure 2.13 the Haar responses in each sub-region and the dx, dyof each

vector are represented.

For highest robustness under geometric transformations and position errors, dxand dyare weighted by a Gaussian with σ = 3.3s centred in the keypoint.

In order to have a representative value for each of these sub-regions, the responses dxand dyare added up. At the same time, the sum of all the

absolute values of the responses |dx|and |dy|it is done, thus the polarity

information about the intensity changes is computed.

Summarizing, each of the sub-regions is represented by a components vector v: v = (Σdx, Σdy, |Σdx|, |Σdy|) (2.12)

Taking all the 4 x 4 subregions, the SURF descriptor is formed by 64 values for each of the detected keypoints.

(37)

Figure 2.13: Haar responses in the subregions around a keypoint [43].

2.3.5 Matching Keypoints

The main goal of matching keypoints is to obtain a representative value of the similarity within two images. The calculus of this value (represented as distance) is done by the application of a distance formula between the images [25]. However, before this calculus, it is necessary to establish the keypoints correspondences.

The keypoints correspondence is obtained by the calculus of the euclidean distance between their feature vectors. Afterwards, this value is used in a Brute-force matcher to obtain the keypoints matching. An example of the matching can be appreciated in the Figure 2.14.

Figure 2.14: Matching representation of the descriptor: the red dots are the keypoints, the green lines are the matches within the two images [26].

(38)

2.4 Filters

Many physic and scientific problems require the estimation of the state of a system that changes over time. This problem can be visualized as an estimation process, since it has to deal with noisy measurements. Normally, the noise is modeled with statistics, thus the estimation will be stochastic.

2.4.1 Bayesian Filter

The most general algorithm for state-space estimation is the Bayesian Filter. It is able to estimate the state of a system with noisy measurements. In order to make estimations, it is necessary to make inference in the system. These inferences are going to be done with the system model (Motion model in tracking) and the measurement model.

On the one hand, the system model describes the evolution of the state of the system over time. In the tracking problem, it will infer where is the target based on the motion data.

On the other hand, the measurement model describes the formation process by which sensor measurements are generated in the physical world.

The estimation or believe of the state is obtained from the posterior

probability density function, which is constructed with all the available data. The first step is to describe the state-space model. In the particular case of tracking objects in images, the state of a system at time t (xt)will be a

multi-dimensional vector with the 3D position and orientation of the target. The set of measurements at a time t are labelled as zt. Thus the set of all

measurements from t1to t2will be:

zt1:t2= zt1, zt1+1, zt1+2...zt2

Then, the goal is to compute the posterior probability conditioned over all observations ztusing the Bayes theorem:

p(xt|z1:t) =

p(zt|xt, zt−1) · p(xt|z1:t−1)

p(zt|z1:t−1)

(2.13) The motion model is going to be defined with the probability distribution p(xt|xt−1), as result the Equation 2.13 is:

p(xt|z1:t) =

p(zt|xt, zt−1) ·R p(xt|xt−1, z1:t−1)p(xt−1|z1:t−1)dxt−1

p(zt|z1:t−1)

(39)

2.4. FILTERS 21

Computing this density becomes an intractable problem since the

measurements are accumulated over time. Hence, it is necessary to apply the Markov assumptions:

• The states Xtare complete because they are the best predictors of the

future. Completeness entails that knowledge of past states and

measurements carry no additional information that would help the filter to predict the future more accurately.

• The past and future observations are independent if the current state xt

is known.

With the observation independence assumption, the Equation 2.13 becomes:

p(xt|z1:t) =

p(zt|xt, zt−1) · p(xt|z1:t−1)

p(zt)

(2.15) With the Markov assumption, it can be rewritten as:

p(xt|z1:t) =

p(zt|xt) · p(xt|z1:t−1)

p(zt)

(2.16) and finally the Equation 2.14 will be:

p(xt|z1:t) =

p(zt|xt) ·R p(xt|xt−1)p(xt−1|z1:t−1)dxt−1

p(zt)

(2.17) The initialization of the filter p(x0)depends on the available information:

• If the exact position is known, particles will be placed there.

• If there is knowledge about the approximate position, the particles can be initialized with a uniform distribution around that position.

• If there is no knowledge, they can be initialized with a uniform distribution all over the state space.

In the Recursive Bayesian Filter, the Equation 2.17 is evaluated in two steps. The first step is named prediction step. Where the prediction is computed from the previous state posterior, before incorporating the measurements at time t.The second step is called measurement update step and it corrects the prediction from the last step.

The particle filters, particularly the Monte Carlo approximation is one

solution for the tracking problem based on the Bayesian filter. It is discussed in the next section.

(40)

Algorithm 1 Algorithm Bayes Filter

1: Algorithm BayesFilter(bel(xt−1), zt) : 2: for all xtdo 3: bel = p(x¯ t|z1:t−1) =R p(xt|xt−1)bel(xt−1)dxt−1 4: bel(xt) = p(xt|z1:t) = ηp(zt|xt)bel(x¯ t) 5: end for 6: return bel(xt)

2.4.2 Particle Filters

Particle Filters are Sequential Monte Carlo (SMC) methods for the computation of posterior distributions based on the Bayes Filter Equation 2.17.

The Monte Carlo particle filter was originally designed for robot localization. However, it has many similarities with tracking objects in images. Hence, it is going to be the approach used in this thesis to solve the problem. It has 4 steps.

Algorithm 2 MCL Algorithm 1: Algorithm MCL(Xt−1, zt) 2: X¯t= Xt= ∅ 3: for m = 1 to M do 4: x[m]_t =motion_model(x[m]_t−1) 5: w[m]_t =measurement_model(zt, x[m]t ) 6: X¯t= ¯Xt+ < x[m]t , w [m] t > 7: end for 8: for m = 1 to M do

9: draw i with probability ∝ w[i]_t

10: add x[i]_t to Xt

11: end for

12: return Xt

In the first step the filter is initialized in one of the ways stated in the last section. The selection of the initialization option depends on the information available at the instant t = 0.

The second step consist in the application of the motion model. Whenever the target moves, the filter will move the particles to predict where the target is after the movement. There are different ways to model this fact, such as the odometry model or velocity model. In this thesis a different one is

(41)

2.4. FILTERS 23

In the third step the measurement model assigns a weight to each of the particles based on the sensed information.

Lastly, the particles are resampled based on their importance weight. The importance of this step and a deeper explanation is done in the next section.

2.4.3 Low variance sampling

The sample procedure will keep the particles with best probabilities and will discard the rest of them. The algorithm is called low variance sampling, which selects the particles based on a sequential stochastic process.

Algorithm 3 Algorithm Low variance sampler

1: X¯t= ∅ 2: r = rand(0; M−1₎ 3: c = w[1]_t 4: i = 1 5: for m = 1 to M do 6: U = r + (m − 1) · M−1 7: while U > c do 8: i = i + 1 9: c = c + w[i]_t 10: end while 11: add x[i]_t to ¯Xt 12: end for 13: return ¯Xt

Instead of randomly choosing M particles, this algorithm computes a single random number and selects samples according to it but still with a probability proportional to the particle weight. This is done by calculating a random number r in the interval [0; M−1_{], where M is the number of samples to be}

drawn at time t. Afterwards, the particles are selected by repeatedly adding the value M−1_{to r and selecting the particle that corresponds with the}

resulting number.

Like the authors states in [39], the low-variance sampler has three advantages. First, it covers the space of samples in a more systematic fashion than the independent random sampler. Second, if all samples have the same importance factor, the generated sample set will be equivalent to the one from the last iteration. Lastly, the algorithm has a complexity of O(M), which in particle filters is an importance factor for its performance.

(42)

(43)

Chapter 3

Methodology

Two different approaches were implemented in order to achieve the goals. Both of them have similarities in the program structure but they differ in the methods and algorithms used. The programs were developed under the Robot Operation System (ROS) using C++, OpenCV and Eigen libraries.

3.1 Image acquisition

As mentioned before, the Leap Motion runs over a USB port on the Linux platform. A service receives image data from the device; A DLL connects to the service and provides data through a variety of languages(C++, Python, Javascript, Objective-C, C\, Java).

The architecture shown on Figure 3.1 consist on 1)Leap Service receives data from the device. 2)Leap Control Panel configures the device tracking settings, calibration and troubleshooting. As the application runs independently from the service, it has control directly on the panel. 3)Leap-ROS node Driver access and modify the distorted images (Figure 3.2) from the service. 4)Stereo_

img _ proc node rectifies and publish the undistorted images and camera

parameters. 5)Tracking node subscribes to the camera parameters and images. Here is where the image processing and particle filter is going to be performed.

(44)

26 CHAPTER 3. METHODOLOGY

Figure 3.1: Leap Motion architecture.

The "stereo_img_proc" nodelet encapsulates the left and right images as a ROS message Image that has the following fields:

• Header: Containing the timestamp and frame identification. • Width and height of the image: They will be 280 x 220. • Encoding type: In this case it will be mono8, each pixel will be

represented by eight unsigned bytes with one channel, since the images are acquired as grayscale.

• Matrix containing the data of the image.

It also publish another message containing important information about each of the cameras. This message contains the information of the calibration files, which is:

• Header: Containing the timestamp, and frame identification. • Width and height of the image: 280x220.

• Distortion parameters: k1, k2, t1, t2 and k3.

• Intrinsic camera 3x3 matrix of the distorted images, defined as K. • Rectification 3x3 matrix to obtain parallel epipolar lines, named as R. • Projection 3x4 matrix for the projection of 3D points in the cameras

(45)

3.2. FEATURE BASED PARTICLE FILTER 27

Figure 3.2: Raw image from one of the cameras of the Leap Motion. A grid highlighting the significant, complex distortion is superimposed on the image [28].

3.2 Feature Based Particle Filter

The first proposal was a featured based particle filter. The general steps are first acquiring the images from the Leap Motion controller, then processing them with OpenCV functions and afterwards estimate the state with a feature based particle filter.

3.2.1 Image processing

The 3D points of the object to track need to be extracted from the acquired images. Hence the first step is to obtain these points, for it a FAST detector will detect the points of the object, a SURF descriptor will describe their neighbour pixels and a brute-force matcher will extract the points that look similar in both images and discard the false features detected. Afterwards, the particles are filtered in order to just extract high-quality feature matches is the ratio test proposed in [22]. This test rejects poor matches by computing the ratio between the best and second-best match. If the ratio is below some threshold, the match is discarded as being low-quality. An example of the extract features procedure can be seen in Figure 3.3.

(46)

Figure 3.3: Example of the feature extraction steps: From up to down, left and right images of the features detected with FAST, features matched with the information from the SURF descriptor and finally the good matches obtained after a filter.

(47)

3.2. FEATURE BASED PARTICLE FILTER 29

The process is summarized in the flowchart 3.4.

Feature Detection Feature De-scription Feature Matching 3D point Generation Particle Filter

Figure 3.4: Flowchart of the image processing of the first approach

The estimated 3D point locations can then be used to compute a fitness score, based on the average point-to-model distance between a 3D object model and the points. This can then in turn be used to weight the particles in the filter. As discussed in the next section, some issues in the feature extraction component of the system prevented us from fully testing this approach.

3.2.2 Implementation Problems: Lack of visual features

The problems encountered were the lack of 3D points detected that would have been the input data of the particle filter. Nevertheless, they could not be extracted because the firmware of the device automatically adjust the amount of IR light that the cameras capture, thus sometimes the image was

overexposed and in consequence it led to a loss of information.

Depending on the distance, the mean of features extracted was around 5. Afterwards, texture was added to the object in order to improve the feature extraction. The number of features extracted increased a 50% per frame.

(48)

Although, it was not enough to perform a particle filter. These problems are showed and discussed in the Chapter 5.

3.3 Contour Based Particle Filter

The feature-based approach from the previous section depends on the presence of sufficient texture on the objects. In this section we discuss an alternative approach based on image contours, which should be less dependent on texture.

3.3.1 Image Processing

In this step is where the information that the particle filter will use as input for the measurement model is extracted.

Since it is a contour based approach, it is necessary to extract the edges of the object from the scene, for it the Canny edge detector is used. Depending on the distance, the parameters of the filter change, thus for every distance by try an error the parameters have to be tuned to detect the edges of the object and discriminate the other ones. An example of the filter with the following parameters can be appreciated in the Figure 3.5:

• Minimum threshold for the hysteresis procedure: 100. • Ratio =HighT hreshold

LowT hreshold =2

• Kernel that specifies the aperture size of the Sobel() operator: 3.

(49)

3.3. CONTOUR BASED PARTICLE FILTER 31

In this application, for the distances of 5, 10 and 15 cm the thresholds used were 50, 100 and 130 respectively.

After that, the distance image is computed (Figure 3.6). The distance image contains for each pixel the distance between this pixel and the closest black pixel. In this case, the black pixels will be the edges of the object, thus a low distance means that this pixel is close to the object.

Figure 3.6: Distance Image (left) of the edge image (right).

3.3.2 Particle Filter

Once the input data to the particle filter is processed, the tracking estimation can be done. The particle filter has 5 steps (Figure 3.7).

As stated in Section 2.4.1, there are different ways to initialize the filter. In this particular case, the initial position of the object is unknown, thus it will uniformly distribute M particles in a fixed region.

Then, the measurement model will make a first prediction of where the object is after a motion. There are different motion models like the velocity model or odometry model, but since there are no information about the object velocity or how much it moved, they can not be used. The implemented model assumes that in every frame the particle will have a certain motion in each of its 6 degrees of freedom. In order to model this behaviour, the measurement model will take each of the Miparticles and then with a Gaussian centred in the Mi

position and with a fixed covariance it will generate a j number of Mi,j

particles.

The third step is the measurement model, which will weight each Mi,j particle

(50)

32 CHAPTER 3. METHODOLOGY Initialize M particles Motion Model Measurement Model Resample N particles Random M-N particles Figure 3.7: Flowchart of the particle filter

It will ask for a 3D model of the object to track in the Mi,jposition and then

this model is going to be projected in both camera planes. The measurement model will weight the particle based on the distance (D) from each of the projected points of the particle 3D model to the edges of the object viewed with the cameras. The probability density function is modelled as an inverse exponential function since a lower value of the distance means a high probability to be in the same position of the object. The probability will be computed as: p(zt|xt)i,j,k =    e−λDt _if _D min6Dt6Dmax 0 otherwise

The maximum value of Dmaxis 279 (image limit), however it was setted as

Dmax=5 and Dmin=0 in order to disregard particles that are too far away

from the edges. The λ factor will make the function slope more or less pronounced.

For each particle is going to be a left and right camera probability, thus the final probability of the particle will be:

(51)

3.3. CONTOUR BASED PARTICLE FILTER 33 Generate 3D Model of the particle Project 3D Model Weight particle Projection Matrix Homography Distance Image PDF

Figure 3.8: Flowchart of the measurement model

Where pLeft:i,j(zt|xt)and pRight:i,j(zt|xt)are the mean of all the 3D points of

the projected particle model. For example for the Left is going to be like:

pi,j:l/r(zt|xt) = 1 n n X i=1 pi,j,k(zt|xt) (3.2)

The fourth step will keep the M particles with the higher probabilities. The low variance sampling algorithm explained in Section 2.4.3 will perform it.

3.3.3 3D Model Generation and Projection

In order to evaluate a particle, the measurement model needs the projection in the camera planes of a 3D model of the object with the particle position and orientation.

First and foremost, it is necessary to analyse the different frames that need to be considered. The world coordinate OXYZ is a right-handed coordinate system with X forward, Y left and Z up fixed by ROS. The left OLXLYLZL

and right ORXRYRZRcamera frames are left-handed coordinate systems with

X forward, Y down and Z right , with the origin of the left frame at the world origin. The right frame is horizontally displaced a distance of 4 cm (Baseline between cameras). The cameras image plane frames are right-handed systems aligned with the X and Y world axis and fixed at the top left corner of the image. All these frames can be seen in the Figure 3.9.

(52)

Figure 3.9: Representation of the different frames of the system:OXYZ is the world frame,OLXLYLZLand ORXRYRZRare the cameras frames and cxy is the

frame of the image plane.

Once the frames positions are known, the first step is to construct the model. Since they are simple shapes (cube, triangle, rectangle and a cylinder) they are represented by a point cloud of the edges of the primitive objects. The model frame is fixed in the middle of the bottom side coinciding with the world frame OXYZ. The world and camera frames are not the same, thus it is needed to transform the points from the world coordinate to the left and right camera frames. For it, all the points of the model are transformed to the new position with the rotation and translation matrix as follows:

X′_{= X · [R} x· Rz· Ry] + T (3.3) In more detail:   xR yR zR  =   x y z     1 0 0 0 cos α −sin α 0 sin α cos α     cos γ −sin γ 0 sin γ cos γ 0 0 0 1     cos β 0 sin β 0 1 0 −sin β 0 cos β  +   tx ty tz  

(53)

3.3. CONTOUR BASED PARTICLE FILTER 35

Where α = −90o_{, β = 0}o_{and γ = 0}o_{. Note that the camera frames are left}

handed, thus the positive X positions are negative and vice-versa.

The second step is to move the 3D model to the particle position and rotation [x, y, z, α, β, γ]. Again with the rotation and translation matrix the points are transformed to the new position.

Now that the model is placed in the particle position, the last step is to project it to the camera left and right image planes with the Projection matrix

(Equation 2.9) of each of the cameras with the Equation 2.3:   u v w  =   fx 0 cx Tx 0 fy cy 0 0 0 1 0   ·     X Y Z 1    

where the coordinate positions in the image plane (u, v) are obtained by dividing them by the homogeneous coordinate:

u = u w v = v

(54)

(55)

Chapter 4

Experimental Setup

This chapter describes the setup of the experiments. All of them were carried out in and indoor environment with natural illumination. The sections explain in detail the hardware and software used and the assumptions made about them.

4.1 Hardware

The Leap Motion controller is a small USB device designed for

human-machine interaction. With it, a user can perform tasks like navigating websites, augmented reality in video-games, high-precision drawing or 3D object manipulation. It acquires 2D images with its two cameras and then different filters and mathematical algorithms are used to build the models and interpret the interaction with the computer. However, all these mathematical background is hidden by the company, thus the tracking procedures can not be analysed. Guna [14] affirms that it can not be used for a professional tracking system due to its lower sensory space and inconsistent sampling frequency. Anyway, the device can be used as a stereo vision system since it is able to acquire images from its two CCD cameras. It also has three infrared LEDs that emit IR light with a wavelength of 850 nanometers. In the Figure 4.1, a schematic view of the device can be appreciated [41].

The company states that the device has an interaction space of eight cubic feet that takes the shape of an inverted pyramid as it could be seen in the

Figure 4.2 [29].

(56)

38 CHAPTER 4. EXPERIMENTAL SETUP

Figure 4.1: Leap Motion visualization [41], (a) real and (b) schematic

Figure 4.2: Leap Motion interaction area [29] 2 feet above the controller, by 2 feet wide on each side(150o_{angle), by 2 feet deep on each side(120}o_angle).

The device does not have a fixed frame rate, it is unstable and changes from one measurement to other. In [14] they analyse this fact, observing that the minimum logging period between two samples was 14ms (71.43Hz), the mean frequency 39Hz with a standard deviation of 12.8 Hz. The cameras also have a feature that allow them to auto-adjust the quantity of IR light that they can capture. The acquired images appear in grayscale due to the infrared light properties.

The data from the Leap Motion is transfered to the host PC trough a

Universal Serie Bus (USB) cable. The USB cable transfer data at a maximum of 480Mb/s. A Dell Inspiron 14 laptop with a 4x2GHz Intel i7-4510U and 16Gb of RAM is used as host PC.

(57)

4.2. SOFTWARE 39

4.2 Software

All the experiments were carried out in the Robot Operation System (ROS) under a system running Ubuntu 14.04. OpenCV libraries are used for image processing. Eigen library was used for matrix and transformation operations.

4.3 Targets

The targets that are going to be tracked are basic geometry shapes. In particular the following ones:

• Cube with side = 3cm Figure 4.3a.

• Cube with texture and side = 3cm Figure 4.3a.

• Cylinder with texture, radius = 1, 5cm and heigth = 6, 5cm Figure 4.4b. • Equilateral triangle with side = 4cm Figure 4.4a.

• Rectangle with side = 5cm and heigth = 11cm Figure 4.3b.

(a) Texture and untexture cubes. (b) Rectangle.

(58)

40 CHAPTER 4. EXPERIMENTAL SETUP

(a) Triangle. (b) Textured Cylinder.

Figure 4.4: Triangle and cylinder to track used in the experiments.

4.4 Experimental Scenario

The experiments record were carried out in the lab T002 of the Örebro University. For a realistic experiment, it should have been done with a robotic arm holding the object with a grip. However it was not possible so a less accurate test was done.

Two rulers of 40 cm and millimetre precision were firmly parallel placed in a static platform, thus the depth position can be measured. Perpendicular to them another mobile ruler was placed to measure the horizontal displacement. The camera was fixed in the middle of the two rulers and the object was fixed at the mobile ruler. An overview of it can be seen in Figure 4.5.

(59)

4.4. EXPERIMENTAL SCENARIO 41

The mobile ruler was manually moved to obtain motions along the X and Z axes that were recorded as rosbag files. The experiments recorded are summarized in the following table:

Model x(cm) y(cm) z(cm) △X(cm) △Y(cm) △Z(cm) 0 5 0 ±5 0 0 0 10 0 ±5 0 0 0 15 0 ±5 0 0 0 5 0 0 0 ±5

Table 4.1: Summarize of the experiments realized. X,Y,Z are the coordinates of the initial position of the target, △X, △Y and △Z are the distance moved from the initial position.

(60)

(61)

Chapter 5

Experimental Results

On this section we describe and discuss the experimental results obtained as well as the findings and qualitative observations gathered through the previous evaluation.

5.1 Feature Based Test

Before the particle filter implementation, different test where done in order to evaluate the number of features and its quality. The selected objects for the experiments were the cube and the cube with texture. These objects performs a motion along the X axes at three different depth distances. The parameters evaluated are the number of features detected from the SURF and FAST detectors and the good features that after the matching and filtering survive. These good features would have been the input of the particle filter.

In the Table 5.1 the means of the tests are summarized:

Test Fast Detections SURF Detections Fast GM SURF GM

Cube 5cm 17.75 13.38 6.22 5.42 Cube 10cm 8.91 7.74 4.08 5.26 Cube 15cm 7.33 3.97 5.64 3.15 Cube with texture 5 cm 73.44 29.78 14.94 10.67 Cube with texture 10 cm 33.47 7.59 12.41 5.49 Cube with texture 15 cm 18,10 6.05 8.56 5.22

Table 5.1: Means of the features detected and good features at different dis-tances and objects with the FAST and SURF detectors.

(62)

44 CHAPTER 5. EXPERIMENTAL RESULTS

And the results can be visualized in the following plots:

Figure 5.1: Test 1: Number of features detected over time with the FAST and SURF detectors at a depth distance of 5 cm.

Figure 5.2: Test 1: Number of good features over time with the FAST and SURF detectors at a depth distance of 5 cm.

(63)

5.1. FEATURE BASED TEST 45

(64)

(65)

5.2. CONTOUR BASED TEST 47

In the beginning the shapes did not have any texture and the mean of good features extracted over time was around 5 for both algorithms. After that, some paints were added to create texture on the objects. The number of features extracted increased a 50% but however there were not sufficient to perform an estimation with a particle filter.

The FAST algorithm detects three times more features than the SURF, but after the matching and filtering the number of good features is quite similar, thus the SURF algorithm is more robust on the feature detection.

5.2 Contour Based Test

After the full implementation of the approach, a tracking test with the textured cube placed in between the two cameras at a depth distance of 5cm was done. The shape realises a slowly horizontal motion of ±5cm.

The graphical results obtained were good, all the particles were able to track the object along its movement as it could be seen in the Figure 5.7. However the numerical results were completely wrong as it could be seen in the Table 5.2 . X (m) Y (m) Z (m) α β γ S0 0 0.05 0 0 0 0 SF₁′ 0.05 0.05 0 0 0 0 SF1 0.083 1.46 1.14 4.093 -0.56 -0.44 SF2 -0.05 0.05 0 0 0 0 SF′ 2 -0.123 1.67 0.89 7.764 4.912 -0.71

Table 5.2: Table with the results of the particle filter. S0is the initial position,

SF₁′, SF′

2are the expected positions and SF1,SF2are the estimations

This bad estimation is due to a bad weight evaluation function and the projection characteristics that confuse the particle filter. In the Figure 5.8 can be appreciated how the best particle projection matches with the model and in the following frames how the projection tends to go to the infinity making the filter diverge.

(66)

Figure 5.7: Evolution of the particles in the particle filter: from up to down four different frames showing the initial projection of the particles around the cube, then three frames at different execution time showing how the particles track the cube.

(67)

Figure 5.8: Evolution of the projection in the particle filter: from up to down three different frames showing the initial projection of the best particle around the cube, then a second a third frame showing how the projection goes to the infinity making the filter diverge.

The following plots shows the histograms of a set of 100,000 particles in the initial frame and in the final frame.

(68)

Figure 5.9: Histogram of the weights of the particles in the initial position.

Figure 5.10: Histogram of the weights of the particles in the final position. In sense, the particle filter works since the higher weights in the final frames are more concentrated than the initial ones spread out in the space. The problem comes when a particle is too far, thus it projection result is a condensation of the points. If they felt in an edge, the probability of these far

(69)

particles is more than the particles close to the model that have a better estimation of the state. As a result, in the following steps of the filter all the particles are going to be really far and the particle filter is going to get confused, leading in a wrong state estimation.

(70)

(71)

Chapter 6

Conclusions and Future work

6.1 Conclusions

In the current work we have presented the design of two implementations for tracking basic shapes with the Leap Motion device as a stereo vision system. Firstly, we have introduce the basic concepts of computer vision and particle filters for tracking objects with stereo vision systems. Then a first approach is proposed and implemented to solve the problem. After testing the number and quality of features that can be extracted from the scene, the conclusion was that there were not enough information for the state estimation algorithm to track the object. As consequence, a contour based approach is implemented. In the beginning it was really promising since it does not depend on the extraction of the features of the object. Afterwards the test results were not as expected, seeing that due to the projection particularities and a inaccuracy particle weight evaluation function the particle filter got confused.

The Leap Motion’s technology is very promising in the sense that it has huge range of applications in hand tracking and gesture recognition for

human-interaction applications. However the tracking algorithms on the background take advantage of the hardware capabilities like control the acquiring frequency, the intensity of the IR LEDS and the amount of IR light that the cameras capture. Since these parameters are not accessible for the developers, it becomes hard to replicate the behaviour in an efficient way.

(72)

54 CHAPTER 6. CONCLUSIONS AND FUTURE WORK

6.2 Future work

Based on the results obtained, the future work include several improvements to the current contour based approach. First of all it is needed to find a function that evaluates the particles weight in a way that the occluded points and the far projections that concentrate the points in the space leading to wrong high probability are disregarded.

Also an optimization of the particle filter and the way the projections are obtained can be done to perform a real time application able to track the objects that a grip placed in a robot can take.

(73)

References

[1] A Ahmadi, MR Daliri, Ali Nodehi, and Amin Qorbani. Objects recogni-tion using the histogram based on descriptors of sift and surf. Journal of

Basic and Applied Scientific Research, 2(9):8612–8616, 2012. (Cited on

page 18.)

[2] Pablo Barrera, José M Cañas, Vicente Matellán, and Francisco Martín. Multicamera 3d tracking using particle filter. In Int. Conf. on Multimedia,

Image processing and Computer Vision, 30march 1april, 2005. (Cited on

page 4.)

[3] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up ro-bust features. In Computer vision–ECCV 2006, pages 404–417. Springer, 2006. (Cited on pages 17 and 18.)

[4] Gunilla Borgefors. Distance transformations in digital images. Computer

vision, graphics, and image processing, 34(3):344–371, 1986. (Cited on

page 16.)

[5] Matthieu Bray, Esther Koller-Meier, and Luc Van Gool. Smart particle filtering for 3d hand tracking. In Automatic Face and Gesture

Recogni-tion, 2004. Proceedings. Sixth IEEE International Conference on, pages

675–680. IEEE, 2004. (Cited on page 1.)

[6] John Canny. A computational approach to edge detection. Pattern

Anal-ysis and Machine Intelligence, IEEE Transactions on, (6):679–698, 1986.

(Cited on page 14.)

[7] ROS Community. Camerainfo documentation. http://docs.ros.org/indigo/api/sensor_msgs/html/msg/CameraInfo.html. (Cited on page 12.)

[8] Leap Motion Company. Leap motion device. https://www.leapmotion.com/. (Cited on page 1.)

Model-based object tracking with an infrared stereo camera

International Master’s Thesis

Model-based object tracking with an infrared stereo

camera

Juan Manuel Rivas Diaz

Technology

Model-based object tracking with an infrared stereo

camera

Studies from the Department of Technology

at Örebro University

Juan Manuel Rivas Diaz

Model-based object tracking with an

infrared stereo camera

© Juan Manuel Rivas Diaz, 2015

Abstract

Acknowledgements

Contents

List of Figures

List of Tables

List of Algorithms

Chapter 1

Introduction

1.1

Motivation

1.2

Problem statement

1.3

Contributions

1.4

Thesis Outline

Chapter 2

Background

2.1

Previous Work

2.2

Computer Vision

2.2.1

Pinhole Camera

2.2.2

Distortions

2.2.3

Stereo cameras and epipolar geometry

2.2.4

Cannonical Stereo Vision

2.2.5

Stereo Camera Parameters

2.3

Feature Detection

2.3.1

Canny edge detector

2.3.2

Distance Fields

2.3.3

FAST detector

2.3.4

SURF descriptor

2.3.5

Matching Keypoints

2.4

Filters

2.4.1

Bayesian Filter

2.4.2

Particle Filters

2.4.3

Low variance sampling

Chapter 3

Methodology

3.1

Image acquisition

3.2

Feature Based Particle Filter

3.2.1

Image processing

3.2.2

Implementation Problems: Lack of visual features

3.3

Contour Based Particle Filter

3.3.1

Image Processing