Autonomous Morphometrics using Depth Cameras for Object Classification and Identification

(1)

Department of Electrical Engineering

Examensarbete

Autonomous Morphometrics using Depth Cameras for

Object Classification and Identification

Examensarbete utfört i Datorseende vid Tekniska högskolan vid Linköpings universitet

av Felix Björkeson LiTH-ISY-EX--13/4680--SE

Linköping 2013

Department of Electrical Engineering Linköpings tekniska högskola

Linköpings universitet Linköpings universitet

(2)

(3)

Object Classification and Identification

Examensarbete utfört i Datorseende

vid Tekniska högskolan vid Linköpings universitet

av

Felix Björkeson LiTH-ISY-EX--13/4680--SE

Handledare: M.Sc. Kristoffer Öfjäll

isy_{, Linköpings universitet}

Dr. Daniel Ljunggren

Optronic

Examinator: Dr. Lars-Inge Alfredsson

isy, Linköpings universitet

(4)

(5)

Computer Vision Laboratory Department of Electrical Engineering SE-581 83 Linköping 2013-06-10 Språk Language Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats D-uppsats Övrig rapport

URL för elektronisk version

http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-95240 ISBN

— ISRN

LiTH-ISY-EX--13/4680--SE

Serietitel och serienummer Title of series, numbering

ISSN —

Titel Title

Autonom Morphometri med Djupkameror för Objektklassificering och Identifiering Autonomous Morphometrics using Depth Cameras for Object Classification and Identifica-tion Författare Author Felix Björkeson Sammanfattning Abstract

Identification of individuals has been solved with many different solutions around the world, either using biometric data or external means of verification such as id cards or RFID tags. The advantage of using biometric measurements is that they are directly tied to the individ-ual and are usindivid-ually unalterable. Acquiring dependable measurements is however challeng-ing when the individuals are uncooperative. A dependable system should be able to deal with this and produce reliable identifications.

The system proposed in this thesis can autonomously classify uncooperative specimens from depth data. The data is acquired from a depth camera mounted in an uncontrolled environ-ment, where it was allowed to continuously record for two weeks. This requires stable data extraction and normalization algorithms to produce good representations of the specimens. Robust descriptors can therefore be extracted from each sample of a specimen and together with different classification algorithms, the system can be trained or validated. Even with as many as 138 different classes the system achieves high recognition rates. Inspired by the research field of face recognition, the best classification algorithm, the method of fisherfaces, was able to accurately recognize 99.6% of the validation samples. Followed by two variations of the method of eigenfaces, achieving recognition rates of 98.8% and 97.9%. These results affirm that the capabilities of the system are adequate for a commercial implementation.

Nyckelord

Keywords Depth Cameras, Classification, Morhometrics, Homography, B-Spline, Eigenfaces, Fisher-faces, Local Binary Pattern Histograms, Nerual Network

(6)

(7)

Identification of individuals has been solved with many different solutions around the world, either using biometric data or external means of verification such as id cards or RFID tags. The advantage of using biometric measurements is that they are directly tied to the individual and are usually unalterable. Acquiring dependable measurements is however challenging when the individuals are un-cooperative. A dependable system should be able to deal with this and produce reliable identifications.

The system proposed in this thesis can autonomously classify uncooperative spec-imens from depth data. The data is acquired from a depth camera mounted in an uncontrolled environment, where it was allowed to continuously record for two weeks. This requires stable data extraction and normalization algorithms to produce good representations of the specimens. Robust descriptors can therefore be extracted from each sample of a specimen and together with different classi-fication algorithms, the system can be trained or validated. Even with as many as 138 different classes the system achieves high recognition rates. Inspired by the research field of face recognition, the best classification algorithm, the method of fisherfaces, was able to accurately recognize 99.6% of the validation samples. Followed by two variations of the method of eigenfaces, achieving recognition rates of 98.8% and 97.9%. These results affirm that the capabilities of the system are adequate for a commercial implementation.

(8)

(9)

This master thesis was the final work towards achieving a Master of Science degree. I was given the opportunity to complete my thesis at the company Optronic. Throughout the first five months in the year 2013, I set upon to complete my goal. Due to confidentiality, certain phrases are not mentioned in this thesis report to avert recognition of sensitive words by search engines. You as a reader will however easily discern what the actual content of the report is really about. I therefore appeal that you do not get annoyed by the use of vague words while the included figures clearly reveal the true nature of the specimens that are mentioned throughout the report.

Stockholm, June 2013 Felix Björkeson

(10)

(11)

Abbreviations

Abbreviation Meaning

1-D One dimensional 2-D Two dimensional 3-D Three dimensional PCL Point cloud library

SVD Singular value decomposition PCA Principal component analysis LDA Linear discriminant analysis TOF Time of flight

Clarifications

Word Meaning

Specimen A particular individual in a set of individuals

Sample Set of data points representing an instance of an object Descriptor Smaller set of data extracted from a sample, describing

the specimen

Class Defines the set of objects that belongs to the same spec-imen

Object An instance of a class, i.e. a sample of a individual

(12)

(13)

Notation vii 1 Introduction 1 1.1 Background . . . 1 1.2 Problem . . . 2 1.3 Structure of Thesis . . . 2 2 Theory 3 2.1 3-D Imaging . . . 3

2.1.1 Time of Flight Imaging . . . 3

2.1.2 Structured Light Imaging . . . 5

2.1.3 Time Of Flight Imaging Versus Structured Light Imaging . 5 2.1.4 Pinhole Camera Model . . . 9

2.1.5 Occlusion . . . 10

2.2 Object Classification . . . 12

2.3 Morphometrics . . . 13

2.4 Morphological Landmarks . . . 13

2.5 Face Recognition . . . 13

3 Algorithms for Image Processing 15 3.1 Flood Fill . . . 15

3.2 Principal Component Analysis . . . 16

3.3 Homography . . . 17

3.4 Hartley Normalization . . . 18

3.5 3-D Plane Fitting . . . 19

3.6 Face Recognition Methods . . . 20

3.6.1 Eigenfaces . . . 20

3.6.2 Fisherfaces . . . 21

3.6.3 Local Binary Pattern Histograms . . . 22

3.7 Neural Network . . . 22

3.8 B-splines . . . 24

4 Implementation 25

(14)

4.1 Overview . . . 25

4.2 Data Acquisition . . . 26

4.2.1 Point Cloud . . . 27

4.2.2 A Good Camera Angle . . . 28

4.2.3 Choosing a Frame of Interest . . . 28

4.3 Data Extraction . . . 28

4.3.1 Manipulating the Viewing Angle . . . 29

4.3.2 Axis Threshold Filter . . . 29

4.3.3 Foreground Extraction with Flood Fill . . . 29

4.4 Reducing Dimensionality . . . 30

4.4.1 Create a Height Map . . . 30

4.4.2 Rectification Using Homography . . . 31

4.5 Data Processing . . . 34

4.5.1 Ridge Detection . . . 35

4.5.2 Ridge Suppression . . . 38

4.5.3 Protrusion Detection . . . 39

4.5.4 Surface Descriptor Extraction . . . 40

4.6 Descriptor Analysis . . . 41

4.6.1 Eigenfaces . . . 41

4.6.2 Fisherfaces . . . 43

4.6.3 Local Binary Pattern Histograms . . . 44

4.6.4 Neural Network . . . 44

5 Results 45 5.1 Evaluation . . . 45

5.2 Descriptor Extraction . . . 46

5.3 Training Samples Per Class . . . 47

5.4 Number of Different Classes . . . 47

5.5 Performance Decline Over Time . . . 51

6 Discussion 53 6.1 Speed and Performance . . . 53

6.2 Outliers . . . 53

6.3 Prediction Uncertainty . . . 54

6.4 Neural Network Parameters . . . 54

6.5 Descriptor Size . . . 55

6.6 Improvements . . . 55

6.6.1 Different Descriptor . . . 55

6.6.2 Harder Outlier Control . . . 55

6.6.3 Color Data . . . 56

6.6.4 Reiteration . . . 56

6.6.5 Frequency Analysis . . . 56

7 Conclusions 57

(15)

1

Introduction

1.1 Background

The need for autonomously working systems within surveillance, robotics, ma-chine control and mama-chine vision etc. increases. Here state-of-the-art solutions in optical sensor technology and image processing can help to overcome problems associated with identification and classification of objects. Image sensors have traditionally been outputting two-dimensional (2-D), or even one-dimensional (1-D) data, but now sensors capable of outputting three-dimensional (3-D) depth data are increasingly available. Future applications are believed to strongly benefit from utilizing this additional dimension in similarity with the superior stereo-scopic method of human vision and brain for navigation and identification. The human abilities are recurrently used as the benchmark reference, a limit which we initially hope to equal and ultimately surpass.

Optronic have a long history in optical metrology and the development and manufacturing of depth cameras. Through its sister company Fotonic, they are interested in further exploring applications which would benefit from using 3-D technology. One application is identification (and classification, or recognition) of general objects using features available in image data such as surface structure, shape, pose, color and reflectivity. As one specific example we are interested in the possibilities of identifying living objects, such as people and animals from their shape and pattern. Such objects are usually structurally similar to other objects of the same group (species or part of body) and subtle textural or curvature features are often the only thing differentiating them. Morphometrics is a scientific area defined as the study of shape and form of organisms that encapsulates the core of this thesis.

(16)

1.2 Problem

The goal was to be able to distinguish a set of specimens from each other based on data acquired from a depth camera.

To complete this task, range data captured in an uncontrolled environment, where specimens passes freely beneath the camera at their convenience, is available. The camera is positioned at a narrow gate where all specimens go through several times a day. Since the specimens choose themselves when to advance through this location combined with the general layout around this gate, disorderly behavior is common. This increases the risk of unfavorable data being acquired. Another problematic characteristic about the data is that cross specimen variations are small, meaning that the shape and form of one individual is very similar to that of another. The internal variation isn’t either necessarily low due to the fact that the specimens are living beings that moves and deforms. What further complicates the task is that the time difference in the captured data is wide, ranging over two weeks, making it possible for the specimens to have noticeably changed during this time. The solution used today to identify the different subjects are RFID tags attached to each specimen. As previously mentioned, the specimens tends to behave quite uncivilized, which consequentially results in the possibility of loosing a RFID tag or two. Specimens walking around with unknown identity will cause a lot of problems. It is therefore interesting to explore the possibly to have a completely unintrusive system that doesn’t rely on mechanical systems that might break, such as RFID tags.

1.3 Structure of Thesis

Chapter 2 begins with explaining some fundamental theory starting with 3-D imaging. The following sections are closely related to each other and set out to explain the principles behind classification. The first section touches the subject on a broad scale, while the rest more or less explains special cases of classifications. Chapter 3 will continue with explaining theories, but is more particularly aimed at explaining well-known algorithms. These algorithms are some of the cornerstones of the system and how they work together to form the final solution is explained in chapter 4. The system is then evaluated in chapter 5 followed by chapter 6, where some notable aspects are discussed along with some proposed improvements. At the end of the report conclusions are mention in chapter 7.

(17)

2

Theory

In this section some fundamental theories are mentioned or explained. Beginning with the principles of three dimensional depth imaging, and then continuing with different aspects of classification.

2.1 3-D Imaging

The basic goal in the field of three dimensional depth imaging is to produce accurate depth estimations of the surroundings. There are a number of techniques to achieve this and below two of the most common will be briefly described. These two technologies are based on temporal and spatial information, respectively. The first technology is called Time of Flight (TOF) that estimates the time of flight for light rays. Some common commercially available models are the MESA Imaging SR4000™ and Fotonic E-SERIES. Next cameras utilizing structured light as a technology to create 3-D data will be explained. In this thesis, data from the latter type of camera is what have been used during development. The most referred-to 3-D camera that use this kind of technology is the Microsoft Kinect™ [1], developed by PrimeSense. In this thesis however, a Fotonic P70 [2] was used. The book Time-of-Flight Cameras and Microsoft Kinect™ by C. Dal Mutto, et al.[3] reviews the technology and applications of both these technologies very well. Below the fundamentals are briefly summarized.

2.1.1 Time of Flight Imaging

The idea and purpose of a 3-D camera is to achieve accurate depth values for each pixel in a matrix sensor, resulting in an image containing the distance to the objects in front of the camera. The basic principle of a TOF camera is to

(18)

acquire temporal information measuring the time t it takes for light to move back and forth between the camera and the environment. Since the speed of light c is constant in the same medium, the distance d from the camera to the object is given by

d = c ∗ t

2 . (2.1)

There are several techniques to measure this time but the one used in most TOF cameras is to measure the phase difference ∆ϕ of the radiated light wave compared to the reflected wave, see figure 2.1. This works by modulating the outgoing light at around 40 MHz. Using simple wave theory the time can be calculated as

t = ∆ϕ

2πfmod

, (2.2)

where fmod is the frequency of the light modulation. The radiated light is usually

generated from a constant modulated light source emitting light near the infrared spectrum, making it invisible to the human eye. Current cameras use common LEDs to emit this light with a wavelength about 850 nm. The phase difference is

Time [s]

Ampl

itude [V]

0

Figure 2.1:Phase difference of a signal (in blue) and its reflection (in red). computed using

∆ϕ = arctanQ3−Q4 Q1−Q2

, (2.3)

where Qi represent the amount of electrical charge received from different time

intervals at π₂ radians phase delays from each other. This electrical charge is the result of a matrix of CCD/CMOS lock-in pixels [4] that convert the light energy into electricity. These lock-in pixels structured in a matrix form the actual sensor chip, where every pixel can sample the light independent from each other. The final distance to the camera can then be calculated by combining equation 2.1 and

(19)

2.2 resulting in distance d = c 2 ∆ϕ 2πfmod . (2.4)

Since light is periodic, there is an upper limit to the time that can be measured be-fore the next period arrives. This cause errors resulting in limits to the measurable range by discontinuities in the range data and is called phase wrapping. The con-sequence of this, is an interval where the depth value can be estimated correctly. There are several techniques to deal with this problem such as phase unwrapping, but none of these will be dealt with in detail here. Reader are encouraged to read the report by M. Hansard, et al. [5].

2.1.2 Structured Light Imaging

The second type of 3-D camera uses structured light to create 3-D data from spatial information by triangulation. The basic principle behind this kind of camera is that a projector is projecting a pattern upon the environment that is detectable by a camera. Due to variation in the environment, the pattern will be distorted and this distortion is the key to proper depth estimation. The camera can locate points in the pattern and using a known baseline between the projector and the camera, a three dimensional point may be triangulated. The pattern that is projected must have certain properties enabling the camera to unambiguously locate positions in the pattern, so that correct correspondences can be appointed. Figure 2.2 shows the basic setup of a projector and a camera. The Fotonic P70 uses a point like pattern modulated three times in both axes, see figure 2.3 for a visualization of the projection pattern.

To triangulate a 3-D point with a calibrated and rectified stereo setup, consider the corresponding points pR= (uR, vR) and pL= (uL, vL), where pRis the coordinates

of a 3-D point P = (x, y, z) reflected and projected into the camera, and pLis the

equivalent reference coordinates of a light ray emitted from the projector. Since the setup should be a rectified stereo setup, vR= vLand uR= uL−d should hold

true, where d is the difference in horizontal coordinates, called disparity. The disparity is then inversely proportional to the depth value z through

z = bf

d , (2.5)

where b is the baseline between the camera and projector and f the focal length. The x and y coordinates can then be found with the aid of the cameras intrinsic camera matrix, see section 2.1.4 for more information.

2.1.3 Time Of Flight Imaging Versus Structured Light Imaging

The question is then what technology is best? There is no straight answer since both technologies have advantages and disadvantages. They are appropriate for different situations.

(20)

A

B

(x,y,z)

(u,v)

Figure 2.2:The setup of a structured light camera, where A is the projector and B the camera. 3-D point coordinates x, y and z are projected into the image coordinates u and v. Since the pattern has a special coding the projector know the projection coordinates of the corresponding image coordinates.

(21)

Figure 2.3:The structured light emitted from the Microsoft Kinect™ projec-tor.

(22)

Resolution and Speed

Since structured light imaging uses ordinary CMOS sensor chips found in regular cameras, the maximal possible resolution correlates with the resolution of the chip. On modern chips this resolution is very high and results in a possibility of very high resolution depth images. The high resolution does however bring a side effect. Since the amount of calculations needed to calculate the depth values are proportional to the resolution of the frame, structured light technology tends to be slow. The bottleneck is therefore the processing power of the camera, and resolution will have to be balanced with frame rate, in a way that is not needed with a TOF camera. While TOF cameras employ relatively immature technology it is possible to capture frames at a very high rate due to the structure of the chip, which enables independent parallel measurements of every pixel. The resolution is however significantly lower due to this structure. Since structured light imaging sensors cannot read every pixel simultaneously, frame rate will be slowed down additionally. Pixels are read sequentially row wise, called rolling shutter, which also produce another set of problems such as motion distortion. A TOF camera can be constructed very compactly. Ideally every pixel should have its own light emitter positioned as close as possible from itself. This is however not feasible since it would demand a very large and sparse sensor and emitter chip. One way to simulate the effect of having a single emitter in the center of the sensor chip is by positioning several emitters evenly distributed around the chip. Because of this geometry the TOF camera can be made much more compact than a structured light based camera, which needs a baseline between the projector and sensor. The range of the camera is then directly dependent on the size of baseline. This means that with a fixed baseline, only a specific range can be accurately measured. For example, the Microsoft Kinect™ has a baseline of approximately 7.5 cm and the optimal operation range is 0.8 m to 6 m.

Distance Dependent

Another advantage of the TOF camera is that, in a sense, it is not distance depen-dent. It is able to discern an accurate depth measurement within the whole current range. Structured light on the other hand measures the disparity of pixels, this measurement is proportional to the distance to the camera. A fixed displacement of the structured light in 3-D space would appear as a larger disparity closer to the camera compared to further away from it. The result is a much lower accuracy far away from the camera than close to it.

Reflectivity Ambiguity

One of the major problems with a TOF camera is caused by the inability to know the reflectivity of a surface. A black surface reflects significantly less photons than a white surface resulting in a weaker signal, hampering the accuracy of the measurement because, the signal is then weak compared to the background light. The consequence of this usually results in different depth estimations for bright surfaces compared to dark surfaces, even if both have the same depth. Bouncing light waves also causes artifacts. If an emitted light wave bounces on more than

(23)

one surface before reaching back to the camera, its time of flight is longer, resulting in overestimated depth values. This effect is called multi-path phenomenon and there is currently no known compensation to remedy it.

Edge Artifacts

Structured light has a tendency to create artifact points at edges. If there is a discontinuity in depth, as there usually is at the edge of an object, points have a tendency to be estimated between the objects edge and the background, creating a drape like effect. TOF cameras are also plagued by this, but not to the extent that structured light cameras are.

2.1.4 Pinhole Camera Model

The pinhole camera model is a approximative simple mathematical model of a camera that also applies to depth cameras. The principle of the pinhole camera is to allow light from the environment to only enter through a small aperture, a pinhole. The light is then projected upon a flat surface creating an image. This can be described with a series of transformations, converting a point in world coordinates into pixel coordinates. Figure 2.4 shows the basic setup of a pinhole camera where a point is projected into the image plane. The camera center is denoted O, with the principal axis crossing the principal plane at point R. The 3-D point P has the coordinates (X, Y , Z) and the projected point has the image coordinates (u, v). The focal length is denoted f . Note that the image plane is somewhat rotated to avoid some negative signs later on. Before projection is possible, the point has to be described with the camera coordinate system. This can be done with a transformation matrix T describing the cameras position in 3-D space which contains the cameras extrinsic parameters

T = R t , (2.6)

where R, a 3 × 3 matrix, represents the camera rotation and t, a 3 × 1 vector, the camera translation.

With the 3-D point transformed into camera coordinates it can be projected into the image plane. Looking at figure 2.4 it is clear with some trigonometry that the image coordinates u and v can be calculated with the following equations:

u = fX

Z (2.7)

v = f Y

Z (2.8)

The last step is to transform the image coordinates into pixel coordinates. This can be as simple as translating the coordinates so that origin is in the top left corner. The projection and image coordinate transformation can be represented with a single transformation matrix, the intrinsic camera matrix K, which can be

(24)

X

Y

Z

p

O

R

u

v

P

f

Figure 2.4:The geometrical setup of a pinhole camera.

constructed as K =         f 0 u0 0 f v0 0 0 1         , (2.9)

where u0and v0is the pixel coordinates of the principal point, ideally half of the

image resolution. A 3-D point with coordinates (X, Y , Z) can then ultimately be transformed into pixel coordinates (u, v) through:

c         u v 1         = K T             X Y Z 1             (2.10)

with c as a homogeneous scaling factor. Note that some additional parameters are usually included in the model such as radial distortion and skewing, see [6] for more details. It was however found that the simple model was adequate for this application.

2.1.5 Occlusion

It is important to consider the effect of occlusion. The 3-D data acquired from a 3-D camera will not be complete, and viewable from every angle. Surfaces not

(25)

(a)Object occlusion (b)Far side occlusion (c)Self occlusion

Figure 2.5: Three different types of occlusion. The points on the floor are occluded behind the box in (a). The backside of the sofa in (b) is not visible from the cameras point of view. Local variations in the basket in (c) obstruct proper 3-D point generation. Note that all three types are often present in the same point cloud.

seen from the camera will not be present in the 3-D data. It is therefore crucial to find a good angle when recording the data, ensuring that relevant surfaces will be included. When regarding the 3-D camera as a single unit, there are three types of occlusion one have to consider. The first type I will be calledobject occlusion.

This occurs when an object is blocking the view of another object, see figure 2.5a. This is usually not so impeding because object occlusion normally only affect the background. The second type of occlusion will be calledfar side occlusion and

this type of occlusion affects surfaces on the opposite side of the object from the cameras point of view, see figure 2.5b. The last type of occlusion isself occlusion

and is the type of occlusion that have to be considered the most. This is because it usually occurs on the surface that is studied. It is also very dependent on the viewing angle, which is possible to modify, see figure 2.5c. It is therefore crucial to find a good angle when recording the data. Note that these three kinds of occlusion to some extent represent the same problem, but they are considered separate. Also note that these types of occlusion are not the types of occlusion you normally hear about when dealing with 3-D sensors. Usually two types of occlusion called camera and light occlusions are mentioned. Camera occlusion would in the case of structured light be regarded as the parts where surfaces are blocking the cameras view of the structured light. On the contrary, light occlusion would occur on the surfaces the camera is able to see but the light cannot reach. These terms are however more technical and should mainly be regarded when studying or developing the actual sensor, not when operating it. Since as an operator, you cannot control the effects of these occlusions without modifying the sensor. The time-of-flight technology doesn’t even have equivalent occlusion effects.

(26)

2.2 Object Classification

Classification is the act of differentiating classes from each other. There are essentially four steps [7] that should be taken into consideration when attempting classification:

1. Produce training and validation sets - The real reason for creating a classi-fication system is to be able to automatically classify unknown samples. It is therefore reasonable to have two separate sets of data, one used for training, and one used for validation. The validation set should then simulate a set of unknown data that is to be classified. The training set should be able to represent all the possible outcomes with a wide variety. The system might not be able to recognize a set of validation data if not similar data occurred in the training data. By providing good validation data, overfitting can be avoided, i.e. preventing the system from trying to describe only the training data, not the generalized structure.

2. Extract features - It is common that the data used at classification contains a lot of information. There is usually too much information, where the majority might be irrelevant for differentiating the different classes. An example could be that you are about to classify a set of cars depending on their brand. Here a lot of information will not aid you in figuring out the brand, e.g. all cars have four wheels, one engine, headlights and taillights, etc. Carefully selected features should then be examined, features which are more or less unique for each brand. In the world of cars the easiest way to find out the brand of the car is to look at the logo, a small feature that with utmost accuracy will be able to distinguish the car. Therefore the data should be reduced to a smaller set of independent features, capable of separating the classes. A set of features can then be called a descriptor, as it describes the class, or more specifically, the sample.

3. Create and train a classifier - The descriptors are usually paired with a corresponding class label, a key to identify the class. A classifier can then take a set of training descriptors with their corresponding labels and train itself to recognize and appoint labels to new descriptors. How this is done depends on the used algorithm. Note that different algorithms work well with different kinds of descriptors. There is seldom a perfect descriptor and a perfect classifier. Most classifier algorithms can be changed and tweaked with a number of parameters, changing the behavior of them. The algorithms mentioned in section 2.5 are all different classifiers.

4. Evaluate the classifier - The final step is to evaluate the system. This is done by inputting the training data and comparing the class label the system outputted with the true class label. If the system is not able to assess the correct class labels for a majority of the training data, it is not very good. The remedy could be to tweak the parameters of the classifier, change the classifier algorithm, or find a better descriptor.

(27)

2.3 Morphometrics

Morphology is in the field of biology defined as the study of shape and dimensions. Morphometrics is a sub-field of morphology, defined as the quantification and comparison of shape [8], [9]. Traditional morphometrics may examine metric measurements, angles, masses etc. These measurements are however usually correlated and the ratio of the height and width of an object usually stays the same as it grows. The result is a small amount of independent variables despite a large amount of measurements. By making the variables independent the scale information is often removed. However, if the aim of the analysis is to find absolute differences between subjects, the scale information can still be useful. For example Yakubu, A., et al. [10] use traditional morphometrics to distinguish between two different species of fish, namely Oreochromis niloticus and Lates niloticus, by analyzing seven morphometric measurements (body weight, standard length, total length, head length, body depth, dorsal fin length and caudal fin length). Traditional morphometrics does however only deal with outlining measurements and not the internal shape variations occurring among a specimen. Morphological landmarks are then an extension to enable data to be gathered over the whole specimen, expanding the spatial information available to describe the specimen, i.e. the features used to form a descriptor as mentioned in the previous section.

2.4 Morphological Landmarks

Morphological landmarks are distinguishable features on an object that can unam-biguously be detected on all specimens [11], i.e. they are said to be homologous. According to I. L. Dryden and Kanti V. Mardia [12] there are essentially three different classes of landmarks: anatomical landmarks, mathematical landmarks and

pseudo-landmarks. An anatomical landmark is a landmark with anatomical

cor-respondence between subjects. It is usually assigned by an expert and has to hold a meaning in the current context. A mathematical landmark holds certain mathematical properties as for example a local maximum or minimum. The pseudo-landmarks do not have any distinguishable attributes in themselves, they exist between anatomical or mathematical landmarks to enrich the amount of measurements. These samples are then usually bundled together to form an actual descriptor.

2.5 Face Recognition

The research field of face recognition is a well established and quickly evolving field with a wide variety of algorithms available. Recognition based on purely biometric data extracted from the face would be a step closer to the humans superior abilities of recognizing faces. The main incentive is from a security and surveillance viewpoint. Systems today can quite easily be bypassed by forging or acquiring data necessary for access. A standard system is the use of a id-card together with a pass code. None of these are impossible to come by, and you

(28)

can therefore gain illegitimate access to areas or information. Biometric data is however much harder to forge. A common biometric measurement is a fingerprint. Acquiring a fingerprint does however require that the subjects finger is placed upon a device for scanning, a task which is time consuming and might not always be possible. Scanning a face from a distance is then a sound alternative. It is achievable on many subjects at the same time and it is completely non intrusive. Note that the face is often used because it contains a lot of information, but any area which might provide enough information is applicable.

This suggests that the field of face recognition has a lot to offer for our problem, and well established methods should be examined and exploited if possible. One of the first algorithms to really break through was the method of eigenfaces, developed by L. Sirovich and M. Kirby [13] and used by Matthew Turk and Alex Pentland [14] for face classification. It tries to maximize the variance between the faces by creating a set of basis vectors created from an eigen decomposition as shown in section 3.6.1, thereby its name eigenfaces. Note that it is called eigenfaces due to its primary application of using images of faces, but any image can be used. Even though their implementation has since been outperformed many times, it still forms the basis for many modern algorithms. It is also frequently used as a baseline method when comparing performance of other systems. Eigenfaces is essentially a way torepresent faces and may not be such a good way to classify

them. Belhumeur, Hespanha and Kriegman recognized this and developed a method called fisherfaces [15] that outperformed eigenfaces greatly in classifying faces. Fisherfaces performs a linear discriminant analysis (LDA), invented by Sir R. A. Fisher, who successfully used it to classify flowers [16]. LDA tries to cluster the same classes together by maximizing the ratio of external and internal class differences. Both eigenfaces and fisherfaces suffer from the necessity of requiring of a lot of training samples acquired from different conditions to be able to accurately recognize faces in somewhat uncontrolled conditions. They are also holistic methods that use every pixel when processing the data. This requires a near perfect alignment of each face, which usually is only possible in controlled conditions. Many different variations have been developed to try and deal with these drawbacks, and some have succeeded better than others. Hu Han et al. [17] reviews several different techniques of illumination preprocessing. The aim of these techniques is to suppress the variations caused by different lighting conditions in each frame. However with the use of 3-D data the illumination problem would be completely eliminated. This once again implies that having accurate 3-D data is a great advantage, as evident in a survey by Andrea F. Abate et al. [18] comparing face recognition methods based on 2-D and 3-D imaging. A 2-D method that tries to deal with the illuminations variation is Local Binary Pattern Histograms [19]. The basic principle with this algorithm is to evaluate the relative local structure around pixels, enabling a robust descriptor of a face. The three algorithms mentioned in this section will be further explained in the section 3.6.

(29)

3

Algorithms for Image Processing

There exist many algorithms in the world of image processing and in this chapter the ones used will be explained. Not all of the following algorithms are explicitly image processing algorithms, but all of them can be used for image processing, and are therefore included in this chapter.

3.1 Flood Fill

Flood fill is an image processing tool used for finding connected pixels. It is very good at segmenting a region from the rest of an image, if the region has proper edges. A seed pixel is first selected inside the region, the algorithm then grow from this pixel evaluating neighboring pixels if they are connected. A pixel is connected if

Z(x0) − ∆−≤Z(x) ≤ Z(x 0

) + ∆+, (3.1)

where Z(x) is the current image value for a pixel with image coordinates x, which might be connected to the seed pixel. This is done for all neighboring seed pixels, which have the coordinates x0

. The two thresholds ∆−and ∆₊are selected

manually depending on the scale and variance of the data. Figure 3.1 shows a simple example of the flood fill algorithm. The seed pixels can only expand to pixels connected through a 4-connectivity in this illustration. Which means that only horizontally and vertically neighboring pixels are considered as neighborhood pixels.

(30)

8

7

8

5

6

9

3

5

7

5

4

8

3

2

3

Figure 3.1:The flood fill algorithm searches from the current pixel (green) for possible connected pixels in its neighborhood. With ∆−= ∆₊= 1 the pixel

with value 4 is connected, the pixels with value 3 and 7 are not, due to the difference being too large.

3.2 Principal Component Analysis

Principal component analysis is one of the most commonly used tools within data analysis. Its purpose is to transform a set of data making it linearly uncorre-lated. The transformation is defined by a number of components, called principal components. The first principal component indicate the maximal variance of the data and the following components indicate the maximal possible variance while fulfilling orthogonality. Mathematically the principal components are calculated based on the covariance matrix. The covariance matrix C contains the variance of all the components of a vector. It can be estimated from realizations of a random vector X, and its mean µ as

C = En(X − µ) (X − µ)To. (3.2) The eigenvectors of the covariance matrix then corresponds to the principal com-ponents, and the corresponding eigenvalues are proportional to the variance along these vectors.

Singular Value Decomposition

With all realizations of X put column-wise into a matrix

Y =X1− µ, X2− µ, . . . , XN− µ , (3.3)

the covariance matrix is C = Y YT_{. The principal components can however be}

calculated by doing a singular value decomposition of the matrix Y directly. The matrix Y is then decomposed using a singular value decomposition as

Y = U SVT, (3.4)

where the columns of U are the left hand singular vectors of Y and corresponds to the eigenvectors of Y YT, i.e. the principal components. The rows of V are the

(31)

right hand side singular vectors and corresponds to the eigenvectors of YTY . The

matrix S contains the corresponding singular values in its diagonal, representing the square roots of the non-zero eigenvalues. If C would be a M × N matrix, then

U would be a unitary M × M matrix, V would be N × N and S would be a M × N

rectangular diagonal matrix.

3.3 Homography

A homography is a projective transformation that has eight degrees of freedom which enables it to perform all affine transformations, i.e. translation, rotation, scaling and skewing. Together those require six degrees of freedom, and the last two enables a homography to change the perspective of a set of points on a plane, i.e. points on an image. See figure 3.2 for an example of a homography transform-ing four points into a square. Which can be seen as changtransform-ing the perspective of a rectangular plane from a slight angle to an angle perfectly from above.

X

₁

_X

2

X

₃

X

₄

X'

₁

X'

₂

X'

₃

X'

₄

X'

₁

Figure 3.2:A homography is capable of transforming the set of points X to the set of points X’

A homography matrix can be calculated to transform the vector of 2-D homoge-neous point coordinates X into another set of point coordinates x, fulfilling

x= HX. (3.5)

The homography matrix H can be estimated using the homogeneous estimation method. By reshaping H =         H11 H12 H13 H21 H22 H23 H31 H32 H33         , (3.6)

into a vector h according to

(32)

the homogeneous estimation method finds the solution to

Ah = 0 (3.8)

for h, where matrix A is constructed as

A =                             x1 y1 1 0 0 0 −x1X1 −y1X1 −X1 0 0 0 x1 y1 1 −x1Y1 −y1Y1 −Y1 x2 y2 1 0 0 0 −x2X2 −y2X2 −X2 0 0 0 x2 y2 1 −x2Y2 −y2Y2 −Y2 .. . ... ... ... ... ... ... ... ... xn yn 1 0 0 0 −xnXn −ynXn −Xn 0 0 0 xn yn 1 −xnYn −ynYn −Yn                             . (3.9)

The vector h is then the vector which minimizes kAhk, subject to khk = 1, and is then given as the eigenvector with the smallest eigenvalue of ATA. Which is equal

to the right hand singular vector obtained from a singular value decomposition of

A, see section 3.2. This singular value decomposition always exist for any matrix,

as opposed to an eigenvalue decomposition.

3.4 Hartley Normalization

There might be a problem when doing singular value decomposition of a matrix to determine a homography. If the matrix is ill-conditioned, the resulting singular values might not have appropriate values, namely one zero value and the rest non zeros. The problem with this is that the solution might not be accurate enough to be useful. The cause of this problem is that the homogeneous coordinates have a bad distribution. The remedy is to transform the points into a coordinate system where they have an optimal distribution, this transformation is called a Hartley normalization [20]. Their mean distance to origin should be equal to

√ 2 and their mean position should be in the origin. The mean position m can trivially be calculated as m= 1 n n X i=1 xi. (3.10)

The mean distance to origin s can be calculated with

s = 1 n n X i=1 k_x_i−_mk. _(3.11)

The normalized points x0are then transformed from x as

(33)

where the Hartley normalization matrix T is constructed with the values from equation 3.10 and 3.11 as T =                   √ s √ 2 0 mx 0 √ s √ 2 my 0 0 1                   . (3.13)

With the two sets of corresponding points x and X, used to estimate a homography, two normalization matrices are constructed Txand TX, respectively. With equation

3.12, these sets of points are normalized, resulting in two sets of normalized points x0and X0. A homography matrix H0, can be calculated with these points according to section 3.3. The homography matrix H, describing the homography between the points X and x is then calculated as

H = TXH 0

Tx−1. (3.14)

This solution is more stable compared to calculating H directly from x and X.

3.5 3-D Plane Fitting

Fitting a plane to data can be done to see if the data is linearly dependent. Used with 3-D data, planar surfaces can be localized which is useful for example when searching for floors or walls. When fitting a plane to a set of points you try to minimize some kind of error, different methods minimize different measurements. The linear regression method minimizes the distance of all the points to the plane along the z-axis. Modifying the algorithm so that the shortest distance, i.e. the perpendicular distance, to the plane is minimized is called orthogonal regression. The implicit equation for a 3-D plane is

ax + by + cz + d = 0, (3.15) where (x, y, z) express the coordinates of a point on the plane. With p = (a, b, c, d)T defining the plane coefficients and x = (x, y, z, 1)T_{, a 3-D point expressed in}

homogeneous coordinates, the perpendicular distance can be defined by rewriting equation 3.15 as

D = x · p. (3.16)

Minimizing this equation for several points is equal to finding the right hand singular vector corresponding to the smallest singular value of the matrix A, i.e. find p minimizing kApk, where

A =                x1 y1 z1 1 x2 y2 z2 1 .. . ... ... ... xn yn zn 1                . (3.17)

(34)

This right hand singular vectors of a matrix can be found with a singular value decomposition. Note that the mean of the points should be removed from the points, so that the condition kpk = 1 can be properly added, to prevent the trivial solution p = 0 to be chosen. Removing the mean of the points will results in d = 0 for equation 3.15, i.e. the plane passes through origin. To have a plane with the correct distance from origin, d can be set to the orthogonal distance from the plane to the mean point, i.e. d = m · n, where n = (a, b, c)T

3.6 Face Recognition Methods

Three different face recognition methods were mentioned in section 2.5 and here the mathematical theory will be explained.

3.6.1 Eigenfaces

As mentioned in section 2.5, Eigenfaces is considered to be the first functional face recognition method. It employs principal component analysis to capture the variation in a set of images. The result is a basis of Eigenfaces representing the set of images in an effective way. Due to the often large sizes of images a trick is exploited to be able to calculate this basis.

To calculate the Eigenfaces you start by arranging the N number of images in vectors, from the top left corner then row-wise, resulting in a set of vectors

{_I₁_{, I}₂_{, . . . , I}_N}_. _(3.18) Calculate the mean vector Π, which when reshaped can be seen as the mean image. Then concatenate all the vectors minus mean together creating a large matrix A such as

A = {I1−Π, I2−Π, . . . , IN−Π} . (3.19)

If an image is 256 × 256 pixels the vectors would be 65536 components long, and with N images the matrix would be 65536 × N . Since PCA of a matrix A employs the eigenvalues of AAT which now is 65536 × 65536 the cost to calculate eigenvectors on this matrix is unreasonable. To circumvent this the N × N inner product matrix ATA is used instead. The eigenvectors υi and eigenvalues of λi of

AAT is what originally was needed, defined as

AATυi = λiυi. (3.20)

But now with ATA we get the following eigen decomposition

ATAωi = µiωi. (3.21)

However by multiplying with A from the left we get the following

AAT(Aωi) = µi(Aωi). (3.22)

Which implies that υi = kAωAωiik since the norm of the eigenvectors is equal to one. It can actually be showed that υi = λ

−_0.5

(35)

since the vectors are normalized it isn’t relevant. The drawback of using ATA is

that a maximum of N Eigenfaces can be calculated, but that is usually enough. With a large amount of images only the eigenvectors corresponding to the largest eigenvalues are kept, the rest do not contribute enough variance information and are consequently discarded. The classification works by finding the closest training samples in the newly created Eigenface subspace. With a constructed base, where the eigenvectors are the basis vectors, an image can be projected into this subspace resulting in a set of coordinates. Begining by removing the mean Π previously calculated, the coordinates can then be calculated as

c= V (I − Π), (3.23)

where V is a matrix containing the used eigenvectors υi as rows. Creating and

storing a set of coordinates {c1, c2, . . . , cN}for all training images enables a fast

comparison to new images. By finding the shortest distance from the current coordinates c to one of the training coordinates ci a corresponding class can be

found.

3.6.2 Fisherfaces

The principle behind Fisherfaces is linear discriminant analysis. The result from the LDA is a subspace spanned by a set of vectors called Fisherfaces. The goal of Fisherfaces is to find a set of basis vectors where the external class differences are minimized while the internal class distances are maximized. The internal class distances are represented by a scatter matrix SI, calculated as

SI = X c∈C X Ik∈c (Ik−Πc)(Ik−Πc)T, (3.24)

where C is a set of classes. And Πc is the mean of the images in class c. The

external class differences are represented with the scatter matrix SE, calculated as

SE =

X

c∈C

Nc(Πc−Π)(Πc−Π)T, (3.25)

with Ncas the number of samples in class c, and Π as the total mean of all samples.

To minimize SIand maximize SEthe optimal basis vectors contained in matrix V

should fulfill Vopt= argmax V V T_S EV VTSIV . (3.26)

These resulting basis vectors are the set of generalized eigenvectors of SEand SI

corresponding to the largest eigenvalues, i.e. the basis is given by

SEVopt = SIVoptΛ, (3.27)

where Λ is the diagonal matrix containing the corresponding eigenvalues. Note that there are some technicalities that need to be solved with SI being singular,

read [15] for details.

(36)

images, called Fisherfaces. The actual class prediction is similar to how Eigenfaces classifies, where a sample is projected into the created subspace and the closest class is then chosen.

3.6.3 Local Binary Pattern Histograms

The principle behind the local binary pattern histograms algorithm is to encode each pixel with a binary number. This number tries to explain the local structure around the pixel by evaluating if the pixel is larger or smaller than its neighbor-hood. Figure 3.3 shows a small example on how to encode a pixel using a basic 3 × 3 local binary pattern. There are some additional extensions applied to extract

1

0

1

0

1

8

7

4

6

9

3

5

7 Threshold

Figure 3.3: Example of a pixel (red circle) being encoded as a local binary pattern. The thresholds are put in series starting from the top left going around clockwise resulting in the binary string: 01100111, called the local binary pattern.

the patterns for the whole image, which you can read about in [19]. The final step to actually create a full descriptor is to divide the image of local binary patterns into pieces and evaluate histograms for each part. These histograms are then concatenated together to create the local binary pattern histogram. Prediction of a class works by comparing the histogram of the current sample with histograms of all the training samples. Choosing the one with the smallest Chi-Square distance

d =X

I

(H1(I) − H2(I))2

H1(I)

, (3.28)

where H1 is the local binary pattern histograms of the sample that is to be

pre-dicted, and H2 is the local binary pattern histograms for one of the training

samples.

3.7 Neural Network

An artificial neural network has the possibility of solving complex nonlinear problems. The network employs a system of nodes, called neurons. Each neurons employ simple mathematical functions and together they are able to find complex

(37)

relationships between input and output. The network is structured as several layers in series. At least one input layer and one output layer is needed to form a proper network, in addition to those an arbitrary number of so called hidden layers can be added between. Each layer contain a number of nodes that communicate to other nodes in the next layer with weighted connections. By changing these weights the network may be trained to a set of training data, see figure 3.4 for a simple setup with one hidden layer. As mentioned the number of hidden layers and nodes can be chosen arbitrary. Each node contains an activation function which takes the sum of the weighted outputs of previous nodes as input. The activation function is usually a sigmoid function, i.e.

f (x) = β ∗(1 − e

−_αx

) (1 + e−_αx

), (3.29)

where α and β is usually specified according to the range of the input. There are

(a)

(b)

(c)

w

ij

w'

jk

x

1

x

2

y

2

y

1

Figure 3.4:A basic neural network with two input nodes in the input layer (a) with the inputs x1and x2. The output layer (c) consists of two nodes with

the outputs y1and y2. The network has one hidden layer (b) which contain

four hidden nodes. It is the weights w and w0 between the layers that are optimized, granting the networks its "learning" abilities.

different algorithms to train a network, the classical algorithm is called random sequential back-propagation. It works by sending input data through the system with a known output, then letting the error propagate back through the network while adjusting the weights accordingly through a simple gradient descend [21].

(38)

3.8 B-splines

Splines can through a set of control points interpolate continuous data. With only a few control points, splines can represent complex shapes or data structures. B-splines are splines that don’t go through their control points, but are merely supported by them. A B-spline is both C1 _{and C}2 _{continuous, meaning that}

its first and second derivatives are continuous. A number of control points, or coefficients Pkdefine the B-spline and construct the spline as

X(t) =

N

X

k=0

PkBk(t), (3.30)

where the blending functions B(t) are piecewise defined for different t. The first blending function B0(t) is defined over the interval −1 ≤ t < 3 as

B0(t) =              (t + 3)3, −_{1 ≤ t < 0} −_3t3−_15t2−_{21t − 5,} _{0 ≤ t < 1} 3t3+ 3t2−_{3t + 1,} _{1 ≤ t < 2} (1 − t)3, 2 ≤ t < 3 . (3.31)

The next blending functions are then defined the same as equation 3.31 but with a shifted t interval. This means that each segment of the B-spline is the sum of four different blending functions, see figure 3.5 for an illustration. A segment of the B-spline can then be defined on the interval [0, 1) with equation 3.32, expressing the blending functions in matrix form.

X(t) = 1 6 h Pk−1 Pk Pk+1 Pk+2 i             −₁ ₃ −₃ ₁ 3 −₆ ₃ ₀ −₃ ₀ ₃ ₀ 1 4 1 0                         t3 t2 t 1             (3.32) −30 −2 −1 0 1 2 3 4 5 6 7 8 9 10 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 t

Figure 3.5: A B-spline function (red) created from six blending functions (blue) with coefficients Pk = [2, 5, 1, 3, 2, 5]. Note that this function is only

de-fined for 0 ≤ t < 7 and the additional dashed functions are helping functions to fulfill the boundary conditions X(0) = P0 = 2 and X(7) = P6= 5.

(39)

4

Implementation

The whole program was written in C++. Some of the algorithms and principles were realized using existing functions from the open source librariesOpenCV and Point Cloud Library. Each section is in more or less chronological order.

4.1 Overview

The system uses data from a 3-D camera as input, then processes it using different methods to extract a robust descriptor. This descriptor is then used to train or predict corresponding classes from each input. More precisely, the amount of data from the raw sensor output is first reduced by choosing the appropriate data to process further. Then that single frame is filtered to extract the relevant data and a rectification process is used to normalize the data in the sense of making it more or less invariant to all affine transformations, i.e. rotation, translation and scale. Distinct points are marked as reference points for the descriptor extraction. Here is a short overview of the steps performed:

1. Acquire the relevant data from the sensor 2. Filter the data by extracting the foreground

3. Get and rectify a height map from the foreground data 4. Find and remove ridge portions of the height map 5. Localize specific protrusions

6. Create a descriptor

7. Train and/or validate classifiers

(40)

4.2 Data Acquisition

The camera is capable of delivering data at 30 frames per second with a resolution of 320 × 240 points. The depth precision with the current setup is about 5 mm at the relevant distance from the camera. The camera outputs four arrays which contain the three coordinates and RGB color, see figure 4.1 for an example of a typical output. The coordinates are stored as a signed number represented with 16 bits giving them a possible range of 65536 values. Since the depth range is limited, combined with the finite precision, only a portion of the data range is used.

(a)The depth (z-axis) output. (b)The color output.

(c)The lenght (y-axis) output. (d)The width (x-axis) output.

Figure 4.1:A typical output of a Fotonic P70 where the x, y and z-axis are displayed in the JET color space. Note that the color output appears to be black and white, it is actually not. Weak RGB camera combined with a color deprived scene results in an almost black and white image.

(41)

4.2.1 Point Cloud

The camera delivers data in three dimensions, by combining the values from each dimension a set of points in 3-D space can be assembled. This set can then be called a point cloud. The Point Cloud Library (PCL) can be used to store and process these point clouds, which enables the use of PCL’s internal functions, including a powerful visualizer. The output arrays from the camera can easily be converted into a point cloud by traversing the arrays and extracting the respective coordinates and turning them into points. By paying attention to where you are in the array it is possible to convey an important attribute of the point cloud, namely give it a constant height and width. If this attribute is preserved the point cloud is said to be organized which is a crucial attribute if one where to attempt to convert the point cloud back into the respective 2-D arrays. An organized point cloud differs from a normal point cloud as it knows what points are part of its neighborhood directly from their memory positions. Contrary to a normal point cloud which need to search for possible neighbors in a local neighborhood defined by the euclidean distance. If the points have been properly saved the point cloud has resolution of 320 × 240 points, resulting in potentially 76800 valid points. As explained in sections 2.1.1 and 2.1.2 a 3-D camera only has a limited range where the depth of a point may be accurately estimated. So depending on the environment all 76800 points may not have a set of coordinates, and invalid points are then represented as Not-A-Number, N AN . Figure 4.2 shows the point cloud converted from the arrays in figure 4.1 rendered with the PCL Visualizer.

(42)

4.2.2 A Good Camera Angle

As mentioned in section 2.1.5, occlusion will play a vital part of the quality of the data. Therefore a good camera position relative the specimens will need to be set up prior to the start of data recording. The main matter is what parts of the specimens should to be included in each frame, and this is in turn an issue of where enough information can be extracted on each specimen with good repeatability. The parts that move a lot will be difficult to use since their variation will be extremely large in every frame. As you can see in figure 4.1 the final perspective was chosen to be from above, looking down on the specimens with a slight angle of about 20 degrees relative to the ground. Increasing this angle would enable a view of the backside of the specimen, but would then lead to self occlusion on the top. Since the top is the most stable region it was favored to have the best possible data of it.

4.2.3 Choosing a Frame of Interest

The 3-D camera provides a continuous stream of 3-D data, and to process every frame would not only be extremely resource demanding but also unnecessary. Depending on the environmental setup, where the camera is positioned and how the objects are moving in front of the camera, only a fraction of the frames might be viable for processing. The primary concern is if the points of interest are fully within the field of view of the camera. More or less complex solutions can be used to solve this problem. Since the data rate is about 65 GB/hour (each frame is 604 kB on disc) and with a limited storage capacity and processing power a simple and effective process needs to be employed. Two columns in the data was designated, one to the right and one to the left. The specimen will then move through one of the columns when it enters the frame, and move through the other as it exits the frame. A simple boolean expression can then be assembled to determine if a specimen is within the frame.

4.3 Data Extraction

After the appropriate frame has been selected some pre-processing has to be per-formed to remove unwanted data. The set of points belonging to the background is typically data that should be removed. Also points that might belong to the object but do not contribute with any useful information, will then only disturb and should consequently be removed. The primary method for extracting the wanted data, i.e. the foreground, is the flood fill method applied to the height map. How to get the height map and relevant implications is described in section 4.4.1. Depending on the quality of the captured data some pre-processing might be needed. Quality in this sense is a good recording angle and fair positions of the specimens so that they are more or less centered in the view. If the recording angle is too tilted the whole data can be rotated and consequentially changing the viewing angle, this can help to improve the results of the flood fill method.

(43)

4.3.1 Manipulating the Viewing Angle

One of the greatest benefits with 3-D data over 2-D data is the possibility to observe the data from another angle. This is essentially a change of basis and changing the viewing angle is then a simple linear transformation. In the case of 3-D points represented with homogeneous coordinates a rotation is a simple multiplication from the left with a rotation matrix. The rotation matrix can be represented as equation 4.1 where α, β and γ are rotation angles around each axis in the Cartesian coordinate system.

R =         

cos α cos β cos α sin β sin γ − sin α cos γ cos α sin β cos γ + sin α sin γ 0 sin α cos β sin α sin β sin γ + sin α cos γ sin α sin β cos γ − cos α sin γ 0

−_{sin β} _{cos β sin γ} _{cos β cos γ} ₀

0 0 0 1          (4.1) A good reference is usually the floor, if the floor is properly aligned to the camera, the rest of the data is usually fine as well. The floor should therefore be parallel to the viewing plane, i.e. have a more or less constant depth value. There are several ways to find the rotation matrix that aligns the floor parallel to the viewing plane. Since the camera has a fixed position in all of the data, the best option is to find a constant rotation matrix and apply it to all data, instead of using a dynamic algorithm that estimates a matrix for each instance. An easy and sufficient method is to use PCL’s viewer to manually rotate the point cloud until the floor looks flat, then saving the matrix which describes how the camera has moved. The inverse of this matrix can then be used to properly rotate the data.

4.3.2 Axis Threshold Filter

With the viewing angle and position aligned to the background, a simple and effective filter is preferably applied to remove as many points as possible with a low numbers of calculations. The simplest filter possible with only one inequality check per point is to compare one of the three coordinates to a constant threshold value. This is done a number of times with different axes and thresholds depending on the situation. The points to remove are then

P < {lx< Px< ux, ly< Py< uy, lz < Pz< uz}, (4.2)

where l and u are the lower and upper limits for the different axes. For example, removing the floor will then be as simple to remove points with a depth value larger than a certain value, provided the floor is properly aligned.

4.3.3 Foreground Extraction with Flood Fill

Section 3.1 describes how the flood fill method works. Since the flood fill method is designed to work on 2-D images, all of the 3-D data cannot be used, only the height data is used. The goal is to extract the foreground from the background. To do this a proper seed pixel has to be picked, this pixel needs to belong to the foreground. If the data acquisition is good enough at choosing proper frames where the specimen is in the center of the image, an easy option is to pick the center pixel as seed pixel. Depending on the range of the data, the upper and lower thresholds for connectivity has to be set appropriately. After converting the height

(44)

values to a metric coordinate system both thresholds were set to approximately 5 cm. This mean that neighboring pixels with less than a 5 cm depth difference will be considered as the same surface. Figure 4.3 shows the result of the flood fill algorithm applied to a typical frame. Note that some lower parts of the specimen was deemed as background. This is fine because those parts were not wanted as foreground anyway. They would have created an unwanted bias due to the higher number of points on that side.

(a)Original depth map slighly rotated to adjust for the recording angle.

(b)The extracted foreground marked as red.

Figure 4.3:Flood fill algorithm used to extract the foreground. The seed pixel was picked as the center pixel of the image, which almost every time belong to the foreground.

4.4 Reducing Dimensionality

As mentioned earlier, there are benefits when working with 2-D data instead of 3-D data. Many algorithms only work on 2-D data and the question when using 3-D data is how and what to use.

4.4.1 Create a Height Map

The height map, or depth map as some would call it, is what defines a depth camera. A height map is essentially the same as the depth map shown in figure 4.1 usually the only difference being opposite signs for the distance values, and is basically what a depth camera is provides by its 2-D sensor. What is really interesting is that this map can be manipulated to enhance certain features, while preserving the array structure. This is because the 3-D point cloud preserves its organized property after filters or transformations have been applied. Formally the reason behind this is that transformations and filter only manipulate the values of the data not the location of the data itself, i.e. the memory location. It is therefore possible to apply all kinds of operations to the 3-D data and then trivially