Object Recognition in 3D Laser Radar Data using Plane triplets

(1)

Object Recognition in 3D Laser Radar Data

using Plane Triplets

Bj¨orn Johansson och Anders Moe

October 7, 2005

Technical report LiTH-ISY-R-2708 ISSN 1400-3902

Computer Vision Laboratory Department of Electrical Engineering Link¨oping University, SE-581 83 Link¨oping, Sweden

bjorn@isy.liu.se, moe@isy.liu.se

Abstract

This report describes a method to detect and recognize objects from 3D laser radar data. The method is based on local descriptors computed from triplets of planes that are estimated from the data set. Each descriptor that is computed on query data is compared with descriptors computed on object model data to get a hypothesis of object class and pose. An hypothesis is either verified or rejected using a similarity measure between the model data set and the query data set.

(2)

1 Introduction

This report describes a method to detect and recognize objects from 3D laser radar data. The method was developed within a graduate course project su-pervised by Christina Grönwall and Jörgen Ahlberg at FOI in Linköping. The main class of objects was in this case military vehicles. The problem is described further in section 2, and a solution overview is given in section 3. The remaining sections describes the solution and results in more detail.

2 Problem description

2.1 Definitions

• Let x = (x, y, z) denote the 3D coordinate system, with x being the depth

and z the height (do not confuse the vector x with the element x).

• Each data set of laser radar measurements consists of a set of 3D points {xk}Kk=1, which will be referred to as a target. See figure 2 for an example.

• Each vehicle is described by a thread model (see figure 1), simply referred

to as a model.

• A plane will refer to the plane model

0 = ax + by + cz + d = x 1 ·     a b c d     = x 1 · p , (1)

which will be used in several parts of the solution.

It may be possible to use the slightly simpler, not fully general, plane model x = ay + bz + c in the target case considering the properties of the data (but not in the model case). This is not explored further here.

2.2 Experiment data

The main class of objects was in this case military vehicles. A thread model was available for a number of vehicles, shown in figure 1. There are six models, but only two of them (MTLB and T72) are present in the target data.

Nine sets of target data were available, generated from a laser radar. It is somewhat difficult to visualize a set of 3D points in a good manner in only one still image without being able to rotate the data set. Figure 2 shows the 3D points in target set trg1pos2with color coded height. Figures 3, 4, and 5 shows the nine sets of 3D points in the (y, z)-plane, with a color proportional to the depth value of x (dark being further away). The figures also contain the correct model for each target (in parenthesis). Note that target 1 contains a box on the

(4)

BTR80 hum-tow −1 0 1 −4 −2 0 2 0 1 2 −1 0 1 −2 0 2 0.5 1 1.5 2 2.5 Leopard MTLB −1 0 1 −4 −2 0 2 4 6 1 2 3 −1 0 1 −2 0 2 0 0.5 1 1.5 T72 T80 −1 0 1 −2 0 2 4 6 0 1 2 −1 0 1 −2 0 2 4 6 0 1 2

(5)

190 195 200 205 15 20 25 −2 0 2 4 6 x y z

Figure 2: Target set trg1pos2.

trg1pos2 (MTLB) trg2pos4(T72) 14 16 18 20 22 24 26 0 1 2 −8 −6 −4 −2 0 2 4 0 1 2 3 trg3pos6 (MTLB) −24 −22 −20 −18 −16 −14 −12 0 1 2

(6)

trg1_pos1 (MTLB) trg2_pos3(T72) 10 12 14 16 18 20 22 2 3 4 5 −14 −12 −10 −8 −6 −4 −2 2 3 4 5 trg3pos5 (MTLB) −30 −28 −26 −24 −22 −20 −18 2 3 4 5

Figure 4: Target data at forest, and corresponding models.

trg1_pos (MTLB) trg2_pos(T72) 10 12 14 16 18 20 22 4 6 8 10 −14 −12 −10 −8 −6 −4 −2 6 8 10 trg3pos (MTLB) −30 −28 −26 −24 −22 5 6 7 8 9

(7)

that the model does not accurately describes the vehicle. It would probably be better to use model data that is generated in the same way as the target data, i.e. data from a laser radar.

(8)

3 Solution overview

Target data Model data

−1 0 1 −2 0 2 0.5 1 1.5

⇓

1. Preprocess target data to remove uninteresting points: Estimate and remove ground plane, remove tree-trunks, and remove small objects

⇓

2. Estimate local planes

⇓

3. Find dominant planes using a clustering procedure

⇓

4. Compute local descriptors, in this case combinations of three

planes (denoted plane triplets) Plane triplets Plane triplets

⇓

5. Find matching descriptors between target data and model data. Each match gives a hy-pothesis of object view and po-sition.

Several hypotheses

⇓

6. Verify the hypotheses using remaining data.

(9)

4 Preprocessing

4.1 Simulation of model data points

The obtained description of the appearance of the vehicles were thread-models, see figure 1. It would be possible to match the features used in 5.1 directly with the thread-models. However, to make the two data sets to match as similar as possible, range points were simulated for the models. This was done by calculating a number of “range images” from the left, right, front, back and above the vehicle. Ortographic projection were used. The first intersection points between the projection lines from all the different views and the planes in the thread-model were calculated, see figure 6.

4.2 Preprocessing of target data

The following preprocessing is proposed, to reduce the number of data points.

• Estimate ground plane and remove points close to that plain

• Locate and remove points on tree-trunks by a projection of points onto

the ground

• Remove small objects by a projection of points in the depth direction

4.2.1 Estimation of the ground plane

The ground plane estimation is done iteratively by applying a horizontal plane below the samples, and then pushing it upwards with a force F . Points below the plane applies a downward force to the plane. The plane normal is updated in each step by assuming that the plane has a moment of inertia and then calculating the moment applied by the points. The iteration is terminated when stabilized.

Points closer than 0.2m to the estimated plane are then removed. The vehicles are assumed to be lower than 2.5m, so points further away from the ground are also removed. An example of the samples before and after the removal is shown in figure 7.

This estimation and removal is done globally. However, for samples over a larger region or for faster varying ground this should probably be done local parts of the samples or by applying a more complex ground model.

4.2.2 Removal of tree-trunks

The position of the tree-trunks are estimated by first projecting the samples in the height direction. Histogram with overlapping bins are then computed from the projected samples, these are then filtered with a difference of Gaussian (DoG) filter. Local maxima within a 3× 3 region with a value larger than a threshold is then assumed to be indications of tree-trunks. Samples closer than 0.25m in the horizontal direction are removed. An example is shown in figure

(10)

BTR80 hum-tow −1 0 1 −4 −2 0 2 0 1 2 −1 0 1 −2 0 2 0.5 1 1.5 2 2.5 Leopard MTLB −1 0 1 −4 −2 0 2 4 6 1 2 3 −1 0 1 −2 0 2 0.5 1 1.5 T72 T80 −1 0 1 −2 0 2 4 6 0.51 1.52 −1 0 1 −2 0 2 4 6 0.51 1.52 2.5

(11)

(a) Before (b) After 210 212 214 216 218 220 222 10 12 14 16 18 20 22 4 5 6 7 8 9 10

Number of sample points: 52079

210 212 214 216 218 220 222 10 12 14 16 18 20 22 4 5 6 7

Figure 7: The sample points before and after removing samples close to the ground plane and samples higher than 2.5m from the ground.

(a) Before (b) After

210 212 214 216 218 220 222 10 12 14 16 18 20 22 4 5 6 7

(12)

(a) Before (b) After 210 212 214 216 218 220 222 10 12 14 16 18 20 22 4 5 6 7

210 212 214 216 218 220 222 10 12 14 16 18 20 22 4 5 6 7

Figure 9: The sample points before and after removing small object samples.

4.2.3 Removal of small objects

To remove small objects like leaves and branches the samples are projected in the depth direction. The positions of the projected samples are then quantized and in each discrete position the sample with largest depth d is found. This gives an “image” dmax. For each position in dmaxthe minimum in a 3×3 region is found, giving dmin. Samples in each discrete position k, l with d < dmin(k, l)(1− 1/Q) are removed, there Q is the quantization step. This removes samples which have more distant samples in all neighboring regions, thus small objects, see figure 9.

5 Object recognition

5.1 Local plane estimation

The 3D space is divided into local overlapping regions, and a local plane is estimated in each region. We will here describe some details.

Assume that we have i = 1, . . . , I local regions, and let c_i denote the center for local region i. Furthermore, let w(x− c_i) denote a local weight function for region i (for example a Gaussian or a cos2function). Then, the plane parameters for region i is found as

p_i= arg minX k w(x_k− c_i) xk 1 · p ₂ . (2)

The expression on the right hand side can be rewritten as

pTT_ip where T_i=X k w(x_k− c_i) x_k 1 xT k 1 . (3)

The 4× 4 matrix Ti is computed for every local region i, and the plane pa-rameters pi are found as the eigenvector to Ti corresponding to the smallest eigenvalue. Even though it may be obvious, we note that each plane will be

(13)

(a) 3D points (b) Local planes

Figure 10: Example of local plane estimation. (a) 3D points. (b) Local planes estimated from the points in (a). Each local plane is visualized by a circular disc, and the pegs point out the region centers. The figure contains about 600 local planes.

described in a common global coordinate system. We can therefore cluster the parameter vectors to find larger planes, as described in section 5.2.

On the implementational side, the matrices T_i can be computed fairly effi-cient in Matlab assuming that the weight w is Cartesian separable and that the local region centers c_iare regularly placed in a Cartesian grid. For example, we have that (T_i)₁₂=X k w(x_k− c_i)x_ky_k (4) =X k w(x_k− c_i1)x_k w(y_k− c_i2)y_kw(z_k− c_i3) . (5)

We thus need to compute w(x_k−c_i1)x_k, w(y_k−c_i2)y_k, and w(z_k−c_i3) for every point k and every region i. This can be implemented by Kronecker products and summation of columns of three (sparse) matrices containing the values

w(x_k− c_i1)x_k, w(y_k− c_i2)y_k, and w(z_k− c_i3) respectively. The other elements in T_i can be computed in a similar manner.

Figure 10 shows an example of local plane estimation. We compute a cer-tainty measure for each plane using a combination of the smallest eigenvalue of

Tiand a function of the number of points used in the estimation. Local planes with a small certainty has been removed.

5.2 Finding dominant planes

Before the estimation of the dominant planes is done the local planes which are close (0.4m) and quite parallel to the estimated ground plane are removed to reduce the number of planes belonging to the ground. Each local plane is then represented by the vector normal to the plane pointing on the plane from the

(14)

Figure 11: Dominant planes, illustrated by green circular discs, that are found from the local planes in figure 10(b).

the dominant planes. Some other representation of the planes should probably be selected, since the used one favors planes close to the origin. However, no better low dimensional representation has yet been found.

Figure 11 shows an example of the method using the local planes in figure 10.

5.3 Plane-triplet matching

We use combinations of three dominant planes, denoted plane-triplets or simply triplets, as local (or at least semi-local) descriptors of the object. A plane-triplet from the target data is compared to a plane-plane-triplet from the model data to give a hypothesis of object view and position, by computing the rotation and translation between the two triplets. Each hypothesis is verified using the remaining dominant planes, as described in section 5.4.

We wish to use the triplets to find a transformation such that a point in the target data, xt_{, is transformed to a corresponding point in the model data, x}m_, as

xt= Rxm+ t , (6)

where R is a rotation matrix and t is a translation vector. The corresponding relation between the plane vectors becomes

pmk RT 0 tT 1 pt, (7)

where k means ’parallel to’. Let {pt

1, pt2, pt3} denote three dominant planes

from the target data, and{pm

1, pm2, pm3} denote three dominant planes from the

model data. Assume that the correspondence is known, i.e. that target plane

pt

i corresponds to model plane pmi .

First, the rotation R is found by comparing the plane normal vectors. The normal vector, n, of a plane corresponds to the three first elements in p. Let

ˆ

(15)

denote matrices containing the normalized plane normal vectors of the target triplet and the model triplet respectively. We wish to find a rotation matrix such that ˆNt_{= R ˆ}_Nm_{(we have assumed that the sign of the normal vectors is known).} This can only be approximately solved in practise. Let USVT _{= ˆ}_Nt_{( ˆ}_Nm₎−1 denote the SVD (Singular Value decomposition) of the matrix ˆNt_{( ˆ}_Nm₎−1_{. The} rotation matrix is then computed as

R = UVT. (9)

Second, we find the translation t. A point of intersection is computed for the target triplet and the model triplet respectively. The translation is then found as the difference of these two points after compensation with the rotation. Let

N = n1 n2 n3 and b =  (p(p1₂))4₄ (p₃)₄   . (10)

A point of intersection, denoted v, is computed as

v = N−1b . (11)

The translation vector is finally computed as

t = vt− Rvm. (12)

Unfortunately, we do not have plane correspondences as we assumed in the beginning. For example, it may very well be that target plane pt₁ corresponds to model plane pm₂ instead of model plane pm₁ , etc. Also, the sign of the nor-mal vectors is not known either, so that for example ˆnt ₌ _−Rnm _{instead of} ˆ

nt_{= Rn}m_{. We have to compute a hypothesis for each possible combination of} correspondences and signs, which gives in total ...

However, many of the combinations can be removed for various reasons:

• The “shape” of the target plane triplet and the model plane triplet should

be similar. The correspondence error between a target plane triplet and a model plane triplet is computed using some ad hoc measure from (7).

• The rotation should mainly be performed around the z-axis, because the

objects are placed on fairly horizontal ground.

The two items above is combined to give a certainty measure for each hypoth-esis, and the hypotheses with the lowest certainty is removed. The remaining hypotheses is verified by the procedure in section 5.4.

5.4 Hypothesis verification

From the matching of the dominant planes in section 5.3 we obtained a num-ber of hypotheses of transformations between the target and model data sets.

(16)

to the dominant planes in the target set and match these with the local planes belonging to the dominant planes in the model set. The matching is done by for each local plane from the target find the “closest” local plane from the model. The similarity measure used was

S =

N X k=1

e−d(k)22σ2 , (13)

where N is the number of target planes and d(k) is the minimum distance

d(k) = min cl∈D p do(cl, pk)2+ da(pl, pk)2, (14) where D = d(c_l, c_k) < max_d, (15) and where d_o(c_l, p_k) is the distance from the center point of the model plane

l to the target plane k, d_a(p_l, p_k) is the angular distance between the model plane l and the target plane k and d(c_l, c_k) is the distance between the center points. σ and max_d are both set to 0.5.

6 Results

The similarity measure S in (13) is computed for each hypothesis and the max-imum is stored as the similarity measure between the target and model. This measure is then computed for each model giving an indication of how similar the target is to the models. Figure 12 shows this measure for the different targets and models. The correct models for the different target are 4, 4, 4, 5, 5, 5, 4, 4, 4. The models and targets are listed below:

Model no. Model description

1 btr80 2 hum tow 3 leopard 4 mtlb 5 t72 6 t80

Target no. Target description

1 trg1 pos1 2 trg1 pos2 3 trg1 pos 4 trg2 pos3 5 trg2 pos4 6 trg2 pos 7 trg3 pos5 8 trg3 pos6 9 trg pos

The mapped model points belonging to the model with highest similarity measure together with the preprocessed target points are plotted in figure 13. An obvious conclusions is that the classification of the method is not so good. However, it seems to be possible to detect vehicles, and also to estimate the orientation of the vehicles except for targets 8 and 9, and for confusions between front and rear. It should however be pointed out that very little time have been

(17)

Target no. Model no. 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6

Figure 12: Similarity measure for the different targets and models. A large circle indicates a high similarity measure.

spent on parameter tuning and other improvement, so it might be possible to improve the results by just changing some parameters. For example, the estimated dominant planes seems to be somewhat unreliable which will affect the subsequent matching and verification. The reason for the poor estimation is still not known.

Another thing that probably would improve the results would be to use some other model data, since some of the obtained thread-models differed quite a lot from the target data.

7 Comments

7.1 Improvements

Due to the limited amount of time, many of the steps involved are somewhat ad hoc and most of the method parameters have not been tuned. We feel that it should be possible to improve the performance, and mention some ideas below:

• As mentioned before, it is probably better to use model (training data)

that better resembles the target data asquisition.

• The ground has been detected using the plane model (1). The ground

detection would probably improve if the plane model was applied more locally, or if we choose a more complex model (e.g. a second order poly-nomial).

(18)

Target 1 Target 2 Target 3 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 −1 0 1 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 −1 0 1 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 −1 0 1 2

Target 4 Target 5 Target 6

−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 8 −1 0 1 2 −8 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 −1 0 1 −6 −4 −2 0 2 4 6 −8 −6 −4 −2 0 2 4 6 −1 0 1 2

Target 7 Target 8 Target 9

−8 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 8 −1 0 1 2 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 −0.50 0.5 1 1.5 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 −1 0 1 2

Figure 13: The mapped model points (red) together with the preprocessed target points (blue).

(19)

• The tree trunks was detected based on local density maxima. Detection

based also on shape would probably improve the result, since local den-sity maxima may also occur on the vehicle (for example if the vehicle is fragmentary visual due to occlusion by several trees).

• The estimation of dominant planes seems to be too much influenced by

outliers. This could for example mean that the cluster procedure has to be more local. A better choice of cluster parameters, or a subsequent refinement may improve the estimation.

• The choice of triplet matching, verification, and certainty measures may

not be fully thought-out. Better choices may improve the performance.

7.2 Another idea: Local interest points

In the end we have chosen another solution than was originally intended. We initially intended to use somewhat more local descriptors, computed from local interest points. However, we felt that these local features would be less accurate, especially for some of the more complicated data sets. More local descriptors also require a more detailed model than coarse descriptors. This puts a higher demand on the accuracy of the model data, in this case the thread models. It would probably be better to use model (training) data collected from the same devise as is used in practise, i.e. true laser range data taken from different views of the object.

This decision to change approach was based on intuition rather than facts, and may have been wrong. We therefore give some details from the original approach here, without presenting any results. Also, the method of using more general local descriptors rather than assuming the existence of large planes should be more suitable for generic object recognition.

7.2.1 Harris points in 3D1

The Harris operator [2] is a well known interest point operator for 2D images that give a high value response near corners and other local 2D structures. The Harris value is computed from a 2D orientation tensor, for example the gradient (structure) tensor

T =X

x

w(x)∇I(x)∇IT(x) , (16) where∇I is the image gradient and w is a weight function. The Harris value is in 2D then computed as

H_2D= det(T)− α trace2(T) , (17)

1_{We wish to thank Klas Nordberg for some discussions regarding the Harris operator and} generalizations.

(20)

a) λ₁λ₂ b) λ₁λ₂− 0.04(λ₁+ λ₂)2

λ1

λ2

λ1

λ2

Figure 14: Isocurves of two rank measures. a) det(T). b) Harris value. Note that the asymptotes in a) goes towards the axes, as opposed to the asymptotes in b).

where α is often heuristically chosen as 0.04. The Harris value can be expressed in terms of the eigenvalues of Ti as

H = λ₁λ₂− α(λ₁+ λ₂)2. (18) The Harris value is basically a rank measure of the orientation tensor and will give a high value for intrinsically two-dimensional signals such as corners. An-other, simpler, measure of rank is det(T) = λ₁λ₂. Figure 14 shows isocurves of the two rank measures. We see that the determinant of T can be high when λ₁ very high and λ₂small, which corresponds to almost intrinsically one-dimensional signals with high contrast. This is undesired and avoided by the Harris rank measure.

We use a slightly different expression of the Harris operator in 3D:

H_3D= det(T)− α trace3(T) . (19)

It is not clear whether this expression is the best generalization, or if there exist other generalizations in the literature.

The tensor T is in practise computed using the local planes in section 5.1. The output from the local plane estimation can be interpreted as a regularly sampled 3D volume with a plane parameter vector in each voxel. We can therefore use ordinary (separable) convolution operations for implementation of weighted summation. We compute a gradient from each plane vector and use (16) and (19) to compute a tensor and a Harris value in each region. The Harris peaks are found by 3D non-max-suppression, and low peaks are removed. Figure 15 shows a simple example with three orthogonal planes.

(21)

Figure 15: Simple example (illustration) of Harris point detection and finding point of intersection. The region contains points from three orthogonal planes. Red dot: Harris peak. Red line: Line going from the Harris peak to a close local point of intersection.

7.2.2 Local points of intersection

As is indicated by figure 15, the Harris peaks are usually located somewhere inside the corner rather than in the actual corner point. If so desired, it is fairly easy to find a local point of intersection between the planes near the Harris peak. But we emphasize that the point of intersection is not necessarily more stable than the Harris peak, and it is not clear which point that is best suited for computation of local object descriptors.

The estimated point of intersection in region i, denoted v_i, is computed as

v_i= arg min v X k w(c_k− c_i) p_k· v 1 2 . (20)

As before, ck denotes region centers, and w denotes a local weight function (not necessarily the same as in section 5.1). If no point of intersection exists for the planes in the region, v_i will approximately be the point which in a sense is the point closest to all planes in the region. It is fairly easy to see that the solution can be found from the eigenvalue corresponding to the smallest eigenvalue of the matrixP_kw(c_k− c_i)p_kpT

k.

Figure 15 shows a simple example, where both the Harris peak and the point of intersection is depicted. Figure 16 shows another, more realistic, example for one of the vehicle models. The number of interest points depends on the object and the size of the weight function w.

(22)

Figure 16: Example of Harris point detection and local points of intersection. The same 3D plot is shown from two different views. Red dots: Harris peak. Red lines: Line going from the Harris peak to a close local point of intersection.

References

[1] Yizong Cheng. Mean shift, mode seeking, and clustering. IEEE Transactions

on Pattern Analysis and Machine Intelligence, 17(8):790–799, August 1995.

[2] C. G. Harris and M. Stephens. A combined corner and edge detector. In 4th

Object Recognition in 3D Laser Radar Data using Plane triplets