Tree Species Classiﬁcation Using Terrestrial Photogrammetry

(1)

Tree Species Classification Using Terrestrial

Photogrammetry

Jakob Boman

June 8, 2013

Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Niclas B¨ orlin

Examiner: Fredrik Georgsson

Ume˚ a University

Department of Computing Science SE-901 87 UME˚ A

SWEDEN

(2)

(3)

Abstract

This thesis investigates how texture classification can be used to automatically classify tree species from image of bark texture. The texture analysis methods evaluated in the thesis are, grey level co-occurrence matrix (GLCM), two different wavelet texture analysis methods and finally the scale-invariant feature transform. To evaluate the methods two classifiers, a linear support vector machine (SVM) and a kernel based import vector machine (IVM) was used. The tree species that were classified were Scotch Pine and Norwegian Spruce and the auxiliary class ground.

Three experiments were conducted to test the methods. The experiments used sub- images of bark extracted from terrestrial photogrammetry images. For each sub-image, the X,Y and Z coordinates were available. The first experiment compared the methods by clas- sifying each sub-image individually based on image data alone. In the second experiment the spatial data was added. Additionally feature selection was performed in both experiments to determine the most discriminating features. In the final experiment individual trees were classified by clustering all data from each tree.

For sub-image classification, the addition of spatial data increased the overall accuracy for the best method from 75.7% to 94.9% The best method was IVM on GLCM textural features.

The most discriminating textural feature was homogeneity in the horizontal direction. The best methods to classify individual trees were SVM with GLCM with an overall accuracy of 88%.

In summary, the methods was found to be promising for tree bark classification. However, the individual tree results were based on a low number of trees. To establish the methods’

true usefulness, testing on a larger number of trees is necessary.

(4)

(5)

Introduction

1.1 Background

To be able to perform management planning on a forest the industry needs data such as species, stem diameter, height and volume from the trees. The data is currently being gathered manually on 10 meter diameters plots in the forest. From the measurements the rest of the forests growth can be estimated. The measurements are done by hand and is labor intensive. This thesis looks into using terrestrial photogrammetry to automatically classify the tree species on the plots.

1.2 Aim

The aim of the thesis is to evaluate methods to classify tree species from terrestrial images from a living forest. The thesis will primarily focus on classification based on bark texture features. The tree species of interest are pine; spruce and deciduous trees ¹ For better classification results it could be advantageous to add a number of auxiliary classes e.g sky, bush and leaves.

Two classifiers, support vector machine and import vector machine, will be tested and used for classification. The classification will be performed on the image data that was previously gathered at the Remningstorp forest test area in southern Sweden. In addition to the image data, object-camera distance are also available and may be used for better results.

The project is divided in the following tasks.

– A study of previously related work that has been done on image classifications. The study should look for relevant algorithms that could be useful for the thesis.

– Select and implement the most interesting texture analyse methods.

– Evaluate tests to find what features are the most effective to discriminate between the classes.

– Evaluate the performance of the methods.

1Trees that loses their leaves part of the year.

1

(8)

1.2.1 Tools

Matlab¹ will be used for the classification and texture analysis. Matlab is a programming language with a interactive GUI for evaluating algorithms and displaying data.

1.3 Related Work

1.3.1 Tree species classification

The methods used may be divided into aerial (Holmgren et al., 2008; Korpela et al., 2010) and terrestrial (Huang, 2006; Puttonen et al., 2011), each area studying different aspect of the tree. Aerial methods are mainly based on airborne laser scanning and aerial photogrammetry of data collected on the trees height and canopy (Holmgren et al., 2008). In contrast, when working with terrestrial based data the interest is feature such as the texture, colour, and curvature of the leaf and/or bark (Wan et al., 2004; Kadir et al., 2011; Fiel and Sab- latnig, 2011). Terrestrial laser scanning is mainly used as a tool to find the trees (McDaniel et al., 2012). To the authors knowledge no method used only terrestrial laser scanning data for classification.

As the thesis aim is to classify trees from the bark texture a more detailed overview of the texture analysis methods, that is reported in literature, follows.

1.3.2 Texture analysis methods

The classification of tree species from bark properties is mainly done with texture analysis methods. Texture analysis was initially based on first- and second order statistics of textures (Haralick et al., 1973). These methods are still used (Sharma and Singh, 2001; Wan et al., 2004).

During the last decade the research focus for tree species classification has followed two main approaches; statistical methods (Wan et al., 2004; Huang, 2006) and signal-based methods (Chi et al., 2003; Fiel and Sablatnig, 2011).

Chi et al. (2003) was one of the earliest to classify trees species from bark texture.

Previous work in textural analysis often used bark texture however the aim was not to classify different tree species but to classify more general classes e.g. straw, metal. Chi et al. (2003) proposed to use gabor filter banks for the classification because of its’ success in earlier works. In the paper they modelled textures as multiresolution narrowband signals that are characterised by their central frequency and normalised ratio of amplitudes. They compared the multiple narrowband signals model to the standard narrowband signals model and achieved ”satisfactory” results.

Wan et al. (2004) made a comparative study on the use of four statistical texture analysis methods; grey level Run-Length method (RLM); Co-occurrence Matrices Method (COOM);

Histogram Method (HM) and Auto-Correlation method (ACM). The result showed that only using grey images for the classification did not give good result. When adding colour information the classification rate was improved significantly. The final result when using the colour information was for COOM with a average recognition rate of 89 %, followed by RLM with 84 % and lastly the HM with 80 %. The classification that yielded the best result was using a 1-nearest-neighbour classifier.

Huang (2006) showed good result when they combined textural information and colour to classify plant using bark images. Huang used multiresolution wavelets to extract the

1http://www.mathworks.se/products/matlab/

(9)

1.4. Outline of thesis 3

colour and textural information. For the classification he used a radial basis probabilistic neural network and support vector machine. He tested the method against ”traditional”

texture statistical methods like auto-correlation method, co-occurrence matrices and histogram method. The final experimental result showed best result when using the wavelet filter method with the neural net classifier.

Jiang et al. (2008) presented a multiresolution approach for classification based on statistic moment. The moments were extracted in the x direction and y direction in different image resolution. The feature showed better results on rotation invariance and noise ro- bustness than methods like Gabor filter and wavelets.

Fiel and Sablatnig (2011) presented a method that normalised grey-scale photos. Their method involved using scale-invariant feature transform Lowe (2004). For the classification they used support vector machine. The experiments showed a classification rate of 70 %.

The results were compared with two experts that got 57 % respectively 78 %.

1.4 Outline of thesis

– Chapter 2 contains the theory behind the texture analysis methods and classifiers used in the thesis.

– Chapter 3 contains a summary of the methods and code used for the experiments.

This chapter also contains a description of the dataset used for the experiments.

– Chapter 4 is a summary of the experiments conducted and the results.

– Chapter 5 contains the discussion of the results and also a discussion of future works.

(10)

(11)

Chapter 2

Theory

2.1 Texture Classification

Texture classification can be divided into two main stages, texture analysis and classification.

During texture analysis the texture content of the images is captured with a texture analysis method, which yields a set of textural features for each image. These features are then used with a classification algorithm to classify new texture samples and correctly matching them with a known class.

2.2 Texture Analysis

2.2.1 Grey Level Co-occurrence matrix

Grey level co-occurrence matrix (GLCM) is used by several statistical texture analysis methods. Unless otherwise stated the following description follows that of Gonzalez and Woods (2008). GLCM is a representation of the relative positions of pixels that have the same intensity level in a grey scale image. An image is defined as a 2D function I(x, y), where the coordinates (x, y) are x = 0, 1, ..., W − 1 and y = 0, 1, ..., H − 1. The value at any pair (x, y) is the intensity or grey level of the image at that point. The number of rows and columns in the images are W − 1 and L − 1, respectively. The grey levels in the image I(x, y) = z are z = 0, 1, ..., L − 1, where L − 1 is the maximal intensity level a pixel can have.

Given an image I and a offset O = [∆x, ∆y], where I(x, y) = z_nand I(x + ∆x, y + ∆y) = z_m there exists a GLCM G. The element g_nm of G is the number of times a pair of pixels with intensities z_n and z_m, where 1 ≤ n, m ≤ L − 1, occur in I with the relative position defined by O. The offset O defines the position of two pixel relative to each other. See Figure 2.1.

The offset dictate the relationship that decides the GLCM. For example the offset [0, 1]

will detect potential vertical features compared to the offset [1, 0] that will encode horizontal features. Similarly, the offset [0, 5] will detect relationships with a 5 pixels distance. There- fore, offset is a important parameter defining the GLCM. A set of offsets is commonly used when using GLCM in image classification.

From the GLCM a number of statistical measurements can be made that can be used as features for image classification. When calculating the features it is more useful to normalise the GLCM, by dividing each element with the number of elements in the GLCM. The term

5

(12)

Table 2.1: Useful statistical feature calculated from a GLCM. Note that ac and ar is the mean computed along the rows respectively the columns of the normalised GLCM. Similarly, σ²_r and σ_c² are the standard deviation of the rows and respectively the columns.

Name Explanation Formula

Energy Measures how smooth the texture is. The value is 1 for a constant texture.

PK n=1

PK m=1p²_nm

Contrast Measures the intensity difference between a pixel and its neighbour over the entire texture.

PK n=1

PK

m=1(n − m)²pnm

Homogeneity Measures the spatial closeness of the distribution of elements in the GLCM to the diagonal.

PK n=1

PK m=1pnm

1+|n−j|

Entropy Measures the randomness of the GLCM

−PK n=1

PK

m=1p_nmlog₂p_nm. Correlation Measures how correlated a pixel

is to its neighbour over the image.

PK n=1

PK

m=1(n−ar)(m−ac) σ²_rσ_c²

pnm is then the probability that the pixel pair has the intensity zn and zm. Table 2.1 contains the commonly used features calculated from a GLCM Haralick et al. (1973).

(13)

2.2. Texture Analysis 7

1 1 2 1 5 1 6 1 8 7 3 1 4 3 1 5 Image I













1 1 0 1 1 0 0 1 0 0 0 0 0 0 2 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 Co-occurrence matrix G













Figure 2.1: The GLCM G for offset [1, 0], i.e., the pixel immediately to the right. The matrix on the left is the image I and the right matrix is the corresponding co-occurrence matrix G. The element (1,1) of G has value 1, because there is one occurrence of a pixel with intensity 1 having a pixel with intensity 1 to it’s immediately right. Similarly, element (3,1) of G has value 2, because there is two occurrence of a pixel with intensity 3 having a pixel with intensity 1 immediately to its right.

2.2.2 Wavelet

The simplest wavelet is the Haar-wavelet defined by the Haar transform. The transform was first invented by Alfred Haar in the 1910, Haar (1910). The term wavelets were first introduced to signal processing in 1970 and 1980 Mallat (1989). The usefulness of wavelets stem from the ability to represent a number of different functions. An important part is that wavelets processes signal in different resolutions. This becomes important because features in a higher resolution may be undetectable but is easily found in a lower resolution.

Note that for different areas of research i.e., compression and image classification, different aspects of the wavelet representation are of interest. Wavelets can further be categorised from which wavelet family it comes from. Where different wavelet families have different properties and lend themselves to different types of applications. This thesis will explain the wavelet concept using the Haar-wavelet, which is simplest wavelet but suitable for texture analysis. A overview of the mathematical background and theory of wavelets can be found in Gonzalez and Woods (2008).

The Haar Wavelet

Wavelet is based on signal processing and a short definition of a signal follows. A signal is defined as a real (or complex) valued function of one or more variables. If the signal is dependant on one variable the signal is one dimensional e.g. speech signal and daily maximum temperature for a local area. Similarly, if the signal is dependant on two variables the signal is two dimensional. An image can be defined as a two dimensional signal, vertical and horizontal coordinates representing the two dimensions.

The simplest Haar wavelet can be explained using a simple linear transformation on a signal. For example, take two neighbouring samples from a signal, a and b. The representation can be changed by using a linear transform which replace a and b by their average s and difference d.

s = a + b

2 (2.1)

(14)

d = b − a (2.2) The new representation is a decomposed version of the signal. Meaning that the wavelet representation have two parts, low frequency information s and high frequency information d.

If a and b is highly correlated, the absolute difference will be small and in the most trivial case a = b the difference d is simply zero. In compression a zero is easily represented and will give good compression rates. In comparison, when using wavelets for edge detection a large d means that there is an edge between a and b. The beauty is that no information was lost, because given s and d its easy to recover a and b:

a = s − d/2, (2.3)

b = s + d/2. (2.4)

In the general case with a signal s_n of 2ⁿ samples values s_n,lwhere l is 0 < l < 2ⁿ. The Haar transform is applied onto each pair a = s2l and b = s2l+1. After the transformation there are 2ⁿ⁻¹pairs of average sn−1,l and differences dn−1,l

sn−1,l= s_n,2l+ s_n,2l+1

2 (2.5)

dn−1,l= sn,2l+1− sn,2l (2.6)

The signal sn is split into two signals, the averages sn−1and differences dn−1both with 2ⁿ⁻¹ samples. Given the two signals the original signal sn can be be recovered.

The new ”average” signal sn−1is a lower resolution representation of the original signal and the ”differences” signal dn−1 is the detailed information needed to go from the lower resolution to the original signal. The average is often referred to as a low pass filter and difference as a high pass filter. This can be done recursively and the same transform can be applied to sn−1 itself. By applying the transform two new signals are created sn−2 and dn−2, see Figure 2.2. As before the new signal is lower resolution representation of the signal sn−1and, in turn sn. This can be done n times before the samples runs out. In the end the lowest resolution signal, s0, will only contain one samples s0,0. This sample is the average of all samples of the original signal. For each resolution the difference signals are saved as they are the key to recover the signal. There will be n difference signals, were the first is of length n − 1 and the last is length one. The beauty is that after the transformation the total number of samples or data points are still 2ⁿ. Of these the difference is made up of 2ⁿ− 1. If the original signal is a smooth varying function the differences will be small and easily represented, e.g., the most trivial case were a = b. After the final transformation the new signal only contains the one sample s0,0 and 2ⁿ⁻¹differences samples. Making the new signal a more efficiently representation of the original signal. This is the foundation of the discrete wavelet transform using the Haar wavelet.

(15)

sn L

H dn−1

H

L

d_n−2

an−2

Figure 2.2: Figure of a 2-level decompositions of a signal, where H is the high pass filter (differences) and L is the lowpass filter (average).

LHn−2

LLn−2

HHn−2

HLn−2

LHn−1 HHn−1

HLn−1

Figure 2.3: The left figure shows a 2 level Haar-wavelet decomposition on a 2D image. The right image shows the same decomposition on a real image. The boxes corresponds between the images e.g. HL₁ is the top right image in the second figure. (Image from the public domain)

Wavelet on 2D images

The 1-D wavelet decomposition can be easily extend to 2D. The 2D methods is first filtered with a low pass filter L and a high pass filter H on each row in the image I(x, y), producing two matrices IL(x, y) and IH(x, y). The low pass filter and the high pass filters are the same ”average” and ”differences” filters as in the previous section about the Haar-wavelet.

The same methods is then used on the columns of IL(x, y) and IH(x, y) producing four subbands, ILL(x, y) ,ILH(x, y) ,IHL(x, y) and IHH(x, y). This is a complete one level decomposition of the original image I(x, y). ILL(x, y) is the smooth sub-image corresponding to a coarser representation of the original image. As in the 1D case the I_LL(x, y) can further be decomposed by using the method recursively. Figure 2.3 shows a 2 level decomposition of an image. Note that HL shows horizontal details; whilst, vertical details are visible in the LH sub-band. The HH sub-band shows diagonal details.

Wavelet in texture analysis

The key feature of wavelets is its ability to decompose an image into different scales. Methods for extracting features from wavelets are similar as methods on regular images including first-order statistics (Energy) and second-order statistics (GLCM). Where each decomposed sub-band is used as the texture analysed by each method.

(16)

H1= 0; H2= 0 O = [∆n, ∆m]

for i = 1 → I do for j = 1 → J do

ni,j= Xi+∆n,j+∆m

m_i,j= Y_{i+∆n,j+∆m}

α = max(min(x_i,j, m_i,j), min(y_i,j, n_i,j)) if α == min(x_i,j, m_i,j) then

H₁(x_i,j) = H₁(x_i,j) + 1 else

H2(xi,j) = H2(xi,j) + 1 end if

end for end for

Figure 2.4: Constructing co-occurrence histograms over two sub-bands X and Y for a given offset O. Where n_i,j is the neighbour pixel defined by O to X_i,j and similarly n_i,j is the neighbour pixel to Y_i,j.

The wavelet co-occurrence histogram method (WNCH), looks at extracting features that are calculated from the relationship between different sub-bands on the same scale.

The method was proposed by (Hiremath et al., 2006). WNCH constructs co-occurrence histograms H₁and H₂over a pair of the sub-bands e.g, I_LL(x, y), I_LH(x, y). The algorithm for generating the histograms can be seen in Figure 2.4. If the correlation is greater between the first sub-band (X) and the neighbouring pixel in the second sub-band (Y) the histogram H₁ is incremented. Else, the correlation is greater between the second sub-band and the neighbouring pixel in the first sub-band and the histogram H₂ is incremented. For each histogram a normalised cumulative histogram (NCH) is constructed. From each NCH the features: slope, mean and standard deviation are extracted and used for classification.

2.2.3 Scale-Invariant Feature Transform

Scale-Invariant Feature Transform (SIFT) is a method for detect and describe local features in images first proposed by Lowe (1999). SIFT features have properties that lends them for matching different images of an object or scene. This because the features are invariant to image scaling and rotation, and partially invariant to change in illumination and viewpoint.

Lowe (2004) divided the computations of the SIFT features into four stages: Scale-space extrema detection, keypoint localisation, Orientation assignment and keypoint descriptor.

Unless otherwise specified the theory will follow that of Lowe (2004).

Scale-space Extrema Detection

The first stages is to detect interesting points in the image that are invariant to scale and can be found under differing views of the same object. Points that are invariant to scale change can only be found by searching in scale space. Lowe (2004) defines scale space of an image as a function L(x, y, σ) = G(x, y, σ) ∗ I(x, y). Where I is an image convolved with a Gaussian function G and σ is the scale parameter. The term octave is used and corresponds to doubling the value of σ meaning that for each octave the image size is half the previous one. As σ increases the image is progressively blurred.

(17)

Scale

Figure 2.5: Example of finding the maxima or minima of the DoG, the marked circle is compared to its neighbours in a 3x3 regions in the current and adjacent scales.

Stable keypoints are detected by finding space extremes in the difference of Gaussian (DoG) function which are computed from the difference of two scales separated by a constant multiplicative factor k, D(x, y, σ) = L(x, y, σ) − L(x, y, kσ). The factor k is selected so that it is a fixed number of Gaussian blurred image per octave. The local maxima and minima of D(x, y, σ) is then computed. Each point is compared to its eight neighbours in the current DoG and also to its nine neighbours in the scale above and below, see Figure 2.5. A point is selected as a keypoint candidate if its larger (a maxima) or smaller (a minima) to all of its neighbours.

Keypoint Localisation

The second stage filters the keypoints found in the first stage. The filter removes points that have low contrast or lie along an edge. Keypoints that have low contrast will be sensitive to noise and edge points will be difficult to localise unless they are a corner point.

Interpolation of nearby data is used to accurately determine the location of the keypoint.

The interpolation is calculated using a Taylor expansion on the DoG shifted so the origin is centred at the keypoint. If another keypoint is found the point of interest is changed and the expansion is performed on the new point. The function value of the point is used for rejecting a point with low contrast. Keypoints that lie along an edge will have a strong response in the DoG function. The best keypoint is corners, as they are easily localised. The algorithm needs to remove non-corner edge points. A peak in the DoG that corresponds to an edge point will have a large principal curvature across the edge and a small curvature in the perpendicular direction. If both principal curvatures are high the edge point is also a corner point and therefore of interest. The curvatures are calculated from a Hessian matrix and the edge is removed if the ratio between the curvatures are greater then a fixed value.

Orientation Assignment

The two first stages have identified and filter relevant keypoints. The remaining keypoints can be considered to be stable and are localised in scale space. The next step is to assign an orientation to each point. The calculations are performed in the keypoints scale and

(18)

therefore provides scale invariance. For an image sample L(x, y, σ) the gradient magnitude m(x, y) and orientation θ(x, y) are computed using the pixel differences:

m(x, y) =p

(L(x + 1, y) − L(x − 1, y))²− (L(x, y + 1) − L(x, y − 1))², (2.7) θ(x, y) = tan⁻¹((L(x, y + 1) − L(x, y − 1))/(L(x + 1, y) − L(x − 1, y))). (2.8) An orientation histogram is formed from the gradient orientations of sample points within a region around the keypoint. The orientation histogram has n (Lowe uses n = 36) bins covering the 360 degree range of orientations. Each sample added to the histogram is weighted by its gradient magnitude and by a Gaussian-weighted circular window with a σ that is 1.5 times of the scale of the keypoint. The highest peak in the histogram corresponds to the dominant direction and is set as the keypoints orientation. If there is another peak within 80% of the highest peak another keypoint is created with that orientation. The additional peaks are added for additional stability.

Keypoint Descriptor

The last stage creates the descriptor that may act as a fingerprint of the keypoint. The descriptor should be as invariant as possible to the remaining variations illumination and viewpoint. The descriptor is created by using 16x16 window of 4x4 subregions. In each region the magnitude and orientations are computed in keypoints scale and with respect to the keypoints orientation. A weight is used with each regions magnitude that gives less emphasis to gradients that are far from the centre of the keypoint. Similar to the last stage each 4x4 subregions are collected in a orientation histograms with 8 bins. The descriptor vector is formed from each histogram entries, so 8 entries per 4x4 subregion and 4 subregions per keypoint window creates a 4 × 4 × 8 = 128 element feature vector. The vector is normalised and to achieve illumination invariance all elements that are larger than 0.2 are threshold to 0.2, followed by a renormalising of the vector. The thresholding will remove illumination variations due to changes in viewpoint.

SIFT Texture Feature

The descriptors can not be used directly for texture classification, this because when computed SIFT finds a number of keypoints to describe the image. A common approach is to use a clustering algorithm e.g. k-means. The clustering creates a codebook from a training set of descriptors. The SIFT descriptors is run against the codebook and each are given a label of the closest cluster. The classification vector V is then computed as:

V = [t₀, t₁, ..., t_k−1], (2.9) where t_i is the number of occurrences of a descriptor with label i. The vector is then normalised and used for classification.

2.3 Classification

In machine learning with supervised learning, classification is the problem to identify which category a new observation belongs to, given training data of earlier observations with known class. The training data is (x_n, y_n), n = 1, ..., N where x_n is the observation with class label y_n. The observation x_n = {t₁, ..., t_M} is called a feature vector that can be constructed from features from e.g. GLCM. The feature vectors spans a abstract space called input

(19)

2.3. Classification 13

Figure 2.6: Example of possible hyperplanes separating the two classes, the solid line is the optimal hyperplane, providing the highest margin to both classes.

space¹. The classification algorithm role is to find a decision function f (xn) = yn. Such that the classifier or decision function can later be used to classify a new observation xi e.i, f (xi) = yi.

2.3.1 Support vector machines

Support vector machine (SVM) is a supervised learning model suitable for classification and pattern analysis. The theory presented here largely follows that of Burges (1998). The training data for SVM is (xn, yn), where yn ∈ C = {1, −1}. The SVM is built on the concept that there exists a hyperplane that separates the observations with label yn = 1 and yn− = 1 in the input space. All points above will be in one class and all points below the plane is in the other class. There exists many hyperplanes that separates the two classes.

However, to minimise classification errors the hyperplane should provide optimal separation between the classes. Example of separating can be seen in Figure 2.6.

The common approach is to maximise the distance from the hyperplane to the nearest observation on each side. This is often referred to as maximising the margin and the classifier as a maximising margin classifier.

The separating hyperplane is defined by w^Tx + b = 0, where x is a set of points on the hyperplane and w is the normal vector to the hyperplane. The problem can be formulated as follows: Suppose that the training data satisfies the following constraints:

w^Txi+ b ≥ 1 for yi= + 1, (2.10)

w^Tx_i+ b ≤ −1 for y_i= − 1. (2.11) Equations (2.10, 2.11) can be combined in one set of inequalities:

1Note that in literature its often also referred to as feature space.

(20)

w· xⁱ +b =

1

w· xⁱ+b = 0

w· xⁱ+b =

−1 w

2

||w||

Figure 2.7: The figure shows a example of a linear problem in 2D space. The dashed lines are the two hyperplanes constructed from the filled support vectors. The solid line is the separating hyperplane between the classes.

yi(w^Txi+ b) − 1 ≥ 0, ∀i (2.12)

All points that satisfy the equality w · xi+ b = 1 forms a hyperplane H1 with normal w.

Similarly, all points that satisfy w·xi+b = −1 forms a hyperplane H2with normal w. Points on the hyperplanes H1 and H2 are called support vectors. The separating hyperplane H is positioned between H1and H2at distance d from both hyperplanes. In order to ”orientate”

H so it is as far way from H1 and H2 as possible the distance d needs to be maximised.

The distance d can be shown to be 1/||w||, the margin between H₁and H₂is then 2/||w||

Hence the problem boils down to minimising ||w|| or equivalently ¹₂||w||². minw

1

2||w||² (2.13)

To solve the quadratic optimisation problem 2.13 the constraints 2.12 are reformulated as a Lagrangian:

LD=X

i

αi−1 2

X

i,j

αiαjyiyjxi· xj, (2.14)

where αi ≥ 0, i = 1, ..., l, is the Lagrangian multipliers one for each constraint. The solution to the original problem 2.13 is given by:

w =X

i=1

α_iy_ix_i. (2.15)

The problem is solved by maximising L_Dwith respect to α_iand solution 2.15. If α_i > 0, the corresponding x_iis on the separating hyperplane. Similarly, if α_i = 0, the corresponding

(21)

xi is not on the plane and will not have any impact on the hyperplane. Thus all training points with a α_i= 0 can be discarded after training as they will not effect the classification.

When the optimisation problem is solved the SVM is trained and can be used for classification. The decision function for SVM can be formulated as:

f (x) =

N

X

i=1

αix · xi, (2.16)

where N is the number of support vectors. The sign of the function decides the class, y_i= sign(f (x_i)).

Soft margin

Real world problem are rarely linearly separable. In this case, problem 2.13 does not have a solution. However, a positive slack variable ξ may be introduced to allow some degree of error to occur. The slack is dependent on the distance from the error point to the hyperplane.

The upper bound of the numbers of errors allowed isP

iξ_i. The error-term is then added to 2.13 given a new minimising problem:

minw

1

2||w||²+ CX

i

ξi, (2.17)

where C is the trade-off between the slack variable and the size of the margin. A high value on C gives a higher penalty for errors. The equation 2.15 can then be rewritten:

w =

N

X

i=1

α_iy_ix_i, (2.18)

where N is the number of support vectors and αi is bound by 0 ≤ αi≤ C.

Nonlinear classification

As a basic SVM is a linear classifier and by definition unable to solve a truly nonlinear problem. Soft margin works better if the erroneous points are few but in a non-linear problems e.g. XOR, the soft margin would fail 50% of the classifications.

But there is a way to solve non-linearly problems with a linear classifier, by using the so called kernel trick Aizerman et al. (1964); Boser et al. (1992). The kernel trick is to map the standard problem space into a higher dimension feature space where the problem is linearly separable. The beauty is that the trick avoids computing the mapping directly, this by requiring that the learning algorithm uses dot products such that these higher dimension dot products can be computed within the problem space, by means of a kernel function, K(x_i, x_j) = Φ(x_i) · Φ(x_j).

The separating hyperplane is placed in the higher dimension space and can then correctly separate the classes even though they were nonlinear in the problem space. A example of this can been seen in Figure 2.9.

The most basic kernel function is the linear kernel, K(xi, xj) = xi· xj . Using the linear kernel is equivalent to solve it in the standard problem space as it will not be mapped to a higher dimension, cf. Equation 2.16. A more useful kernel function is K(xi, xj) = e^−||xⁱ^−x^j^||²^/2σ² that provides the mapping to a higher dimension. In Equation 2.18 the problem was reformulated as a Lagrangian and a dot product x_i· x_j was introduced. The dot product is replaced with the kernel function x_i· x_j = K(x_i, x_j). After the change the

(22)

w· xⁱ +b =

1

w· xⁱ+b = 0

w· xⁱ+b =

−1 w

2

||w||

ξ

Figure 2.8: The figure shows a example of a linear problem solved with a separating hyperplane and a slack ξ.

−4 −2 0 2 4 −4 −2 2 4

5 10 15 Φ

Figure 2.9: Example of a nonlinear problem in <¹ solved by using the mapping Φ(x) 7→

(x, x²). The kernel maps the problem to <²where a separating plane can be found.

SVM is doing the computation in the higher dimension and the computational cost during training is roughly the same as doing the training on un-mapped data.

Multi class implementations

One limitation of SVM is that it can only classify two classes. The approach when dealing with multiple classes is to reduce the problem into several binary classification problems solvable by a set of SVM. One popular approach is to use a one-vs-rest strategy, using N one-versus-rest SVMs for N classes and the SVM with the highest output is chosen. The output is often corresponding to the largest positive distance from the hyperplane. The other approach is to use a one-vs-one strategy, using one SVM for each pair of classes.

After the classification a count is made and the class that was chosen most often is used as classification result.

(23)

Feature selection with SVM

A key problem in classification is that it is common that the input space is very large and the features vectors contains redundant or irrelevant components. Where if a redundant or irrelevant feature element is removed the training time is reduced and the risk of overfitting is minimised.

For a linear SVM the normal to hyperplane w can be used to find the discriminate feature. Note the Equations 2.15 is used to calculate w from the Lagrangian multipliers.

The element of w = {w1, ..., wn} can be used to determine the contribution of each feature to the final solution. The elements that are 0 or approximately 0 can be confidently removed as they have little to none contribution to separating the classes. Similarly, if the elements value is high it indicates that the corresponding feature is important for separating the two classes.

Guyon et al. (2002) proposed the SVM-Recursive Feature Elimination (SVM-RFE) for removing the least discriminate features one at the time with backward elimination. In each step a linear SVM is trained and the normal w is calculated. The normal is then used to calculate a ranking score ci= (wi)² for each feature. The feature with the smallest ranking score , cj, is removed. The algorithm is either terminated when a feature subset of a specific size is found or until there are no more features to remove. The process can be speed up by removing more then one feature at the time. The algorithm can easily be extended for multiclass problems. The ranking score is then c_i= sum_k(w_k, i)² where k is the number of hyperplanes and w_k is the normal to the kth hyperplane.

2.3.2 Import vector machines

Import vector machine (IVM) is a sparse kernel logistic regression model. IVM was first introduced by Zhu and Hastie (2005) as an attempt to combine the concept of support vectors from SVM with the probability output of logistic regression. The section follows Zhu and Hastie (2005).

Logistic Regression

Logistic regression is a traditional statistic tool that builds a model from observations and gives a classification result based on the probability. Logistic regression gives the posterior probability pn given a observation xn. When dealing with a two class problem the model is:

p(yn|xn; w) = (1 + exp(−w^Txn))⁻¹, (2.19) where w is a extended parameter w = [w₀, ω] containing the bias w₀and weight vector ω which are constructed from the training data. The weight vector is the relative importance of the different elements in the observation vectors. Training the model is to determine the parameter w using a likelihood function. Common functions to use are negative log likelihood and maximum likelihood estimation.

Kernel logistic regression (KLR) is a method to solve nonlinear problems with logistic regression. It is based on the kernel trick described in Section 2.3.1.

Support Vector Machine and Kernel Logistic Regression

As noted previously one of SVM’s limitations is that SVM does not provide a probability, only a estimation sign[p(x) − 1/2]. SVM has a strong similarity to regularised function estimation in the reproducing kernel Hilbert space (RKHS) as noted by Wahba (1999). A

(24)

−3 −2 −1 0 1 2 3 0

1 2 3

yf(x)

Loss

Hinge loss Negative log-likelihood

Figure 2.10: SVM’s hinge loss function plotted with negative log-likelihood.

regularised function estimation problems contains two parts, a loss function and a regu- larisation term. In the case of SVM the loss function is a hinge loss function, defined as (1 − yf (x))₊. The regularised function estimation for a SVM is:

min

f ∈H_K

1 n

n

X

i=1

[1 − y_if (x_i)] +λ 2||f ||²_H

K, (2.20)

where HK is the RKHS generated with kernel K(·, ·). Note that f (x) is the decision function 2.16. If the SVM’s hinge loss function is replaced in 2.20 with a negative log-likelihood (NLL) of the binomial distribution, the problem becomes:

min

f ∈H_K

1 n

n

X

i=1

ln(1 + e^−yⁱ^{f (x}ⁱ⁾) +λ

2||f ||²_H_K. (2.21) NLL is used as it shares similarities in shape with the hinge loss function, see 2.10. The reason to do the replacements is twofold, first to be able get the natural probability instead of the estimated probability sign[p(x) − 1/2]; secondly a KLR can naturally be generalised for multi class problems by using multinomial logistic regression. The main disadvantages is that the support vector property of SVM is lost e.i, all points in the training data are needed for classification. This is a drawback of KLR and makes the computational cost O(n³). Therefore, Zhu and Hastie (2005) proposed import vector machine (IVM) that uses a subset S of {x₁, x₂, ..., x_n} to approximate the KLR model 2.20.

Import Vector Machine - Algorithm

To find the subset S, Zhu and Hastie (2005) uses the following greedy forward strategy:

Initially, S = ∅ is empty and iteratively built by adding one sample x_i ∈ L at the time.

The sample x_i added from {x₁, x₂, ..., x_n} is chosen so that it will decrease the regularised NLL the most. The first import vector yields the best one-class classification that can be

(25)

interpreted as the point lying in the area with the highest class density. The second vector will yield the best possible linear separability to one of the other classes.

As noted earlier to solve a KLR the Equation 2.21 needs to be minimised. Equation 2.21 can be written in a finite dimensional form:

H = 1

nln(1 + e^−y∗(K¹^a))+λ

2 ∗ a^TK₂a (2.22)

where K₁ = k(x_i, x_s) with x_i∈ L is the regressor matrix, K₂= k(x_s, x_s⁰) with x_s, x_s⁰ ∈ S is the regulation matrix. Newton-Raphson method can then be used iteratively to find a^(k).

a^(k)= (1

nK₁^TW K₁+ λK₂)⁻¹K₁^TW z, (2.23) where W is a diagonal matrix W = diag(p1(1−p1), ..., pn(1−pn)) with pi= (1+exp(−yixi))⁻¹, i = 1, ..., n. The parameter z is defined as:

z = 1

n(K₁a^(k−1)+ W⁻¹(y ∗ p)). (2.24) To speed up the computation a one-step iteration approximation is used instead of the full Newton-Raphson computation.

S = ∅; L = {x₁, x₂, ..., x_n};Let k = 1, a⁽⁰⁾= 0 repeat

for all xl∈ L do

Use one-step iteration to find a such as it minimises:

H(xl) = _n¹ln(1 + e^−y∗K¹^l^a) +^λ₂∗ a^TK₂^la end for

x_l∗= arg min_x_l_∈LH(x_l) S = S ∪ x_l∗

L = L \ x_l∗

k = k + 1 until Hk converged

Figure 2.11: Algorithm for selecting import vectors, in each step the vector xi that will minimise the likelihood function the most is selected e.i. the sample that if added will give the best classification result.

The Algorithm 2.11 will halt when HK has converged. The convergence criterion is the ratio: = |Hk− Hk−∆k|/|Hk|, for small values of ∆k = 1.

(26)

(27)

Chapter 3

Method

3.1 Dataset

The image data used for the experiments were collected during a field campaign in the Remningstorp test area outside Sk¨ovde in southern Sweden (Forsman et al., 2012). The forest at Remningstorp is semi-boreal with mostly Scotch Pine¹, Norway Spruce²and some broad-leaved trees. The images were collected with a calibrated rig. A point cloud was generated from SIFT descriptors. 25 field plots with 10-20 m radius were photographed using the camera rig. Furthermore, the species of each tree was available.

From each 3D point in the point cloud a sub-image of size 128 × 128 was extracted. Two extracted sub-images can be seen in Figure 3.1. In addition to the image data the spatial data and SIFT data was available. The data used for the experiments are collected from one plot. Figure 3.2 shows an image with SIFT points from the plot. The dataset contains 8535 sub-image on pine and 17426 on spruce. In addition, 62551 sub-images of ground.

(a) Spruce bark (b) Pine Bark

Figure 3.1: Image of spruce and pine bark texture.

1Swedish: Tall

2Swedish: Gran

21

(28)

1 2 3 4 5 6

Figure 3.2: An image with classified SIFT regions on six trees, one pine (yellow) and five spruces (blue) and on the ground (green). The trees are at a distance of 3–11m.

3.2 Compared feature analysis methods

The chosen texture analysis methods are regular GLCM (GLCM), wavelet with GLCM (WGLCM), wavelet co-occurrence histogram method (WNCH) and SIFT (SIFT). The method were chosen as they have been previously used in related work and showed promising results. Each method extracts the features used for classification from the sub-images in the data set. For both GLCM and WGLCM, a co-occurrence matrix is calculated. From the GLCM the following measurements are calculated; energy; entropy; homogeneity; contrast;

max probability and correlation. Generating a 6 elements feature vector per offset. Four GLCM’s are generated with offsets:(1, 0), (1, 1), (0, 1), (−1, 1) per sub-image. For the standard GLCM the feature vector is 24 elements long. The WGLCM method is based on the same setup as for the regular GLCM. The sub-image is decomposed 3 times and a GLCM is calculated on each sub band. The final feature vector is 288 elements long. The WNCH is constructed for eight offsets: (1, 0), (1, 1), (0, 1), (−1, 1), (−1, 0), (−1, −1), (0, −1), (1, −1), for three decomposition levels. The final vector is 432 elements long. The SIFT method uses the SIFT data included from the dataset. The SIFT feature vector is built by using the SIFT-descriptor from the dataset, that is 128 elements long. In addition to the SIFT- descriptor, frame size (in pixel), the major gradient direction (in degrees) and the standard deviation of the SIFT-descriptor was added as additional features.

Each feature is further scaled between [0 1] with min-max normalisation. Scaling is necessary to avoid possible numerical problems. In particular, when a one feature is numerically several magnitude larger then the others. Note that the scaling is performed per feature and not on the feature vectors.

(29)

3.3. Performance Evaluation 23

3.3 Performance Evaluation

To evaluate the classification performance of the methods, two classifiers, a linear SVM and a kernel based IVM, was trained and tested with k-fold cross validation. In k-fold cross validation the dataset is partitioned into k equaled sized sub-sets. One subset is used as the validation set and rest k − 1 subsets are used as training data for the classifier. This is then repeated for each subset in total of k times. The mean of each cross-validation is used as the result. For the experiments k = 5 was chosen as a good trade-off between computation time and accurate performance estimation. The performance is measured in overall accuracy and class accuracy. Overall accuracy is defined as the proportion of the total number of classification that were correct. Class accuracy is defined as proportion of correct classification per class.

3.4 Parameter Selection

Parameter selection for the classifiers is done by grid search with k-fold cross validation on a smaller tuning set. The grid search uses a predefined set of values of the parameter to check and chooses the one with the highest cross validation accuracy on the tuning set.

The parameter need to be selected for a linear SVM is the error parameter C. For IVM the regulation parameter λ and the kernel parameter σ was selected. The tuning set consists of 500 sub-images each of spruce, pine and ground, summing up to 1500 images.

3.5 Code

The code for texture analysis methods and classification are implemented in Matlab. Mat- lab is a high level languages with extensive libraries for numerical computation and lends itself for scientific research. All libraries used have simple interfaces for Matlab. The implementation of GLCM is based on Matlab’s graycomatrix function, from the image analysis toolbox. The graycomatrix function creates a GLCM for a given offset and can be used with graycoprops to generate a number of statistical features from the GLCM. The Haar-wavelet is used as the wavelet for the wavelet based methods. The wavelet implementation is based on Matlab’s dwt2 function, from the wavelet toolbox.

The SIFT implementation is based on VLFeat, an open source library of computer vision algorithms (Vedaldi and Fulkerson, 2008). VLFeats implementation is based on Lowe (2004), The support vector machine implementation is provided by LIBSVM Chang and Lin (2011), which is a widely used open-source support vector machine library. The import vector machine implementation is based on Zhu and Hastie (2005) and provided by Roscher (2011).

(30)

(31)

Chapter 4

Experiment and Results

The experiments evaluates a set of feature extraction methods on extracted sub-images from the Remningstorp dataset. The training set consists of 3000 sub-images each of spruce, pine and ground, summing up to a total of 9000 sub-images. The sub-images are arbitrary selected and no consideration is made if two sub-images overlap. The methods was first tested with only the standard image data. The second experiments tested if the methods performed better with the addition of spatial data from the datasets point cloud. Following each experiment a SVM-RFE was performed and the methods was trained and tested with reduced feature vectors. The classification performance and the most discriminating features was noted. In the final experiment the sub-image were clustered so that the methods could be evaluated on individual trees.

4.1 Experiment 1

The first experiment evaluated the methods individually on how good they perform on the sub-image data. For purpose of comparison, two classifiers were used to evaluate the performance of each method. The best overall result was obtained with SVM on WGLCM features with an overall accuracy of 84% (Table 4.1).

Table 4.1: Classification result for IVM and SVM on features from four texture analysis methods. The overall classification accuracy and the class wise classification accuracy per method is presented. Bold numbers indicate the best classification result

Classifier Method Accuracy Pine Spruce Ground

SVM GLCM 77.9 88.3 66.5 78.8

SIFT 41.5 54.8 44.2 25.7

WGLCM 83.6 89.7 75.9 85.3

WNCH 77.6 86.0 66.8 80.0

IVM GLCM 73.7 90.3 55.5 74.3

SIFT 47.9 63.3 10.0 70.0

WGLCM 80.0 91.3 63.2 85.6

WNCH 67.7 82.9 45.0 75.1

25

(32)

Table 4.2: Classification result for IVM and SVM on feature from four texture analysis methods with the least important features removed with SVM-RFE. The overall classification accuracy and the class wise classification accuracy per method is presented. Bold numbers indicate the best classification result

SVM GLCM 77.7 88.1 66.0 78.8

SIFT 40.5 36.7 56.7 28.1

WGLCM 84.0 90.3 76.2 85.4

WNCH 78.1 86.3 67.4 80.1

IVM GLCM 75.7 89.4 58.4 79.2

SIFT 47.8 63.1 10.8 69.5

WGLCM 79.1 92.0 59.7 85.7

WNCH 71.0 86.0 47.5 79.3

Table 4.3: Train and test time for each method with full and reduced feature vectors.

Full Feature Vector Reduced Feature Vector Test (s) Train (s) Test (s) Train (s)

SVM GLCM 0.7 40 0.7 37

SIFT 7.0 509 5.0 460

WGLCM 2.6 81 2.0 57

WNCH 4.8 135 3.0 80

IVM GLCM 0.002 209 0.002 180

SIFT 0.004 63 0.006 61

WGLCM 0.002 566 0.0006 461

WNCH 0.006 782 0.002 694

4.1.1 Reduced feature vectors

In this experiment the performance of each of the method was tested with the feature data first undergoing SMV-RFE. After the feature elimination the reduced feature vectors was used for training and testing with the two classifiers, with the same setup as for the first experiment.

For all methods a reduced feature vector was found to have similar classification results when compared to the complete feature vectors (Table 4.2). A reduction also gave a slight performance boost regarding training time (Table 4.3).

The five most discriminating features and corresponding hyperplane weight for the best performing method WGLCM, can be seen in Table 4.4. All values are significant for the classification result.

(33)

4.1. Experiment 1 27

Table 4.4: The 5 most discriminating features for the best method in Experiment 1, WGLCM. W is the corresponding normalised hyperplane weight. Level is the wavelet decomposition level, where 3 corresponds to the coarsest level and 1 is the finest. Sub.

is the wavelet sub-band, where LL is the average and LH, HL, and HH is the vertical, horizontal, and diagonal detail sub-bands, respectively. Off. = GLCM offset, where [1 0]

is in the vertical direction and [0 1] is the horizontal direction. The cutoff value for the hyperplane normal vector is 0.0718. Values of W below the cutoff are not significantly to the classification result.

Classes W Level Sub. GLCM Feat. Off.

Spruce / Pine 1 0.218 1 LH Homogeneity [1 0]

2 0.191 2 LL Homogeneity [1 0]

3 0.188 2 HH Homogeneity [1 0]

5 0.176 3 LH Correlation [1 0]

Spruce / Ground 1 0.340 2 LL Homogeneity [1 0]

3 0.207 1 LH Entropy [1 0]

4 0.148 2 LL Correlation [0 1]

5 0.137 2 HH Correlation [0 1]

Pine / Ground 1 0.171 1 LH Correlation [0 1]

2 0.165 1 LH Entropy [1 0]

5 0.154 1 LH Homogeneity [1 0]

(34)

Table 4.5: Classification result for IVM and SVM on features from the four texture analysis methods with added spatial data X,Y ,Z and distance to the camera. The overall classification accuracy and the class wise classification accuracy per method is shown. Bold numbers indicate the best classification result.

SVM GLCM 90.1 89.5 83.6 98.2

SIFT 77.4 79.0 62.0 90.7

WGLCM 91.9 92.0 86.0 96.8

WNCH 89.5 89.1 83.1 96.2

IVM GLCM 92.7 93.4 86.3 98.4

SIFT 65.8 69.4 57.0 71.0

WGLCM 84.1 90.1 71.3 90.2

WNCH 78.6 87.6 61.3 87.1

4.2 Experiment 2

In the second experiment each methods feature vector was extended with point data. The data added was X,Y and Z coordinates relative the camera and the distance from the camera to the point. Table 4.5 shows the result of the experiment with best accuracy achieved by IVM on GLCM features.

4.2.1 Reduced feature vectors

In this experiment the performance of each of the method from the previous experiment was tested with the training data first undergoing SMV-RFE. After the SVM-RFE the reduced feature vectors was used for training and testing. The best method was GLCM with the IVM classifier with a overall accuracy of 95 % (Table 4.6). For all methods a reduced feature vector was found to have similar classification results when compared to the complete feature vectors (Table 4.6). A reduction also gave a performance boost regarding training time (Table 4.7).

Table 4.6: Classification result for IVM and SVM on reduced feature vectors from the four texture analysis methods with added spatial data X,Y ,Z and distance to the camera. The overall classification accuracy and the class wise classification accuracy per method is shown.

Bold numbers indicate the best classification result.

SVM GLCM 89.8 88.8 82.1 98.3

SIFT 71.5 46.6 71.7 96.3

WGLCM 92.1 91.5 87.2 97.6

WNCH 90.7 90.7 84.1 97.3

IVM GLCM 94.9 99.2 87.4 98.0

SIFT 67.8 69.4 65.0 71.2

WGLCM 85.2 91.2 72.0 92.4

WNCH 85.1 90.4 71.5 93.5

(35)

Table 4.7: Table with train and test time for each method with full and reduced feature vector.

Full Feature Vector Reduced Feature Vector Test (s) Train (s) Test (s) Train (s)

SVM GLCM 0.2 5.0 0.2 3.0

SIFT 3.0 255 0.5 440

WGLCM 3.0 97 1.2 150

WNCH 4.0 421 2.0 187

IVM GLCM 0.0001 251 0.0002 300

SIFT 0.002 63 0.0001 75

WGLCM 0.004 581 0.006 587

WNCH 0.002 993 0.01 800

The five most discriminating features for the best method in Experiment 2, GLCM, can be seen in Table 4.8. The most important feature was the Z coordinate when separating trees from ground.

Table 4.8: The 5 most discriminating features for GLCM, best method in Experiment 2. W is the corresponding normalised hyperplane weight. The spatial coordinate Z is the height over ground. The cutoff value for the hyperplane normal vector is 0.218. Values of W below the cutoff are not significantly to the final classification result and greyed.

Classes W GLCM Feat. Off.

Spruce / Pine 1 0.415 Entropy [0 1]

2 0.363 Homogeneity [0 1]

4 0.291 Entropy [1 1]

5 0.280 Correlation [-1 1]

Ground / Spruce 1 0.876 Z -

5 0.131 X -

Ground / Pine 1 0.543 Z -

4 0.230 Contrast [0 1]

5 0.176 Correlation [0 1]

4.3 Experiment 3

In the third experiment the methods was tested on individual trees. In each step one tree was used as test dataset. The rest of the trees was used as the training dataset. The training dataset was constructed by randomly selecting 210 points from each remaining tree. As in Experiment 2 the feature vectors was extended with spatial data. Two trees from each

(36)

Table 4.9: Classification results of 17 trees from plot 165 classified with IVM and SVM on GLCM features. Dist. is the distance from the centre of the plot to the tree. N is the total number of classified points.

SVM IVM

Species Tree ID N Dist. (m) Pine Spruce Ground Pine Spruce Ground

Pines 11784 931 10.6 100% 0% 0% 91% 0% 9%

11824 3428 7.2 40% 60% 0% 9% 91% 0%

11838 352 9.2 76% 24% 0% 82% 17% 1%

11852 650 7.2 98% 1% 1% 95% 3% 2%

11853 792 8.6 87% 13% 0% 84% 13% 3%

Spruces 11792 316 9.5 22% 71% 6% 9% 55% 36%

11805 404 11 0% 88% 12% 0% 64% 36%

11810 210 9.5 2% 98% 0% 10% 83% 7%

11812 4125 3.2 7% 77% 16% 14% 25% 61%

11813 1846 5.3 4% 92% 5% 3% 79% 18%

11825 1225 3 1% 98% 1% 0% 62% 37%

11826 728 5.5 1% 90% 9% 0% 63% 37%

11839 3785 3.1 1% 94% 6% 1% 97% 2%

11841 246 11.6 86% 2% 12% 93% 0% 7%

11861 454 9.2 23% 71% 6% 45% 26% 29%

11865 233 7.1 17% 82% 0% 30% 40% 29%

11866 777 2.7 3% 94% 3% 5% 59% 36%

species were used as data for parameter selection. That resulted in total 5 pines and 12 spruces were used for the test and training phase.

GLCM and WNCH with SVM correctly classified 4 pines and 11 spruces with an overall accuracy of 88% (Table 4.9 4.11). Figure 4.1 shows four trees which has been classified by SVM on GLCM features, where two is correctly classified and the two is misclassified.

SIFT with IVM performed worst and was unable to correctly classify spruces and only 46%

overall accuracy (Table 4.12).

(37)

(a) Spruce: 11825 (b) Spruce: 11841 (c) Pine: 11838 (d) Pine: 11824

Figure 4.1: Clustered sub-image from two pine and two spruce classified with SVM on GLCM features. Yellow is classified as spruce, blue as pine and green as ground. Trees 11825 and 1838 are correctly classified as spruce receptively pine. Whereas, trees 11841 and 11824 were misclassified as pine respectively spruce. Note the green areas classified as ground on the two spruces.

(38)

Table 4.10: Classification results of 17 trees from plot 165 classified with IVM and SVM on WGLCM features. Dist. is the distance from the centre of the plot to the tree. N is the total number of classified points.

SVM IVM

Pine 11784 931 10.6 100% 0% 0% 99% 1% 0%

11824 3428 7.2 38% 62% 0% 79% 21% 0%

11838 352 9.2 99% 1% 0% 97% 3% 0%

11852 650 7.2 99% 1% 0% 100% 0% 0%

11853 792 8.6 94% 6% 0% 94% 6% 0%

Spruce 11792 316 9.5 15% 84% 2% 74% 20% 6%

11805 404 11 2% 80% 18% 5% 50% 45%

11810 210 9.5 19% 80% 1% 20% 78% 1%

11812 4125 3.2 4% 77% 19% 31% 6% 63%

11813 1846 5.3 5% 93% 2% 33% 56% 11%

11825 1225 3.0 0% 100% 0% 0% 67% 33%

11826 728 5.5 3% 87% 10% 13% 62% 25%

11839 3785 3.1 0% 96% 4% 2% 31% 67%

11841 246 11.6 93% 1% 6% 77% 17% 6%

11861 454 9.2 78% 17% 6% 67% 21% 12%

11865 233 7.1 7% 92% 1% 38% 55% 7%

11866 777 2.7 1% 96% 3% 2% 74% 24%

Table 4.11: Classification results of 17 trees from plot 165 classified with IVM and SVM on WNCH features. Dist. is the distance from the centre of the plot to the tree. N is the total number of classified points.

SVM IVM

Pine 11784 931 10.6 98% 2% 0% 73% 4% 23%

11824 3428 7.2 48% 52% 0% 44% 46% 10%

11838 352 9.2 74% 26% 0% 44% 53% 3%

11852 650 7.2 94% 6% 0% 86% 6% 7%

11853 792 8.6 85% 15% 0% 78% 19% 4%

Spruce 11792 316 9.5 3% 84% 12% 2% 47% 52%

11805 404 11 4% 80% 16% 1% 58% 41%

11810 210 9.5 3% 97% 0% 3% 80% 17%

11812 4125 3.2 8% 71% 21% 11% 4% 85%

11813 1846 5.3 9% 86% 4% 22% 28% 51%

11825 1225 3 2% 95% 3% 1% 57% 41%

11826 728 5.5 2% 92% 6% 3% 41% 56%

11839 3785 3.1 1% 98% 2% 2% 42% 55%

11841 246 11.6 78% 12% 10% 30% 58% 11%

11861 454 9.2 38% 53% 9% 25% 34% 41%

11865 233 7.1 17% 77% 6% 14% 43% 42%

11866 777 2.7 3% 96% 1% 4% 23% 73%

(39)

Table 4.12: Classification results of 17 trees from plot 165 classified with IVM and SVM on SIFT features. Dist. is the distance from the centre of the plot to the tree. N is the total number of classified points.

SVM IVM

Pine 11784 931 10.6 87% 0% 13% 54% 6% 40%

11824 3428 7.2 43% 56% 1% 60% 29% 11%

11838 352 9.2 85% 0% 15% 75% 12% 14%

11852 650 7.2 92% 3% 5% 81% 13% 6%

11853 792 8.6 94% 0% 6% 72% 14% 14%

Spruce 11792 316 9.5 68% 1% 31% 23% 11% 65%

11805 404 11 22% 57% 21% 29% 10% 61%

11810 210 9.5 98% 2% 0% 48% 26% 26%

11812 4125 3.2 0% 81% 19% 43% 28% 29%

11813 1846 5.3 8% 85% 7% 54% 27% 20%

11825 1225 3 0% 100% 0% 21% 30% 50%

11826 728 5.5 28% 53% 20% 43% 37% 20%

11839 3785 3.1 0% 100% 0% 44% 27% 29%

11841 246 11.6 91% 0% 9% 59% 23% 18%

11861 454 9.2 86% 0% 14% 38% 20% 42%

11865 233 7.1 86% 0% 14% 39% 34% 26%

11866 777 2.7 0% 100% 0% 28% 41% 31%

(40)

Tree Species Classiﬁcation Using Terrestrial Photogrammetry

Tree Species Classification Using Terrestrial

Photogrammetry

Jakob Boman

June 8, 2013

Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Niclas B¨ orlin

Examiner: Fredrik Georgsson

Ume˚ a University

Department of Computing Science SE-901 87 UME˚ A

SWEDEN

Contents

Chapter 1

Introduction

1.1 Background

1.2 Aim

1.2.1 Tools

1.3 Related Work

1.3.1 Tree species classification

1.3.2 Texture analysis methods

1.4 Outline of thesis

Chapter 2

Theory

2.1 Texture Classification

2.2 Texture Analysis

2.2.1 Grey Level Co-occurrence matrix

2.2.2 Wavelet

2.2.3 Scale-Invariant Feature Transform

2.3 Classification

2.3.1 Support vector machines

2.3.2 Import vector machines

Chapter 3

Method

3.1 Dataset

3.2 Compared feature analysis methods

3.3 Performance Evaluation

3.4 Parameter Selection

3.5 Code

Chapter 4

Experiment and Results

4.1 Experiment 1

4.1.1 Reduced feature vectors

4.2 Experiment 2

4.2.1 Reduced feature vectors

4.3 Experiment 3