Evaluation of Random Forests for Detection and Localization of Cattle Eyes

(1)

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

Evaluation of Random Forests for detection and

localization of cattle eyes

Examensarbete utfört i Datorseende vid Tekniska högskolan vid Linköpings universitet

av

Daniel Sandsveden LiTH-ISY-EX--15/4885--SE

Linköping 2015

Department of Electrical Engineering Linköpings tekniska högskola

Linköpings universitet Linköpings universitet

(2)

(3)

Evaluation of Random Forests for detection and

localization of cattle eyes

Examensarbete utfört i Datorseende

vid Tekniska högskolan vid Linköpings universitet

av

Daniel Sandsveden LiTH-ISY-EX--15/4885--SE

Handledare: Amanda Berg

isy_{, Linköpings universitet}

Annika Rantzer

Agricam

Examinator: Klas Nordberg

isy, Linköpings universitet

(4)

(5)

Avdelning, Institution Division, Department

Computer Vision

Department of Electrical Engineering SE-581 83 Linköping Datum Date 2015-09-23 Språk Language Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats D-uppsats Övrig rapport

URL för elektronisk version

http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-121540

ISBN — ISRN

LiTH-ISY-EX--15/4885--SE Serietitel och serienummer Title of series, numbering

ISSN —

Titel

Title Evaluation of Random Forests for detection and localization of cattle eyes

Författare Author

Daniel Sandsveden

Sammanfattning Abstract

In a time when cattle herds grow continually larger the need for automatic methods to detect diseases is ever increasing. One possible method to discover diseases is to use thermal images and automatic head and eye detectors. In this thesis an eye detector and a head detector is implemented using the Random Forests classifier. During the implementation the classifier is evaluated using three different descriptors: Histogram of Oriented Gradients, Local Binary Patterns, and a descriptor based on pixel differences. An alternative classifier, the Support Vector Machine, is also evaluated for comparison against Random Forests.

The thesis results show that Histogram of Oriented Gradients performs well as a description of cattle heads, while Local Binary Patterns performs well as a description of cattle eyes. The provided descriptor performs almost equally well in both cases. The results also show that Random Forests performs approximately as good as the Support Vector Machine, when the Support Vector Machine is paired with Local Binary Patterns for both heads and eyes. Finally the thesis results indicate that it is easier to detect and locate cattle heads than it is to detect and locate cattle eyes. For eyes, combining a head detector and an eye detector is shown to give a better result than only using an eye detector. In this combination heads are first detected in images, followed by using the eye detector in areas classified as heads.

Nyckelord

(6)

(7)

Abstract

In a time when cattle herds grow continually larger the need for automatic meth-ods to detect diseases is ever increasing. One possible method to discover diseases is to use thermal images and automatic head and eye detectors. In this thesis an eye detector and a head detector is implemented using the Random Forests clas-sifier. During the implementation the classifier is evaluated using three different descriptors: Histogram of Oriented Gradients, Local Binary Patterns, and a de-scriptor based on pixel differences. An alternative classifier, the Support Vector Machine, is also evaluated for comparison against Random Forests.

The thesis results show that Histogram of Oriented Gradients performs well as a description of cattle heads, while Local Binary Patterns performs well as a de-scription of cattle eyes. The provided descriptor performs almost equally well in both cases. The results also show that Random Forests performs approximately as good as the Support Vector Machine, when the Support Vector Machine is paired with Local Binary Patterns for both heads and eyes.

Finally the thesis results indicate that it is easier to detect and locate cattle heads than it is to detect and locate cattle eyes. For eyes, combining a head detector and an eye detector is shown to give a better result than only using an eye detector. In this combination heads are first detected in images, followed by using the eye detector in areas classified as heads.

(8)

(9)

Acknowledgments

First of all I would like to thank Agricam for giving me the opportunity to write this thesis. I would like to thank my supervisor at Agricam, Annika Rantzer, for her support and feedback during the thesis work, as well as my supervisor Amanda Berg and my examiner Klas Nordberg at ISY for their useful comments and help.

I would also like to thank Ellinor Eineren at Agricam and Jörgen Ahlberg at Ter-misk Systemteknik for their involvement in this thesis, and Julius Jeuthe for his final comments on my report.

Finally, I wish to thank my friends and my family for their support throughout my studies, with a special thank you to Li Sandsveden.

Linköping, September 2015 Daniel Sandsveden

(10)

(11)

2.3 Region Detector . . . 14 2.3.1 Sliding Window . . . 14 2.4 Evaluation . . . 15 2.4.1 Concept of Accuracy . . . 15 2.4.2 K-Fold Cross-Validation . . . 16 3 Method 17 3.1 Method overview . . . 17 3.2 Data . . . 19 3.2.1 Datasets . . . 19 3.2.2 Ground truth . . . 21 3.2.3 Training sets . . . 22 3.2.4 Evaluation sets . . . 23

3.2.5 Summary of data and ground truths . . . 23

3.3 Implementation . . . 24

3.3.1 Descriptors . . . 24

(12)

3.3.2 Classifiers . . . 26

3.3.3 Region detectors . . . 27

3.4 Evaluation . . . 28

3.4.1 Levels of evaluation . . . 28

3.4.2 Evaluation of descriptors and classifiers . . . 29

3.4.3 Evaluation of region detector . . . 29

4 Results 31 4.1 Implementation . . . 31

4.1.1 Descriptors . . . 31

4.1.2 Classifiers . . . 37

4.1.3 Support Vector Machine . . . 46

4.1.4 Region detectors . . . 47

4.2 Evaluation . . . 47

4.2.1 Classifiers and descriptors . . . 47

4.2.2 Region detectors . . . 55 4.2.3 Summary of Evaluation . . . 56 5 Discussion 57 5.1 Results . . . 57 5.1.1 Implementation . . . 57 5.1.2 Evaluation . . . 59 5.2 Method . . . 60 5.2.1 Data . . . 60

5.2.2 Methods and implementation . . . 61

5.2.3 Evaluation . . . 61

5.2.4 Source criticism . . . 62

5.3 Ethical and societal aspects . . . 62

6 Conclusions 65 6.1 Answers to problem descriptions . . . 65

6.1.1 How well does the provided classifier and descriptor work on cow heads? . . . 65

6.1.2 How well does the provided classifier and descriptor work on cow eyes? . . . 66

6.1.3 What type of detection window should be used? . . . 66

6.1.4 What type of feature descriptor should be used? . . . 66

6.1.5 How does random forests compare to svm for detection of heads and eyes on dairy cows? . . . 66

6.2 Future work . . . 66

(13)

Notation

Several abbreviations are used regularly throughout this thesis. The abbrevia-tions and their meaning are listed below in alphabetical order.

Abbreviations

Abbreviation Meaning

FP False Positive

FN False Negative

hog _{Histogram of Oriented Gradients}

k-fold _{K-Fold Cross-Validation}

lbp _{Local Binary Patterns}

svm Support Vector Machines

TP True Positive

TN True Negative

(14)

(15)

1

Introduction

This master thesis evaluates the classifier random forests [Breiman, 2001] for the purpose of cattle head and eye detection, as a part of the engineering program inComputer Science and Engineering, at the Department of Electrical Engineering

(isy), Linköping University. The thesis work is performed at Agricam, a company that uses thermal image cameras and image analysis to detect mastitis in dairy cows at an early stage.

1.1 Outline

This report is composed of six different chapters. Chapter 1 introduces the thesis problem, its background, and important limitations. Theoretical background for relevant methods and algorithms is presented in chapter 2.

The method used for this thesis work is presented in chapter 3. The results are presented in chapter 4, followed by a discussion in chapter 5. Final conclusions are presented in chapter 6.

1.2 Background

Mastitis is a common disease in dairy cows, every year it costs billions of dollars for farmers around the world. In Sweden alone farmers lose around 190 million Swedish krona yearly due to the disease [Nielsen, 2009].

Being able to detect and prevent this disease is, therefore, an important priority for effective milk production. Preventing and stopping the disease before its out-break would also lower the amount of antibiotics used, something that becomes

(16)

increasingly important for each passing day [WHO et al., 2014].

By using thermal cameras and machine learning Agricam can detect mastitis at an early stage. They do this by using an udder detector to, in regular intervals, measure the temperature of the udder in order to detect inflammation.

Now Agricam wants to explore possibilities to detect other diseases as well, not only in dairy cows, but in the wider category of cattle. A good indicator of severe diseases in cattle is fever. Severe mastitis is only one of the diseases that show this symptom. The idea behind this thesis work is therefore to explore the possibility of fever detection by using eye temperature measurements. The first step would then be to see how difficult it is to find measurement points of eyes in thermal images.

As a starting point for this thesis work an experimental implementation of the

random forests [Breiman, 2001] classifier is provided. Further provided is

an experimental descriptor ("default") based on pixel differences, and software that utilizes this classifier and descriptor to detect regions of interest in images.

1.3 Motivation

There are several diseases that can afflict cattle. One common denominator for many of these diseases is fever, a rise in body temperature. Measuring the body temperature during regular intervals in order to detect fever in cattle is, therefore, of interest as a method to detect and hopefully prevent diseases.

One method to get this measurement is to use the temperature of the eye as a fever indicator [Johnson et al., 2011]. Finding the position of the eye in thermal images using machine learning could be an effective way to easily achieve this in a time when cattle herds grow continuously larger.

1.4 Purpose

The goal of the thesis work is to develop and evaluate an eye detector for ther-mal images of dairy cows. At the same time a head detector will be implemented using the same principles. This is to evaluate if detecting first a head and then search for an eye within the detected head gives a better result than only search-ing for an eye in the whole image. The purpose of these detectors is twofold:

1. Measure the temperature of a dairy cow eye to detect fever.

There is a desire for an easy method to detect whether cattle has a fever or not. Regular temperature measurements of the cattle’s eye is a possible solution to this problem. In order to measure eye temperature a detector that can locate the eye is needed.

2. Act as decision support for the udder detector.

(17)

1.5 Problem description 3

by telling it where in the image the head is located. This could possibly increase the success rate for udder detection.

1.5 Problem description

The following problem statements have been formulated in the beginning of the thesis work.

How well does the provided classifier and descriptor work on cow heads? The purpose of this problem statement is to evaluate how well the provided im-plementation of random forests performs if it is trained to classify heads using the provided descriptor.

How well does the provided classifier and descriptor work on cow eyes? The purpose of this problem statement is to evaluate how well the provided im-plementation of random forests performs if it is trained to classify eyes using the provided descriptor.

What type of detection window should be used?

The detection window and its size determines what we look for and where we look for it in an image. Should we search the whole image for a head and then search the head for an eye, or is it more efficient to search for the eyes directly? Two different detection windows are seen in figure 1.1.

(a) (b)

Figure 1.1: Two different detection windows. One focusing on the eye (a)

and one focusing on the whole head (b). What type of feature descriptor should be used?

There are many different feature descriptors, and they all have their own

advan-tages and disadvanadvan-tages. Two popular descriptors for object detection are

His-togram of Oriented Gradients (”hog”) [Dalal and Triggs, 2005] and Local Binary Patterns (”lbp”) [Ojala et al., 1996]. The purpose of this problem statement is to

evaluate these two descriptors and compare them against the provided descrip-tor.

How does random forests compare to svm for detection of heads and eyes on dairy cows?

(18)

however, many alternatives such asSupport Vector Machines (”svm”) [Cortes and

Vapnik, 1995], one of the most popular classifiers. The purpose of this problem statement is to compare the results of random forests with the results of svm in order to say something about the suitability of random forests as a classifier for eyes and heads in images of cattle.

1.6 Limitations

This thesis work is limited by several factors. These factors are listed below. Dataset

Four different datasets are provided by Agricam. They are composed of thermal images of dairy cows from two different farms. Depending on the farm, images differ in level of difficulty for classifying an object in the images. The level of difficulty depends on how close the cows walk, what angle the images are taken from, and how the background in the images look.

Time

The thesis work is limited to 30 credits which is approximately 800 hours of work. These hours also include elements such as opposition, thesis presentation, and many other side activities.

Computation time

Because of the limitation of time, too much time cannot be spent on computations. This also limits the amount of data that can be used due to computation time for training and evaluation of the algorithms during the thesis work. Computations are performed on hardware provided by Agricam.

One purpose of the thesis work is to develop and evaluate an eye detector. This detector may be used by Agricam in some form after the thesis work is complete, and it is therefore important that it can function in near real time. This demand also limits the computation time, mainly for the complete detector. A computa-tion time of a few seconds for the complete detector is reasonable.

Focus

The goal of the thesis work is a detector of head and eye of a cow. The thesis work does not deal with actual detection of fever or any other type of measurements. Methods

Agicam provides an implementation of random forests which is the main clas-sifier to evaluate. For an indication of how well the clasclas-sifier performs it is com-pared with an svm classifier. Two different feature descriptors, hog and lbp, are implemented and evaluated with the random forests classifier. A provided descriptor is also evaluated. Only one of the descriptors is implemented and evaluated with the svm classifier. No other methods are implemented or evalu-ated.

(19)

2

Theory

This chapter provides a theoretical framework regarding relevant algorithms and methods to the reader.

2.1 Descriptors

Images are composed of thousands of pixel values. As an example a 128x128 sized gray-scale image has 16 384 different pixels. This large amount of data, where a single pixel on its own does not say anything about the image, is not always a suitable representation for image processing.

An often better representation for image processing is the descriptor. A descrip-tor is a more compact representation of the image or parts of the image. Desirable properties of a descriptor is invariance against parameters such as illumination, scaling of objects or small shifts in positions. Another desirable property is high discriminative power making it possible to tell different objects apart [Forssén]. The descriptor often comes as a feature vector; a vector of features describing properties of an image region. These properties can be anything from the texture of an object to details about the object’s shape. In this thesis work, the features will be represented as high dimensional vectors.

Many different descriptors exist. This thesis focuses on the evaluation of two well known descriptors: the hog descriptor and the lbp descriptor.

2.1.1 Histogram of Oriented Gradients

hogwas first introduced by Dalal and Triggs in 2005 [Dalal and Triggs, 2005] as a descriptor for human detection. It has since become a very popular descriptor

(20)

used in many forms of object detection.

Like its name suggests, the hog descriptor describes the orientations of the image gradients. The image gradients are changes in color or intensity, meaning that the descriptor captures strong edges, like a contour, very well.

The general algorithm is composed of three important steps: 1. Gradient Computation

2. Orientation Binning 3. Block Normalization

Gradient Computation

The first step in computing the hog descriptor for an image is to find the gradient images Ix and Iy. This can be done by convolving the input image with a filter. A simple 1-D derivative filter often works best [Dalal and Triggs, 2005]. Filter examples for the x-dimension and the y-dimension are seen in (2.1).

Dx= [−1, 0, 1], Dy= [−1, 0, 1]T (2.1)

Orientation Binning

The next step is Orientation Binning. The image is divided into a number ofcells,

regions in the image. A histogram of the gradient orientation is computed for each cell in the image. A simple histogram with five bins is presented in figure 2.1. Value Quantity 1 2 3 4 6 7 8 5 1 2 3 4 5

Figure 2.1:An example histogram with five bins. According to the histogram the quantity of bin two is seven.

Each pixel in the cell gets a weighted vote for the gradient orientation, where the weight is a function of the gradient magnitude in the pixel. Often the magnitude itself gives the best result [Dalal and Triggs, 2005].

(21)

2.1 Descriptors 7

The vote is interpolated between neighbouring bins and added to the cell’s

his-togram. Depending on whether the gradient is unsigned (0◦- 180◦) or signed

(−180◦- 180◦), and the number of bins, the binning is somewhere between fine and course. For human detection nine bins with an unsigned gradient gives the best result but for other objects another number of bins and a signed gradient may perform better [Dalal and Triggs, 2005].

Gradient magnitude |∇I| and gradient orientation θIcan be calculated according to (2.2) and (2.3) respectively. |∇_{I(x, y)| =} q Ix2+ Iy2 (2.2) θI(x, y) = arctan( Iy(x, y) Ix(x, y) ) (2.3) Block Normalization

Once histograms for each cell have been computed, the next step is local nor-malization to deal with variations in local illumination and contrast. This is per-formed over larger regions calledblocks.

In [Dalal and Triggs, 2005] several different normalization schemes are suggested. Some of them are seen in (2.4) - (2.6), where v is the non-normalized descriptor vector, ||v||1is the Manhattan norm, and ||v||2is the Euclidean norm. is a small

value added to ensure safety against division by zero, and should have only minor effect on the result.

L2norm= q v ||_v||2 2+ 2 (2.4) L1norm= v (||v||1+ ) (2.5) L1sqrt= r v (||v||1+ ) (2.6) Finally, the normalized blocks are concatenated into a single feature descriptor for the image. An example of the organization of the descriptor is presented in figure 2.2.

(22)

Block 1 Block 2 ... Block N-1 Block N (a)

Cell 1 ... Cell J (b)

Bin 1 Bin 2 ... Bin K-1 Bin K (c)

Figure 2.2: Organization of hog feature vector. In (a) the feature vector

composed of blocks 1 to N is seen. In (b) a closer look at block 1 shows that it is composed of J number of cells. In (c) a closer look at Cell 1 shows that it is composed of the cell’s histogram using K number of bins.

2.1.2 Local Binary Patterns

lbp_{was first introduced in 1996 [Ojala et al., 1996] as a way to describe textures.} It has since been used, for example, in facial recognition. Its fast computation time and strong pattern recognition makes it a popular descriptor.

The lbp for a pixel p is calculated using its eight neighbouring pixels. The inten-sity of each neighbouring pixel pi, i = {1, ..., 8}, is compared to the intensity of p with the output gi being either 0 or 1 according to (2.7).

gi =        1 if pi ≥p, 0 if pi < p. (2.7) Computing gifor all neighbouring pixels and reading the result counter-clockwise

from g1 to g8 outputs an 8-bit binary word that can be converted to a decimal

value. This decimal value is used as the lbp code for pixel p. A graphical descrip-tion of this process is presented in figure 2.3.

Threshold 67 93 110 126 13 52 30 49 20 1 1 1 0 0 0 0 0 Binary: 00100011 Decimal: 35

Figure 2.3: Calculation of lbp code. Neighbouring pixel values are

com-pared to central pixel value. An 8-bit binary word is then constructed by going counter-clockwise. Finally the binary word is used in its decimal form.

(23)

2.1 Descriptors 9

The lbp descriptor is computed by dividing the image into a number of cells. For each cell a histogram of lbp codes is created by computing the lbp code for all pixels in the cell and counting the number of occurrences of each code. For the basic lbp algorithm there exists 28= 256 different codes which means that each histogram will have 256 bins. The organization of the lbp feature vector is seen in figure 2.4.

Cell 1 Cell 2 ... Cell N-1 Cell N (a)

Bin 1 Bin 2 ... Bin 255 Bin 256 (b)

Figure 2.4:Organization of lbp feature vector. In (a) the feature vector con-sisting of cells 1 to N is seen. In (b) a closer look at cell 1 which is composed of the cell histogram using 256 bins is seen.

Uniform patterns

Having 256 different lbp codes is not always efficient. In [Ojala et al., 2002] Ojala et al. discovered that certain patterns appear more often than others. They call these patternsuniform patterns and defines them as patterns that have at most two

0/1 circular transitions. By this definition 01111000 and 10000011 are uniform while patterns such as 00110011 are non-uniform. By using uniform patterns and sorting all non-uniform patterns into a single bin the size of the lbp feature vector is greatly reduced.

Rotation invariantLBP

Also in [Ojala et al., 2002], Ojala et al. proposes a new version of lbp that utilizes the uniform patterns and is rotation invariant, denoted LBP_{P ,R}riu2. This operator does not use the eight neighbouring pixel values but instead uses P number of values equally distributed on a circle with radius R around the investigated pixel. See figure 2.5.

With this representation a rotation of the image can be seen as a rotation of the sample points. As an example 00111100 is the same as 00001111 but rotated. Using a circular bit-wise right shift on each uniform lbp code and selecting the lowest decimal number achieved removes the effect of rotation, making the lbp rotation invariant.

Using a rotation invariant version of lbp can also reduce the size of the feature vector. Using P = 8 will result in nine bins for uniform patterns and one bin for all non-uniform patterns, making a total of only ten bins.

(24)

Figure 2.5:Sample points for LBP_{P ,R}riu2with R = 1 and P = 8. Sample points not exactly in the center of a pixel are interpolated using bilinear interpola-tion.

2.2 Classifiers

A classifier is an implementation of an algorithm used for classification. It is commonly fed an image or part of an image represented as a feature vector. Us-ing information learned durUs-ing a trainUs-ing phase it labels the feature vector as belonging to a certain class.

The simplest form of classifiers are binary classifiers; labeling objects as one of two different classes, one called positive and one called negative. The positive class is usually some desired object while the negative class is everything else. More advanced classifiers can label objects into one of several classes.

In this thesis work classifiers trained using labeled images, images that a user have already classified, are used. This is called supervised learning. Other types of learning are unsupervised learning and reinforcement learning.

Depending on the type of classifier different methods are used to differentiate a feature vector belonging to one class from a feature vector of another class. A simple example of such a method could be a single threshold. This threshold is saved and then used when classifying new images.

This thesis focuses on two different classifiers, random forests and svm, with main focus on the random forests classifier.

(25)

2.2 Classifiers 11

2.2.1 Random Forests

random forests_{was first introduced in 2001 by Breiman [Breiman, 2001]. It is}

an ensemble classifier that utilizes several weak classifiers to form a strong classi-fier. Each weak classifier performs a classification with a certain success, and the ensemble classifier uses all the weak classifications to make a strong classification with a good success rate.

The algorithm gets its name from two of its most important properties; its weak classifiers are decision trees and there are several elements of randomness in-troduced during training. The random forests classifier is fast and runs effi-ciently on large data sets. It has no problems with overfitting, which is when the classifier is fitted too well on training data, causing low performance on new data.

Decision Tree

Decision trees can be implemented in different ways depending on their purpose, but generally they are composed of a root, internal nodes and leaves. random

forests_{for binary classification are composed of binary decision trees, where}

each node divides into two new nodes until the leaves are reached.

A decision tree is grown by traversing labeled data from the root downwards. At each node a threshold λ that divides the labeled data as well as possible is found and saved. This traversing continues until a stopping condition is fulfilled, which often is that a certain depth is reached or the amount of data sent down to a node is considered too small.

When this stopping condition is fulfilled an end-node called a leaf is created. In random foreststhe probabilities that the data that has reached the leaf belongs to the positive class and the negative class is computed.

When performing classification on a new sample the data is traversed through the tree, at each node comparing a value x against the threshold λ to decide which way to go. Once the data reaches a leaf the probabilities are used to classify the sample as one class or the other. An example of a binary tree is presented in figure 2.6. λ₁ x > λ₁ x < λ₁ λ₂ λ₃ x < λ₂ x > λ₂ x < λ₃ x > λ₃ Root Node Leaf

(26)

random forests _{utilizes several trees to form a forest. When classifying new} samples it lets all the decision trees in the forest classify the sample and then it holds a vote, classifying the samples as the majority classification. A measure of certainty, how many trees of all trees that voted positive, can also be returned together with the classification.

Randomness

random forestsintroduces randomness at two different levels: when selecting

data to grow a tree and once at each node when training that same tree.

During training of the forest, each tree will be grown using different training sets. If the total number of labeled training samples is N then for each tree a subset of length N is created by randomly selecting N samples with replacement from all labeled training samples. This is called bagging. Selecting with replacement means that it is possible to select the same sample several times. This leads to different training sets being independent of each other. For a crude example see figure 2.7. 5 ₇ 9 3 2 9 7 3 9 5 2 7 9 3 5 5 3 2 2 5 Train Train Train

Figure 2.7: Sampling with replacement. Three different trees are trained

using three different training sets sampled from the original group of labeled training samples.

In [Breiman, 2001] Breiman found that splitting a node in a tree over just one feature gives a good result. Which feature to split over is the second layer of randomness introduced. If the total number of features is M then m different feature indexes, where m << M, are picked at random at each node and the split evaluated for each of them. The index of the feature that gives the best split is saved together with the threshold to split over.

(27)

2.2 Classifiers 13

2.2.2 Support Vector Machines

An early version of svm as a binary classifier was introduced in 1992 [Boser et al., 1992]. It performed well when training data could be separated without any errors. An extension that made it possible to perform well on non-separable data was made a few years later in 1995 [Cortes and Vapnik, 1995]. These are the linear svm and the non-linear svm, respectively.

The general idea behind svm is to find a decision boundary that separates two

classes of data with as large margin as possible. This decision boundary is a

line in 2D, a plane in 3D, and ahyperplane in the general case. The margin is

the distance between the decision boundary and thesupport vectors, the feature

vectors that are positioned closest to the decision boundary.

In figure 2.8 the basic elements of svm are seen. This includes the support vectors, the decision boundary, and the margins. The example is from a two-dimensional linearly separable case but the general idea applies for all other cases as well.

Margin

Figure 2.8: Elements of svm. Decision boundary separates classes.

Maxi-mum margins are found. Support vectors are circled.

Soft Margins

In the original implementation the margins were hard, they did not allow for any feature vectors to be classified erroneously. This works as long as the data is linearly separable, separable by a hyperplane, but even then sometimes this leads

to very small margins. Implementingsoft margins where some errors are allowed

was found to give a better overall result [Cortes and Vapnik, 1995], and makes it possible to classify some data not completely linearly separable.

(28)

Kernels

Not all data is linearly separable. One way to solve this is to use a map-function to map all feature vectors into high-dimensional feature space where the data is linearly separated. However, this mapping does not scale well with the number of input features [Ben-Hur and Weston, 2010].

To solve this svm useskernel functions. A kernel function performs computations

in the high-dimensional feature space without explicitly computing the mapping, allowing non-linearly separable data to become separable and a decision bound-ary to be found. Two kernel functions, the Polynomial kernel and the Gaussian kernel, are shown in (2.8) and (2.9) respectively.

K(x, x0) = (xTx0+ c)d (2.8)

K(x, x0) = exp(γ||x − x0||2_), _{γ = −} 1

2σ2 (2.9)

In both equations x and x0 are samples represented as feature vectors. In (2.8)

c is a trade-off constant, and d is the degree of the polynomials. In (2.9) γ is a

parameter that controls the width of the Gaussian.

2.3 Region Detector

When processing an image it is often not necessary to process the whole image. More commonly, certain regions of interest exists. A region of interest can for example be a face in an image of a person or a car in an image of a parking lot. Finding these regions of interest for processing can be done with a region detector.

There are several different region detectors. One such method is to look for in-terest points and then create a region around each point. In this thesis work a

method calledSliding Window is used.

2.3.1 Sliding Window

Sliding Window is one of the simplest detection methods. Let a window slide across the image, and at each position create a small sub-image called a patch. This patch can be used to compute a feature vector which can then be sent to a classifier.

(29)

2.4 Evaluation 15

Full image

Patch

Figure 2.9: Sliding Window. A detection window slides across the image,

and generates a patch at each position.

Using a sliding window can be computationally heavy since many regions are found and processed. In situations where this is not a problem Sliding Window can produce good results.

2.4 Evaluation

In order to compare different algorithms, be it feature descriptors or classifiers, some type of measurement of how well a classification has performed is needed.

Two methods often used together for this purpose are Accuracy (”accuracy”)

andK-Fold Cross-Validation (”k-fold”).

2.4.1 Concept of Accuracy

To measure the accuracy of a binary classifier a common method is the concept of true positive, true negative, false positive, and false negative as defined in [Swets, 1988]. In this definition a classification belongs to one of the categories below:

True Positive(”tp”)

A positive image is classified as positive.

True Negative(”tn”)

A negative image is classified as negative. False Positive(”fp”)

A negative image is classified as positive.

False Negative(”fn”)

(30)

With these definitions the accuracy of a classifier applied to a set of images can be calculated according to (2.10), where tp, tn, fp, and fn is the total number of images in each category.

Accuracy=

tp_{+ tn}

tp+ tn + fp + fn (2.10)

An accuracy of 1 is equivalent to 100% accuracy meaning that all classified samples are correctly classified, i.e. fp=fn= 0. The opposite of accuracy is

inaccuracy, defined in (2.11). An inaccuracy of 1 is equivalent to 100%

inaccuracy meaning that all classified samples are misclassified, i.e. tp=tn= 0.

Inaccuracy= 1 − Accuracy (2.11)

2.4.2 K-Fold Cross-Validation

k-foldis a method to measure how generalized a classification algorithm is, i.e. a measurement of how well the classification algorithm is expected to perform on new data [Refaeilzadeh et al., 2009].

In k-fold the classification data is divided into k different parts, all being approx-imately equal in size and mixed such that the probability of finding a certain class is equal across all parts.

Once the data has been divided the classifier is trained k times using k − 1 parts of the data each time, and tested against the remaining data. This yields k dif-ferent accuracies that can be averaged to obtain an accuracy that is considered approximate to the accuracy of training on all k parts of data and labeling new data. An example of this with k = 3 is presented in figure 2.10.

Train Train Test Train Test Train Test Train Train

Result 1 Result 2 Result 3 Total Result

Figure 2.10:3-Fold Cross-Validation. Data is divided into three parts. Train-ing and evaluatTrain-ing three times on the different parts yields an average

(31)

3

Method

This chapter presents the methods used for the thesis work. First an overview of the methods are given, then the different datasets are presented, followed by deeper descriptions of the methods themselves.

3.1 Method overview

For a clearer understanding the methods can be divided into three steps. The first step is implementation and evaluation of descriptors and the random forests classifier, the second step is implementation and evaluation of a descriptor and the svm classifier, and the third step is evaluation of a region detector working together with descriptor and classifier. By using this division of methods, the thesis work is performed sequential, i.e. first a descriptor is implemented, then it is evaluated, then it is used with a region detector etc. A deeper description of the method for the thesis work is provided below.

In the first step, the descriptors hog and lbp are implemented in C#, and their optimal parameters are found by evaluation with positive and negative patches together with a random forests classifier. Similarly, the parameters for the

random forestsclassifier are optimized. Using these optimized parameters all

descriptors, including the default descriptor, are evaluated with the random forestsclassifier. This process is presented in figure 3.1.

(32)

Implement HOG Implement LBP Optimize HOG Parameters Optimize LBP Parameters Optimize Random Forests Parameters Optimize Random Forests Parameters Default descriptor Optimize Random Forests Parameters Evaluate Random Forests, HOG Evaluate Random Forests, LBP Evaluate Random Forests, Default

Figure 3.1:First step of methods.

In the second step, an svm classifier is used as comparison for the random forests_{classifier, but only with one descriptor. This short process is presented} in figure 3.2.

Implement

descriptor

Evaluate Support Vector

Machine, Descriptor

Figure 3.2:Second step of methods.

In the third step, the final detector utilizing the random forests classifier and a chosen descriptor is evaluated. Three different ways to find regions of interest are used. One that searches for heads, one that searches for eyes, and one that searches first for heads and then for eyes. This process is presented in figure 3.3.

Optimize parameters for head-patches

Optimize parameters for eye-patches

Search for heads Evaluate result

Search for eyes

Optimize parameters for head-patches

Evaluate result

Search for heads Optimize parameters for eye-patches

Search for eyes Evaluate result

(33)

3.2 Data 19

3.2 Data

This section presents the data used during the thesis work. Some data is provided by Agricam, and some is produced during the thesis work. It is used in different forms to train and evaluate algorithms.

3.2.1 Datasets

Four different datasets composed of image sequences of cows on their way to milking have been collected by Agricam from two different farms. Each sequence is composed of ten to twelve images. From each farm two cameras have been used; one placed on the right side of the cows and one placed on the left side. The data has been collected from all regular milking hours of the day. The focus when producing the images has been the udders, meaning that there are sequences without any heads or eyes in them.

The four different datasets are denoted f1l, f1r, f2l and f2r. f1l and f1r are the left and right camera from one farm while f2l and f2r are the left and right camera from the other farm.

F1L

Dataset f1l is composed of images from the left camera of the farm that is consid-ered simple. The backgrounds in the images are empty and the cows move calmly causing the images to mostly contain only one cow. Images from the left camera has no objects obstructing the cows. On average every third sequence contains a head and an eye. Typical images taken from dataset f1l are shown in figure 3.4.

(a) (b)

Figure 3.4:Examples of images from dataset f1l.

F1R

Dataset f1r is composed of images from the right camera of the farm that is con-sidered simple. Images from the right camera has two metal bars that obstructs the cows. On average three out of four sequences contain a head and an eye. Typical images taken from dataset f1r are shown in figure 3.5.

(34)

(a) (b)

Figure 3.5:Examples of images from dataset f1r.

F2L

Dataset f2l is composed of images from the left camera of the farm that is consid-ered difficult. The backgrounds in the images are cluttconsid-ered and the cows move faster which means that in many images more than one cow can be seen. Images from the left camera has a metal bar that obstructs the cows. On average every sequence contains a head and an eye. Typical images taken from dataset f2l are shown in figure 3.6.

(a) (b)

Figure 3.6:Examples of images from dataset f2l.

F2R

Dataset f2r is composed of images from the right camera of the farm that is considered difficult. Images from the right camera has a metal bar that obstructs the cows. On average two-thirds of sequences contain a head and an eye. Typical images taken from dataset f2r are shown in figure 3.7.

(35)

3.2 Data 21

(a) (b)

Figure 3.7:Examples of images from dataset f2r.

3.2.2 Ground truth

Two different types of ground truth are produced manually during the thesis work, one for eyes and one for heads. In both cases the ground truth is produced from the provided datasets, one ground truth for each dataset and each body part.

To create each ground truth two types of annotations is used. The negative anno-tation no point is given to images with no point of interest. For images contain-ing points of interest, such as heads or eyes, the coordinate of the interest point is saved as a positive annotation. All annotations are made manually. In figure 3.8 a positive ground truth annotation is seen for both the head and the eye.

Figure 3.8: An example of a positive ground truth for both head and eye.

The blue circle is the positive annotation for an eye while the pink circle is the positive annotation for a head.

(36)

3.2.3 Training sets

Random forestsand svm are trained using patches labeled as positive or

neg-ative. These labeled patches are produced using the provided datasets together with the ground truth.

For positive patches, 25 regions are cut out around a positive annotation as seen in figure 3.9. This procedure can be seen as using an overlapping five by five grid, where each grid element represents a region that is cut out and saved as a patch. The element in the middle of the grid is a rectangle around the positive annotation with the approximate size of an average head/eye for the dataset the image comes from. The surrounding grid elements are this rectangle but with a small shift in the x and y direction. For this thesis work, the grid elements are shifted by two pixels as compared to their neighbours.

For negative patches regions are cut out from the whole image, using a shift of ten percent of the average head/eye for the dataset the image comes from.

...

Figure 3.9: Positive training set generated by cutting out regions around a positive annotation.

For training purposes 1,000 positive annotations are made for each dataset and body part. This produces 25,000 positive patches per dataset and body part, a number that is reduced to 20,000 positive patches by removal of every fifth patch. For the same datasets and body parts 400 negative annotations are made produc-ing around 125,000 - 200,000 negative patches per dataset and body part depend-ing on the average size of a head/eye in each dataset. These are also reduced to 20,000.

The reason for this reduction is mainly due to memory issues during training. The classifiers are trained with the help of provided experimental software, and in this software all data is read and kept in memory during training. The other reason for this reduction is computation time, using more data means that each training and evaluation will take more time.

(37)

3.2 Data 23

Some examples of positive and negative patches for both heads and eyes are seen in figure 3.10.

(a) (b) (c) (d)

Figure 3.10:Examples from training set based on f1l. Positive and negative patches from the head are seen in (a) and (b) while corresponding images for the eyes are seen in (c) and (d).

3.2.4 Evaluation sets

Because of the way the patches are generated, with patches generated from the same positive annotation only looking slightly different, evaluating with these patches may not give a true indication of how well the descriptors and classifiers will perform with new data.

To get a more truthful indication, a smaller set of patches are generated to use as an evaluation set. These patches are generated with the same method as the training set; using different sequences of images to ensure that the results reflect actual real world conditions as close as possible.

For this purpose 210 different sequences with approximately twelve images per sequence from each dataset are given positive and negative annotations. From these annotations approximately 10,000 positive patches and 100,000 - 200,000 negative patches are generated. These are then reduced to 5,000 patches of each type for each dataset.

3.2.5 Summary of data and ground truths

A summary of the quantities of data used is presented in table 3.1. These quanti-ties are for one dataset and body part, meaning for each dataset there are 20,000 positive training patches for eyes and so forth.

(38)

Table 3.1:Quantities of different types of data used.

Type Quantity Purpose

Positive training patches 20,000 Training of classifiers

Negative training patches 20,000 Training of classifiers

Evaluation sequences 210 Evaluation of region detectors,

generation of evaluation patches

Positive evaluation patches 5,000 Evaluation, descriptors & classifiers

Negative evaluation patches 5,000 Evaluation, descriptors & classifiers

3.3 Implementation

This section describes the algorithms and methods implemented for the thesis work. The main categories are Descriptors, Classifiers, and Region detectors.

3.3.1 Descriptors

One of the main problem statements of this thesis work is how to represent patches, what type of descriptor to use. Two different descriptors, the lbp de-scriptor and the hog dede-scriptor, are selected and implemented. Agricam also provides a fast descriptor, the default descriptor, based on pixel differences, to compare against.

During implementation the descriptors are optimized for dataset f2r since this is considered as the most difficult dataset. These parameters are then used for all datasets.

Histogram of Oriented Gradients

The hog descriptor, see 2.1.1, is mainly a description of contours and other lines in an image. Since head-patches have strong contours (the shape of the head), hogis chosen as a possible good descriptor.

The descriptor is implemented in C# based on the general algorithm described in section 2.1.1. Selection of parameters is made by combining the descriptor with a provided implementation of the random forests classifier, and perform-ing 3-fold cross-validation for various parameters. The results of the parameter selection are presented in chapter 4.

The patch is divided into a fixed number of cells, for example 2x2 cells or 3x3 cells. These cells are then divided into blocks that may or may not overlap. The blocks, cells, and their histograms decide how large the descriptor will be, see figure 2.2. The blocks and their layout decide which regions of the patch will be normalised together, see 2.1.1.

Five different block configurations are tested for a patch divided into 3x3 cells. These configurations are seen in figure 3.11. Three different block configurations are tested for a patch divided into 2x2 cells. These configurations are seen in figure 3.12. As an example, the block configuration in figure 3.11a consists of

(39)

3.3 Implementation 25

one block containing all nine cells. This means that the whole patch will be normalised over at the same time. On the contrary, the block configuration in figure 3.11b consists of nine blocks, where each blocks contains only one cell. In this configuration each cell is normalised on its own.

(a)One block with nine cells.

(b) Nine blocks with one cell.

(c) Four blocks with four cells.

(d) Two horizon-tal blocks with six cells.

(e) Two vertical blocks with six cells.

Figure 3.11:Different configurations of blocks for patches divided into 3x3 cells.

(a)One block with four cells.

(b) Four blocks with one cell.

(c) Four blocks with two cells.

Figure 3.12:Different configurations of blocks for patches divided into 2x2 cells.

(40)

Local Binary Patterns

The lbp descriptor, see 2.1.2, is mainly a description of local structure in an im-age. Since patches of eyes are nothing but structure, lbp is chosen as a possible good descriptor.

The descriptor is implemented in C# based on the standard description described in section 2.1.2. Uniform patterns are also implemented. However, rotation in-variant lbp was not implemented because of time limitations.

As with hog, parameters are selected by performing 3-fold cross-validation to-gether with a random forests classifier. The results of the parameter selection are seen in chapter 4.

The Default descriptor

The default descriptor is a descriptor provided by Agricam based on pixel dif-ferences. It is very fast to compute and very high dimensional making a possibly good fit with the random forest classifier. It has no parameters to adjust and is provided fully implemented.

3.3.2 Classifiers

Two classifiers are used in this thesis work, the random forests classifier and the svm classifier.

Random Forests

An implementation of the random forests classifier is provided by Agricam. It utilizes bagging and works as described in section 2.2.1.

Three parameters can be set by the user: the number of trees in the forest, the maximum depth of each tree and the number of features to evaluate at each node in each tree. As with descriptors hog and lbp optimal parameters are found by combining the random forests with a descriptor and performing 3-fold cross-validation using various different parameter values. The results of the parameter optimization are seen in chapter 4.

Support Vector Machine

The svm classifier is chosen for the thesis work because of its popularity and based on advice from the thesis supervisors. The main purpose of the svm classifier during the thesis work is to be compared against the provided

ran-dom forests_{classifier. Because of this an already implemented version is used:}

the Matlab functions svmtrain [Mathworks.com, 2015b] and svmclassify

[Math-works.com, 2015a].

Due to time limitations the parameters for the svm classifier are not optimized. Instead the default parameters suggested by Matlab are used.

(41)

3.3.3 Region detectors

The provided software utilizes a region detector, a descriptor, and a classifier to function as a complete detector. It implements a Sliding Window algorithm to find regions for classifications. Descriptors are computed for these regions and are sent to the random forests classifier for classification.

The classifier returns the classification paired with a certainty between zero and one, see 2.2.1. A certainty of one means that all trees voted positive and a cer-tainty of zero means that no trees voted positive. The provided software has a detection threshold that can be set by the user. The software will only accept positive classifications with a certainty above this threshold.

Margins can be set for the Sliding Window. No margins mean that the Sliding Window will move across the whole image, while higher margins mean that it will move across a smaller area of the image. The margins are given in units of pixels, meaning for example that a left margin of 17 corresponds to that the Sliding Window will disregard the first 17 pixels from the left in the image. The detection threshold and the margins are optimized by evaluating with the evaluation set several times using different parameters. The measurement opti-mized for is accuracy.

Three different types of searches are of interest for the thesis work. Each uses a Sliding Window as a region detector, but in different ways and with different set-tings. Therefore, each type of search is henceforth referred to as a region detector. They are presented below.

Head-patches

The reason for finding a head is mainly supportive. The size of the Sliding Win-dow is set to match the approximate size of a head in each dataset. The Sliding Window is moved across the image within set margins.

Eye-patches

Finding eyes in images is the main purpose of this thesis. The size of the Sliding Window is set to match the approximate size of an eye in each dataset. The sliding window is moved across the image within set margins.

Combination

A combination of a head detector and an eye detector is implemented during the thesis work. In this combination a head-sized Sliding Window is moved across the whole image searching for heads. Regions classified as heads are then searched again with an eye-sized Sliding Window, this time searching for eyes. Eye-patches are significantly smaller than head-patches meaning that a Sliding Window will find many more regions to be classified when searching through a whole image for eyes than if it is searching for heads. Using a combination and only searching in already classified heads may therefore improve speed.

(42)

Using a combination may also improve the performance. If considering the im-ages seen in figure 3.10, it is for a human easier to differentiate between what is a head and what is not a head, than it is to differentiate between what is an eye and what is not an eye. Because of this it is not difficult to believe that classifier algorithms should have an easier time to differentiate heads than eyes. Using a combination could therefore decrease the amount of false eye detections by lim-iting areas searched.

3.4 Evaluation

This section describes the methods used for evaluation of the thesis work. The main categories are Levels of evaluation, Evaluation of descriptors and classifiers, and Evaluation of region detectors.

3.4.1 Levels of evaluation

There exists three different levels in this thesis work: patch, image, and sequence.

Evaluation is performed on two of them:patch and sequence.

Patch

A patch is a sub-image, a small part of an image. In the binary classifier case the patch is either positive or negative. A positive patch consists of something interesting such as a head or an eye. A negative patch can depict anything else. Classification of a patch is considered a true positive if a patch labeled as positive is classified as positive. In the same way it is considered a true negative, a false positive, and a false negative according to definitions from section 2.4.1.

Image

An image can contain several detections. This is because the Sliding Window detection window will have moved across the whole image, at each position pro-ducing a patch sent to the classifier for classification. All patches classified as positive will count as a detection in the image.

All detections have a confidence, a measure of how many trees voted positive. An image is considered a true positive if the detection with the highest confidence of all detections contains a positive annotation.

Sequence

A sequence is composed of several images, each image possibly containing several detections. A sequence is considered a true positive if the detection with the highest confidence of all detections in all images contains a positive annotation. The sequence accuracy is the most important evaluation measurement since the purpose of this thesis work is a detector to measure the temperature of a dairy cow. Each sequence represents one cow, so as long as the sequence is a

(43)

3.4 Evaluation 29

true positive it does not matter if images in the sequence are falsely classified, a correct measurement of the cow’s temperature can still be produced.

3.4.2 Evaluation of descriptors and classifiers

Evaluation of descriptors and classifiers is performed on patch level using evalu-ation sets. Parameters for both descriptors and classifiers are found by perform-ing 3-fold cross-validation as described in section 2.4.2 and optimizperform-ing for

ac-curacy _{on one dataset, the f2r dataset. The optimized parameters are then}

used with all datasets. All results are calculated as accuracy according to the definition from section 2.4.1, and is then converted to inaccuracy for clearer visualization of the error.

All descriptors are evaluated with the random forests classifier using the 40,000 training patches and the 10,000 test patches generated for each dataset. Results are calculated for each dataset. These results are a measure of the performance for both the descriptors and the classifier.

The svm classifier is only evaluated with the lbp descriptor. The descriptor is im-plemented in Matlab using the exact same algorithm as the C# implementation, and the exact same training and evaluation patches are used as in the random

forests_evaluation.

3.4.3 Evaluation of region detector

Evaluation of different types of detection windows as a region detector is per-formed on sequence level with the random forests classifier and the default descriptor using optimal parameters. Motivation for this is given in chapter 4. The evaluation is performed using provided software.

(44)

(45)

4

Results

This chapter presents the results of the thesis work. The results from the imple-mentation are presented first followed by results from the evaluation.

4.1 Implementation

This section presents results from the implementation of the thesis work. All results comes from selection of parameters, and are presented in inaccuracy. To select parameters, partial optimization is performed. During this process, all parameters are first set to a default value. Optimization is then performed se-quentially with one parameter being optimized at a time. The optimal found value is chosen and used for subsequent parameter optimizations. This method of optimization may not be optimal, but it is deemed fast to implement, which is desired because of the limitation of time. Results are presented in order of implementation.

4.1.1 Descriptors

Selection of parameters for the different descriptors is presented and motivated in this subsection. All results are from 3-fold cross-validation using the provided

random forests implementation with default parameters: 50 trees, a

maxi-mum depth of 10, testing 10 features at each node. The results are shown as

inaccuracy, defined as the difference between 100% accuracy and actual

ac-curacy, see (2.11). All optimized parameters are found using the f2r dataset,

because of time limitations and the fact that it is considered the most difficult dataset.

(46)

Histogram of Oriented Gradients

The hog descriptor has several parameters that can be adjusted, as described in section 2.1.1. At the start of optimization the following parameters are used as default: unsigned gradient, nine bins, dividing the patch into 3x3 cells and using blocks of 2x2 cells.

Unsigned vs Signed

The hog descriptor can be either signed (−180◦

- 180◦

) or unsigned (0◦

- 180◦

). Results for signed and unsigned hog for eye and head patches is seen in figure 4.1. Bodypart Eyes Head Inaccuracy (3-fold) 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.0493 0.0607 0.0329 0.0365

Signed HOG vs Unsigned HOG - Inaccuracy

Signed Unsigned

Figure 4.1:Signed vs unsigned hog for eyes and head.

These results indicate that a signed hog descriptor works best both for eye and head patches.

Number of bins

The degrees of the hog descriptor can be divided into different number of bins. Results for different number of bins is seen in figure 4.2.

These results indicate that fewer bins work better for eye patches, with this pat-tern being approximately true for head patches as well. Four bins are chosen for eye patches and nine bins are chosen for head patches.

(47)

4.1 Implementation 33 Bodypart Eyes Head Inaccuracy (3-fold) 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

0.1 HOG Bins - Inaccuracy

36 bins 18 bins 12 bins 9 bins 6 bins 5 bins 4 bins

Figure 4.2: hogdescriptor with various numbers of bins.

Number of cells

When computing the hog descriptor for a patch, the image is divided into a number of cells. Results for different number of cells is seen in figure 4.3.

Bodypart Eyes Head Inaccuracy (3-fold) 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18

0.2 HOG Cells - Inaccuracy

1x1 2x2 3x3 4x4 5x5

(48)

These results show larger differences for eye patches. 2x2 cells is chosen for eye patches and 3x3 cells is chosen for head patches.

Block configuration

The cells of the hog descriptor can be divided into several different block con-figurations, see figure 3.11 and figure 3.12. Results from different configurations for eye patches is seen in figure 4.4 and for head patches in figure 4.5.

Configurations 1 2 3 Inaccuracy (3-fold) 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.0400 0.0317 0.0273

HOG Blocks - Eyes - Inaccuracy

Figure 4.4: hogdescriptor for patch divided into 2x2 cells with different

configurations of blocks. Configurations 1 2 3 4 5 Inaccuracy (3-fold) 0 0.01 0.02 0.03 0.04 0.05 0.0321 0.0419 0.0321 0.0364 0.0287

HOG Blocks - Head - Inaccuracy

Figure 4.5: hogdescriptor for patch divided into 3x3 cells with different

configurations of blocks.

Configuration 3 is chosen for eye patches while configuration 5 is chosen for head patches.

(49)

Block normalization

Three different types of normalization were tested with the hog descriptor; the

L2norm, the L1norm, and the L1sqrt, see section 2.1.1. Results for the different normalization is seen in figure 4.6.

Bodypart Eyes Head Inaccuracy (3-fold) 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045

0.05 HOG Normalisation - Inaccuracy

L2-Norm L1-Norm L1-Sqrt

Figure 4.6: hogdescriptor with different block normalization.

The L1sqrt performs best for both eye and head patches and is therefore chosen. Local Binary Patterns

The lbp descriptor has two parameters that can be adjusted, as described in sec-tion 2.1.2. At the start of optimizasec-tion the following parameters are used as de-fault: uniform lbp and dividing the patch into 3x3 cells.

Uniform LBP vs LBP

Uniform lbp was implemented during the thesis work. Results from ordinary lbpand uniform lbp is seen in figure 4.7.

The results indicate that uniform lbp performs better than ordinary lbp. Because of this uniform lbp is chosen.

(50)

Bodypart Eyes Head Inaccuracy (3-fold) 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.0181 0.0287 0.0295 0.0725 Uniform LBP vs LBP - Inaccuracy Uniform Ordinary

Figure 4.7:Uniform lbp versus ordinary lbp for eyes and head.

Number of cells

When computing the lbp descriptor for a patch the image is divided into a num-ber of cells. Results for different numnum-ber of cells is seen in figure 4.8.

Bodypart Eyes Head Inaccuracy (3-fold) 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 LBP Cells - Inaccuracy 1x1 2x2 3x3 4x4 5x5

(51)

The results indicate that best inaccuracy is reached by dividing eye-patches into 2x2 cells and head-patches into 3x3 cells.

The Default descriptor

The default descriptor has no parameters to set, hence there are no results from the implementation phase.

4.1.2 Classifiers

This section presents selection of parameters for classifiers random forests and svm. All results come from 3-fold cross-validation using the different scriptors with optimal parameters. The results are shown as inaccuracy, de-fined as the difference between 100% accuracy and actual accuracy, see (2.11).

Random Forests

The experimental random forests has three different parameters that can be set by the user. These are maximum depth of a tree, number of features to test at each node, and number of trees in the forest. Default parameters used at start of optimization are 25 trees with a maximum depth of 5, testing 5 features at each node.

Histogram of Oriented Gradients

First parameter to optimize is maximum depth of a tree. How this parameter affects the inaccuracy is seen in figure 4.9.

The results indicate that both for eye patches and head patches the improvement to inaccuracy decreases from approximately a depth of eight to ten. Since the computation time for each added layer increases exponentially, a low maximum depth is desirable. For eye patches a depth of ten is chosen, and for head patches a depth of eight is chosen.

(52)

Depth 2 4 6 8 10 12 14 16 Inaccuracy (3-fold) 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

0.1 Random Forest - HOG - Eyes - Inaccuracy vs Depth

(a) Depth 2 4 6 8 10 12 14 16 Inaccuracy (3-fold) 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

0.1 Random Forest - HOG - Head - Inaccuracy vs Depth

(b)

Figure 4.9: Inaccuracyvs depth for random forests using the hog

de-scriptor.

Next parameter to optimize is the number of features to test at each node in each tree. How this parameter affects the inaccuracy is seen in figure 4.10.

(53)

4.1 Implementation 39 Features 2 4 6 8 10 12 14 16 Inaccuracy (3-fold) 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

0.1 Random Forest - HOG - Eyes - Inaccuracy vs Features

(a) Features 2 4 6 8 10 12 14 16 Inaccuracy (3-fold) 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

0.1 Random Forest - HOG - Head - Inaccuracy vs Features

(b)

Figure 4.10: Inaccuracyvs number of features for random forests

us-ing the hog descriptor.

Computation time increases linearly with the number of features tested. This means that even if a low number of features tested is desirable it is not crucial. Because of this ten features is chosen for both eye and head patches.

The final parameter is the number of trees. How this parameter affects the inac-curacy_{is seen in figure 4.11.}

(54)

Trees 10 20 30 40 50 60 70 Inaccuracy (3-fold) 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

0.1 Random Forest - HOG - Eyes - Inaccuracy vs Trees

(a) Trees 10 20 30 40 50 60 70 Inaccuracy (3-fold) 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

0.1 Random Forest - HOG - Head - Inaccuracy vs Trees

(b)

Figure 4.11: Inaccuracyvs number of trees for random forests using

the hog descriptor.

As with number of features tested the computation time increases linearly for number of trees. 55 trees are chosen for eye patches and 45 trees are chosen for head patches.

Local Binary Patterns

The same set of tests was performed with the lbp descriptor. Chosen parameters for eye patches are a maximum depth of nine, testing eleven features, and using 40 trees. For head patches chosen parameters are a maximum depth of nine,

(55)

testing ten features, and using 60 trees.

Results for maximum depth, number of features to test, and number of trees are seen in figure 4.12, figure 4.13, and figure 4.14.

Depth 2 4 6 8 10 12 14 16 Inaccuracy (3-fold) 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1

Random Forest - LBP - Eyes - Inaccuracy vs Depth

(a) Depth 2 4 6 8 10 12 14 16 Inaccuracy (3-fold) 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

0.1 Random Forest - LBP - Head - Inaccuracy vs Depth

(b)

Figure 4.12: Inaccuracyvs depth for random forests using the lbp

(56)

Features 2 4 6 8 10 12 14 16 Inaccuracy (3-fold) 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

0.1 Random Forest - LBP - Eyes - Inaccuracy vs Features

(a) Features 2 4 6 8 10 12 14 16 Inaccuracy (3-fold) 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

0.1 Random Forest - LBP - Head - Inaccuracy vs Features

(b)

Figure 4.13: Inaccuracyvs number of features for random forests

Evaluation of Random Forests for Detection and Localization of Cattle Eyes

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

Evaluation of Random Forests for detection and

localization of cattle eyes

Evaluation of Random Forests for detection and

localization of cattle eyes

Examensarbete utfört i Datorseende

vid Tekniska högskolan vid Linköpings universitet

av

Abstract

Acknowledgments

Contents

Notation

1

Introduction

1.1

Outline

1.2

Background

1.3

Motivation

1.4

Purpose

1.5

Problem description

1.6

Limitations

2

Theory

2.1

Descriptors

2.1.1

Histogram of Oriented Gradients

2.1.2

Local Binary Patterns

2.2

Classifiers

2.2.1

Random Forests

2.2.2

Support Vector Machines

2.3

Region Detector

2.3.1

Sliding Window

2.4

Evaluation

2.4.1

Concept of Accuracy

2.4.2

K-Fold Cross-Validation

3

Method

3.1

Method overview

Implement

descriptor

Evaluate Support Vector

Machine, Descriptor

3.2

Data

3.2.1

Datasets

3.2.2

Ground truth

3.2.3

Training sets

...

...

3.2.4

Evaluation sets

3.2.5

Summary of data and ground truths

3.3

Implementation

3.3.1

Descriptors

3.3.2