• No results found

Automatic Eartag Recognition on Dairy Cows in Real Barn Environment

N/A
N/A
Protected

Academic year: 2021

Share "Automatic Eartag Recognition on Dairy Cows in Real Barn Environment"

Copied!
63
0
0

Loading.... (view fulltext now)

Full text

(1)

Master of Science Thesis in Electrical Engineering

Department of Electrical Engineering, Linköping University, 2017

Automatic eartag

recognition on dairy cows in

real barn environment

(2)

Maja Ilestrand LiTH-ISY-EX--17/5072--SE Supervisor: Gustav Häger

isy, Linköpings universitet

Felix Björkeson

Farmic AB

Ola Grankvist

Farmic AB

Examiner: Klas Nordberg

isy, Linköpings universitet

Computer Vision Laboratory Department of Electrical Engineering

Linköping University SE-581 83 Linköping, Sweden Copyright © 2017 Maja Ilestrand

(3)

Abstract

All dairy cows in Europe wear unique identification tags in their ears. These eartags are standardized and contains the cows identification numbers, today only used for visual identification by the farmer. The cow also needs to be identi-fied by an automatic identification system connected to milk machines and other robotics used at the farm. Currently this is solved with a non-standardized radio transmitter which can be placed on different places on the cow and different re-ceivers needs to be used on different farms. Other drawbacks with the currently used identification system are that it is expensive and unreliable. This thesis ex-plores the possibility to replace this non standardized radio frequency based iden-tification system with a standardized computer vision based system. The method proposed in this thesis uses a color threshold approach for detection, a flood fill approach followed by Hough transform and a projection method for segmenta-tion and evaluates template matching, k-nearest neighbour and support vector machines as optical character recognition methods. The result from the thesis shows that the quality of the data used as input to the system is vital. By using good data, k-nearest neighbour, which showed the best results of the three OCR approaches, handles 98 % of the digits.

(4)
(5)

Acknowledgments

First, I would like to thank my supervisors at Farmic, Felix Björkeson and Ola Grankvist, for all their time, their help and well thought out inputs. I would also like to thank all employees at Farmic and Fotonic for their interest in my work and making me feel welcome at their office.

Secondly, I would like to thank the examiner, Klas Nordberg, for good inputs and quick responses to my questions. I also want to thank the supervisor at the university, Gustav Häger.

Also, I would like to thank my brother, Tom Ilestrand, for helping me with my grammar when writing this report.

Last, but not least, I would like to thank my boyfriend, Björn Werner, for all the encouragement, discussions and moral support.

Stockholm, june 2017 Maja Ilestrand

(6)
(7)

Contents

1 Introduction 1 1.1 Motivation . . . 1 1.2 Purpose . . . 2 1.3 Problem description . . . 2 1.4 Limitations . . . 3 1.5 Thesis structure . . . 3 2 Theory 5 2.1 Related work . . . 5 2.1.1 Object detection . . . 5 2.1.2 Object segmentation . . . 5

2.1.3 Optical character recognition . . . 7

2.2 Color space HSV . . . 7

2.3 Flood fill algorithm with fixed range . . . 8

2.4 Line detection using probabilistic Hough transform . . . 9

2.4.1 Standard Hough transform . . . 9

2.4.2 Progressive probabilistic Hough transform . . . 10

2.5 Template matching . . . 11

2.6 Histogram equalization . . . 12

2.7 Eigenfaces . . . 13

2.8 Support vector machines . . . 15

2.9 k-nearest-neighbour . . . 18 3 Method 19 3.1 Tag detection . . . 19 3.2 Tag segmentation . . . 20 3.2.1 Skew correction . . . 21 3.2.2 Segmentation . . . 22

3.3 Optical character recognition (OCR) . . . 24

3.3.1 Templates data . . . 25

3.3.2 Training data for SVM and kNN . . . 25

3.3.3 Template matching . . . 26

(8)

3.3.4 Input to support vector machine and k-nearest neighbour

algorithms . . . 27

3.3.5 Support vector machines . . . 29

3.3.6 k-Nearest Neighbour . . . 30 4 Evaluation 31 4.1 Data . . . 31 4.2 Results . . . 32 4.2.1 Outline . . . 33 4.2.2 A-data . . . 34 4.2.3 B-data . . . 36 4.2.4 C-data . . . 38

4.2.5 Comparison between A-,B- and C-data . . . 41

5 Discussion 45 5.1 Results . . . 45

5.2 Method . . . 47

5.2.1 Tag detection . . . 47

5.2.2 Tag segmentation . . . 47

5.2.3 Optical character recognition (OCR) . . . 47

5.3 Future work . . . 48

6 Conclusions 51

(9)

1

Introduction

This thesis project is carried out on Farmic AB with supervision from the Depart-ment of Electrical Engineering (isy) at Linköping University. Its purpose is to investigate the possibility to use computer vision in order to identify dairy cows by the identification tags in their ears instead of the current, non standardized radio frequency approach that is used today.

This subject is not covered in literature but it has similarities to the more investigated field of detecting and recognizing license plates on cars. Different approaches have been used for this and relevant approaches are described in section 2.1

1.1

Motivation

Use of machines and robotics in the agricultural industry is increasing and dairy farmers are not excluded from this evolution [15]. Today, milking is done auto-nomously, either with a milk robot that is completely autonomous and does not need human interaction or a milking machine that only needs a human to put the machine on the teats. It is important to have knowledge about which cow has been milked and which has not, this means that it is important to have an automatic identification system joined to the robot or machine. Today the iden-tification system is based on radio frequency (RF) which is expensive and some-times missjudge the identification. The RF based system is not standardized so every new machine or robot that is introduced to the farm must be adapted to that system. All cows already wear a plastic identification tag in their ears. If it would be possible to read the eartag with a simple RGB-camera the system will be standardized, less expensive and the cow will not have to wear an RF transmitter.

(10)

1.2

Purpose

The current radio frequency based identification system is expensive, not stan-dardized and requires the cow to wear an RF transmitter. The aim of this thesis is to investigate the possibility to find a less expensive system that will work on every farm without alterations. Figure 1.1 shows an example image that the sys-tem will be implemented for.

Figure 1.1:Example image that the system will be implemented for.

1.3

Problem description

In this thesis, a computer vision based identification system for dairy cows will be implemented and evaluated. This is a rather broad field, therefore three ques-tions are formulated to describe the aim of the thesis.

This thesis will answer the following questions:

• Is it possible to use methods for licence plate recognition on eartags? What are the main differences between licence plate recognition and eartag recog-nition?

• Is it possible to obtain equally good results using template matching com-pared to a machine learning approach such as support vector machines (SVM) or k-nearest neighbour (kNN)?

• Is this computer vision system something that could be used commercially and replace the current identification system based on radio frequency?

(11)

1.4 Limitations 3

1.4

Limitations

The implementation is made for still images.

This is mostly because of the lack of good cameras, the cameras used to capture the images used in this thesis could not capture good videos.

The system does not have to decide if there is no eartag in the scene nor if there is more than one eartag in the scene.

Only images with a visual eartag are chosen as input and only one eartag per image is taken under consideration. This is for narrowing down the thesis work. To decide if the detected object is an eartag or something else would make the implementation more complicated. The choice to only take one eartag per image under consideration is for the same reason.

1.5

Thesis structure

The structure of the thesis is as follows.

• Chapter 2 presents the theory that is necessary for this thesis. • Chapter 3 describes the method used.

• Chapter 4 presents the results of the evaluations.

• Chapter 5 presents a discussion of the results and of the method.

• Chapter 6 gives a conclusion of the evaluations and answers to the ques-tions in section 1.3.

(12)
(13)

2

Theory

2.1

Related work

No studies on detecting and reading eartags on cows has been found but this subject can easily be compared to the application of detecting and recognizing licence plates. Therefore, the related work that will be presented is focused on license plate recognition.

2.1.1

Object detection

The approach of using color extraction to detect the interesting part of the image is used in [4] and [17]. This seems to be a good approach if the interesting part has a certain color that stands out from the rest of the image. As shown in figure 1.1 the eartag has a certain yellow color, therefore this may be a suitable method for detecting the eartag. Another approach that has shown good results is a Hough transform approach used in [7] where the authors extract lines and searches for parallel lines that corresponds to the plate. This approach might not be as good on eartags as it is on license plate since the eartag only has three straight sides, the upper side contains a hanger which is not straight.

2.1.2

Object segmentation

There are a number of different approaches for rectification of the digits men-tioned in literature. The rectification aims to rotate all digits to a horozontal position. J Xing et al. [27] propose a Radon-transform [21] method for aligning the license plate. A principal component analysis (PCA) method is suggested in [3] where the edges of the plate are detected by row and column wise mask-ing. This means that pixels in a row or column that have the same neighbours are masked, from row wise masking the horizontal boundaries are extracted and

(14)

from column wise masking the vertical boundaries are extracted. This will result in the two middle images in figure 2.1. After that the logical AND operator is used on the two images, resulting in the right image in figure 2.1 where pixels that do not end up in both the masked images are removed, the remaining pixels will correspond to corners. PCA is performed on the remaining pixels, and the resulting eigenvectors will tell the orientation of the plate.

Figure 2.1:Schematic figure. From left to right: Binarized image of a licence plate, row wise masking, column wise masking, result when the pixels that do not end up in both the row and column wise masking image is removed.

This approach for rectification was tested in this thesis but a notable differ-ence between a licdiffer-ence plate on a car and an eartag on a cow is the shape. The eartag is much more square shaped than the licence plate making the eigenvalues corresponding to the eigenvectors of the eartag approximately of the same size. It is the eigenvectors that will tell the orientation of the plate, or tag, and with eigenvalues of approximately the same size there will not be a definite direction as shown to the right in figure 2.2. This is because the space of the eigenvectors, corresponding to a specific eigenvalue, for the eartag case will become two di-mensional unlike the space of the eigenvectors for the plate case that will be one dimensional. To the left in the same image, one eigenvalue of the eigenvector is much bigger than the other, this direction describes the orientation of the plate.

Figure 2.2: To the left a schematic figure of a rectangular license plate, the orientation is easy to find. To the right a schematic figure of a quadratic eartag, the orientation is difficult to find. The arrows in the image represents the eigenvectors. On the right side there are three eigenvectors that repre-sents the difficulties of finding the orientation, there is no clear orientation of the quadratic eartag.

(15)

2.2 Color space HSV 7

I. Shafiq Ahmad et al. [2] use a projection method for segmentation, a method where a projection in both horizontal and vertical direction is performed. This method showed promising results when the plates were not perfect and the char-acters were either connected or broken.

2.1.3

Optical character recognition

The most common methods for character recognition of license plates are tem-plate matching based recognition methods, machine learning based recognition methods and feature based recognition methods. In [9] the authors compare fea-ture based vector crossing, zoning and template matching. The results conclude that template matching is the method with the highest accuracy. In [20] several different scale invariant features are described and used for character recognition: Distance to walls, cross-time feature, active region ratio and height to width ratio. These features are then used in a support vector machine (SVM) [8] for classifi-cation of the characters. This approach gave an accuracy close to 98 %, which is good result for plate recognition.

2.2

Color space HSV

A regular digital color image uses the RGB color space, where every pixel in the image is represented by R (Red), G (Green) and B (Blue). This is a cartesian representation of color, often illustrated as a cube. The RGB color space can be transformed to the HSV color space where H stands for hue, S for saturation and V for value. This is a cylindrical-coordinate representation, the RGB cube and the HSV cylinder are illustrated in figure 2.3.

Figure 2.3:The geometric representation of RGB colorspace to the left and HSV colorspace to the right. The RBG cube is a cartesian representation and the HSV cylinder is a cylindrical-coordinate representation. Note that the two representations are in two different color spaces.

(16)

This transformation separates the image intensity from the color information, making it easier to detect a certain color. It was first introduced in [25] and developed for use in computer graphics to be able to specify color in a way that is more closely related to how humans percive color. The transformation from RGB to HSV is:

M = max(R, G, B), m = min(R, G, B), C = M − m,

where max and min refers to the largest and smallest values among R,G and B in a color. H = 60◦×                undefined, if C ≡ 0 G−B C mod 6, ifM ≡ R B−R C + 2, ifM ≡ G R−G C + 4, ifM ≡ B (2.1) V = max(R, G, B) (2.2) S =        0, if V ≡ 0 C V, otherwise (2.3) Note that the saturation is normalized, which has effect on color values in dark regions. These values can have maximum saturation in the HSV color space, this means that saturation in HSV has different meaning than the conventional meaning where saturation is the colorfulness of an area judged in proportion to its brightness.

2.3

Flood fill algorithm with fixed range

The flood fill algorithm (FF), also called seed fill, determines if connected pixels belong to an area that is specified by a seed point (start point) in an image. The input to the algorithm is one image, one seed point in the image, a target range and a replacement color. The value of the seed point’s neighbours are then com-pared to the seed point’s value. If the neighbours are in the specified target range, compared to the seed point, the neighbour is set to belong to the seed point’s area and is colored with the replacement color. The pixel at (x,y) is considered to belong to the seed point area if:

seedP oint − lowerDif f ≤ (x, y) (2.4)

(17)

2.4 Line detection using probabilistic Hough transform 9

seedP oint + upperDif f ≥ (x, y) (2.5)

where lowerDiff and upperDiff is the lower and upper difference which is speci-fied by the user. This is done for H, S and V respectively and will continue until no neighbouring pixel is set to belong to the area [13]. Figure 2.4 (c) shows a successful flood fill on figure 2.4 (b). The lowerDiff used in here is (60, 60, 110) and the upperDiff is (70, 150, 100). These values was found by testing.

(a) Original image in the RGB color space

(b) HSV image before flood filling

(c)Result of the flood fill-ing

Figure 2.4: Example of a successful flood filling, the white color denotes the area that belongs to the eartag that is specified by the seed point that lies inside the yellow eartag. The areas inside the tag that is not completely filled are the parts that are darkest in (a).

2.4

Line detection using probabilistic Hough

transform

2.4.1

Standard Hough transform

The Hough transform (HT) is a well known and popular method for extraction of geometrical primitives. The standard HT is used for detection of straight lines in an image, the straight line is often described as y = mx + b. It can also be described by the Hesse normal form:

r = xcosθ + ysinθ, (2.6)

where r is the distance from the origin to the closest point on the straight line (blue line in figure 2.5) and θ is the orientation of r with respect to the x axis.

(18)

Figure 2.5: Parametric description of a straight line. The red line is the straight line, the blue line represents the distance from the origin to the clos-est point on the red line.

The Hough space is the (r,θ)-plane where every (x,y)-point in the image is represented by a curve according to equation (2.6). All the points in the image are accumulated into the Hough space and the result will be several sinusoidal curves that cross where lines occur [23]. It is often missing pixels on the lines or noise present that may be seen as a line, to handle this the HT performs an explicit voting procedure. For every point on the same line the corresponding accumulator cell will be incremented. The resulting peaks in the accumulator array will tell where the lines are in the corresponding image.

2.4.2

Progressive probabilistic Hough transform

To minimize the computational cost of the Hough transform, J. Kittler et al. [14] proposed an algorithm in 2000 they call the progressive probabilistic Hough transform (PPHT). PPHT and the standard HT share the same voting pattern and the same representation of the accumulator array.

The PPHT proceeds as follows. A random point is selected for voting and is compared to a threshold, this is for determining if the point exists due to noise. When a line is detected the remaining supporting points are removed and thus retract their votes. This repeats until all the points either have voted or have been assigned to a feature. This means not every point has to vote like in the standard HT and the PPHT will therefore save computational cost compared to the standard HT. The outline of the PPHT algorithm is described in algorithm 1.

(19)

2.5 Template matching 11

Algorithm 1:Progressive Probabilistic Hough Transform Data:Image, threshold l, threshold m

Result:Lines in image

1 ifInput image is empty then 2 return

3 else

4 Update the accumulator with a single pixel randomly selected from the

input image;

5 Remove pixel from input image;

6 ifThe highest peak in the accumulator that was modified by the new

pixel is higher than the threshold l then

7 Look along a line specifed by the peak in the accumulator, and find

the longest segment of pixels either continuous or with a gap not longer than the threshold m;

8 Remove the pixels in the segment from input image; 9 Remove all the pixels from the line in the accumulator that

previously voted;

10 ifline segment is longer than the minimum length then 11 add it into the output list

12 else 13 goto 1; 14 end 15 else 16 goto 1; 17 end 18 end

2.5

Template matching

Template matching is used to detect the occurrence of a specific pattern or ob-ject in an image. This is done by defining a small image of the pattern or obob-ject, a template, and searching for a similar occurrence in a given image. The algo-rithm slides the template over the image and at each pixel a similarity measure is computed. There are several different approaches for the similarity measure, this thesis uses the normalized correlation coefficient [19] approach described in equation (2.7), where I denotes the image, T the template, R the result, w and h are the width and height of the overlapped patch of the image, (x’, y’) are the coordinates of a pixel in the template and (x, y) are the coordinates of the pixel under consideration in the image. This approach was chosen since it showed the best results on the data used in this thesis.

(20)

R(x, y) = P x0 ,y0(T 0 (x0, y0) · I0(x + x0, y + y0)) r P x0,y0T (x 0 , y0 )2· P x0,y0I 0 (x + x0 , y + y0 )2 , (2.7) where T0(x0, y0) = T (x0, y0) − 1 w · h· X x00,y00 T (x00, y00), I0(x + x0, y + y0) = I(x + x0, y + y0) − 1 w · h· X x00,y00 I(x + x00, y + y00).

2.6

Histogram equalization

Equalization of a histogram from an image aims at improving the contrast in the image by stretching out the intensity range. This will make the dark areas black and bright areas white in all images which is useful in the SVM and kNN to avoid the classification to depend on intensity of the background or digit. This is done by mapping the pixel values from the original image into a wider range, making the intensity values of the new histogram spread over the whole range.

Consider an image of integer pixel intensities ranging from 0 to L-1. The original pixel intensities, k, are transformed to new values, T(k), by the function:

T (k) = f loor((L − 1)

k

X

n=0

pn), (2.8)

where floor() rounds down to the nearest integer [22]. pn is the normalized

his-togram of a given image with a bin for each possible intensity and is calculates as:

pn=

number of pixels with intesity n

total number of pixels n = 0, 1, ..., L − 1, (2.9) where L is the number of possible intensity values, often 256.

(21)

2.7 Eigenfaces 13 (a) No his-togram equal-ization (b) After his-togram equal-ization

Figure 2.6: To the left the original image is shown. To the right the result from histogram equalization on image (a) is shown.

(a)Original histogram (b)Equalized histogram

Figure 2.7:To the left the histogram of the left image in figure 2.6 is shown, the histogram contains two distinct peaks. To the right the histogram of the right image in figure 2.6 is shown, the histogram has been stretched out.

2.7

Eigenfaces

The recognition method called eigenfaces (also known as eigenimages) was de-veloped by L. Sirovich and M. Kirby [16] for classification of face images. Even though the method can be used on any image the name eigenfaces remains from its primary application. This method is not often used as a classification method today but the way of represent an image using eigenfaces is commonly used as a basis in many modern algorithms.

(22)

To capture variation in a set of images, eigenfaces employs principal com-ponent analysis (PCA). PCA is a method that can reduce data dimensionality by finding a basis of eigenvectors. It was introduced by Pearson [18] and the purpose is to transform a set of data making it linearly uncorrelated. The transformation is defined by a number of principal components which are calculated based on the sample covariance matrix, Q = [qjk]. The data used to calculate the sample

covariance matrix are N number of images, all containing M number of pixels. The sample covariance matrix, qjk, is an estimate of the covariance between the

j:th and k:th pixel over all N images. qjk is calculated as:

qjk = 1 N − 1 N X i=1 (xijx¯j)(xikx¯k), (2.10)

where ¯xj is the sample mean of the j:th pixel in the N different images and xij

is a pixel j from image i. The eigenvectors of the covariance matrix, Q, are the principal components and the corresponding eigenvalues are proportional to the variance along these vectors. This means the eigenvalues from a PCA gives a sort of rank of which components that have large variance in the signal along their dimensions.

The result from eigenfaces is a basis of these eigenvectors (also called eigen-faces) that can be used to represent the images with lower dimensionality than before. The eigenvectors are calculated as follows:

1. Arrange the N number of images in column vectors. This will result in a set of vectors:

{I1, I2, ..., IN} (2.11)

2. Calculate the mean vector of all the vectors I, Π, this mean vector can be seen as a mean image when reshaped as can be seen in figure 2.8.

Figure 2.8: Reshaped mean vector of images of the digits 0 to 9, Π, it has strong values where the shapes of the different digits coincides.

(23)

2.8 Support vector machines 15

3. Create a concatenated matrix, A, of all vectors, I, minus the mean vector, Π.

A = {I1− Π, I2− Π, ...IN− Π}. (2.12)

4. Calculate the eigenvectors, vi, defined by:

AATvi = λivi. (2.13)

This calculation may get very computational expensive since the matrix AAT may get very big. For an image of size 192×75, which is used in this thesis, the corre-sponding pixel vector, I, will be 14400 pixels long. This means AAT will have the size 14400×14400 pixels. To avoid this, ATA is used instead, that will have a size

of N ×N. The eigen decomposition used is now:

ATAωi = λiωi, (2.14)

where ωi is an eigenvector of ATA. By multiplying with A from the left we get

that the eigenvectors, vi, can be expressed as:

vi = kAωi

Aωik

. (2.15)

There is a drawback using ATA instead of AAT, it is that a maximum of N eigenvectors can be calculated, but with an N that is big enough it should not be a problem since not every eigenvector is needed. Only the eigenvectors with a corresponding large eigenvalue are contributing with enough information.

The mean subtracted images are projected into a low dimensional subspace where all the images are represented as a point. The classification part of the eigenfaces method finds the closest training samples in the created eigenface sub-space. An image can be projected into this space using a constructed base where the eigenvectors are the basis vectors. This will result in a set of coordinates calculated as:

c = V (I − Π), (2.16)

where V is a matrix containing the chosen eigenvectors, vi, as rows. The

coor-dinates, c, will tell the weight of the eigenvectors that are needed to represent a specific class. By finding the shortest distance from the current coordinates c to one of the training coordinates ci a corresponding class can be found. The

classi-fication part of eigenfaces are not used in this thesis, but the coordinates c, that gives a dimensionality reduction, are used as input to SVM as well as kNN.

2.8

Support vector machines

Support vector machines (SVM) is a machine learning classifier that by super-vised learning outputs an optimal hyperplane which categorizes new samples [1]. This approach gets the input training data in a high dimensional feature

(24)

space, the feature space used in this thesis is the one described in section 2.7. In this feature space, the distance between the closest data points in each class and a plane is maximized by a unique separating hyperplane, as shown in figure 2.9. This distance is known as the margin, M, a larger margin corresponds to a more accurate classifier. This is because the hyperplane helps classifying the new sam-ples, the longer the distance between this hyperplane and the training samsam-ples, the easier the classification becomes. Figure 2.9 shows a schematic drawing of a separating hyperplane between two classes in a two dimensional space.

Figure 2.9: The line that separetes the blue and red dots (two different classes) are the hyper plane, it is placed to maximize the margin, M.

The points that lie closest to the margin are known as the support vectors and it is the support vectors that are included in the decision function. Due to that, the complexity of an SVM model is not dependent on the size of the training set but on the number of support vectors, which is generally smaller than the training set.

Let us introduce w and b that are, respectively, the vector normal to the hy-perplane and its displacement relative to the origin. The hyhy-perplane can be ex-pressed as:

w · x −b = 0, (2.17)

where x is a vector containing the data points. If a point satisfies equation (2.17), the point lies on the hyperplane. The points that does not lie on the hyperplane will give w · x − b > 0 and w · x − b < 0 respectively on each side of the hyperplane. If the data points are linearly separable there will be two hyperplanes, the dotted lines in figure 2.9, with no data point in between, that separates the two classes. These hyperplanes can be describes as w · x − b = 1 and w · x − b = −1 where

(25)

2.8 Support vector machines 17

w · x −b ≥ 1 (2.18)

for one class and

w · x −b ≤ −1 (2.19)

for the other class. If we introduce y, a vector holding the class labels of the two classes (1 and -1 in this case) we can describe equation (2.18) and (2.19) as:

yi(w · xib) ≥ 1, i = 1, ..., N

where N is the number of training cases. By dividing both sides in equation (2.20) by kwk we get:

yi(w · xib)

kwk ≥ 1

kwk = M, i = 1, ..., N . (2.20) It is desired to choose w and b to maximize M. To do this, we need to minimize kwk. This is difficult to solve, by substituting kwk with 1

2kwk

2 we get an easier

solution without altering it. The problem to be optimized in the training can be expressed as: minimize w,b 1 2kwk 2 (2.21) subject to: yi(w · xib) − 1 ≥ 0, i = 1, ..., N

If the data is not linearly separable the SVM model described above will diverge and grow arbitrarily. A well known way to solve this is by a soft margin which allows, but penalizes, misclassifications. The problem to be optimized in the training can now be expressed as:

minimize w,b 1 2kwk 2+ C N X i=1 ξi (2.22) subject to: yi(w · xib) − 1 + ξi≥0 ξi0, i = 1, ..., N ,

where C is a capacity constraint and ξi is a slack variable that can be expressed

as:

ξi = max(0, 1 − yi(w · xib)). (2.23)

This minimization problem can be solved using Lagrange multipliers [1]. This approach not only maximizes the margin by minimizing kwk but also penalizes ξ

(26)

through C. The larger C, the more ξ will be penalized, this is important to keep in mind since this may cause the SVM model to overfit the data.

The training data points are mapped to a high dimensional space which can make the data linearly separable. For this a kernel function is used to simplify the calculations that come with mapping the data points to the higher dimensional space. Commonly used kernel functions are a radial basis function (RBF) and a linear kernel [1], where the latter is used in this thesis.

In the classification part, where new samples will be classified, the same map-ping procedure is done and compared to the SVM boundary. The position of the new sample relative the SVM boundary will imply its label.

2.9

k-nearest-neighbour

The k-nearest-neighbour (kNN) algorithm is a memory-based classifier that does not require any model to be trained [12]. KNN uses the k nearest training sam-ples to vote for the membership of a new sample. The input data, x, is assigned to the most frequent class appearing in the vector Vxwhich consists of the k

near-est neighbours to x, this is illustrated in figure 2.10. The most commonly used distance measure used to decide which class is nearest is the Euclidean distance in feature space:

di = kxix0k (2.24)

Other distance measures that are used are Manhattan distance, Minkowski distance and Hamming distance.

Despite the simplicity of kNN, it has been successful in many classification problems where each class has many possible prototypes and the decision bound-ary is irregular [12], for example handwritten digits which all look different from each other. One drawback with kNN is that it is memory-based, which means that the training data must be stored in the memory and that can be a problem when using a large amount of training data.

Figure 2.10: By measuring the distance to the k nearest training samples, the class of the new sample will be decided. This example uses k = 3 and the neighbours are two green squares and one yellow triangle. Green square is the most frequent one so the new sample will be classified as a green square.

(27)

3

Method

This chapter describes the system implemented in this thesis. The implementa-tion is done in C++ using the open source library OpenCV in Microsoft Visual Studio 2015. The data used in this thesis are 204 images, collected at a farm in northern Sweden. The images was captured by three different persons using four different cameras. The images are taken under different light settings, angles and on different cows to make sure the system is trained and evaluated with a representative data set.

The system is divided into three main modules: • Tag detection

• Tag segmentation

• Optical character recognition (OCR)

Figure 3.1 shows a schematic view of the system.

3.1

Tag detection

This implementation assumes that there is always an eartag in the image and the system only detects one eartag, even if there is more eartags present. The first step in the system is to detect the eartag in the image. Since all the eartags have the same standardized yellow color, this thesis uses color extraction to detect the eartag in the image as mentioned in section 2.1. This is done by transforming the input RGB image to the HSV color space. The HSV image is then thresholded for yellow that will provide a binary image on which some morphological operations is done to remove noise. The yellow color used on eartags lies in the range: Hue: 22-40, Saturation: 95-255, Value: 50-255.

(28)

Figure 3.1:System overview

As an example, all values in the image 3.2 (b) that lie within the specified range is set to 1 (white) and the values outside the range is set to 0 (black), result-ing in the binary image 3.2 (c).

(a)Input image (b)HSV image (c)Thresholded image

Figure 3.2:Object detection steps. (a) the input image in RGB color space, (b) the input transformed to HSV color space, (c) color threshold is performed which gives the binary image where the eartags can be detected.

In figure 3.2 there are three eartags, but only one will be taken under con-sideration. The biggest blob in the binary image is detected as an eartag and is cropped out from the input image. The cropping is done with some margin to make sure the entire eartag is included.

3.2

Tag segmentation

The tag segmentation step includes both correction for skewed eartags and seg-mentation of the eartags.

(29)

3.2 Tag segmentation 21

3.2.1

Skew correction

To find the skew and orientation of the detected eartag, the sides of the eartag must be detected. Due to different light settings, dirty eartags, motion blur and different quality in the images, the color threshold method mentioned in section 3.1 is not good enough at finding edges from and some extra steps needs to be done. Performing Hough transform directly on a gradient image does not give the desired results either due to the same reasons.

The first step is to do floodfill, as described in section 2.3, on the eartag image. Three different representations of the eartag image was tested for this step, the RGB image, the grayscale image and the HSV image. The HSV image showed the best results and therefore the HSV image was used. To get a suitable seed point as input to the floodfill algorithm, a small rectangle was cropped out from a grayscale eartag image using the results from the color threshold method de-scribed in section 3.1. The seed point used was the pixel with the maximum value to ensure that the seed point lies in a yellow area and not on a digit which is black. This thesis uses a floodfill with fixed range to prevent the floodfill from going outside the eartag. A successful floodfill is shown in figure 2.4.

The outer contours of the result from the floodfill is then obtained and sent to a progressive probabilistic Hough transform (PPHT) algorithm, the PPHT al-gorithm is described in section 2.4.2. The PPHT alal-gorithm detects the lines and sends the lines to a merge line algorithm. This algorithm searches for lines with maximal 5 degrees difference relative to the x-axis and at matching positions. If two or more lines that should be merged are found, the algorithm takes the end points of the lines that lies as much apart from each other as possible and draws a new line beween those points and remove the other lines. Figure 3.3 illustrates the result from the merge line function.

(a)Lines detected by the probabil-itsic HT, 6 lines

(b) Lines after merging line algo-ritm, 4 lines

Figure 3.3: Illustration of detected lines and the outcome from the merge line algorithm.

(30)

The lines that represent the angles needed for the skew correction are ob-tained by choosing the line that corresponds to the underside of the tag and the line or lines corresponding to the sides of the tag. The upper side of the tag is not taken under consideration and is removed. This is because it is hidden by the cows ear in many images and the upper side is not useful for line detection since it does not contain a line. The underside and side lines are chosen as follows: 1. The lines with a small angle with respect to the x-axis is detected as candidates for the underside and the lines with a small angle with respect to the y-axis is detected as candidates for the side.

2. The longest line in each group is the most probable to be accurate and the angles corresponding to these lines are therefore chosen to represent skewness of the eartag.

These two angles are then used to calculate the affine transform of the tag which is used to warp the image, making the eartag horizontal. The 2×3 matrix of the affine tranform is calculated from the two lines desbribed by the angles above to two lines described by a square with the angles 0 and 90 degrees relative to the x-axis. This is illustrated in figure 3.4

Figure 3.4: How the affine tranform is calculated. To the left a schematic figure of the detected eartag, to the right the result of the affine transform.

3.2.2

Segmentation

The projection method mentioned in [2] has influenced the implementation in the segmentation step.

The goal in this step is to remove all background around the tag, leaving just the eartag. To be able to do this, the sides of the eartag must be found once again. It is not possible to use the lines detected in the PPHT algorithm since that algorithm, together with the floodfill algorithm fails to detect all sides too often. To find the underside of the tag the image gradients are extracted by using a sobel filter in the y-direction on a grayscale image. This will give an image as shown to the left in 3.5 that has strong negative values on the underside of the tag. To find the underside a horizontal projection that accumulates all the pixel values in every row is performed. As figure 3.5 shows, the projection vector will have

(31)

3.2 Tag segmentation 23

a large negative peak corresponding to the underside of the tag. This negative peak is detected and by that, the underside of the tag is detected.

Figure 3.5:A rotated image with sobel filtering in the y-direction on a gray scale image and its corresponding projection vector. The negative edge of the underside of the eartag gives a large negative peak in the projection vector.

Finding the sides of the eartag is more challenging than finding the underside. This is due to the digits on the eartag. The digits have large gradients in the x-direction, sometimes even larger than the side of the eartag, an example of this is shown in figure 3.6. To find the sides, a sobel filter in the x-direction is applied. In the gradient image, the left side of the eartag will have a positive gradient. A projection like the one illustrated in figure 3.5 is the performed, but in vertical direction. It is not enough to detect the highest peak in the projection vector, since it may correspond to a digit with high gradient. The maximal amount of peaks that is likely to occur is nine, two for every digit and one for the side of the eartag. Thus the nine highest peaks from the projection vector are extracted and the leftmost is chosen designated as the left side of the tag. To sort out false peaks that may correspond to something in the background a criteria is set. The criteria is that the extracted peaks must have a value at least one third of the highest peak.

The right side of the eartag is extracted the same way as the left side but with an additional check. After the peak is found there is a second check that is important if the side of the eartag is not distinct or if it is missing, this may happen if the cow has similar color as the tag or the side of the tag is hidden under the cows hair. The reason that only the right side of the tag has an extra check is that this check needs another side to compare with. The implementation could be the other way around, finding the right side first and have an extra check on the left side. The second check checks the height-width-ratio of the tag, if this ratio is bigger than 1.8, which means that the wrong edge is found, another

(32)

Figure 3.6:Image with sobel filtering in the x direction, the digits have larger gradients than the sides of the tag. This is seen when comparing the right side of the tag, where the negative gradient is close to zero, to the digits where the gradient is more negative.

algorithm will find the right side of the eartag. This algorithm can be described as follows:

A vertical projection, as described above, is done on the grayscale image and a threshold is set to either the mean value of the projection vector or the maximum value minus 35. The threshold has been set by testing. All the pixels in the projection vector that are larger than the threshold is said to belong to the eartag. Figure 3.7 shows an image with missing right side of tag and its corresponding projection vector.

The upper side of the eartag does not contain a straight line as the other three sides and in some images, as figure 3.7 there is not even a visible upper side. To get the position where to crop the projection approach on a grayscale image, described above, is used.

3.3

Optical character recognition (OCR)

The last step in the system is to recognize the digits on the segmented eartag. This thesis evaluates both template matching (TM) that is described in section 2.5, support vector machines (SVM) described in section 2.8 and k-nearest-neighbour (kNN) described in section 2.9 as OCR approaches.

(33)

3.3 Optical character recognition (OCR) 25

Figure 3.7:Gray scale image with no distinct right side and its correspond-ing projection vector.

3.3.1

Templates data

The data used as OCR templates are 29 different patches of digits that are cropped out from the training data. The article [9] mentioned in section 2.1 uses 10 syn-thetic templates, this is not used in this thesis due to problems finding the right font that perfectly matches the one under consideration. Since no perfect tem-plate were found, more than 10 temtem-plates are used. Table 3.1 shows the distribu-tion of templates per digit. The reason that the distribudistribu-tion is not equal is that it was difficult to find a template of decent quality for some digits.

Table 3.1:Distribution of template data

Digit 0 1 2 3 4 5 6 7 8 9 Total

Number of patches 2 1 3 4 2 2 2 3 6 4 29

3.3.2

Training data for SVM and kNN

For the training of the SVM model and kNN classifier 1300 patches of digits are used, 130 of each digit. 10 patches of each number are cropped out from the training data set, using the template matching algoritm. From each of these 10 patches 13 new are created by introducing variation as shown in figure 3.9. To improve the training data, a histogram equalization is performed, this will

(34)

remove most of the differences in color. After that an adaptive threshold that results in a binary image is performed. The result from the adaptive threshold will not always be perfect, some noise will be present. This noise is deleted by hand by the author to make the training data set as good as possible. In figure 3.8 an example of the final, binarized, training data is shown.

Figure 3.8:Example of training data

To obtain 130 patches of each number this binarized data are rotated and translated in different angles and directions. Figure 3.9 shows the different train-ing patches generated from one original.

Figure 3.9:The original image patch to the left and the original image patch rotated −1◦, −3, −5, 1, 3, 5◦, translated 1 and 3 pixels to the left and 1 and 3 pixels to the right and translated 3 pixels up and 3 pixels down, respec-tively.

The 1300 training data patches are then used as input to the eigenfaces method, mentioned in section 2.7, where the resulting coordinates corresponding to the largest eigenvalues are passed as training data to SVM and kNN.

3.3.3

Template matching

The input to the template matching algorithm is the whole segmented eartag. The templates are one by one compared to the image and the four templates that have the best match and do not overlap are the four that is chosen to be the right ones. The templates used are described in section 3.3.1.

(35)

3.3 Optical character recognition (OCR) 27

3.3.4

Input to support vector machine and k-nearest neighbour

algorithms

The SVM model and the kNN classifier need the digits to be segmented one by one to be able to recognize the digits. The digit positions from the template matching are used here. There is also a functions that tries to correct the digit positions. This function is used if the digit positions are not in a straight row or if they are less than four. This function is described in algorithm 2 and an example when it is used is shown in figure 3.10.

Algorithm 2:Correctness function for digit positions

Data:Inaccurate digit positions (not in a straight row or less than four) Result:Corrected digit positions

1 Make image binary ; 2 Look for digits ;

3 ifFound new, good digit positions then 4 return new digit positions

5 else

6 //try to move old digitpositions;

7 ifThree lies in a straight row, one is not then 8 move fourth to fit in line with the others ; 9 return new digit positions

10 else

11 ifonly three detected digits in a straight row then

12 add a new one besides the others if there is space, try left sides

first, then right side ;

13 return new digit positions 14 else

15 return old digit positions 16 end

17 end 18 end

The input to the pretrained SVM model and kNN is processed as described in section 3.3.2, i.e. the histogram is equalized to remove dependence of the back-ground and intensity, an adaptive threshold is applied to obtain a binary image. On this binary image the eigenfaces method is performed and the prediction is made on the coordinates from the eigenfaces method. In figure 3.11 the result-ing 1300 eigenvalues from 1300 images are shown. The graph shows that after approximately x = 40 the eigenvalues tapers off, this means only the first 40 eigen-vectors are interesting.

(36)

(a)Digits detected by template match-ing.

(b)The leftmost box is moved to fit in line with the other boxes.

Figure 3.10:Example of when template matching fails to detect all four dig-its correctly and the correctness function moves one box to fit in line with the other boxes.

0 200 400 600 800 1000 1200 -1 0 1 2 3 4 5 10 10 X: 40 Y: 4.54e+08

Figure 3.11:The eigenvalues plans out after approximately x = 40, the most valuable information lies in the 40 first corresponding eigenvectors.

Figure 3.13 shows how well the N eigenvectors can reconstruct a certain im-age, when N goes from 1 to 40. This is done by:

I mage =

N

X

i=1

ci· vi (3.1)

where Image is the 40 different images shown in figure 3.13, ci is the i:th

coordi-nates and vi is the i:th eigenvector. Figure 3.12 shows the original image that is

reconstructed by equation (3.1). As can be seen in figure 3.13 the original image shown in figure 3.12 is almost completely reconstructed after 30 used eigenvec-tors. During the implementation different amounts of eigenvectors where tested and the best performance was found using around 40 eigenvectors.

(37)

3.3 Optical character recognition (OCR) 29

Figure 3.12:Original image

Figure 3.13: How well the eigenvectors reproduce the number seven, for every subimage one more eigenvector is added to the subimage. Already when the 10th eigenvector is added a human can be certain what number it is.

3.3.5

Support vector machines

Since a linear kernel is used there is only two parameters to be set. One is C, the capacity constraint and the other one is the number of maximal iterations in the training. The C is set by the auto function in OpenCV, this function optimizes the best C for the data set under consideration using cross-validation, when the cross-validation estimate of the test set is minimal the parameters are considered

(38)

optimal. The value of C was set to 0.1 by this auto function. The number of maximal iterations is set to 1000, which lets the algorithm iterate without being stopped by the maximal iteration constant.

3.3.6

k-Nearest Neighbour

The parameter to the kNN classifier is k, the number of neighbours that are taken under consideration. This is set to k = 3 in this thesis. Different values have been tested without a large difference in performance, as long as k is not too large, see figure 3.14. This indicates that the classes are separated good and classifier is not sensitive to the value of k. To use k = 3 seems more reasonable than using k = 1 for example, to only use the nearest neighbour seem a bit haphazardly.

k=1 k=2 k=3 k=5 k=7 k=10 k=15 k=19 k=25 k=35 k=55 k=75 k=100 156 158 160 162 164 166 168

Number of correct classified digits

Figure 3.14: Results from using different k’s in the kNN classifier. The re-sults are from the same data set with a total of 196 digits.

(39)

4

Evaluation

4.1

Data

The data used in the evaluation are 102 images, half of the data set described in the introduction of chapter 3, with varying quality. To get information about how the quality affects the result the images are partitioned into three different categories:

• A-data, Good lighting, no blur, completely visible eartag • B-data, Images that does not fall directly into A- or B-data. • C-data, Poor lighting, blurriness, partly hidden or dirty eartags

This categorization is subjective and different persons would probably categorize differently. To avoid a bias categorization by the author, the data are divided by an outsider with no basic knowledge in image processing or the system. The distribution of the images are shown in table 4.1 and example images from the three different categories are shown in figure 4.1.

The data is processed by the system described in chapter 3 and the three OCR approaches described in section 3.3 is evaluated. Template matching uses the templates described in section 3.3.1 and SVM and kNN is trained as described in section 3.3.2.

Table 4.1:Number of images in the different categories A-data B-data C-data Total number of images 34 49 19 102

(40)

(a)Example of an eartag in the A-data set

(b)Example of an eartag in the B-data set

(c)Example of an eartag in the C-data set

Figure 4.1: (a) example of A-data, the lighting is good, there is no blur and the eartag is not dirty. (b) example of B-data, the eartag is completely visible, the lighting is sufficient but the focus is a bit off. (c) example of C-data, there is a lot of motion blur in the image and the digits are difficult to separate. Note that these three images are zoomed in on the eartags to give a better description of the differences.

4.2

Results

This section present the results of the evaluation and is divided into A-data, B-data, C-data and a comparison between the three categories.

(41)

4.2 Results 33

4.2.1

Outline

The evaluation of each data set is divided into the three system modules described in chapter 3:tag detection, tag segmentation and optical character recognition. There

is also an evaluation of thecomplete system at the end of each data set evaluation.

The different evaluation modules are described in this section.

Tag detection

The section Tag detection presents the result from the tag detection method de-scribed in section 3.1. This result is binary: Was the tag found in the image or not. The result from this evaluation tells the ratio between how many eartags that was found and how many images used.

Tag segmentation

In the section called Tag segmentation the result from the method described in section 3.2 is presented. This section includes results both from skew correction evaluation and segmentation evaluation.

To decide if the skew correction and segmentation steps succeeded by measur-ing the results from the two steps is difficult. This is because a minor error in both of them, as figure 4.2 shows, can still give a correct recognized eartag. Therefore, the evaluation of the skew correction and segmentation is done by looking at the output from the complete system, which is the recognized digits from the OCR module. If the output is not correct, the cause of the fail is detected and assigned to the right step. This is an ocular inspection done by the author and the fail cause is only assigned to one step. The reason for this is that the segmentation step depends on an aligned eartag, which means that if the skew correction fails, the segmentation step will most probably also fail. Figure 4.3 shows an example of a skew correction that fails and that causes the segmentation to fail. To isolate the skew correction evaluation and segmentation evaluation, the segmentation evaluation is only done on images with a succeeded skew correction.

Figure 4.2:Example of a successfull skew correction and segmentation. As can be seen, there is still a little bit of background left in the corners and the skew correction did not make the right side of the tag completely aligned with the y-axis.

(42)

Figure 4.3:Example of a failed skew correction which leads to a failed seg-mentation.

Optical character recognition (OCR)

The three different OCR approaches described in section 3.3 are evaluated sep-arately and without influences from failed skew correction and segmentation, which means that the OCR approaches are evaluated on the images where the tag is detected and the skew correction and the segmentation have succeeded. The bar plots in this section shows both Digits and Tags, where Digits represents the percentage of correct recognized digits in the data set and Tags represents the percentage of completely correct recognized eartags, i.e., tags with four correct recognized digits.

Complete system

In the section called complete system, the whole system presented in chapter 3 is evaluated. This involve tag detection, tag segmentation and finally OCR, no images are excluded. This is presented by the performance of the three different OCR approaches which is the output of the system. As in the previous section,

Optical character recognition (OCR), the bar plots in this section shows both Digits

and Tags, where Digits represents the percentage of correct recognized digits in the data set and Tags represents the percentage of completely correct recognized eartags, i.e., tags with four correct recognized digits.

4.2.2

A-data

Tag detection

The color threshold approach described in section 3.1 succeeds on every image in the A-data set.

Tag segmentation

Table 4.2 describes how well the preprocessing steps skew correction and tag segmentation performs on the A-data set.

(43)

4.2 Results 35

Table 4.2: Performance of skew correction and segmentation on the A-data set

Passed Percent [%] Skew correction 32/34 94

Segmentation 31/32 97

Total 31/34 91

As table 4.2 shows, 3/34 (9 %) images fail to pass due to the preprocessing steps.

Optical character recognition (OCR)

Table 4.3 and figure 4.4 present the results from OCR with the images with failed skew correction or segmentation removed.

Table 4.3:Performance of OCR on the A-data set. The images that failed due to skew correction or segmentation are removed in this table

Total digits: 124 Total tags: 31 Passed Percent [%] Passed Percent [%]

kNN 122 98 29 94 SVM 122 98 29 94 TM 117 94 27 87 kNN SVM TM 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Digits Tags

Figure 4.4:Performance of OCR with the images with failed skew correction or segmentation removed. The green bars describe percentage of all digits and the yellow describe percentage of tags that are completely correct recog-nized.

(44)

Complete system

Table 4.4 and figure 4.5 presents how well the three different approaches perform on the whole system.

Table 4.4:Performance of the whole system on the A-data set Total digits: 136 Total tags: 34

Passed Percent [%] Passed Percent [%]

kNN 124 91 29 85 SVM 124 91 29 85 TM 119 88 27 79 kNN SVM TM 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Digits Tags

Figure 4.5: Performance of the whole system on the A-data set. The green bars describe the percentage of correct recognized digits and the yellow bars describe correct recognized eartags.

4.2.3

B-data

Tag detection

The color threshold approach described in section 3.1 succeeds on every image in the B-data set.

Tag segmentation

Table 4.5 describes how well the preprocessing steps skew correction and tag segmentation performs on the B-data set.

(45)

4.2 Results 37

Table 4.5: Performance of skew correction and segmentation on the B-data set

Passed Percent [%] Skew correction 41/49 84

Segmentation 41/41 100

Total 41/49 84

As table 4.5 shows, 8/49 (16 %) of the images fail to pass due to the preprocessing steps.

Optical character recognition (OCR)

Table 4.6 and figure 4.6 present the results from OCR with the images with failed skew correction or segmentation removed.

Table 4.6:Performance of OCR on the B-data set. The images that failed due to skew correction or segmentation are removed in this table

Total digits: 164 Total tags: 41 Passed Percent [%] Passed Percent [%]

kNN 158 96 36 88 SVM 152 93 32 78 TM 133 81 22 54 kNN SVM TM 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Digits Tags

Figure 4.6: Performance of OCR on the B-data set with the images with failed skew correction or segmentation removed. The green bars describe percentage of all digits and the yellow describe percentage of tags that are completely correct recognized.

(46)

Complete system

Table 4.7 and figure 4.7 shows how well the three different approaches perform on the whole system.

Table 4.7:Performance of the whole system on the B-data set Total digits: 196 Total tags: 49

Passed Percent [%] Passed Percent [%]

kNN 164 84 35 71 SVM 158 81 32 65 TM 138 70 22 45 kNN SVM TM 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Digits Tags

Figure 4.7: Performance of the whole system on the B-data set. The green bars describe the percentage of correct recognized digits and the yellow bars describe correct recognized eartags

4.2.4

C-data

Tag detection

The color threshold approach described in section 3.1 succeeds on every image in the C-data set.

Tag segmentation

Table 4.8 describes how well the preprocessing steps skew correction and tag segmentation performs on the C-data set.

(47)

4.2 Results 39

Table 4.8: Performance of skew correction and segmentation on the C-data set

Passed Percent [%] Skew correction 16/19 84

Segmentation 16/16 100

Total 16/19 84

As table 4.8 shows, 3/19 (16 %) images fails due to the preprocessing steps.

Optical character recognition (OCR)

Table 4.9 and figure 4.8 describe the results from OCR with the images with failed skew correction or segmentation removed.

Table 4.9:Performance of OCR on the C-data set. The images that failed due to skew correction or segmentation are removed in this table

Total digits: 64 Total tags: 16 Passed Percent [%] Passed Percent [%]

kNN 51 80 5 31 SVM 52 81 6 38 TM 43 67 5 31 kNN SVM TM 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Digits Tags

Figure 4.8: Performance of OCR on the C-data set with the images with failed skew correction or segmentation removed. The green bars describe percentage of all digits and the yellow describe percentage of tags that are completely correct recognized.

(48)

Complete system

Table 4.10 and figure 4.9 shows how well the three different approaches perform on the whole system.

Table 4.10:Performance of the whole system on the C-data set Total digits: 76 Total tags: 19

Passed Percent [%] Passed Percent [%]

kNN 53 70 5 26 SVM 53 70 6 32 TM 43 57 5 26 kNN SVM TM 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Digits Tags

Figure 4.9: Performance of the whole system on the C-data set. The green bars describes the percentage of correct recognized digits and the yellow bars describes correct recognized eartags.

(49)

4.2 Results 41

4.2.5

Comparison between A-,B- and C-data

Tag segmentation

Figure 4.10 illustrates the difference in performance of the segmentation step on the three data sets. The data used in this plot can be found in table 4.2, 4.5 and 4.8. The reason why there is no skew correction bar on the B- and C-data sets is that on those data sets the segmentation handles all images that the skew correction does.

A-data B-data C-data

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Segmentation Skew correction

Figure 4.10:Performance of segmentation and skew correction on the three different data sets.

(50)

Optical character recognition (OCR)

Figure 4.11 presents the difference in results of the OCR, both between the data sets and between the three OCR methods.

A-data B-data C-data

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% kNN SVM TM

(a)Percentage of correct recognized digits with the different OCR approaches with the images with failed skew correction or segmentation removed.

A-data B-data C-data

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% kNN SVM TM

(b)Percentage of correct recognized eartags with the different OCR approaches with the images with failed skew correction or segmentation removed.

(51)

4.2 Results 43

Complete system

Figure 4.12 (a) shows the difference when the whole system is evaluated on the three data sets.

A-data B-data C-data

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% kNN SVM TM

(a)Comparison of percentage of correct recognized digits in the three different data sets.

A-data B-data C-data

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% kNN SVM TM

(b)Comparison of percentage of correct recognized eartags in the three different data sets.

(52)
(53)

5

Discussion

5.1

Results

The results presented in the previous chapter demonstrate that the quality of the data used is vital for the performance of the system. Figure 4.11 shows that the OCR approaches are good at recognizing the digits in the A- and B-data but it is bad at recognizing the digits in the C-data. The kNN classifier in particular has this behavior, it scores 98 %, 96 % and 80 % of the digits respectively in the different data sets. The SVM classifier and the template matching approach also have this behavior but the performance on the B-data set is inferior to the performance on the A-data for these two approaches.

Figure 4.10 shows that the skew correction and segmentation are equally good on the B- and C-data and that it performs better on the A-data. Notable is the performance of the segmentation step which is 100 % in the B- and C-data and only fails on one image in the A-data. The fact that the skew correction is better for the A-data is reasonable since the images in the A-data have sufficient lighting and the eartags are not covered in dirt. That B- and C-data yield the same result is more surprising. The reason for this is probably that the blurriness present in the C-data affects the digits more than the sides of the tag and the blurriness does not affect the result of the skew correction. The main reason why the skew correction fails is that the flood fill fails. This might happen if the background outside the tag has a similar color as the tag, then the flood fill algorithm will flood out and no distinct edge will be found. Another reason for the flood fill algorithm to fail is if the eartag is very dirty, then the flood fill algorithm does not fill enough of the tag, which leads to incorrectly detected edges. An example where both these issues occur is shown in figure 5.1.

(54)

Figure 5.1:Failed flood fill. In areas where the tag is very dirty and dark the flood fill algorithm does not fill. The area outside the tag which is bright and almost orange is filled.

In figure 4.11 we can see that the kNN and SVM are better at recognizing the digits than the more simpler approach of template matching. KNN and SVM yield the same results on the A-data but on the B- and C-data there are some differences. On the C-data there is only one digit that differs, and from this it is not possible to say which method is better. The OCR results from the B-data on the other hand tell a little more about the difference in performance between kNN and SVM. Table 4.7 and figure 4.7 show that the kNN classifier can handle six more digits than the SVM. From that, the conclusion can be made that kNN is the best classifier tested in this thesis for this application.

Comparing the results of this thesis to results from the literature is compli-cated because the data used are important. The literature rarely show the data sets used and therefore a comparison to decide if the performance of the system presented in this thesis is adequate is difficult. The result that kNN is superior to SVM on OCR is shown in [11] but the opposite is shown in [24]. This shows that different implementation styles and different data sets can give different results. To be able to use this commercially on the farms, the performance of the sys-tem needs to be at least 90 %. The currently used radio frequency syssys-tem has a performance somewhere between 90 and 99 % [10]. There is a big difference in using a computer vision based system, compared to the radio frequency based system. The difference is that the computer vision based system has the possi-bility to detect when a cow walks pass a camera even if the cow is missing its identification tag, while an RF based system cannot detect a cow without its RF transmitter. This difference is of importance when cows enter a milking station with a capacity of several cows. When the cows are at the milking station it is im-portant for the farmer to know which cow is in which stall to keep track of how much milk that each cow produce. When the milking station is full, but one cow did not get identified, the RF based identification system will not know which cow went to which stall. If there is a large amount of cows at the milking station

References

Related documents

The cG(1)cG(1)-method is a finite element method for solving the incompressible Navier-Stokes equations, using a splitting scheme and fixed-point iteration to resolve the nonlinear

Our thesis is aimed at developing a clustering and optimization based method for generating membership cards in a hypermarket by using a two-step sequential approach: first, we build

The baseline HOG based tracker with no scale estimation capability is compared with our exhaus- tive scale space tracker and the fast scale estimation method in table 1..

[r]

In the simulation study below we will illustrate the nite sample behavior of this method and it will then be clear that the noncausal FIR model used in the rst step of the

The projection method may be applied to arbi- trary closed-loop systems and gives consistent esti- mates regardless of the nature of the feedback and the noise model used. Thus

Ingötet visar en utåt mot stenens ena halv- mänformiga gavelsida bredare och inåt smalare, grund plattbottnad kanal, som når fram till en under kanalens bottenyta något