Evaluation of system design strategies and supervised classification methods for fruit recognition in harvesting robots

(1)

(2)

Det h¨ ar masterexamensarbetet har utf¨ orts av en student fr˚ an Kungliga Tekniska H¨ ogskolan i samarbete med Cybercom Group. M˚ alet var att utv¨ ardera och j¨ amf¨ ora designstrategier f¨ or igenk¨ anning av frukt i en sk¨ orderobot och prestandan av klas- sificerande maskininl¨ arningsalgoritmer n¨ ar de appliceras p˚ a det specifika problemet.

Arbetet omfattar grunderna av dessa system; till vilket parametrar, begr¨ ansningar, krav och designbeslut har unders¨ okts. Ramverket anv¨ andes sedan som grund f¨ or implementationen av sensorsystemet, processerings- och klassifikationsalgoritmerna.

En tomatplanta i plast med frukter av varierande mognadsgrad anv¨ andes som bas f¨ or tr¨ aning och validering av systemet, och en Kinect f¨ or Windows v2 utrustad med sensorer f¨ or h¨ oguppl¨ ost f¨ arg, djup, och infrar¨ od data anv¨ andes f¨ or att erh˚ alla bilder.

Datan processerades i MATLAB med hj¨ alp av mjukvaruutvecklingskit f¨ or Kinect tillhandah˚ allet av Windows, i syfte att extrahera egenskaper ifr˚ an objekt i bilderna.

Multipla vyer erh¨ olls genom att l˚ ata tomatplantan rotera p˚ a en plattform, driven av en stegmotor och Arduino Uno. De bin¨ ara klassifikationsalgoritmer som testades var Support Vector Machine, Decision Tree och k-Nearest Neighbor. Modellerna tr¨ anades och valideras med hj¨ alp av en five fold cross validation i MATLABs Clas- sification Learner applikation. Prestationsindikatorer som precision, ˚ aterkallelse och F

₁

-po¨ ang ber¨ aknades f¨ or de olika modellerna. Resultaten visade bland annat att statiska modeller som k-NN och SVM presterar b¨ attre f¨ or det givna problemet, och att den sistn¨ amnda ¨ ar mest lovande f¨ or framtida applikationer.

Nyckelord: Maskininl¨ arning, Klassifikation, Bildprocessering, Dataseende, Support

Vector Machine, k-Nearest Neighbor, Decision Tree, Sk¨ orderobot, Igenk¨ anningssys-

tem, Kinect v2

(3)

(4)

This master thesis project is carried out by one student at the Royal Institute of Technology in collaboration with Cybercom Group. The aim was to evaluate and compare system design strategies for fruit recognition in harvesting robots and the performance of supervised machine learning classification methods when applied to this specific task. The thesis covers the basics of these systems; to which param- eters, constraints, requirements, and design decisions have been investigated. The framework is used as a foundation for the implementation of both sensing system, and processing and classification algorithms. A plastic tomato plant with fruit of varying maturity was used as a basis for training and testing, and a Kinect v2 for Windows including sensors for high resolution color-, depth, and IR data was used for image acquisition. The obtained data were processed and features of objects of interest extracted using MATLAB and a SDK for Kinect provided by Microsoft.

Multiple views of the plant were acquired by having the plant rotate on a platform controlled by a stepper motor and an Ardunio Uno. The algorithms tested were binary classifiers, including Support Vector Machine, Decision Tree, and k-Nearest Neighbor. The models were trained and validated using a five fold cross validation in MATLABs Classification Learner application. Peformance metrics such as preci- sion, recall, and the F

₁

-score, used for accuracy comparison, were calculated. The statistical models k-NN and SVM achieved the best scores. The method considered most promising for fruit recognition purposes was the SVM.

Keywords: Supervised Machine Learning, Classification, Image Processing, Com-

puter Vision, Support Vector Machine, k-Nearest Neighbor, Decision Tree, Harvest-

ing Robot, Recognition System, Kinect v2

(5)

(6)

First of all I would like to express my gratitude to my industrial supervisor Fredrik Edlund at Cybercom Group for the much appreciated guidance, support, and help, during this project.

I would also like to thank my supervisor De-Jiu Chen (KTH), for providing helpful insights and comments during the project.

Finally I would like to thank my friends and family for the support throughout my entire education.

Gabriella Bj¨ ork, Stockholm, June 2017

(7)

(8)

1 Introduction 1

1.1 Purpose . . . . 1

1.2 Scope . . . . 2

1.3 Methodology . . . . 2

1.3.1 Ethical Considerations . . . . 3

1.3.2 Report Outline . . . . 3

2 Background 5 2.1 Agricultural Robots . . . . 5

2.1.1 Motivations . . . . 5

2.1.2 System Overview . . . . 6

2.2 Constraints and Requirements . . . . 7

2.2.1 Constraints . . . . 7

2.2.2 Requirements of the System . . . . 8

2.3 Parameters . . . . 10

2.3.1 Color . . . . 10

2.3.2 Geometry . . . . 10

2.3.3 Texture . . . . 10

2.3.4 Temperature . . . . 11

2.3.5 Depth . . . . 11

2.4 Performance Metrics . . . . 11

2.5 Delimitations . . . . 13

3 State of the Art 15 3.1 Recognition Systems . . . . 15

3.2 Image Acquisition . . . . 16

3.2.1 Two Dimensional Imaging . . . . 16

3.2.2 Three Dimensional Imaging . . . . 17

(9)

3.3.1 Digital Image Representation . . . . 20

3.3.2 Image Fusion . . . . 20

3.3.3 Techniques for Feature Extraction . . . . 21

3.4 Decision Making . . . . 25

3.4.1 Supervised Machine Learning Methods . . . . 26

3.4.2 Algorithm Comparison and Validation . . . . 30

4 Implementation 35 4.1 Hardware Implementation . . . . 35

4.1.1 Microsoft Kinect V2 . . . . 36

4.1.2 Experimental Setup . . . . 38

4.2 Software Implementation . . . . 39

4.2.1 Data Acquisition . . . . 40

4.2.2 Image Processing and Feature Extraction . . . . 40

4.2.3 Decision Making . . . . 45

5 Results 49 5.1 Training and Validation of Classifiers . . . . 49

5.1.1 Feature Selection . . . . 50

5.1.2 Decision Tree . . . . 52

5.1.3 SVM . . . . 53

5.1.4 k-NN . . . . 54

5.2 Performance Evaluation . . . . 55

6 Discussion and Conclusion 57 6.1 Discussion . . . . 57

6.1.1 System Evaluation . . . . 57

6.1.2 Classifier Performance Evaluation . . . . 58

6.2 Conclusion . . . . 60

7 Recommendations for Future Work 63

(10)

2.1 Illustration of the three subsystems of a harvesting robot, including;

recognition, harvesting, and motion system. . . . . 7 2.2 Illustration of the terms precision and recall . . . . 12 3.1 An illustration of the representation of different color spaces. . . . 21 3.2 An illustration of the CHT, in which edge points (x, y) generate votes

on a conic surface in three dimensional (a, b, r) parameter space. . . . 23 3.3 The basics of supervised machine learning . . . . 25 3.4 An illustration of a k-NN for point p with k = 3. The three nearest

neighbors are found within the orange circle, determining the class of p. 27 3.5 An illustration of a binary classification tree, with two splits, in which

the instances are classified based on two distinctive features. . . . 28 3.6 An illustration of a SVM with separating hyperplane, support vectors,

margin, and instances of each class . . . . 30 3.7 Illustration of neurons in an ANN with input, hidden and output layers. 31 3.8 An Illustration of divided subsets using cross-validation with k=5 . . 33 4.1 The Kinect for Windows v2 Sensor, including color sensor, depth/IR

sensor, and multi-array microphone. The camera (real world) coordi- nate system is originated in the depth/IR sensor. . . . . 36 4.2 The experimental hardware setup consisting of: 1) Plastic tomato

plant, 2) Kinect sensor, 3) PC, 4) Arduino Uno and Arduino Motor Shield Rev.3, 5) Stepper motor, and 6) Power supply. . . . 39 4.3 The steps in the process of creating a classifier for testing based on

training of instances. . . . 40 4.4 An illustration of the steps involved in computing the mask, only

masking out the actual instance to avoid inherited feature values0. . . 42

4.5 The morphological operations applied to the binary mask. . . . . 44

4.6 The original image and the original image combined with the red mask. 44

(11)

5.1 The standardized parallel coordinate plots of features for different sample sizes. . . . 51 5.2 Scatter plots on the original data to examine the predictive relation-

ships of features, using 594 samples. . . . 52

(12)

3.1 Table indicating the characteristics of each considered method; in re- gards to speed of training and making new predictions; effects of large feature spaces in relation to training sample size; irrelevant features, noise; and bias. Low performance or high sensitivity is indicated with (+), and the opposite (+ + +). . . . . 32 4.1 Technical specifications of Kinect sensor . . . . 36 4.2 An example of typical threshold values applied to each color band used

for computation of a binary red mask. . . . . 43 5.1 The output data structure, including an example of sample data to

be used for training of the classifiers. . . . 50 5.2 Performance evaluation of each model and sample size, including;

training speed, prediction speed, TP, TN, FP, FN, precision, recall

and F

₁

-score. . . . . 56

(13)

ANN Artificial Neural Network CCD Charged-Coupled Device CHT Circle Hough Transform

CMOS Complementary Metal-Oxide Semiconductor CW Continuous Wave

DT Decision Tree FN False Negative FP False Positive

ICP Iterative Closest Point IR Infrared

k-NN K-nearest Neighbor

LiDAR Light Detection and Ranging NB Naive Bayes

NIR Near Infrared RGB Red Green Blue

SDK Software Development Kit SVM Support Vector Machine TN True Negative

ToF Time-of-Flight

TP True Positive

(14)

Introduction

This chapter presents the purpose and scope of the project as well as the methodology, ethical considerations and report outline.

1.1 Purpose

In the last decades, rapid progress in sensor- and real time computer vision technolo- gies has enabled the automation of the harvesting process to gain a lot of research interest [1]. Earlier restrictions in the field were due to the key requirements of agricultural robots, if they are to be compared to manual labor, of them working effectively in terms of quality, speed, and cost [2].

Even though extensive research has been conducted within this field, the commer- cialization of robots for harvesting of high value fruit is still to be realized. The limitations are mainly due to constraints of the recognition system [3], in which de- tection and localization of fruit are the main tasks. Supervised machine learning algorithms for pattern classification have been commonly employed in previous at- tempts, as well as for a wide variety of other applications.

The purpose of this thesis project is to investigate recognition systems for high-value fruit as well as evaluate the performance of supervised machine learning methods when applied to this specific problem. The project has been carried out by one student at The Royal Institute of Technology in cooperation with Cybercom Group.

The scope and delimitations has been decided upon in cooperation with the institu-

tion as well as the company. The research questions to which this project aims at

answering are the following:

(15)

1. What typical constraints, requirements, parameters and design decisions are in need of consideration when designing a fruit recognition system for a harvesting robot?

2. Which are the available choices of sensory and software solutions when design- ing a fruit recognition system?

3. What are the differences in performance of supervised machine learning algo- rithms, when used for the task of recognizing fruit?

1.2 Scope

There is a great deal of aspects in need of investigation regarding fruit harvesting systems, all of which can not be solved within this project. This stresses the need to limit the scope. Following the research questions stated, the projects aims at:

1. Carrying out a literature study in order to investigate the important con- straints, requirements, parameters, and design decisions to be accounted for when designing a fruit recognition system for a harvesting robot,

2. Designing and implementing a sensing system for recognition and localization of fruit,

3. Evaluating the performance of supervised classification algorithms, using the implemented sensory system.

1.3 Methodology

At the start of the project a planning phase was carried out, during which the pur- pose and research questions were developed. This was accomplished in collaboration with all involved parts; i.e. the institution and the company. During this phase, a time plan, risk analysis and methodology to which the project should follow were established as well. A planning report and seminar were carried out and approved by the institution. The second phase of the project was to conduct and research relevant literature in order to determine the framework; i.e. answer the questions regarding the design decisions and what performance metrics to use for validation, and to in- vestigate the available choices for the following implementation phase of the project.

The information gain in this phase is limited by the sources of literature available;

(16)

open source literature or literature acquired via the KTH library. Sources were cho- sen in regards to their publicizing organization, as well as them being well-cited and peer-reviewed sources. The implementation phase was then carried out, including implementation of hardware and processing- and classification algorithms to be eval- uated. The validation part of the project needed to be evaluated objectively, i.e.

through the use of quantitative research methods. This includes data gathering by structured experiments and statistical analyses, using the performance metrics and validation strategies found in the previous phases. The fourth and last phase of the project consisted of concluding and analyzing the results from the previous phases as well as reflecting over recommendations of future work.

1.3.1 Ethical Considerations

Just as for any other robot operating in an environment in which living beings might be occurring there are ethical considerations regarding the safety, i.e. of not imposing harm on those beings. Within the scope of this project the robot itself is not accounted for, proposing that the recognition system itself does not violate any harm on those, leading to the project not compromising on any ethical issues.

1.3.2 Report Outline

In Chapter 2, the background study enclosing the investigated constraints, require- ments, parameters and design decisions is presented, followed by the state of the art in regards to sensory design, processing, and classification algorithms in Chapter 3.

The implementation of the system is presented in Chapter 4. Chapter 5 presents the

results; Chapter 6 discussions and conclusions; and Chapter 7 recommendations for

future work.

(17)

(18)

Background

This chapter presents the purpose and an overview of agricultural robots and their recognition system, including: typical constraints of these systems, identified parame- ters, functional- and non-functional requirements, and performance metrics for eval- uation. Delimitations to the project are presented at the end of the chapter.

2.1 Agricultural Robots

There are many possible applications for which an agricultural robot can be designed;

e.g. for pruning, cultivating, spraying, trimming, disease supervising and harvesting.

The idea of automating the harvesting process was first introduced in the 1960s [4].

This section presents the motives for automatizing the harvesting process, as well as an overview of the general overall- and recognition system in terms of hard- and software.

2.1.1 Motivations

The need of commercializing harvesting robots for fruit picking in the agricultural

industry has many important motives, one of the most important being to reduce

human labor force. Human labor force is for many crops highly intensive work as

well as heavily straining on the body; leading to many injuries within the work force

[5]. Another strong motive for commercialization is to minimize damages induced

by manual harvesting. Immense quantities of fruit goes to waste due to improper

handling, since they are delicate and vulnerable to any damage; even the smallest

scratch could potentially decrease the viable life time, thus reducing the value. From

a sustainable point of view this is a major issue. One-third of all eatable produce

(19)

globally is lost and wasted, spread across all stages of the food supply chain, according to current estimations by the Food and Agricultural Organization of the United Nations [6]. Damage-free harvesting is therefore of utter importance. Despite the foreseeable benefits and the extensive efforts in the field there are still commercially available solutions for automated fruit harvesting.

2.1.2 System Overview

Most harvesting robots use vision-based control and consist of manipulators, end effectors, vision and motion systems [7]. The main tasks involved in robotic harvest- ing are in terms of detecting, localizing, gripping and detaching fruit. In general, the robots can be divided into three subsystems: one for detecting and localizing fruit and obstacles, one for gripping and detaching the fruit, and one for carrying and moving the other two systems around [8]. The system and subsystems are il- lustrated in Figure 2.1. As seen in the figure, the recognition system can either be placed on the motion system or directly on the manipulator. The commercialization of harvesting robots has essentially been restricted due to the recognition system, i.e. the localization and classification of fruit. Extensive research of different as- pects in the field of detecting and localizing fruit with a variety of crops, sensors and computer vision techniques has been previously conducted to a various degree of success. A recognition system is usually based on the steps of acquiring data from visual sensors, processing the data, and making a decision to whether any objects of interest can be identified from this data. Using data from other sensors, including chemical, tactile, and proximity sensors, has also been investigated [2]. The output of the recognition system, i.e. the identified fruits and their position, is then used by the harvesting subsystem, grasping and detaching the fruits.

In the overall system, [3] found that there were five major tasks needed to be imple-

mented and executed: fruit localization, ripeness determination, obstacle localiza-

tion, task planning, and motion planning. The recognition system involves the three

former tasks, i.e. fruit and obstacle localization, and ripeness determination. There

are a couple of methods commonly used to cover these tasks in fruit recognition

applications; single feature analysis, multiple feature analysis and classification ma-

chine learning methods [9]. As could be expected, and described by [10], using only

one feature rarely represents the object in a satisfactory manner except in extreme

cases. However, since each feature represents a different aspect of the object, they

could compensate for the limitations of each other. Hence, using multiple feature

analysis or classification machine learning methods, in which the features are fused,

(20)

Figure 2.1: Illustration of the three subsystems of a harvesting robot, including; recognition, har- vesting, and motion system.

increases the performance of the tasks.

Regarding the sensory system, the same logic applies. As described by [11], the sensing process can be interpreted as a mapping of the state of the world into a set of much lower dimensionality. This means, as with the features, that a single sensor rarely provides an accurate perception of the scene. Combining multiple sensors has been extensively researched in recent years due to the synergistic combination of representing diverse sources of sensory data in a single format. This also applies to the applications of harvesting robots, and as investigated by [2], multi-sensory systems, especially including three dimensional sensors, had a positive impact on the accuracy in detecting and localizing fruit.

2.2 Constraints and Requirements

In order to develop the requirements it is of great importance to first identify what constraints that are typically associated with the task of recognizing fruit in different environments of operation. This section covers the constraints identified, and the requirements, including functional and non-functional requirements, derived from these.

2.2.1 Constraints

The main challenges to confront in terms of designing a recognition system for fruit

is due to the uncontrolled environment of operation, as well as the varying morphol-

(21)

ogy of the objects to be detected; they are natural objects with a high variety in size, shape, color, texture and hardness [12, 3]. Some other challenges reported in previous research is due to errors in fruit detection, occlusions, clustering and vari- able lighting conditions [9, 2]. As mentioned, using only one feature for detection rarely represents the object in a satisfactory manner, i.e. enhances problems asso- ciated with these challenges, which has been encountered in previous attempts [10].

Another constraint affecting the overall system is inaccuracies in determining the distance and position of the identified objects. This could cause failed or unsatis- fying picking attempts, in which damage to both fruit and plantation might occur [9].

Most crops are cultivated in either greenhouses, indoors, orchards, or open fields.

There are some distinctions in terms of the major constraints depending on for which of these environments the robot is designed. In greenhouse or indoor environments the main problem is that the crops usually are more occluded, although with the ben- efit of parameters such as lightening, humidity, and other weather conditions being controlled and monitored [3]. In orchards and open fields, weather conditions such as varying lightening conditions, rain, and wind, obviously needs to be considerer.

Another variable affecting the performance of the system is the testing environment;

in previous research, a higher accuracy has been achieved when testing in lab envi- ronment [3].

Other restraining factors for commercialization, as [8] points out, is due to the im- portance of the system being efficient, low cost, and easy to use.

2.2.2 Requirements of the System

The requirements of the system are built upon the tasks to be performed and the

constraints identified. To summarize, constraints typically associated with these ap-

plications are due to environmental variations in and to which the system needs to

operate. The unstructured environments, as well as environmental conditions such

as humidity, temperature and lightening are all important aspects that influence the

design decisions of the hardware implementation, i.e. the choice of sensors. The

sensory techniques and processing algorithms must also be able to cope with the

variation in size, shape, color and position of the fruits. In order for the system to

become a commercially efficient solution, the system needs to be efficient in terms

of quality, speed and economy compared to manual labor. The harvesting energy

consumption is another aspect that needs to be considered [8]. The requirements are

(22)

divided into functional and non-functional requirements.

The functional requirements are described in terms of what the system should do, and are as following:

• Identify fruit. The system shall be able to identify fruit objects.

• Determine ripeness of fruit. The system shall be able to determine the ripeness of the identified fruits.

• Identify obstacles. The system shall be able to identify obstacles to avoid collisions when detaching the fruit.

• Three dimensional operating. The system must be able to accurately deter- mine the real world coordinates, i.e. the location, of the identified fruits to be harvested.

• Handle variations in parameters such as size, shape, color, texture. Since the objects to be recognized are natural objects, the system must be able to handle the variations properly in order to receive a high accuracy in detection.

The non-functional requirements identified are in terms of:

• Accuracy. The system must have a satisfying accuracy of detecting ripe fruit, as well as a low rate of falsely identified fruits.

• Cost. The system must be economically beneficial compared to manual labor, which requires the system to be low cost.

• Speed. The system must be a real time system, fast enough to meet the speed requirements of their application environment.

• Power consumption. Since the system will work in field, it is sufficient if it has a low power consumption. Low power consumption is also both economically and environmentally beneficial.

• Environmental conditions. The hardware must be able to operate in an en- vironment in which humidity, lightening and temperature levels might be of varying range.

• Usability. The system should be easy to use.

(23)

2.3 Parameters

The parameters identified, in need of consideration in a fruit recognition system, are the following: color, shape, size, texture, temperature, and depth [10, 13, 12].

Their purpose and contribution to dealing with the constraints identified, is further explained in the following sections.

2.3.1 Color

Color is one of the most important features of images [14, 9], and the most common in distinguishing ripe fruit from other background elements [10], since a lot of fruit have distinctive colors; like red apples, tomatoes, citrus and mangoes [9]. It is im- portant that the system can handle variances in color, considering that the required ripeness might vary with season [3].

Only using color features limits the opportunity to detect unripe or green fruit, as well as introducing errors due to noise from uncertain background features. The variations of color when exposed to different lightening is another cause of operational deficiencies [9].

2.3.2 Geometry

Shape is an important feature for human vision to recognize objects [14]. This fea- ture can be difficult to extract, as well as being a rather time consuming task. In previous research, analysis have only explicitly been employed when the fruits have spherical shapes [10]. The size of the objects of interest is also a good indicator, since fruit have genetic size constraints [9].

Using geometric features such as shape and size when analyzing images compensates for the problem of variations of lightening. The geometric features are also helpful in preventing clustering of fruit to be categorized as single fruit, since clusters will have shapes and sizes outside of the threshold values.

2.3.3 Texture

Texture is another feature useful for classifying images. This can be employed for

fruit recognition purposes since fruit generally has a more smooth texture than sur-

rounding objects [9]. The color does not affect the differences in texture, so texture

(24)

analysis is appropriate when the fruits have similar colors to leaves or other back- ground elements [9].

2.3.4 Temperature

Temperature is a useful feature for detecting fruit considering the physical structure of plants. Leaves accumulate less heat than fruit, allowing fruits to be distinguished by their temperature. Using thermal features for detection of fruit is useful when fruits have no distinctive color to their background [12].

2.3.5 Depth

The distance to an object is of great use when calculating the size of an object of interest. The position, which can be calculated from the depth map, of the fruits in real world coordinates, is mandatory if they are to be picked by the manipulator, since they need to inputted to the motion planning task for detachment. A high accuracy would decrease the risk of damaging the fruit in the picking attempts.

Depth features can also be used to reconstruct the entire obtained scene or keep track of the identified fruits and surroundings by mapping points from one view of the scene to another. This is of great importance for the vision system not to try to identify, or the manipulator not to try to detach, fruits already accounted for. This problem arises when a transformed view of the scene due to fruit or robot movement is obtained.

2.4 Performance Metrics

To be able to properly evaluate the performance of the implemented classification algorithms in regards to the final research question, some indicators must be decided upon. Within previous research different metrics are used for performance evalua- tions, and sometimes not reported at all. This stresses the need to create a consistent basis for comparison [3, 10]. Furthermore, as described by [10], no attempt has been done testing algorithms in similar field conditions.

Popular metrics applicable for evaluation of binary statistical classification algo- rithms are precision and recall [15]. Precision is the fraction of retrieved instances that are relevant, and recall is the fraction of relevant instances that are retrieved.

For classification tasks the terms true positives (TP), true negatives (TN), false

positives (FP) and false negatives (FN) can be used to compare the results of the

(25)

classifiers. The terms positive and negative refer to the expectation of the classifier, i.e. the prediction. True and false refers to whether the prediction is correspondent to reality. An illustration of the terms is shown in Figure 2.2. Applied to the prob- lem of fruit recognition, the terms most influencing the sustainability and the overall efficiency of the system are TP and FP. The TP rate, i.e. how many of the total amount of ripe fruits is identified, should aim at achieving a high score, to not waste any eatable produce. Minimizing the FP rate is of great importance, since it can cause damage to fruit or plantation if attempting to pick non-fruit objects, as well as increasing the cycle time [3].

Precision and recall can be calculated, in terms of those metrics, as the following:

P recision = T P

T P + F P (2.1)

Recall = T P

T P + F N (2.2)

Figure 2.2: Illustration of the terms precision and recall

The F

₁

score is another metric used for accuracy determination. The score is based on the calculations of precision and recall and is defined as:

F

₁

= 2 · precision · recall

precision + recall (2.3)

These metrics can be combined with validation strategies for supervised classification

methods. These are further explained in Section 3.4.2.

(26)

2.5 Delimitations

Considering the endless amount of possibilities in a project like this, as well as the

restricted time frame and budget, delimitations were needed to be decided upon. In

terms of the tasks to be implemented, a delimitation is set to only cover the first

two mentioned in Section 2.1, i.e. fruit localization and ripeness determination. For

convenience and consistency in the validation of the methods implemented, a plastic

tomato plant with fruit of varying maturity is used in a laboratory experimental

setup. This further restricts the use of other than visual sensors, such as chemical

or tactile sensors for ripeness determination. The restricted time frame also affects

the size of training samples, to be generated for the training and validation, which

is a limiting factor to the performance of the classifiers.

(27)

(28)

State of the Art

This chapter describes the basics of recognition systems and computer vision, and explores the State of the Art of these when applied to the task of harvesting fruit.

The chapter covers sensors used; definitions, techniques and algorithms for image processing; and the theory and characteristics of supervised classification methods.

3.1 Recognition Systems

A recognition system usually conducts the following computation steps [13]:

1. Image Acquisition. Images are acquired though any visual sensor in digital form, their digital representation further explained in Section 3.3.1.

2. Image Processing. The acquired images are processed, by e.g. segmenting the background, filtering noise, improving image quality etc.

3. Feature Extraction. Features are extracted and analyzed by different techniques in order to achieve classification.

4. Decision Making. Based on the analyzed features a decision is made, to whether anything in the frame can be recognized.

The first step, image acquisition, is performed by the hardware implementation (i.e.

sensors), and is presented in Section 3.2. Definitions and techniques regarding the

following two stages, image processing and feature extraction, are explained in Section

3.3. The decision making stage, using classification machine learning methods, is

presented in Section 3.4.

(29)

3.2 Image Acquisition

To achieve the first step, data acquisition, different visual sensors operating in two or three dimensions are used. There are many options available for both purposes, which are further explained in this section. Section 3.2.1 presents devices used for two dimensional imaging, and Section 3.2.2 devices and techniques for three dimensional imaging. A comparison of the available devices and techniques and their synergistic effects is presented at the end in Section 3.2.3.

3.2.1 Two Dimensional Imaging

The most common two dimensional sensors found in fruit recognition applications are color cameras, including Charged-Coupled Devices (CCD) and Complementary Metal-Oxide Semiconductor (CMOS) cameras. Other sensors commonly employed are spectral and thermal cameras [2, 9, 12]. Two dimensional sensors can be used to receive information about color, temperature, or physical properties of objects in an image.

3.2.1.1 Color Camera

Both CCD and CMOS sensors operate on the same principle; they convert light intensity into electrons using a variety of technology. After conversion the accumu- lated charge of each cell in the image is read. CCD sensors are the most flexible and commonly used device in computer vision applications. In CCD sensors the charge is transported and read at one corner of the array, converting the analog pixel val- ues into digital values. The values are stored in the image plane and can be read, row by row, by a computer [16]. In CMOS sensors transistors are used at each pixel, amplifying and moving the charge, making it possible to read each pixel individually.

3.2.1.2 Spectral Camera

Multi-spectral imaging can be used to capture data at various wavelengths across the electromagnetic spectrum. This allows for extraction of additional information from non-visible frequencies, e.g. recognizing physical properties, such as texture, of materials. The wavelengths are usually a small set of specific wavelengths between infrared (IR), and near infrared (NIR) ranges [8].

Spectral imaging can be achieved in two manners [16]; one using a refracting film for

each of the different refracted wavelength components, or using a color wheel which

(30)

is synchronously rotated in the optical pathway and read at each color during the rotation. There are pros and cons using both methods. The former allows for a gain in spectral information thus losing spatial resolution, while the latter has a trade-off between sensing speed and color sensitivity.

3.2.1.3 Thermal Camera

Thermal imaging is a technique for capturing IR images. Thermal cameras detect radiation in the long-infrared range. IR radiation is emitted by all objects and increases with temperature; i.e. thermal cameras allows for visualizing of variations in temperature of objects.

3.2.2 Three Dimensional Imaging

Three dimensional sensors are used to sense the depth or range of a three dimensional surface element, rather than the intensity or radiation as in two dimensional sensors;

they receive depth information of objects of interest, and can also be used to calculate size of these. The three dimensional sensors used in recognition systems in previous research are based on stereo vision, structured light imaging, Time of Flight (ToF) camera, and Light Detection and Ranging (LiDAR) [2, 9].

3.2.2.1 Stereo Vision

The principle of stereo vision is to transform data from two two dimensional imaging devices into three dimensional data, by allowing them to capture the scene from two aligned points, with a set distance. The spatial distortion between the two images is used to calculate the distance to objects of interest, using the known distance and focal lengths of the cameras. The spatial distortion refers to the difference in image location of the same three dimensional point when projected under perspective of two different cameras [16].

3.2.2.2 Structured Light

The principle of structured light imaging is based on active illumination of the scene

with two dimensional spatially varying intensity patterns. An imaging sensor is

used to acquire the two dimensional data of the scene under illumination, and the

geometric shape distortion arising during this process can be used to extract the

illuminated surface shape [16, 17].

(31)

3.2.2.3 Time of Flight

The ToF principle is based on the known and constant speed of light in air, from which the absolute time that a light pulse travels from its source to target, and back to a receiver, can be measured. The time needed by the active illumination source to travel from emitter to target is proportional to the depth. The source and receiver are located closely to avoid shadowing effects. Most conventional ToF cameras use a point detector to scan a modulated laser beam over the scene. Another approach uses Continuous Wave (CW) modulation, i.e. using continuous light waves instead of single short light pulses [18].

For the pulsed light version, in which the illumination source is switched on and off rapidly, the distance d to a point is given by [19]:

d = ct

2 (3.1)

where c is the speed of light and t the time for the light traveling back and forth, which can be calculated as:

t = φ

ω (3.2)

where φ is the phase delay of the traveling light and ω is the angular frequency of optical wave.

Using CW modulation, the phase shift is measured instead. The phase shift is retrieved by demodulating the received signal. This can be done in various ways, and using the calculated phase shift φ, the distance d can be calculated according to [20]:

d = c

4πw φ (3.3)

where w is the modulated frequency.

3.2.2.4 Light Detection and Ranging

LiDAR works on a similar principle as ToF, comparing change of phase (i.e. the

delay) between a transmitted amplitude modulated laser beam and the received re-

flected signal. The pulsed laser beam is transmitted to the three dimensional surface

to measure the distance, expressed in terms of the period of the modulated beam.

(32)

The reflectivity of the surface, for the used wavelength of light, can be estimated by comparing the received and transmitted intensity [16].

3.2.3 Comparison of Devices

The sensors described have particular characteristics and areas of application. In previous studies, a variety of sensors, and combinations of them, have been used to optimize cost, accuracy, speed and robustness of the system [2]. The possibili- ties of combining sensors, for synergistic effects, are innumerable, creating a need to compare and evaluate the available devices. As previously unraveled, using a single sensor is not sufficient in providing an accurate perception of the scene, and multi- sensory systems including three dimensional sensors is mandatory in order to achieve an accurate detection.

Regarding the two dimensional sensors, the use of spectral and thermal sensors is of importance if features regarding texture and temperature are to be extracted. The use of spectral cameras, or the IR capabilities of them, also allows for visibility in uncertain lightening conditions, such as in the dark. A color sensor is useful in most applications, since color usually is the most important features of images. It is also sufficient in determining ripeness in fruits with distinctive colors, e.g. red tomato fruits. There are some differences in the operational profiles of CCD and CMOS sensors. The CCD device creates high-quality, low-noise images. The CMOS is more susceptible to noise, and has varying light sensibility. CMOS cameras on the other hand, use a significantly lower amount of power, are less expensive, and able to pro- vide faster responses. As [21] concludes, a CMOS camera is the best alternative for time resolved data acquisition in order to receive a sufficient frame rate.

Regarding the three dimensional sensors, the main limiting factor for the use of these

devices is the availability of low cost options [2]. The use of stereo vision to acquire

three dimensional data is restricted by the complex correlation algorithms, resulting

in a long computational time. The method is also restricted by the low accuracy,

especially in outdoor environments [9]. Cameras based on structured light work at

a high speed, but their accuracy is highly affected by the lightening conditions, in

which the intensity patterns might be disrupted by external illuminations. The ToF

sensor, especially using CW modulation, is the most promising solution for imple-

mentation in real time applications. This is due to the simplicity of the algorithm,

calculating highly accurate distance measurements at a high speed using a small

amount of processing power. LiDAR cameras have good accuracy and operating

(33)

ranges outdoors, although being very expensive and slower than ToF cameras [16].

3.3 Image Processing and Feature Extraction

This section covers the definitions of digital representation of images, presented in Section 3.3.1, and image fusion in Section 3.3.2. The section also covers techniques for processing and extraction of features in Section 3.3.3.

3.3.1 Digital Image Representation

The basic structure of a digital image is a two dimensional array I[r,c] containing discrete intensity values, usually in a 8-bit format. In computer vision, a digital image can be represented in several manners [16]:

• Grayscale Image. A grayscale image is a monochrome image with one inten- sity value per pixel.

• Multi-spectral Image. A multi-spectral image contains a two dimensional vector of values at each pixel.

• Binary Image. A binary image is an image in which each pixel is either represented by 0 or 1 (i.e. black or white).

• Labeled Image. A labeled image contains symbol values at each pixel. The symbol values denotes the outcome of decisions made on that pixel.

A coordinate system is used to obtain desired pixel values in the structures.

3.3.2 Image Fusion

Image fusion can be employed in different levels of representation; signal, image,

feature, or symbol. Signal-level fusion provides a fused signal with the same format

as the combined source signals. Image-level fusion (or pixel-level fusion) generates a

fused image in which each pixel is determined from a set of pixels in each source im-

age. In feature-level fusion, extractions of features from the sources are first carried

out before joined, e.g. fusion of edge maps. Symbol-level fusion effectively com-

bines source data at the highest level of abstraction, by first processing the sources

individually to acquire certain information, and then combining that information

[11].

(34)

3.3.3 Techniques for Feature Extraction

This section provides commonly employed techniques for extraction of color, shape, and texture, as well as reconstruction of scenes and object tracking using depth features.

3.3.3.1 Color

Color features can be extracted from an image once a color space is defined. A color space is, as described by [22], an organization of colors. This organization can be represented in a three dimensional space. This means that a digital color image is represented by a two dimensional vector containing three values at each pixel. Any three wavelengths of light have the ability to create many different colors, some more than other, when being mixed in various proportions. The most common way to rep- resent color is using the red, green, and blue (RGB) color space, i.e. using particular wavelengths of these. They are primary colors, i.e. the colors able to produce the highest variety of other colors.

(a) RGB (b) HSI, HSV, and HSL

Figure 3.1: An illustration of the representation of different color spaces.

Another important characteristic when viewing images is hue, which is of circular

quality, meaning it is represented in cylindrical coordinates of points in the RGB color

space. There are several color spaces built upon this characteristic, in combination

with other, often correlated, characteristics described in minimum and maximum

values. These include lightness, brightness, brilliance, saturation, etc. The color

spaces commonly used include hue, saturation and; lightness (HSL), value (HSV),

(35)

or intensity (HSI). The RGB and hue based, cylindrical, color spaces can be seen in Figure 3.1. These are commonly used for feature extraction purposes [14].

Color features can be extracted by segmenting objects of desired color in an image.

This can be achieved by computing thresholds for each of the color bands defined in the color space. This could be done manually, or in an automatized algorithm computing the optimal thresholds. A commonly used method for computation of these is Otsu’s method, a clustering-based image thresholding technique introduced by [23]. The method is based on selecting the optimal thresholds by integrating the gray level histograms. The thresholds are determined based on finding the value maximizing the separability in gray level of the resulting classes. When applied to their specific color band and combined segmentation is possible.

3.3.3.2 Shape

Shape feature extraction techniques are usually either contour, i.e. calculated from the boundary of a shape, or region based [14]. Tomatoes are somewhat circular, and in previous research in the field, circular Hough Transform (CHT) has been widely applied for identifying spherical objects [12]. CHT works, as described in [24], by accumulating votes for all parameters which satisfy the constraints of each feature point. The votes are collected in an array, called accumulator array, which is a discrete representation of the continuous multidimensional space. When voting has been carried out, the array elements containing large numbers of votes, i.e. lo- cal maximums, indicate the presence of a circular shape. The position of the local maximum is corresponding to the center of the circle.

The theory behind the method is based on a circle in a two dimensional space being described by:

(x − a)

²

+ (y − b)

²

= r

²

(3.4)

where r is the radius and (a,b) is the center of the circle. For a fixed point (x,y), 3.4 can find the appropriate parameters. All of the parameters satisfying (x,y), are located on the surface of an inverted right-angeled cone with an apex at (x,y,0).

Points were three or more cones intersect define parameters of circles. An illustra-

tion can be seen in Figure 3.2. To extend this process, a two-stage process can be

applied. This aims at first fixing radius and finding the optimal center of circles in

(36)

Figure 3.2: An illustration of the CHT, in which edge points (x, y) generate votes on a conic surface in three dimensional (a, b, r) parameter space.

the two dimensional space, and then finding the optimal radius in a one dimensional parameter space.

3.3.3.3 Texture

There are several methods used for extraction of texture features; including spatial and spectral methods [14]. Spatial methods are based on computing the pixel statis- tics or structures in the image. Spectral methods on the other hand, are based on transforming the image into frequency domain and calculating the features from this transformed image.

The most commonly used technique for feature extraction is Gabor filters [25]. Ga-

bor filters are linear filters used for edge detection, and are an example of spectral

feature extraction, in which the image is transformed into frequency domain. The

transformed image is then sampled, and filtered with a number of Gabor wavelets

(i.e. filters) of different spatial frequencies and orientations. Each Gabor wavelet

returns the captured localized specific frequencies as a feature vector. Gabor wavelet

transforms an image I(x, y) with a set of Gabor filters, whose function g(x, y), with

σ

_x

and σ

_y

being the scaling parameters, W center frequency, and θ the orientation,

is defined as:

(37)

g(x, y) = 1

2πσ

_x

σ

_y

exp

− 1 2

x

²

σ

²_x

+ y

²

σ

²_y

+ 2πjW

x

(3.5)

3.3.3.4 Three Dimensional Reconstruction

Related points in the depth images can be mapped by finding appropriate transfor- mations, allowing the computer to keep track of objects of interest or reconstruct entire scenes. This process is referred to as point cloud registration. The theory behind this process is based on a point in the depth image being represented as:

p

_i

=



 x

_i

y

i

z

_i



 (3.6)

From which, a set of related point N , can form a point cloud :

P = {p

_i

|i = 1, ..., N } (3.7)

A point cloud is generated for each acquired image, and then combined, in order to reconstruct the entire scene. Each object will have a slightly transformed position in the following image, and by finding the relative transformation; points can be mapped from one image to another. The process consists of finding a spatial trans- formation aligning the transformed points. The translation between point clouds is expressed as a vector in terms of movement along each three dimensional axis. When appropriate transformation has been found, the point clouds can be aligned and the scene reconstructed. The movement of objects of interest can also be identified.

There are several methods used for point cloud registration, one of the most com- monly used being Iterative Closest Point (ICP) introduced by [26]. As implied, this is an iterative algorithm, which for the two sets of points P and M , consists of the following three steps:

1. Pairing each point P

_i

to the corresponding closest point in M ,

2. Finding the transformation minimizing the mean square error between the paired points,

3. Applying the transformation to P

_i

, updating the mean square error, evaluating

if it falls within a threshold, and repeating the process if not.

(38)

Figure 3.3: The basics of supervised machine learning

This process can be performed independently on two point clouds or for a set of point clouds, in which an iterative process of finding the translation between one point cloud and the following is carried out. These are stored in a matrix, which can be applied to align all point clouds.

3.4 Decision Making

This section presents the basics of supervised classification methods, investigates the methods most commonly used in harvesting applications, and discusses their char- acteristics and validation strategies employed for evaluation.

Classification is the problem of identifying to which of a set of categories a new obser- vation belongs based on data containing instances whose class membership is known.

Classification can be either binary or multi-class. In a binary classifier, an instance can belong to one of two classes, whereas in a multi-class classifier the instance can belong to one of several classes [27]. The performance of a classifier depends on the data to be classified. Both supervised and unsupervised methods are reported in previous research, although supervised methods being predominated.

Supervised machine learning refers to the ability of a model to predict responses

from new data based on the model being trained with examples of known data and

known responses during a training phase. An illustration of the process, including a

training and testing phase, is shown in Figure 3.3.

(39)

3.4.1 Supervised Machine Learning Methods

Supervised methods reported in previous research include Naive Bayesian Classifier (NB), k-Nearest Neighbor clustering (k-NN clustering), Decision Trees (DT), Ar- tificial Neural Networks (ANN) and Support Vector Machines (SVM). The theory behind these is further explained in the following sections, and a comparison of them in regards to their characteristics as well as validation strategies are presented at the end in Section 3.4.2.

3.4.1.1 Naive Bayes Classifier

Bayesian classification is based on Bayes’ theorem and are statistical classifiers that can predict class membership probabilities [28]. Bayes’ Theorem uses two types of probabilities; Posterior Probability P(H/X), and Prior Probability P(H) where X is data containing the features and H is some hypothesis, in the case of classification it is a class depending on the features in X. According to Bayes’ Theorem,

P (H/X) = P (X/H) × P (H)/P (X) (3.8)

The instance is assigned to the class with the maximum Posterior Probability.

3.4.1.2 k-NN Clustering

The information in this section is retrieved from [29, 28]. k-NN clustering is one of the simplest and intuitive machine learning algorithms, which classifies instances based on the class of their k nearest neighbors. It is a non-parametric classification and regression algorithm. An illustration of a binary classifier in a two dimensional feature space with k = 3 can be seen in Figure 3.4.

A k-NN is given a training dataset D made up of ( ~ x

_i

)

_i∈[1,|D|]

training samples. The samples are described by a set of features F , in which numerical features are nor- malized to the range [0,1]. Each training sample is labeled with a class label y

_j

∈ Y.

For each ( ~ x

_i

) the distance to an unclassified point ~ q can be calculated as:

d(~ q, ~ x

_n

) = X

f ∈F

w

_f

δ( ~ q

_f

, ~ x

_if

) (3.9)

The k nearest neighbors are selected based on a distance metric, which can be de-

scribed in several ways. A simple description using discrete and continuous attributes

is described as:

(40)

Figure 3.4: An illustration of a k-NN for point p with k = 3. The three nearest neighbors are found within the orange circle, determining the class of p.

δ( ~ q

_f

, ~ x

_if

=



 

 

0 f discrete and ~ q

f

= ~ x

if

1 f discrete and ~ q

f

6= ~ x

if

| ~ q

f

− ~ x

if

| f continuous

(3.10)

The output of the classifier is a class membership determined by a majority vote of its k neighbors, i.e. the instance is being assigned the class most commonly found among its neighbors. Assigning weighted contributions to the class membership determination based on the distance to the neighbors, i.e. causing closer neighbors to receive a higher contribution to the output is a useful technique to improve accuracy.

The contribution is weighted by e.g. the inverse, or square inverse, of their distance to the point.

3.4.1.3 Decision Trees

The information in this section is retrieved from [30, 31]. DTs classify instances by

sorting them on their feature values. A simple binary DT is illustrated in Figure 3.5,

in which each terminating node represents a class label, each node a feature from

which the data can be divided, and each branch some value from which the node

feature is divided. An instance is classified from root node and down based on the

(41)

Figure 3.5: An illustration of a binary classification tree, with two splits, in which the instances are classified based on two distinctive features.

values of its features. The tree is built recursively using the entire training data, starting at the root node. The nodes are split according to a split criterion. The feature that best divides the training data is assumed to be the root node of the tree, allowing to easily determining this feature as well. Finding this feature can be done with numerous methods, and there is no best single method in achieving this. This procedure of finding the best division is repeated until the training data is divided into subsets of the same class.

Underfitting of data might be an occurring event, in which the tree is too small in regards to the training sample, hence inducing errors. Overfitting data, is an even more common problem in DTs, and is induced when the tree is too large. In order to avoid this happening some approaches can be used:

1. Stopping the training algorithm before it reaches a perfect splitting point of the training data,

2. Pruning of the decision tree based on the optimal proportion between complex- ity of the tree and misclassification error.

Pre-pruning can be done to the tree, i.e. not allowing it to reach more than a speci- fied depth value. Post-pruning (i.e. removing or assigning classes to nodes) usually is employed to evaluate performance.

If necessary, i.e. if the prediction procedure gets stuck at a node due to missing values

in the data, the tree uses surrogate splits. Surrogate splits mean that in addition to

the best primary split, every node may also be split on one or more other features.

(42)

3.4.1.4 Support Vector Machine

SVM is a relatively new supervised machine learning technique described by [32].

SVM produces nonlinear boundaries by constructing a linear boundary in a large, transformed version of the feature space. A SVM classifies data by finding the best hyperplane that separates data points from one class to another.

A SVM is given a training dataset of n points on the form ( ~ x

₁

, y

₁

),...,(( ~ x

_n

, y

_n

) where y

_i

∈ [−1, 1] indicating which class ~ x

_i

belongs to. Any hyperplane can be written as the set of points ~ x satisfying:

~

w × ~ x − b = 0 (3.11)

Where ~ w is the normal vector to the hyperplane. If there is a hyperplane able to linearly separate the training examples by their label, two hyperplanes, or support vectors, creating a margin, can be used to reassure no examples falls in between.

The hyperplanes are then described as:

~

w × ~ x − b = 1 (3.12)

~

w × ~ x − b = −1 (3.13)

The goal is to achieve the maximum margin between the hyperplanes by minimiz- ing kwk, i.e. maximizing the distance between the support vectors and instances on either side, in order to reduce an upper bound on the expected generalization error. The SVM is illustrated in Figure 3.6, including the separating hyperplane, the margin, support vectors and instances belonging to either of the two classes.

The original SVM is a linear classifier, but using a kernel trick it is possible to create a non-linear SVM classifier. In the non-linear classifier every dot product is replaces by a non-linear kernel function. Most multi-class SVM are implemented by combining several two-class SVMs.

3.4.1.5 Artificial Neural Networks

ANNs are inspired by biological neural networks, and are described thoroughly in

[28]. ANN consist of neurons, which are fed some input, e.g. some features, to which

calculations can be done to compute an output. The output can be fed into another

neuron, to create one or several hidden layers within the network, as seen in Figure

(43)

Figure 3.6: An illustration of a SVM with separating hyperplane, support vectors, margin, and instances of each class

3.7. There are several different types of ANNs derived from this basic logic, which are further explained in [28]. The networks can either allow or not allow bi-directional data flow, i.e. feedback loops. The former is referred to as recurrent neural networks, and the ladder feed-forward neutral networks.

3.4.2 Algorithm Comparison and Validation

The information in the following sections is retrieved from [15, 30, 28, 33]. The choice of algorithm always depends on the classification problem, and there is no single best method outperforming any of the others. This encourages the need to make comparisons between methods for different problems with a different set of characteristics. This section discusses the characteristics of the methods presented in relation to the specific problem to be solved, as well as validation techniques for the comparison of them.

3.4.2.1 Comparison of Methods

The methods previously described all have different operational profiles in terms of;

how they handle large feature spaces contrary training sample size, irrelevant fea- tures and noise; speed and storage space of training and new predictions; and bias.

A comparison of the methods and their characteristics is presented in Table 3.1.

(44)

Figure 3.7: Illustration of neurons in an ANN with input, hidden and output layers.

ANNs are perceptron-based techniques, DT logic-based, and NB, k-NN and SVM statistical-based techniques.

In respect to the input of the classifiers, i.e. the features, some methods are more or less affected by the relationship of feature space dimensionality and training sample size. Methods appropriate for handling large feature spaces are SVMs and ANN, since they are complex models, which further requires that a large sample size is used to achieve the maximum accuracy. k-NN, DT, and NB especially, show better performance in spaces of lower dimensionality. SVM and ANN also perform well when nonlinear relationships exist between the instances, while DT performs poorly under these conditions.

Some of the algorithms are more sensitive than others to the presence of irrelevant features and noise. Sensitivity to irrelevant features include k-NN, NB, and ANN, classifiers which become especially inefficient. DTs are less sensitive, and the best performance is achieved by the SVM. Regarding noise sensitivity, the algorithms have quite equal characteristics. An exception of this is for k-NNs, in which if an instance is assigned a neighbor in which there are errors in the feature value (i.e.

noise), misclassification might be achieved. However, it is also possible, using prun- ing strategies, to make a DT rather resistant to noise.

In regards to the speed of and memory required using the methods, k-NN is both

rather slow and demand a great deal of storage for both training and prediction of

new data. The method with the best performance in these terms is NB, considering

that it only requires single pass on the data as well as only storing the prior and

(45)

Table 3.1: Table indicating the characteristics of each considered method; in regards to speed of training and making new predictions; effects of large feature spaces in relation to training sample size; irrelevant features, noise; and bias. Low performance or high sensitivity is indicated with (+), and the opposite (+ + +).

Performance NB k-NN DT SVM ANN

General Accuracy + ++ ++ +++ +++

Speed (Training) +++ +++ ++ + +

Speed (New Predictions) +++ + +++ +++ +++

Large Feature Spaces/

Training Sample Size

+ + ++ ++ ++

Irrelevant Features + ++ ++ +++ +

Noise +++ + ++ ++ ++

Bias +++ +++ ++ ++ +

conditional probabilities. DTs are also quite quickly trained, while ANN (especially if the number of hidden nodes is large) and SVM are significantly slower during this phase. DT, ANN, and SVM, are all quite fast and require less storage space when predicting new data, since they use a condensed summary of data.

Bias is a term reflecting the contribution to error when the methods are trained with different input data. A high bias is therefore usually equal to a constrained, simple, classifier, which is insensitive to data fluctuations. NB and k-NN are examples of these. Algorithms with a low bias are more complex models, including ANN, SVM, and DT, implying that overfitting might be an occurring event. However; SVMs have overfitting protection, increasing the bias; and DT as previously explained, pruning strategies for dealing with overfitting.

3.4.2.2 Validation

A validation method is used to examine the predictive accuracy of the models. Val-

idation of supervised machine learning methods can be done in numerous ways; the

most simple and used being cross-validation. There are some different ways in which

this technique can be employed as well. k-fold cross validation is built upon dividing

the training set into k somewhat equally sized subsets, and setting aside some of the

data to fit the model and some to test it. For every kth part, the model is fitted with

the other k-1 parts and prediction error is calculated based on the prediction of the

data in k. An illustration of this process, using k=5, which is a typical value used

in this method (along with k=10), is shown in Figure 3.8. This process is repeated

(46)

Figure 3.8: An Illustration of divided subsets using cross-validation with k=5

for all subsets, and the error rate estimation of the classifier is calculated from the average of the error rates of each subset.

For the function, indicating the partition to which i is allocated, described as κ : 1, ..., N 7→ 1, ..., K, denoted by the fitted function ~ f

^−κ

(~ x), and with the kth data removed, the cross-validation prediction error is calculated as:

CV ( ~ f ) = 1 N

N

X

i=1

L(y

i

, ~ f

^−κ(i)

( ~ x

i

)) (3.14)

The case of k = n is referred to as leave-one-out validation. This means that all subsets contain only one single instance, which naturally leads to more accurate estimations of the error, as well as a more expensive computation requiring N appli- cations of the method.

For comparison of supervised machine learning algorithms, statistical comparisons

of their accuracies using a specific dataset are usually employed, such as the perfor-

mance metrics precision and recall described in Section 2.4. For a specified number

of training sets run on a number of methods, an estimation of their differences can

be performed.

(47)

(48)

Implementation

This chapter presents and discusses the chosen solution for the hardware and software implementation, in regards to the requirements of the system, the scope of the project, and the devices and techniques investigated in the previous chapter.

4.1 Hardware Implementation

The hardware implementation consists of the sensing system as well as the labora- tory environment, i.e. the experimental setup, with which the training and testing is carried out. The sensor solution chosen is a Microsoft Kinect for Windows v2.

The major reasoning behind this choice is based on the mandatory requirement of a system that can operate in three dimensions, while not exceeding the limited budget of the project as well as meeting the low-cost requirement. Other influencing factors to this decision are based on the sensor providing high-resolution color, depth, and IR data at a high speed. This means no additional sensors are needed since a com- bination of these is able to obtain the most important features for fruit classification tasks. The depth is calculated based on the CW modulation ToF principle, identified as the most promising solution for real time applications.

The disadvantages of using this sensor are based on the power consumption, weight of

the sensor, and the required connections to power outlet and PC, making it a rather

stationary solution. The focus in this project is however, in line with the scope and

delimitations, to evaluate classification algorithms in a laboratory environment. This

implies that while these requirements must be met in a commercial product, they

are of less relevance in contributing to the results of this thesis, and are therefore

not prioritized.