• No results found

Traffic Sign Recognition Using Machine Learning

N/A
N/A
Protected

Academic year: 2021

Share "Traffic Sign Recognition Using Machine Learning"

Copied!
51
0
0

Loading.... (view fulltext now)

Full text

(1)

Traffic Sign Recognition Using Machine Learning

SHARIF SHARIF

JOANNA LILJA

(2)

2 |

Degree Programme in Computer Engineering Date: June 10, 2020

Supervisor: Anders Sjögren Examiner: Fadil Galjic

School of Electrical Engineering and Computer Science Host company: VINA

Swedish title: Igenkänning av parkeringsskyltar med hjälp av maskininlärning

Traffic Sign Recognition Using Machine Learning / Igenkänning av parkeringsskyltar med hjälp av maskininlärning

c

2020 Sharif Sharif and Joanna Lilja

(3)

Abstract | i

Abstract

Computer vision is an area in computer science that attempts to give computers the ability to see and recognise objects using varying sources of input, such as video or pictures. This problem is usually solved by using artificial intelligence (AI) techniques. The most common being deep learning.

The project investigates the possibility of using these techniques to recognise traffic signs in real time. This would make it possible in the future to build a user application that does this. The case study gathers information about available AI techniques, and three object detection deep learning models are selected. These are YOLOv3, SSD, and Faster R-CNN. The chosen models are used in a case study to find out which one is best suited to the task of identifying parking signs in real-time.

Faster R-CNN performed the best in terms of recall and precision combined.

YOLOv3 slacked behind in recall, but this could be because of how we chose to label the training data. Finally, SSD performed the worst in terms of recall, but was also relatively fast.

Evaluation of the case study shows that it is possible to detect parking signs in real time. However, the hardware necessary is more powerful than that offered by currently available mobile platforms. Therefore it is concluded that a cloud solution would be optimal, if the techniques tested were to be implemented in a parking sign reading mobile app.

Keywords

Traffic sign detection; Machine Learning; Image Recognition; Object Identification

(4)

ii | Sammanfattning

Sammanfattning

Datorseende är ett område inom datorvetenskap som fokuserar på att ge maskiner förmågan att se och känna igen objekt med olika typer av input, såsom bilder eller video. Detta är ett problem som ofta löses med hjälp av artificiell intelligens (AI). Mer specifikt, djupinlärning.

I detta projekt undersöks möjligheten att använda djupinlärning för att känna igen trafikskyltar i realtid. Detta så att i framtiden kunna bygga en applikation, som kan byggas att känna igen parkeringsskyltar i realtid. Fallstudien samlar information om tillgängliga AI-tekniker, och tre djupinlärningsmodeller väljs ut. Dessa är YOLOv, SSD, och Faster R-CNN. Dessa modeller används i en fallstudie för att ta reda på vilken av dem som är bäst lämpad för uppgiften att känna igen parkeringsskyltar i realtid.

Faster R-CNN presterade bäst vad gäller upptäckande av objekt och precision tillsammans. YOLOv3 upptäckte färre object, men det är sannolikt att detta berodde på hur vi valde att markera träningsdatan. Slutligen upptäckte SSD minst antal objekt, men presterade också relativt snabbt.

Bedömning av fallstudien visar att det är möjligt att känna igen parkeringsskyltar i realtid. Den nödvändiga hårdvaran är dock kraftfullare än den som erbjuds av mobiler för närvarande. Därför dras slutsatsen att en molnlösning skulle vara optimal, om de testade teknikerna skulle användas för att implementera en app för att känna igen parkeringskyltar.

Nyckelord

Igenkänning av trafikskyltar; Maskininlärning; Bildigenkänning; Objektidentifikation

(5)

Acknowledgments | iii

Acknowledgments

We would like to thank our supervisors Anders Sjögren, and Fadil Galjic, for their invaluable guidance and constructive criticisms that helped shape this thesis. We would also like to thank our project owner Luc Bui, for giving us this opportunity.

Stockholm, June 2020

Sharif Sharif and Joanna Lilja

(6)

iv | Acknowledgments

(7)

CONTENTS | v

Contents

1 Introduction 1

1.1 Machine Learning . . . 1

1.2 Problem . . . 1

1.3 Purpose . . . 2

1.3.1 Ethics and Sustainability . . . 2

1.4 Project Owner . . . 2

1.5 Delimitations . . . 3

1.6 Outline . . . 3

2 Theoretical Background 5 2.1 Machine learning methods . . . 5

2.1.1 Deep learning . . . 5

2.1.2 Neural Networks . . . 6

2.1.3 Supervised- vs. Unsupervised Learning . . . 7

2.1.4 Underfitting vs. Overfitting . . . 7

2.1.5 Transfer Learning. . . 7

2.1.6 Object Detection . . . 8

2.1.7 Sliding Window . . . 8

2.1.8 Convolution. . . 8

2.1.9 Pooling . . . 8

2.1.10 Region Proposals . . . 9

2.2 Machine Learning Models . . . 9

2.2.1 You Only Look Once . . . 9

2.2.2 Faster R-CNN. . . 9

2.2.3 Single Shot Multibox Detector . . . 10

2.2.4 Optical Character Recognition . . . 10

2.3 Related Works . . . 10

2.3.1 Object Classification and Localisation . . . 10

2.3.2 Object Detection with Deep Learning: A Review . . . 11

(8)

vi | CONTENTS

3 Method 13

3.1 Research Process . . . 13

3.1.1 Theoretical Background . . . 13

3.1.2 Overview . . . 15

3.1.3 Identifying Submodules Solutions . . . 16

3.2 Literature Study . . . 16

3.3 Technological methods . . . 17

3.3.1 Building Training Set . . . 17

3.3.2 Training the Networks . . . 17

3.3.3 Testing the Networks . . . 18

3.3.4 Evaluation . . . 18

3.3.5 Drawing Conclusions . . . 18

3.4 Development Method . . . 19

3.4.1 Development Process . . . 19

3.4.2 Software Tools and API’s Used . . . 19

4 Results of the Experiments 23 4.1 Training Set . . . 23

4.2 Testing Results . . . 24

4.2.1 Speed . . . 26

4.2.2 Precision and Recall . . . 26

4.2.3 Certainty . . . 27

4.3 Analysis of the Results . . . 28

4.3.1 Reliability . . . 28

4.3.2 Validity . . . 29

5 Discussion 31 5.1 Interpreting the result . . . 31

5.1.1 Technical Interpretation . . . 31

5.1.2 Technology Usability . . . 32

5.2 Possible Improvements . . . 32

5.2.1 Problems With Labeling Training Data . . . 32

6 Conclusions and Future work 35 6.1 Conclusions . . . 35

6.2 Limitations . . . 35

6.2.1 The Absence of Text Interpretation. . . 35

6.2.2 Training Set Size . . . 36

6.2.3 Lack of Frameworks for Mobile Phones . . . 36

6.3 Future work . . . 36

(9)

Contents | vii

6.3.1 Text Interpretation . . . 36 6.3.2 Developing the Mobile App . . . 36

References 37

(10)

viii | Contents

(11)

Introduction | 1

Chapter 1 Introduction

Artificial Intelligence (AI) refers to digital systems that are able to intelligently extract and interpret data and/or act intelligently in its environment. The use of Artificial Intelligence has expanded greatly in recent years, due to computers becoming more powerful and new techniques being developed.

Some examples of its applications include image recognition, speech generation, diagnosing illness[1], and even strategy games [2].

1.1 Machine Learning

Machine learning is a subfield of artificial intelligence that concerns systems learning and generating logic on their own, by using some dataset to train upon.

[3] There are ample possibilities to use these techniques to solve complex problems, such as object detection in images, as will be described in detail in this report.

1.2 Problem

The parking of motor vehicles is dictated by rules that constrain the location, time, and types of vehicles that are allowed to park. Without constant exposure to these rules, it can be hard to remember and understand them all.

Parking rules are given in the form of traffic signs. Some signs may have text, while others have symbols. Automating the process of interpreting parking signs can assist new driving students in figuring out rules, as well as help seasoned drivers understand a parking rule they may not have come across recently.

(12)

2 | Introduction

Furthermore, the problem of identifying and interpreting traffic signs, more specifically parking signs, could also be applied to smart cars. In a world where self driving cars are increasingly prevalent, we are beginning to expect vehicles that can understand parking signs and park for us.

The question we want to ask to address the problem is the following;

How can parking signs be identified and interpreted using machine learning?

1.3 Purpose

The purpose of this thesis is to investigate how to interpret parking signs using machine learning, preferably such that it can be done on a mobile cell phone.

Interpreting parking signs digitally could help the average driver to know where it is legal to park and when. With an accurate and effective parking sign detector, unlawful parking could be reduced. In addition to this, it might be possible to use this technology in autonomous vehicles.

1.3.1 Ethics and Sustainability

The possible beneficiaries of the project are people who want to develop real time applications that can identify and interpret traffic signs, such as autonomous car companies. Other beneficiaries, such as driving schools and tutors, can benefit from using the application developed as an aid to help teach the parking rules that apply.

The project has some sustainability aspects to it since the goal is to be able to read parking signs. Reading parking signs is a computer vision problem that can be used for autonomous vehicles. Having autonomous vehicles helps achieve social sustainability goals.

1.4 Project Owner

The project owner at VINA[4] has requested for us to explore if an app can be made to take pictures of parking signs and output their meaning in a clear, easy to understand way, and if so, how this can be done. Showing that it is possible is part of what is expected, but it should also be feasible to run on mobile phones available on the current market.

(13)

Introduction | 3

1.5 Delimitations

This paper will not be about creating a final app for VINA, as the workload of such an app may be out of the scope of this project. It does, however, investigate how the backend could favorably be implemented by comparing three different machine learning methods.

Data collection and labeling is a time consuming task. Therefore, the dataset that will be used as training data will be relatively small compared to the normal size of datasets that are normally used. This affects the accuracy with respect to detecting parking signs. This also means that a subset of parking signs will be used for the project. These signs include, the parking sign, no parking sign, no stopping or parking sign, and signs showing the times parking is allowed.

Furthermore, this report does not handle interpretation of text, only detection of it.

The app will only be compatible with Swedish parking signs, since the datasets used in training are of Swedish parking signs. Parking signs aren’t employed exactly the same in other countries.

1.6 Outline

The structure of the rest of this document is as follows. Chapter2, background study, details the literature study, containing the theoretical background which explains relevant information about image recognition methods and models that are handled in this report.

In chapter3 the methodology is outlined. Both the research method and project method are described. Also choice of machine learning models and methods for documentation and modeling are presented.

Chapter4presents the result yielded by the methods laid out in the previous chapter. It also analyses these results and the method that produced them in terms of validity and reliability.

Subsequently, chapter 5 holds the discussion in which we describe our interpretation of the results. This chapter also presents what we would like to have done differently.

Finally, chapter 6 clearly states our conclusions, explain the limiting factors, and meditate on what would be the next steps for future work following this project.

(14)

4 | Introduction

(15)

Theoretical Background | 5

Chapter 2

Theoretical Background

The purpose of this chapter is to provide the theoretical background necessary to understand the rest of the report. It attempts to give a brief yet comprehensive explanation of the topics. It even touches on how some of the more complex aspects work so that the results of the research can be effectively discussed in the final chapter, as we have established a reasonable amount of understanding of what goes on behind the scenes when we say things like “convolution” and

“overfitting”.

2.1 Machine learning methods

This section describes the general areas and methods of machine learning that are used by the models tested in this report. To understand how the tested models work, a general understanding of these topics is required.

2.1.1 Deep learning

Deep learning is a subfield of machine learning. The key difference that makes deep learning excel is feature extracting. Feature extraction is performed manually by a human in machine learning. Meanwhile, in deep learning this is done by the deep learning model itself [5]. Loosely defined, feature extraction is the process of extracting properties of the input data that predictions are to be based upon. In this way, using deep learning and training data, image recognition can be achieved without explicit programming.

A key property in images is that pixels nearby each other are more correlated than pixels further apart. Convolutional networks exploit this by extracting local features that depend on small sections of the full image.

(16)

6 | Theoretical Background

2.1.2 Neural Networks

A neural network in computer science is a graph (a structure of nodes and edges) that loosely model a network of biological neurons to solve complex problems. There are many types of neural networks of widely varying complexity, but the basic principles are the same.[6] Let us use a simple multilayer perceptron (MLP) as an example. An MLP is a network made up of three or more layers of nodes: an input layer, one or more hidden layers, and an output layer, as shown in Figure2.1. [7]

Figure 2.1 – Possible setup of a multilayer perceptron

These nodes are more specifically known as perceptrons. The purpose of a perceptron is to calculate its so called activation level, based on a number of inputs, weights, and biases. The weights and biases are what are modified during the training process to allow the network to learn, while the inputs are the current activations calculated by the previous layer. The flow of activation to input is shown by the directed edges in Figure2.1.

Now you may ask, what about the input layer? Remember, the purpose of the input layer is to feed into the rest of the network the information in which some underlying structure is hoped to be detected. This may for example be an image. To feed it into the input layer, it can be converted into a vector. The values of the vector are then fed into the nodes of the input layer to serve as their activation level. Once the input layer is filled, this allows the activations to cascade down the hidden layers, until they eventually reach the output layer.

Each output node may represent a class for input classification. The activation generated by that node represents how sure the network is that the input is of that specific class. However, as mentioned previously, MLPs are quite primitive. The networks used in this project are more sophisticated.

(17)

Theoretical Background | 7

2.1.3 Supervised- vs. Unsupervised Learning

The strategy that should be used when applying machine learning depends on the goal, but also what type and amount of training data available. Machine learning models can also be grouped by what degree of supervision they require. These are often called supervised learning, semi-supervised learning, and unsupervised learning. Unsupervised learning does not provide labeled training data. This means that the machine has to find the underlying patterns on its own. It does this by grouping data based on similarities and differences.

For example, sorting pictures of cats and dogs. Classification is a type of supervised learning. Classification is normally used if the data can be categorized or tagged in ways that are discrete. For example, machine learning programs that can decipher handwriting are well suited for classification as letters and numbers are a finite set. [8]

2.1.4 Underfitting vs. Overfitting

A machine learning model is said to be underfitted when it has not been shown enough training data to capture the underlying trend, which we are trying to teach it. When training a neural network, it is crucial to be cautious of not only the effects of underfitting the model, but overfitting it as well.

Supervised machine learning works by using models to make predictions based on previous training on data, that has already been labeled with the correct answer. To test that the program works, it is shown a set of data that is different from the training data. This is to detect problems like overfitting.

Overfitting occurs when the network has been trained with so much of the same data that it starts memorizing the irrelevant details. This can lead to the network not correctly categorizing the input. [9]

2.1.5 Transfer Learning

Transfer learning is the practice of using an already trained model for training on a new model, as opposed to starting with randomized parameters. It is a popular approach for computer vision since it can reduce the time required for training, due to the skill jump that can be found from related problems. It is similar to how you would have an easier time learning to play the guitar if you already know how to play the piano. [10]

(18)

8 | Theoretical Background

2.1.6 Object Detection

Object classification is, as the name suggests, the process of finding out the class of an object. Object localisation, on the other hand, is about finding out where in the image the object exists. This is usually represented by a bounding box containing the object. Finding a class of the object and its bounding box is what is known as object detection. [11]

2.1.7 Sliding Window

Object Detection can be achieved by the sliding window method, where many sub-images of fixed sizes are read from an input image in a manner that systematically “slides across” the entire input image, and feeds each sub-image through an image classifier. As such, when the sub-images are classified both the class of the object and its location in the input image can be deduced, effectively accomplishing object detection.

2.1.8 Convolution

In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of deep neural networks, most commonly applied to analyzing visual imagery.

What sets CNN apart is that it is designed to capture the contexts in an image using filters. While primitive versions of this method may use handwritten filters, CNNs opt to learn these filters on their own through machine learning.

The reason why the image is not just treated as a vector of pixels and fed through a multilayer perceptron is because while that may work well for simpler problems, it would perform poorly when subjected to complex images with many pixel dependencies. [6] Convolutional neural networks are usually made out of convolution layers, activation function layers, pooling layers, and lastly a feature map. [12]

2.1.9 Pooling

Adding pooling layers is a way of summarizing the features present in different areas of the feature map. Commonly applied methods of this are called average pooling and max pooling. Average pooling summarizes the average presence of a feature and max pooling summarizes the most activated presence of a feature.[13]

(19)

Theoretical Background | 9

2.1.10 Region Proposals

The purpose of region proposals is to find possible locations of identifiable objects in an image.[12] Many detection networks depend on the region proposal models to find plausible locations of objects.[14] A way to generate these proposals is to slide a small network over a convolutional feature map.

A common strategy to generate region proposals is through the use of a Region Proposal Network (RPN). An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position.

2.2 Machine Learning Models

This section lists the models used in this research project and provides the necessary knowledge to understand what they do.

2.2.1 You Only Look Once

The name You Only Look Once (YOLO)[15] refers to the fact that the image is only being processed through the model in one pass. The image as a whole is divided into a grid and passed into the model. Each cell then predicts bounding boxes and classification by comparing how much detected objects overlap with the ground truth. Darknet is a framework for training neural networks written by the author of the YOLO model and serves as a basis for it.

The modified and improved version, called YOLOv3 [16] is used in this project. Likewise, the first stage detects and splits the image into interesting regions, whereas the second stage is to run these regions through a convolutional neural network and try to classify the objects.

2.2.2 Faster R-CNN

R-CNN stands for Region-Based Convolutional Neural Network.[17] Later came Fast R-CNN [18] and Faster R-CNN [12], which propose improvements in performance to the preceding models.

Faster R-CNN improves region proposal computation by introducing a Region Proposal Network (RPN). The RPN generates region proposals, which are used by Fast R-CNN for identification. By sharing convolutional features with the detection network, Faster R-CNN is able to significantly reduce the

(20)

10 | Theoretical Background

time cost of region proposals. In other words, it combines RPN and Fast R- CNN into a single network by sharing their convolutional features and letting the RPN component tell the unified network where to look. w

2.2.3 Single Shot Multibox Detector

A Single Shot Multibox Detector (SSD)[19] is a way of performing object detection using a single deep neural network. Single Shot refers to that object localization and classification is completed in a single pass of the network.

Multibox[20] is a kind of bounding box regression developed by Szegedy et al. Finally, detector simply refers to the object detection purposes of the network.[21]

The SSD network discretizes the generated bounding boxes into a set of default boxes. Subsequently, scores are generated to evaluate the presence of each object category in each default box. Adjustments are also made to the box in order to better correspond to the object.

2.2.4 Optical Character Recognition

Optical character recognition is simply the process of identifying text from images. There are numerous methods and technologies that achieve this. One of these technologies, developed by Google, is called Tesseract [22]. It uses neural networks to detect lines which is used for character recognition.

2.3 Related Works

This section highlights similar works that have previously been conducted by others.

2.3.1 Object Classification and Localisation

The master’s thesis "Object Classification and Localization Using Machine Learning Techniques" by Asplund at Chalmers University [23] has some overlap with this report. As the title suggests, like this thesis, it deals with object identification in images. It also includes a literature study on machine learning on behalf of Volvo Advanced Technology and Research.

Due to this, there is some amount of shared ground with this thesis. For example, Asplund also covers Faster R-CNN. The major difference is that this report also handles YOLOv3 and SSD, and attempts to put them up against

(21)

Theoretical Background | 11

each other to compare them in terms of accuracy and performance for the task of identifying parking signs.

2.3.2 Object Detection with Deep Learning: A Review

"Object Detection with Deep Learning: A Review"[24], also has some overlap with this report. It is however much more extensive. Zhong-Qiu Zhao et.

al. begin by introducing the history of deep learning and convolutional neural networks. They then dive into generic object detection architectures.

Several specific tasks are also discussed, such as salient object detection, face detection, and pedestrian detection. They even give experimental analyses to compare various methods. In the final part, they propose a number of promising directions for future work that could be done in the field.

The paper covers both YOLO, R-CNN, and SSD, but also a multitude of other methods. The main difference between the papers is that this thesis has a smaller scope, deciding to focus on just three models, and applying them to a single problem.

(22)

12 | Theoretical Background

(23)

Method | 13

Chapter 3 Method

This section will describe the methods used in this project. It will go over the methodology, including the different phases of the project, and step by step instructions necessary to recreate the experiment. Each phase will be explained in detail as well. In order to examine the possibility of developing a parking sign app, we will first do a literature study. Once we have some models which we want to test, we will develop a testing environment and evaluate these against each other based on how well they perform.

3.1 Research Process

This section describes the theoretical background about research methodology and how it was used in this thesis.

3.1.1 Theoretical Background

The methodology used for this project will be adapted from Andersson et al. (2002) [25] methodology for technological research. The technological research methodology is a generalised version based on Mario Bunge’s work (1982) [26]. Our methodology consists of the performing following steps:

1. How can the problem be solved?

2. How can a technique or product be developed to solve the problem in an efficient way?

3. What research or data already exists for this type of technique?

(24)

14 | Method

4. Develop the new technique with step 3 as a standpoint. If this works, jump to step 6.

5. Try with a new technique.

6. Create a model or simulation of the suggested technique or product.

7. Which consequences does the new technique or product bring?

8. Testing the technique or product. If the test is satisfactory, jump to step 10.

9. Identify and fix the shortcomings of the technique or product.

10. Evaluate the result in comparison with existing knowledge and identify new problems that occur for future research.

(25)

Method | 15

3.1.2 Overview

The following figure displays an overview of the phases and processes that will be used in this thesis.

Figure 3.1 – Overview of the research methodology

Literature study, Steps 1-3

Step 1-3 of our research process will be accomplished in the literature study phase. First, in order to find out how to solve the problem, the problem is broken up into submodules. Then literature is researched to gain background knowledge on the different areas needed to solve the submodules.

Building Training Set & Training, Steps 4-7

The purpose of step 4-7 is to develop a technique or product, based on the previous steps, and look into the consequences of each technique. In this thesis,

(26)

16 | Method

this is implemented by building the training set and training machine learning models that are going to be tested.

Test ML Models & Conclude, Steps 8-10

Finally, the purpose of steps 8-10 of our research process is to evaluate the results, identify shortcomings of the techniques used and possibilities for future research. This is implemented by analysing the measurements of the results and comparing them to the desired goals.

3.1.3 Identifying Submodules Solutions

Interpreting traffic signs pragmatically is a large task that can be broken down into smaller submodules. Three parts emerge immediately when identifying the submodules. In order to achieve this thesis, three problems need to be solved. Firstly, locate parking signs and identify them. The second part is to recognise text on these signs and interpret them. The third and final submodule is to implement this on an Android phone. The solutions to the submodules are identified by researching the following questions:

• What technologies exist for object detection in images using AI techniques?

• What technologies exist that enables reading text from images?

• Are these technologies implementable on mobile phones?

3.2 Literature Study

Literature study was divided into three different parts, where each part finds solutions to the above submodules. Knowledge of different machine learning models that exist for object detection and localisation was gained through articles on towardsdatascience.com and medium.com. Research papers, through arxiv.org, were found to gain a deeper understanding of these models. Three different models were chosen based on this research, Faster R-CNN[12], SSD[19], and YOLOv3[16].

These articles also gave knowledge on how to build training sets and train the models on machine learning frameworks, such as Tensorflow and Darknet.

TensorflowLite, a framework for running these models on mobile phones was found on Tensorflow.org, although the framework only supported SSD models.

No other frameworks could be found that had support for the other machine learning models on mobile phones.

(27)

Method | 17

The articles also gave knowledge on what technologies exist for translating text in images to actual text. One such technology is called Tesseract. Tesseract, takes in an image with text in it and then gives out the text that it could find.

Tesseract also provides tools for training your own text. Other open source OCR technology does not provide that. Tesseract is the only open source library that was still being maintained and has decent language support.

3.3 Technological methods

This section describes the process undertaken while working on the practical parts of the thesis.

3.3.1 Building Training Set

In this phase, images of the parking signs to be detected are collected.

Transportstyrelsen.se and Trafikverket.se were contacted to see if they had images of parking signs, which unfortunately they did not. Another source for the data collection was Google Images. Images were found by searching using the keyword, ’parkeringsskyltar’. Images were also collected manually by going out and looking for parking signs in the city close to where each member lived.

The different classes of signs that were selected were, the P in parking signs, text on the signs, no parking, no stopping, and no parking, date parking,

’parkeringsskiva’ (a sign that mandates showing when a car was parked), disabled only sign, bus, car, truck, and motorbike signs.

3.3.2 Training the Networks

In this phase, Faster R-CNN and SSD were trained on the training set above using Tensorflow framework and Tensorflow Object Detection API. YOLOv3 model was trained on the same training set as the other networks, but was trained using the Darknet framework. Both the frameworks provided output providing information on how the loss function is performing. All networks were trained until the loss function was not changing significantly anymore.

All training was done on the same local computer that belonged to one of the members.

(28)

18 | Method

3.3.3 Testing the Networks

The testing phase is where the performance of the image recognition techniques implemented will be tested. A test set was built using 20% of the training set.

The test set is not included in the training, as it would lead to skewed results.

All of the networks were also tested on the same computer.

The tests were realised by writing two programs. One using the Tensorflow framework, and one using OpenCV. Both programs were written in Python due to the abundance of documentation. Each program would read a directory containing the test set and would process each image individually by running it through the trained networks. The output of the programs would be an image containing boxes drawn around the detected parking sign objects and the probabilities associated with them as well as the time it took to pass the image through the network.

3.3.4 Evaluation

In order to compare the different models in a meaningful way, each model will be evaluated by looking at speed, precision, and recall. Speed will be measured in milliseconds, and it is the time it takes for the network to process an image.

Precision and recall are based on three parameters. These are:

• True Positive (TP) - the number of correctly detected objects

• False Positive (FP) - the number of incorrectly detected objects

• False Negative (FN) - the number of undetected objects

Precision, which measures how accurate the model was when detecting objects, can be calculated by

TP TP + FP

Recall, which measures how good the model was at detecting correct objects, can be calculated by

TP TP + FN

3.3.5 Drawing Conclusions

In the conclusion phase, the result that gave the best performance was further analyzed. The analysis was made conducted with regards to the usability of

(29)

Method | 19

this technology in different applications.

3.4 Development Method

This section provides information about how the time and money affected the scope of the project and the process used to avoid hitting those constraints.

3.4.1 Development Process

The project was developed following the agile software development method [27]. This is an iterative approach. For each model decided on, the platform was developed in a single iteration using a small subset of parking signs.

After each successful iteration, the subset of different parking signs that can be processed will be increased. The process of increasing the subset of parking signs will continue until either a time constraint or available parking sign images are exhausted.

The project budget is constricted to time and money. Building the training set and training the different machine learning models takes up a majority of the time available. The project consists of smaller parts that need to be solved. In order to stay within the time frame available, the submodules were developed iteratively instead of in parallel. Training machine learning models also takes a lot of time. One option could be to pay for cloud computing to achieve the training. However, that was an expense was not affordable and required that time be sacrificed instead. This meant that the other submodules could not be developed with the time that remained.

3.4.2 Software Tools and API’s Used

This section lists all the Software tools, frameworks, and API’s used in this project and explains them.

LabelImg

To prepare the training set, bounding boxes needed to be drawn around the objects of interest, and these boxes needed to be labeled. LabelImg[28] is a tool for labeling images used as training data for object detection models. It provides a graphical user interface that makes it easy to draw bounding boxes and label them with classes. Figure3.2 shows an example of a labeled image open in LabelImg.

(30)

20 | Method

Figure 3.2 – A labeled image open in LabelImg

Tensorflow

In order to build the machine learning models and perform the training, Tensorflow, and Tensorflow’s Object Detection API were used. Tensorflow is an open source library written for machine learning and deep learning. The Object Detection API is an API that provides the most common and popular object detection models already built. These models are trained on publicly available image data sets and were used as a starting point while training on our own set.

Darknet

Similarly, Darknet is an object detection training framework specifically for the YOLOv3 method. Darknet was used to build and train the YOLOv3 model.

(31)

Method | 21

OpenCV

This library, is another open source library for computer vision that has a deep neural network learning module. It was used to test YOLOv3 as it had support for it. OpenCV was used instead of Darknet as Darknet would need to be recompiled in order to only run on the CPU, which is how the other models were tested.

(32)

22 | Method

(33)

Results of the Experiments | 23

Chapter 4

Results of the Experiments

This chapter presents the raw data produced by the experiments. For each model speed, precision and recall are presented. Finally, the results are analysed in terms of validity and reliability.

4.1 Training Set

The training set consists of a set of images and their corresponding text files, which contain the class and location of objects that should be identified. Figure 4.1shows an example of a labeled image. Table4.1shows the different classes that are being trained on and the number of instances that exist in the training set.

Figure 4.1 – A labeled image

(34)

24 | Results of the Experiments

Table 4.1 – Number of instances per class Class Instances

Parking 263

Text 516

Parking sheet 10

No parking 115

No stopping 48

Disabled 50

Date parking 7

Bus sign 1

Car sign 9

Truck sign 2

Motorbike sign 14 Yellow sign 206

Blue sign 632

4.2 Testing Results

This section presents the result produced by the models. From the generated bounding boxes and their adhering classes, it extracts precision and recall. It also records how fast the test was and displays how certain the models were. An example of a resulting image after it was shown to Faster R-CNN is displayed in figure 4.2. The percentages shown next to the class names in the figure represent certainty, i.e. how confident the model is that its answer is correct.

(35)

Results of the Experiments | 25

Figure 4.2 – Sample of the result

(36)

26 | Results of the Experiments

4.2.1 Speed

To test the speed of the models, the test set was shown to the models 5 times.

The results, as well as the average speed of these 5 tests, are shown in table 4.2. As can be seen, YOLOv4 had the fastest speed with averaging of 556 ms.

SSD being the second fastest which took on average, 926 ms. Finally, Faster R-CNN being the slowest, took 3528 ms.

Table 4.2 – Speed over 5 executions

Speed (ms/image) Avg.

Faster R-CNN 3407 3485 3644 3542 3564 3528 YOLOv3 571 535 537 556 585 556

SSD 969 906 918 914 923 926

4.2.2 Precision and Recall

This section presents precision and recall, which refers to how correct the models were, and how good they were at detecting objects respectively. They are calculated from true positives (TP), false positives (FP) and false negatives (FN) as explained in section3.3.4. These are all displayed in figure4.3.

Precision

As shown in table4.3, SSD had the highest precision of 98%. Faster R-CNN had the second highest precision of 96%. YOLOv3 had the lowest precision of 95%. Precision can be calculated by

TP TP + FP

Recall

Table 4.3shows that Faster R-CNN had the highest recall of 96%. YOLOv3 had the second highest recall of 53%, whereas SSD had the lowest recall of 22%. Recall can be calculated by

TP TP + FN

(37)

Results of the Experiments | 27

Table 4.3 – Precision and recall

TP FP FN Precision Recall Faster R-CNN 244 9 8 0.96 0.96

YOLOv3 129 6 115 0.95 0.52

SSD 57 1 195 0.98 0.22

4.2.3 Certainty

The certainty denotes how certain the model is that its guess for a given detected object is correct. It can be seen as the percentages next to the class name in figure4.2. Tables4.4to4.6show the average certainty for each model, but also average certainty per class. The tables show that Faster R-CNN had the highest average certainty of 95%. YOLOv3 is the second highest with an average certainty of 91%. Lastly, SSD had the lowest average certainty of 68%.

SSD has missing certainty data for some classes (Parking sheet, No parking, No stopping, Disabled, and Motorbike sign) since it couldn’t detect them at all.

Table 4.4 – Average certainty for Faster R-CNN Certainty - Faster R-CNN

Blue sign 0.98 No parking 0.95 Yellow sign 0.96 No stopping 0.96 Parking sign 0.94 Disabled 0.79 Text 0.97 Motorbike sign 0.62 Parking sheet 0.48 Average 0.95

Table 4.5 – Average certainty for YOLOv3 Certainty - YOLOv3

Blue sign 0.96 No parking 0.91 Yellow sign 0.95 No stopping 0.87 Parking sign 0.89 Disabled 0.69 Text 0.95 Motorbike sign 0.66 Parking sheet 0.48 Average 0.91

(38)

28 | Results of the Experiments

Table 4.6 – Average certainty for SSD Certainty - SSD

Blue sign 0.73 No parking 0.32 Yellow sign 0.59 No stopping - Parking sign 0.74 Disabled - Text 0.50 Motorbike sign -

Parking sheet - Average 0.68

4.3 Analysis of the Results

This section analyses the result, including how useful it is. It does so in terms of its reliability and validity. That is, how trustworthy it is, and to what degree it measures what was intended to be measured respectively.

4.3.1 Reliability

This section makes an analysis of how reliable the result is.

Test Set

The test size is of course important when it comes to how reliable the result is.

The smallness of our test set, consisting of 38 images, negatively affects this.

Furthermore, it does not cover all possible combinations.

The Value of Certainty

Something to note is that the models varied significantly in certainty that they produced. For example, Faster R-CNN was on average 95% certain, whereas SSD was on average 68% certain.

In the end, what we care about is that signs are correctly identified. If certainty is usually low, but consistently above the threshold for the correct class, the model is technically performing adequately. However, in order to investigate if that is the case, a large test set is necessary. This would help make sure that one was not just lucky with the first few tests. If such a test set is not available, as in our case, a high certainty level is reassuring.

(39)

Results of the Experiments | 29

4.3.2 Validity

This section analyses the result in terms of validity. In other words, the degree to which the experiment measures what it is supposed to measure. It takes a deeper look at the test data and how the labeling of the training data was performed.

Missing Test Classes

Due to some signs being more common in the test data, the average certainty is brought up significantly, even though more rare signs tended to have lower certainty. Furthermore, some signs were missing altogether from the test data due to lack of time for data collection.

Labeling Bias

We have discovered that the way in which the training images are labeled can have a significant effect on the performance of the model. It has become clear that choosing to separate borders of signs from their contents has caused some problems, due to it being a disadvantage to some models.

The detection rate of YOLOv3 came out as quite low. However, this may be partly because of the way we chose to label the training data as mentioned above. As illustrated by the example in figure 4.3, YOLOv3 seems to have problems detecting objects that are too close, or that cover each other. It tends to detect either the border of a sign, or the contents in it, and only rarely both.

This theory is strengthened by our background research into YOLOv3.

In conclusion, the way the training data was labeled may have undermined the detection rate of some models. For example, even with the low accuracy of YOLOv3, it is not hard to imagine that the detection rate would come out much higher if the training data was labeled with the specific model in mind.

(40)

30 | Results of the Experiments

Figure 4.3 – YOLOv3 and Faster R-CNN side by side

(41)

Discussion | 31

Chapter 5 Discussion

In this chapter, our methods and results are discussed. Ways in which the project could have been better conducted are brought up and explained here.

5.1 Interpreting the result

This section interprets the results of the experiments from a technical perspective and how usable the technology is.

5.1.1 Technical Interpretation

Having a high precision means that objects were being detected where they were supposed to in images (regardless of the correctness of the classifications).

Having a high recall means that objects were being detected correctly. Faster R-CNN had the highest recall and high precision. This implies that most of the objects available to be detected were detected correctly.

YOLOv3 had high precision but low recall, implying that yolo was able to detect objects in the correct locations, but was missing quite a few objects. It could probably have performed better if the labeling had been done with how the model works in mind. Therefore the results are somewhat inconclusive.

SSD had a high precision but a really low recall rate. This has similar implications to YOLO, but the difference being that SSD failed to detect even more objects. SSD arguably performed the worst in our experiment. However, it is reasonable to state that Faster R-CNN works well for this task.

(42)

32 | Discussion

5.1.2 Technology Usability

In order for the technology to be usable, a very high precision and recall rate is required. The only model that performed that well was Faster R-CNN. With a bigger training set, even higher values of accuracy can be reached. However speed is also an important factor. It takes Faster R-CNN about 3.5 seconds to process an image. This might be too slow for autonomous vehicles. However, for an Android application 3.5 seconds is not that long.

5.2 Possible Improvements

This section discusses what we may change if we were to repeat the experiment.

5.2.1 Problems With Labeling Training Data

This section describes improvements that could be made in regards to the labeling of training data.

Potentially Avoid Sub-Dividing Objects

As described in section4.3.2, some models don’t like objects that are too close to each other or that overlap. Our findings suggest that this should be taken into account when choosing what model to use and how to label data for that model.

Treating Text

We discovered that treating all text as the same class and labeling text by block, instead of always row by row, may have been a mistake. Especially when we discovered that sometimes having all text in a block would mean also including a no parking sign. This lead us to sometimes divide text into sub-blocks, and this seems to have confused the models somewhat, as in figure5.1.

Furthermore, dividing different types of text into more classes would possibly make the next step, text interpretation, easier. Examples would be recurring letter combinations such as keywords like "avgift" or times like "(11- 17)".

(43)

Discussion | 33

Figure 5.1 – Example of confused text detection

(44)

34 | Discussion

(45)

Conclusions and Future work | 35

Chapter 6

Conclusions and Future work

This chapter clearly states our conclusions, clarifies the limiting factors, and meditate on possible future work following this project.

6.1 Conclusions

In this thesis, we investigated the possibility of using image recognition based on AI techniques for developing a mobile phone application that could interpret parking sign rules. From our results we see that Faster R-CNN was a model that gave the best results at recognising parking signs and its elements and can conclude that it is possible to develop an application that can give the rules that apply at a parking place. However, it might not be possible to implement all of it on a mobile phone. Having the neural network side of it be implemented on a cloud solution or online server is going to be necessary.

The application it self can only be used as a web application.

6.2 Limitations

This section describes what problems we had that limited the results.

6.2.1 The Absence of Text Interpretation

Since we are not testing text interpretation, the question of how best to interpret parking signsis not entirely handled by this thesis report. On the other hand, we were able to perform text recognition, that is recognising where on the image text existed. However, reading the text is another matter. Precision could also

(46)

36 | Conclusions and Future work

be increased by including more instances of the classes, since some of the classes had a very low instance count.

6.2.2 Training Set Size

Normally, when training a neural network on a set, the set consists of thousands of images per class that the network is to detect. However, due to time limitations, collecting thousands of parking sign images and labeling them is an extremely time exhausting task. It just might be the case that the other models needed more images to train on in order to get better results.

6.2.3 Lack of Frameworks for Mobile Phones

During our literature study phase, no frameworks could be found that supported all of the models on mobile phones.

6.3 Future work

Since the end goal for the project owner is a mobile app that interprets parking sign semantics, there are a few obvious next steps to build on the work done in this thesis.

6.3.1 Text Interpretation

To complete the functionality of the parking-sign interpreter, the next thing to be done is treating text. This could be done by cropping out the image and feeding it through a tool like Tesseract, which is a neural network for text recognition.

6.3.2 Developing the Mobile App

If the models can not be implemented on mobile phones in today’s market, then the application would likely be a client-server solution. Processing an image on a computer takes about 3.5 seconds, which can be assumed to be even slower on mobile phones. A cloud based solution might be faster, depending on internet speeds needed for uploading the images.

(47)

REFERENCES | 37

References

[1] X. Liu et al., “A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis,” The Lancet, 2019. doi: https://doi.org/10.1016/S2589-7500(19)30123-2. [Online].

Available:https://www.thelancet.com/journals/landig/article/PIIS2589- 7500(19)30123-2/fulltext

[2] M. Reynolds, “DeepMind’s ai beats world’s best Go player in latest face-off,” https://www.newscientist.com/article/2132086-deepminds- ai-beats-worlds-best-go-player-in-latest-face-off/, 2017.

[3] A. Geitgey, “Machine learning is fun! the world’s easiest introduction to machine learning,” https://medium.com/@ageitgey/machine-learning- is-fun-80ea3ec3c471, 2014.

[4] “Vina,”http://www.vina.se/.

[5] D. Karunakaran, “Deep learning series 1: Intro to deep learning,”

https://medium.com/intro-to-artificial-intelligence/deep-learning- series-1-intro-to-deep-learning-abb1780ee20, 2018.

[6] A. Mehta, “A comprehensive guide to types of neural networks,”https:

//www.digitalvidya.com/blog/types-of-neural-networks/, 2019.

[7] A. Honkela, “Multilayer perceptrons,”https://users.ics.aalto.fi/ahonkela/

dippa/node41.html, 2001.

[8] J. Brownlee, “Supervised and Unsupervised Machine Learning Algorithms,” Mar. 2016, library Catalog: machinelearningmastery.com.

[Online]. Available: https://machinelearningmastery.com/supervised- and-unsupervised-machine-learning-algorithms/

(48)

38 | REFERENCES

[9] dewangNautiyal, “Underfitting and overfitting in machine learning,”

https://www.geeksforgeeks.org/underfitting-and-overfitting-in- machine-learning/, 2017.

[10] J. Brownlee, “A Gentle Introduction to Transfer Learning for Deep Learning,” Dec. 2017, library Catalog: machinelearningmastery.com.

[Online]. Available: https://machinelearningmastery.com/transfer- learning-for-deep-learning/

[11] A. Sachan, “Zero to hero: Guide to object detection using deep learning: Faster r-cnn,yolo,ssd,” https://cv-tricks.com/object-detection/

faster-r-cnn-yolo-ssd/.

[12] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” arXiv, 2016. [Online].

Available:https://arxiv.org/abs/1506.01497

[13] J. Brownlee, “A gentle introduction to pooling layers for convolutional neural networks,” https://machinelearningmastery.com/pooling-layers- for-convolutional-neural-networks/, 2019.

[14] T. Karmarkar, “Region proposal network (rpn) — backbone of faster r-cnn,” https://machinelearningmastery.com/pooling-layers-for- convolutional-neural-networks/, 2018.

[15] M. Pugliese, “A very shallow overview of yolo and darknet,” https:

//martinapugliese.github.io/recognise-objects-yolo/, 2018.

[16] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,”

arXiv, 2018. [Online]. Available:https://arxiv.org/abs/1804.02767 [17] efkan duraklı, “Training yolov3 object detection api with your own

dataset,” https://medium.com/@duraklefkan/training-yolov3-object- detection-api-with-your-own-dataset-4dcfc7c1c34c, 2019.

[18] N. Andersson and A. Ekholm, “Fast r-cnn,” arXiv, 2002. [Online].

Available:http://lup.lub.lu.se/record/929312

[19] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” arXiv, 2016. [Online].

Available:https://arxiv.org/abs/1512.02325

(49)

REFERENCES | 39

[20] C. Szegedy, S. Reed, D. Erhan, and S. I. Dragomir Anguelov, “Scalable, high-quality object detection,” arXiv, 2014. [Online]. Available: https:

//arxiv.org/abs/1412.1441

[21] E. Forson, “Understanding ssd multibox — real-time object detection in deep learning,” https://towardsdatascience.com/understanding-ssd- multibox-real-time-object-detection-in-deep-learning-495ef744fab, 2017.

[22] Google, “Tesseract,”https://github.com/tesseract-ocr/tesseract, 2017.

[23] C. Asplund, “Object classification and localization using machine learning techniques,” 2016. [Online]. Available: http://publications.lib.

chalmers.se/records/fulltext/239156/239156.pdf

[24] ——, “Object detection with deep learning: A review,” arXiv, 2019.

[Online]. Available:https://arxiv.org/pdf/1807.05511.pdf

[25] N. Andersson and A. Ekholm, “Vetenskaplighet - utvärdering av tre implementeringsprojekt inom it bygg & fastighet 2002,” Institutionen för Byggande och Arkitektur, Lunds Universitet, Tech. Rep., 2002.

[Online]. Available: http://www.lth.se/fileadmin/projekteringsmetodik/

research/Other_projects/utvarderingver1.pdf

[26] M. Bunge, Epistemology & Methodology I:: Exploring the World, ser.

Treatise on Basic Philosophy. Springer Netherlands, 1983. ISBN 978-90-277-1511-1. [Online]. Available:https://www.springer.com/gp/

book/9789027715111

[27] “What is Agile Software Development?” [Online]. Available:

https://www.visual-paradigm.com/scrum/what-is-agile-software- development/

[28] T. Lin, “Labelimg,”https://pypi.org/project/labelImg/, 2019.

(50)
(51)

TRITA-EECS-EX-2020:233

www.kth.se

References

Related documents

[7] presents three similarity metrics in order to investigate matching of similar business process models in a given repository namely (i) structural similarity that compares

My hope with the ets-model is that it will be able to predict the amount of tweets with a high accuracy, and since the correlation between the measured twitter data and the data

After the data had been labelled, in this project through algorithmical occurrence detection, it was now possible trying to improve the occurrence detection by applying some

In this article, we present a meta- analysis (i.e. a ‘‘survey of surveys’’) of manually collected survey papers that refer to the visual interpretation of machine learning

This  study  researched  the  risk  factors  of  road   WUDI¿FLQMXULHVDQGWKHUHODWLRQVKLSZLWKWKHVHYH-­ rity  of  injury  in  a  designated  Safety  Community  

The other approach is, since almost always the same machine learning approaches will be the best (same type of kernel, number of neighbors, etc.) and only

This chapter has a twofold aim; to make experiments with the specified machine learning models to classify the potential buyers of the EV and to identify the characteristics of

One face detection algorithm and a face recognizer network, namely Mtcnn and InceptionResnetv1 are implemented on the video in order to illustrate the system.. Figure 4.6: