A comparison of training algorithms when training a Convolutional Neural Network for classifying road signs

(1)

IN

DEGREE PROJECT TECHNOLOGY, FIRST CYCLE, 15 CREDITS

,

STOCKHOLM SWEDEN 2019

A comparison of training

algorithms when training a

Convolutional Neural Network for

classifying road signs

RASMUS BERGENDAL

(2)

A comparison of training algorithms when

training a Convolutional Neural Network for

classifying road signs

Rasmus Bergendal and Andreas Rohl´

en

Supervisor: Jana Tumov´

a

Examiner: ¨

Orjan Ekeberg

(3)

Abstract

This thesis is a comparison between three different training algorithms when training a Convolutional Neural Network for classifying road signs. The algorithms that were compared were Gradient Descent, Adadelta, and Adam. For this study the German Traffic Sign Recognition Benchmark (GTSRB) was used, which is a scientifically relevant dataset containing around 50000 annotated images. A combination of supervised and offline learning was used and the top accuracy of each algorithm was registered. Adam achieved the highest accuracy, followed by Adadelta and then Gra-dient Descent. Improvements to the neural network were implemented in form of more convolutional layers and more feature recognizing filters. This improved the accuracy of the CNN trained with Adam by 0.76 per-centage points

(4)

Sammanfattning

En jämförelse av träningsalgoritmer vid träning av ett Convolu-tional Neural Network för klassificering av vägskyltar

Detta examensarbete är en jämförelse av tre olika träningsalgoritmer vid träning av ett Convolutional Neural Network för klassifiering av v¨ agskyl-tar. De algoritmer som jämfördes var Gradient Descent, Adadelta och Adam. I denna studie användes datamängden German Traffic Sign Recog-nition Benchmark (GTSRB), som är en vetenskapligt använd datamängd inneh˚allande runt 50000 kommenterade bilder. En kombination av ¨ over-vakad (supervised) och offline inlärning användes och varje algoritms top-presultat sparades. Adam uppn˚adde högst resultat, följt av Adadelta och sist Gradient Descent. Det neurala nätverket förbättrades med hjälp av fler convolutional lager och fler igenkännande filter. Detta förbättrade träffsäkerheten hos nätverket som tränats med Adam med 0.76 procen-tenheter.

(5)

1 Introduction

Autonomous vehicles are starting to become a part of the modern transportation system. Everything from autonomous trucks for long distance deliveries to per-sonal vehicles on busy city streets are driving on our roads today. Autonomous vehicles are dependent on safe and reliable ways to gather information about the traffic situation and the traffic rules. Recognizing and classifying road signs correctly is crucial for traffic safety. In a transition period where both human driven vehicles and autonomous vehicles will be occupying the streets, the road signs will have to be comprehensible for both. Today road signs are designed to fit the human visual system, which efficiently recognizes objects within cluttered scenes. Hierarchical neural networks roughly mimic the human visual system, and are among the most promising architectures for tasks like this [4].

The most common method for recognizing road signs today is using machine learning. In other words, you train a neural network to correctly recognize and classify images that it gathers from its surroundings. There are a number of different types of neural networks, but the one that is most fitting for image classification is a convolutional neural network(CNN) [9]. There are not only many types of neural networks but there are also many different training algo-rithms to choose from when training the neural network.

1.1 Problem statement

The aim of this thesis is to compare different training algorithms for training a CNN to classify images of road signs to see which one results in the best perfor-mance. This thesis also aims to investigate if it is possible to make adjustments to the structure of the neural network in order to improve performance in the specific domain of classifying road signs.

1.2 Thesis overview

In Section 2, information about concepts like neural networks, training a neural network and different training algorithms are explained. In Section 3, the meth-ods that were used in this thesis to achieve the results are explained. Section 4 consists of the results that were achieved during the testing. In Section 5, the results and why they turned out as they did is discussed. Section 6 consists of the final conclusion to the problem statement.

(7)

2 Background

This section introduces the basic structure of neural networks and how they work. Different types of neural network architectures are presented as well as different training processes.

2.1 Artificial Neural Network (ANN)

ANN’s are heavily inspired by the complex functionality of human brains. The human brain processes information in parallel by firing signals between billions of interconnected neurons and ANN’s work the same way.

An ANN consists of several layers of neurons. There is an input layer, which receives the information that is given to the neural net. There are one or several hidden layers and finally an output layer. The hidden layers consist of several neurons per layer and the output layer can consist of one or more neurons de-pending on what problem one aims to solve. For a simple problem where the output is either ”True” or ”False”, the output layer consists of only one neuron. For a more complex problem like image classification, more neurons are needed. All of these neurons are connected to all the neurons in the adjacent layers. Each connection between the neurons is associated with a numeric number called a weight. The weight acts as a variable in the output function of the neuron. The output hi, of neuron in the hidden layer is,

hi = σ( N

P

j=i

Vijxj+ Tihid),

where σ() is called the activation function, N the number of inputs neurons, Vij the weights, xj inputs to the input neurons, and Tihid the threshold terms

of the hidden neurons [20]. To avoid divergent neurons, the activation function is used to bound the value of the neuron. The sigmoid function, defined as σ(u) = _1+e1−u, is a common example of an activation function.

(8)

Below is a visual representation of the general structure of a neural network.

Figure 1: Structure of a simple neural network [4]

There are many different types of neural networks, and they are all designed to solve different types of problems. Some are more simple than others while some are incredibly complicated.

2.2 Convolutional Neural Networks (CNN)

Image recognition is the task of taking an input image and outputting a class that best describes the image. Today, convolutional neural networks is the most common approach to problems related to visual perception and they seem to outperform all other techniques. A convolutional neural network is character-ized by a special architecture composed of alternating convolutional and pooling layers optionally followed by fully connected layers. These has been proven suc-cessful at extracting and combining local features from an input image[12]. In an architecture where all hidden layers are fully connected to the input layers the amount of parameters to learn quickly rises to a level where it is very computationally expensive. Convolutional neural networks avoid this issue by restricting the hidden units to only be connected to a subset of the input units. Each hidden unit will be connected to a contiguous region of pixels in the input, much like how different neurons in the visual cortex are assigned to specific receptive fields in the retina[1].

The input for a convolutional layer is a 3D matrix (m · m · r) where m is the height and width of the image and r is the number of channels. In an RGB

(9)

Figure 2: A convolutional network architecture with 2 feature stages [2]

image the number of channels would be three (one for each color). The con-volutional layer will have a number of trainable filters of dimensions n · n · r, where n < m. This filters are specified at recognizing specific features of the input. Since these features may exist at several regions in the input, all filters are convolved with the entirety of the input so that all occurrences of the specific feature is registered. [1][11]

The feature maps go through a sampling by the pooling layers. The sub-sampling is done by aggregating the activation value of a particular feature in different regions of the input and thus transforming the feature map into a lower-resolution version. This can be done by taking the max (max pooling) or mean feature activation value (mean pooling) in a limited region. This improves computational speed and can also help avoiding over-fitting[1]. Over-fitting occurs when the network correspond too exactly to the training data, causing it to fail when classifying data it has not been trained on.

2.3 Training

One of the most prominent features of a neural network is its ability to learn from the presentation of patterns. The training process consists of taking a set of steps in order to tune the weights and thresholds of the networks neurons. After having been trained the network has learned the relationship between in-puts and outin-puts, and can therefore produce output close to the desired output of any given input [5].

Like humans, neural networks learn by examples. A neural network is trained by having a set of input data called a training set. The goal of the training is to minimize a so called error function. The error function is a function that out-puts a value based on the difference between the output of the neural network and the desired output. A common error function is the sum of the squared differences between neural networks output and the desired output[20].

(10)

To verify that the neural network works as intended, a validation set can be used. The validation set is usually 10-40% of the size of the training set. Like the training set, the validation set contains input data. The difference is that when using the validation set, the weights associated with the connections be-tween the neurons are not changed as it is only for observing the accuracy. In this study, supervised learning in conjunction with offline learning is used. The supervised learning method works by having a desired output for each given set of input signals. Training a neural network with supervised learning there-fore requires a table with input and expected output, also called attribute/value table[5]. The network is considered trained when the difference between the pro-duced outputs and the desired outputs is within an acceptable range.

In offline learning, the weights and thresholds are adjusted after the network has been trained on a subset of the training set. The process takes into account the mean difference of the produced output and the desired output for all the samples in the training set and takes adjustment steps accordingly. Therefore, the whole set of training samples must be available during the whole training process [5].

The steps taken for updating the weights and thresholds is decided by the train-ing algorithm.

2.3.1 Gradient Descent

Gradient descent is a iterative optimization function for finding a minimum of a function. In the context of training a neural network, the function to be min-imized is the error function. Gradient descent iteratively moves from a initial set of parameters to a set of parameters that minimizes the function. The local minimum is found by taking steps proportional to the negative of the gradient at the current point. The algorithm is generic and easy to implement but may result in a local minimum instead of the global minimum[13].

Gradient descent requires a learning rate hyper-parameter to be chosen. The learning rate decides the rate of which the parameters will be adjusted. Setting this rate to high may result in the system diverging from the objective and a low learning rate will result in slow learning[21].

2.3.2 Adadelta Optimizer

Adadelta is derived from a method called AdaGrad. In contrast to Gradient Descent, Adagrad updates the learning rate continuously in regard to all pre-vious gradients on a per-dimension basis. The structure of Adagrad makes it sensitive to the initial parameters and since the accumulated gradients con-stitutes the denominator in the update function it also has the deficiency of a

(11)

constantly decreasing learning rate, which eventually result in zero progress [21]. Instead of accumulating gradients throughout the whole learning progress Adadelta accumulate over a finite window of recent gradients. By only using recent gra-dients, the denominator does not accumulate to infinity but instead becomes a local estimate. This ensures that the network always makes progress, even after several iterations of updates.

2.3.3 Adam Optimizer

Adam Optimizer is a combination of two other popular training algorithms, namely AdaGrad and RMSProp. RMSProp is another AdaGrad inspired adap-tive learning rate method developed to resolve the decreasing learning rate issue [15]. Some advantages with using Adam is that it works well with sparse gradi-ents that would be inadequate for other algorithms. It also naturally tunes its parameters and thus performing a form of step-size annealing [8].

The algorithm works, as all algorithms derived from gradient descent, by taking small steps updating the weights of the neural network in an effort to minimize the cost function by using the gradient as an indicator of how the weights should be updated. What makes Adam different from gradient descent is that it takes more parameters into account when calculating these steps. In gradient descent there is a constant learning rate, but in Adam the learning rate is changed with respect to both the first (the mean) and second (the uncentered variance) moments of the gradients. There are also the variables β1and β2 which control

the exponential decay of the first and second moments respectively[8].

2.4 Related Work

Deep learning is the most prominent method in road sign recognition and these methods have been proven to be very accurate in road sign classification. Stallkamp[17][18] have presented studies where CNN-based methods achieved higher classification rates than the average human. 98.98% accuracy were achieved on the German Traffic Sign Recognition Benchmark (GTSRB) but superior implementations of CNNs have been achieved since. Jin[7] suggests a hinge loss stochastic gradient descent method for training the CNN classifier and have achieved 99.65% on the GTSRB. Haloi [6] further improved the accuracy on the GTSRB to 99.81% by using spatial transformer layers and a modified version of inception module. However, it should be stated that even though many of these approaches achieve an above human result in experiments like these, humans still tend to outperform machine learning algorithms in real time traffic situations.

(12)

3 Methods

A convolutional neural network was created and trained on the GTSRB data set. The optimizers used for training were the following; Gradient decent, Adadelta and Adam. GTSRB consist of a training part which were used to train the network and also a verification part which was used to verify the accuracy of the network. The three optimizers were compared in regards to percentage of correctly classified images. The best performing optimizer were studied more closely in hope to alter it into achieving even better results.

3.1 Dataset GTSRB

In the last ten years a lot of public and challenging data sets for traffic sign classification has been introduced. In particular, GTSRB have attracted schol-ars to find new methods for road sign recognition [16].

The GTRSB dataset contains 51839 annotated images, all acquired during day-time in Germany. The training set contains 39209 images and the remaining 12630 images constitutes the test set. The dataset is limited in terms of not containing images taken during challenging weather conditions or at times when its dark outside [17]. Some claim that this makes GTSRB insufficient for test-ing the reliability of a recognition algorithm[19]. However, as this study is a comparison between different training algorithms, and not a development of an actual application, the dataset is deemed as suitable.

3.2 CNN

3.2.1 Framework

To create and train the CNN, TensorFlow and some of its many built in functions were used. TensorFlow is a widely used machine learning framework designed to work at large scale and in different environments [14].

The architecture of TensorFlow enables a flexible environment for developing and it gives the developer the opportunity to experiment with new optimizers and training algorithms. TensorFlow uses dataflow graphs to represent compu-tation and when run on a single machine it maps the nodes of a dataflow graph across many computational devices, including multicore CPUs and general pur-pose GPUs. [3]

3.2.2 Neural network design

The network that was used in our training and testing is a CNN that consists of three convolution layers, three max pooling layers and one dense layer. The first convolution layer consists of 32 3x3 filters, the second one 64 3x3 filters, and the third one 128 3x3 filters. Each max pooling layer is of size 2x2. The dense layer has 128 nodes. The output layer consists of 43 nodes, the same

(13)

amount as our number of classes in the dataset. These choices are inspired by a CNN created by Aditya Sharma that was designed for classifying clothing in the Fashion-MNIST dataset. Inspiration was drawn from that specific CNN because of the similar resolution of the images in the two datasets.

3.2.3 Learning rates

There are different learning rates needed for each training algorithm as they do not work in the same way. The learning rates that were chosen are based on testing. For each training algorithm, different learning rates were tested with smaller and smaller intervals, until finally reaching a satisfying result. The learning rate for Gradient Descent was chosen to be 0.01, the learning rate for Adam was chosen to be 0.001, and the learning rate for AdaDelta was chosen to be 10.

3.2.4 Testing method

The network was trained on the roughly 40000 pictures in the training set in batches of 128. The training optimizer was run after each batch. After training the network on the whole training set, the network was tested on the 10000 testing pictures in the test set. The number of training iterations, also known as epochs, for each training algorithm was 200, meaning that for each training algorithm, the network was trained on the complete training set 200 times. This gave the network enough iterations to reach the top accuracy without making the training time unreasonably long. The results were then saved and plotted onto a graph.

3.3 Performance definition

The performance of the CNN was defined by the number of correct classifica-tions divided by the total number of classificaclassifica-tions. A higher value is a better performance.

3.4 Improvement to CNN

Road signs have prominent features that can be utilized to make the CNN specialized for this specific task. The features that were focused on are the following:

• Distinct colors (white, black and red)

• Simple shapes (Circles, squares and triangles) • Sharp edges

In order to capture these features, certain changes to the structure of the CNN were made. These changes were made gradually, testing if the changes yielded

(14)

any improvements. If the result was indeed improved, these changes were iter-ated upon until there no longer were any improvements to the result.

The main changes that were made to the CNN were the number of convolutional layers and the number of filters in each layer. The amount of convolutional layers were increased from 3 to 7. And the amount of filters in each layer were doubled. These changes were chosen because convolutional layers are the layers that detect features in pictures. The hypothesis was that increasing the number of these layers would therefore make the network more prone to detecting the distinct features of road signs. Because the filters in each convolutional layer are responsible for making out the features in the pictures, another hypothesis was that increasing the number of filters in each layer would help in classifying the pictures.

(15)

4 Result

4.1 Gradient Descent Optimizer

Gradient Descent Optimizer achieved a top testing accuracy of 0.894375. In the graph, it is shown that during the first 25 iterations, the neural network improved drastically. After that, it stabilized and did not improve. The training accuracy followed this pattern as well, and reached an accuracy of 1.0 after the first 25 iterations.

Figure 3: Results for Gradient Descent over 200 iterations. Top testing result: 89,44%

(16)

4.2 Adadelta Optimizer

The Adadelta optimizer achieved a top testing accuracy of 0.964375. As shown in figure 4 a drastic improvement was made during the first iterations. The testing accuracy later stagnated and only smaller changes occurred. The final testing accuracy was higher than the network trained with the gradient descent optimizer. The training accuracy quickly reached a value of 1.0 but were more unstable than when trained with gradient descent.

(17)

4.3 Adam Optimizer

Adam Optimizer achieved a top testing accuracy of 0.969875. In contrast to Gradient Descent Optimizer, Adam Optimizer did not reach the same level of stabilization after the first couple of iterations. After about 20 iterations, it reached accuracy levels between 0.92 and 0.96. The same is true for the training accuracy, where it did not stabilize as much as Gradient Descent, but reached the same max value of 1.0. Even though the training accuracy is objectively worse than both Gradient Descent and Adadelta, it achieved the highest testing accuracy of them all.

(18)

4.4 Improved CNN with Adam Optimizer

The improved CNN achieved a top testing accuracy of 0.97750. The CNN reached its highest accuracy fairly quickly, after just about 40 iterations. It also reached a quite high value of 0.95 after just 5 iterations. After that it fluctuated between the values 0.93 and 0.97, only reaching accuracy levels higher than 0.97 a few times. The accuracy never stabilized during the entire training session but still achieved a significantly higher result than the non-improved CNN with Adam.

Figure 6: Results for the improved CNN with Adam over 200 iterations. Top testing result: 97,75%

(19)

5 Discussion

5.1 Result analysis

The result show a pronounced difference in performance among the three algo-rithms. As expected, the Adam Optimizer had the highest percentage of accu-rate classifications while Gradient Descent performed the worst. Both Adadelta and Adam are based on gradient descent but with further improvements imple-mented, hence the greater performance comes as no surprise.

The learning curves of Adadelta (Fig. 4) and Adam (Fig. 5) highly resembles each other while the curve of Gradient Descent (Fig. 3) has different properties. The testing accuracy of Gradient Descent remains stable after the first few train-ing iterations while the same section in the other graphs have some fluctuations. This can be a result of over fitting with the more advanced training algorithms. Gradient descent use a constant step size throughout the whole training process. This results in a smoother and more stable learning curve. The step sizes in Adadelta and Adam are calculated, and will therefore vary from their initial values. It is possible that the adaptiveness of Adam and Adagrad make them worse at generalizing compared to the simpler Gradient Descent. The step-size tuning can result in adaptation to noise or irrelevant features which in turn results in over fitting on the training data.

The differences between the algorithms can be seen even more clearly when look-ing at the trainlook-ing accuracy. The trainlook-ing accuracy of gradient descent stays at 100% after about 20 iterations while the training accuracy of both Adadelta and Adam is fluctuating. The weights of the neural network are adjusted after each batch, and since these are adaptive methods the network can get over fitted on a particular batch, resulting in a lower accuracy for the next batch. This has a negative impact on the mean accuracy of a single training iteration, hence the recurring dips in the graph. In future studies, this can be partly compensated for by using larger batch sizes, and thus reducing the noise.

However, the final testing accuracy were greater among the two more advanced training algorithms. Since Gradient Descent has a static learning rate, it has dif-ficulties navigating areas that are much steeper in one dimension than another, which is common around optimas. This can be visualized by a ball navigating a ravine down to a local minimum. If the step size is constant the ball will overshoot and oscillate across the steep slopes of the ravine, while only mak-ing hesitant progress towards the minimum. The adaptive methods, Adadelta and Adam, take the earlier gradients into account and can therefore build up momentum. The momentum accelerates the descent in the relevant dimension, whilst dampening the oscillation. This enables faster converging and also aids the function in avoiding converging towards sub-optimal local minimas.

(20)

5.2 Improvements

At first the intention was to adjust the training algorithm in order to make it more adjusted to the task of classifying road signs. However, this proved to be very difficult to implement. A more feasible and possibly more effective ap-proach was to make adjustments to the neural network itself.

The idea was to change the structure of the CNN completely by changing the filter size, the amount of filters, the amount of convolutional layers, and the amount of dense layers all at once to see if the accuracy could be improved. This proved to be ineffective because it was impossible to know which changes were responsible for the changes in accuracy. Since the CNN already performed fairly well, a decision was made to make small changes instead and iterate on those changes.

The first idea that came to mind was that the pictures that the network was trained on, being only 32x32 in size and having very distinct features, were very simple and didn’t need many layers to be distinguished from one another. This proved to be false as decreasing the amount of filters resulted in a significant decrease in accuracy.

After trying the simple approach the next idea was to increase the depth of the network by adding more layers. The first change that was made was to add one convolutional layer after the third convolutional layer with same amount of filters as the third layer. This became the first structure to achieve a top testing accuracy of over 97%. In order to iterate on this approach, the next step was to add another convolutional layer. This time a layer was added after the second layer, also with the same amount of filters as the second layer. This structure saw a very small improvement in accuracy but it was an improvement nonetheless. Following the successful steps that were previously taken, another convolutional layer was added after the first layer, with the same amount of filters as the first layer. This is when bigger improvements to the accuracy were made, reaching a top testing accuracy of 97,24%.

To try something new, more filters were added to each convolutional layer. More specifically the amount of filters in each layer were doubled. The idea behind this was that if more convolutional layers, which in turn results in a higher amount of total filters, equals a higher accuracy, then maybe just adding more filters in each layer will have the same effect. This turned out to be the case, and a top accuracy of 97,4% was achieved.

Because the first idea, with adding more layers, worked well, a network without double the amount of filters but with another convolutional layer after the sixth layer was tested. It seemed that more layers had a greater effect than adding more filters as this layer pushed the accuracy to 97,47%. To combine these two best performing networks, the amount of filters in this network were doubled.

(21)

This resulted in a top testing accuracy of 97.75%. Networks with more convo-lutional layers and more filters were tested but they did not result in improved accuracy.

As the accuracy increased, the inconsistency in the accuracy of the networks increased as well. As can be seen in the graph of the improved CNN, the accuracy fluctuates even more than in the previous test with the basic CNN structure with Adam Optimizer. As was discussed in previous sections, this is most likely a result of overfitting. As the amount of parameters in the network increases, the likelyhood that the network gets overfitted increases [10]. The top accuracy was achieved fairly early in the training phase at around iteration 40. After that the accuracy varied wildly, going as low as 90%. This overfitting is a problem that would need to be dealt with if it were to be implemented in a real world application, but as that was not the aim of this thesis, only the top accuracy was considered.

(22)

6 Conclusion

The results show that there is a significant difference in performance between Gradient Descent, Adadelta, and Adam when used for classifying road signs. As expected, the algorithm that gave the most accurate CNN was Adam (96,99%), followed by Adadelta (96,44%) and Gradient Descent (89,44%).

Changes in form of more convolutional layers in front of each max pooling layer and more feature recognizing filters were implemented in hope to increase the accuracy of the CNN trained with Adam optimizer. This proved to be successful and the performance was increased to 97,75%.

In future studies, it would be interesting to experiment with different batch sizes and also to re-create the subsets between each training iteration to see if this makes the CNN more generalized and less prone to overfitting.

(23)

References

[1] Convolutional neural network. http://ufldl.stanford.edu/tutorial/supervised/ConvolutionalNeuralNetwork. Accessed: 2019-04-02.

[2] Convolutional neural networks (lenet). http://deeplearning.net/tutorial/lenet.html. Accessed: 2019-05-14.

[3] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems De-sign and Implementation (OSDI 16), pages 265–283, Savannah, GA, 2016. USENIX Association.

[4] D. C. Ciresan, U. Meier, J. Masci, L. M. Gambardella, and J. Schmidhuber. Flexible, high performance convolutional neural networks for image clas-sification. In Twenty-Second International Joint Conference on Artificial Intelligence, 2011.

[5] I. da Silva, D. Spatti, R. Flauzino, L. Liboni, and S. dos Reis Alves. Ar-tificial Neural Networks: A Practical Course. Springer International Pub-lishing, 2016.

[6] M. Haloi. Traffic sign classification using deep inception based convolutional networks. arXiv preprint arXiv:1511.02992, 2015.

[7] J. Jin, K. Fu, and C. Zhang. Traffic sign recognition with hinge loss trained convolutional neural networks. IEEE Transactions on Intelligent Trans-portation Systems, 15(5):1991–2000, 2014.

[8] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

[9] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.

[10] S. Lawrence, C. L. Giles, and A. C. Tsoi. Lessons in neural network training: Overfitting may be harder than expected. In AAAI/IAAI, pages 540–545. Citeseer, 1997.

[11] Y. LeCun, K. Kavukcuoglu, and C. Farabet. Convolutional networks and applications in vision. In Proceedings of 2010 IEEE International Sympo-sium on Circuits and Systems, pages 253–256. IEEE, 2010.

[12] Y. Lu. Food image recognition by using convolutional neural networks (cnns). arXiv preprint arXiv:1612.00983, 2016.

(24)

[13] D. Paper. Data Science Fundamentals for Python and MongoDB. Apress, 2018.

[14] L. Rampasek and A. Goldenberg. Tensorflow: Biology’s gateway to deep learning? Cell systems, 2(1):12–14, 2016.

[15] S. Ruder. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747, 2016.

[16] F. Shao, X. Wang, F. Meng, T. Rui, D. Wang, and J. Tang. Real-time traffic sign detection and recognition method based on simplified gabor wavelets and cnns. Sensors, 18(10):3192, 2018.

[17] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel. The german traffic sign recognition benchmark: A multi-class classification competition. In IJCNN, volume 6, page 7, 2011.

[18] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel. Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition. Neural networks, 32:323–332, 2012.

[19] D. Temel, G. Kwon, M. Prabhushankar, and G. AlRegib. Cure-tsr: Chal-lenging unreal and real environments for traffic sign recognition. arXiv preprint arXiv:1712.02463, 2017.

[20] S. Wang. Interdisciplinary Computing in Java Programming. The Springer International Series in Engineering and Computer Science. Springer US, 2003.

[21] M. D. Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.

(25)

A comparison of training algorithms when training a Convolutional Neural Network for classifying road signs

A comparison of training

algorithms when training a

Convolutional Neural Network for

classifying road signs

RASMUS BERGENDAL

A comparison of training algorithms when

training a Convolutional Neural Network for

classifying road signs

Rasmus Bergendal and Andreas Rohl´

en

Supervisor: Jana Tumov´

a

Examiner: ¨

Orjan Ekeberg

Contents

1

Introduction

1.1

Problem statement

1.2

Thesis overview

2

Background

2.1

Artificial Neural Network (ANN)

2.2

Convolutional Neural Networks (CNN)

2.3

Training

2.4

Related Work

3

Methods

3.1

Dataset GTSRB

3.2

CNN

3.3

Performance definition

3.4

Improvement to CNN

4

Result

4.1

Gradient Descent Optimizer

4.2

Adadelta Optimizer

4.3

Adam Optimizer

4.4

Improved CNN with Adam Optimizer

5

Discussion

5.1

Result analysis

5.2

Improvements

6

Conclusion

References