Comparing performance of convolutional neural network models on a novel car classification task

(1)

INOM

EXAMENSARBETE TEKNIK, GRUNDNIVÅ, 15 HP

STOCKHOLM SVERIGE 2017,

Comparing performance of convolutional neural network models on a novel car

classification task

Jämförelse av djupa neurala nätverksmodeller med faltning på en ny bilklassificeringsuppgift AMUND HANSEN VEDAL

KTH

SKOLAN FÖR DATAVETENSKAP OCH KOMMUNIKATION

(2)

(3)

Comparing performance of convolutional neural network models on a novel car classification task

Amund Hansen Vedal

Department of Media Technology and Interaction Design KTH Royal Institute of Technology

Stockholm, Sweden SE-100 44 amund@kth.se

Abstract

Recent neural network advances have lead to models that can be used for a variety of image classification tasks, useful for many of today’s media technology applications.

In this paper, I train hallmark neural network architectures on a newly collected vehicle image dataset to do both coarse- and fine-grained classification of vehicle type. The results show that the neural networks can learn to distinguish both between many very different and between a few very similar classes, reaching accuracies of 50.8% accuracy on 28 classes and 61.5% in the most challenging 5, despite noisy images and labeling of the dataset.

Figure 1: Sample images from 24 of the classes of the PlatesMania dataset used for most experiments.

The four classes "Other", "Kei car", "Caddy" and "Pickup with box" are not included since these were dropped after initial experiments

1 Introduction

Automating image classification and clustering tasks is a long-standing problem of computer vision. A classical problem where such systems

can be applied is traffic surveillance, where the user may desire to detect a vehicle in an image (object detection) or recognize the make or model of a detected vehicle (image classification). An early vehicle detection system

(4)

of this kind using neural networks was already proposed more than three decades ago [4].

In recent years, there has been some major improvements in using neural networks for image classification tasks. Using convolutional layers, Krizhevsky et al. [18] presented a network architecture causing a jump in accuracy compared to earlier models, inspiring several ground-breaking, new model types [22,25,12]

Through a media technology perspective, we can already see the impact of image classification-techniques on our daily lives. The face recognition features of Facebook [27], geolocation of images [28], and transfering style from one image to another [8] has all become a part of our daily lives. These techniques are also used in less playful contexts, such as in di- agnosing brain tumors from MRI images [11].

In training deep neural networks for image classification, an important requirement is a dataset of labeled images to train on, using supervised learning. In this paper, I first present a method of collecting, labeling and processing images which we used to create our new PlatesMa- nia car image dataset, containing more than 200.000 labeled images of cars in different environments (Section3.1). I then describe image classification experiments conducted on this dataset using the three recent new neural network architectures VGG by Simonyan and Zis- serman [22], GoogLeNet by Szegedy et al. [25], and ResNet by He et al. [12] (Section4.1), and use the results to discuss differences between them (Section8).

My results show that pre-designed neural networks can be trained to classify images with high confidence, even when the dataset surely contains label errors, and without clearing away noise from the images with object detection methods. The best-achieving network reaches 50.8% accuracy for Top-1 score on all 28 classes of our new PlatesMania dataset, and a61.5% Top-1 score on a more fine-grained task of separating only the five most similar classes. I also show that the choice of neural network architecture can affect the classification differently, such as giving very different Top-5 scores, even when the Top-1 accuracies of the networks are similar. To strengthen this argument, I analyze the performance of the network using common methods such as confusion matrices and testing classification accuracy on previously unseen test data.

Contributions

This work contributes an exploration of the whole process of collecting a dataset of pictures, training a neural network for vehicle classification, and finally attempting to optimize their performance on the dataset:

• I explain how we collected useful data effi- ciently and prepared a dataset from it (section 3.1), get baseline results from our neural network models (Experiment 1) and introduce the programming library and hardware used to con- struct and train them.

• I compare results after removing classes (Ex- periment 4), and from varying the speed of learning rate decay (Experiment 2).

• I compare results for datasets of variable size, and attempt substituting datasize with data augmentation

• I attempt fine-grained classification on the 5 most similar classes of our dataset

• I make deeper models with the same structure compete side-by-side using the same material (Experiment 6)

• I discuss the meaning, or lack thereof, of comparing results between these different experiments.

Note: I write "we" when referring to collabo- rative work with Sina Ghassemi that I used for my experiments. I use "I" about choices based on my own hypotheses.

2 Related work

For vehicle classification in particular, there are (to my knowledge) only a few such labeled datasets of high quality [29,16]. These datasets are quite clean (free from noise information), as this information was usually cropped out manu- ally using human supervision. These datasets also contain bounding box coordinates if a cropped version is desired.

Vehicle classification using neural networks has already been attempted in by for example Yang et al. [29] and with very promising results.

There has also been some attempts by Krause et al. [16] to convert images into 3D representations to be classified more easily, and some student work attempting transfer on the same dataset [19].

(5)

3 Dataset

3.1 Collecting and labeling images In order to focus on training our networks, we searched for sources of labeled car images to build a large dataset quickly. We wanted to build our own dataset rather than using a prepared one such as CompCars [29], but we imag- ined using Google Image searches would be extremely tedious and time-consuming.

As such, our goal was to create a database of at least 1000 labeled pictures per class – the same as the basis of the ImageNet dataset, described in Krizhevsky et al. [18]. To accomplish this, we first downloaded one picture per 7217 car models from PlatesMania, a free database of vehicle pictures labeled with car model [1]. We then hand-sorted them into 28 distinct classes (see Figure1): ConcreteMixer, Truck, Pickup, Van, Minibus, Campervan, Bus, Doubledecker, Veteran, Jeep, Limousine, SUV, Hatchback, Sedan, Sports, Station Wagon, Compact, Amer- ican, Classic, Military, Firetruck, Ambulance, Motorcycle, Crane, Other, Kei Car, Caddy and Pickup with box. Some classes were chosen based on similar studies [29,16], others were inspired by internet sources [2], and the remaining extracted ad hoc to minimize the "Other"- class (such as ConcreteMixer, Millitary and Crane). Throughout the course of our experiments, we continued downloading 296,000+

images of the vehicle models we had already classified, automatically mapping them to our classes. The result is the largest vehicle dataset we know about, containing over 207,000 unique images. The categorical distribution of our classes is shown in Figure2.

Classes with < 15000 Sedan Hatchback SUV Truck Bus Stationwagon Sports

Figure 2: Distribution of images in each class of our PlatesMania dataset

3.2 About the images

The images are taken in all sorts of different environments, not always encompassing the whole vehicle, and usually from the perspective of a regular by-stander. Most have a rea-

sonable resolution (around 1200x900 pixels), and contain various objects from their surround- ings, such as other vehicles, people, buildings, trees or traffic signs (see Figure3). Sometimes watermarks are also present (Figure4). To- gether, these elements form noise characteristic to our dataset, later referred to as "PlatesMania dataset", which is part of what the network has to learn to recognize as irrelevant.

(a) Ambulance (b) Compact

(c) SUV (d) Truck

Figure 3: Challenging samples from our Plates- Mania dataset

(a) Ambulance (b) Compact Figure 4: Examples of watermarked images

4 Method

4.1 Neural network models

For my experiments, I chose three well- known network models: VGG, GoogLeNet and ResNet. These three, all convolutional and improvements of the breakthrough AlexNet [18], are recent winners of the ImageNet-competition [21]. The networks have also been proven suc- cessful for similar tasks of vehicle classification and detection [19,29].

VGG VGG [22] from Oxford University is the least complicated network out of the three, and as such a good starting point for my training. It has 16 convolutional layers and no parallel layers or residuals like the other two. The network is the least optimized (has the most

3

(6)

weights), and as such takes longer to load and train than the others.

GoogLeNet GoogLeNet [25] is one of the 2014-winners of the ImageNet competition [21].

It introduces the so-called Inception-module, which processes the signal with different ker- nels (convolutional layers) in parallel to produce a richer stack of feature maps. This is believed to help the network learn multiscale features [25]. After each Inception-layer, the output of each parallel "path" is concatenated and passed on deeper into the network The specific name for the model I used is Inception-3.

ResNet Microsoft Research won all three classes of ImageNet in 2015 with ResNets [12].

Its name comes from how its layers are orga- nized into so-called "residual blocks", where the original input signal of a block is directly added to its output. This trick helps to avoid the vanishing gradient problem when networks become extremely deep [20,12]. I used the 34-layer deep ResNet (exception inExperiment 6).

4.2 Tools

Torch7 Torch7 [6] is one of several programming libraries used to create neural networks.

It is accessed through the LUA scripting lan- guage, and has implementations for CUDA, which helps get the most out of the NVIDIA GPU’s I used in my experiments (see Hard- ware). It is also open-source, has good doc- umentation and tutorials, and a large and active community, including Facebook AI whose ResNet [12] implementations I used for my experiments. Well-known alternative frameworks to Torch7 include Caffe [15] and Tensorflow [3], which all have multi-GPU support. Torch was chosen because we believed it had faster multi- GPU implementation at the time. I based my experiments on Torch-implementations made by one of the core maintainers of Torch, Soumith Chintala [5].

Hardware I used NVIDIA Tesla K80 GPU- cards for training my networks. This graph- ics card consists of two GPU’s of 12GB of GPU-RAM (VRAM) each, enabling compu- tational speeds up to 8.74 TFLOPs (trillions of floating-point operations per second) if running in single-precision mode [23]. Using double- precision would slow down the computations and have little to no effect on classification accuracy [10]. Training a ResNet on a single GPU required about 24 hours per 8 epochs of training.

5 Preprocessing

Standard preprocessing Initially, to have equal amounts of input pixels for the first layer of the network, I first scale the images so the shortest side is 256 pixels, and then center-crop to 244x244 pixels [22, 13]. This method is based on the assumption that the vehicle is lo- cated in the middle of the picture. For each color layer (RGB), I then perform standard data normalization, by subtracting the mean and di- viding by standard deviation. This, along with Batch Normalization layers help the networks converge faster and avoid problems with vanishing or exploding gradients [20,14].

Data augmentation I also perform some less common preprocessing-steps, such as so-called multicrop, since feeding the network with finer samplings of the image is believed to help it learn [22,25]. Differently from Szegedy et al.

[25], however, I chose to use 9 larger crops, to avoid large increases of computation time. Us- ing the original, resized image where the shortest side has size 256 pixels, I crop the image in 9 different positions using a 224x224 cropping mask. This results in 9 heavily overlapping parts distributed evenly over the image as a grid (Left–Center–Right, Top–Mid–Bottom).

Color jittering To make the features of the pictures easier to distinguish, I altered the col- ors of the images in two ways, both suggested in the packaged Torch-implementation of ResNet from Facebook [7]. The first, well-described in [18], is a PCA-based technique that changes color intensity by adding random multiples of eigenvectors and -values layer-wise before pass- ing the picture through the network. The second technique uses a simpler approach of adding uniform brightness, saturation and contrast to the image (in random order), which supposedly helps the network learn invariances in these properties [13].

6 Experiments

Common to all experiments As the first experiments were conducted at an early stage of labeling the data, and I wanted to avoid classical problems of imbalanced representation, the maximum number of pictures I could choose per class for this experiment was that of the least populated class: 330. The size of the training set is 60% and 40% for testing. I also chose a dropout-probability of 0.5 for every weight, which has a similar effect of training several smaller networks at once and then averaging

(7)

over them [24]. Unless stated otherwise, I used all aforementionedPreprocessing-steps, as well as the following initial values, which are within the range seen in recent works [20,24,12]:

Initial learning rate = 0.01 Momentum = 0.9

Learning rate decay: halved every 10 epochs (to speed up training [20])

Weights were initialized according to Xavier initialization [9])

Batchsize = 32 original images, to avoid over- loading the GPU memory.

Experiment 1 – Baselines The purpose of this experiment was to make network-specific baselines for our PlatesMania dataset. I trained the three networks VGG, GoogLeNet and ResNetfrom scratch using 330 images per 28 classes (9240 original images in total for training and testing).

Experiment 2 – Faster learning rate decay To explore how variations in learning rate af- fects the learning with our dataset, I trained VGG again with faster learning rate decay. Ac- cording to Nielsen [20], the learning rate should be decreased when the learning slows down, which seems to happen at Epoch 4 for VGG (see Figure 5a). As such, I halved the learning rate -parameter every 4th epoch rather than every 10th.

Experiment 4 – Fewer classes This experiment is also almost identical to the first, except with fewer classes. Here, I removed the very heterogeneous "Others"-class, as well as the three last classes Kei Car, Caddy and Pickup with box, and trained the three networks with the remaning 24 (thus 132 images per class in the test set). Here, I also monitored the output of the networks with confusion matrices to better understand where they made the most mistakes, see Figure7.

Experiment 5 – More data, less augmentation As we had gathered a significantly larger amount of data, I chose to substitute thedata augmentation-methods by increasing the number of pictures per class to 13500 to better understand the importance of data size in training.

I also excluded all but the classes 12-16 in this experiment, to see if the networks would better distinguish between these very similar classes it had previously "confused" (Figure7) if it had more examples and fewer alternatives. Note:

after several failed attempts to make the networks converge, I chose to reduce the learning rate only 20% every 10 epochs to make them converge faster [20].

Experiment 6 – Very deep networks As great success was reported for very deep ResNets [12], I also wanted to try increasing the amount of layers. Starting from the ResNet- 34 model used in earlier experiments, I first increased the of convolutional layers to 50, and then to an extreme 152-layer implementation in the end. This was a challenge, as it put heavy demands on the GPUs.

0"

20"

40"

60"

80"

100"

0" 3" 5" 8" 10" 13" 15" 18" 20" 22" 25" 27"

Accuracy'(%)'

Epoch'

Top.1,Trainset"

Top.5,"Testset"

Top.1,"Testset"

(a) VGG

0"

20"

40"

60"

80"

100"

0" 3" 5" 8" 10" 13" 15" 18" 20" 22" 25" 27" 30" 32" 35" 37" 39"

Accuracy'(%)'

Epoch'

Top/1,Trainset"

Top/5,"Testset"

Top/1,"Testset"

(b) ResNet-34

Figure 5: Experiment 1: for VGG and ResNet- 34 (GoogLeNet-plot omitted because it is very similar to VGG). Note how VGG stops learning after reaching 100% accuracy on the training set, while ResNet-34 keeps improving its performance on the test set

7 Results

In this section the results are presented chrono- logically, either by plotting the average accuracy on all classes (as plots or a table) or presented as separate classes (confusion matrices).

Plots are based on records after each epoch, and results are discussed and explained in section8 and in captions.

5

(8)

0"

20"

40"

60"

80"

100"

0" 3" 5" 8" 10" 13" 15"

Accuracy'(%)'

Epoch'

Top-1,"Trainset"

Top-5,"Testset"

Top-1,"Testset"

Figure 6: Experiment 2: Learning rate decay every 4 epochs rather than every 10 on VGG

(a) GoogLeNet

(b) ResNet

Figure 7: Confusion matrices for Experiment 4.

Each class has 132 images in total

0"

20"

40"

60"

80"

100"

0" 5" 10" 15" 20" 25" 30" 35" 40" 45" 50" 55" 60" 65" 70" 75"

Accuracy'(%)'

Epoch'

Train,"Top31"

Test,"Top31"

Figure 8: Experiment 6: The performance of ResNet-152 on the five most difficult classes.

We can see how the network converges after a relatively large amount of iterations.

(a) GoogLeNet

(b) ResNet-50

Figure 9: The results after experiments 5 (a) and 6 (b) suggest how the networks still struggle between the most similar classes Sedan, Hatch- back and StationWagon.

(9)

Classes Model Top-1 Top-5

VGG 50,8 82,0

28 GoogLeNet 49,5 71,1

ResNet-34 41,8 77,0

VGG 52,1 85,7

24 GoogLeNet 51,5 68,7

ResNet-34 46,1 82,0 ResNet-152 61,5 -

5 ResNet-50 61,0 -

GoogLeNet 60,4 -

Table 1: Accuracy for trained networks, experiments 1,4,5,6. Top-5 is omitted for 5-class experiments

8 Discussion

Experiment 1 The average Top-1 score of 47.4%for the three networks was a very positive surprise for this first experiment and heavily outweights a random generators probability of ¹/²⁸to choose the right car type. Since I was using our new dataset rather than a cleaner competitor such as [29, 16], I had expected worse. This experimental result also reveals what seems to be a difference between ResNets and the other two: how it can keep learning after the results on the training set have reached almost 100% accuracy. This can be seen in Figure5.

Experiment 2 The results from this experiment are difficult to interpret. The "before" and

"after"-graphs follow each other very closely (in fact, the Top-1 and Top-5 both converged faster before the change, see Figure6), and there’s little to no difference in the final scores. Also, since both initial weights and the (stochastic gradient descent) learning algorithm are non- deterministic, it’s hard to know from just one experiment whether or not the fluctuations re- ally depend on the learning rate decay, or on other factors such as weight initialization.

Experiment 4 In this experiment, removing 14%of the classes and their images only lead to an increase to 49.9% correctly classified images. It is hard to analyze this result by direct comparison with Experiment 1, since both the amounts of training images and classes were smaller this time. It could be more fair to compare the ratios between my two results with that of the random probabilities (¹/²⁸and¹/²⁴) of success: ^49.9_47.4 < ¹1^//²⁴28. This comparison could suggest the network was actually less effective with fewer classes and less data.

The most useful insight after this experiment probably came from introducing the confusion matrices (Figure7), as it was the first clear feed- back on where the networks were making mistakes. In several cases, as in deciding between Stationwagons, Hatchbacks and Sedans, the results clearly reflect the difficulties we had clas- sifying each vehicle by hand. Because of their similarity, the confusion between Trucks and ConcreteMixers were no surprise either. How- ever, I was surprised by how clearly the network separated DoubleDeckers from normal buses, and how often it mistook other vehicles for Pick- ups and Jeeps (see columns 4 and 10 of Fig7).

In hindsight, I probably would’ve chosen to re- move or merge other classes to directly improve the results. Couldn’t, for example, removing the

"Others"-class be detrimental to the score, seeing as it differs significantly from the remaining classes, and as such is easier to classify?

Comparing the results of 24 class experiments in Table 1 and Figure 7c also reveals what seems to be another particularity with ResNets:

the large gap between Top-1 and Top-5 scores.

Compared to GoogLeNet, it receives a higher Top-5 score but a lower Top-1, which I interpret as sacrificing higher peaks in the probability density function (higher confidence) in one class for a more even spread over groups of classes.

Experiment 5 The last two experiments were the most fine-grained, as they only contained very similar car types. Comparing to the previ- ous, where the average accuracy of the classes 12-16 were about 32% for GoogLeNet (see Fig- ure7), seeing the network pass the 60% -mark was a great positive surprise. Of course, such a comparison isn’t fair at all; this time I used about 41 times as many original training images per class and a lot fewer classes. Pick- ing randomly from the cars would’ve yielded

1/5 = 20%on average, which makes the rela- tive difference between random guessing and my neural network worse than inExperiment 4.

Experiment 6 After ResNet-34 achieved about 37% accuracy on average for the classes 12-16 in Experiment 4, and GoogLeNet increased about 9% in Experiment 5, this last experiment displays the highest accuracies of all, around 61.5%. I unfortunately didn’t have time to run an extra experiment with 5 classes on ResNet-34, so it’s hard to know how much of the increase was due the class reduction and size of data, versus increasing network depth.

It seems, however, that increases from 50 to 152 convolutional layers didn’t help much com-

7

(10)

pared to the 3x increase of complexity, which I speculate could be related to the vanishing gradient problem [20,12]. Seeing the results in Figure9, its clear that the largest difficulty was distinguishing between Sedans, Hatchbacks and Station wagons. This was no surprise, as many are designed to be in-between these three popular classes, and as such could belong to one or the other.

Reflections on my experimental method The experiments were planned as ad hoc ex- plorations instead of methodical studies to see how good results I could achieve in the short timeframe of a bachelor thesis. As a conse- quence, the exact implication of each parameter change cannot be considered proven empiri- cally through my studies. For example, it is still unclear how increasing the amount of images per class influenced learning in experiment 5 and 6, as I also chose fewer classes, turned off data augmentation, and used deeper versions of ResNet. This direction was, however, based on conscious choices. I believe the importance of optimizing each parameter was already well- documented, and therefore I wanted to try out popular methods on a new, high-noise dataset, to both broaden my understanding of deep neural networks and hopefully produce a guide to help others.

9 Future work

9.1 Further optimizing the algorithm Conducting methodological experiments to see which parameters lead to increases would pre- sumably show how to further improve our results more clearly. One example could be trying to extend the depth of GoogLeNet or adding residual connections, to see if some of the positive effects from ResNets inExperiment 6could also be seen in deeper GoogLeNets, such as in Szegedy et al. [26]. It would also be very interesting to explore the trained networks by visualizing the weights of each layer with the techniques described in Zeiler and Fergus [30]. This could help explain which features are learned during training.

9.2 Cleaning up the dataset

I have pondered several solutions for improving the quality of our dataset, such as further merg- ing or removing classes, or training a network on a less noisy dataset like CompCars [29], and using it to check the labels of PlatesMania [1].

One could also consider using object detection

to find car in image [31], or even excluding rare vehicles, such as particular Russian minibuses or military missile carriers, which are less common but currently represented. Creating a good, easily analyzable dataset is a common challenge in research. Even though we cannot completely trust all labels from PlatesMania, we can still rely on our subjective classifications most of the time. The question is how important a per- fect dataset is in this context. A recent study by Krause et al. [17] strongly questions the value of "cleaning up" a reasonably clean dataset of what it calls "cross-category noise" (mislabeled images), when the dataset is large. Deciding whether to clean up or not also depends on what the purpose of the trained classifier will be. The accuracies in my experiments are, for example, quite a bit worse than CompCars [29], because of the higher quality of their dataset, but it is unknown how accurately their classifiers would perform on our noisy dataset. In general, however, we recognize that creating a high-quality dataset is a very common problem in all research, and that creating experiments that show clear results is an art.

9.3 Determining a goal to go further After working with these algorithms for some time, and reading several articles and results from current research, it’s easy to imagine a myriad of ways of continuing our work. How- ever, inventing without re-inventing in such an active field seems a task as daunting as it is exciting. In the end, it depends on what the problem at hand, and what dataset is available;

maybe the most interesting problem lies in dis- covering a new field of application rather than a new optimization technique! For this reason, I believe the most important first step to create a well-performing algorithm is identifying the purpose of the task – a goal. With a clear goal and it’s implicit constraints, the direction through the "forest" of variables in a project becomes clearer.

9.4 Acknowledgements

I’d like to thank Elena, my friends (especially Michele, Matteo and Simone in Torino) and family for the support while writing my bachelor thesis. I would also like to give a particular to mention to some of the researchers and Ph.D candidates of the Telecom Italia Joint Open First, I am very grateful to Sina Ghassemi, forLab discussions, tips, friendship and help to use Torch for my experiments. Without him, I would still be lost trying to import images or navigate

(11)

the jungle of papers, programming library doc- umentations and mathematical expertise that constitutes the exciting field of neural networks.

It’s been a pleasure working with him.

I would also like to thank Skjalg Lepsøy, Tomas Björklund and Pedro Gusmão, for helping me understand the concepts of neural networks and Linux, and for always doing so with a smile.

Lastly, I’d like to thank Gianluca Francini, Enrico Magli and the JOL for giving me the opportunity to write my thesis in their offices.

They included me in meetings and briefings, and gave me access to their absolute high-end hardware, as if I was a regular member of their research team. It has been wonderfully inspiring environment, making this thesis-project an experience beyond all my expectations.

Grazie mille.

About Joint Open Lab The Telecom Italia Joint Open Lab is a collaboration between Politecnico di Torino and Telecom Italia. It consists of the four groups VISIBLE, CRAB, SWARM, MOBILE, that work with convolutional neural networks, robotics, IoT applications, and mobile social applications respec- tively. http://jol.telecomitalia.com/

jolvisible

References

[1] Platesmania.www.platesmania.com/. [Last retrieved: 12-06-2016].

[2] Wikicars – car body style.http://wikicars.

org/en/Car_body_style. [Last retrieved:

13-06-2016].

[3] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Good- fellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Leven- berg, D. Mané, R. Monga, S. Moore, D. Mur- ray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Van- houcke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems. 2015.

URLhttp://tensorflow.org/. Software available from tensorflow.org.

[4] D. Bullock, J. Garrett, and C. Hendrickson. A neural network for image-based vehicle detection. Transportation Research Part C: Emerg- ing Technologies, 1(3):235–247, 1993.

[5] S. Chintala. Github repository.

https://github.com/soumith/

imagenet-multiGPU.torch, 2013. [Last Retrieved: 01-06-2016].

[6] R. Collobert, K. Kavukcuoglu, and C. Fara- bet. Torch7: A matlab-like environment for machine learning. 2011.

[7] Facebook. Github repository.

https://github.com/gcr/fb.resnet.

torch-lesion-study/blob/master/

datasets/transforms.lua.

[8] L. A. Gatys, A. S. Ecker, and M. Bethge.

A neural algorithm of artistic style. CoRR, abs/1508.06576, 2015. URLhttp://arxiv.

org/abs/1508.06576.

[9] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In Y. W. Teh and M. Titterington, editors, Proceedings of the Thirteenth Inter- national Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Ma- chine Learning Research, pages 249–256, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. PMLR. URLhttp://proceedings.

mlr.press/v9/glorot10a.html.

[10] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. Deep learning with limited numerical precision. CoRR, abs/1502.02551, 2015. URLhttp://arxiv.org/abs/1502.

02551.

[11] M. Havaei, A. Davy, D. Warde-Farley, A. Biard, A. C. Courville, Y. Bengio, C. Pal, P. Jodoin, and H. Larochelle. Brain tumor segmen- tation with deep neural networks. CoRR, abs/1505.03540, 2015. URLhttp://arxiv.

org/abs/1505.03540.

[12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015. URLhttp://arxiv.

org/abs/1512.03385.

[13] A. G. Howard. Some improvements on deep convolutional neural network based image classification. CoRR, abs/1312.5402, 2013. URL http:

//dblp.uni-trier.de/db/journals/

corr/corr1312.html#Howard13.

[14] S. Ioffe and C. Szegedy. Batch normalization:

Accelerating deep network training by reducing internal covariate shift. CoRR, abs/1502.03167, 2015. URLhttp://arxiv.org/abs/1502.

03167.

[15] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.

[16] J. Krause, M. Stark, J. Deng, and L. Fei-Fei.

3d object representations for fine-grained categorization. 2013.

9

(12)

[17] J. Krause, B. Sapp, A. Howard, H. Zhou, A. To- shev, T. Duerig, J. Philbin, and F. Li. The un- reasonable effectiveness of noisy data for fine- grained recognition. CoRR, abs/1511.06789, 2015. URLhttp://arxiv.org/abs/1511.

06789.

[18] A. Krizhevsky, I. Sutskever, and G. E. Hinton.

Imagenet classification with deep convolutional neural networks. pages 1097–1105, 2012.

[19] D. Liu and Y. Wang. Image classification of vehicle make and model using convolutional neural networks and transfer learning.

2014. URLhttp://cs231.stanford.edu/

reports/lediurfinal.pdf. [Unpublished paper from Stanford University course CS231n, Last retrieved: 01-06-2016].

[20] M. A. Nielsen. Neural networks and deep learning. Determination Press, 2015.

[21] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and F. Li. Imagenet large scale visual recognition challenge. CoRR, abs/1409.0575, 2014. URL http://arxiv.org/abs/1409.0575.

[22] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014. URL http://arxiv.org/abs/1409.1556.

[23] R. Smith. Nvidia launches tesla k80, gk210 gpu. www.anandtech.com/show/8729/

nvidia-launches-tesla-k80-gk210-gpu, 2014. [Web article. Last retrieved: 15-05- 2016].

[24] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout:

A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15

(1):1929–1958, Jan. 2014. ISSN 1532-4435.

URLhttp://dl.acm.org/citation.cfm?

id=2627435.2670313.

[25] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E.

Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with con- volutions. CoRR, abs/1409.4842, 2014. URL http://arxiv.org/abs/1409.4842.

[26] C. Szegedy, S. Ioffe, and V. Vanhoucke.

Inception-v4, inception-resnet and the impact of residual connections on learning. CoRR, abs/1602.07261, 2016. URLhttp://arxiv.

org/abs/1602.07261.

[27] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf.

Deepface: Closing the gap to human-level performance in face verification. pages 1701–

1708, 2014. doi: 10.1109/CVPR.2014.220.

URL http://dx.doi.org/10.1109/CVPR.

2014.220.

[28] T. Weyand, I. Kostrikov, and J. Philbin. Planet - photo geolocation with convolutional neural networks. CoRR, abs/1602.05314, 2016. URL http://arxiv.org/abs/1602.05314.

[29] L. Yang, P. Luo, C. C. Loy, and X. Tang.

A large-scale car dataset for fine-grained categorization and verification. CoRR, abs/1506.08959, 2015. URLhttp://arxiv.

org/abs/1506.08959. [Last Retrieved: 01- 06-2016].

[30] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. CoRR, abs/1311.2901, 2013. URLhttp://arxiv.

org/abs/1311.2901.

[31] Y. Zhou and N. Cheung. Vehicle classification using transferable deep neural network features. CoRR, abs/1601.01145, 2016. URL http://arxiv.org/abs/1601.01145.

(13)

(14)