• No results found

Deep Learning to Detect Snow and Water in Construction Planning using Remote Sensing Images

N/A
N/A
Protected

Academic year: 2021

Share "Deep Learning to Detect Snow and Water in Construction Planning using Remote Sensing Images"

Copied!
42
0
0

Loading.... (view fulltext now)

Full text

(1)

aster˚

as, Sweden

Thesis for the Degree of Bachelor of Science in Computer Science

15.0 credits

DEEP LEARNING TO DETECT

SNOW AND WATER IN

CONSTRUCTION PLANNING USING

REMOTE SENSING IMAGES

Mats Fridberg

mfg17006@student.mdh.se

Adam Hoflin

ahn17016@student.mdh.se

Examiner: Mobyen Uddin Ahmed

alardalen University, V¨

aster˚

as, Sweden

Supervisors: Shahina Begum

alardalen University, V¨

aster˚

as, Sweden

Hamidur Rahman

alardalen University, V¨

aster˚

as, Sweden

(2)

Abstract

Construction and property development is an industry subjected to and affected by weather con-ditions in the daily operation. A construction site will often be covered in snow and rain which must be taken into consideration for the safety of the personnel working there. With drone im-age technology, construction companies can monitor their construction sites from an aerial view. This thesis aims to evaluate if a Convolutional Neural Network (CNN) can confidently classify snow and rain conditions in drone imagery and how an architecture outperforming previously tested CNN models can be designed. The option of using a CNN as a feature extractor and per-forming classification on those features using Machine learning (ML) algorithms is also explored in this thesis.

The thesis began with a literature study and survey of existing CNN models and methods. The importance of image augmentation for the training process was discovered and the architectures AlexNet and VGG16 were chosen for the thesis work. Both models were implemented in addition to a proposed model designed specifically for the task at hand. The three models were then used, in an un-trained state, to extract feature maps to be used in the ML algorithms. The ML algorithms were trained and used as classifiers to validate if they would be an effective alternative method. A small CNN architecture with three convolutional layers was found to be sufficient at extracting relevant features for the classification task at hand compared to the larger state of the art models. The proposed model reached a top accuracy of 97.3% in binary classification and 96.5% while also taking rain conditions into consideration. The ML algorithms were found to be valid options when choosing a classifier for the extracted features. Smaller and condensed feature maps led to a better accuracy of the ML algorithms, reaching as high as 90% using features extracted from the proposed model.

(3)

Acknowledgements

We would like to thank our supervisors Hamidur Rahman and Shahina Begum for guiding us through this thesis. We also want to thank NCC for the opportunity and resources they pro-vided to us under the project Machine Learning for the prevention of occupational accidents in the construction industry funded by Vinnova.

(4)

Table of Contents

1. Introduction 1

2. Background 2

2.1. Neural Networks . . . 2

2.1..1 Activation function . . . 2

2.1..2 Stochastic gradient descent . . . 3

2.2. Deep learning . . . 3

2.2..1 Convolutional Neural Network . . . 3

2.2..2 Filters . . . 4

2.2..3 Pooling layer . . . 4

2.2..4 Flattening . . . 5

2.2..5 Fully Connected Layer . . . 5

2.2..6 AlexNet . . . 6

2.2..7 VGG16 . . . 6

2.3. Cross-validation . . . 7

2.4. Machine learning algorithms . . . 7

2.4..1 Support Vector Machine . . . 7

2.4..2 Logistic regression . . . 8 2.4..3 k-nearest neighbors . . . 8 2.5. Metrics . . . 8 2.5..1 Confusion Matrix . . . 8 2.5..2 Accuracy . . . 9 2.5..3 Precision . . . 9 2.5..4 Recall . . . 9

(5)

2.5..5 F1 Score . . . 9

2.6. Multi-class metrics . . . 10

2.7. Related works . . . 11

3. Problem Formulation 12 4. Materials and Methods 13 4.1. Dataset & System . . . 13

4.1..1 Images . . . 13 4.1..2 Videos . . . 13 4.1..3 Software . . . 13 4.1..4 Hardware . . . 13 4.1..5 Raspberry Pi . . . 14 4.2. Method . . . 14 4.2..1 Binary classification . . . 15 4.2..2 Multiclass classification . . . 15

5. Ethical and Societal Considerations 16 6. Implementation and experiment 17 6.1. Creating the dataset . . . 17

6.1..1 Data augmentation . . . 17

6.1..2 Extracting images from video . . . 17

6.1..3 Implementing k-fold cross-validation . . . 18

6.2. VGG16 implementation . . . 18

6.3. AlexNet implementation . . . 18

(6)

6.5. Machine learning algorithms on CNN extracted feature maps . . . 20

6.6. Testing on a Raspberry Pi . . . 21

7. Results 22 7.1. Binary Classification . . . 22

7.1..1 Convolutional Neural Networks . . . 22

7.1..2 Confusion matrices . . . 23

7.1..3 Machine learning algorithms . . . 24

7.2. Multi-Class Classification . . . 24

7.2..1 Convolutional Neural Networks . . . 24

7.2..2 Confusion matrices . . . 25

7.2..3 Machine learning algorithms . . . 26

8. Discussion 27

9. Limitations 29

10.Conclusions 30

11.Future work 31

References 33

(7)

List of Figures

1 Illustration of an artificial neuron. . . 2

2 Illustration of max-pool. . . 4

3 Illustration of a flattening layer. . . 5

4 Illustration of fully connected layers. . . 5

5 AlexNet architecture. . . 6

6 VGG16 architecture. . . 7

7 Illustration of a confusion matrix. . . 8

8 Illustration of image extraction. . . 17

9 Architecture of the proposed model. . . 19

10 Grouped bar chart displaying result of binary classification using CNN models. . 22

11 Confusion matrix for proposed model in binary classification. . . 23

12 Confusion matrix for AlexNet in binary classification. . . 23

13 Confusion matrix for VGG16 in binary classification. . . 23

14 Grouped bar chart displaying result of binary classification using ML algorithms. 24 15 Grouped bar chart displaying result of multi-class classification using CNN models. 25 16 Confusion matrix for proposed model in multi-class classification. . . 25

17 Confusion matrix for AlexNet in multi-class classification. . . 25

18 Confusion matrix for VGG16 in multi-class classification. . . 25

19 Grouped bar chart displaying result of multi-class classification using ML algorithms. 26 20 Non-disclosure agreement . . . 34

List of Tables

1 F1-score in binary classification. . . 23

(8)

2 F1-score in multi-class classification . . . 25

Acronyms

CNN Convolutional Neural Network KNN k-nearest neighbors

LR Logistic Regression ML Machine learning

(9)

1.

Introduction

Construction and property development is an industry subjected to and affected by weather conditions in the daily operation. A construction site will often be covered in snow and rain which must be taken into consideration for the safety of the personnel working there. With drone image technology, project managers can monitor their construction sites from an aerial view.

A Convolutional Neural Network (CNN) is a widely used deep learning solution in image classi-fication tasks. It can both extract features from an image and classify them in the same process which sets it apart from other image classification options. A CNN can also be used for the sole purpose of feature extraction. The features can then be classified using Machine learning (ML) algorithms which is a tried and tested approach to classification.

The purpose of this thesis is to detect snow and rain conditions, using deep learning, and to evaluate the algorithms used to see if they can be a supportive tool for the project manager in daily operation. Specifically, CNN’s are efficient at such classification problems. They have in recent years become a well acknowledged method in image recognition tasks and are widely used in the area. This thesis aims to assert if a deep learning solution can have the desired performance, which would be a high accuracy of identifying if it has snowed or rained. It also aims to assess if an alternative approach of using a CNN for feature extraction and ML algorithms for classification is viable.

A proposed CNN model, specifically designed for the task at hand, was constructed and compared to state of the art CNN models classifying the same dataset. The knowledge of how to construct the CNN model was gathered through a literature study and an assessment of different CNN architectures. The work was done through experiments by implementing the different solutions and models. The results were then evaluated and compared to each other. Not only were the CNNs compared by their classification results but also their feature extraction ability. The extracted features were classified using ML algorithms and their performance was then evaluated in addition to the ML algorithm used.

Evaluating the capabilities of image recognition technologies in the construction industry could lead to interesting progress in how safety protocol is enforced in the daily operation. This thesis serves as a proof of concept and aims to show what is possible with the deep learning resources available today.

(10)

2.

Background

2.1.

Neural Networks

A neural network is a self-learning system inspired by the structure of the human brain [1]. It consists of nodes, also known as neurons. A neuron receives an input, or a number of inputs, from a link then performs an activation function on that input before sending the resulting output through another link. The neuron has weights associated with it that affect the output. A common goal for supervised learning is changing these weights in order for the input to result in the desired output. Changing the weights is commonly called ”learning”.

Figure 1: A diagram of an artificial neuron [2]. It shows an input matrix being multiplied by the associated weights and sent through a transfer function to an activation function. The end result is an output shown as oj.

2.1..1 Activation function

An activation function is a mathematical calculation deciding the output from a node given a set of inputs. To increase the non-linearity of the output from a convolutional layer the activation function is used. Every node in the network has an activation function deciding if the node will be activated or not. If the node is activated, the output is sent to the next layer in the network. It is of great importance that the activation function is not computationally heavy, since that will slow down the network. The three activation functions used in this thesis are rectifier linear unit (ReLU ), Sigmoid and Softmax.

ReLU: max(0, x) Softmax: ˜yi= e

yi

P

jeyk

(11)

2.1..2 Stochastic gradient descent

Neural networks are trained using gradient descent. The objective of gradient descent is to reach the lowest point of a graph. In a two dimensional Cartesian coordinate system, this means that the algorithm want to find the value x where value y is the minimum. In order to calculate the gradient descent a loss function is needed. In this thesis binary-crossentropy is used for binary classification and categorical-crossentropy for multi-class classification.

Gradient descent depends on the loss function, which depends on every single element in the training data. Stochastic gradient descent is different in the way that the algorithm introduces randomness into the equation. Instead of using all elements in the training data, it randomly selects which elements will be used in the algorithm [3]. By adding momentum to this algorithm the likeliness of getting stuck in a local minimum is decreased.

2.2.

Deep learning

Deep learning is a part of ML algorithms [4]. The difference between deep learning and more traditional ML is that instead of being taught which features to look for in an image, the computer makes its own decision of which features might be of importance [5]. Deep learning is commonly used in areas like voice, audio and image recognition. In image recognition the deep learning neural networks are divided into hidden layers, that extract relevant features from the images. In this thesis a Convolutional Neural Networks (CNN) is used for classifying snow and rain conditions in drone images.

2.2..1 Convolutional Neural Network

A Convolutional Neural Network (CNN) is a deep neural network commonly used in image analysis, mostly in image classifcation applications [6]. During training the network has filters (also known as kernels) that ’convolve’ over the input data to find the optimal weights. The filter is only applied to parts of the previous layers at a time where it performs a weighted sum and an activation function and then ’convolve’ to the next part of the previous layer, with the same set of weights. This is what is called the Convolution. Each convolutioal layer creates new feature maps for every filter in the layer, i.e a convolutional layer consisting of 16 filters will create 16 feature maps. These feature maps are then passed to the next layer in the network.

The introduction of Convolutional Nerual Networks (CNN) has broadened the possibilities of pattern recognition in images. CNNs take account for spatial dependence in the input data [6]. This means that the network can find features anywhere in the input image. Because of this, CNNs are commonly used in image recognition tasks.

(12)

2.2..2 Filters

A filter (or kernel) in the context of a CNN is a set of weights that are applied to the input data. The filter is represented as a matrix of random values, commonly between 0 and 1 or -1 or 1. The filters are multiplied with the input data, generating feature maps.

2.2..3 Pooling layer

Pooling is used to make the extracted feature maps more abstract. By doing this the new pooled feature maps will be less sensitive for the location of features in the input. Two common pooling algorithms are Max-pooling and Average pooling. In Max-pooling the highest value is taken from parts where a filter is applied [7]. Figure 2. describes the process. Average-pooling is similar with the only difference being that an average of all the data within the filter is averaged. If a filter with dimensions 2X2 and stride 2 is used, the feature map will be down sampled to a fourth of the original dimension. This makes the feature map more abstract and smaller, which will be beneficial in the next convolutional layer. By making the feature maps more abstract the chances of overfitting the network is reduced.

(13)

2.2..4 Flattening

The flattening layer takes the feature map from the pooling layer as input. The flattening algorithm converts the multi-dimensional data from the pooling layer and converts it to a vector. The new vector can be represented as a vector of nodes that can be used as an input for the next layer in the network.

Figure 3: Flattening of 2X2 matrix

2.2..5 Fully Connected Layer

A fully connected layer, also known as a dense layer is a linear operation where the input is connected to every output by a weight. The operation is followed by an activation function, such as ReLU, sigmoid or Softmax. In this study the Fully Connected Layer takes the output from the flattening layer as input.

(14)

2.2..6 AlexNet

AlexNet is a CNN designed by Alex Krizhevsky [8]. AlexNet showed great performance in the 2012-ILSVRC (ImageNet Large Scale Visual Recognition Challenge). The network classified 1.2 million images in 1000 different classes at a top-1 and top-5 error rate of 13.5% and 17.0% respectively [8]. Top-5 error rate refers to the fraction of test images where the correct label is not within the top-5 labels considered by the model.

The AlexNet architecture consist of five convolutional layers, three pooling layers and two fully conntected layers. The first two convolutional layers in the network has the filter dimensions of 11X11 and 5X5 and the rest having the dimension 3X3.

Figure 5: Architecture of AlexNet [9]

2.2..7 VGG16

VGG16 is a CNN designed by K. Simonyan and A. Zisserman [10]. The architecture showed great performance in the 2014-ILSVRC and achieved a top-5 error rate of 7.3%. The network consists of 13 convolutional layers, 5 pooling layers and two fully connected layers followed by an output layer, making it significantly bigger than the AlexNet architecture. Unlike AlexNet, the VGG16 model uses a fixed 3X3 filter size on every convolutional layer throughout the network. The model is divided into blocks of convolutional layers that has associated max pool layers.

(15)

Figure 6: Architecture of VGG16 [9]

2.3.

Cross-validation

Cross-validation is a widely used method for calculating error rate/accuracy in ML or CNN applications [11]. k-fold validation is a cross-validation method where the data is split into k amount of folds, where k is an integer bigger than 0. The model is then trained on k-1 folds and validated on one fold. This is repeated for each fold. k-fold validation is a good method for when there is not an abundance of data available. It also makes certain that every part of the data is validated and adjusted in the training process.

2.4.

Machine learning algorithms

2.4..1 Support Vector Machine

Support Vector Machine (SVM) is a ML algorithm for supervised learning. The algorithm can be used for both classification and regression purposes. In this thesis the focus is on classification. SVM is a kernel-based algorithm meaning that any linear model can be turned into a non-linear model by performing the kernel trick [12]. This means that the algorithm changes dimension on the mapped data until it finds a hyperplane that can separate the classes. A kernel is a set of mathematical functions which transforms the data into the required form. There are different types of kernel functions that can be used by the SVM. Linear kernel and Radial Basis Function (RBF) being two common kernel functions. The Linear kernel can be effectively used when the data is assumed to be linearly separable. RBF on the other hand, is commonly used when the data is hypothesized to be separable by a curve. In this thesis the linear kernel function has been used.

(16)

2.4..2 Logistic regression

Logistic Regression (LR) is a classification model. The LR is similar to the linear regression, but more complex [13]. One big difference between linear and logistic regression is the logit link function. By taking a linear combination of the covariate values the logit link function converts the values to the scale of probability, i.e. a value between 0 and 1. LR is commonly used in binary classification, but can be used in multi-class classification using the one-versus-all method. The one-versus-all method involves training one classifier per class, with the training samples being positive and the others being negative.

2.4..3 k-nearest neighbors

k-nearest neighbors (KNN) is a classification and regression algorithm. In classification the output is the most common class of the k-nearest neighbors, where k is a positive integer [14]. The nearest neighbor is often defined by the Euclidean distance between the test and training samples.

2.5.

Metrics

2.5..1 Confusion Matrix

The confusion matrix is used for classification problems to summarize the prediction results. The confusion matrix is a way of showing how the classification model is confused when predictions are made. It gives insight of what types of errors are made by the model.

Figure 7: Confusion Matrix True positive: Observation is positive and is predicted to be positive. False negative: Observation is positive, but is predicted to be negative. True negative: Observation is negative and is predicted to be negative.

(17)

False positive: Observation is negative, but is predicted to be positive.

2.5..2 Accuracy

The term accuracy refers to how often the model is correct in its predictions. The accuracy can tell if a model is trained correctly and the general performance of the model. However, accuracy will not give any detailed information of how well the model performs.

Accuracy = true positives + true negatives total examples

2.5..3 Precision

The term precision refers to how often the model is correct when it makes a positive prediction. To calculate the precision, the number of correct positive examples has to be divided by the the total number of predicted positive examples. When the cost of false positives are high, knowing the precision of the model can be good.

P recision = true positives

true positives + f alse positives

2.5..4 Recall

The term recall refers to the percentage of total relevant results correctly classified by the model. When the cost of false negatives are high, knowing the recall of the model can be good. Recall can be seen as a measure of quantity and precision the measure of quality.

Recall = true positives

true positives + f alse negatives

2.5..5 F1 Score

The F1 score is a measure of a models accuracy that combines Precision and Recall. The F1 score is a value between 0 and 1, where a higher score is better. A higher score refers to the model having less amount of false positives and false negatives. A lower score means that the model instead has a high amount of false positives and false negatives.

F 1 = 2 · P recision · Recall P recision + Recall

(18)

2.6.

Multi-class metrics

The difference between the metrics for binary classification and multi-class classifications is that precision and recall has to be calculated for each class individually in the multi-class classification. The results from that calculation are then averaged to achieve a general precision and recall score for the model.

(19)

2.7.

Related works

Previous work on classifying snow has been done by Y. Zhan et al. [15] who used a CNN to distinguish snow from clouds in satellite imagery. They found great success distinguishing snow from clouds with regards to accuracy. They took into consideration, not only low-level informa-tion (like per-pixel color), but also high-level semantic informainforma-tion (like textures surrounding the features). This was possible since deep neural networks are efficient at modeling such informa-tion, it can learn abstract patterns with the help of multiple layers. In N. Kussul et al. [16] they used a deep learning model to classify different types of crops in remote satellite images. Their architecture is built upon an ensemble of CNNs. They found issues with disturbance of satellite imagery since clouds or shadows could lead to contamination of the pixels in the images. M. R. Ibrahim [17] used a pipline of CNNs to extract weather conditions and classify multi-labels of weathers. D. Varshney [18] used a CNN to discriminate between snow and clouds in remote sensing images. C. Liang, J. Ge et al. [19] used a deep learning model to classify detailed winter road surface conditions . They showed the potential of achieving higher testing accuracy by using a larger training data-set. G. Pan, L. Fu et al. [20] also did road surface condition recognition, but using a pre-trained CNN for classification.

N. Koyama et al. [21] used a CNN to classify snow in parking lots. They compared Network in Network and the AlexNet model [8]. They used a RaspberryPi with a camera module to take images of the parking lots. The images was split into smaller images labeled ”with snow cover” and ”without snow cover” for training purposes.

A. Howard [22] evaluated different techniques of improving CNNs by manipulating the input data. By manipulating the images, a larger dataset could be created to train the CNN, which led to higher accuracy and less overfitting. The data-set could be extended by rotating, shifting, flipping and adding noise to the images.

F. Huang, Y. LeCun [23] used a SVM in combination with feature extraction using a CNN. Their motivations for the work was that a CNN is good at learning features while a SVM is a better classifier in itself but worse at handling complicated features. They extracted the output of the CNNs fifth convolutional layer to retrieve a feature vector which they then used as training input for the SVM. Using only SVM they found an error rate of 43.3%, using only a CNN they found an error rate of 7.2% and combining the two they found an error rate of 5.9%. Support vector machine was also used in combination with CNNs in K¨all´en et al. [24]. They used a pre-trained CNN to extract features to be used in the SVM for grading Gleason score on malignant prostatic adenocarcinoma specimen. They achieved an accuracy of 89.2% when classifying entire images. It showed that their network could perform on the same level as previous work, but without hand-crafted features.

Z. Liang et al. [25] used a 17 layer CNN model to classify malaria diagnosis on microscopic level images of blood smears. They trained their CNN using cross-validation where they split up their data into 10 folds. The CNN method was compared to a ”transfer learning model” which uses ML algorithms on features extracted from the CNN. They found the CNN model to perform better in most regards, the exception being that the CNN models require more resources and computational power than the transfer learning approach.

(20)

3.

Problem Formulation

Getting an overview of potential dangers on a construction site and the current status of rain and snow conditions, could lead to valuable updates in the construction process. Gaining such information from a distance using a drone and deep learning would increase both time efficiency and safety. In order to showcase these possibilities, it has to be made sure that the right features from the drone images are selected and the correct CNN model is used.

This thesis aims to establish how a CNN can be implemented to classify snow and rain in drone images. The goal is to outperform other already existing models with a proposed CNN model design. The already existing architectures to be outperformed are AlexNet and VGG16. Further-more, other alternative methods will be explored such as using ML algorithms on data extracted from images using a CNN. The research questions for this thesis are following:

Q.1 How can a CNN be constructed to classify snow and rain in remote drone imagery? Q.2 How can a CNN model be designed that outperforms previously tested models?

Q.3 How can Machine learning algorithms effectively be used as classifiers on feature maps extracted by a CNN?

(21)

4.

Materials and Methods

This chapter describes the methodology used in this thesis. In section 4.1 the dataset and the materials used to train the CNN models is explained. In section 4.2 the steps taken to gather knowledge on the topic and how the experiments have been implemented and performed is explained.

4.1.

Dataset & System

4.1..1 Images

The first part of the dataset were images taken with a camera mounted to a crane to simulate drone footage. The images were divided into two classes. One class being snow and the other one being images not containing snow, 69 images contained snow and 10 images did not contain any snow of the images received. The resolution of the images were 1920X1080X3.

4.1..2 Videos

The second part of the dataset were two five minute videos that were recorded with a drone flying over a construction site. The videos could be divided into three classes since one of the videos contained both snow and rain conditions. The other video containing no snow. These videos were used to further expand the dataset of images through data augmentation. The videos had a frame rate of 30 frames per second at a resolution of 1920X1080X3.

4.1..3 Software

The open-source library Keras [26] and the TensorFlow platform [27] was used when implement-ing the CNN models. The code was written in the programmimplement-ing language Python which is supported by Keras and could run on the hardware in the computers used for the experiments. The Python library Scikit-learn was used for the implementation of the ML algorithms [28].

4.1..4 Hardware

Using Tensorflow and CUDA cores the networks could be trained on NVIDIA GTX1070 GPUs with 8 GB of memory. It is possible to train the CNN models directly on the CPU, but is not recommended due to the CPU having a smaller bandwidth than the GPU. GPUs are also very efficient at performing matrix multiplications and convolutions.

(22)

4.1..5 Raspberry Pi

A trained CNN model was used for testing on a Raspberry Pi 4 with 4GB of memory. The testing was possible by using Tensorflow and Keras installed on the Rapsberry Pi and loading the pre-trained model from the computer where it was trained to the Pi.

4.2.

Method

In this thesis three different CNNs have implemented. A proposed architecture has been designed, implemented and compared to the other state of the art CNN architectures. Features have been extracted from each of the architectures and the resulting feature matrices have been used in ML algorithms.

Due to not having enough knowledge on the subject area the thesis was started by conducting a literature study. This was important, since the appropriate tools and knowledge to be able to implement and experiment with CNNs was needed. The information gathered from the literature study also directly influenced the report since it helped construct arguments for why certain decisions were made when creating the proposed architecture. The literature study was based on reading articles on related topics. The immediate result of the study can be found in the background section.

Numerous experiments on different CNN architectures was performed. The experiments con-sisted of implementing the CNNs and training them with the appropriate dataset. VGG16, AlexNet and a proposed architecture was implemented. The proposed architecture was designed with help of the knowledge gathered from the literature study. The different architectures were trained with the same parameters such as the amount of epochs and datasets generated using the same images. The goal of these experiments was to evaluate the performance of the models regarding accuracy, precision, recall and f1-score when classifying the data.

The training of the CNN models was done using the k-fold cross-validation method to ensure fair and consistent results. A 5-fold cross validation was used. However, the data was divided into 6 parts since it needed a test data-set that the model had not been trained on.

The results of the experiments were compared using the metrics accuracy, precision, recall and F1 score. Other qualitative aspects such as the structure of the layers within the architecture were also considered. A smaller network with fewer weights will be less computationally heavy to train and use, which might be of importance if the CNN were to be used in real time applications. In image classification problems such as this one, more data leads to a better result since there are more examples to train the network on [29]. Data augmentation techniques such as cropping, rotating and flipping were performed in order to artificially expand the dataset. This was done using a Python script.

Feature maps were extracted from the CNNs to be used for classification using Support Vector Machine (SVM), Logistic regression and k-nearest neighbors. This was done to validate if ex-tracting feature maps from CNNs could be used in ML algorithms with satisfactory performance.

(23)

4.2..1 Binary classification

The binary classification classified snow and not snow. In a CNN it is achieved by having one node in the output layer. By doing this, the network only specifically tries to identify one class. If the desired class is snow, the network will identify the other class as not snow. The output layer uses the activation function sigmoid to get the probability of the data belonging to the class snow. If the output from the activation function is closer to 1 than 0, the data will be classified as snow, otherwise not snow. The loss function used in the binary classification was binary crossentropy.

4.2..2 Multiclass classification

The multi-class classification classified snow, not snow and rain conditions. The multi-class classification is done using the same method as the binary classification, with the only difference being that there are more than two classes. Instead of there being one node in the output layer there are three, one for each class. The loss function used in the multi-class classification was categorical crossentropy.

(24)

5.

Ethical and Societal Considerations

The data used in this study is aerial footage of construction sites and rarely contains any humans. Since the images are taken from an aerial point of view is is difficult to distinguish any person in the picture. The dataset has been manually checked for appearances of humans and such images have been removed from the dataset.

Since footage from a construction site can be considered as sensitive information a Non-disclosure agreement (NDA) has been signed to get access to the data. The NDA points out that the data cannot be shared with others. The data has been stored locally and in the cloud in a locked private folder. The images are to be deleted at the end of the thesis work. All images used in the thesis report are placeholder images not part of the provided data.

(25)

6.

Implementation and experiment

6.1.

Creating the dataset

6.1..1 Data augmentation

Initially, there were 10 images containing no snow and 69 images containing snow with the resolution of 1920X1080X3 in the dataset. However, this was not sufficient to train any network. AlexNet use images at a resolution of 227X227X3 as input, whilst VGG16 use images with a resolution of 224X224X3, which meant that two different datasets had to be created. To perform data augmentation a script was written in Python which selected a random point in the image and extracted a 227X227X3 or 224X224X3 portion from the original image. This was done 5 times for each image. Similar to [22] every new extracted image was manipulated, to further increase the amount of data in the dataset. The extracted images were rotated 90, 180 and 270 degrees and flipped horizontally. This resulted in eight images for each extracted image from the original 1920X1080X3 data. By doing this 40 images were extracted from each of the original data.

Figure 8: An example of how smaller images are extracted from a larger image.

6.1..2 Extracting images from video

The dataset contained two five minute videos of aerial drone footage over a construction site. One video containing snow/rain conditions and the other containing no snow. Another Python script was used for extracting images from the videos. To counteract images being too similar, the script saved one image for every 30 frames in the video. The result was 294 images containing snow/rain conditions and 297 not containing snow, due to the different length of the videos. Every extracted image was run through the Python script described in section 6.1..1 to resize and extend the dataset. Images containing snow/rain conditions had to be manually labeled to filter out images that did not fit the classification. The manual labeling had to be done for both of the datasets. When combining the data generated from the videos with the data generated from the images, the final datasets contained 5763 images in the 227X227X3 dataset and 5568

(26)

images in the 224X224X3 dataset.

6.1..3 Implementing k-fold cross-validation

The commonly used methods for k-fold requires loading the entire data set into memory which was not possible on the GPUs. Instead an object from Keras called ImageDataGenerator was used, which allows for images to be loaded into memory from a source directory in batches during training. In order to accommodate the file structure for the k-fold method the data had to be divided into 6 different folders manually. A python script was used to separate the dataset resulting in 1 folder for testing and 5 folders for the 5 folds used during training. This resulted in the ability to use an instance of ImageDataGenerator per folder which could load that data into memory for training when needed.

6.2.

VGG16 implementation

The VGG16 model was implemented using the Python library Keras and Tensorflow. The network was trained on the 224X224X3 dataset. During training, the method k-fold cross-validation was used. The data set was split into 6 folds whereas one fold was set aside for testing after the training of the model. The model was trained for 50 epochs (10 epochs per validation-fold), both binary and multi-class classification. Due to the size of the network and the great amount of trainable weights the training was very slow. The amount of trainable weights for binary classification was 134,268,738 and 134,272,835 for multi-class classification. Running on a Nvidia GTX 1070 with 8 GB of memory one epoch took approximately 59 seconds for binary classification and 122 seconds per epoch for three classes.

The activation function used in VGG16 is ReLU for all layers except for the output layer where a softmax or sigmoid function is used. The loss function used for binary classification is binary crossentropy and categorical crossentropy for multi-class classification. The model uses the initial learning rate of 0.01 with exponential decay, which means that the learning rate changes while training.

6.3.

AlexNet implementation

The implementation of AlexNet was done with the Python library Keras and Tensorflow. The k-fold cross-validation method was used during training for AlexNet as well, splitting the data set into 6 folds. The main difference from VGG16 implementation-wise was that AlexNet used the 227X227X3 dataset. This was done for both binary and multi-class classification. The size and weights of the network is significantly smaller than the VGG16 model, making training faster. The amount of trainable weights for binary classification was 28,846,282 and 28,875,283 for multi-class classification. One epoch of training took approximately 14 seconds for binary classification and 23 seconds per epoch for three classes.

The activation functions used in the model were ReLU for every layer, except for the output layer where a softmax or sigmoid is used instead. The activation used in binary classification

(27)

was sigmoid. AlexNet uses categorical crossentropy for multi-class and binary crossentropy for binary classification like VGG16, and uses a optimization algorithm called Adam. The optimizer is an extension to stochastic gradient descent and maintains a learning rate for each weight in the network [30].

6.4.

Creating the proposed architecture

One goal of the thesis is to evaluate if the tested models could be further improved upon by creating a new proposed CNN model outperforming them in regards to accuracy, precision, recall and F1 score. The AlexNet and VGG16 architectures are great models for classifying multiple classes. AlexNet classified 1.2 million images in 1000 different classes at a top-5 error rate of 17.0% in the 2012-ILSVRC [8]. VGG16 achieved a top-5 error rate of 7.3% in the same competition two years later [10]. These models were chosen for comparison since they are proven to be efficient. Creating an architecture outperforming them would be possible by specializing a CNN architecture to fit the problem.

The proposed model uses the same data set used in the implementation of the VGG16 network, meaning a dimension of 224X224X3. The ideal input size for the proposed model would be 221X221X3 since there is a slight data loss from the first convolutional layer with the 224X224X3 dataset. However, it can be argued that it instead could provide results harder to interpret, since different datasets would be used for all of the models. The aim was to create a smaller network, less computationally heavy, that could outperform both networks. The AlexNet and VGG16 models are great at detecting small features in the images, which is why a slightly different approach was taken that down sampled the data faster, creating more general feature maps than the other models. The result was a network consisting of three convolutional layers with associated max-pool layers. The output from the third max-pool layer was passed to a flattening layer that is connected with a fully connected layer. This layer was connected to the output layer.

Figure 9: The proposed architecture [9]

The first convolutional layer consists of 96 filters with the size of 9X9 with a stride of 4. A 9X9 filter can be considered as a large filter which is great at extracting larger features of an image. This is suitable for the proposed architecture since the desired features cover most parts of the images. The big filter size has shown great results in the AlexNet model which is why the

(28)

same approach was taken. The output from the first convolutional layer is a 54X54X96 feature vector. The feature vector is passed to the first max-pool layer that has the size of 3X3 with the stride 2. This is knows as overlapping max-pooling and has shown great results in the AlexNet architecture [8]. The output from the first max-pool layer is a 26X26X96 feature vector. Since the feature maps are now more abstract and at a reduced dimension a smaller filter size for the second convolutional layer was chosen. For this layer filter size of 3X3 with stride 2 was used, which is the same filter size and stride used in VGG16. The layer consists of 256 filters. The output from the second convolutional layer is a 12X12X256 feature vector. The max-pool layer associated with the second convolutional layer has a pool size of 2X2 with a stride of 2. The output from the second max-pool layer is a 6X6X256 feature vector.

The output from the second max-pool layer is fed to the third and final convolutional layer in the network. The layer has the same properties as the second convolutional layer, meaning a filter amount of 256, filter size of 3X3 and stride of 2. The output from the convolutional layer is a 2X2X256 feature vector which is fed to the third and final max-pool layer of the network. The max-pool has a filter size of 2X2. The output from the max-pool layer is a 1X1X256 feature vector.

The output from the third max-pool layer is fed to a flattening layer, where the feature maps are converted to a one dimensional vector with the size of 256. The flattening layer is connected to a dense layer with 256 nodes that are connected to the output layer. Before the data is passed to the output layer, dropout is used to counteract overfitting of the network [31]. This method is also used in the AlexNet and VGG16 models. The output layer can have different shapes, depending on how many classes are classified. In this case 1 class is used for binary classification and 3 for multi-class classification. The proposed model used binary crossentropy for binary classification and categorical crossentropy for multi-class classification as loss functions. The model used the Adam optimizer like the AlexNet architecture.

The result is a three layer network not including max-pool layers that has 901,507 trainable weights when classifying three classes and 900,993 trainable weights for binary classification.

6.5.

Machine learning algorithms on CNN extracted feature maps

ML algorithms are very good classifiers, but are sensitive to complex data. Therefore, the CNN models were used to extract features that could be used for classification in the ML algorithms. The features were extracted from the flattening layer after the last pooling layer in each model. From the proposed model a 256X1 feature vector was extracted, from AlexNet a 1024X1 feature vector and from VGG16 a 25088X1 feature vector. These vectors were extracted for each image in the dataset.

The extracted feature maps were then used in classification using SVM, LR and KNN. Binary classification and multi-class classification was performed on all the ML algorithms. The algo-rithms were validated using 5-fold cross-validation and the results from each validation fold were averaged together.

(29)

[28], with the exception being that SVM used the linear kernel. The KNN algorithm used the 3 nearest neighbors for classification.

6.6.

Testing on a Raspberry Pi

A trained version of the proposed model was used on a Raspberry Pi 4 with 4GB of memory. The model was first trained on a personal computer where the model was saved. The saved model was transferred over to the Raspberry Pi where a test was done using 8 images for each class. This was done to test if the testing could be done on a less powerful machine.

(30)

7.

Results

7.1.

Binary Classification

The binary classification classified images containing snow and not snow. Binary classification was done using CNN models and ML algorithms on CNN extracted feature maps.

7.1..1 Convolutional Neural Networks

The three models were all trained at 50 epochs to get an even comparison. All of the metrics are from the performance on the testing data following a k-fold cross-validation training process. 5 folds were used and the network trained for 10 epochs per fold. All models had a batch size of 32. The proposed architecture had an accuracy of 97.3%, AlexNet had an accuracy of 90.2% and VGG16 an accuracy of 87.2%. The proposed model also had the highest precision at 97.3% compared to AlexNet 94.7% and VGG16 82.3%. The recall was similar with the proposed model having a recall at 95.0%, while AlexNet had 90.8% and VGG16 84.8%.

(31)

With precision and recall, the F1 score of the models can be calculated. The proposed model got a F1 score of 0.96, AlexNet a score of 0.93 and VGG16 the lowest at 0.84. These results show that the deep learning model could be further improved to get better performance at the task at hand.

Proposed AlexNet VGG16 F1 Score 0.96 0.93 0.84

Table 1: F1-score in binary classification.

7.1..2 Confusion matrices

The following are confusion matrices for the different models on the test data. The amounts have been averaged and the percentages are displayed. The percentage represents the accuracy of each class.

(32)

7.1..3 Machine learning algorithms

The feature maps from the proposed model achieved the best accuracy in SVM at 90%, followed by AlexNet at 87.67% and VGG16 at 72.73%. Using the same feature maps the proposed got the best result using LR at 87.66% followed by AlexNet at 85.78% and VGG16 at 72.73%. AlexNet got the best results using KNN at 86.35% followed by the proposed model at 86.0% and VGG16 at 73.75%.

Figure 14: Results of binary classification

7.2.

Multi-Class Classification

The multi-class classification classified images containing snow, no snow and rain. Multi-class classification was done by using CNN models and ML algorithms on CNN extracted feature maps.

7.2..1 Convolutional Neural Networks

The three models were all trained at 50 epochs to get an even comparison. All of the metrics are from the performance on the testing data following a k-fold cross-validation training process. 5 folds were used and the network trained for 10 epochs per fold. All models had a batch size of 32. On the multi-class classification the proposed model got an accuracy of 96.45%, VGG16 achieved 94.70% and AlexNet 93.10%. The proposed model also had the best precision at 96.41%, VGG16 achieved a precision of 95.13% followed by AlexNet at 93.21%. The recall had the same order with the proposed model achieving 96.36%, VGG16 achieving 94.66% followed by AlexNet at 93.05%.

(33)

Figure 15: Results of multi-class classification

The proposed model achieved the highest F1 score of 0.96 followed by VGG16 at 0.95 and AlexNet at 0.93.

Proposed AlexNet VGG16 F1 Score 0.96 0.93 0.95 Table 2: F1-score in multi-class classification

7.2..2 Confusion matrices

The following are confusion matrices for the different models on the test data. The amounts have been averaged and the percentages are displayed. The percentage represents the accuracy of each class.

(34)

7.2..3 Machine learning algorithms

The feature maps from the proposed model achieved the best accuracy in SVM at 88.01%, followed by AlexNet at 83.15% and VGG16 at 72.73%. Using the same feature maps the proposed model got the best result using LR at 85.58% followed by AlexNet at 81.25% and VGG16 at 69.54%. Compared to binary classification the proposed model outperformed AlexNet in KNN at 86.44%, where AlexNet got 84.37% followed by VGG16 at 74.78%.

(35)

8.

Discussion

G. Pan, L. Fu et al. [20] classified road surface conditions and achieved a top accuracy of 90.7% on binary classification. The classes they used were snow and not snow. On binary classification a top accuracy of 97.3% was achieved in this experiment which is better than their result, as shown in figure 10. In their multi-class classification they classified bare surface, partly snow covered and fully snow covered. Their results in the same order was 94.3%, 83.8% and 56.3%. The proposed model achieved 98.55% on snow and 95.52% on not snow, as shown in figure 16. The result of their model had a few similarities with the results from the AlexNet implementation. Multi-class AlexNet was worse at classifying snow compared to the other models tested in this project.

Better and more accurate results could be achieved with a more diverse dataset. The dataset originated from either one or two different sources per class. This led to small diversities in the images, which could have great consequence when introducing new data when classifying. Due to the fact that parts of the snow and rain sets originated from the same source it led to some anomalies in the manual classification. Some images could most likely be miss-classified. Even if the deep learning model would be good at classifying the image, it would be considered incorrect, since it has the wrong label. The consequence of this is minor changes in the performance of the models. Since the proposed model and VGG16 use the same dataset the impact should be approximately the same.

It is worth noting that the entirety of the rain dataset is constructed from the same source video. This has implications that contributing factors to its good accuracy score could be other factors such as the time of day the video was recorded or other weather conditions such as a cloudy sky. Like previously mentioned a diverse dataset is paramount in classification problems.

The confusion matrices of the 3 class-classification show that AlexNet never classifies a snow image as something other than snow. This is probably due to a bias in the training. The weights have been over-tuned and AlexNet is more likely to classify an image as containing snow than any others class. Only 73.91% of the images it classifies as snow is actually an image containing snow and 20% of those are actually an image containing no snow. Compared to the proposed architecture which has a more balanced distribution of miss-classifications. It should be noted that the dataset could be a contributing factor since AlexNet uses a different dataset. However, VGG16 uses the same dataset as the proposed model and achieves a higher accuracy classifying snow but falls short considering not snow and rain. This could be due to the fact that the proposed architecture favors generalizing the data while VGG16 takes into account more detailed features. There are some images in the data set that are partially obscured by an object or contains some other details which VGG16 possibly still classifies as snow. The cost of this could be that the weights dictating snow are sensitive to other features (perhaps light) also existing in the other classes which led to a worse classification of not snow. The same applies for the proposed architecture, where the features are generalized and get a better result overall. This can be strengthened by the confusion matrix for binary classification, seen in Figure 11, where the proposed architecture outperforms VGG16 in regards to snow accuracy as well. The benefits of using ML algorithms as classifiers on CNN extracted feature maps are very dependent on the size of the features. The features extracted from the VGG16 model had the dimension of 25088X1 which is almost 100 times bigger than the features from the proposed

(36)

model at a dimension of 256X1. In SVM, the consequence of this is a very time consuming training process not giving any benefits in the results. The training of the SVM using feature maps from the VGG16 model took 5888 seconds, while the same algorithm using feature maps from the proposed model took 59 seconds. Feature maps from the proposed model achieved an accuracy of 90.0% whilst VGG16 only managed to achieve 72.6%. This is a direct consequence of the size of the feature maps, since SVM is in general worse at handling complex features than deep learning models. KNN and LR is also affected by this. However, KNN is still faster than SVM. SVM and LR also being supervised learning algorithms is affected in terms of time. However when running on the smaller feature maps generated from either the proposed model or AlexNet they are still significantly faster.

The ML algorithms could probably perform better if they used the same method as K¨all´en et al. [24] and Z. Liang et a.l[25]. They used a pre-trained model when generating the feature maps used in the SVM. SVMs are in general worse at handling complex features than a CNN [23]. By using a pre-trained model the network would possibly make the features less complex than its untrained counterpart. However, if a pre-trained model had been trained on the created dataset, this would mean that the SVM would train on the same data as the CNNs. This could lead to biased results.

(37)

9.

Limitations

The data originating from few sources is a limitation. Retrieving data from more sources would most likely result in a better trained CNN model producing less biased results. Rain conditions and snow can appear a lot different depending if the sky is clear or cloudy. It would therefore be beneficial to process data from sources were these conditions are apparent.

There was no efficient way of implementing the k-fold cross-validation in Keras. The data had to be loaded into different folders for each fold manually. There was no great way of changing fold between iterations, which meant that the model had to be saved and loaded in between iterations.

(38)

10.

Conclusions

The study validated if deep learning could be a supportive tool for the project manager in the daily operation of a construction site. Using the correct models a deep learning solution was found to be a powerful tool in terms of classification. Even though miss-classification occurs, CNNs are great and efficient tools for classifying the images.

Q.1 How can a CNN be constructed to classify snow and rain in remote drone imagery? The initially implemented architectures AlexNet and VGG16 both performed well on the supplied data. In binary classification AlexNet achieved an accuracy of 90.2% and VGG16 an accuracy of 87.2%. In multiclass classification VGG16 achieved an accuracy of 94.70% and AlexNet 93.10%. These results are sufficient enough to declare that a CNN can be used in the daily operation, considering it will not be used in a safety critical system.

Q.2 How can a CNN model be designed that outperforms previously tested models?

The proposed architecture was designed and implemented with the task in mind. The target was to outperform the previously tested models in regards to accuracy, precision, recall and F1 score. The model achieved these goals by creating more generalized feature maps of the entire image, instead of finding smaller details. This proved to be more effective on the dataset used in this thesis. Many different models versions were experimented with, but the model that seemed to perform the best was the proposed model in this report. The model consists of three convolutional layers with associated pooling layers. The model had just over 900,000 trainable weight, which is significantly smaller than AlexNet’s 29,000,000 weights and VGG16’s 134,000,000 weights. This made the proposed model a lot faster to train and test, which can be very beneficial if deployed in less powerful machines.

In regards of accuracy the proposed model outperformed both the AlexNet and VGG16 models. In binary classification the model achieved an accuracy of 97.3% on the test data compared to AlexNet at 90.2% and VGG16 at 87.2%. The results were more equal in the multi class classification where the proposed model achieved an accuracy of 96.45% followed by VGG16 at 94.70% and AlexNet at 93.10%.

Q.3 Can Machine learning algorithms effectively be used as classifiers on CNN extracted feature maps?

ML algorithms can be used as classifiers if provided with small feature maps. The ML algorithms performed significantly better when using smaller feature maps, regarding both accuracy and time. The smaller size of the feature maps indicates more summarized features that are easier and more condense for the ML algorithms to classify. The feature maps extracted from the proposed model and the AlexNet architecture achieved over 85% on all the ML algorithms, with SVM using feature maps from the proposed model achieving the highest accuracy of 90%. This can be considered an effective classifier due to the fact that the AlexNet CNN model reached a similar accuracy of 90.2% with 50 epochs of training. The big features extracted from VGG16 could not perform as well, with training time and accuracy being affected. The ML algorithms achieved an accuracy of less than 75% on all the tests using large feature maps from VGG16.

(39)

11.

Future work

The work in this thesis took a rather high level approach to the CNN models and ML algorithms, which means that the study could be further improved by investigating the topics more profound. This could be done by studying each field more and how their correlations could be improved. For example in this thesis only the linear kernel was tested for the SVM. There are possibilities that other kernel methods such as Radial Basis Function (RBF) could yield better results than the linear kernel used. Also changing the input data to gray-scale instead of the RGB values used in this thesis could possibly give better results when the data is aimed to be used in the ML algorithms. It would be very beneficial to create a tool that extract the features from the CNN and apply them directly to the ML algorithms, instead of exporting them to a file first. It would be interesting to test the performance differences using a pre-trained CNN model for feature extraction. In this study the un-trained models were used for the extraction of the feature maps. The CNN model should be trained using similar data corresponding to the same classes which are supposed to be classified, meaning that not any pre-trained model can be used. In the future a model for classifying more than just three classes could be created. This could be achieved by adding more classes for different amount of snow, such as ”partially snow covered” and ”fully snow covered”. This could be beneficial since each class would get a more precise definition than in this study.

(40)

References

[1] S. Russell and P. Norvig, “Artificial intelligence: a modern approach,” 2002.

[2] Wikimedia Commons, “Diagram of an artificial neuron.” 2005, [Online; ac-cessed 15-April-2020]. [Online]. Available: https://commons.wikimedia.org/wiki/File: ArtificialNeuronModel english.png

[3] L. Bottou, “Large-scale machine learning with stochastic gradient descent,” in Proceedings of COMPSTAT’2010. Springer, 2010, pp. 177–186.

[4] Y. LeCun, Y. Bengio, and G. Hinton.

[5] J. Schmidhuber, “Deep learning in neural networks: an overview,” Neural networks : the official journal of the International Neural Network Society, vol. 61, pp. 85–117, 2015. [6] A. Khan, A. Sohail, U. Zahoora, and A. S. Qureshi, “A survey of the recent architectures

of deep convolutional neural networks,” arXiv preprint arXiv:1901.06032, 2019.

[7] J. Nagi, F. Ducatelle, G. A. Di Caro, D. Cire¸san, U. Meier, A. Giusti, F. Nagi, J. Schmidhu-ber, and L. M. Gambardella, “Max-pooling convolutional neural networks for vision-based hand gesture recognition,” in 2011 IEEE International Conference on Signal and Image Processing Applications (ICSIPA), 2011, pp. 342–347.

[8] G. E. H. A. Krizhevsky, I. Sutskever, “Imagenet classification with deep convolutional neural networks,” 2012.

[9] Y. Uchida, “convnet-drawer,” https://github.com/yu4u/convnet-drawer, 2019.

[10] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.

[11] T. Hastie, R. Tibshirani, and J. Friedman, The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media, 2009.

[12] M. Hofmann, “Support vector machines-kernels and the kernel trick,” Notes, vol. 26, no. 3, 2006.

[13] K. L. Sainani, “Logistic regression,” PMR, vol. 6, no. 12, pp. 1157–1162, 2014. [14] L. E. Peterson, “K-nearest neighbor,” Scholarpedia, vol. 4, no. 2, p. 1883, 2009.

[15] Y. Zhan, J. Wang, J. Shi, G. Cheng, L. Yao, and W. Sun, “Distinguishing cloud and snow in satellite images via deep convolutional network,” IEEE Geoscience and Remote Sensing Letters, vol. PP, pp. 1–5, 08 2017.

[16] N. Kussul, M. Lavreniuk, S. Skakun, and A. Shelestov, “Deep learning classification of land cover and crop types using remote sensing data,” IEEE Geoscience and Remote Sensing Letters, vol. 14, no. 5, pp. 778–782, 2017.

[17] M. R. Ibrahim, J. Haworth, and T. Cheng, “Weathernet: Recognising weather and vi-sual conditions from street-level images using deep residual learning,” ISPRS International Journal of Geo-Information, vol. 8, no. 12, p. 549, 2019.

(41)

[18] D. Varshney, P. Gupta, C. Persello, and B. Nikam, “Convolutional neural networks to detect clouds and snow in optical images,” Ph.D. dissertation, 03 2019.

[19] C. Liang, J. Ge, W. Zhang, K. Gui, F. A. Cheikh, and L. Ye, “Winter road surface status recognition using deep semantic segmentation network.”

[20] G. Pan, L. Fu, R. Yu, and M. I. Muresan, “Winter road surface condition recognition using a pre-trained deep convolutional neural network,” Tech. Rep., 2018.

[21] N. Koyama, S. Yokoyama, T. Yamashita, H. Kawamura, K. Takeda, and M. Yokogawa, “Recognition of snow condition using a convolutional neural network and control of road-heating systems,” pp. 122–126, 2017.

[22] A. Howard, “Some improvements on deep convolutional neural network based image classi-fication,” arXiv.org.

[23] Fu Jie Huang and Y. LeCun, “Large-scale learning with svm and convolutional for generic object categorization,” in 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), vol. 1, 2006, pp. 284–291.

[24] H. K¨all´en, J. Molin, A. Heyden, C. Lundstr¨om, and K. ˚Astr¨om, “Towards grading gleason score using generically trained deep convolutional neural networks,” in 2016 IEEE 13th International Symposium on Biomedical Imaging (ISBI), 2016, pp. 1163–1167.

[25] Z. Liang, A. Powell, I. Ersoy, M. Poostchi, K. Silamut, K. Palaniappan, P. Guo, M. A. Hossain, A. Sameer, R. J. Maude et al., “Cnn-based image analysis for malaria diagnosis,” in 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 2016, pp. 493–496.

[26] F. Chollet et al., “Keras,”https://github.com/fchollet/keras, 2015.

[27] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Man´e, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Vi´egas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015, software available from tensorflow.org. [Online]. Available: https://www.tensorflow.org/

[28] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon-del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.

[29] A. G. Howard, “Some improvements on deep convolutional neural network based image classification,” arXiv preprint arXiv:1312.5402, 2013.

[30] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.

[31] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014.

(42)

Appendix A

Non-disclosure Agreement

Figure

Figure 1: A diagram of an artificial neuron [2]. It shows an input matrix being multiplied by the associated weights and sent through a transfer function to an activation function
Figure 2: Process of max-pool with size 2X2 and stride 2
Figure 3: Flattening of 2X2 matrix
Figure 5: Architecture of AlexNet [9]
+7

References

Related documents

46 Konkreta exempel skulle kunna vara främjandeinsatser för affärsänglar/affärsängelnätverk, skapa arenor där aktörer från utbuds- och efterfrågesidan kan mötas eller

Both Brazil and Sweden have made bilateral cooperation in areas of technology and innovation a top priority. It has been formalized in a series of agreements and made explicit

The increasing availability of data and attention to services has increased the understanding of the contribution of services to innovation and productivity in

Industrial Emissions Directive, supplemented by horizontal legislation (e.g., Framework Directives on Waste and Water, Emissions Trading System, etc) and guidance on operating

The deep hedging framework, detailed in Section 7, is the process by which optimal hedging strategies are approximated by minimizing the replication error of a hedging portfolio,

A spatial transformer (ST) module is composed of a localization network that predicts transformation parameters and a trans- former that transforms an image or a feature map using

function angle_index=index_angles(raw_edges, width, height, threshold).

Comparing with the base model results in table 4.1 and the no image model results in table 4.2 we can see that the image features and the type augmentation feature synergize in or-