Ume˚aUniversity HarisubramanyabalajiSubramaniPalanisamy RiskAssessmentbasedDataAugmentationforRobustImageClassiﬁcation

(1)

Risk Assessment based Data Augmentation for Robust Image

Classification

using Convolutional Neural Network

Harisubramanyabalaji Subramani Palanisamy November 4, 2018

Student Spring 2018

Master Thesis, 30 ECTS

Master of Science in Applied Physics and Electronics (Specialization in Robotics and Control), 120 ECTS

(2)

Risk Assessment based Data

Augmentation for Robust Image

Classification

Harisubramanyabalaji Subramani Palanisamy

November 4, 2018

Master’s Thesis in Robotics and Control

Academic Supervisor - Ume˚ a: Shafiq ur R´ ehman

Industrial Supervisor - Scania CV AB: Mattias Nyberg & Joakim Gustavsson

Examiner: John Berge

Ume˚ aUniversity

Department of Applied Physics and Electronics

SE-901 87 UME˚ A

SWEDEN

(3)

(4)

Abstract

Autonomous driving is increasingly popular among people and automotive industries in realizing their presence both in passenger and goods transportation. Safer autonomous navigation might be very challenging if there is a failure in sensing system. Among several sensing systems, image classification plays a major role in understanding the road signs and to regulate the vehicle control based on urban road rules. Hence, a robust classifier algorithm irrespective of camera position, view angles, environmental condition, different vehicle size

& type (Car, Bus, Truck, etc.,) of an autonomous platform is of prime importance.

In this study, Convolutional Neural Network (CNN) based classifier algorithm has been implemented to ensure improved robustness for recognizing traffic signs. As training data play a crucial role in supervised learning algorithms, there come an effective dataset requirement which can handle dynamic environmental conditions and other variations caused due to the vehicle motion (will be referred as challenges). Since the collected training data might not contain all the dynamic variations, the model weakness can be identified by ex- posing it to variations (Blur, Darkness, Shadow, etc.,) faced by the vehicles in real-time as a initial testing sequence. To overcome the weakness caused due to the training data itself, an effective augmentation technique enriching the training data in order to increase the model capacity for withstanding the variations prevalent in urban environment has been proposed.

As a major contribution, a framework has been developed to identify model weakness and successively introduce a targeted augmentation methodology for classification improvement.

Targeted augmentation is based on estimated weakness caused due to the challenges with difficulty levels, only those necessary for better classification were then augmented further.

Predictive Augmentation (PA) and Predictive Multiple Augmentation (PMA) are the two proposed methods to adapt the model based on targeted challenges by delivering with high numerical value of confidence. We validated our framework on two different training datasets (German Traffic Sign Recognition Benchmark (GTSRB) and Heavy Vehicle data collected from bus) and with 5 generated test groups containing varying levels of challenge (simple to extreme). The results show impressive improvement by ≈ 5-20% in overall classification accuracy thereby keeping their high confidence.

Keywords: Safety-Risk assessment, predictive augmentation, convolutional neural network, traffic sign classification, autonomous driving, heavy vehicles, orientation challenges, classification degradation.

(5)

(6)

Acknowledgement

I am extremely privileged to thank everyone who helped me getting motivated and sup- ported me in successfully completing my master study.

First and Foremost, I would like to express my sincere gratitude to my academic advisor Associate Prof. Shafiq ur R´ehman and industrial advisor Prof. Mattias Nyberg & Joakim Gustavsson for their continuous support in my thesis research, and their patience. I am thankful for the inspirations and encouragement in writing of this thesis.

It is my pleasure to thank my manager Marcus, EPXS department, Scania and my col- leagues Xinyi, Nazre, Nor´en, Mikael, Agneev, Sandipan, Yash, Dave, Maz, and Christian for their support.

Besides my advisor, I would like to thank Sikandar Lal Khan, who helped me in strength- ening my skills focusing novel computer vision algorithms and providing me with moral support to follow my dream.

I am truly indebted and thankful to my fellow friends Sribalaji and Abhishek for con- stantly having several technical discussions in bringing novel solutions to existing problem and supporting me in hard times during my master study.

Many thanks to my family: parents (Palanisamy and Padmavathi), sister (Prabha Karthikeyan) for supporting me, understanding my decisions and being there all along in this wonderful journey. Without them I would have not reached this position.

Last but not the least, I would like to thank my friends who made my stay in Ume˚a very wonderful; Shreyas, Aravind, Alieu, Ashokan, Rahul,Vishnupradeep, Mahmoud Sukkar, Li Songyu, Xiaowei, Adam, Ariane, Midhu, Mi, Vishnu, Sushmitha, Senthil, Kavya.

Thank you everyone .

Harisbramanyabalaji Subramani Palanisamy Ume˚a University, Sweden, 2018

iii

(7)

(8)

Preface

Part of the work presented in this thesis have passed peer-reviewed conference proceedings.

1. Harisubramanyabalaji Subramani Palanisamy, Shafiq ur R´ehman, Mattias Nyberg, and Joakim Gustavsson, “Improving Image Classification Robustness Using Predictive Data Augmentation”, 1st SAFECOMP Workshop on Artificial Intel- ligence Safety Engineering, WAISE 2018, September 18-21,V¨aster˚as, Sweden.

v

(9)

(10)

List of Figures

1.1 Experimental camera locations and respective field of view to cover the scene around the vehicle. Based on vehicle size and type, location of cameras are being altered and calibrated. . . 2 1.2 Image captured from different location on heavy vehicle leads to orientation

change in captured objects taken from [26]. Left: Front View, Right: Top View . . . 2 1.3 Prototype autonomous heavy vehicle with six camera location and their re-

spective images to show variation in orientation of a traffic signs. Figure in column 1: Targeted left side environment (Top: forward left and Bottom:

rear left), column 2: Targeted middle environment (Top: forward front and Bottom: rear front), column 3: Targeted right side environment (Top:

forward right and Bottom: rear right). . . 3 1.4 Natural environmental variations captured by the front facing camera placed

on heavy vehicle and classifier clearly misses the traffic sign of our interest . . 4 1.5 Natural variations exist in collected GTSRB dataset taken from [27] . . . 5 1.6 State-of-the-art CNN architectures for image classification . . . 7 2.1 Simple perceptron representation with multi input and single output (solid

box as they represent constants). X₁...X_n are inputs and W₁...W_n are learning weights (hollow box as they represent variables) . . . 9 2.2 Single hidden layer based Artificial Neural Network with 1-D feature vector

as input sample and four class output . . . 10 2.3 Multiple hidden layer based Deep Neural Network (DNN) [58]. . . 10 2.4 Activation functions for neural networks [61] . . . 12 2.5 Schema of weight flow in gradient descent algorithm, targeted to bring ini-

tialized weight to global minimum [64] . . . 12 2.6 Rolling variation of image can be generated both in clockwise and anti-

clockwise direction with defined angles . . . 16 2.7 Pitch variation of image can be generated in order compensate the Top and

Bottom view of an object . . . 17

ix

(13)

2.8 Yaw variation of image can be generated in order compensate the view from Left and Right side of an object. . . 17 2.9 Occlusion variation of image can be generated in order compensate the changes

occurring in any part of an object . . . 18 2.10 Distance variation of image can be generated from short to long ranges. . . . 19 2.11 Shadow variation can be generated both in horizontal and vertical direction

of an object . . . 20 2.12 Blending of two input images in order to create overlapping challenges on one

image . . . 20 3.1 Schematic flow chart to represent the Safety-Risk assessment and augmen-

tation framework for traffic sign classification with training (TR), validation (VR), and testing (TE) being the input data. Identified weakness in terms of challenge type and its difficulty level (Risk Ranges) are used by the predictive augmentation platform to improve the training and validation samples . . . . 22 3.2 Systematic selection of safe and risk ranges based on defined threshold recog-

nition accuracy for each challenge type . . . 24 4.1 Sample images collected from dataset considered for this study with natu-

ral variations. Left: German Traffic Sign Recognition Benchmark Right:

Collected heavy vehicle data with four class types. . . 26 4.2 CNN Architecture : Input → CV1 → PL1 → CV2 → PL2 → FC1 → FC2 →

FC3 (Output). Convolution layer(CV), Polling Layer(PL), Fully connected layer (FC), and feature reduction corresponding to each layer in terms of image dimension is embedded inside blocks . . . 27 4.3 Systematic selection of hyper parameters, with Lr, Fd, D, B, Fk being

Learning rate, Filter depth, Nodes in hidden layers, Batch size, and Filter kernel size respectively . . . 27 4.4 Filter kernel used to generate “Snow” flakes on images with less to more

number of flakes per image . . . 29 4.5 Filter kernel used to generate “Rain” strokes on images with horizontal and

angled presence per image . . . 29 4.6 Filter kernel used to generate “Fog” on images with low to high dense cloud

per image . . . 29 4.7 Filter kernel used to generate “Dust” on images with low to high dense of

dust particles per image . . . 30 4.8 Filter kernel used to generate “Smoke” on images with low to high dense

fumes per image . . . 30 4.9 Filter kernel used to generate “Water Flow” on images with low to high

variation per image . . . 30 4.10 Filter kernel used to generate “Bubble” on images with small to large bubble

size per image. . . 30

(14)

LIST OF FIGURES xi

4.11 Samples generated with all challenges with corner challenging levels. Row1 represents the low level, Row2 medium level, Row3 high level. Each Column represent the type of noise and can be identified from Table 4.2 . . . 31 4.12 Generalized robustness verification framework for traffic sign classification . . 32 4.13 Training and validation accuracy for German Traffic Sign Recognition Bench-

mark with only Fk hyper-parameter group. . . 35 4.14 Testing accuracy based on two training group along with selected 5 best

set of parameters on each challenge type. Solid line from With Manipulation clearly outperformed the dotted lines Without Manipulation in extreme levels of difficulty . . . 37 4.15 Threshold accuracy based safe-risk range distribution. 20 challenge types

were grouped into Camera orientation, Distance, Occlusion, Blur, Illumina- tion, and Weather variations. Selected model training with hyper-parameter are Without Manipulation (WM) and Filter Kernel (Fk) group with 7 3 value 42 4.16 Comparative trained model performance on Occlusion challenge type with

different data augmentation, where Without augmentation (WoA), Random Augmentation (RA), Predictive Augmentation (PA), and Predictive Multiple Augmentation (PMA) . . . 45 4.17 Comparative overall accuracy based on augmentation in testing group, where

Without augmentation (WoA), Random Augmentation (RA), Predictive Aug- mentation (PA), and Predictive Multiple Augmentation (PMA). The PA and PMA has got improvement in all five testing groups . . . 47 4.18 Comparative confidence based on augmentation in testing group, where With-

out augmentation (WoA), Random Augmentation (RA), Predictive Augmen- tation (PA), and Predictive Multiple Augmentation (PMA). The PA and PMA has got improvement in all five testing groups . . . 48 5.1 Fall in recognition accuracy for Darkness challenge type based on repeated

augmentation without prior knowledge in model capacity. Dotted lines represent the model trained without augmentation and the solid lines represent with augmentation . . . 50

(15)

(16)

List of Tables

4.1 Number of image samples used for training, testing and validation present in each dataset. . . 26 4.2 Generated Real-Time challenges covering extreme difficulty levels which were

grouped into vehicle and environment perspective. . . 28 4.3 Training (TR), Validation (VA), and Testing (TE) accuracy on Without

manipulation (WM) and With manipulation high (WiMH) based on hyper paramter selection. . . 33 4.4 The results for identified Safe-Risk ranges . . . 39 4.5 The results of top performance based on hyper parameters selection. Lr, Fd,

D, B, Fk being Learning rate, Filter depth, Nodes in hidden layers, Batch size, and Filter kernel size respectively . . . 39 4.6 Number of image samples for model training and inferencing. Without Aug-

mentation (WoA), Random Augmentation (RA), Predictive Augmentation (PA) and Predictive Multiple Augmentation (PMA) . . . 43 4.7 The results for average safe range accuracy based on augmentation, where

Without augmentation (WoA), Random Augmentation (RA), Predictive Aug- mentation (PA), and Predictive Multiple Augmentation (PMA), and German Traffic Sign Recognition Benchmark (GTSRB) . . . 44 4.8 The results for average risk range accuracy based on augmentation, where

Without augmentation (WoA), Random Augmentation (RA), Predictive Aug- mentation (PA), Predictive Multiple Augmentation (PMA), and German Traffic Sign Recognition Benchmark (GTSRB) . . . 46

xiii

(17)

(18)

List of Abbreviations

AI Artificial Intelligence

AV Autonomous Vehicles

ADAS Advanced Driver Assistance Systems SAE Society of Automotive Engineers TSR Traffic Sign Recognition

ML Machine Learning

DL Deep Learning

GPU Graphical Processing Unit CNN Convolutional Neural Network

GTSRB German Traffic Sign Recognition Benchmark BTSD Belgium Traffic Sign Detection

CURE-TSD Challenging Unreal and Real Environments for Traffic Sign Detection

FOV Field of View

PA Predictive Augmentation

PMA Predictive Multiple Augmentation

RA Random Augmentation

WoA Without Augmentation

HOG Histogram of Gradients

VGG Visual Geometry Group

MCDNN Multi Column Deep Neural Network ANN Artificial Neural Network

FFNN Feed Forward Neural Network

TanH Hyperbolic Tangent

xv

(19)

ReLU Rectified Linear Unit ELU Exponential Linear Unit MLP Multi Layered Perceptron SVM Support Vector Machines SGD Stochastic Gradient Descent RMSprop Root Mean Square Propagation 2D & 3D 2 Dimension and 3 Dimension

TR Training

TE Testing

VA Validation

CIELAB International Commission on Illumination

CV Convolution Layer

PL Pooling Layer

FC Fully Connected Layer

DA Data

MH Manipulation High

WM Without Manipulation

WiMH With Manipulation High

(20)

Chapter 1 Introduction

Automobiles has seen tremendous growth since 1700s [1] in helping humans move from one place to another through sophisticated technologies. Several eras of advancement has been prevalent with major concentration towards fuel combustion engines, exterior design, utility preference, safety and comfort. Furthermore, due to the advancement in the field of electronics [2], sensors, control systems and Artificial intelligence (AI) during early 21st century [3], the automotive sector has pushed their boundaries in providing Advanced Driver Assis- tance Systems (ADAS) [4], autonomous driving functionality [5,6,7,8], lane monitoring [9], etc,. Several research and development are being carried out to achieve level five autonomy according to the Society of Automotive Engineers (SAE) standards [10] in automobiles. At the same time, safety being a top most priority, such intelligent vehicles are being rigorously tested in order to adapt to dynamic conditions and cooperate with human driving by sharing the road [11,12,13,14,15]. Among several sensors used in Autonomous Vehicles (AV), camera play a major role in understanding the complex scene around the vehicle in order to derive necessary information for safer navigation [16]. Since populated urban road environments are very challenging for an AV, they must satisfy the traffic flow, road signs and rules in regulating the vehicle control. Traffic Sign Recognition (TSR), a computer vision based algorithm is attracting researchers and automakers in bringing robust classification irrespective of challenging conditions [17, 18]. However, among many traditional computer vision algorithms AI based learning algorithms (i.e. Machine Learning (ML) and Deep Learning (DL)) are preferred as they outperform for TSR [19, 20]. In addition, learning algorithms are becoming popular nowadays even in complex application due to availability of high computing power with the help of Graphical Processing Unit (GPU) even for the case of huge data requirements [21,22]. Among various ML and DL models, Convolutional Neural Networks (CNN) is widely used for image classification task, which helps in deriving unique features for classifying object of interest of any kind which includes: traffic signs, pedestrians, etc, [19,23,24,25].

1.1 Motivation

A CNN based classifier is a supervised learning model used for recognizing the type of object from a given image [24]. This work concentrated on traffic signs as an object of interest. Training and performance evaluation of such a model are usually done with the collected labelled datasets either from real roads or simulation engines. However, generally available labelled datasets, such as German Traffic Sign Recognition Benchmark (GTSRB)

1

(21)

Fig. 1.1. Experimental camera locations and respective field of view to cover the scene around the vehicle.

Based on vehicle size and type, location of cameras are being altered and calibrated.

Fig. 1.2. Image captured from different location on heavy vehicle leads to orientation change in captured objects taken from [26]. Left: Front View, Right: Top View

(22)

1.1. Motivation 3

[27], Belgium Traffic Sign Detection (BTSD) [18], Challenging Unreal and Real Environ- ments for Traffic Sign Detection (CURE-TSD) [28] lack several real-time challenges faced by vehicles as they are usually collected from the front-facing cameras placed in a passenger car dashboard mostly during daytime and in good weather conditions. AVs are not only limited to passenger cars, tremendous development are being done on heavy vehicles such as trucks, and buses [6,29]. Generally, number of cameras and their placement based on the Field Of View (FOV) required to cover a road sign around the vehicle will vary with size of the vehicle [30]. For example, AV prototype developed by different automotive sectors has their own preference in camera placements such as and not limited to: on dashboard, on top, left view, right view, in front of vehicle, rear view, etc., as shown in Fig. 1.1.

Similarly, object captured from different camera position leads to orientation variations in images and will vary from images captured from front facing cameras. For example, camera placed on top and front of the heavy truck along with their camera feed is compared in Fig.

1.2. Similarly, six camera placement and their respective images from an environment is presented in Fig. 1.3, which clearly explains the orientation challenges exist for a classifier algorithm if the model training is not proper. In order to address these orientation problem

Fig. 1.3. Prototype autonomous heavy vehicle with six camera location and their respective images to show variation in orientation of a traffic signs. Figure in column 1: Targeted left side environment (Top:

forward left and Bottom: rear left), column 2: Targeted middle environment (Top: forward front and Bottom: rear front), column 3: Targeted right side environment (Top: forward right and Bottom: rear right)

caused by multiple camera locations, classification algorithm must be robust enough to understand all such variations before it is allowed to run on real roads. In addition to camera placement, AV can undergo weather variation which leads to performance degradation and eventually results in misclassification [28]. Fig. 1.4shows the variation in environmental condition which has an effect on the CNN model performance (“Yield” sign not classified), could be improved if it is trained with enough knowledge about those unpredictable situa- tions. Hence, to account for all the mentioned dynamic challenges which are missing in the generally available datasets, these datasets are commonly augmented randomly with challenges including rotation, cropping, brightness, flipping etc [31]. In addition, other studies considered weather challenges such as rain, fog and snow to further improve the data set

(23)

quality [28]. However, previous literature and the generally available datasets, have only considered model training with the passenger car application in mind [16,17,18,19,23,25, 27,28,31]. Therefore, whether the trained model developed for the car industry satisfies all challenges faced by heavy vehicles including but not limited to the placement of cameras still a question (Research Question 1)? In addition, variations in placement of sign posts on different sides of the road according to the driving norms of a country must be compensated by the algorithm.

Fig. 1.4. Natural environmental variations captured by the front facing camera placed on heavy vehicle and classifier clearly misses the traffic sign of our interest

Normally the collected dataset from a moving vehicle contains some variations as shown in Fig. 1.5, which are taken from the GTSRB dataset. Hence, several challenging conditions such as blur, brightness, darkness, etc., already present in the data will be used for training the model. So there comes a need in understanding the model performance based on the existing training data and identify the exact limitations of the model on the considered dataset. This allow us to understand which challenge type affects the sensitivity of a model the most and where to direct our focus in preparing data in order to achieve better performance. This can drastically bring down the excessive overloading of unwanted data samples with the same challenging conditions, particularly during augmentation. This also raises a question (Research Question 2), whether overloading the same challenge type during training process multiple time affects the classifier performance ? Furthermore, there are other unanswered questions (Research Question 3 and 4) in the traffic sign classification: a) Does the classifier sensitivity remain constant with any training data for a particular type of challenge by memorization? [32] b) Does the random augmentation increase the classifier robustness in real-time challenges? Motivated by these limitations and questions, there is an increased need for a robust image classification algorithm irrespective of variations in environmental conditions, camera location based on vehicle size, and pertaining to motion challenges. Many studies [23,25, 27,31,33, 34] have focused on improving the classifier performance without analysing the system performance based on model degradation in challenging conditions, which may lead to classification failure during vehicle motion.

(24)

1.2. Contribution and Scope 5

Fig. 1.5. Natural variations exist in collected GTSRB dataset taken from [27]

1.2 Contribution and Scope

The major contribution of this thesis study is to develop a safety assessment and augmentation framework for identification of model capacity negotiating real-time challenges at varying levels. The framework introduces two new techniques called a Predictive Augmenta- tion (PA) and Predictive Multiple Augmentation (PMA), adapting the dataset for improved robustness with high confidence in classifying traffic signs. The trained model based on proposed augmentation can withstand variation in camera placement, and weather variations irrespective of vehicle type and size.

Scope of this work includes,

1. Identification of LeNet-5 classifier weakness with respect to real-time challenges along with their corner challenging levels.

2. Generation of 20 different challenge types along with collective 264 levels of difficulty.

3. Consideration of experimental dataset includes: GTSRB and collected data from heavy vehicle (Scania Autonomous Bus).

4. CNN model training is done based on systematic tuning procedure containing five important hyper parameters.

5. Performance evaluation based on five testing groups containing real-time challenges.

6. Detailed performance comparison of accuracy and their per-class classification confidence for the proposed augmentation methods.

(25)

1.3 Related Work

Over the last decade, Traffic Sign Recognition (TSR) research has grown rapidly and performance in classification accuracy keeps improving [19, 20, 35]. Several algorithms has been developed, right from traditional computer vision techniques to learning models (ML and DL) which are capable of performing classification task in real-time. Traditional classification algorithm such as template matching [36], sliding window method [23], etc., has a serious disadvantage of higher computational cost and slower performance on real-time classification. On the other hand, fundamental ML technique requires hand driven features specifically for each class type during the training process for better understanding the different objects. Since, each traffic signs are unique with colour and shape information, hand driven features are made possible based on feature extraction with the help of shape, colour [37], edge [38], Haar feature [23, 39], Histogram of Gradient (HOG) features [23, 20], etc, [40].

In addition to feature extraction, popular classifier algorithms for ML techniques such as Nearest Neighbour [20, 19], Random Forest [41], Neural Networks [37], Support Vector Machines [42], etc., which are major components in differentiating the class types. However, in the last few years, DL models, a biologically inspired, multilayer architecture based on CNN are increasingly popular due to its high capacity in identifying its own features for each class type in contrast to ML models. First neural network with CNN architecture has been developed for binary hand written digit classification with simple two convolution layered architecture as LeNet [43]. CNN can vary in network size, which clearly depends on the complexity of the problem with the help of alternating convolution and pooling layers for feature extraction.

Followed by the LeNet architecture a variant of CNN, several modified architectures are being introduced with very deep feature extraction modules particularly for the case, when there are millions of images in training set. State-of-the-art architectures for image classification such as VGG [44, 45], ResNet [46], Google net [47], etc., [48] has provided us with near perfect accuracy scores along with effective generalization over different dataset.

Similarly, there are few CNN architecture developed particularly developed for TSR such as LeNet-5 [31], Multi-Column Deep Neural Network (MCDNN) [33,34], hinge loss model [49], extreme learning [50], etc. Fig. 1.6 shows very deep model developed for general object classification task (Google Inception Network) and traffic sign classification task (MCDNN).

To achieve better performance in TSR based on CNN models does not depend only on the network architecture, there is an increased requirement in using an effective training data itself [51, 52]. Over the past decade, several dataset has been collected with varying size, number of class types, challenging environmental conditions, etc [53]. However some dataset are very small and meant only for their personal research, other huge labelled dataset becomes a benchmark and made universally available. Benchmark dataset are highly suit- able for major computer vision research directions such as object segmentation, detection, classification, etc [27,18,28].

Even though several datasets are available, each lack in some features which are not limited to less number of traffic sign classes, challenging conditions, camera variations, etc., which are essential to make the algorithm effectively learn and bring robust generalization.

Hence, successive labour intensive data collection and labelling has been carried out in order to eliminate the disadvantages in previously available datasets. Furthermore, game engines play a major role in generating realistic synthetic dataset according to the user requirement [28], which can help in compensating the missing traffic sign class types. Several image processing techniques are being used in augmenting the images thereby appending

(26)

1.3. Related Work 7

the dataset with desired environmental conditions.

(a) Inception Module (GoogleNet Architecture) for object classification taken from [47]

(b) Multi-Column Deep Neural Network for traffic sign classification taken from [33]

Fig. 1.6. State-of-the-art CNN architectures for image classification

(27)

1.4 Thesis Overview

This thesis is structured in such a way to understand neural networks and their applications right from the basics and progressively will answer research questions with results and reasons behind their performance.

– Chapter 2 deals with detailed background on Artificial Neural Networks, building blocks of ANN and CNN architecture, and methods followed to manipulate the images according to the desired noise type.

– Chapter 3 aims at defining the proposed assessment and augmentation framework along with their embedded sub functions.

– Chapter 4 deals with setting up of several experiments derived from the proposed framework in order to answer all our research questions considered in this study in a systematic way.

– Chapter 5 aims at dealing with general discussion based on the achieved results, ways to improve the results, and challenges faced during the study.

– Chapter 6 finally provides summary of this work with major achievements based on the initial motivations.

(28)

Chapter 2 Background & Theory

2.1 Artificial Neural Networks (ANN)

Artificial Neural Networks were developed based on a biologically inspired brains’ neural activity, for effectively performing complex task through repeated learning approach with the help of relevant examples [54, 55, 56]. These examples help the neurons to automati- cally infer rules and features to make decisions. The single neural processing block called a “Perceptron”, which takes in some input and perform relevant operation to deliver the desired result. For instance, Perceptron shown in Fig. 2.1takes multiple inputs and deliver a single output, by performing summation of product operation. Multiple Perceptrons connected together to form a chained network called Artificial Neurons [56]. Chain connected ANN has been proved to be extremely powerful in wide range of applications particularly in supervise learning tasks [57]. One such example is a handwritten digit recognition task, which uses several training inputs and their corresponding known labels [43]. ANN perform a looped training procedure on each sample till the error between the ground truth label matches with the network output. The performance are validated with unknown samples of similar category used in training process. The performance of ANN has tremendously improved in later years with enhanced deeper network architecture. The basic building blocks of ANN and their operations are detailed below.

Fig. 2.1. Simple perceptron representation with multi input and single output (solid box as they represent constants). X1...Xnare inputs and W1...Wnare learning weights (hollow box as they represent variables)

9

(29)

2.1.1 Layers

ANN consist of hierarchical layers with systematic connections where information pass through each Perceptron from one end to other in making final decision [54, 55, 56]. Fig.

2.2shows the simple ANN with connections, where any number of input samples is given as 1-D feature vector and outputs are defined based on the number of classes (in the case of classification) as required by the problem definition. The connections represent the learning weights which varies during the computation process. These adapted weights for each task represent the learned parameter value and can be detached from the network to perform on other similar task without being retrained. The three important layers collectively defines the ANN architecture, they are Input layer, Hidden Layer and final Output Layer.

Fig. 2.2. Single hidden layer based Artificial Neural Network with 1-D feature vector as input sample and four class output

Fig. 2.3. Multiple hidden layer based Deep Neural Network (DNN) [58]

(30)

2.1. Artificial Neural Networks (ANN) 11

Input Layer : This layer contains only the training samples with single dimension and are connected as shown in Fig. 2.2with the successive “Hidden Layer” for making simple decision by summing up the product of input with respective evidence/weights.

Hidden Layer : According to hierarchy, hidden layer will follow the input layer with any number of Perceptron shells, where actual processing of information take place. Each Perceptron shell, which collects the input and their respective weight allocations to calculate weighted sum P wixi and make decision based on defined threshold value. Where Xi and W_i being input and the weight corresponds to i^thsample. For example, in binary data, if the weighted sum is greater than the threshold, then output throws 1 else 0. Threshold value not necessarily be a constant, it can be non-linear defined based on different “Acti- vation function”, which will be discussed in next section. Both the weights and threshold values used in ANN are real numbers which are very important parameters of a neuron. For simple single Feed Forward Neural Network (FFNN), only one hidden layer is used as shown in Fig. 2.2. However, any number of hidden layer along with any number of Perceptron shell can be connected to bring deep network architecture for complex problems as shown in Fig. 2.2. This deeper architecture is represented as “Deep Learning”.

Output Layer : Output Layer can contain any number of output class which will generate only the real numbers according to the problem considered. For example, output with either 0 or 1 for the case of binary data and can hold range of probabilities for non-binary data. The probability of each classes can be derived using the loss functions, which will be discussed later in the section.

Activation Function : Activation function is very important for an Artificial Neural Network for making decisions [59,48,49,58,24]. Without activation function in Perceptron shell, the output will always be linear in nature as shown in Fig. 2.1. To make sense of something really complicated, there need a Non-linear functional mapping between the inputs and respective output. Simple yet effective Sigmoid functions were used initially to deal the problem of non-linearity in ANN. Sigmoid functions help in achieving bounded activations within the range [0-1]. However, it has a Vanishing Gradient problem will make them less effective for deeper networks. For the past decades many other activation functions were introduced which are considered better based on the complexity of task in bringing down the computation and thereby increased speed is achieved. Some of the widely used activation functions shown in Fig. 2.4other than Sigmoid are Hyperbolic Tangent (TanH), Rectified Linear Unit (ReLU), Leaky ReLU, Exponential Linear Unit (ELU), Maxout, etc.

[59,60]

Forward and Back Propagation : Generally in ANN, there is a process called For- ward Propagation which help in deriving the network output of class label for a given input data with the help of assigned weights [59,58]. Now, the network output will either be per- fectly matched with the ground truth labels or small error value which has to be minimized by altering the initial weights, which is done with the help of Back Propagation technique [62]. The back propagation help in finding the partial derivative of error function with respect to the weights, thereby help in minimizing error by subtracting with original weight values. It is clear that by repeated forward and back propagation ANN learns the input and gather useful information to activate the desired set of neurons for final results. The calculate the difference in error, many loss function has been defined among which Mean- squared error loss function is widely used. Gradient is the derivative of the loss function

(31)

Fig. 2.4. Activation functions for neural networks [61]

with respect to a single weight in each Perceptron, in other words it is simply the slope of the function. Powerful gradient descent algorithm [63] is used for updating the weights by minimizing the gradients during backward propagation. However, many development has been achieved in optimizing and speeding the gradient descent procedure (See section 2.2.5).

The general schema of the gradient descent algorithm is shown in Fig. 2.5.

Fig. 2.5. Schema of weight flow in gradient descent algorithm, targeted to bring initialized weight to global minimum [64]

2.2 Convolutional Neural Network (CNN)

The Convolutional Neural Networks is very popular in the field of computer vision particularly in image understanding task [24, 59]. CNN is derived from the traditional ANN and effectively modified according to the vision problems and are employed in several fields including autonomous driving, medical image analysis, etc [19,48,59]. CNN perform better and are robust in image classification due to their unique feature learning capacity from huge training examples compared to the traditional computer vision algorithms. The general CNN architecture consists of two major functional modules, i.e. feature learning &

extraction module and final classification module [50, 65, 49, 59]. CNN is structured with

(32)

2.2. Convolutional Neural Network (CNN) 13

three main layers namely convolution layer, pooling layer, and fully connected layer. The systematic feature learning module is composed of alternating convolution and pooling layers along with desired activation functions. According to the complexity of the task, the number of layers can be increased in feature learning & extraction. Final classification module is same as that of ANN with Multi-layered Perceptron (MLP) contains desired number of hidden layers [66]. Similar to ANN, 1) forward propagation 2) back propagation process take place in feature extraction. The detailed operation of the CNN layers are discussed below.

2.2.1 Convolution Layer

Convolution layer is the first layer in CNN architecture which helps in achieving 2-D/3-D convolution operation on the input images [24,59]. Convolution operation provides us with pixel wise weight estimation of an image for the respective filter kernel. This operation is performed simultaneously on all the channels of the input image and change of weights in 3 channels will be collectively recorded in the single channel feature map output . Convolution operation has got three main advantages: 1) Since there is a weight sharing ability in the feature maps, which can effectively reduce the number of parameters to learn 2) Helps in learning correlation between the neighbouring pixels 3) identifying features invariant to the location in an image. The output feature map size generated by the convolution layer can be controlled with the help of filter kernel size, stride length, and zero padding on input images. The convolution operation with single filter kernel Cf provide us with 2D feature map output image O(m∗n) and is given in the Equation 2.1.

O_(m∗n)= I_(m∗n∗c)∗ C_{f (m∗n∗c)}+ b_(1∗1∗1) (2.1) Where m,n,and c are the width,height and channel of the Input image I, output image O and filter kernel Cf. b is the bias value.

Filter kernel weight matrix Cf dimensions can be varied by stacking more number of layers for better feature learning. The size of the output map generated by the convolution layer can be define by Equation 2.2.

Om= (Im− F + 2P )/S + 1 (2.2)

Where F is the filter kernel dimension, S is the stride length for the sliding window operation, P is the zero padding value which helps to retain the output size of the feature map to that of input for stride S = 1, Im is the width of input feature map. In case of necessity, the output map size can be fixed similar to that of input map size by modifying padding value with P = (F − 1)/2.

2.2.2 Pooling Layer

Pooling layer will follow the convolution layer in the CNN architecture. Similar to convolution layer, pooling layer provide translational invariant result as they consider the neighbouring pixels for their calculation. Pooling layer helps in reducing the spatial sample size of input feature map based on overlapping square Max kernels Mk(height ∗ width). Generally, Max or Average pooling will work in sliding window concept to get the maximum pixel value by ignoring the smaller ones. On the other hand, average pooling will average the values in the 9 pixels window to generate the new feature value. Normally, in this layer stride length of S = 2 will be considered and without zero padding. The same Equation 2.2is used to find the size of the output feature map for pooling layer [50, 65, 49,59].

(33)

2.2.3 Fully Connected Layer

As mentioned earlier, final classification module holds the MLP layers similar to the ANN in classifying objects according to the learned features. The simple and complex fully connected layers based on inputs which are derived from the final pooling layer output, i.e. 2-D feature map is concatenated into 1-D vector shown in Fig. 2.2 and2.3. In addition, non-linear activation functions as shown in Fig. 2.4are used in both convolution and fully connected layers to fire the neurons.

2.2.4 Loss Function

Generally, loss function aim in assigning high score value for the correct class type compared to the incorrect classes. The famous classifiers used in CNNs are Multiclass Support Vector Machines (SVM) or Softmax classifier along with their unique loss function representations [58, 59]. The SVM loss function helps in finding the class score with a fixed margin δ as presented in Equation 2.3 [58]. Summation of all such class loss scores, the SVM loss function will assign a final score which corresponds to a class label defined in the task.

L_i = X

cj6=yi

max(0, F_s_cj − Fs_yi + δ) (2.3)

Where L_i in Equation 2.3is the SVM loss value for the i^thimage by summing all the class scores. y_iis the known class label for the i^thimage. F_s_cj and F_s_yi are the score function for the incorrect class and true class type obtained from the linear classifier respectively. δ is the fixed margin value which helps in maintaining the score of the correct classes at least larger than its own value. The loss function for full training samples can be presented in the Equation2.4

L = 1 N

N

X

i=1

L_i (2.4)

Where N is the total number of images in the dataset. Similarly, Softmax classifier which uses the same loss score calculation for whole training samples is presented in Equation2.4 but with different loss function [58]. Softmax classifier helps us to visualize more intuitive score values of each class type in terms of normalized probabilities. The loss function expression used by Softmax classifier is given by Equation2.5.

Li= −log e^F^syi P

cje^F^scj

!

(2.5)

2.2.5 Optimization

Gradient Descent algorithm discussed earlier in section 3.1, which is used for updating the weights in a direction close to global minimum in CNN architecture for learning training examples. There are some advancements in gradient descent algorithm to improve their performance and stability which includes, Batch Gradient Descent, Stochastic Gradient De- scent (SGD), and Mini-Batch Gradient Descent [58,59]. Vanilla or Batch Gradient Descent, computes the gradient of a cost function with respect to the parameters considering the en- tire training samples. Stochastic gradient descent on the other hand performs a parameter

(34)

2.3. Image Manipulation 15

update based only on one training example. Furthermore, Mini-batch gradient descent performs an update for every mini-batch of n training examples. Still these algorithms has to be optimized in better way to achieve effective results with less computational power.

Learning rate is an important parameter to bring down the computation of optimizer algorithm. However, it is quiet challenging to define exact value for learning rate as small value leads to longer convergence time and vice versa. Hence, several optimizer uses additional parameters such as momentum, RMSprop, Adam, etc. [58], for better convergence. Adam optimizer [67], being widely used nowadays for image recognition, with initialized random search for N number of iterations. In addition, among other techniques, Adam uses SGD as their gradient estimation algorithm to optimize the direction of updating weights.

2.3 Image Manipulation

Generally, gathered 2D camera data from moving vehicle in real roads for the purpose of training machine learning models does not holds all challenges due to limitation in driving area, weather conditions, scene with multiple sign types, road conditions, etc. To accom- modate most of the challenges particularly during the data collection, will increase the labour cost and unpredictable time line in waiting for the exact conditions to occur in real time. For example, waiting time for change in monsoon and seasons. These challenges are commonly manipulated using filter kernels from predefined image processing libraries or specially defined filter for each type [68, 69, 70]. This generation helps us in bringing effective background analysis of how far the classifier can withstand the challenges. The detailed set of off-line challenges manipulation process using image processing techniques were presented below.

Roll : The decision made by the autonomous vehicles varies with the perspective changes in the images which are caused either by camera angle or tilted/damaged signs from the environment during vehicle motion, which can lead to considerable rotation of 2D camera images. As, object of interest is in 3D and is projected onto 2D images, we term rotation of images as “Roll” rather than representing with usual literature based name “Rotation”, which greatly helps in correlating other view point challenge type considered in the study.

The Fig. 2.6 helps us to understand better how Roll challenge is manipulated on the image and its extreme ranges along both the directions. The rolling of clockwise and anti- clockwise in images are represented by negative angles and positive angles respectively.

Mathematically, Rolling Transformation matrix TRpresented in matrix Equation2.6takes θ as input and get convoluted with the input image in generating the Roll challenge as required without scaling [71]. Rolling Scaling Transformation matrix T_RS including desired scaling is presented in matrix Equation2.7. Where α = Scale.cosθ, β = Scale.sinθ, Scale is a scaling factor ranges between [0 - 1], centerX and centerY defines the X & Y coordinate of an image from where it is rotated.

T_R=cosθ −sinθ sinθ cosθ

(2.6)

TRS=α −β (1 − α).centerX − β.centerY β α β.(centerX) + (1 − α).centerY

(2.7) Pitch : 3D object of interest from environment can either be viewed from top or bottom angle in heavy vehicles, which leads to non-parallel projection of images with respect

(35)

Fig. 2.6. Rolling variation of image can be generated both in clockwise and anti-clockwise direction with defined angles

to camera plane. In addition, particularly trucks has got heavy suspension in the cabin can also contribute to “Pitch” effect with varying angles as shown in Fig. 2.7. Generally, the pitching effect in an image can be achieved by projective transformations, also known as a perspective transform or homography [71]. This action operates on homogeneous coordinates, where two sets of 4 XY coordinate points are needed from image, first set of points which is to be mapped/projected to get the next set of reference points. These 8 pairs of coordinates helps in finding the unknown homographic transformation matrix using 8 equations as presented in matrix Equation 2.9. The mathematical representation of projective transformation is presented in Equation 2.8, where H refers to the 3*3 homogeneous transformation matrix, where Ix,y,1 refers to the input image and I_x,y,1⁰ belongs to projected output image. The projective transformation based on XY coordinates are mapped in terms of angle variation θ for this work and will be discussed later in section.

I_x,y,1⁰ = H.Ix,y,1 (2.8)

H =





H11 H12 H13

H21 H22 H23

H31 H32 1



 (2.9)

I_x,y⁰ = A.I_x,y (2.10)

A =A11 A12 A13

A21 A22 A23

(2.11) Yaw : Similar to Pitch variations, object of interest might occur in orthographic projection with respect to vehicle front view and leads to “Yaw” effect in images along with

(36)

Fig. 2.7. Pitch variation of image can be generated in order compensate the Top and Bottom view of an object

Fig. 2.8. Yaw variation of image can be generated in order compensate the view from Left and Right side of an object

varying angles as shown in Fig. 2.8. Generally, orthographic projections or Multiview projections are generated with the help of Affine transformations [71], which requires two sets of 3 XY coordinate points, where first set belongs to points which is to be projected to get the next set of reference points. These 6 pairs of XY coordinate helps in finding the variables in Affine transformation matrix presented in Equation2.10, which is then convoluted with the any input image pixels to achieve desired transformation. The mathematical representation of Affine transformation is presented in Equation 2.10, where A refers to the 2*3 transformation matrix, where Ix,y refers to the input image and I_x,y⁰ belongs to transformed output image. The Affine transformation based on XY coordinates are mapped in terms of angle variation θ for this work and will be discussed later in section.

Occlusion : Occlusion is one of the important challenge naturally occur when information is perceived from the environment, where part of the desired object will be overlapped by an undesired object. In addition, sometimes object detector may crop the part of an object, which is necessary for classification resulting in wrong identification. The Fig. 2.9 helps us to understand better the simple occlusion challenge that can occur in an image.

Generation of occlusion can be done using linear translation of image data from a fixed coordinates, thereby achieving part of the useful information about the object of interest [71].

The image translation is presented in the Equation 2.12. Where L being the transformation matrix for translation as presented in the matrix Equation 2.13, where I_x,y refers to the input image and I_x,y⁰ belongs to translated output image with transX and transY being translational value in X and Y coordinates respectively [69,70].

I_x,y⁰ = L.I_x,y (2.12)

(37)

L =1 0 transX 0 1 transY

(2.13)

Fig. 2.9. Occlusion variation of image can be generated in order compensate the changes occurring in any part of an object

Distance : It is very important for the vision algorithm to take quick action once it perceive information from the real world. The system can either be developed with high speed recognition in close range or must make decision from long ranges. It is encouraged that long range identification will be effective and necessary even if the system is capable of high speed close range performance for safer navigation. The Fig. 2.10shows the variations in image dimension from short to long distance ranges. Since the CNN considered in this work does not allow for varying input image size, image data will be surrounded/padded by black pixels to match the defined dimension [71]. It is worth mentioning, the black pixel padded will be thrown away by activation function during feature extraction stage in CNN models. The generation of small to large size images with padded black pixels are presented in Equation 2.14. Where S being the transformation matrix for scaling as presented in the matrix Equation2.15, where Ix,y refers to the input image and I_x,y⁰ belongs to scaled output image with Sx and Sy being scaling factor in X and Y coordinates respectively and range between [ 0 - 1 ]. Later this scaled image is then merged with the black image of desired dimension.

I_x,y⁰ = S.Ix,y (2.14)

S =Sx 0 0 Sy

(2.15) Motion Blur : Motion of speeding vehicle in the real road, introduces different blurring effect notably the frames captured with shakes caused due to rigid fixing of cameras. The uneven road surfaces like ups and downs introduces motion blur in vertical direction, fast motion of vehicle in horizontal direction leads to blur in horizontal direction and finally lens zooming/focus leads to random Gaussian lens blur. The mathematical filter kernel used to generate Horizontal Blur H_B is given in matrix Equation 2.16which is then convoluted with the original image [71]. It works like a directional low pass filter which averages the pixel value in desired direction [69, 70]. Similarly, filter kernel used to generate Vertical Blur V_B is given in the matrix Equation 2.17. The Gaussian blur G_B can be considered

(38)

Fig. 2.10. Distance variation of image can be generated from short to long ranges

as a low pass filter which attenuates the high frequencies in the image and causes blurring effect with the help of defined standard deviation value in both x and y direction as given in the Equation 2.18.

HB=





0 0 0 1 1 1 0 0 0



 (2.16)

VB =





0 1 0 0 1 0 0 1 0



 (2.17)

GB(x, y) = 1/(2πσ²)e^−(x²^+y²^)/2σ² (2.18) Illuminations : The perceived images from the environment will undergo different illumination conditions which may leads to excessive brightness, darkness and contrast variations. To account for these challenges, input training samples to the classifier, where contrast variation is generated with the help of a α factor as presented in the Equation 2.19. Where C(x, y) is the contrasted output image with α > 0 being the factor in choosing the strength of contrasting level, I(x, y) being the input image to be manipulated, and β being brightness parameter, which is kept constant as value 0 for this contrast variation operation. For the brightness and darkness improvement in the input images, following gamma correction as presented in Equation 2.20is used, where B(x, y) and D(x, y) brightened and darkened output image respectively. When there is γ > 1 and γ < 1, brightness and darkness effects are achieved respectively. here can be a case where shadow effect falls on the object of interest both in vertical and horizontal direction, hence band gamma correction with darkness parameter can be used as shown in Fig. 2.11[69,70].

C(x, y) = α.I(x, y) + β (2.19)

B(x, y), D(x, y) = (I(x, y)/255)^γ∗ 255 (2.20) Blending Challenge : In addition to the above mentioned challenges, caused due to vehicle motions, view point, hardware and poor lighting conditions, object recognition has also got additional variations caused due to the natural weather changes. For example,

(39)

Fig. 2.11. Shadow variation can be generated both in horizontal and vertical direction of an object

snowflakes falls on the object and the resulting perceived image will have overlapped object of interest and snowflakes, like blending effect. Generally, image blending operation is carried out with the help of reference filter image on top of the original colour image (R;

G; B; A). In this operation, the final alpha channel ‘A’ defines the opacity of the image data and plays a major role in achieving blending effect as mentioned in the mathematical Equation2.21. Where α is the blending factor of two images and its value ranges between [0 - 1], I1(x, y) & I2(x, y) refers to input and filter images to be blended respectively, and Blendout is the desired output image. The Fig. 2.12presents a visualization example of blending operation.

Blend_out= I₁(x, y) ∗ (1.0 − α) + I₂(x, y) ∗ α (2.21)

Fig. 2.12. Blending of two input images in order to create overlapping challenges on one image

(40)

Chapter 3 Assessment and Augmentation

Method

Motivated by the classifier degradation, particularly caused due to real-time challenges faced by the moving vehicles, this thesis work proposes a Safety-Risk assessment and augmentation framework. In addition to the framework, two specific augmentation techniques were introduced namely Predictive Augmentation (PA) and Predictive Multiple Augmentation (PMA) for effective CNN model training. Proposed framework and two augmentation techniques will deliver a concrete solution in adapting the CNN model to withstand the degradation caused by the challenging conditions. Detailed description of the proposed framework is presented below.

3.1 Safety-Risk Assessment and Augmentation Frame-

work

Fig. 3.1 shows the detailed flow chart of proposed safety-risk assessment and improvisa- tion framework for traffic signs with one challenging sample performance before and after framework processing. Detailed description of two sub platforms defined in the proposed framework were presented in this section. As this study focuses on understanding the classifier capacity on challenges before being deployed into real systems, initial safety-risk verification is of prime importance. Then model training can be adapted as per the requirements needed by the classifier to improve its performance. Hence, the framework is divided into two sub platforms called, 1) Safe-Risk Assessment, which helps in identifying the classifier weakness on each considered challenge type 2) Predictive Augmentation, which helps in increasing the training sample needed by the CNN model to understand the challenge type which falls under their weakness. The framework is developed to handle any given classifier model to understand their capacity based on our requirements. Selected CNN architecture along with their dataset (training, testing and validation batches) are the only requirements for this framework. Similar to robustness framework, best performing CNN model has been selected to verify their capacity based on tuning hyper-parameters for the given data (See Chapter 4). Detailed description regarding two major platforms were discussed as follows.

21

(41)

Fig. 3.1. Schematic flow chart to represent the Safety-Risk assessment and augmentation framework for traffic sign classification with training (TR), validation (VR), and testing (TE) being the input data.

Identified weakness in terms of challenge type and its difficulty level (Risk Ranges) are used by the predictive augmentation platform to improve the training and validation samples

(42)

3.1. Safety-Risk Assessment and Augmentation Framework 23

3.1.1 Safe-Risk Assessment

Safe-Risk Assessment platform presented in the framework as shown in Fig. 3.1 helps in quantifying the model capacity for each challenge type. As discussed earlier, the best performing traffic sign classifier algorithm and its dataset used for training (TR), validation (VA) and testing (TE) were the only input requirement for the assessment platform. Even though the input CNN model is claimed to be best performing for testing samples, major aim of this proposed framework is to validate their performance on challenging conditions and retrain the model according to the systematic hyper-parameter tuning procedure (See section ) to validate its stability.

Hyper-parameter tuning and retraining take place inside CNN model tuning & training presented in framework. In order to understand their performance and robustness on real- time challenges faced by the vehicles, Real-time challenge generation block is equipped with an array of challenges (filter kernels) which are then manipulated on top of the derived testing samples (See Section 4.3) from the input dataset.

Fig. 3.1clearly shows a test sample with pitching challenge type is misclassified with high confidence even with the considered best performing classifier. Hence, a Safe & Risk Range Identification block helps in identifying the challenge type and their difficulty levels which were mostly misclassified by the classifier and requires respective training samples for better classification. The procedural way to select the safe and risk ranges for each challenge type has been formulated based on threshold recognition accuracy as shown in Fig. 3.2. As generalized identification has to be achieved, all considered challenges were grouped into symmetric and linear types as shown in Fig. 3.2with low being qualitatively less challenging and high being more challenging (See section). If the classification accuracy for considered challenge type and difficulty level reaches above the defined threshold level, then those will falls under safe ranges and vice versa. Threshold accuracy for identification can be defined based on the requirements that how good the classifier should perform in the wild. As this platform gave a detailed intuition about targeting the data batch with challenges for future model training. Identified limitation in terms of Risk ranges for each challenge type is fed as input to the successive predictive augmentation platform.

3.1.2 Predictive Augmentation

Predictive Augmentation (PA) platform being the second part of framework has been introduced to improve the previously available training samples with real-time challenges.

Augmentation were done based on the predicted challenges along with their risk ranges which were gathered from previous safe-risk assessment platform. Since augmentation carried out based on the identified model capacity, improvement in performance and robustness has been achieved as shown in Fig. 3.1. Those identified limitations are augmented on the training samples in two different ways, they are : 1) PA, which used only one identified challenge to augment the test image. For example, a speed limit 30 km/hour with a (blur) difficulty level 10. So, the next sample will contain a different challenge type and level; 2) PMA, on the other hand, uses two variations, one was a challenge type from an identified range and the other was a CIELAB colour space due to its high feature learning capacity [65] so as to bring simultaneous overlapped effect of both on the same image. This helps the classifier model to understand the combined variations occurring in the environment.

(43)

(a) Linear challenge type

(b) Symmetric challenge type

Fig. 3.2. Systematic selection of safe and risk ranges based on defined threshold recognition accuracy for each challenge type

(44)

Chapter 4 Experiments

This chapter presents the detailed insight on general dataset selection, CNN architecture selection, generation of challenge for augmentation and validation commonly shared between the experiments, and finally three stages of experimentation were framed in order to cover the sub processes proposed in the framework. Finally, results achieved and their performance were evaluated.

4.1 Dataset Selection

To validate the proposed frameworks, two dataset were considered. 1) German Traffic Sign Recognition Benchmark (GTSRB) data, and 2) collection of own 2D camera data from experimental autonomous bus navigated in urban Swedish roads as shown in Fig. 4.1

GTSRB : GTSRB data were mainly considered for this study as most of the traffic sign detection, classification and tracking algorithm has been bench-marked as well as they were collected from physical real-world German roads. The data samples has been collected as a video sequence for approximately 10 hours recorded with a camera placed on dashboard of a passenger car while driving in urban road environment (Germany) only during daytime.

GTSRB dataset was compiled with final data size of 50000 images with 43 different classes.

The dataset contains images of more than 1,700 traffic sign instances along with the image size which varies between 15 * 15 and 222 * 193 pixels dimensions. To make uniform input to the classifier all the images are resized to 32*32 dimensions [27]. The Table 4.1shares more information on training, validation and testing image samples used for this study. Fig.

4.1shows batch of sample images from GTSRB, which were presented in the left side.

Heavy vehicle Data: According to the motivation presented in introduction chapter, dataset collection were done from autonomous bus to show the variations in images com- paring heavy vehicle with passenger car data. The data collected were from monocular camera, which was placed centred inside the cabin. The camera was placed in a position which was high from the ground compared to the usual camera location in passenger cars.

This greatly helped in understanding the necessity of training data and the performance with respect to high angle camera placement. In addition, some environmental conditions like darkness, brightness, etc,. were captured from the scene in Swedish roads. Collected heavy vehicle data were comparatively very small with only 4 sign types which includes

“Pedestrian crossing”,“Yield”, “Traffic light red”, and “Keep right”. We used only 4 sign

25

(45)

Fig. 4.1. Sample images collected from dataset considered for this study with natural variations. Left:

German Traffic Sign Recognition Benchmark Right: Collected heavy vehicle data with four class types

types as these were the maximum number of samples captured. In addition, the small data size helps us to bring an effective measure to prove our claim that predictive augmentation will improve irrespective of data size (See Section 4.6.1).

Table 4.1. Number of image samples used for training, testing and validation present in each dataset

GTSRB Heavy vehicle

Train 34799 161

Valid 4410 41

Test 12630 51

4.2 CNN Architecture Selection

As the major contribution of this study was to identify the model degradation in order to bring better solution to improve robustness based on augmentation method, simple CNN architecture design was chosen rather than going with deep models. Fig. 4.2 shows the detailed architecture of considered CNN model for the following experiments. However, the architecture was inspired from “LeNet” binary handwritten character recognition [43], model has been adapted to work on colour images with 3 channel input and some of the hyper parameters were modified in successive systematic tuning procedures. Generally, selection of such hyper parameters were very challenging based on previous studies [31].

Hence, rather than utilizing random set of hyper parameter value, this work introduced a similar procedural way to select best set of values as shown in Fig. 4.3. Five parameter groups were selected, each with 5 different values, collectively gives us 25 set of parameters according to our procedure. Tuning begins with initially assigned set of parameters and best value from each group successively replace the main architecture thereby achieving final set of best parameter values. For all the experiments, the above mentioned systematic selection procedure were used. However, for the case of robustness verification, best set of parameters in each groups were considered for performance comparison. On the other hand, for predictive augmentation experiment only the final set of parameters were considered.

Ume˚aUniversity HarisubramanyabalajiSubramaniPalanisamy RiskAssessmentbasedDataAugmentationforRobustImageClassiﬁcation

Risk Assessment based Data Augmentation for Robust Image

Classification

Risk Assessment based Data

Augmentation for Robust Image

Classification

Harisubramanyabalaji Subramani Palanisamy

November 4, 2018

Master’s Thesis in Robotics and Control

Academic Supervisor - Ume˚ a: Shafiq ur R´ ehman

Industrial Supervisor - Scania CV AB: Mattias Nyberg & Joakim Gustavsson

Examiner: John Berge

Ume˚ aUniversity

Department of Applied Physics and Electronics

SE-901 87 UME˚ A

SWEDEN

Acknowledgement

Preface

Contents

List of Figures

List of Tables

List of Abbreviations

Chapter 1

Introduction

1.1 Motivation

1.2 Contribution and Scope

1.3 Related Work

1.4 Thesis Overview

Chapter 2

Background & Theory

2.1 Artificial Neural Networks (ANN)

2.1.1 Layers

2.2 Convolutional Neural Network (CNN)

2.2.1 Convolution Layer

2.2.2 Pooling Layer

2.2.3 Fully Connected Layer

2.2.4 Loss Function

2.2.5 Optimization

2.3 Image Manipulation

Chapter 3

Assessment and Augmentation

Method

3.1 Safety-Risk Assessment and Augmentation Frame-

work

3.1.1 Safe-Risk Assessment

3.1.2 Predictive Augmentation

Chapter 4

Experiments

4.1 Dataset Selection

4.2 CNN Architecture Selection