Operational data extraction using visual perception

(1)

Operational data extraction using visual perception

NAGARAJAN SHUNMUGAM

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

Master’s Programme, ICT Innovation, 120 credits Date: January 20, 2020

Supervisor: Veeresh Elango (Industrial) and Sabha Zojaji (Academic)

Examiner: Dr Haibo Li

School of Electrical Engineering and Computer Science Host company: Scania CV AB

(3)

c

2021 Nagarajan Shunmugam

(4)

deep learning, the domain of computer vision has achieved significant performa -nce boosts that it competes with hardware based solutions. Firstly, data is collected from a large number of sensors which can increase production costs and carbon footprint in the environment. Secondly certain useful physical quantities/variables are impossible to measure or turns out to be very expensive solution. So in this dissertation, we are investigating the feasibility of providing the similar solution using a single sensor (dashboard- camera) to measure multiple variables. This provides a sustainable solution even when scaled up in huge fleets. The video frames that can be collected from the visual perception of the truck (i.e. the on-board camera of the truck) is processed by the deep learning techniques and operational data can be extracted. Certain techniques like the image classification and semantic segmentation outputs were experimented and shows potential to replace costly hardware counterparts like Lidar or radar based solutions.

Keywords

Visual perception, camera, convolutional neural networks, classification, object detection, semantic segmentation, depth estimation, gradient descent with restarts, cosine annealing.

(5)

Sammanfattning

Informationstiden har lett till att tillverkare av lastbilar och logistiklösningsleve -rantörer är benägna mot mjukvara som en tjänst (SAAS) baserade lösningar.

Med framsteg inom mjukvaruteknik som artificiell intelligens och djupinlärnin har domänen för datorsyn uppnått betydande prestationsförstärkningar att konkurrera med hårdvarubaserade lösningar. För det första samlas data in från ett stort antal sensorer som kan öka produktionskostnaderna och koldioxidavtry -cket i miljön. För det andra är vissa användbara fysiska kvantiteter / variabler omöjliga att mäta eller visar sig vara en mycket dyr lösning. Så i denna avhandling undersöker vi möjligheten att tillhandahålla liknande lösning med hjälp av en enda sensor (instrumentbrädkamera) för att mäta flera variabler.

Detta ger en hållbar lösning även när den skalas upp i stora flottor. Videoramar som kan samlas in från truckens visuella uppfattning (dvs. lastbilens inbyggda kamera) bearbetas av djupinlärningsteknikerna och operativa data kan extraher -as. Vissa tekniker som bildklassificering och semantiska segmenteringsutgång -ar experimenterades och visar potential att ersätta dyra hårdvaruprojekt som Lidar eller radarbaserade lösningar.

Nyckelord

Visuell uppfattning, kamera, neurologiska nätverk, klassificering, objektdetek tering, semantisk segmentering, djupberäkning, gradientnedstigning med omst art, cosinusglödgning.

(6)

Intelligence, Frida Nellros, for their support and advice during the project. I also wanted to thank them for extending my thesis during these extra ordinary times due to pandemic. I would like to thank my academic supervisor, Sahba Zojaji for helping throughout my thesis with his subject knowledge and expertise. Many thanks to my examiner Haibo Li for accepting to examine my thesis and to my friends Srinath, Jelin and Lalith for reading previous drafts of this report and providing many valuable comments. Last but not least, I owe a few words to my family, for supporting me to pursue my dreams far away from home. I thank them from the bottom of my heart.

Stockholm, March 2021 Nagarajan Shunmugam

(7)

(8)

Chapter 1 Introduction

With the digitalization in logistics industry and supply chain management the use of multiple hardware sensors needs to be avoided. This in turn prevents huge costs for sensors, helping in reduction of large scale production expenses and also reduces the carbon footprint on environment.

With the development of machine learning and deep learning advancements in the computer vision domain, their performances become comparable to hardware sensors (in certain features that were impossible or costly to measure without using advanced software solutions, for e.g. weather) where they surpass the special hardware solution. The hierarchy of artificial intelligence is shown in Figure: 1.1. These solutions also help us to understand the market, customer needs, evolving client interests, cash flow, and existing problems in the system [1]. Based on that the facts [2], companies can use data-science, internet of things, and artificial intelligence techniques to extract valuable indicators and understand user patterns to provide services that are tailored for a specific task and individual.

The dissertation here is to evaluate whether computer vision and machine learning based techniques can be used to extract valuable parameters using a single sensor without the need for multiple hardware sensors and thus decreasing huge costs associated with it. This thesis tries to provide a comparative study by using deep learning techniques like object detection, semantic segmentation and monocular depth estimation.

(11)

1.1 Background: Rise of AI in computer vision techniques

The parallel development of computer hardware and the rich sources of data availability in the information era gives enormous power to the AI techniques which are capable of handling such information and is now expected to replace hardware sensors. The algorithms perform at levels superior to humans in certain domains of computer vision and it is starting to replace the hardware sensors like lidar or radar. Companies like Tesla are using primary computer vision based techniques and achieve performances that are close to hardware sensors. But on the other hand there are companies like Waymo from Google that develop their own hardware sensors to achieve autonomous driving capability. Deep learning is fed with raw data and is able to understand the tiny abstract features in its own terms. The true challenge of AI is to solve problems that are easy for humans like differentiating objects based on visual perception or requires a special hardware adding to the cost. They help solve problems in real world which generalizes well in the dynamic environment rather than using the traditional computer vision techniques like template matching or combinations of several hardware sensors of previous generations. Ironically, even though computers have always outperformed humans in repetitive and formal tasks, they have only got better in vision and speech only very recently.

The traditional hardware sensors used in the realizing the dynamic and noisy environments of real world is now slowly getting replaced by algorithms.

The problems like image classification have now reached peak performance when the hardware power started to rise. The object detection techniques were carried out with real time performance without compromising in the quality of prediction. Modern architectures like the YOLO [3] instead of traditional computer vision techniques like template matching was used. The tasks like semantic segmentation working at impressive speeds like 5 fps, thanks to the modern hardwares like the Nvidia GPU. Their performances show performances comparable to hardware sensors and in future they might replace hardware sensors.

1.2 Technical Challenges and Delimitation

The idea is to use only one type of hardware sensor (in this case a camera) avoiding the use of special sensors like lidars, radars and ultrasonic sensors.

The main challenge will be to achieve a good precision and performance

(12)

Figure 1.1: Deep neural networks position in the hierarchy [2]

that is comparable to the hardware sensor used specifically designed. Due to limited memory, the network chosen should have few parameters resulting in a small weight without compromising the performance. As multiple networks are used for different variables, choice of network will be a crucial part.

When there are multiple output labels of same type are present in large numbers, say cars in a heavy traffic, the techniques like object detection fails to detect each class where traditional hardware sensors work better. So advanced techniques like semantic segmentation will be tested for better performance to match the hardware specific performance.

Preparation of dataset from scratch for certain output labels in variables which are not directly available in public domain.

(13)

1.3 Goals

With vision as the primary component, camera as the primary sensor, the goal of the thesis is to extract operational variables with good quality. By using the camera alone, the cost of other sensors could be eliminated.

Extracting operational variables that can help understand the behaviour of the vehicles and the driving conditions is yet to be experimented with or to be used in real time applications.

To carefully pick neural networks to save some memory in the storage of weights and biases. The static and dynamic obstacles will be captured using techniques like object detection and semantic segmentation. The depth of these obstacles are detected using the monocular perception as well.

1.4 Research Methodology

The deep learning techniques were preferred over the machine learning techniques due to their superior performances in the computer vision domain.

Neural network architectures were selected (with the memory constraints in mind) and performances were bench-marked to choose an optimal network size without compromising the performance. The performance and accuracy of proposed algorithms are evaluated using performance indicators like accuracy, mean average precision, IOU, confidence thresholds, confusion matrices etc [4]. In order to make the model robust and reliable, multiple neural networks were tested under various weather and traffic conditions. Images were up- scaled and down-scaled to test the networks’ performance under different pixel resolutions. It also helps to understand the behaviour of the vehicles and the driving conditions.

(14)

Chapter 2 Problem and Proposed solution

2.1 Problem

The term operational data means any data that represents the vehicle’s state or its surrounding environmental condition. Traditional hardware sensors that are used in vehicles add up cost and time to measure operational data.

The network of hardware sensors needs to be in sync of a common frequency and it makes the build and deploy process complicated. These hardware sensors increases the carbon footprint when it cannot be recycled and makes it harder to scale in the future. So there is a need for cheaper and simpler solution.

2.2 Research question

Is it possible to extract useful operational parameters from a single camera sensor without the need of multiple hardware sensors and still achieve comparable performance?

For example, if is it possible to extract depth information of the surrounding objects from a static frame/image instead of a dedicated sensor like lidar?

2.3 Proposed solution

To save cost and time, this thesis proposes to measure various operational parameter using a single sensor, i.e. to use a camera mounted in the dashboard of the truck which can help understand the environment being driven on and in turn assist in diagnostics for maintenance. The models will run in

(15)

memory constrained hardware and under real time conditions, [5] without compromising in performance. [6] [7] [8]

(16)

Chapter 3 Literature study and Background

Machine Learning started as a random research topic around the 1950s by Donald Hebb in a book titled The Organization of Behavior. Since the hardware at that time was not good enough it faded out pretty fast. In the last decade, with the development of powerful hardware, machine learning and its potential has surprised the world. Deep learning which is a sub domain of machine learning has become a popular choice in every domain of data science especially in the field of computer vision [9], it performed at super human levels. Neural networks became a field of its own and grew into greater depths in research and new architectures turned up to be very powerful and robust, leading to their use in critical applications like autonomous cars [10]

and healthcare sectors [11]. In this thesis, extensively using them to find new variables that can add value to automotive industries was the ultimate goal.

3.1 What are neural networks?

The neural networks are mathematical functions that are designed to closely emulate the brain’s cognitive functions like understanding patterns and problem solving [13] can be seen in Fig 2.1. Most importantly, able to be teachable to any unique task. Today neural networks are being used in almost every domain from taking financial decisions [14] to assisting doctors during diagnosis and surgery [11]. The neural networks are formed by individual units called the neurons [15] [16] [17] . These neurons are connected to each other through a weighted connection that is a randomly initialized real number that changes as the network starts training. It is then added to a bias and passed through a non linear function called the activation function. This new value generated forms the input to the next neuron and the whole process continues till the last layer

(17)

Figure 3.1: Basic neuron [12] a single neuron takes in input variables and outputs a non linear value.

Figure 3.2: Mathematical representation [12], where the weights and biases are tied up.

of the network. This process is called forward propagation. It can be seen in Fig- 2.3 and 2.4.

Back propagation is a crucial part of neural network’s performance [18].

By performing this chain rule method the weights and biases are updated after every forward propagation step. To be more detailed, at the end of the forward propagation, the loss is calculated using the loss function and then the differential of the loss with respect to the weights and biases are calculated and this process is back tracked till the first neuron and the new weights and biases are calculated by the gradient descent algorithm. So the deep neural networks work in this fashion.

Moving to the computer vision applications, a different type of neural network, called as Convolutional neural network (CNN) is needed.

(18)

Figure 3.3: Neural connections in layers, where every neurons in connected to the immediate next layer with every neuron [12]

Figure 3.4: combining all the neurons with a matrix fashion to be efficient computationally. [12]

3.1.1 Convolutional neural networks (CNN)

Convolutional neural networks (Fig- 2.5) are very similar to traditional neural networks they still receive inputs and process them and send outputs, the major application is that they are used for images or computer vision based

(19)

applications [19]. The CNN is used in computer vision because the images are in the form of tensors a 3 dimensional matrix [20]. Tensors can still be passed to fully connected neurons [21] [22] but as the size of the image gets bigger the number of parameters will get bigger and takes a long time to train and increase the cost of the training. The working of the CNN can be explained as follows:

Figure 3.5: Convolutional neural network, where the image is passed in every single convolutional block and the fully connected layers. [23]

The input image is scaled to grey scale and then passed through the filters which are known as kernels,they can detect edges, colours, and many such features. They are of different sizes as well.

Thus a convoluted layer [24] which is compact and densely stored feature maps is obtained. There are ways to tweak the final feature map size, by changing the stride and padding. Followed by this the feature map is passed through the non linear activation function. The activation functions by definition, takes the linear inputs and process them out as a linear or a non linear value which is fed to the neural network of the next layer. Typically, relu function [25] is used widely which calculates the maximum value of the network. However the final layer will be quite different though, softmax or sigmoid [26] activation based on the type of application. This is followed by the pooling layer, which down samples the network, usually max pooling is preferred. The term down sampling actually means the reduction of the dimensions of the input 2D matrix. Max pooling, helps in this by taking a small batch and then reducing the batch by choosing the maximum value of the bunch. These layers are then wrapped up with a fully connected neural networks. As said earlier the final layer will be passed through the softmax activation function in case of classification problem.

(20)

used previously. The goal of imagenet competition is to attain the highest accuracy in terms of multi class classification. The biggest trade off of this approach is that they never took into consideration about the model size nor the inference time. But for our challenge is to work on the memory constrains and real time inference if needed.

Figure 3.6: benchmarks: accuracy vs model size [27]

So in the research for a model that has good accuracy without having huge memory requirements, the first network that was analyzed is the Alexnet [21], trained in GTX 580 was 10 times faster than the CPU of that time. This network winning the imagenet competition changed the history of computer vision. Various other networks like Vgg 16/19 [28], Inception [29], Resnet [30], Squeezenet [31] and few other architectures were tested.

3.2.1 Handling over-fitting

If a neural network performs bad it is either over or under-fitting. If the sample size of the dataset is very less and then the trained neural network

(21)

has a problem of over-fitting and if the training data is having more images and trained in smaller network it leads to under-fitting. The over-fitting problem is handled by the following ways, Firstly, It was ensured that more data is used so that the network has a fair chance to understand the data.

Next, data augmentation was used, fastai library has 16 different types of data augmentation. The network was also baked in with batch normalization, technique that stabilizes the neural network training process and reduces the time of training by re-scaling and re-centering the dataset in batches and drop outs, (turning random neurons on and off) which are turned on if needed. The data was also tested with different network architectures. The neural network is deep enough so that the problem of under-fitting is not an issue in this work.

3.3 Object detection

Object detection (Fig- 2.7) is a powerful application of computer vision. The ability to identify and localize the target class is used in multiple applications.

Some of the modern applications of object detection are Industrial applications like quality and defect identification [32], facial recognition [33] in phones, PC and offices, Security and surveillance and most importantly in self driving cars for detecting the static and dynamic obstacles in the road in real time. Here the idea is to extract the static and dynamic obstacles which can be used by any of the transport and logistics industries in various domains. The various models were studied like SSD [34], RCNN [35], FAST RCNN [36] and YOLO V3 [3]. The trade off between the accuracy and model size were closely taken into consideration. For the application, 10 specific classes detection was performed and more details will be explained in the later sections. The problem with the object detection approach is it’s hard to capture multiple instances of the same class when the number is huge, cannot solve this issue by having more anchor boxes, when there are too many of the same class. So a new technique called the semantic segmentation where every pixel is classified.

3.4 Semantic segmentation

Semantic segmentation [38] [39] [40] is otherwise known as pixel wise classification. It was once considered impossible to process such huge amount of computations, thanks to the modern hardware, that supports such amounts of data and intensive computations. Here the self driving vehicle scenario is taken into considerations, multiple different classes are considered here, the

(22)

Figure 3.7: Object detection, where multiple anchor boxes are superimposed with different confidence scores. [37]

pedestrians, road, sidewalk, vegetation, heavy vehicles, cars, cyclists, road signs etc. Here the static and dynamic obstacles are not missed because of the pixel wise classification and objects of the same class are identified with greater resolution. The neural network used here are said to have a unique architecture compared to the previous tasks, it had to down-sample the input video frame and up-sample the compressed feature map to overlay the masks generated over the original image. This model is called the U net model [11]

which was first used in bio medical applications to detect tumours and segment them. This model performed well in the self driving car community, and [40]

nowadays a modified version of U-net architecture is used in order to get the better performance and handle memory constrains.

This approach captures every single detail of the pixel but 3 dimensional data like depth information is not hard to capture, so that higher resolution perception is obtained.

3.5 Depth detection

The final piece of the puzzle where all the 2D information is obtained and the only missing part is 3 dimensional data in our case, the depth. The humans perceive depth because of the stereo vision, ability to see the same object by both the eyes and the brain computes the disparity to find the depth precisely.

(23)

Figure 3.8: Semantic segmentation, where every single pixel of the image is classified.

Figure 3.9: Depth detection, displayed in the form a heat map based color mask. [41]

For a long time different sensors were used to calculate depth or had to use two cameras to help with stereo vision. The lidars were used in modern self driving cars in order to have a very high accuracy [42]. These sensors were

(24)

addition to it a second neural network called posenet [44] is added to increase the depth perception.

So far, the data was semantically segmented an image and also obtained the depth information. Thus as much as 3d information from a single camera is obtained.

(25)

Chapter 4 Design and Implementations

In this chapter the details of the implementation will be discussed. Staring from the preparation of the hardware used, framework used, data-set and the various networks that was tested and the final chosen networks.

4.1 Hardware used

The hardware used for the implementation is a HP Zbook Laptop with 16 GB of Ram and 12 GB of Nvidia graphics memory ( Quadro 1000). Since the quadro GPU is used it actually has 4 GB of memory and the rest 8 GB acts as high speed RAM and a buffer. This was used during the first half of the project till June, then a dell inspiron was used for the rest of the project, which has 16GB of RAM and then 2 GB of Nvidia gtx 960M graphics card. So most of the training was carried out in the CPU. The results were consistent even-though it took a long time to train the models.

Figure 4.1: The GPU used

(26)

the inference speed (Fig:3.3), GPU usage (Fig:3.4) and memory utilization.

The pytorch is said to be faster than the tensorflow 2. The pytorch feels like working in python and offers infinite hackability than tensorflow. One more advanced library is being used, which is the Fastai founded Jeremy Howard and Rachael^†. It packs with latest techniques and research papers implemented in the refined hierarchy (Fig:3.5) and with advanced packages like the learning rate predictor, Stochastic gradient descent with restarts, cosine annealing etc.

Figure 4.2: platform and library used

Figure 4.3: Inference speed [45], shows the resnet has faster inference speeds.

∗ https://pytorch.org/ ^† https://www.fast.ai/

(27)

Figure 4.4: GPU Utilization [45], when pytorch is used the performance is achieved with less GPU compute power.

4.3 Overview

In the following sections, the various tasks that are attempted will be described in details. Starting with the classification task, with the extraction of multiple variables and trained and bench marked the results. This is followed by the object detection task, choosing the datasets and choosing the best architecture and training them. The semantic segmentation involves the pixel wise calculation of the input visuals and classify every pixel to a specific class without missing out any information. The depth estimation tasks brings the 3 dimensional information, the disparity of the static an dynamic objects in the visuals.

4.4 Variables obtained (Classification task)

The variables that could add value to the services provided by the trucking and logistics firms are studied and how it can fulfill the individual user requirements was taken into consideration. The operational variables like the rolling resistance, weather etc are partly used to understand the driving conditions and recommend customer specific maintenance plans and maintenance contracts and other services that are added newly every year. Apart from providing these services to the end user, the operational variables can also use this data to understand the cause of repairs and defects due to wear and tear by measuring the rolling resistance, road condition. But certain quantities

(28)

Figure 4.5: Fastai package Structure

were hard to measure by physical sensors like rain, daylight, traffic density or road conditions precisely. So this thesis topic was proposed stating that visual perception of the truck namely the dash-cam feed to extract such quantities that are hard to extract by traditional sensors.

(29)

4.4.1 Variables

Road condition

The road condition is measured to find the vertical and horizontal forces.

These help to understand the dynamic forces that affects the lifetime of the axles, chassis installations and the suspensions. The various classes picked are smooth, uneven, very uneven and off-road.

Rolling resistance

Rolling resistance measures the horizontal forces experienced by the tyres. It involves high torque transfers to the drive-lines which in turn affects the fuel economy and mechanical wears. The classes chosen here are very hard, hard, soft and off-road.

Weather

The weather is new variable that is included in the research. Weather plays a crucial role in identifying the problem in case of wear and tear or sudden breakdown during the operation. The classes are Sunny, Rainy, Cloudy and Snow.

Topography

The topography is a relative measure of the height difference between the sea level and the road. The important factor is the frequency of the steep inclines and height. The significance of this variable is because it directly affects the torque, which directs impacts the fuel economy, driving performance and long term durability. The various classes are Flat, Hilly and Very hilly.

4.5 Classification task

Classification is one of the first breakthroughs the deep learning has achieved and the research went so well that, in certain applications of computer vision the neural networks perform at super human level. For the classification task, starting with the data collection and then diving deep into the selection of architecture and the various ways to evaluate the performance of the network.

The entire pipeline of the classification task is pictured in Fig:3.10.

(30)

educational institutions. the specific data-sets used are KITTI . There were certain classes like the off-road class where it was hard to get the data, so data had to be scrapped from Google and fill up the classes. The scrapping of data is done as follows, searching for the keyword in Google in our case "off-road"

and then scroll through 300 to 500 images and type the specific javascript command on the console by pressing ctrl + shift + J. The command is,

Figure 4.6: JavaScript command to capture images

This command downloads the image links as a text file and then use the fastai API download images method and download the images. One important disclaimer is to disable the add blockers if any.

4.5.2 Choosing the perfect neural network

The research for the neural started with the Imagenet competition, the Alexnet [30] of 2012 was first tested. Following the chronicles of developed new architectures like the Inception network [29] from Google on 2014, followed by the Resnet [28]by Facebook AI on 2015, then some modern architectures of 2017 like the Squeezenet [31], Shufflenet [35] was tested. For each of the network that was compared in terms of the accuracy and model size in Table:4.1 and 4.2. It was also assured by the the Stanford’s DAWN benchmarks in Fig:3.7 [46]. It clearly pointed out that the Resnet [28] was far superior than other networks in terms of accuracy.

Figure-3.5 shows the top ranked model by huge corporations, resnet proved to be the winner. The size of the network was also quite mediocre compared to the bigger networks like vgg 19 [29]. The resnet architecture Fig: 3.8 is

∗ https://www.kaggle.com/datasets ^† http://www.cvlibs.net/datasets/kitti/

(31)

Figure 4.7: DAWN Benchmark [46], where the resnet shows strong inferences and is in the top 5 leader board from different corporations in multiple applications.

superior because of the skip connections that help in retaining the information along the length of the training time and at the end neurons. The different types of resnet is shown in Fig:3.9, is used resnet 34 in our tests. The Fig:3.11,3.12 shows the mid phase training for the first two variables along with the training time.

4.5.3 Evaluation criteria

The evaluation is performed under situations that might be tricky for the neural network to predict, It was done under different weather conditions like under blurred climate where the images are not clear and under different light settings the network is tested in Fig:4.4,4.5,4.7,4.8. Technically the neural network was tested with poor quality of the images in order to be prepared for different camera gears. In order to check whether it detects small or far objects it is tested with images of different scales.

(32)

Figure 4.8: Resnet architecture [28] showing skip connections which helps to retain the information from the previous feature map.

Figure 4.9: Resnet layers [28]

4.5.4 Analysis and Training for the four variables

The network is trained on a pre-trained network, because this helps to save training time as most of the primitive or low level features like colors, textures, lines are learned beforehand. The model is initially trained for 10 epochs on the prepared dataset which is split as 80 percent training and 20 percent for testing and validation. The later layers like the fully connected layers will learn the high level features like the variables themselves, the network is then unfreezed and trained for another 5 epochs which helps to connect the low features of the frozen network to the high level features learned by the end neurons and help to improve the accuracy and reduce loss, thus adapts for the new problem.

The intermediate function used is a learning rate predictor (Fig:4.15) algorithm that helps in choosing the optimal learning rates for the specific

(33)

Figure 4.10: The pipeline shows the multiple networks are placed into the ECU and the results are aggregated into the cloud and visual plots.

Figure 4.11: Training phase - Road condition, where the networks are trained for 10 epochs and the confusion matrix shows the performance of the network.

dataset. Then used the learning rate slicer that ensures unique learning

(34)

Figure 4.12: Training phase - Rolling resistance, where the networks are trained for 10 epochs and the confusion matrix shows the performance of the network.

rate along the length of the network and the learning rate is annealed along the length of training time, this is called cosine annealing / LR scheduling (Fig:4.13). The neural network also has a unique loss function called the stochastic gradient with restarts (Fig:4.14). This loss function ensures that the network does not get caught in the local minima as most of the real time data are noisy and non-linear in nature. These techniques ensures to settle in global minima.

4.6 Object detection task

This technique works by detecting the semantic objects that we need as classes with input visual data using the anchor boxes and confidence scores.

They works at real times speeds and one of the important example of such architectures is used below.

(35)

Figure 4.13: Cosine annealing / LR Scheduling [47] helps to achieve the global minima faster.

Figure 4.14: Stochastic gradient descent with restarts [47] helps the network reach global minima and the avoiding the local minima.

4.6.1 Data collection

The dataset used were the COCO dataset of 2014^∗. It has around 80 classes.

It has around 80,000 images for the training and around 40,000 images for validation. It was sponsored by Microsoft, Facebook AI and few other academic institutions. The size of the dataset was around 19 GB. The main idea is to detect objects that are favorable for self driving vehicles or any vehicle for that matter. The classes that considered are Person, Bicycle, car, motorcycle, Bus,truck, traffic light, stop sign, etc

(36)

Figure 4.15: Learning rate predictor [48] graphically shows the optimal learning rate.

Figure 4.16: YOLO V3 architecture [36] which consist of the skip connections and the scaling of the input images at different sizes helps in better prediction performances.

(37)

4.6.2 Choosing the network

Object detection task was performed with the YOLO V3 (Fig:3.16), which is a modern architecture and inference times are very fast and the accuracy in detection was really good. The special features of YOLO were, detecting object at different scales which also implies it is really good in detecting objects at a very far distance, more bounding boxes and more anchor boxes which enabled detection of more objects that are close to each other, along with cross based loss function with the ability to classify multi label classification problem. The network was pre-trained by the image-net dataset Fig:4.17. The final trained network was of 240MB size.

Figure 4.17: Large scale datasets [27] like Image-net based pre-trained networks are preferred as it saves training time and cost.

The problem with this technique is that it fails to detect classes that are close together like figure 3.19, where some of the cars on the left side of the image, is labelled as a single car due to the inefficiency in detection, and also a declassification of class person when there is no person.

4.7 Semantic segmentation task

The pixel wise classification is generally preferred to overcome the missed classification of classes by the previous techniques. This method is able to obtain the maximal amount of visual details from the input image.

∗ https://cocodataset.org/home

(38)

Figure 4.18: Training phase of the YOLO network which shows different class performance.

Figure 4.19: Tested under traffic conditions and we can see the mis- classifications of the class person.

(39)

4.7.1 Data collection

The semantic segmentation is a pixel wise classification technique and the dataset used for the semantic segmentation task is the Cambridge-Driving Labeled Video Database^∗ which has more than 700 images with labelled masks. The entire dataset is sufficient enough to train a neural network on the PC, but there are tools available online to create a custom dataset with masks for specific tasks. These tools are open source and available for the academic research purposes.

4.7.2 Choosing the network

The library used is the Fastai framework once again for this task. The architecture used is a modified version of the U nets. The residual neural network is used in the same fashion as the U nets with the down-sampling and expanding the resnets (up-sampled) on the other end and apply transforms to the input images in order to increase the dataset.

Figure 4.20: Output predictions that are compared to the ground truth and predictions with the validation dataset.

The network is trained for 10 epochs and then unfreeze the network and retrain again for 13 epochs to get an accuracy of 92 percentage (Fig:4.20). The

∗ http://mi.eng.cam.ac.uk/research/projects/VideoRec/CamVid/

(40)

Figure 4.21: Training phase for 13 epochs until the model starts to overfit.

4.8 Depth detection task

The 3 dimensional depth is calculated from a monocular image using the same Unet architecture in addition to separate neural network to measure the pose of the object sin the image.

4.8.1 Data collection

The dataset used here is the KITTI dataset with the size of 175 GB. The data is recommended to be converted to jpeg from png in order to save loading times. For every input image the corresponding depth map image was used as training data. The depth is calculated from the loss function that calculates the relative depth from the depth labels. Loss = (meu * Lp) + (Lamda * Ls)

(41)

where, The smoothness term Lamda = to 0.001, Lp = per pixel photometric loss, meu = per pixel mask.

4.8.2 Choosing the network

Figure 4.22: U net architecture and pose net [41] form the primary networks combined in parallel.

The architecture that is used is same as the semantic segmentation task, the U-net architecture (Fig:4.23). Apart from the U nets, there is pose net (Fig:4.22) that calculates the pose of the images, that are made to pass through a projection where occlusions are removed by creating a custom loss function and then finally they are passed through the up-sampling network to match the size of the input image. The model was trained for almost a day and weighed around 100 MB in size.

(42)

Figure 4.23: U net architecture [49] with the skip connections that help to retain the information from the down sampling to the up sampling which helps in improved predictions.

(43)

Chapter 5 Results

In this chapter, the various results and progress obtained so far will be discussed.

5.1 Results of Classification task

Road condition:

The first variable is road condition, the classes are Off road, smooth, un-even and very un-even. The various networks experimented are, Alexnet, Inception, Darknet , Resnet, Mobilenet, Shufflenet and Squeezenet. According to the table 5.1, the Resnet proved to be better than other networks in terms performance (.20) even though the size was bigger when compared to squeenet (21MB), while the Inception results were the least in terms of loss (0.19).

Table 5.1: The various models trained Neural network Loss value Model size [MB]

Inception 0.19 152

Resnet 0.20 134

Shufflenet 0.30 31

Squeezenet 0.25 21

Mobilenet 0.29 34

Darknet 0.36 80

Alexnet 0.43 225

The 891 images and 222 images is for validation. The neural network is chosen and then trained for 15 epochs continuously and then the weights are

(44)

Figure 5.1: Lr finder function helps to choose the optimal learning rate for achieving the global minima by 4 epochs.

Figure 5.2: The plotting for optimal learning rate, in our case the the best value of the learning rate is 1 e-03.

Now a research paper [48] to find the learning rate specific to the dataset is used and this learning rate is plotted as a range. The best learning rate was chosen from observing the graph in Fig:5.2 and then slice this learning rate across the entire neural network, which means choose different learning rates at different depths of the neural network using the Lr slice function. Then the data is trained once again in the neural network with unique learning rate and

(45)

the loss goes down even further of 0.2 to 0.3 points. Finally, the loss value of 0.17 after 20 epochs was obtained and the confusion matrix in Fig: 5.3 portrays the results.

Figure 5.3: The confusion matrix, the numbers in each square represent the number of images classified compared to the ground truth.

The Fig: 5.3 clearly shows that the model, learned the smooth and off-road classification much better and finds it harder to find the distinction between the uneven and very-uneven classes. The difference between those two classes are not that significant and it was even harder to label and prepare datasets that distinctly separates those two values. The test result is shown in Fig:5.4.

Rolling resistance:

The rolling resistance had the classes hard, off-road, soft, very-hard. The dataset had 850 images for training and 212 images for validation. The various architectures chosen here are same as previous variable. The table:5.2 below shows the results obtained and the squeeze net has a loss of 0.25 which is very close to resnet loss of 0.22 and the resnet was chosen because it had the least loss even though the size of the model is higher compared to the squeeze net. The squeeze net and inception were the closest to the resnet results. The sample result is shown in Fig:5.5. The empirical observations from tables 5.1 and 5.2 validate that Resnet achieves higher performance when compared to

(46)

Figure 5.4: Road condition variable tested with the resnet 18 architecture.

Table 5.2: The various models trained Neural network Loss value Model size [MB]

Inception 0.28 152

Resnet 0.22 134

Shufflenet 0.33 31

Squeezenet 0.25 21

Mobilenet 0.27 34

Darknet 0.33 80

Alexnet 0.39 225

other neural network architecture (as indicated in Fig: 4.7).

Table 5.3: The results

Neural network Loss value Model size [MB]

topography - Resnet 0.27 134

weather - Resnet 0.17 134

daylight - Resnet 0.001 134

(47)

Figure 5.5: Rolling resistance variable tested with the resnet 18 architecture.

Weather, Topography and Daylight variables:

The rest of the variables were solved with the resnets as well since it had the least loss value and mediocre model size when compared to the rest of the architectures.

The datasets used are from different public datasets because the some classes like off road, rainy, snow and few others were not available as a single complete dataset and details of the datasets used are mentioned in the previous chapter. The dataset for the weather was 1611 training images and 402 images for validation with the classes, cloudy, rainy, foggy, sunny and snowy conditions. The dataset for the weather was 25254 training images and 6313 validation images for the classes day and night. The final variable topography had classes flat, hilly and very hilly with a dataset of 979 and 244 images for training and validation. The performance results are shown in the Fig:5.6 and a sample test image is shown in Fig:5.7 and 5.8.

5.2 Object detection

Classification fails to measure some variables like static and dynamic variables and depth so object detection technique is used here. The COCO 2014 dataset was used to train the YOLO V3 network, the classes were mainly chosen with self driving car environment. It was chosen because it had almost every scenario required for self driving task. The classes are given below and their

(48)

Figure 5.6: Confusion matrix results for the weather, topography, daylight variables of the resnet architecture.

Figure 5.7: Weather tests blurred images tested under the pre trained resnet 18 architecture.

trained results.

The images were trained using the google cloud platform free credit program. The model was trained till mAP value was close to 1 and achieved around 0.9 are shown in Fig:5.10. It took around 6 to 8 hours of training time with the 16GB Dell laptop. The precision scores was around 0.70 in the validation dataset. The recall performance was around 0.95 in the validation dataset and are shown in 5.10.

Then the trained models were tested with images and the results are shown in Fig:5.11 and 5.12. The figure 5.12 shows the misclassification of the class person but when observing the ground truth the class is absent in the center of the image and on the left side there are three vehicles and only one of them

(49)

Figure 5.8: Daylight variable tested with resnet 18 architecture.

Figure 5.9: The test scores corresponding to the various classes the network was able to achieve.

is identified. This issue occurs when there is a lot of objects of same class in a crowded manner and these inefficiencies can be resolved by the semantic segmentation technique.

(50)

Figure 5.10: The performance of the network when the measured with different metrics like mAP, F1 score. The network also shows better recall scores over 0.8. The orange strokes is the training performance and the blue strokes are the validation.

(51)

Figure 5.11: Trucks detected by the YOLO network in bright lighting conditions which are considered to be easy test conditions.

Figure 5.12: Tested in bright lights and traffic conditions, the third image shows the short comings of this technique where it mis classifies cars as person.

(52)

separate from the labels.

Figure 5.13: The Training phase, till the model is starting to get over-fitted.

The training was performed for 10 epochs as a initial start and it took almost 5 minutes on the 16GB Dell laptop since it is trained with imagenet presets, so it was able to finish faster. The loss was about 0.32 in this stage.

Now unfreeze the network and train once again while monitoring the losses to prevent over fitting and stopped at 15 epochs and obtained a loss of about 0.22.

Now one more step of up-scaling the network is performed and then train for another 10 epochs. This reduces the loss to about 0.18.

The results in Fig:5.14 shows the performance in the validation set to be similar to the ground truth but the fig 5.15, shows that the flaws of the previous techniques are solved like avoiding misclassification and missing classes of the same type in a crowded environment. But the model needs more diverse

(53)

Figure 5.14: The test results1, shows the masks superimposed with over the original image to check the predictions. In some cases they fail to detect the cyclist and classifies them as roads occasionally.

dataset so that it can generalize to any given dataset with a reduced accuracy scores though.

(54)

Figure 5.15: Tested in bright lights and traffic conditions, where every single pixel is being classified by the U net network.

(55)

5.4 Depth estimation

The training process starts with the two parallel training, the modified version of the U net and the pose network. Then the occlusions are removed by the per pixel minimum re-projection, where the occlusions are matched and then the loss is calculated which gives sharper images and finally the predicted depth is up sampled and then they are matched with the input image size.

Figure 5.16: The test results1, shows the ability of the network to detect the contours of the environment.

Figure 5.17: The test results2, shows the algorithm fails to detect the environment that is far away.

The inputs were raw KITTI dataset which involves the monocular and stereo images. The images took almost 24 hours to train on both the monocular and stereo images on the 16GB Dell laptop. The final loss was around 0.187.

The final test images were shown in Fig:5.16,5.17,5.18,5.19 and the disparity are stored as a separate numpy file along with the output image.

5.5 Summary

The multi class classification had the loss value of 0.001 for the daylight variable, thanks to the proven neural network architectures like the residual

(56)

Figure 5.18: The test results3, the nearby trucks / object’s depth are obtained with higher details.

Figure 5.19: The test results4the nearby trucks / object’s depth are obtained with higher details but the surrounding environment’s depth is failed to be detected.

neural network and squeeze network. The dataset could be increased to several thousand images and the depth of the neural network might be experimented by adding more layers to the existing 18 or 34 layers for each variable, there is a chance of improvement in performance under versatile conditions. The object detection had a rich source of dataset like the imagenet and coco which includes 100s of classes with diverse images in different settings and have been trained more for improved performance. The architecture chosen, YOLO v3 is a well proven architecture and is sufficient enough to handle the loads of data. The semantic segmentation has one of the best results with a loss of 0.18 and the modified version of U nets with the residual neural nets helped in retention of the information because of the skip connection. The depth detection which was modified version of the architecture used in semantic segmentation application, seemed to work flawlessly, however the precision is yet to be close to that of hardwares like Lidar.

(57)

Chapter 6 Discussion and Analysis

The idea of extracting useful variables like the weather, road condition, depth, static and dynamic obstacles was performed through visual perception of the vehicle. The multi class classification was able to measure the conditions in which the vehicle is driven but it was too abstract and cannot capture finer details like the obstacles and classifying them, so the object detection technique was used to track the static and dynamic obstacles, it was able to detect even the smaller classes in a cluttered environment. During the literature study, the drawbacks of the single shot detectors and faster rcnn proved by the failure to detect classes that are very small or many in numbers, so the YOLO was used this was trained to detect the images in different scales and did not miss the classes in the above mentioned situations. But this technique did have some drawbacks like misclassifying objects and still missing classes at random occasions. So a superior technique had to be used which is called as the semantic segmentation where the images are classified in a pixel wise manner, by this way the classes will not be missed at all. This technique solved the pre existing issues and also reached a good accuracy of 92 percent. So far state of the variables have been measured from the vision which are mostly 2D information. An attempt to provide 3D information for example the depth in this case for every frame that is performed.

The semantic segmentation out performed object detection in terms of identifying missing classes but it takes longer time time to train since it classifies every single pixel of the image, so modern architectures like transformer based object detection or YOLO V5, an advanced object detection technique can be tested in the future for better performance. The depth estimation still remains to be a separate algorithm when compared to the other detection techniques, Hence, if the object detection and depth estimation can be combined

(58)

The entire project required only one type of hardware sensor, (the camera of multiple resolutions like 480 px, 720 px and 1080 px final image qualities) so it can replace the use of multiple sensors for different applications and use cases.

Not to mention, the cost of camera is less expensive and multiple variables are measured in one go and stored digitally. So this solution can be easily scaled up and used as a sustainable and effective solution in the logistics and transportation industry. By choosing pre-trained models, it cuts down training time and saving energy, costs [50]. The entire project was conducted and tested in a laptop Intel chip-set i7 octa core, 16GB of RAM and 2 GB of gtx 960M graphics card. Therefore helped prevent a lot of hardware usage in prototyping and thereby reducing the carbon footprint considerably.

6.2 Future work

The research and experimentation performed here can be extended with new techniques like YOLO V5 and transformer based neural network. Once again, software based solution will improve in performance comparable to hardware sensors. The latest research work on object detection using techniques like transformers which is an attention based technique that gives superior performance. In semantic segmentation and depth perception approaches, the architecture could have performed better if the datasets from the KITTI database and other public datasets (different dataset helps in better generalization) are incorporated. To eliminate the use of labelled data in supervised learning, reinforcement learning could be applied for taking high level decisions instead of measuring individual variables. Further academic research in the reinforcement learning techniques that might help change the way the current algorithms learn and perform. The testing of the model in embedded hardware like raspberry pi or any system on chip based devices to check the real time performance can be validated. Some of the final parts of training was done in a low compute laptop and with a better hardware, the performance of the networks will improve over time. With the hope that people who might come across this thesis might still be able to understand the various vision problems that can be solved with deep learning.

(59)

Chapter 7 Conclusions

This thesis work concludes by showing that deep learning techniques combined with a single camera can extract useful operational parameters related to operation and maintenance of a vehicle. By using the visual perception data from the dashboard camera in a truck, this thesis was able to derive parameters like depth perception, road conditions, rolling resistance, weather, topography, etc. of the driving environment. The extracted variables are comparable to dedicated sensors in performance with the capability to outperform the exclusive hardware. This better performance can be a result of improved deep learning techniques like resnet, CNN, Semantic Segmentation, etc. and the quality of visual perception data (from higher resolution and better quality cameras).

Resnet architecture proved to be a better choice for extracting road conditions, rolling resistance, weather, topography, daylight when compared to Alexnet, squeezenet and other neural network classification techniques. Similarly, for object detection, Semantic Segmentation provided better results when compared to YOLOv3. U NET combined with posenet was able to estimate depth in close to real-time conditions.

With the techniques used in this thesis, it is possible to save expenses on hardware and reduce the overall carbon foot print (from driving) without compromising real time predictions. This can already be seen in existing cars like Tesla where each update in software pushes the automotive sector closer to achieving complete electrification and reduction in high initial hardware costs.

(60)

References

[1] J. P. S. Giuditta de Prato., “Is data really the new “oil” of the 21st century or just another snake oil?” Conference of the International Telecommunications, vol. San Lorenzo de El Escorial, Spain, Jun.

2015. doi: 10.17487/RFC1235. [Online]. Available: https://www.

researchgate.net/publication/292382059_Is_data_really_the_new_oil_

of_the_21st_century_or_just_another_snake_oil_Looking_at_uses_

and_users_privatepublic_Giuditta_de_Prato_Jean_Paul_Simon

[2] I. goodfellow, “deep learning book,” Internet Request for Comments, Nov. 2016. doi: 10.17487/RFC1235. [Online]. Available: https:

//medium.com/deeplearningbook

[3] A. F. Joseph Redmon, “YOLOv3: An Incremental Improvement,”

IEEE, Apr. 2018. doi: 10.17487/RFC1235. [Online]. Available:

https://arxiv.org/pdf/1409.1556.pdf

[4] S. H. et al., “loss, mean, f1 score,”

IEEE, Feb. 2016. doi: 10.17487/RFC1235.

[Online]. Available: https://towardsdatascience.com/

metrics-to-evaluate-your-machine-learning-algorithm-f10ba6e38234 [5] Y. S. B. H. W. W. Junzhe Zhang, Sai Ho Yeung, “Efficient

Memory Management for GPU-based Deep Learning Systems,” IEEE, Feb. 2019. doi: 10.17487/RFC1235. [Online]. Available: https:

//arxiv.org/pdf/1903.06631.pdf

[6] nvidia, “Nvidia developer blog,” IEEE, Feb. 2019. doi:

10.17487/RFC1235. [Online]. Available: https://devblogs.nvidia.com/

parallelforall/inference-next-step-gpu-accelerated-deep-learning/

[7] N. P. J. et al., “In-Datacenter Performance Analysis of a Tensor Processing Unit,” IEEE, Apr. 2017. doi: 10.17487/RFC1235. [Online].

Available: https://arxiv.org/ftp/arxiv/papers/1704/1704.04760.pdf

(61)

[8] S. H. et al., “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding,” IEEE, Feb. 2016. doi: 10.17487/RFC1235. [Online]. Available: https:

[9] H. H. S. Robert Geirhos, David H. J. Janssen,

“Comparing deep neural networks against humans: object recognition when the signal gets weaker,” IEEE, vol. RFC 1235 (Experimental), Dec. 2018. doi: 10.17487/RFC1235.

[Online]. Available: https://medium.com/@prince.canuma/

super-human-level-accuracy-in-computer-vision-learning-by-revising [10] M. G. W. A. S. D. B. J. J. T. A. P. J. K. Lex Fridman, Daniel

E. Brown, “MIT Advanced Vehicle Technology Study: Large-Scale Naturalistic Driving Study of Driver Behavior and Interaction with Automation,” IEEE, Aug. 2019. doi: 10.17487/RFC1235. [Online].

Available: https://arxiv.org/pdf/1711.06976.pdf

[11] T. B. Olaf Ronneberger, Philipp Fischer, “U-Net: Convolutional Networks for Biomedical Image Segmentation,” IEEE, May 2015.

[Online]. Available: https://arxiv.org/abs/1505.04597

[12] “the differences between artificial and biological neural networks,” IEEE, Dec. 2019. doi: 10.17487/RFC1235.

[Online]. Available: https://towardsdatascience.com/

the-differences-between-artificial-and-biological-neural-networks [13] B. Mehlig, “Artificial Neural Networks,” IEEE, Jan. 2019. [Online].

Available: https://arxiv.org/abs/2002.11523

[14] A. C. Evgeny Ponomarev, Ivan Oseledets, “Using Reinforcement Learning in the Algorithmic Trading Problem,” IEEE, Feb. 2020.

[15] K. A. . P. HERTZ, J, “Introduction to the Theory of Neural Computation. Addison-Wesley,” IEEE, vol. RFC 1235 (Experimental), Dec. 1991. doi: 10.17487/RFC1235.

super-human-level-accuracy-in-computer-vision-learning-by-revising [16] S. HAYKIN, “Neural Networks: a comprehensive

foundation.” IEEE, vol. RFC 1235 (New Jersey:

(62)

1235 ( Heidelberg), Dec. 1999. doi: 10.17487/RFC1235.

super-human-level-accuracy-in-computer-vision-learning-by-revising [18] X. R. J. X. Bingzhen Wei, Xu Sun, “Minimal Effort Back Propagation for

Convolutional Neural Networks,” IEEE, 2017. doi: 10.17487/RFC1235.

[Online]. Available: https://arxiv.org/pdf/1709.05804.pdf

[19] K. O. 1 and R. Nash, “An Introduction to Convolutional Neural Networks,” IEEE, Dec. 2015. [Online]. Available: https:

//www.researchgate.net/publication/285164623_An_Introduction_to_

Convolutional_Neural_Networks/link/5670a4c908aececfd55331e7/

download

[20] F. Battaglia, “Tensors: A guide for undergraduate students,” IEEE, 2013. [Online]. Available: https://www.researchgate.net/publication/

258757260_Tensors_A_guide_for_undergraduate_students

[21] H. G. E. H. Alex Krizhevsky, Ilya Sutskever, “ImageNet Classification with Deep Convolutional Neural Networks,”

NIPS, vol. RFC 1235 (Experimental), Dec. 2012. doi:

10.17487/RFC1235. [Online]. Available: https://papers.nips.cc/paper/

4824-imagenet-classification-with-deep-convolutional-neural-networks.

pdf

[22] L. F.-F. A. Berg, J. Deng, “Large scale visual recognition challenge,”

IEEE, Dec. 2010. doi: 10.17487/RFC1235. [Online]. Available:

www.imagenet.org/challenges

[23] “understanding neural network neurons,” IEEE, Dec. 2018.

doi: 10.17487/RFC1235. [Online]. Available: https://medium.com/

fintechexplained/understanding-neural-network-neurons-55e0ddfa87c6 [24] R. N. Keiron O’Shea1and, “An Introduction to Convolutional Neural

Networks,” IEEE, Nov. 2015. doi: 10.17487/RFC1235. [Online].

Available: https://www.researchgate.net/publication/285164623_An_

Introduction_to_Convolutional_Neural_Networks

(63)

[25] G. E. H. Vinod Nair, “Rectified Linear Units Improve Restricted Boltzmann Machines,” IEEE, Aug. 2020. [Online]. Available: https:

//www.cs.toronto.edu/~hinton/absps/reluICML.pdf

[26] A. G. Chigozie Enyinna Nwankpa, Winifred Ijomah and S. Marshall,

“Activation Functions: Comparison of Trends in Practice and Research for Deep Learning,” ICML, May 2015. [Online]. Available: https:

[27] “a beginner intro to convolutional neural networks,”

Dec. 2019. doi: 10.17487/RFC1235. [Online].

Available: https://medium.com/@purnasaigudikandula/

a-beginner-intro-to-convolutional-neural-networks-684c5620c2ce [28] A. Z. Karen Simonyan, “VERY DEEP CONVOLUTIONAL

NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION,”

[29] Y. J. P. S. S. R. D. A. D. E. V. V. A. R. Christian Szegedy, Wei Liu, “Going Deeper with Convolutions,” IEEE, Sep. 2014. doi: 10.17487/RFC1235.

[30] S. R. J. S. Kaiming He, Xiangyu Zhang, “Deep Residual Learning for Image Recognition,” IEEE, Dec. 2015. doi: 10.17487/RFC1235.

[31] M. W. M. K. A. W. J. D. K. K. Forrest N. Iandola, Song Han,

“SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and

<0.5MB model size,” IEEE, Feb. 2016. doi: 10.17487/RFC1235.

[32] R. Iqbal, T. Maniak, F. Doctor, and C. Karyotis, “Fault detection and isolation in industrial processes using deep learning approaches,” IEEE Transactions on Industrial Informatics, vol. 15, no. 5, pp. 3077–3084, 2019. doi: 10.1109/TII.2019.2902274

[33] W. D. Mei Wang, “Deep Face Recognition: A Survey,” IEEE, Aug.

2020. [Online]. Available: https://arxiv.org/pdf/1804.06655.pdf

[34] D. E. C. S. S. R. C.-Y. F. A. C. B. Wei Liu, Dragomir Anguelov, “SSD:

Single Shot MultiBox Detector,” IEEE, Dec. 2015. [Online]. Available:

https://arxiv.org/abs/1505.07427

(64)

[36] R. Girshick, “Fast R-CNN,” IEEE, Sep. 2015. doi: 10.17487/RFC1235.

[Online]. Available: https://arxiv.org/pdf/1409.1556.pdf

[37] “YOLO object detection with OpenCV,” IEEE, Apr. 2018. doi:

10.17487/RFC1235. [Online]. Available: https://www.pyimagesearch.

com/2018/11/12/yolo-object-detection-with-opencv/

[38] Y. Y. Xiaolong Liu1, Zhidong Deng1, “Recent progress in semantic image segmentation,” IEEE, Apr. 2018. doi: 10.17487/RFC1235.

[Online]. Available: https://arxiv.org/ftp/arxiv/papers/1809/1809.

10198.pdf

[39] E. A. Irem Ulku, “A Survey on Deep Learning-based Architectures for Semantic Segmentation on 2D images,” IEEE, Dec. 2019. doi:

10.17487/RFC1235. [Online]. Available: https://arxiv.org/abs/1912.

10230

[40] T. D. Jonathan Long, Evan Shelhamer, “Fully Convolutional Networks for Semantic Segmentation,” IEEE, Apr.

2015. doi: 10.17487/RFC1235. [Online]. Available:

https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/

Long_Fully_Convolutional_Networks_2015_CVPR_paper.pdf

[41] M. F. G. B. Clément Godard, Oisin Mac Aodha, “Digging Into Self-Supervised Monocular Depth Estimation,” IEEE, Apr. 2018. doi:

10.17487/RFC1235. [Online]. Available: https://arxiv.org/abs/1806.

01260

[42] N. S. Zhengqi Li, “MegaDepth: Learning Single-View Depth Prediction from Internet Photos,” IEEE, Apr. 2018. [Online]. Available: https:

//arxiv.org/abs/1804.00607

[43] D. G. B. H. M. C.-K. Q. W. Yan Wang, Wei-Lun Chao, “Pseudo- LiDAR from Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Autonomous Driving,” IEEE, Dec. 2018. [Online].

Available: https://arxiv.org/abs/1812.07179

(65)

[44] R. C. Alex Kendall, Matthew Grimes, “PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization,” IEEE, May 2015. [Online]. Available: https://arxiv.org/abs/1505.07427

[45] synced review, “TensorFlow, PyTorch or MXNet? A comprehensive evaluation on NLP CV tasks with Titan RTX,”

Apr. 2019. [Online]. Available: https://medium.com/syncedreview/

tensorflow-pytorch-or-mxnet-a-comprehensive-evaluation-on-nlp-cv [46] “Stanford DAWN benchmark,” IEEE, Apr. 2020. doi:

10.17487/RFC1235. [Online]. Available: https://dawn.cs.stanford.edu/

benchmark/

[47] F. H. Ilya Loshchilov, “SGDR: Stochastic Gradient Descent with Warm Restarts,” IEEE, Aug. 2016. doi: 10.17487/RFC1235. [Online].

Available: https://arxiv.org/pdf/1409.1556.pdf

[48] L. N. Smith, “Cyclical Learning Rates for Training Neural Networks,”

[49] T. B. Olaf Ronneberger, Philipp Fischer, “U-Net Convolutional Networks for Biomedical Image Segmentation,” IEEE, May 2015. doi:

10.17487/RFC1235. [Online]. Available: https://arxiv.org/pdf/1409.

1556.pdf

[50] medium post, “pretrained models,” IEEE,

Feb. 2016. doi: 10.17487/RFC1235.

[Online]. Available: https://medium.com/comet-ml/

approach-pre-trained-deep-learning-models-with-caution-9f0ff739010c

(66)

Operational data extraction using visual perception