Vehicle Pose estimation using machine learning

(1)

Vehicle Pose estimation using machine

learning

Daniel Fennhagen and Thomas Johansson

(2)

1

Abstract

This work describes how a car pose estimator was created with the capabilities of estimating a car’s position, orientation, and dimensions from an image. Our solution utilizes the Faster R-CNN architecture to detect car positions and the MultiBin architecture to estimate the orientation and dimensions of a car. The models are trained on generated images which mimic a database of images from several insurance companies. The system is evaluated on the challenging KITTI object detection benchmark using the official metric of 3D orientation estimation.

This work proposes a novel method of extracting car features using an estimated pose and high-quality 3D models of cars. While this method does not achieve state-of-the-art results, it

(3)

2

Sammanfattning

Detta arbete beskriver hur en maskininlärningsmodell skapades med förmågan att uppskatta en bils position, orientering och dimensioner från en bild. Vår lösning använder Faster R-CNN-arkitekturen för att upptäcka bilpositioner och MultiBin-R-CNN-arkitekturen för att uppskatta bilens orientering och dimensioner. Modellerna tränas på genererade bilder som efterliknar en databas med bilder från flera försäkringsbolag. Vi utvärderar vårt system på det utmanande KITTI-objektdetektering datasetet med hjälp av den officiella metoden för uppskattning av 3D-orientering.

Detta arbete föreslår en ny metod för att utvinna bildelar med en uppskattad ställning

tillsammans med högkvalitativa 3D-modeller av bilar. Denna metod uppnår inte de toppmoderna resultaten, men visar att utvinning av delobjekt är möjlig med hjälp av ett objekts ställning.

(4)

3

Preface

The authors would like to extend their greatest thanks to their supervisor Erik Schaffernicht at Örebro University for the help and answers he has provided them during this project. The authors would also like to thank Henrik Andreasson for his expertise and assistance with transformations in multiple reference systems, and Dana Nilsson for continuous help with report writing.

A special thanks goes out to CAB and the author’s supervisor Christian Bjernesjö for first enabling this project to happen but also for his continuous support during the project.

(5)

4

Content

Abstract ... 1 Sammanfattning ... 2 Preface... 3 List of abbreviations ... 6 1 Introduction ... 7 1.1 Background ... 7 1.2 Project... 8 1.3 Tasks... 10 1.4 Work distribution ... 10 2 Theoretical background ... 11

2.1 Fundamentals of machine learning ... 11

2.2 State-of-the-art approaches ... 17

2.3 Synthetic data generation ... 22

3 Implementation ... 25

3.1 Design... 25

3.2 Implementation... 27

4 Results ... 33

4.1 Image generation ... 33

4.2 Car pose estimator ... 35

4.3 Feature extraction ... 40

5 Discussion ... 42

5.1 Image and label generation ... 42

5.2 Faster R-CNN object detector ... 43

5.3 MultiBin regressor... 43

5.4 Feature extraction ... 44

5.5 Fulfillment of project tasks ... 45

5.6 Carrying through the project in a desirable way ... 45

5.7 Social and economic implications ... 46

5.8 The development potential of the project ... 46

6 Reflection on own learning ... 48

(6)

5

6.2 Skills and Abilities ... 49

6.3 Perception and Approach ... 49

7 Tools ... 50

7.1 Scrum ... 50

7.2 Software ... 50

7.3 Other resources ... 51

(7)

6

List of abbreviations

CAB CAB Group AB

ANN Artificial Neural Network

CNN Convolutional Neural Network

R-CNN Region based Convolutional Neural Network

RoI Region of Interest

RPN Region Proposal Network

RGB Red Green Blue

MSE Mean Squared Error

IoU Intersection over Union

mAP mean Average Precision

mAP@.50IoU mean Average Precision at 50% Intersection over Union

(8)

7

1 Introduction

This chapter will serve as an introduction the projects background, purpose and to give an explanation as to why this project was created. This is also the chapter where the company at which this project has been conducted at is given a brief introduction.

1.1 Background

The purpose of this project was to develop a machine learning model with the capabilities of detecting multiple car features like “front left door” and “rear bumper”. The project would lead to having a machine learning system that could be inserted into an already existing machine learning pipeline to classify bodywork damage on vehicles. The additions to the existing pipeline are shown in Figure 1.

Figure 1: Shows where this project would go within CABs pipeline, green arrows shows the updated pipeline and the box marked with red is modified.

This report and the work that went into it have been conducted at the company CAB Group AB (CAB). CAB is a company that works with systems and services to help the automotive and real estate industry make repair cost calculations. As of today, the task of estimating the prices for vehicle bodywork damage is being done manually by repair shop mechanics, however since late 2016 CAB has done work to automate the process of detecting vehicle bodywork damage using computer vision. This work was intended to refactor several older machine learning systems that have been used to detect individual car features into a single system able to detect multiple car features from a single image.

As stated, CAB had previously developed machine learning models that was able to classify parts of a car, some examples of this is the front doors, back doors, and front bumper. However a number of models were missing and instead of creating new models for the individual parts that

(9)

8 were missing from the pipeline this project was created to replace those models with a single model able to identify any part of a vehicle using a vehicle's position, orientation and

dimensions, which will be referred to as the vehicle’s pose. This pose will be used together with a 3D model of a vehicle to extract individual features of the vehicle.

Out of the systems in CAB’s disposal, the two machine learning systems that will be used in this work are a car filter created by Victor Lindespång in [1], this filter is used to remove images of dashboards, license plates and other documents that are sometimes photographed and thus exists in the database used in this system, the other system used in this project is an object detector, currently used by CAB to detect individual car features from images, while this system will not be implemented in this work, it will be modified and used to detect the position of a car in an image.

Finally, CAB has created an image generator which is used to generate training data for the existing machine learning models. This image generator will be modified so that it can used as part of this work.

1.2 Project

As CAB had already experimented with segmentation networks to detect car features the direction this project took was to investigate the possibility of detecting car features using the pose of a car. The reason for doing this was grounded in multiple decisions, CAB wanted to evaluate their results on a completely different solution, CAB were interested in retrieving the pose of a car for use in later stages of their pipeline and the authors found this to be an interesting problem to solve.

There already existed an object detector at CAB’s disposal, so this project will focus on

estimating the orientation and dimension of a car in the images from CABs database, while the position of the car is calculated from the results of the object detector. This project’s approach will be to use a machine learning model where the dimension and orientation is estimated. This is a common occurrence when estimating an objects state from an image [2, 3, 4, 5].

The images that will be processed by the system are of varying quality and has the car in

different difficult to detect poses. Figure 2 shows a sample of the images that CAB has collected and are trying to detect bodywork damage upon. Some of these images are hard to detect the pose on, noteworthy are the images in the rightmost column.

(10)

9

Figure 2: A sample of the images which will be fed to the system. License plates have been blurred out.

By examining these images, the conclusion can be made that roll and pitch can be excluded as the images are captured horizontally from similar heights. This is used to simplify the problem and by exploiting this the problem is reduced to estimating the dimensions and the yaw of the car. This was decided by the authors to ease the solution, however, it can be observed in the bottom image in the middle column that some images are angled down, which adds roll and pitch to the car relative from the camera. To get a better solution these factors must be considered. To train a machine learning system training data is needed, and labeling real images with one axis of rotation is a difficult task which has been explored in [6, 7]. Given this the image

generator at CAB’s disposal will be modified so that images and their pose can be retrieved. The additions and removals from the image generator can be seen in Figure 3.

Figure 3: The modified image generator, white boxes represent the parts that already existed, red are removed, yellow are modified and green are new parts.

This work is introducing a novel method of retrieving a vehicle’s features using a 3D model of a vehicle and the estimated pose.

(11)

10 The work was split into two major parts, it was done to optimize the time spent on each part of the project and was based on previous knowledge of the subject, the task at hand is rather large for the time given and this was a major contributor to splitting the work into different parts where each author could do as much as possible. The work was split into the following parts:

• Generating labeled images and preparing the data for training

• Building and training the machine learning system.

1.3 Tasks

Prior to this work several tasks were set that were to be finished by the end of this project, for the project to be considered a success these tasks would need to be completed.

1.3.1 Image and label generation

• Modify the existing image generator to generate images of cars in multiple poses with an extended label.

1.3.2 Machine learning

• Develop a machine learning model with the capabilities of estimating a vehicle's pose.

• Train the model on generated and real data.

• Analyze the different results and discuss what they mean. 1.3.3 Feature extraction

• Extract individual features of a vehicle using the predicted pose and a 3D model.

1.4 Work distribution

During this project the authors spent the first weeks performing primarily information gathering where both authors went as wide as possible because of the authors low knowledge in this field from the start of this project, where all information was shared between the authors.

After the information gathering stage each author specialized in their respective part of this project where Thomas worked in Unity with the image generation and labeling while Daniel created the machine learning model.

During the writing of the report each author spent most time on their specialization in the project, however each topic has been discussed and evaluated by both authors.

(12)

11

2 Theoretical background

This will be the most information heavy chapter of this report and will include the theoretical background for this project. This chapter can be looked back into when reading subsequent chapters to get a deeper understanding of what methods this project have used. This chapter has been split up into an introduction to machine learning and a state-of-the-art approach to this problem.

2.1 Fundamentals of machine learning

This chapter exists as an introduction for the reader who is unfamiliar with machine learning. As the program at which this work is being conducted through does not currently teach machine learning courses, an introductory chapter to machine learning is required.

2.1.1 Machine learning

Machine learning is the science of programming computers to learn from data. Machine learning algorithms uses labeled data to build a mathematical model which is used to make predictions on new data. This method has its strengths where traditional approaches are at their weakest, for example: problems where a solution requires a large set of rules, or a lot of fine tuning. An additional example is when the environment is changing, machine learning can adapt to new data [8, 9]. Figure 4 is a high-level view of what a machine learning system can look like.

Figure 4: A high level view of a machine learning system, where W and b represents weights respectively biases, these are the parts of a model which are updated to increase the accuracy of the model.

A model will learn to generalize as it minimize a loss function, if a model learns to generalize it will perform well on never seen before data [8, 9].

Machine learning systems are normally categorized by how much supervision, or what kind of data, they get during training, these categories are: supervised learning, unsupervised learning, semisupervised learning and reinforcement learning [9].

(13)

12 2.1.2 Supervised learning

In supervised learning, the training data that is passed through the model contains both the input and the desired output. This can be explained as learning from examples, where the goal is for the model to learn a set of rules that will make correct predictions on never seen before data [10]. Two typical tasks for supervised learning is Classification and Regression.

2.1.3 Classification

The data passed to a classification model will consist of input features with the output as a class [9]. A classic example of classification is determining what kind of Iris flower one is looking at based on several datapoints like petal length. The model will then learn the different attributes of each class and be able to classify a new Iris flower [8].

The most common classification algorithms are the k-nearest neighbor algorithm and decision tree classifier, both of these algorithms are easy to interpret but are sensitive to the structure of the input data [11].

2.1.4 Regression

Regression models will, like their classification counterpart, learn from data containing input and output features, although in this model the output features are continuous values [9].

2.1.5 Unsupervised learning

As the name suggests, the training data in unsupervised learning is without output data. There are numerous algorithms used in unsupervised learning and they are normally categorized as

clustering-, anomaly detection-, visualization and dimensionality reduction- and association rule learning-algorithm [9], however details about those algorithms are out of this reports scope. Unsupervised learning is not commonly used on its own but rather as a prerequisite for other machine learning algorithms [9], this can be seen in [12, 13] where a clustering algorithm is first used to fine-tune a number of reference shapes that are later used as the filter size for a

convolutional layer. [14] does things a bit differently, by instead of extracting the bounding box they are using k-means clustering to retrieve the silhouettes of the object. This is an interesting solution, however in [14] the approach is to predict the pose in continuous video where the pose is heavily influenced by the previous frame’s pose.

2.1.6 Artificial Neural Networks

Artificial neural networks (ANN’s) are inspired by the brain and consist of neurons which are based on the biological cell. The following neuron was proposed by McCulloch and Pitts [15], where the neurons are explained as binary threshold units and were mathematically computed as following:

(14)

13 (1) [15]

This means that the weighted sum of its n input signals, xj, j = 1, 2,..., n, and generates an output

of 1 if its sum is above a certain threshold u, otherwise 0. Where θ is a unit step function and wj

is the weight from the jth neuron input [16]. This neuron is visualized in Figure 5.

Figure 5: The neuron proposed by McCulloch and Pitts [15].

Since 1943 when this neuron was proposed this neuron has been improved in many ways, most notably is that the activation function u which can now vary between the piecewise linear, sigmoid, Gaussian or the now common ReLU activation functions [9, 17].

ANN’s are made up of layers of the abovementioned neuron, which are connected, depending on the network type, to other neurons which combined make up the network. ANN’s can therefore be seen as weighted directed graphs in which neurons are nodes and the connections between the neurons are directed edges with weights, this can be seen in Figure 6. There can be, depending on how a network is connected, two groupings of ANN’s called feed-forward networks and recurrent networks [9, 16]. Like their names suggest, feed-forward networks are connected from the first layer to the last, without loops while recurrent networks can have loops.

(15)

14

Figure 6: Artificial Neural Network in a feed-forward configuration with 3 inputs, 1 hidden layer consisting of four neurons, and two output neurons.

ANN’s are relevant to this problem because they can be used as a regression model with higher complexity than what would be possible with simple machine learning algorithms as explained in [18].

2.1.7 Deep learning

The fundamental definition of learning is difficult to formulate, a learning process in an ANN can be described as modifying a network's weights so that the network can perform well on a specific task [9, 16]. The modifications of the weights can be explained in a few steps.

• A batch of data is passed through a network. At each layer, the result of every neuron is calculated and then passed to the next layer.

• The result in the final layer is compared to the ground truth and a loss is calculated.

• Starting from the last layer, the contribution of each neuron of the previous layer is calculated using the chain rule (from calculus). This effectively is a gradient across all neurons within a layer.

• Previous step is repeated until the input layer is reached.

• Finally, the weights are tweaked using the gradient generated from previous steps.

These steps are part of the backpropagation training algorithm that was proposed in this

groundbreaking paper [19], which is still used today. It is noteworthy that this algorithm does not work on neurons with the threshold activation function as the gradient cannot be calculated on a flat surface [9]. Worth noting is that the backpropagation training algorithm only work for feed-forward networks, where a similar algorithm is used for recurrent networks [9].

To sum up the learning chapter: an ANN will, with the help of labeled data, over many iterations be able to converge at a local or global minimum where it hopefully has learned the underlying patterns so it can make correct predictions on new data.

(16)

15 2.1.8 Convolutional neural network

Convolutional neural networks (CNN) are a neural network used primary to classify images, as they have proven to yield better results than other machine learning methods [20, 21]. They work by sliding a kernel over the image, observing multiple pixels at once where the result is the combined result of the input multiplied by the kernel weights [20], as can be seen in Figure 7. A convolutional neural network will effectively find features in an image. For example: a single convolutional layer can find features like sharp edges, while subsequent layers can identify complex shapes consisting of combinations of the previous layer’s shapes.

Figure 7: A neural network containing one convolutional layer used to identify features, connected to a pooling layer which downscales the data while keeping the features. This is connected to a fully connected layer which will add

semantics to the identified features [22].

2.1.9 Insufficient data

Data dependency is one of the most severe problems to solve when working with machine

learning, and it is even more critical to deep learning, because of the large amount of data needed to understand the underlying patterns of the data. A very interesting finding is that the scale of a model and the size of the training data has an almost linear relationship, which in and of itself speak for the amount needed for a deep learning model. When a model does not have enough training data it will overfit [23].

2.1.10 Nonrepresentative data

For a model to be able to generalize well, it is important to have data that well represent what the model should learn. This is often harder than it sounds, as having too little data might result in sampling noise, which is when the data is nonrepresentative because of chance. On the flipside, when having enough data, if the sample of data used during training is flawed the data will be nonrepresentative. This is called sampling bias [9].

2.1.11 Overfitting

Overfitting happens when a model is too complex in relation to the amount of training data available or when the model has irrelevant components. For example: a neural network is able to solve complex relations and is more flexible than a linear regression model, however if the dataset conforms to a linear model the neural network will add unnecessary complexity that might worsen the result, thus overfitting the data [24].

(17)

16 There are some possible solutions to overfitting which are to:

• Simplify the model.

• Gather more training data if the data should fit a more complex model.

• Reduce the noise in the data.

Constraining a model to avoid overfitting is called regularization, and the amount of regularization to apply during learning is controlled by a hyperparameter [9].

A hyperparameter is a parameter used by machine learning algorithms that is not affected by the training of a model. These are set prior to training the model and tuning them is key to achieving good results [9].

2.1.12 Underfitting

Opposite to overfitting is underfitting, this will occur when there is still room for the model to learn. In other words, the model is not complex enough, it is over-regularized. When underfitting occurs, the model is not able to learn the underlying patterns in the data and will perform poorly on new data [25].

The possible solutions to this are to:

• Select a model with higher complexity.

• Pass better features to the model.

• Reduce regularization hyperparameter. 2.1.13 Data augmentation

If there is not enough training data a model will overfit during training. To counteract this problem data augmentation is introduced, by performing small modifications on the image carefully as to not remove semantics from an image, a relatively small dataset can be expanded to a much larger set [26]. Some methods for image augmentation are:

• Cropping, where the images of different widths and heights gets a cropped central square patch, or the cropping can be done randomly.

• Rotation, where the images are rotated slightly.

• Noise injection is when a matrix containing random values is injected into the image. • Color space, when the R, G and/or B values are changed in for example intensity which

results in what resembles lighting alterations. 2.1.14 Data preprocessing

The input data can have extremely varied values, and some machine learning algorithms will not work at all without normalization. The abovementioned backpropagation training algorithm will converge faster if the input data is normalized [27]. Commonly scaling is done using the min-max normalization which is expressed mathematically as:

𝑥

′

=

𝑥 − min⁡(𝑥)

(18)

17 where x is the original value, and x’ is the normalized value. Other common normalization techniques are: Statistical or Z-Score Normalization, Median Normalization and Sigmoid Normalization [28].

2.2 State-of-the-art approaches

Contrary to the previous chapter, some previous knowledge to machine learning is required to understand the methods explained.

2.2.1 Fast R-CNN

Fast R-CNN is an object detector which was first of its kind to be considered end-to-end trainable, where other object detectors of the time would consist of multi-staged pipelines. Notably speaking was the predecessor R-CNN which would first train a CNN on the objects of interest’s sizes, also called proposals. Then it would fit support vector machines to the CNN features, which would act as object detectors. Finally, the bounding box regression would be learned [29].

Fast R-CNN solves this by introducing an architecture that can be end-to-end trained by taking as input the whole image and a set of object proposals. The network would then start of by processing the image with several convolutional and max pooling layers to produce a

convolutional feature map. Then for each object proposal a region of interest (RoI) pooling layer would extract a feature vector from the feature map. The feature vectors would then be fed into a fully connected layer which led to the two output layers, where one produced a probability estimate over all classes and the other would output the four bounding box values [29]. This architecture can be seen in Figure 8.

Figure 8: The Fast R-CNN architecture [29].

2.2.2 ResNet

ResNet solves a very specific problem with deep convolutional networks, which is that of the degrading gradient. Simply explained, this problem occurs when networks become too deep

(19)

18 which leads to saturation of the model's accuracy, and with increased depth comes degrading accuracy [30]. This degrading of the accuracy is not caused by overfitting which may seem like the logical answer as discussed in [31]. This degrading is caused by the fact that different problems are just simply harder to learn, at least in a realistic time [30].

This problem was solved by the introduction of the residual learning building block, which can be seen in Figure 9,

Figure 9: The residual learning building block [30].

where a groundbreaking “skip connection” was introduced. This skip connection enables the network to become much deeper than normal convolutional network while still converging to a solution within reasonable time.

2.2.3 Faster R-CNN with ResNet101

Faster R-CNN is an object detection system proposed in [12] which is a combination of a Region Proposal Network (RPN), which is a deep fully convolutional network that proposes regions on an image, and the Fast R-CNN detector that uses the proposed regions to detect objects [12]. An RPN will take an image of any size as input and output a set of rectangular object proposals, and each object proposal will be accompanied by an objectness score. RPN’s introduce anchors, which are a pre-calculated size based on the size of the bounding boxes in the training data. The RPN will then use these anchors as size for the kernel in a convolutional layer fed with the feature map from previous convolutional layers, this process is visualized in Figure 10.

(20)

19

Figure 10: A Region Proposal Network. k is the number of pre-calculated anchors, as can be seen there is one output prediction for each anchor [12]. The final layers are the classification layer (cls) and the regression layer (reg).

The Faster R-CNN is as mentioned a combination of Fast R-CNN and the just mentioned RPN network. However, the RPN takes as input a feature map, and this is where ResNet101 comes in and transforms the raw image into a convolutional feature map. The 101 in the name is referring to the depth of the network, and the performance difference between different depths can be read in [31].

2.2.4 Constraining a 3D bounding box

The Faster R-CNN network will be used to create a 2D bounding box over the car in our project. By generating a tightly fit 2D bounding box of an object, a 3D bounding box can be constrained within the bounds of the 2D bounding box as it should fit tightly within the 2D bounding box frame [2]. To explain this fact a definition of a 3D bounding box first has to be made, this can be done with its orientation R(θ,φ,α), parametrized by azimuth, altitude and zenith, and its center T = [tx, ty, tz]T. With the pose of the object in the camera coordinate frame (R, T) ∈ SE(3) and the

camera intrinsic matrix K, the projection of a 3D point X0 = [X, Y, Z, 1]T in the coordinate frame

of the object into the image x = [x, y, 1] T_is:

𝑥 = 𝐾[𝑅⁡𝑇]𝑋

_𝑜 (3) [2]

With the assumption that the origin of the object is at the center of the 3D bounding box and the objects dimensions D = [dx, dy, dz] are known, the coordinates of the 3D bounding box vertices

can be calculated by X1 = [dx/2, dy/2, dz/2]T, X2 = [-dx/2, dy/2, dz/2]T, … ,

X8 = [-dx/2, - dy/2, - dz/2] T, where n is the nth vertex and d is the dimension of the car. With this

(21)

20 it requires that each side of the 2D bounding box is being touched by the projection of at least one of the 3D bounding box corners. For example, the projection of the 3D corner

X0 = [dx/2, - dy/2, dz/2]T that is touching the left side of the 2D bounding box with coordinate xmin

is creating the constraint that results in the equation:

(4) [2]

where (.)x refers to the x-coordinate from the perspective projection. This can be done for the

other 2D box sides using similar formulas. This is not enough to constrain the pose, however it narrows down what needs to be solved using regression [2].

2.2.5 MobileNetV2

This architecture is an expansion of the ResNet architecture and uses the “skip connection” which was mainly developed for machines with less computational power. The main contribution of this architecture is the inverted residual with linear bottleneck module, this module will as input take a low-dimensional feature map and first expand it to a high dimension where features are filtered and then projected back into a low dimension using linear

convolution. This module is much more memory efficient than other convolutional architectures without sacrificing accuracy [32].

2.2.6 MultiBin architecture with MobileNetV2

As mentioned above in the Faster R-CNN chapter, the 3D bounding box will fit tightly inside a 2D bounding box, and while that is a good constraint to narrow down the result it is not enough to estimate the pose of the car. The MultiBin architecture was developed to solve this, and the paper also solves the problem of local vs global orientation, which will be explained shortly. The MultiBin architecture splits up the orientation into bins parametrized only by azimuth (rotation around the y-axis) [2], this is a problem specific solution that was developed for autonomous driving and it suits this problem very well because the majority of the images that will pass through our pipeline are taken at around waist height.

(22)

21

Figure 11: Left are cropped images of a car passing by. Right are images of the whole scene. As it is shown the car in the cropped images rotates while the car direction is constant among all different rows [2].

While the global orientation is the same for all images, the local orientation changes. [2] solves this by combining the local orientation with a ray from the camera to the object so that the combination of the two results in the global orientation of the object. This is visualized in Figure 12,

Figure 12: Visualization of the global orientation θ calculated from the local orientation θl [2].

where the global orientation θ is calculated as a sum of θray + θl. The network is trained to

(23)

22 The architecture for the MultiBin regressor consists of some shared convolutional layers which then branches into three separate branches as can be seen in Figure 13.

Figure 13: The left branch is the dimension estimation for the object, middle branch is the cos and sine azimuth for each bin, and the rightmost branch is the confidence of each bin [2].

On the KITTI object detection dataset [33] this architecture has achieved state-of-the-art results [2].

2.3 Synthetic data generation

A part of this project is to generate synthetic images as datasets and this chapter will investigate some methods of doing that. Machine learning models are best trained on large datasets but those are hard to come by. This section therefore explores how synthetic data can be generated for use as a complement to real data.

2.3.1 Synthetic 3D models

Having synthetic data as a basis for making machine learning models can be beneficial. With synthetic data comes perfect labeling measurements, something that can be hard to get with real images [34]. In 2016 [35] managed to be one of the first to use synthetically generated 3D object data for object detection CNNs which result could be used as real world evaluation data. A combination of synthetic 3D object data and real data was used in [36] where the validation accuracy of only real or only synthetic images was both below 50% but when used together the accuracy came close to 95%. This approach used the ADORESet dataset with 2500 real images and 750 synthetic images and was tested on CNNs such as ResNet [30] and VGGNet [37] which shows that synthetic images can be used as a compliment to real images.

Synthetic scenes with 3D model furniture was generated in [38] where objects parameters such as texture, material and reflectivity there were also ambient parameters like lighting and

distancing to objects. The approach was to have the camera rotate around a set object in 24 azimuth positions accompanied by 13 elevation angle positions. This was then compared to the corresponding labeled objects from the COCO [39] dataset which is a large dataset with over

(24)

23 328 000 images and 91 different object types. The result showed that the standard deviation for label chair was 0.03 in width and 0.06 in height and for label monitor was width 0.09 and height 0.09, these results show a small deviation margin which indicates that the way in which it was implemented could be beneficial for this project [38].

Work has also been done within existing game engines in [40] the driving simulator VDrift [41] was used as a basis for generating images. Since VDrift is open source different game engine settings could be accessed such as lightning, shadow generation as well as the whole scenes themselves. For each image that was generated a depth map, optical flow map, texture map and pixel-wise annotations was captured. And since the engine itself was at disposal the data collected were precise. When training models on the data that was generated some interesting results are found out. When texture and depth features are used as training data the accuracy is 91.2% but when using texture, depth, and flow together the accuracy lowers by 0.2%. In this case this is because the depth and flow feature maps are very similar in what data they produce. This is something to look out for when generating synthetic data [40].

2.3.2 GAN

Generative Adversarial Networks or GAN for short was designed by [42], is a method

specifically developed with the goal to increase the amount of already available data by using an existing dataset. It is mostly used with two neural networks where one evaluates the other. A variant of this this method is used in [43]. Here the method consists of a Generator (G) which generates images and Discriminator (D) networks which task is to evaluate the generated images. The procedure takes real images and insert random noise to samples as to increase the variety by changing generated, as well as original images. Inserting random noise contains finding a Nash equilibrium seen in equation (5) [43]:

minG⁡maxD⁡Ex∼qdata(x)⁡[log⁡D(x)] ⁡ + ⁡ Ez∼p(z)⁡[log(1⁡ − ⁡D(G(z)))] (5) [43]

Here 𝒛 ∈ 𝑹𝒅𝒛 is gathered from the distribution p(z) such as N(0, I) or U[−1, 1]. When this is

applied G and D is as stated above convolutional neural networks. GANs are very useful when it comes to scaling up datasets as seen in the Figure 14, increasing truncation values with GAN generates morphed new synthetic images.

Figure 14: The effects of truncation, the first set has a truncated value of 2, second set has a truncated value of 1, the third set has a truncated value of 0.4 and the fourth and last set has a truncated value of 0.04 [43].

(25)

24 The truncation in Figure 14 is done by truncating a vector with values sampled by some

magnitude above a user defined threshold. The result of this is greater sample quality but the drawback is reduced overall sample variety [43].

(26)

25

3 Implementation

In this chapter there will first be a design overview of the system this project is creating,

followed by the implementation details. The implementation details will be kept on a high level since the specific details can be found in the Model chapter.

3.1 Design

3.1.1 Preparation of images

To train a model there needs to be some training data, and as discussed in a previous chapter, this project will use a Unity program to generate images from 3D models. Since this projects goal is to find out if this is a feasible solution for CAB and not a final solution, images will only be generated on two 3D models to prove the concept.

Our program’s main loop consisted of code blocks that were created to give variety based on settings parameters which are set by the user before a batch of images is generated. The values that are set in the following code blocks are done so to have an even distribution so there are not any specific colors or any specific car that appears more often.

• A random position for the car within bounds of the camera is set. This is done so that the car does not appear in the same position in every scene setup and to increase the variety of images.

• Set a new rotation of the car by incrementing azimuth by some degree, increasing altitude by some degree every full rotation of azimuth. This approach was taken as to make sure that the whole car is captured in as many angles as possible.

• Randomize azimuth, altitude, and zenith by a small amount, this is done for an increased variety distribution, to make sure that one angle of the car does not have the same altitude and zenith for a given azimuth.

• Set a new random scene, with random lights, skybox, ground plane. This project uses 15 different skyboxes where one is chosen on random for each scene setup, four different lights placed around the scene as to give good variety. The ground plane has access to five different textures to choose from for each setup.

• Setting up the look of the car sin done in three steps.

o The color of the car is set by randomizing the materials RGB values o The glossiness of the material is set

o The metallic of the material is set

• Add a square at a random position, to represent squares present the real images database that CAB have. The squares marks damage on real cars and models are trained using that dataset.

(27)

26

Figure 15: Output data from image generator.

The data that is passed on to the Faster R-CNN object detector and the MultiBin regressor will be in the form of images, labels and the camera intrinsic matrix, visualized in Figure 15. 3.1.2 Model architecture

Since the objective is to retrieve the pose of a car directly from an image two of the previously architectures has to be used, Faster R-CNN to first generate the 2D bounding boxes and then the MultiBin architecture to estimate the 3D bounding box from those 2D bounding boxes. The reason for using the Faster R-CNN architecture in favor of newer architectures was that CAB had already implemented this architecture in models that detected individual car features. A side effect of generating our own data is that it can ensure that it is evenly distributed, and thanks to this there is no need to augment the data to avoid the problems of insufficient or nonrepresentative data. However, the authors decided to train the MultiBin architecture using both generated data and the KITTI object detection dataset [33] to examine the difference of the results, and this data is augmented by flipping the images horizontally and by adding jitter to the images.

The dataflow in the machine learning pipeline when training is visualized in Figure 16, where images with their labels passed through both of the machine learning models parallel to each other, since the training data is structured in such a way that both models can make use from it.

Figure 16: Flowchart describing the training stage of both models, using the same labels. Where the top branch results will be the 2D bounding box metadata, and the bottom branch results will be the 3D bounding box,

(28)

27 Important to note is that this is the setup when training the models, when using the model to predict on new images, a different setup is made. Instead of running the models in parallel they will be running in series where the output of the Faster R-CNN will be the input of the MultiBin model, as can be seen in Figure 17.

Figure 17: The machine learning pipeline when predicting on new data. The metadata from Faster R-CNN is the 2D bounding box data, where the output of the MultiBin is the pose.

3.2 Implementation

This chapter will explain how the pipeline described in the above chapter was implemented. If the reader is trying to replicate the results achieved in this report this chapter is meant to give all the details needed to replicate the results.

3.2.1 Unity

CABs existing Unity program was initially evaluated, and the authors noted the changes that had to be made. Functionality such as stage randomization, material randomization, image capturing, database specific modifications and 2D labeling were left intact. Car orientation was remade as the images CAB were generating had few observation angles of the car, and to minimize the orientation bias the authors decided to include a full rotation of azimuth with even distribution. Unity also have its own measurement system to determine how big or small an object would be. These measurements are made to reflect real world meters. So, in short, one Unity unit is one meter in the real world. This provides an easy way to determine the size dimensions of a car. The datasets will from now on be named as following:

• KITTI object detection dataset → KITTI

• KITTI object detection evaluation dataset → KITTI-EVAL

• Unity generated dataset with images resembling CAB’s database → UNITY

• Unity generated evaluation dataset with images resembling CAB’s database → UNITY-EVAL-CLOSE

• Unity generated evaluation dataset with images resembling the KITTI object detection dataset → UNITY-EVAL-FAR

(29)

28 The complete label for a car in all the UNITY datasets includes the same information as in the KITTI datasets and this is intentional as two models were trained, one on the KITTI dataset, and one on the UNITY dataset. This data is explained in Figure 18.

Figure 18: The contents of a single label file. When training the Faster R-CNN model only bbox will be needed as it is the only output of that model. For the MultiBin model all values are needed for training, and only bbox is needed as

an input when predicting, along with an image.

The way each attribute in the label file was extracted from the Unity program will be explained in the following subchapters.

3.2.1.1 Truncated

The truncated attribute ranges from 0 to 1 and is what percentage of the 2D bounding box is outside of the image frame, 0 being completely non-truncated and 1 being truncated. This is calculated using the 2D bounding box and the screen bounds, seen in Formula (X), where AreaInFrame and TotalArea is the 2D bounding box area in frame respectively total area.

𝒕𝒓𝒖𝒏𝒄𝒂𝒕𝒆𝒅⁡ = ⁡𝟏 − (

𝑨𝒓𝒆𝒂𝑰𝒏𝑭𝒓𝒂𝒎𝒆

𝑻𝒐𝒕𝒂𝒍𝑨𝒓𝒆𝒂

)

(6)

3.2.1.2 Occluded

Occluded is a whole number O ∈ {0,1,2,3} indicating occlusion state where 0 = fully visible, 1 = partly occluded 2 = largely occluded, 3 = unknown. For this project, the occlusion was always 0 since there were no random objects in the scene to obscure the cars that were randomly set in the scene.

3.2.1.3 Alpha

The alpha is representing the global rotation. It is used by the model during training and it is calculated from the local azimuth (rotation_y), adding the Euclidean plane between camera and the car position, alpha ranges within [-PI, PI], this can be seen in Fel! Hittar inte

referenskälla..

alpha = rotation.y

(30)

29 if alpha % (2PI)

alpha += 2PI else if alpha > PI alpha -= 2PI

Code listing 1: The rotation converted from the cameras coordinate system. CarPos.z and CarPos.x is the cars x and z position relative to the camera.

The reason for using alpha (global orientation) is to negate the effect of viewing the car from the camera’s perspective. A car might look more rotated when being close to the camera than when far ahead of the camera. Figure 11 and 12 give further insight as to why this is needed. Finally, the 270° (or 90° in the opposite direction) added are to offset the 0° observation angle in the dataset. In our dataset a car pointing straight to the right is observed to have 0° of rotation.

3.2.1.4 Bounding Box

Bbox are the four pixel values of the boundaries of a 2D bounding box. The bounding box points was gathered from points that was placed on the car models so it would cover the entire car. The method goes through all the placed points top left and bottom corner of the bounding box.

3.2.1.5 Dimension

The dimensions of the cars were obtained from measuring the collision box that was placed on each of the cars. The measurement is done on the collision box since it is set to the same size as the car, the dimensions are then taken in each of the x, y and z axes.

3.2.1.6 Location and Rotation_y

The location (x,y,z) is the relative 3D location in meters and the rotation_y is the rotation around Y-axis in the range of [-PI, PI], these values are measured in camera coordinates. To get the location of the car object models in relation to the camera the car models and the camera object positions in world coordinates is taken. Then distance between the camera and car model is measured to calculate the car models local position in relation to the camera.

3.2.2 MultiBin regressor

The implementation of the MultiBin regressor can be split up into several parts, these are as following: Preprocessing, Constructing the network and Defining the loss. This chapter will be split into these subchapters.

3.2.2.1 Preprocessing

When generating training data there is always a bit of a hassle to get the data in a structure that the model will accept, because of the structure generated in the Unity project there was quite an easy conversion from label files and images to the structures used by the models during training. In this process the images are cropped to only include the 2D bounding box + a random number of pixels in each cardinal direction. The images are flipped at random and when flipped the label data is also flipped.

(31)

30 After previous steps, the image is resized to the standard size, which in this case is 224 ✕ 224 pixels in 3 channels (RGB). Finally, the mean of each channel is removed from the image to center the data.

The images are loaded in batches of 8 and are now ready to be used for training or validation. However, one step still must be made, which is to retrieve the average dimensions of the 2D bounding boxes to generate anchors for the MultiBin regressor. This is simply done by iterating over all labels and calculating the average height and width of each 2D bounding box.

3.2.2.2 Constructing the network

The architecture has already been discussed in the model chapter and thus this subchapter will contain the exact network this project uses. All machine learning code was implemented using TensorFlow and Keras.

First, the MobileNetV2 architecture was built, with the exact layers specified in the MobileNetV2 architecture documentation [32], which can be seen in Figure 19.

Figure 19: MobileNetV2 architecture. Where t is the expansion factor of a block, c the output channel, n the number of repetitions of each block, and s the stride of the first layer in the block, all other layers use a stride of 1. All

convolutions use a 3×3 kernel [32].

After this a dropout layer is added which should improve the generalization from the average pooling layer to the latter layers, which branch into three separate branches visualized in Figure 20.

(32)

31

Figure 20: MultiBin architecture converted to a fully convolutional model. The filters have sizes of 3×3, 4×4 and respectively 2×2 (because 2 bins are used), while all kernels are of size (1,1).

3.2.2.3 Defining the loss

This model has three outputs as can be seen in the previous subchapter, Dimensions, Orientation and Confidence. These are quite different to calculate a loss for and have far from equal weights. For example, the dimensions of the car will almost always be the same, while orientation will wary in the range of [-PI, PI]. With this in mind, we use the mean squared error (MSE) for the dimensions, and binary cross entropy for the confidence. Finally, is the orientation loss

calculated as:

loss = cos(gt)cos(pred) + sin(gt)sin(pred) loss = mean(loss)

loss = sum(loss) loss = 2 - 2loss

Code listing 2: Orientation loss definition where gt is the ground truth and pred is the predicted value.

Where the orientation angle has been discretized and divided into overlapping bins. For each bin, the network estimates a confidence probability ci that the output lies in the ith bin and the

residual rotation correction that needs to be applied to the orientation of the center ray for that bin in order to obtain the output angle. This results in 3 outputs per bin (ci, cos(∆θi),sin(∆θi)).

This loss is used in favor of L2 loss because L2 loss encourages the network to minimize loss over all nodes, which results in an prediction that is less than optimal for any single node [2]. The weights used for the different loss functions was 1.0 for dimension, 10.0 for orientation and 5.0 for confidence. Noteworthy is that the optimizer for this model is the stochastic gradient descent optimizer with a learning rate of 0.0001 and a momentum of .9.

(33)

32 3.2.3 Feature extraction

The final part of this project is to extract individual car features using the predicted pose. As this is a proof of concept rather than a final product it was decided that this could be done manually. The steps to reproduce this is as following:

• Retrieve the pose from a predicted result, the values of importance are the location, dimension, and rotation. This means that the bounding box predicted from the Faster R-CNN model is not of importance for this evaluation.

• In the unity program, set the background image to be the original image.

• Place a car in the scene, with the same pose as the predicted values.

• Reduce the opacity of the background image so that the car can be seen, since the car gets placed behind the background.

(34)

33

4 Results

In this chapter the results of the three tasks of this project will be presented. This chapter will start of by presenting the results from the image generation and then proceed to go through the results of the two machine learning models, with a final part on the feature extraction which is the cumulative result of this work. Each part of this chapter is evaluated differently and these evaluations will be introduced in respective part.

4.1 Image generation

Figure 21 is showing some randomly selected images from the generated dataset to demonstrate what kind of images the generator has created. The squares that are visible on some pictures are the added random squares which represents squares that are drawn in by hand to mark car damages on real images from car repair shops, these are added randomly as to remove their semantics when training.

Figure 21: Sample pictures to show image generation, different lightning, paint work, dirt textures, ground, skyboxes, positions, rotations and cars.

The results of the image generation will be statistically compared against the KITTI dataset, this evaluation will capture the numerical representation of the images. However, this method of evaluating the results does not measure how well these images mimic real images, this is expanded upon in the discussion chapter.

4.1.1 Alpha, rotation_y and truncated

Figure 22 is a visualization of the data distribution from the labels including alpha, rotation_y and truncated. Notable is the correlating distribution of alpha and rotation_y, which is not a surprising relation as the global rotation should be about the same as the local rotation. Also

(35)

34 notable is that in the KITTY dataset truncated cars is very few whereas in the UNITY dataset, the distribution is split up among all levels of truncation.

Figure 22: Y axis is number of cars and the x axis is the value range. In alpha and rot_y the x axis shows values between -PI to PI and in truncated the values represent amount of truncation.

4.1.2 Dimensions

Looking at the distribution of dimensions between the different datasets in Figure 23, one thing becomes very clear, which is that the UNITY, UNITY-EVAL-CLOSE and UNITY-EVAL-FAR datasets have just two different car models in the dataset and those cars are the same length, which is obvious as the values range between one or two numbers. The KITTI datasets on the other hand, has many different cars which shows in the wider data range.

Figure 23: Y axis is number of cars and the x axis is the value range in meters.

4.1.3 Location

In Figure 24 the locations distribution are visualized, the KITTI and KITTI-EVAL datasets have a wider spread across all location axes. The UNITY, EVAL-CLOSE and UNITY-EVAL-FAR an even distribution but in a smaller range than KITTI and KITTI-EVAL. For location_z and location_x the difference in real life images and generated images becomes apparent, the generated UNITY dataset is lacking in range even if the current distribution range is even.

(36)

35 4.1.4 Bounding box

In Figure 25, graphs show the min and max bounding box values for (x,y) coordinates. A notable difference is within the UNITY and UNITY-EVAL-CLOSE dataset were the values has a larger range than the other datasets. The KITTY dataset has good spread of values with some spikes, except for the bbox_miny where the values seem to have centered around 180 or close to it.

Figure 25: Y axis is number of cars and the x axis is the pixel value range.

4.2 Car pose estimator

This subchapter will first contain the Faster R-CNN results. After which the MultiBin models results will be presented, highlighting the difference between the configurations of the created models.

4.2.1 Faster R-CNN

The 2D object detector was trained on the UNITY dataset, it trained for just under 45k iterations before being manually terminated. The training was kept short as the focus of this project was the 3D pose estimator. During this training two key metrics were evaluated, which were the mean average precision (mAP) and the mAP at 50% intersection over union (mAP@.50IoU). These results can be viewed in Figure 26 and as a general guideline, higher values are better.

(37)

36

Figure 26: The metrics used for the object detector graphed during training. Left is the mAP, right is the mAP on images with more than 50% IoU.

Mean average precision is a way of measuring a model’s accuracy and is commonly used for detection tasks. mAP@50%IoU is calculated in the same way as ordinary mAP, but only takes into account the predictions that have more than 50% IoU, which is a measure of how well the predicted bounding box compares to the ground truth bounding box for confident predictions. As mentioned previously, this object detector was trained on the UNITY dataset, however a modification was made to the dataset, which was to alter the dataset to not include bounding boxes that extended beyond the image dimensions, the bounding boxes that were outside of the image were clamped to the image dimensions. This led to the model detecting quite well on images where the full car was visible in within the image, where degraded performance was seen in images where the car extends outside of the boundaries, this can be seen in Figure 27.

Figure 27: A sample of predictions on real images. Each image is split into two parts, the prediction on the left side and ground truth on right side.

The difference in performance compared to the state-of-the-art highlights is quite large, with SA-SSD [44] which achieves 95.16% average precision on the KITTI dataset on images where the prediction has more than 70% IoU, with a similar measure where predictions with over 75% IoU are evaluated our model achieves 77.6% average precision when trained on the UNITY dataset. This is slightly lower than the accuracy achieved when trained on the KITTI dataset, where the

(38)

37 Faster R-CNN architecture achieve an 83.16% average precision on 70% IoU predictions. The reason for these results derive from two sources, the Faster R-CNN architecture is older than the SA-SSD architecture, which explains the lower results on the KITTI dataset, and the second reason is that CAB’s dataset is simply harder to detect on than the KITTI dataset. As previously mentioned, the KITTI dataset does not contain images of cars close to the camera which is a problem for the architecture.

4.2.2 MultiBin regressor

All configurations of the MultiBin regressor were set to train to 500 epochs, however the KITTI model was stopped by early stopping at epoch 60 and the UNITY model had to be manually stopped at epoch 12 as time was limited during this project. Early stopping was configured with a patience of 10 epochs. The summarized loss can be seen in Figure 28.

Figure 28: The training graph for the combined loss when training both models, orange and red are representing the KITTI and respectively UNITY model.

The loss graphed in Figure 28 is the sum of the loss from the three different loss functions defined in the implementation chapter, where lower values mean better performance. To get a deeper understanding of this result a look at the individual losses is made.

(39)

38 In Figure 29 the different losses are graphed, an important part to keep in mind when reading these losses is that their weights are not considered for. Which means that the orientation loss weighs 10 times more than the loss for dimension loss, and twice as much as the confidence loss. Without knowing about the weights it may seem like the dimension loss would make the model keep improving for some more time, however because the orientation is no longer able to improve the model is considered to have reached a minimum.

These values can be hard to conceive and are because of this some metrics are calculated as the average orientation measured in degrees and average dimension error measured in meters. Note that confidence loss is excluded from this evaluation.

Figure 30: Evaluated metrics for the two models on all three datasets

Using these metrics as the results gives a clearer view of the results, which can be seen in Figure 30. The dimensions are on average off by 0.27 meters for the UNITY model evaluated on the KITTI evaluation dataset, note that these are the total dimensions, which means in height, width, and length, measured in meters. The orientation has been converted to the difference in angle between the ground truth and the predicted angle measured in degrees.

The current de facto precision measurement method was introduced by Geiger et. al. [33] where the average orientation similarity (AOS) is measured. Our solution reaches a 97.5% accuracy with an orientation similarity of 30° and 89.5% with a similarity of 15° on the generated dataset, note that n=100. A graph of the accuracy measured from 0° similarity to 180° can be viewed in Figure 31. These numbers are nearing the state of the art with systems such as MMLab PV-RCNN [45] reaching 98.15% AOS on the KITTI set.

(40)

39

Figure 31: The orientation similarity estimation results. Y-axis is the recall ranging [0, 1] and X-axis is the orientation similarity in radians.

Figure 32 is showing the 3D bounding box drawn on top of the images. These results are from predictions using the KITTI model, where the left column consists of images from the KITTI-EVAL dataset, middle column are from the UNITY-KITTI-EVAL-FAR dataset and the rightmost column are images generated from the UNITY-EVAL-CLOSE dataset.

Figure 32: Drawn 3D bounding box from predicted results. Note that on the generated datasets there are squares generated which are part of the generation and not part of the prediction.

The images make it easier to understand representation of the numbers from Figure 30, the KITTI model performed excellently on the KITTI evaluation dataset, where both orientation and dimensions are quite low. The results for the UNITY-EVAL-FAR dataset had a significantly degraded orientation accuracy while retaining good dimension estimates. Finally, the estimates on the UNITY close evaluation dataset shows extremely degraded estimates with a dimension error of almost a meter compared to the ground truth.

In Figure 33 the error in orientation and dimension are visualized on a bar graph to further demonstrate the difference in the result. Most notably is the varied performance on the UNITY-EVAL-FAR dataset.

(41)

40

Figure 33: The resulting error in orientation and dimension for the KITTI and UNITY models on all datasets.

4.3 Feature extraction

Returning to the original question this report set out to answer, can car features be extracted from an image using machine learning and high-quality 3D models of cars? In this chapter the results of the evaluation are presented, where a 3D model of a car is placed in the predicted pose and individual features are extracted.

4.3.1 Visual representation

The UNITY model predicted on the UNITY-EVAL-CLOSE dataset and from that set several images were evaluated. The predictions were then manually entered into the Unity program and the scene was recreated using the predicted values, these scenes can be seen in Figure 32 where the highlighted parts are individual features of the 3D models and the transparent car is the image that was predicted upon.

Figure 34: Car model features placed at the predicted pose. Where the background images are the original images and the blue/red car parts are placed parts.

As can be seen in these images the results are of varying accuracy, notable is the bottom right image, where the car has a quite accurate orientation but is rotated 180 degrees. When examining these images one thing became clear, the position of the car is a major contributor to properly

(42)

41 detecting individual features.

It seems like the individual feature extraction is much more sensitive to small differences in the pose compared to the 3D bounding box. This can be seen in all the images in Figure 32, the 3D bounding box can be considered to be pretty accurate when examining the images manually, while the feature extraction quickly displays the error in the prediction.

4.3.2 Intersection over Union

While it seems like these predictions are quite accurate, this result can be measured using IoU. In Figure 33 the steps to measure this is visualized. To retrieve these values two 3D models are created in the scene, one with the ground truth pose, and the second with the predicted pose. Then different car parts are enabled within the editor and the 2D areas are retrieved.

Figure 35: A visualization of the IoU when using a 3D model. The final row is showing a cutout over the predicted area.

In figure 33, two predicted images were picked at random and in the top row these images are displayed together with a feature highlighted from the 3D models. The middle row is

highlighting the ground truth, predicted, and overlapping areas. The IoU is calculated to 0.771 for the left image and 0.356 for the right image.

(43)

42

5 Discussion

Discussing the results is very important and this chapter will go in depth to the authors thoughts on the results, followed by how the project can be further developed.

5.1 Image and label generation

The even distribution of alpha values in the UNITY dataset is not surprising, this is because when the pictures are generated the azimuth is increased incrementally for each image generated. When it comes to the KITTI dataset, those pictures are taken in traffic, meaning that vehicles will most often be facing towards or from the camera which shows in the data distribution. Looking at the truncated distributions the same is true, in the generated UNITY dataset the vehicles is purposefully offset to appear close to and over the edge of the image. In the KITTI dataset that only happens if a vehicle is very close to the camera or at an intersection. The argument could be made that an even data distribution should help the model generalize well on a larger set of images.

When looking at the dimension values there is one thing at stands out. The distribution is on two values only in the UNITY dataset, where the values represents the dimensions of the different 3D models. Since the KITTI dataset is taken on actual roads there will naturally be a larger range of size measurements. The measurements in the KITTI dataset is done with LIDAR and the

specifics on how it measures dimension size is not known to the authors of this report, this explains why there are values smaller than 1.2 meters in width for example.

Looking at the notable data distribution around location_x both the UNITY and the KITTI datasets have the largest distribution around 0, although in the KITTI dataset the distribution is lower at exactly 0 than in the UNITY one. This can again show the difference in generated and real images. The lack of distribution at exactly 0 in the KITTI dataset can probably be accredited to the fact that images are taken while driving around. Meeting traffic will rarely be in the exact center of the image and same side traffic will only be directly in the middle of the image

occasionally.

What stands out in data distribution in bounding box is the UNITY datasets, here the range is larger than the KITTI dataset. This is due to an effect from perspective, when an object is viewed up close less of the object is observed within camera/eye field of view, but the bounding box will remain in its original size. This leads to the bounding box point landing outside of the field of view which why the values have such a large range in the dataset where focus was on pictures up close. The reason for spikes mainly within the KITTI and UNITY datasets is because all values outside the maximum or minimum range is batched together and represented together.

What it boils down to when it comes to the difference in data distribution in the KITTI and UNITY datasets is in what way generated images and images taken in real life with advanced measurement equipment are obtained. Noticeable differences in the two ways of gathering data is the real images from the KITTI dataset has its data distribution from how vehicles are