Bird's-eye view vision-system for heavy vehicles with integrated human-detection

(1)

Västerås, Sweden

Thesis for the Degree of Master of Science in Engineering - Robotics

30.0 credits

BIRD’S-EYE VIEW VISION-SYSTEM

FOR HEAVY VEHICLES WITH

INTEGRATED HUMAN-DETECTION

Emma Frisk

efk15002@student.mdh.se

Julia Harms Looström

jlm16005@student.mdh.se

Examiner: Mikael Ekström

Mälardalen University, Västerås, Sweden

Supervisor: Carl Ahlberg

Mälardalen University, Västerås, Sweden

Company Supervisors: Kim Päivärinne and Anders Svedberg

CrossControl, Västerås and Uppsala, Sweden

(2)

Abstract

To safely manoeuvre heavy vehicles, such as harvesters or mining machines, the operator must be fully aware of the vehicle’s surroundings. It is a demanding task due to the amount of blind spots, which increases with the size of the vehicle. Today, there are vision systems available on the market that can provide the operator with extended vision, but the problem is that the operator manually has to switch between the views or install several screens in the cockpit. Those systems cause the operator to change focus from the surroundings to the screen, which decreases the awareness of the surrounding environment.

Recently, efforts have been made, and resources have been spent to reduce the number of acci-dents related to the handling of heavy vehicles, especially in agriculture. A system that can provide the operator with extended awareness of the surroundings without controlling the system manually, would be desirable. This thesis proposes such a system. The goal of this thesis was to develop a system that can reduce the amount of blind spots and increase the environmental awareness for the operator in real-time. Four wide-angled cameras have been combined into a top-view vision system that can detect humans within the nearest surrounding environment of the vehicle, in close to real-time. The system was implemented on a screen computer custom made for vehicle applications, which introduced challenges in the implementation due to the restricted availability of computa-tional power. The system successfully reduces the blind spots and detects humans in time for the operator to be able to stop the vehicle at speeds close to 20 km/h.

(3)

1 Introduction 1 2 Problem Formulation 2 2.1 Hypothesis . . . 2 2.2 Research questions . . . 2 2.3 Limitations . . . 2 3 Background 3 3.1 Fish-eye . . . 3 3.2 Image distortion . . . 3 3.3 Image warping . . . 3 3.4 Camera calibration . . . 4 3.5 Homography . . . 4 3.6 Feature extraction . . . 5 3.7 Image stitching . . . 5 3.8 Human detection . . . 5

3.9 OpenCV and OpenCL . . . 5

3.10 Mat and UMat . . . 5

3.11 Qt Creator . . . 5

3.12 CCpilot V700 . . . 6

3.13 CCpilot X900 . . . 6

3.14 Internet Protocol camera . . . 6

4 Related Work 7 4.1 Bird´s-eye view . . . 7

4.2 Human detection . . . 8

5 Method 10 6 Ethical and Societal Considerations 11 7 Implementation 12 7.1 Camera calibration . . . 12 7.2 Human detection . . . 12 7.3 Undistortion . . . 13 7.4 Resizing . . . 13 7.5 Change of perspective . . . 13

7.6 Match and merge images . . . 15

7.7 Stitching . . . 16 7.8 Software structure . . . 17 7.8.1 Calibration . . . 18 7.8.2 main(initialisation program) . . . 18 7.8.3 mainThread . . . 19 7.8.4 cameraThread(s) . . . 19 7.8.5 mergeThread . . . 20 8 Experimental setup 21 9 Results 22 9.1 Visual appearance . . . 22 9.2 V700 . . . 23 9.3 X900 . . . 24 9.4 X900 with UMat . . . 25

(4)

10 Discussion 27 10.1 Visual appearance . . . 27 10.2 V700 . . . 27 10.3 Human detection . . . 27 10.4 X900 . . . 28 11 Future Work 29 11.1 Stitching . . . 29 11.2 Human detection . . . 29 11.3 Environment . . . 29 12 Conclusion 30 References 33

(5)

List of Figures

1 Types of image distortions. . . 3

2 The workflow of the thesis. . . 10

3 Figure of the steps of the implementation here. . . 12

4 The figure shows three images used for camera calibration. . . 12

5 Human detection. . . 13

6 Before and after undistortion. . . 13

7 The position of the chessboard relative to the camera. . . 14

8 Points for calculating homography to change perspective. . . 14

9 Before and after change of perspective. . . 15

10 The four images merged. . . 16

11 Finding the overlapping area between two adjacent images. . . 16

12 Placement of the seam. . . 17

13 Masks with the overlap trimmed out. . . 17

14 The structure of the software system. . . 18

15 Experimental test setup. . . 21

16 The final vision system with no humans in the Field Of View (FOV). . . 22

17 The final vision system where a human is detected in the front camera (Case 3). . 22

18 The final vision system where a human is detected in both the front camera and right camera (Case 2). . . 23

(6)

List of Tables

1 V700 threads execution times(s). . . 23

2 V700 camera thread operations execution times(s). . . 23

3 V700 merge thread operations execution times(s). . . 24

4 X900 threads execution times(s). . . 24

5 X900 camera thread operations execution times(s). . . 24

6 X900 merge thread operations execution times(s). . . 25

7 X900 with UMat threads execution times(s). . . 25

(7)

Acronyms

BRISK Binary Robust Invariant Scalable Keypoints CPU Central Processing Unit

DIW Dynamic Image Warping FNR False negative rate FOV Field Of View FPR False Positive Rate FPS Frames Per Second GPU Graphics Processing Unit HOG Histogram of Oriented Gradient IDE Integrated Development Environment IP Internet Protocol

L-CNN Light weighted Convolutional Neural Network OCamCalib Omnidirectional Camera Calibration OpenCV Open Computer Vision

OpenCL Open Computer Language QML Qt Modeling Language

RANSAC RANdom SAmple Consensus RBF Radial Basis Function

RAM Random-Access Memory

SIFT Scale Invariant Feature Transform SSD Single Shot multi-object Detector

(8)

SURF Speeded Up Robust Feature SVM Support Vector Machine UMat Unified Mat

(9)

1 Introduction

Working with heavy vehicles, such as harvesters and mining machines, implies that the operator of the vehicle must be well aware of the surroundings. Otherwise, the vehicle may pose a safety risk. The vehicles have blind spots that are not covered by the side view mirrors and in many of the heavy vehicles, there is no way for the operator to see the rear of the vehicle to detect potential obstacles behind it [1]. The statistics of fatal accidents at work in Sweden from 2011 to 2020 show that vehicles or machines cause, on average, 45.66% of the fatal accidents at work [2]. Further analysis conducted by the Swedish work environment authority concluded that a large cause for fatal accidents around heavy vehicles is the lack of situational awareness for the operator manoeuvring the vehicle [3]. Thus, vision systems can provide helpful information.

The vision systems used on heavy vehicles today commonly consist of multiple separate cameras. The operator either has multiple screens, one screen with multiple images side by side, or one screen where the operator actively has to choose which image to look at [4] [5].

By combining images from multiple cameras a top-view image of a vehicle and its surroundings can be provided, also called bird’s-eye view vision system [6]. The name bird’s-eye view refers to the view from which the birds observe the world. The most common setup for such system is four fish-eye lens cameras placed around the vehicle with a horizontal view. The captured images are therefore heavily distorted and several image processing techniques need to be applied to achieve the final bird’s-eye view [7].

Many car manufacturers offer bird’s-eye view vision systems and most of the systems consist of four fish-eye cameras installed around the car [8] [9]. There are also bird’s-eye view vision systems developed for trucks, such as the Backeye®_{360 from Brigade [}₁₀_].

The solution proposed in this master thesis is an extension of the bird’s-eye view vision system where human detection has been integrated. With this solution, the blind spots will be reduced and the operator will be able to have a more extensive awareness of the surroundings from a single screen. The system will make the vehicle smart enough to assist the operator in detecting hazardous situations.

The master thesis was conducted together with the company CrossControl 1_{. CrossControl}

works with monitoring solutions for the heavy vehicle industry such as harvesters and mining ma-chines. They supply their customers with both hardware and software solutions that keep their vehicles smart, safe, and productive.

The structure of the paper is as follows: The problem formulation and the proposed hypothesis along with the derived research questions are stated in Section2. In Section3 some background information of the thesis is described and a state-of-the-art review is written in Section4. The selected method is stated in Section5. Ethic and societal considerations is addressed in Section6. The implementation is described in Section7 and the experimental setup is shown in Section 8. In Section9 the results are stated and discussed in Section10where future work is presented in Section11. Finally, conclusions regarding the work are addressed in Section12.

(10)

2 Problem Formulation

For an operator to manoeuvre a heavy vehicle in a safe manner is a demanding task. Resources has been put into solving this problem and decrease the amount of accidents related to the heavy vehicle industry.

This thesis will develop a bird’s-eye view vision system for heavy vehicles with the goal to decrease the amount of blind spots and increase the environmental awareness of the operator. The system should also be able to detect humans and alert the driver that a human is approaching the vehicle.

The system will be developed on the pre-defined platform CCpilot V700 provided by CrossCon-trol. Thus, the available computational power will be limited. The data from the four cameras will be intense to process. Since the system should operate in real-time, solutions related to com-putational workload will be investigated.

2.1 Hypothesis

By implementing a bird’s-eye view vision system with integrated human detection, visibility limita-tions such as blind spots will be reduced and the operator’s awareness of the vehicle’s surroundings will be increased.

2.2 Research questions

RQ1: Which algorithms, resources and limitations are necessary to implement a bird’s-eye view vision system, including human detection, on the pre-defined embedded display computer? RQ2: How can the computational workload be reduced without jeopardising the functionality of

the system?

RQ3: How does the system effect the situational awareness regarding the reduction of visual blind spots?

RQ4: How can the system be developed to operate and detect humans in real-time?

2.3 Limitations

The hardware, such as computers and cameras, will be provided by CrossControl. The authors will also be supplied with the software used in the company’s current systems. Image processing along with human detection increase the computational workload, especially in this system which will be operating in real-time.

The thesis will not include development of new algorithms.

Since this thesis is based on vision, environmental circumstances will affect the system. There-fore, the field of use for the system is limited. Bad weather such as heavy rainstorms along with bad light conditions will limit the usage of the system. With the dawn, the sight will be limited and the system will no longer be reliable.

The authors will focus on the integration of the bird´s-eye view vision system and human detection along with an academic reflection of how this system affects the situational awareness of the vehicle and operator.

(11)

3 Background

Basic knowledge and necessary background information needed to understand the subject and the different parts of the thesis is presented in this section.

3.1 Fish-eye

A fish-eye lens is an ultra-wide-angle lens with a significantly broader angle of view than a normal lens. The name ’fish-eye’ refers to any lens capable of capturing the whole hemisphere and project it on a plane. Such lens produces heavy visual distortions both regarding distance to and appearance of objects in the view. The distance distortion makes objects appear to be smaller and located further away than they really are. The appearance distortion causes objects to change proportions. Objects near the centre of the view appear bigger, and objects at the edges appear smaller than they really are. Cameras with an optimal normal lens create images without distortion, which means that they have straight perspective lines. Fish-eye lenses, on the other hand, create images with a convex appearance called barrel distortion [11] [12].

3.2 Image distortion

The distortion of an image depends on the physical parameters of the camera and its lens. Image distortion can be defined as the deviation of a pixel’s predicted position on the camera’s focal plane. The degree of distortion depends on several factors, primarily the angle of view and focal length of the lens. As mentioned in3.1, the wide-angle lenses cause changes in the appearance of objects in the image. The closer an object is to the centre of the view, the bigger the appearance. Telephoto lenses create distortions that cause objects close to the lens to appear bigger. The background, or objects far from the lens, may appear closer and bigger than they are [11].

Three common distortions are represented by a grid system in Figure 1, where (a) represent the grid system of an image taken with a normal lens without distortion. (b) represent the barrel distorted grid system of an image taken with a wide-angle lens. (c) depicts the distortion grid system commonly seen in images taken with a telephoto lens.

(a) No distortion. (b) Barrel distortion. (c) Pincushion distortion.

Figure 1: Types of image distortions.

3.3 Image warping

Image warping is transforming an image so that the locations of the pixels in the original image are changed. The transformation does not include changing the colour nor the intensity [13]. The basic geometric transformations of warping of translations, rotations and scalings, but polynomials or spline function are necessary for more complex warping [14]. Warping can be used in image processing for correction of the optical distortions introduced by the camera and to change viewing perspective [15].

(12)

3.4 Camera calibration

The goal of camera calibrations is to retrieve the intrinsic and extrinsic parameters of the camera for enabling transformations of the image [16, p.370-381]. The intrinsic parameters refer to the camera and lens properties, i.e. the distortion of the lens and focal length. The extrinsic parameters consist of two parts, rotation and translation, which enable transformations between the camera coordinate system to the world coordinate system [17].

3.5 Homography

Homography can be defined as the projective mapping between two different planes. In computer vision, homography can be used for multiple purposes, such as camera pose estimation, correction or change of perspective, and image stitching [16, p.384-385]. Let the viewed point be expressed as ˜Qand the image point to where ˜Qis mapped as ˜q, Equation1, then the homography mapping can be expressed in terms of matrix multiplication, as seen in Equation2.

˜ Q =     X Y Z 1     , ˜q =   x y 1   (1) ˜ q = sH ˜Q (2)

sin Equation2 is an arbitrary scale factor that has been factored out of the homography matrix H to show that the homography is only defined up to that factor.

The homography matrix H can be divided into two parts: the physical transformation; and the projection.

The physical transformation, W in Equation3, is a sum of rotations R and translation t which relates the viewing plane with the image plane. Whereas the projection introduces the camera matrix, M in Equation4, where fx and fy represents the focal length and cx and cy represents

the principal point offset.

W =R t (3) M =   fx 0 cx 0 fy cy 0 0 1   (4)

This yields that Equation2can be rewritten as Equation5. ˜

q = sM W ˜Q (5)

To achieve the desired coordinate ˜Q0 , which is only defined for the viewing plane rather than for all of space, the object plane has to be define. By defining the plane as Z = 0 Equation5can be rewritten as Equation6. The homography matrix can then be described by H = sM[ r1 r2 t].

  x y 1  = sMr1 r2 r3 t     X Y 0 1     = sMr1 r2 t   X Y 1   (6)

The homography between two images can be derived using four corresponding points in the images, by using more points an overdetermined system is achieved [18].

(13)

3.6 Feature extraction

Instead of analysing whole images, methods to optimise image processing has been developed. Extracting points of interest, also called feature points, is a method widely used to reduce the complexity of image processing. By extracting feature points, both local and global properties are obtained. If a sufficient amount of such points can be extracted from an image of interest, this approach performs well [19]. Scale Invariant Feature Transform (SIFT), Speeded Up Robust Feature (SURF) and Binary Robust Invariant Scalable Keypoints (BRISK) are some examples of feature extraction methods.

The extracted feature points in each image need to be matched with the points in the other image to find matching features in different images. Therefore, feature matching methods are applied. These methods search for correlation between points. The result is a reduced amount of points but with increased reliability [20].

3.7 Image stitching

The procedure of combining several partly overlapping images is called image stitching. The goal of this procedure is to create one large seamless image [21]. The procedure consists of image registration, seam cutting, and image blending. Image registration is the first step and aligns the images, which create overlapping areas. Global homography and spatially varying warps are examples of methods used for this step. The second step, seam cutting, aims to find an optimal seam in the overlapping area so that miss-aligned artefacts are hidden and the seam becomes invisible. Even though the goal of the seam cutting is a seamless image, the seam might still be slightly visible due to brightness differences between the images and other factors. Which makes image blending necessary to reduce the visibility of the seam [22].

3.8 Human detection

Human detection is the task of locating human presence in images and is commonly used in video surveillance. By searching every part of the image and comparing each part with known patterns and libraries of human images, it is possible to detect and locate humans present in an image [23].

3.9 OpenCV and OpenCL

Open Computer Vision (OpenCV) is an open-source library for computer vision and machine learning software. The library does not only contain what is referred to as the classical computer vision and machine learning algorithms, but it is also continuously updated to contain new modern algorithms. It is widely used by amateurs, professionals, as well as governmental bodies. OpenCV is natively written in C++ but also includes interfaces for Python, Java and MATLAB [24].

Open Computer Language (OpenCL) is a framework that provides Graphics Processing Unit (GPU) access for applications with non-graphical computing. This is a great asset since many computer vision algorithms, such as object detection and image processing algorithms, runs more efficiently on GPU than on Central Processing Unit (CPU) [25].

3.10 Mat and UMat

Mat is a class in OpenCV which is used to store data such as vectors, matrices and images. The Unified Mat (UMat) class is similar to the Mat class however the advantage of UMat is the connection to OpenCL. If the code runs on a platform with OpenCL installed the UMat will include OpenCL instuctions when passed to a OpenCV function, which will result in a higher performance with computer vision applications [26].

3.11 Qt Creator

Qt Creator is a cross-platform Integrated Development Environment (IDE) used for application development across desktop, mobile, and embedded platforms. It is compatible with Linux,

(14)

Win-dows and macOS and allows the development of applications in numerous languages such as C++, QML, JavaScript, and Python. Its framework is written in C++ [27].

3.12 CCpilot V700

The CCpilot V700 is a screen computer based on an i.MX8 DualXPlus 2 x 1.0 GHz processor. It is designed to address the challenges related to the increased content of software in modern mobile machines. V700 supports emerging technologies such as multiple digital camera streams and object detection. It is equipped with an 7" 800 x 480 pixels multi-touch screen, a Vivante GC7000lite GPU with support for OpenCL and 1 GB Random-Access Memory (RAM). The operating system is Linux-based and supports the development environment QtCreator which supports languages such as C++, c, Qt Modeling Language (QML) and JavaScript [28].

3.13 CCpilot X900

The CCpilot x900 is an Intel© _{Atom quad-core, 4 x 1.6 GHz, powered touch screen computer. It}

is designed for video analytics and is equipped with a 9" 1280 x 768 pixels multi-touch screen, an Intel© _{HD Graphic 500 with support for OpenCL and 4 GB RAM. The CCpilot x900 has the}

same operating system as the CCpilot v700 and does also have support for QtCreator [29].

3.14 Internet Protocol camera

An Internet Protocol (IP) camera allows for the images to be generated directly from digital data. Since the IP camera communicate over Ethernet, which is a worldwide standard, it allows for simple integration of multiple cameras into an existing network. The IP communication enables progressive scan, which results in higher quality video and static images, rather than interlaced scanning [30]. The IP cameras used in this thesis are the Orlaco EMOS Ethernet camera 180° [31] which is a camera intended for computer systems in heavy vehicles. It has a resolution of 1280 x 960 pixels, and comes equipped with a fish-eye lens with 180°horizontal and 130°vertical view. Some image processing is built into the cameras, such as auto white-balance, edge enhancement, and exposure control.

(15)

4 Related Work

In this section, the state-of-the-art related to the thesis is covered. The different parts of the thesis, bird’s-eye view and human detection, are covered by previous work separately, but the combination is not as widely covered.

4.1 Bird´s-eye view

Wang et al. [32] presents a real-time automatic parking method that includes a bird’s-eye view vision system. The vision system consists of four fish-eye cameras. The cameras were first calib-rated using Zhang’s method [33] for retrieving the intrinsic and extrinsic parameters. The Zhang’s method is based on detecting chessboard patterns and calculating the distortion of the lens based on how the pattern has been distorted in the images. Thereafter the images from the cameras were undistorted using a polynomial distortion model-based method, and the perspective was changed to give a top view perspective. Finally, joint calibration was used to merge the images, where the Levenberg-Marquardt algorithm was applied to optimise by minimising the error of feature points. To further increase the quality of the image, the bilinear interpolation algorithm and the white balance procedure were applied. The results from the vision system show a surrounding view of the vehicle from bird’s-eye view with the four cameras combined. A method based on the radon transform was successfully implemented on the bird ’s-eye view image to detect free parking space. The experimental result of the free parking space detection showed higher accuracy and robustness than methods based on the Hough transform. The authors did not discuss the results from the bird’s eye view vision system. However, from visual inspection, the conclusion was that the seams between the images are sharp, and the colours were not corrected to make it look like one image rather than four images side by side.

Liu et al. [7] presents a driving assistant system based on a vision system that provides a bird’s-eye view of the vehicle’s surroundings. Six fish-eye cameras were mounted on the vehicle to capture its surroundings from all directions. The cameras were calibrated using Zhang’s method. The obtained intrinsic and extrinsic parameters were used to undistort the images. The Levenberg-Marquardt method was applied to optimise those parameters and minimise the residual error. Those parameters were then used to correct distortion and transforming the images from fish-eye to perspective.

The perspective images can be transferred to the same coordinate system using planar homo-graphy, i.e. ground plane, and stitched together. Objects co-planar to the ground plane can be correctly registered and presented in the image. Liu et al. propose an "optimal seam approach" where they implement a quadratic function to model the distribution of the residual error of the image. Using this approach, they can stitch the images together by choosing to represent each overlapping point with the corresponding one from the image with the lowest error of that point. Dynamic Image Warping (DIW) was implemented to align and correctly stitch 3D objects which might be misaligned in the mentioned seams due to their non-planar properties. 3D objects ob-tained by a fish-eye lens will be heavily distorted when converted to perspective. Those problems were addressed by warping the stitched images into one fish-eye perspective using the Radial Basis Function (RBF) Wendland.

Exposure compensation along with weighted blending was implemented to achieve a more visually correct representation of the surroundings. Liu et al. successfully achieved a seamless Bird’s-eye view vision system that supports drivers when manoeuvring vehicles.

In the paper presented by Lai et al. [34] an image fusion algorithm for around view monitor systems was proposed. The algorithm smooths the brightness between the lenses using image Y-channel histogram matching and image fusion algorithm.

The around view imaging system was based on four fish-eye cameras. The images from the cam-eras were first corrected to remove the fish-eye distortion. Thereafter the perspective was corrected to top view through applying the transformation matrix obtained from homomorphic transforma-tion. The images were then aligned through geometric alignment, creating one synthesised image. For smoothing out the seams, brightness correction through photometric alignment based on

(16)

histo-gram matching was used with the front and back images as benchmarks. The brightness correction was chosen to be done in the YUV colour space since it entails that only the luminance component Y had to be corrected. As the final step, the multi-band blending algorithm was applied, which removes the last visible remains of the seams. The result was a seamless bird’s eye view image, and the proposed method showed great efficiency.

In the work [8] conducted by Miró, several commercial bird’s-eye view vision systems are described. A major flaw that was found in the systems was the lack of alignment and stitching between the images. The paper presents the development of a bird’s-eye view vision system for buses using four fish-eye cameras. The main focus was to implement a warping algorithm that increases the possibility to align the images properly.

Two different techniques for camera calibration were evaluated. Firstly, cones were placed in a specific pattern around the bus to locate the overlapping areas. The drawback with this technique was that the appearance of the cones changed depending on their angle to the cameras. Therefore, planar chess patterns were used as the second technique, similar to Zhang’s method.

Two different methods were implemented for undistorting the images from the cameras. OpenCV and Omnidirectional Camera Calibration (OCamCalib). The paper states that the registration of the images, i.e. the extraction of feature points, has been conducted manually. The results of the undistortion were not discussed. Neither were the results of the feature extraction. Several warping algorithms were tested and evaluated where the Homography merging combined with RANdom SAmple Consensus (RANSAC) showed the best results. Blending algorithms were discussed but not implemented. The purpose was to adjust the difference in exposure and enhance the finished view.

Chen et al. propose a method for dynamically switching between adjacent cameras instead of having fixed joints in their paper [35]. Four wide-angle cameras were mounted on a vehicle. Zhang’s method and an FOV distortion model was used to calibrate the cameras. With the obtained parameters, the images from the cameras were transformed into perspective through undistortion. The vehicle was parked in a parking lot, so the borderlines could be used to obtain the homography matrix for each camera. With the matrices from the homography, the images were transformed and combined into the final view.

The objects far away from the camera appeared heavily distorted, and the image quality became poor. Therefore, the combined image was warped into one fish-eye perspective by applying the FOV distortion model. Moving objects were detected and their corresponding area calculated in each image respectively to achieve the proposed switching between cameras method. The system chose the camera that captures the largest area of moving objects in the overlapping area, and the whole image from that camera was viewed, resulting in the adjacent images being hidden.

Chen et al. successfully developed a vehicle surrounding monitoring system which reduces the number of blind spots. The dynamic switching between cameras increases the visual detection of moving objects.

Jo et al. proposed a top-view vision system for excavators in their paper [36]. Four fish-eye cameras were mounted on an excavator. Front, rear, left, and right. A method based on the FOV model was implemented to correct the radial distortion of the images. Homography was implemented to project these corrected images onto a plane to obtain the proposed 360◦_{surrounding view. An}

im-age stitching method was then applied to stitch the four corrected imim-ages into one. The developed system provided the operator of the excavator with a 360◦ _{surrounding view of the vehicle with}

fewer blind spots.

4.2 Human detection

The paper conducted by Chiang and Wang [37] shows that the method of using Histogram of Oriented Gradient (HOG) and Support Vector Machine (SVM) for human detection can be suc-cessfully applied to fish-eye lens images. However, instead of sliding the window horizontally and

(17)

along the reference line with various sizes, the HOG features were extracted and classified by a pre-trained SVM classifier. It was not possible to apply an ordinary human data set to train the classifier because the appearance of humans in fish-eye images is too far from humans captured with conventional cameras. Furthermore, a window-merging algorithm was developed for merging overlapping windows which contain the same human.

The proposed method was proved to detect humans in low-resolution and low-contrast images containing several humans with different poses and sizes with high accuracy(98.3%). However, from the results, it was also concluded that the method could not correctly detect humans that were too close to the centre of the lens.

Mutiple human detection algorithms for real-time video on edge computing devices were evaluated and compared in a work conducted by Nikouei et al. [38]. The evaluated algorithms were Haar Cascaded, HOG with SVM, Single Shot multi-object Detector (SSD) GoogleNet, SSD Mobile-Net and Light weighted Convolutional Neural Mobile-Network (L-CNN) and the evaluated parameters were Frames Per Second (FPS), CPU usage, memory usage, False Positive Rate (FPR) and False negative rate (FNR).

In the results SSD GoogleNet and SSD MobileNet showed to have the best accuracy of the algorithms. However, their memory usage was the highest and SSD GoogleNet showed the second-worst result in both FPS and CPU usage. HOG with SVM had the second-worst result regarding FPS and CPU usage and the next worst result regarding accuracy, which gave it the overall worst result. The Haar Cascaded performed the worst with respect to accuracy. However, its CPU and memory usage were one of the lowest and it had the highest frames per second. Finally, L-CNN was proven to be the best of the algorithms for this application. It had the second-best result in every category except FPR where it was the third-best, which gave it the overall best result.

In a paper conducted by Gajjar et al. [39] a new method for human detection in real-time video. The method is based on common HOG + SVM, but visual saliency was applied for determining the region of interest, which increased the precision and lowered the computational power and execution time. Furthermore, for determining the paths of the detected humans, the k - Means algorithm was implemented, which clustered the HOG vectors of the positively detected windows. The results showed a detection precision of 83.11% and a recall of 41.27%.

(18)

5 Method

The hypothetico-deductive method [40] was applied in order to either verify or falsify the hypothesis proposed in section2.1and the research questions stated in section2.2. The workflow is presented in Figure2 below.

Figure 2: The workflow of the thesis.

A literature study of the topic has been conducted where the aim was to obtain more extensive knowledge about the most suitable approaches and algorithms to use for the implementation. The research questions RQ1 and RQ2 were answered during the implementation, and RQ3 and RQ4 during the tests. Results generated by the system, and the different parts separately, have been analysed, evaluated, and discussed, by comparing execution times and the final image. Conclusions have been drawn.

(19)

6 Ethical and Societal Considerations

The work of this thesis encounters both ethical and societal considerations that will be addressed in this section. The thesis does not handle any confidential data. Therefore, the handling of such data will not be addressed.

The camera system can be seen as a surveillance system and will therefore be considered as such. No data such as images or videos will be stored by the system. The security of the system will be addressed by the company.

The system should not be considered a safety system. An operator will interpret the information from the system. It should be used as a tool and not a stand-alone solution for manoeuvring a vehicle in a safe manner.

Implementation of this system could result in fewer accidents around heavy vehicles since it could increase the operator’s situational awareness.

(20)

7 Implementation

Based on the literature study, the methods and algorithms described in the following sections have been implemented to obtain the desired outputs, results and knowledge to answer the stated research questions. The main part of the system has been implemented in QtCreator using OpenCV functions in C++. While some parts has been developed in VSCode using Python.

To achieve the final bird’s-eye view the implementation was divided into multiple steps, as presented in Figure3.

Figure 3: Figure of the steps of the implementation here.

7.1 Camera calibration

Based on the literature study, Zhang’s method [33] for camera calibration was selected and imple-mented. A chessboard pattern printed on an A4 paper sheet was used to calibrate the cameras.

A number of images with different translation and orientation were captured with each camera, as seen in Figure 4. In each capture, the coordinates of all corners within the chessboard were detected and stored. Thereafter the built-in OpenCV calibration function uses the coordinate points to calculate the intrinsic camera parameters.

Figure 4: The figure shows three images used for camera calibration.

7.2 Human detection

For human detection a built-in pre-trained HOG + Linear SVM model from OpenCV was used. The combination HOG as feature descriptor and SVM as the classifier is common in related work, [37]– [39], and has proven to work especially well for human classification. HOG is used for feature extractor where objects in the images are described using the distribution of edge directions. It divides the image into smaller sections, and a histogram of gradient direction is compiled for each section. Then, the descriptor is made complete by assembling all the histograms. Finally, a SVM is applied to the features in the descriptor [41, p.150-177]. It would have been advantageous to apply human detection on the final stitched image to reduce the computational workload. However, the distortion introduced by the change of perspective makes it impossible to detect humans with predefined human detection libraries. Therefore, human detection was implemented on each camera before changing perspective. The advantage of this solution is that it has a higher range, and the

(21)

a human is detected by one of the cameras, an alert is given by tinting the stitched image red. This helps the operator in determining from which direction the human is approaching.

(a) Human detected and marked in green. (b) Human detected with the final red tint.

Figure 5: Human detection.

7.3 Undistortion

The images captured by the cameras are heavily distorted. The fish-eye lens causes extreme barrel distortion. Using the extracted intrinsic parameters, the images can be undistorted, as seen in Figure6.

(a) Original image captured by the fish-eye lens. (b) Undistorted image.

Figure 6: Before and after undistortion.

7.4 Resizing

The wide field of view of the cameras also affects the undistortion of the images. The undistorted images stretch to infinity along the x-axis due to the wide-angle in the x-direction. Since the cameras have a narrower field of view angle in the y-direction, the images will be finite along the y-axis when undistorted. Therefore, the images are cropped in order to achieve an image with non-infinite edges. All images are cropped in the same way to have the same FOV for all four cameras.

7.5 Change of perspective

A surface in the image must be assigned as the ground plane to create the top-down view of the images. By placing the chessboard on the ground, according to Figure7below, a top view can be

(22)

created by remapping the pixels so that the chessboard is rectified.

Figure 7: The position of the chessboard relative to the camera.

The remapping was accomplished by first finding the four points in the image representing the outer corners of the chessboard, PO1−4 in Figure 8 och Equation7. Thereafter by using a prior

knowledge of the ratio between the corners, i.e. the size of the board, the four corresponding destination points PD1−4 could be calculated, as can be seen in Figure 8 and Equation 7. One

of the original points is set to have the same destination point, PO1 in the below-stated case, to

maintain the same position of the chessboard in the transformed image.

P_O1 P_O2 P_O3 P_O4 P_D2 P_D3 P_D4 P_D1 y x

Figure 8: Points for calculating homography to change perspective. Chessboard_Size = 5 × 6 squares PO1=corner1 PD1= PO1 PO2=corner2 PD2= (PD1.X, PD1.Y + 5) PO3=corner3 PD3= (PD1.X + 6, PD1.Y + 5) PO4=corner4 PD4= (PD1.X + 6, PD1) (7)

After the original and destination points are determined, they are used to find the top-view homography matrix, representing the transformation between the original and top view image. Figure9shows the result of transforming the image using the top-view homography matrix.

(23)

(a) The undistorted original perspective. (b) Top-view perspective.

Figure 9: Before and after change of perspective.

7.6 Match and merge images

For finding the relationship between the cameras, the initial plan was to find the common features in the overlap of two cameras through SIFT and RANSAC. However, it resulted in several faulty matches. Therefore a chessboard was placed in the overlapping area, and matches were made from finding the corner within the chessboard. This solution increased the accuracy and correct matches in the overlapping area. The method used to find chessboard corners searches row by row from top to left, resulting in that the matches can be done since the points are in the same order in both images.

From the matching points, the homography between the two images was determined. Using the homography the two overlapping images were put in the same view and merged.

If all four images were to have the same ground plane, the homography between them would only consist of a rotation and a translation. However, since the ground plane determined in each image was not identical, the determined homography between the images did not consist of exclusively a rotation and translation. Thereby a distortion was introduced when warping the second image to fit the first image. When thereafter adding the next image to the other two the distortion is escalated.

To achieve a better estimation of the ground plane, a larger chessboard was introduced which covered more of the floor and therefore gives a better representation of the ground plane. The new ground planed showed better result for the merge. In order to display the images in one frame all images were padded to the size required to display the merged image. Thereafter masks were created for each image to ignore the black padding when placing the images on top of each other. The result of merging all four images can be seen in Figure10.

(24)

Figure 10: The four images merged.

7.7 Stitching

To remove the overlap between the images and create a seam, the mask of each image was trimmed. The first step was to find the overlapping area in the masks, which was accomplished by reducing the colour intensity of the images by fifty percent and then add the two masks. As shown in Figure11a, the background remains black, and the former white parts of the mask become grey except for the overlapping area, which is white. Thereafter the overlapping white area was extracted as a mask itself, see Figure11b.

(a) The masks of two adjacent images added. (b) Mask for only the overlapping area.

Figure 11: Finding the overlapping area between two adjacent images.

The next step was defining the placement of the seam, by finding the middle point of the overlapping mask and the inner corner point of the mask. A line was fitted to the two points defining the placement of the seam in between the overlapping images, as can be seen in Figure12. The unwanted overlap was then removed by trimming the masks using the fitted line, as shown Figure13.

(25)

Figure 12: Placement of the seam.

Figure 13: Masks with the overlap trimmed out.

When the overlap had been trimmed out, the four images were added to achieve the final stitched image, as can be seen in Figure16. A still image of a tractor was added to represent a potential vehicle.

7.8 Software structure

The software system is implemented in QtCreator using C++. The application runs on multiple threads to reduce the execution time. The software system is divided into five main parts; calibra-tion, main(initialisation program), main thread, camera threads and merge thread. The software structure is shown in Figure14and the different parts are described below.

(26)

Figure 14: The structure of the software system. 7.8.1 Calibration

The calibration is divided into three parts. The first part, camera calibration, takes nine camera calibration images per camera as input and calculates the intrinsic parameters of each camera, respectively. The intrinsic parameters are used in the next step of the calibration, which is the warping calibration. The warping calibration takes the intrinsic parameters and one image from each camera, with the chessboard pattern placed on the ground, as inputs. The output from the warping calibration is two map matrices used for undistortion, one top-view homography matrix and one top-view image per camera. The top-view images are used in the next step in the calibration, which is the merge calibration. The merge calibration merges the four top-view images, and masks for each image are created to stitch the overlapping regions between the images. The output from merge calibration is one stitch homography matrix and one mask for each image. The three calibration steps are only executed once when the cameras are installed on the vehicle. Re-calibration is necessary if the cameras were to be moved or the angle of the camera changed. 7.8.2 main(initialisation program)

As described in Algorithm1, the initialisation program starts with reading the parameters gener-ated in the calibration. Each camera has, as mentioned, two map matrices, map1 and map2, and one top-view homography matrix topV iewH. The parameters from all four cameras are stored in vectors for each type of parameter, respectively. The parameters from merge calibration, stitchH and mask, are read from a file and stored in vectors by the same principle. The next step in the initialisation program is to open pipelines and start the streaming from the four cameras. Handles

(27)

to each camera stream is created and stored in a vector, handles. The connections to each camera are controlled to ensure that they work as expected. When the connections are established, the main thread is started, and the vector with handles are sent as input.

Algorithm 1main(initialisation program)

1: _{procedure main()}

2: read calibration parameters from file 3: fori = 0 ; i<=3 ; i++ do

4: handles[i] = start camera[i]

5: if camera[i] OK then

6: continue

7: else

8: try to connect to camera[i] again 9: start mainThread(handles)

7.8.3 mainThread

The main thread is an infinite loop, as can be seen in Algorithm2. One camera thread is started for each handle in handles, and a variable called future, which is connected to each thread, is stored in a vector. Those variables are then used to wait for and check that each camera thread is finished. The output image from each camera thread is stored in a vector. When all the camera threads are finished, a merge thread is started. The vector containing all the output images are sent as the input. A future variable is connected to the merge thread and when it is finished, the process starts all over again.

Algorithm 2mainThread

1: _{procedure mainThread(&handle)}

2: while 1 do .Infinit loop

3: fori = 0 ; i<=3 ; i++ do

4: f utures[i] = start cameraThread(i,handles[i])

5: fori = 0 ; i<=3 ; i++ do

6: f utures[i].waitForFinished

7: topV iewImgs[i] = result from cameraThread[i]

8: f uture= start mergeThread(topV iewImgs)

9: f uture.waitForFinished 7.8.4 cameraThread(s)

A camera thread, described in Algorithm 3, fetches one frame, img, using the handle to the corresponding camera. The img is then undistorted using the parameters from the calibration. By applying the top-view homography matrices from the calibration, the perspective of img is transformed to the top-view perspective. The frames are rotated and moved to fit together using the parameters from the merge calibration. The masks are applied to remove data outside the region of interest. The top-view frames are sent as outputs an retrieved into a vector in the main thread. The four camera threads are running in parallel. (Not actually parallel on the current hardware since it has only two cores.) The cameraThreads can not restart and fetch a new frame before all four previous cameraThreads have ended, which controls that the frames from all four cameras may not be fetched at the exact same time but within a short time frame.

(28)

Algorithm 3cameraThread

1: _{procedure cameraThread(i, &handle)}

2: handle → img .Fetch frame from camera stream

3: undistort img using map1[i] and map2[i]

4: change perspective to top-view using topV iewH[i] 5: rotate and move into merge position using stitchH[i] 6: add mask[i]

7.8.5 mergeThread

When the frames from all four cameras have reached the merge thread, they are undistorted, have been changed to top-view perspective, rotated and moved, and masked. In the merge thread the frames are combined, as can be seen in Algorithm4. The combined image is cropped and rotated to match the size of the screen and presented.

Algorithm 4mergeThread

1: _{procedure mergeThread(&topV iewImgs)} 2: add all topV iewImgs

3: crop and rotate to fit screen(800x480)

(29)

8 Experimental setup

Figure 15: Experimental test setup.

Before implementing the system on a vehicle, a test setup was created, as seen in Figure 15. The setup measures 950 x 250 x 490 mm. The four cameras are rotated 90 degrees from each other and slightly tilted to the floor. The cameras were tilted to limit the field of view since the information above the horizon is not relevant for a top-view image. However, a too excessive tilt might cause unwanted cropping of, for example, humans, which might affect the reliability of the human detection. This thesis followed an iterative process where several test cases were conducted and evaluated. All tests were conducted in a normally lit indoor environment. Since the execution time between each frame had no significant fluctuation, ten frames were considered enough to represent the system’s functionality. To determine which parts were the most time-consuming, the execution time of each part in the system was measured. After that, possible improvements to the system were explored to achieve a more efficient system. The test cases were performed on the V700 computer, and the specifications for each test case are stated below.

Case 1: Large images (1480 x 1640 px) and human detection on all four cameras. Case 2: Small images (980 x 640 px) and human detection on all four cameras. Case 3: Small images (980 x 640 px) and human detection on one of the cameras. Case 4: Small images (980 x 640 px) without human detection.

Due to a system flaw in the V700 the GPU was not accessible, and the software could only run on the CPU. Therefore, the X900 was introduced to compare running on CPU and GPU. Case2and Case3 were performed on the X900 computer, with and without the use of UMat in the merge thread.

The range of the human detection was tested by measuring the distance from the cameras to a detected human. The distance was increased by 0.1 m per measurement until the detection failed.

(30)

9 Results

In this section, the experimental results are presented. The section is divided into four parts, where the first part presents the resulting visual appearance of the system and the following three parts presents the execution time of the system’s three configurations separately.

9.1 Visual appearance

The resulting visual appearance of the system is presented below. Figure16 presents the visual result when no humans are detected.

Figure 16: The final vision system with no humans in the FOV.

Figure17and18below, shows the results when a human is approaching from two different angles. The system can reliably detect humans up to 5 m away from the cameras.

(31)

Figure 18: The final vision system where a human is detected in both the front camera and right camera (Case2).

9.2 V700

Table1 presents the results obtained with the V700 computer. The execution time for the main thread includes all four camera threads and the merge thread. As can be seen, the camera threads are the most time-consuming, while the merge thread is quite fast. As presented, the consumed time drastically decreases when human detection is restricted to one camera (Case2and3).

mainThread cameraThread mergeThread min avg max min avg max min avg max Case 1 5.0588 5.1620 5.5484 1.2518 2.0895 3.6672 0.0344 0.0373 0.0445 Case 2 4.3837 4.5334 5.2243 1.0890 1.7595 3.3713 0.0171 0.0180 0.0208 Case 3 1.4323 1.9760 2.3241 0.1093 0.5032 2.2982 0.0171 0.0180 0.0208 Case 4 0.4367 0.4665 0.4923 0.0960 0.1602 0.3091 0.0169 0.0174 0.0185

Table 1: V700 threads execution times(s).

Table 2 shows the execution time for all the parts within the camera thread where most time-consuming part of is proven to be human detection.

cameraThread

humanDetection remap warpPerspective min avg max min avg max min avg max Case 1 0.9945 1.8270 2.9470 0.0225 0.0245 0.0385 0.0129 0.0197 0.0492 Case 2 0.9953 1.6242 3.1412 0.0225 0.0252 0.0386 0.0130 0.0158 0.0452 Case 3 0.9984 1.5111 2.1010 0.0226 0.0384 0.0792 0.0129 0.0239 0.0541 Case 4 - - - 0.0224 0.0348 0.0827 0.0128 0.0164 0.0441

copyMakeBorder warpPerspective bitwise_add min avg max min avg max min avg max Case 1 0.0024 0.0039 0.0140 0.1843 0.2433 0.4724 0.0220 0.0237 0.0342 Case 2 0.0008 0.0009 0.0010 0.0510 0.0745 0.1701 0.0059 0.0068 0.0160 Case 3 0.0009 0.0009 0.0010 0.0512 0.0772 0.1136 0.0059 0.0070 0.0143 Case 4 0.0009 0.0012 0.0033 0.0548 0.0902 0.1673 0.0059 0.0066 0.0121

(32)

The execution time of the functions in the merge thread is presented in Table3. As can be seen, the execution time drastically decreases when the size of the image is reduced (see Case1and2).

mergeThread

add crop & rotate Painter min avg max min avg max min avg max Case 1 0.0215 0.0235 0.0300 0.0045 0.0066 0.0080 0.0011 0.0015 0.0017 Case 2 0.0057 0.0058 0.0060 0.0044 0.0063 0.0078 0.0010 0.0011 0.0012 Case 3 0.0056 0.0062 0.0080 0.0044 0.0064 0.0081 0.0010 0.0011 0.0012 Case 4 0.0056 0.0060 0.0080 0.0043 0.0061 0.0080 0.0010 0.0011 0.0013

Table 3: V700 merge thread operations execution times(s).

Average execution time for the system (Table1Case3) = 1.976 s Distance to human = 5 m Visual reaction time for humans = 0.180 s

Execution time + reaction time = 1.976 + 0.180 = 2.156 s Distance to human

Total time = 5 m

2.156 s = 2.319m/s = 8.349 km/h (8)

For the system on the V700 computer to be able to detect a human, the maximum velocity of the vehicle is approximately 8 km/h, as seen in Equation8.

9.3 X900

Table4presents the results obtained with the X900 computer. Case1is not tested since reducing the size of the images made the system significantly more efficient. Case4is also not tested since one of the system’s main functionalities, human detection, is neglected.

mainThread cameraThread mergeThread min avg max min avg max min avg max Case 2 0.9617 1.0108 1.1574 0.3324 0.7173 1.0457 0.0154 0.0155 0.0161 Case 3 0.2816 0.4903 0.7956 0.0417 0.2444 0.7818 0.0155 0.0160 0.0171

Table 4: X900 threads execution times(s).

In Table 5 the results of the operations in the camera threads are presented. As seen, human detection is still, by far, the most time-consuming operation.

cameraThread

humanDetection remap warpPerspective min avg max min avg max min avg max Case 2 0.2672 0.6287 1.0088 0.0047 0.0131 0.0337 0.0039 0.0096 0.0191 Case 3 0.2395 0.3443 0.7415 0.0046 0.0084 0.0157 0.0038 0.0050 0.0096

copyMakeBorder warpPerspective bitwise_and

min avg max min avg max min avg max Case 2 0.0009 0.0012 0.0031 0.0116 0.0310 0.0590 0.0021 0.0023 0.0053 Case 3 0.0008 0.0009 0.0009 0.0110 0.0213 0.0414 0.0019 0.0021 0.0023

Table 5: X900 camera thread operations execution times(s). The execution time of the functions in the merge thread is presented in Table6.

(33)

mergeThread

add crop & rotate Painter min avg max min avg max min avg max Case 2 0.0114 0.0116 0.0125 0.0037 0.0040 0.0047 0.0004 0.0004 0.0005 Case 3 0.0113 0.0115 0.0124 0.0036 0.0039 0.0044 0.0003 0.0004 0.0004

Table 6: X900 merge thread operations execution times(s).

Average execution time for the system (Table4Case 3) = 0.490 s Distance to human = 5 m Visual reaction time for humans = 0.180 s

Total time = 5 m

0.670 s = 7.459m/s = 20.720 km/h (9)

For the system on the X900 computer to be able to detect a human, the maximum velocity of the vehicle is approximately 20 km/h, as seen in Equation9.

9.4 X900 with UMat

Table 7 presents the results obtained with the X900 computer with UMat implemented in the merge thread.

mainThread cameraThread mergeThread min avg max min avg max min avg max Case 2 0.9521 1.0232 1.1772 0.2836 0.7386 1.0992 0.0043 0.0049 0.0082 Case 3 0.2735 0.4802 0.7893 0.0439 0.1850 0.7834 0.0042 0.0048 0.0076

Table 7: X900 with UMat threads execution times(s).

The execution time of the operations in the merge thread with UMat implemented can be seen in Table8.

mergeThread

add crop & rotate Painter min avg max min avg max min avg max Case 2 0.0006 0.0008 0.0009 0.0002 0.0002 0.0002 0.0004 0.0004 0.0005 Case 3 0.0005 0.0007 0.0009 0.0001 0.0001 0.0002 0.0003 0.0004 0.0004

Table 8: X900 with UMat merge thread operations execution times(s).

Average execution time for the system (Table7Case3) = 0.480 s Distance to human = 5 m Visual reaction time for humans = 0.180 s

Total time = 5 m

0.660 s = 7.574m/s = 21.038 km/h

(10)

For the system on the X900 computer with UMat to be able to detect a human, the maximum velocity of the vehicle is approximately 21 km/h, as seen in Equation10.

(34)

Figure19 presents a comparison of the execution time of the merge thread in the three different configurations of the system.

1 2 3 4 0 10 20 30 40 37 18 18 ₁₇ 16 16 2 2 Cases Milliseconds (ms) V700 X900 X900 with UMat

(35)

10 Discussion

In this section, the experimental results are discussed. The section is divided into three parts; visual appearance, execution time and human detection.

10.1 Visual appearance

From looking at the final image and the grid on the floor, see Figure16, it can be seen that the rear image does not fit the grid perfectly. This is because the common ground plane for all the images was not correctly determined, which introduced a distortion when merging the images. In this set-up, the image from the rear camera has more distortion than the others since it is the last image merged. When the left and right image are merged with the front image, it introduces a small distortion to them, and when thereafter the rear image is to be merged, it will have to have an even larger distortion to fit the slightly distorted left and right images. Therefore, it might be preferable to begin with the rear camera instead of the front, since the operator naturally has a better sight forward than backward. Another visual flaw in the final image, Figure16, is the difference in saturation between the stitched images. This is due to the built-in white-balance and exposure correction in the cameras. Even though the seams are visible, there are minimal to no blind spots, as can be seen in Figure18, where a human is standing on the seam between two cameras and is detected by both the front and the right cameras.

10.2 V700

The progress from Case 1 to Case 4 shows how the computational workload was reduced, and below is the results of each case discussed.

From the results in Table 1 it can be seen that the main thread has the longest execution time. This is since it contains both the four camera threads and the merge thread. The camera thread has a longer execution time than the merge thread because more operations are performed in the camera thread. Another remarkable detail with the execution time for the V700 is the large deviation between the runs. This is because the V700 only has two cores, and therefore cannot run all the threads truly parallel, and the execution scheme may look different between each run. In Case1, the four images were padded more than necessary, resulting in a black border, which had to be cropped. Therefore, Case2was derived where the four images were padded just enough to fit all in the same image without creating a black border. The results of Case2 showed great improvement from Case1. The execution time for all operations from copyMakeBorder in Case2

was approximately one-fourth of the execution time for the same operations in Case1, as can be seen in Table2.

By comparing the operations inside the camera thread, Table 2, it can be concluded that human detection has the longest execution time of all operations. Hence, Case3was introduced, which applies human detection to only one camera instead of all four. The result of Case3shows a decrease of over 2.5 s for the total time of the main thread, as can be seen in Table1. Furthermore, Case4was introduced, where the human detection was removed completely. The results for Case4

gives a frame rate of approximately 2 FPS. However, Case4removes a major function of the system and is therefore not an option. Hence, it is the best result for the V700 Case3which gives a frame rate of approximately 0.5 FPS.

10.3 Human detection

The functionality of human detection was good. The system was able to detect humans in multiple cameras at once. However, the reach of 5 m for the human detection leaves more to be desired, especially if the human was to approach fast.

The result of the human detection for the best case, Case3, shows that for the human detection to detect a human and alert in time, the maximum velocity of the vehicle must be approximately 8 km/h. In a general sense, it is an unsatisfactory result. However, in some circumstances, it is sufficient, for example, if the vehicle reverses.

(36)

10.4 X900

The X900 was introduced mainly to compare executing a part of the system on a GPU, using UMat, instead of a CPU. Since the camera thread was proven to be the most time-consuming part of the system, it would have been advantageous to test it with UMat. However, it would have required major reconstruction in the implementation of the camera thread. Therefore, UMat was tested only on the merge thread. Case 1 was neglected when testing the X900 since the more padded images do not add anything to the system, and case 4 was neglected since it removes a major function of the system.

When comparing the overall results of the X900, with and without UMat, with the results of the V700, the X900 proved to be much better, which is expected since it is of higher performance. Human detection does still take more time than any other process. However, it is three times faster than when running on the v700.

The difference in execution time between running human detection on four versus one of the cameras, Case2and Case3, on the X900 is approximately 50%, which is a slightly less improvement compared to the difference of Case2and Case3on the V700. The reason for this is that the X900 has two more cores than the V700, and therefore is the difference not as significant for the X900 as for the V700.

Running the merge thread with UMat resulted in more than a 50% decrease in execution time for the merge thread, which is a great result.

It is highly feasible to run the system on the X900 even without UMat. The total execution time, as can be seen in Table4 and Table7, gives approximately 2 FPS and a maximum speed of 20 km/h and 21 km/h with UMat. If the GPU were to be accessible on the V700 and UMat were to be used instead of Mat through the whole system, it would give a much more feasible outcome for the V700 as well.

(37)

11 Future Work

This section will address possible improvements and describes relevant aspects for further devel-opment of the system.

11.1 Stitching

The estimation of the common ground plane should be further researched, as it is the apparent reason for the misalignment of the images. With an improved estimation of the ground plane, the misalignment of the images would decrease, enhancing the visual impression of the system.

Detection and adaptive stitching of three-dimensional objects can be implemented to improve the visual aspects of the system further. The computational workload will increase, which will make the system slower, if not compensated for with additional computational power.

The cameras have automatic colour and saturation correction, which generates differences in brightness between the images. Therefore, it would be suitable to introduce brightness or exposure compensation, combined with weighted blending, to achieve a visually cohesive representation of the surroundings.

11.2 Human detection

As seen in the results, the performance of the system drastically decreases when human detection is applied on all four cameras simultaneously. Hence, it could be suitable to implement an adaptive human detection where the detection is enabled on the cameras connected to the vehicle’s direction of travel.

As mentioned, human detection could not be applied to the final image due to heavy distor-tions. Implementing a detection trained on images of humans which are transformed to a top-view perspective would resolve this issue and decrease the computational workload.

To further improve the system, a more in-depth investigation of which human detection al-gorithms that is the most suitable should be conducted.

There is a considerable difference between the auditory and visual reaction time of humans. The reaction time of auditory signals is approximately 40 ms shorter than visual signals [42]. This means that at a speed of 40 km/h, the stopping distance would decrease by 0.4 m if an auditory warning system was applied compared to a visual. Therefore, it would be advantageous to implement an auditory warning system in addition to the visual. The warning system can be further enhanced by implementing multiple audio sources to determine from which direction the human approaches.

11.3 Environment

Since the system is based on digital Ethernet cameras, the vision will be limited if the surrounding light is limited. Hence, it would be suitable to equip the system with additional thermal imaging for vehicles permanently working in environments with limited to no surrounding light, such as the mining industry.

(38)

12 Conclusion

In this master thesis, a bird’s-eye view vision system for heavy vehicles with integrated human detection has successfully been developed.

The result is a bird’s-eye view of the vehicle, giving an overview of the entire vehicle without blind spots. The system can detect humans within 5 meters from the vehicle and visually alert the operator that a human is approaching.

The system was implemented and tested on two different screen computers. Due to internal system flaws in the V700, the system could not execute on the GPU. Therefore, the system was tested on another computer, the X900. This computer allowed execution on the GPU and improvements of the system could be implemented, such as the hardware-accelerated OpenCV data type UMat. As the results presents, the X900 computer outperforms the V700, and the usage of the data type UMat decreases the execution time significantly.

RQ1: Which algorithms, resources and limitations are necessary to implement a bird’s-eye view system, including human detection, on the pre-defined embedded dis-play computer?

Due to the restricted computational power available, the system’s workload needs to be taken into consideration. Therefore, the implemented algorithms were carefully selected, and limitations such as lowered resolution and frame rate of the cameras were applied.

RQ2: How can the computational workload be reduced without jeopardising the functionality of the system?

By processing frames with lower resolution, restricting the human detection to one camera, and executing a more significant part of the code on the GPU.

RQ3: How does the system affect the situational awareness regarding the reduction of visual blind spots?

The developed system positively affects the situational awareness since the system generates one complete overview of the vehicle, leaving no blind spots.

RQ4: How can the system be developed to operate and detect humans in real-time? For the system to operate and detect humans in real-time on such a restricted platform, it is crucial to execute a significant part of the system on the GPU. A more in-depth investigation of which human detection algorithms that is the most suitable for such application should be conducted to improve the functionality.

(39)

References

[1] P. J. Petrany, D. J. Husted, R. L. Sanchez and A. D. McNealy, System for improving operator visibility of machine surroundings, US Patent App. 14/833,226, Mar. 2017.

[2] Dödsolyckor efter orsak, 2011-2020, Accessed: 06-05-2021. [Online]. Available: https : / / www . av . se / globalassets / filer / statistik / dodsolyckor / arbetsmiljostatistik -dodsolyckor-i-arbetet-orsak-2011-2020.pdf.

[3] Analys av dödsolyckor 2018 och första halvåret 2019, Accessed: 06-05-2021. [Online]. Avail-able: https://www.av.se/globalassets/filer/publikationer/rapporter/analys-av-dodsolyckor-2018-och-forsta-halvan-av-2019-2019-037768.pdf.

[4] L. A. Smith, Third eye tractor trailer blind side driving system, US Patent 10,596,965, Mar. 2020.

[5] R. Sanchez and P. Petrany, Vision system and method of monitoring surroundings of ma-chine, US Patent 9,667,875, May 2017.

[6] G. Rathi, H. Faraji, N. Gupta, C. Traub, M. Schaffner and G. Pflug, Multi-camera dynamic top view vision system, US Patent 10,179,543, Jan. 2019.

[7] Y.-C. Liu, K.-Y. Lin and Y.-S. Chen, ‘Bird’s-eye view vision system for vehicle surrounding monitoring,’ in Robot Vision, G. Sommer and R. Klette, Eds., Berlin, Heidelberg: Springer Berlin Heidelberg, 2008, pp. 207–218, isbn: 978-3-540-78157-8.

[8] A. P. Miró, ‘Real-time image stitching for automotive 360º vision systems,’ 2014.

[9] L. Deng, M. Yang, H. Li, T. Li, B. Hu and C. Wang, ‘Restricted deformable convolution-based road scene semantic segmentation using surround view cameras,’ IEEE Transactions on Intelligent Transportation Systems, vol. 21, no. 10, pp. 4350–4362, 2019.

[10] N. Stojanovic, I. Grujic, J. Glisovic, O. I. Abdullah and S. Vasiljevic, ‘Application of new technologies to improve the visual field of heavy duty vehicles’ drivers,’ in New Technolo-gies, Development and Application II, I. Karabegović, Ed., Cham: Springer International Publishing, 2020, pp. 411–421, isbn: 978-3-030-18072-0.

[11] H. Horenstein, Black and white photography: a basic manual. Little, Brown, 1983. [12] R. Kingslake, A history of the photographic lens. Academic Press, 2005, pp. 145–148. [13] ‘Image warping,’ in Encyclopedia of Biometrics, S. Z. Li and A. Jain, Eds. Boston, MA:

Springer US, 2009, pp. 730–730, isbn: 978-0-387-73003-5. doi: 10.1007/978-0-387-73003-5_878. [Online]. Available:https://doi.org/10.1007/978-0-387-73003-5_878.

[14] J.-C. Pinoli, Mathematical Foundations of Image Processing and Analysis, Volume 2. John Wiley & Sons, 2014, vol. 2, p. 16.

[15] C. A. Glasbey and K. V. Mardia, ‘A review of image-warping methods,’ Journal of Ap-plied Statistics, vol. 25, no. 2, pp. 155–171, 1998. doi:10.1080/02664769823151. [Online].

Available:https://doi.org/10.1080/02664769823151.

[16] G. R. Bradski, Learning OpenCV : computer vision with the OpenCV Library, eng, First edition. Beijing ; O’Reilly Media, isbn: 0-596-15602-2.

[17] Z. Zhang, ‘Camera parameters (intrinsic, extrinsic),’ in Computer Vision: A Reference Guide, K. Ikeuchi, Ed. Boston, MA: Springer US, 2014, pp. 81–85, isbn: 978-0-387-31439-6. doi:

10.1007/978-0-387-31439-6_152. [Online]. Available: https://doi.org/10.1007/978-0-387-31439-6_152.

[18] Handbook of image and video processing, eng, 2nd ed., ser. Communications, Networking and Multimedia. Amsterdam ; Elsevier Academic Press, 2005, isbn: 1-281-11183-X.

[19] D. M. Escrivá, OpenCV 4 computer vision application programming cookbook : build complex computer vision applications with OpenCV and C++, eng, Fourth edition. Birmingham ; Packt Publishing Ltd, isbn: 1-78934-528-6.

[20] H. Zhang, J. Han, H. Jia and Y. Zhang, ‘Features extraction and matching of binocular image based on sift algorithm,’ in 2018 International Conference on Intelligent Transportation, Big Data Smart City (ICITBS), 2018, pp. 665–668. doi:10.1109/ICITBS.2018.00173.

Bird&apos;s-eye view vision-system for heavy vehicles with integrated human-detection

Västerås, Sweden

Thesis for the Degree of Master of Science in Engineering - Robotics

30.0 credits

BIRD’S-EYE VIEW VISION-SYSTEM

FOR HEAVY VEHICLES WITH

INTEGRATED HUMAN-DETECTION

Emma Frisk

Julia Harms Looström

Examiner: Mikael Ekström

Supervisor: Carl Ahlberg

Company Supervisors: Kim Päivärinne and Anders Svedberg

Contents

List of Figures

List of Tables

Acronyms

1

Introduction

2

Problem Formulation

2.1 Hypothesis

2.2 Research questions

2.3 Limitations

3

Background

3.1 Fish-eye

3.2 Image distortion

3.3 Image warping

3.4 Camera calibration

3.5 Homography

3.6 Feature extraction

3.7 Image stitching

3.8 Human detection

3.9 OpenCV and OpenCL

3.10 Mat and UMat

3.11 Qt Creator

3.12 CCpilot V700

3.13 CCpilot X900

3.14 Internet Protocol camera

4

Related Work

4.1 Bird´s-eye view

4.2 Human detection

5

Method

6

Ethical and Societal Considerations

7

Implementation

7.1 Camera calibration

7.2 Human detection

7.3 Undistortion

7.4 Resizing

7.5 Change of perspective

7.6 Match and merge images

7.7 Stitching

7.8 Software structure

8

Experimental setup

9

Results

9.1 Visual appearance

9.2 V700

9.3 X900

9.4 X900 with UMat

10

Discussion

10.1 Visual appearance

10.2 V700

10.3 Human detection

10.4 X900

11

Future Work

11.1 Stitching

11.2 Human detection

11.3 Environment

12

Conclusion

References

Bird's-eye view vision-system for heavy vehicles with integrated human-detection