A tracking framework for a dynamic nonstationary environment

(1)

DEGREE PROJECT IN TECHNOLOGY, SECOND CYCLE, 30 CREDITS

STOCKHOLM, SWEDEN 2020

A tracking framework

for a dynamic

nonstationary

environment

KTH Thesis Report

Sebastian Ståhl

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

Ett spårningsramverk

för en dynamisk

ickestationär miljö

Author

Sebastian Ståhl sstah@kth.se

School of Electrical Engineering and Computer Science KTH Royal Institute of Technology

Place for Project

Stockholm, Sweden Airpelago AB

Examiner

Pawel Herman

Division of Computational Science and Technology KTH Royal Institute of Technology

Supervisors

Arvind Kumar

Division of Computational Science and Technology KTH Royal Institute of Technology

Tobias Fridén CoFounder Airpelago AB

(3)

Abstract

As the use of unmanned aerial vehicles (UAVs) increases in popularity across the globe, their fields of application are constantly growing. This thesis researches the possibility of using a UAV to detect, track, and geolocate a target in a dynamic nonstationary environment as the seas. In this case, different projection and apparent size of the target in the captured images can lead to ambiguous assignments of coordinated. In this thesis, a framework based on a UAV, a monocular camera, a GPS receiver, and the UAV’s inertial measurement unit (IMU) is developed to perform the task of detecting, tracking and geolocating targets. An object detection model called Yolov3 was retrained to be able to detect boats in UAV footage. This model was selected due to its capabilities of detecting targets of small apparent sizes and its performance in terms of speed. A model called the kernelized correlation filter (KCF) is adopted as the visual tracking algorithm.

This tracker is selected because of its performance in terms of speed and accuracy. A reinitialization of the tracker in combination with a periodic update of the tracked bounding box are implemented which resulted in improved performance of the tracker. A geolocation method is developed to continuously estimate the GPS coordinates of the target. These estimates will be used by the flight control method already developed by the stakeholder Airpelago to control the UAV. The experimental results show promising results for all models. Due to inaccurate data, the true accuracy of the geolocation method can not be determined. The average error calculated with the inaccurate data is 19.5 meters. However, an indepth analysis of the results indicates that the true accuracy of the method is more accurate. Hence, it is assumed that the model can estimate the GPS coordinates of a target with an error significantly lower than 19.5 meters. Thus, it is concluded that it is possible to detect, track and geolocate a target in a dynamic nonstationary environment as the seas.

Keywords

Object detection, tracking, geolocation, UAV, Yolov3, KCF.

(4)

Sammanfattning

Användandet av drönare ökar i popularitet över hela världen vilket bidrar till att dess tillämpningsområden växer. I denna avhandling undersöks möjligheten att använda en drönare för att detektera, spåra och lokalisera ett mål i en dynamisk ickestationär miljö som havet. Målets varierande position och storlek i bilderna leda till tvetydiga uppgifter. I denna avhandlingen utvecklas ett ramverk baserat på en drönare, en monokulär kamera, en GPSmottagare och drönares IMU sensor för att utföra detektering, spårning samt lokalisering av målet. En objektdetekteringsmodell vid namn Yolov3 tränades för att kunna detektera båtar i bilder tagna från en drönare. Denna modell valdes på grund av dess förmåga att upptäcka små mål och dess prestanda vad gäller hastighet. En modell vars förkortning är KCF används som den visuella spårningsalgoritmen. Denna algoritm valdes på grund av dess prestanda när det gäller hastighet och precision. En återinitialisering av spårningsalgoritmen i kombination med en periodisk uppdatering av den spårade avgränsningsrutan implementeras vilket förbättrar trackerens prestanda. En lokaliseringsmetod utvecklas för att kontinuerligt uppskatta GPSkoordinaterna av målet.

Dessa uppskattningar kommer att användas av en flygkontrollmetod som redan utvecklats av Airpelago för att styra drönaren. De experimentella resultaten visar lovande resultat för alla modeller. På grund av opålitlig data kan inte lokaliseringsmetodens precision fastställas med säkerhet. En djupgående analys av resultaten indikerar emellertid att metodens noggrannhet är mer exakt än det genomsnittliga felet beräknat med opålitliga data, som är 19.5 meter. Därför antas det att modellen kan uppskatta GPSkoordinaterna för ett mål med ett fel som är lägre än 19.5 meter. Således dras slutsatsen att det är möjligt att upptäcka, spåra och geolocera ett mål i en dynamisk ickestationär miljö som havet.

Nyckelord

Objektdetektering, spårning, lokalisering, drönare, Yolov3, KCF.

(5)

Acknowledgements

I would like to thank Pawel Herman, Arvind Kumar och Tobias Fridén for their guidance through this thesis.

(6)

Acronyms

UAV Unmanned Aerial Vehicle IMU Inertial Measurement Unit GPS Global Positioning System FPS Frames per second

RTK Real time kinematic

NED Reference Frame North East Down Reference Frame

ECEF Reference Frame EarthCentered EarthFixed Reference Frame KCF Kernelized Correlation Filter

IoU Intersection over Union

CNN Convolutional Neural Network VM Virtual machine

GCP Google Cloud Plattform

(7)

Abstract

iii

Sammanfattning

iv

Acknowledgements

v

Acronyms

vi

1 Introduction

1

1.1 Background . . . 1

1.2 Problem statement . . . 3

1.3 Scope and aims . . . 3

1.4 Stakeholders . . . 3

1.5 Outline . . . 4

2 Background

5 2.1 UAV background . . . 5

2.2 Tracking of moving boats . . . 9

2.2.1 Object detection . . . 9

2.2.2 Tracking . . . 11

2.2.3 Camera calibration . . . 12

2.2.4 Geolocating target . . . 16

2.2.5 Dynamic nonstationary environment . . . 19

3 Method

20 3.1 Tracking system . . . 20

3.2 Camera calibration . . . 21

3.2.1 Homography between the model plane and its image . . . 22

3.2.2 Estimate Homography . . . 22

(8)

3.2.3 Constraints on the intrinsic parameters . . . 25

3.2.4 Closedform solution . . . 25

3.2.5 Computation of intrinsic parameters . . . 26

3.2.6 Maximum likelihood estimation . . . 27

3.2.7 Radial distortion . . . 27

3.2.8 Implementation. . . 28

3.3 Object detection . . . 31

3.3.1 Yolov3 architecture . . . 31

3.3.2 Feature extractor. . . 32

3.3.3 Bounding Box Prediction . . . 32

3.3.4 Class Prediction . . . 34

3.3.5 Prediction across scales . . . 34

3.4 Visual tracking algorithm . . . 39

3.4.1 Linear regression . . . 39

3.4.2 Cyclic shifts. . . 40

3.4.3 Circulant matrices . . . 41

3.4.4 Relationship to correlation filters . . . 42

3.4.5 NonLinear Regression . . . 43

3.4.6 Kernel trick . . . 43

3.4.7 Fast kernel regression . . . 44

3.4.8 Fast detection . . . 45

3.4.9 Fast kernel correlation . . . 46

3.4.10 Radial basis function and Gaussian kernel . . . 46

3.4.11 Multiple channels . . . 47

3.4.13 Evaluation . . . 48

3.5 GPS coordinates estimation . . . 49

3.5.1 Target’s position in NED reference frame . . . 49

3.5.2 Target’s position in camera frame . . . 49

3.5.3 Target distance estimation . . . 50

3.5.4 The UAV’s position expressed in the NED frame . . . 51

3.5.5 Estimate GPS coordinates from bearing and distance . . . 52

3.5.6 Haversine formula . . . 55

(9)

3.6 Kalman filter . . . 57

3.6.1 Time update . . . 57

3.6.2 Measurement update . . . 58

4 Result

60 4.1 Camera calibration . . . 60

4.2 Object detection . . . 61

4.3 Tracking . . . 62

4.4 GPS Estimation . . . 64

4.4.1 Comparison of geolocation method’s performance . . . 68

4.5 Evaluation of GPS estimation results . . . 69

4.6 Time perspective . . . 73

5 Discussion

75 5.1 Comparison with stateoftheart . . . 75

5.2 Impact . . . 76

5.3 Limitations . . . 76

5.4 Weaknesses . . . 76

5.5 Strengths . . . 77

5.6 Benefits, Ethics and Sustainability . . . 78

6 Conclusions

79 6.1 Future Work . . . 79

(10)

Chapter 1 Introduction

1.1 Background

UAVs (Unmanned Aerial Vehicles) have been used in a wide range of applications in recent years. Agriculture, industrial purposes, military use, monitoring, rescue missions, and medical help are some areas where UAVs have been used [8] [14] [5] [40] [7]. UAVs have the advantage of high mobility, can operate safely, and are efficient in capturing scenes remotely. Today’s UAVs have shortcomings that put restrictions on their use, but with the rapid development of the market and technology, many of these problems could soon be solved. The greatest disadvantage of the UAVs is their battery life, resulting in a short time for the UAVs to complete their missions. Most UAVs are equipped with IMU sensors of low quality which yields limited estimates from the IMU. Due to limitations of the UAV load capabilities the processing power that can be loaded on the UAVs is also limited.

Although most UAVs are equipped with highresolution cameras the flying altitude of the UAVs implies that often the targets have low resolution. Noise in the form of sunlight and blurring can also degrade the image quality. Although the UAVs are limited by these factors, they are extensively used for certain missions, and their field of use is constantly expanding.

Detection, tracking, and geolocating targets are research areas where the use of UAVs is increasing. In most UAV application people have tried to solve the detection, tracking and geolocation problem for the stationary environment which do not interact with the target explicitly. These tasks can be performed in different ways using different tools. Redding et al. [33] proposed a method for localizing a stationary target in the world coordinates from UAV images. The authors use a small fixedwing UAV, the target’s pixel location in an

(11)

image, UAV position, UAV attitude and camera pose angles in their presented method. The experimental results show that they could localize a target with an estimation distance error of 11 meters. Conte et al. [6] presented a method for geolocating a target by using a Micro Aerial Vehicle (MAV) in a stationary environment. The authors calculated the geolocation of the target by comparing imagery from the MAV with georeferenced satellite imagery.

The authors could estimate the location of the target with an accuracy of 2.3 meters.

Hosseinpoor et al. [18] presented a precise process for tracking and target geolocation using RTK GPS and thermal video acquired by UAV in a stationary environment. The authors could estimate the geolocation of a target with an error smaller than a meter for the altitudes 60 meters and 120 meters with the use of RTK GPS. But with the use of SPP GPS, the location error is 19.4 meters and 22.3 meters for the 60 meters and 120 meters altitudes respectively. Zhao et al. [48] developed a complete framework for detection, tracking, and geolocation of a moving vehicle in an urban environment from UAV using a monocular camera. The authors could estimate the geolocation of the target with an error of less than 5m.

However, for the dynamic nonstationary environment like the sea, this is not the case. Tracking in a dynamic nonstationary environment is challenging because of several reasons. Firstly, the dynamic nonstationary environment can itself add to the movement of the object that is being tracked. Secondly, the seas have features like wave and water reflection which can create interference with the target and lead to error in tracking/detection. Finally, the solution needs to be power efficient as the purpose is to use the solution for realtime data.

To be able to track a target the GPS estimations need to be accurate. But how accurate the GPS estimations need to be are assumed to vary between different systems. The altitude of the UAV is one parameter that heavily affects this. The higher the UAV flies the bigger the field of view. Thus, a higher altitude allows for a bigger margin of error. However, as the altitude grows the task of detecting the target increases in difficulty as well. Airpelago, a stakeholder that provided data and technical support, has a fully developed flight control function which allows users to maneuver the drone by choosing which coordinates the drone will fly to. Integration with this feature is not included in this thesis. How precise the GPS estimates must be for the UAV to successfully follow the boat without dropping it out of sight is thus not something that can be researched. The focus is therefore on developing models for detection, tracking and geolocation and research the accuracy of the GPS estimations with similar systems in a nondynamic stationary environment.

(12)

1.2 Problem statement

This thesis project aims to develop computationally efficient strategies to identify, track, and geolocate moving objects in the sea. By combining these tasks a boat tracking system can be developed for UAVs. If the system successfully manages to track boats in the sea, it is possible to track different objects in dynamic nonstationary environments. Thus, it can be altered to solve different applications in a dynamic nonstationary environment.

Therefore, the research question in this thesis is: What are the major sources of errors in tracking of targets in a dynamic nonstationary environment?

1.3 Scope and aims

The objective of the thesis is to develop models for the tasks of identifying, tracking and geolocating targets in a dynamic nonstationary environment and highlight the difficulties that arose in the development of these models. The thesis aims to research whether the difficulties associated with the dynamic environment can be overcome and thereby create effective models for detection, tracking and geolocation. By combining these models they will constitute a framework for tracking targets. Thus, it can be deployed in realworld settings and solve practical problems like tracking and locating boats in open seas.

This thesis delivers a camera calibration method to extract the camera’s intrinsic matrix, an object detection model that can identify boats from UAV footage, a tracking algorithm that performs well in fluctuation mediums such as the seas and a mathematical model that can estimate the GPS coordinates of moving targets. These deliverables together create a system that allows a UAV to track a boat. The deliverables need to be computationally efficient to be able to perform the identification, tracking, and geolocation in real

time.

The data collection is delimited to Swedish seas. This however does not limit the use of the system on other seas. The focus of the thesis is to develop efficient algorithms but not to implement these in the system of the stakeholder.

1.4 Stakeholders

This project is provided by Airpelago AB. Airpelago is a startup company based in Gothenburg and Linköping. They develop a cloudbased platform to monitor and control

(13)

connected drones. Airpelago, together with Ericsson, has funding from Vinnova to develop such a service. The service is developed partly based on Ericsson’s need for a demonstration platform for their dronerelated services, and partly based on “Sjöräddningssällskapets”

vision of using drones for rescue missions. The main goal of Airpelago’s project is to create a modular and adaptable system to develop a system of tracking moving objects using drones while meeting the diverse needs of different industries. They are currently building the core and basic functionality of the system. Eventually, they imagine that this functionality could be enhanced with the help of addons that could be developed both internally and by third

party developers.

1.5 Outline

In the second chapter, a detailed background including previous studies in the field of camera calibration, object detection, object tracking, and geolocation are presented. The third chapter presents the methodology used and how it was implemented in the thesis.

The results are presented in the fourth chapter. The fifth chapter presents the discussion.

The final chapter is the conclusion.

(14)

Chapter 2 Background

This chapter gives the background to how UAVs works, their coordinate system and how to transform between these. Further, the chapter gives a background to object detection, tracking, camera calibration and geolocation of target.

2.1 UAV background

UAV is a collective name for motorized aircraft without a pilot on board that can fly autonomously or remotely. UAVs are available in all sizes, from aircraft of hundreds of grams that are launched by being thrown by hand to vessels of thousands of kilos that take off and land like regular aircraft. Furthermore, the shape of the UAVs vary. Fixedwing is a type of UAV that shares similarities in its shape with airplanes. The quadcopter is another UAV that shares similarities in the shape with helicopters. However, as the name suggests the quadcopter has four rotors. These two types are the most common types of UAVs. Commonly the UAVs are equipped with a camera under their body. The camera is attached to a threeaxis gimbal whose function is to stabilize the footage captured by the camera. The orientation of the aircraft and gimbal is known as its attitude which is defined by the rotation around the pitch, roll and yaw axes in the body coordinate system.

The body coordinate system is relative to the aircraft itself. Three perpendicular axes are defined such that the origin is the center of mass, and the Xaxis is directed through the front of the aircraft and the Yaxis through the right of the aircraft. Using the coordinate righthand rule, the Zaxis is then through the bottom of the aircraft. To solve the task of geolocating a target, several reference frames are used. The camera reference frame is defined with the same axes as the body coordinate system. The camera reference frame is illustrated in figure 2.1.1.

(15)

Figure 2.1.1: Illustration of the camera reference frame. Image reference¹.

The NorthEastDown (NED) reference frame is selected as the world coordinate system.

NED is defined as xaxis aligned with north, yaxis aligned with east and zaxis perpendicular with the x and y axes directed straight down. The NED reference frame is illustrated in figure 2.1.2.

Figure 2.1.2: Illustration of the NED reference frame. Image reference¹

The geodetic reference frame is defined by the angle ϕ, λ and the height h. ϕ corresponds to the latitude, λ corresponds to the longitude. In this thesis, the latitude and longitude of the targets which are geodetic coordinates are estimated.

1Body Coordinate System. DJI, Dec. 2016. URL:

https://developer.dji.com/mobilesdk/documentation/introduction/flightController_concepts.html.

(16)

Figure 2.1.3: Illustration of the geodetic reference frame. Image reference [38]

Rotation matrices are used to transform coordinates from one reference frame into another.

The rotation from the camera frame to the NED frame consists of three rotations. The first rotation is R_x which is roll around the xaxis. The second rotation is R_y which is the pitch around the yaxis. Finally, R_z is used to rotate the yaw which is around the zaxis.

Together these three rotation matrices create the rotation matrix R^{(N ED)}_(C) which transforms a point’s coordinates, expressed in the camera frame, to the same point expressed in the NED reference frame. ϕ, γ and θ denotes the yaw, pitch and roll respectively.

R^{(N ED)}_(C) = R_zR_yR_x=







cos ϕ − sin ϕ 0 sin ϕ cos ϕ 0

0 0 1













cos γ 0 sin γ

0 1 0

− sin γ 0 cos γ













1 0 0

0 cos θ − sin θ 0 sin θ cos θ





 (2.1)

The rotation from the NED frame to the camera frame is given by:

R^(C)_{(N ED)} = R^{(N ED)}_C ⁻¹ = R^{(N ED)}_C ^T (2.2)

(17)

Figure 2.1.4: Illustration of the three rotations. Image reference².

To infer the position of the target the geolocation method uses data collected by the UAV. A UAV collects a lot of data during its runs. The time, latitude, longitude, altitude, pitch, roll, yaw (compass heading), gimbal pitch, gimbal heading, gimbal yaw are the most useful parameters that are collected and used by the geolocation estimation. Airpelago has developed a fixedwing UAV that will be used for rescue missions. Further, they developed a flight control algorithm which flies the UAV to the specified coordinates. The idea is to estimate the boat’s GPS coordinates, which should be the input to the flight control algorithm. In this way, the UAV will fly in the trajectory of the boat and thereby track it.

The developed fixedwing UAV is equipped with a camera on a gimbal, a thermal camera, a GPS receiver, and an IMU. Through these instruments, the UAV’s GPS coordinates and attitude can be obtained. However, in this thesis UAVs from DJI are used. There are two main differences between the fixedwing UAV and the used UAVs from DJI. Firstly, the UAVs from DJI do not possess a thermal camera. Lastly, the UAVs from DJI are quadcopters and not fixedwing. Although the used UAVs are quadcopters and not fixed

wing the same algorithms can be used. The focus in this thesis is therefore to detect, track, and geolocate moving targets in the sea by the use of a monocular camera, a GPS receiver, and an IMU. Through the use of these instruments, several challenges arise. Since the UAV is supposed to fly at an altitude of 45 meters above sea level, the targets become small in the

2Body Coordinate System. DJI, Dec. 2016. URL:

https://developer.dji.com/mobilesdk/documentation/introduction/flightController_concepts.html.

(18)

data recorded by the monocular camera. The dynamic nonstationary environment creates a complicated background that makes it harder to detect and track the moving targets.

Furthermore, the accuracy of the sensors on most UAVs are low. This will contribute to a larger margin of error in the geolocation algorithms. More accurate equipment such as binocular camera and laser rangefinder is not an option for this thesis because these instruments are neither on the DJI UAVs nor the UAV developed by Airpelago. Finally, the question ”where the algorithms should run?” has not been answered yet. One option is to transmit all of the data to the cloud where all of the data is processed by the algorithms.

Another option is to load hardware onto the UAVs so that all of the algorithms run on

board the UAVs. Regarding the choice, the performance of the algorithms needs to be high in terms of speed.

2.2 Tracking of moving boats

The process of tracking moving boats in the open sea is a process that consists of three parts.

The first part is to identify the target in the aerial video which is called object detection.

Secondly, the target is tracked visually in the aerial video which is called tracking. The last part is to geolocate the target.

2.2.1 Object detection

Traditional motion detection methods such as optical flow and temporal difference have been widely used to detect small objects from a UAV. Optical flow allows for feature estimation and the background subtraction method is used for the segmentation of moving objects [27]. These methods can handle dynamic backgrounds, small object apparent sizes, and moving cameras. The methods based on motion can however struggle with the detection of objects of certain types. The purpose of researching these traditional methods was to get a wider understanding of how the problem could be solved, I did not want to lock myself into an approach without researching properly what methods were available.

In recent years, deep learning methods have become increasingly popular as detection methods. The methods in deeplearning can be divided into two different branches, one

stage, and twostage methods. The popular twostage methods are RCNN [9] and fast RCNN [10]. In these methods, the detection is divided into region proposal generation and classification. These methods were researched due to previous knowledge about their advantage of high detection accuracy. However, they sacrifice the speed of detection. Due

(19)

to the lack of performance in terms of speed, these algorithms were not considered. The onestage methods only perform the classification phase and are therefore faster in terms of speed. A single network is used to obtain the probabilities and position coordinates. The Single Shot MultiBox Detector (SSD) [26] is a popular onestage method that is popular for its speed, which is why it was considered. However, the SSD network performs worse than the twostage methods in the detection of small objects.

The purpose of this thesis is to develop computationally efficient algorithms that can run in realtime. However, the performance of the algorithms in terms of accuracy needs to be high. Thus a fast and accurate detection algorithm is needed. You only look once (Yolo) [34]

is a popular onestage method. The Yolo method has been further developed over the years resulting in Yolov2 [35] and Yolov3 [36]. Yolov3 has the advantage of higher performance in terms of accuracy and speed. Perhaps most importantly, Yolov3 is good at detecting small objects. The performance of Yolov3 in terms of precision is almost as good as the twostage methods [20]. Redmon and Farhadi [36] compare the performance of Yolov3 with several methods on the COCO data set [25]. The results of this comparison can be seen in figure 2.2.1. Further, Zhao, Pu, Wang, Chen and Xu [48] compare the performance between several stateoftheart object detection methods. The authors find that Yolov3 is the best method for the recognition of small objects in terms of speed and accuracy.

Figure 2.2.1: Illustration of different object detection methods. The methods are run on either an M40 or Titan X. Image reference [36].

Therefore, Yolov3 is chosen as the detection method in this thesis.

(20)

2.2.2 Tracking

Visual tracking algorithms can be divided into two major categories, namely generative models and discriminative models. The particle filter [1] and the meanshift tracking are generative tracking algorithms. In these tracking algorithms, the region of the target is modeled for the current frame. The algorithms predict the most similar region to the model as the target’s pixels position. Hosseinpoor et al. [18] presented a precise process for tracking and target geolocation using realtime kinematic (RTK) GPS and thermal video acquired by UAV. They iteratively determined the local minima of the distance measurement functions using the mean shift algorithm to track the target. Such generative methods are not suitable for UAVs as argued by Zhao et al. [48]: ”Generative methods are not appropriate for the video captured from a UAV using a visible light camera because they are susceptible to the complex background; they are only suitable for the cases where the pixel size of the object is relatively large and the object moves at a low speed”. However, the use of a thermal camera allowed Hosseinpoor et al. [18] to successfully visually track their targets. Since a thermal camera was not available as a resource in this thesis generative methods were not considered. Although a thermal camera was not available in this thesis, researching different approaches gave a better understanding of the tasks.

Discriminative model methods take the target and background regions as a positive and negative sample, respectively. Machine learning methods are used to train the classifier.

The classifier is used on the next frame to predict the optimal region. KCF [16] is a state

oftheart tracker. This method can change the size of the predicted bounding box and thus adapt to the size of the target. The KCF tracker can track targets of small apparent sizes with complex backgrounds quickly and robustly. Rani et al. [32] presented a novel method where the tracking was based on KCF and enhanced by integrating KCF with Kalman Filter. The results between KCF and KCF with Kalman filter did not differ much.

Rani et al. [32] argue that their novel method KKCF outperforms the traditional KCF for outliers or failure cases which are corrected through the Kalman filter. Yue et al. [45]

proposed an improved KCF algorithm to precisely track maneuvering objects. The KCF algorithm improvement is based on an adaptive threshold approach, KCF method, and Kalman filter. If the distance between the target in two consecutive frames is larger than a distance threshold, the Kalman filter algorithm is used to predict the location of the target.

Yue et al. [45] found that their approach had the effectiveness of tracking accuracy and realtime performance in tracking the maneuvering object. The KCF tracking algorithm is extensively used in research articles on tracking. The reason for its popularity is thanks to its high performance. The purpose of investigating several articles about KCF was because

(21)

the use of KCF often was combined with other methods. However, according to the results found by Rani et al. [32] and Yue et al. [45], it only affected outliers or when the target made a big jump visually between two frames. Since the drone was flying at a relatively high altitude, it was assumed that the target could not make large movements visually between each frame. Thus, KCF was considered as a method for tracking.

However, trackers were researched further. Deep learning is becoming popular for tracking, especially within Multiple Object Tracking (MOT). Kapania et al. [19]

implemented a framework for realtime MOT and used a trackingbydetection approach where the tracking was performed with the Deep SORT algorithm and the detection was done with a combination of Yolov3 with RetinaNet. The detection module was executed on each frame. The framework was tested on the VisDrone 2018 data set and showed competitive performance compared to existing trackers. Yang et al. [44] proposed an MOT algorithm based on dense trajectory voting in aerial footage. The authors built a new data set containing a training set and a diverse test set. The authors used the data set to train a neural network by using deeplearning methods. The neural network could detect vehicles in aerial footage. Yang et al. [44] calculated the dense optical flow in adjacent frames. They also generated effective dense optical flow trajectories in each detection bounding box at the current time.

Although this thesis focuses on single object tracking, it provided a broader knowledge of the task of tracking. The Deep SORT was not widely researched which is why it was not considered. In this thesis, the tracker should be fast enough to be able to perform tracking in realtime, hence the KCF was adopted as the visual tracking algorithm. Since the performance between KCF and KCF with Kalman filter does not differ much, only the KCF tracker as adopted to boost the performance in terms of speed.

2.2.3 Camera calibration

To be able to geolocate a target the intrinsic parameters of the camera are required to be obtained first. Zhang [47] proposed a novel method for camera calibration where the intrinsic camera parameters are extracted. The proposed method only requires the camera and images of a printed pattern on a planar surface. The method models radial lens distortion and consists of a closedform solution which is followed by a nonlinear refinement based on the maximum likelihood criterion. The proposed methods showed very good results on both simulated and real data [47]. This method was chosen because it is widely used and appeared to have become the standard method for calibrating a

(22)

camera.

Cameras are based on projection models. In order to understand how a camera can be calibrated projection models are explained. The camera model describes a projection from 3D to 2D. The most fundamental camera model is the pinhole camera model which is described in the following subsection.

Full pinhole camera

In the pinhole camera model, a 3D world point M is projected onto a 2D image point m in the image plane through a camera center C. The points, M, m and C are collinear. The mapping from 3D to 2D coordinates described by a pinhole camera is called perspective projection. This is a projection of 3D points in space to 2D points where both points are expressed in homogeneous form. To express a point in the homogeneous form, 1 is added as the last element. A point expressed in the homogeneous form is indicated by a tilde above it. The projection model can be divided into the following three transformations.

1. Transformation between the world reference frame and the camera reference frame.

2. Transformation between the camera reference frame and the sensor reference frame (retinal plane).

3. Transformation between the sensor reference frame and the image reference frame.

Figure 2.2.2: Illustration of the full pinhole camera. In this image the camera reference frame, denoted with_C is different to the camera reference frame used in this thesis. Image reference³.

3Camera Calibration and 3D Reconstruction¶. URL: https://docs.opencv.org/2.4/modules/calib3d/doc/

camera_calibration_and_3d_reconstruction.html#void%20projectPoints(InputArray%20objectPoints,

(23)

Transformation between the world reference frame and the camera reference frame.

This is a transformation between an arbitrarily chosen world reference frame, R_w, and the camera reference frame, R_C. The origin of the camera reference frame is located in the camera center C. This transformation is done through rotation R and a translation t. The rotation R and translation t are called the extrinsic parameters. R is a 3x3 rotation matrix and t is a 3x1 translation vector.

R =







r₁₁ r₁₂ r₁₃ r₂₁ r₂₂ r₂₃ r₃₁ r₃₂ r₃₃





 and t =





 t_x t_y t_z





 (2.3)

The transformation is defined as:





 X_C

YC

Z_C 1







= h

R i





 X Y Z

1





 + t =



R t 0^T 1









 X Y Z

1







= h

T i





 X Y Z 1







(2.4)

Together the rotation matrix and the translation vector constitute a 4× 4 matrix T , called the extrinsic matrix. The extrinsic matrix has 6 degrees of freedom, 3 for rotation, and 3 for translation.

Transformation between the camera reference frame and the sensor reference frame (retinal plane).

This transformation binds the camera reference frame R_C to the sensor reference frame R_R, the retinal plane. This perspective projection is defined as:

%20InputArray%20rvec,%20InputArray%20tvec,%20InputArray%20cameraMatrix,%20InputArray

%20distCoeffs,%20OutputArray%20imagePoints,%20OutputArray%20jacobian,%20double%20aspectRatio).

(24)

s





 x y 1





 =







f 0 0 0 0 f 0 0 0 0 1 0











 XC

Y_C Z_C 1







= h

P i





 XC

Y_C Z_C 1







(2.5)

where f is the focal length of the camera and s is an arbitrary scale factor.

Transformation between the sensor reference frame and the image reference frame.

This is a transformation between image coordinates [x, y]^T expressed in metric units and discrete image coordinates [u, v]^T expressed in pixels. The transformation is defined as:





 u v 1





 =







α γ u₀ 0 β v₀ 0 0 1











 y x 1





 = h

A i





 y x 1





 (2.6)

αand β define the number of pixels per unit of length in the x and y direction of the sensor.

If α = β the pixels are squares. u0 and v0 are the coordinates of the principal point in the image, the intersection between the principal axis and the image plane. γ is the skew parameter. The skew parameter is often considered to be negligible and thus replaced with 0.

The full pinhole model

The three previous steps constitute the pinhole camera which is defined as:

˜ m =

z}|{K

AP T ˜M (2.7)

(25)

where

K =







α 0 u₀ 0 β v₀ 0 0 1













f 0 0 0 0 f 0 0 0 0 1 0





 =







f_x 0 u₀ 0 0 f_y v₀ 0

0 0 1 0





 (2.8)

where fx and fy are the focal lengths of the camera in terms of pixel dimensions in the x and y directions. If the pixels are squares, α equals β. u₀ and v₀ are the coordinates of the principal point on the image plane. K is called the intrinsic matrix of the camera and has five degrees of freedom. The intrinsic matrix is upper triangular. K is used to project the point on to the image plane using the intrinsic parameters. By projecting the point on to the image plane we can extract the location of the point in the image.

u = f_xr₁₁X + r₁₂Y + r₁₃Z + t_x

r₃₁X + r₃₂Y + r₃₃Z + t_z + c_x (2.9)

v = f_yr₂₁X + r₂₂Y + r₂₃Z + t_x

r₃₁X + r₃₂Y + r₃₃Z + t_z + c_y (2.10)

2.2.4 Geolocating target

The process of geolocating a target from a UAV has been highly researched resulting in many approaches. Different approaches were researched to try to create an understanding of what is needed, what these methods have in common and why different methods are used. Redding et al. [33] proposed a method for localizing a target in the world coordinates from UAV images. They used the target’s pixel position in the image together with the UAV attitude and IMU sensors. The localization methods can localize a target with an accuracy of 11m from the true position. Conte et al. [6] presented a method for geolocating a target by using a Micro Aerial Vehicle (MAV). The authors calculated the geolocation of the target by comparing imagery from the MAV with georeferenced satellite imagery. This method does not require accurate sensors onboard the MAV. The results showed that the authors were able to locate a target with an accuracy of 2.3 meters. The MAV was flying at an altitude of 70 meters during the experiment. Nuske et al. [30] presented a geolocation method that can handle the low accuracy that originated from the lowcost sensors on UAVs. The presented solution is a geolocation filter with a discretized solution space that handles sampled nonlinear observations. The developed geolocation filter performs better

(26)

compared to linearized methods. The geolocation filter was tested on stationary objects.

Hamidi et al. [15] used georeferenced data to precisely 3D geolocating UAV images. The authors base their methodology on a database matching technique for refining the coarse initial attitude and position parameters of the camera derived from the navigation data.

Instead of just geolocating a target in the image like previously presented studies, the authors try to geolocate the whole image. The authors use a rigorous collinearity model in a backward scheme to geolocate the entire frame. The authors also propose a forwards geolocating procedure. The procedure is based on a rayDSM intersection method and is used for cases where the ground location of specific image targets is required. The authors present a root mean square error of 14 meters in horizontal and 3D positions as their experimental result. Hosseinpoor et al. [18] presented a precise process for tracking and target geolocation using RTK GPS and thermal video acquired by UAV. The authors iteratively found the local minima of the distance measurement functions using the mean shift algorithm to track the target. Traditional photogrammetric bundle adjustment equations were used for geolocating the target. The geolocation data was filtered using an extended Kalman filter. The Kalman filter provided a smoothed geolocation and velocity estimates. Hosseinpoor et al. [18] used the accurate exterior parameters given by the UAV’s IMU and the RTK GPS sensors and interior orientation parameters of the thermal camera from the preflight laboratory calibration process in their geolocation method. In the results, the authors show how much more accurate the use of RTK GPS with extended Kalman filter than SSP GPS and RTK GPS without the Kalman filter. The authors can estimate the geolocation of a target with an error smaller than a meter for the altitudes 60 meters and 120 meters with the use of RTK GPS. But with the use of SPP GPS, the location error is 19.4 meters and 22.3 meters for the 60 meters and 120 meters altitudes respectively. Wang et al. [42] proposed a UAV electrooptical stabilized imaging system for a realtime multi

object localization. The authors used an object location model and calculate the geodetic coordinates for each target by using the homogeneous coordinate transformation. Wang et al. [42] proposed two methods for improving the accuracy of the multitarget localization.

Method one is “the realtime zoom lens distortion correction method”. Method two is “a recursive least square (RLS) filtering method based on UAV dead reckoning”. Babinec et al. [2] researched the accuracy of object location estimation in low flight aerial imagery.

The authors’ location estimation method was based on homography mapping between two 2D planes. Babinec et al. [2] used images of objects with known coordinates to derive the two 2D planes. BožićŠtulić et al. [4] proposed a novel method based on convolutional neural networks to perform object detection and localization in aerial imagery. The first

(27)

step in their method was to plan an optimal flight route for the UAV. From the imagery, they create a mosaic of the area of interest to obtain a larger fieldofview panoramic image.

Further, the authors create a georeferenced map using the image mosaic. The image mosaic is also used for object detection, which uses convolutional neural networks. Zhang et al. [46] developed a novel method for target geolocation. The geolocation accuracy is enhanced by improving the estimation of heading angle bias. This method included trajectory planning which helps to improve the heading angle bias. The authors use a particle swarm optimization algorithm to derive an expression of optimal trajectory. The authors ignored the pitch and roll measurements errors and focus on the yaw angle bias.

The tracking was performed using a featuretracking algorithm. Zhang et al. [46] employed the batch leastsquares technique to estimate the yaw bias by taking visual measurements of a stationary ground object. Zhang et al. [46] found that the ground object of interest was more accurately geolocated with the use of trajectory planning. Zhao et al.[48] developed a complete framework for detection, tracking, and geolocation of a moving vehicle from UAV using a monocular camera. The authors implemented YOLOv3 to detect small vehicles from the airborne video. Yolov3 was used to calculate the initial pixel positions of the car of interest. The authors compared the following models: SSD, Faster RCNN, and YOLOv3;

they found that YOLOv3 was the fastest and had the highest mAP. KCF filter was used as the tracking method. This method allowed for fasttracking, each frame was processed in 10 milliseconds. The KCF filter calculated the pixel positions of the tracked object for each frame. The pixel position was used by their geolocation method to localize the target. The geolocation method was based on the projection model of the camera on the UAV. Zhao et al.[48] used the predicted locations to display the trajectory of the UAV. Finally, a flight control method based on the results from the KCF filter tracking was developed to make the UAV follow the target. The flight control method allowed the UAV to keep the target in the field of view, which made the KCF filter tracking and geolocation more accurate. The authors could estimate the geolocation of the target with an error of less than 5m.

Some of these approaches used instruments that are not available for this thesis, e.g.

thermal camera and georeference data. Several studies found methods based on UAV’s lowcost sensors with accurate results [33] [18] [48]. Thus, in this thesis, a geolocation method was developed by using the target’s pixel position, a projection model, the UAV’s attitude, and the IMU. Further, the Kalman filter was implemented to improve the estimation of the GPS coordinates.

(28)

2.2.5 Dynamic nonstationary environment

Although geolocation of a target has been done with good results, none of the presented studies focused on the sea environment. In fact, there has not been much research focusing on the dynamic nonstationary environment like the sea. In the seas, boats blend into the water more easily than, for example, colorful cars do against road surfaces. The task of identifying targets can therefore be more difficult. Furthermore, there are no clear paths defined at sea which can make the task of estimating the position of the target more complex. Finally, the water is most of the time in motion, in the form of waves which creates different patterns. Thus, the environment around the boat is constantly changing in a different way than, for example, what it does for a car.

However, Kwon et al. [21] proposed a method that maximized the accuracy of target localization by minimizing the sunlight reflection influences by planning an optimal path for UAV. The path of the UAV cannot be determined in advance for the tracking since the UAV follows its target and is therefore not considered. Leira et al. [24] used a thermal camera for automatic detection, classification, and tracking of objects in the ocean surface with a UAV. The thermal image was smoothed by convolving the image with a gaussian kernel. Edges were detected by using the gradient of the image. Kwon et al. [21] used a threshold value for magnitude to remove noise in the convolved image. A connected component algorithm that groups and labels components together in blobs was used to filter out unwanted blobs in the image. The resulting image was called a binary image.

Then the bounding boxes for the remaining objects were found. The center positions for the bounding boxes were calculated for the image frame and the world frame. The center positions were used in the tracking module. The object tracking was done by Kalman filters. The authors used Kalman filters to estimate and predict the position and velocity for the object of interest. They initialized one Kalman filter for each object of interest. The use of a thermal camera is not available in this thesis, which is why this approach is not considered.

(29)

Chapter 3 Method

This chapter consists of six sections. The first section describes how the entire tracking framework is built up. The second section describes the method used for camera calibration. The third section describes how the Yolov3 network works, which is used for object detection. The fourth section describes the KCF tracker. In the fifth section, the GPS estimation method is explained. Finally, the Kalman filter is explained.

3.1 Tracking system

To create a tracking system the detection, tracking and geolocation modules were put together. The system starts by performing object detection on the first frame of the video stream. The bounding box of the boat of interest is sent as an initial bounding box for the tracking algorithm. The tracker visually tracks the boat in each frame. For the timestamps where the GPS logger has logged the true boat coordinates the bounding box from the tracker is used as input by the GPS coordinates estimation algorithm. The algorithm then estimates the GPS coordinates of the boat by using the UAV attitude data together with the bounding box received from the tracker. The estimated GPS coordinates are supposed to continuously be transmitted to the flight control function which will allow the UAV to automatically follow the boat.

(30)

System module composition

Figure 3.1.1: Illustration of the entire system.

The hardware that is supposed to be used on the UAVs had not been decided yet. Hence, the time evaluation of the algorithm becomes unreliable. The algorithms were tested on the local machine the following specifications

Local machine hardware

Machine: MacBook Pro (Retina, 15-inch, Mid 2015) Processor: 2,5 GHz Quad-Core Intel Core i7

Memory: 16GB 1600 MHz DDR3

Graphics: AMD Radeon R9 M370X 2GB Intel Iris Pro 1536mB

3.2 Camera calibration

Camera calibration is the process of estimating the camera’s intrinsic and extrinsic parameters. In this thesis, the camera calibration method presented by Zhang [47] is

(31)

used.

3.2.1 Homography between the model plane and its image

To calibrate the camera, we use a set of points laying on the same plane. This means that they share the same Zvalue. If we choose points where Z=0 the projective transformation becomes the following

s





 u v 1





 = A h

r₁ r₂ r₃ t i





 X Y

0 1







=Ah

r₁ r₂ t i





 X Y

1





 (3.1)

where s is an arbitrary scale. This is done without loss of generality. By transforming points on the same plane, the projective transformation becomes a 2D to 2D transformation. This transformation is invertible. This transformation can be written as

s





 u v 1





 = H





 X Y

1





 (3.2)

(3.3)

where H is called the homography and is a 3× 3 matrix defined up to a scale.

H = A h

r₁ r₂ | ti

=







h₁₁ h₁₂ h₁₃ h₂₁ h₂₂ h₂₃ h₃₁ h₃₂ h₃₃





 (3.4)

3.2.2 Estimate Homography

To estimate the homography a technique based on maximum likelihood criterion is used.

Let ˜m be the image point and ˜M be the corresponding point in world coordinates. In theory, they should satisfy formula 3.1 but in reality, they usually fail to do so because of noise in the images. The maximum likelihood estimation of H is obtained by minimizing

X

i

(m_i − ˆm_i)^TΛ⁻¹_m

i(m_i− ˆm_i) (3.5)

(32)

where

ˆ

m_i = 1 h₃^TM_i



h₁^TM_i h₂^TM_i



 with hi the i^throw of H. (3.6)

Λ_m_i is the covariance matrix, assuming that m_iis corrupted by Gaussian noise with mean 0. It can be assumed that Λm_i = σ²I for all i, this is reasonable if all of the points extracted are extracted independently and with the same procedure [47]. With this assumption, the maximum likelihood estimation becomes a nonlinear leastsquares problem.

min_HX

i

|| mi− ˆm_i ||² (3.7)

The LevenbergMarquardt algorithm [28] is used to solve the nonlinear minimization.

The algorithm requires an initial guess. To obtain the initial guess the following is performed:

Let x = [h₁^T, h₂^T, h₃^T]^T. The perspective projection formula can be rewritten. Using the algebraic distance:

|| h ||= 1 (3.8)





 u v 1





 =







h₁₁ h₁₂ h₁₃ h21 h22 h23

h₃₁ h₃₂ h₃₃











 x y 1





 (3.9)

u = h₁₁x + h₁₂y + h₁₃

h₃₁x + h₃₂y + h₃₃ (3.10)

v = h₂₁x + h₂₂y + h₂₃

h₃₁x + h₃₂y + h₃₃ (3.11)

Multiplying by denominator:

(h₃₁x + h₃₂y + h₃₃)u = h₁₁x + h₁₂y + h₁₃ (3.12) (h31x + h32y + h33)y = h21x + h22y + h23 (3.13)

(33)

Rearrange:

h₁₁x + h₁₂y + h₁₃− h31xu− h32yu− h33u = 0 (3.14) h₂₁x + h₂₂y + h₂₃− h31xv− h32yv− h33v = 0 (3.15) (3.16)

Which can be written as:



x y 1 0 0 0 −ux −uy −u 0 0 0 x y 1 −vx −vy −v









 h11

... h₃₃





=



0 0



 (3.17)

Which can be written as:



M^T 0^T −u ˜M^T 0^T M˜^T −v ˜M^T



 x = 0 (3.18)

The equation above is given one point. To solve the homography, 4 points in the plane is a minimum since the homography has 8 degrees of freedom. Given n points, the equation is given by:







X₁ Y₁ 1 0 0 0 −uX −uY −u 0 0 0 X₁ Y₁ 1 −vX −vY −v ... ... ... ... ... ... ... ... ... X_n Y_n 1 0 0 0 −uX −uY −u

0 0 0 X_n Y_n 1 −vX −vY −v











 h₁₁

... h₃₃





=





 0

... 0





 (3.19)

This equation can be rewritten as Lx = 0. L is a 2n× 9 matrix, x is a 9 × 1 matrix and the 0on the righthand side is a 2n× 1 matrix. x is defined up to a scale factor.

LX = 0 =⇒ L^TLX = L^T0 =⇒ L^TLx = 0 (3.20)

The solution, x, is given by the eigenvector of L^TLassociated with the smallest eigenvalue.

Now a homography for a given image with n points has been estimated.

(34)

3.2.3 Constraints on the intrinsic parameters

Given an image of a plane, we can estimate the homography using a set of points on the plane.

h

h₁ h₂ h₃ i

= λA h

r₁ r₂ t i

(3.21)

where λ is an arbitrary scalar. Two constraints can be applied to every homography. These can be applied since r₁and r₂are orthonormal. As mention previously, the homography has 8 degrees of freedom as it is estimated up to a scale. Since there are 3 extrinsic parameters for rotation and 3 for translation, only 2 intrinsic parameters can be extracted for each homography.

h^T₁A^−TA⁻¹h₂ = 0 (3.22) h^T₁A^−TA⁻¹h₁ = h^T₂A^−TA⁻¹h₂ (3.23)

How these constraints can be derived can be found in the section ”2.4 Geometric interpretation” in [47].

3.2.4 Closedform solution

Let

A^−TA⁻¹ = B =







B₁₁ B₁₂ B₁₃ B₂₁ B₂₂ B₂₃ B₃₁ B₃₂ B₃₃





 =







1

f_x² −_f2^γ xfy

voγ−u^ofy

f_x²fy

−_f²^γ

xfy

γ² f_x²f_y² + _f¹2

y −^γ(v^o_f^γ−u² ^o^f^y⁾

xf_y² − ^v_f^o²

y

voγ−u⁰fy

f_x²fy −^γ(v^o_f^γ²^−u^o^f^y⁾

xf_y² − _f^v^o²

y

(voγ−u^ofy) f_x²f_y²

2+_f^v²^o2 y + 1







(3.24)

Bis a symmetric matrix defined by:

b = h

B₁₁, B₁₂, B₂₂, B₁₃, B₂₃, B₃₃ iT

(3.25)

Let h_i = h

hi1, hi3, hi3

iT

be the i^thcolumn of H. From the constraints

(35)

h^T_iA^−TA⁻¹h^T_j = h^T_i Bh^T_j = v^T_ijb (3.26)

where

v_ij = h

h_i1h_j1, h_i1h_j2+ h_i2h_j1, h_i2h_j2, h_i3h_j1+ h_i1h_j3, h_i3h_j2+ h_i2h_j3, h_i3h_j3 iT

(3.27)

From a given homography the two constraints can be rewritten as two homogeneous equations expressed in b.



 v^T₁₂ v₁₁− v^T22



 b = 0 (3.28)

Given n images of the plane, n above equations are stacked. V is a 2n×6 matrix, b is a 6×1 vector and 0 is a 2n× 1 vector.

V b = 0 (3.29)

In general, given n > 2, a unique solution b, defined up to a scale, can be extracted. This is because A has five intrinsic parameters. The solution is given by the eigenvector of V^TV associated with the smallest eigenvalue.

3.2.5 Computation of intrinsic parameters

The camera’s intrinsic parameters can be calculated from b. The B matrix is defined by b and is estimated up to a scale factor. λ is the arbitrary scale factor.

λA^−TA⁻¹ = B (3.30)

A tracking framework for a dynamic non­stationary environment