Machine Learning for Inferring

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2019 ,

Machine Learning for Inferring

Sidescan Images from Bathymetry and AUV Pose

ZITAO ZHANG

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

Machine Learning for

Inferring Sidescan Images from Bathymetry and AUV Pose

ZITAO ZHANG

Master in Machine Learning Date: June 25, 2019

Supervisor: John Folkesson Examiner: Patric Jensfelt

School of Electrical Engineering and Computer Science

(4)

(5)

iii

Abstract

Underwater navigation has been a big challenge for autonomous underwater vehicles (AUVs) for a long time. It is highly dependent on acoustic methods called SONAR. There are two kinds of sonar sensors which are commonly used, the multibeam sonar and the sidescan sonar. Both of them have some advantages and limitations. Substantial improvements can be made if a ma- chine interpretation method can be developed for the translation between these two sonar data.

The objective of this thesis project is to find an effective way to do transla-

tion from seabed bathymetry (underwater depth) data (from multibeam sonar)

to sidescan sonar images. In the project, we explored the feasibility of machine

learning based translation methods. Some different generative models based

on the idea of generative adversarial nets were tried. This project is an ex-

perimental trial, and it still needs more improvement before production. But

the result shows a strong potential for the ability of machine learning based

methods to handle this kind of translation tasks.

(6)

iv

Sammanfattning

Navigeringen har varit en stor utmaning för autonoma undervattensfordon (AUV) under lång tid. Typiskt används akustiska metoder, så kallad SONAR. Det finns två typer av sonarsensorer, multibeam sonar och sidescan sonar. Båda har styrkor och svagheter. Genom att översätta mellan dessa två sensordata kan betydande förbättringar uppnås.

Syftet med detta avhandlingsprojekt är att hitta ett effektivt sätt för att över-

sätta data från batymetri (undervattensdjup, från multibeam sonar) till sidescan

sonarbilder. I projektet undersökte vi genomförbarheten för översättningsme-

toder baserad på maskininlärning. Olika generativa modeller baserade på ge-

nerative adversarial nets (GANs) hade undersöktes. Detta projekt kan ses som

en förstudie. Ytterligare förbättringar krävs fortfarande, men resultatet visar

en stark potential för maskininlärningsmetoder att hantera denna typ av över-

sättningsuppgifter.

(7)

v

Acknowledgement

This work was supported by Stiftelsen for Strategisk Forskning (SSF) through

the Swedish Maritime Robtics Centre (SMaRC) (IRC-0046). We thank MMT

Sweden AB for providing the data.

(8)

Chapter 1 Introduction

In scientific research and ocean exploration, autonomous underwater vehicles (AUVs) have the potential to revolutionize our access to places that man- controlled vehicles can hardly reach [1], such as deep ocean and below the glaciers. Navigation is one of the primary challenges in AUV research today.

Global Positioning System (GPS) can provide accurate positioning informa- tion while on the surface, but GPS signals are not available underwater [2].

Because of the fact that electromagnetic energy cannot propagate appreciable distances in water except for very low frequency, acoustic methods become one of the most feasible detection and communication methods because of their long propagation range. These acoustic methods are known as SONAR, or "Sound Navigation And Ranging". In this project, we are interested in two different sonar sensors, multibeam and sidescan sonar.

1.1 Autonomous Underwater Vehicles

Autonomous underwater vehicles (AUVs) are computer-controlled underwa- ter surveying platform. They are called "autonomous" because there are no physical connections (wired or wireless) between vehicles and human oper- ator, who might be on shore or on a ship following behind the vehicles [3].

Early AUVs were completely relying on their onboard hardware (sensors) and software (algorithms) for surveying underwater[3]. Nowadays, it is possible to transmit small amounts of data with AUVs by acoustic methods with the development of acoustic technology, such as the battery status and depth, or given simple orders to AUVs such as "start" and "stop", but complex orders are still effectively inaccessible[3]. This is what distinguishes the AUVs and remote controlled vehicles[3].

1

(12)

2 CHAPTER 1. INTRODUCTION

Figure 1.1: Example of AUV surveying with the help of sonar

1.2 Multibeam Sonar

Multibeam echo-sounder sonars were developed and became accessible in the late 1980s [4]. These systems work by transmitting multiple narrow sound pulses through a transmitter at a specific frequency, and then receiving the acoustic backscatter through a receiver [4] [5] [6]. Each sound pulse will re- turn the distance between the point this pulse hit the seabed and the AUV.

Multibeam sonars have a high standard of calibration and accuracy [4]. By applying acoustic mathematical models, complete 3D geometric bathymetry (depth) information of the seabed can be generated. Multibeam sonars are powerful tools for seabed mapping and navigation, but the limitation is that it is expensive and requires powerful sensor platform [4]. And because each sound pulse can only return single value of the distance, so the resolution is also limited to the number of sound pulses which the sonar can send at the same time.

1.3 Sidescan Sonar

Sidescan sonars are choices of high-resolution seabed mapping. These sys-

tems can cover a large portion of the seabed away from the surveying vessels,

from a few tens of meters to 60 km or more [4]. The instrument has one beam

on each side of the vessel, broad in the vertical plane perpendicular to the

tracking direction. When the sonar signal hit the seabed, it will reflect echos

with different intensity. The sensor receive the reflection intensity and the

(13)

CHAPTER 1. INTRODUCTION 3

time sonar "ping" return to the sensor. It can generate a 2D spacial-time based image of various echo intensity. Sidescan sonars are more cost-effective than multibeam sonar. These systems can provide high-resolution sonar images.

However, the interpretation of these sidescan sonar images is much more dif- ficult than that of multibeam backscatter due to the lack of resolution in one of the three spatial dimensions. Details of the interpretation the sonar images will be discussed in Chapter 3 Method part.

1.4 Problem Description

Substantial improvements can be made if the sidescan sonar readings could be used to aid navigation in real time rather than simply logging that data as is done now. This requires better machine interpretation of sonar, which is the main focus of this project. Traditional 3D reconstruction or interpreta- tion methods are mostly acoustic ray tracing simulations and mathematically based [7][8]. These methods require accurate mathematical models and al- ways generate perfect sonar image with little noise because it is hard to simu- late complex back-scatters from non-uniform mediums and sea-bed materials.

The complex mathematical models could also lead to a high computational complexity which might be too slow to get a real-time interpretation. A faster and more accurate alternative interpretation method needs to be proposed. For this project, we mainly focus on translating the bathymetry data to the sides- can image, which can be applied to many different scenarios such as AUV simulators.

1.5 Research Question

Can machine learning based methods such as regression and generative ad- versarial nets (GAN) provide an effective translation between bathymetry data and sidescan sonar image? Can these methods generate realistic sidescan sonar images, or can they generate accurate sidescan sonar images corresponding to the bathymetry data? What’s the advantages and disadvantages of these meth- ods? These questions will be discussed in this project.

1.6 Ethics, Sustainability and Society

Oceans have massive resources which include not only fossil fuel, minerals,

or renewable energy such as tides and ocean current, but also ecological re-

(14)

4 CHAPTER 1. INTRODUCTION

sources such as deepwater ecosystems. But it is difficult for human to visit the deepwater area. Human divers are facing great hazard for underwater oper- ation. The AUV could be one of the solutions, but so far, these vehicles are difficult to operate and not qualified for complicated tasks.

If a better AUV positioning solution can be proposed, these vehicles would have a much wider application range, such as surveying underwater area, main- taining underwater devices, monitoring underwater plants. This can help to replace human to complete high-risk tasks. And if AUVs are more commonly used, the underwater surveying will have much higher efficiency. This could bring more profits and reduce the cost in many different fields such as fishery, mining, search and rescue industry.

1.7 Thesis Structure

In Chapter 2, some basic concept of GAN will be introduced, together with some different network models. These structures will be used in the latter part.

In Chapter 3, implementation details will be introduced, including the dataset, network implementation, and experiment setup.

In Chapter 4, the result of experiments and evaluation will be shown. Some optimization and inference during our experiments will also be included in this part.

In Chapter 5, we will make a brief conclusion of our results, contributions,

and expected future work.

(15)

Chapter 2 Related Work

2.1 Generative and Discriminative Model

Generative and discriminative models are important components in machine learning and computer vision. The generative models are trained to represent some kinds of distribution from sample data, while the discriminative models are trained to represent the conditional probability of the sample data of the distribution directly. The mathematical interpretation is that: Given the input x (observation) and the label y (target), the generative models try to learn a model of joint probability p(x, y), the discriminative models try to learn a model of posterior p(y|x).

In practice, generative models need much more sample data to train, but they can converge to the asymptotic error much faster [9]. The discriminative models usually have higher asymptotic error [9] and are more commonly used to do machine learning tasks such as classification and regression.

Images are always considered as data samples from certain distribution in high-dimension space. So the generative models can be used to generate im- ages by sampling data from the learned distribution. They can be applied to lots of different fields, such as semi-supervised learning [10], text-to-image synthesis [11], style transfer and texture synthesis[12], image-to-image trans- lation [13][14], and video generation [15].

2.2 Generative Adversarial Nets (GANs)

Before the concept of adversarial nets was proposed in 2014, deep learning had shown its great potential on discriminative models, while the deep generative models had less of progress because it’s difficult to train until the adversarial

5

(16)

6 CHAPTER 2. RELATED WORK

nets framework is introduced [16]. This framework simulates an adversarial competition between two models, one is the generator and the other is the discriminator [16].

The main idea of a GAN is the following: A real data point (or image) x can be considered as a sample from a probability distribution function in a high-dimension space. The task of the generator is to find such a distribution, which is generated by a neural network so that it can generate a data x

⁰

which is likely to be real. Then we fix the parameters of the generator, train the dis- criminator to determine whether a sample is from the generator distribution or the real data distribution. Then we can get a gradient of the objective function from the discriminator. After that, we can apply the gradient ascent method to train the generator to make the generator distribution to move closer to the real distribution [16].

In the original GAN, the generator is fed with a low-dimension random noise vector and trained to output a high-dimension vector (such as an image).

The discriminator is fed with a high-dimension vector and try to determine whether this input is from a model distribution or a data distribution, or in another word, whether it is a generated "fake" image or a real image [16].

The advantage of GAN, compared to some other generative model such as VAE, is that the generator has never been input the real data. It is just trained to mimic a real distribution to "fool" the discriminator. So it has good capacity to generate data which are not in the given data samples, while the VAE always try to find the most similar image in the dataset it has memorized and try to get the combination among these images, which may lead to blur output. Another advantage is that for VAE, the generated images are usually compared with real images by mean square error (MSE). If two images, for example, the handwriting character, both have the same MSE with the original image, one has a wider stroke while the other has some noise point among the image, it is obvious that the first one is better for human’s point of view. But this is hard for the network to tell as both of them have the same loss. So the network will not try to learn to generate more first-image-like data.

However, in practical, GAN is difficult to optimize. In a traditional neural

network, we can monitor the loss function to improve the performance. But

in a GAN, we need to keep the two networks having well-match competing

performance, which is called Nash equilibrium [17]. When the discriminator

fails, it does not guarantee that the generator has generated good images. It

may just because the discriminator is too weak to tell the difference. And if

the discriminator is too strong, the gradient would vanish and can no longer

help to train the generator. In order to improve the performance and reduce

(17)

CHAPTER 2. RELATED WORK 7

the training difficulties, some techniques have been carried out in recent years [17]. Another issue of GAN is mode collapse. It is usually caused by the low model capacity so that the generator would learn to map different input vector to the same output [18]. This issue will cause a low diversity of generated images.

Mathematical Theories and Algorithms

Given a dataset X, we want to learn the generator’s distribution p

^g

over data x. An input prior p

z

(z) is defined on a noise distribution, and the generator can be expressed as a differential multilayer perceptron function G(z; θ

g

) to output sample x, where θ

g

is the parameter of the function. The discriminator is defined by another function D(x; θ

^d

), which takes an input sample x and outputs a scalar to determine the probability of sample x is from real data dis- tribution p

data

rather than the generator’s distribution p

g

. We keep training D to maximize the accuracy of correctly classifying data x to the corresponding distribution. At the same time, the G is trained to minimize log(1 − D(G(z))).

The value function is defined as min

G

max

D

V (D, G) = E

x∼p_data(x)

[log D(x)] + E

z∼pz(z)

[log(1 − D(G(z)))) This minmax game would have the global optimum p

^g

= p

_data

[16]. In or- der to get the optimal solution, we need to update the generator G and discrim- inator D respectively. That is, for a fixed G, we apply the stochastic gradient ascent to D with

∇

_θ_d

1 m

m

X

i=1

log D x

⁽ⁱ⁾

+ log 1 − D G z

⁽ⁱ⁾

(2.1) for given m samples (the batch size). Then for a fixed D, we apply the stochas- tic gradient descent to G with

∇

_θ_g

1 m

m

X

i=1

log 1 − D G z

⁽ⁱ⁾

(2.2)

These gradient-based updates need to be taken iteratively until the opti-

mum has been reached. The discriminator and generator can be updated in

different speed, for example, 5 times of D updates then follows 1 G update

[16].

(18)

8 CHAPTER 2. RELATED WORK

2.3 Deep Convolutional GAN (DCGAN)

Deep Convolutional GAN (DCGAN) is a kind of GAN which has a deep con- volutional structure [10]. It satisfies these following architecture [10]:

• Replace any pooling layers with strided convolutions (discriminator) and fractional-strided convolutions/deconvolution (generator).

• Use batch normalization in both the generator and the discriminator.

• Remove fully connected hidden layers for deeper architectures.

• Use ReLU activation in generator for all layers except for the output, which uses Tanh.

• Use LeakyReLU activation in the discriminator for all layers.

DCGAN is more like an empirical optimization. The authors didn’t give de- tailed mathematical derivation. According to the result, the DCGAN can be trained more stable and can learn good representations of images for super- vised learning and generative modeling [10].

2.4 Wasserstein GAN (WGAN)

Wasserstein GAN, or WGAN, is raised to optimize the training of GAN [19].

For the original GAN, we measure the difference between generator distribu- tion and real data distribution by Jensen-Shannon divergence (JS-divergence) [16]. The goal is to let the 2 distribution as close as possible, which is, J S(P

r

, P

g

) → 0 [19]. The issue is that in practice, both of the distributions are low-dimension manifolds in high-dimension space. In most cases, there is little overlap among the two distribution when the training starts, which means the JS-divergence keeps its value unchanged (which is log 2) with the training going on. This can’t provide a usable gradient so it will cause difficulties to update the net- work.

Then the Earth-Mover (EM) distance, or Wasserstein-1 distance, is intro- duced. The EM distance can be simply explained as the minimum "mass" it needs to move "earth" from one "mound" (distribution) to another. It can pro- vide a more continuous and usable gradient compared to JS-divergence [19].

This shows that the EM distance is a better selection for the loss function.

The EM distance requires a constraint that the discriminator function should

be locally Lipschitz to ensure the continuity and differentiability (shown in

(19)

CHAPTER 2. RELATED WORK 9

Equation 2.3). In [19], weight clipping is introduced as one simple approach to the Lipschitz constrain. By simply clamping the weights of discriminator within a fixed range, all the function will be K-Lipschitz for some K [19]. How- ever, according to [19], it is a terrible way to satisfy the Lipschitz constrain. If the clipping parameter is too large, the discriminator will take a very long time to reach the optimum, and it might also cause the gradient exploding, while if the parameter is too small, it can easily cause the gradient vanishing with a deep network [19][20]. It would be an issue to choose a suitable clipping parameter.

Then another approach called gradient penalty (WGAN-GP) is introduced in [20]. This technique adds a penalty term onto the loss function to ensure the gradient norm of the discriminator function be less than 1. First sample interpolates between the region of P

r

and P

g

, called P

xˆ

, then compute the penalty term shown in Equation 2.6. By applying this method, the training of GAN can be more stable.

Mathematical Theories and Algorithms

The EM distance of two distributions P

r

, P

g

is defined as [19]

W (P

r

, P

g

) = inf

γ∈Π(Pr,Pg)

E

^(x,y)∼γ

[kx − yk],

where γ(x, y) means the "mass" it needs to move from x to y in order to trans- port distribution P

r

to P

g

, and Π (P

r

, P

g

) means the sets of all possible strate- gies of transporting distribution P

r

to P

g

.

This distance is difficult to compute, so in [19] it is converted to the form of

W (P

r

, P

θ

) = sup

kf k_L≤1

E

x∼Pr

[f (x)] − E

x∼Pθ

[f (x)], (2.3) by Kantorovich-Rubinstein duality [21], where kf k

^L

≤ 1 denotes that f is a 1-Lipschitz function.

It is quite simple to modify the loss of original GAN to a WGAN. The loss of discriminator in Equation 2.1 is modified as

∇

_θ_d

m

X

i=1

1 m f

_θ_d

x

⁽ⁱ⁾

− 1

m f

_θ_d

G z

⁽ⁱ⁾

(2.4) where f

θd

is the discriminator function with 1-Lipschitz constrain. The loss of generator in Equation 2.2 is modified as

− ∇

_θ_g

1 m

m

X

i=1

f

_θ_d

G z

⁽ⁱ⁾

. (2.5)

(20)

10 CHAPTER 2. RELATED WORK

The gradient penalty term is expressed as

ˆ

E

x∼Pˆx

(k∇

_x_ˆ

D(ˆ x)k

₂

− 1)

²

. (2.6) This penalty term is added to the discriminator loss (Equation 2.4) with a penalty weight λ.

2.5 Conditional GAN (cGAN)

Conditional GAN (cGAN) is a method for supervised learning. The gener- ator is fed with a random noise vector z and a condition c and generate an output image G(z, c). The discriminator is trained to not only tell the differ- ence between real images and generated images, but it also judges whether the image-condition pairs match. There are two possible approaches, one is to design two discriminators: D

1

(x) for judging whether the images are real or fake, D

2

(x, c) for judging whether the pairs match p(x|c). The other approach is to design one discriminator D(x, c) for both tasks, which is more commonly used [14, 22].

The condition c of cGAN can be any vector, including the image itself. So cGAN can also be used to do image-to-image translation tasks such as style transformation and image segmentation [14]. Suppose we have some image- to-image pairs in two different domain (style), called X and Y . The GAN learns a mapping from input image x and random noise vector z, to image y, as G : {x, z} → y. In [14], one single discriminator is designed to criticize the image-condition pairs, with the input images, from both domain X and Y .

2.6 Encoder-decoder

Encoder-decoder network structure is an important generative model. It can

be used in different areas, such as machine translation [23] and image segmen-

tation [24]. It contains two models: one recognition model generate the latent

vector z from input data x in a high dimensional space X, which is known

as "encoder", and the other generative model generate the target y in another

high dimensional space Y from z, which is known as "decoder" [25]. The

main idea of encoder-decoder is that for the high-dimensional input x, the en-

coder tries to get a highly compressed code z, called latent vector. The code

is expected to contain the most important information in x. Then the decoder

tries to represent these features back to another high-dimensional data y. It is

believed that x and y are sharing the same features in the code z.

(21)

CHAPTER 2. RELATED WORK 11

The encoder-decoder structure can be combined with adversarial technol- ogy as the GAN provides a powerful learning strategy for the generative model on complex distributions [26]. Some different GAN networks are using such a kind of encoder-decoder structure such as BiGAN [26] and SegNet [24].

However, there are some limitations with the encoder-decoders and encoder- decoder GAN architectures. According to [27], the generators are expected to learn meaningful features from the encoder, but in practice, the learning objec- tive along can’t provide a successful path. Another problem is mode collapse [27]. The generators may fail to provide various generated result.

2.7 U-net and Skip Connection

U-net architecture was first raised in 2015 for image segmentation [28]. Tradi- tional deep convolutional networks are always designed to have a single output for classification tasks. These networks can hardly handle the task of image segmentation, as it would be time-consuming to predict the labels pixel by pixel, and there is a trade-off between accuracy and spatial relationship, as a large patch window may lead to low local accuracy while a small patch cannot make the full usage of context [28].

The structure of "U-net" is shown in Figure 2.1. It is an encoder-decoder

structure, but with long skip connections. The U-net consists of a "contract-

ing" path and an "expansive" path [28], just like an encoder and a decoder in

encoder-decoder structure. The contracting path uses max pooling layer be-

tween each stack for downsampling, and the expansive path uses deconvolution

layers (up-convolution) for upsampling [28]. For each block’s output in con-

tracting path, it is concatenated to the corresponding block input in the expan-

sive path (grey arrows in Figure 2.1). This concatenation is also called "skip

connection". The skip connections help the network converge much faster and

reach a better local optimum [29]. According to [29], the problem of encoder-

decoder structure is that the decoder can’t guarantee to recover small details

with high-frequency information from a low-dimensional latent space. This

would be a bottleneck of such structure, especially when the network is deep

[29]. The skip connections allow passing detailed image information directly

from contracting path to expansive path, which help to solve the problem of

losing details.

(22)

12 CHAPTER 2. RELATED WORK

Figure 2.1: U-net architecture. Left part is the "contracting" path and right part is the "expansive" path. Image from [28]

Figure 2.2: Convolutional encoder-decoder networks with symmetric skip connections. Image from [29]

2.8 Residual Networks (ResNet)

Theoretically, deeper networks can fit more complex distributions, but they

are also hard to train. In practice, with the depth of a network increasing, the

accuracy begins to saturate and then drops rapidly. This is called "degrada-

tion" in [30]. To solve this problem, a deep residual learning framework is

proposed. Instead of learning the desired mapping H(x), another mapping

(23)

CHAPTER 2. RELATED WORK 13

F (x) := H(x) − x is introduced [30]. It is easier to fit the new residual mapping F (x) than H(x) [30].

To implement this framework, a shortcut connection is added between dif- ferent layers to build a residual blocks shown in Figure 2.3. In experiments, the performance of network contains multiple residual block is much better than a plain network with the same number of layers but without shortcut con- nections [30].

Figure 2.3: A residual block building. Image from [30]

The residual framework can also been applied to GAN by replacing some of the layers with residual blocks. In [31], the ResNet network structure has been used in generator, shown in Figure 2.4. All the pooling layers are removed and are replaced with strided and fractional-strided convolution layers. The input images first pass some convolution layers for down-sampling, and then several residual blocks, and finally pass the up-sampling deconvolution layers.

Figure 2.4: The ResNet structure used in [31] The input images are first passed

to encoder blocks for downsampling, then passed to a series residual blocks,

and finally decoder blocks for upsampling.

(24)

Chapter 3 Methods

3.1 Dataset

The first step of training a network is to get a good dataset. In this project, the data is provided by Swedish Maritime Robotics Centre (SMaRC). SMaRC has collected both sidescan and multibeam sonar data in pairs. These data are collected from AUV’s simultaneously surveying the same region of seabed from different tracking orientation. For each single track, the AUV surveys along a nearly straight track, sending sonar signals to both left and right sides of the vehicle. The vehicle can get one row of data from each sonar "ping" (a pulse signal), and with the vehicle travelling along the seabed 3.1. These "row"

can be assemble together and get the so called "waterfall" image (Figure 3.3).

With the help of auvlib [32], we can easily get these waterfall images, and from the multibeam sonar data, we can calculated the 3D geometric information of seabed bathymetry with in-build APIs in auvlib.

For the sidescan sonar (SSS) image, the tracking path of AUV is vertical in the middle in the image. The vertical axis represents the surveying distance along the tracking path, the horizontal axis represents the time the sonar signal needs to travel through water, hit the obstacle (seabed), reflect back through water, and finally reach the sonar receiver. This shows the distance from the seabed to the AUV. Each pixel represents the intensity of the signal received, which in the dataset has a value vary from 0 to roughly 3. A bright pixel means that either the incident angle is relatively large (close to normal incidence), or the material has high reflectivity or both. Similarly, a dark pixel means the incident angle is relatively small, or the material has low reflectivity. The dark area in middle is because of the dead zone of sonar, and it also takes some time for the sonar signal reach the nearest obstacle, which means the receiver can’t

14

(25)

CHAPTER 3. METHODS 15

Figure 3.1: AUV gets one row of data with the help of sonar.

Figure 3.2: Mechanism of sidescan sonar. This is an example of the right- hand side of the AUV. The Intensity vs Time graph represents one row in the waterfall image. The intensity represents the brightness of each pixel. Blue arrow is the highlight because there is a large incident angle of the sonar ping.

The red arrow is the shadow because the sonar signal can’t reach the back of

the rock so that there is no reflection within d

³

and d

⁴

.

(26)

16 CHAPTER 3. METHODS

get any signal within this time.

For the bathymetry image, the coordinate is similar to the SSS image. Each pixel is matched with the same pixel in SSS image, represents the depth of the seabed, which in the dataset has a value roughly around -20. A black pixel (with value of 0) means that there is no data here.

We have in total of 57 waterfall image pairs.

(a) sidescan sonar image (b) seabed bathymetry image

Figure 3.3: Example of waterfall image pair. The color in bathymetry images are only for visualization. In the dataset for network training, they are grey- scale images.

Crop images into patches

In order to feed these paired data into the network, we need to split the waterfall

images into small patches. The main idea is to crop the waterfall images with

a square window sliding along each side. We crop the patches with override

(27)

CHAPTER 3. METHODS 17

areas so that we can get more data images. For adjacent patches, in order not to have too much similarity, we flip the image up and down. For the right side of the waterfall, we crop the patches in a similar way, and flip all the right side patches left and right, so that all the patches look like in the left side of waterfall images. The reason why this flipping work is that the sonar system only collects data in rows from the left or right of the AUV separately.

Imagine the AUV surveying the same area forward and backward, we can get 2 different waterfall images, which are symmetric in both vertical and horizontal direction. So these data are not sensitive to the flipping operation (Figure 3.4).

We use the same cropping method on both of the SSS and bathymetry waterfall images. These patches are then resized into the shape of 256 × 256 pixels and grouped in pairs (Figure 3.5).

Separate train set and test/validation set

After getting the image pairs, we need to split the pairs into 2 sets: the train

set and the test/validation set. To ensure there are not too many similar images

between train and test set, we select some of the waterfall images and mark

all the images cropped from these waterfall images as the test set. As each

waterfall image in an individual surveying track of the AUV, so there is not

much dependency.

(28)

18 CHAPTER 3. METHODS

Figure 3.4: Crop small patches from waterfall image: Red boxes are sliding windows of cropping. Patches within green frame are what we extracted for dataset.

Improve the dataset

During the training, we have met some issue for the dataset. First, the bathymetry

data is quite biased. Most of the variances are within ± 1 on the average depth

roughly -20. This is quite difficult to normalize. Second, the size of the train

set is too small. The network sometimes starts to over-fitting on the train set

(29)

CHAPTER 3. METHODS 19

Figure 3.5: Example of paired patches

before it converges to an acceptable result.

To solve the first problem, the incident angle of the sonar wavefront are used to replace the bathymetry data. After getting the incident angle on each pixel, it’s easy to normalize the data within the range of 0 ∼ 1 by computing the cosine of angles (Figure 3.6. This is also provided in the auvlib [32].

This method works because the brightness of pixels in SSS image is posi- tively correlated to the incident angle. According to the Lambertian reflection model, the reflection intensity I at point p can be expressed as

I(p) = KΦ(p)R(p)| cos(θ(p))|, (3.1) where θ is the incident angle of the wavefront, and φ is the intensity of the illu- minating sound wave, R is the reflectivity of the seabed, K is a normalization constant [7]. A normal incidence (have a large cosine value) would always lead to a high reflection intensity of sonar signal, which leads to a bright pixel on SSS image. The new paired patches are grouped with SSS image and incident angle image.

To solve the second problem, the height of patches is reduced to one half

(256 × 256 to 128 × 256), and then these patches are resized back to 256 × 256,

which means that size of the dataset can be doubled. This also means that there

are fewer features appearing in each image patch so that the GAN could be able

to learn the features easier (Figure 3.7).

(30)

20 CHAPTER 3. METHODS

(a) sidescan sonar image (b) cosine of incident angle image

Figure 3.6: New example of waterfall image pair

3.2 Network implementation

3.2.1 Environment Setup

This project is implemented in Python 3. The main libraries are TensorFlow

(version r1.10) for network building, OpenCV for image visualization and au-

vlib [32] for data extracting. TensorFlow is an open source platform for ma-

chine learning. It provides users easy-to-use APIs and functions for building

and training network and optimization. The GPU resources for training the

network are accessed on the GPU cluster server provided by the CSC depart-

ment of KTH.

(31)

CHAPTER 3. METHODS 21

Figure 3.7: New example of paired patches

3.2.2 Network architectures

Conditional GANs are used in this project to learn the image translation from one domain X (incident angle) to another Y (SSS image). For the generator G, it is fed with images of X domain and noise z, and generate the corresponding images of Y domain: G : {x; z} → y. The discriminator D is fed with image pairs (x, y) from both domains. The images of domain X are the input of generator, and the images of domain Y include either images from real SSS images (ground truth) or generated images. It returns a judgment of:

(1) Whether the image y is from real distribution p

^data

rather than generator distribution p

g

; (2) Whether the image y is corresponding to the input image x rather than a "randomly generated" realistic image.

Generator

U-net: The first trial is based on the U-net introduced in [14]. It is an 8-

layer encoder-decoder with symmetric skip connection (we mark this model

as U-net8), shown in Figure 3.8. The input image (256 × 256 × 1) first pass

the Encoder 1, a K4S2F32 convolution layer (kernelSize=4, stride=2, num-

berOfFilter=32), to downsample to 128 × 128 × 32 matrix. Then for the rest

7 encoders, each of them contains a leaky ReLu activation layer, a convolu-

tion layer, and a batch normalization layer. The size of the matrix becomes

(32)

22 CHAPTER 3. METHODS

half after passing each encoder, and the number of channels is doubled until it reaches 256 and remains that number. After that, we get a 1 × 1 × 256 la- tent vector (code). This vector is then passed to a series of decoders. Each decoder contains a ReLu activation layer, a deconvolution layer, and a batch normalization layer. The output from each decoder is concatenated with the corresponding encoder output on the channel dimension so the channel size is doubled. This is the skip connection. For the last decoder, the 128 × 128 × 64 input is first passed to a ReLu activation layer, then a K4S2F1 deconvolution layer, then the Tanh activation layer to get the final output.

ResNet: In order to improve the performance, we tried some more complex network structure. We applied the ResNet to the GAN like described in [31], shown in Figure 2.4. It contains 2 encoder blocks, 9 residual blocks, and 2 decoder blocks. The image input first being downsampled to a size of 128 × 128 × 64 after passing 2 encoders. Then follows 9 residual blocks, all these blocks have the same input and output size. Finally, the matrix is passed to 2 decoders and upsampled to 256 × 256 × 1.

Discriminator

PatchGAN: We use the basic idea of the PatchGAN discriminator structure in the [14]. It is a deep convolution network without any pooling layers. The input domain and target domain images are first concatenated on the channel axis to the size of 256 × 256 × 2. After that, the matrix is passed to a series of convolution layers and leaky ReLu activation layers. Here in order to apply the technology of WGAN-GP, we removed the batch normalization layers because the gradient penalty term is applied to each input individually rather than to the whole batch [20]. The discriminator output a 30 × 30 matrix. Each value represents the prediction of a 70 × 70 patch in the original image [14].

3.2.3 Training Method

We follow the training method introduced in [14]. In each iteration, we update

the discriminator D once, and then the generator G once. The training batch

size is set as 20. We use the Adam Optimizer with learning rate of 0.0002, and

the moment parameters are set as β

1

= 0.5, β

₂

= 0.9. In total 150,000 step

updates are taken to ensure the network converges.

(33)

CHAPTER 3. METHODS 23

Figure 3.8: Generator network structure: U-net8

3.2.4 Objective function

The objective function of a conditional GAN is:

L

_cGAN

(G, D) = E

x,y

[log D(x, y)] + E

x,z

[log(1 − D(x, G(x, z))], (3.2)

where the discriminator D tries to maximize the objective function while the

generator G tries to minimize the objective against D, which can be written

(34)

24 CHAPTER 3. METHODS

Figure 3.9: Generator network structure: ResNet9

Figure 3.10: Discriminator network structure: PatchGAN

as:

G

^∗

= arg min

G

max

D

L

_cGAN

(G, D). (3.3)

Apart from the adversarial loss defined in Equation 3.2, some other ap-

proaches have also shown that combining the GAN objective with some other

(35)

CHAPTER 3. METHODS 25

traditional reconstruction losses can be helpful [33]. So weighted L1 and L2 loss are added to the objective function to help the GAN training:

L

_L1

(G) = E

x,y,z

[ky − G(x, z)k

₁

] , (3.4) L

_L2

(G) = E

x,y,z

[ky − G(x, z)k

₂

] , (3.5) G

^∗

= arg min

G

max

D

(L

cGAN

(G, D) + λ

1

L

L1

(G) + λ

2

L

L2

(G)), (3.6) and we can also add a weight parameter λ

^GAN

to the GAN loss so that we can conduct more experiments on the parameters.

To training the generator G and the discriminator D separately, Equation 3.2 can be written as:

L

_D

= − (E

x,y

[log D(x, y)] + E

x,z

[log(1 − D(x, G(x, z))]) (3.7)

L

_G

= −E

x,z

[log(D(x, G(x, z))] (3.8)

In order to make the training procedure more stable, Wasserstein GAN loss with gradient penalty term is introduced to the loss function. Equation 3.7 and 3.8 are updated to:

L

_D

= E

x,z

[−D(x, G(x, z))] − E

x,y

[D(x, y)] (3.9)

L

G

= −E

^x,z

[(D(x, G(x, z))] (3.10)

.

To add the gradient penalty term to the discriminator loss L

D

, we need to complete the following steps:

1. Sample interpolates within the region of generated and real images, 2. Compute the gradient of discriminator on interpolates,

3. Get the penalty term from the gradient.

The equation of gradient penalty term can be written as:

GP = E

x,ˆy

h k∇

_y_ˆ

D(x, ˆ y)k

₂

− 1

2

i

, (3.11)

where ˆ y is the interpolates. Then the gradient penalty can be added to the discriminator loss in Equation 3.9 as:

L

_D

= E

x,z

[−D(x, G(x, z))] − E

x,y

[D(x, y)] + GP. (3.12)

To add the noise z into the generator G, the same method of applying

dropout to the first several layers in [14] is taken because adding noise directly

to the generator is likely to be ignored [34].

(36)

26 CHAPTER 3. METHODS

3.3 Experiment

First different weight parameters among GAN loss λ

GAN

, L1 loss λ

1

and L2 loss λ

2

are tested. The default value is λ

GAN

= 1, λ

₁

= 10, λ

₂

= 0, just like the setup in [14]. And if the λ

^GAN

is set to a very small value comparing to other two reconstruction losses, then it would be more like a normal regression rather than a GAN. The results between these 2 different types of machine learning approaches can be compared.

Then different filter numbers are also tested. In the default setup, the first encoder will have a 32-filter convolution layer, then 64, 128, etc. We tried some different setup such as 20-40-80, 16-32-64. In the report below, we will only use the number of filters in the first convolution layer for distinguishing.

We also try to change the filter number of the discriminator to check whether a simpler discriminator will be effective for criticism.

Some other optimization can be done on the network structures. For the U-net, we try to use different depth of encoder-decoder layers, reducing the layer from the original 8 layers to 7 layers. A shallow U-net can more focus on small details, while a deeper U-net might extract some large feature be- cause of the larger perception area [35]. Another modification is the convolu- tion/deconvolution layer stride. A larger stride leads to a more rapid downsam- pling/upsampling, which means a stronger spatial pooling, which at the same time, a higher information losing speed. For the ResNet, we try to change the number of residual blocks to simplify the network structure and add more encoder/decoder blocks.

For convenience, we use abbreviations to distinguish different experiment setups. For example, "unet8-32-32" means that this is an 8-layer U-net struc- ture generator, with 32 filters in generator and 32 filters in discriminator; "resnet7_v2- 32-20_10_0_1" means that this is a version 2 ResNet structure with 7 residual blocks, 32 filters in generator and 20 filters in discriminator, the loss weight parameters are λ

¹

= 10, λ

₂

= 0, λ

_GAN

= 1.

3.3.1 Evaluation Setup

Evaluate the generated image quality is quite difficult [17]. Traditional re-

construction loss such as pixel-wise L1 or L2 error can’t handle with high-

frequency information, while the sidescan sonar images include lots of high-

frequency information such as highlights and shadows in rock areas. There are

also some other quantitative evaluation methods such as structural similarity

index (SSIM) and peak signal-to-noise ratio (PSNR). But during our test, we

(37)

CHAPTER 3. METHODS 27

found that these scores can’t show the quality of the generated image accu- rately. Some models have significantly better output but there is no obvious difference in these scores, or even lower. That might be due to the fact that the sidescan sonar images consist of large amounts of noises, which is difficult for evaluation.

So we decide to evaluate the image quality by human. We designed a survey for volunteers. It contains quite a lot of questions with paired images.

For each question, the respondent will be shown one bathymetry image and one SSS image. The bathymetry image is applied with OpenCV inbuilt colormap to make details more visible. The SSS image comes either from the ground truth SSS image, or from the generated image from GAN. The respondent should judge whether the SSS image is "real" or "fake". All these images are randomly distributed to avoid bias. We ask our respondent to make their decision by their first impression within several seconds.

In order to have a clearer view of the performance, we divide all the test images into 3 different categories: flat, rock, and hill (Figure 3.11). "Flat"

means that this area is nearly flat seabed area lack of information, "rock" means that this area contains a lot of small rock-like objects, and "hill" means that this area contains a lot of large-scale structure like hills. According to these categories, we can have a clearer view of the advantage and disadvantage of different GAN model results.

(a) Flat area (b) Rock area (c) Hill area

Figure 3.11: Example of image pairs shown in survey. For each image, the

upper half is bathymetry image, the lower half is SSS image. The depth of the

seabed increases with the colormap varying from blue to red.

(38)

Chapter 4 Results and Discussion

4.1 U-net

Figure 4.1: U-net8 results 1.

The first experiment is the U-net8 with 32, 20, 16 filters. The output image

28

(39)

CHAPTER 4. RESULTS AND DISCUSSION 29

examples are shown in Figure 4.1. By comparing these results, we can see that unet8-32-32 has the best performance. Both of the unet8-20-20 and unet8-16- 16 have some strange grid-like texture on flat areas (Figure 4.2). Also, we can observe that all these network structures have limitation to handle rocks and hills. The networks are likely to generate some highlights and shadows randomly on these rock and hill areas, which makes the results less realistic.

Figure 4.2: Grid-like texture on flat areas.

Figure 4.3: Loss/score on the test/validation set.

From the loss graphs (Figure 4.3), we can find that there is no convincing evidence that any of the losses/scores can represent the performance of the generator accurately. That is what we mentioned in the subsection 3.3.1 Eval- uation Setup.

Then based on the unet8-32-32, we tested whether different weight param-

eters will lead to any improvement. We kept the weight of GAN loss λ

GAN

= 1

and L2 loss λ

2

, changed the weight of L1 loss. The test setup is unet8-32-

32_10_0_1, unet8-32-32_100_0_1, and unet8-32-32_1_0_1.

(40)

30 CHAPTER 4. RESULTS AND DISCUSSION

Figure 4.4: U-net8 results 2.

From the result 2 (Figure 4.4), we find that the unet8-32-32_100_0_1 has a more blurred output. It might because of the fact that a larger L1 loss weight would make the reconstruction prefer to generate pixels with "mean" color so that some high-frequency signals are ignored such as small rocks and noises.

This result is similar to a regression result as the GAN loss is not so impor- tant comparing to the reconstruction error. At the same time, the unet8-32- 32_1_0_1 has a lot of grid-like noises. This might because the weight of L1 is too small so that the generator failed to get enough "hints" to learn charac- teristics from reconstruction error.

We also tried to figure out the influence of the weight of L2 loss. We fixed

the L1 and GAN loss weight, tested different L2 parameters from 0 to 50. Re-

sult 3 is shown in Figure 4.5. It seems that there is no improvement in the

generated image quality. And from the "unet8_10_50_1" we can see that the

ripple-like noises on the flat area are obviously less than that in other results

(41)

CHAPTER 4. RESULTS AND DISCUSSION 31

Figure 4.5: U-net8 results 3.

or ground truth. The reason might because of the L2 loss has a higher penalty on extreme value, so the generator is more likely to generate less bright and shadow areas. This could also lead to the problem of blurry output.

Then we tested some shallower networks based on U-net structure. We designed 3 different 7-layer U-net structures (Figure 4.6). The U-net7 version 1 is the same with U-net8, just remove the last encoder block (encoder 8) so the latent vector is in size of 2×2×256 rather than 1×1×256. The U-net7 version 2 has a larger stride size in the convolution layer of the first encoder (encoder 1). The stride is changed from 2 to 4, so the output size from the encoder 1 is 64 × 64 × 32 rather than 128 × 128 × 32. The U-net7 version 3 has a larger stride in the convolution layer of the last encoder (encoder 7). The stride is changed from 2 to 4, so the latent vector is also in size of 1 × 1 × 256. The larger stride size will cause a more rapid downsampling/upsampling speed.

From the result 4 (Figure 4.7), we can observe that there are not much

(42)

32 CHAPTER 4. RESULTS AND DISCUSSION

Figure 4.6: Example of 3 different U-net7 structures.

difference between 7-layer U-net and 8-layer U-net. For the 3 different versions

of U-net7, version 1 and version 3 have more similar output, while version 2

seems to have fewer details such as small rocks. The reason might be that

version 2 has a larger stride in the first encoder block, which means a more

rapid downsampling rate at the first. So more information might lose during

the reconstruction.

(43)

CHAPTER 4. RESULTS AND DISCUSSION 33

Figure 4.7: U-net7 results 4.

.

(44)

34 CHAPTER 4. RESULTS AND DISCUSSION

4.2 ResNet

ResNet is a more complex network structure and is expected to have a better result. We first test directly on the ResNet9 with 32 and 20 filters. From Figure 4.8, we can see that generated images have uneven dark and bright blocks, which makes the images less realistic. The reason might be that there are only 2 encoder and 2 decoder blocks in the original ResNet9 structure (Figure 3.9).

So the perception area for each node in this structure is much smaller than the U-net we tried before, which means that there is less spatial relationship between patches in different areas.

Figure 4.8: ResNet9 results 5.

In order to solve this problem, we designed 2 different ResNet structure

with more encoder/decoder blocks. For both structures, we added 1 more en-

coder block and 1 more decoder block. The difference is that in ResNet9 ver-

sion 2, the first encoder has a convolution layer with stride size 2 (so is the last

decoder), while in the version 3, the first encoder has a convolution layer with

(45)

CHAPTER 4. RESULTS AND DISCUSSION 35

stride size 1.

Figure 4.9: Example of two ResNet9 structures.

We set the number of the filter remains the value of 32, and tested on new

ResNet9 structure. The result 6 is shown in Figure 4.10. From the result, we

can see that the uneven problem in result 5 didn’t appear, which means that our

idea is in the correct direction. The ResNet can generate much more realistic

SSS images on rock areas, comparing to the U-net structure.

(46)

36 CHAPTER 4. RESULTS AND DISCUSSION

Figure 4.10: ResNet9 results 6.

Then we tried to simplify the ResNet9. We reduced the network to 8, 7, 6 residual blocks to see whether they can still provide good results. We tested both on the ResNet version 2 and version 3 to see which one can provide a better result.

From the result shown in Figure 4.11, we notice that the problem of uneven dark and bright blocks appear again on version 3 network, while the version 2 seems to be able to handle this problem perfectly. This might be caused by the smaller perception areas in version 3 structure comparing to version 2.

And for the image quality, there is not much difference between the ResNet9

or ResNet8 or ResNet7, but for ResNet6, there are some unexpected bright

noises.

(47)

CHAPTER 4. RESULTS AND DISCUSSION 37

Figure 4.11: ResNet results 7.

4.3 Performance Evaluation

We choose 2 networks, "unet8-32-32" and "resnet7_v2-32-32", to evaluate their performance by survey. These 2 networks are likely to have the best performance in either network structures of U-net and ResNet. The weight parameters are set as λ

^GAN

= 1, λ

1

= 10, λ

2

= 0. We combined these gener- ated images with ground truth images randomly, with in total of 116 images.

We collected 9 interviews. The result is shown in Table 4.1.

From the result, we can see that for the ground truth images, respondents mark most of the images as "real", which is in line with the facts. We can

Ground Truth U-net ResNet

Flat 65.0% 42.9% 44.4%

Rock 81.7% 38.4% 62.2%

Hill 74.4% 30.2% 42.9%

Table 4.1: The retrieval rate in different categories

(48)

38 CHAPTER 4. RESULTS AND DISCUSSION

also see that the flat and hill areas having a relatively lower retrieval rate than rock area. This shows that it’s harder for respondents to judge whether the flat areas are real or not. That might be due to the fact that the flat areas always lack information for participants to make the decision. Then for our GAN models, we can see that both networks have a similar retrieval rate in flat areas, which exceed half of the retrieval rate of ground truth. This shows that for flat areas, both U-net and ResNet can sometimes fool a human critic. However, for the U-net, the retrieval rate of the rock area is much lower, and performance on the hill area is worse. This means that this U-net architecture has limitation to handle with complex seabed structure. For ResNet, the retrieval rate of the rock area is quite high, which means that the ResNet can generate realistic SSS image in the rock areas. But the retrieval rate of the hill area is still not high enough comparing to the retrieval rate for ground truth. So the ResNet still can’t handle with complex larger structures in seabed. This result also fits our observation in experiments.

4.4 Time Consumption

Time consumption is not what we are currently focused on. Our training and generation are both running on a single-GPU platform with the computational level of Nvidia GTX 1080 Ti. During our experiments, the "unet8-32-32"

takes about 12 hours for the 15,0000 steps of update. And the generation time

of each image input (size of 256 × 256) is about 0.015 seconds. For other U-

net structure with less filter size or less layer, the time consumption is also de-

crease, for example, the training time of "unet7-20-20" is about 9.7 hours, and

generation time is about 0.012 seconds per image. For the ResNet structure,

the time consumption is much higher. The training time of "resnet9-32-32" is

about 24 hours, with the generation time 0.029 seconds per image. The ResNet

version 2 and version 3 have added more encoder/decoder blocks so there are

fewer nodes in the residual blocks. The training time of "resnet9_v3-32-32" is

about 19 hours, with the generation time 0.023 seconds per image. A ResNet

with less residual blocks would also have less time consumption, for example,

the training time of "resnet6_v2-20-20" is about 15 hours, with the generation

time 0.018 seconds per image.

(49)

Chapter 5 Conclusions

The objective is to find whether machine learning based methods can provide an effective translation from bathymetry data to sidescan sonar image. In this thesis, the application of generative adversarial nets has been tried on this task.

We tested 2 different generative model structures, the U-net and ResNet. Ac- cording to our experiment results, we can generate some quite realistic and nice-looking sidescan sonar images in not so-complicated areas. We have found that the ResNet can generate high-quality sidescan images in flat and rock areas. But it still has some limitation to generate high-quality images in the seabed areas with large structures such as big rocks or hills.

The innovation of this thesis is to provide a new idea of machine interpre- tation on sonar images apart from the traditional mathematical model simula- tion. Our result is that GANs have a strong potential to do a satisfactory job in translation.

5.1 Future Work

Our network structure can still be improved. For the U-net structure, some more convolution layers with identical input-output size can be applied to each encoder block to improve the feature-extraction ability. For the ResNet struc- ture, adding convolution layers to encoder blocks might also help improve the low-level feature-extraction ability. Another trial is that the skip-connection method can be also added to ResNet to help the network to get more detailed information directly from contracting path to expansive path.

The ultimate objective is a two-way translation between bathymetry data and sidescan sonar. The translation from bathymetry to sidescan sonar can provide an effective AUV simulation method, while the translation form sides-

39

(50)

40 CHAPTER 5. CONCLUSIONS

can sonar to bathymetry can be used for AUV onboard positioning algorithm.

We hope our results can be helpful in future study in this field.

(51)

Bibliography

[1] John J Leonard et al. “Autonomous underwater vehicle navigation”.

In: MIT Marine Robotics Laboratory Technical Memorandum. Citeseer.

1998.

[2] X Yun et al. “Testing and evaluation of an integrated GPS/INS system for small AUV navigation”. In: IEEE Journal of Oceanic Engineering 24.3 (1999), pp. 396–404.

[3] Denise Crimmins and Justin Manley. What Are AUVs, and Why Do We Use Them? url: https : / / oceanexplorer . noaa . gov / explorations / 08auvfest / background / auvs / auvs . html (visited on 05/15/2019).

[4] Philippe Blondel. The handbook of sidescan sonar. Springer Science &

Business Media, 2010.

[5] LCS Instruments. “Multibeam sonar theory of operation”. In: Report.

L3 Communications SeaBeam Instruments (2000).

[6] Luciano Fonseca and Larry Mayer. “Remote estimation of surficial seafloor properties through the application Angular Range Analysis to multi- beam sonar data”. In: Marine Geophysical Researches 28.2 (2007), pp. 119–

126. [7] Enrique Coiras, Yvan Petillot, and David M Lane. “Multiresolution 3-D reconstruction from side-scan sonar images”. In: IEEE Transactions on Image Processing 16.2 (2007), pp. 382–390.

[8] Yan Pailhas et al. “Real-time sidescan simulator and applications”. In:

OCEANS 2009-EUROPE. IEEE. 2009, pp. 1–6.

[9] Andrew Y Ng and Michael I Jordan. “On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes”. In:

Advances in neural information processing systems. 2002, pp. 841–848.

41

(52)

42 BIBLIOGRAPHY

[10] Alec Radford, Luke Metz, and Soumith Chintala. “Unsupervised repre- sentation learning with deep convolutional generative adversarial net- works”. In: arXiv preprint arXiv:1511.06434 (2015).

[11] Scott Reed et al. “Generative adversarial text to image synthesis”. In:

arXiv preprint arXiv:1605.05396 (2016).

[12] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. “Perceptual losses for real-time style transfer and super-resolution”. In: European conference on computer vision. Springer. 2016, pp. 694–711.

[13] Jun-Yan Zhu et al. “Unpaired image-to-image translation using cycle- consistent adversarial networks”. In: Proceedings of the IEEE Interna- tional Conference on Computer Vision. 2017, pp. 2223–2232.

[14] Phillip Isola et al. “Image-to-image translation with conditional adver- sarial networks”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, pp. 1125–1134.

[15] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. “Generating videos with scene dynamics”. In: Advances In Neural Information Pro- cessing Systems. 2016, pp. 613–621.

[16] Ian Goodfellow et al. “Generative Adversarial Nets”. In: Advances in Neural Information Processing Systems 27. Ed. by Z. Ghahramani et al. Curran Associates, Inc., 2014, pp. 2672–2680.

[17] Tim Salimans et al. “Improved techniques for training gans”. In: Ad- vances in neural information processing systems. 2016, pp. 2234–2242.

[18] Sanjeev Arora and Yi Zhang. “Do gans actually learn the distribution?

an empirical study”. In: arXiv preprint arXiv:1706.08224 (2017).

[19] Martin Arjovsky, Soumith Chintala, and Léon Bottou. “Wasserstein gan”.

In: arXiv preprint arXiv:1701.07875 (2017).

[20] Ishaan Gulrajani et al. “Improved training of wasserstein gans”. In: Ad- vances in Neural Information Processing Systems. 2017, pp. 5767–5777.

[21] Cédric Villani. Optimal transport: old and new. Vol. 338. Springer Sci- ence & Business Media, 2008.

[22] He Zhang, Vishwanath Sindagi, and Vishal M Patel. “Image de-raining using a conditional generative adversarial network”. In: arXiv preprint arXiv:1701.05957 (2017).

Machine Learning for Inferring

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2019 ,

Machine Learning for Inferring

Sidescan Images from Bathymetry and AUV Pose

ZITAO ZHANG

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

Machine Learning for

Inferring Sidescan Images from Bathymetry and AUV Pose

ZITAO ZHANG

Master in Machine Learning Date: June 25, 2019

Supervisor: John Folkesson Examiner: Patric Jensfelt

School of Electrical Engineering and Computer Science

iii

Abstract

The objective of this thesis project is to find an effective way to do transla-

tion from seabed bathymetry (underwater depth) data (from multibeam sonar)

to sidescan sonar images. In the project, we explored the feasibility of machine

learning based translation methods. Some different generative models based

on the idea of generative adversarial nets were tried. This project is an ex-

perimental trial, and it still needs more improvement before production. But

the result shows a strong potential for the ability of machine learning based

methods to handle this kind of translation tasks.

iv

Sammanfattning

Syftet med detta avhandlingsprojekt är att hitta ett effektivt sätt för att över-

sätta data från batymetri (undervattensdjup, från multibeam sonar) till sidescan

sonarbilder. I projektet undersökte vi genomförbarheten för översättningsme-

toder baserad på maskininlärning. Olika generativa modeller baserade på ge-

nerative adversarial nets (GANs) hade undersöktes. Detta projekt kan ses som

en förstudie. Ytterligare förbättringar krävs fortfarande, men resultatet visar

en stark potential för maskininlärningsmetoder att hantera denna typ av över-

sättningsuppgifter.

v

Acknowledgement

This work was supported by Stiftelsen for Strategisk Forskning (SSF) through

the Swedish Maritime Robtics Centre (SMaRC) (IRC-0046). We thank MMT

Sweden AB for providing the data.

Contents

1 Introduction 1

1.1 Autonomous Underwater Vehicles . . . . 1

1.2 Multibeam Sonar . . . . 2

1.3 Sidescan Sonar . . . . 2

1.4 Problem Description . . . . 3

1.5 Research Question . . . . 3

1.6 Ethics, Sustainability and Society . . . . 3

1.7 Thesis Structure . . . . 4

2 Related Work 5 2.1 Generative and Discriminative Model . . . . 5

2.2 Generative Adversarial Nets (GANs) . . . . 5

2.3 Deep Convolutional GAN (DCGAN) . . . . 8

2.4 Wasserstein GAN (WGAN) . . . . 8

2.5 Conditional GAN (cGAN) . . . . 10

2.6 Encoder-decoder . . . . 10

2.7 U-net and Skip Connection . . . . 11

2.8 Residual Networks (ResNet) . . . . 12

3 Methods 14 3.1 Dataset . . . . 14

3.2 Network implementation . . . . 20

3.2.1 Environment Setup . . . . 20

3.2.2 Network architectures . . . . 21

3.2.3 Training Method . . . . 22

3.2.4 Objective function . . . . 23

3.3 Experiment . . . . 26

3.3.1 Evaluation Setup . . . . 26

vi

CONTENTS vii

4 Results and Discussion 28

4.1 U-net . . . . 28

4.2 ResNet . . . . 34

4.3 Performance Evaluation . . . . 37

4.4 Time Consumption . . . . 38

5 Conclusions 39 5.1 Future Work . . . . 39

Bibliography 41

Chapter 1 Introduction

Global Positioning System (GPS) can provide accurate positioning informa- tion while on the surface, but GPS signals are not available underwater [2].

1.1 Autonomous Underwater Vehicles

Autonomous underwater vehicles (AUVs) are computer-controlled underwa- ter surveying platform. They are called "autonomous" because there are no physical connections (wired or wireless) between vehicles and human oper- ator, who might be on shore or on a ship following behind the vehicles [3].

1

2 CHAPTER 1. INTRODUCTION