IN
DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS
STOCKHOLM SWEDEN 2019 ,
Machine Learning for Inferring
Sidescan Images from Bathymetry and AUV Pose
ZITAO ZHANG
KTH ROYAL INSTITUTE OF TECHNOLOGY
SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
Machine Learning for
Inferring Sidescan Images from Bathymetry and AUV Pose
ZITAO ZHANG
Master in Machine Learning Date: June 25, 2019
Supervisor: John Folkesson Examiner: Patric Jensfelt
School of Electrical Engineering and Computer Science
iii
Abstract
Underwater navigation has been a big challenge for autonomous underwater vehicles (AUVs) for a long time. It is highly dependent on acoustic methods called SONAR. There are two kinds of sonar sensors which are commonly used, the multibeam sonar and the sidescan sonar. Both of them have some advantages and limitations. Substantial improvements can be made if a ma- chine interpretation method can be developed for the translation between these two sonar data.
The objective of this thesis project is to find an effective way to do transla-
tion from seabed bathymetry (underwater depth) data (from multibeam sonar)
to sidescan sonar images. In the project, we explored the feasibility of machine
learning based translation methods. Some different generative models based
on the idea of generative adversarial nets were tried. This project is an ex-
perimental trial, and it still needs more improvement before production. But
the result shows a strong potential for the ability of machine learning based
methods to handle this kind of translation tasks.
iv
Sammanfattning
Navigeringen har varit en stor utmaning för autonoma undervattensfordon (AUV) under lång tid. Typiskt används akustiska metoder, så kallad SONAR. Det finns två typer av sonarsensorer, multibeam sonar och sidescan sonar. Båda har styrkor och svagheter. Genom att översätta mellan dessa två sensordata kan betydande förbättringar uppnås.
Syftet med detta avhandlingsprojekt är att hitta ett effektivt sätt för att över-
sätta data från batymetri (undervattensdjup, från multibeam sonar) till sidescan
sonarbilder. I projektet undersökte vi genomförbarheten för översättningsme-
toder baserad på maskininlärning. Olika generativa modeller baserade på ge-
nerative adversarial nets (GANs) hade undersöktes. Detta projekt kan ses som
en förstudie. Ytterligare förbättringar krävs fortfarande, men resultatet visar
en stark potential för maskininlärningsmetoder att hantera denna typ av över-
sättningsuppgifter.
v
Acknowledgement
This work was supported by Stiftelsen for Strategisk Forskning (SSF) through
the Swedish Maritime Robtics Centre (SMaRC) (IRC-0046). We thank MMT
Sweden AB for providing the data.
Contents
1 Introduction 1
1.1 Autonomous Underwater Vehicles . . . . 1
1.2 Multibeam Sonar . . . . 2
1.3 Sidescan Sonar . . . . 2
1.4 Problem Description . . . . 3
1.5 Research Question . . . . 3
1.6 Ethics, Sustainability and Society . . . . 3
1.7 Thesis Structure . . . . 4
2 Related Work 5 2.1 Generative and Discriminative Model . . . . 5
2.2 Generative Adversarial Nets (GANs) . . . . 5
2.3 Deep Convolutional GAN (DCGAN) . . . . 8
2.4 Wasserstein GAN (WGAN) . . . . 8
2.5 Conditional GAN (cGAN) . . . . 10
2.6 Encoder-decoder . . . . 10
2.7 U-net and Skip Connection . . . . 11
2.8 Residual Networks (ResNet) . . . . 12
3 Methods 14 3.1 Dataset . . . . 14
3.2 Network implementation . . . . 20
3.2.1 Environment Setup . . . . 20
3.2.2 Network architectures . . . . 21
3.2.3 Training Method . . . . 22
3.2.4 Objective function . . . . 23
3.3 Experiment . . . . 26
3.3.1 Evaluation Setup . . . . 26
vi
CONTENTS vii
4 Results and Discussion 28
4.1 U-net . . . . 28
4.2 ResNet . . . . 34
4.3 Performance Evaluation . . . . 37
4.4 Time Consumption . . . . 38
5 Conclusions 39 5.1 Future Work . . . . 39
Bibliography 41
Chapter 1 Introduction
In scientific research and ocean exploration, autonomous underwater vehicles (AUVs) have the potential to revolutionize our access to places that man- controlled vehicles can hardly reach [1], such as deep ocean and below the glaciers. Navigation is one of the primary challenges in AUV research today.
Global Positioning System (GPS) can provide accurate positioning informa- tion while on the surface, but GPS signals are not available underwater [2].
Because of the fact that electromagnetic energy cannot propagate appreciable distances in water except for very low frequency, acoustic methods become one of the most feasible detection and communication methods because of their long propagation range. These acoustic methods are known as SONAR, or "Sound Navigation And Ranging". In this project, we are interested in two different sonar sensors, multibeam and sidescan sonar.
1.1 Autonomous Underwater Vehicles
Autonomous underwater vehicles (AUVs) are computer-controlled underwa- ter surveying platform. They are called "autonomous" because there are no physical connections (wired or wireless) between vehicles and human oper- ator, who might be on shore or on a ship following behind the vehicles [3].
Early AUVs were completely relying on their onboard hardware (sensors) and software (algorithms) for surveying underwater[3]. Nowadays, it is possible to transmit small amounts of data with AUVs by acoustic methods with the development of acoustic technology, such as the battery status and depth, or given simple orders to AUVs such as "start" and "stop", but complex orders are still effectively inaccessible[3]. This is what distinguishes the AUVs and remote controlled vehicles[3].
1
2 CHAPTER 1. INTRODUCTION
Figure 1.1: Example of AUV surveying with the help of sonar
1.2 Multibeam Sonar
Multibeam echo-sounder sonars were developed and became accessible in the late 1980s [4]. These systems work by transmitting multiple narrow sound pulses through a transmitter at a specific frequency, and then receiving the acoustic backscatter through a receiver [4] [5] [6]. Each sound pulse will re- turn the distance between the point this pulse hit the seabed and the AUV.
Multibeam sonars have a high standard of calibration and accuracy [4]. By applying acoustic mathematical models, complete 3D geometric bathymetry (depth) information of the seabed can be generated. Multibeam sonars are powerful tools for seabed mapping and navigation, but the limitation is that it is expensive and requires powerful sensor platform [4]. And because each sound pulse can only return single value of the distance, so the resolution is also limited to the number of sound pulses which the sonar can send at the same time.
1.3 Sidescan Sonar
Sidescan sonars are choices of high-resolution seabed mapping. These sys-
tems can cover a large portion of the seabed away from the surveying vessels,
from a few tens of meters to 60 km or more [4]. The instrument has one beam
on each side of the vessel, broad in the vertical plane perpendicular to the
tracking direction. When the sonar signal hit the seabed, it will reflect echos
with different intensity. The sensor receive the reflection intensity and the
CHAPTER 1. INTRODUCTION 3
time sonar "ping" return to the sensor. It can generate a 2D spacial-time based image of various echo intensity. Sidescan sonars are more cost-effective than multibeam sonar. These systems can provide high-resolution sonar images.
However, the interpretation of these sidescan sonar images is much more dif- ficult than that of multibeam backscatter due to the lack of resolution in one of the three spatial dimensions. Details of the interpretation the sonar images will be discussed in Chapter 3 Method part.
1.4 Problem Description
Substantial improvements can be made if the sidescan sonar readings could be used to aid navigation in real time rather than simply logging that data as is done now. This requires better machine interpretation of sonar, which is the main focus of this project. Traditional 3D reconstruction or interpreta- tion methods are mostly acoustic ray tracing simulations and mathematically based [7][8]. These methods require accurate mathematical models and al- ways generate perfect sonar image with little noise because it is hard to simu- late complex back-scatters from non-uniform mediums and sea-bed materials.
The complex mathematical models could also lead to a high computational complexity which might be too slow to get a real-time interpretation. A faster and more accurate alternative interpretation method needs to be proposed. For this project, we mainly focus on translating the bathymetry data to the sides- can image, which can be applied to many different scenarios such as AUV simulators.
1.5 Research Question
Can machine learning based methods such as regression and generative ad- versarial nets (GAN) provide an effective translation between bathymetry data and sidescan sonar image? Can these methods generate realistic sidescan sonar images, or can they generate accurate sidescan sonar images corresponding to the bathymetry data? What’s the advantages and disadvantages of these meth- ods? These questions will be discussed in this project.
1.6 Ethics, Sustainability and Society
Oceans have massive resources which include not only fossil fuel, minerals,
or renewable energy such as tides and ocean current, but also ecological re-
4 CHAPTER 1. INTRODUCTION
sources such as deepwater ecosystems. But it is difficult for human to visit the deepwater area. Human divers are facing great hazard for underwater oper- ation. The AUV could be one of the solutions, but so far, these vehicles are difficult to operate and not qualified for complicated tasks.
If a better AUV positioning solution can be proposed, these vehicles would have a much wider application range, such as surveying underwater area, main- taining underwater devices, monitoring underwater plants. This can help to replace human to complete high-risk tasks. And if AUVs are more commonly used, the underwater surveying will have much higher efficiency. This could bring more profits and reduce the cost in many different fields such as fishery, mining, search and rescue industry.
1.7 Thesis Structure
In Chapter 2, some basic concept of GAN will be introduced, together with some different network models. These structures will be used in the latter part.
In Chapter 3, implementation details will be introduced, including the dataset, network implementation, and experiment setup.
In Chapter 4, the result of experiments and evaluation will be shown. Some optimization and inference during our experiments will also be included in this part.
In Chapter 5, we will make a brief conclusion of our results, contributions,
and expected future work.
Chapter 2
Related Work
2.1 Generative and Discriminative Model
Generative and discriminative models are important components in machine learning and computer vision. The generative models are trained to represent some kinds of distribution from sample data, while the discriminative models are trained to represent the conditional probability of the sample data of the distribution directly. The mathematical interpretation is that: Given the input x (observation) and the label y (target), the generative models try to learn a model of joint probability p(x, y), the discriminative models try to learn a model of posterior p(y|x).
In practice, generative models need much more sample data to train, but they can converge to the asymptotic error much faster [9]. The discriminative models usually have higher asymptotic error [9] and are more commonly used to do machine learning tasks such as classification and regression.
Images are always considered as data samples from certain distribution in high-dimension space. So the generative models can be used to generate im- ages by sampling data from the learned distribution. They can be applied to lots of different fields, such as semi-supervised learning [10], text-to-image synthesis [11], style transfer and texture synthesis[12], image-to-image trans- lation [13][14], and video generation [15].
2.2 Generative Adversarial Nets (GANs)
Before the concept of adversarial nets was proposed in 2014, deep learning had shown its great potential on discriminative models, while the deep generative models had less of progress because it’s difficult to train until the adversarial
5
6 CHAPTER 2. RELATED WORK
nets framework is introduced [16]. This framework simulates an adversarial competition between two models, one is the generator and the other is the discriminator [16].
The main idea of a GAN is the following: A real data point (or image) x can be considered as a sample from a probability distribution function in a high-dimension space. The task of the generator is to find such a distribution, which is generated by a neural network so that it can generate a data x
0which is likely to be real. Then we fix the parameters of the generator, train the dis- criminator to determine whether a sample is from the generator distribution or the real data distribution. Then we can get a gradient of the objective function from the discriminator. After that, we can apply the gradient ascent method to train the generator to make the generator distribution to move closer to the real distribution [16].
In the original GAN, the generator is fed with a low-dimension random noise vector and trained to output a high-dimension vector (such as an image).
The discriminator is fed with a high-dimension vector and try to determine whether this input is from a model distribution or a data distribution, or in another word, whether it is a generated "fake" image or a real image [16].
The advantage of GAN, compared to some other generative model such as VAE, is that the generator has never been input the real data. It is just trained to mimic a real distribution to "fool" the discriminator. So it has good capacity to generate data which are not in the given data samples, while the VAE always try to find the most similar image in the dataset it has memorized and try to get the combination among these images, which may lead to blur output. Another advantage is that for VAE, the generated images are usually compared with real images by mean square error (MSE). If two images, for example, the handwriting character, both have the same MSE with the original image, one has a wider stroke while the other has some noise point among the image, it is obvious that the first one is better for human’s point of view. But this is hard for the network to tell as both of them have the same loss. So the network will not try to learn to generate more first-image-like data.
However, in practical, GAN is difficult to optimize. In a traditional neural
network, we can monitor the loss function to improve the performance. But
in a GAN, we need to keep the two networks having well-match competing
performance, which is called Nash equilibrium [17]. When the discriminator
fails, it does not guarantee that the generator has generated good images. It
may just because the discriminator is too weak to tell the difference. And if
the discriminator is too strong, the gradient would vanish and can no longer
help to train the generator. In order to improve the performance and reduce
CHAPTER 2. RELATED WORK 7
the training difficulties, some techniques have been carried out in recent years [17]. Another issue of GAN is mode collapse. It is usually caused by the low model capacity so that the generator would learn to map different input vector to the same output [18]. This issue will cause a low diversity of generated images.
Mathematical Theories and Algorithms
Given a dataset X, we want to learn the generator’s distribution p
gover data x. An input prior p
z(z) is defined on a noise distribution, and the generator can be expressed as a differential multilayer perceptron function G(z; θ
g) to output sample x, where θ
gis the parameter of the function. The discriminator is defined by another function D(x; θ
d), which takes an input sample x and outputs a scalar to determine the probability of sample x is from real data dis- tribution p
datarather than the generator’s distribution p
g. We keep training D to maximize the accuracy of correctly classifying data x to the corresponding distribution. At the same time, the G is trained to minimize log(1 − D(G(z))).
The value function is defined as min
Gmax
D
V (D, G) = E
x∼pdata(x)[log D(x)] + E
z∼pz(z)[log(1 − D(G(z)))) This minmax game would have the global optimum p
g= p
data[16]. In or- der to get the optimal solution, we need to update the generator G and discrim- inator D respectively. That is, for a fixed G, we apply the stochastic gradient ascent to D with
∇
θd1 m
m
X
i=1
log D x
(i)+ log 1 − D G z
(i)(2.1) for given m samples (the batch size). Then for a fixed D, we apply the stochas- tic gradient descent to G with
∇
θg1 m
m
X
i=1
log 1 − D G z
(i)(2.2)
These gradient-based updates need to be taken iteratively until the opti-
mum has been reached. The discriminator and generator can be updated in
different speed, for example, 5 times of D updates then follows 1 G update
[16].
8 CHAPTER 2. RELATED WORK
2.3 Deep Convolutional GAN (DCGAN)
Deep Convolutional GAN (DCGAN) is a kind of GAN which has a deep con- volutional structure [10]. It satisfies these following architecture [10]:
• Replace any pooling layers with strided convolutions (discriminator) and fractional-strided convolutions/deconvolution (generator).
• Use batch normalization in both the generator and the discriminator.
• Remove fully connected hidden layers for deeper architectures.
• Use ReLU activation in generator for all layers except for the output, which uses Tanh.
• Use LeakyReLU activation in the discriminator for all layers.
DCGAN is more like an empirical optimization. The authors didn’t give de- tailed mathematical derivation. According to the result, the DCGAN can be trained more stable and can learn good representations of images for super- vised learning and generative modeling [10].
2.4 Wasserstein GAN (WGAN)
Wasserstein GAN, or WGAN, is raised to optimize the training of GAN [19].
For the original GAN, we measure the difference between generator distribu- tion and real data distribution by Jensen-Shannon divergence (JS-divergence) [16]. The goal is to let the 2 distribution as close as possible, which is, J S(P
r, P
g) → 0 [19]. The issue is that in practice, both of the distributions are low-dimension manifolds in high-dimension space. In most cases, there is little overlap among the two distribution when the training starts, which means the JS-divergence keeps its value unchanged (which is log 2) with the training going on. This can’t provide a usable gradient so it will cause difficulties to update the net- work.
Then the Earth-Mover (EM) distance, or Wasserstein-1 distance, is intro- duced. The EM distance can be simply explained as the minimum "mass" it needs to move "earth" from one "mound" (distribution) to another. It can pro- vide a more continuous and usable gradient compared to JS-divergence [19].
This shows that the EM distance is a better selection for the loss function.
The EM distance requires a constraint that the discriminator function should
be locally Lipschitz to ensure the continuity and differentiability (shown in
CHAPTER 2. RELATED WORK 9
Equation 2.3). In [19], weight clipping is introduced as one simple approach to the Lipschitz constrain. By simply clamping the weights of discriminator within a fixed range, all the function will be K-Lipschitz for some K [19]. How- ever, according to [19], it is a terrible way to satisfy the Lipschitz constrain. If the clipping parameter is too large, the discriminator will take a very long time to reach the optimum, and it might also cause the gradient exploding, while if the parameter is too small, it can easily cause the gradient vanishing with a deep network [19][20]. It would be an issue to choose a suitable clipping parameter.
Then another approach called gradient penalty (WGAN-GP) is introduced in [20]. This technique adds a penalty term onto the loss function to ensure the gradient norm of the discriminator function be less than 1. First sample interpolates between the region of P
rand P
g, called P
xˆ, then compute the penalty term shown in Equation 2.6. By applying this method, the training of GAN can be more stable.
Mathematical Theories and Algorithms
The EM distance of two distributions P
r, P
gis defined as [19]
W (P
r, P
g) = inf
γ∈Π(Pr,Pg)
E
(x,y)∼γ[kx − yk],
where γ(x, y) means the "mass" it needs to move from x to y in order to trans- port distribution P
rto P
g, and Π (P
r, P
g) means the sets of all possible strate- gies of transporting distribution P
rto P
g.
This distance is difficult to compute, so in [19] it is converted to the form of
W (P
r, P
θ) = sup
kf kL≤1
E
x∼Pr[f (x)] − E
x∼Pθ[f (x)], (2.3) by Kantorovich-Rubinstein duality [21], where kf k
L≤ 1 denotes that f is a 1-Lipschitz function.
It is quite simple to modify the loss of original GAN to a WGAN. The loss of discriminator in Equation 2.1 is modified as
∇
θdm
X
i=1
1
m f
θdx
(i)− 1
m f
θdG z
(i)(2.4) where f
θdis the discriminator function with 1-Lipschitz constrain. The loss of generator in Equation 2.2 is modified as
− ∇
θg1 m
m
X
i=1
f
θdG z
(i). (2.5)
10 CHAPTER 2. RELATED WORK
The gradient penalty term is expressed as
ˆ
E
x∼Pˆx
(k∇
xˆD(ˆ x)k
2− 1)
2. (2.6) This penalty term is added to the discriminator loss (Equation 2.4) with a penalty weight λ.
2.5 Conditional GAN (cGAN)
Conditional GAN (cGAN) is a method for supervised learning. The gener- ator is fed with a random noise vector z and a condition c and generate an output image G(z, c). The discriminator is trained to not only tell the differ- ence between real images and generated images, but it also judges whether the image-condition pairs match. There are two possible approaches, one is to design two discriminators: D
1(x) for judging whether the images are real or fake, D
2(x, c) for judging whether the pairs match p(x|c). The other approach is to design one discriminator D(x, c) for both tasks, which is more commonly used [14, 22].
The condition c of cGAN can be any vector, including the image itself. So cGAN can also be used to do image-to-image translation tasks such as style transformation and image segmentation [14]. Suppose we have some image- to-image pairs in two different domain (style), called X and Y . The GAN learns a mapping from input image x and random noise vector z, to image y, as G : {x, z} → y. In [14], one single discriminator is designed to criticize the image-condition pairs, with the input images, from both domain X and Y .
2.6 Encoder-decoder
Encoder-decoder network structure is an important generative model. It can
be used in different areas, such as machine translation [23] and image segmen-
tation [24]. It contains two models: one recognition model generate the latent
vector z from input data x in a high dimensional space X, which is known
as "encoder", and the other generative model generate the target y in another
high dimensional space Y from z, which is known as "decoder" [25]. The
main idea of encoder-decoder is that for the high-dimensional input x, the en-
coder tries to get a highly compressed code z, called latent vector. The code
is expected to contain the most important information in x. Then the decoder
tries to represent these features back to another high-dimensional data y. It is
believed that x and y are sharing the same features in the code z.
CHAPTER 2. RELATED WORK 11
The encoder-decoder structure can be combined with adversarial technol- ogy as the GAN provides a powerful learning strategy for the generative model on complex distributions [26]. Some different GAN networks are using such a kind of encoder-decoder structure such as BiGAN [26] and SegNet [24].
However, there are some limitations with the encoder-decoders and encoder- decoder GAN architectures. According to [27], the generators are expected to learn meaningful features from the encoder, but in practice, the learning objec- tive along can’t provide a successful path. Another problem is mode collapse [27]. The generators may fail to provide various generated result.
2.7 U-net and Skip Connection
U-net architecture was first raised in 2015 for image segmentation [28]. Tradi- tional deep convolutional networks are always designed to have a single output for classification tasks. These networks can hardly handle the task of image segmentation, as it would be time-consuming to predict the labels pixel by pixel, and there is a trade-off between accuracy and spatial relationship, as a large patch window may lead to low local accuracy while a small patch cannot make the full usage of context [28].
The structure of "U-net" is shown in Figure 2.1. It is an encoder-decoder
structure, but with long skip connections. The U-net consists of a "contract-
ing" path and an "expansive" path [28], just like an encoder and a decoder in
encoder-decoder structure. The contracting path uses max pooling layer be-
tween each stack for downsampling, and the expansive path uses deconvolution
layers (up-convolution) for upsampling [28]. For each block’s output in con-
tracting path, it is concatenated to the corresponding block input in the expan-
sive path (grey arrows in Figure 2.1). This concatenation is also called "skip
connection". The skip connections help the network converge much faster and
reach a better local optimum [29]. According to [29], the problem of encoder-
decoder structure is that the decoder can’t guarantee to recover small details
with high-frequency information from a low-dimensional latent space. This
would be a bottleneck of such structure, especially when the network is deep
[29]. The skip connections allow passing detailed image information directly
from contracting path to expansive path, which help to solve the problem of
losing details.
12 CHAPTER 2. RELATED WORK
Figure 2.1: U-net architecture. Left part is the "contracting" path and right part is the "expansive" path. Image from [28]
Figure 2.2: Convolutional encoder-decoder networks with symmetric skip connections. Image from [29]
2.8 Residual Networks (ResNet)
Theoretically, deeper networks can fit more complex distributions, but they
are also hard to train. In practice, with the depth of a network increasing, the
accuracy begins to saturate and then drops rapidly. This is called "degrada-
tion" in [30]. To solve this problem, a deep residual learning framework is
proposed. Instead of learning the desired mapping H(x), another mapping
CHAPTER 2. RELATED WORK 13
F (x) := H(x) − x is introduced [30]. It is easier to fit the new residual mapping F (x) than H(x) [30].
To implement this framework, a shortcut connection is added between dif- ferent layers to build a residual blocks shown in Figure 2.3. In experiments, the performance of network contains multiple residual block is much better than a plain network with the same number of layers but without shortcut con- nections [30].
Figure 2.3: A residual block building. Image from [30]
The residual framework can also been applied to GAN by replacing some of the layers with residual blocks. In [31], the ResNet network structure has been used in generator, shown in Figure 2.4. All the pooling layers are removed and are replaced with strided and fractional-strided convolution layers. The input images first pass some convolution layers for down-sampling, and then several residual blocks, and finally pass the up-sampling deconvolution layers.
Figure 2.4: The ResNet structure used in [31] The input images are first passed
to encoder blocks for downsampling, then passed to a series residual blocks,
and finally decoder blocks for upsampling.
Chapter 3 Methods
3.1 Dataset
The first step of training a network is to get a good dataset. In this project, the data is provided by Swedish Maritime Robotics Centre (SMaRC). SMaRC has collected both sidescan and multibeam sonar data in pairs. These data are collected from AUV’s simultaneously surveying the same region of seabed from different tracking orientation. For each single track, the AUV surveys along a nearly straight track, sending sonar signals to both left and right sides of the vehicle. The vehicle can get one row of data from each sonar "ping" (a pulse signal), and with the vehicle travelling along the seabed 3.1. These "row"
can be assemble together and get the so called "waterfall" image (Figure 3.3).
With the help of auvlib [32], we can easily get these waterfall images, and from the multibeam sonar data, we can calculated the 3D geometric information of seabed bathymetry with in-build APIs in auvlib.
For the sidescan sonar (SSS) image, the tracking path of AUV is vertical in the middle in the image. The vertical axis represents the surveying distance along the tracking path, the horizontal axis represents the time the sonar signal needs to travel through water, hit the obstacle (seabed), reflect back through water, and finally reach the sonar receiver. This shows the distance from the seabed to the AUV. Each pixel represents the intensity of the signal received, which in the dataset has a value vary from 0 to roughly 3. A bright pixel means that either the incident angle is relatively large (close to normal incidence), or the material has high reflectivity or both. Similarly, a dark pixel means the incident angle is relatively small, or the material has low reflectivity. The dark area in middle is because of the dead zone of sonar, and it also takes some time for the sonar signal reach the nearest obstacle, which means the receiver can’t
14
CHAPTER 3. METHODS 15
Figure 3.1: AUV gets one row of data with the help of sonar.
Figure 3.2: Mechanism of sidescan sonar. This is an example of the right- hand side of the AUV. The Intensity vs Time graph represents one row in the waterfall image. The intensity represents the brightness of each pixel. Blue arrow is the highlight because there is a large incident angle of the sonar ping.
The red arrow is the shadow because the sonar signal can’t reach the back of
the rock so that there is no reflection within d
3and d
4.
16 CHAPTER 3. METHODS
get any signal within this time.
For the bathymetry image, the coordinate is similar to the SSS image. Each pixel is matched with the same pixel in SSS image, represents the depth of the seabed, which in the dataset has a value roughly around -20. A black pixel (with value of 0) means that there is no data here.
We have in total of 57 waterfall image pairs.
(a) sidescan sonar image (b) seabed bathymetry image
Figure 3.3: Example of waterfall image pair. The color in bathymetry images are only for visualization. In the dataset for network training, they are grey- scale images.
Crop images into patches
In order to feed these paired data into the network, we need to split the waterfall
images into small patches. The main idea is to crop the waterfall images with
a square window sliding along each side. We crop the patches with override
CHAPTER 3. METHODS 17
areas so that we can get more data images. For adjacent patches, in order not to have too much similarity, we flip the image up and down. For the right side of the waterfall, we crop the patches in a similar way, and flip all the right side patches left and right, so that all the patches look like in the left side of waterfall images. The reason why this flipping work is that the sonar system only collects data in rows from the left or right of the AUV separately.
Imagine the AUV surveying the same area forward and backward, we can get 2 different waterfall images, which are symmetric in both vertical and horizontal direction. So these data are not sensitive to the flipping operation (Figure 3.4).
We use the same cropping method on both of the SSS and bathymetry waterfall images. These patches are then resized into the shape of 256 × 256 pixels and grouped in pairs (Figure 3.5).
Separate train set and test/validation set
After getting the image pairs, we need to split the pairs into 2 sets: the train
set and the test/validation set. To ensure there are not too many similar images
between train and test set, we select some of the waterfall images and mark
all the images cropped from these waterfall images as the test set. As each
waterfall image in an individual surveying track of the AUV, so there is not
much dependency.
18 CHAPTER 3. METHODS
Figure 3.4: Crop small patches from waterfall image: Red boxes are sliding windows of cropping. Patches within green frame are what we extracted for dataset.
Improve the dataset
During the training, we have met some issue for the dataset. First, the bathymetry
data is quite biased. Most of the variances are within ± 1 on the average depth
roughly -20. This is quite difficult to normalize. Second, the size of the train
set is too small. The network sometimes starts to over-fitting on the train set
CHAPTER 3. METHODS 19
Figure 3.5: Example of paired patches
before it converges to an acceptable result.
To solve the first problem, the incident angle of the sonar wavefront are used to replace the bathymetry data. After getting the incident angle on each pixel, it’s easy to normalize the data within the range of 0 ∼ 1 by computing the cosine of angles (Figure 3.6. This is also provided in the auvlib [32].
This method works because the brightness of pixels in SSS image is posi- tively correlated to the incident angle. According to the Lambertian reflection model, the reflection intensity I at point p can be expressed as
I(p) = KΦ(p)R(p)| cos(θ(p))|, (3.1) where θ is the incident angle of the wavefront, and φ is the intensity of the illu- minating sound wave, R is the reflectivity of the seabed, K is a normalization constant [7]. A normal incidence (have a large cosine value) would always lead to a high reflection intensity of sonar signal, which leads to a bright pixel on SSS image. The new paired patches are grouped with SSS image and incident angle image.
To solve the second problem, the height of patches is reduced to one half
(256 × 256 to 128 × 256), and then these patches are resized back to 256 × 256,
which means that size of the dataset can be doubled. This also means that there
are fewer features appearing in each image patch so that the GAN could be able
to learn the features easier (Figure 3.7).
20 CHAPTER 3. METHODS
(a) sidescan sonar image (b) cosine of incident angle image
Figure 3.6: New example of waterfall image pair
3.2 Network implementation
3.2.1 Environment Setup
This project is implemented in Python 3. The main libraries are TensorFlow
(version r1.10) for network building, OpenCV for image visualization and au-
vlib [32] for data extracting. TensorFlow is an open source platform for ma-
chine learning. It provides users easy-to-use APIs and functions for building
and training network and optimization. The GPU resources for training the
network are accessed on the GPU cluster server provided by the CSC depart-
ment of KTH.
CHAPTER 3. METHODS 21
Figure 3.7: New example of paired patches
3.2.2 Network architectures
Conditional GANs are used in this project to learn the image translation from one domain X (incident angle) to another Y (SSS image). For the generator G, it is fed with images of X domain and noise z, and generate the corresponding images of Y domain: G : {x; z} → y. The discriminator D is fed with image pairs (x, y) from both domains. The images of domain X are the input of generator, and the images of domain Y include either images from real SSS images (ground truth) or generated images. It returns a judgment of:
(1) Whether the image y is from real distribution p
datarather than generator distribution p
g; (2) Whether the image y is corresponding to the input image x rather than a "randomly generated" realistic image.
Generator
U-net: The first trial is based on the U-net introduced in [14]. It is an 8-
layer encoder-decoder with symmetric skip connection (we mark this model
as U-net8), shown in Figure 3.8. The input image (256 × 256 × 1) first pass
the Encoder 1, a K4S2F32 convolution layer (kernelSize=4, stride=2, num-
berOfFilter=32), to downsample to 128 × 128 × 32 matrix. Then for the rest
7 encoders, each of them contains a leaky ReLu activation layer, a convolu-
tion layer, and a batch normalization layer. The size of the matrix becomes
22 CHAPTER 3. METHODS
half after passing each encoder, and the number of channels is doubled until it reaches 256 and remains that number. After that, we get a 1 × 1 × 256 la- tent vector (code). This vector is then passed to a series of decoders. Each decoder contains a ReLu activation layer, a deconvolution layer, and a batch normalization layer. The output from each decoder is concatenated with the corresponding encoder output on the channel dimension so the channel size is doubled. This is the skip connection. For the last decoder, the 128 × 128 × 64 input is first passed to a ReLu activation layer, then a K4S2F1 deconvolution layer, then the Tanh activation layer to get the final output.
ResNet: In order to improve the performance, we tried some more complex network structure. We applied the ResNet to the GAN like described in [31], shown in Figure 2.4. It contains 2 encoder blocks, 9 residual blocks, and 2 decoder blocks. The image input first being downsampled to a size of 128 × 128 × 64 after passing 2 encoders. Then follows 9 residual blocks, all these blocks have the same input and output size. Finally, the matrix is passed to 2 decoders and upsampled to 256 × 256 × 1.
Discriminator
PatchGAN: We use the basic idea of the PatchGAN discriminator structure in the [14]. It is a deep convolution network without any pooling layers. The input domain and target domain images are first concatenated on the channel axis to the size of 256 × 256 × 2. After that, the matrix is passed to a series of convolution layers and leaky ReLu activation layers. Here in order to apply the technology of WGAN-GP, we removed the batch normalization layers because the gradient penalty term is applied to each input individually rather than to the whole batch [20]. The discriminator output a 30 × 30 matrix. Each value represents the prediction of a 70 × 70 patch in the original image [14].
3.2.3 Training Method
We follow the training method introduced in [14]. In each iteration, we update
the discriminator D once, and then the generator G once. The training batch
size is set as 20. We use the Adam Optimizer with learning rate of 0.0002, and
the moment parameters are set as β
1= 0.5, β
2= 0.9. In total 150,000 step
updates are taken to ensure the network converges.
CHAPTER 3. METHODS 23
Figure 3.8: Generator network structure: U-net8
3.2.4 Objective function
The objective function of a conditional GAN is:
L
cGAN(G, D) = E
x,y[log D(x, y)] + E
x,z[log(1 − D(x, G(x, z))], (3.2)
where the discriminator D tries to maximize the objective function while the
generator G tries to minimize the objective against D, which can be written
24 CHAPTER 3. METHODS
Figure 3.9: Generator network structure: ResNet9
Figure 3.10: Discriminator network structure: PatchGAN
as:
G
∗= arg min
G
max
D
L
cGAN(G, D). (3.3)
Apart from the adversarial loss defined in Equation 3.2, some other ap-
proaches have also shown that combining the GAN objective with some other
CHAPTER 3. METHODS 25
traditional reconstruction losses can be helpful [33]. So weighted L1 and L2 loss are added to the objective function to help the GAN training:
L
L1(G) = E
x,y,z[ky − G(x, z)k
1] , (3.4) L
L2(G) = E
x,y,z[ky − G(x, z)k
2] , (3.5) G
∗= arg min
G
max
D(L
cGAN(G, D) + λ
1L
L1(G) + λ
2L
L2(G)), (3.6) and we can also add a weight parameter λ
GANto the GAN loss so that we can conduct more experiments on the parameters.
To training the generator G and the discriminator D separately, Equation 3.2 can be written as:
L
D= − (E
x,y[log D(x, y)] + E
x,z[log(1 − D(x, G(x, z))]) (3.7)
L
G= −E
x,z[log(D(x, G(x, z))] (3.8)
In order to make the training procedure more stable, Wasserstein GAN loss with gradient penalty term is introduced to the loss function. Equation 3.7 and 3.8 are updated to:
L
D= E
x,z[−D(x, G(x, z))] − E
x,y[D(x, y)] (3.9)
L
G= −E
x,z[(D(x, G(x, z))] (3.10)
.
To add the gradient penalty term to the discriminator loss L
D, we need to complete the following steps:
1. Sample interpolates within the region of generated and real images, 2. Compute the gradient of discriminator on interpolates,
3. Get the penalty term from the gradient.
The equation of gradient penalty term can be written as:
GP = E
x,ˆyh k∇
yˆD(x, ˆ y)k
2− 1
2i
, (3.11)
where ˆ y is the interpolates. Then the gradient penalty can be added to the discriminator loss in Equation 3.9 as:
L
D= E
x,z[−D(x, G(x, z))] − E
x,y[D(x, y)] + GP. (3.12)
To add the noise z into the generator G, the same method of applying
dropout to the first several layers in [14] is taken because adding noise directly
to the generator is likely to be ignored [34].
26 CHAPTER 3. METHODS
3.3 Experiment
First different weight parameters among GAN loss λ
GAN, L1 loss λ
1and L2 loss λ
2are tested. The default value is λ
GAN= 1, λ
1= 10, λ
2= 0, just like the setup in [14]. And if the λ
GANis set to a very small value comparing to other two reconstruction losses, then it would be more like a normal regression rather than a GAN. The results between these 2 different types of machine learning approaches can be compared.
Then different filter numbers are also tested. In the default setup, the first encoder will have a 32-filter convolution layer, then 64, 128, etc. We tried some different setup such as 20-40-80, 16-32-64. In the report below, we will only use the number of filters in the first convolution layer for distinguishing.
We also try to change the filter number of the discriminator to check whether a simpler discriminator will be effective for criticism.
Some other optimization can be done on the network structures. For the U-net, we try to use different depth of encoder-decoder layers, reducing the layer from the original 8 layers to 7 layers. A shallow U-net can more focus on small details, while a deeper U-net might extract some large feature be- cause of the larger perception area [35]. Another modification is the convolu- tion/deconvolution layer stride. A larger stride leads to a more rapid downsam- pling/upsampling, which means a stronger spatial pooling, which at the same time, a higher information losing speed. For the ResNet, we try to change the number of residual blocks to simplify the network structure and add more encoder/decoder blocks.
For convenience, we use abbreviations to distinguish different experiment setups. For example, "unet8-32-32" means that this is an 8-layer U-net struc- ture generator, with 32 filters in generator and 32 filters in discriminator; "resnet7_v2- 32-20_10_0_1" means that this is a version 2 ResNet structure with 7 residual blocks, 32 filters in generator and 20 filters in discriminator, the loss weight parameters are λ
1= 10, λ
2= 0, λ
GAN= 1.
3.3.1 Evaluation Setup
Evaluate the generated image quality is quite difficult [17]. Traditional re-
construction loss such as pixel-wise L1 or L2 error can’t handle with high-
frequency information, while the sidescan sonar images include lots of high-
frequency information such as highlights and shadows in rock areas. There are
also some other quantitative evaluation methods such as structural similarity
index (SSIM) and peak signal-to-noise ratio (PSNR). But during our test, we
CHAPTER 3. METHODS 27
found that these scores can’t show the quality of the generated image accu- rately. Some models have significantly better output but there is no obvious difference in these scores, or even lower. That might be due to the fact that the sidescan sonar images consist of large amounts of noises, which is difficult for evaluation.
So we decide to evaluate the image quality by human. We designed a survey for volunteers. It contains quite a lot of questions with paired images.
For each question, the respondent will be shown one bathymetry image and one SSS image. The bathymetry image is applied with OpenCV inbuilt colormap to make details more visible. The SSS image comes either from the ground truth SSS image, or from the generated image from GAN. The respondent should judge whether the SSS image is "real" or "fake". All these images are randomly distributed to avoid bias. We ask our respondent to make their decision by their first impression within several seconds.
In order to have a clearer view of the performance, we divide all the test images into 3 different categories: flat, rock, and hill (Figure 3.11). "Flat"
means that this area is nearly flat seabed area lack of information, "rock" means that this area contains a lot of small rock-like objects, and "hill" means that this area contains a lot of large-scale structure like hills. According to these categories, we can have a clearer view of the advantage and disadvantage of different GAN model results.
(a) Flat area (b) Rock area (c) Hill area