Semantic segmentation of seabed sonar imagery using deep learning

(1)

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Datateknik

2019 | LIU-IDA/LITH-EX-A--2019/073--SE

Semantic segmentation of

sea-bed sonar imagery using deep

learning

Semantisk segmentering av sonarbilder från havsbotten med

deep learning

Petter Granli

Supervisor : Suejb Memeti Examiner : Lena Buﬀoni

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Över-föring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och till-gängligheten ﬁnns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet än-dras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

For investigating the large parts of the ocean which have yet to be mapped, there is a need for autonomous underwater vehicles. Current state-the-art underwater positioning of-ten relies on external data from other vessels or beacons. Processing seabed image data could potentially improve autonomy for underwater vehicles.

In this thesis, image data from a synthetic aperture sonar (SAS) was manually segmented into two classes: sand and gravel. Two different convolutional neural networks (CNN) were trained using different loss functions, and the results were examined. The best per-forming network, U-Net trained with the IoU loss function, achieved dice coefficient and IoU scores of 0.645 and 0.476, respectively. It was concluded that CNNs are a viable ap-proach for segmenting SAS image data, but there is much room for improvement.

(4)

Acknowledgments

I want to thank my supervisor Louise and Saab for giving me the opportunity and support needed for this thesis. I also want to thank my supervisor Suejb and my examiner Lena for their valuable input. Lastly, I want to thank my opponent Tobias Martinsson for his feedback.

(5)

List of Figures

2.1 Example of SAS image . . . 4

2.2 Example of SAR image . . . 4

2.3 Scene understanding . . . 5

2.4 U-Net architecture . . . 6

2.5 Overfitting . . . 7

2.6 Early stopping . . . 7

4.1 Overview of the method . . . 12

4.2 Rotation and cropping of image . . . 13

4.3 Overlapping slicing . . . 13

5.1 Results of the manual segmentation . . . 17

5.2 Plot of accuracy during training for FCN-8 . . . 18

5.3 Plot of loss during training for FCN-8 . . . 18

5.4 Plot of accuracy during training for U-Net with cross-entropy loss . . . 19

5.5 Plot of cross-entropy loss during training for U-Net . . . 19

5.6 Plot of accuracy during training for U-Net with dice coefficient loss . . . 19

5.7 Plot of dice coefficient loss during training for U-Net . . . 19

5.8 Plot of accuracy during training for U-Net with IoU loss . . . 20

5.9 Plot of IoU loss during training for U-Net . . . 20

5.12 Ground truth . . . 21

5.13 No overlap when stitching . . . 22

5.14 24 pixels overlap when stitching . . . 22

(8)

List of Tables

5.1 Table of properties for images in dataset . . . 16 5.2 Results for FCN-8 . . . 18 5.3 Results for U-Net with the different loss functions . . . 20

(9)

1 Introduction

Many of the devices and services used daily are made possible thanks to machine learn-ing. Whether we are using a search engine, talking to a voice assistant, booking a cab ride, or browsing e-mail, machine learning improve the experience. Historically, machine learning has required domain expertise to transform the raw data into something useful for an algorithm. Deep learning introduces more layers of processing to model complex raw data and have significantly improved state-of-the-art performance in an abundance of ar-eas such as image analysis [25]. A popular deep learning method is convolutional neural networks (CNNs) which have had breakthroughs in many computer vision areas such as semantic segmentation of image data for autonomous cars and segmentation of biomedical data [34]. Recently, there have been many breakthroughs using CNNs for image analysis thanks to large open image databases and huge performance growth in graphics processing units (GPUs) [36].

When segmenting an image, different parts of the image are grouped to give an insight into the image data in some way. Semantic segmentation also classifies these groups and assigns meaningful labels to them. In practice, this means that each pixel in an image is assigned a label. A use-case example is when autonomous cars use semantic segmentation to segment drivable road or obstacles in images from car-mounted cameras.

Sonar is an acronym for Sound Navigation and Ranging and is a technique which uses sound for detection and location of objects. Synthetic aperture sonar (SAS) is a technique for combining multiple sonar reading to achieve higher resolution imagery compared to more conventional sonar approaches. In the underwater domain, sonar is used rather than radar because of the absorption properties of water for radio signals.

(10)

1.1. Motivation

1.1 Motivation

Large parts of the ocean have yet to be mapped. This, in conjunction with limited means of communication under water, increases demand for autonomous underwater vehicles. Clas-sification of types of seabed could potentially improve autonomy for these vehicles. Man-ual classification and segmentation of a large amount of data is very time consuming and, of course, not feasible for autonomous applications. This put demand on automatic approaches. Current state-of-the-art techniques for underwater navigation achieves sufficient positioning precision but are limited by the need for receiving external information from other vessels or beacons. Terrain-based navigation shows great promise in terms of navigation performance and full autonomy. [31]

For underwater applications, sonar is used to enable computer vision. This study uses data from a synthetic aperture sonar (SAS) which yields highly detailed images of the seabed.

1.2 Aim

The aim of this thesis project is to analyze if deep learning can be used for segmentation of synthetic aperture sonar image data. Also, what kind of performance is to be expected from such an approach should be investigated. The research is conducted in the following steps:

1. Preprocessing and manually segmenting the image data to create a training and testing dataset.

2. Applying two existing acknowledged CNN architectures to segment the data and eval-uate their performance.

3. Modifying non-architecture parameters such as loss function during training and tiling strategies to evaluate their impact on segmentation quality.

The reason for applying existing architectures instead of designing ones from scratch is be-cause designing an architecture is a complicated task. To determine rules for designing a CNN, even for specific problems, is essentially an impossible task.

1.3 Research questions

Following are the research questions for this thesis:

• What performance can be expected when applying an existing CNN for segmenting acoustic seabed image data?

• How do image tiling strategies affect performance for segmenting acoustic seabed im-age data?

1.4 Delimitations

This thesis is, of course, limited by time. Further, the available data is sparse, which affects the possibilities to examine certain approaches. The goal of this thesis is not to find the optimal solution for segmenting this kind of data, but rather a benchmark for further studies which hopefully can give some pointers towards the right direction when designing an architecture.

(11)

2 Theory

In this chapter, relevant theory to this thesis is presented. First, synthetic aperture sonar (SAS) and semantic segmentation are described in more detail to give the context to the problem of this thesis. Convolutional neural networks (CNN), together with state of the art architectures, are also described, and the chapter is concluded with the issue of overfitting and means of measuring performance. The data used in this thesis is gathered using an SAS, and semantic segmentation is used to analyze this data. The segmentation is generated using a state of the art CNN, which is a method prone to overfitting.

2.1 Synthetic aperture sonar

As stated in the introduction, sonar is an acronym for Sound Navigation and Ranging and is a technique which uses sound for detection and location of objects. Sonar is similar to radar, Radio Detection and Ranging, which instead uses radio waves. For the underwater domain, sonar is used rather than radar because of the absorption properties of water for radio sig-nals. Synthetic aperture sonar (SAS) combines multiple sonar readings and applies advanced post-processing to generate very detailed high-resolution sonar images. Conventional sonar can also achieve high resolution but with severe range limitation. [12] Examples of SAS and synthetic aperture radar (SAR) images can be seen in Figure 2.1 and 2.2.

For long, SAS was considered a branch of SAR. SAS shares lots of characteristics with SAR, and many techniques and algorithms have directly been reused for SAS. However, because of the properties of water, SAS has developed to a truly unique research field. [13]

(12)

2.2. Semantic segmentation and scene understanding

Figure 2.1: Example of SAS image. By Australian Transport Safety Bureau (cropped) [3]

Figure 2.2: Example of SAR image. By Antti Lipponen (cropped) [27]

2.2 Semantic segmentation and scene understanding

Semantic segmentation is a category in scene understanding [26] which, as the name suggests, aims to explain the scene in an image. As stated in Chapter 1, semantic segmentation is when objects in an image are classified, and location and boundary for the object are detected. Other tasks for scene understanding include:

• Classification. An image is processed and assigned one or several classes.

• Object localization. Objects in an image are detected, and bounding boxes for these objects are output.

• Individual instance segmentation. The image is semantically segmented, and individ-ual instances of the classes are distinguished.

An illustration of these different types of scene understanding can be seen in Figure 2.3. In-teresting for this thesis is the example of semantic segmentation, where the image has been divided into three classes and a background. This is an example of semantic segmentation in practice. All pixels have been assigned a meaningful label.

(13)

2.3. Convolutional neural networks

(a) Image classification (b) Object localization

(c) Semantic segmentation (d) Instance segmentation

Figure 2.3: Different types of scene understanding. Image by Sara Kirby [21], adapted from [26]

2.3 Convolutional neural networks

Convolutional neural networks (CNN) are a popular approach for solving not only segmen-tation problems but visual imagery problems in general. CNNs have long been used for achieving state-of-the-art performance for image classification tasks [24]. Thanks to the in-crease in computational power and crowdsourcing of large datasets, deeper networks con-tinue to break new ground in terms of classification performance [23]. Research suggests that the depth of the network benefits not only accuracy, but also generalization to different tasks and datasets [36].

Recent networks use huge datasets with millions of images trained on multiple GPUs for several weeks [36]. For many applications, this is not a viable approach since the manual annotation of images is too time-consuming and costly. When looking at popular datasets for segmentation challenges such as Cityscapes [7], CamVid [2], Pascal VOC12 [10], or SUN RGB-D [37], they contain a huge amount of training data compared to the amount of data for this thesis. This results in that popular architectures, winners of these challenges, are often optimized for a large amount of training data. Also, these popular datasets contain regular images. In other words, these winning networks are trained on data consisting of visible light which is fundamentally different from echoes of sound in water which this thesis process.

(14)

2.3. Convolutional neural networks

Fully convolutional neural networks

Fully convolutional networks (FCNs), were proposed by J. Long et al. in 2015 [29]. FCNs can produce a pixel segmentation with the same dimension as an arbitrary input image. When presented, the proposed FCN achieved state of the art performance for several datasets. Three different networks were presented: FCN-32, FCN-16, and FCN-8, where FCN-8 produces the most detailed segmentation.

U-Net

For this thesis, an architecture which performs well for limited training data is, of course, interesting. U-Net was proposed by Ronneberger et al. in 2015 and is a modification and extension of the fully convolutional network which aims to produce precise segmentation maps even with very little training data. [34] U-Net is the most well known CNN architecture in the field of medical image analysis and is, together with similar networks, ranking top in image segmentation challenges [28].

The U-Net architecture, which can be seen in Figure 2.4, consists of one contracting and one expansive path. In the contracting path, a block consisting of two 3x3 convolutions followed by ReLU activation and a downsampling 2x2 max pooling with stride 2, is repeated four times. The expansive path also consists of four repeating blocks. Each block in the expansive path includes an upsampling of the feature map, a concatenation of the feature map from the contracting path, and two 3x3 convolutions with ReLU activations. In the final layer, the feature vector is mapped to the desired number of classes using a 1x1 convolution. [34]

(15)

2.4. Overfitting

2.4 Overfitting

Overfitting [9] is problematic for most machine learning approaches, but it is particularly problematic when working with limited datasets such as the dataset of this thesis. Overfitting is when a trained model performs well on the training data but not on previously unseen data. Eventually when training, noise in the training data will be learned by the trained model. In Figure 2.5, the green line represents the overfitted model, while the black line represents the general model.

Figure 2.5: Descriptive figure of the overfitting concept by Chabacano [4]

Early stopping is a crude technique for avoiding overfitting. Separate training and validation datasets are used during training. While training, the error for the training dataset is mini-mized, and the error for the validation dataset is monitored. As described in Figure 2.6, if the error for the validation dataset starts increasing, the training is stopped to avoid overfitting. The downside to this approach is, of course, that the training is halted. [33]

Figure 2.6: Early stopping

Dropout is a more refined method proposed by G. E. Hinton et al. in 2012 that prevents overfitting of a network that can occur when using small datasets. Dropout works by ran-domly removing the hidden units of the network with a set probability for each training case. Dropout has shown to mitigate overfitting effectively, and it allows large networks to be trained without using early stopping. [15]

(16)

2.5. Metrics for performance evaluation

Batch normalization

Batch normalization has gotten great traction since it was presented in 2015 and is used by many state-of-the-art image segmentation architectures [1, 17]. Interestingly, it was not in-cluded in the proposed U-Net network by Ronneberger et al. [34]. S. Ioffe and C. Szegedy presented batch normalization for accelerating training of deep networks by incorporating normalization in the network’s architecture for activations. With batch normalization, it is possible to use higher learning rates and less carefully selected initialization. Applying batch normalization results in a significant speedup of training which enables costly modifications to be done to the architecture, resulting in improved performance. The algorithm is described in Equation 2.1 where input is x over a mini-batch B = txi...xmu, and the parameters to be learned are β and γ. [19]

µBÐ 1 m m ÿ i=1 xi σ_B2Ð 1 m m ÿ i=1 (xi´ µB)2 p xiÐ xi´ µB b σ_B2+ε yi Ð γxpi+β ” BNγ,β(xi) (2.1)

Despite the undeniably good performance achieved by implementing batch normalization, the reasons for its effectiveness are disputed. S. Santurkar et al. investigates the widespread belief that batch normalization works by reducing the internal covariate shift and finds little support for this. The authors believe that it works by smoothing the optimization landscape which stabilizes the behavior and predictiveness of the gradients and accelerates training. [35]

2.5 Metrics for performance evaluation

To evaluate the performance of segmentation is no trivial task, and there is a need for com-mon metrics to make the results comparable. Historically, the evaluation of the segmentation quality has been performed by human inspection. The way this is done today is by compar-ing the output segmentation with manually segmented test data and measure the difference by some metric. [42]

Apart from segmentation quality, there are also other valuable characteristics of an architec-ture such as training and execution time or memory footprint. For many applications, such as in robotics or other embedded systems, execution time and memory footprint are very relevant metrics, but few architectures focus on these characteristics. However, to aid repli-cability, it is important to include execution time and hardware specification for the results acquired. [11]

(17)

2.5. Metrics for performance evaluation

Intersect-over-union

The intersect-over-union (IoU) metric compares the similarity and diversity of two sets and is calculated by dividing the intersection of two sets with their union, as seen in Equation 2.2. IoU has become the standard for measuring the performance of segmentation, but it is a limited measurement when the segmentation boundary is what should be evaluated rather than the amount of correctly labeled pixels [8].

IoU(A, B) =|A X B|

|A Y B| =

|A X B|

|A|+|B|+|A X B| (2.2)

Dice coefficient

The Dice coefficient metric is very similar to the IoU metric. The difference between these metrics is that the overlap is counted twice in both the numerator and denominator, as can be seen in Equation 2.3.

DSC(A, B) = 2|A X B|

|A|+|B| (2.3)

Pixel accuracy

Pixel accuracy, which can be seen in Equation 2.4, is a simple metric of the ratio between correctly labeled pixels (true positive+true negative) and the total number of pixels.

ř TP+ř TN

(18)

3 Related work

There is limited research on using CNN for segmentation of SAS data in particular, but deep learning has successfully been implemented for other task using said data. Also, segmen-tation of similar kinds of data using CNN has been explored together with more traditional approaches to segment SAS data.

3.1 Recognition and classification

One field where deep learning has been successfully applied for both SAS and SAR data is target recognition, detection, and classification. D. Williams has explored the classifica-tion of underwater SAS data. An unsupervised approach was used to classify targets in different seabed environments using a huge dataset with good accuracy [39]. D. Williams later expanded his work using a CNN classifier. The CNN approach resulted in a signifi-cant increase in performance. The author concludes that CNN shows lots of promise for the SAS image analysis field [40]. H. Zhu et al. explore the classification of SAS imagery us-ing CNN when there is a limited amount of trainus-ing data [43]. The proposed method uses advanced preprocessing algorithms and transfer learning for mitigating problems with a lim-ited dataset. The results show a performance gain for most but not all tasks. Deep learning has also successfully been implemented for target recognition on, the similar, SAR data with high performance [5]. Furthermore, methods for mitigating problems with using limited la-beled training data have been applied with promising results. Z. Huang et al. [18] conclude that the use of transfer learning is a viable method to use for limited SAR data.

(19)

3.2. Segmentation using CNN

3.2 Segmentation using CNN

As stated before, there is little or no previous research of using CNNs for segmentation of SAS data. However, CNNs have successfully been used for segmentation in the similar field SAR. C. Henry et al. find that a shallow CNN can be utilized for extracting and segmenting roads in SAR imagery [14]. D. Malmgren-Hansen and M. Nobel-Jorgensen have utilized a CNN to segment targets and their shadows [30].

3.3 Phase information

CNNs are mostly used for segmenting data from optical sensors. A unique opportunity when working with sonar, or other complex data, is the additional phase information. The phase data have been used successfully for both SAS [38] and SAR [41]. For complex data, the building blocks of the architecture need to be modified to handle complex numbers [6], and the additional data increases computational complexity, but the field shows great promise.

(20)

4 Method

In this chapter, the method of the work is described. The method includes the creation of the train and test dataset, implementation of the segmentation architecture, training of the architecture, prediction for the test dataset, and the performance evaluation. The SAS image data was preprocessed and manually segmented to be used for training and evaluation of the predictions. Then, the architectures were trained with the training data, and the performance was evaluated using the test data. An overview of how the concepts are tied together can be seen in Figure 4.1

(21)

4.1. Manual labeling and segmentation

4.1 Manual labeling and segmentation

As a first step, the raw SAS data was used to create the train and test dataset. The data consisted of 9 high-resolution GeoTIFF images, taken at different dates and locations in lake Vättern. The images were firstly rotated and cropped to be rectangular, as seen in Figure 4.2.

Figure 4.2: Rotation and cropping of image

A manual segmentation mask was created by using the pen tool in Photoshop. Gravel parts of the images were painted white while sand was painted black. Both the segmentation masks and the rectangular images were saved in PNG format. Eight of the nine images were used for training after slicing them into smaller images with an overlap described in Figure 4.3. The overlap when slicing was 50 % (256 pixels) in both height and width. The last image and its corresponding manual segmentation, which was not used for training, was used for evaluating the performance of the segmentation. This is often referred to as leave-one-out, and could easily be extended for cross-validation if more time were available [22].

(22)

4.2. Implementation

4.2 Implementation

The U-Net and FCN-8 architectures described in Section 2.3 were implemented with the Keras1framework in python with Tensorflow2 as backend. Computations were made us-ing an Nvidia GPU with the CUDA API, as specified in Section 4.6. Batch normalization (Section 2.3) and dropout (Section 2.4) with the probability 0.5 was added to the architec-tures. During training, the data augmentation API of Keras was used to rotate and mirror the training slices.

4.3 Training

The architectures described in Section 4.2 were trained using 505 overlapping slices of the eight training images. The network was trained for 50 epochs with a batch size of 2 while saving the model with the best loss score. Both U-Net and FCN-8 were trained using the Adam [20] optimizer with a learning rate of 10´4and the binary cross-entropy loss function. The U-Net was also trained using the dice coefficient and IoU loss functions to evaluate their impact.

4.4 Prediction

As with the training, smaller patches of the image were used for predictions. Predictions were made on both overlapping and non-overlapping patches to compare the results. When pre-dicting for non-overlapping patches, the patches were stitched together using no additional computing. When predicting the masks for overlapping patches, the images were stitched together using the pixels from the patch which has the center closest to the pixel position. In other words, the overlapping images were cropped to remove half of the overlapping pixels from each edge before being stitched together. For example, when the overlap tested was 50 % in both height and width, the 512x512 images were cropped to 256x256, removing 128 pixels from each edge (half of the 256-pixel overlap).

4.5 Result measurement

When predicting segmentation masks for the test dataset, the network output probabilities between 0 and 1 for each pixel, resulting in a grayscale image. To evaluate the segmentation, thresholding of 0.5 was applied to the output segmentation images to obtain a binary image (only two distinct pixel values), since only two classes were used. The binary image was then compared to the ground truth.

1_{https://keras.io}

(23)

4.6. Hardware and software specifications

4.6 Hardware and software specifications

Following are the hardware and software used for computations and image processing in this thesis. If implemented correctly, other specifications should yield the same results in terms of performance metrics other than execution time.

Hardware

• CPU: Intel i7 8700k • GPU: Nvidia RTX 2080ti • RAM: 16GB Software • Manjaro 18.0.3 Linux 4.19.23-1 • GPU driver: 418.43 • CUDA 10.1 • cuDNN 7.4.2.24-1 • Tensorflow 1.13.1 • Keras 2.2.4 • Photoshop CC 2019

(24)

5 Results

This section presents the results of the conducted study. The manually segmented training and testing dataset are presented together with the results and predictions of the two imple-mented architectures.

5.1 Manual segmentation

The result of the manual segmentation can be seen in Figure 5.1, where nine corresponding ground truth images have been created. Properties of the images can be seen in Table 5.1

Image # Height (px) Width (px) Gravel (%) Sand (%)

1 4856 1781 23 77 2 3888 1712 15 85 3 3976 1688 19 81 4 4011 1716 16 84 5 4005 1689 18 82 6 3945 1728 18 82 7 3969 1662 21 79 8 4759 1965 10 90 9 (Test) 4016 1740 13 87

(25)

5.1. Manual segmentation

(26)

5.2. Prediction results

5.2 Prediction results

In this section, the results of using the trained architectures are presented. First, the results when using FCN-8 are presented, followed by the results for U-Net, trained with different loss functions.

FCN-8

The FCN-8 architecture was only trained using the binary cross-entropy loss function. The accuracy and loss plot can be seen in Figure 5.2 and 5.3, respectively. The accuracy ranges from 0 to 1, where 1 means 100% accuracy, and the binary cross-entropy is what is being minimized when training. There is a drop in accuracy and loss around epoch 32, which can not be explained with absolute certainty. However, the dip does not reoccur, suggesting it is caused by randomness rather than some peculiarity in the data.

The result in terms of metrics can be seen in Table 5.2. All values of-course ranges from 0 to 1, and a higher value is desired. The accuracy is close to the class imbalance of the test data and both the dice, and IoU metrics are quite low.

Epoch Accuracy 0.75 0.8 0.85 0.9 0.95 1 10 20 30 40 50

Figure 5.2: Plot of accuracy during training for FCN-8 Epoch Loss 0 0.1 0.2 0.3 0.4 0.5 10 20 30 40 50

Figure 5.3: Plot of loss during training for FCN-8

Tiling Dice IoU Acc

No overlap 0.405 0.254 0.891 24 px overlap 0.413 0.260 0.892 256 px overlap 0.407 0.256 0.892

(27)

U-Net

The U-Net architecture was trained using three different loss functions: binary cross-entropy, dice coefficient, and IoU loss. The plots for accuracy and loss during the training of the U-Net using the three mentioned loss functions can be seen in Figure 5.4 to 5.9. Metrics for the segmentation quality of these trained networks can be seen in Table 5.3.

The network trained with the IoU loss function performed best in terms of all metrics, reach-ing a dice coefficient score of 0.645, an IoU score of 0.476, and an accuracy of 89.8%. The out-put segmentation for the three different overlaps can be seen in Figures 5.13 to 5.15, where the ground truth for these predictions can be seen in Figure 5.12. Sand and gravel are depicted by white and black color respectively in these figures.

Epoch Accuracy 0.75 0.8 0.85 0.9 0.95 1 10 20 30 40 50

Figure 5.4: Plot of accuracy during training for U-Net with cross-entropy loss

Epoch Loss 0 0.2 0.4 0.6 10 20 30 40 50

Figure 5.5: Plot of cross-entropy loss during training for U-Net

Epoch Accuracy 0.75 0.8 0.85 0.9 0.95 1 10 20 30 40 50

Figure 5.6: Plot of accuracy during

training for U-Net with dice coefficient loss

Epoch Loss 0 0.1 0.2 0.3 0.4 0.5 10 20 30 40 50

Figure 5.7: Plot of dice coefficient loss during training for U-Net

(28)

5.2. Prediction results Epoch Accuracy 0.75 0.8 0.85 0.9 0.95 1 10 20 30 40 50

Figure 5.8: Plot of accuracy during training for U-Net with IoU loss

Epoch Loss 0 0.05 0.1 0.15 0.2 10 20 30 40 50

Figure 5.9: Plot of IoU loss during training for U-Net

Cross-entropy loss Dice coef. loss IoU loss

Dice IoU Acc Dice IoU Acc Dice IoU Acc

No overlap 0.533 0.363 0.781 0.601 0.430 0.883 0.643 0.474 0.897 24 px overlap 0.538 0.368 0.784 0.606 0.435 0.883 0.645 0.476 0.898 256 px overlap 0.539 0.369 0.785 0.606 0.435 0.883 0.637 0.467 0.898

(29)

Best segmentation

Figures 5.13, 5.14, and 5.15 show the segmentation output from the best performing trained network, U-Net trained with the IoU loss function. The ground truth for this segmentation is shown in Figure 5.12. Black represents sand, and white represents gravel. When non-overlapping patches are stitched together, the borders between patches are obvious, which can be seen in Figure 5.10. When using a small overlap, the stitched image is more continu-ous, but some borders are still visible as seen in Figure 5.11. The more extreme overlapping strategy of 256 pixels looks continuous without any obvious borders.

Figure 5.10: Example of discontinuity when using no overlap

Figure 5.11: Example of discontinuity when using 24px overlap

(30)

Figure 5.13: No overlap when stitching

Figure 5.14: 24 pixels overlap when stitching

(31)

6 Discussion

In this chapter, the results and method of the study are discussed.

6.1 Results

In this section, the results of the conducted study are discussed. First, the result of the manual segmentation is discussed, followed by the discussion of the prediction results.

Manual segmentation

Much time was put on manually segmenting the dataset, and the overall result looks to be of high quality. However, it is difficult to assess the quality of the segmentation in more detail. The segmentation is mostly highly detailed and accurate, but it is unfortunately apparent that some of the images are segmented more finely than others which can certainly have affected the result of the predictions for the architecture.

As discussed by A. Meyer-Baese and V. J. Schmid [32], for the medicinal field, the manual segmentation is best done by an expert. Moreover, even then, the segmentation is not highly precise and is prone to intraobserver variability. The manual segmentation of this thesis was not performed by an expert in the field, and as the act of segmentation was stretched over a longer time, there is probably a variance between the segmented images.

Prediction results

When manually inspecting the prediction result for the test dataset it is obvious that solving the problem of segmenting SAS data is possible using a CNN. However, the segmentation results from the networks in this thesis have much room for improvement. When comparing the metrics of the predictions to results of other studies in the similar SAR domain, the values are similar. D. Malmgren-Hansen and M. Nobel-Jorgensen [30] achieve dice coefficient scores

(32)

6.2. Method

Even though the stitching method for this thesis was simple, looking at the prediction results, it is apparent that some sort of overlapping strategy when predicting patches of a large image is needed. Without overlapping, the prediction contains many sharp edges, and it is apparent to the viewer that the large image consists of patches stitched together. Using a very large overlap does not seem to contribute much to the segmentation quality, but it is important to use a small overlap at least to overcome the diminishing accuracy near the edges.

Worth noting is that the loss function used when training the architecture seems to have a very large impact of the prediction result and using a loss function corresponding to a specific metric does not necessarily guarantee the best result in terms of that metric.

6.2 Method

In this section, the method of the conducted study is discussed.

Manual segmentation

When manually segmenting the data, the pixels were classified as either sand or gravel. It would, of course, be more interesting to include even more classes, like man-made objects, mud, or larger rocks. However, the available data contained too little of other classes. An-other problem that could have occurred with more classes is misclassification when annotat-ing the images. It would have been hard to distannotat-inguish the borders between more classes by just looking at the finalized imagery from the sonar, so the dataset would have to be constructed in another way. For example, known objects could be placed on the seabed be-forehand like in D. P. Williams study [39].

To segment the data fully manually instead of using a computer-aided approach was primar-ily because no other suitable method was found for this kind of data. Much of the available software is developed for segmentation of objects in conventional photography, where classes are separated with distinct color differences or obvious shapes. Also, if a semi-automatic ap-proach is used, it is difficult to assess how much the operator contributes to the segmentation. One pitfall could be that the semi-automatic approach is more automatic than thought, result-ing in a thesis where an automated approach is mimicked by a deep learnresult-ing method which is more or less a trivial problem.

Predictions

There are many viable approaches for the implementation of the concepts in this thesis. How-ever, the segmentation result is not affected by using Keras with Tensorflow and an Nvidia GPU or any other framework and hardware.

The architectures were chosen due to their popularity and performance in other fields, and surely other architectures could perform better for the data in this thesis. However, the goal of this thesis was not to find the most optimal architecture. Batch normalization and dropout are more or less standard for all modern CNNs and have well-documented performance and was therefore included.

Due to time constraints, the same number of epochs and optimizer was used for both archi-tectures, rather than trying to optimize the performance for both. This enabled the research to quickly focus more on the better performing architecture, the U-Net.

The stitching approach used in this thesis is quite simple, but the goal of evaluating stitching was not the maximize the performance. The stitching was evaluated to answer if stitching

(33)

6.3. The work in a wider context

is needed at all and whether a very large overlap would yield any performance gain or not. Also, B. Huang et al. [16] compares stitching methods and finds that there is only a small per-formance gain for using computationally heavy averaging methods rather than just clipping the edges before stitching.

Source criticism

Overall, the selected sources for this thesis are of very high quality thanks to the field being incredibly popular. Almost all of the sources used in this thesis are peer-reviewed scientific publications with lots of citations in other highly regarded publications. However, some of the cited sources are published quite recently, making it more difficult to assess their impact on the field and by extension their quality.

6.3 The work in a wider context

In a wider context, what is investigated in this thesis could have ethical or societal impacts. The work and result itself cannot be considered very impactful, but the concept of analyzing seabed data could be very beneficial for a better understanding of marine ecosystems or more eco-friendly means of gathering minerals. Of course, the same analysis and maybe increased autonomy could also be used for malicious purposes.

(34)

7 Conclusion

The purpose of this thesis was to investigate if deep learning can be used to segment seabed sonar image data. The way this was investigated was by creating a dataset with correspond-ing ground truth for sand and gravel and answercorrespond-ing the followcorrespond-ing research questions:

• What performance can be expected when applying an existing CNN for segmenting acoustic seabed image data?

• How do image tiling strategies affect performance for segmenting acoustic seabed im-age data?

Both the FCN-8 and U-Net architectures were applied and the best segmentation result was achieved using U-Net with the IoU loss function. In terms of metrics, the segmentation reached a pixel accuracy of 90 % together with dice coefficient and IoU scores of 0.65 and 0.48 respectively. Using a tiling strategy was proven to be beneficial to the segmentation quality in terms of continuity, but the impact on metrics was minimal. Using a very large overlap did not seem to contribute much to the segmentation quality.

The contributions of this thesis consist of a training and testing dataset to be used for further studies, and a result that clearly shows that CNN is a viable approach for this problem. This thesis also provides early results to be used as a reference point to future work in the field.

7.1 Future work

For future work, there is a need for a larger, and possibly, community shared dataset which includes more classes. Not only would this dataset ensure the quality of future results, but also enable comparison of results between different researchers. Efforts should be made to maximize the performance of using a CNN for segmenting this data by comparing different architectures and parameter tuning, and in the long run, create a domain-specific architecture for segmentation of SAS data. Also, the concept of using the phase information in the SAS data by having a complex-valued CNN seems promising and should be examined.

(35)

Bibliography

[1] Vijay Badrinarayanan, Ankur Handa, and Roberto Cipolla. “SegNet: A Deep Convo-lutional Encoder-Decoder Architecture for Robust Semantic Pixel-Wise Labelling”. In: arXiv preprint arXiv:1505.07293 (2015).

[2] Gabriel J. Brostow, Julien Fauqueur, and Roberto Cipolla. “Semantic object classes in video: A high-definition ground truth database”. In: Pattern Recognition Letters 30.2 (2009), pp. 88–97.

[3] Australian Transport Safety Bureau. A shipwreck discovered in December 2015. com-mons.wikimedia.org/wiki/File:A_shipwreck_discovered_in_December_2015.jpg. https://creativecommons.org/licenses/by/4.0/legalcode. May 2019.

[4] Chabacano. Overfitting. https://commons.wikimedia.org/wiki/File:Overfitting.svg. https://creativecommons.org/licenses/by-sa/4.0/legalcode. Mar. 2019.

[5] Sizhe Chen and Haipeng Wang. “SAR target recognition based on deep learning”. In: 2014 International Conference on Data Science and Advanced Analytics (DSAA). 2014, pp. 541–547.

[6] Hyeong-Seok Choi, Jang-Hyun Kim, Jaesung Huh, Adrian Kim, Jung-Woo Ha, and Kyogu Lee. “Phase-Aware Speech Enhancement with Deep Complex U-Net”. In: in-ternational conference on learning representations (2019).

[7] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. “The Cityscapes Dataset for Semantic Urban Scene Understanding”. In: 2016 IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR). 2016, pp. 3213–3223.

[8] Gabriela Csurka, Diane Larlus, and Florent Perronnin. “What is a good evaluation mea-sure for semantic segmentation”. In: British Machine Vision Conference 2013. 2013. [9] Thomas G. Dietterich. “Overfitting and undercomputing in machine learning”. In:

ACM Computing Surveys 27.3 (1995), pp. 326–327.

(36)

Bibliography

[11] Alberto Garcia-Garcia, Sergio Orts-Escolano, Sergiu Oprea, Victor Villena-Martinez, and José García Rodríguez. “A Review on Deep Learning Techniques Applied to Se-mantic Segmentation”. In: arXiv preprint arXiv:1704.06857 (2017).

[12] Roy Edgar Hansen. “Introduction to synthetic aperture sonar”. In: Sonar systems. Inte-chOpen, 2011.

[13] M. P. Hayes and P. T. Gough. “Synthetic Aperture Sonar: A Review of Current Status”. In: IEEE Journal of Oceanic Engineering 34.3 (July 2009), pp. 207–224. ISSN: 0364-9059. DOI: 10.1109/JOE.2009.2020853.

[14] Corentin Henry, Seyedmajid Azimi, and Nina Marie Merkle. “Road Segmentation in SAR Satellite Images With Deep Fully Convolutional Neural Networks”. In: IEEE Geo-science and Remote Sensing Letters 15.12 (2018), pp. 1867–1871.

[15] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R. Salakhutdinov. “Improving neural networks by preventing co-adaptation of feature detectors”. In: arXiv preprint arXiv:1207.0580 (2012).

[16] Bohao Huang, Daniel Reichman, Leslie M. Collins, Kyle Bradbury, and Jordan M. Malof. “Tiling and Stitching Segmentation Output for Remote Sensing: Basic Chal-lenges and Recommendations.” In: arXiv: Computer Vision and Pattern Recognition (2018). [17] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. “Densely Connected Convolutional Networks”. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017, pp. 2261–2269.

[18] Zhongling Huang, Zongxu Pan, and Bin Lei. “Transfer Learning with Deep Convolu-tional Neural Network for SAR Target Classification with Limited Labeled Data”. In: Remote Sensing 9.9 (2017), p. 907.

[19] Sergey Ioffe and Christian Szegedy. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”. In: international conference on machine learning (2015), pp. 448–456.

[20] Diederik P. Kingma and Jimmy Lei Ba. “Adam: A Method for Stochastic Optimization”. In: international conference on learning representations (2015).

[21] Sara Kirby. Sheep Herding 2. flickr.com/photos/yarncoffee/5822299095/. https://creativecommons.org/licenses/by/2.0/legalcode. June 2019.

[22] Ron Kohavi. “A study of cross-validation and bootstrap for accuracy estimation and model selection”. In: IJCAI’95 Proceedings of the 14th international joint conference on Arti-ficial intelligence - Volume 2. Vol. 2. 1995, pp. 1137–1143.

[23] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. “ImageNet Classification with Deep Convolutional Neural Networks”. In: Advances in Neural Information Pro-cessing Systems 25. 2012, pp. 1097–1105.

[24] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. “Backpropagation Applied to Handwritten Zip Code Recognition”. In: Neural Computation 1.4 (Dec. 1989), pp. 541–551.ISSN: 0899-7667.DOI: 10.1162/neco.1989. 1.4.541.

[25] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. “Deep learning”. In: Nature 521.7553 (2015), pp. 436–444.

[26] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ra-manan, Piotr Dollár, and C. Lawrence Zitnick. “Microsoft COCO: Common Objects in Context”. In: european conference on computer vision (2014), pp. 740–755.

[27] Antti Lipponen. SanFrancisco20170107. flickr.com/photos/150411108@N06/32175309746. https://creativecommons.org/licenses/by/2.0/legalcode. May 2019.

(37)

Bibliography

[28] Geert J. S. Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud Arindra Adiyoso Se-tio, Francesco Ciompi, Mohsen Ghafoorian, Jeroen A. W. M. van der Laak, Bram van Ginneken, and Clara I. Sánchez. “A survey on deep learning in medical image analy-sis”. In: Medical Image Analysis 42 (2017), pp. 60–88.

[29] Jonathan Long, Evan Shelhamer, and Trevor Darrell. “Fully convolutional networks for semantic segmentation”. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2015, pp. 3431–3440.

[30] David Malmgren-Hansen and Morten Nobel-Jorgensen. “Convolutional neural net-works for SAR image segmentation”. In: 2015 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT). 2015, pp. 231–236.

[31] José Luís Melo and Aníbal Matos. “Survey on advances on terrain based navigation for autonomous underwater vehicles”. In: Ocean Engineering 139 (2017), pp. 250–264. [32] Anke Meyer-Baese and Volker J. Schmid. “Pattern Recognition and Signal Analysis in

Medical Imaging”. In: (2014).

[33] Lutz Prechelt. “Early Stopping — But When?” In: Neural Networks: Tricks of the Trade: Second Edition. Ed. by Grégoire Montavon, Geneviève B. Orr, and Klaus-Robert Müller. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, pp. 53–67.ISBN: 978-3-642-35289-8.DOI: 10.1007/978-3-642-35289-8_5.

[34] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. “U-Net: Convolutional Networks for Biomedical Image Segmentation”. In: medical image computing and computer assisted intervention (2015), pp. 234–241.

[35] Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. “How Does Batch Normalization Help Optimization”. In: neural information processing systems (2018), pp. 2488–2498.

[36] Karen Simonyan and Andrew Zisserman. “Very Deep Convolutional Networks for Large-Scale Image Recognition”. In: international conference on learning representations (2015).

[37] Shuran Song, Samuel P. Lichtenberg, and Jianxiong Xiao. “SUN RGB-D: A RGB-D scene understanding benchmark suite”. In: 2015 IEEE Conference on Computer Vision and Pat-tern Recognition (CVPR). 2015, pp. 567–576.

[38] David P. Williams. “Exploiting Phase Information in Synthetic Aperture Sonar Images for Target Classification”. In: 2018 OCEANS - MTS/IEEE Kobe Techno-Oceans (OTO). 2018.

[39] David P. Williams. “Fast Target Detection in Synthetic Aperture Sonar Imagery: A New Algorithm and Large-Scale Performance Analysis”. In: IEEE Journal of Oceanic Engineer-ing 40.1 (2015), pp. 71–92.

[40] David P. Williams. “Underwater target classification in synthetic aperture sonar im-agery using deep convolutional neural networks”. In: 2016 23rd International Conference on Pattern Recognition (ICPR). 2016, pp. 2497–2502.

[41] Zhimian Zhang, Haipeng Wang, Feng Xu, and Ya-Qiu Jin. “Complex-Valued Convolu-tional Neural Network and Its Application in Polarimetric SAR Image Classification”. In: IEEE Transactions on Geoscience and Remote Sensing 55.12 (2017), pp. 7177–7188. [42] Hongyuan Zhu, Fanman Meng, Jianfei Cai, and Shijian Lu. “Beyond pixels: A

compre-hensive survey from bottom-up to semantic image segmentation and cosegmentation”. In: Journal of Visual Communication and Image Representation 34 (2016), pp. 12–27.ISSN:

Semantic segmentation of seabed sonar imagery using deep learning

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Datateknik

2019 | LIU-IDA/LITH-EX-A--2019/073--SE

Semantic segmentation of

sea-bed sonar imagery using deep

learning

Semantisk segmentering av sonarbilder från havsbotten med

deep learning

Petter Granli

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

1

Introduction

1.1

Motivation

1.2

Aim

1.3

Research questions

1.4

Delimitations

2

Theory

2.1

Synthetic aperture sonar

2.2

Semantic segmentation and scene understanding

2.3

Convolutional neural networks

Fully convolutional neural networks

U-Net

2.4

Overfitting

Batch normalization

2.5

Metrics for performance evaluation

Intersect-over-union

Dice coefficient

Pixel accuracy

3

Related work

3.1

Recognition and classification

3.2

Segmentation using CNN

3.3

Phase information

4

Method

4.1

Manual labeling and segmentation

4.2

Implementation

4.3

Training

4.4

Prediction

4.5

Result measurement

4.6

Hardware and software specifications

5

Results

5.1

Manual segmentation

5.2

Prediction results

FCN-8

U-Net

Best segmentation

6

Discussion

6.1

Results

Manual segmentation