• No results found

Transforming Thermal Images to Visible Spectrum Images using Deep Learning

N/A
N/A
Protected

Academic year: 2021

Share "Transforming Thermal Images to Visible Spectrum Images using Deep Learning"

Copied!
50
0
0

Loading.... (view fulltext now)

Full text

(1)

Master of Science Thesis in Electrical Engineering

Department of Electrical Engineering, Linköping University, 2018

Transforming Thermal

Images to Visible Spectrum

Images using Deep

Learning

Adam Nyberg

(2)

Master of Science Thesis in Electrical Engineering

Transforming Thermal Images to Visible Spectrum Images using Deep Learning

Adam Nyberg LiTH-ISY-EX–18/5167–SE Supervisor: Abdelrahman Eldesokey

isy, Linköpings universitet

David Gustafsson

FOI

Examiner: Per-Erik Forssén

isy, Linköpings universitet

Computer Vision Laboratory Department of Electrical Engineering

Linköping University SE-581 83 Linköping, Sweden Copyright © 2018 Adam Nyberg

(3)

Abstract

Thermal spectrum cameras are gaining interest in many applications due to their long wavelength which allows them to operate under low light and harsh weather conditions. One disadvantage of thermal cameras is their limited visual inter-pretability for humans, which limits the scope of their applications. In this the-sis, we try to address this problem by investigating the possibility of transforming thermal infrared (TIR) images to perceptually realistic visible spectrum (VIS) im-ages by using Convolutional Neural Networks (CNNs). Existing state-of-the-art colorization CNNs fail to provide the desired output as they were trained to map grayscale VIS images to color VIS images. Instead, we utilize an auto-encoder ar-chitecture to perform cross-spectral transformation between TIR and VIS images. This architecture was shown to quantitatively perform very well on the problem while producing perceptually realistic images. We show that the quantitative differences are insignificant when training this architecture using different color spaces, while there exist clear qualitative differences depending on the choice of color space. Finally, we found that a CNN trained from day time examples generalizes well on tests from night time.

(4)
(5)

Acknowledgments

First I would like to thank FOI for providing me with the opportunity to do my master thesis at FOI. A big thanks to my supervisors David Gustafsson at FOI and Abdelrahman Eldesokey at Linköping University for supporting me with both technical details and the thesis writing. I would also like to thank my exam-iner who was responsive, supportive and believed in the thesis. An extra thanks goes to Henrik Petersson, David Bergström and Jonas Nordlöf at FOI for being supportive and helping me with the hardware and other issues. A final thanks is addressed to my fellow master thesis students at FOI who helped make the time at FOI wonderful.

Linköping, June 2018 Adam Nyberg

(6)
(7)

Contents

1 Introduction 1 1.1 Motivation . . . 1 1.2 Problem formulation . . . 2 1.3 Research questions . . . 3 1.4 Related work . . . 3 1.5 Limitations . . . 5 1.6 Thesis outline . . . 5 2 Background 7 2.1 Camera calibration . . . 7 2.2 Image registration . . . 8 2.3 Image representation . . . 9

2.4 Artificial Neural Network (ANN) . . . 10

2.5 Convolutional Neural Network (CNN) . . . 11

2.5.1 Convolutional layer . . . 11 2.5.2 Pooling layer . . . 12 2.5.3 Objective function . . . 12 2.5.4 Training a CNN . . . 13 2.5.5 Dropout layer . . . 13 2.5.6 Batch normalization . . . 14 3 Method 15 3.1 Data acquisition system . . . 15

3.1.1 Camera calibration . . . 15

3.1.2 Image registration . . . 16

3.1.3 Data collection . . . 16

3.1.4 Data split . . . 18

3.2 CNN models . . . 18

3.2.1 Gray to color model . . . 18

3.2.2 Thermal to color model . . . 18

3.3 Pansharpening . . . 18

4 Experiments 23

(8)

viii Contents

4.1 Evaluation metrics . . . 23

4.2 Experimental setup . . . 24

4.3 Experiments . . . 24

4.3.1 Pretrained colorization model on TIR images . . . 24

4.3.2 Pretrained TIR to VIS transformation . . . 24

4.3.3 Color space comparison . . . 25

4.3.4 Night time generalization . . . 25

4.4 Quantitative results . . . 25 4.5 Qualitative analysis . . . 25 5 Discussion 33 5.1 Method . . . 33 5.2 Results . . . 34 6 Conclusions 37 6.1 Future work . . . 37 Bibliography 39

(9)

1

Introduction

Thermal infrared (TIR) cameras are becoming increasingly popular due to their long wavelengths which allow them to work under low-light conditions. TIR cameras require no active illumination as they sense emitted heat from objects. This opens up for many applications such as driving in complete darkness or bad weather. However, one disadvantage of TIR cameras is their limited visual inter-pretability for humans. In this thesis, we investigate the problem of transforming TIR images, specifically long-wavelength infrared (LWIR), to visible spectrum (VIS) images. Figure 1.1 shows an example of a TIR and the corresponding VIS image. The problem of transforming TIR images to VIS images is inherently chal-lenging as they do not contain the same information in the electromagnetic spec-trum. Therefore, one TIR image can correspond to multiple different VIS images, e.g. blue and green balls of the same material and temperature would look almost identical in a LWIR image but very different in a VIS image.

1.1

Motivation

Transforming TIR images to VIS images can open up a wide variety of applica-tions, for example enhanced night vision while driving in the dark. Another possible application is detecting objects in the transformed TIR images that are difficult to see in regular TIR images. This thesis was conducted at the Swedish Defence Research Agency (FOI) and is has three main applications for transfor-mation of TIR images:

– Enhanced night vision where users should be able to detect and classify targets by looking at the transformed TIR images.

– Detection of Improvised Explosive Devices (IED) in natural environments

(10)

2 1 Introduction

(a)A LWIR image. (b)A visible spectrum image.

Figure 1.1:A pair of thermal and visible spectrum images from the KAIST-MS dataset. [24]

such as fields, forests or along roads. The transformed TIR images can po-tentially improve the user ability to detect IEDs.

– Improve user ability to drive at night by increasing the ability to find stones and other obstacles that are hard to see using only VIS. The intended use may be while driving off-road or on roads with little traffic.

1.2

Problem formulation

FOI has developed a co-axial imaging system that makes it possible to capture both TIR and VIS images with pixel-to-pixel correspondence. Figure 1.2 shows the principle of this co-axial imaging system. The dichroic filter is a very accu-rate color filter that selectively refracts one range in the electromagnetic spec-trum while reflecting other ranges. This thesis investigates different supervised learning methods to do the transformation from TIR images to VIS images. To achieve this, we investigate the process of geometric calibration and image regis-tration between the thermal and visible spectrum camera so that a pixel-to-pixel correspondence is possible between the image pairs. The robustness of calibra-tion and registracalibra-tion are crucial for the learning process, therefore, we evaluate them using a special calibration board that is visual in both visible and infrared spectrum. Furthermore, we develop a GUI application that can be used when collecting image pairs and transforming a stream of TIR images into VIS images. Finally, since collecting VIS images for training under low-light conditions is not feasible, we investigate the generalization capabilities of the proposed methods by training them on image pairs from daytime and evaluating on images from night time.

(11)

1.3 Research questions 3

(a)Top down view of the system. (b) View from behind with system mounted in a car.

Figure 1.2:A co-axial imaging system with one LWIR camera and one visible spectrum camera with the same optical axis. The thermal light is being re-flected and the visible light is being passed through a hot mirror, also called dichroic filter. A hot mirror is designed to refract visible wavelengths and reflect the thermal spectrum wavelengths.

1.3

Research questions

In this thesis, we investigate the following research questions:

1. How would an existing pretrained CNN model for colorizing grayscale im-ages perform on the task of transforming TIR imim-ages to VIS imim-ages? 2. Which color space is more suitable for the task of TIR to VIS

transforma-tion?

3. Does a model trained on thermal images from daytime generalize well on images from night time?

1.4

Related work

Colorizing grayscale images has been extensively investigated in the literature. Scribbles [22] require the user to manually apply strokes of color to different regions of a grayscale image and the algorithm makes the assumption that neigh-boring pixels with the same luminance should be the same color. Transfer tech-niques [33] use the color palette from a reference image and apply that palette to a grayscale image by matching luminance and texture. For example when coloriz-ing a grayscale image of a forest, typically the user would have to provide another image of a forest with an appropriate color palette. Both scribbles and transfer techniques require manual input for colorization. Other colorization methods are based on automatic transformation, i.e. the only input to the method is the

(12)

4 1 Introduction

Figure 1.3: Illustration of the relative positions of visible spectrum, NIR, short-wavelength infrared (SWIR) and LWIR in the electromagnetic spec-trum.

greyscale image without explicit information about colors. Recently, promising results have been published in the area of automatic transformation by using Con-volutional Neural Networks (CNNs) because of CNNs ability to model semantic representation in images [5, 13, 19, 34].

In the infrared spectrum, less research has been done specifically on trans-forming of TIR images to VIS images. Limmer and Lensch [23] proposed a CNN to transform near-infrared (NIR) images into VIS images by training on a dataset that contains images of highway roads and surroundings. Their method was shown to perform well as the NIR and VIS images are highly correlated. Con-trarily, LWIR images are less correlated to VIS images which would make the problem more challenging. Figure 1.3 shows the distribution of different IR spec-tra and how close they are to the visible spectrum. The KAIST-MS [24] dataset is a large dataset which consists of LWIR and VIS image pairs. The KAIST-MS dataset was collected by mounting both a LWIR camera and a visible spectrum camera on a car and captured images during both day and night in city environments. Kniaz et al. [17] explored transformation using CNNs of synthetic TIR images. Generative Adversarial Networks (GANs) [9] have also shown promising results in colorization when used on greyscale VIS images [4], NIR images [29, 30] and thermal images [15].

The most relevant work is a CNN-based method by Berg et al. [2] which trains a model to transform TIR images to VIS images. The KAIST-MS dataset was used to train and evaluate the model and the model was able to produce perceptually realistic results. In addition they also compared the CNN model to a Naïve ap-proach of using an existing greyscale colorization method directly on the thermal images. Different to this, we introduce a new dataset captured using our co-axial imaging system. The new dataset is used to train a CNN-based method that trans-forms TIR images to VIS images. We also compare the result of converting the VIS image into different color spaces, specifically RGB and CIELAB. Finally, we provide a qualitative analysis of the generalization capabilities of the proposed methods during different lighting conditions.

(13)

1.5 Limitations 5

1.5

Limitations

This thesis is limited by a number of factors, most significantly time and hard-ware. The thesis was conducted during 5 months with about half of that time spent on collecting and processing the dataset and the rest of the thesis spent on training and evaluating different models. Because the training time of the mod-els used was long, no significant time was spent on fine tuning hyper-parameters. The dataset collected only contains scenes that include fields and forests along ru-ral roads. By limiting the types of scenes, a relatively small dataset was enough which in turn decreased collection and training time. Another limiting factor is that the night time TIR dataset only contains one scene and no VIS images of that scene, which makes evaluating research question 3 hard.

FOI provided a computer equipped with a GeForce GTX 1080 Ti graphics card1. In addition FOI also provided a thermal and visible spectrum camera mounted in a co-axial system. The system did not include any hardware for sync-ing the cameras. Another limitation of the system was that it was not possible to guarantee that the cameras were mounted in the exact position or angle.

1.6

Thesis outline

A basic introduction of the theory and associated background is given in chap-ter 2, this includes camera calibration, image registration, image representation and basic concepts used for training and inference of CNNs. Chapter 3 describes the method used to collect data, processing of the data and the models used in the experiments. Chapter 4 presents the experimental setup, the specific exper-iments, quantitative results and qualitative analysis. An analysis of the method and results is given in chapter 5. Finally, chapter 6 presents the conclusions of this thesis together with a section about future work.

(14)
(15)

2

Background

This chapter include the basic theory used in this thesis. The chapter starts by de-scribing how camera calibration and image registration are done, followed by the theory behind digital image representation and color spaces. Lastly, this chapter introduces theory behind CNNs.

2.1

Camera calibration

Geometric camera calibration is the process of estimating parameters of the lens and the sensor of a camera [35]. The camera parameters can be divided into intrinsics, extrinsics, and distortion coefficients. The intrinsic parameters, which include the focal length (fx, fy) and the optical center (cx, cy), are unique for every camera and they are commonly grouped into a camera model matrix expressed as M =         fx 0 cx 0 fy cy 0 0 1         . (2.1)

The extrinsics parameters are used to determine the rotation and translation from world coordinates to camera coordinates [12] and is expressed as

E ="R t # =             r11 r12 r12 r21 r22 r13 r31 r32 r33 t1 t2 t3             (2.2)

where R is a rotation matrix and t is a translation vector. To go from world points (X, Y , Z, 1) to pixel points (x, y, 1) the following formula is used

(16)

8 2 Background

(a)Barrel distortion. (b)Pincushion distortion.

Figure 2.1:Illustration of two different types of distortions.1

        x y 1         = wMET             X Y Z 1             (2.3)

where w is a scale factor. The distortion coefficients are used to estimate image distortions due to the camera lens. Distortions in a camera will make straight objects in the real world look curved in the acquired images as shown in seen in figure 2.1.

2.2

Image registration

Image registration is the process of mapping multiple images into a reference coordinate system [36]. The images can be taken from different sensors or the same sensor but from different angles. By finding common points in these im-ages, it is possible to register them to the same coordinate system by estimating a transformation. The parameters for this transformation are represented as a homography matrix and estimated from at least four corresponding points. The transformation can be expressed as

s         x0 y0 1         = H         x y 1         =         h11 h12 h13 h21 h22 h23 h31 h32 h33                 x y 1         , (2.4)

(17)

2.3 Image representation 9 p1= s         x0 y0 1         , (2.5) p2= H         x y 1         (2.6) where s is a scaling factor, H is the 3x3 homography matrix, (x0,y0) are the coor-dinates of a point in the target image and (x,y) are source coordinate, which is being transformed into the target coordinate system [36]. One way to evaluate the registration is to calculate the transfer error. The formula for calculating the transfer error used in this thesis is defined as

1 n n X i=1 d(pi1, p2i), (2.7)

where n is number points being evaluated, p1i is the measured point, p2i is the cor-responding point x mapped from the first image and d() is the Euclidean distance between the points [11].

2.3

Image representation

Digital images are stored in matrices with dimensions (W , H, C) where W , H is the width and height in pixels and C is the number of channels. The number of discrete colors possible to represent in the digital image depends on the number of bits used per pixel. For example an 8 bit per pixel grayscale image can repre-sent 28= 256 shades of grey and a color image of 24 bits per pixel (8 bits times 3

channels) can represent 224= 16, 777, 216 total colours. [31]

There exist multiple ways of representing color within three channels. The red, green blue (CIERGB) color space was standardized in the 1930s by the Com-mission internationale de l’éclairage (CIE). The CIERGB color space required mix-ing of negative light so the CIE also developed another color space called XYZ or CIEXYZ where the transformation from CIERGB to XYZ is defined in equation 2.8.         X Y Z         = 1 0.17697         0.49 0.31 0.20 0.17697 0.81240 0.01063 0.00 0.01 0.99                 R G B         (2.8) The CIEXYZ color space separates the luminance (Y axis) from the chrominance (X and Z axis) because humans perceive luminance and chrominance differently. Humans are much more sensitive to changes in luminance than chrominance, CIE developed CIELAB which is a non-linear re-mapping of the XYZ color space and compensates for the perceived difference, where L represents the luminance and AB represents the chrominance. The three components of the CIELAB col-orspace are defined as:

(18)

10 2 Background

(a)HSL color space. (b)HSV color space.

Figure 2.2:The HSL and HSV color space mapped to cylinders.2

L = 116f (Y Yn ), (2.9) f (t) =        t1/3 t > δ t/(3δ2) + 2δ/3 else, (2.10) A = 500      f ( X Xn ) − f (Y Yn )      , (2.11) B = 500      f ( Y Yn ) − f ( Z Zn )       (2.12)

where Yn is the luminance value for nominal white, Xn and Znis the chromatic-ity, and f is a finite-slope approximation to the cube root with δ = 6/29 [31]. The hue-chroma-luminance (HCL) color space is based on the CIELAB color space but translated into polar coordinates [26]. Other color spaces are the HSL (hue, saturation, lightness) and HSV (hue, saturation, value) which also reflects per-ceived changes in color better than the RGB color space. A visualization of the HSL and HSV color spaces can be seen in figure 2.2.

2.4

Artificial Neural Network (ANN)

An ANN is an acyclic layered graph consisting of nodes and weights where input is fed into the network at the first layer and at the last layer the models predic-tions comes out. These networks are also called multilayer perceptrons (MLPs). In a fully connected network, all nodes are connected with weights to all nodes in both previous layer and next layer. When training ANNs, the weights between nodes are learned using algorithms such as backpropagation [10].

(19)

2.5 Convolutional Neural Network (CNN) 11

2.5

Convolutional Neural Network (CNN)

For high dimensional data, such as images, fully-connected ANNs would require enormous amount of weights to train which is inefficient. Instead, Convolu-tional Neural Networks (CNNs) utilize weights-sharing to reduce the number of weights that need to be trained and make the process efficient [10]. A CNN, com-pared to an ANN, is more capable of abstracting and finding patterns while also requiring less training data. For instance, a fully connected network trained to identify a bird would need training data showing all the possible positions a bird can appear in a image. A CNN, on the other hand, would utilize a convolutional layer to learn the the shape of a bird and can then identify a bird at any position of the image [18].

2.5.1

Convolutional layer

The major element of CNNs is convolutional the layer which use kernels to tra-verse the input and produce feature maps. The feature maps and architecture of a CNN can be seen in figure 2.3. The weights of the kernel are modified when the network is training [20]. The convolutional layer consists of four hyperparame-ters:

• number of filters K,

• the filters spatial extent (width, height) F, • the stride S,

• amount of zero padding P .

Given input of size (W1, H1, D1) the convolutional layer outputs an array of size

(W2, H2, D2) where W2= W1−F + 2P S + 1, (2.13) H2= H1−F + 2P S + 1, (2.14) D2= K. (2.15)

The filters are the matrices containing the weights and being convolved over the image. The stride is the distance between two consecutive filter positions and zero padding is adding a border of zeros around the image, the reason for adding zeros is to increase the spatial dimension of the output. A dilated kernel pro-vide expansion of the receptive field without loss of resolution by adding space between points in the kernel. [6]

(20)

12 2 Background

Figure 2.3: A visualization of a simple convolutional network. The blue squares are feature maps and the grey squares represents filters.3

2.5.2

Pooling layer

Pooling layers are usually used to reduce the size of feature maps and to enable deep layers to have a lower-level perspective of the data. This is performed by applying a pooling operator to the feature map in a sliding-window manner and with a specific stride. There are several pooling operators, but the most common are max and average pooling. Max pooling compared to average pooling may per-form better because they respond more to high signal activations [3]. The pooling process reduces the number of parameters and the computational complexity of the network and also helps the model to find larger structures if the kernel size is constant [21].

2.5.3

Objective function

The objective function steers the direction of the weights during the training. This function is chosen depending on the task that is being solved by the network. The two most common objective functions are the L1and L2which are defined as

L1= 1 W ∗ H ∗ C W X i=1 H X j=1 C X c=1 |yˆc ijyijc|, (2.16) L2 = 1 W ∗ H ∗ C W X i=1 H X j=1 C X c=1 ( ˆyijcyc ij)2, (2.17)

where ˆyijc is predicted value, ycijis the target value, (W , H) is width and height in pixels and C is number channels per pixel [21].

Different domains require different objective functions, for example when us-ing the L2 function, the network will be more sensitive to outliers and will tune the weights to dampen those outliers. For image generation problems, Structural Similarity (SSIM) is a good metric for similarities between images and is defined as

(21)

2.5 Convolutional Neural Network (CNN) 13 SSI M(x, y) = 1 M M X j=1 (2µxjµyj+ C1) + (2σxyj+ C2) 2xj+ µ2yj+ C1)(σxj2 + σyj2 + C2) (2.18) where x and y are two images, and M is number of positions for a sliding window,

µxj and µyj is the average luminance value within the sliding window j, σxj2 and

σyj2 is the variance within the sliding window j, σxyj is the covariance of x and

y within the sliding window j, C1 and C2 are constants. The MS-SSIM extends

the SSIM by repeatedly downsizing the image and computing SSIM at each scale. This makes MS-SSIM less dependent on the scale of the image [32]. From SSIM, it is possible to derive the Structural Dissimilarity (DSSIM) that could be used as an objective function and is defined as

DSSI M(x, y) = 1 − SSI M(x, y)

2 . (2.19)

It is also possible to change the regression problem into a classification problem by quantization of the color space as demonstrated by Zhang et al. [34].

2.5.4

Training a CNN

Before training of a CNN, all weights must be initialized to some value. One way is to use a statistical distribution to draw initial weights, which has been shown to produce low error rates and fast convergence [8]. Training of CNNs can be done using the backpropagation algorithm [21]. The basic principle is to do a forward pass through the network and then compare the output with the true value in a objective function, also called loss function. The loss function returns an error and by using the chain rule it is possible to derive each weight contribution to the loss. Then stochastic gradient descent is performed to calculate updates to the weights that will minimize the loss function. The ADAM optimizer has been shown to perform well [16] as a method to perform gradient descent optimiza-tion. The ADAM optimizer use a combination of two techniques to improve the gradient descent. The first one is to use individual learning rates for all weights. The second technique is to include estimates of first and second moments of the gradients. The network is trained by performing the algorithm described above iterating over the training samples [7].

2.5.5

Dropout layer

One of the major risks during CNNs training is overfitting, where the trained model is perfectly fitted to the training data. There exist several methods to pre-vent overfitting, but one of the methods that has been shown to perform very well is dropout. In dropout layers, for every training iteration a portion of the neurons are removed and their weights are not updated. This helps in preventing the network from overfitting to the training data and improves its generalization

(22)

14 2 Background

capabilities. This layer operates by dropping neurons in the drop layer by proba-bility p. The dropout layers should only be active during training and during test time all weights are used. [27]

2.5.6

Batch normalization

Batch normalization addresses the variance in layers input by normalizing each mini-batch. Each mini-batch is being normalized with 0 mean and standard de-viation 1. The calculated statistics for each mini-batch during training are then used during inference. This allows for faster training time and decreasing the importance of good initial weights, in some cases removes the need for dropout layers as a regularizer. [14]

(23)

3

Method

This chapter describes the method used to answer the research questions. First, all steps of the preprocessing of the data are specified, including camera calibra-tion and image registracalibra-tion. Finally, the models used are described in detail.

3.1

Data acquisition system

Two cameras were used for all data collection; a thermal camera FLIR A651and a visible spectrum camera XIMEA MC023CG-SY2. Technical details about the cameras can be seen in table 3.1. The fields of view (FOV) of the cameras are different, the thermal camera has a wider vertical angle and the visible spectrum camera a wider horizontal angle. Therefore, the images need to be cropped after registration. An illustration of the resulting FOV can be seen in figure 3.1. The co-axial system was assembled by hand and therefore perfect alignment of the cameras was not guaranteed.

3.1.1

Camera calibration

The calibration of both cameras, LWIR and VIS, was done using the OpenCV library3. Images of a chessboard were taken from different angles and covering all parts of the image, a total of 23 images were used for the LWIR camera and 37 images for the VIS camera. The chessboard used for calibrating the LWIR camera was made with two materials of different reflexivity. An example of images used for the calibration can be seen in Figure 3.2. We do not care about the camera’s world position or the actual size of objects because of the assumption of that the

1http://www.flir.co.uk/automation/display/?id=56345

2https://www.ximea.com/en/products/usb-31-gen-1-with-sony-cmos-xic/mc023cg-sy 3https://docs.opencv.org/3.1.0/dc/dbb/tutorial_py_calibration.html

(24)

16 3 Method

FLIR A65 XIMEA MC023CG-SY

Resolution 640 x 512 1936 x 1216

FOV 24.6° x 19.8° 25.6° x 16.2°

Pixel pitch 17 µm 5.86 µm

Sample frequency 30 Hz 5 Hz

Bit depth 8 24

Table 3.1:This table shows the specifications of the cameras used in the data acquisition system. The cameras are configurable to use other bit depth and sample frequency but this table shows the settings used in this thesis.

two cameras have the same position. Therefore, only the intrinsics and distortion coefficients were estimated in the calibration. The intrinsic camera matrix and the distortion coefficients were then stored and a transformation applied to all images.

3.1.2

Image registration

While doing image registration, the most important part is to find common points in both images. To ensure a good registration multiple images of the chessboard were taken at different distances. The chessboard used has 88 corners and at least 5 images were used in every registration resulting in at least 440 common points used to estimate the homography matrix. An image registration was done before and after every collection of data to ensure good registration. The quality of the registration was calculated by taking new images of the chessboard and then calculating the transfer error as described in section 2.2. For all collections of data the transfer error was between 0.8 and 2.2 pixels.

3.1.3

Data collection

The co-axial imaging system was mounted on a car facing sideways. About 5 VIS images per second and 30 TIR images per second were captured together with timestamps in milliseconds. The car was driving between 10 km/h and 30 km/h which meant that there was no reason to collect more than 5 VIS images per second. For simplicity, the cameras were set to capture 8 bits per channel. Every VIS image was then paired with the TIR image using the timestamps. The average time difference between the paired images in the dataset was 9.3 milliseconds. The dataset was collected during three different occasions, each of both cloudy and sunny weather. Examples of some images from the dataset can be seen in figure 3.3.

After pairing all images, the VIS images were registered into the TIR images coordinates. This leaves the top and bottom part of the resulting images empty because of the differences in FOV of the cameras. Images collected during differ-ent occasions resulted in the cropped images with differdiffer-ent spatial resolution be-cause the cameras were not perfectly aligned. Therefore, all images were cropped

(25)

3.1 Data acquisition system 17

Figure 3.1:Illustration of the view angles of the cameras. The red rectangle is the thermal camera and the green rectangle is the visible spectrum camera. The blue marked rectangle in the middle is the part of the images that was used when training the models.

Figure 3.2:Example images used for camera calibration. The top row shows images used for the VIS camera and bottom row shows images used for the TIR camera.

(26)

18 3 Method

and downsampled into a common resolution of 320x256 pixels.

3.1.4

Data split

A total of 9562 image pairs were collected and split into 5736 image pairs for training, 1913 for validation and 1913 for test. The training set is used for train-ing the network, the validation set is used durtrain-ing the traintrain-ing to make sure that the network generalizes to unseen data and for evaluating different hyperparam-eters. The test set is used for final evaluation of the model on unseen data. A dataset containing 701 TIR images taken during night time was also recorded. This dataset was only used for a qualitative evaluation.

3.2

CNN models

We test two models on the new dataset, a grayscale to color and a thermal to color model. Both models were tested using TensorFlow [1] and are described in the following sections.

3.2.1

Gray to color model

The model used as a baseline is the existing colorization model proposed in Col-orful Image Colorization [34]. The model takes a greyscale image as input and

outputs a color image that uses the color space CIELAB where the L, luminance, is taken directly from the greyscale image and then the A and B are predicted. Table 3.2 provides a detailed description of the model used for inference. The model is pretrained on images from the ImageNet training set [25].4

3.2.2

Thermal to color model

Berg et al. [2] proposed a model specifically designed for TIR to VIS transforma-tion which is based on a convolutransforma-tional autoencoder architecture with skip con-nections. Figure 3.4 shows the autoencoder architecture including details about network parameters.

3.3

Pansharpening

Pansharpening (PAN) [28] is the process of merging a high spatial resolution panchromatic image with a lower spatial resolution multispectral image to cre-ate a high spatial resolution color image. Because the transformed TIR images looked blurry and the input TIR image is sharp, a naive PAN method was used to increase the sharpness of the generated VIS images. The naive method involved a single step of replacing the generated VIS image L channel in CIELAB color space with the inverted L channel from the input TIR image.

(27)

3.3 Pansharpening 19

Figure 3.3:Examples of images collected with the TIR image to the left and corresponding VIS image to the right. As seen in the images the scenes were limited to fields and forests.

(28)

20 3 Method Layer X C S D Sa De BN L input 224 3 - - - -conv1_1 224 64 1 1 1 1 - -conv1_2 112 64 2 1 1 1 Yes -conv2_1 112 128 1 1 2 2 - -conv2_2 56 128 2 1 2 2 Yes -conv3_1 56 256 1 1 4 4 - -conv3_2 56 256 1 1 4 4 - -conv3_3 28 256 2 1 4 4 Yes -conv4_1 28 512 1 1 8 8 - -conv4_2 28 512 1 1 8 8 - -conv4_3 28 512 1 1 8 8 Yes -conv5_1 28 512 1 2 8 16 - -conv5_2 28 512 1 2 8 16 - -conv5_3 28 512 1 2 8 16 Yes -conv6_1 28 512 1 2 8 16 - -conv6_2 28 512 1 2 8 16 - -conv6_3 28 512 1 2 8 16 Yes -conv7_1 28 256 1 1 8 8 - -conv7_2 28 256 1 1 8 8 - -conv7_3 28 256 1 1 8 8 Yes -conv8_1 56 128 .5 1 4 4 - -conv8_2 56 128 1 1 4 4 - -conv8_3 56 128 1 1 4 4 - Yes

Table 3.2:The architecture of the colorization model. X spatial resolution of output, C is the number of input channels, S is the convolution stride where values greater than 1 is downsampling and values less than 1 is upsampling before convolution, D kernel dilation, Sa is the accumulated stride across all preceding layers, De is the effective dilation of the layer with respect to the input, BN if batch normalization was used after the layer, L if a 1x1 convolution and cross-entropy loss layer were used.

(29)

3.3 Pansharpening 21

Figure 3.4: Illustration of the architecture used by Berg et al.. The black block is the input image with dimensions (W, H, 1). The gray blocks each contain one convolution, batch normalization and LeakyReLU layer. The green blocks contain one convolution, batch normalization, dropout, con-catenation and ReLU layer. The last two green blocks also contain one up-sampling layer. The blue block contains a convolution, batch normalization and dropout layer. The last red block is just the output image with dimen-sions (W, H, 3). The black arrows represent skip connections.

(30)
(31)

4

Experiments

This chapter presents the metrics used for evaluation and experimental setup, in-cluding all necessary information about the training of the models. The chapter also specifies the experiments conducted together with all results. Both quantita-tive and qualitaquantita-tive results of the experiments are presented in this chapter.

4.1

Evaluation metrics

In colorization tasks, root-mean-square error (RMSE), peak signal-to-noise ratio (PSNR) and Structural Similarity (SSIM), defined in equation 2.18, are commonly used. The RMSE is defined as

RMSE =

MSE (4.1)

where MSE is defined as

MSE = 1 W ∗ H ∗ C W X i=1 H X j=1 C X c=1 ( ˆyijcyc ij)2, (4.2)

where ˆyijc is predicted value, yijc is the target value, (W , H) is width and height in pixels and C is number channels per pixel. The PSNR is defined as

P SN R = 10 · log10·

MAX2

MSE (4.3)

where MAX is the maximal pixel value.

(32)

24 4 Experiments

4.2

Experimental setup

All training and evaluation of the models was performed on a GeForce GTX 1080 Ti graphics card1. The experiments were evaluated using the test dataset defined in section 3.1.4. The colorization model used pretrained weights which were downloaded from github 2. The TIR to VIS transformation model was trained with the same hyperparameters as in Berg et al. [2] and specified below:

• Batch size 16, • 750 epochs,

• 0.001 learning rate,

• ADAM optimizer with parameters β1= 0.9, β2= 0.999 and  = e10

[16], • The LeakyReLU layers with α = 0.2,

• Dropout rate of 0.5,

• DSSIM loss function with 4 × 16 pixels window size.

The training converged after about 700 epochs which is why 750 epochs was chosen.

4.3

Experiments

A number of experiments were performed to be able to answer the research ques-tions in section 1.6. The results for all experiments were evaluated by comparing the L1, RMSE, PSNR and SSIM metrics as in [2].

4.3.1

Pretrained colorization model on TIR images

The first experiment is to use the pretrained colorization model, described in section 3.2.1, to colorize TIR images. The generated images were then evaluated.

4.3.2

Pretrained TIR to VIS transformation

This experiment aims to evaluate how the TIR to VIS transformation model, trained on the KAIST-MS dataset, would perform on the new dataset produced for this thesis. A pretrained model provided by the authors of [2] was used.

1https://www.nvidia.com/en-us/geforce/products/10series/geforce-gtx-1080-ti 2https://github.com/nilboy/colorization-tf

(33)

4.4 Quantitative results 25

4.3.3

Color space comparison

In this experiment we train the TIR to VIS transformation model twice using two different color spaces for the VIS images. The first training was done with the VIS images in RGB color space and by using the objective function DSSIM on the three color channels. The second training was done by converting the VIS images into CIELAB color space. For the second training the objective function DSSIM was used on the L channel (LL) and L

1 on the AB channels (LAB). The total loss

was then defined as

LLAB= LL+ LAB= DSSI M(yL, ˆyL) + L1(yAB, ˆyAB) (4.4)

whereLis the L channel andAB is the AB channels. All VIS image values were normalized to be between 0 and 1 in both the RGB and CIELAB training. The hyperparameters used for training are defined in 4.2.

4.3.4

Night time generalization

For this experiment a special dataset was used. The dataset includes TIR images during the night but no VIS images. The special dataset only contained one scene but from different angles. This experiment aims to answer if the TIR to VIS trans-formation method also works during night time. As it is difficult to collect VIS images during night, evaluation of this experiment was done qualitatively to see how realistic the output images are. The used model was the TIR to VIS transfor-mation model trained in RGB color space from experiment 4.3.3.

4.4

Quantitative results

Table 4.1 shows the results for all experiments. All metrics were calculated in the RGB color space with the values for the channels normalized between 0 and 1 with standard deviation denoted as ±. The colorization model performed the worst and the pretrained TIR to VIS transformation achieved a significant im-provement over the colorization model. The trained TIR to VIS models in dif-ferent color spaces outperformed the other models. The TIR-to-RGB model per-formed the best on all metrics except SSIM. It is not possible to conclude a single best score in the SSIM metric because the two best models TIR-to-CIELAB and TIR-to-RGB had similar results with a large deviation. Adding PAN sharpening to the trained TIR to VIS model did not improve the quantitative metrics but there exists a clear qualitative difference as seen in figure 4.5.

4.5

Qualitative analysis

Figure 4.1 shows example images for different models. The figure shows that the colorization model has a strong tendency to paint the ground blue. The pre-trained TIR to VIS transformation model found some structures but the colors did not match the target VIS image. The TIR to VIS transformation model trained

(34)

26 4 Experiments

Experiment L1 RMSE PSNR SSIM

Pretrained 0.479 0.526 5.691 0.455

colorization model [34] ±0.089 ±0.081 ±1.478 ±0.069

TIR-to-CIELAB 0.321 0.394 8.478 0.544

trained on KAIST-MS [2] ±0.105 ±0.112 ±2.682 ±0.060

TIR-to-CIELAB 0.141 0.170 15.78 0.815

trained on our dataset ±0.044 ±0.050 ±2.452 ±0.073

TIR-to-RGB 0.115 0.147 17.05 0.811

trained on our dataset ±0.043 ±0.046 ±2.522 ±0.067

TIR-to-RGB + PAN 0.135 0.159 16.37 0.728

trained on our dataset ±0.048 ±0.054 ±2.552 ±0.116 Table 4.1:This table shows the qualitative results calculated on channel val-ues normalized between 0 and 1. The best performing experiment for all metrics all highlighted with bolded text. We can see that the TIR to VIS transform model outperform the other models and that there is not a signif-icant between the two different color spaces.

in RGB color space is able to produce the most perceptually realistic results on this dataset. Figure 4.2 shows a comparison of the two best TIR to VIS transforma-tion models. The model trained in RGB color space tends to have more saturated colors than the model trained in CIELAB color space. Both models produce per-ceptually realistic colors but the images are blurry. Figure 4.3 illustrates some handpicked examples where the model fails. The dataset contains few houses and therefore, the models do not know how to accurately transform them to VIS images. The second row shows a problem of the models not recognizing the change from field to forest.

Generalization capabilities under night conditions. Figure 4.4 shows trans-formation of night TIR images. The images differs significantly from the other dataset and that resulted in poor quality of the transformed images. The first two rows show somewhat realistic transformations while the last three rows show instances where all models fail.

PAN sharpening. Since the results are blurry, a naive version of PAN sharpening was applied on the generated VIS image. Figure 4.5 shows both the result from the TIR to VIS transformation model and the result after applying PAN sharpen-ing.

(35)

4.5 Qualitative analysis 27

(a)TIR input (b) Pretrained colorization model (c)TIR-to-LAB trained on KAIST-MS [2] (d) TIR-to-RGB trained on our dataset

(e)Target VIS

Figure 4.1: Comparison between images generated by the colorization model, pretrained TIR to VIS model and the retrained TIR to VIS transfor-mation model trained in RGB color space.

(36)

28 4 Experiments

(a)TIR input (b)RGB (c)CIELAB (d)Target VIS

Figure 4.2:Comparison between images generated by the TIR to VIS trans-formation model trained on the two color spaces RGB and CIELAB.

(37)

4.5 Qualitative analysis 29

(a)TIR input (b)RGB (c)CIELAB (d)Target VIS

Figure 4.3: Examples of bad transformations, first row shows problem of transforming houses and the second row shows a problem of distinguishing between field and forest.

(38)

30 4 Experiments

(a)TIR input (b)RGB (c)CIELAB

Figure 4.4: Comparison between images generated from night TIR images. The retrained TIR to VIS models have problems with transforming humans and if the sky is not visible in the TIR image.

(39)

4.5 Qualitative analysis 31

(a)TIR input (b)RGB (c) RGB + PAN sharpening

(d)Target VIS

Figure 4.5: Illustration of the increase in sharpness and detail by adding PAN sharpening. These examples clearly shows how the trees tend to be brighter when adding the PAN sharpening.

(40)
(41)

5

Discussion

This chapter discusses both the method used and the results in this thesis.

5.1

Method

This thesis investigates different supervised methods which means that the train-ing data is of significant importance. Therefore, a new dataset containtrain-ing pairs of TIR and VIS images was collected using a co-axial camera system. The system had on average about 1 pixel off, transfer error, between features in the TIR and VIS images during testing of the system. However, when visually inspecting the collected images it is possible to see a transfer error as high as 7 pixels in some image pairs. This is due to two main reasons. The first reason is that there was no hardware for syncing of the cameras, combined with the fact that the vehicle was moving during collection. This resulted in image shifts in the horizontal direc-tion. This was mitigated as much as possible by collecting the timestamp of the images and then all TIR images were paired with the VIS image that had the least time difference. The second reason for the high transfer error in the collected dataset was that the two cameras were mounted by hand in the co-axial system, resulting in the two cameras not being perfectly aligned and creating parallax effects. This was a problem because the images were registered at a distance of about 10 meters, where parallax is significant, while the scenes captured in the dataset was everything from 30 meter to 500 meters.

When evaluating the question of which color space to use when training a TIR to VIS model we used different objective functions. It is possible to make the argument that it is not possible to draw any conclusions when changing multiple hyper parameters, such as the objective function. However, to achieve the best result for each color space we decided to use the most appropriate objective func-tion for each color space. Therefore, using different objective funcfunc-tions are valid

(42)

34 5 Discussion

to achieve the best possible result for each color space.

5.2

Results

Good results in the quantitative metrics also seem to correspond to perceptually good results in the qualitative analysis which means that the chosen metrics were suitable for the task. It is quite obvious that the pretrained colorization model did not perform well at all. The L1RMSE errors are off by about 0.5 for values

normal-ized between 0 and 1. This can be seen in figure 4.1 where the colorization model colorizes the ground in all TIR images with shades of blue. The pretrained TIR to VIS model by Berg et al. performs better than the colorization model but still has issues with finding structures and applying accurate colors. This is most likely a result of training the model on a dataset with completely different types of scenes. The used KAIST-MS dataset mostly contains streets in a city environment compared to the fields and forest used for evaluation. Most generated images are looking brown with some images that have accurate colors of the bright sky. The models trained on the new dataset had the best performance.

The results of training the same model in different color spaces produces sim-ilar but perceptually different results. It is interesting to note that the quanti-tative metrics are almost identical between the TIR-to-RGB and TIR-to-CIELAB models. The TIR-to-RGB version had best performance for L1, RMSE and PSNR

while the SSIM metric was similar for both the TIR-to-RGB and TIR-to-CIELAB version. The two versions seem to find the same semantic structures but we can see that there are clear color differences between the two versions in figure 4.2. The TIR-to-RGB version produces more saturated colors which could be because the TIR-to-CIELAB version relies more on the L channel than the AB channels. A common problem for both models is that they produce some images with the forest transformed as field which can be seen in second row of figure 4.3. This is because in some images, there is no a clear luminance difference between the field and trees in TIR images. Another problem is the lack of some objects, e.g. houses, in the training set. This results in an incorrect transformation as shown in the first row of figure 4.3 where the model attempts to transform the house to the color of fields. Given enough training examples, the models should be able to transform houses accurately because they are easy to distinguish in both TIR and VIS images. Another reason for the problem of houses is the transfer error which may affect sharp edges like houses more than a relatively soft edges of a field.

Assessing the night images shown in figure 4.4 is difficult as the night dataset only contains images of one scene and no corresponding VIS images are available. Another factor that makes evaluating the images hard is that the TIR images con-tain a human and the training set did not concon-tain any humans. However, it is possible to analyze the images by evaluating if the generated images are percep-tually realistic or not. The first two rows in figure 4.4 show two successful exam-ples transformations and the three last rows show examexam-ples of where the TIR to VIS model fails to produce realistic VIS images.

(43)

mit-5.2 Results 35

igate that we applied a naive PAN sharpening method. The result of the PAN sharpening can be seen in figure 4.5. Applying the PAN sharpening does increase the perceived sharpness of the generated images while still maintaining most of the colors. We see that the fields and skies look great but trees tend to be white or gray. This is because the trees are not very colorful combined with the fact that the added luminance is very bright for trees. Depending on the application, applying a PAN sharpening method can significantly increase the level of details in the generated images.

(44)
(45)

6

Conclusions

This chapter answers the research questions from section 1.3.

How would an existing pretrained CNN model for colorizing grayscale images perform on the task of transforming TIR images to VIS images? By assessing both the

quan-titative results in table 4.1 and the example images in figure 4.1 it is possible to conclude that existing pretrained colorization models perform poorly directly on TIR images. TIR images are too different from VIS grayscale images.

Which color space is more suitable for the task of TIR to VIS transformation? We

compare one model trained in both RGB and CIELAB color space. According to the quantitative results in this thesis the model trained in RGB is marginally more suitable for the task of TIR to VIS transformation. This is also confirmed by evaluating example images. The images from the TIR-to-RGB model have more saturated colors and look slightly better than the images from the TIR-to-CIELAB model.

Does a model trained on thermal images from daytime generalize well on images from night time? This question can only be evaluated by analyzing generated images.

As we can see in figure 4.4 the model is able to transform TIR images to percep-tually realistic colors in some cases. Images that include a human are harder because warm foreground objects reduces the contrast of the background.

6.1

Future work

For future work it would be interesting to collect a larger and more accurate dataset. More accurate in terms of lower transfer error for objects at any distance. So a system that includes hardware to position the cameras and hardware to sync

(46)

38 6 Conclusions

the cameras would be needed. It may also be interesting to investigate if GANs can be used to reduce the blurriness of the generated images. A known feature of GANs is their ability to compose objects to make the image look more realistic, depending on the application this could be positive or a drawback. For example, when driving based on generated images it is important that the models do not add arbitrary objects, such as stones, in the generated image to make the image more realistic. FOI is also developing another system with a short-wavelength infrared (SWIR) instead of LWIR. Testing the algorithms developed in this thesis on the SWIR system can also be interesting because SWIR is more correlated with the visible spectrum and therefore the results may be even better.

(47)

Bibliography

[1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Va-sudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL https://www. tensorflow.org/. Software available from tensorflow.org. Cited on page 18.

[2] Amanda Berg, Jorgen Ahlberg, and Michael Felsberg. Generating visible spectrum images from thermal infrared. In The IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR) Workshops, June 2018. Cited on pages 4, 18, 21, 24, 26, 27, and 34.

[3] Y-Lan Boureau, Jean Ponce, and Yann LeCun. A theoretical analysis of fea-ture pooling in visual recognition. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 111–118, 2010. Cited on page 12.

[4] Yun Cao, Zhiming Zhou, Weinan Zhang, and Yong Yu. Unsupervised di-verse colorization via generative adversarial networks. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 151–166. Springer, 2017. Cited on page 4.

[5] Zezhou Cheng, Qingxiong Yang, and Bin Sheng. Deep colorization. In Pro-ceedings of the IEEE International Conference on Computer Vision, pages 415–423, 2015. Cited on page 4.

[6] Vincent Dumoulin and Francesco Visin. A guide to convolution arithmetic for deep learning. arXiv preprint arXiv:1603.07285, 2016. Cited on page 11.

(48)

40 Bibliography

[7] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth inter-national conference on artificial intelligence and statistics, pages 249–256, 2010. Cited on page 13.

[8] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth inter-national conference on artificial intelligence and statistics, pages 249–256, 2010. Cited on page 13.

[9] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative ad-versarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014. Cited on page 4.

[10] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org. Cited on pages 10 and 11.

[11] Richard Hartley and Andrew Zisserman. Multiple view geometry in com-puter vision second edition. Cambridge University Press, 2000. Cited on page 9.

[12] Janne Heikkila and Olli Silven. A four-step camera calibration procedure with implicit image correction. In Computer Vision and Pattern Recogni-tion, 1997. Proceedings., 1997 IEEE Computer Society Conference on, pages 1106–1112. IEEE, 1997. Cited on page 7.

[13] Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa. Let there be Color!: Joint End-to-end Learning of Global and Local Image Priors for Automatic Image Colorization with Simultaneous Classification. ACM Transactions on Graphics (Proc. of SIGGRAPH 2016), 35(4), 2016. Cited on page 4.

[14] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. Cited on page 14.

[15] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. arxiv, 2016. Cited on page 4.

[16] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimiza-tion. arXiv preprint arXiv:1412.6980, 2014. Cited on pages 13 and 24. [17] V. V. Kniaz, V. S. Gorbatsevich, and V. A. Mizginov. Thermalnet: a

Deep Convolutional Network for Synthetic Thermal Image Generation. IS-PRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, pages 41–45, May 2017. doi: 10.5194/ isprs-archives-XLII-2-W4-41-2017. Cited on page 4.

(49)

Bibliography 41 [18] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Cur-ran Associates, Inc., 2012. URL http://papers.nips.cc/paper/

4824-imagenet-classification-with-deep-convolutional-neural-networks. pdf. Cited on page 11.

[19] Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Colorization as a proxy task for visual understanding. In CVPR, volume 2, page 8, 2017. Cited on page 4.

[20] Y. LeCun. A theoretical framework for back-propagation. In P. Mehra and B. Wah, editors, Artificial Neural Networks: concepts and theory, Los Alami-tos, CA, 1992. IEEE Computer Society Press. Cited on page 11.

[21] Yann Lecun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 5 2015. ISSN 0028-0836. doi: 10.1038/nature14539. Cited on pages 12 and 13.

[22] Anat Levin, Dani Lischinski, and Yair Weiss. Colorization using optimiza-tion. In ACM Transactions on Graphics (TOG), volume 23, pages 689–694. ACM, 2004. Cited on page 3.

[23] Matthias Limmer and Hendrik PA Lensch. Infrared colorization using deep convolutional neural networks. In Machine Learning and Applications (ICMLA), 2016 15th IEEE International Conference on, pages 61–68. IEEE, 2016. Cited on page 4.

[24] Jingjing Liu, Shaoting Zhang, Shu Wang, and Dimitris N Metaxas. Mul-tispectral deep neural networks for pedestrian detection. arXiv preprint arXiv:1611.02644, 2016. Cited on pages 2 and 4.

[25] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bern-stein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015. Cited on page 18. [26] M Sarifuddin and Rokia Missaoui. A new perceptually uniform color space

with associated color similarity measure for content-based image and video retrieval. In Proc. of ACM SIGIR 2005 Workshop on Multimedia Informa-tion Retrieval (MMIR 2005), pages 1–8, 2005. Cited on page 10.

[27] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1):1929–1958, January 2014. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=2627435. 2670313. Cited on page 14.

(50)

42 Bibliography

[28] Tania Stathaki. Image fusion: algorithms and applications. Elsevier, 2011. Cited on page 18.

[29] Patricia L Suárez, Angel D Sappa, and Boris X Vintimilla. Learning to col-orize infrared images. In International Conference on Practical Applications of Agents and Multi-Agent Systems, pages 164–172. Springer, 2017. Cited on page 4.

[30] P. L. Suárez, A. D. Sappa, and B. X. Vintimilla. Infrared image colorization based on a triplet dcgan architecture. In 2017 IEEE Conference on Com-puter Vision and Pattern Recognition Workshops (CVPRW), pages 212–217, July 2017. doi: 10.1109/CVPRW.2017.32. Cited on page 4.

[31] Richard Szeliski. Computer vision: algorithms and applications. Springer Science & Business Media, 2010. Cited on pages 9 and 10.

[32] Zhou Wang, Eero P Simoncelli, and Alan C Bovik. Multiscale structural similarity for image quality assessment. In Signals, Systems and Comput-ers, 2004. Conference Record of the Thirty-Seventh Asilomar Conference on, volume 2, pages 1398–1402. Ieee, 2003. Cited on page 13.

[33] Tomihisa Welsh, Michael Ashikhmin, and Klaus Mueller. Transferring color to greyscale images. In ACM Transactions on Graphics (TOG), volume 21, pages 277–280. ACM, 2002. Cited on page 3.

[34] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image coloriza-tion. In European Conference on Computer Vision, pages 649–666. Springer, 2016. Cited on pages 4, 13, 18, and 26.

[35] Zhengyou Zhang. A flexible new technique for camera calibration. IEEE Transactions on pattern analysis and machine intelligence, 22(11):1330– 1334, 2000. Cited on page 7.

[36] Barbara Zitova and Jan Flusser. Image registration methods: a survey. Image and vision computing, 21(11):977–1000, 2003. Cited on pages 8 and 9.

References

Related documents

Success (or greatest biomass production) was determinant upon the vigor with which each species collected its resources from above and below ground and the

Av intervjuerna framkom också att även om många spelappar har trösklar, där det gäller att ta sig vidare, så används Apparna många gånger till repetition, och

That is, the differences between web and paper respondents was more substantial in the lon- gitudinal sample (i.e. among those who either answered by web or by paper across all

Därefter deflaterades genomsnittliga årspriser till prisnivån för juni 2011 och medelpriser för åren 2006 till 2010 beräknades för höstvete, vårvete, fodervete,

Figure 12 Climate impact per delivered kWh for 5AhTavorite battery in a Volvo bus (13 kWh battery, European electricity for production and propulsion) The table below contains

The deep hedging framework, detailed in Section 7, is the process by which optimal hedging strategies are approximated by minimizing the replication error of a hedging portfolio,

Two specific examples will give us a glimpse of what real options might be like. The first one is an equity index option based on S&P 500 Index. The S&P 500 index is a

increases. The decreased forced also means that the Bohr radius increases. Exciton binding energies are on the order of 0.1-1 eV and 0.01-0.1 eV for Frenkel and Wannier-