Architecture A achieved a performance of 0.187 on the test set, roughly translating to 35% fewer photon errors than a model similar to state of the art

(1)

UPTEC F 18047

Examensarbete 30 hp Juli 2018

Noise Reduction in Flash X-ray Imaging Using Deep Learning

Tobias Sundman

(2)

Teknisk- naturvetenskaplig fakultet UTH-enheten

Besöksadress:

Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0

Postadress:

Box 536 751 21 Uppsala

Telefon:

018 – 471 30 03

Telefax:

018 – 471 30 00

Hemsida:

http://www.teknat.uu.se/student

Abstract

Noise Reduction in Flash X-ray Imaging Using Deep Learning

Tobias Sundman

Recent improvements in deep learning architectures, combined with the strength of modern computing hardware such as graphics processing units, has lead to significant results in the field of image

analysis. In this thesis work, locally connected architectures are employed to reduce noise in flash X-ray diffraction images. The layers in these architectures use convolutional kernels, but without shared weights. This combines the benefits of lower model memory footprint in convolutional networks with the higher model capacity of fully connected networks. Since the camera used to capture the diffraction images has pixelwise unique characteristics, and thus lacks equivariance, this compromise can be beneficial.

The background images of this thesis work were generated with an active laser but without injected samples. Artificial diffraction patterns were then added to these background images allowing for training U-Net architectures to separate them. Architecture A achieved a performance of 0.187 on the test set, roughly translating to 35% fewer photon errors than a model similar to state of the art.

After smoothing the photon errors this performance increased to 0.285, since the U-Net architectures managed to remove flares where state of the art could not. This could be taken as a proof of concept that locally connected networks are able to separate diffraction from background in flash X-Ray imaging.

ISSN: 1401-5757, UPTEC F 18047 Examinator: Tomas Nyberg Ämnesgranskare: Prashant Singh Handledare: Carl Nettelblad

(3)

Popul¨arvetenskaplig sammanfattning

De senaste ˚aren har fört med sig stora forskningsframsteg inom bland annat artificiell intelligens, deep learning och bildanalys. Detta, kombinerat med den varaktiga utveck- lingen hos beräkningsh˚ardvaran, har öppnat dammluckorna för vad som är möjligt att

˚astadkomma. Examensarbetet undersöker sätt att använda moderna neurala nätverk och deep learning för att separera intressanta signaler fr˚an bakomliggande brus i bilder. Dessa bilder genereras vid Linac Coherent Light Source (LCLS), vilket är en anläggning som sköts av Stanfords universitet i USA. Röntgenstr˚alar skjuts p˚a biopartiklar, exempelvis virus. Vid kontakt med partikeln böjs str˚alarna och skapar s˚a kallade diffraktionsmönster som sedan kan f˚angas upp av en kamera. Det är sedan möjligt att använda fysikaliska modeller för att fr˚an ett s˚adant mönster erh˚alla en sannolik bild av hur partikeln ser ut. Detta ger insikter i den mikroskopiska världen som till och med mycket kraftfulla mikroskop har sv˚art att f˚a tillg˚ang till.

D˚a det rör sig om mycket svaga mönster, ofta endast tusentals fotoner, är processen mycket bruskänslig. I arbetet tränas ett djupt neuralt nätverk inspirerat av den senaste forskningen inom deep learning. Med en hierarkisk struktur, där delar av nätverket lär sig att bearbeta olika storleksordningar i bilden, lyckas nätverket separera artificiellt genererade diffraktionsmönster fr˚an verklig bakgrund. Särskilt lyckat är borttagandet av icke-linjära fenomen s˚a som slumpmässigt förekommande ljusstr˚alar i bakgrunden.

Detta är n˚agot som dagens metoder länge haft sv˚art med. Arbetet kan ses som en konceptvalidering att neurala nätverk kan användas vid separation av diffraktionsmönster samt bakgrund.

(4)

Acknowledgments

First and foremost I would like to thank my supervisor Carl Nettelblad at Uppsala Uni- versity. Without his continued support the work would not be where it is today. I also want to thank my subject reader Prashant Singh for the helpful technical input given during the thesis work. Furthermore I want to thank Filipe Maia for technical support with the Davinci cluster, which was originally funded by the European Research Council.

Last but not least I want to thank Alberto Pietrini and Filipe Maia who gathered and preprocessed the data used in this thesis work.

(5)

Contents

Notation v

1 Introduction 1

1.1 Problem formulation . . . . 1

1.2 Background . . . . 1

1.2.1 Free electron lasers . . . . 1

1.2.2 Flash X-ray imaging . . . . 1

1.2.3 CSPAD . . . . 2

1.2.4 Noise and artifacts in the CSPAD . . . . 2

2 Theory 3 2.1 Machine learning . . . . 3

2.2 Feedforward networks . . . . 4

2.3 Learning using optimization . . . . 5

2.3.1 Loss functions . . . . 6

2.3.2 Regularization . . . . 6

2.3.3 Back-propagation . . . . 7

2.3.4 Initialization . . . . 8

2.3.5 Batch normalization . . . . 9

2.3.6 Hyperparameters . . . . 9

2.4 Convolutional networks . . . . 10

2.4.1 Convolutional layer . . . . 10

2.4.2 Pooling layer . . . . 11

2.4.3 Locally connected layer . . . . 11

2.5 Autoencoders . . . . 11

2.5.1 Encoder . . . . 13

2.5.2 Decoder . . . . 13

2.5.3 Denoising autoencoder . . . . 13

3 Implementation 14 3.1 Tools . . . . 14

3.2 Dataset . . . . 14

3.2.1 Dataset split . . . . 15

3.2.2 Artificial diffraction data . . . . 16

3.2.3 Preprocessing . . . . 16

3.3 Linear model . . . . 18

3.4 Partitioned autoencoder . . . . 18

3.5 Layer architectures . . . . 19

3.5.1 Locally connected layers . . . . 19

3.6 Residual networks . . . . 20

3.6.1 Memory constraints . . . . 21

3.6.2 U-Net inspired architecture . . . . 21

4 Results 24 4.1 Performance measure . . . . 24

4.2 Linear model . . . . 25

(6)

CONTENTS CONTENTS

4.3 Architecture A . . . . 26

4.4 Architecture B . . . . 32

4.5 Architecture C . . . . 32

5 Discussion 36 5.1 Limitations . . . . 37

5.2 Difficulties . . . . 37

6 Conclusion 39 6.1 Future work . . . . 39

7 Appendix 40 A Architecture code . . . . 40

B Time and memory usage . . . . 40

C Architecture A . . . . 40

D Architecture B . . . . 42

E Architecture C . . . . 45

8 Bibliography 48

(7)

Notation

Acronyms

ADU Arbitrary Digital Unit ANN Artificial Neural Network CNN Convolution Neural Network CSPAD Cornell-SLAC Pixel Array Detector FEL Free Electron Laser

FXI Flash X-ray Imaging

LCLS Linac Coherent Light Source MSE Mean Squared Error

NN Neural Network

PCA Principal Component Analysis ReLU Rectified Linear Unit

SASE Self-Amplified Spontaneous Emission SELU Scaled Exponential Linear Unit SGD Stochastic Gradient Descent Numbers

w A scalar

w A vector

W A matrix

W A tensor

Operators

Hadamard product (elementwise multiplication) Symbols

N (µ, σ˜ ²) Normal distribution with mean µ and standard deviation σ truncated at µ±2σ.

N (µ, σ²) Normal distribution with mean µ and standard deviation σ.

(8)

1 Introduction

1.1 Problem formulation

This thesis investigates the applicability of neural networks for denoising flash X-ray images. There are no specific usage restrictions on network memory footprint and evaluation time. A network with short evaluation time would be beneficial since it could be used in an online fashion. The signal consists of diffraction patterns recorded using a camera-like pixel detector. This signal has to be separated from the background. The background can be of comparable strength to the diffraction signal, highlighting the need to separate them for proper analysis of the signal, but also making this noise-reduction problem challenging.

1.2 Background

1.2.1 Free electron lasers

The Free Electron Laser (FEL) was invented in the early 1970s by Madey [1]. As a beam of relativistic electrons passes through a periodic magnetic structure it interacts with the emitted radiation which concentrates the energy into a single mode of an electromagnetic wave [2]. In the beginning optical cavities were used as oscillators. This changed in the early 1980s when Self-Amplified Spontaneous Emission (SASE), where a high gain FEL itself acts as an amplifier, began seeing use instead. Using SASE the initially generated radiation amplifies exponentially making one mode be predominant. During the 2000s multiple high-gain X-ray FELs were constructed that could generate multi- gigawatt femtosecond coherent X-ray pulses greatly reducing the cost compared to the previous optical cavities. [2].

1.2.2 Flash X-ray imaging

Radiation damage is the main impediment in determining the structure of biological macromolecules using microscopy [3]. In Flash X-ray Imaging (FXI) an X-ray FEL is used to generate femtosecond X-ray pulses. A sample is hit generating diffraction images that, through solving an inverse problem, give insight into the structure of the original object. The pulses have a very high X-ray dose rate and they exit the samples before radiation-induced damage takes place, meaning useful structural information can be extracted [4]. This phenomenon has been coined diffraction-before-destruction [2]. The pulses fluctuate in properties, mainly because of poor longitudinal coherence due to the SASE process [5], meaning single-pulse detection is necessary for the data to be useful [6].

(9)

1. INTRODUCTION 1.2. BACKGROUND

1.2.3 CSPAD

The Linac Coherent Light Source (LCLS) hosts an FXI instrument consisting of Cornell- SLAC Pixel Array Detectors (CSPADs). It is a general purpose hybrid pixel X-ray camera built by SLAC using the ASIC component developed by Cornell [7]. The data in this thesis work was generated in the 0.1 µm chamber with an interaction region with a nominal 0.1 µm FWHM focal spot [8]. Each detector consists of multiple CSPAD 2 × 1 modules (containing two ASICs each) comprised of 388 × 185 pixels. Each pixel in turn spans an area of size 110 µm×110 µm and they read out a voltage proportional to the number of incident photons on the specific pixel at 120 Hz [9]. The setup contains the large camera consisting of 64 ASICs with the task of collecting wide-angle scattering.

A second detector, the back detector, consists of four ASICs and is placed downstream collecting small-angle scattering which is allowed to pass through a hole in the large camera.

1.2.4 Noise and artifacts in the CSPAD

The raw readout from each pixel in a CSPAD is measured in Arbitrary Digital Units (ADUs). This readout depends on the electrical properties of the individual pixel which has to be remedied through correction [10]. The readout approximately follows a linear function of the number of incident photons with an offset (pedestal) and a scaling (gain) that have to be chosen appropriately from measurements. Determining the pedestals can be accomplished using dark runs (frames with inactive laser) where the median ADU value in each pixel corresponds to that pixel’s pedestal. The gain can be estimated by finding peaks in the ADU histogram corresponding to zero- and one-photon frames in a run [9] [11].

On top of this approximately linear pixel behavior there are a number of non-linear phenomenons. The response of each pixel follows a non-linear function in regards to incident intensity, mostly due to crosstalk. Since the intensity of a FEL pulse is highly variable this non-linear behavior is very important to quantify. Data processing attempts have been made to remedy the non-linear behavior [6] but further investigation is needed.

The ADU values can also saturate at higher intensities meaning the global gain has to be adjusted if one wants to perform high intensity imaging [12]. Another observed non-linear behavior besides crosstalk is drift of the pedestal and gain values in each pixel [13].

There is also a frame-wise offset which is cancelled out by subtracting the median value of all pixels on a per ASIC basis. The median value should, for weak photon signals, correspond to a zero-photon readout. Besides the noise due to electrical properties and the global frame-wise offset, there are also other more complex frame-wise artifacts. One example is an artifact in the form of a horizontal gradient. It has been shown that this artifact can be reduced by subtracting a per column, per ASIC common mode [9].

When recording background with an enabled laser there will also typically be extraneous photons scattered from equipment, solvent and/or gases [14]. Parts of this last source of background noise could potentially have some structure originating from the positioning of equipment in the chamber in which scattering occurs.

(10)

2 Theory

Sections 2.1-2.5 are adapted from the book Deep Learning written by Ian Goodfellow, Yoshua Bengio and Aaron Courville [15]. This book is a strong foundation if one seeks a broader introduction to the fields of machine learning, and more specifically deep learning, than the one presented here.

2.1 Machine learning

The field of machine learning tackles the task of using computers to solve various tasks without explicitly programming them to do so. Mitchell [16] gave a more formal definition of a machine learning algorithm: “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P , if its performance at tasks in T , as measured by P , improves with experience E.”

The input of a machine learning algorithm consists of examples. Each example is a collection of features often expressed as a vector x ∈ Rⁿ, where each entry in the vector represents one feature. For example if we wish to classify the species of a flower, interesting features could include its color and size.

A multitude of machine learning tasks with significantly different characteristics exist. In classification the task is to infer a function f : Rⁿ → {1, ..., k} from the feature space to a discrete set of classes using only the available data, usually called the training set.

Regression instead tries to infer a function f : Rⁿ → R. An example could be where the features represent a person and the output the amount of money they will spend on vacation this year. Beyond these two fairly simple tasks many others exist, such as anomaly detection, denoising and machine translation.

When creating machine learning algorithms it is important to know how well they perform.

To accomplish this one uses performance measures that give some quantitative insight into the error of the model. In classification this can take the form of accuracy which is given by the number of correctly classified examples divided by the total number of examples.

When training machine learning models one typically uses so called loss functions that output a number expressing the performance of the current iteration.

When speaking of the experience of a model it is typically further subdivided into unsupervised learning and supervised learning. In supervised learning, the experience takes the form of examples x with an associated target y, usually in the form of a label or target. The algorithm then tries to learn to predict y from x through for example maximizing p(y|x). Unsupervised learning on the other hand does not have a ground truth in the form of labels y. Instead it tries to learn useful properties of the dataset. In the example of clustering this takes the form of trying to find meaningful groupings in the dataset.

One of the main difficulties in machine learning is avoiding overfitting or underfitting in the model. A model exhibiting overfitting performs very well on the training set, but poorly on the test set (data intentionally removed from the original dataset for use in performance metrics). This is usually due to the presence of noise in the training set which we do not want our function to use since it generalizes poorly to other data. The

(11)

2. THEORY 2.2. FEEDFORWARD NETWORKS

goal of machine learning models is typically to perform as good as possible on the test set, minimizing the test error. An overfit model may, in some cases, perform well on the test set but still have poor generalization. This would typically be due to a poorly constructed test set that aligns too nicely with the training set. An underfit model on the other hand performs poorly on both the test and training sets.

The ability of a model to overfit is determined by its capacity. An example of how capacity can manifest is trying to fit a polynomial function to some data points. If we allow the polynomial to have a very high degree, meaning high capacity, it will have a low training error since it can intersect all points in the training set. Such a model will most likely generalize poorly, having difficulty with unseen data due to the presence of noise. This would be a case of overfitting. Likewise if the degree of the polynomial is low it will fit the training data poorly, but the test error will be similar to the training error. These low capacity models usually incur underfitting. Finding the balance between over- and underfitting is usually accomplished through a process called regularization which punishes high capacity models during training by, for example, incurring a higher loss.

2.2 Feedforward networks

Early neural network models were inspired by the inner workings of the brain and the complex interaction pattern of its billions of neurons. Due to this influence one of the names the models have since gone by is Artificial Neural Networks (ANNs). A simple feedforward ANN, without feedback connections, consists of an input layer, some inter- mediate hidden layers, and an output layer. Each layer usually consists of a linear function followed by a non-linear transform. The input layer consists of the feature vector x. Each hidden layer can then be seen as a function composed with the output of the previous layer. In this setting a three layer ANN would take the form ˆy = f⁽³⁾(f⁽²⁾(f⁽¹⁾(x))) where f⁽¹⁾ represents the first hidden layer and f⁽²⁾ the second. The final layer, f⁽³⁾ in this case, is usually called the output layer which yields the prediction ˆy. The term deep network seen in deep learning usually refers to networks with a large number of hidden layers. The deep networks with multiple introduced non-linearities have the ability to learn very complex patterns in the given data.

An example to illustrate feedforward neural networks is a three layer fully connected network. First the input feature vector x is multiplied by a weight matrix W^(1)> and then a bias b⁽¹⁾ is added. The result is passed through a non-linear transformation g⁽¹⁾ applied elementwise yielding the first activation h⁽¹⁾. This procedure is repeated twice more using the activations of the previous layer resulting in the equations

h⁽¹⁾ = g⁽¹⁾(W^(1)>x + b⁽¹⁾)

h⁽²⁾ = g⁽²⁾(W^(2)>h⁽¹⁾ + b⁽²⁾) ˆ

y = g⁽³⁾(W^(3)>h⁽²⁾ + b⁽³⁾)

(2.1)

with the output prediction ˆy. All weights W and biases b are learnable parameters, meaning they need to be optimized so that ˆy actually approaches the true target y. Each entry in the activation vectors h can be seen as a node in the network. The weights then determine the strength of the links between those nodes and the ones in the previous layer.

The non-linear transformations are usually called activation functions and can be chosen in many ways. A very common choice is the so called Rectified Linear Unit (ReLU)

(12)

2. THEORY 2.3. LEARNING USING OPTIMIZATION

x selu(x)

Figure 2.1: The scaled exponential linear unit proposed by Klambauer et al.

which has the form g(x) = max(0, x). Other typical choices are the sigmoid function and the hyperbolic tangent function. The choice is mostly done to guarantee good properties when optimizing the weights and biases.

Klambauer et al. [17] propose using their Scaled Exponential Linear Unit (SELU) as the activation function in neural networks, which can be seen in Figure 2.1. The function has the form

selu(x) = λ

(x, if x > 0

αe^x− α, if x ≤ 0, (2.2)

where α and λ are fixed constants chosen as described by the authors. The advantage of using the SELU is that it guarantees close to zero mean and unit variance activations throughout the network. This is highly beneficial when training the network since it remedies problems with vanishing or exploding gradients that can appear when growth or decrease in activations accumulate exponentially through many layers.

2.3 Learning using optimization

In the case of the feedforward neural network described in Section 2.2 there are learnable parameters W and b that need to be optimized. To do this we gather all learnable parameters in θ for easier notation then formulate the cost function J (θ), sometimes called the objective function. Typically this cost function can be expressed as an average per example loss in the training set as

J (θ) = 1 m

m

X

i=1

L f

x⁽ⁱ⁾; θ , y⁽ⁱ⁾

, (2.3)

(13)

where L gives the loss of the specific training example (x⁽ⁱ⁾, y⁽ⁱ⁾), f is the function com- position expressing the full network and m is the number of examples in the training set. We can then take the gradient of the cost function J with respect to the parameters θ

∇_θJ (θ) = 1 m

m

X

i=1

∇_θL f

x⁽ⁱ⁾; θ , y⁽ⁱ⁾

. (2.4)

One of the simplest ways of optimizing using this gradient is to take a step in the negative gradient direction as

θ ← θ − ∇_θJ (θ), (2.5)

where the learning rate adjusts the size of the steps. This basic optimization method is called gradient descent.

Since the gradient typically is an expectation it can be estimated using a smaller subset of the training set. In each step we instead sample a minibatch of m⁰ < m examples from the training set and estimate the gradient as

∇_θJ (θ) = 1 m⁰

m⁰

X

i=1

∇_θL f

x⁽ⁱ⁾; θ , y⁽ⁱ⁾

. (2.6)

The minibatch can be chosen to be very small, typically less than hundreds of examples, meaning each gradient descent step can be computed much faster. This adaptation of gradient descent is called Stochastic Gradient Descent (SGD). Each time the optimization algorithm has seen the full training set we say that one epoch has passed.

Many further adaptations of SGD have been created, most using the so called momentum. The momentum is a decaying sum of previously seen gradients that, when added to the actual gradient, help it overcome regions of low gradient in some parameter direction.

One very popular iteration of SGD currently in use is the Adam optimizer (short for adaptive momentum estimation) [18]. It incorporates the first and second order moments of the gradient, as well as bias correction. The correction is added to alleviate the fact that the moments are biased towards zero in the early steps. For a full definition of the Adam optimizer see Algorithm 1.

2.3.1 Loss functions

The choice of loss function is highly dependant on the specific task it will be used for. In the case of image autoencoders, where the output should resemble the input as closely as possible, a common choice is the so called Mean Squared Error (MSE) defined as

J_MSE(θ) = 1 m

m

X

i=1

f

x⁽ⁱ⁾; θ

− y⁽ⁱ⁾2

, (2.7)

where x⁽ⁱ⁾, y⁽ⁱ⁾ are the examples over which we wish to calculate the MSE.

2.3.2 Regularization

Regularization can be done explicitly by augmenting the loss function J with a regularization factor Ω(θ) resulting in the regularized objective function

J (θ) = J (θ) + αΩ(θ),˜ (2.8)

(14)

Algorithm 1: The Adam algorithm. The Hadamard product takes two matrices of the same dimensions, multiplies them elementwise, and outputs a matrix of the same dimensions.

Require: : Learning rate

Require: β1, β2∈ [0, 1): Decay rates

Require: δ: Small constant for numerical stability under division Require: θ: Initial parameters

Initialize first moment r ← 0 Initialize second moment s ← 0 Initialize time step t ← 0

while θ not converged (change in loss larger than some constant ρ) do t ← t + 1

Sample minibatch of size m⁰ from training set Compute gradient g ← ∇θJ (θ) = _m¹0

Pm⁰

i=1∇_θL f x⁽ⁱ⁾; θ , y⁽ⁱ⁾ Update biased first moment estimate r ← β1r + (1 − β1)g

Update biased second raw moment estimate s ← β₂s + (1 − β₂) (g g) Compute bias-corrected first moment estimate ˆr ← _1−β^r t

1

Compute bias-corrected second raw moment estimate ˆs ← _1−β^s t 2

Update parameters θ ← θ − ^√^ˆ^r

ˆ s+δ

end

return optimized parameters θ

where α ∈ [0, ∞) is a hyperparameter that scales the relative importance of the penalty.

Regularization is typically not applied to the biases b. We therefore use w to refer to the parameters in θ to which regularization will be applied. Typical choices of regularization Ω include L² regularization Ω(θ) = ¹₂kwk²₂ which biases the weights towards zero and L¹ regularization Ω(θ) = kwk₁ which results in a sparse solution.

Another common form of regularization is so called dropout [19]. When using dropout each node (kernel for convolutional networks) in the network has a probability, a hyperparameter, to explicitly set its output to zero. In practice this works very well as regularization and can intuitively be seen as training an ensemble of weaker models that then work together to give a stronger prediction. One way of then using the model at test time is to multiply each node output in the network by the probability to include it.

This has no theoretical proof but works very well in practice.

2.3.3 Back-propagation

An important question one might have is how we actually calculate the gradient ∇θJ (θ).

The answer is that we use the chain rule of calculus to successively compute the derivatives.

This process is usually called back-propagation due to it being calculated starting with the output layer. In contrast the process of calculating the activations is usually called forward propagation since it works its way from the input layer. By organizing the network in a computational graph, where each node contains a tensor variable and the directed links signify some operation with known properties under differentiation, we can work out all derivatives in a simple way. An example of calculating the derivatives using

(15)

e

d +

× c

a b

Figure 2.2: The computational graph of a function e = ab + c. From this we can determine the derivatives starting from the top of the graph. ^∂e_∂e = 1, _∂d^∂e = 1, ^∂e_∂c = 1.

Further down the tree one uses the multivariate chain rule; ^∂e_∂b = ^∂e_∂d^∂d_∂b = 1 · a = a and

∂e

∂a = ^∂e_∂d^∂d_∂a = 1 · b = b. The same logic holds for more complex graphs, as long as the derivatives between adjacent nodes are well-defined.

the computational graph can be seen in Figure 2.2.

2.3.4 Initialization

If the parameters W and b in a feedforward network are initialized at zero many of the gradients of different parameters will be identical, leading to many nodes learning the same mapping. To encourage different nodes learning different mappings it is common to initialize the weights with small random values. Since the symmetry is broken by the random initialization of the weights W the biases b are usually initialized to some small positive number if one uses ReLU as activation function. This ensures that the nodes are in the active regime at network initialization.

One problem created by using the initialization strategy above is that the variance in activations shrinks for each layer we forward propagate through. This will also shrink the gradients and make learning difficult. Glorot and Bengio [20] try to remedy this by sampling the weights between layers i and j from a distribution with

V ar(Wi,j) = 2 ni+ nj

, (2.9)

where niis the number of nodes in layer i. This so called Xavier initialization preserves variance in the layer activations under the assumption that there are linear activation functions and zero biases. This assumption does not usually hold but it has been shown to work well in practice. Recent papers suggest that one needs to double the variance when using ReLU, instead using

V ar(Wi,j) = 2

n_i, (2.10)

(16)

2. THEORY 2.4. CONVOLUTIONAL NETWORKS

due to the fact that approximately half the outputs of a ReLU layer are zero, thus halving the variance [21]. The paper did not use an average of the input and output layer thus the 2 remains even though it is effectively doubled.

2.3.5 Batch normalization

Using gradients to optimize parameters is a strong tool, but not without flaws. When using the gradient to adjust parameters we assume that we change only one and keep the rest fixed. Since all layers are adjusted at the same time this will yield unforeseen changes in activations due to the many function compositions. Batch normalization [22] tries to remedy this by forcing the activations h^(j)_i of a layer j, for a specific example x⁽ⁱ⁾ in a minibatch of size m⁰, into a unit Gaussian distribution with the transformation

h˜^(j)_i = h^(j)_i − µ

σ^(j) . (2.11)

The expected value µ^(j) and standard deviation σ^(j) are taken on a per dimension basis across the entire minibatch as

µ^(j)= 1 m⁰

m⁰

X

i

h^(j)_i (2.12)

and

σ^(j)= v u u t

1 m⁰

m⁰

X

i

h^(j)_i − µ^(j)2

. (2.13)

Since this process can reduce the expressive power of the network it is common to introduce new learnable parameters γ^(j)and β^(j) to allow the network to relearn the identity mapping, or any mapping in between. Formalizing this yields the expression

h^0(j)_i = γ^(j) ˜h^(j)_i + β^(j), (2.14) where h⁰_i is the output of the batch normalization layer for a given example x⁽ⁱ⁾ and is the Hadamard product. At test time µ^(j) and σ^(j) are calculated from the training data using for example running running averages from the training.

2.3.6 Hyperparameters

There are many hyperparameters in neural networks that need to be tuned to obtain the lowest generalization error possible. Typical hyperparameters include the learning rate, dropout rate, number and size of layers and convolutional kernel width. When there are hyperparameters that need to be optimized it is typical to split the original dataset into three parts. Two are the usual test and training sets, but the third is the so called validation set. We optimize the hyperparameters in regards to their performance on the validation set. To not overfit hyperparameters on the validation set we use the test set to measure the actual generalization performance.

In this thesis work the hyperparameters were optimized using random sampling. Each hyperparameter is sampled uniformly from a set of eligible values and the best performing set of hyperparameters is chosen. A more preferable way of hyperparameter optimization could be latin hypercube sampling [23].

(17)

2. THEORY 2.4. CONVOLUTIONAL NETWORKS

2.4 Convolutional networks

One flaw in the fully connected networks described in Section 2.2 is that it discards the spatial structure of the data. This is most noticeable in images where it is often reasonable to assume that nearby pixels exhibit some form of spatial correlation. In 1989 LeCun introduced the concept of Convolutional Neural Networks (CNN) [24].

By restricting weights to only be locally connected, meaning one node only connects to spatially close nodes in the following layer, we bias the network towards considering the spatial structure of the data. Using the concept of weight sharing, where many weights are constrained to have the same value, we can reduce the number of learnable parameters in the model by many orders of magnitude. These realizations catapulted the field of machine learning in image processing, since models could now be trained much more efficiently. The ideas can be summarized using the concept of convolution.

Discrete convolution in its simplest form is a type of weighting of different parts of the input. For example, if we have a time series of the value x(t) of some stock it could be reasonable to weight more current values higher when predicting the future. This is done by convolving the input x with the kernel w(a), where a is the age of a measurement.

The result is the convolution s given by s(t) = (x ∗ w)(t) =

∞

X

a=−∞

x(a)w(t − a), (2.15)

often referred to as the feature map.

When applying convolution in neural networks the kernels K are typically three-dimensional since the images usually have two spatial dimensions and one channel dimension. In the input the channels could represent different colors, for example in RGB format. In convolutional layers it is typical to generate multiple feature maps that can be interpreted as channels for the following layer. The resulting feature map S after convolving image I with kernel K is

S(i, j) = (I ∗ K)(i, j) = X

l,m,n

I(i + l, j + m, n)K(l, m, n). (2.16) As seen in (2.16) it is typical to have the kernel size in the channel dimension be the same as the number of available channels. Multiple such feature maps, with different shared weights, would then be generated as channels for the next layer.

2.4.1 Convolutional layer

A regular convolutional layer consists of a kernel K containing weights W and a single bias b. We introduce the concept of stride to describe the distance between different evaluations of the dot product between image and kernel. If the stride s is the same as the kernel size, input pixels will only affect one output pixel since the applications of the kernel do not overlap. Assuming the kernel to have size d in the spatial dimensions means the output image will have size

D₂= D₁− d + 2P

s + 1, (2.17)

if D1 is the spatial size of the input image. P stands for the amount of zero padding applied to each edge to preserve the spatial size. Zero padding is a typical way to handle

(18)

2. THEORY 2.5. AUTOENCODERS

edges during convolution. Three different notable types of zero padding are used. In valid convolution no padding is added and the convolution is applied where possible, resulting in a smaller output image. During same convolution enough zeroes are added to preserve the image size. Finally full convolution means adding enough zeroes so that each pixel in the input affects the same number of pixels in the output.

After convolution the feature map has a bias b, shared among all output pixels, added to each entry and the result is passed through an activation function, such as the ReLU.

The result is an activation map, typically referred to as feature map also after this activation function. Typically more than one filter is used resulting in a set of activation maps, previously referred to as channels.

2.4.2 Pooling layer

Pooling is a collection of methods used to replace pixel values with summary statistics of their neighborhood, possibly reducing the image size while doing so. For example, max pooling uses a kernel of size d × d × 1 and stride s = d, outputting the maximum value in the current image location. Pooling has the additional effect of making the image more invariant to small translations in the input.

2.4.3 Locally connected layer

Locally connected layers are related to convolutional layers. If one removes the weight sharing in a convolutional layer the result is a locally connected one (see Figure 2.3 for a comparison). The number of learnable weights in the network will scale linearly with the number of image pixels, as opposed to with the number of layers in a convolutional network. This yields a smaller memory footprint than in fully connected networks, but larger than in convolutional networks. Up- and down sampling can be accomplished using the same methodologies as in convolutional networks.

2.5 Autoencoders

An autoencoder is a form of unsupervised learning where we have an input and wish to recreate it. This can be trivially accomplished by copying the input to the output which is pointless. The task is instead to first encode the input onto a set of latent variables (the code) which has a lower dimension than the input. After this we apply a decoder that tries to restore the input as well as possible. This restoration will in most scenarios be lossy, but the code will contain some meaningful low dimensional representation of the input data. An illustration of a simple autoencoder can be seen in Figure 2.4.

In an undercomplete autoencoder the code dimension is smaller than the input dimension. If the decoder is linear and the cost function is the mean squared error this setup learns the same projection as Principal Component Analysis (PCA) would.

If we allow the decoder to be nonlinear, the low dimensional representation can become even more general than PCA. Instead of having a smaller code dimension than the input dimension one could use a regularized autoencoder, applying regularization to the latent variables. This regularization will force the autoencoder not to learn the identity mapping, thus having a similar effect as the undercomplete autoencoder.

(19)

x_n wn,n Σ ^g

Activation function

h_n Output

xn−1 w_n,n−1

x_n+1 wn,n+1

Weights

Bias b_n

Inputs

(a)

x_n w₁ Σ ^g

Activation function

hn

Output

xn−1 w0

xn+1 w₂

Weights

Bias b₀

Inputs

(b)

Figure 2.3: A comparison of (a) a locally connected network and (b) a convolutional network. The locally connected network has unshared weights for each output node h_n whereas the convolutional network has the same weights for all outputs. The pictured networks have kernel size 3 and a single one dimensional input and output feature map.

Input layer

Hidden layer

Latent layer

Hidden layer

Output layer

Figure 2.4: A simple autoencoder where the difference between the input and output is minimized. Here the latent layer contains a single latent variable.

(20)

x C x˜ f h g y

Figure 2.5: In a denoising autoencoder the input x is first distorted through some process C. After this it is encoded onto some latent variables h and then decoded to the output y. The function’s encoder and decoder are then optimized to achieve a vector y similar, in some cost function sense, to the clean input x.

2.5.1 Encoder

The encoder part of an autoencoder is typically a regular network found in other parts of deep learning theory. It can for example be a densely connected network, a convolutional network, or something more sophisticated. One can usually apply models developed for other tasks when building the architecture.

2.5.2 Decoder

In the case of autoencoders we wish to output an image at the end of the network.

Since the input image has been reduced to a low dimensional latent feature space by the encoder, it needs to be brought back to be high dimensional space. Typically this has been done through a process called transposed convolution. It is also referred to as deconvolution, but this name is discouraged since transposed convolution differs from the mathematical inverse of the convolution operation. A more descriptive name is fractionally strided convolution. Convolution as described in Section 2.4 can be described as a regular matrix multiplication with some weights sharing the same value.

Due to this fact one can transpose this matrix to get an operation that returns the original shape. In fact, a convolution with stride s = 1, kernel size d and zero padding p has an associated transposed convolution with stride s⁰ = s, kernel size d⁰= d and zero padding p⁰ = d − p − 1 [25].

If the kernel size is not evenly divisible by the stride, it is very likely that the transposed convolution will create checkerboard artifacts due to uneven overlap of kernel applications [26]. The model could theoretically learn to counteract these artifacts by changing the weights, but this is rarely seen in practice. Another approach to upsampling that has enjoyed a recent upswing is interpolation, where common choices are nearest-neighbor interpolation and bilinear interpolation [26].

2.5.3 Denoising autoencoder

A traditional Denoising AutoEncoder (DAE) requires access to noise free data. We have some function C, which when applied to the examples x yield corrupted versions ˜x.

The task of a DAE is then to minimize the cost

L(x, g(f (˜x))), (2.18)

where f and g are the usual encoder and decoder functions. This requires that the distorting function C is known to evaluate the cost function. An example of a DAE can be seen in Figure 2.5.

(21)

3 Implementation

3.1 Tools

The models were implemented using TensorFlow 1.4.1, with GPU support, in the Python programming language. Everything in TensorFlow is stored as a tensor of one form or another. This allows for efficient implementations of a wide array of functions since the inputs are of consistent form. One can make use of predefined functions in a modular way to build all but the most obscure modern network architectures. This modularity stems from the fact that TensorFlow builds a computational graph, as described in Section 2.3.3. As long as the module has a well-defined gradient it can easily be integrated with other TensorFlow functions. If one wants to create something novel it can usually be accomplished, but may be difficult to implement efficiently.

All networks presented in this thesis were trained using the Davinci GPU cluster at Uppsala University. The cluster is composed of one set of 32 nodes and a second set of 24 nodes. Each node in the set of 32 contains two Intel Xeon E5-2620, 64GB of RAM and four NVIDIA GTX 680 with 4GB of RAM. The nodes in the set of 24 each contain two Intel Xeon E5-2620 v2, 64GB of RAM and four NVIDIA GTX Titan Black with 6GB of RAM.

3.2 Dataset

The background dataset contains 12904 images of background of size 388×370 pixels (two ASICs). The beam will saturate and possibly damage the pixels closest to the incident point. Due to this a physical beam blocker is placed in front of the beam. This means that the pixels do not contain diffraction patterns and they have thus been removed using the mask seen in Figure 3.1. There are also a small number of pixels that are damaged or otherwise misbehaving which have also been masked out.

An example of a background image can be seen in Figure 3.2. The pixel values in the images were given in ADU, with values in the range of thousands to tens of thousands.

Figure 3.1: The mask applied to each image in the dataset. Black represents pixels that are removed and white the ones that are kept.

(22)

3. IMPLEMENTATION 3.2. DATASET

(a) (b)

Figure 3.2: A single example image from the background dataset. (a) contains an unmasked image and (b) contains the same image after the mask has been applied.

(a) (b)

Figure 3.3: (a) mean value and (b) standard deviation on a per pixel basis over the entire dataset of 12904 images.

One could describe the images as bi-modal, with a scattering component containing large ADU values superimposed over a dark run with mostly low ADU values. The scattering component also has a higher average standard deviation as seen in Figure 3.3.

A typical photon registers as an ADU increase of approximately 25 in the pixel, meaning the standard deviation in the scattering component is orders of magnitude larger. The standard deviation in the dark run component on the other hand is of the same order of magnitude as a photon.

3.2.1 Dataset split

The beam intensity varied non-randomly when the dataset was captured, as can be seen in Figure 3.4. The images with high beam intensity are the most useful ones since they have discernible diffraction patterns. Due to this fact the data examples with indices between 6000 and 8664 were chosen to be the test set. This is not representative of the full dataset, but a good performance on such high intensity images is preferable. The other 10240 images were randomly separated into a training set containing 80% of the remaining images and a validation set containing 20%.

(23)

(a) (b)

Figure 3.4: The ADU sum of each individual image in the dataset after being masked.

(a) contains the unaltered ADU sum whereas (b) has been smoothed with a window size of 50.

3.2.2 Artificial diffraction data

An artificial dataset of clean diffraction images was generated. A 40×40 square of random values sampled from the folded normal distribution abs(N (0, 1)) was inserted in the center of a zero matrix of size 2048 × 2048. A two dimensional fast Fourier transform was then performed on the matrix. The quadrants are switched diagonally and the squared absolute value is taken per pixel to achieve a more general, but still realistic, diffraction pattern.

An example can be seen in Figure 3.5a.

To increase realism the pattern intensity was scaled by a random factor drawn from the distribution abs( ˜N (0, ₂₄₂₉^0.5 2

)) where the normal distribution is truncated. 2429 is an empirically arrived at value that shifts the maximum intensity to approximately one photon per unmasked pixel. Multiplying with this intensity factor guarantees a high amount of low intensity patterns but maintaining a large intensity range. This results in an an average number of photons in each masked image close to the number of unmasked pixels. A Poisson sampling is then performed using these values as the per-pixel rate of photons. These are then converted to pseudo ADU values by multiplying with N (25, 2²) pixelwise resulting in an image as in Figure 3.5b. The patterns are also shifted vertically and horizontally by random numbers from a distribution ˜N (0, 5²) truncated at two standard deviations and then rounded. The resulting artificial diffraction patterns can then be added to background images by aligning the center of the pattern with the center of the beam blocker in the center left of the background.

3.2.3 Preprocessing

Unit Gaussian inputs to a neural network allows the optimization process to converge much more smoothly, meaning some preprocessing was preferable. A single scalar mean pixel value was calculated over the unmasked pixels in the training set. Using the mean value, a standard deviation could be calculated. The data was then rescaled by subtracting the mean and dividing by the standard deviation. This does not ensure that the dimensions separately have unit Gaussian distribution, but it does ensure that the pixel values as a whole are unit Gaussian distributed. This process is typically called stan-

(24)

0 100 200 300

x [pixels]

0 50 100 150 200 250 300 350

y [pixels]

5 10 15 20 25 30

Photons per pixel

(a)

0 100 200 300

x [pixels]

0 50 100 150 200 250 300 350

y [pixels]

0 200 400 600 800 1000 1200

ADU

(b)

0 100 200 300

x [pixels]

0 50 100 150 200 250 300 350

y [pixels]

1200 1400 1600 1800 2000 2200 2400 2600

ADU

(c)

Figure 3.5: Simulated diffraction pattern with (a) theoretical number of incident photons per pixel and (b) Poisson sampled ADU values using the photon numbers as expected number of occurrences. (c) shows the diffraction pattern superimposed on a background image.

The depicted diffraction pattern has a high intensity and is therefore easily distinguishable after being superimposed.