MITRASTRANDBERGmitrast@kth.se Semi-SupervisedDomainAdaptationforPickClassiﬁcationinPickandPlaceMachines

(1)

Semi-Supervised Domain

Adaptation for Pick Classification

in Pick and Place Machines

MITRA STRANDBERG

(2)

(3)

Adaptation for Pick

Classification in Pick and

Place Machines

MITRA STRANDBERG

mitrast@kth.se

Master in Computer Science Date: June 24, 2019

Supervisor: Hedvig Kjellström Host Supervisor: Javier Cabello Examiner: Joakim Gustafson

School of Electrical Engineering and Computer Science

(4)

of Printed Circuit Board (PCB) components. The data is used to train automated image analysis methods to improve the decisions in the mounting process. Previous work with Neural Networks has shown promising results in the classification of the component status. However, the characteristics of the data changes over time as new PCBs, components, and PnP machines are deployed.

This work applies a Semi-supervised Domain Adaptation method named As- sociative Domain Adaptation to enable learning of a new and unlabeled data set. The networks reach high performance despite skew class distributions, but the final results do not outperform the current classification algorithm in the PnP machines. However, ensembling of different methods can make use of the strengths from both the current classification system and the method proposed in this thesis, where the ability to learn from unlabeled data is a promising advantage.

(5)

ringslinjer med Pick and Place-maskiner. Dessa maskiner plockar upp komponenter och avgör sedan om de bör monteras eller förkastas. Genom att ut- veckla analysen av en plockad komponent kan användandet av material och resurser effektiviseras. Stora mängder av bilder på plockade komponenter har möjliggjort tillämpningar med neurala närtverk för att avgöra om en komponent är lämplig att montera. I och med att industrin hela tiden utvecklar nya komponenter och Pick and Place-maskiner så förändras utseendet av bilderna som nätverket ska klassificera. För att ett nätverk med övervakad inlärning (Supervised Learning) ska kunna anpassas till det senaste datat behöver exper- ter hela tiden lägga ner tid på att tilldela etiketter (labels) till de nya bilderna.

För att göra det möjligt för neurala nätverk att använda ny data som saknar etiketter så tillämpar detta arbete semi-övervakad inlärning för domänanpass- ning (Semi-Supervised Domain Adaptation). Ett ytterligare problem med datat är att felaktiga komponenter är ovanliga. Den valda metoden, Associativ do- mänanpassning (Associative Domain Adaptation), lyckas justera den inlärda fördelningen till den nya domänen trots att datamängdernas klasser har skev fördelning. De slutgiltiga nätverken når hög precision men överträffar inte be- fintliga beslutssystem. Däremot finns det potential att utnyttja styrkorna i båda klassificeringssystemen genom att kombinera dem, där nätverkens kapacitet att lära från data utan etiketter har en lovande potential.

(6)

and valuable feedback during our meetings throughout the project. I want to thank my supervisor at Mycronic AB, Javier Cabello, for his great tutoring and innovative discussions in the office along with Stefan Fu who also helped to form the data sets.

I would also like to thank Hossein Azizpour who set up a meeting at the beginning of the project where I got an overview of different methods in domain adaptation, and Sofia Broomé for her tips on generators. Last but not least, I want to thank my family for their support and proofreading of the report.

(7)

1 Introduction 1

1.1 Research Question . . . 2

1.2 Objective . . . 2

1.3 Scope . . . 3

1.4 Outline . . . 5

2 Background 6 2.1 Transfer Learning . . . 6

2.2 Domain Adaptation . . . 8

2.2.1 Common Domain Adaptation Methods . . . 9

2.3 Associative Domain Adaptation . . . 11

2.3.1 Cross-Entropy Loss . . . 11

2.3.2 Association Loss . . . 12

2.3.3 Feed-Forward Neural Network . . . 14

2.4 Embedding visualization with t-SNE . . . 16

2.5 Performance Metrics . . . 17

2.5.1 Precision Recall . . . 17

2.5.2 F1 Score . . . 18

2.5.3 Precision Recall Curve (PRC) . . . 18

2.6 Similar Problems and Applications . . . 19

3 Data 21 3.1 PCB Components . . . 21

3.2 Data set Visualization with t-SNE . . . 25

3.3 Skew Class Distribution . . . 26

3.4 Oversampling and Image Augmentation . . . 26

3.4.1 CLAHE . . . 27

3.4.2 Inverse Gamma Correction . . . 28

3.4.3 New class distribution . . . 28

4 Methods 30

(8)

4.1 Input . . . 30

4.1.1 Cropping . . . 31

4.1.2 Scaling . . . 31

4.1.3 Rotations . . . 31

4.1.4 Image Normalization . . . 31

4.2 Network Architectures . . . 32

4.2.1 SVHN Models . . . 32

4.2.2 MNIST Models . . . 33

4.3 Default Hyper-Parameters and Intervals . . . 33

4.4 Measure the Effect of Domain Adaptation . . . 35

4.4.1 Baseline . . . 35

4.4.2 Benchmark . . . 36

4.4.3 Embeddings in t-SNE . . . 36

4.5 Experiments . . . 36

4.5.1 Oversampling . . . 36

4.5.2 Optimizers . . . 37

4.5.3 Hyper-Parameter Optimization . . . 39

5 Results 40 5.1 Experiment 1: Oversampling . . . 40

5.2 Experiment 2: Optimizers . . . 42

5.3 Experiment 3: Hyper-parameter Optimization . . . 44

5.4 Embeddings from t-SNE . . . 48

6 Discussion 50 6.1 Experiment 1: Oversampling . . . 50

6.2 Experiment 2: Optimizers and Batch Normalization . . . 51

6.3 Experiment 3: Hyper-Parameter Optimization . . . 52

6.4 Embedding distributions . . . 52

6.5 Ethical, Sustainable, and Social Aspects . . . 53

6.6 Further Research . . . 54

7 Conclusions 55

Bibliography 57

(9)

Introduction

Multilayer Perceptrons have the ability to learn various tasks on high dimensional data given that the data set is sufficiently large, for example in image classification, object recognition, language translation, and more. This thesis uses large image data sets with skew distributions of labels and features. Skew class distributions add bias in favor of the dominant classes and at the same time, rare details can be of high priority in the decision making [1]. The labeling process may require human expertise which is both time-consuming and expensive for the company. In addition, if the labeled subset is non-uniformly selected, erroneous, or constitutes a small fraction of the data, it may not be representative of the true distribution. All mentioned impediments raise interest in techniques to enforce a generalized distribution in the network weights.

For example, data set augmentation techniques [1] and unsupervised or semi- supervised learning [2].

When in lack of labels, unsupervised and semi-supervised learning exploits information in unlabeled data. Unsupervised methods make use of unlabeled data to find patterns. Semi-supervised learning can use an insufficient set of labeled data and complement it with unlabeled data to capture the whole distribution. Reports have shown that semi-supervised methods can reach the performance of their supervised counterparts even though a substantial por- tion of the data is unlabeled. [3] [4] [5] [6] [7] [8] [9] [10] This work applies a semi-supervised method named Associative Domain Adaptation [3] to a new data set different to commonly used benchmarks.

(10)

1.1 Research Question

Can semi-supervised methods under domain shift, such as Associative Do- main Adaptation, be applied to a binary classification problem on images of Printed Circuit Board (PCB) components?

The binary classification problem is to determine whether the Pick and Place (PnP) machine have picked a component correctly, so it can be placed on a PCB. The labels are ok pick (OK) or not ok pick (NOT OK). Semi-supervised methods are needed because the labeled data set is not representative of the data during test time. A model that can learn from unlabeled data as well have a greater chance to capture the whole distribution. The labeled data set, also referred to as the source domain data set (DS), can be used for supervised training. The second data set (target domain data set, also DT), requires semi- supervised or unsupervised methods. Domain shift, or domain adaptation, exploits similarities of the two data sets by creating mappings between similar data points. By virtue of proper mappings, a classifier that performs well on D_S can improve the performance in DT.

1.2 Objective

The objective of this thesis from the principal’s perspective is to investigate if semi-supervised neural networks can be used to classify new data. Due to neural networks’ heavy consumption of computational resources, it is of interest to adapt the network without fully retraining it.

Associative Domain Adaptation, developed by Haeusser et al. [3], is applied to an image data set that is different to common benchmark data sets. Reports that present new architectures or methods for neural networks often use data sets such as CIFAR [11], MNIST [12], SVHN [13], GTSRB [14] and more. Data sets in the industry can differ significantly and a method that works well on benchmark data sets is not necessarily suitable in some industrial applications.

In our data set, a small detail in the pick angle is equally important as a major fault such as an upside down component. At the same time, the performance must be extremely high.

(11)

1.3 Scope

The classification task is restricted to the component status OK or NOT OK.

Figure 1.1 show examples of one OK component and the different error classes that all are considered NOT OK. The errors Stop Production and Wrong Com- ponent are left out from the data sets since these are caused by corrupt pick tools or other faults in the assembly line.

The data sets are also limited to small PCB components. Large components have so large variations in appearance that even a qualified expert may find it difficult to determine the pick status from looking at a single image (see figure 3.2 in Chapter 3). It is mentioned in Future Research (Section 6.6) that an extension to our solution is to evaluate the impact on a larger data set.

The existing data consist of approximately 900 000 images where about 300 000 images are labeled by human experts. The distribution of labels is not uniform across all of the available data. For a supervised solution, the data during train time would be different to the data during deployment. Due to the huge amount of images that would need labeling for the data set to correspond to the true distributions, it is out of the scope of this thesis to label all of the images. Semi-supervised learning is applied to adapt a network to the unlabeled data set, which is also referred to as Target domain (DT). The labeled data is referred to as Source domain (D^S) and it is the same data set as in the work by Kolibacz [15]. Figure 3.3b in Chapter 3 show the result from dimensionality reduction with t-SNE to provide an intuition of the different distributions in D_S and DT.

Results from Domain Adapted networks (CN N^DA) will be compared to two corresponding networks: 1) a baseline Convolutional Neural Network (CNN) that is trained in a supervised setting on source domain data; 2) a supervised benchmark CNN trained on the target domain using labels generated by an automatic vision system. If the domain adapted network can outperform both of the supervised networks it might be applicable in the industry.

The question of research is adequately answered if at least one semi-supervised learning method is applied to D^S and D^T. The desired outcome is that Asso- ciative Domain Adaptation will perform well enough to be useful in the PnP Machines.

(12)

(a) Ok (b) Billboarded (c) Corner pick

(d) Damaged (e) Not picked (f) Spinning

(g) Tombstoned (h) Upside down (i) Wrong pick angle

(j) Stop production (k) Wrong component

Figure 1.1: The images are taken with a camera placed under the picked component. (a) shows a correctly picked component, labeled OK. (b)-(i) show examples of wrongly picked and damaged components, labeled NOT OK. The error classes (j) and (k) are excluded from the data sets. A sample of Wrong Component can be picked correctly, thus this error class would only add noise to the data sets. Stop Production occurs if the picking tool is contaminated, our problem regards component’s pick status.

(13)

1.4 Outline

The report consists of six additional chapters:

• Chapter 2, Backgound: Describes general Domain Adaptation and rel- evant academic work which leads to Associative Domain Adaptation and the AdamW optimizer. Skew distributions in the data sets argue for specific performance metrics.

• Chapter 3, Data: Contents of the image data sets are explored and dif- ferences between labeled and unlabeled data motivates Domain Adap- tation. Oversampling and image augmentations are explained.

• Chapter 4, Methods: The work is divided into three main experiments, all employ image pre-processing and different hyper-parameters.

• Chapter 5, Results: Corresponds to each of the three experiments.

• Chapter 6, Discussion: Analysis of the results in the context of Semi- supervised Domain Adaptation and a general note on ethical aspects.

Suggestions on further research regards both the used method and other approaches to the problem.

• Chapter 7, Conclusions: Evaluation of the final network’s applicability in the industry.

(14)

Background

In practice, a model that is trained on a specific data set and task may not be of value if the setting changes to a similar problem; the input data can change over time, or the novel task might be so different that the old model does not apply. This section begins with an explanation of domain and task transfer in machine learning. Different approaches in the area are presented including Associative Domain Adaptation [3] which is applied in this thesis. The data sets have skew class distributions, so the performance is measured with F1 score and PRC plots [16].

2.1 Transfer Learning

A challenge with applications in machine learning is to find training and validation sets that are representative of the problem. Examples of impediments are skewed class distributions, insufficient labels, and a continuously changing stream of data during test-time. Training a network from scratch can also be inefficient. For example, when analyzing the behavior of a large number of users the data can be too divergent for a reasonably sized network. In these different scenarios, it is of interest to reuse information and efficiently learn the necessary distributions through additional training and architecture modifications.

A neural network can be viewed as a task that is learned for a given input domain. The data set is denoted X , also referred to as feature space. An input

(15)

image is an element in feature space,

X = {x₁, x₂, ..., x_n} ∈ X

The marginal distribution P (X) is the distribution over all possible images. A domain is a collective term for both the feature space and distribution,

D = {X , P (X)}

The task consists of a label space Y and a mapping function φ : X → Y , T = {φ(X), Y}

Any of the parameters in D or T can vary transfer learning. The initial scenario consists of the source domain and source task. The target domain and task constitute the goal of the transfer. There are three main approaches in transfer learning that undertake the different scenarios [17],

1. In inductive transfer learning the source task φS and target task φT are different. Thus the model must learn a new target mapping function.

The domains can either be the same or different [17]. An example of inductive transfer learning is multi-task learning where a model learns multiple tasks simultaneously. Previous work have shown that multi- task learning can result in stronger generalization. [18]

2. In transductive transfer learning the task remains the same, TS = T_T, while the domains are different, DS 6= D_T. Labels are only available in D^S. The domains can vary in two ways. Variation in feature space, X_S 6= X_T, means that new types of data points might occur. For example, the size of the images change or RGB images can be substituted with gray-scale images. Variation in the marginal distributions, PS(X) 6= PT(X), means that there are different probabilities of ob- serving some of the features in feature space. The domains have a weak difference and the problem is referred to as domain adaptation [17].

3. Unsupervised transfer learning handles the case where no labeled data is available in D^S nor D^T. The source and target tasks or domains can be different but should be related. [17]

This thesis focus on transductive transfer learning in a semi-supervised setting. The variation resides in the domains. Specifically, the difference is in the marginal distributions PS(X) 6= P_T(X). The images in the two domains are very similar, but some features in the images vary such as contrast, components etc. The feature space is coherent because the type of data points (image

(16)

size, gray-scale etc) are consistent, X = XS = X_T. The same holds for the label space, Y = Y^S = YT = {OK, NOT OK}. The mapping function φ connects Y to samples in feature space X and thus the source and target tasks are equal, TS = T_T.

2.2 Domain Adaptation

Domain adaptation is useful when a model trained on a source domain needs to learn the same task on a different but related target domain. Domain variations mainly reside in the marginal distributions. This is a common problem when the labeled data set is not representative of the data during test time, so the training data suffer from a shift in the distribution of features. [4] Semi- supervised Domain adaptation avail similarities between the two domains to learn tasks in the unlabeled target domain. In the setting of a classification task, a given classifier on DSthat learns to map samples from DT to DS will also be able to classify in DT.

Deep Neural Networks transform high dimensional data, for example, images, to vectors in a lower dimension. The vector in the last layer is referred to as an embedding. The theory proposed by Ben-David et al. [19] [20] introduces do- main adaptation by assuming the existence of a classifier that performs well on both domains. The classifier learns embeddings that minimize the distance between the domains. The distance between the distributions is measured by the H-divergence. It is large if samples from DShave low probability to occur in DT, and it is small if the domains overlap. However, the distribution functions are not available so the H-divergence is approximated using samples from both of the domains. By virtue of the estimated H-divergence (dH(D_S, D_T)) and the classification error on source domain (S), the upper bound in the target domain error is defined as,

_T ≤ _S+ 1

2d_H(D_S, D_T) (2.1)

The equation above shows the two main notions of successful domain adap- tation. The network must learn discriminative features, represented by the source error S. Minimized source error yields accurate estimations of class labels. The learned features must also be domain invariant which is achieved by minimizing the distance between the domains. In practice, training should favour features that exist in both DS and DT.

(17)

2.2.1 Common Domain Adaptation Methods

This section walks through different methods that are useful when the distribution of the data change during test-time. We begin with the most intuitive approach of Fine-Tuning a pre-trained network, followed by proxy-labeling, and finally Adversarial Networks. All approaches are relevant for the research in Domain Adaptation and were considered for the problem in this thesis. Our chosen approach is presented in the next section, but these methods would be applicable either if D^T contained a labeled subset or if the access to computational power increased.

Fine-Tuning

Classic Transfer Learning or Fine Tuning if the difference between the two do- mains is weak. Intuitively, the first layers in a Convolutional Neural Network (CNN) extract basic features, the internal image representations are reminiscent of Gabor Filters or color blobs. The deeper layers capture details in higher dimensions. [21] Because of this, the last layers can be substituted with newly initialized layers and tuned with gradient descent in a supervised setting. The approach requires a smaller data set with labels in D^T. Since assigning labels is out of scope and there exist a large pool of unlabeled data, it is of interest use semi-supervised learning to capture as much of the distribution as possible.

Proxy-Labels

One approach to learning from unlabeled data is to let a classifier on D^S assign pseudo-labels to samples in DT. The classifications with high confidence are viewed as ground-truth by the model. In the beginning, the most confident pseudo-labels are added to the training set, and as training progresses it automatically gains confidence in more intricate samples. Asymmetric Tri- Training let four CNNs train jointly [5]. Two networks generate pseudo-labels by training on both DSand the most confident samples in DT. A third network trains only on the pseudo-labeled data set. The fourth network is a feature extractor that outputs features that the other three networks jointly train on. The result is a network that can classify samples in D^T. [5] Given the amount of computational power available in this work, it is of interest to use a simpler architecture.

(18)

Figure 2.1: Illustration of DANN from Ganin et al. [4] All networks use standard feed-forward propagation. The label predictor updates the network weights with regular gradient descent, but the domain classifier multiplies the loss with a negative constant during back-propagation. This ensures that the features in the embeddings are similar regardless of the true domain [4].

Adversarial Domain Adaptation

The last couple of years has delivered various new methods for Adversarial Do- main Adaptation. It makes use of a generator-discriminator framework similar to the one used in Generative Adversarial Networks (GAN) [22]. Specifically for Domain Adaptation, the generator takes images from DS and transforms them to D^T, and the discriminator distinguish between the domains. UNIT [7] and CycleGAN [8] has shown successes in transforming images between domains. These types of methods has successfully delivered photo-realistic images, but the high dimensional transformations are computationally heavy.

In addition, if D^Ssystematically excludes features present in D^T it might not be possible to transfer between the domains.

A more lightweight approach is to just use low dimensional representations of the images (embeddings). Some of the features are not present in the embedding space and thus do not require reconstruction. One example of a method is Domain Adversarial Neural Networks (DANN) [4], in Figure 2.1. It makes use of three sub-networks and trains them jointly which results in a classifier that base predictions on features that exist in both domains.

(19)

2.3 Associative Domain Adaptation

This section described the method applied in this thesis. Associative Domain Adaptation [3] performs Semi-supervised Domain Adaptation with a simple feed-forward neural network architecture. The idea behind Associative Do- main Adaptation use learning by association, a learning concept utilized by both humans and animals where old knowledge and ideas reinforce new ideas.

A child that learns what a dog is can recognize different types of dogs without being exposed to several samples of each type of dog. [23]

Associative Domain Adaptation [3] applies learning by association [23] to neural networks with help of a new loss function. The new loss maximizes the probability of correctly classifying labeled data points while simultaneously bringing the distributions of DS and DT closer to each other. It is defined as the sum of a classification loss and an association loss. The classification loss (Lclassif ication) is a regular cross entropy loss that encourages correct classifications in DSand the association loss (Lassociation) creates mappings between the domains. The association loss is controlled by a weight factor α, where α = 0 let the network train on D^S in a supervised setting,

L = Lclassif ication+ αLassociation (2.2) This section goes into depth of the loss, followed by the explanations of con- cepts used in the feed-forward networks such as activation functions, batch normalization and optimizers.

2.3.1 Cross-Entropy Loss

The supervised classification loss is a cross-entropy loss. Let y be a one-hot vector for the label. p is the estimated softmax distribution. A high pⁱimplies strong confidence that the label is Yi. The cross-entropy H is a value between 0 and 1 and is minimized when the classifier is confident in the correct classifications,

Lclassif ication= H(y, p) = −

|Y|

X

i

y_ilog(p_i) (2.3) In context of domain adaptation, the classification loss enforces discrimina- tive features. It is minimized when the learned embeddings generate correct classifications of samples from DS.

(20)

2.3.2 Association Loss

The second term in Eq. 2.2 enable domain invariance by forcing the network to create similar embeddings for samples from different domains that belong to the same class. The walker loss (L^walker) maximize the probability that a sample in DS is transferred back to the same class in DS via a sample from D_T. The transfer probabilities are based on how similar two samples are, so we need to avoid that training only creates mapping to the most generic samples.

The visit loss (L^visit) adds a randomizing factor to help the network evaluate the more intricate samples in DT as well. The total association loss is the weighted sum,

Lassociation = β1Lwalker+ β2Lvisit (2.4) where β1and β2are weight factors named walker weight and visit weight.

Batches of images from D^S and D^T are fed to an L-layer CNN, denoted as a function φ0:L : R^N⁰ → R^N^L. Let x^s denote an image sample from source domain and x^tdenote a sample from target domain. The network’s internal represenations of the images are then defined as,

S_i = φ_0:L−1(x^s_i)

T_j = φ_0:L−1(x^t_j) (2.5)

And the measure of similarity for the embedding pair is the dot product, M_ij = (S_i· T_j)

The new loss function makes the network learn mappings from Sⁱ to another embedding Sk in the same class. The mapping transfers via any of the target domain embeddings in T . The probability of specifically choosing Tj is computed with softmax on the similarity measures, and it is maximized when the samples are similar,

P_ij^st = P(Tj|S_i) = e^M^ij P|T |

j=0e^M^ij (2.6)

And the total round trip probability is the joint probability to also transfer back to source domain from T^j to S^k,

P_ik^sts = (P^stP^ts)_ik (2.7) this can be explained as an imaginary walker that switches between source and target domain. The more similar instances in the different domains are,

(21)

the more likely they are chosen by the walker. The walker loss is a cross- entropy loss that is minimized when the round trip probability stays in one class in D^S,

L_walker = H(U, P^sts) (2.8)

where,

U_ij =

(1/|S_i|, if class(Si) = class(Sk) 0, otherwise

The more similar an image pair (S, T ) are, the more likely the network will learn to map them. This approach keeps distinction between classes after the domain invariant features are extracted.

However, Lwalker does not regard the aspect that the model should consider the whole feature spaces XS and XT. It suffers from the risk of getting stuck in a local minimum by only creating mappings to images in the target domain that are similar to an average representation of the source domain. This means that the model converges to use round-trip probabilities P^sts where the target embedding Tj is the sample that is closest to an average of Si and Sk. This hinders the network from adapting to the whole target domain. A visit loss is added to promote mappings to different embeddings from the target domain.

It spreads the probability of visiting a sample of T given Si. It is defined as the cross-entropy of the uniform distribution over all target samples and the probability of visiting a target sample T^j by starting in any of the source samples S,

L_visit = H(V, P^visit) (2.9)

where,

V_j = 1

|T | and Pj^visit= X

xi∈Ds

P_ij^st

The visit weight β² in equation 2.4 controls much to spread the probability of visiting all samples B. Similar class distributions in DS and DT allow high β₂, but it introduces bias when the class distributions are different. When the visit loss enforce mappings to samples in target domain with rare features, the uniform selection of samples in D^Sassumes that the class distribution is equal to the one in DS. The bias can be reduced with a low visit weight relative the walker weight β1.

(22)

2.3.3 Feed-Forward Neural Network

Besides the new Association loss, Associative Domain Adaptation uses a regular feed-forward CNN. This section describes relevant activation functions, weight initialization, Batch Normalization, regularization, and optimizers.

ELU

Exponential Linear Unit (ELU) [24] is an activation function that also takes negative values which pushes the output mean towards zero. The result is reduced bias shift and faster learning. ELU has shown to converge faster and reach a higher accuracy than ReLU with Batch Normalization for Deep Resid- ual Networks. [25] Let x denote the unit input and α is a positive constant, then the function is,

ELU (x) =

(x, if x ≥ 0

α(e^x− 1), 0 > x (2.10) Vanishing gradients are avoided as the function slowly approach the lower bound −α for large negative inputs. There is a risk of exploding gradients when x is large, but this can be avoided with proper initialization of the unit weights and Batch Normalization.

Batch Normalization

Another approach to avoid unstable gradients is to normalize the hidden layer outputs. Batch Normalization calculates the mean (µ) and variance (σ) to normalize each batch. Both parameters can be estimated with moving averages, where m is the momentum that controls how much the current batch affect the moving averages,

µ = mµ + (1 − m)µ_batch (2.11)

σ² = mσ² + (1 − m)σ_batch²

The batch normalization layer update the moving averages and normalize the outputs with µ and σ²for each training iteration, and during test time it reuses the learned parameters.

(23)

Adam

Adaptive optimizers compute different learning rates for each weight to pre- vent training from getting stuck in flat regions of the objective function, for example in local minimums. Adaptive Moment Estimation (Adam) [26] adjust the learning rate according to estimates of the first order moment (m) and second order moment (v) of the gradients. The moments are initialized to 0, and updated with a fraction of the gradient g for each training time step t,

m_t= β₁m_t−1+ (1 − β₁)g_t (2.12) v_t = β₂v_t−1+ (1 − β₂)g²_t

β1and β²are suggestively set to 0.9 and 0.999. The initialization cause m and v to be biased towards zero which motivate the following bias-corrections,

ˆ

m_t = m_t

1 − β₁^t (2.13)

ˆ

v_t= v_t 1 − β₂^t

Lastly, the weight parameters are updated with a general learning rate α w_t = w_t−1− α mˆt

√vˆ_t+ (2.14)

In cases where the function does not have a second order moment prevents division by zero. The original paper by Kingma and Ba [26] use α = 10⁻³ and = 10⁻⁸.

Regularization

For any statistical model, the bias-variance trade-off describes the balancing act between learning details of the data and maintaining a general distribution. Deep Neural Networks have strong variance with ability to learn com- plex functions in high dimensional spaces. The risk of over-fitting is reduced by adding bias. Ridge regression (L2) adds a penalty term to the loss function i.e. the sum of the squared weights. Then the optimizer will minimize both the loss and the weight values, so the hidden units can not fit perfectly to each data point. For a layer l, let W denote the weights and λ scale the amount of regularization,

Loss ← Loss + λ X

w∈Wl

w² (2.15)

(24)

AdamW

Shortly after Associative Domain Adaptation was published, Loshchilov et.

al. showed that L2 regularization is not equivalent to weight decay when combined with adaptive optimizers. [27] When L2 is used, the gradients are computed for both the loss and the regularization term in eq. 2.15. Adam optimizer use these gradients to compute the first and second order moments (eq. 2.12) and then the weight decay normalize the moments (eq. 2.14). So when regularization is applied only to the current gradient g^t, and then the weight decay is normalized, weights with large gradient magnitudes are regularized with a smaller relative amount than weights with small gradients. Decoupled weight decay for Adam (AdamW) regularize each weight equivalently. Instead of adding L2 regularization to the loss function, AdamW adds the gradient of the L2 regularization term to the weight decay in eq. 2.14. [27] The new version of the weight update is,

w_t= w_t−1− (α mˆ_t

√ˆv_t+ + λw_t−1) (2.16)

2.4 Embedding visualization with t-SNE

Domain Adaptation must form embeddings that are discriminative and domain invariant. The learned distributions use features that exist in both domains to distinguish between the different classes [19] [20]. If the embeddings are visualized as points in two or three dimensions, they will form one cluster per class where the domains are inseparable in each cluster. The learned distribution is often visualized with t-Distributed Stochastic Neighbor Embedding (t-SNE). [3] [4] [5] [6] It is an unsupervised machine learning algorithm that cluster data based on similarities and transforms each sample to a point in two or three-dimensional space. Pairs of samples are randomly chosen and the conditional probability that the two are neighbors is computed based on the similarity of the two vectors. The conditional probabilities are used to find the Kullback-Leibler divergence for each pair, where high conditional probabilities imply low divergence. t-SNE uses the sum of all Kullback-Leibler distances as the cost function and minimize it with gradient descent. The algorithm is not deterministic so the results differ between runs [28].

(25)

2.5 Performance Metrics

The choice of performance measure plays a vital role in the heuristics for estimation of the optimal network. A binary classifier learns the probability dis- tributions of positives and negatives in Figure 2.2. Overlapping distributions leads to false predictions and raise the complexity in determining the boundary.

Let positives correspond to correctly picked components (OK) and negatives are the incorrectly picked components (NOT OK). True Positive Rate (TPR) is the proportion of correctly classified positive samples, and False Positive Rate (FPR) express the proportion of incorrectly classified negative samples,

T P R = T P T P + F N F P R = F P

F P + T N

A perfect classifier has T P R = 1 and F P R = 0 on a test set that represents the true distribution of the data. The Receiver Operator Characteristic (ROC) curve is a common metric to show a classifier’s ability to distinguish between positives and negatives for different thresholds. In this work, the class distribution is heavily skewed where positive samples dominate the data. A drawback of ROC curves is the inability to measure the performance in relation to the distribution. For example, if the number of negatives (FP and TN) are doubled, FPR remains unchanged,

F P R_new = 2F P

2F P + 2T N = F P

F P + T N = F P R

2.5.1 Precision Recall

An alternative measure of performance that is appropriate for skew data is Pre- cision Recall (PR), where Recall equals TPR, and Precision substitute FPR,

Recall = T P R = T P T P + F N P recision = T P

T P + F P

(26)

Figure 2.2: Example histogram with positives in yellow, negatives in red, and the classifier’s decision boundary as a black line. True Positives (TP) and True Negatives (TN) are the correct classifications. Incorrect classifications occur in the overlap with False Positives (FP) and False Negatives (FN).

Precision tells the proportion of correct classifications in the positive estimations. We repeat the example where the total number of negatives is doubled.

As the number of FP increase, the Precision will decrease, P recision_new= T P

T P + 2F P < P recision

The PR metric captures variations in performance when the data distribution change. Therefore, it is one of the preferred metrics for skew data sets. [16]

Training on skew data introduce a bias in favour of the overrepresented features [29]. So when positives correspond to the NOT OK class, the PR metric carefully evaluates the underrepresented class.

2.5.2 F1 Score

When comparing different models it is convenient to represent the performance in one metric. F1 score is the harmonic mean of Precision and Re- call. The score equals one for a perfect classifier, and in the worst case it is zero,

F 1 = 2 · P recision · Recall P recision + Recall

2.5.3 Precision Recall Curve (PRC)

A Precision Recall Curve (PRC) plot (Figure 2.3) measures the precision and recall for different classification thresholds between 0 and 1. It shows how well the model separates the different classes. As in Figure 2.3, the y-axis measure

(27)

Figure 2.3: Examples of PRC plots for different data distributions. The left plot has a balanced distribution of positives and negatives, with baseline b = 1000/2000 = 0.5. The right plot show the same classifier’s on unbalanced data and the baseline drop to b = 1000/11000 ≈ 0.09. The best performing models are close to the upper right corner and the area is close to 1. Figure from Saito and Rehmsmeier [16].

precision, and the x-axis show recall. The performance of a random classifier is referred to as the baseline (b). It depends on the distribution of positives (P ) and negatives (N ),

b = P

P + N

A perfect classifier follows the upper left corner of the PRC plot, so recall grows proportionally with the classification threshold and precision is always perfect. [16] The collective performance for all thresholds is summarized as the area under the PRC. Measures such as Area Under Curve (AUC) use linear interpolations between each point. We use the more pessimistic Average Precision (AP) because it interpolates using the lower bounds. Let Pⁿand Rⁿ denote Precision and Recall at the nth threshold,

AP =X

n

(R_n− R_n−1)P_n

2.6 Similar Problems and Applications

In Surface Mount Technology (SMT) and construction of PCBs, machine vision is used for Automated Optical Inspection (AOI) to detect errors and to improve the performance of the mounting process. [30] [31] [15]

(28)

In 2011, Mar, Yarlagadda, and Fookes [30] used images of PCBs to diag- nose the solder paste circuits printed onto the board. Features in the images are extracted with illumination normalization, solder paste localization, and segmentation. Then the solder joints on the PCBs are classified with Log Gabor Filters, Discrete Wavelet Transform, and Discrete Cosine Transform.

Even earlier, Kim, Cho, and Kim [31], developed an AOI system for classification of solder joints by combining a feature extractor for the circuits with a neural network classifier. The neural network approach degraded in performance because of the complexity of the solder paste patterns on each PCB.

[30] Large variations in the data opt for large training sets to cover the whole feature space.

Another example where the images appear more similar to the PCB chip components is in the work by Wang et al. [32], where a CNN was trained to detect pills in images. The domain adaptation component in the work was to bring the domains closer by applying image pre-processing to make edge detection more robust. [32]

A particular impediment in this work is skew distributions in the data an unbalanced relationship between the importance of false positives and false negatives. Often, but not always, it is more costly to mount a defect component (false negative) than to discard a correctly picked component (false positive).

This is a common problem in medical diagnostics because a false negative may miss a case of cancer or other severe diseases. Data from ill patients can be difficult to find and the labeling process is expensive because it often requires specialized doctors. Cohen et al. [33] tackle the problem of imbalanced data with over- and undersampling and then classifies the samples with a support vector machine that strictly penalizes false negatives by adding slack variables with different weights for the two classes. Baur, Albarqouni, and Navab [34]

compare different types of GANs to generate realistic images of skin lesions, which can be used to balance the class distribution.

Regarding applications of Semi-supervised Domain Adaptation for CNNs, Lunga et al. [35] classified satellite images in target domain by virtue of in- teractive learning. The training process selects samples from source domain based on a relevance measure, where a source sample is relevant if the se- mantic representation is close to a set of target samples. The training loop also requests labels from a human expert for miss-classified samples in target domain.

(29)

Data

This chapter describes the data collection and characteristics of the labeled and unlabeled data. Differences between the two data sets motivate the relevance of domain adaptation, where the labeled and the unlabeled data respectively belong to the source domain (DS) and target domain (DT). The skew class distributions can be adjusted with both undersampling and oversampling.

3.1 PCB Components

A Printed Circuit Board (PCB) connects transistors, buses, chips and other components that are used in electrical devices. When Surface Mount Tech- nology (SMT) emerged in 1960 the wired circuits were replaced with solder paste printed onto the surface of the board. The new technique made it possible to build compact PCBs and the modern assembly line must conduct broad variations of components, circuit architecture, manufacturing speed and precision. The components are mounted onto the PCB by a Pick and Place (PnP) machine. It picks a component, takes a picture, and decides whether to mount or discard the component. Possible instigation’s of a discarded component are previous errors in the assembly line, corrupt components, or the PnP machine failed to pick the component correctly. This thesis regards classification of the state of the component, recall Figure 1.1 in Section 1.3. The excluded error classes are Stop production and Wrong Component. Stop Production show random errors mostly caused by a pick tool that is contaminated with solder paste. This is considered out of scope because it expresses the status of the pick tool rather than the component. In case of Wrong Component, the pick

(30)

status can be correct but it should not be mounted onto the PCB. Inclusion of this error class would only add noise to the data.

In a previous thesis, Kolibacz [15], a CNN was successfully applied to the binary classification problem of OK/NOT OK picked chip components. A common problem in the deployment of machine learning algorithms, including neural networks, is that the data can change over time. For example, the data traffic in a communication network can change if the provider improves the data traffic bandwidth or speed. In email spam filters, the distribution of emails that reach the inbox varies between users. In both examples, the marginal distribution during training is different from deployment. For the PCB component images, the new features cause a shift in the marginal distribution such that PS(X) 6= P_T(X). Neural networks maximize the likelihood of observ- ing a label L given the data X and weights w, P (L|X, w). Thus, if the data change, the network must adapt the likelihood function. The source domain data set DSis similar to the data set used in the previous thesis Kolibacz [15].

The target domain data set in DT have new characteristics,

1. Updated versions of the PnP machines use different camera’s. The images have different contrast, lightning, and higher resolution.

2. Customers introduce new components to the pool of objects that need classification.

For the neural network to be useful in practice it either needs to continuously adapt to the latest data, or obtain strong generalization of the image embeddings. Figure 3.1 shows the difference between the source and the target domain. The components are similar in shape and the images are taken with different cameras.

Large size components are excluded due to the variation in shapes, see Figure 3.2. In domain adaptation it must be possible to create mappings between source and target domain images. The components in source and target domain should be similar in order to derive proper mappings. The closer the domains are, the more features can be used to achieve discriminativeness [19] [20] [3].

To facilitate correct classifications the data sets only contain component sizes 0201, 0402 and 0603, see Table 3.1.

(31)

(a) Sample from DS. (b) Sample from DT.

Figure 3.1: Examples of components of the same size from each of the domains. The contrast and resolution are often higher in target domain. Both the calibration of the cameras and new components that occur in DT are possible reasons for the drop in accuracy for the current supervised classifier.

Figure 3.2: Examples of large components. Large component sizes are excluded from the data sets because of the large variation in appearance and the there is more room for improvement of classification of small components.

(32)

Table 3.1: Three different sizes of components exist in the data sets. The images show examples of each component size and the width and height dimensions are specified.

Component type Height Width

0201 0.5 mm 0.3 mm

0402 1 mm 0.5 mm

0603 1.5 mm 0.8 mm

(33)

(a) Colors of OK (yellow) and NOK (purple). (b) Colors of DS(green) and DT (blue).

Figure 3.3: t-SNE dimensionality reduction on raw image data from the test sets in D^Sand D^T, with 2000 samples from each domain. Perplexity was 46, learning rate 263, and the algorithm ran for 7967 iterations.

3.2 Data set Visualization with t-SNE

t-SNE can properties of the distribution of the raw images in D^S and D^T. In Figure 3.3a, the OK samples form different clusters while the NOT OK images appear to almost be randomly spread out in the plane. Intuitively, this can be interpreted as different types of components have different features, for example as in Table 3.1 that show samples of different component sizes. The NOT OK images are of a more random nature and do not cluster in the same way. Figure 3.3b shows some distinguished clusters for samples in DSand DT. Both domains contain samples that t-SNE failed to distinguish. But in general, the algorithm manages to find differences between the two domains. This can also indicate why networks that have high accuracy in D^S fail to capture the distribution in DT. The desired outcome from domain adaptation is to make it difficult for t-SNE separate the domains and at the same time simplify the distinction of OK and NOT OK samples.

(34)

3.3 Skew Class Distribution

According to the automatic vision system the complete data set has the approximate ratio of 150:1 (OK:NOT OK) components. If the deployed classifier assign the label OK to all images it will reach an accuracy of 99.3 %. So, it is important to consider the skew class distribution in the training phase and use appropriate metrics to evaluate the network. Ideally, the network would train on the real distribution. However, our labeled data set is shifted towards incorrect picks. The approximate source domain ratio is 2:1 and the selected target domain has a similar distribution, see Table 3.2. The following section describe approaches to balance the data combined with different augmentation techniques.

Table 3.2: Original distributions of DSand DT. The distribution of OK/NOT OK images is unbalanced which introduce the risk of additional bias. Over- sampling with image augmentation will be applied to give both of the classes equal importance in the classifier. The distribution of different components is also skew where the smallest components are rare, especially in DT.

Domain # Images Labels Components

OK NOT OK 0201 0402 0603

D_S 321746 207639 (65 %) 114107 (35 %) 11 % 52 % 37 % D_T 550000 365576 (66 %) 184424 (34 %) 2 % 49 % 49 % Total 871746 573215 (66 %) 298531 (34 %)

3.4 Oversampling and Image Augmentation

Skew class distributions are common in practical machine learning problems.

For example in medical diagnostics [1] [33] and fraud detection [36]. Skew distributions of the data complicate learning of the relevant features by adding a bias in favor of the overrepresented classes. There exist a variety of methods to reduce the bias either by ensembling [36], adjustment of the network weights during training, or modifications of the data set [1] [29].

Random Majority Undersampling adjust class imbalance by random selection of samples to remove from the overrepresented classes. Experiments have shown that neural networks trained on undersampled data sets are inferior to networks trained on the original data set unless the original distribution

(35)

is heavily skewed [1] [29]. In our case the labeled data set is undersampled, but this is not necessarily a disadvantage because of the extremely skewed original distribution. Random Minority Oversampling is a form of selection with replacement applied to the underrepresented classes. Appropriate augmentations of the resampled images will encourage generalization instead of overfitting [29], and the result is a reduced bias by balancing the data without removing information. It is emphasized that the training set is large compared to the number of features. Otherwise, the additional features from the augmentations outvote the characteristics of the original data set [1]. The augmentations are applied with caution to preserve the classification-critical features in each sample. For example, an image of a correct pick should not be rotated to an incorrect angle. The following sections describe the augmentation techniques and lastly the oversampled data distribution.

3.4.1 CLAHE

Contrast Limited Adaptive Histogram Equalization (CLAHE) increase contrast and enhance details of an image by stretching the range of pixel values. Regular Histogram Equalization stretch the pixel values to fill the range [0,255]. For images with low contrast, the noise is amplified to the extent that the original features disappear. CLAHE reduces the noise by dividing the image into segments and then applying Histogram Equalization to each segment.

A clip limit is used to penalize the amount of noise in each segment. The result is an enhancement of both contrast and details.

Previous work has used CLAHE for underwater image segmentation [37], enhancement of fingerprints [38], and mammograms [39]. All papers show examples where CLAHE reinforce details that are not clearly visible in the original images. In mammography, the additional noise is still too harsh to allow reliable diagnostics because every detail in the tissue is important. In our data, the largest monotonous segment is the dark background. Noise in the background resembles solder paste that contaminates the pick tool and is a common source of NOT OK picks. The augmentations are only applied to NOT OK images and therefore the augmentations will not interfere with the component status. Variations in contrast is a natural property of the data sets because of the cameras with different calibrations.

(36)

Figure 3.4: Demonstration of Inverse Gamma Correction for different γ. The x-axis show the original pixel values and the y-axis present the corresponding outputs.

3.4.2 Inverse Gamma Correction

Inverse gamma correction is a non-linear operation to adjust the illumination in an image. The human eye is sensitive variations of luminance in dark settings, so details in an image appear enhanced when the pixel values are increased in dark regions. The operation is used in digital photo and video editing to optimize the pixel values for the human eye, but it has also been used for image pre-processing for convolutional neural networks [40]. Inverse Gamma Correction transforms the pixel values according to a parameter gamma (γ).

γ = 1 is equivalent to a linear transformation. For γ < 1, the luminance increase particularly in the darker regions, and γ > 1 result in darkened pixels especially in the bright regions, see Figure 3.4. Let Pⁱⁿ denote the original pixel values and P^outthe augmented pixels, then the operation defined,

P_out = P

1 γ

in

3.4.3 New class distribution

After oversampling, both DSand DT consist of 50 % OK and 50 % NOT OK images, the exact final distributions are presented in Table 3.3. The method chapter, specifically Figure 4.2, describes the applications the augmentations, including CLAHE and inverse gamma correction.

(37)

Table 3.3: Class distribution in the oversampled training sets. The distribution of OK and NOT OK images are almost equal in respective domain. Note that the label division for D^T is based on the automatic vision system and is not completely accurate.

Domain # Images # Oversamples Labels

OK NOT OK

D_S 412426 90169 207639 (50 %) 207787 (50 %)

D_T 731100 181128 365576 (50 %) 365709 (50 %)

Total 1143711 190197 573215 (50 %) 573496 (50 %)

(38)

Methods

Associative Domain Adaptation [3] was chosen to tackle the problem of strictly limited access to labels in the target domain. The fact that the method only requires a single feed-forward network is an advantage compared to adversarial approaches when in lack of computational capacity. It is of high value to ef- fectively adapt to unlabeled data because the data in the assembly line change over time. In this section, we describe how Associative Domain Adaptation was applied to images of PCB components. Section 4.1 describes the pre- processing, Section 4.2 illustrate the network architectures and Section 4.3 treat the hyper-parameters. The three main experiments are finally described in Section 4.5.

All implementations were written in Python 3.6, Tensorflow [41] and the Tensorflow-Slim API [42]. The images were augmented with OpenCV 2 [43]

for Python, stored on disk as TFRecords files, and then loaded to run time memory in batches with Tensorflow’s Dataset API. Dimensionality reduc- tions with t-SNE used the open-source library Multicore t-SNE, available on GitHub [44]. Packages from SKLearn were Numpy for calculations, Mat- plotlib to generate graphs and Metrics to compute a PRC. [45]

4.1 Input

Pre-processing was used to convert image data to inputs for a CNN. The image size must equal the dimensions of the network’s input layer, in our case

(39)

the images were resized and slightly cropped. This section describes the pre- processing that was used for all experiments.

4.1.1 Cropping

All images were cropped to be square. Most images were already in a square shape, occasionally the width and height difference was a few pixels.

4.1.2 Scaling

CNNs only accept inputs of fixed size, so all images were resized with Linear Interpolation from OpenCV 2. [43]

4.1.3 Rotations

The components in the raw data sets can either stand vertically or lie horizon- tally, it depends on how the component should be placed on the PCB. Since the CNNs are not aware of the intended angle, the data set will contain contradic- tions. Rotations in the pre-processing make use of the goal angle from the PnP machine’s vision system. All samples with horizontal goal angles are rotated by 90 degrees with OpevCV 2. [43] The rotations showed promising results both in our experiments and in previous work by Kolibacz [15], so rotations were used for all oversampled data sets. Other types of rotations for random minority oversampling augmentations are described in Section 4.5.1.

4.1.4 Image Normalization

The pixel values of images in jpg or png format span the range [0, 255]. In our data set the images of Not Picked components have pixel values close to zero while other images that show components have brighter areas. Large variations of input values can destabilize the gradient descent and cause vanishing or exploding gradients. To omit this issue all inputs were normalized to zero mean within the interval [−1, 1].

(40)

(a) SVHN Model architecture.

(b) MNIST Model architecture.

Figure 4.1: CNN architectures, (a) show the SVHN model, with nine convolutional layers (Conv) and one fully connected layer (FC). (b) show the shallower MNIST model architecture [3]. The convolution kernels had dimension 3x3 and the max-pooling filter has kernel 2x2. Some of the experiments added Batch Normalization (BN), marked in green.

4.2 Network Architectures

Haeusser et al. [3] used the CNN architecture we refer to as SVHN Model, and they also provided the shallower MNIST Model. The SVHN Model show state- of-the-art performance on various benchmark data sets, but our data is also reminiscent of the MNIST dataset which consists of grayscale images with a black background and a light object in the center. This work test both models.

All hidden units used ELU activation function with α = 1.0 as suggested by Clevert, Unterthiner, and Hochreiter [24], and the classification outputs use Softmax activation.

4.2.1 SVHN Models

The first model had nine convolutional layers with 3x3 kernels and stride 1.

After every third convolution layer, the inputs were down-sampled with max-

(41)

pooling with 2x2 filters and stride 2. The final embeddings were generated in a fully connected layer of 128 units. The following Softmax layer outputs a classification vector with one element per class [3]. The experiments that use batch normalization apply it after each convolutional layer, see Figure 4.1a.

4.2.2 MNIST Models

In case the SVHN Model is unnecessarily deep, the shallower MNIST Model was also tested on the data sets, Figure 4.1b. The only difference was that the MNIST Model had six convolutional layers.

4.3 Default Hyper-Parameters and Intervals

The number of hyper-parameters to tune needed restrictions due to limited computational power and time. The first limitation was to hold on to the two architectures in the original applications of Associative Domain Adaptation [3]. This section addresses the remaining hyper-parameters with default values and intervals for tuning in Table 4.1.

The image size of 56x56 pixels was randomly selected in the interval [32, 80].

The lower bound comes from the image size used in the original paper and the upper bound was selected because the GPUs could not handle larger image sizes with a reasonable batch size for the two network architectures.

It is important that all classes are represented in each batch to make it possible to generate correct mappings [3]. When the SVHN Model was fed 56x56 pixel images the system could handle the maximum batch size 1400, with 700 samples from D^T and 350 samples from each class in D^S. From here on the batch setup is summarized as (DT:DS), in this case (700:2x350). For the shallower MNIST Model, the GPUs could handle batch size 2000 which also was used in Haeusser et al. [3]. Thus, the random search interval was between 400 (200:2x100) and 2000 (1000:2x500).

(42)

Table 4.1: Default values for the hyper-parameters are used in all experiments unless stated otherwise. Epochs are calculated on the target data set. The rightmost column probability distributions for the hyper-parameters that were optimized with random search (RS) in Section 4.5.3. Uniform distribution is denoted U (min, max) and log uniform distribution is Ulog(min, max)

Parameter Notation Default RS

Image size (pixels) - 56x56 -

Batch size - 1400 U (400, 2000)

Batch normalization momentum m 0.99 U (0.95, 0.999)

Early stopping (epochs) e 7 10

Maximum learning rate - 1e-4 Ulog(1e-6, 1e-4)

Minimum learning rate - 1e-6 Ulog(1e-8, 1e-6)

Learning rate decay LRD 0.33 U (0.1, 0.5)

Learning rate decay delay (epochs) LRD delay 8.6 U (5, 10) Domain adaptation delay (epochs) DAdelay 0.5 U (0.2, 5)

Walker weight β1 1.0 -

Visit weight β2 0.2 U (0.1, 1.0)

L2 multiplier λ_L2 1e-4 -

Weight decay λ_{W D} 1e-4 U_log(1e-7, 1e-1)

Batch Normalization was applied to both network architectures because the ELU activations run the risk of exploding gradients. Batch normalization can stabilize the gradients in the deeper layers as it normalizes the output from each convolution layer. The implementation in TF Slim API updates the moving averages for each batch according to the momentum m in Equation 2.11.

In experiments with early stopping, the training loop let the network train until the F1 score on the target validation set had not improved for e epochs. Default value e = 7 was considered large enough considering that some networks found an optimum in less than 7 epochs, see the results in Section 5.2.

The learning rate was adjusted with step-wise exponential decay [3]. In Table 4.1, LRD denote the learning rate’s decay factor and LRD delay stands for the interval between each learning rate decay step. In the original paper, the learning rate delay was 9000 iterations which correspond to ∼ 8.6 epochs for our oversampled target data set and batch size 1400.

Before a network starts to adapt to DT it trains on DS in a supervised setting.

The purpose is to let the network learn to learn some classifications before also creating mappings [3]. It trained only on D^S for the number of epochs

(43)

specified by the domain adaptation delay (DAdelay). This was controlled by initial inactivation of the Association Loss (Eq. 2.2), where α = 0. The default value for DA^delay originates from the original paper which used 500 batch iterations, it corresponds to ∼ 0.5 epochs with default batch settings.

Once the DA^delay had expired the loss was adjusted with α = 1 and the network started training with both the classification loss (eq. 2.3) and association loss (eq. 2.4). The latter introduces two new weights, one for the walker loss (β1) and another for the visit loss (β2). The default values are the same as in the original paper, but the visit loss can be adjusted depending on the different class distributions in source and target domain [3]. We used Random Search to explore appropriate values.

The experiments with Adam optimizer also used L2 regularization. The reg- ularization weight λ^L2 is the weight in equation 2.15. The experiments with AdamW optimizer and weight decay had the regularization weight λW D(equation 2.16). Both optimizers were available in Tensorflow.

4.4 Measure the Effect of Domain Adaptation

The impact of Domain Adaptation is often evaluated by comparisons to two supervised networks, one trained on source domain and the second trained on target domain [3] [4] [6] [5]. The performance of the domain adapted network should be higher than the network trained only on DS. Ideally, the domain adapted networks match the performance of networks that have trained only on D^T in a supervised setting. The performance was measured with F1 score and PRC because of the skew class distribution in the test sets [16].

4.4.1 Baseline

If domain adaptation manages to learn features in target domain it will outperform supervised networks that only have trained on the source domain.

This was evaluated by comparing the domain adapted networks (CN NDA) to a baseline network with the same hyper-parameters, denoted CN Nsource.