Learning from noisy labelsby importance reweighting:: a deep learning approach

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2019 ,

Learning from noisy labels by importance reweighting:

a deep learning approach

TONGTONG FANG

(2)

(3)

Abstract

Noisy labels could cause severe degradation to the classification perfor- mance. Especially for deep neural networks, noisy labels can be memo- rized and lead to poor generalization. Recently label noise robust deep learning has outperformed traditional shallow learning approaches in handling complex input data without prior knowledge of label noise generation. Learning from noisy labels by importance reweighting is well-studied. Existing work in this line using deep learning failed to provide reasonable importance reweighting criterion and thus got un- desirable experimental performances. Targeting this knowledge gap and inspired by domain adaptation, we propose a novel label noise ro- bust deep learning approach by importance reweighting. Noisy labeled training examples are weighted by minimizing the maximum mean dis- crepancy between the loss distributions of noisy labeled and clean la- beled data. In experiments, the proposed approach outperforms other baselines. Results show a vast research potential of applying domain adaptation in label noise problem by bridging the two areas. Moreover, the proposed approach potentially motivate other interesting problems in domain adaptation by enabling importance reweighting to be used in deep learning.

Keywords: noisy label, importance reweighting, deep learning, do-

main adaptation

(4)

Sammanfattning

Felaktiga annoteringar kan sänka klassificeringsprestanda.Speciellt för djupa nätverk kan detta leda till dålig generalisering. Nyligen har brusrobust djup inlärning överträffat andra inlärningsmetoder när det gäller hantering av komplexa indata Befintligta resultat från djup inlärning kan dock inte tillhandahålla rimliga viktomfördelningskri- terier. För att hantera detta kunskapsgap och inspirerat av domänan- passning föreslår vi en ny robust djup inlärningsmetod som använ- der omviktning. Omviktningen görs genom att minimera den maxi- mala medelavvikelsen mellan förlustfördelningen av felmärkta och ko- rrekt märkta data. I experiment slår den föreslagna metoden andra metoder. Resultaten visar en stor forskningspotential för att tillämpa domänanpassning. Dessutom motiverar den föreslagna metoden un- dersökningar av andra intressanta problem inom domänanpassning genom att möjliggöra smarta omviktningar.

Nyckelord: annoterad data, omviktning, djupt lärande, domänan-

passning

(5)

Acknowledgement

This work would not have been possible without the aid and support from Prof. Masashi Sugiyama, the director of RIKEN AIP, who ac- cepted me to work in his team at RIKEN AIP, gave me the freedom to explore my research interests and provided useful critiques of my work. Also, I would like to express my very great appreciation to Dr.

Gang Niu. I learned a lot from his insightful guidance, encyclopedic knowledge, and his wisdom towards life. I was particularly grateful for the frequent discussions with Nan Lu, Miao Xu, Bo Han, Feng Liu, Wenkai Xu, Xuyang Zhao at AIP and the help from Yifan Zhang, Tianyi Zhang at the University of Tokyo in developing the work.

Besides, I would like to thank my KTH supervisor Prof. Henrik Boström for his valuable and constructive suggestions during the development of this research work. I would not forget his very elaborated comments for my thesis, unveiling the secrets of how to write thesis properly. I also wanted to offer my special thanks to Prof. Magnus Boman. He was not only my course instructor, thesis examiner, and research su- pervisor but also the person who witnessed how I gradually grew up at KTH and provided me generous support of my career goal.

Finally, nobody had been more important in the pursuit of my career

goal than my parents. I wish to thank Guoguo and Huairen for their

selfless love. They felt every my feeling in developing this work: no

matter how cheerful or depressed I was, they were always there beside

me in my heart.

(6)

List of Figures

1 Research process flow of this work. . . . . 4

2 Basic structures of neural networks. . . 16

3 The architecture of LeNet-5. . . 17

4 A residual building block. . . 17

5 The structure of Resnet-32. . . 18

6 Examples from MNIST and CIFAR. . . 23

7 Label noise transition matrix. . . 24

8 Architecture of the proposed approach. . . 32

9 Results on MNIST with 0.2 symmetric label noise. . . 35

10 Training accuracy on MNIST with 0.2 symmetric label noise. . . 36

11 Results on MNIST with 0.3 pairflip label noise. . . 37

12 Training accuracy on MNIST with 0.3 pairflip label noise. 38 13 Results on MNIST with 0.45 pairflip label noise. . . 39

14 Training accuracy on MNIST with 0.45 pairflip label noise. 40 15 Results on MNIST with 0.5 symmetric label noise. . . 41

16 Training accuracy on MNIST with 0.5 symmetric label noise. . . 42

17 Results on CIFAR-10 with 0.2 symmetric label noise. . . 44

18 Training accuracy on CIFAR-10 with 0.2 symmetric la- bel noise. . . 45

19 Training loss on CIFAR-10 with 0.2 symmetric label noise. 47 20 Results on CIFAR-10 with 0.3 pairflip label noise. . . 48

21 Training accuracy on CIFAR-10 with 0.3 pairflip label noise. . . 49

22 Training loss on CIFAR-10 with 0.3 pairflip label noise. . 50

23 Results on CIFAR-10 with 0.45 pairflip label noise. . . . 51

24 Training accuracy on CIFAR-10 with 0.45 pairflip label noise. . . 52

25 Training loss on CIFAR-10 with 0.45 pairflip label noise. 53 26 Results on CIFAR-10 with 0.5 symmetric label noise. . . 54

27 Training accuracy on CIFAR-10 with 0.5 symmetric la- bel noise. . . 55

28 Training loss on CIFAR-10 with 0.5 symmetric label noise. 57 29 Histogram of the learned weight distribution on MNIST with 0.2 symmetric label noise. . . 59

30 Histogram of the learned weight distribution on MNIST with 0.3 pairflip label noise. . . 60

31 Histogram of the learned weight distribution on MNIST

with 0.45 pairflip label noise. . . 61

(7)

32 Histogram of the learned weight distribution on MNIST with 0.5 symmetric label noise. . . 62 33 Histogram of the learned weight distribution on CIFAR-

10 with 0.2 symmetric label noise. . . 63 34 Histogram of the learned weight distribution on CIFAR-

10 with 0.3 pairflip label noise. . . 64 35 Histogram of the learned weight distribution on CIFAR-

10 with 0.45 pairflip label noise. . . 65 36 Histogram of the learned weight distribution on CIFAR-

10 with 0.5 symmetric label noise. . . 66 37 The effect of clean labeled data size on test accuracy. . . . 67 38 Histogram of weight distribution learned by Reweight

(without rescaling) on MNIST with symmetric label noise. 77 39 Histogram of weight distribution learned by Reweight

(without rescaling) on MNIST with pairflip label noise. . 78 40 Histogram of weight distribution learned by Reweight

(without rescaling) on CIFAR-10 with symmetric label noise. . . 79 41 Histogram of weight distribution learned by Reweight

(without rescaling) on CIFAR-10 with pairflip label noise. 80

List of Tables

1 Examples of f-divergence. . . 13 2 Summary of the datasets and base models used in exper-

iments. . . 23 3 Comparison of label noise problem and domain adapta-

tion. . . 29 4 Average test accuracy ± standard error on MNIST over

the last ten epochs. . . 43 5 Average test accuracy ± standard deviation on CIFAR-

10 over the last ten epochs. . . 58 6 Effect of clean labeled data size: average test accuracy of

the proposed approach on CIFAR-10 with 0.2 symmetric

label noise over the last ten epochs. . . 68

(8)

List of Figures iv

List of Tables v

1 Introduction 1

1.1 Background . . . . 1

1.2 Problem . . . . 2

1.3 Purpose . . . . 3

1.4 Goal . . . . 3

1.5 Research Methodology . . . . 3

1.6 Ethics, Benefits and Sustainability . . . . 4

1.7 Outline . . . . 5

2 Extended background 6 2.1 Learning from noisy labels . . . . 6

2.2 Domain adaptation . . . . 9

2.3 Distribution divergence measures . . . 10

2.3.1 Maximum mean discrepancy (MMD) . . . 11

2.3.2 Other alternative divergences . . . 12

2.3.3 Divergence measures for importance reweighting 13 2.4 Deep learning basics . . . 15

2.4.1 Neural network architectures . . . 15

2.4.2 Optimization for training networks . . . 18

3 Methodology 20 3.1 Choice of method . . . 20

3.1.1 Learning from noisy labels . . . 20

3.1.2 Choice of divergence measure . . . 21

3.1.3 Hyperparameter tuning . . . 21

3.2 Experimental setup . . . 22

3.2.1 Datasets and base models . . . 22

3.2.2 Label noise transition matrix and noise rate . . . 23

3.2.3 Baselines . . . 24

3.2.4 Experimental details . . . 24

3.3 Performance evaluation . . . 25

3.3.1 Evaluation metrics . . . 25

3.3.2 Evaluation procedure . . . 26

3.3.3 Success criteria . . . 26

4 Proposed approach 28

(9)

4.1 Problem settings . . . 28

4.2 Bridging label noise problem and domain adaptation . . 29

4.3 Importance reweighting criterion . . . 30

4.4 Deep architecture implementation . . . 31

5 Results 34 5.1 Performance by label noise type and noise rate . . . 34

5.1.1 Results on MNIST . . . 34

5.1.2 Results on CIFAR-10 . . . 43

5.2 Distribution of the learned weights . . . 58

5.2.1 The case of MNIST . . . 58

5.2.2 The case of CIFAR-10 . . . 63

5.3 Size of clean labeled dataset . . . 67

5.4 Discussion . . . 68

6 Conclusion 72

References 73

A Weight distribution without rescaling 77

(10)

1 Introduction

1.1 Background

In the real world, data with noisy labels/corrupted labels (i.e. a mix- ture of correctly labeled and wrongly labeled data) is ubiquitous due to many reasons such as crowdsourcing [1, 2] or online queries [3, 4]. The noisy labels can lead to a deterioration of the performance in classifi- cation problem [5]. This situation is even worse when using deep neu- ral networks for classification because the networks can easily remem- ber these noisy labels and thus have a poor generalization ability [6].

As deep learning getting popular, learning from noisy labels by deep learning has become a hot research topic in machine learning and has resulted in many approaches to tackle this problem. Generally, such approaches try to make the training of neural networks robust even with extremely noisy labels.

The problem of learning from noisy labels has been studied extensively since 1988 [7]. Scott et al. proposed a general framework for classifi- cation with noisy labels [8], namely mutually contaminated (MC) dis- tributions framework in [9]. Within this framework, Menon et al. [9]

studied the problem of binary noisy labels using class-probability es- timation and provided fruitful theoretical insights in solving the prob- lem of noisy labels. Patrini et al. [10] further extended this work into multi-class classification and invented a forward and backward loss correction approach for noisy labels problem. Their approach works well when the noise transition matrix, which indicates how the noisy labels are generated, is known. However, this matrix is usually un- known in practice. When it is unknown, they provided an estimation but this estimation works not so well in their experiments [10]. Also, Liu et al. [11] tackled the label noise classification problem by impor- tance reweighting. They formulated this problem and proved a consis- tency guarantee that the learned classifier using importance reweight- ing would converge to the optimal classifier in the noise-free case for any surrogate loss function. But still, their performance greatly relied on the accurate estimate of the noise transition matrix, which is still an open problem.

Recently, deep learning empowers label noise research in handling

more complex data input with high label noise rate. Deep learning

based approaches have outperformed the state-of-the-art methods in

learning from noisy labels, and most importantly, they are free of es-

timating the label noise transition matrix. Usually, in deep learning

(11)

based approach, a sample selection criterion is designed to select the examples that are more probably having clean labels and let these se- lected examples contribute more in training deep neural networks. Ex- amples are Co-teaching [12], Mentornet [13] and Reweighting [14]. Co- teaching [12] exchanges the examples with small training loss between two neural networks to reduce the training error, where the examples with small training loss are assumed to be clean labeled. MentorNet and Reweight assume the availability of a small size of clean labeled data. MentorNet maintains two networks, namely MentorNet and Stu- dentNet simultaneously, letting MentorNet train on the clean labeled data to guide StudentNet to firstly train on the examples most probably with clean labels. Unlike Co-teaching and MentorNet that simply se- lect or discard examples, Reweight tries to use as many as examples in training where examples with clean labels are more likely to be assigned a larger weight and vice versa. In [14], the examples are reweighted ac- cording to their sensitivities measured via influence functions, i.e. how much the examples change towards a small local perturbing.

1.2 Problem

Traditionally, learning from noisy labels by importance reweighting is well-studied but has drawbacks in the requirement of knowing the la- bel noise generation process and the difficulties in dealing with com- plex data input. Extending this approach by deep learning is worth to study because deep learning based approaches do not suffer from these limitations. However, the representative work in this line of research, Reweight [14] fails to provide a reasonable example reweighting cri- terion. As previously mentioned, Reweight assigns weights to exam- ples by their sensitivities towards a small perturbing. This is explained by matching the gradient directions of the clean labeled and noisy la- beled data. But only matching this local change of loss function is not enough to ensure the training robust to noisy labels, i.e. they wrongly take sensitivity as the importance of examples. Moreover, in [14], a large amount of training data are not used in training because they are assigned with a weight 0. All these failures lead to a poor experimental performance, where their performance is similar to that using random weights in symmetric label noise experiments.

In short, learning from noisy labels by importance reweighting using

deep learning is promising to study and existing work fails in proposing

a reasonable reweighting criterion. This indicates a need for a novel

criterion that works well for label noise problem and results in good

(12)

experimental performance.

1.3 Purpose

Towards the failures of Reweight, we aim to propose an importance reweighting approach for label noise problem with a reasonable reweight- ing criterion. So we pose the following research questions:

How to propose a novel approach that outperforms the current im- portance reweighting based label noise approaches by proposing ef- fective importance reweighting criterion?

In this work, we first need to design an effective importance reweight- ing criterion that works as intuitively expected. Secondly, this pro- posed reweighting criterion is hopefully to be integrated into the novel proposed approach. And the novel approach is expected to make the training of neural networks robust in the presence of noisy labels, i.e.

the performance of neural networks would not be significantly devas- tated because of the inducing of noisy labels.

1.4 Goal

The goal of this work is to motivate the research in learning from noisy labels by proposing a novel approach. Also, the work would poten- tially encourage more research in using domain adaptation techniques to solve label noise problem by bridging the two areas. Moreover, we devote to boost the growth of the noisy labels pervasive industries, such as crowdsourcing platforms and recommendation systems. By lessen- ing label noise effect, our work benefits these industries in better ana- lyzing user behaviors and improving their services or products.

1.5 Research Methodology

Figure 1 shows the research process flow of this work. As data-driven

research, we adopt a quantitative research method where a model is

trained from the data for specific tasks. This work belongs to funda-

mental research where a new method B is proposed for a defined re-

search problem by challenging existing method A. The performance of

our proposed method is evaluated by training and testing on bench-

mark datasets in experiments. At the end of the experiments, we can

(13)

confirm that which method, A or B, is better in solving the research problem according to their experimental performances. Therefore in this work, we use experimental research strategy as our research strat- egy.

Figure 1: Research process flow of this work.

1.6 Ethics, Benefits and Sustainability

This work is not directly involved with ethical issues. All datasets used in the experiments are open-source datasets that are commonly used in machine learning research. And no participant other than the author is needed to carry out the work. However, our work may indirectly relate personal data when applied in real-world applications. For example, this work would potentially enhance the growth of label noise inten- sive industries, e.g. recommendation systems. And those industries may use personal data in different ways such as analyzing user behav- ior.

For benefit, our work would benefit both the research community and

the industry. On the one hand, for the research community, this work

is the first to connect the study of learning from noisy labels and do-

main adaptation and points out a huge research potential in this in-

terdisciplinary area. Moreover, the study provides a novel importance

reweighting approach to tackle the research of learning from noisy la-

bels. On the other hand, the industry could have a robust algorithm

when noisy labels are numerous. For example, crowdsourcing plat-

forms and recommendation systems are two main sources to gener-

ate numerous noisy labels. Our work can potentially encourage their

growth by solving label noise problems. The growth of these companies

would further provide new job opportunities and contribute to over-

all economic growth. This is related with the Goal 8: Decent work

and economic growth defined in the framework of sustainable devel-

(14)

opment goals by the United Nations

¹

. Besides, since the work is closely linked with industry benefit, it also aligns with the Goal 9: Industries, Innovation and Infrastructure. All of these indicate that our work con- forms to the universal call of building a sustainable planet.

1.7 Outline

The thesis is organized as the following:

• Extended background provides the possible literature review of learning from noisy labels, domain adaptation and the background knowledge of distribution divergence measures and deep learn- ing basics.

• Methodology elaborates how we choose our method to answer the research problem. It also includes the experimental setup and the evaluation metrics used in the experiments.

• Proposed approach shows the problem settings, how we approach the targeted research problem, including how to bridge label noise problem and DA, our proposed reweighting criterion and the deep architecture implementation.

• Results shows our experimental results, the interpretations of the results and necessary discussions.

• Conclusion concludes the main part of the work.

1

https://www.undp.org/content/undp/en/home/sustainable-development-goals.

html

(15)

2 Extended background

2.1 Learning from noisy labels

Traditional supervised learning usually requires domain experts to pro- vide annotations as data labels. As the emergence of deep learning, there is a greatly increased demand for a large number of labeled data, therefore manually data labeling becomes unrealistic for the wide range of deep learning applications. For quick and cheap labeling, labeling through online queries and crowdsourcing [2, 1] are frequently used to collect data labels. The former assigns labels according to the user’s query keyword from search engines and the latter acquires data labels from a large number of annotators in a distributed manner. However, both of them lead to a considerable amount of data with noisy labels (i.e. wrongly labeled data). For example, in crowdsourcing, the label- ing aggregation from different annotators is still a problem. Besides, the quality of the annotators is hard to control. Even some annotations are from spammers [2], i.e. annotators assign random labels because they do not have the required knowledge for labeling. The noisy la- bels severely weaken the performance of neural networks [6] because neural networks can easily memorize the noisy labels and generalize poorly. Therefore, the research in learning from noisy labels is neces- sary and important to ensure the robust training of deep neural net- works.

The research of learning from noisy labels has a long history, dating back to 1988 [7]. Along the time, Menon et al. [9] studied the the- ories of label corruption processes within the mutually contaminated (MC) distribution framework [8] by class-probability estimation. MC learning [9, 8] is a general framework for learning from noisy labels, where samples from a label corrupted distribution ˜ D are observed un- der some unknown noise parameter α that generating the noisy labels.

Class-conditional label noise (CCN learning) and positive and unla-

beled learning (PU learning) are two special cases of MC learning. The

work [9] by Menon et al. exposed several important theoretical conclu-

sions. For example, it proved that optimizing balanced error (BER) or

the area under the ROC curve (AUC) on the clean labeled data is equiv-

alent to optimize that on the corrupted data. This implies the noise

parameter α is unnecessary to be known for learning from noisy la-

bels, also guarantees the possibility of only learning from noisy labels

without clean labels. Moreover, it reveals α could be estimated by a

class-probability estimator under some assumptions.

(16)

In 2017, Patrini et al. [10] extended [9] from binary class to multi-class setting and proposed a forward and backward loss correction in robust learning from label noise. [10] has become one of the main approaches in label noise research. According to [10], if the noise transition ma- trix T is known and non-singular, the loss in backward correction l

_B

is l

B

(ˆ p(y |x)) = T

⁻¹

l(ˆ p(y |x)) (1) where x, y are column vectors, ˆ p(y |x) is a vector of softmax output ap- proximating the class-conditional probability p(y |x). Here by applying T

⁻¹

, the corrected loss is a linear combination of the original loss with coefficients accounting for the probability of each true label y given the observed corrupted label. Contrary to backward correction, forward loss correction tries to correct the model prediction, as defined below l

F,φ

(h(x)) = l(T

^⊤

φ

⁻¹

(h(x)) (2) where l

_F,φ

is the forward correction loss, φ is a link function, and h(x) is the predicted label. In the case T is unknown, [10] provides an ap- proach to estimate T but the estimation is not usually correct.

Label noise problem can also be solved using importance sampling. In importance sampling [15], we want to approximate integrals like

E[f] =

∫

f (x)p(x)dx. (3)

The idea is to let samples from a proposal distribution q(x) to approx- imate the expectation over an exact distribution p(x), as shown below.

Usually the probability density q(x) is simpler than p(x).

E[f ] =

∫

f (x) p(x)

q(x) q(x)dx ≈ 1 m

∑

m i=1

w

_i

f (x

_i

) (4)

In Equation (4), m is the total number of samples from the proposal distribution and the importance weight w

_i

=

^p(x_q(xⁱ⁾

i)

. It is clearly be seen from w

_i

that the weight is large if the sample x

_i

is more likely to be sampled from p(x) compared to that from q(x). As a result, by assigning importance weights to samples, the distribution from q(x) is forced to match the target distribution p(x).

Let R

_L,D

(f ) be the expected risk for a classifier f with respect to the

distribution D and loss L. In classification, an f is learned to minimize

the expected risk. Liu et al. [11] formulated the label noise classification

(17)

problem via importance reweighting as the following:

R

_L,D

(f ) = E

(x,y)∼D

[ L(f

θ

(x), y)]

= E

_(x,˜_{y)∼ ˜}_D

[ P

_D

(x, y)

P

D˜

(x, ˜ y) L(f

θ

(x), ˜ y) ]

= E

_(x,˜_y)_{∼ ˜}_D

[β(x, ˜ y) L(f

θ

(x), ˜ y)]

= R

_β,_{L, ˜}_D

(f ).

(5)

Since in the noisy labels problem, P

_D

(x) = P

D˜

(x), the weights are com- puted as

β(x, ˜ y) = P

_D

(x, y)

P

D˜

(x, ˜ y) = P

_D

(y |x)P

D

(x)

P

D˜

(˜ y |x)P

D˜

(x) = P

_D

(y |x)

P

D˜

(˜ y |x) . (6) They also proved that the classifier learned by this importance reweight- ing approach converges to the optimal classifier obtained in the label noise free case. But their approach heavily relied on the accurate esti- mation of the label noise transition matrix, which is still a hard problem to be solved.

Recently, deep learning based approaches have achieved superior per- formance in label noise research. Although these approaches are non- transparent and less theoretical grounded, they are free of estimating the noise transition matrix T . To ensure the training robust towards noisy labels, these approaches select the data with clean labels based on some criterion to guide the training procedure. Some representa- tive works are MentorNet [13], Co-teaching [12] and reweighting [14]

architecture. MentorNet architecture [13] consists of two neural net- works: a MentorNet and a StudentNet. During training, MentorNet learns a curriculum on a small size of the clean labeled dataset and then guides the StudentNet using this curriculum to train on the corrupted labeled data. The learned curriculum can select the samples that are most probably correctly labeled. In every mini-batch step, MentorNet updates its curriculum by the features (loss, the change of loss along moving averages, label and training progress) given by the StudentNet.

Unlike MentorNet where the error flow has to be accumulated during

training, Co-teaching [12] handles noisy labels by training two neural

networks simultaneously and reduces the training error by exchang-

ing small loss data between the two nets. This work assumes data with

small training loss is more likely to be correctly labeled so by trans-

ferring these clean labeled data to peer networks, the error could be

filtered out. This approach outperforms MentorNet on several bench-

mark datasets.

(18)

Reweight [14] catches much attention in learning with noisy labels. In the settings of [14], there is a small set of correctly labeled data and a massive dataset with noisy labels, e.g. relatively in a ratio 1/50. Both the data and noisy labels are balanced across classes. The examples are reweighted according to a certain criterion to reduce the negative effect of training on noisy labeled data. A core question of this work is how to assign the weight w

_i

properly. [14] trains a neural networks on data with noisy labels {(x

^ti

, y

_i^t

) }

_i_∈[1,m]

. At each gradient step after the training, the clean labeled data {(x

^vi

, y

_i^v

) }

_i_{∈[1,n],n≪m}

(named validation set in [14]) would be fed into the same neural network and compute the loss on validation dataset. They claim that the optimal weight w

_i

should be selected based on the performance of training on the clean labeled data, that is,

w

^∗

= arg min

w,w≥0

1 n

∑

n i=1

f

_i^v

(θ

^∗

(w)) (7)

where θ

^∗

is the optimal model parameter that minimize the loss for training set f

_i^t

,

θ

^∗

(w) = arg min

θ

∑

m i=1

w

_i

f

_i^t

(θ) (8)

. The weights of noisy labeled examples are assigned by comparing the gradient direction when using noisy labeled data and clean labeled data.

2.2 Domain adaptation

Domain adaptation (DA) attempts to leverage knowledge from one or multiple labeled source domain to learn a classifier for a target domain [16]. Referring to what to transfer between the domains, domain adap- tation approaches can be categorized into instance transfer, feature representation transfer, parameter transfer and relational knowledge transfer approach [17]. Instance transfer approach transfers samples to the target domain by importance reweighting. Feature representa- tion approach learns a feature representation that minimizes the dif- ference of source and target domain. Parameter transfer finds shared parameters between source and target domain and relational knowl- edge transfer approach transfers the relationship within data from the source to the target domain.

According to whether adopting deep features or deep architectures, DA

(19)

methods can be categorized into shallow DA and deep DA method.

Shallow methods usually try to match the two distribution between source and target domain by either reweighting samples [18, 19, 20] or learning a shared space for distribution matching [21, 22, 23]. Deep DA methods catch more attention in recent research. Through the train- ing of deep DA models, the models aim to minimize a defined loss(e.g.

classification loss) and maximize a domain-confusion factor [24]. The factor can be computed by a discrepancy loss or an adversarial loss [16, 25]. We then only focus on reviewing discrepancy loss based meth- ods for the relevance of this work.

In deep DA methods, discrepancy loss based methods minimize the distribution discrepancy of the source and target domain measured by some criteria. Maximum mean discrepancy (MMD) or its variant is the most commonly used criterion [26, 27, 28, 29, 30, 31]. The details of MMD are shown in Section2.3. Ghifary et al. [27] firstly used MMD in feedforward neural networks to match the cross-domain representa- tions in latent space. Then, in order to utilize the strong representation power of convolutional neural networks (CNN), MMD was further ex- tended to deep CNN, namely deep domain confusion network (DDC) [28]. The DDC minimizes the following loss:

L = L

c

(X

_L

, y) + λ D

k²

(X

_S

, X

_T

) (9) where L

c

(X

_L

, y) is the classification loss on the labeled training data X

_L

, D

_k²

(X

_S

, X

_T

) is the MMD between source X

_S

and unlabeled target data X

T

and the hyperparameter λ controls the degree of domain con- fusion. Later, Long et al. used multiple MMDs between multiple adap- tation layers to design deep adaptation networks (DANs) [30] but their assumption towards the conditional distributions was quite strong. To relax this assumption, Long et al. again raised joint adaptation net- works (JANs) [31], where a joint distribution discrepancy, rather than a sum was adopted for deep features. Moreover, combining the MMD based feature adaptation with residual layers, residual transfer net- works (RTNs) [29] jointly learned the adaptation of both features and classifiers.

2.3 Distribution divergence measures

Measuring the divergence of two distributions is a fundamental prob-

lem in machine learning and has a wide range of potential applications

like binary classification and two-sample test. In this section, we re-

(20)

view the commonly used criteria to measure distribution divergence, with a particular focus on maximum mean discrepancy (MMD) and f- divergence.

2.3.1 Maximum mean discrepancy (MMD)

As described in section 2.2, MMD is a domain discrepancy criterion widely used in machine learning, especially in DA. Unlike parametric criterion such as t-test, MMD is a non-parametric discrepancy mea- sure which is more suitable for solving real-world problems because no strong prior assumption of data distribution is needed in non- parametric measures [32]. It has also been proven to be equivalent to the energy distances from statistics research [33]. To be specific, MMD [26] measures the distances of two distribution embeddings in a reproducing kernel of Hilbert space (RKHS) [33]. Before formally in- troducing MMD, we first give some background knowledge of kernels, distribution embeddings and RKHS.

To start with, complex machine learning tasks can not be solved by us- ing only linear decision boundaries. In order to have a non-linear deci- sion boundary, non-linear mappings Φ are used to map data from input space X to a high dimensional space H where the samples can be lin- early separated. Since the dimension of H can be extremely high, kernel functions are adopted to reduce the heavy cost of inner product compu- tations in H. For example, we are given a dataset (x

1

, y

1

), ..., (x

n

, y

n

) ∈ X ×Y, where the domain X and Y are nonempty sets containing inputs x and targets y respectively. A kernel, as a similarity measure in X , is defined as

k : X × X → R, (x, x

^′

) 7→ k(x, x

^′

) (10) satisfying, for all x, x

^′

∈ X ,

k(x, x

^′

) = ⟨Φ(x), Φ(x

^′

) ⟩ (11) where Φ : X → H is a mapping to some dot product spaces H and

⟨·, ·⟩ represents the dot product defined in H. In such a way, we can construct algorithms in H without needing to explicitly compute the mapping Φ by substituting ⟨Φ(x), Φ(x

^′

) ⟩ with k(x, x

^′

).

Definition 2.1 (RKHS) Let H be a Hilbert space of real-valued func- tions defined on X . A function k: X × X → R is called a reproducing kernel of H if the following conditions are satisfied:

∀x ∈ X , k(·, x) ∈ H, and ∀x ∈ X , ∀f ∈ H, ⟨f, k(·, x)⟩

_H

= f (x).

(21)

H is a reproducing kernel Hilbert space (RKHS) if it has a reproduc- ing kernel. H

k

denotes a RKHS H with its reproducing kernel k. Once RKHS is defined, we then show MMD could be expressed as the dis- tance between mean embeddings of two distributions in H [26, 34].

Definition 2.2 (MMD) Let k be a kernel defined on X and µ

k

(u) is the kernel embedding of u on H

k

, the maximum mean discrepancy (MMD) D

k

between two distributions P , Q is

D

k

(P, Q) = ∥µ

k

(P ) − µ

k

(Q) ∥

_H_k

, where µ

k

(u) = E

x∼u

[Φ(x)].

According to [34, 33], squared MMD can be easily obtained in RKHS, which is

D

k²

= E

x,x^′

[k(x, x

^′

)] + E

y,y^′

[k(y, y

^′

)] − 2E

x,y

[k(x, y)] (12) where x, x

^{′ iid}

∼ P , y, y

^{′ iid}

∼ Q.

Straightforwardly in RKHS, a biased empirical estimate [26, 35, 34] of MMD is derived as the following

D

²_k

(P, Q) = 1 n

²

∑

n i,i^′=1

k(x

_i

, x

_i′

) + 1 m

²

∑

m j,j^′=1

k(y

_j

, y

_j′

) − 2 nm

∑

n i=1

∑

m j=1

k(x

_i

, y

_j

) (13) given that X = {x

1

, ..., x

_n

}

^iid

∼ P and Y = {y

1

, ..., y

_m

}

^iid

∼ Q.

Therefore, we can measure the discrepancy of two distributions by sam- pling data from the two distributions and compute their empirical es- timates of MMD, as indicated in (13).

2.3.2 Other alternative divergences

Other commonly used divergences inlucde f-divergence, energy dis- tance and Wasserstein distance. Given two distributions P , Q and their absolutely continuous density p, q, the f-divergence [36] with respect to a measure dx on domain Ω is defined as

D

_f

(P ∥ Q) =

∫

Ω

q(x)f ( p(x)

q(x) )dx (14)

where f is a convex function satisfying f (0) = 1. There are many com-

monly used f-divergence including but not limited to: Kullback-Leibler

(22)

(KL) divergence, Reverse KL divergence, Pearson χ

²

divergence and Jensen-Shannon divergence (see in Table 1).

Divergence Correponding f (u) Kullback-Leibler (KL) u log u

Reverse KL − log u

Pearson χ

²

(u − 1)

²

Jensen-Shannon −(u + 1) log

^1+u₂

+ u log u Table 1: Examples of f-divergence.

Energy distance is the statistical distance between two probability dis- tributions, defined as

D

_E

(P, Q) = 2 E

XY

∥X − Y ∥ − E

XX^′

∥X − X

^′

∥ − E

Y Y^′

∥Y − Y

^′

∥ (15) where (X, X

^′

, Y, Y

^′

) are independent random variables, P is the cumu- lative density function (cdf) of X, X

^′

, Q is the cdf of Y, Y

^′

, ∥·∥ repre- sents the Euclidean norm. Energy distance is proved as a special case of MMD [33].

Wasserstein distance is another distance measure arising from optimal transport. If J(P, Q) denotes all joint distributions J for x and y with marginal distribution P , Q, Wasserstein distance is defined as

W

_p

(P, Q) = ( inf

J∈J (P,Q)

∫

∥x − y∥

^p

dJ (x, y))

¹^p

(16)

. Wasserstein distance is widely used in machine learning, such as in generative adversarial networks [37] and restricted boltzmann ma- chines [38].

2.3.3 Divergence measures for importance reweighting

In importance reweighting, one approach is to compute the weights by minimizing divergence measures between a weighted distribution and the target distribution. The solution of the optimization ˆ w is expected to approximate the optimal weight w, that is

min

w

D(p ∥ wq) ≈ min

ˆ

w

D(p ∥ ˆ wq) (17)

where D can represent any distribution divergence measure in prin-

cipal. Some divergences discussed in section 2.3 are frequently used

(23)

due to their advantages when reformatted into optimization problems.

For example, using KL divergence in the optimization in Equation 17 is shown below:

min

wˆ

D

KL

(p ∥ ˆ wq) = min

ˆ w

∫ log

( p(x) ˆ

w(x)q(x) )

p(x)dx

= min

ˆ w

∫

− log ˆ w(x)p(x)dx

≈ 1 n max

ˆ w

∑

n i=1

log ˆ w(x

_i

), x

_i

∼ p(x)

(18)

with constraints ˆ w(x) ⩾ 0 and

_m¹

∑

m

j=1

w(x ˆ

_j

) = 1, x

_j

∼ q(x). This is called a Kullback-Leibler importance estimation Procedure (KLIEP).

When applying a squared Euclidean distance, the optimization prob- lem becomes

min

wˆ

1 2

∫

( ˆ w(x) − w(x))

²

q(x)dx

= min

ˆ w

[ 1 2

∫ ˆ

w(x)

²

q(x)dx −

∫ ˆ

w(x)p(x)dx + 1 2

∫

w(x)

²

q(x)dx ]

= min

ˆ w

[ 1 2

∫ ˆ

w(x)

²

q(x)dx −

∫ ˆ

w(x)p(x)dx ]

≈ min

ˆ w

[ 1 2m

∑

m j=1

ˆ

w(x

_j

)

²

− 1 n

∑

n i=1

ˆ w(x

_i

)

]

, x

_i

∼ p(x), x

j

∼ q(x).

(19)

This is called an unconstrained least-square importance fitting (uLSIF) [39].

Moreover, MMD could also be used in importance reweighting. Mini- mizing an MMD to compute the weights is called a kernel mean match- ing procedure [20]. Specifically, this is to solve the following optimiza- tion problem

min

β

E

^xj∼q(x)

[β(x

_j

)Φ(x

_j

)] − E

xi∼p(x)

[Φ(x

_i

)] (20)

subject to β(x) ≥ 0 and E

xj∼q(x)

[β(x

_j

)] = 1. ∥·∥ represents a MMD

defined in Section 2.3.1.

(24)

2.4 Deep learning basics

In this section, we introduce basic knowledge of deep learning, includ- ing popular neural network architectures and the optimization proce- dure when training deep neural networks.

2.4.1 Neural network architectures

First, we introduce the basic structures of neural networks. Then, we present two famous deep architectures we adopted as base models:

LeNet and ResNet. The selected base models can well fit the corre- sponding benchmark dataset in supervised learning, i.e. using clean labeled data in model training and testing.

Basic structure

In 1958, perceptrons were invented to model how information is stored and organized in brains [40]. Nowadays, perceptrons become the basic units of modern neural networks, as shown in Figure 2a. In a percep- tron, data inputs, their corresponding weights and the bias are sum- marized and then passed through an activation function to generate a output. The activation function is a non-linear transformation ap- plied on the data input and enables neural networks to learn complex representations. Popular used activation functions are rectified linear unit (ReLU, relu(x) = max(0, x)), hyperbolic tangent (tanh, tanh(x) =

e^x−e^−x

e^x+e^−x

), sigmoid (sigmoid(x) =

_1+e¹_−x

) and softmax (sof tmax(x

_i

) =

exp(xi)

∑

iexp(xi)

).

In a multi-layer perceptron (MLP), layers are stacked together where each layer contains a certain number of neurons that are connected with the neurons from the previous layer. For example, Figure 2b shows an MLP with three layers: one input layer, one hidden layer in the mid- dle and one output layer. Deep MLP usually have plenty of hidden lay- ers between the input and output layer to have a strong representation capacity.

Convolutional neural network (CNN) is a neural network designed for

processing the data with a grid-like topology such as images. CNN

greatly accelerates the advancement of deep learning and computer vi-

sion by its power of feature extraction. Features of data are extracted

from low-level to high-level by layers from bottom to the top. For ex-

ample, in face recognition, the first few layers learn the representation

of the basic shape and color while the later layers learn a more detailed

(25)

(a) A basic perceptron. (b) A multi-layer perceptron.

Figure 2: Basic structures of neural networks.

representation of human eyes and noses. A CNN typically has the fol- lowing types of layers:

• Convolutional layers (Conv), applying convolution operation of a predefined small region (called a filter) and the part of the layer connected with the filter;

• Pooling layers (Pool), applying a downsampling operation;

• Fully connected layers (Fc), each neuron in this layer connecting with all neurons in the previous layer to compute class scores.

Next, two CNN architectures, LeNet and ResNet, are introduced. They are commonly used as base models for comparing algorithm perfor- mance on benchmark datasets.

LeNet

Proposed in the 20th century, LeNet [41] is a simple convolutional neu- ral network widely used in handwritten digits recognition. At that time, it could free the heavy human labors of manually extracting data fea- tures and had achieved the state-of-the-art in MNIST dataset [41] clas- sification. As a typical LeNet architecture, LeNet-5 consists of 7 layers, including convolutional layers, subsampling layers and fully connected layers, as shown in Figure 3. In this work, LeNet is used as the base model for the experiments on MNIST.

ResNet

ResNet [42], short for the residual net, is a successful deep architecture

tackling the notorious ”vanishing/exploding gradient” problem [43] in

deep learning, and making the training of very deep neural networks

possible. It has achieved compelling performance on plenty of bench-

(26)

Figure 3: The architecture of LeNet-5.

mark datasets and greatly boosted the research in machine learning and computer vision.

Figure 4: A residual building block.

The core idea of ResNet is to add residual blocks (shown in Figure 4)

in deep neural networks. The residual blocks provide the shortcut con-

nection that skips one or more layers. And the shortcut connection

performs the identity mapping. This makes the training of very deep

neural networks easier to optimize. The authors of [42] assume that

by adding identity mappings to extra layers, the training error of deep

architectures should not be larger than that of their shallow counter-

parts. And meanwhile, the deep architectures can achieve a large per-

formance gain by increasing network depth. Resnet can be very deep

in depth, such as having 32, 44, 56 and 110 layers. Figure 5 shows the

structure of Resnet-32, i.e. ResNet with a depth 32. ResNet has be-

come one of the most popular used architectures in visual recognition

tasks.

(27)

Figure 5: The structure of Resnet-32.

Note that the layers are stacked from top to the bottom.

2.4.2 Optimization for training networks

Stochastic gradient descent (SGD) is one of the most popular algorithms in optimizing neural networks. In SGD, the network parameters θ are updated according to the objective function J(θ):

θ = θ − α▽

θ

E [J(θ)] . (21) α is the learning rate, deciding how much an update step of the algo- rithm influences the value of current weights. To speed up the network training, a mini-batch of samples is sampled from dataset to feed the network for a fast parameter update:

θ = θ − α▽

θ

J (θ; x

_i

, y

_i

), (22) where x

_i

, y

_i

represent the pairs of sampled data in the mini-batch. The expection E in equation(21) is approximated by averging the results over the whole dataset.

Overfitting happens when the neural network learns the noise within

the training data and thus makes the network hard to generalize well

to the unseen test dataset. To reduce the network overfitting effect,

regularization techniques can be added in the optimization, such as

dropout, L

¹

and L

²

regularization (L

²

regularization is also called weight

(28)

decay). In dropout, a proportion of nodes in neural networks are ran- domly selected to be ignored in training neural networks. In L

¹

and L

²

regularization, an additional regularization term Ω is added to the objective function. So the regularized objective function ˜ J (θ; x, y) be- comes

J (θ; x, y) = J (θ; x, y) + αΩ(θ). ˜ (23) The Ω term in L

¹

regularization is Ω(θ) = ∥w∥

₁

= ∑

i

|w

i

|, and in L

²

regularization is Ω(θ) =

¹₂

∥w∥

²₂

.

(29)

3 Methodology

As mentioned in Section 1.5, we adopt an empirical research method.

The evaluation of approaches is based on experiments tested on bench- mark datasets. A novel approach in learning from noisy labels is pro- posed to overcome the limitations of current approaches and is ex- pected to outperform all baselines in experiments. In the following, we describe how to choose the suitable approaches of learning from noisy labels and the divergence measure used in example reweighting criterion among all the alternatives. We also present our experimental setup and performance evaluation.

3.1 Choice of method

Here we explain why we choose to use a deep learning based impor- tance reweighting approach to tackle the label noise problem and which divergence to choose in designing a reweighting criterion.

3.1.1 Learning from noisy labels

As mentioned in Section 2.1, there are two major categories of research approaches towards learning from noisy labels: traditional shallow learning based approach and deep learning based approach. Usually traditional approaches heavily rely on the accuracy of estimation label noise transition matrix, which is still a hard problem remaining to be solved. Also, traditional approaches could only handle cases with rel- atively simple data inputs and low label noise rate. On the contrary, deep learning based approaches are free of suffering any of these men- tioned limitations, although they are less theoretical grounded. There- fore, we decide to design a deep learning based approach to the problem of learning from noisy labels.

Among all deep learning based approaches, we use importance

reweighting to select relatively clean samples. That is because im-

portance reweighting is a well-studied topic in statistics, and it is

also be applied in traditional research approaches in learning from

noisy labels. Instead of simply keeping or discarding samples as most

deep learning based approaches do, importance reweighting based ap-

proaches let samples contribute differently, or even in between of keep-

ing and discarding, to the network training and the number of samples

involved in training can be kept to a maximum.

(30)

When revisiting the traditional importance reweighting based research in learning from noisy labels [11], we found their approach is hard to scale for large and complex data input. More effort is needed to let it combine with deep architectures. Reweight [14] is the first attempt in this direction but fails in giving reasonable importance reweighting criterion. Here begins our work to fill this knowledge gap.

3.1.2 Choice of divergence measure

In this work, we consider the instance based domain adaptation be- cause importance reweighting is in its core, a method of sample selec- tion. As described in equation (4) in section 2.1, we need to compute the optimal weight w

_i

= p(x

_i

)/q(x

_i

) for samples with noisy labels. Al- though density estimation can be used to estimate the two density sep- arately, it severely suffers from the curse of dimensionality [44]. An alternative approach is to solve an optimization problem by minimiz- ing a distribution divergence measure. Possible divergence measure used for importance reweighting is introduced in Section 2.3.3.

MMD is a commonly used divergence in importance reweighting.

KLIEP is less computationally efficient than KMM because of the non- linear of the objective functions [39]. Although KMM requires a careful choice of kernel width, the choosing heuristic of using the median dis- tance between samples works well in practice. We implemented both uLSIF and KMM and in our settings, KMM performs better in com- putational time and accuracy. So we adopt KMM as our importance reweighting method in this work.

3.1.3 Hyperparameter tuning

The hyperparameters used in this work include batch size, learning

rate, the positions and values of learning rate decay (for CIFAR-10 only),

weight decay, number of total epochs and the kernel width. For batch

size, we use the largest batch size that could be fitted into the memory

of the available computing cluster. We use a total of 400 epochs for

letting the approaches converge. Then, we try different initial learning

rates and find the most suitable one for each dataset. Once the ini-

tial learning rate is fixed, we adjust different values of weight decay to

achieve the best regularization effect. Positions and values of learning

rate decay on CIFAR-10 are determined by trying different combina-

tions. For kernel width, we extract some minibatches of loss distri-

(31)

butions of clean labeled and noisy labeled data and compute weights by kernel mean embedding to find an optimal kernel width that could force the loss distributions of noisy labeled data to match that of clean labeled data.

For ensuring computation efficiency, we do not use cross-validation in hyperparameter tuning. This may cause an overoptimistically bi- ased performance estimation because very few information from test data is used to guide the hyperparameter tuning. But this risk would not be large since in our setting, we assume some small size of clean labeled test data is available in reweighting examples and besides the test dataset is not directly used in training procedure. Also, we evaluate alternative approaches by comparing under this same setting. In prac- tice, when the dataset is large and computing resources are enough, we still recommend using cross-validation in tuning the hyperparame- ters.

3.2 Experimental setup

In the experiments, we use the commonly used benchmark datasets and the corresponding base models for comparison convenience. Then, we flip the data labels according to the predefined label corruption ma- trix/label noise transition matrix and corruption rate to generate noisy labeled datasets. To evaluate our approach, we also design several base- lines for performance comparison. The details of the experiments, e.g.

software versions, hyperparameter settings are given at the end of this subsection.

3.2.1 Datasets and base models

The benchmark datasets used in the experiments are MNIST [41] and

CIFAR-10 [45]. Examples of the dataset are shown in Figure 6. MNIST

is a dataset of handwritten digit images, digit ranging from 0 to 9. The

training dataset contains 60000 samples and test set contains 10000

samples. The frequently used corresponding base model is LeNet (more

details in Section 2.4.1). CIFAR-10 dataset is a collection of 60000 real-

world object images in 10 classes, 50000 images for training and 10000

for testing. Each class has 6000 32x32 color images. To compare our

work with [14], we use LeNet as the base model for MNIST and Resnet-

32 as the base model for CIFAR-10. The details of the base models are

(32)

given in Section 2.4.1. Table 2 summarizes the details of the datasets and base models used in the experiments.

(a) MNIST. (b) CIFAR.

Figure 6: Examples from MNIST and CIFAR.

Dataset # training # testing # class Image size Base model

MNIST 60,000 10,000 10 28 × 28 LeNet

CIFAR-10 50,000 10,000 10 32 × 32 ResNet-32 Table 2: Summary of the datasets and base models used in experi- ments.

3.2.2 Label noise transition matrix and noise rate

We study the case of the class-conditional label noise (CCN learning), where the data labels are randomly flipped with some probabilities con- ditioned on classes. In this work, the labels of label corrupted data are generated according to a predefined label noise transition matrix T , where T

_ij

= P (˜ y = j |y = i). We consider two common types of la- bel noise transition matrix (Figure 7), called pairflip and symmetric in [12]. In pairflip case, the labels in every class only flip to one neigh- bor class. Without loss of generality, we define the neighborhood class for class i is class i + 1 when i < n. n is the total number of classes.

When i = n, the neighborhood class for class i is the class 1. This is reasonable in practice because labels can be wrongly classified among two very similar classes. Another noise type is symmetric, where the la- bels can randomly flip to other n − 1 classes with an equal probability.

The label corrupted rate, denoted by δ is the probability of label flip-

ping. Note that the label noise transition matrix and label corruption

(33)

rate are unknown to the model. The design of this matrix and corrup- tion rate is only for the convenience of data generation and method comparison.



 

 

1 − δ δ 0 . . . 0 0 1 − δ δ . . . 0 .. . . .. ... .. . 0 0 . . . 1 − δ δ δ 0 . . . 0 1 − δ



 

 



 

 

1 − δ

_n₋₁^δ

. . .

_n₋₁^δ _n₋₁^δ

δ

n−1

1 − δ

_n₋₁^δ

. . .

_n₋₁^δ

.. . . .. .. .

δ

n−1

. . .

_n₋₁^δ

1 − δ

_n₋₁^δ

δ

n−1 δ

n−1

. . .

_n₋₁^δ

1 − δ



 

 

Figure 7: Label noise transition matrix.

Left: Pairflip label noise; Right: Symmetric label noise.

3.2.3 Baselines

We consider the following baselines to compare our method.

• Clean only, use only the limited clean labeled data in training.

• Random, assign random weights generated by a rectified Gaus- sian distribution β

_i

=

^∑^max(0,sⁱ⁾

imax(0,si)

, where s

_i

∼ N (0, 1). This is the same baseline as used in [14].

• Uniform, assign the same weights to all training examples.

• Reweight, proposed by [14].

3.2.4 Experimental details

The experiments are implemented using Python 3.6.5 on PyTorch 0.4.0, computed on RAIDEN GPU cluster provided by RIKEN AIP.

Same as [14] for a fair comparison, the clean labeled data are randomly selected with a size of 1000 in our experiments and the baseline ”Ran- dom” and ”Uniform” are fine-tuned in the last 10 epochs using 1000 clean labeled data. We consider the following experiment settings:

symmetric label noise with label corrupted rate 0.2, 0.5 and pairflip

label noise with corruption rate 0.3, 0.45, tested on MNIST and CIFAR-

10 dataset. We repeat the experiments under each setting for 5 times

on MNIST and 3 times on CIFAR-10.

(34)

For MNIST experiments, we use SGD without momentum as optimizer in our proposed approach. The weight decay is 0.01 and learning rate is 0.00001. For the best performance of the baseline ”Reweight”, we adopt the optimizer of SGD with 0.9 momentum, 0.01 weight decay and a learning rate of 0.0001. Other baselines are using SGD without momentum, weight decay 0.01 and learning rate 0.0001. For exper- iments on CIFAR-10, the optimizer is SGD without momentum and 0.0002 weight decay. The initial learning rate is 0.05, decaying ×0.1 at epoch 180, 290 and 390 for a total of 400 epochs. All experiments on CIFAR-10 are under the same above setting. In both MNIST and CIFAR-10 experiments, the batch size is 128 for noisy labeled training data and clean labeled data. Kernel width is selected as 1-th quantile of distances between examples. Random seed for all experiments is 100.

Note that we do not use any data augmentation technique in data pre- processing as that used in [14]. That is the exact reason why our repro- duced results of Reweight are not as good as that reported in [14].

3.3 Performance evaluation

For performance evaluation, we firstly present our selected evaluation metric in this work. Secondly, we show how we conduct the evalua- tion procedure in experiments. Thirdly, success criteria are given to demonstrate how we evaluate our proposed approach as a successful approach and how the observed performance in experiments lead to a conclusion.

3.3.1 Evaluation metrics

In this work, we target a balanced class classification problem. So the training objective is to minimize a 0-1 loss, defined as

l(ˆ y, y) = 1 {

arg max

c

{ˆy

c

̸= y} }

(24) where 1 {·} is the indicator function, ˆy

c

is the c-th element of the pre- diction ˆ y ∈ R

^c

, c is the total number of classes. Since the 0-1 loss is non-convex and non-continuous, cross-entropy is adopted as a surro- gate loss for the training objective. A cross-entropy loss is defined as

−

∑

c j=1

y

_j

log f

_j

(x; θ) (25)

(35)

where y

_j

corresponds to the j’s element of one-hot encoded label of x, f is the classifier that maps input space to label space f : X → R

^c

, f

_j

is the j’s element of f and the output layer is a softmax.

In the test stage, we use classification accuracy to evaluate the perfor- mance. Classification accuracy is computed as the number of correct predictions divided by the total number of predictions, which is essen- tially the mean of the 0-1 loss values for all considered examples. We expect to learn a classifier that could achieve a high test accuracy (short for classification accuracy on the test dataset).

Another evaluation criterion is to check whether the weights assigned to training examples are meaningful. Intuitively, if an importance reweighting approach works well in label noise problem, the weights should be small if the data is wrongly labeled and the weights should be large if the data is correctly labeled. At least the weight distributions for the ground truth correctly labeled and wrongly labeled data should have a large difference in median or mode.

3.3.2 Evaluation procedure

After every training epoch, we evaluate our proposed approach by com- puting and recording the test accuracy in every epoch. Meanwhile, training accuracy and training loss are also recorded for the purpose of hyperparameter tuning and results interpretation. Experiments under the same setting are repeated for three or five times, so do the eval- uation procedures. The final results are averaged over the repeated experiments. Mean test accuracy and the corresponding standard er- ror/standard deviation in the repeated experiments are computed as experimental results. Furthermore, at the end of each experiment, we plot the distribution of the learned weights from the last epoch for the ground truth correctly labeled data and wrongly labeled data respec- tively.

Learning from noisy labelsby importance reweighting:: a deep learning approach

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2019 ,

Learning from noisy labels by importance reweighting:

a deep learning approach

TONGTONG FANG

Abstract

Keywords: noisy label, importance reweighting, deep learning, do-

main adaptation

Sammanfattning

Nyckelord: annoterad data, omviktning, djupt lärande, domänan-

passning

Acknowledgement

Finally, nobody had been more important in the pursuit of my career

goal than my parents. I wish to thank Guoguo and Huairen for their

selfless love. They felt every my feeling in developing this work: no

matter how cheerful or depressed I was, they were always there beside

me in my heart.

List of Figures

1 Research process flow of this work. . . . . 4

2 Basic structures of neural networks. . . 16

3 The architecture of LeNet-5. . . 17

4 A residual building block. . . 17

5 The structure of Resnet-32. . . 18

6 Examples from MNIST and CIFAR. . . 23

7 Label noise transition matrix. . . 24

8 Architecture of the proposed approach. . . 32

9 Results on MNIST with 0.2 symmetric label noise. . . 35

10 Training accuracy on MNIST with 0.2 symmetric label noise. . . 36

11 Results on MNIST with 0.3 pairflip label noise. . . 37

12 Training accuracy on MNIST with 0.3 pairflip label noise. 38 13 Results on MNIST with 0.45 pairflip label noise. . . 39

14 Training accuracy on MNIST with 0.45 pairflip label noise. 40 15 Results on MNIST with 0.5 symmetric label noise. . . 41

16 Training accuracy on MNIST with 0.5 symmetric label noise. . . 42

17 Results on CIFAR-10 with 0.2 symmetric label noise. . . 44

18 Training accuracy on CIFAR-10 with 0.2 symmetric la- bel noise. . . 45

19 Training loss on CIFAR-10 with 0.2 symmetric label noise. 47 20 Results on CIFAR-10 with 0.3 pairflip label noise. . . 48

21 Training accuracy on CIFAR-10 with 0.3 pairflip label noise. . . 49

22 Training loss on CIFAR-10 with 0.3 pairflip label noise. . 50

23 Results on CIFAR-10 with 0.45 pairflip label noise. . . . 51

24 Training accuracy on CIFAR-10 with 0.45 pairflip label noise. . . 52

25 Training loss on CIFAR-10 with 0.45 pairflip label noise. 53 26 Results on CIFAR-10 with 0.5 symmetric label noise. . . 54

27 Training accuracy on CIFAR-10 with 0.5 symmetric la- bel noise. . . 55

28 Training loss on CIFAR-10 with 0.5 symmetric label noise. 57 29 Histogram of the learned weight distribution on MNIST with 0.2 symmetric label noise. . . 59

30 Histogram of the learned weight distribution on MNIST with 0.3 pairflip label noise. . . 60

31 Histogram of the learned weight distribution on MNIST

with 0.45 pairflip label noise. . . 61

32 Histogram of the learned weight distribution on MNIST with 0.5 symmetric label noise. . . 62 33 Histogram of the learned weight distribution on CIFAR-

10 with 0.2 symmetric label noise. . . 63 34 Histogram of the learned weight distribution on CIFAR-

10 with 0.3 pairflip label noise. . . 64 35 Histogram of the learned weight distribution on CIFAR-

10 with 0.45 pairflip label noise. . . 65 36 Histogram of the learned weight distribution on CIFAR-

10 with 0.5 symmetric label noise. . . 66 37 The effect of clean labeled data size on test accuracy. . . . 67 38 Histogram of weight distribution learned by Reweight

(without rescaling) on MNIST with symmetric label noise. 77 39 Histogram of weight distribution learned by Reweight

(without rescaling) on MNIST with pairflip label noise. . 78 40 Histogram of weight distribution learned by Reweight

(without rescaling) on CIFAR-10 with symmetric label noise. . . 79 41 Histogram of weight distribution learned by Reweight

(without rescaling) on CIFAR-10 with pairflip label noise. 80

List of Tables

1 Examples of f-divergence. . . 13 2 Summary of the datasets and base models used in exper-

iments. . . 23 3 Comparison of label noise problem and domain adapta-

tion. . . 29 4 Average test accuracy ± standard error on MNIST over

the last ten epochs. . . 43 5 Average test accuracy ± standard deviation on CIFAR-

10 over the last ten epochs. . . 58 6 Effect of clean labeled data size: average test accuracy of

the proposed approach on CIFAR-10 with 0.2 symmetric

label noise over the last ten epochs. . . 68

Table of Contents

List of Figures iv

List of Tables v

1 Introduction 1

1.1 Background . . . . 1

1.2 Problem . . . . 2

1.3 Purpose . . . . 3

1.4 Goal . . . . 3

1.5 Research Methodology . . . . 3

1.6 Ethics, Benefits and Sustainability . . . . 4

1.7 Outline . . . . 5

2 Extended background 6 2.1 Learning from noisy labels . . . . 6

2.2 Domain adaptation . . . . 9

2.3 Distribution divergence measures . . . 10

2.3.1 Maximum mean discrepancy (MMD) . . . 11

2.3.2 Other alternative divergences . . . 12