IN
DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS
STOCKHOLM SWEDEN 2019 ,
Learning from noisy labels by importance reweighting:
a deep learning approach
TONGTONG FANG
Abstract
Noisy labels could cause severe degradation to the classification perfor- mance. Especially for deep neural networks, noisy labels can be memo- rized and lead to poor generalization. Recently label noise robust deep learning has outperformed traditional shallow learning approaches in handling complex input data without prior knowledge of label noise generation. Learning from noisy labels by importance reweighting is well-studied. Existing work in this line using deep learning failed to provide reasonable importance reweighting criterion and thus got un- desirable experimental performances. Targeting this knowledge gap and inspired by domain adaptation, we propose a novel label noise ro- bust deep learning approach by importance reweighting. Noisy labeled training examples are weighted by minimizing the maximum mean dis- crepancy between the loss distributions of noisy labeled and clean la- beled data. In experiments, the proposed approach outperforms other baselines. Results show a vast research potential of applying domain adaptation in label noise problem by bridging the two areas. Moreover, the proposed approach potentially motivate other interesting problems in domain adaptation by enabling importance reweighting to be used in deep learning.
Keywords: noisy label, importance reweighting, deep learning, do-
main adaptation
Sammanfattning
Felaktiga annoteringar kan sänka klassificeringsprestanda.Speciellt för djupa nätverk kan detta leda till dålig generalisering. Nyligen har brusrobust djup inlärning överträffat andra inlärningsmetoder när det gäller hantering av komplexa indata Befintligta resultat från djup inlärning kan dock inte tillhandahålla rimliga viktomfördelningskri- terier. För att hantera detta kunskapsgap och inspirerat av domänan- passning föreslår vi en ny robust djup inlärningsmetod som använ- der omviktning. Omviktningen görs genom att minimera den maxi- mala medelavvikelsen mellan förlustfördelningen av felmärkta och ko- rrekt märkta data. I experiment slår den föreslagna metoden andra metoder. Resultaten visar en stor forskningspotential för att tillämpa domänanpassning. Dessutom motiverar den föreslagna metoden un- dersökningar av andra intressanta problem inom domänanpassning genom att möjliggöra smarta omviktningar.
Nyckelord: annoterad data, omviktning, djupt lärande, domänan-
passning
Acknowledgement
This work would not have been possible without the aid and support from Prof. Masashi Sugiyama, the director of RIKEN AIP, who ac- cepted me to work in his team at RIKEN AIP, gave me the freedom to explore my research interests and provided useful critiques of my work. Also, I would like to express my very great appreciation to Dr.
Gang Niu. I learned a lot from his insightful guidance, encyclopedic knowledge, and his wisdom towards life. I was particularly grateful for the frequent discussions with Nan Lu, Miao Xu, Bo Han, Feng Liu, Wenkai Xu, Xuyang Zhao at AIP and the help from Yifan Zhang, Tianyi Zhang at the University of Tokyo in developing the work.
Besides, I would like to thank my KTH supervisor Prof. Henrik Boström for his valuable and constructive suggestions during the development of this research work. I would not forget his very elaborated comments for my thesis, unveiling the secrets of how to write thesis properly. I also wanted to offer my special thanks to Prof. Magnus Boman. He was not only my course instructor, thesis examiner, and research su- pervisor but also the person who witnessed how I gradually grew up at KTH and provided me generous support of my career goal.
Finally, nobody had been more important in the pursuit of my career
goal than my parents. I wish to thank Guoguo and Huairen for their
selfless love. They felt every my feeling in developing this work: no
matter how cheerful or depressed I was, they were always there beside
me in my heart.
List of Figures
1 Research process flow of this work. . . . . 4
2 Basic structures of neural networks. . . 16
3 The architecture of LeNet-5. . . 17
4 A residual building block. . . 17
5 The structure of Resnet-32. . . 18
6 Examples from MNIST and CIFAR. . . 23
7 Label noise transition matrix. . . 24
8 Architecture of the proposed approach. . . 32
9 Results on MNIST with 0.2 symmetric label noise. . . 35
10 Training accuracy on MNIST with 0.2 symmetric label noise. . . 36
11 Results on MNIST with 0.3 pairflip label noise. . . 37
12 Training accuracy on MNIST with 0.3 pairflip label noise. 38 13 Results on MNIST with 0.45 pairflip label noise. . . 39
14 Training accuracy on MNIST with 0.45 pairflip label noise. 40 15 Results on MNIST with 0.5 symmetric label noise. . . 41
16 Training accuracy on MNIST with 0.5 symmetric label noise. . . 42
17 Results on CIFAR-10 with 0.2 symmetric label noise. . . 44
18 Training accuracy on CIFAR-10 with 0.2 symmetric la- bel noise. . . 45
19 Training loss on CIFAR-10 with 0.2 symmetric label noise. 47 20 Results on CIFAR-10 with 0.3 pairflip label noise. . . 48
21 Training accuracy on CIFAR-10 with 0.3 pairflip label noise. . . 49
22 Training loss on CIFAR-10 with 0.3 pairflip label noise. . 50
23 Results on CIFAR-10 with 0.45 pairflip label noise. . . . 51
24 Training accuracy on CIFAR-10 with 0.45 pairflip label noise. . . 52
25 Training loss on CIFAR-10 with 0.45 pairflip label noise. 53 26 Results on CIFAR-10 with 0.5 symmetric label noise. . . 54
27 Training accuracy on CIFAR-10 with 0.5 symmetric la- bel noise. . . 55
28 Training loss on CIFAR-10 with 0.5 symmetric label noise. 57 29 Histogram of the learned weight distribution on MNIST with 0.2 symmetric label noise. . . 59
30 Histogram of the learned weight distribution on MNIST with 0.3 pairflip label noise. . . 60
31 Histogram of the learned weight distribution on MNIST
with 0.45 pairflip label noise. . . 61
32 Histogram of the learned weight distribution on MNIST with 0.5 symmetric label noise. . . 62 33 Histogram of the learned weight distribution on CIFAR-
10 with 0.2 symmetric label noise. . . 63 34 Histogram of the learned weight distribution on CIFAR-
10 with 0.3 pairflip label noise. . . 64 35 Histogram of the learned weight distribution on CIFAR-
10 with 0.45 pairflip label noise. . . 65 36 Histogram of the learned weight distribution on CIFAR-
10 with 0.5 symmetric label noise. . . 66 37 The effect of clean labeled data size on test accuracy. . . . 67 38 Histogram of weight distribution learned by Reweight
(without rescaling) on MNIST with symmetric label noise. 77 39 Histogram of weight distribution learned by Reweight
(without rescaling) on MNIST with pairflip label noise. . 78 40 Histogram of weight distribution learned by Reweight
(without rescaling) on CIFAR-10 with symmetric label noise. . . 79 41 Histogram of weight distribution learned by Reweight
(without rescaling) on CIFAR-10 with pairflip label noise. 80
List of Tables
1 Examples of f-divergence. . . 13 2 Summary of the datasets and base models used in exper-
iments. . . 23 3 Comparison of label noise problem and domain adapta-
tion. . . 29 4 Average test accuracy ± standard error on MNIST over
the last ten epochs. . . 43 5 Average test accuracy ± standard deviation on CIFAR-
10 over the last ten epochs. . . 58 6 Effect of clean labeled data size: average test accuracy of
the proposed approach on CIFAR-10 with 0.2 symmetric
label noise over the last ten epochs. . . 68
Table of Contents
List of Figures iv
List of Tables v
1 Introduction 1
1.1 Background . . . . 1
1.2 Problem . . . . 2
1.3 Purpose . . . . 3
1.4 Goal . . . . 3
1.5 Research Methodology . . . . 3
1.6 Ethics, Benefits and Sustainability . . . . 4
1.7 Outline . . . . 5
2 Extended background 6 2.1 Learning from noisy labels . . . . 6
2.2 Domain adaptation . . . . 9
2.3 Distribution divergence measures . . . 10
2.3.1 Maximum mean discrepancy (MMD) . . . 11
2.3.2 Other alternative divergences . . . 12
2.3.3 Divergence measures for importance reweighting 13 2.4 Deep learning basics . . . 15
2.4.1 Neural network architectures . . . 15
2.4.2 Optimization for training networks . . . 18
3 Methodology 20 3.1 Choice of method . . . 20
3.1.1 Learning from noisy labels . . . 20
3.1.2 Choice of divergence measure . . . 21
3.1.3 Hyperparameter tuning . . . 21
3.2 Experimental setup . . . 22
3.2.1 Datasets and base models . . . 22
3.2.2 Label noise transition matrix and noise rate . . . 23
3.2.3 Baselines . . . 24
3.2.4 Experimental details . . . 24
3.3 Performance evaluation . . . 25
3.3.1 Evaluation metrics . . . 25
3.3.2 Evaluation procedure . . . 26
3.3.3 Success criteria . . . 26
4 Proposed approach 28
4.1 Problem settings . . . 28
4.2 Bridging label noise problem and domain adaptation . . 29
4.3 Importance reweighting criterion . . . 30
4.4 Deep architecture implementation . . . 31
5 Results 34 5.1 Performance by label noise type and noise rate . . . 34
5.1.1 Results on MNIST . . . 34
5.1.2 Results on CIFAR-10 . . . 43
5.2 Distribution of the learned weights . . . 58
5.2.1 The case of MNIST . . . 58
5.2.2 The case of CIFAR-10 . . . 63
5.3 Size of clean labeled dataset . . . 67
5.4 Discussion . . . 68
6 Conclusion 72
References 73
A Weight distribution without rescaling 77
1 Introduction
1.1 Background
In the real world, data with noisy labels/corrupted labels (i.e. a mix- ture of correctly labeled and wrongly labeled data) is ubiquitous due to many reasons such as crowdsourcing [1, 2] or online queries [3, 4]. The noisy labels can lead to a deterioration of the performance in classifi- cation problem [5]. This situation is even worse when using deep neu- ral networks for classification because the networks can easily remem- ber these noisy labels and thus have a poor generalization ability [6].
As deep learning getting popular, learning from noisy labels by deep learning has become a hot research topic in machine learning and has resulted in many approaches to tackle this problem. Generally, such approaches try to make the training of neural networks robust even with extremely noisy labels.
The problem of learning from noisy labels has been studied extensively since 1988 [7]. Scott et al. proposed a general framework for classifi- cation with noisy labels [8], namely mutually contaminated (MC) dis- tributions framework in [9]. Within this framework, Menon et al. [9]
studied the problem of binary noisy labels using class-probability es- timation and provided fruitful theoretical insights in solving the prob- lem of noisy labels. Patrini et al. [10] further extended this work into multi-class classification and invented a forward and backward loss correction approach for noisy labels problem. Their approach works well when the noise transition matrix, which indicates how the noisy labels are generated, is known. However, this matrix is usually un- known in practice. When it is unknown, they provided an estimation but this estimation works not so well in their experiments [10]. Also, Liu et al. [11] tackled the label noise classification problem by impor- tance reweighting. They formulated this problem and proved a consis- tency guarantee that the learned classifier using importance reweight- ing would converge to the optimal classifier in the noise-free case for any surrogate loss function. But still, their performance greatly relied on the accurate estimate of the noise transition matrix, which is still an open problem.
Recently, deep learning empowers label noise research in handling
more complex data input with high label noise rate. Deep learning
based approaches have outperformed the state-of-the-art methods in
learning from noisy labels, and most importantly, they are free of es-
timating the label noise transition matrix. Usually, in deep learning
based approach, a sample selection criterion is designed to select the examples that are more probably having clean labels and let these se- lected examples contribute more in training deep neural networks. Ex- amples are Co-teaching [12], Mentornet [13] and Reweighting [14]. Co- teaching [12] exchanges the examples with small training loss between two neural networks to reduce the training error, where the examples with small training loss are assumed to be clean labeled. MentorNet and Reweight assume the availability of a small size of clean labeled data. MentorNet maintains two networks, namely MentorNet and Stu- dentNet simultaneously, letting MentorNet train on the clean labeled data to guide StudentNet to firstly train on the examples most probably with clean labels. Unlike Co-teaching and MentorNet that simply se- lect or discard examples, Reweight tries to use as many as examples in training where examples with clean labels are more likely to be assigned a larger weight and vice versa. In [14], the examples are reweighted ac- cording to their sensitivities measured via influence functions, i.e. how much the examples change towards a small local perturbing.
1.2 Problem
Traditionally, learning from noisy labels by importance reweighting is well-studied but has drawbacks in the requirement of knowing the la- bel noise generation process and the difficulties in dealing with com- plex data input. Extending this approach by deep learning is worth to study because deep learning based approaches do not suffer from these limitations. However, the representative work in this line of research, Reweight [14] fails to provide a reasonable example reweighting cri- terion. As previously mentioned, Reweight assigns weights to exam- ples by their sensitivities towards a small perturbing. This is explained by matching the gradient directions of the clean labeled and noisy la- beled data. But only matching this local change of loss function is not enough to ensure the training robust to noisy labels, i.e. they wrongly take sensitivity as the importance of examples. Moreover, in [14], a large amount of training data are not used in training because they are assigned with a weight 0. All these failures lead to a poor experimental performance, where their performance is similar to that using random weights in symmetric label noise experiments.
In short, learning from noisy labels by importance reweighting using
deep learning is promising to study and existing work fails in proposing
a reasonable reweighting criterion. This indicates a need for a novel
criterion that works well for label noise problem and results in good
experimental performance.
1.3 Purpose
Towards the failures of Reweight, we aim to propose an importance reweighting approach for label noise problem with a reasonable reweight- ing criterion. So we pose the following research questions:
How to propose a novel approach that outperforms the current im- portance reweighting based label noise approaches by proposing ef- fective importance reweighting criterion?
In this work, we first need to design an effective importance reweight- ing criterion that works as intuitively expected. Secondly, this pro- posed reweighting criterion is hopefully to be integrated into the novel proposed approach. And the novel approach is expected to make the training of neural networks robust in the presence of noisy labels, i.e.
the performance of neural networks would not be significantly devas- tated because of the inducing of noisy labels.
1.4 Goal
The goal of this work is to motivate the research in learning from noisy labels by proposing a novel approach. Also, the work would poten- tially encourage more research in using domain adaptation techniques to solve label noise problem by bridging the two areas. Moreover, we devote to boost the growth of the noisy labels pervasive industries, such as crowdsourcing platforms and recommendation systems. By lessen- ing label noise effect, our work benefits these industries in better ana- lyzing user behaviors and improving their services or products.
1.5 Research Methodology
Figure 1 shows the research process flow of this work. As data-driven
research, we adopt a quantitative research method where a model is
trained from the data for specific tasks. This work belongs to funda-
mental research where a new method B is proposed for a defined re-
search problem by challenging existing method A. The performance of
our proposed method is evaluated by training and testing on bench-
mark datasets in experiments. At the end of the experiments, we can
confirm that which method, A or B, is better in solving the research problem according to their experimental performances. Therefore in this work, we use experimental research strategy as our research strat- egy.
Figure 1: Research process flow of this work.
1.6 Ethics, Benefits and Sustainability
This work is not directly involved with ethical issues. All datasets used in the experiments are open-source datasets that are commonly used in machine learning research. And no participant other than the author is needed to carry out the work. However, our work may indirectly relate personal data when applied in real-world applications. For example, this work would potentially enhance the growth of label noise inten- sive industries, e.g. recommendation systems. And those industries may use personal data in different ways such as analyzing user behav- ior.
For benefit, our work would benefit both the research community and
the industry. On the one hand, for the research community, this work
is the first to connect the study of learning from noisy labels and do-
main adaptation and points out a huge research potential in this in-
terdisciplinary area. Moreover, the study provides a novel importance
reweighting approach to tackle the research of learning from noisy la-
bels. On the other hand, the industry could have a robust algorithm
when noisy labels are numerous. For example, crowdsourcing plat-
forms and recommendation systems are two main sources to gener-
ate numerous noisy labels. Our work can potentially encourage their
growth by solving label noise problems. The growth of these companies
would further provide new job opportunities and contribute to over-
all economic growth. This is related with the Goal 8: Decent work
and economic growth defined in the framework of sustainable devel-
opment goals by the United Nations
1. Besides, since the work is closely linked with industry benefit, it also aligns with the Goal 9: Industries, Innovation and Infrastructure. All of these indicate that our work con- forms to the universal call of building a sustainable planet.
1.7 Outline
The thesis is organized as the following:
• Extended background provides the possible literature review of learning from noisy labels, domain adaptation and the background knowledge of distribution divergence measures and deep learn- ing basics.
• Methodology elaborates how we choose our method to answer the research problem. It also includes the experimental setup and the evaluation metrics used in the experiments.
• Proposed approach shows the problem settings, how we approach the targeted research problem, including how to bridge label noise problem and DA, our proposed reweighting criterion and the deep architecture implementation.
• Results shows our experimental results, the interpretations of the results and necessary discussions.
• Conclusion concludes the main part of the work.
1
https://www.undp.org/content/undp/en/home/sustainable-development-goals.
html
2 Extended background
2.1 Learning from noisy labels
Traditional supervised learning usually requires domain experts to pro- vide annotations as data labels. As the emergence of deep learning, there is a greatly increased demand for a large number of labeled data, therefore manually data labeling becomes unrealistic for the wide range of deep learning applications. For quick and cheap labeling, labeling through online queries and crowdsourcing [2, 1] are frequently used to collect data labels. The former assigns labels according to the user’s query keyword from search engines and the latter acquires data labels from a large number of annotators in a distributed manner. However, both of them lead to a considerable amount of data with noisy labels (i.e. wrongly labeled data). For example, in crowdsourcing, the label- ing aggregation from different annotators is still a problem. Besides, the quality of the annotators is hard to control. Even some annotations are from spammers [2], i.e. annotators assign random labels because they do not have the required knowledge for labeling. The noisy la- bels severely weaken the performance of neural networks [6] because neural networks can easily memorize the noisy labels and generalize poorly. Therefore, the research in learning from noisy labels is neces- sary and important to ensure the robust training of deep neural net- works.
The research of learning from noisy labels has a long history, dating back to 1988 [7]. Along the time, Menon et al. [9] studied the the- ories of label corruption processes within the mutually contaminated (MC) distribution framework [8] by class-probability estimation. MC learning [9, 8] is a general framework for learning from noisy labels, where samples from a label corrupted distribution ˜ D are observed un- der some unknown noise parameter α that generating the noisy labels.
Class-conditional label noise (CCN learning) and positive and unla-
beled learning (PU learning) are two special cases of MC learning. The
work [9] by Menon et al. exposed several important theoretical conclu-
sions. For example, it proved that optimizing balanced error (BER) or
the area under the ROC curve (AUC) on the clean labeled data is equiv-
alent to optimize that on the corrupted data. This implies the noise
parameter α is unnecessary to be known for learning from noisy la-
bels, also guarantees the possibility of only learning from noisy labels
without clean labels. Moreover, it reveals α could be estimated by a
class-probability estimator under some assumptions.
In 2017, Patrini et al. [10] extended [9] from binary class to multi-class setting and proposed a forward and backward loss correction in robust learning from label noise. [10] has become one of the main approaches in label noise research. According to [10], if the noise transition ma- trix T is known and non-singular, the loss in backward correction l
Bis l
B(ˆ p(y |x)) = T
−1l(ˆ p(y |x)) (1) where x, y are column vectors, ˆ p(y |x) is a vector of softmax output ap- proximating the class-conditional probability p(y |x). Here by applying T
−1, the corrected loss is a linear combination of the original loss with coefficients accounting for the probability of each true label y given the observed corrupted label. Contrary to backward correction, forward loss correction tries to correct the model prediction, as defined below l
F,φ(h(x)) = l(T
⊤φ
−1(h(x)) (2) where l
F,φis the forward correction loss, φ is a link function, and h(x) is the predicted label. In the case T is unknown, [10] provides an ap- proach to estimate T but the estimation is not usually correct.
Label noise problem can also be solved using importance sampling. In importance sampling [15], we want to approximate integrals like
E[f] =
∫
f (x)p(x)dx. (3)
The idea is to let samples from a proposal distribution q(x) to approx- imate the expectation over an exact distribution p(x), as shown below.
Usually the probability density q(x) is simpler than p(x).
E[f ] =
∫
f (x) p(x)
q(x) q(x)dx ≈ 1 m
∑
m i=1w
if (x
i) (4)
In Equation (4), m is the total number of samples from the proposal distribution and the importance weight w
i=
p(xq(xi)i)
. It is clearly be seen from w
ithat the weight is large if the sample x
iis more likely to be sampled from p(x) compared to that from q(x). As a result, by assigning importance weights to samples, the distribution from q(x) is forced to match the target distribution p(x).
Let R
L,D(f ) be the expected risk for a classifier f with respect to the
distribution D and loss L. In classification, an f is learned to minimize
the expected risk. Liu et al. [11] formulated the label noise classification
problem via importance reweighting as the following:
R
L,D(f ) = E
(x,y)∼D[ L(f
θ(x), y)]
= E
(x,˜y)∼ ˜D[ P
D(x, y)
P
D˜(x, ˜ y) L(f
θ(x), ˜ y) ]
= E
(x,˜y)∼ ˜D[β(x, ˜ y) L(f
θ(x), ˜ y)]
= R
β,L, ˜D(f ).
(5)
Since in the noisy labels problem, P
D(x) = P
D˜(x), the weights are com- puted as
β(x, ˜ y) = P
D(x, y)
P
D˜(x, ˜ y) = P
D(y |x)P
D(x)
P
D˜(˜ y |x)P
D˜(x) = P
D(y |x)
P
D˜(˜ y |x) . (6) They also proved that the classifier learned by this importance reweight- ing approach converges to the optimal classifier obtained in the label noise free case. But their approach heavily relied on the accurate esti- mation of the label noise transition matrix, which is still a hard problem to be solved.
Recently, deep learning based approaches have achieved superior per- formance in label noise research. Although these approaches are non- transparent and less theoretical grounded, they are free of estimating the noise transition matrix T . To ensure the training robust towards noisy labels, these approaches select the data with clean labels based on some criterion to guide the training procedure. Some representa- tive works are MentorNet [13], Co-teaching [12] and reweighting [14]
architecture. MentorNet architecture [13] consists of two neural net- works: a MentorNet and a StudentNet. During training, MentorNet learns a curriculum on a small size of the clean labeled dataset and then guides the StudentNet using this curriculum to train on the corrupted labeled data. The learned curriculum can select the samples that are most probably correctly labeled. In every mini-batch step, MentorNet updates its curriculum by the features (loss, the change of loss along moving averages, label and training progress) given by the StudentNet.
Unlike MentorNet where the error flow has to be accumulated during
training, Co-teaching [12] handles noisy labels by training two neural
networks simultaneously and reduces the training error by exchang-
ing small loss data between the two nets. This work assumes data with
small training loss is more likely to be correctly labeled so by trans-
ferring these clean labeled data to peer networks, the error could be
filtered out. This approach outperforms MentorNet on several bench-
mark datasets.
Reweight [14] catches much attention in learning with noisy labels. In the settings of [14], there is a small set of correctly labeled data and a massive dataset with noisy labels, e.g. relatively in a ratio 1/50. Both the data and noisy labels are balanced across classes. The examples are reweighted according to a certain criterion to reduce the negative effect of training on noisy labeled data. A core question of this work is how to assign the weight w
iproperly. [14] trains a neural networks on data with noisy labels {(x
ti, y
it) }
i∈[1,m]. At each gradient step after the training, the clean labeled data {(x
vi, y
iv) }
i∈[1,n],n≪m(named validation set in [14]) would be fed into the same neural network and compute the loss on validation dataset. They claim that the optimal weight w
ishould be selected based on the performance of training on the clean labeled data, that is,
w
∗= arg min
w,w≥0
1 n
∑
n i=1f
iv(θ
∗(w)) (7)
where θ
∗is the optimal model parameter that minimize the loss for training set f
it,
θ
∗(w) = arg min
θ
∑
m i=1w
if
it(θ) (8)
. The weights of noisy labeled examples are assigned by comparing the gradient direction when using noisy labeled data and clean labeled data.
2.2 Domain adaptation
Domain adaptation (DA) attempts to leverage knowledge from one or multiple labeled source domain to learn a classifier for a target domain [16]. Referring to what to transfer between the domains, domain adap- tation approaches can be categorized into instance transfer, feature representation transfer, parameter transfer and relational knowledge transfer approach [17]. Instance transfer approach transfers samples to the target domain by importance reweighting. Feature representa- tion approach learns a feature representation that minimizes the dif- ference of source and target domain. Parameter transfer finds shared parameters between source and target domain and relational knowl- edge transfer approach transfers the relationship within data from the source to the target domain.
According to whether adopting deep features or deep architectures, DA
methods can be categorized into shallow DA and deep DA method.
Shallow methods usually try to match the two distribution between source and target domain by either reweighting samples [18, 19, 20] or learning a shared space for distribution matching [21, 22, 23]. Deep DA methods catch more attention in recent research. Through the train- ing of deep DA models, the models aim to minimize a defined loss(e.g.
classification loss) and maximize a domain-confusion factor [24]. The factor can be computed by a discrepancy loss or an adversarial loss [16, 25]. We then only focus on reviewing discrepancy loss based meth- ods for the relevance of this work.
In deep DA methods, discrepancy loss based methods minimize the distribution discrepancy of the source and target domain measured by some criteria. Maximum mean discrepancy (MMD) or its variant is the most commonly used criterion [26, 27, 28, 29, 30, 31]. The details of MMD are shown in Section2.3. Ghifary et al. [27] firstly used MMD in feedforward neural networks to match the cross-domain representa- tions in latent space. Then, in order to utilize the strong representation power of convolutional neural networks (CNN), MMD was further ex- tended to deep CNN, namely deep domain confusion network (DDC) [28]. The DDC minimizes the following loss:
L = L
c(X
L, y) + λ D
k2(X
S, X
T) (9) where L
c(X
L, y) is the classification loss on the labeled training data X
L, D
k2(X
S, X
T) is the MMD between source X
Sand unlabeled target data X
Tand the hyperparameter λ controls the degree of domain con- fusion. Later, Long et al. used multiple MMDs between multiple adap- tation layers to design deep adaptation networks (DANs) [30] but their assumption towards the conditional distributions was quite strong. To relax this assumption, Long et al. again raised joint adaptation net- works (JANs) [31], where a joint distribution discrepancy, rather than a sum was adopted for deep features. Moreover, combining the MMD based feature adaptation with residual layers, residual transfer net- works (RTNs) [29] jointly learned the adaptation of both features and classifiers.
2.3 Distribution divergence measures
Measuring the divergence of two distributions is a fundamental prob-
lem in machine learning and has a wide range of potential applications
like binary classification and two-sample test. In this section, we re-
view the commonly used criteria to measure distribution divergence, with a particular focus on maximum mean discrepancy (MMD) and f- divergence.
2.3.1 Maximum mean discrepancy (MMD)
As described in section 2.2, MMD is a domain discrepancy criterion widely used in machine learning, especially in DA. Unlike parametric criterion such as t-test, MMD is a non-parametric discrepancy mea- sure which is more suitable for solving real-world problems because no strong prior assumption of data distribution is needed in non- parametric measures [32]. It has also been proven to be equivalent to the energy distances from statistics research [33]. To be specific, MMD [26] measures the distances of two distribution embeddings in a reproducing kernel of Hilbert space (RKHS) [33]. Before formally in- troducing MMD, we first give some background knowledge of kernels, distribution embeddings and RKHS.
To start with, complex machine learning tasks can not be solved by us- ing only linear decision boundaries. In order to have a non-linear deci- sion boundary, non-linear mappings Φ are used to map data from input space X to a high dimensional space H where the samples can be lin- early separated. Since the dimension of H can be extremely high, kernel functions are adopted to reduce the heavy cost of inner product compu- tations in H. For example, we are given a dataset (x
1, y
1), ..., (x
n, y
n) ∈ X ×Y, where the domain X and Y are nonempty sets containing inputs x and targets y respectively. A kernel, as a similarity measure in X , is defined as
k : X × X → R, (x, x
′) 7→ k(x, x
′) (10) satisfying, for all x, x
′∈ X ,
k(x, x
′) = ⟨Φ(x), Φ(x
′) ⟩ (11) where Φ : X → H is a mapping to some dot product spaces H and
⟨·, ·⟩ represents the dot product defined in H. In such a way, we can construct algorithms in H without needing to explicitly compute the mapping Φ by substituting ⟨Φ(x), Φ(x
′) ⟩ with k(x, x
′).
Definition 2.1 (RKHS) Let H be a Hilbert space of real-valued func- tions defined on X . A function k: X × X → R is called a reproducing kernel of H if the following conditions are satisfied:
∀x ∈ X , k(·, x) ∈ H, and ∀x ∈ X , ∀f ∈ H, ⟨f, k(·, x)⟩
H= f (x).
H is a reproducing kernel Hilbert space (RKHS) if it has a reproduc- ing kernel. H
kdenotes a RKHS H with its reproducing kernel k. Once RKHS is defined, we then show MMD could be expressed as the dis- tance between mean embeddings of two distributions in H [26, 34].
Definition 2.2 (MMD) Let k be a kernel defined on X and µ
k(u) is the kernel embedding of u on H
k, the maximum mean discrepancy (MMD) D
kbetween two distributions P , Q is
D
k(P, Q) = ∥µ
k(P ) − µ
k(Q) ∥
Hk, where µ
k(u) = E
x∼u[Φ(x)].
According to [34, 33], squared MMD can be easily obtained in RKHS, which is
D
k2= E
x,x′[k(x, x
′)] + E
y,y′[k(y, y
′)] − 2E
x,y[k(x, y)] (12) where x, x
′ iid∼ P , y, y
′ iid∼ Q.
Straightforwardly in RKHS, a biased empirical estimate [26, 35, 34] of MMD is derived as the following
D
2k(P, Q) = 1 n
2∑
n i,i′=1k(x
i, x
i′) + 1 m
2∑
m j,j′=1k(y
j, y
j′) − 2 nm
∑
n i=1∑
m j=1k(x
i, y
j) (13) given that X = {x
1, ..., x
n}
iid∼ P and Y = {y
1, ..., y
m}
iid∼ Q.
Therefore, we can measure the discrepancy of two distributions by sam- pling data from the two distributions and compute their empirical es- timates of MMD, as indicated in (13).
2.3.2 Other alternative divergences
Other commonly used divergences inlucde f-divergence, energy dis- tance and Wasserstein distance. Given two distributions P , Q and their absolutely continuous density p, q, the f-divergence [36] with respect to a measure dx on domain Ω is defined as
D
f(P ∥ Q) =
∫
Ω
q(x)f ( p(x)
q(x) )dx (14)
where f is a convex function satisfying f (0) = 1. There are many com-
monly used f-divergence including but not limited to: Kullback-Leibler
(KL) divergence, Reverse KL divergence, Pearson χ
2divergence and Jensen-Shannon divergence (see in Table 1).
Divergence Correponding f (u) Kullback-Leibler (KL) u log u
Reverse KL − log u
Pearson χ
2(u − 1)
2Jensen-Shannon −(u + 1) log
1+u2+ u log u Table 1: Examples of f-divergence.
Energy distance is the statistical distance between two probability dis- tributions, defined as
D
E(P, Q) = 2 E
XY∥X − Y ∥ − E
XX′∥X − X
′∥ − E
Y Y′∥Y − Y
′∥ (15) where (X, X
′, Y, Y
′) are independent random variables, P is the cumu- lative density function (cdf) of X, X
′, Q is the cdf of Y, Y
′, ∥·∥ repre- sents the Euclidean norm. Energy distance is proved as a special case of MMD [33].
Wasserstein distance is another distance measure arising from optimal transport. If J(P, Q) denotes all joint distributions J for x and y with marginal distribution P , Q, Wasserstein distance is defined as
W
p(P, Q) = ( inf
J∈J (P,Q)
∫
∥x − y∥
pdJ (x, y))
1p(16)
. Wasserstein distance is widely used in machine learning, such as in generative adversarial networks [37] and restricted boltzmann ma- chines [38].
2.3.3 Divergence measures for importance reweighting
In importance reweighting, one approach is to compute the weights by minimizing divergence measures between a weighted distribution and the target distribution. The solution of the optimization ˆ w is expected to approximate the optimal weight w, that is
min
wD(p ∥ wq) ≈ min
ˆ
w
D(p ∥ ˆ wq) (17)
where D can represent any distribution divergence measure in prin-
cipal. Some divergences discussed in section 2.3 are frequently used
due to their advantages when reformatted into optimization problems.
For example, using KL divergence in the optimization in Equation 17 is shown below:
min
wˆD
KL(p ∥ ˆ wq) = min
ˆ w
∫ log
( p(x) ˆ
w(x)q(x) )
p(x)dx
= min
ˆ w
∫
− log ˆ w(x)p(x)dx
≈ 1 n max
ˆ w
∑
n i=1log ˆ w(x
i), x
i∼ p(x)
(18)
with constraints ˆ w(x) ⩾ 0 and
m1∑
mj=1
w(x ˆ
j) = 1, x
j∼ q(x). This is called a Kullback-Leibler importance estimation Procedure (KLIEP).
When applying a squared Euclidean distance, the optimization prob- lem becomes
min
wˆ1 2
∫
( ˆ w(x) − w(x))
2q(x)dx
= min
ˆ w
[ 1 2
∫ ˆ
w(x)
2q(x)dx −
∫ ˆ
w(x)p(x)dx + 1 2
∫
w(x)
2q(x)dx ]
= min
ˆ w
[ 1 2
∫ ˆ
w(x)
2q(x)dx −
∫ ˆ
w(x)p(x)dx ]
≈ min
ˆ w
[ 1 2m
∑
m j=1ˆ
w(x
j)
2− 1 n
∑
n i=1ˆ w(x
i)
]
, x
i∼ p(x), x
j∼ q(x).
(19)
This is called an unconstrained least-square importance fitting (uLSIF) [39].
Moreover, MMD could also be used in importance reweighting. Mini- mizing an MMD to compute the weights is called a kernel mean match- ing procedure [20]. Specifically, this is to solve the following optimiza- tion problem
min
β
E
xj∼q(x)[β(x
j)Φ(x
j)] − E
xi∼p(x)[Φ(x
i)] (20)
subject to β(x) ≥ 0 and E
xj∼q(x)[β(x
j)] = 1. ∥·∥ represents a MMD
defined in Section 2.3.1.
2.4 Deep learning basics
In this section, we introduce basic knowledge of deep learning, includ- ing popular neural network architectures and the optimization proce- dure when training deep neural networks.
2.4.1 Neural network architectures
First, we introduce the basic structures of neural networks. Then, we present two famous deep architectures we adopted as base models:
LeNet and ResNet. The selected base models can well fit the corre- sponding benchmark dataset in supervised learning, i.e. using clean labeled data in model training and testing.
Basic structure
In 1958, perceptrons were invented to model how information is stored and organized in brains [40]. Nowadays, perceptrons become the basic units of modern neural networks, as shown in Figure 2a. In a percep- tron, data inputs, their corresponding weights and the bias are sum- marized and then passed through an activation function to generate a output. The activation function is a non-linear transformation ap- plied on the data input and enables neural networks to learn complex representations. Popular used activation functions are rectified linear unit (ReLU, relu(x) = max(0, x)), hyperbolic tangent (tanh, tanh(x) =
ex−e−x
ex+e−x
), sigmoid (sigmoid(x) =
1+e1−x) and softmax (sof tmax(x
i) =
exp(xi)
∑
iexp(xi)
).
In a multi-layer perceptron (MLP), layers are stacked together where each layer contains a certain number of neurons that are connected with the neurons from the previous layer. For example, Figure 2b shows an MLP with three layers: one input layer, one hidden layer in the mid- dle and one output layer. Deep MLP usually have plenty of hidden lay- ers between the input and output layer to have a strong representation capacity.
Convolutional neural network (CNN) is a neural network designed for
processing the data with a grid-like topology such as images. CNN
greatly accelerates the advancement of deep learning and computer vi-
sion by its power of feature extraction. Features of data are extracted
from low-level to high-level by layers from bottom to the top. For ex-
ample, in face recognition, the first few layers learn the representation
of the basic shape and color while the later layers learn a more detailed
(a) A basic perceptron. (b) A multi-layer perceptron.
Figure 2: Basic structures of neural networks.
representation of human eyes and noses. A CNN typically has the fol- lowing types of layers:
• Convolutional layers (Conv), applying convolution operation of a predefined small region (called a filter) and the part of the layer connected with the filter;
• Pooling layers (Pool), applying a downsampling operation;
• Fully connected layers (Fc), each neuron in this layer connecting with all neurons in the previous layer to compute class scores.
Next, two CNN architectures, LeNet and ResNet, are introduced. They are commonly used as base models for comparing algorithm perfor- mance on benchmark datasets.
LeNet
Proposed in the 20th century, LeNet [41] is a simple convolutional neu- ral network widely used in handwritten digits recognition. At that time, it could free the heavy human labors of manually extracting data fea- tures and had achieved the state-of-the-art in MNIST dataset [41] clas- sification. As a typical LeNet architecture, LeNet-5 consists of 7 layers, including convolutional layers, subsampling layers and fully connected layers, as shown in Figure 3. In this work, LeNet is used as the base model for the experiments on MNIST.
ResNet
ResNet [42], short for the residual net, is a successful deep architecture
tackling the notorious ”vanishing/exploding gradient” problem [43] in
deep learning, and making the training of very deep neural networks
possible. It has achieved compelling performance on plenty of bench-
Figure 3: The architecture of LeNet-5.
mark datasets and greatly boosted the research in machine learning and computer vision.
Figure 4: A residual building block.
The core idea of ResNet is to add residual blocks (shown in Figure 4)
in deep neural networks. The residual blocks provide the shortcut con-
nection that skips one or more layers. And the shortcut connection
performs the identity mapping. This makes the training of very deep
neural networks easier to optimize. The authors of [42] assume that
by adding identity mappings to extra layers, the training error of deep
architectures should not be larger than that of their shallow counter-
parts. And meanwhile, the deep architectures can achieve a large per-
formance gain by increasing network depth. Resnet can be very deep
in depth, such as having 32, 44, 56 and 110 layers. Figure 5 shows the
structure of Resnet-32, i.e. ResNet with a depth 32. ResNet has be-
come one of the most popular used architectures in visual recognition
tasks.
Figure 5: The structure of Resnet-32.
Note that the layers are stacked from top to the bottom.
2.4.2 Optimization for training networks
Stochastic gradient descent (SGD) is one of the most popular algorithms in optimizing neural networks. In SGD, the network parameters θ are updated according to the objective function J(θ):
θ = θ − α▽
θE [J(θ)] . (21) α is the learning rate, deciding how much an update step of the algo- rithm influences the value of current weights. To speed up the network training, a mini-batch of samples is sampled from dataset to feed the network for a fast parameter update:
θ = θ − α▽
θJ (θ; x
i, y
i), (22) where x
i, y
irepresent the pairs of sampled data in the mini-batch. The expection E in equation(21) is approximated by averging the results over the whole dataset.
Overfitting happens when the neural network learns the noise within
the training data and thus makes the network hard to generalize well
to the unseen test dataset. To reduce the network overfitting effect,
regularization techniques can be added in the optimization, such as
dropout, L
1and L
2regularization (L
2regularization is also called weight
decay). In dropout, a proportion of nodes in neural networks are ran- domly selected to be ignored in training neural networks. In L
1and L
2regularization, an additional regularization term Ω is added to the objective function. So the regularized objective function ˜ J (θ; x, y) be- comes
J (θ; x, y) = J (θ; x, y) + αΩ(θ). ˜ (23) The Ω term in L
1regularization is Ω(θ) = ∥w∥
1= ∑
i
|w
i|, and in L
2regularization is Ω(θ) =
12∥w∥
22.
3 Methodology
As mentioned in Section 1.5, we adopt an empirical research method.
The evaluation of approaches is based on experiments tested on bench- mark datasets. A novel approach in learning from noisy labels is pro- posed to overcome the limitations of current approaches and is ex- pected to outperform all baselines in experiments. In the following, we describe how to choose the suitable approaches of learning from noisy labels and the divergence measure used in example reweighting criterion among all the alternatives. We also present our experimental setup and performance evaluation.
3.1 Choice of method
Here we explain why we choose to use a deep learning based impor- tance reweighting approach to tackle the label noise problem and which divergence to choose in designing a reweighting criterion.
3.1.1 Learning from noisy labels
As mentioned in Section 2.1, there are two major categories of research approaches towards learning from noisy labels: traditional shallow learning based approach and deep learning based approach. Usually traditional approaches heavily rely on the accuracy of estimation label noise transition matrix, which is still a hard problem remaining to be solved. Also, traditional approaches could only handle cases with rel- atively simple data inputs and low label noise rate. On the contrary, deep learning based approaches are free of suffering any of these men- tioned limitations, although they are less theoretical grounded. There- fore, we decide to design a deep learning based approach to the problem of learning from noisy labels.
Among all deep learning based approaches, we use importance
reweighting to select relatively clean samples. That is because im-
portance reweighting is a well-studied topic in statistics, and it is
also be applied in traditional research approaches in learning from
noisy labels. Instead of simply keeping or discarding samples as most
deep learning based approaches do, importance reweighting based ap-
proaches let samples contribute differently, or even in between of keep-
ing and discarding, to the network training and the number of samples
involved in training can be kept to a maximum.
When revisiting the traditional importance reweighting based research in learning from noisy labels [11], we found their approach is hard to scale for large and complex data input. More effort is needed to let it combine with deep architectures. Reweight [14] is the first attempt in this direction but fails in giving reasonable importance reweighting criterion. Here begins our work to fill this knowledge gap.
3.1.2 Choice of divergence measure
In this work, we consider the instance based domain adaptation be- cause importance reweighting is in its core, a method of sample selec- tion. As described in equation (4) in section 2.1, we need to compute the optimal weight w
i= p(x
i)/q(x
i) for samples with noisy labels. Al- though density estimation can be used to estimate the two density sep- arately, it severely suffers from the curse of dimensionality [44]. An alternative approach is to solve an optimization problem by minimiz- ing a distribution divergence measure. Possible divergence measure used for importance reweighting is introduced in Section 2.3.3.
MMD is a commonly used divergence in importance reweighting.
KLIEP is less computationally efficient than KMM because of the non- linear of the objective functions [39]. Although KMM requires a careful choice of kernel width, the choosing heuristic of using the median dis- tance between samples works well in practice. We implemented both uLSIF and KMM and in our settings, KMM performs better in com- putational time and accuracy. So we adopt KMM as our importance reweighting method in this work.
3.1.3 Hyperparameter tuning
The hyperparameters used in this work include batch size, learning
rate, the positions and values of learning rate decay (for CIFAR-10 only),
weight decay, number of total epochs and the kernel width. For batch
size, we use the largest batch size that could be fitted into the memory
of the available computing cluster. We use a total of 400 epochs for
letting the approaches converge. Then, we try different initial learning
rates and find the most suitable one for each dataset. Once the ini-
tial learning rate is fixed, we adjust different values of weight decay to
achieve the best regularization effect. Positions and values of learning
rate decay on CIFAR-10 are determined by trying different combina-
tions. For kernel width, we extract some minibatches of loss distri-
butions of clean labeled and noisy labeled data and compute weights by kernel mean embedding to find an optimal kernel width that could force the loss distributions of noisy labeled data to match that of clean labeled data.
For ensuring computation efficiency, we do not use cross-validation in hyperparameter tuning. This may cause an overoptimistically bi- ased performance estimation because very few information from test data is used to guide the hyperparameter tuning. But this risk would not be large since in our setting, we assume some small size of clean labeled test data is available in reweighting examples and besides the test dataset is not directly used in training procedure. Also, we evaluate alternative approaches by comparing under this same setting. In prac- tice, when the dataset is large and computing resources are enough, we still recommend using cross-validation in tuning the hyperparame- ters.
3.2 Experimental setup
In the experiments, we use the commonly used benchmark datasets and the corresponding base models for comparison convenience. Then, we flip the data labels according to the predefined label corruption ma- trix/label noise transition matrix and corruption rate to generate noisy labeled datasets. To evaluate our approach, we also design several base- lines for performance comparison. The details of the experiments, e.g.
software versions, hyperparameter settings are given at the end of this subsection.
3.2.1 Datasets and base models
The benchmark datasets used in the experiments are MNIST [41] and
CIFAR-10 [45]. Examples of the dataset are shown in Figure 6. MNIST
is a dataset of handwritten digit images, digit ranging from 0 to 9. The
training dataset contains 60000 samples and test set contains 10000
samples. The frequently used corresponding base model is LeNet (more
details in Section 2.4.1). CIFAR-10 dataset is a collection of 60000 real-
world object images in 10 classes, 50000 images for training and 10000
for testing. Each class has 6000 32x32 color images. To compare our
work with [14], we use LeNet as the base model for MNIST and Resnet-
32 as the base model for CIFAR-10. The details of the base models are
given in Section 2.4.1. Table 2 summarizes the details of the datasets and base models used in the experiments.
(a) MNIST. (b) CIFAR.
Figure 6: Examples from MNIST and CIFAR.
Dataset # training # testing # class Image size Base model
MNIST 60,000 10,000 10 28 × 28 LeNet
CIFAR-10 50,000 10,000 10 32 × 32 ResNet-32 Table 2: Summary of the datasets and base models used in experi- ments.
3.2.2 Label noise transition matrix and noise rate
We study the case of the class-conditional label noise (CCN learning), where the data labels are randomly flipped with some probabilities con- ditioned on classes. In this work, the labels of label corrupted data are generated according to a predefined label noise transition matrix T , where T
ij= P (˜ y = j |y = i). We consider two common types of la- bel noise transition matrix (Figure 7), called pairflip and symmetric in [12]. In pairflip case, the labels in every class only flip to one neigh- bor class. Without loss of generality, we define the neighborhood class for class i is class i + 1 when i < n. n is the total number of classes.
When i = n, the neighborhood class for class i is the class 1. This is reasonable in practice because labels can be wrongly classified among two very similar classes. Another noise type is symmetric, where the la- bels can randomly flip to other n − 1 classes with an equal probability.
The label corrupted rate, denoted by δ is the probability of label flip-
ping. Note that the label noise transition matrix and label corruption
rate are unknown to the model. The design of this matrix and corrup- tion rate is only for the convenience of data generation and method comparison.
1 − δ δ 0 . . . 0 0 1 − δ δ . . . 0 .. . . .. ... .. . 0 0 . . . 1 − δ δ δ 0 . . . 0 1 − δ
1 − δ
n−1δ. . .
n−1δ n−1δδ
n−1
1 − δ
n−1δ. . .
n−1δ.. . . .. .. .
δ
n−1
. . .
n−1δ1 − δ
n−1δδ
n−1 δ
n−1
. . .
n−1δ1 − δ
Figure 7: Label noise transition matrix.
Left: Pairflip label noise; Right: Symmetric label noise.
3.2.3 Baselines
We consider the following baselines to compare our method.
• Clean only, use only the limited clean labeled data in training.
• Random, assign random weights generated by a rectified Gaus- sian distribution β
i=
∑max(0,si)imax(0,si)
, where s
i∼ N (0, 1). This is the same baseline as used in [14].
• Uniform, assign the same weights to all training examples.
• Reweight, proposed by [14].
3.2.4 Experimental details
The experiments are implemented using Python 3.6.5 on PyTorch 0.4.0, computed on RAIDEN GPU cluster provided by RIKEN AIP.
Same as [14] for a fair comparison, the clean labeled data are randomly selected with a size of 1000 in our experiments and the baseline ”Ran- dom” and ”Uniform” are fine-tuned in the last 10 epochs using 1000 clean labeled data. We consider the following experiment settings:
symmetric label noise with label corrupted rate 0.2, 0.5 and pairflip
label noise with corruption rate 0.3, 0.45, tested on MNIST and CIFAR-
10 dataset. We repeat the experiments under each setting for 5 times
on MNIST and 3 times on CIFAR-10.
For MNIST experiments, we use SGD without momentum as optimizer in our proposed approach. The weight decay is 0.01 and learning rate is 0.00001. For the best performance of the baseline ”Reweight”, we adopt the optimizer of SGD with 0.9 momentum, 0.01 weight decay and a learning rate of 0.0001. Other baselines are using SGD without momentum, weight decay 0.01 and learning rate 0.0001. For exper- iments on CIFAR-10, the optimizer is SGD without momentum and 0.0002 weight decay. The initial learning rate is 0.05, decaying ×0.1 at epoch 180, 290 and 390 for a total of 400 epochs. All experiments on CIFAR-10 are under the same above setting. In both MNIST and CIFAR-10 experiments, the batch size is 128 for noisy labeled training data and clean labeled data. Kernel width is selected as 1-th quantile of distances between examples. Random seed for all experiments is 100.
Note that we do not use any data augmentation technique in data pre- processing as that used in [14]. That is the exact reason why our repro- duced results of Reweight are not as good as that reported in [14].
3.3 Performance evaluation
For performance evaluation, we firstly present our selected evaluation metric in this work. Secondly, we show how we conduct the evalua- tion procedure in experiments. Thirdly, success criteria are given to demonstrate how we evaluate our proposed approach as a successful approach and how the observed performance in experiments lead to a conclusion.
3.3.1 Evaluation metrics
In this work, we target a balanced class classification problem. So the training objective is to minimize a 0-1 loss, defined as
l(ˆ y, y) = 1 {
arg max
c