A Label-based Conditional Mutual Information Estimator using Contrastive Loss Functions

(1)

A Label-based Conditional Mutual

Information Estimator using

Contrastive Loss Functions

ZIWEI YE

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

using Contrastive Loss

Functions

ZIWEI YE

Master in Information and Network Engineering Date: September 22, 2020

Supervisor: Hanwei Wu Examiner: Markus Flierl

(3)

(4)

Abstract

In the field of machine learning, representation learning is a collection of tech-niques that transform raw data into a technology that can be effectively devel-oped by machine learning. In recent years, deep neural network-based repre-sentation learning technology has been widely used in image learning, recog-nition, classification and other fields, and one of the representative ones is the mutual information estimator/encoder. In this thesis, a new form of contrastive loss function that can be applied to existing mutual information encoder net-works is proposed.

Deep learning-based representation learning is very different from traditional machine learning feature extraction algorithms. In general, the features ob-tained by feature extraction are surface-level and can be understood by hu-mans, while the representation learning learns the underlying structures of data, which are easy to be understood by machines but difficult for humans. Based on the above differences, when the scale of data is small, human’s prior knowledge of the data can play a big role, so the feature extraction algorithm has a greater advantage; when the scale of data increases, the component of the prior knowledge will decline sharply. At this time, the strong computing ability of deep learning is needed to make up for this deficiency, thus the ef-fect of representation learning will be better. The research done in this thesis is mainly aimed at a more special situation, where the scale of training data is small and the scale of test data is large. In this case, there are two issues that need to be considered, one is the distribution representation of the model, and the other is the overfitting problem of the model. The LMIE (label-based mutual information estimator) model proposed in this thesis has certain ad-vantages regarding both issues.

(5)

Sammanfattning

Inom maskininlärning är representationslärande en samling tekniker som om-vandlar rådata till en teknik som effektivt kan utvecklas genom maskininlär-ning. Under senare år har djup neurala nätverksbaserad representationsinlär-ningsteknologi använts i stor utsträckning för bildinlärning, igenkänning, klas-sificering och andra områden, och ett av de representativa är den ömsesidiga informationsberäknaren / kodaren. I denna avhandling föreslås en ny form av kontrastiv förlustfunktion som kan tillämpas på befintliga nätverk för ömsesi-dig informationskodare.

Djup inlärningsbaserad representationsinlärning skiljer sig mycket från tradi-tionella maskininlärningsextraheringsalgoritmer. I allmänhet är de funktioner som erhålls genom funktionsekstraktion ytnivå och kan förstås av människor, medan representationsinlärningen lär sig de underliggande strukturerna för da-ta, som är lätta att förstå av maskiner men svåra för människor. Baserat på ovanstående skillnader kan människans förkunskaper om uppgifterna spela en stor roll, när skalan av data är liten, så funktionsekstraktionsalgoritmen har en större fördel; när skalan på data ökar kommer komponenten i förkunskaperna att minska kraftigt. För närvarande behövs den starka beräkningsförmågan för djup inlärning för att kompensera för denna brist, varför effekten av represen-tationslärande blir bättre. Forskningen som gjorts i denna avhandling är främst inriktad på en mer speciell situation där utbildningsdata är liten och omfatt-ningen av testdata är stor. I det här fallet är det två frågor som måste beaktas, det ena är fördelningen av modellen, och den andra är det överpassade problemet med modellen. LMIE-modellen (labelbaserad ömsesidig informationsberäk-nare) som föreslås i denna avhandling har vissa fördelar när det gäller båda frågorna.

(6)

1 Introduction 1 2 Representation Learning and Feature Extraction 5

2.1 Traditional Visual Feature Extraction . . . 5

2.2 Representation Learning Algorithm based on Matrix Decom-position and Sparse Representation . . . 7

2.3 Deep Learning-based Representation Learning . . . 10

3 Generative Model 14 3.1 Variational Autoencoder . . . 15

3.2 Generative Adversarial Network . . . 19

4 Mutual Information-based Technology 23 4.1 Information Entropy . . . 23

4.2 Mutual Information and Calculation . . . 24

4.3 Label-based Conditional Mutual Information Estimator . . . . 27

5 Implementation 31 5.1 Architecture . . . 31

5.1.1 Encoder . . . 31

5.1.2 Loss Function Calculation . . . 34

(7)

6.2.1 Binary Classification . . . 40 6.2.2 Multiple Classification . . . 41 6.2.3 Comparison between 4 Versions of LMIE . . . 41 6.2.4 Relationship between Temperature and Accuracy . . . 41 6.3 Result Analysis . . . 42

7 Discussion 44

7.1 Conclusion . . . 44 7.2 Future Work . . . 44

(8)

Introduction

Images are an important part of media display. The great amount of images bring great challenges to image management. In most situations like search engines or e-commerce product displays, only accurate identification and clas-sification of image categories can give better users experience. Image classi-fication algorithms are usually based on the fact that similar objects in the image should have similar color texture, spatial information or other features. An algorithm mainly includes two stages, image feature extraction and classi-fier training. The quality of feature extractor directly affects the performance of classifier.

The common image feature extraction method is to combine and transform the underlying visual features, such as color, texture and shape features, to construct image features. However, the underlying features of the image are susceptible to factors such as brightness, occlusion and deformation. There-fore, there is no guarantee that the same objects will have similar features. In other words, between the underlying visual features of the image and the se-mantic features in the higher layers, there is a sese-mantic gap. Therefore, the underlying features of the image are not suitable to be used as classification features.

At present, representation learning is widely used in image feature extraction. In the field of machine learning, representation learning is a set of technology which generally used to transform raw data into a new form that can be effec-tively recognized by machine. The goal of representation learning is not to predict an observation by learning the original data, but to learn the underly-ing structure of the data so that other characteristics of the original data can

(9)

be analyzed. Representation learning allows machine to learn how to extract features, which can also be called "learning how to learn". In order to fulfill the feature extraction function, a neural network powered encoder will usu-ally be used, which converts the image into latent feature vectors or matrix. A neural network is intended to simulate the neural network structure of the human brain and extract features from the data by multiple layers. Its structure is complex and can theoretically simulate arbitrarily complex nonlinear func-tions. There are multiple ways to determine the performance of the encoder, and this paper mainly focus on the mutual information-based way.

Mutual information (MI), also known as information gain, has been success-fully used in the context of deep learning and deep reinforcement learning(e.g., VIME, EMI) to measure the statistical dependence between two representa-tions. The main idea of mutual information based representation learning is to discover useful representations, and one simple idea being explored is to train a representation-learning function, for example an encoder, to maximize the mutual information (MI) between its inputs and outputs. The relationship between two representations can be shown as figure 1.1. In this case it can be seen that the more correlation the two representation has, the higher mutual information is gain.

Figure 1.1: a Venn Diagram of Mutual Information

(10)

dependence irrelevant to the task. For example, trivial noise local to pixels is useless for image classification, so a representation may not be benefit from this information if the end-goal is to classify. Moreover, since the capacity of the high-level representation is fixed, such irrelevant information may squeeze out some valuable information. Therefore, input image can be divided into different patches, and the MI between representation and input image can be maximized. This is the so called local MI, which can be used to reduce the above impact caused by irrelevant factors.

Previous research has shown that methods for learning low dimensional repre-sentations where MI estimators play an important role have been widely used in unsupervised learning. Deep InfoMax (DIM) is introduced in [2] and is claimed that it outperforms many popular unsupervised learning methods, and also demonstrated how structure matters: independent knowledge of the input data can also influence a representation’s quality significantly. A Mutual In-formation Neural Estimator was presented in paper [1], which is known as MINE, with several applications where this method can be used to minimize or maximize MI. Besides, paper [3] proposed an approach named Contrastive Predictive Coding (CPC), which is a special case of InfoNCE. The core model works by extracting suitable features using autoregressive models to predict the feature in latent space.

However, most of the existing MI estimators today require a large number of data sets as training data. The huge amount of data can indeed help the estimator improve accuracy. Nevertheless, for some special situations, such as when the data is difficult or very expensive to obtain, only a small amount of data can be used during training. In this case, many estimators based on large data sets are difficult to guarantee their accuracy. The problem caused by the small-scale data set is not only the decline in prediction accuracy, but also overfitting, that is, the network parameters after training are too strict, re-sulting in a high training accuracy rate but a low test accuracy rate. To solve these problems caused by small training data sets, one of the possible ways is to design a new loss function. Essentially, the effectiveness of a deep neural network always need to be judged by the loss function.

(11)

dependen-cies between variables in conditional settings. The label-based latent variable model has the objective to maximize the mutual information between the la-tent variables and the label information conditioned on the encoded observed variables.

(12)

Representation Learning and

Fea-ture Extraction

Image representation learning refers to the representation and description of image data. The quality of representation learning is crucial to the image recognition system because it directly affects the final recognition result. Com-mon image representation learning includes image pre-processing and feature learning. Specifically, image pre-processing refers to processing the image data itself by some means to obtain a higher-quality image that is clearer or more suitable for the requirements of an algorithm; feature learning is also called feature extraction, which means extracting the key valid information from the original data as image features and use the features for the down-stream recognition task. In other words, as long as a sufficiently discriminative image representation is obtained, even a simple data classifier can obtain the best recognition performance, which is the ultimate goal of image representa-tion learning. [4] Due to the large number and types of algorithms involved in representation learning, this section only summarizes a few algorithms that are closely related to this paper and the current research progress.

2.1 Traditional Visual Feature Extraction

Prior to 2003, the features used in most traditional image recognition methods were expressed as global features constructed using texture and color infor-mation, such as color histograms, color correlation maps, GIST [5], Edgel [6], etc. Due to the limited ability of such global features to represent image con-tent, the image recognition method at this stage was only suitable for copy image retrieval, which refers to the situation that the query image is basically

(13)

consistent with the target image.

After 2003, the mainstream approach of image feature extraction began to slowly shift to manually designed visual features such as color histogram, color correlation map [7], gradient direction histogram (HOG), scale-invariant fea-ture transformation (SIFT) [8], etc.. These feafea-tures often require an elabo-rate design to have a better performance on representing the image, but still achieved impressive representation capabilities. In particular, the proposal of local scale-invariant feature "SIFT" has made great progress in feature extrac-tion technology. Since SIFT can detect local regions with significant visual characteristics in the image and generate feature descriptions with very stable description capabilities for these regions, it is very suitable for constructing visual words to describe images. In this stage of the research framework, most of the research focused on topics such as how to effectively generate visual dictionaries (such as using Means, hierarchical Means [9], approximate k-Means [10] and other clustering methods), how to aggregate expression of lo-cal features in an image (use inverted index directly, or use Fisher Vector [11], VLAD [12] or other methods), how to use the spatial context of local features for correlation verification [13, 14], How to hash or quantize features (such as the use of Hamming code embedding [15], product quantization [16], scalar quantization [17] or other technologies), etc. Because the local visual features like SIFT have better resistance to translation, zoom, rotation and other de-formation properties, the major visual retrieval method in this period of time turned from copy image retrieval to partial copy image retrieval, similar image retrieval, instance retrieval, etc..

(14)

2.2 Representation Learning Algorithm based

on Matrix Decomposition and Sparse

Rep-resentation

Matrix decomposition-based representation learning algorithm realizes the transformation of data from the original high-dimensional feature space to the new low-dimensional feature space by learning a transformation matrix and using it to map the original data. The main purpose is to get a more effective representation of the data, to improve the data separability, and to improve the algorithm generalization performance. The purpose of matrix decomposition is to decompose the original data feature matrix into the product of multiple matrices to obtain data representations in different spaces, thus obtain a bet-ter feature representation. Matrix decomposition is widely used and studied. The major focus of the research is the decomposition method. Different al-gorithms are obtained according to the decomposition method. The existing classical representation learning algorithms based on matrix decomposition mainly include principal component analysis (PCA), singular value decom-position (SVD), non-negative matrix factorization (NMF), linear discriminant analysis (LDA), etc.. [18]

(15)

iteratively performing PCA in each mode of the tensor. MPCA has been used for facial recognition, gait recognition, etc.. Subsequently, MPCA was fur-ther extended to unrelated MPCA, non-negative MPCA, and robust MPCA. Paper [22] proposed a convex sparse principal component analysis (CSPCA) algorithm, which adds the latest sparsity and robust PCA to the collaborative framework to promote each other. CSPCA is the first convex sparse and ro-bust PCA algorithm, which theoretically ensures that the algorithm reaches the global optimal.

NMF is a representation algorithm based on local characteristics. It decom-poses data into a positive basis matrix (feature matrix) and a positive coeffi-cient matrix (weight matrix). The positive basis matrix represents the local characteristics in the data (for example, the nose in the face image), and the positive coefficient matrix indicates the degree to which the local characteris-tics reconstruct the data [23]. NMF achieves dimensionality reduction by pro-jecting data along a positive basis matrix that retains local characteristics in the data. Negative values are meaningless in some practical problems, for exam-ple, there can be no pixels with negative values in the image data. Therefore, the matrix after the constrained decomposition of the standard NMF algorithm is non-negative. It extracts the local features of the original data, and each sample is represented by the additive combination of multiple local features. NMF has a wide range of application fields, such as image analysis, text clus-tering, speech processing, biomedical engineering, and chemical engineering. Among them, the most successful applications are in image analysis and pro-cessing in face recognition [24, 25, 26]. In order to improve the convergence speed and recognition performance of NMF, two different error metrics have emerged: Euclidean distance and KL divergence. There are many extension studies based on NMF, most of which add new restrictions on the basis of the standard NMF algorithm, so the sparsity-enhanced NMF algorithm, discrimi-nant NMF algorithm, weighted NMF algorithm, and the parts widely used in face recognition Non-negative matrix factorization (LNMF) algorithm and so on. Paper [27] proposed a double-graph sparse non-negative matrix factoriza-tion (DSNMF) based on feature selecfactoriza-tion, combining the double-graph model with NMF, which can save the geometric structure of data space and feature space at the same time, and non-negative matrix factors can be updated itera-tively and interacitera-tively to fully tap potential data information.

(16)

(17)

2.3 Deep Learning-based Representation

Learn-ing

(18)

Auto Encoder (AE) is similar to RBM, which includes two processes of en-coding and deen-coding, mainly used for data dimension reduction or feature ex-traction. It is a three-layer neural network composed of an input layer, a hid-den layer (encoding layer) and an output layer (decoding layer), and its output should be the same as the input under the best case. The auto encoder is unsu-pervised learning. It uses the back propagation algorithm to train the network weights, making the target value equal to the input value, so that the hidden layer learns the good features of the input data. Subsequently, a stacked auto encoder (SAE) deep structure was proposed [32], which has a stronger rep-resentation learning ability. Some constraints are added to the construction method of the auto encoder, and some new learning methods are obtained, such as sparse auto encoder (Sparse AE) [33]. On the basis of the auto en-coder, the regular limit of L1 is added, sparse The expression is often more ef-fective than other expressions; the noise reduction automatic encoder (DAE) [34] is a simple variant of the automatic encoder, which adds random noise to the training data and reconstructs the noise-free original data by using the noise data. Train the DAE to learn to eliminate this noise to obtain raw data that has not been processed by the noise. Therefore, the noise reduction auto encoder can learn the more robust feature expression of the input data, which is why its generalization performance is stronger than other encoders.

(19)

generation sentences [38] and multi-label document classification [39]. Convolutional Neural Network (CNN) is currently the most widely used deep representation learning method in the field of computer vision. CNN is es-sentially a special feed forward neural network. The earliest CNN model was successfully applied to the problem of handwritten character recognition [40]. Due to various reasons, CNN’s development has stalled for a long time and gradually faded out of everyone’s eyes. It wasn’t until 2012 that the CNN model was regained attention and gradually favored by academia because of the amazing results achieved by Krizhevsky and others using the newly in-vented AlexNet in ImageNet image classification competition [41]. The struc-ture of AlexNet model is shown in figure 2.1.

Figure 2.1: AlexNet Structure [41]

(20)

pool-ing layer, fully connected layer and loss function layer. By combinpool-ing different types of network layers of different sizes, researchers can design a variety of CNN models according to the needs of the task.

(21)

Generative Model

The task of supervised learning is to learn a model and apply this model to pre-dict the corresponding output for a given input. The general form of this model is the decision function Z = f (X) or the conditional probability distribution P (Z|X). According to different learning directions, it can be divided into two types: generative model and discriminant model. The generative model is to learn the joint probability distribution P (X, Z) from the training data, and then obtain the conditional probability distribution P (Z|X) as the prediction model. Its most basic principle in probability theory is

P (Z|X) = P (X, Z)

P (X) (3.1)

The Bayesian formula is shown as formula 3.2.

P (X, Z) = P (X|Z)P (Z) = P (Z|X)P (X) (3.2)

According to the Bayesian formula, the generative model can also be expressed as formula 3.3.

P (Z|X) = P (X|Z)P (Z)/P (X) (3.3)

In formula 3.3, P (X|Z) is the posterior probability of X and P (Z) is the prior probability. In fact, the conditional probability P (Z|X) is also a poste-rior probability. The conditional probability and the posteposte-rior probability have similar meanings, but they are expressed differently. Specifically, P (Z|X) is the conditional probability of Z after X occurs, and is also called the poste-rior probability of Z due to the value obtained from X; P (X|Z) is known Z.

(22)

The conditional probability of X after occurrence is also called the posterior probability of X due to the value obtained from Z. Therefore, the conditional probability of the required solution in the supervised model is actually the conditional probability of the output Z after the input X occurs. When using the formula P (Z|X) = P (X|Z)P (Z)/P (X), the first solution is P (X|Z), so it can also be understood directly as P (X|Z) is modeled as a generative model (the probability distributions P (Z) and P (X) need to be mastered at the same time), and P (Z|X) is directly modeled as a discriminant model. The discriminant model focuses on the relationship between X and Z, or the law or distribution corresponding to Z given a certain X; the generative model attempts to describe the joint distribution of X and Z. The generation method learns the joint distribution P (X, Z), so it can represent the distribution of data from a statistical point of view, which can reflect the similarity of sim-ilar data itself. However, it does not care where the classification boundary that divides each type is [42]. The generation method can be derived from the joint probability distribution P (X, Z), but the discriminant method cannot. The learning method of the generation method converges faster, that is, when the sample size increases, the learned model can converge to the real model faster. When there are hidden variables, the generation method can still be used to learn, but the discrimination method cannot be used.

There have been many generative models before deep learning, but suffering from the fact that generative models is difficult to describe and model, re-searchers have encountered many challenges. The emergence of deep learning has helped them solve many problems. At present, the two most fast-growing generative models are Variational autoencoder (VAE) and Generative Adver-sarial Network (GAN).

3.1 Variational Autoencoder

Variational auto-encoder (VAE) [43] is a generative model proposed by Diederik P. Kingma and Max Welling in 2013. It is an unsupervised learning model that can classify images, reduce Visualization, etc.

(23)

represented by pθ(x|z). From the perspective of the autoencoder, this is the

decoder; and the process x → z is the recognition model, with qφ(z|x)

repre-senting the probability distribution, similar to the encoder of the autoencoder, therefore it is called variational autoencoder. But in fact it has nothing to do with the mathematical basis of the autoencoder. The variational auto-encoder generates two vectors of mean (µ) and standard deviation (σ) each time, while the auto-encoder generates an implicit vector z. The main difference between the two of them can be seen more clearly in figure 3.2 and 3.3.

Figure 3.1: the graph model of VAE, solid lines represents generation model pθ(x|z), while dashed lines represents qφ(z|x) which is similar to posterior

probability pθ(z|x)

Figure 3.2: AE model

The sampled latent variable z is the conditional distribution parameter, using a Gaussian distribution, as shown in the formula 3.4.

(24)

Figure 3.3: VAE model

The loss function of the variational autoencoder is the difference between the generated image data and the real image data, namely the reconstruction loss and the posterior probability density in the encoder and the posterior probabil-ity densprobabil-ity in the decoder. Therefore, in the variational automatic encoder, the entire network is continuously adjusted by using the gradient descent method to optimize the loss function, that is, adjusting θ and φ to reduce the loss. Reconstruction loss can be calculated using cross-entropy. Whether the distri-bution produced by the encoder is close to the standard distridistri-bution is calcu-lated using Kullback-Leibler divergence [44]. KL divergence is the distance between two probability distributions. If the two distributions are closer, the KL divergence is smaller, and vice versa.

Given a dataset X = x(i), where i = 1, 2, ..., N . Known from the graph model of VAE, x(i) _{is generated from distributions z}(i) _{∼ p}

θ(z); x(i) ∼ pθ(x|z). The

edge similarity is composed of the sum of the edge similarity of a single data point, logpθ(x(1), x(2), ..., x(N )) = PN_i=1pθ(x(i)). Every single data can be

expressed as formula 3.5.

logpθ(x(i)) = DKL(qφ(z|x(i))|pθ(z|x(i))) + L(θ, φ; x(i)) (3.5)

The first item on the right side of formula 3.5 is the KL divergence that approx-imates the true posterior probability, and the KL divergence is non-negative, so the lower limit of variation can also be written as formula 3.6.

L(θ, φ; x(i)) = −DKL(qφ(z|x(i))|pθ(z|x(i))) + Eqφ(z|x(i))[logpθ(x

(i)_{|z)] (3.6)}

The purpose of this next step is to analyze and integrate the formula by se-lecting the best pθ(z). For this purpose, firstly, z is parameterized to Z by

introducing a differentiable transform gφ(ε, x) with noise random variable ε.

The definition is like formula 3.7.

(25)

In specific, when applying VAE, formula 3.7 was rewritten into the form of formula 3.8.

Z(i,l)= gφ(ε, x(i)) = µ(i)+ σ(i) ε(l), ε(l) ∼ N (0, 1) (3.8)

Where l stand for the l − th sample of noise ε, i stands for the i − th data point, means element product operator. µ(i) _{and σ}(i) _{are the output of nonlinear}

mapping (encoder) from x(i). According to formula 3.8, the following formula can be derived.

qφ(z|x(i)) = N (z; µ(i), σ(i)I) (3.9)

At the same time, suppose the latent variable z is a multivariate Gaussian dis-tribution pθ(z) = N (z; 0, I). In this way, the calculation of the first item on

the right side of formula 3.6 can be derived as formula 3.10.

−DKL(qφ(z|x(i))|pθ(z|x(i))) = 1 2 J X j=1

(1 + log((σ_j(i))2) − (µ(i)_j )2− (σ(i)_j )2) (3.10) Then the stochastic gradient descent method can easily be used to maximize the lower bound, which is called autoencoder variational Bayes (AEVB).

L(θ, φ; x(i)) ' 1 2

J

X

j=1

(1+log((σ_j(i))2)−(µ(i)_j )2−(σ(i)_j )2)+1 L

L

X

l=1

logpθ(x(i)|z(i,l))

(3.11) Where z(i,l)= µ(i)+ σ(i) ε(l)_{, ε}(l)_{∼ N (0, I).}

(26)

al. [45] proposed a conditional variational autoencoder model (CVAE). The input of its encoder not only has the original data X = {xi}N_i=1, but also has part of the label Y = {yi}N_i=1 to control the generation of samples with spe-cific classes. Although CVAE turns the traditional VAE into semi-supervised learning, it does not solve the shortcomings of generating sample ambiguity well, and there is a large room for optimization. Louizos et al. [46] proposed another derivative form of semi-supervised variational autoencoder model in 2015, variational fair self-encoding (VFAE). The purpose is to separate the noise factor from the hidden variable information, so that the model can learn the characteristic representation of some non-denaturing factors more clearly. In VFAE, the maximum mean difference (MMD) is added as a regular term, penalizing the posterior distribution qφ(z|s) of the latent variable Z obtained

from the noise S, so that qφ(z|s) is as small as possible. It can play the role of

weakening the dependence between the hidden variable Z and the noise factor S. Another variant structure is the important weighted autoencoder (IWAE) [47] proposed by Burda et al. in 2015, the purpose is to improve the prob-lem of poor generalization ability of traditional VAE. Compared with VAE, IWAE has a similar structure, and the optimization goal is also to maximize the variational lower bound function of the model. However, IWAE hopes that by increasing the number of samples of the hidden variable zi corresponding

to the sample xi, its variational lower bound will be more compact and closer

to the true log-likelihood, so as to improve the adaptability of the generation network to different distribution forms, and also having a good generalization generation ability for the distribution forms that do not meet the VAE assump-tion constraints.

3.2 Generative Adversarial Network

(27)

the data classification. The generative model adjusts the generative network according to the classification results fed back by the discriminant model to obtain the generative network parameters with the smallest loss.

The basic adversarial generation network uses two multi-layer perceptrons that supervise each other to train each other, namely the generation model G and the discriminant model D. The generation model G performs data sampling and generates virtual data conforming to the sampling for the data sampling. The discriminant model judges the generation result of the generation model G and the real data, and generates the probability of data distribution according to the data distribution. When the training starts, the discriminant model adjusts the neural network parameters based on the back propagation of the genera-tive model. The generagenera-tive model adjusts the neural network parameters based on the prediction results of the discriminant model until the overall function reaches the optimal condition End training. Therefore, the training and gen-eration of the entire network can be achieved from unsupervised training to supervised training. The basic structure of a generative adversarial network is shown in the figure 3.4.

Figure 3.4: GAN structure

The generative model can continuously adjust its ability to reconstruct the sam-ple according to the results of the discriminant model. The generation model G constructs a multi-layer perceptron θg, the noise vector z extracted from

the distribution of pnoise(z), and the process of mapping this data space to

Pdatais called G(z; θg). Training the generative model can minimize log(1 −

D(G(z))). At the same time, a second multi-layer perceptron D(x; θg) is

(28)

expression tends to 0. Both the generative model and the discriminant model are trained using convolutional neural networks, and the parameters are ad-justed using the back propagation algorithm. The adjustment direction of the generative model is that the probability that the generated sample passes the discriminant model tends to 1, and the parameter adjustment direction of the discriminant model is to judge the probability of the image from the gener-ative model close to 0. What can be achieved by the discriminant model is the maximization of log(1 − D(G(z))). Therefore, the generative model and discriminant model will converge to a same point. The optimization formula for generative adversarial network is shown in formula 3.12.

min

G maxD V (G, D) = Ex∼pdata(x)[logD(x)] + Ez∼pz(z)[log(1 − D(G(z)))]

(3.12) After the beginning of training the adversarial generative network, the gen-erative model may be closer to the situation of the global optimal solution pg = pdata, and its model may prove to converge to the same point. In this

case, when pg(x) = pdata(x), the discriminant model cannot distinguish

be-tween the two. At this time, the model as a whole reaches the optimal value, and the extreme value at this time is D(x) = 1₂.

On the basis of ordinary GANs, the "Conditional Generative Adversarial Nets" (CGAN) proposed the use of label-assisted generative model generation train-ing [49], and can obtain stable image output; subsequently, Wasserstein Gen-erative Adversarial Nets (WGAN) proposed to solve the single loss function problem in the adversarial generation model [50]. The generative adversarial network can be trained with a simpler function without being limited to the cross-entropy loss function.

(29)

(30)

Mutual Information-based

Tech-nology

4.1 Information Entropy

As early as the 1940s, Shannon put forward the concept of information theory [53] in the field of communication, laying the foundation of information the-ory. Since then, information theory has brought new ideas to many scientific and technical issues in many fields, such as applied mathematics, electrical engineering, bio-informatics, statistics, and computer science. In information theory, entropy is a very important concept. It was originally used in ther-modynamics. Later, this concept was introduced into information theory by Shannon. Entropy refers to the number of bits required to describe a random variable [54]. It can be regarded as a measure of the average uncertainty of a random variable. When only considering random variables with discrete val-ues and assume that the discrete random variable X has a discrete value table χ, the probability density function of X is p(x) = P r{X = x}, x ∈ χ, then the definition of entropy of X is

H(X) = −X

x∈χ

p(x)log2p(x) (4.1)

It can be seen from the definition that the size of entropy does not depend on the actual value of the sample, but on the probability distribution of random variables. Entropy can be seen as a measure of the average uncertainty of a random variable. In the average sense, it is used to describe the number of bits required by the random variable. Similarly, conditional entropy is the entropy of a random variable given another random variable. Assuming that

(31)

the random variable Z is known, the discrete value table is γ, the conditional entropy is expressed as H(X|Z) = −X z∈γ p(y)H(X|Z = z) = −X x∈χ X z∈γ p(x, z)log(p(x|z)) (4.2)

In format 4.2, p(x|z) represents the posterior probability of X under the con-dition that Z is known. It can be seen from the definition that if X completely depends on Y , then H(X|Z) is zero. This means that given Z, no additional information is needed to describe X. On the contrary, H(X|Z) = H(X) in-dicates that Z does not make any contribution to the information concerning X.

When X, Z are continuous variables, format 4.1, 4.2 should change to 4.3 and 4.4. H(X) = − Z x p(x)log2p(x) (4.3) H(X|Z) = − Z x Z z p(x, z)log(p(x|z)) (4.4)

The joint entropy of two random variables X and Z can be defined as

H(X, Z) = H(X) + H(Z|X) = H(Z) + H(X|Z) (4.5)

4.2 Mutual Information and Calculation

Mutual information is a measure of the correlation between random variables based on Shannon entropy. Because the main research object of this article is continuous data, unless otherwise specified, the variables appearing in the following part of this chapter are continuous variables. The mutual informa-tion between X and Z can be understood as the reducinforma-tion of uncertainty in X given Z. The mutual information of X and Z is defined as

I(X, Z) = − Z x Z z p(x, z)log( p(x|z) p(x)p(z)) (4.6)

(32)

are from each other. It is symmetrical about X and Z and has a non-negative value. The value is zero if and only if X and Z are independent of each other. When the correlation between X and Z is high, the value of mutual infor-mation I(X, Z) will be large. Mutual inforinfor-mation is also a measure used to measure the amount of information that a random variable contains another random variable. It is a reduction in the uncertainty of the original random variable caused by another random variable.

Mutual information has two important properties that distinguish it from other dependency measures. First, mutual information can measure any two random variables, regardless of whether they are independent, dependent, or other-wise. This property is rooted in the fact that mutual information is calculated according to the joint probability density function and the edge probability density function, without using any hierarchical or sequential statistical infor-mation, that is to say, mutual information has symmetry. Second, mutual in-formation has the same property under spatial transin-formation. This is because the calculation of mutual information is composed of logarithms, therefore, mutual information is a dimensionless metric, and its value will not change with the change of the coordinate system. This property also ensures that mu-tual information is reversible under transformation.

(33)

For continuous variables, the situation is more complicated. Most mutual in-formation estimation algorithms are based on the mathematical definition of mutual information. In format 4.6, it can be seen that the value of I(X, Z) mainly depends on the similarity between the joint distribution p(x, z) and the product of marginal distribution p(x)p(z). In this case, mutual information has a more general form as following.

I(X; Z) = Z x×z log dPXZ dPX ⊗ dPZ dPXZ (4.7)

where PXZ is the joint probability distribution of X and Z, PX and PZ are

marginal distribution which got from PX =

R

zdPXZ and PZ =

R

xdPXZ.

Based on the fundamental knowledge above, there is another statement for mutual information as follows: mutual information between X and Z is equiv-alent to the Kullback-Leibler (KL-) divergence between the joint PXZ, and the

product of the marginals PX ⊗ PZ[1]. The formula is shown below.

I(X; Z) = DKL(PXZ||PX ⊗ PZ) := EPXZ[log

dPXZ

dPX ⊗ dPZ

] (4.8)

In the process of practical application, KL-divergence is always shown in an-other equivalent form, which is called the "Donsker-Varadhan (DV-) represen-tation" [57]. The KL-divergence admits the following dual representation:

DKL(PXZ||PX ⊗ PZ) = sup T :Ω→R

EPXZ[T ] − log(EPX⊗PZ[e

T_]) _(4.9)

where T is a function that can either be presented by equations or by parame-terized neural networks. The supremum symbol is taken over all functions T that makes the value of the two expectations finite.

In practice it is impossible to find the best among all functions T . However, if a best one among a set of functions with similar forms can be found, a clear lower bound for the above expression can still be got. Let F be any subset of functions which satisfies the constraints of DV-representation, the lower bound is expressed as

I(X; Z) ≥ sup

T ∈F

EPXZ[T ] − log(EPX⊗PZ[e

T_]) _(4.10)

(34)

the problem of estimating the maximum mutual information has also been transformed into a problem of training the network to maximize the lower bound of mutual information. The paper [3] applied this theorem and came up with a lower bound on MI called "InfoNCE" which expressed as

d Iθ,ψ (Inf oN CE) (X; Eψ(X)) := EX[Tθ,ψ(x, Eψ(x))−E_X˜[log X x0 eTθ,ψ(x0,Eψ(x))_]] (4.11) In the equation 4.11, Z was changed into Eψ(x). In most cases the aim of

mutual information estimator is to maximize the mutual information between input data X and the latent representation of input data, which can obtain from network with parameters ψ. Therefore, Eψ(x) stands for this latent

represen-tation. Tθ,ψis the overall function that contains the two trained networks. x is

an input sample from data set X, and x0 is an input sample from another set ˜

X = X. The purpose of this processing is to obtain two independent distri-butions in the calculation of the marginal distridistri-butions.

In addition to the methods described above, non-KL divergences can also be used. For example, a Jensen-Shannon divergence (JSD) lower bound [58] for MI estimation is defined as d Iθ,ψ (J SD) (X; Eψ(X)) := EX[−sp(−Tθ,ψ(x, Eψ(x)))] − EX ˜X[sp(Tθ,ψ(x0, Eψ(x)))] (4.12)

The sp function in equation 4.12 means softplus with expression as sp(z) = log(1 + ez_{). Theoretically speaking, a JSD-based estimator should behave}

in a similar manner to DV-based estimators like InfoNCE, because both the two lower bounds are trying to maximize the expected log ratio of the joint distribution over the product of marginal.

4.3 Label-based Conditional Mutual

Informa-tion Estimator

(35)

is maximizing the mutual information between the latent variables and the la-bel information conditioned on the encoded observed variables, instead of the mutual information between the latent variables and the observed data. Consider a latent variable model where X ∈ χ shows the observed variable, Z is the latent variable, either continuous or discrete. And Y is the label. First, use a neural network encoder that can map the observed variable to latent space gθ(X) = C, where θ represents the parameters of the neural network. In this

case a Markov chain can be created as C ↔ X ↔ Y ↔ Z. In order to optimize the parameters of the encoder, the mutual information between the latent variables and the label information conditioned on the encoded observed variables should be maximized as shown in formula 4.13.

arg max

θ I(Y ; Z|gθ(X)) (4.13)

According to the data processing inequality and D-separation principle, there is

I(Y ; X, Z) ≥ I(Y ; gθ(X), Z) = I(Y ; C, Z) (4.14)

Meanwhile, according to the chain rule of mutual information, the conditional mutual information I(Y ; Z|X) can be expressed as

I(Y ; Z|X) = I(Y ; Z, X) − I(Y ; X) = I(Y ; Z, X) − logK (4.15)

The K in formula 4.15 is the number of labels. When assuming the labels have a uniform prior distribution, I(Y ; X) = logK. In this case, formula 4.13 can be seen as a lower bound of the mutual information between the latent variables and the label information conditioned on the observed variables, as shown in formula 4.16.

I(Y ; Z|C) = I(Y ; C, Z) − logK ≤ I(Y ; Z|X) (4.16)

The expression I(Y ; C, Z) is the conditional mutual information between two discrete distributions defined in [53].

I(Y ; C, Z) = Ep(c,z,y)[log

p(c, z|y)

p(c, z, y)] = Ep(c,z,y)[log

p(c, z|y)

(36)

to compute analytically for high-dimensional vectors. However, a lower bound of the CMI can be derived with the help of tractable critic functions to approx-imate the density ratio logp(c,z|y)_p(c,z) [59].

Ideally, it is expected that the encoder can learn a mapping from observed variable space to latent space that the data points from the same classes are represented together in the latent space. On the contrary, the latent represen-tations belonging to different classes in observed variable should be dissimilar to each other. In order to fulfill this requirement, as well as taking the loga-rithmic attribute in formula 4.17 into account, an exponential function with a similarity measure of c and z as its argument is chosen as critic function: f (z, c) = ed(z,c) _≈ p(c,z|y)

p(c,z) , where d(z, c) = max( hc,zi

kckkzk, 0). The normalized

inner product is used so that the number of latent vector dimension won’t have much impact on the critic function. A more intuitive explanation for this critic function is, if z and c are from the same conditional distribution p(·|y), that is to say z is in the same class as c, then the output of function will be much larger than 1. On the contrary, if z is drawn from a class different from c, then the output of critic function will be close to 1. In this way, the critic function enforces a similarity-preserve mapping at the encoder. [59]

(37)

bound of the original objective function 4.13 can be derived as 4.18. I(Y ; C, Z) − log(2) = Ep(c,z,y)[log

p(c, z|y) 2p(c, z)] = −Ep(c,z,y)[log

2p(c, z) p(c, z|y)] ≥ − Ep(c,z,y)[log

p(c, z) p(c, z|y) + 1] ≈ −Ep(c,z,y)[log

p(c, z) p(c|y)p(z|y) p(c, ¯z|y) p(c, ¯z) + 1] = Ep(c,z,y)[log 1 p(c,z) p(c|y)p(z|y) p(c,¯z|y) p(c,¯z) + 1 ] = Ep(c,z,y)[log p(c|y)p(z|y) p(c,z) p(c|y)p(z|y) p(c,z) + p(c,¯z|y) p(c,¯z) ] ≈ Ep(c,z,y)[log f (c, z) f (c, z) + f (c, ¯z)] = Ep(c,z,y)[log ed(c,z) ed(c,z)_{+ e}d(c,¯z)] (4.18)

(38)

Implementation

5.1 Architecture

The label-based conditional mutual information estimator introduced in this paper mainly contains three parts: an encoder, a loss function calculation mod-ule, and a linear classifier. The encoder applied in this thesis is basically a con-volutional neural network. It is used for mapping observed variables (images) into latent vectors. The construction of the encoder will be explained in detail in the following section. The latent vectors, as well as the label variables are put into the loss function calculation module in order to get the loss for the en-coder to do back-propagation. For model training process, only these two parts are needed. The linear classifier is used for model testing. In this thesis, the classifier is mainly based on "logistic regression" algorithm. The classifier is fed with the output latent vectors of the encoder, and outputs predicted labels for every vectors, which are compared to real labels to determine the effect of the whole model. The schematic diagram of the whole model is shown in figure 5.1.

5.1.1 Encoder

The encoder networks used in this thesis are not existing network models, but several self-built networks based on the mutual information estimator intro-duced in paper [2]. There exists four encoder networks in this thesis. For convenience, they will be named as LMIE (Label-based Mutual Information Estimator) v1-v4 in the following text. They convey some advanced

(39)

(40)

tures, such as: (a) Inception structure that contains 3x3 convolutional kernel which keep the respective field and decrease the computation cost; (b) leaky Relu activation function that makes the model converge faster. LMIEv1 is the simplest encoder. Its structure is shown in figure 5.2(a). And the structure of LMIEv2 is shown as figure 5.2(b).

(a) (b)

Figure 5.2: LMIEv1 and v2 structure

The difference between the above two networks is the way to obtain latent vec-tors. In LMIEv1, the latent vectors are assumed to have impulse distribution, that is, latent vectors were got directly from the output of encoder. While in LMIEv2, the latent vectors are assumed to have Gaussian distribution. In this case, the encoder outputs a "mean vector" and a "variance vector" for each image. Through these two vectors, a new “latent vector sample" is recreated in the form of Gaussian distribution.

The structure of LMIEv3 and v4 is shown in figure 5.3(a) and 5.3(b).

The difference between LMIEv1 and v3 is that v3 is more complex and deeper than v1. Finally, network LMIEv4 combines the changes of v2 and v3. The different results obtained by these four networks can be used to comprehen-sively consider the impact of network and algorithm complexity on the results. For summarization, LMIE v1 to v4 have following characteristic separately. v1: simple encoder network + non-parametric latent features;

v2: simple encoder network + stochastic latent features;

(41)

(a) (b)

Figure 5.3: LMIEv3 and v4 structure

5.1.2 Loss Function Calculation

The loss function applied in the model is called "contrastive loss". The pro-posal of the loss function is based on formula 4.18. Consider the content inside the expectation operation log_ed(c,z)ed(c,z)_+ed(c,¯z). It can be modified as

ˆ

y = −log e

d(c,z)

ed(c,z)_{+ e}d(c,¯z) (5.1)

Formula 5.1 can be a judgment basis of whether the latent pair of z and ¯z are sampled from the same class or not. Specifically speaking, if z and ¯z are sampled from the same class, ˆy should be close to 1; on the contrary, if z and ¯z are sampled from different classes, ˆy is close to 0. However, because c, z and ¯

z are all normalized vectors, the extreme dynamic range of 5.1 is not [0, 1], but [0.183, 1]. Inspired by paper [60], another parameter can be added into the formula to avoid this problem to some extent, which is called "temperature" (τ ). The new formula after modification is shown as formula 5.2.

ˆ y = −log e d(c,z)/τ ed(c,z)/τ _{+ e}d(c,¯z)/τ = −log 1 1 + e(d(c,¯z)−d(c,z))/τ (5.2)

(42)

is also an important step in model testing.

Accordingly, a new label can be created for every latent pair, which is shown in formula 5.3.

˜

y =0, if label(z) = label(ˆz)

1, if label(z) 6= label(ˆz) (5.3) With these two newly generated contents above, the calculation formula of the contrastive loss is defined as

L = 1 ˜ N ˜ N X i ˆ yiy˜i+ (1 − ˆyi)(1 − ˜yi) (5.4)

What should be noticed is that ˜N in formula 5.4 is no longer the number of raw images, but the number of latent pairs. Suppose the number of raw images is N , in extreme cases, ˜N can reach N (N −1)₂ as a maximum number.

5.1.3 Linear Classifier

After the encoder is trained, it is extremely important to verify the effective-ness of the encoder. In other words, it is necessary to verify whether the latent vectors obtained through the encoder can achieve a certain degree of automatic clustering with the help of classifier. Assuming that the effect of the encoder is significant, then theoretically speaking, the simplest linear classifier can do the clustering task of latent vectors. In consideration of it, a "Logistic Regres-sion" classifier is chosen.

Although Logistic Regression (LR) carries the word "regression" in its name, it is essentially a classification algorithm. It has aliases like logit regression, maximum-entropy classification (MaxEnt) and the log-linear classifier. LR is based on the linear regression model, using the sigmoid function to compress the result of the linear model wTx to interval [0, 1], so that it has a probabilistic significance. The essence of it is still a linear model, and its implementation is relatively simple.

LR is named after the logistic distribution. The density function and distri-bution function of this distridistri-bution are shown in formula 5.5 and 5.6.

p(x; µ, s) = e

−(x−µ)/s

(43)

P (x; µ, s) = 1

e−(x−µ)/s (5.6)

Where µ is the positional parameter, and s is the shape parameter. The prob-ability density function of the logistic distribution with different µ and s is roughly shown in the figure 5.4.

Figure 5.4: the probability density function of the logistic distribution It can be seen that the graph of the logistic distribution is similar to the Gaus-sian distribution, that is, it changes faster near the center, and changes slowly at both ends. When µ = 0, s = 1, the probability distribution function of the logistic distribution is often called the sigmoid function.

(44)

5.2 Training

5.2.1 Dataset

CIFAR-10 is a dataset of images with labels mainly used for image classifica-tion. It contains 60,000 color images being divided into 10 classes. Among them 50,000 images are included in "training" set, and other 10,000 are in-cluded in "test" set. All the images have size 32 ∗ 32 pixels.

Due to model limitation, only the 10,000 images in "test" set are used in the thesis. For each classes, 100 randomly chosen images are used to train the encoder, and the rest 900 images are used for validation test.

Figure 5.5: thumbnails of 10 random images for each class in CIFAR-10

5.2.2 Data Augmentation

(45)

are from the same class or not. This effectively creates much more labeled training data.

5.2.3 Implementation Details

All the model training and testing are done on a personal laptop with eight-core i5-8250U CPU (no GPU). All data such as training time are based on this hardware and hence are for reference only.

Both the encoder network and the classifier has some parameters that need to be settled before training. Table 5.1 below is the configuration of the en-coder.

Table 5.1: configuration of encoder

Parameters Values

Batch size(binary classes) 16 Batch size(multiple classes) 40 Start learning rate 0.0001

Learning policy Adam

Learning rate decay 0.002

Table 5.2 shows the parameters when training a LR classifier.

Table 5.2: configuration of classifier

Parameters Values

Regularization penalty L2

Solver lbfgs

maximum iteration 10

(46)

Results

6.1 Evaluation Metrics

Three criterion are used as evaluation metrics of the model, which are speed of training, train and test accuracy, and Matthews correlation coefficient (MCC). Speed of training uses "second/epoch" as unit. Since no GPU was used in this thesis, the absolute numbers of speed has little practical meaning. How-ever, the relative numbers between multiple configurations still have its value. Train and test accuracy is the most commonly used criteria in image recog-nition and classification. Generally speaking, the higher the test accuracy, the better the model’s performance is. Meanwhile, another data is also worth not-ing, that is the gap between training and test accuracy. This data is mainly used to judge whether the model is effective for avoiding the problem of overfitting or underfitting.

Matthews correlation coefficient is a concept based on confusion matrix. For a binary classification problem, the actual values are only positive and negative. If an instance is positive and is predicted to be positive, it is true positive (TP). If it is negative and is predicted to be positive, it is false positive (FP). If it is negative and is predicted to be negative , it is called true negative (TN), and if it is positive and is predicted to be negative, it is called false negative (FN). MCC is defined as shown in formula 6.1.

M CC = T P × T N − F P × F N

p(T P + F P )(T P + F N)(T N + F P )(T N + F N) (6.1)

(47)

It can be seen that MCC is an index that unites TP, TN, FP and FN. It is therefore a more balanced evaluation metric. The interval of MCC is [−1, 1]. M CC = 1 means the prediction is exactly the same as real labels. M CC = 0 means the prediction almost equivalent to random guessing. MCC essentially describes the correlation coefficient between the prediction result and the ac-tual situation. Regarding the multi-classification, following steps are done. First, set one class as positive class, and others as negative, and calculate the value of MCC for a single class. Second, do the same to all the 10 classes. Third, average up the 10 values and get MCC value for whole classes.

6.2 Model Performance

In this section, all the models introduced above will be examined. In addi-tion, two other representation learning models will also be trained and tested with same dataset for comparison, which are "Autoencoder" model and "super-vised DIM", which uses almost the same network as "Deep InfoMax" mutual information estimator (DIM) in paper [2], only the loss layer is changed to cross-entropy loss, and hence made "supervised DIM" a supervised model.

6.2.1 Binary Classification

Since there are ten classes in the CIFAR10 dataset, there will be 45 possibili-ties to extract two classes from them, and this number is too large. Therefore, in the experiment of this thesis, only 8 kinds are randomly selected from 45 possibilities, and the average value is shown in this subsection.

The speed of training for each encoder is shown in table 6.1. And the train accuracy, test accuracy and MCC are shown together in table 6.2.

Table 6.1: speed of training in binary classification Encoder model Speed of training (s/epoch)

AE 3.07

(48)

Table 6.2: model performance in binary classification Encoder model Train acc Test acc MCC

AE 0.66±0.01 0.695± 0.01 0.38 ± 0.01

supervised DIM 0.99± 0.01 0.80±0.02 0.62 ± 0.01 LMIE (best performance) 0.995±0.005 0.87±0.03 0.73 ± 0.02

6.2.2 Multiple Classification

The performance of the model in multi-classification task will become an im-portant basis for whether the model can be applied in practice, because the tasks in reality are basically multi-class tasks. The speed of training is shown in table 6.3, and other three criterion are shown in table 6.4.

Table 6.3: speed of training in multiple classification Encoder model Speed of training (s/epoch)

AE 6.63

supervised DIM 23.26 LMIE (best performance) 200.44

Table 6.4: model performance in multiple classification Encoder model Train acc Test acc MCC

AE 0.15±0.01 0.14± 0.01 0.08 ± 0.01

supervised DIM 0.98± 0.01 0.50±0.01 0.42 ± 0.01 LMIE (best performance) 0.74±0.03 0.47±0.02 0.41 ± 0.02

6.2.3 Comparison between 4 Versions of LMIE

The results shown in the previous two subsections are the best performance of the 4 LMIE versions. This subsection will show the comparison of the results between the 4 versions of LMIE using multiple classification as an example. The results are shown in table 6.5.

(49)

Table 6.5: results comparison between 4 versions of LMIE Encoder model Speed of training Train acc Test acc MCC

v1 200.44 0.74±0.03 0.47±0.02 0.41 ± 0.02

v2 211.37 0.72±0.03 0.46±0.02 0.39 ± 0.02

v3 415.94 0.68±0.01 0.44±0.01 0.38 ± 0.01

v4 474.27 0.68±0.02 0.42±0.01 0.35 ± 0.02

shown in figure 6.1.

Figure 6.1: line chart of τ and test accuracy in multiple classification

6.3 Result Analysis

(50)

that softmax loss function only considers intra-class aggregation, while con-trastive loss also considers the maximization of the distance between classes. This advantage can play a greater role when the number of classification labels is small.

When comparing between the four versions of LMIE, it can be seen that LMIEv1, which is the relatively simplest network model, has the best performance. For this result, one possible conjecture is that when the complexity of the model matches the size of the data, the model will have a better performance. There-fore when the model is more complicated, the performance is reduced. How-ever there is another possible explanation, that is, the computational power of complex models is constrained by the computer, and requires equipment with higher performance.

(51)

Discussion

This is the final chapter where the conclusion of the experiment results of models, as well as the potential improvement in the future will be shown.

7.1 Conclusion

The thesis proposed a label-based conditional mutual information estimator using contrastive loss function and neural network-powered encoder. The pro-posed estimator sacrifices training efficiency but has better performance in classifying a subset extracted from CIFAR10. Compared with supervised DIM estimator, the test accuracy of the proposed model has an increment of about 7% in binary classification task. While in multi-classification task, the per-formance of the proposed model is close to but not exceeding the supervised DIM model. Therefore, contrastive loss function can have a good effect in some specific situations, and model based on this loss function can be seen as a prototype for classification tasks with few training data.

7.2 Future Work

Like proposed in the last chapter, more complicated networks made the test accuracy decrease. The reason for this problem is still not clear and requires further study by testing more networks with different structures. Another de-fect for this model is that only the ten classes in CIFAR10 is used. In order to ensure this model is suitable for real life practise, it may be necessary to test other more complex datasets.

(52)

[1] Mohamed I B; Aristide B; et.al. “Mutual Information Neural Estima-tion”. In: ICML (2018).

[2] Devon H; Alex F; et.al. “Learning Deep Representations by Mutual In-formation Estimation and maximisation”. In: ICLR (2019).

[3] Aaron O; Yazhe L; Oriol V. “Representation learning with contrastive predictive coding”. In: arXiv (2018).

[4] Zheng Z; Yong X. “Research on Image Representation Learning based on Structure and Discriminative Semantic Embedding”. In: Harbin In-stitute of Technology(2019).

[5] Oliva A; Torralba. “Modeling the shape of the scene: A holistic repre-sentation of the spatial envelope”. In: International journal of of com-muter vision42 (2001), pp. 145–175.

[6] Cao Y; Wang C; Zhang L; et al. “Edgel Index for large-scale sketch-based Image Search”. In: CVPR (2011), pp. 767–768.

[7] Huang J; Kuma JR S R; MITRA M; et al. “Image Indexing using color correlograms”. In: CVPR (1997), pp. 762–768.

[8] Lowe D G. “Distinctive image features from scale-invariant keypoints”. In: international Journal of Composition Vision (2004), pp. 91–110. [9] Nister D; Stewntus H. “Scalable recognition with a Vocabulary tree”.

In: CVPR (2006), pp. 2161–2168.

[10] Philbin J; Chum O; Isard M; et al. “Object retrieval with large vocabu-laries and fast spatial matching”. In: CVPR (2007), pp. 1–8.

[11] Perronin F; Sanchez J; Mensin K T. “Improving the fisher kernel for large-scale image classification”. In: ECCV (2010), pp. 143–156.

(53)

[12] Jegou H; Douze M; Schmid C; et al. “Aggregating local descriptors into a compact image representation”. In: Proceedings of the IEEE Confer-ence on computer vision and Pattern Recognition (2010), pp. 3304– 3311.

[13] Zhou W; Li H; Lu Y; et al. “SIFT match verification by geometric cod-ing for large-scale partial-duplicate web image search”. In: ACM Trans-actions on Multimedia Computing, Communications and Applicants 4 (2013).

[14] Chu L; Jiang S; Wand S; et al. “Robust spatial consistency graph model for partial duplicate image retrieval”. In: IEEE Transactions on Multi-media15(8) (2013), pp. 1982–1996.

[15] Jegou H; Douze M; Schmid C. “Hamming embedding and weak ge-ometric consistency for large scale image search”. In: ECCV (2008), pp. 304–317.

[16] Jegou H; Douze M; Schmid C. “Product quantization for nearest neigh-bor search”. In: IEEE Transactions on Partial Analysis and Machine Intelligence(2011), pp. 117–128.

[17] Zhou W; Yang M; Wang X; et al. “Scala feature Matching by Dual Cas-caded scalar quantization for image retrieval”. In: IEEE Transactions on Partial Analysis and Machine Intelligence38(1) (2016), pp. 159–171. [18] Weiting S; Hongwei G. “Extreme Learning Machine Based

Represen-tation Learning”. In: Dalian University of Technology (2019).

[19] Gorban A N; Kegl B; Wunsch D C; et al. “Principal Manifolds for Data Visualization and Dimension Reduction”. In: Springer Berlin Heidel-berg(2008).

[20] Kriegel H P; Schubert E; Zimek A. “A General Framework for Increas-ing the Robustness of PCA-Based Correlation ClusterIncreas-ing Algorithms”. In: International Conference on Scientific and Statistical Database Man-agement. Springer-Verlag(2008), pp. 418–435.

[21] Lu H; Plataniotis K N. “Venetsanopoulos A N. A survey of multilin-ear subspace lmultilin-earning for tensor data”. In: Pattern Recognition 44(7) (2011), pp. 1540–1551.

(54)

[23] Jae S L; Lee D D; et al. “Non-negative matrix factorization of dynamic images in nuclear medicine”. In: IEEE Nuclear Science Symposium Conference Record4 (2001), pp. 2027–2030.

[24] Lin Q. “Improved Face Recognition Method Based on NMF”. In: Com-puter Science(2012).

[25] Yuanyuan W; Shuyi W; Bin M; et al. “Correntropy Induced Metric Based Graph Regularized Non-negative Matrix Factorization”. In: Neu-rocomputing(2016).

[26] Yuanyuan W; Naiyang G; et al. “Translation non-negative matrix fac-torization with fast optimization”. In: IEEE International Conference on Systems, Man, and Cybernetics (SMC)(2014), pp. 2871–2874. [27] Meng Y; Shang R; Jiao L; et al. “Feature Selection Based Dual-graph

Sparse Non-negative Matrix Factorization for Local Discriminative Clus-tering”. In: Neurocomputing 209 (2018), pp. 87–99.

[28] Wright J; Ganesh A; Zhou Z; et al. “Robust face recognition via sparse representation”. In: IEEE Transactions on Pattern Analysis and Ma-chine Intelligence31(2) (2009), pp. 210–227.

[29] Qiao; Lishan; Chen; et al. “Sparsity preserving projections with ap-plications to face recognition”. In: Pattern Recognition 43(1) (2010), pp. 331–341.

[30] Yang W; Wang Z; Sun C. “collaborative representation based projec-tions method for feature extraction”. In: Pattern Recognition 48(1) (2015), pp. 20–27.

[31] Hinton G E; Osindero S; Teh Y W. “A Fast Learning Algorithm for Deep Belief Nets”. In: Neural Computation 18(7) (2006), pp. 1527– 1554.

[32] Hinton G E; Salakhutdinov R R. “Reducing the Dimensionality of Data with Neural Networks”. In: Science 313 (5786) (2006), pp. 504–507. [33] Hu D K; Duan G D. “Learning Facial Expression Codes with Sparse

Auto-Encoder”. In: Applied Mechanics and Materials (2013), pp. 334– 337.

(55)

[35] Graves A. “Supervised Sequence Labelling with Recurrent Neural Net-works”. In: Springer Berlin Heidelberg (2012).

[36] Tai K S; Socher R; Manning C D. “Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks”. In: Com-puter Science5(1) (2015), p. 36.

[37] Li J W; Luong M T; Jurafsky D. “A Hierarchical Neural Autoencoder for Paragraphs and Documents”. In: Annual Meeting of the Association for Computational Linguistics(2015).

[38] Chen X; Ma L; Jiang W; et al. “Regularizing RNNs for Caption Gener-ation by Reconstructing The Past with The Present”. In: IEEE Confer-ence on Computer Vision and Pattern Recognition(2018).

[39] Yan Y; Wang Y; Gao W C; et al. “LSTM2 : Multi-Label Ranking for Document Classification”. In: Neural Processing Letters 7 (2017), pp. 1– 22.

[40] LeCun Y; Bottou L; Bengio Y; Haffner P. “Gradient-based learning ap-plied to document recognition”. In: Proceedings of the IEEE 86 (1998), pp. 2278–2324.

[41] Krizhevsky A; Sutskever I; Hinton G E. “Imagenet classification with deep convolutional neural networks”. In: Advances in Neural Informa-tion Processing Systems(), pp. 1097–1105.

[42] Ng A Y; Jordan M I. “On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes”. In: NIPS (), pp. 841–848.

[43] Diederik P; Kingma; Max W. “Auto-encoding variational Bayes”. In: ICLR(2014).

[44] Zhou Q. “Research on privacy protection of social media users based on KL divergence”. In: Intelligence exploration 9 (2015).

[45] Makhzani A; Shlens J; Jaitly N; et al. “Adversarial autoencoders”. In: arXiv1511.05644 (2015).

[46] Louizos C; Swersky K; Li Y; et al. “The variational fair autoencoder”. In: arXiv 1511.00830 (2015).

[47] Burda Y; Grosse R; Salakhutdinov R. “Inportance weighted autoen-coders”. In: arXiv 1704.02916 (2015).

(56)

[49] Shen M F; Chen J L; et al. “Digital image restoration based on im-age decomposition and region segmentation”. In: Journal of Electronic Measurement and Instrument23(9) (2009).

[50] Fu S C; Lou S T. “Image repair algorithm based on regional texture syn-thesis”. In: Journal of Electronic Measurement and Instrument 31(6) (2009).

[51] Bao J; Chen D; Wen F; et al. “Cvae- gan: fine- grained image generation through asymmetric training”. In: arXiv 1703.10155 (2017).

[52] Creswell A; Bharath A A; Sengupta B. “Conditional autoencoders with adversarial information factorization”. In: arXiv 1711.05175 (2017). [53] Cover T M;Thomas J A. “Elements of information theory”. In:

John-Wiley, Sons(2012).

[54] McEliece R. “The theory of information and coding”. In: Cambridge University Press(2002).

[55] Tesmer M; Estevez P A. “AMIFS: adaptive feature selection by using mutual information”. In: IEEE International Joint Conference on Neu-ral Networks(2004).

[56] Weiwei Z; Niranjan J; Brady M. “Colorectal MRI image registration using phase mutual information from non-parametric probability den-sity function estimator”. In: 2008 5th IEEE International Symposium on Biomedical Imaging: From Nano to Macro(2008).

[57] Donsker M; Varadhan S. “Asymptotic Evaluation of certain Markov Process Expectations for large time”. In: Communications on Pure and Applied Mathematics(1975).

[58] Sebastian N; Botond C; Ryota T. “Training generative neural samplers using variational divergence minimization”. In: arXiv (2016).

[59] Hanwei W; Ather G; Markus F. “Conditional Mutual information-based Contrastive Loss for Financial Time Series Forecasting”. In: arXiv 2002.07638 (2020).

(57)