• No results found

Deep Convolutional Nonnegative Autoencoders

N/A
N/A
Protected

Academic year: 2022

Share "Deep Convolutional Nonnegative Autoencoders"

Copied!
91
0
0

Loading.... (view fulltext now)

Full text

(1)

Deep Convolutional

Nonnegative Autoencoders

YANN DEBAIN

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

Nonnegative Autoencoders

YANN DEBAIN

Master in Information and Network Engineering Supervisor: Stanislaw Gorlow, PhD

Examiner: Saikat Chatterjee, PhD

School of Electrical Engineering and Computer Science Host company: Dolby Sweden AB

Email: debain@kth.se

(3)
(4)

In this thesis, nonnegative matrix factorization (NMF) is viewed as a feedbackward neural network and generalized to a deep convolutional ar- chitecture with forwardpropagation under β-divergence. NMF and feed- foward neural networks are put in relation and a new class of autoen- coders is proposed, namely the nonnegative autoencoders. It is shown that NMF is essentially the decoder part of an autoencoder with nonneg- ative weights and input. The shallow autoencoder with fully connected neurons is extended to a deep convolutional autoencoder with the same properties. Multiplicative factor updates are used to ensure nonnegativ- ity of the weights in the network. As a result, a shallow nonnegative autoencoder (NAE), a shallow convolutional nonnegative autoencoder (CNAE) and a deep convolutional nonnegative autoencoder (DCNAE) are developed. Finally, all three variants of the nonnegative autoencoder are tested on different tasks, such as signal reconstruction and signal enhancement.

iii

(5)

I den här rapporten betraktas icke-negativ matrisfaktorisering (eng: non- negative matrix factorization, NMF) som ett återkopplat neuralt nätverk.

NMF är generaliserat till en djup faltningsarkitektur med “forwardprop- agation” och β-divergens. NMF och “feedforward” neurala nät jämförs och en ny typ av autokodare är presenterat. Den nya typen av autoko- dare kallas icke-negativ autokodare. NMF betraktas avkodardelen av en autokodare med icke-negativa vikter och ingång. Den grunda autoko- dare med summationsdelen är utbyggd till en djup faltningsautokodare med icke-negativa vikter och ingång. I den här rapporten utvecklades en grund icke-negativ autokodare (eng: nonnegative autoencoder, NAE), en grund icke-negativ faltningsautokodare (eng: convolutional nonnegative autoencoder, CNAE) och en djup icke-negativ faltningsautokodare (eng:

deep convolutional nonnegative autoencoder, DCNAE). Slutligen testas de tre varianterna av icke-negativ autokodare på några olika uppgifter som signalrekonstruktion och signalförbättring.

iv

(6)

I want to thank my supervisors for the freedom and advice they shared during the development of this project. From KTH’s side, thank you, Saikat Chatterjee, for the freedom you gave me during this thesis. From Dolby’s side, thank you Stanislaw Gorlow and Pedro J. Villasana T. for all those nice conversations regarding our ideas. I also want to thank all the people at Dolby for helping me during the thesis, the discussions we had during lunch or fika and all our foosball games. I also thank Grenoble INP – Phelma for letting me doing this thesis. Finally, I want to thank my parents for supporting me during the thesis.

v

(7)

Contents vi

List of Figures viii

List of Tables xi

1 Introduction 1

1.1 Motivation . . . . 2

1.2 Objective . . . . 2

1.3 Contributions . . . . 2

1.4 Notation . . . . 3

2 Background 5 2.1 Nonnegative Matrix Factorization . . . . 5

2.1.1 Overview . . . . 5

2.1.2 Loss Function . . . . 6

2.1.3 Multiplicative Factor Updates . . . . 8

2.1.4 Convolutional Extension . . . . 9

2.1.5 Multilayer Extension . . . . 11

2.2 Feedforward Neural Networks . . . . 12

2.2.1 Overview . . . . 13

2.2.2 Main Operations . . . . 13

2.2.3 Architectures . . . . 15

2.2.4 Example . . . . 16

2.3 Comparison . . . . 17

3 Multilayer β-CNMF 21 3.1 β-CNMF . . . . 21

3.2 Multilayer β-CNMF . . . . 22

3.2.1 Data Model . . . . 22

vi

(8)

3.2.2 Forwardpropagation . . . . 22

4 Deep Convolutional Nonnegative Autoencoders 25 4.1 Nonnegative Autoencoders . . . . 25

4.1.1 Concept . . . . 25

4.1.2 Architecture . . . . 26

4.1.3 Derivation . . . . 26

4.1.4 Simulation . . . . 27

4.2 Convolutional Nonnegative Autoencoders . . . . 31

4.2.1 Concept . . . . 31

4.2.2 Architecture . . . . 31

4.2.3 Derivation . . . . 32

4.2.4 Simulation . . . . 33

4.3 Deep Convolutional Nonnegative Autoencoders . . . . . 37

4.3.1 Concept . . . . 38

4.3.2 Architecture . . . . 38

4.3.3 Derivation . . . . 38

4.3.4 Simulation . . . . 40

4.4 Conclusion . . . . 45

5 Experiments 47 5.1 Signal Reconstruction . . . . 47

5.1.1 Scope . . . . 47

5.1.2 Input . . . . 50

5.1.3 Results . . . . 50

5.2 Signal Enhancement . . . . 60

5.2.1 Scope . . . . 60

5.2.2 Data . . . . 62

5.2.3 Results . . . . 64

5.3 Conclusion . . . . 67

6 Take-Aways 69 6.1 Conclusions . . . . 69

6.2 Open Questions . . . . 70

Bibliography 71

(9)

2.1 β-divergence for different values of β for v = 1 (image borrowed from [43]) . . . . 8 2.2 Multilayer CNMF . . . . 12 2.3 An illustration of a biological neuron and its mathematical

model (image borrowed from [41]) . . . . 13 2.4 Three popular activation functions . . . . 14 2.5 The filters that represent all the main characteristics of an

examplary image (image borrowed from [32]) . . . . 15 2.6 Architecture of LeNet-5, a convolutional neural network

for digits recognition (image borrowed from [22]) . . . . . 17 2.7 Nodal architecture of a shallow autoencoder and NMF . 18 2.8 Comparison of the convolution operation in a neural net-

work and NMF . . . . 19 3.1 Original spectrogram (V) . . . . 24 3.2 Reconstruction (U) without and with forwardpropagation.

The corresponding reconstruction error using β-divergence after 500 iterations is 0.10 and 0.11, respectively. . . . 24 4.1 Input used to analyze the behavior of the NAE (black =

1, white = 0) . . . . 28 4.2 Output, weights and activations extracted from the de-

coder part of the NAE . . . . 29 4.3 Reconstruction of some training data after 100 iterations.

Reconstruction error: 0.004, 0.010 . . . . 30 4.4 16 random weights learned . . . . 30 4.5 Input used to analyze the behavior of the CNAE (black =

1, white = 0) . . . . 33 4.6 Results for convolution in x . . . . 34 4.7 Results for convolution in x and y . . . . 35

viii

(10)

4.8 Reconstruction of one image from the MNIST dataset. Re-

construction error: 0.31, 0.29 . . . . 36

4.9 The patterns learned to reconstruct the reference image . 37 4.10 Network architecture . . . . 40

4.11 Layer 2 of the decoder . . . . 41

4.12 Layer 1 of the decoder . . . . 41

4.13 Reconstruction of one image in the MNIST dataset with DCNAE and multilayer CNMF. Reconstruction error: 0.18, 0.27 . . . . 43

4.14 Reconstruction of one image from the MNIST dataset with CNAE and CNMF . . . . 44

4.15 The patterns learned by the filters to reconstruct the ref- erence image - 1 . . . . 44

4.16 The patterns learned by the filters to reconstruct the ref- erence image - 2 . . . . 45

5.1 Block diagram for a coding task . . . . 48

5.2 Input spectrogram (sound given by Dolby) . . . . 50

5.3 Weights and reconstruction using NAE . . . . 51

5.4 Weights and reconstruction using CNAE . . . . 53

5.5 Convolution of W

i

with W

+i

, for i = 1, 2, . . . , 4 . . . . . 53

5.6 Weights and reconstruction using DCNAE . . . . 55

5.7 Comparison between multiplicative update rule and pro- jected gradient descent . . . . 56

5.8 Comparison between multiplicative update rules and pro- jected gradient descent after 500 iterations . . . . 57

5.9 Comparison between NAEs and real-valued AEs: (a) NAE vs AE (η = 0.1), (b) CNAE vs CAE (η = 0.01), (c) DC- NAE vs DCAE (η = 0.02) . . . . 58

5.10 Comparison between NAEs and real-valued AEs after 500 iterations . . . . 59

5.11 Block diagram for signal enhancement/source separation task . . . . 61

5.12 Mixture (sound given by Dolby) . . . . 63

5.13 Speech and non-speech magnitude spectrograms (sounds given by Dolby) . . . . 63

5.14 Speech reference vs semi-supervised speech enhancement

results . . . . 65

5.15 Speech reference vs supervised speech enhancement results 66

(11)

5.16 Deep NMF vs DCNAE . . . . 67

(12)

2.1 Summary of the comparison . . . . 20

4.1 Model Parameters – NAE vs NMF . . . . 30

4.2 Model Parameters – CNAE/CNMF . . . . 36

4.3 Model Parameters – DCNAE/Multilayer CNMF . . . . . 43

5.1 DCNAE architecture . . . . 49

xi

(13)
(14)
(15)
(16)

Introduction

Deep learning has emerged as one of the most successful machine learn- ing methods that have outperformed previous state-of-the-art results for speech recognition and computer vision, among others. The success of deep learning is due to the increase in computing power, availability of massive amounts of data and the development of new computing systems.

However, some of the drawbacks of deep neural networks are that they require massive amounts of data, their feature representations are hard to interpret and they are difficult to analyze mathematically.

Other techniques for compact data representation and feature extrac- tion, such as nonnegative matrix factorization (NMF), combine data mod- elling with non-convex optimization to learn interpretable and consistent features from data. Previously criticized for its computational complex- ity, NMF became extremely popular when an efficient algorithm to solve the factorization was proposed in 1999 by Lee and Seung [24, 23]. Later, several variants of NMF were developed: a convolutional model [35, 37], different loss functions [9, 44], sparse coding [16] or a multilayer NMF model [40, 1, 31], to name a few.

In this thesis, firstly, NMF is viewed as a feedbackward neural network and generalized to a deep convolutional architecture with forwardpropa- gation under β-divergence. Secondly, it is shown that NMF is essentially the decoder part of a variant of a feedforward neural network known as the autoencoder (AE). The main part of the thesis is devoted to gen- eralizing the shallow autoencoder with a single hidden layer and fully connected neurons to a deep convolutional autoencoder while constrain- ing its input and filter coefficients to the domain of nonnegative real numbers.

1

(17)

By using the multiplicative update rule to find a local minimum of the β-divergence using gradient descent, the range of the autoencoder is likewise restricted to nonnegative real numbers. The nonnegative autoen- coder (NAE) and its variants are tested on the following tasks: signal reconstruction and signal enhancement.

1.1 Motivation

Tools like NMF for feature extraction are easily interpretable and are easy to analyze [16, 7, 10]. The multilayer extension offers a hierarchical feature representation [40, 1, 31] but is suboptimal because the model is not end-to-end trained, meaning that the factorization is solved in each layer independently, without considering how the error propagates to other layers. On the other hand, because of their black-box nature, neural networks are not always well understood mathematically and have a feature representation that is hard to interpret [27]. The advantage of neural networks is that they are easily converted into a deep architecture that is end-to-end trainable as a single optimization task. It is with great interest to examine an alternative path by combining deep neural networks and NMF into a deep nonnegative neural network. In doing so, we can combine the strengths of the two approaches.

1.2 Objective

The objective of this thesis is, firstly, to make the multilayer convolutional NMF end-to-end trainable under β-divergence. Secondly, NMF shall be considered as the decoder part of a shallow autoencoder. The shallow autoencoder shall be extended to deep convolutional architecture with nonnegativity constraint on its inputs and weights. The nonnegative autoencoder and its variants should be end-to-end trainable as a single optimization task for better accuracy. Lastly, algorithms that use this new architecture are to be tested in the context of signal reconstruction and signal enhancement with application to speech and audio.

1.3 Contributions

In this section, I briefly summarize the main contributions of this thesis.

These are:

(18)

• A multilayer convolutional NMF is derived under β-divergence with multiplicative factor updates and trained as a single optimization problem using forwardpropagation alias uppropagation [1].

• The NMF model is baked into a nonnegative autoencoder (NAE) [15, 3]. The NAE is the basis for further research reported in this thesis. It is used as a core tool for further algorithm design.

• The applicability and performance of the new algorithms: NAE, convolutional NAE (CNAE) and deep CNAE (DCNAE) are inves- tigated on different tasks, such as signal reconstruction and signal enhancement.

1.4 Notation

Throughout this thesis, several mathematical conventions are used. When

addressing numbers, R, R

≥0

and C stand for the set of real, nonnegative

real and complex numbers, respectively. Multidimensional variables are

denoted in bold letters, with bold lower case letters standing for vectors

and bold upper case letters for matrices. A matrix of ones of size K × N

is denoted as 1

K×N

. Conventional matrix multiplication between the ma-

trices A and B is denoted as AB. Operations A ◦ B, A

◦p

and

AB

denote

element-wise multiplication (Hadamard product), element-wise exponen-

tiation and element-wise division, respectively. {X

i

}

I

∈ R

L×M×Ci×Co

denotes a set of I convolutional filters with a kernel size of L × M that

received C

i

input channels and produces C

o

output channels.

(19)
(20)

Background

In this chapter, a brief discussion of the essential background needed to follow the thesis is presented. Section 2.1 introduces nonnegative matrix factorization (NMF), β-divergence, multiplicative factor updates, the convolutional NMF and the multilayer NMF. Section 2.2 presents feedforward neural networks and their deep architecture. In Section 2.3, a comparison is made between NMF and a neural network.

2.1 Nonnegative Matrix Factorization

In this section, nonnegative matrix factorization (NMF) is presented to- gether with β-divergence, multiplicative factor updates, the convolutional and multilayer extensions.

2.1.1 Overview

NMF, originally introduced as positive matrix factorization by Paatero and Tapper [29], is a method where a matrix V ∈ R

K≥0×N

is factorized as a product of two nonnegative matrices W ∈ R

K≥0×I

and H ∈ R

I≥0×N

, where the rank of the factorization I ≪ min(K, N):

V ≃ U = WH. (2.1)

Eq. (2.1) can be understood as a decomposition of V into I representa- tive basis vectors stacked as the columns of W, where each basis vector is weighted by the respective row of H. Usually, W is referred to as the dictionary, V as the observation and H as the coefficient matrix that is

5

(21)

required to approximate V by U.

To perform NMF, a loss function L(V, U) is required to measure the dissimilarity between V and U. The factorization is obtained by solving the optimization problem

(W

opt

, H

opt

) = argmin

W, H

L(V, U) s.t. w

ki

, h

in

∈ R

≥0

. (2.2)

2.1.2 Loss Function

Several functions can be used as the loss function to solve the NMF problem. The most common ones are the squared Euclidean distance, the Kullback–Leibler divergence and the Itakura–Saito divergence.

Squared Euclidean Distance

The Euclidean distance is the straight-line distance between two points in Euclidean space. The squared Euclidean (SE) distance between u and v is defined as

d

SE

(v, u) = 1

2 (u − v)

2

. (2.3)

Kullback–Leibler Divergence

The Kullback–Leibler (KL) divergence is a metric to compare two proba- bility distributions. In NMF, V and U are considered as histograms. In this thesis the generalized KL divergence is addressed, which is defined as

d

KL

(v, u) = v log ( v

u

) − v + u. (2.4)

Itakura–Saito Divergence

Another popular measure of dissimilarity between two distributions is the Itakura–Saito (IS) divergence. This divergence is a measure of dis- similarity usually used in connection with voicegrams. Using NMF, V and U can be considered as voicegrams. The IS divergence is defined as

d

IS

(v, u) = v

u − log ( v u

) − 1. (2.5)

(22)

β-Divergence

The loss functions mentioned previously are used depending on the con- text of the problem at hand. However, to avoid using different loss func- tions for each problem, a more general function that includes all these distance metrics is desired. Such a function is β-divergence, originally introduced as a power divergence [13]. The β-divergence between two points u and v is defined as

d

β

(v, u) =

 

 

 

 

 

 

v v

β−1

− u

β−1

β − 1 v

β

− u

β

β , for β / ∈ {0, 1}, v

u − log ( v u

) − 1, for β = 0,

v log ( v

u

) − v + u, for β = 1.

(2.6)

By appropriately choosing the β-parameter we obtain:

d

0

(v, u) ≡ d

IS

(v, u), (2.7) d

1

(v, u) ≡ d

KL

(v, u), (2.8) d

2

(v, u) ≡ d

SE

(v, u). (2.9) Accordingly, the β-divergence between two matrices, V and U is defined entrywise as

D

β

(V, U) =

K k=1

N n=1

d

β

(v

kn

, u

kn

) . (2.10) When β-divergence is used in minimization tasks, the convergence of the multiplicative update rule (see Section 2.1.3) was proven for β ∈ [0, 2]

[24, 8, 18]. β-divergence has a unique minimum at v = u, as shown in

Figure 2.1.

(23)

Figure 2.1: β-divergence for different values of β for v = 1 (image bor- rowed from [43])

2.1.3 Multiplicative Factor Updates

Multiplicative factor updates were proposed by Lee and Seung in [23, 24].

It is based on the gradient descent method applied element-wise to the loss function with respect to each factor X (W or H):

X ← X − η

X

◦ ∇

X

L(V, WH), (2.11) where η

X

is a learning-rate matrix of the same size as X and

X

L(V, WH) is the gradient of the loss function with respect to X. By splitting the gradient into positive and negative parts, i.e.

X

L(V, WH) = ∇

+X

L(V, WH) − ∇

X

L(V, WH), (2.12) and allowing η

X

to change at every iteration, we can set

η

X

= X

+X

L(V, WH) . (2.13)

So the multiplicative update rule reads:

X ← X ◦

X

L(V, WH)

+X

L(V, WH) . (2.14)

The advantage of this form is that there is no need to tune the learning-

rate η

X

and the nonnegativity of the elements of the factor X is ensured

(24)

at each iteration. These multiplicative updates are also known as the heuristic updates in NMF literature [33, 2, 23, 10].

If L(V, U) = D

β

(V, U), with U = WH, the convergence of the heuristic updates for NMF were proven in [24, 8, 18] for β ∈ [0, 2]. These are

W ← W ◦ [

V ◦ U

◦(β−2)

] H

T

U

◦(β−1)

H

T

, (2.15)

H ← H ◦ W

T

[

V ◦ U

◦(β−2)

]

W

T

U

◦(β−1)

. (2.16)

2.1.4 Convolutional Extension

As shown in Section 2.1.1, the observation V is factorized into a product between the dictionary W and the coefficient matrix H. Should W and/or H evolve in one or two dimensions, one can assume that the current state or value of these matrices is correlated with their past and future states. We can take this into account by replacing the matrix multiplication in our model by a convolution.

Convolution in Time

An extension of NMF, called convolutional nonnegative matrix factor- ization (CNMF), was introduced by Smaragdis in [37]. In CNMF in time, the model is allowed to evolve over time to track evolution on M consecutive columns of V. The approximation of V becomes

V ≃ U =

M

−1 m=0

W(m)

m

H, (2.17)

where

m

X shifts X to the right by inserting m zero-columns at the begin- ning while maintaining the original size of X. This is equivalent to a sum of I truncated convolutions between the basis matrices {W

i

} ∈ R

K≥0×M

with their corresponding coefficient vectors {h

i

} ∈ R

1≥0×N

, i = 1, 2, ..., I,

V ≃ U =

I i=1

(W

i

∗ h

i

), (2.18)

where ∗ denotes convolution in 1D. Note that if M = 1 the factorization

reduces to the original non-convolutional NMF.

(25)

Convolution in Time and Frequency

Schmidt and Mørup extended the CNMF algorithm to a 2D version [35].

Now, the coefficients of W and H are allowed to change over time and frequency, thus exploiting the structure in a L ×M neighborhood around each element of V:

V ≃ U =

L−1

l=0 M

−1 m=0

l↓

W(m)

m

H(l), (2.19)

where

l↓

X is the zero-fill down-shift operator. This operation can be compared to a sum of I truncated convolutions between the box filters {W

i

} ∈ R

K≥0×M

with the corresponding coefficient matrices {H

i

} ∈ R

L≥0×N

for i = 1, 2, ..., I, i.e.

V ≃ U =

I i=1

(W

i

∗ ∗H

i

), (2.20)

where ∗∗ denotes convolution in 2D. Note that the original NMF is em- bedded when L = M = 1.

CNMF Multiplicative Factor Updates

Multiplicative update rules were used under β-divergence, β ∈ [0, 2], for the convolutional extension of NMF with proof of convergence given in [44].

The multiplicative update rules for CNMF in time are:

W(m) ← W(m) ◦ [

V ◦ U

◦(β−2)

]

m

H

T

U

◦(β−1)

m

H

T

, (2.21)

H ← H ◦

m

W

T

(m) [

m

V

m

U

◦(β−2)

]

m

W

T

(m)

m

U

◦(β−1)

. (2.22)

(26)

The multiplicative update rules for CNMF in time and frequency are:

W(m) ← W(m) ◦

l

[

l

V U

l◦(β−2)

]

m

H

T

(l)

l l↑

U

◦(β−1)

m

H

T

(l)

, (2.23)

H(l) ← H(l) ◦

m l↓

W

T

(m) [

m

V

m

U

◦(β−2)

]

m l↓

W

T

(m)

m

U

◦(β−1)

. (2.24)

2.1.5 Multilayer Extension

A multilayer extension to NMF was proposed in [40, 1, 31] to learn hierarchical features. The basic idea is to perform NMF in several layers.

This way a hierarchical representation of the input can be learned. One can interpret this as a more and more abstract representation of the input with increasing hierarchy levels. With the multilayer extension come new possibilities to update the layers [1, 31]:

1. Update the multilayer NMF by iterating layer by layer;

2. Update the multilayer NMF by propagating the error through all layers (forwardpropagation alias uppropagation [1]);

3. Update the multilayer NMF by iterating layer by layer, then update the model by propagating the error through all layers.

In the first scheme, each layer is optimized separately, beginning from the lowest layer. An advantage of this scheme is the extensibility of the framework: as all parameters of one layer are independent of the parame- ters of the other layers, another layer can be added without changing the previous layers. The disadvantage of this scheme is that it is suboptimal because in most cases the global minimum of the whole network is not found [1].

In the second scheme, the layers in the network are optimized simulta-

neously. The advantage of this scheme is the minimization of the overall

cost function, which leads to a better reconstruction. The drawback is the

introduction of a relation between the layers, which makes the network

more difficult to extend [31].

(27)

The third scheme combines the two previous schemes. Firstly, each layer in the network is considered as independent from the other layers and optimized in that way. Secondly, the layers are combined and trained simultaneously to achieve a better reconstruction [1].

A variant of the multilayer NMF was implemented using CNMF under β- divergence (β-CNMF) in [43]. The basic relation between two consecutive layers is illustrated in Figure 2.2.

Figure 2.2: Multilayer CNMF

The multilayer β-CNMF performs CNMF sequentially in multiple layers, each layer l having a different dictionary {W

(l)i

}

I(l)

and signal representa- tion {H

(l)i

}

I(l)

. The representation from the lower layer is factorized into the next higher layer’s dictionary and new representation. The multi- layer CNMF was used in speech enhancement tasks, where the objective was to be more discriminant with each layer, so any potential leakage that could occur between speech and non-speech in one layer would be reduced in the next layer by the use of a different dictionary [43]. This method corresponds to the first scheme discussed previously.

2.2 Feedforward Neural Networks

This section discusses feedforward neural networks, both shallow and

deep. The main operations in between the layers that simulate the con-

nections in the brain are discussed. Finally, a concrete example of a deep

convolutional neural network, called LeNet-5, is shown.

(28)

2.2.1 Overview

Artificial neural networks (ANNs) are computing systems inspired by the biological neural networks that constitute a brain. Such systems learn to perform tasks by considering data examples, generally without being programmed with task-specific rules.

An artificial neural network consists of a collection of simulated neu- rons. Each neuron is a node which is connected to other nodes via links that correspond to biological axon-synapse-dendrite connections. Each link has a weight that determines the strength of one node’s influence on another node [46]. Figure 2.3 is an illustration of this mechanism for one biological neuron and how it is modelled mathematically.

Figure 2.3: An illustration of a biological neuron and its mathematical model (image borrowed from [41])

An artificial neuron has a body in which computations are performed, and several input channels and one output channel, similar to a real biological neuron.

2.2.2 Main Operations

The neurons are typically organized into multiple layers. In the biological neuron model, the neurons of one layer connect only to neurons of the immediately preceding and immediately following layers. The layer that receives the input data is the input layer. The layer that produces the final result is the output layer. In between the input and output layer can be zero or several hidden layers. Between two layers, various operations are possible:

Linear Transformation

Mathematically, we can think of a fully connected layer as a function that

applies a linear transformation W on a vectorial input x of dimension

(29)

P and outputs a vector y of dimension Q. Usually, the linear operation has a bias parameter b:

y = Wx + b. (2.25)

This layer is the first representation of the axon-synapse-dendrite-neuron connections that happen in the brain, see Figure 2.3.

Activation Function

An activation function is used to simulate the 1-0 impulse carried away from the cell body [45]. In the neuron model, the activation function is used to simulate this property and also to add nonlinearity to the model. The ability of the neural network to approximate any function, and especially a non-convex function, is related to the nonlinear activa- tion function. Figure 2.4 shows three popular activation functions.

Figure 2.4: Three popular activation functions

Convolution

Convolutional layers, as the name alludes, convolve the input X with a set of K filters f

k

and pass the result to the next layer via Y

k

. Usually, the convolution operation has a bias parameter b

k

:

Y

k

= f

k

∗ X + b

k

, (2.26)

where ∗ denotes convolution. This is similar to the response of a neuron

in the visual cortex to a specific stimulus [49]. Each convolutional neuron

processes data only for its receptive field. Figure 2.5 shows an example

(30)

of learned filters representing receptive fields that trigger only for certain patterns in the input image.

Figure 2.5: The filters that represent all the main characteristics of an examplary image (image borrowed from [32])

Because in the convolution operation the weights are shared over the entire image, it reduces the number of free parameters.

Pooling

The pooling operation is used to reduce the computations and reduce the detail level for the next operation. The pooling operation reduces the dimensions of the data by combining the outputs of neuron clusters at one layer into a single neuron.

Normalization

The normalization operation consists in shifting inputs to have a zero mean and unit variance. This operation is used to make the inputs of each trainable layer comparable across features.

2.2.3 Architectures

Neural networks are often organized into N layers of neurons and are

therefore also called N -layer neural networks. A 1-layer neural network

means a network with no hidden layers, i.e. the input layer is directly

connected to the output layer. When N ≥ 3, i.e. when more than 2

layers are hidden, we say that the neural network is deep [34]. Thus a

deep neural network is a N -layer neural network with N ≥ 3.

(31)

The universal approximation theorem [50] states that a 2-layer neural network can approximate any function over some compact set, provided that it has enough neurons in the hidden layer. Determining the number of neurons in the hidden layer is difficult and computationally expensive.

To improve the performance of a neural network, it is common to make it deeper, adding more hidden layers, to learn more abstract features throughout the hidden layers. It was shown in [11] that deep neural networks outperform shallow neural networks across various tasks and domains and argued that the number of neurons in a shallow neural network grows exponentially with task complexity. So, to be useful, a shallow neural network might need to be very big in terms of the number of neurons — possibly much bigger than a deep neural network.

Training a neural network (shallow or deep) is not much different from training any other machine learning model with gradient descent [11].

Updating each weight in a neural network to minimize a cost function is possible via backpropagation [22]. Backpropagation means computing the gradient of the loss function with respect to each weight by the chain rule, computing the gradient one layer at a time, and iterating backwards from the last layer to avoid redundant calculations of intermediate terms in the chain rule [48]. It is common to explore manually the space of hyperparameters such as the learning rate or weights regularization, not to mention the architectures hyperparameters like the number of hidden layers, to obtain the best performance in terms of both accuracy and training time.

2.2.4 Example

A lot of convolutional architectures have been developed from the 1990s

onwards. In this section, I present a well-known deep convolution neural

network developed by Yann LeCun, LeNet-5 [22]. Figure 2.6 shows the

architecture of LeNet-5.

(32)

Figure 2.6: Architecture of LeNet-5, a convolutional neural network for digits recognition (image borrowed from [22])

This kind of architecture is one of the first successful applications of convolutional neural networks (CNNs). LeNet-5 was first used to read zip codes and digits on bank cheques in the United States and then used in other classification tasks: it was able to achieve an error rate below 1% on the MNIST dataset [21], which was very close to the state of the art at that time.

By modern standards, LeNet-5 is a very simple network. It only has 7 layers, which are three convolutional layers (C1, C3 and C5), two are subsampling (pooling) layers (S2 and S4), and one is a fully connected layer (F6), followed by the final output layer. Convolutional layers use 5-by-5 convolutions with stride 1. Sub-sampling layers are 2-by-2 average pooling layers. Hyperbolic tangent as activation function is used after the convolutional layers C1, C3 and C5 and the fully connected layer F6.

Each filter from the convolution operations represents a receptive field that is sensitive to a particular area from the input digits. The set of fully connected layers at the end of the network play the role of a classifier that use the patterns learned from the convolutions to determine the digit. LeNet-5 can be seen in action at [12, 20].

2.3 Comparison

In this section, a comparison of NMF and a neural network is made.

Firstly, a comparison of the architecture, constraints and optimization

between NMF and a neural network is done. Secondly, the learnings from

the comparison are summarized in Table 2.1 at the end of the section.

(33)

Architecture

Neural networks, usually being feedforward, and NMF, being feedback- ward, have different architectures, which can easily be shown using nodal analysis. In the NMF case, there is no input layer, the coefficient ma- trix H plays the role of the hidden layer, the dictionary W contains the linear weights and U is the approximation of the visible data V. One can view the nodal architecture of NMF as the decoder part of a shallow autoencoder. A nodal representation of both architectures is shown in Figure 2.7.

(a) Neural Network (b) NMF

Figure 2.7: Nodal architecture of a shallow autoencoder and NMF Different steps were taken to make NMF resemble a neural network such as projective NMF [51], mini-batch method [36] or semi-NMF neural network alternative [38, 39] (W ∈ R and H ∈ R

≥0

), but none of them combine the properties of neural networks with those of NMF in a proper sense.

A major difference between the neural network and NMF architecture is that the nodal interpretation of NMF is not a feedforward architecture.

NMF has an inverse structure meaning that the hidden states are ad-

justed to the outputs. In the case of deep convolutional networks, the

convolution operation is carried out in different ways. More precisely, in a

modern deep neural network architecture, the set of convolutional filters

is shared across all inputs of one layer to generate the input of the next

layer. In multilayer NMF [43], each input has its own set of filters to gen-

erate the input for the next layer. Figure 2.8 contrasts the convolution

operation between the lth and (l + 1)th layer or both methods.

(34)

(a) CNN (b) CNMF

Figure 2.8: Comparison of the convolution operation in a neural network and NMF

It should be noted that the different operations (matrix multiplication and convolution) in between the layers are carried out without the bias vector in NMF and its variants.

Constraints

Although it is possible to impose hard constraints on a convolutional neural network, it is more common to have soft constraints such as weight regularization or constraint violation penalty. An auxiliary regularization term R(W) or R(H) corresponding to the constraint, is introduced and added to the cost J on top of the general loss function L(X, ˆ X):

J = L(X, ˆ X) + R( ·). (2.27) The natural constraint for NMF is nonnegativity. Nonnegativity is en- sured by the multiplicative updates [23, 24] or by other methods like projected gradient [25]. It is possible to add more constraints to NMF such as sparsity [16] or graph regularization [30]. The constraints for multiplicative updates are in the form of

X ← X ◦

X

L(V, U) + ∇

X

R( ·)

+X

L(V, U) + ∇

+X

R( ·) (2.28)

to ensure nonnegativity.

(35)

Optimization Methods

As seen in Sections 2.1 and 2.2, neural networks and NMF are trained using a variant of gradient descent. While neural networks use backprop- agation and adjust the weights, NMF commonly uses alternating mul- tiplicative updates. It is also common in NMF to normalize the bases vectors:

V ≃ U = WH = WDD

−1

H, (2.29)

where D is a diagonal matrix, D = diag (

||W

1

||

−1p

, ||W

2

||

−1p

, ..., ||W

I

||

−1p

)

, (2.30)

where W

i

denotes the ith column of W and

||W

i

||

−1p

= (

K

k=1

w

kip

)

1p

. (2.31)

It should be noted that introducing D does not change the behavior of NMF. Weight normalization would be possible in a neural network only if the network was linear.

Summary

Even if there are significant differences between NMF and neural net- works, the fact that they both use gradient descent and can be repre- sented as a nodal architecture is an encouragement to develop an archi- tecture that combines the strengths of neural networks and NMF. Table 2.1 provides a brief summary of the finding from the comparison.

ANN NMF

Data flow Feedforward Feedbackward

Transformation Nonlinear Linear

Domain of weights R R

≥0

Filters in multilayer case Shared Not shared Update rules Additive Multiplicative

Nonlinearity Activation functions Nonnegativity constraint Optimization Backpropagation Forwardpropagation

Table 2.1: Summary of the comparison

(36)

Multilayer β-CNMF

In this chapter, a multilayer CNMF based on a convolutional data model in 2D with forwardpropagation [1] is developed. In Section 3.1, a sum- mary of CNMF under the β-divergence is given. A multilayer extension, which generalizes the work in [1, 43], is elaborated in Section 3.2.

3.1 β-CNMF

As mentioned in Section 2.1.5, a convolutional extension of NMF has been implemented in [37, 35] for different cost functions. To avoid imple- menting several algorithms, each one minimizing a different cost function, new update rules were proposed based on β-divergence in [44], correcting the errors in the previous publications.

The general form of NMF when using a 2D convolutional model is

V ≃ U =

L−1

l=0 M

−1 m=0

l

W(m)

m

H(l). (3.1)

From (3.1), one can see that NMF and CNMF proposed in [37] and [23, 24] are special cases of CNMF in 2D. As a matter of fact, (3.1) is equiv- alent to NMF for L = M = 1, and (3.1) becomes CNMF in time [37] for L = 1. Therefore, for the rest of the thesis, CNMF stands for the general form shown in (3.1).

The update rules for CNMF under β-divergence (β-CNMF) are given in (2.23) and (2.24). Note that the updates rules for NMF, (2.15) and (2.16), are obtained by choosing L = M = 1.

21

(37)

3.2 Multilayer β-CNMF

The idea of using NMF in multiple layers can be found in the existing literature. In [40, 1, 31], NMF was extended to multiple layers with different cost functions. In [43, 4], the convolution operation was used in between layers, resulting in a multilayer CNMF. However, no formulation as a single optimization problem was proposed for the multilayer CNMF.

In this section, an approach similar to the forwardpropagating algorithm from [1] is developed for multilayer CNMF.

3.2.1 Data Model

Following the idea from [43], CNMF is employed in each layer indepen- dently:

V ≃ U =

I(1)

i=1

W

(1)i

∗ ∗H

(1)i

⇝ H

(1)i

=

I(2)

j=1

W

(2)j

∗ ∗H

(2)j

⇝ . . . , (3.2)

where {W

(l)i

} is a set a basis filters in the lth layer for the corresponding coefficient matrices {H

(l)i

}, i = 1, 2, ..., I

(l)

, and l = 1, 2, ..., L, where L is the depth of the network and ∗∗ denotes convolution as defined in Section 2.1.4.

In each layer l ≤ L, the coefficient matrices {H

(l)i

} are transformed into new representations. The idea is that each new representation will be more discriminant than the previous representation, so any similarity between components in one layer can be potentially resolved in the next higher layer with the employment of a more specific dictionary. Therefore, better performance for specific tasks such as signal enhancement can be achieved.

3.2.2 Forwardpropagation

In [43], a multilayer CNMF is trained layer-by-layer, which is suboptimal.

To tackle this shortcoming, a new algorithm inspired by [1] is developed.

The entire network is trained by propagating the error through all the layers, starting with the deepest hidden layer.

To extend β-CNMF to a multilayer case, let us introduce two new

(38)

matrices A and B to the update rule:

W

i

← W

i

A ⋆ ⋆H

i

B ⋆ ⋆H

i

, (3.3)

H

i

← H

i

W

i

⋆ ⋆A

W

i

⋆ ⋆B , (3.4)

where ⋆⋆ denotes cross-correlation in 2D and with

A = V ◦ U

◦(β−2)

, (3.5)

B = U

◦(β−1)

. (3.6)

In the multilayer case, if we obtain A

(l)i

and B

(l)i

in the lth layer of the network, the update rules become

W

(l)i

← W

(l)i

A

(l)i

⋆ ⋆H

(l)i

B

(l)i

⋆ ⋆H

(l)i

, (3.7) H

(L)i

← H

(L)i

W

(L)i

⋆ ⋆A

(L)i

W

(L)i

⋆ ⋆B

(L)i

, (3.8) where A

(l)i

and B

(l)i

are computed using the chain rule:

A

(l+1)i

= W

(l)i T

⋆ ⋆A

(l)i

, (3.9) B

(l+1)i

= W

(l)i T

⋆ ⋆B

(l)i

, (3.10) where X

(l)Ti

denotes the transpose of X

(l)i

, for l = 1, 2, ..., L, and

A

(1)i

= A = V ◦ U

◦(β−2)

, (3.11) B

(1)i

= B = U

◦(β−1)

. (3.12) Simulation

As in [43], a 2-layer β-CNMF is implemented with the same parameters.

This model is trained twice: once without using forwardpropagation, i.e.

the model is trained layer-by-layer, and once using forwardpropagation, i.e. the model is trained in a single optimization task.

The models are trained under β-divergence with β = 0.5 (arbitrarily chosen) during 500 iterations on a piece of audio shown in Figure 3.1.

The corresponding approximations are shown in Figure 3.2.

(39)

Figure 3.1: Original spectrogram (V)

Figure 3.2: Reconstruction (U) without and with forwardpropagation.

The corresponding reconstruction error using β-divergence after 500 iter- ations is 0.10 and 0.11, respectively.

It can be seen from Figure 3.2 that both architectures output a “good” re-

construction, with similar reconstruction error. However, using forward-

propagation the network was able to achieve this reconstruction with less

computations (17856 components updated per iteration instead of 33088

for the architecture from [43]), which can speed up the training time. As

deep networks are sensible to be stuck in a local minimum, the layer-

by-layer optimization can be used as a pre-training step to have a good

starting point when using the forwardpropagation, as in [14].

(40)

Deep Convolutional Nonnegative Autoencoders

In this chapter, new variants of a nonnegative autoencoder (NAE) are developed by incorporating the underlying ideas of NMF, CNMF and neural networks. Section 4.1 introduces the concept of shallow NAEs.

In section 4.2, a convolutional NAE (CNAE) is developed. Section 4.3 concludes the chapter with a deep convolutional NAE (DCNAE), with three hidden layers.

4.1 Nonnegative Autoencoders

In this section, the concept of nonnegative autoencoders (NAEs) is pre- sented. Firstly, it is shown how to reinterpret NMF to convert it into a neural network. Secondly, an architecture is proposed and the constraints are defined. Then, the update rules are derived to enforce nonnegativity.

Finally, the new architecture is simulated on a toy example and compared to NMF on a more complex dataset.

4.1.1 Concept

The purpose of an autoencoder is to learn a representation for a set of data, typically for dimensionality reduction [47]. NMF and autoen- coders are suited for the same goal: dimensionality reduction. Moreover, we have seen in Section 2.3 that the nodal representation of NMF can be interpreted as the decoder part of an autoencoder, see Figure 2.7. There- fore, an autoencoder seems to be the right neural network architecture

25

(41)

to bridge the gap between neural networks and NMF.

4.1.2 Architecture

Let us reinterpret NMF as a linear autoencoder. The obvious formulation is

V ≃ U = WH = WW

+

V (4.1)

with H = W

+

V representing the hidden layer and U = WH represent- ing the output layer, respectively. We further enforce the constraint that W ∈ R

≥0

and H ∈ R

≥0

. The nonnegative matrices W and H would correspond to their namesakes in NMF, whereas the matrix W

+

∈ R

≥0

is some sort of pseudo-inverse of W that produce a nonnegative H.

The shallow NAE has the same formula

V ≃ U = WW

+

V (4.2)

with V, U ∈ R

K≥0×N

, W ∈ R

K≥0×I

and W

+

∈ R

I≥0×K

where I is the rank of the factorization.

In the autoencoder terminology, the first layer weights W

+

are referred to as the encoder, which produces a code representing the input, and the weights W are referred to as the decoder; which uses the code to reconstruct the input. The shallow NAE is trained to minimize the re- construction error

L(V, U) ≡ D

β

(V, U) , (4.3) which is equal to the entrywise β-divergence.

4.1.3 Derivation

Like most machine learning techniques, gradient descent is used to train the shallow NAE. The gradients of L(V, U) with respect to the weights W and W

+

are computed with L(V, U) defined in (4.3):

W

L(V, U) = (

U

◦(β−1)

[

V ◦ U

◦(β−2)

])

V

T

W

+T

, (4.4)

W+

L(V, U) = W

T

(

U

◦(β−1)

[

V ◦ U

◦(β−2)

])

V

T

. (4.5)

(42)

From (4.4) and (4.5), we can see that

W

L(V, U) and ∇

W+

L(V, U) can be written in the form

W

L(V, U) = ∇

+W

L(V, U) − ∇

W

L(V, U), (4.6)

W+

L(V, U) = ∇

+W+

L(V, U) − ∇

W+

L(V, U) (4.7) with

+W

L(V, U) = U

◦(β−1)

V

T

W

+T

, (4.8)

W

L(V, U) = [

V ◦ U

◦(β−2)

]

V

T

W

+T

, (4.9)

+W+

L(V, U) = W

T

U

◦(β−1)

V

T

, (4.10)

W+

L(V, U) = W

T

[

V ◦ U

◦(β−2)

]

V

T

. (4.11)

By using the gradient descent algorithm

W ← W − η

W

◦ ∇

W

L(V, U), (4.12) W

+

← W

+

− η

W+

◦ ∇

W+

L(V, U), (4.13) and setting

η

W

= W

+W

L(V, U) , (4.14)

η

W+

= W

+

+W+

L(V, U) , (4.15) we obtain the following multiplicative updates:

W ← W ◦ [

V ◦ U

◦(β−2)

]

V

T

W

+T

U

◦(β−1)

V

T

W

+T

, (4.16)

W

+

← W

+

W

T

[

V ◦ U

◦(β−2)

] V

T

W

T

U

◦(β−1)

V

T

. (4.17)

4.1.4 Simulation

In this section, the shallow NAE is implemented and trained on an arti-

ficially created example. The simulation helps us understand the nature

of the weights and activations learned by the network. The decoder part

of the NAE (activations, weights and output) is then compared to an

NMF implementation on a more complex dataset.

(43)

Input

The input, shown in Figure 4.1, is a 30-by-90 image composed of two horizontal (1-by-10) and two vertical (10-by-1) lines.

10 20 30 40 50 60 70 80 90

5

10

15

20

25

30

Figure 4.1: Input used to analyze the behavior of the NAE (black = 1, white = 0)

Data Model

As the input is of dimension 30-by-90 and is composed of only two distinct components (horizontal and vertical lines), the rank of factorization, i.e.

the dimensionality parameter, is set to 2. Therefore, the shallow NAE has the following formula:

V ≃ U = WW

+

V (4.18)

with V, U ∈ R

K≥0×N

, W ∈ R

K≥0×I

and W

+

∈ R

I≥0×K

, and K = 30, N = 90 and I = 2. The model is trained to minimize the loss

L(V, U) ≡ D

β=2

(V, U). (4.19)

Results

The weights from the decoder W and the activations in the hidden layer

H = W

+

V are extracted from the trained architecture. The results are

shown in Figure 4.2.

(44)

Figure 4.2: Output, weights and activations extracted from the decoder part of the NAE

From the figure above one can see that the NAE has successfully learned the basic elements present in the input image. The weights W contain the basis vectors needed to explain the input data, i.e a vertical line at the centre of the y-axis (1st component) and a dot on the 25th pixel on the y-axis (2nd component). The activations tell where the basis vectors are located on the x-axis.

Comparison with NMF

To compare the NAE with NMF, each architecture was trained on a subset (5000 images) of the MNIST dataset [21]. For the comparison to be fair, each architecture was designed to have approximately the same number of adjustable variables, so that each architecture consumes the same energy in a given time. The rank of the factorization for NMF was chosen using singular value decomposition (SVD) such that 95%

of the energy was preserved, while the rank of factorization for NAE

was adjusted to have a comparable number of free parameters. Table

4.1 shows the parameterization of the methods, where the number of

parameters for the NAE is the number of elements in the weight matrices

(2 × K × I) and the number of parameters for NMF is the number of

element in the dictionary W (K ×I) and the coefficient matrix H (I ×N).

(45)

Parameters NAE NMF Input dimension (K × N) 784 × 5000 784 × 5000

Rank (I) 440 120

Number of parameters 689920 694080 Table 4.1: Model Parameters – NAE vs NMF

Figure 4.3 shows the reconstruction of some training data after 100 iter- ations.

Figure 4.3: Reconstruction of some training data after 100 iterations.

Reconstruction error: 0.004, 0.010

Figure 4.4 shows some weights learned by each model.

Figure 4.4: 16 random weights learned

(46)

It can be seen in Figure 4.3 that NAE and NMF give a good reconstruc- tion of the input digits. However, the digits reconstructed with NAE are less blurred than the ones reconstructed with NMF because of the larger latent space dimentsion. The weights learned by the models (see Figure 4.4) are different: the ones learned by NAE are sparser and resemble more small receptive fields, while the weights learned by NMF resemble parts of digits. By learning sparse weights, the network is capable to have comparable performance with less computations [42, 26].

4.2 Convolutional Nonnegative Autoencoders

In this section, the shallow NAE is extended to a shallow convolutional nonnegative autoencoder (CNAE). Firstly, CNMF is reinterpreted as a nonnegative convolutional neural network. Secondly, an architecture is proposed and the constraints are defined. Then, the update rules are derived. Finally, the new architecture is simulated on a toy example and compare to CNMF on a more complex dataset.

4.2.1 Concept

As with NMF, an autoencoder seems the right neural network architec- ture for the reinterpretation of CNMF. This time, the matrix product between the layers is replaced by a convolution. As mentioned in Section 2.2.2, the convolution helps us find important patterns with a certain structure in the data.

4.2.2 Architecture

To reinterpret CNMF as a linear convolutional autoencoder, the I differ- ent signal representations H

i

are stacked together such as {H

i

}

I

can be interpreted as a signal with I channels. With this kind of signal represen- tation the convolution used in [41, 49] can be used with the CNMF model.

The obvious formulation to reinterpret CNMF as a linear convolutional autoencoder is

V ≃ U =

i

W

i

∗ ∗H

i

= ∑

i

W

i

∗ ∗W

+i

⋆ ⋆V (4.20)

with H

i

= W

+i

⋆ ⋆V representing the hidden layer and U =

i

W

i

∗ ∗H

i

representing the output layer, respectively. The nonnegativity constraint

References

Related documents

The purpose of the different datasets is to investi- gate how the dimensionality of the latent space affects generation and latent loss of the autoencoders, while increasing

Tommie Lundqvist, Historieämnets historia: Recension av Sven Liljas Historia i tiden, Studentlitteraur, Lund 1989, Kronos : historia i skola och samhälle, 1989, Nr.2, s..

We ask the question whether GNAG converges faster than NAG for certain choices of the gradient correction parameter, and by numerical examples arrive at the conclusion that a

The main findings reported in this thesis are (i) the personality trait extroversion has a U- shaped relationship with conformity propensity – low and high scores on this trait

Finding information related to technical issues with the Portal or its underlying Business Systems, which as exemplified in “4.2.1.1” are the most used features of the Portal, was

SEG-YOLO is an end to end model that consists of two neural networks: (a) YOLOv3, for object detection to generate instance bounding boxes and also for feature maps extraction as

the company should prepare targets and guide values for all energy use on the basis of the results of a first energy analysis FEA of the building; the targets are in respect of

The primary goal of the project was to evaluate two frameworks for developing and implement- ing machine learning models using deep learning and neural networks, Google TensorFlow