Deep Convolutional Nonnegative Autoencoders

(1)

Deep Convolutional

Nonnegative Autoencoders

YANN DEBAIN

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

Nonnegative Autoencoders

YANN DEBAIN

Master in Information and Network Engineering Supervisor: Stanislaw Gorlow, PhD

Examiner: Saikat Chatterjee, PhD

School of Electrical Engineering and Computer Science Host company: Dolby Sweden AB

Email: debain@kth.se

(3)

(4)

In this thesis, nonnegative matrix factorization (NMF) is viewed as a feedbackward neural network and generalized to a deep convolutional ar- chitecture with forwardpropagation under β-divergence. NMF and feed- foward neural networks are put in relation and a new class of autoen- coders is proposed, namely the nonnegative autoencoders. It is shown that NMF is essentially the decoder part of an autoencoder with nonneg- ative weights and input. The shallow autoencoder with fully connected neurons is extended to a deep convolutional autoencoder with the same properties. Multiplicative factor updates are used to ensure nonnegativ- ity of the weights in the network. As a result, a shallow nonnegative autoencoder (NAE), a shallow convolutional nonnegative autoencoder (CNAE) and a deep convolutional nonnegative autoencoder (DCNAE) are developed. Finally, all three variants of the nonnegative autoencoder are tested on diﬀerent tasks, such as signal reconstruction and signal enhancement.

iii

(5)

I den här rapporten betraktas icke-negativ matrisfaktorisering (eng: non- negative matrix factorization, NMF) som ett återkopplat neuralt nätverk.

NMF är generaliserat till en djup faltningsarkitektur med “forwardprop- agation” och β-divergens. NMF och “feedforward” neurala nät jämförs och en ny typ av autokodare är presenterat. Den nya typen av autoko- dare kallas icke-negativ autokodare. NMF betraktas avkodardelen av en autokodare med icke-negativa vikter och ingång. Den grunda autoko- dare med summationsdelen är utbyggd till en djup faltningsautokodare med icke-negativa vikter och ingång. I den här rapporten utvecklades en grund icke-negativ autokodare (eng: nonnegative autoencoder, NAE), en grund icke-negativ faltningsautokodare (eng: convolutional nonnegative autoencoder, CNAE) och en djup icke-negativ faltningsautokodare (eng:

deep convolutional nonnegative autoencoder, DCNAE). Slutligen testas de tre varianterna av icke-negativ autokodare på några olika uppgifter som signalrekonstruktion och signalförbättring.

iv

(6)

I want to thank my supervisors for the freedom and advice they shared during the development of this project. From KTH’s side, thank you, Saikat Chatterjee, for the freedom you gave me during this thesis. From Dolby’s side, thank you Stanislaw Gorlow and Pedro J. Villasana T. for all those nice conversations regarding our ideas. I also want to thank all the people at Dolby for helping me during the thesis, the discussions we had during lunch or ﬁka and all our foosball games. I also thank Grenoble INP – Phelma for letting me doing this thesis. Finally, I want to thank my parents for supporting me during the thesis.

v

(7)

Contents vi

List of Figures viii

List of Tables xi

1 Introduction 1

1.1 Motivation . . . . 2

1.2 Objective . . . . 2

1.3 Contributions . . . . 2

1.4 Notation . . . . 3

2 Background 5 2.1 Nonnegative Matrix Factorization . . . . 5

2.1.1 Overview . . . . 5

2.1.2 Loss Function . . . . 6

2.1.3 Multiplicative Factor Updates . . . . 8

2.1.4 Convolutional Extension . . . . 9

2.1.5 Multilayer Extension . . . . 11

2.2 Feedforward Neural Networks . . . . 12

2.2.1 Overview . . . . 13

2.2.2 Main Operations . . . . 13

2.2.3 Architectures . . . . 15

2.2.4 Example . . . . 16

2.3 Comparison . . . . 17

3 Multilayer β-CNMF 21 3.1 β-CNMF . . . . 21

3.2 Multilayer β-CNMF . . . . 22

3.2.1 Data Model . . . . 22

vi

(8)

3.2.2 Forwardpropagation . . . . 22

4 Deep Convolutional Nonnegative Autoencoders 25 4.1 Nonnegative Autoencoders . . . . 25

4.1.1 Concept . . . . 25

4.1.2 Architecture . . . . 26

4.1.3 Derivation . . . . 26

4.1.4 Simulation . . . . 27

4.2 Convolutional Nonnegative Autoencoders . . . . 31

4.2.1 Concept . . . . 31

4.2.2 Architecture . . . . 31

4.2.3 Derivation . . . . 32

4.2.4 Simulation . . . . 33

4.3 Deep Convolutional Nonnegative Autoencoders . . . . . 37

4.3.1 Concept . . . . 38

4.3.2 Architecture . . . . 38

4.3.3 Derivation . . . . 38

4.3.4 Simulation . . . . 40

4.4 Conclusion . . . . 45

5 Experiments 47 5.1 Signal Reconstruction . . . . 47

5.1.1 Scope . . . . 47

5.1.2 Input . . . . 50

5.1.3 Results . . . . 50

5.2 Signal Enhancement . . . . 60

5.2.1 Scope . . . . 60

5.2.2 Data . . . . 62

5.2.3 Results . . . . 64

5.3 Conclusion . . . . 67

6 Take-Aways 69 6.1 Conclusions . . . . 69

6.2 Open Questions . . . . 70

Bibliography 71

(9)

2.1 β-divergence for diﬀerent values of β for v = 1 (image borrowed from [43]) . . . . 8 2.2 Multilayer CNMF . . . . 12 2.3 An illustration of a biological neuron and its mathematical

model (image borrowed from [41]) . . . . 13 2.4 Three popular activation functions . . . . 14 2.5 The ﬁlters that represent all the main characteristics of an

examplary image (image borrowed from [32]) . . . . 15 2.6 Architecture of LeNet-5, a convolutional neural network

for digits recognition (image borrowed from [22]) . . . . . 17 2.7 Nodal architecture of a shallow autoencoder and NMF . 18 2.8 Comparison of the convolution operation in a neural net-

work and NMF . . . . 19 3.1 Original spectrogram (V) . . . . 24 3.2 Reconstruction (U) without and with forwardpropagation.

The corresponding reconstruction error using β-divergence after 500 iterations is 0.10 and 0.11, respectively. . . . 24 4.1 Input used to analyze the behavior of the NAE (black =

1, white = 0) . . . . 28 4.2 Output, weights and activations extracted from the de-

coder part of the NAE . . . . 29 4.3 Reconstruction of some training data after 100 iterations.

Reconstruction error: 0.004, 0.010 . . . . 30 4.4 16 random weights learned . . . . 30 4.5 Input used to analyze the behavior of the CNAE (black =

1, white = 0) . . . . 33 4.6 Results for convolution in x . . . . 34 4.7 Results for convolution in x and y . . . . 35

viii

(10)

4.8 Reconstruction of one image from the MNIST dataset. Re-

construction error: 0.31, 0.29 . . . . 36

4.9 The patterns learned to reconstruct the reference image . 37 4.10 Network architecture . . . . 40

4.11 Layer 2 of the decoder . . . . 41

4.12 Layer 1 of the decoder . . . . 41

4.13 Reconstruction of one image in the MNIST dataset with DCNAE and multilayer CNMF. Reconstruction error: 0.18, 0.27 . . . . 43

4.14 Reconstruction of one image from the MNIST dataset with CNAE and CNMF . . . . 44

4.15 The patterns learned by the ﬁlters to reconstruct the ref- erence image - 1 . . . . 44

4.16 The patterns learned by the ﬁlters to reconstruct the ref- erence image - 2 . . . . 45

5.1 Block diagram for a coding task . . . . 48

5.2 Input spectrogram (sound given by Dolby) . . . . 50

5.3 Weights and reconstruction using NAE . . . . 51

5.4 Weights and reconstruction using CNAE . . . . 53

5.5 Convolution of W

_i

with W

⁺_i

, for i = 1, 2, . . . , 4 . . . . . 53

5.6 Weights and reconstruction using DCNAE . . . . 55

5.7 Comparison between multiplicative update rule and pro- jected gradient descent . . . . 56

5.8 Comparison between multiplicative update rules and pro- jected gradient descent after 500 iterations . . . . 57

5.9 Comparison between NAEs and real-valued AEs: (a) NAE vs AE (η = 0.1), (b) CNAE vs CAE (η = 0.01), (c) DC- NAE vs DCAE (η = 0.02) . . . . 58

5.10 Comparison between NAEs and real-valued AEs after 500 iterations . . . . 59

5.11 Block diagram for signal enhancement/source separation task . . . . 61

5.12 Mixture (sound given by Dolby) . . . . 63

5.13 Speech and non-speech magnitude spectrograms (sounds given by Dolby) . . . . 63

5.14 Speech reference vs semi-supervised speech enhancement

results . . . . 65

5.15 Speech reference vs supervised speech enhancement results 66

(11)

5.16 Deep NMF vs DCNAE . . . . 67

(12)

2.1 Summary of the comparison . . . . 20

4.1 Model Parameters – NAE vs NMF . . . . 30

4.2 Model Parameters – CNAE/CNMF . . . . 36

4.3 Model Parameters – DCNAE/Multilayer CNMF . . . . . 43

5.1 DCNAE architecture . . . . 49

xi

(13)

(14)

(15)

(16)

Introduction

Deep learning has emerged as one of the most successful machine learn- ing methods that have outperformed previous state-of-the-art results for speech recognition and computer vision, among others. The success of deep learning is due to the increase in computing power, availability of massive amounts of data and the development of new computing systems.

However, some of the drawbacks of deep neural networks are that they require massive amounts of data, their feature representations are hard to interpret and they are diﬃcult to analyze mathematically.

Other techniques for compact data representation and feature extrac- tion, such as nonnegative matrix factorization (NMF), combine data mod- elling with non-convex optimization to learn interpretable and consistent features from data. Previously criticized for its computational complex- ity, NMF became extremely popular when an eﬃcient algorithm to solve the factorization was proposed in 1999 by Lee and Seung [24, 23]. Later, several variants of NMF were developed: a convolutional model [35, 37], diﬀerent loss functions [9, 44], sparse coding [16] or a multilayer NMF model [40, 1, 31], to name a few.

In this thesis, firstly, NMF is viewed as a feedbackward neural network and generalized to a deep convolutional architecture with forwardpropa- gation under β-divergence. Secondly, it is shown that NMF is essentially the decoder part of a variant of a feedforward neural network known as the autoencoder (AE). The main part of the thesis is devoted to gen- eralizing the shallow autoencoder with a single hidden layer and fully connected neurons to a deep convolutional autoencoder while constrain- ing its input and filter coefficients to the domain of nonnegative real numbers.

1

(17)

By using the multiplicative update rule to ﬁnd a local minimum of the β-divergence using gradient descent, the range of the autoencoder is likewise restricted to nonnegative real numbers. The nonnegative autoen- coder (NAE) and its variants are tested on the following tasks: signal reconstruction and signal enhancement.

1.1 Motivation

Tools like NMF for feature extraction are easily interpretable and are easy to analyze [16, 7, 10]. The multilayer extension oﬀers a hierarchical feature representation [40, 1, 31] but is suboptimal because the model is not end-to-end trained, meaning that the factorization is solved in each layer independently, without considering how the error propagates to other layers. On the other hand, because of their black-box nature, neural networks are not always well understood mathematically and have a feature representation that is hard to interpret [27]. The advantage of neural networks is that they are easily converted into a deep architecture that is end-to-end trainable as a single optimization task. It is with great interest to examine an alternative path by combining deep neural networks and NMF into a deep nonnegative neural network. In doing so, we can combine the strengths of the two approaches.

1.2 Objective

The objective of this thesis is, ﬁrstly, to make the multilayer convolutional NMF end-to-end trainable under β-divergence. Secondly, NMF shall be considered as the decoder part of a shallow autoencoder. The shallow autoencoder shall be extended to deep convolutional architecture with nonnegativity constraint on its inputs and weights. The nonnegative autoencoder and its variants should be end-to-end trainable as a single optimization task for better accuracy. Lastly, algorithms that use this new architecture are to be tested in the context of signal reconstruction and signal enhancement with application to speech and audio.

1.3 Contributions

In this section, I brieﬂy summarize the main contributions of this thesis.

These are:

(18)

• A multilayer convolutional NMF is derived under β-divergence with multiplicative factor updates and trained as a single optimization problem using forwardpropagation alias uppropagation [1].

• The NMF model is baked into a nonnegative autoencoder (NAE) [15, 3]. The NAE is the basis for further research reported in this thesis. It is used as a core tool for further algorithm design.

• The applicability and performance of the new algorithms: NAE, convolutional NAE (CNAE) and deep CNAE (DCNAE) are inves- tigated on diﬀerent tasks, such as signal reconstruction and signal enhancement.

1.4 Notation

Throughout this thesis, several mathematical conventions are used. When

addressing numbers, R, R

≥0

and C stand for the set of real, nonnegative

real and complex numbers, respectively. Multidimensional variables are

denoted in bold letters, with bold lower case letters standing for vectors

and bold upper case letters for matrices. A matrix of ones of size K × N

is denoted as 1

_K_×N

. Conventional matrix multiplication between the ma-

trices A and B is denoted as AB. Operations A ◦ B, A

^◦p

and

^A_B

denote

element-wise multiplication (Hadamard product), element-wise exponen-

tiation and element-wise division, respectively. {X

i

}

I

∈ R

^L^×M×Cⁱ^×C^o

denotes a set of I convolutional ﬁlters with a kernel size of L × M that

received C

_i

input channels and produces C

_o

output channels.

(19)

(20)

Background

In this chapter, a brief discussion of the essential background needed to follow the thesis is presented. Section 2.1 introduces nonnegative matrix factorization (NMF), β-divergence, multiplicative factor updates, the convolutional NMF and the multilayer NMF. Section 2.2 presents feedforward neural networks and their deep architecture. In Section 2.3, a comparison is made between NMF and a neural network.

2.1 Nonnegative Matrix Factorization

In this section, nonnegative matrix factorization (NMF) is presented to- gether with β-divergence, multiplicative factor updates, the convolutional and multilayer extensions.

2.1.1 Overview

NMF, originally introduced as positive matrix factorization by Paatero and Tapper [29], is a method where a matrix V ∈ R

^K_≥0^×N

is factorized as a product of two nonnegative matrices W ∈ R

^K_≥0^×I

and H ∈ R

^I_≥0^×N

, where the rank of the factorization I ≪ min(K, N):

V ≃ U = WH. (2.1)

Eq. (2.1) can be understood as a decomposition of V into I representa- tive basis vectors stacked as the columns of W, where each basis vector is weighted by the respective row of H. Usually, W is referred to as the dictionary, V as the observation and H as the coeﬃcient matrix that is

5

(21)

required to approximate V by U.

To perform NMF, a loss function L(V, U) is required to measure the dissimilarity between V and U. The factorization is obtained by solving the optimization problem

(W

opt

, H

opt

) = argmin

W, H

L(V, U) s.t. w

ki

, h

in

∈ R

≥0

. (2.2)

2.1.2 Loss Function

Several functions can be used as the loss function to solve the NMF problem. The most common ones are the squared Euclidean distance, the Kullback–Leibler divergence and the Itakura–Saito divergence.

Squared Euclidean Distance

The Euclidean distance is the straight-line distance between two points in Euclidean space. The squared Euclidean (SE) distance between u and v is deﬁned as

d

_SE

(v, u) = 1

2 (u − v)

²

. (2.3)

Kullback–Leibler Divergence

The Kullback–Leibler (KL) divergence is a metric to compare two proba- bility distributions. In NMF, V and U are considered as histograms. In this thesis the generalized KL divergence is addressed, which is deﬁned as

d

_KL

(v, u) = v log ( v

u

) − v + u. (2.4)

Itakura–Saito Divergence

Another popular measure of dissimilarity between two distributions is the Itakura–Saito (IS) divergence. This divergence is a measure of dis- similarity usually used in connection with voicegrams. Using NMF, V and U can be considered as voicegrams. The IS divergence is deﬁned as

d

_IS

(v, u) = v

u − log ( v u

) − 1. (2.5)

(22)

β-Divergence

The loss functions mentioned previously are used depending on the con- text of the problem at hand. However, to avoid using diﬀerent loss func- tions for each problem, a more general function that includes all these distance metrics is desired. Such a function is β-divergence, originally introduced as a power divergence [13]. The β-divergence between two points u and v is deﬁned as

d

β

(v, u) =

 

 



 

 

v v

^β⁻¹

− u

^β⁻¹

β − 1 − v

^β

− u

^β

β , for β / ∈ {0, 1}, v

u − log ( v u

) − 1, for β = 0,

v log ( v

u

) − v + u, for β = 1.

(2.6)

By appropriately choosing the β-parameter we obtain:

d

₀

(v, u) ≡ d

IS

(v, u), (2.7) d

₁

(v, u) ≡ d

KL

(v, u), (2.8) d

₂

(v, u) ≡ d

SE

(v, u). (2.9) Accordingly, the β-divergence between two matrices, V and U is deﬁned entrywise as

D

_β

(V, U) =

∑

K k=1

∑

N n=1

d

_β

(v

_kn

, u

_kn

) . (2.10) When β-divergence is used in minimization tasks, the convergence of the multiplicative update rule (see Section 2.1.3) was proven for β ∈ [0, 2]

[24, 8, 18]. β-divergence has a unique minimum at v = u, as shown in

Figure 2.1.

(23)

Figure 2.1: β-divergence for diﬀerent values of β for v = 1 (image bor- rowed from [43])

2.1.3 Multiplicative Factor Updates

Multiplicative factor updates were proposed by Lee and Seung in [23, 24].

It is based on the gradient descent method applied element-wise to the loss function with respect to each factor X (W or H):

X ← X − η

X

◦ ∇

X

L(V, WH), (2.11) where η

_X

is a learning-rate matrix of the same size as X and ∇

X

L(V, WH) is the gradient of the loss function with respect to X. By splitting the gradient into positive and negative parts, i.e.

∇

X

L(V, WH) = ∇

⁺_X

L(V, WH) − ∇

⁻_X

L(V, WH), (2.12) and allowing η

_X

to change at every iteration, we can set

η

_X

= X

∇

⁺X

L(V, WH) . (2.13)

So the multiplicative update rule reads:

X ← X ◦ ∇

⁻_X

L(V, WH)

∇

⁺_X

L(V, WH) . (2.14)

The advantage of this form is that there is no need to tune the learning-

rate η

_X

and the nonnegativity of the elements of the factor X is ensured

(24)

at each iteration. These multiplicative updates are also known as the heuristic updates in NMF literature [33, 2, 23, 10].

If L(V, U) = D

β

(V, U), with U = WH, the convergence of the heuristic updates for NMF were proven in [24, 8, 18] for β ∈ [0, 2]. These are

W ← W ◦ [

V ◦ U

^◦(β−2)

] H

^T

U

^◦(β−1)

H

^T

, (2.15)

H ← H ◦ W

^T

[

V ◦ U

^◦(β−2)

]

W

^T

U

^◦(β−1)

. (2.16)

2.1.4 Convolutional Extension

As shown in Section 2.1.1, the observation V is factorized into a product between the dictionary W and the coeﬃcient matrix H. Should W and/or H evolve in one or two dimensions, one can assume that the current state or value of these matrices is correlated with their past and future states. We can take this into account by replacing the matrix multiplication in our model by a convolution.

Convolution in Time

An extension of NMF, called convolutional nonnegative matrix factor- ization (CNMF), was introduced by Smaragdis in [37]. In CNMF in time, the model is allowed to evolve over time to track evolution on M consecutive columns of V. The approximation of V becomes

V ≃ U =

M

∑

−1 m=0

W(m)

→m

H, (2.17)

where

→m

X shifts X to the right by inserting m zero-columns at the begin- ning while maintaining the original size of X. This is equivalent to a sum of I truncated convolutions between the basis matrices {W

i

} ∈ R

^K_≥0^×M

with their corresponding coeﬃcient vectors {h

i

} ∈ R

¹_≥0^×N

, i = 1, 2, ..., I,

V ≃ U =

∑

I i=1

(W

_i

∗ h

i

), (2.18)

where ∗ denotes convolution in 1D. Note that if M = 1 the factorization

reduces to the original non-convolutional NMF.

(25)

Convolution in Time and Frequency

Schmidt and Mørup extended the CNMF algorithm to a 2D version [35].

Now, the coeﬃcients of W and H are allowed to change over time and frequency, thus exploiting the structure in a L ×M neighborhood around each element of V:

V ≃ U =

L−1

∑

l=0 M

∑

−1 m=0

l↓

W(m)

→m

H(l), (2.19)

where

l↓

X is the zero-ﬁll down-shift operator. This operation can be compared to a sum of I truncated convolutions between the box ﬁlters {W

i

} ∈ R

^K_≥0^×M

with the corresponding coeﬃcient matrices {H

i

} ∈ R

^L_≥0^×N

for i = 1, 2, ..., I, i.e.

V ≃ U =

∑

I i=1

(W

_i

∗ ∗H

i

), (2.20)

where ∗∗ denotes convolution in 2D. Note that the original NMF is em- bedded when L = M = 1.

CNMF Multiplicative Factor Updates

Multiplicative update rules were used under β-divergence, β ∈ [0, 2], for the convolutional extension of NMF with proof of convergence given in [44].

The multiplicative update rules for CNMF in time are:

W(m) ← W(m) ◦ [

V ◦ U

^◦(β−2)

]

_→^m

H

^T

U

^◦(β−1)

→m

H

^T

, (2.21)

H ← H ◦

∑

m

W

^T

(m) [

_m

V

←

◦

←m

U

^◦(β−2)

]

∑

m

W

^T

(m)

←m

U

^◦(β−1)

. (2.22)

(26)

The multiplicative update rules for CNMF in time and frequency are:

W(m) ← W(m) ◦

∑

l

[

_l_↑

V ◦ U

^l^↑^◦(β−2)

]

_→^m

H

^T

(l)

∑

l l↑

U

^◦(β−1)

→m

H

^T

(l)

, (2.23)

H(l) ← H(l) ◦

∑

m l↓

W

^T

(m) [

_m

V

←

◦

←m

U

^◦(β−2)

]

∑

m l↓

W

^T

(m)

←m

U

^◦(β−1)

. (2.24)

2.1.5 Multilayer Extension

A multilayer extension to NMF was proposed in [40, 1, 31] to learn hierarchical features. The basic idea is to perform NMF in several layers.

This way a hierarchical representation of the input can be learned. One can interpret this as a more and more abstract representation of the input with increasing hierarchy levels. With the multilayer extension come new possibilities to update the layers [1, 31]:

1. Update the multilayer NMF by iterating layer by layer;

2. Update the multilayer NMF by propagating the error through all layers (forwardpropagation alias uppropagation [1]);

3. Update the multilayer NMF by iterating layer by layer, then update the model by propagating the error through all layers.

In the ﬁrst scheme, each layer is optimized separately, beginning from the lowest layer. An advantage of this scheme is the extensibility of the framework: as all parameters of one layer are independent of the parame- ters of the other layers, another layer can be added without changing the previous layers. The disadvantage of this scheme is that it is suboptimal because in most cases the global minimum of the whole network is not found [1].

In the second scheme, the layers in the network are optimized simulta-

neously. The advantage of this scheme is the minimization of the overall

cost function, which leads to a better reconstruction. The drawback is the

introduction of a relation between the layers, which makes the network

more diﬃcult to extend [31].

(27)

The third scheme combines the two previous schemes. Firstly, each layer in the network is considered as independent from the other layers and optimized in that way. Secondly, the layers are combined and trained simultaneously to achieve a better reconstruction [1].

A variant of the multilayer NMF was implemented using CNMF under β- divergence (β-CNMF) in [43]. The basic relation between two consecutive layers is illustrated in Figure 2.2.

Figure 2.2: Multilayer CNMF

The multilayer β-CNMF performs CNMF sequentially in multiple layers, each layer l having a diﬀerent dictionary {W

^(l)i

}

I^(l)

and signal representa- tion {H

^(l)i

}

I^(l)

. The representation from the lower layer is factorized into the next higher layer’s dictionary and new representation. The multi- layer CNMF was used in speech enhancement tasks, where the objective was to be more discriminant with each layer, so any potential leakage that could occur between speech and non-speech in one layer would be reduced in the next layer by the use of a diﬀerent dictionary [43]. This method corresponds to the ﬁrst scheme discussed previously.

2.2 Feedforward Neural Networks

This section discusses feedforward neural networks, both shallow and

deep. The main operations in between the layers that simulate the con-

nections in the brain are discussed. Finally, a concrete example of a deep

convolutional neural network, called LeNet-5, is shown.

(28)

2.2.1 Overview

Artiﬁcial neural networks (ANNs) are computing systems inspired by the biological neural networks that constitute a brain. Such systems learn to perform tasks by considering data examples, generally without being programmed with task-speciﬁc rules.

An artiﬁcial neural network consists of a collection of simulated neu- rons. Each neuron is a node which is connected to other nodes via links that correspond to biological axon-synapse-dendrite connections. Each link has a weight that determines the strength of one node’s inﬂuence on another node [46]. Figure 2.3 is an illustration of this mechanism for one biological neuron and how it is modelled mathematically.

Figure 2.3: An illustration of a biological neuron and its mathematical model (image borrowed from [41])

An artiﬁcial neuron has a body in which computations are performed, and several input channels and one output channel, similar to a real biological neuron.

2.2.2 Main Operations

The neurons are typically organized into multiple layers. In the biological neuron model, the neurons of one layer connect only to neurons of the immediately preceding and immediately following layers. The layer that receives the input data is the input layer. The layer that produces the ﬁnal result is the output layer. In between the input and output layer can be zero or several hidden layers. Between two layers, various operations are possible:

Linear Transformation

Mathematically, we can think of a fully connected layer as a function that

applies a linear transformation W on a vectorial input x of dimension

(29)

P and outputs a vector y of dimension Q. Usually, the linear operation has a bias parameter b:

y = Wx + b. (2.25)

This layer is the ﬁrst representation of the axon-synapse-dendrite-neuron connections that happen in the brain, see Figure 2.3.

Activation Function

An activation function is used to simulate the 1-0 impulse carried away from the cell body [45]. In the neuron model, the activation function is used to simulate this property and also to add nonlinearity to the model. The ability of the neural network to approximate any function, and especially a non-convex function, is related to the nonlinear activa- tion function. Figure 2.4 shows three popular activation functions.

Figure 2.4: Three popular activation functions

Convolution

Convolutional layers, as the name alludes, convolve the input X with a set of K ﬁlters f

_k

and pass the result to the next layer via Y

_k

. Usually, the convolution operation has a bias parameter b

k

:

Y

_k

= f

_k

∗ X + b

k

, (2.26)

where ∗ denotes convolution. This is similar to the response of a neuron

in the visual cortex to a speciﬁc stimulus [49]. Each convolutional neuron

processes data only for its receptive ﬁeld. Figure 2.5 shows an example

(30)

of learned ﬁlters representing receptive ﬁelds that trigger only for certain patterns in the input image.

Figure 2.5: The ﬁlters that represent all the main characteristics of an examplary image (image borrowed from [32])

Because in the convolution operation the weights are shared over the entire image, it reduces the number of free parameters.

Pooling

The pooling operation is used to reduce the computations and reduce the detail level for the next operation. The pooling operation reduces the dimensions of the data by combining the outputs of neuron clusters at one layer into a single neuron.

Normalization

The normalization operation consists in shifting inputs to have a zero mean and unit variance. This operation is used to make the inputs of each trainable layer comparable across features.

2.2.3 Architectures

Neural networks are often organized into N layers of neurons and are

therefore also called N -layer neural networks. A 1-layer neural network

means a network with no hidden layers, i.e. the input layer is directly

connected to the output layer. When N ≥ 3, i.e. when more than 2

layers are hidden, we say that the neural network is deep [34]. Thus a

deep neural network is a N -layer neural network with N ≥ 3.

(31)

The universal approximation theorem [50] states that a 2-layer neural network can approximate any function over some compact set, provided that it has enough neurons in the hidden layer. Determining the number of neurons in the hidden layer is diﬃcult and computationally expensive.

To improve the performance of a neural network, it is common to make it deeper, adding more hidden layers, to learn more abstract features throughout the hidden layers. It was shown in [11] that deep neural networks outperform shallow neural networks across various tasks and domains and argued that the number of neurons in a shallow neural network grows exponentially with task complexity. So, to be useful, a shallow neural network might need to be very big in terms of the number of neurons — possibly much bigger than a deep neural network.

Training a neural network (shallow or deep) is not much diﬀerent from training any other machine learning model with gradient descent [11].

Updating each weight in a neural network to minimize a cost function is possible via backpropagation [22]. Backpropagation means computing the gradient of the loss function with respect to each weight by the chain rule, computing the gradient one layer at a time, and iterating backwards from the last layer to avoid redundant calculations of intermediate terms in the chain rule [48]. It is common to explore manually the space of hyperparameters such as the learning rate or weights regularization, not to mention the architectures hyperparameters like the number of hidden layers, to obtain the best performance in terms of both accuracy and training time.

2.2.4 Example

A lot of convolutional architectures have been developed from the 1990s

onwards. In this section, I present a well-known deep convolution neural

network developed by Yann LeCun, LeNet-5 [22]. Figure 2.6 shows the

architecture of LeNet-5.

(32)

Figure 2.6: Architecture of LeNet-5, a convolutional neural network for digits recognition (image borrowed from [22])

This kind of architecture is one of the first successful applications of convolutional neural networks (CNNs). LeNet-5 was first used to read zip codes and digits on bank cheques in the United States and then used in other classification tasks: it was able to achieve an error rate below 1% on the MNIST dataset [21], which was very close to the state of the art at that time.

By modern standards, LeNet-5 is a very simple network. It only has 7 layers, which are three convolutional layers (C1, C3 and C5), two are subsampling (pooling) layers (S2 and S4), and one is a fully connected layer (F6), followed by the ﬁnal output layer. Convolutional layers use 5-by-5 convolutions with stride 1. Sub-sampling layers are 2-by-2 average pooling layers. Hyperbolic tangent as activation function is used after the convolutional layers C1, C3 and C5 and the fully connected layer F6.

Each filter from the convolution operations represents a receptive field that is sensitive to a particular area from the input digits. The set of fully connected layers at the end of the network play the role of a classifier that use the patterns learned from the convolutions to determine the digit. LeNet-5 can be seen in action at [12, 20].

2.3 Comparison

In this section, a comparison of NMF and a neural network is made.

Firstly, a comparison of the architecture, constraints and optimization

between NMF and a neural network is done. Secondly, the learnings from

the comparison are summarized in Table 2.1 at the end of the section.

(33)

Architecture

Neural networks, usually being feedforward, and NMF, being feedback- ward, have diﬀerent architectures, which can easily be shown using nodal analysis. In the NMF case, there is no input layer, the coeﬃcient ma- trix H plays the role of the hidden layer, the dictionary W contains the linear weights and U is the approximation of the visible data V. One can view the nodal architecture of NMF as the decoder part of a shallow autoencoder. A nodal representation of both architectures is shown in Figure 2.7.

(a) Neural Network (b) NMF

Figure 2.7: Nodal architecture of a shallow autoencoder and NMF Diﬀerent steps were taken to make NMF resemble a neural network such as projective NMF [51], mini-batch method [36] or semi-NMF neural network alternative [38, 39] (W ∈ R and H ∈ R

_≥0

), but none of them combine the properties of neural networks with those of NMF in a proper sense.

A major diﬀerence between the neural network and NMF architecture is that the nodal interpretation of NMF is not a feedforward architecture.

NMF has an inverse structure meaning that the hidden states are ad-

justed to the outputs. In the case of deep convolutional networks, the

convolution operation is carried out in diﬀerent ways. More precisely, in a

modern deep neural network architecture, the set of convolutional ﬁlters

is shared across all inputs of one layer to generate the input of the next

layer. In multilayer NMF [43], each input has its own set of ﬁlters to gen-

erate the input for the next layer. Figure 2.8 contrasts the convolution

operation between the lth and (l + 1)th layer or both methods.

(34)

(a) CNN (b) CNMF

Figure 2.8: Comparison of the convolution operation in a neural network and NMF

It should be noted that the diﬀerent operations (matrix multiplication and convolution) in between the layers are carried out without the bias vector in NMF and its variants.

Constraints

Although it is possible to impose hard constraints on a convolutional neural network, it is more common to have soft constraints such as weight regularization or constraint violation penalty. An auxiliary regularization term R(W) or R(H) corresponding to the constraint, is introduced and added to the cost J on top of the general loss function L(X, ˆ X):

J = L(X, ˆ X) + R( ·). (2.27) The natural constraint for NMF is nonnegativity. Nonnegativity is en- sured by the multiplicative updates [23, 24] or by other methods like projected gradient [25]. It is possible to add more constraints to NMF such as sparsity [16] or graph regularization [30]. The constraints for multiplicative updates are in the form of

X ← X ◦ ∇

⁻X

L(V, U) + ∇

⁻X

R( ·)

∇

⁺_X

L(V, U) + ∇

⁺_X

R( ·) (2.28)

to ensure nonnegativity.

(35)

Optimization Methods

As seen in Sections 2.1 and 2.2, neural networks and NMF are trained using a variant of gradient descent. While neural networks use backprop- agation and adjust the weights, NMF commonly uses alternating mul- tiplicative updates. It is also common in NMF to normalize the bases vectors:

V ≃ U = WH = WDD

⁻¹

H, (2.29)

where D is a diagonal matrix, D = diag (

||W

1

||

⁻¹_p

, ||W

2

||

⁻¹_p

, ..., ||W

I

||

⁻¹_p

)

, (2.30)

where W

_i

denotes the ith column of W and

||W

i

||

⁻¹p

= (

_K

∑

k=1

w

_ki^p

)

¹_p

. (2.31)

It should be noted that introducing D does not change the behavior of NMF. Weight normalization would be possible in a neural network only if the network was linear.

Summary

Even if there are significant differences between NMF and neural net- works, the fact that they both use gradient descent and can be repre- sented as a nodal architecture is an encouragement to develop an archi- tecture that combines the strengths of neural networks and NMF. Table 2.1 provides a brief summary of the finding from the comparison.

ANN NMF

Data ﬂow Feedforward Feedbackward

Transformation Nonlinear Linear

Domain of weights R R

≥0

Filters in multilayer case Shared Not shared Update rules Additive Multiplicative

Nonlinearity Activation functions Nonnegativity constraint Optimization Backpropagation Forwardpropagation

Table 2.1: Summary of the comparison

(36)

Multilayer β-CNMF

In this chapter, a multilayer CNMF based on a convolutional data model in 2D with forwardpropagation [1] is developed. In Section 3.1, a sum- mary of CNMF under the β-divergence is given. A multilayer extension, which generalizes the work in [1, 43], is elaborated in Section 3.2.

3.1 β-CNMF

As mentioned in Section 2.1.5, a convolutional extension of NMF has been implemented in [37, 35] for diﬀerent cost functions. To avoid imple- menting several algorithms, each one minimizing a diﬀerent cost function, new update rules were proposed based on β-divergence in [44], correcting the errors in the previous publications.

The general form of NMF when using a 2D convolutional model is

V ≃ U =

L−1

∑

l=0 M

∑

−1 m=0

l↓

W(m)

→m

H(l). (3.1)

From (3.1), one can see that NMF and CNMF proposed in [37] and [23, 24] are special cases of CNMF in 2D. As a matter of fact, (3.1) is equiv- alent to NMF for L = M = 1, and (3.1) becomes CNMF in time [37] for L = 1. Therefore, for the rest of the thesis, CNMF stands for the general form shown in (3.1).

The update rules for CNMF under β-divergence (β-CNMF) are given in (2.23) and (2.24). Note that the updates rules for NMF, (2.15) and (2.16), are obtained by choosing L = M = 1.

21

(37)

3.2 Multilayer β-CNMF

The idea of using NMF in multiple layers can be found in the existing literature. In [40, 1, 31], NMF was extended to multiple layers with diﬀerent cost functions. In [43, 4], the convolution operation was used in between layers, resulting in a multilayer CNMF. However, no formulation as a single optimization problem was proposed for the multilayer CNMF.

In this section, an approach similar to the forwardpropagating algorithm from [1] is developed for multilayer CNMF.

3.2.1 Data Model

Following the idea from [43], CNMF is employed in each layer indepen- dently:

V ≃ U =

I⁽¹⁾

∑

i=1

W

⁽¹⁾_i

∗ ∗H

⁽¹⁾i

⇝ H

⁽¹⁾i

=

I⁽²⁾

∑

j=1

W

⁽²⁾_j

∗ ∗H

⁽²⁾j

⇝ . . . , (3.2)

where {W

^(l)_i

} is a set a basis ﬁlters in the lth layer for the corresponding coeﬃcient matrices {H

^(l)_i

}, i = 1, 2, ..., I

^(l)

, and l = 1, 2, ..., L, where L is the depth of the network and ∗∗ denotes convolution as deﬁned in Section 2.1.4.

In each layer l ≤ L, the coeﬃcient matrices {H

^(l)_i

} are transformed into new representations. The idea is that each new representation will be more discriminant than the previous representation, so any similarity between components in one layer can be potentially resolved in the next higher layer with the employment of a more speciﬁc dictionary. Therefore, better performance for speciﬁc tasks such as signal enhancement can be achieved.

3.2.2 Forwardpropagation

In [43], a multilayer CNMF is trained layer-by-layer, which is suboptimal.

To tackle this shortcoming, a new algorithm inspired by [1] is developed.

The entire network is trained by propagating the error through all the layers, starting with the deepest hidden layer.

To extend β-CNMF to a multilayer case, let us introduce two new

(38)

matrices A and B to the update rule:

W

_i

← W

i

◦ A ⋆ ⋆H

i

B ⋆ ⋆H

_i

, (3.3)

H

_i

← H

i

◦ W

_i

⋆ ⋆A

W

_i

⋆ ⋆B , (3.4)

where ⋆⋆ denotes cross-correlation in 2D and with

A = V ◦ U

^◦(β−2)

, (3.5)

B = U

^◦(β−1)

. (3.6)

In the multilayer case, if we obtain A

^(l)_i

and B

^(l)_i

in the lth layer of the network, the update rules become

W

^(l)_i

← W

^(l)i

◦ A

^(l)_i

⋆ ⋆H

^(l)_i

B

^(l)_i

⋆ ⋆H

^(l)_i

, (3.7) H

^(L)_i

← H

^(L)i

◦ W

^(L)_i

⋆ ⋆A

^(L)_i

W

^(L)_i

⋆ ⋆B

^(L)_i

, (3.8) where A

^(l)_i

and B

^(l)_i

are computed using the chain rule:

A

^(l+1)_i

= W

^(l)_i ^T

⋆ ⋆A

^(l)_i

, (3.9) B

^(l+1)_i

= W

^(l)_i ^T

⋆ ⋆B

^(l)_i

, (3.10) where X

^(l)T_i

denotes the transpose of X

^(l)_i

, for l = 1, 2, ..., L, and

A

⁽¹⁾_i

= A = V ◦ U

^◦(β−2)

, (3.11) B

⁽¹⁾_i

= B = U

^◦(β−1)

. (3.12) Simulation

As in [43], a 2-layer β-CNMF is implemented with the same parameters.

This model is trained twice: once without using forwardpropagation, i.e.

the model is trained layer-by-layer, and once using forwardpropagation, i.e. the model is trained in a single optimization task.

The models are trained under β-divergence with β = 0.5 (arbitrarily chosen) during 500 iterations on a piece of audio shown in Figure 3.1.

The corresponding approximations are shown in Figure 3.2.

(39)

Figure 3.1: Original spectrogram (V)

Figure 3.2: Reconstruction (U) without and with forwardpropagation.

The corresponding reconstruction error using β-divergence after 500 iter- ations is 0.10 and 0.11, respectively.

It can be seen from Figure 3.2 that both architectures output a “good” re-

construction, with similar reconstruction error. However, using forward-

propagation the network was able to achieve this reconstruction with less

computations (17856 components updated per iteration instead of 33088

for the architecture from [43]), which can speed up the training time. As

deep networks are sensible to be stuck in a local minimum, the layer-

by-layer optimization can be used as a pre-training step to have a good

starting point when using the forwardpropagation, as in [14].

(40)