Single Channel Spectrum-based Speech Enhancement Using

(1)

IN

DEGREE PROJECT ELECTRICAL ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2019,

Single Channel Spectrum-based Speech Enhancement Using

Neural Networks

FILIP WEN-FWU TSAI

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

(3)

Abstract

The ability to communicate is fundamental to form a relationship, and it is a necessity for a well-functioning society. Since a major part of our daily communication takes place orally, the ability to perceive speech is important. However, it is not always as easy to perceive the message, especially when the level of background noise partially masks the speech. For a person with hearing impairment, the situation gets even worse.

The impact of background noise is also challenging other domains, and one of those is regarding virtual assistants, which have recently become more common due to the technological advancements. Since virtual assistants have allowed us to interact with our technological devices in our daily lives, the dependency that they work becomes more critical. This dependency especially holds when we are required to interact with them by speech. Still, in both suggested cases, background noise remains an issue to some degree. Hence, the possibility to reduce the noise influence is likely to have a significant role in how our society develops.

In this report, we evaluate the possibility of reducing background noise.

To do it, we proposed a new neural network architecture which is based on the principles of extreme learning machine. Considering that this report works with spectrum-based speech, appropriate constraints to ensure non-negativity in our optimization problem has been carried out. Moreover, different configurations applied to the architecture have been observed, which includes unprocessed vs.

pre-processed features, masking filter, and stacking several single architecture layers.

The results show that the proposed architecture with the unprocessed, noisy speech, input performs better than an input pre-processed by a well-known method. Another finding observed was that relaxation in constraint yielded better performance of a noisy speech than based on a non-negative convex- constrained solution.

(4)

Sammanfattning

Att kunna kommunicera är fundamentalt för att forma relationer och är därav en nödvändighet för att f˚a till ett fungerande samhälle. Eftersom en betydande del av den dagliga kommunikationen sker muntligt blir betydelsen av att kunna uppfatta vad som sägs betydelsefullt. Att kunna uppfatta vad som sägs är dock inte alltid helt enkelt, framförallt när ljudniv˚an av bakgrundsstim delvis maskerar talet. Desto sv˚arare blir det för människor som har n˚agon form av hörselskada.

Effekterna av bakgrundsbrus framf¨or ¨aven utmaningar inom andra omr˚aden.

Ett av dessa omr˚aden är gällande virtuella assistenter, som blivit allt vanligare p˚a grund av de tekniska framstegen. I och med att virtuella assistenter har möjliggjort möjligheten att för oss interagera med v˚ara tekniska apparater, blir det allt viktigare att de fungerar. Detta gäller framför allt i de situationer där vi dessutom krävs att kommunicera muntligt med en virtuell assistent. Fr˚agan som kvarst˚ar är hur bakgrundsstim ska reduceras d˚a bakgrundsstim i de b˚ada ovannämnda fallen försv˚arar först˚aelseförm˚agan i olika omfattningar. Med detta sagt antyds möjligheten att kunna reducera bakgrundsstim ha stor betydelse för hur samhället kommer att utvecklas.

I denna rapport kommer vi att utvärdera möjligheten att reducera bakgrundsstim. För att uppn˚a detta kommer vi att framföra en ny artificiella neu- rala nät arkitektur baserade utifr˚an idéer fr˚an extrem inlärande maskin (eng.

extreme learning machine). I och med att bearbetning sker p˚a spektrumbaserat tal kommer lämpliga bivillkor appliceras för att garantera icke-negativ utsig- nal vid formulering av optimeringsproblem som har tagits fram. Vidare har olika sammansättningar undersökts, vilket inkluderar förarbetat kontra icke- förarbetat spektrum som insignal, filtrering och sammankoppling av flera lager av en viss arkitektur.

Utifr˚an de erh˚allna resultaten visar det sig att den förslagna arkitekturen med insignal av ett icke-förarbetat brusigt tal presterar bättre än när insignalen

¨

ar förädlad genom en välkänd metod inom detta omr˚ade. Ett annat fynd som ocks˚a har observerats är att lättnad i bivillkor resulterade till förbättring av ett brusigt tal än när bivillkoret grundar sig i en icke-negativ konvex lösning.

(5)

Acknowledgment

I would first and foremost, thank my supervisor and examiner Saikat Chatterjee, who has been patient and for the several suggestions he has given me. In fact, the meetings have been inspiring and, as a result, I have reviewed several topics further.

Secondly, I would like to thank Alireza M. Javid who helped guide me through potential hazards and gave me instructions on how to use the cloud- based computing service. Thirdly, I would like to thank the Department of Information Science and Technology for giving me access to the cloud-based computing service. Also, I would like to thank Pol del Aguila Pla who gave me some advice regarding how to conduct a project in general.

Lastly, I would like to thank my family who has supported me throughout this project, but also for the steps I was required to take until this point.

Thank you ...

Powered by Hannah Matkala Nylander

(6)

List of Figures

2.1 Architecture of an artificial neural network with L+1 layers. The top diagram illustrates how the neural network often is drawn while the bottom only illustrates the essential features. . . . 8 2.2 Architecture of PLN. The matrix variables D₁ and D₂ are aux-

iliary variables to easier see the structure of Y_r_l. . . . 10 2.3 Illustration of the two constraint choices. The red diamond im-

plies that Ol = Ul, each blue dashed lined circle ensures PP with a defined scaling factor (α) and each black solid lined circle corresponds to arbitrarily chosen constraint (ϵ). . . . 12 3.1 Architecture of the modified PLN. . . . 13 3.2 The block diagram to calculate the input to mPLN. . . . . 15 3.3 An architecture block diagram of the NmPLN with the masking

filter bΦPLN(·) and an additional ReLU function, g(·) at the end of every single layer. . . . 19 3.4 The N-PLN architecture which accounts for the estimation ad-

justment made in NmPLN. The matrix variables D₁and D₂ are auxiliary variables to easier see the structure of Y_r^S

l. . . . 21 5.1 Spectrogram of two optimal matrices computed based on the two

multi-input cases. The NMF-based input is masked. . . . 32 5.2 Spectrogram of two optimal matrices computed based on the two

multi-input cases. The NMF-based input is non-masked. . . . 33 5.3 NMSE for some method with specific configuration. Hollow circle

denotes a new layer with a new output matrix, solid circle with black boarders denotes the last new node added in the specific layer, and black solid circle denotes a new layer with the last known output matrix. . . . 37 6.1 NMSE for the N-PLN and NmPLN pair with ‘Mix’ input. Hol-

low circle denotes a new layer with a new output matrix, solid circle with black boarders denotes the last new node added in the specific layer, and black solid circle denotes a new layer with the last known output matrix. . . . 43 6.2 NMSE of aNmPLN with ‘NMF’ input. Hollow circle denotes

a new layer with a new output matrix, solid circle with black boarders denotes the last new node added in the specific layer, and black solid circle denotes a new layer with the last known output matrix. . . . 44

(9)

6.3 A set of optimal output matrices Ω for one simulation experiment of aNmPLN with specific setting ‘Mix-Iterative-MaskAll’. . . . . 45 6.4 Spectrogram of a speech and a noisy speech sequence from the

mismatched condition. The noisy speech is a mixture of a female utterance and cafeteria noise of a SNR level at 5 dB. . . . 46 6.5 Spectrogram of a speech estimate sequences from selected con-

figurations from the mismatched condition. The enhancement is done on a noisy speech based on a female utterance and cafeteria noise of a SNR level at 5 dB. . . . 47

(10)

List of Tables

5.1 NMSE training scores of the theoretical claims across all SNR levels. The best specific setting is bolded. . . . 31 5.2 NMSE test scores of the theoretical claims. The best specific

setting for each SNR level within the same condition is bolded. . 32 5.3 The preliminary choices which were carefully tuned (independent

variables) for the five most suitable N-PLN and NmPLN of each problem formulation. The horizontal bar ( ), which some dependent variables have, is used to indicate that it is the averaged value over several experimental simulations. . . . 34 5.4 The NMSE performance in matched condition for a N-PLN ar-

chitecture. The best specific setting for each SNR level is bolded. 35 5.5 The NMSE performance in matched condition for a NmPLN ar-

chitecture. The best specific setting for each SNR level is bolded. 35 5.6 Speech enhancement performance from the most promising spe-

cific settings. The best specific setting in each speech performance measure for each SNR level is bolded. . . . . 39 6.1 The final parameter choices which were carefully tuned (indepen-

dent variables). The horizontal bar ( ), which some dependent variables have, is used to indicate that it is the averaged value over several experimental simulations. . . . 41 6.2 The NMSE score based on the convex formulation. The best

specific setting for each SNR level within the same condition is bolded. . . . 42 6.3 The NMSE performance bases on aN-PLN architecture. The best

specific setting for each SNR level within the same condition is bolded. . . . 42 6.4 The NMSE performance based on aNmPLN architecture. The

best specific setting for each SNR level is bolded. . . . 42 6.5 Speech enhancement performance from the most promising spe-

cific settings. The best specific setting in each speech performance measure for each SNR level is bolded. . . . . 46 B.1 The speech dataset used for training, validating and testing. All

data is of dialect 1 (DR1) which corresponds to the dialect region of New England. . . . 60 B.2 The noise dataset for training, validating and testing. . . . . 61

(11)

C.1 The NMSE score based on the N-PLN architecture in the mis-

matched condition. . . . 62

C.2 The NMSE score based on the NmPLN architecture in the mismatched condition. . . . 62

C.3 The SDR score based on the N-PLN architecture. . . . 63

C.4 The SDR score based on the NmPLN architecture. . . . 63

C.5 The SIR score based on the N-PLN architecture. . . . 63

C.6 The SIR score based on the NmPLN architecture. . . . . 64

C.7 The SAR score based on the N-PLN architecture. . . . 64

C.8 The SAR score based on the NmPLN architecture. . . . 64

C.9 The PESQ score based on the N-PLN architecture. . . . 65

C.10 The PESQ score based on the NmPLN architecture. . . . 65

C.11 NMSE and speech enhancement performance of two simulations based on NMF when the number of basis vectors are doubled (R = 80). Bolded value corresponds to values equal or better performance than the corresponding promising specific settings in preliminary results (cf. Table 5.6). . . . 66

D.1 The SDR score based on the convex formulation. . . . 67

D.2 The SDR score based on aN-PLN. . . . 67

D.3 The SDR score based on aNmPLN. . . . 68

D.4 The SIR score based on the convex formulation. . . . 68

D.5 The SIR score based on aN-PLN. . . . 68

D.6 The SIR score based on aNmPLN. . . . 69

D.7 The SAR score based on the convex formulation. . . . 69

D.8 The SAR score based on aN-PLN. . . . 69

D.9 The SAR score based on aNmPLN. . . . 70

D.10 The PESQ score based on the convex formulation. . . . 70

D.11 The PESQ score based on aN-PLN. . . . . 70

D.12 The PESQ score based on aNmPLN. . . . 71

(12)

Acronyms

ADMM Alternating Direction Method of Multipliers.

aNmPLN ad-hoc formulated Non-negative modified Progressive Learning Network.

ANN Artificial Neural Network.

cNmPLN convex formulated Non-negative modified Progressive Learning Network.

DNN Deep Neural Network.

ELM Extreme Learning Network.

FFT Fast Fourier Transform.

ISTFT Inverse Short-Time Fourier Transform.

LS Least Square.

MMSE Minimum Mean Square Error.

mPLN modified Progressive Learning Network.

MSE Mean Square Error.

NMF Non-negative Matrix Factorization.

NmPLN Non-negative modified Progressive Learning Network.

NMSE Normalized Mean Square Error.

PESQ Perceptual Evaluation of Speech Quality.

PLN Progressive Learning Network.

PP Progression Property.

ReLU Rectified Linear Unit.

SAR Sources to Artifacts Ratio.

SDR Source to Distortion Ratio.

SIR Source to Interferences Ratio.

SNR Signal-to-Noise Ratio.

STFT Short-Time Fourier Transform.

TFR Time-Frequency Representation.

(13)

Chapter 1

Introduction

1.1 Motivation

Speech is one of man’s basic and greatest tool for communication. The quality of the message can, however, be obscured in a noisy environment. For a person of fully functioning hearing, separating the speech from a noisy speech is usually not a problem thanks to our sophisticated auditory system, but it is problematic for those with hearing impairment [1]. Nonetheless, to understand a noisy speech like in a cafeteria requires significant listening effort, which most would likely like to eliminate.

Another area that in recent years has become more interested in suppressing noise in a noisy speech happens between the interaction of humans and virtual assistants, also known as automatic speech recognition (ASR). Some virtual assistants that have integrated into our daily lives are, among others: Siri (Apple), Alexa (Amazon), and Google Assistant. Then, to better integrate into our lives, it demands the assistants to adapt to our different noisy condition [2].

Although speech enhancement may be associated with a solution for those with a well-financial situation the former can be considered to have a more significant impact on those with a less well-financial situation; at least if a cost- efficient solution is available. After all, disabling hearing loss is more common and severe in less developed countries [3]. These less developed countries also have a high rate of childhood deafness and hearing-impairment; consequently, those children often have it harder to learn and get educated enough to move away from poverty.

In consideration to globalization, solving global challenges has gained more priority among politicians as we are more dependent on each other than ever before. In fact, the United Nations has addressed several global challenges known as Sustainable Development Goals, which are according to them ”the blueprint to achieve a better and more sustainable future for all” [4]. Among all of those goals, goal 9 (Industry, Innovation and Infrastructure) consider among others development and innovation within information and communication technology [5], and we consider that our report can contribute to this goal. To accommo- date this feat, we attempt to present our findings, codes, and requirements with transparency and reproducibility in mind.

All in all, the ability to develop better noise suppress techniques suggest to

(14)

be of importance. Fortunately, with the technological advancement, the computational capacity had multifold; as a result, it has allowed artificial neural network (ANN) to flourish. Together with well-documented noise reduction methods such as spectrum-based speech enhancement, it presents several possibilities.

1.2 Related Work

Early works that have had significant contributions in spectrum-based speech enhancement development include spectral subtraction [6] and parametric Wie- ner filtering [7] and [8]. A summary of several methods, including the above at that time is presented in a paper by Lim and Oppenheim [9]. Undoubtedly, with the technological advancement, the computation capacity has increased multifold over the past decades. On the other hand, the main speech enhancement ideas remain the same while more powerful computational demanding techniques are used to process the data.

As spectrum by nature is non-negative, Grais and Erdogan [10] solved a speech-music separation problem with a method called non-negative matrix factorization (NMF). The author’s idea of using NMF was to represent each source by a linear combination of a basis matrix and a coefficient matrix, in a super- vised learning approach. The algorithm they used to solve the NMF problem was based on a well-known work by Lee and Seung. Furthermore, the authors Grais and Erdogan did in the same paper investigate the use of masking filters as a method to improve the separation process. The results showed that the best result was not obtained by applying a Wiener filter, which is an optimal filter in a minimum mean square error (MMSE) sense.

When applying NMF the subspace of different sources are assumed to be orthogonal to each other, but this is often not the case in practice. With some modifications of imposing orthogonality for each source, the orthogonality problem seems to be satisfied in a paper by Ding et al. [11]. However, this criterion only holds for independently trained sources, hence for mixed sources this remains unsolved. Fortunately, Kang et al. [12] demonstrated an idea to improve this situation based on mixed sources, that is to combine sn NMF estimation with a non-linear system. In their case, the non-linear system was a deep neural network (DNN). Ultimately, their proposed idea of using DNN was to make it ”learn complicated inter-dependencies between variables” which cannot be achieved when the sources are trained independently.

Although DNN is a powerful and well-established ANN architecture, it has, depending on the structure, some significant drawbacks that are important to consider. One of those significant drawbacks occurs when using DNN with backpropagation as it is a computationally expensive operation. An architecture that considers this drawback is extreme learning machine (ELM) [13], which also belongs to ANN. The idea in the ELM framework is to randomly choose weights for every hidden layer without the need to change them, later on, which are required if one would apply the backpropagation to fit the reference signal.

Instead, in ELM the authors solve the problem by solving the output matrix by minimizing the norm between the reference signal and the estimated signal.

Thus, the gain of using ELM instead of a conventional feedforward neural network is that it can produce good generalization performance while significantly

(15)

reducing the training time.

To obtain a better performing architecture than ELM, Chatterjee et al. [14]

have in their work taken advantage of the computational gain from using ELM.

By applying the idea of ELM together with an architecture design that always can satisfy the progressive property, they developed what the authors refer to as a progressive learning network (PLN). With the use of the progressive property, they are from a theoretical mathematical standpoint able to prove that the cost is non-increasing when adding more nodes in a hidden layer and when adding several hidden layers.

1.3 Thesis Project

Motivated by the strict mathematical justification together with promising prac- tical results of PLN, this thesis proposes a new architecture based on its concept.

The proposed architecture is intended to deal with spectrum-based speech enhancement problem formulation, but it can be extended to other applications as well.

Furthermore, as the PLN architecture is described as an architecture that requires minimal parameter tuning, it is intended to retain this strength and instead focus on different configurations. After all, speech quality is subjective phenomena, thus providing different configurations allows for deeper analysis when examining the speech enhancement measurements. The different configurations that are observed and compared are as follows:

• NMF computed signal and unprocessed mix signal,

• solving a convex optimization problem with non-negative constraints and an ad-hoc approach to relax the imposed constraints,

• masking filter arrangements, and

• the number of stacked single layers.

1.4 Contribution

Based on this thesis and related work, preparation of a manuscript for submis- sion to a conference is in process. The manuscript should, when finished, be available in the following site: https://sites.google.com/site/filiptsai.

1.5 Thesis Outline

The rest of this report is organized as follows:

Chapter 2 provides the necessary background which is the main building block for the proposed architecture. Our focus is mainly to cover NMF and PLN.

Chapter 3 explains the core contribution of this thesis, which is the proposed architecture. Here, the different configurations are also presented.

(16)

Chapter 4 describes the procedures used for the experimental setup. This chapter includes algorithm descriptions and parameter choices. This chapter also describes the procedure of our feature extraction.

Chapter 5 presents and discuss a simple theoretical experiment. Afterward, the preliminary results are presented to gain an overview of the proposed ideas.

Chapter 6 presents and discuss the final results both numerically and visually.

Here, the used configurations are based on the knowledge gained from the previous chapter. A general discussion is also included to highlight possible benefits that our method can contribute in a wider sense.

Chapter 7 concludes this report and gives possibilities of future work.

(17)

Chapter 2

Background

In this chapter, the foundation needed to build our proposed architecture is presented. Therefore, this chapter focuses on NMF and PLN, but a brief review on ANN and ELM are also added as they are a foundation in PLN. For a more in-depth review on ANN and ELM a recommendation would be to look at a book written by Bishop [15] and a paper written by the main ELM inventor Huang [16] respectively.

2.1 Non-negative Matrix Factorization

As briefly expressed in section 1.2, NMF is a method to express a non-negative signal as a linear combination in which each individual component retains the property of non-negativity. To demonstrate how it is computed, allow us to first assume that we have K observations of a spectrum-based signal with J frequency bins resolution t^(k)∈ R^{J ×1}_≥0 where k = 1, . . . K and R≥0 denotes non- negativity elements. In a more compact form, matrix form, we can express the full spectrum sequence as T = t⁽¹⁾, . . . , t^(K) ∈ R^{J ×K}≥0 . By then representing the full spectrum sequence as a linear combination we can express it as

T = eT + E = WC + E, (2.1)

where eT is the non-negative linear combination estimate of T, W ∈ R^{J ×R}≥0

represent the basis matrix, and C ∈ R^R×K≥0 represent the coefficient (content) matrix and E ∈ R^{J ×K} is the added residual noise matrix resulted from an imperfect representation. The value of R, the number of basis vectors, is oftentimes chosen smaller than J and K for computational reasons on the behalf of increasing the noise term. It can be observed that in the case R = J we can obtain perfect reconstruction^[i]. For illustrative purposes, W could in human speech be associated with the dimensions of the speech organs while C could be seen as how much air that leaves the mouth and the nose at different time instances.

As one can see in equation 2.1, expressing T by a linear combination yields a residual noise term E which is an undesirable by-product. To reduce the present

[i]If one choose R = J , one feasible solution is W = IJ (IJ is an identity matrix with dimension J ) and C = T, in other words a perfect reconstruction.

(18)

of the residual noise one can impose a cost function with the aim to minimize the cost. One such cost function is the following Kullback-Leibler (KL) divergence function

D_KL(T∥WC) =X

jk

T_jklog T_jk (WC)jk

− Tjk+ (WC)_jk

, (2.2) where j and k corresponds to the j’th row and k’th column. Although, W and C are not convex at the same instance, meaning that there is no guarantee a global solution exists, an algorithm to minimize the cost function still does exist. This referred algorithm is developed by Lee and Seung [17] and it is a multiplicative update rule which ensures that the overall cost is non-increasing for each additional iteration. Furthermore, it has been used in several works with promising results, which include [18], [10], and [12]. Their multiplicative update rule, in matrix operation, can be expressed as follows:

C ← C ⊙

"

W^T(T/(WC)) (W^T1)

#

, (2.3a)

W ← W ⊙

"

(T/(WC)) C^T (1C^T)

#

, (2.3b)

where 1 is a matrix of ones with the same dimension as T, ⊙ is the Hadamard product (element-wise multiplication) and all division operations are element- wise division.

Let us now consider the possibility to source separate a source mixture into Q different sources. In an NMF framework, we would then be required to obtain Q different W and C pairs there the mixed source could be expressed as

T^mix= eT^mix+ E^mix = W^mixC^mix+ E^mix, (2.4) where W^mix = W₁W2 · · · WQ, C^mix = C^T₁ C^T₂ · · · C^T_Q^T

and E^mix = PQ

q=1Eq is the residual noises from each linear combination approximation. An option to estimate source q from a source mixture could be to in prior learn each basis matrix W_q independently of the other sources according to update rules in 2.3. To then find each content matrix for a source mixture, and to represent each source estimate, we could minimize the following KL divergence function:

D_KL(T^mix∥W^mixCe^mix). (2.5) where eC^mix is the content matrix estimate given W^mix. As W^mix is considered to be a known variable, finding eC^mix could be obtained by computing this update rule (cf. update rule 2.3a)

Ce^mix← eC^mix⊙





(W^mix)^T

T/(W^mixCe^mix)

((W^mix)^T1)



. (2.6)

Note that each segment, of an integer multiple, of R rows in eC^mix corresponds to a particular source. This particular source is determined from W^mix.

Even though NMF is used in several applications as presented in the introduction NMF, it still has some limitations. Thus, in the section below the PLN is described, and it can be seen as a substitution of the DNN to ”learn complicated inter-dependencies between basis matrices” [12].

(19)

2.2 Progressive Learning Network

The main idea in a PLN is to improve the input by learning non-linear attributes of the given data. This is accomplished by exploiting the progression property (PP) and non-tuned random-valued-matrix in combination with a non- linear function [14]. Therefore, before describing the PLN further, it would be beneficial to demonstrate the progression property (PP) and briefly present a standard neural network as well as an Extreme Learning Machine (ELM).

2.2.1 Progression Property

The purpose of the PP is to find a relationship in which the input signal, γ ∈ R^N, equals the output signal through some wise choices of some linear transformations. As this property is not limited to only linear relationships the usage of non-linear function g(·) can also be applied. One feasible mathematical expres- sion, which is used in a study made by Saikat et al. [14], can be the following:

U_Ng(V_Nγ) = γ, (2.7)

where VN ∈ R^{M ×N} and UN ∈ R^{N ×M} are known linear transformations. A specific non-linear function that satisfies PP is the rectified linear unit (ReLU) function,

g(γ) = max(γ, 0), (2.8)

in combination with the linear transformation variables chosen as V^T_N = UN = [IN − IN], where IN is an identity matrix of dimension N .

2.2.2 Artificial Neural Network

An artificial neural network is a mathematical method representation inspired by our the brains, more specifically the activity by a cluster of neurons which forms a network [19]. A standard ANN architecture, see Figure 2.1, usually consist of three different layers, that is an input layer, a number of hidden layers and an output layer.

The output signal is calculated sequentially starting from the input layer in which the current feature vector, in this case x^(k)∈ R^Γ, is linearly transformed by the weight matrix Θ1[ii]. This transformed signal is then transformed by the non-linearly transformation (NLT) variable G1, which yields a new current feature vector. Then, this newly generated feature vector repeats the above steps, with incrementation of layer l in Θ_l and G_l, until the output layer is reached. This process is referred to as forward pass. However, as each weight matrix often is not optimal, as they are often randomly chosen, the deviation between ˜t^(k) and the ground truth t^(k) is poor. To reduce this deviation one strategy is to formulate a mean square error (MSE) cost function such as

CANN= Cost(˜t^(k)) = 1 K

K

X

k=1

t^(k)− ˜t^(k)

2

(2.9)

[ii]Oftentimes, a bias vector with the same dimension as the number of output nodes is included in Θ. Then, in this case, the row dimension of Θ is the number of input nodes plus one.

(20)

Figure 2.1: Architecture of an artificial neural network with L + 1 layers. The top diagram illustrates how the neural network often is drawn while the bottom only illustrates the essential features.

and then take the gradient of the cost. The found gradient serves as an indication of how much each element in the last weight matrix should be adjusted.

The adjustment of each weight matrix happens layer-by-layer starting from the output layer to the input layer, where a new gradient matrix is computed in each layer. This action of backpropagating layer-by-layer and adjusting each element of each weight matrix is known as the backpropagation. After that all weight matrices are adjusted and a new forward pass is carried out, yielding another estimate with some deviation from the ground truth.

This back and forth process continues until a stopping criterion is met. Ide- ally, the chosen step size used in the gradient search should monotonically de- crease the cost function for each iteration. Unfortunately, there is no strict mathematical formulation when choosing the appropriate step size, meaning that it could progress into a tedious process of finding an appropriate step size.

This fact together with that the backpropagation itself is a computation demanding process has led us to look into another architecture, which is ELM.

2.2.3 Extreme Learning Machine

ELM is a feedforward neural network, and the idea of an ELM architecture arises as a response to solving the computation demanding operation backpropagation.

The solution that ELM brings is to instead of randomly generate the last weight matrix solve it in a least squares (LS) problem formulation, that is

Θ^⋆_L= arg min

ΘL

K

X

k=1

t^(k)− ΘLa^(k)_L

2

, (2.10)

(21)

where a^(k)_L = GL−1ΘL−1. . . G2Θ2G1Θ1x^(k)is the output after computing the L − 1’th non-linear transformation (cf. Figure 2.1). Then, to evaluate the cost of the solved LS estimate, we can once again use an MSE cost, that is,

C_ELM = Cost(Θ^⋆_L) = 1 K

K

X

k=1

t^(k)− Θ^⋆_La^(k)_L

2

. (2.11)

By citing one of the main inventors of ELM, it (ELM) should yield, assuming appropriate architecture size is chosen, a ”better generalization performance than the gradient-based learning such as backpropagation in most cases” [13].

2.2.4 PLN

Architecture Design

The PLN design is based on the idea of ELM meaning that backpropagation is not necessary when computing the estimated signal. As a result, it is a rather computational efficient method of computing big data. The key aspect that separates the two methods is that PLN ensures the training data to never increase in cost when increasing the architecture size while ELM does not. The implication of this feat allows for reduced time-consumption in finding a reason- able architecture size by imposing a threshold instead of manually tuning the architecture size.

The key component that permits PLN to achieve a non-increasing cost lies in the possibility to always be able to satisfy the PP in a convex problem formulation. In PLN, the optimized signal can be interpreted as two separate signals joined together. One of those is a deterministic reference signal, which is better or equal to the input signal, while the second matrix is a stochastic signal. Then, by considering solving a general convex problem formulation, the output can be either better, same or worse than the input signal. However by taking advantage of the PP, it ensures that in the worst case scenario the output matrix yields the deterministic reference signal by forcing influence of the stochastic part to zero, thereof the non-increasing cost is ensured.

Before providing the mathematical problem formulation that significates PLN, an overview of the PLN process is described by five steps. For the sake of simplicity, we are here and onward using variables in matrix notation instead of vector notation.

1. An optimal output signal estimate is calculated based on the input signal (X) if it is the first layer (l = 1) or the previous output estimate ( eTl−1) for consecutive layers. Then, this signal is optimal linear transformed (H^⋆ or O^⋆_l−1). This yielded optimal estimate is afterward linearly transformed by sign splitter matrix (VJ) with the column dimension J .

2. In parallel a random matrix (Rn_l) which together with the initial input data (X) or the previous hidden layer output matrix (Y_r_l−1) describes some of the intrinsic non-linear attributes. Here, n_l is the number of random nodes and r_l= 2J + n_lis the total number of nodes in the hidden layer output matrix.

(22)

3. The above two steps are then added together by a non-linear function, in this case the ReLU function (g(·)), which yields the current hidden layer output matrix (Yr_l).

4. An optimal estimate ( eTl) is calculated based on an optimal linear matrix (O^⋆_l) between the hidden layer output matrix (Yr_l) and the true output signal (T). To ensure a non-increasing cost function the defined set of possible optimal linear matrices (Ol) needs to include the optimal linear matrix (Ol) that satisfies the PP (for more detail see below, The Mathe- matical Formulation).

5. If there is an indication that the performance can be further improved (a predetermined threshold is not yet satisfied) then a process to repeat the above steps is carried out. Depending on the reason for potential performance improvement the architecture can either increase with a certain number of random nodes or introduces a new single layer.

To visualize the PLN architecture, a block diagram can be seen in Figure 2.2.

Figure 2.2: Architecture of PLN. The matrix variables D₁and D₂are auxiliary variables to easier see the structure of Yr_l.

The Mathematical Formulation

From the above description, we can express the hidden layer output matrix as

Yr_l=









 g

"

V_JH^⋆ Rn_l

# "

X X

#!

, l = 1,

g "

VJO^⋆_l−1 R_n_l

# "

Tel−1

Y_r_l−1

#!

, l ≥ 2.

(2.12)

and the optimal estimate as

Tl=

(H^⋆X, l = 0,

O^⋆_lY_r_l, l ≥ 1. (2.13)

To obtain an optimal estimate, the corresponding input signal requires to be the optimal linearly transformed. Here, the optimal linear transformations

(23)

H^⋆ and O^⋆_l are defined as

H^⋆= arg min

H

∥T − HX∥^a_a subject to ∥H∥^b_b≤ ϵ,

(2.14a)

C₀^⋆= Cost(H^⋆) = ∥T − H^⋆X∥^a_a, (2.14b) and

O^⋆_l = arg min

Ol

∥T − OlYr_l∥^a_a subject to ∥Ol∥^b_b≤ α ∥Ul∥^b_b,

(2.15a)

C_l^⋆= Cost(O^⋆_l) = ∥T − O^⋆_lY_r_l∥^a_a, (2.15b) where ϵ is an arbitrary constant, α ≥ 1 is a chosen constant and ∥Ul∥^b_b = 2J for b = 1 and b = 2. For clarification, the notation ∥·∥_d, where d is either a or b corresponds to the matrix norm ℓd-norm. Note that both optimization problems in 2.14 and 2.15 are convex.

As the foundation of PLN relies on that the PP is a possible outcome, it is worth observing the difference in constraints in both optimal linearly transformation calculations. At first, this might appear insignificant as the main difference is a scalar factor of ∥Ul∥^b_b. However, by enforcing that ∥Ol∥^b_b always can be equal to ∥U_l∥^b_b, that is, it is always a possibility to obtain O_l= U_l. As a result, the output estimate is in the worst case equal to the input signal and as such ensures non-increasing cost.

In contrast, when solving the optimal linear transformation H^⋆ an issue occurs in the process to determine the value of ϵ if we chose it arbitrary without any clue of direction. After all, ϵ might be chosen too small such that PP cannot be satisfied; thus the estimated output might at worst case be a poorer representation of the reference signal than the current input signal. On the other hand, ϵ could also be chosen unnecessarily large which implies that a significantly larger defined set might be needed to be evaluated; as a result, it can require unnecessary computational demand. Still, this does not necessarily imply that increasing the defined set is a bad idea. To visualize the different scenarios a case of a two-dimensional problem can be seen in Figure 2.3. From the figure, it can be observed that the circles made by the blue dashed lines with α values assume that O_l= U_l. To then demonstrate why an expanded defined set might be good, let us consider a feasible possibility that a strictly better solution can lie in the defined region for α = 2 but not for α = 1. Consequently, choosing a more extensive defined set may yield a better estimate at the expense of higher computational demand.

(24)

Figure 2.3: Illustration of the two constraint choices. The red diamond implies that O_l = U_l, each blue dashed lined circle ensures PP with a defined scaling factor (α) and each black solid lined circle corresponds to arbitrarily chosen constraint (ϵ).

(25)

Chapter 3

Modified Progressive Learning Network

We have now arrived at the main part of this report, namely the contribution.

In this chapter, the proposed architecture is presented, which is a modification of PLN while retaining its useful properties. Additionally, emphasis regarding the input signal, non-negativity (spectrum), and masking filter are explained.

3.1 Modified PLN Architecture

Following the idea of the PLN, we propose a new architecture, see Figure 3.1, which we call modified PLN (mPLN). Compared to PLN, mPLN compute the linear transformation and the random matrix one after another. In relation to the five steps described in section 2.2.4, those two first steps can in the mPLN architecture be seen as block diagrams in series instead of parallel as in the PLN architecture. The implication of having them in series means that we only have one main input signal into the system which we denote as X^S_PLN∈ R^{J ×K}, where S denotes the speech/signal part. For further specification regarding the input signal X^S_PLN see next section.

Figure 3.1: Architecture of the modified PLN.

From an estimated input signal, eT^S_l with eT^S₀ = X^S_PLN, the signal passed through a full rank random-valued-matrix Rn_l ∈ Rⁿ^l^×J with nl ≥ J , followed by a linear transformation, Vn_l, and a non-linear function, g(·), (cf. subsec- tion 2.2.1). The estimated output signal can as a result be expressed as

Te^S_l = On_lY_n^S_l, (3.1)

Single Channel Spectrum-based Speech Enhancement Using

Single Channel Spectrum-based Speech Enhancement Using

Neural Networks

FILIP WEN-FWU TSAI

Abstract

Sammanfattning

Acknowledgment

Contents

List of Figures

List of Tables

Acronyms

Chapter 1

Introduction

1.1 Motivation

1.2 Related Work

1.3 Thesis Project

1.4 Contribution

1.5 Thesis Outline

Chapter 2

Background

2.1 Non-negative Matrix Factorization

2.2 Progressive Learning Network

Chapter 3

Modified Progressive Learning Network

3.1 Modified PLN Architecture