Blind Source Separation in Ultra-Low Power Devices

(1)

POWER DEVICES

by

DESHAWN ANDRE BROWN

BSEE, University of Denver, 2016

A thesis submitted to the Graduate Faculty of the

University of Colorado Colorado Springs

in partial fulfillment of the

requirement for the degree of

Master of Science

Department of Electrical and Computer Engineering

(2)

DESHAWN ANDRE BROWN

(3)

This thesis for the Master of Science degree by

DESHAWN ANDRE BROWN

has been approved for the

Department of Electrical and Computer Engineering

by

Mark Wickert , Chair

Jugal Kalita

Byeong Kil Lee

December 17, 2020 Date

(4)

BLIND SOURCE SEPARATION IN ULTRA-LOW POWER DEVICES

Thesis directed by Professor Mark Wickert

ABSTRACT

Blind Source Separation (BSS) is the process of decomposing a mixed signal into a set of

linearly independent vector components. The Blind Source Separation problem arises

or-ganically in day-to-day human interaction in the form of the Cocktail Party Problem: when

found in an auditory environment characterized by simultaneous speakers (say speaker

A and speaker B), a human can narrow their auditory attention to a single target source

(speaker A) and ignore all other sources (speaker B). While seemingly trivial for a human,

this task has historically proven to be difficult to solve algorithmically. Advancements in

the multi-disciplinary field of Artificial Intelligence (AI) have given way to Neural

Net-work based approaches to BSS. In tandem, TinyML, an emerging field spearheading the

deployment of Neural Networks onto resource-starved microcontrollers, yields promise of

proliferation of AI on platforms previously thought to be computationally impractical. In

this work, we explore the feasibility of real-time BSS on a low-cost, commercially

avail-able microcontroller. The LibriSpeech dataset is used to create 1200 unique Male-Female

speech mixtures. These mixtures are used to create a TensorFlow-based audio conditioning

pipeline for training and deploying two BSS Deep Neural Network (BDNN) architectures.

Once trained, TensorFlow Lite for Microcontrollers is used to compress and deploy the

BDNNs onto the Arduino Nano 33 BLE Sense microcontroller. For each architecture, we

(5)

be deployed on the Arduino Nano 33 BLE Sense. While the implemented architectures

are algorithmically functional on the target device, analysis reveals a fundamental tradeoff

between model size (number of parameters), speech intelligibility, model throughput, and

(6)

ACKNOWLEDGMENTS

Mom, Dad, Terrell, Jamal, and Tee-Tee - I am eternally grateful for your support as I’ve

em-barked on this journey in higher education. A special thanks to my colleagues at Northrop

Grumman who were especially supportive in allowing me an incredible amount of

flex-ibility while I have pursued this MSEE. I’ve thoroughly enjoyed my time at UCCS. Dr.

Wickert, thank you for your limitless support in pursuing this thesis and giving me the

creative liberty to explore such a bleeding edge concept. This thesis is the culmination of

(7)

CHAPTER

1. Introduction 1

2. Core Concepts and Related Work 5

2.1 Overview of the Blind Source Separation Problem . . . 6

2.1.1 Source Separation as an Inverse Problem . . . 6

2.2 The Cocktail Party Problem . . . 8

2.2.1 Principal Components Analysis (PCA) for Audio Source Separation 10 2.2.2 Independent Components Analysis (ICA) for Audio Source Sepa-ration . . . 15

2.2.3 Constraints of PCA and ICA for Ultra-Low Power Device Integration 18 2.2.4 Computational Auditory Scene Analysis (CASA) . . . 19

2.2.5 Time-Frequency Masking . . . 26

2.3 Artificial Neural Networks . . . 30

2.3.0.1 Convolutional Neural Networks . . . 33

2.3.0.2 Recurrent Neural Networks . . . 34

2.3.1 Neural Network Approaches to Blind Source Separation . . . 36

2.4 TensorFlow . . . 38

2.4.1 TensorFlow Lite for Microcontrollers . . . 39

2.5 Real-Time Signal Processing on Embedded Devices Considerations . . . . 40

(8)

3.1 Audio Dataset . . . 44

3.2 TensorFlow Audio Processing Pipeline . . . 45

3.2.1 Short-Time Fourier Transform Frame Optimization . . . 50

3.3 Hardware . . . 52

3.4 Neural Network Architectures . . . 53

3.4.1 Neural Network Compression . . . 56

3.4.2 TensorFlow Lite Conversion and Microcontroller Deployment . . . 58

3.5 Latency Estimation . . . 60

3.6 Source Separation Figures-of-Merit . . . 61

4. Results 63 4.1 TensorFlow Neural Network Model Assessment . . . 63

4.2 Prototype Autoencoder Neural Network Model . . . 64

4.3 Lite Autoencoder Neural Network Model . . . 69

4.4 Super-Lite Autoencoder Neural Network Model . . . 74

5. Discussion and Conclusion 80 5.1 Conclusion . . . 80

5.2 Limitations and Challenges . . . 81

5.3 Future Work . . . 82

BIBLIOGRAPHY 89

(9)

(10)

TABLE

3.1 Google Colaboratory Specifications. . . 44

3.2 Short-Time Fourier Transform Frame Parameters. . . 52

3.3 Deep Feed-Forward Autoencoder Prototype Architecture. . . 57

3.4 Lite Autoencoder Neural Network Architecture. . . 57

3.5 Super-Lite Autoencoder Neural Network Architecture. . . 58

3.6 Deep Neural Network Training Parameters. . . 58

4.1 TensorFlow Lite Model Sizes for Prototype Autoencoder. . . 68

4.2 Prototype Autoencoder Inference Latency. . . 69

4.3 Prototype Autoencoder Separation Metrics. . . 69

4.4 TensorFlow Lite Model Sizes for Lite Autoencoder. . . 73

4.5 Lite Autoencoder Inference Latency. . . 74

4.6 Lite Autoencoder Separation Metrics. . . 74

4.7 TensorFlow Lite Model Sizes for Super-Lite Autoencoder. . . 78

4.8 Source Separation Metrics for Super-Lite Autoencoder. . . 79

4.9 Super-Lite Autoencoder Inference Latency. . . 79

(11)

FIGURE

2.1 General Formulation of an Inverse Problem. . . 8

2.2 Cocktail Party Problem with 2 Sources and 2 Observers. . . 9

2.3 PCA fails to align the Principal Axes of Non-Gaussian Mixtures. . . 13

2.4 y = abs(x) is an example of a function that is uncorrelated but dependent. . 14

2.5 ICA is able to align Non-Gaussian mixtures with the Principal Axes. . . 18

2.6 Time-Frequency Representation of a Speech Signal and its associated Hard Mask. . . 29

2.7 Generic Feedforward Neural Network Topology. . . 30

2.8 Generic Structure for a Recurrent Neural Network. . . 35

2.9 An example of a Computational Graph. . . 39

3.1 Effect of Normalization on STFT frame histogram. . . 47

3.2 Packet-Streaming Concept for Frame-Based Source Separation. . . 48

3.3 TensorFlow-Based Training Data Generation Pipeline. . . 49

3.4 Effect of STFT parameters on Time-Frequency Resolution. . . 51

3.5 Inappropriate STFT parameters cause estimated masks to ignore male-female speech features. . . 52

3.6 TensorFlow Computational Graph for Densely-Connected Feed-Forward Architectures developed for this thesis. . . 56

(12)

3.8 TensorFlow to TensorFlow Lite Deployment Scheme. . . 60

4.1 Target Mask for Ideal Male-Speech Separation. . . 64

4.2 Reconstructed Male Audio Sequence using Prototype Autoencoder Network. 66 4.3 Reconstructed Female Audio Sequence using Prototype Autoencoder Net-work. . . 66

4.4 Cross-Correlation Sequences for Ideal Male Speech and Prototype Autoencoder-Estimated Male Speech . . . 67

4.5 Cross-Correlation Sequences for Ideal Male Speech and Prototype Autoencoder-Estimated Male Speech. . . 67

4.6 Estimated Male Masks for Prototype Autoencoder Network. . . 68

4.7 Reconstructed Male Audio for Lite Autoencoder Network. . . 70

4.8 Reconstructed Female Audio for Lite Autoencoder Network. . . 71

4.9 Cross-Correlation Sequences for Ideal Male Speech and Lite Autoencoder Estimated Male Speech. . . 71

4.10 Cross-Correlation Sequences for Ideal Male Speech and Lite Autoencoder Estimated Female Speech. . . 72

4.11 Estimated Male Masks for Lite Autoencoder Network. . . 73

4.12 Reconstructed Male Audio for Super-Lite Autoencoder Network. . . 75

4.13 Reconstructed Female Audio for Super-Lite Autoencoder Network. . . 76

4.14 Cross-Correlation Sequences for Ideal Male Speech and Super-Lite Au-toencoder Estimated Male Speech. . . 76

4.15 Cross-Correlation Sequences for Ideal Male Speech and Super-Lite Au-toencoder Estimated Female Speech. . . 77

(13)

(14)

INTRODUCTION

The ubiquity of Blind Source Separation (BSS) techniques in audio signal processing

is clear: from audio enhancement to vocal isolation, the ability to separate a target signal

from the environment is a crucial driver in many of today’s audio processing pipelines.

Historically, algorithmic approaches to the BSS problem have been multi-pronged, with

research efforts split between computational modeling of the human auditory system, via

Computational Auditory Scene Analysis (CASA) and matrix-based approaches such as

Independent Components Analysis (ICA) [1].

CASA and ICA largely differ in the classes of BSS that they aim to address. The

goal of ICA is to separate all target sources from the input mixture, whereas the goal of

CASA is to separate a target speaker from the mixing environment [2]. This difference

becomes more clear when the performance of ICA and CASA are compared in different

scenarios. A direct comparison of ICA and CASA demonstrates dichotomy of performance

when the mixing environment is altered [2]. Specifically, it is shown that CASA techniques

(15)

con-verse is shown to be true for ICA. Auditorily, we find that CASA systems perform best

when competing sources are tonal or narrowband in nature [3]. In contrast, ICA performs

best when competing sources are both broadband in nature - as is the case where mixtures

containing competing speech signals and/or Additive White Gaussian Noise (AWGN) [3].

It is worth noting that ICA is a special case of the BSS problem: performance only holds if

the underlying mixing matrix is linear and the source signals are non-Gaussian.

In spite of these constraints, the ICA formulation of the BSS problem still finds use

because it yields meritorious results in the standard formulation of the Cocktail Party

Prob-lem. Here, we have a mixture of concurrent speakers and have the explicit goal of isolating

all target speakers. In these circumstances, ICA can reliability isolate the speakers in the

mixture. Many communities, including the hearing aide community, have long-held in

interest solving the problem in real-time [4]. The solving of BSS on platforms like the

hearing aide is unique because of the inherent resource-constrained nature of the device,

and the need for relatively low latency [4]. Unfortunately, the computationally intensive

matrix-based mathematics that underlie ICA have prevented the technique from seeing

widespread usage in real-time applications. While real-time approaches for ICA have been

demonstrated using FPGAs [5], special hardware is required, and, because of the need to

program in Verilog, ad-hoc tuning, modification, or addition of algorithms is not

straight-forward.

In recent years, the machine learning (ML) community has become increasingly

in-volved with the problem [6]. Due to advancements in Graphical Processing Unit (GPU)

technology and neural network theory, we have seen neural networks successfully used to

(16)

or exceeds the state-of-the-art. In speech mixtures, it has been shown that time-frequency

masking, a core component of CASA, yields performance on-par with ICA due to

favor-able orthogonality properties in the time-frequency plane [7]. Neural network architectures

have been developed that are able to estimate the time-frequency mask for a Short-Time

Fourier Transform (STFT) frame of audio data. Once a time-frequency mask is known, the

original speech signal can be recovered by applying the time-frequency mask to the STFT

frame and then computing the inverse STFT.

The relevancy of these advancements in neural network design is bolstered by

con-current advancements in the development of TensorFlow, Google’s computational graph

framework [8], and the invention of TinyML, which is the concept of deploying ML

algo-rithms on low-power microcontrollers [9]. Innovation in these areas make the commercial

use of algorithms previously seen as computationally infeasible a possibility. TensorFlow

Lite for Microcontrollers, a portable build of TensorFlow, allows full-fledged TensorFlow

models to be compressed into a low-memory footprint that is capable of running on

com-mercially available microcontrollers [9]. When taken holistically, these factors give rise to

the potential of solving the BSS problem on low-end hardware.

In this work, we address the issue of solving the source separation problem in

real-time. Specifically, we address the viability of using the Arduino Nano 33 BLE Sense

microcontroller [10] and the i.MX RT600 microcontroller [11] for estimating the

time-frequency masks needed for separating linearly-mixed speech signals. Following this

in-troductory chapter, Chapter 2 of this thesis manuscript presents a literature review that

outlines the core concepts and related work that lay the foundation for our proposed

(17)

of neural networks, TensorFlow, and TensorFlow Lite model conversion. Following the

establishment of these core concepts, Chapter 3 describes the audio processing pipeline

model that was designed in TensorFlow and the corresponding conversion in TensorFlow

Lite. This chapter includes details about the datasets used, the neural network

process-ing architecture, and details the conversion from TensorFlow to TensorFlow Lite. Chapter

4 presents the results of the study, including performance comparisons to ICA and ideal

time-frequency masks when applicable. Chapter 5 describes future work that may be

pur-sued following this study and provides a summary of the work conducted as well as future

(18)

CORE CONCEPTS AND RELATED WORK

This section describes the concepts and principles that govern this work. Specifically,

we provide a mathematical overview of the Blind Source Separation problem and outline

the two classical approaches to solving the problem: CASA and ICA. Mitianoudis’ work

on ICA serves as the foundation for the ICA section of this chapter [12]. Time-Frequency

masking, a newer technique that assigns time-frequency bins to a particular source, and the

driver for the neural network approach outlined in this thesis manuscript is also discussed.

For its seminal role in this work, TensorFlow, Google’s open-source framework and

interface for developing and deploying machine learning algorithms, will be described in

detail. TensorFlow Lite for Microcontrollers (TFMicro), a framework for converting and

greatly compressing TensorFlow models so that they can run on embedded devices will

(19)

2.1 Overview of the Blind Source Separation Problem

2.1.1 Source Separation as an Inverse Problem

Source separation belongs to the broader class of mathematical problems known as

inverse problems. In the forward problem, we wish to determine a function that maps a

set of input parameters and a set of measurements. Formally, this translation is known

as the measurement operatorM . The measurement operator transforms parameters in the

function space X to the space of the data D. That is:

y = M (x) for x ∈ X and y ∈ D. (2.1)

The inverse problem is the opposite: in this formulation, we are tasked with finding points x ∈ X using observations y ∈ D such that:

y ≈ M(x) for x ∈ X and y ∈ D. (2.2)

In other words, we use our observations_{y ∈ D to determine the input parameters x ∈ X}

that would have been needed to yield the observations. This is shown visually in Figure 2.1.

In order to uniquely reconstruct these parameters, we require the measurement operator to

be injective. That is,

(20)

Because measurements present in a mixture more times than not contain noise, the

injec-tivity is often only approximate. For practical problems of interest, such as the source

separation problem, we can treat a mixture’s measurement operator as an approximation

of an injective measurement operator. We can then find an inversion operator M−1 _that

maps the range ofM to unique elements in X. The inversion operator is characterized by

its estimated stability. The stability estimate takes the form of a modulus of continuity:

kx1− x2kX ≤ ω(kM(x1) − M(x2)kD), (2.4)

where _{ω : R}+ −→ R+ is an increasing function such that ω(0) = 0. Equation (2.4)

quantifies the continuity of the mapping.

An inverse problem is considered well-posed when the reconstruction error can be

de-scribed by some constantC, such that ω(x) = Cx. When strong noise is present in the

ob-servations or the inverse mapping landscape is otherwise unstable, such that small changes

in parameters yield large changes in observations, then an inverse problem is deemed to

be ill-posed. In general, the BSS problem is characterized as an ill-posed inverse problem.

Attempting to solve the inverse problem using numerical methods requires that we

regular-izethe problem. The classical means of doing so is by making a-priori assumptions about the sources or the mixing matrix. The use of these assumptions give rise to matrix-based

(21)

Figure 2.1: General Formulation of an Inverse Problem.

2.2 The Cocktail Party Problem

This work is focused on the Cocktail Party formulation of the BSS problem. This

formulation is named after the auditory scenario that arises in multi-speaker listening

envi-ronments. In a multi-speaker auditory scene characterized by competing audio sources, we

wish to discover an unmixing matrix that, when applied, is able to isolate the independent

audio sources that are present in the mixture.

The audio source separation problem can be characterized as described below.

Sup-pose that there are N speakers or sources transmitting audio signals S, represented as a N × 1 column vector, S= s1[n] s2[n] ... sN[n] T , (2.5)

and the observations at each listener or microphone (called observer from this point on) X

are given by:

X=

x1[n] x2[n] ... ... xM[n]

T

. (2.6)

Here,n denotes the time index in a discrete-time system model where samples have spacing

Ts. The mixing environment (or mixing method) can be represented by the mixing matrix

(22)

Figure 2.2: Cocktail Party Problem with 2 Sources and 2 Observers.

present with respect to each observer as:

E =

ǫ1[n] ǫ2[n] ... ǫM[n]

T

(2.7)

where_{E describes the time-varying additive noise present on each observer. The}

measure-ments captured by each observer can then be expressed as a linear system:

X_{= AS + E.} (2.8)

In Figure 2.2, we represent the cocktail party problem as finding (or recovering) the audio

from speakersS1 and S2 given a monaural mixture collected on a microphone (eitherX1

orX2). Assuming that the inverse exists, we can recover the original source vectors S by

estimating a matrix operator W that can invert the mixing operator A:

(23)

2.2.1 Principal Components Analysis (PCA) for Audio Source Separation

PCA is a foundational technique in machine learning that also finds relevancy in

the discussion of audio source separation. Unlike frequency domain approaches, PCA is

statistical in that it relies on the vector space properties of the input data in order for the

technique to work.

The goal of PCA is to transform an input into an alternate vector space representation

that eliminates as many redundant parameters as possible while also retaining as much

unique information relevant to the signal as possible. PCA achieves this elimination in redundancy by projecting the data onto a new set of axes that satisfy a set of constraints

[13]. Mathematically, it is found that this projection is one that maximizes the variance of

the variables (thus maximizing the linear independence).

PCA builds off the principles of the previously formulated cocktail party problem,

but additional constraints are imposed because of the underlying linearity assumptions of

the PCA algorithm. The first constraint is that mixing of the source vectors with the mixing

operator must happen instantaneously. This assumption can be approximated for cases

where microphones are close to the sources with respect to the speed of sound. The second

assumption is that the mixing operator must be linear. This means that convolution-based

mixing operators, such as room reverberation, violate the assumptions of PCA.

Mathematically, we establish the case for PCA as follows. Suppose that we have a

zero-mean, discrete-time stochastic process vector, i.e.,

(24)

Cx= E{x[n]x[n]T}, (2.11)

where the E operator denotes the expected value. The stochastic process vector has a

covariance matrix given by Cx. In PCA, we wish to identify the (in)dependence structure

in all dimensions and discover an orthogonal transformation matrix W of size_{L × N from}

RLto RN, whereL < N , such that the L-dimensional output vector y[n] = WX[n] captures the essence of the input vector and where the covariance vector of the output vector Cyis a

diagonal matrix D with elements arranged in a decreasing order [1]. The stochastic process

vector can then be reconstructed according to:

ˆ

X[n] = WTWX[n]. (2.12)

The objective of PCA is to find an optimal value of the orthogonal matrix W, denoted ˜

W, such that the reconstruction error J is minimized, where

J_{= E} X[n] − ˆX[n] . (2.13)

The rows of the transform matrix ˜W are called the principal components of the

stochastic process vector. These principal components, which are also the solution to the

aforementioned optimization problem, are given by the eigenvectors of the covariance

ma-trix Cx. The subspace that is spanned by the eigenvectors form the PCA subspace.

PCA acts as a dimensionality reduction operation and finds use in data reduction

(25)

in order to reduce the number of independent variables that are under consideration when

fitting a model (thus acting as a regularizer to prevent overfitting). PCA provides a

trans-formation that maps the observation space to m bases of ascending importance [13].

One of the commonplace algorithms for calculating the eigenvalues and eigenvectors

for PCA is the use of Singular Value Decomposition (SVD). To perform SVD, we first

multiply the observation vector by a matrix containing the eigenvectors of the covariance

matrix. We then multiply the previous result with the diagonal matrix containing the inverse

of the square root of the corresponding eigenvalues. That is,

SVD := DVX, (2.14) where D= diag 1 √ d1 1 √ d2 ... 1 √ dN , (2.15)

and the elements of the diagonal matrix are the eigenvalues of the covariance matrix and

V=

e1 e2 ... eN

T

, (2.16)

where the elements of the vector V are the eigenvectors of the covariance matrix.

Ob-servation of the algorithm for SVD reveals a striking concern: computing eigenvalues and

eigenvectors is computationally intensive. Furthermore, because PCA is statistical, an input

vector of sufficient length is needed in order to get a decomposition that is representative of

the source data. The sufficient length comes about because time averages replace statistical

(26)

Even if computational overhead was not a concern, we find an issue with PCA when

we attempt to use the algorithm on speech mixtures. Because PCA performs dimensionality

reduction by discovering a subspace that maximizes the variance [13], we would expect

that PCA would allow for audio source separation for a frame of interest. Unfortunately,

analysis reveals that this is not the case.

Figure 2.3: PCA fails to align the Principal Axes of Non-Gaussian Mixtures.

After applying PCA, we find that the signals are uncorrelated, but are not aligned with

the principal axes, implying that the signals were unable to be separated. This is because

we cannot uniquely identify non-Gaussian signals using second-order statistics [12]. To see

this, consider the scatter plots in Figure 2.3. The left plot shows the observation basis that

is estimated from the eigenvectors of the covariance matrix for a non-Gaussian mixture.

The right-hand plot shows the result of representing the mixture in the basis estimated by

PCA. After representing the data in this basis, we see that the data is rotated, but is not

aligned with the principal axesu1 andu2. This misalignment indicates co-dependence of

(27)

Figure 2.4:y = abs(x) is an example of a function that is uncorrelated but dependent.

the distribution. This demonstrates a crucial finding for PCA: statistical decorrelation is

not a sufficient condition for source separation.

Statistical independenceis required for source separation. Mathematically, we find that decorrelation does not imply independence [12], i.e. uncorrelated random variables are

not independent unless Gaussian. To prove that decorrelation does not imply independence,

consider the absolute value function,y = abs(x) (Figure 2.4). Because the relationship for

the abs(x) is not linear over the full domain (and is in fact discontinuous at x = 0), we

can say that thex and y are uncorrelated. However, we see that the variables are dependent

because the needed property of probabilistic independence is violated. As an example, by

knowing a value ofx, we can predict with absolute certainty what the value of y will be.

This is not to say that PCA does not have merit in the context of BSS: when the

observations contain high levels of noise, PCA can be used as a preprocessing step prior to

the application of other source separation techniques. The use of PCA in this way yields

(28)

2.2.2 Independent Components Analysis (ICA) for Audio Source Separation

Independent Component Analysis (ICA), like PCA, is a special case of the BSS

prob-lem. Unlike the general formulation, ICA assumes that the source signals present in the

mixing environment are statistically independent (rather than simply uncorrelated). Source

vectors S = (s1, s2, ..., sN)T are deemed independent if their joint probability is equal to

the product of their probabilities. That is,

P(S) = P(s1, s2, ...sN) = P(s1)P(s2)...P(sN), (2.17)

where the P operator represents the probability density. ICA also imposes the constraint

that at most one of the source vectors can have Gaussian statistics [12]. This condition

should be satisfied for all practical human speech separation tasks of interest.

There are many approaches to ICA, ranging from neural networks to statistical signal

processing. Instead of rigorously addressing all formulations we instead describe the

con-cept of ICA in moderate detail. Further detail about additional approaches for ICA can be

found in [12]. Given the aim of this thesis, the ICA formulations discussed in this section

are chosen because of their relevancy to the cocktail party source separation problem.

In the following sections, we will establish the general formulation for ICA. While

many ICA algorithms have been developed over the years, including formulations for

Max-imum Likelihood Estimation (MLE), entropy maximization, and maximization of

non-Gaussianity, analysis in this thesis will be conducted with respect to FastICA [16], an

algorithm that uses the non-Gaussianity formulation of ICA. If all assumptions are met, the

(29)

ICA is to maximize the statistical independence of the resultant output vectors using either

the relative entropy (also called the Kullback-Leibler divergence), maximum likelhihood,

or maximizing the nongaussianity [17]. While the ICA algorithm has historical pedigree

in source separation, we note that the ICA algorithm imposes permutation and scale

ambi-guities on the source estimates. The permutation ambiguity means that we cannot control

the order of the source estimates [12]. The scaling ambiguity says that, because the mixing

matrix and the original source vectors are not known a-priori, scalar factors are lost in the

mixing process [12].

Because ICA is predicated on the non-gaussianity of the inputs, it is important to

understand why this is the case. Consider an n-dimensional, Gaussian distribution random

vector x, e.g., x= x1 x2 ... xn T . (2.18)

The Gaussian distribution is unique in that all orthogonal projections of x will have

the same probability distribution [18]. In other words, because the Gaussian process vector

has a symmetric covariance matrix, we cannot discover unique orthogonal transformations

of the Gaussian distribution. Consequently, we cannot discover unique sources in a mixing

matrix A if all of the underlying sources are Gaussian-distributed.

The Central Limit Theorem (CLT) plays a large part in ICA techniques too. The

CLT shows that the sum of independent random variables, irrespective of the underlying

distribution, tends to a Gaussian distribution. If we instead assume that all source vectors

(30)

An independent componenty can be represented by a linear combination of the estimated

source vectorsxi and an estimated mixing matrix W. That is,

y = WTx=X

i

wixi. (2.19)

Here, y denotes an independent component. In practice, the true mixing matrix A is not

known, so we must simultaneously estimate W and x. To approximate these terms, we

make the following change of variables:

z_{= A}TW_, _(2.20)

then

y = WT_x_{= W}T_As_{= z}T_s_. _(2.21)

The above shows thaty is a linear combination of the sources si, with weights given byzi

[18]. By the CLT, we know that a sum of random variables tends to a Gaussian distribution.

Consequently, zTs is more Gaussian than the underlying source vectorssi. Conversely, zTs

strays from the Gaussian distribution equal to one of the source signals [18]. The

inde-pendent components can thus be discovered by maximizing the non-gaussianity of wTx. To discover all of the independent components, we search the convex optimization surface

for all local maxima. The optimization surface in the n-dimensional space of vectors w

contains 2n local maxima - one for each source vector [18]. These local maxima map

(31)

algo-Figure 2.5: ICA is able to align Non-Gaussian mixtures with the Principal Axes.

rithm applies an orthogonalization in the PCA subspace that corresponds to constraining

of the search space to a place where uncorrelated estimates are in close proximity [18].

We demonstrate the utility of ICA in Figure 2.5. Unlike PCA, after projecting the

non-Gaussian mixture in the ICA subspace, we see that the data is properly aligned with the

principal axes. Alignment with the principal axes is indicative of successfully resolving the

mixture into its independent components.

2.2.3 Constraints of PCA and ICA for Ultra-Low Power Device Integration

The cornerstone issue underpinning the viability of either ICA and PCA in a

real-time context, despite their ability to blindly separate mixtures, are the exceedingly long

processing frames needed to produce suitable statistics. Frames 9-10 seconds in duration

were used in Mitianoudis 2004 [12], which is well beyond the processing timelines that are

needed for real-time audio processing applications. Latency in the neighborhood of 20-30

(32)

2.2.4 Computational Auditory Scene Analysis (CASA)

CASA refers to algorithms that attempt to mimic the behavior of the human auditory

system. Classical research on auditory scene analysis shows that deconstruction of an

auditory scene is a two-step process [3]. In the first stage, elements or features are extracted

from the auditory scene. These elements are representative of key elements within the

environment. In the second stage, we combine elements that are likely to be paired with

the same acoustic source – such as a human speaker. Relevant features may be derived

using heuristic-based approaches or through unsupervised means such as neural networks

[3]. Once these features have been discovered, we can invert the auditory representation in

order to recover the original source vectors.

Time-frequency masking is a key concept used in some CASA systems that is used

to tag source signals in an acoustic mixture. Time-frequency masking, in the context of

CASA, is inspired by acoustic masking in the human auditory system, in which stronger

signals overpower weaker signals that are in the same (or nearly the same) time-frequency

region on interest [3]. In this schema, high-weight values are assigned to dominant sources,

while lower weights are applied to weaker sources. By applying this mask to the

time-frequency representation and applying and then inverting the representation, we can

effec-tively recover the source signals [3].

Construction of these time-frequency masks requires that the clean speech signals

are known a-priori. Using the clean speech signal as a reference, we find time-frequency

regions of the mixture where the energy is within 3 dB of the clean speech signal. These

(33)

energy threshold, and0 otherwise. If we have a time-frequency representation of the speech

signals, wheret represents the time axis, and f represents the frequency axis, then the ideal

binary mask m(t, f ) for source signal s(t, f ) with respect to a competing noise signal n(t, f ) can be expressed as:

m(t, f ) =          1, s(t, f ) > n(t, f ) 0, otherwise (2.22)

Creating such masks without a-priori knowledge of the source signals is a tenet of the

modeling approaches employed in CASA systems. As mentioned earlier, CASA systems

typically use a multi-stage processing scheme. The first stage typically involves the use

of time-frequency analysis. Many CASA systems use biologically-inspired operations,

but Short-Time Fourier Transforms (STFTs) or wavelets can be used as well [3]. In a

biologically inspired CASA system, a bank of bandpass filters is used to simulate the effect

of frequencies being associated with various locations on the basilar membrane [3], and a

gammatone filter bankis used to simulate the impulse responses associated with the nerve fibers in the human auditory system [3]. The gammatone filter bank has a continuous-time

impulse response of the form:

gi(t) = tn−1exp(−2πbit)cos(2πfi(t) + φi)u(t), (1 ≤ i ≤ N), (2.23)

whereN is the number of filter channels, n is the filter order, t is the time in seconds, and

(34)

The parameters of these filters are tuned to match a specified psycho-acoustic

envi-ronment. In the human auditory system, it is found that the bandpass filter bank increases

non-linearly as described by the mel-scale [20]. For human subjects, it is found that the

equivalent rectangular bandwidth (ERB) is equal to:

ERB(f ) = 24.7(4.37f /1000 + 1), (2.24)

wheref is the frequency in KHz and ERB is in Hz.

Once filtered, the next stage of the CASA system uses the time-frequency

informa-tion in order to identify and group relevant features. The fundamental frequency is one of

the most important features for such grouping. In the human auditory system, humans are

able to use differences in fundamental frequencies in order to cluster the harmonics of one

source and segregate the interferer. In order to identify the fundamental frequency, the

cor-relegramis used [3]. A correlegram can be interpreted as a multi-channel autocorrelation, where we compute the autocorrelation for each channel of our filter bank:

A(t, l, τ ) =

N −1

X

n=0

h(t − n, l)h(t − n − τ, l)w(n), (2.25)

whereh(t, f ) represents the filter response for channel l at time t, τ is the autocorrelation

lag, andw is a window function of N samples. Fundamental frequencies form peaks in

the correlegram which can then be found using standard peak detection algorithms. Once

these fundamental frequencies have been identified, pitch tracking algorithms can then be

(35)

have been proposed, including Hidden Markov Models (HMMs) and Particle Filter based

approaches [3].

Though these models are functionally robust, we avoid these in the context of this

thesis because of the associated computational and memory overheads [2] [3] . The neural

oscillator network approach, though much simpler than HMMs and Particle filters, is

suc-cessful at re-synthesizing both speech vectors and noise vectors in mixtures [2]. The neural

oscillator network is an oscillation-based network that is used to perform auditory grouping

[20]. The neural oscillator network proposed in Wang 1999 [20] forms a perceptual stream

if the oscillator neurons are phase-locked. This phase-locked stream is desynchronized

from the oscillator neurons that map to different streams. When this grouping is learned,

we can re-synthesize all sources by inverting the representation that is output from the

neural network. Functionally, the oscillator neural network acts as an image segmentation

algorithm.

The first layer of the network is a segmentation layer. Oscillator neurons in the

net-work are defined as reciprocally connected excitatory and exhibitory variablesxij andyij,

respectively. Excitatory and exhibitory variables in the 2-dimensional grid are described

by the equations,

˙xij = 3xij − x3ij + 2 − yij + Iij + Sij + ρ, (2.26)

(36)

Here, Iij represents an external input to the oscillator neuron, Sij denotes coupling

from other oscillators in the network, andρ is the amplitude of a Gaussian noise term [20].

We choose ǫ to be a small number. If noise effects are ignored and the external input is

constant, then ˙xij = 0, called the x-nullcline, is a cubic function and ˙yij, the so-called

y-nullcline, is the sigmoid function,S(x). Here, the sigmoid function is defined as:

S(x) = 1

1 + e−x. (2.28)

WhenIij > 0, the nullclines intersect at a single point. This configuration represents

an oscillator with two time-scales. We choose the tuning parameterβ to be a small value.

The oscillator configuration yields a stable limit cycle for sufficiently small values ofǫ [20]

and is considered enabled.

The limit cycle contains two distinct phases, referred to as silent and active. These

two phases map to the left and right branches of the cubic function. Transition between

these two states occurs on short time scales, with theγ parameter determining the relative

times spent in each state. A largerγ parameter increases the amount of time spent in each

phase. WhenIij < 0, the nullclines intersect at a stable point and no oscillation occurs. In

this state, the oscillator is considered excitable [20].

Sij represents segments formed in the first layer of the neural network. These

seg-ments are described mathematically as:

Sij =

X

kl∈N(i,j)

(37)

Here, Wij,kl is the connection weight from oscillator (i, j) to oscillator (k, l), and N (i, j)

represents the set of nearest neighbors at the grid location (i, j). θx is a threshold value

between the left and right branches of the cubic function. H is the unit step function. WZ

is described as the weight of inhibition from global inhibitorz [20], defined as:

˙z = σ_∞_{− z,} (2.30)

where σ_∞ = 1 if xij ≥ θz for at least one oscillator and σ_∞ = 0 otherwise. A lateral

potential is used for noise rejection purposes, and is described by the equation:

˙ pij = (1 − pij)u   X kl∈Np(i,j) u(xkl− θx) − θp] − ǫpij  . (2.31)

The lateral potential is added as a gating term to the external excitatory inputIij. If

the activity of each neighbor_{k ∈ N}₁(i) is greater than the threshold θx, then the outer unit

step function will be equal to unity and the oscillator will accumulate potential. Oscillators

that accumulate enough potential are designated leaders. Followers are neighboring

oscil-lators that can transition phases (referred to in the model as jumping). Noisy fragments

are unable to transition phases beyond a short period of time because they cannot become

leaders or become followers because of a lack of nearest-neighbor stimulation (essentially

becoming isolated pixels in a 2D representation of the layer). This scheme allows the

net-work to reject noise artifacts. The strength of the external input is directly correlated to the

peak magnitudes found in a given channel’s autocorrelation function.

The consequence of this configuration is that an individual oscillator has no impact

(38)

in order to create the phase locking property that is fundamental to grouping. The second

layer of the network is the grouping layer. We will touch on the grouping scheme at a high

level. Additional detail can be found in [20]. Recall earlier the form of the correlegram.

For each time frame, we can created a pooled correlegram by summing the channels across

frequency. If we define the pooled correlegram ass(j, τ ), where time-frame j and lag τ are

the_{2 − D variables, then:}

s(j, τ ) =

N

X

i=1

A(i, j, τ ). (2.32)

For each time frame, a fundamental frequency estimate from the pooled correlegram

is used to classify the frequency channels. Channels are classified as either being

con-sistent with the fundamental frequencyP , or otherwise inconsistent with the fundamental

frequency (two categories). For a given delayτM corresponding to the peak magnitude of

the pooled correlegram for a given channeli at time frame j, we tag the channels where:

A(i, j, τm)

A(i, j, 0) > θd. (2.33)

Here, A(i, j, 0) denotes the energy in channel i at time frame j. The resulting

energy-based tagging analogs the time-frequency masking approach mentioned earlier. Analysis

conducted in [3] and [2] highlights the flaws in this method – largely that CASA techniques

are sensitive to the structure of the source vectors. This sensitivity highlights differences

between ICA and CASA based methods for source separation.

Like ICA, CASA is used for source separation, but the two are designed to

(39)

the potential for variability in separation performance due to dependencies on the

struc-ture of the source vectors. The CASA system well-separates signals that are localized in

time-frequency, but poorly separates signals that are broadband in nature (such as human

speech). The opposite finding is true for ICA, where the JADE ICA algorithm yielded

de-graded performance compared to CASA when the interferer was tonal or narrowband. This

degradation is attributed to the comparatively poor higher-order statistics in these

scenar-ios [3]. These findings indicate that ICA and CASA are in fact complementary methods,

and neither is well-suited to separating arbitrary sources. Ultimately, because we are

in-terested in broadband separation (as in human speech mixtures encountered in the cocktail

party problem), the CASA approach is not a viable source separation technique for our

application.

2.2.5 Time-Frequency Masking

Time-frequency masking, proposed by Yilmaz and Rickard in 2004 [21], solves the

Blind Source Separation problem by exploiting the assumption of separability of mixed

speech signals when transformed to the time-frequency domain. This finding is attributed

to the approximate W-disjoint orthogonality of speech [7]. It is shown that when source

vectors are W-disjoint orthogonal, we can fully recover the source signal in a mixture.

Two signals s1 and s2 are said to be W-disjoint orthogonal if for a given window

function W (t), the supports of the windowed Fourier transforms are disjoint [21]. The

windowed Fourier transform ofsj is given by:

FW_(s j(·))(ω, τ) = 1 √ 2π Z ∞ −∞ W (t − τ)sj(t)e−jωt. (2.34)

(40)

If we refer to the windowed Fourier transform (equation (2.34)) assˆj(ω, τ ), then we achieve

W-disjoint orthogonality if

ˆ

s1(ω, τ )ˆs2(ω, τ ) = 0, ∀(ω, τ). (2.35)

Approximate W-disjoint orthogonality allows us to assume that time-frequency bins

have little-to-no overlap, and that high-energy time-frequency bins can be assumed to be

attributed to a single speaker. Comparisons between mask-based source separators and

recent algorithms utilizing statistical model-based speech enhancement and non-negative

matrix factorization demonstrated better performance in the mask-based supervised

learn-ing techniques [6].

For each source, Binary masking is used to partition the time-frequency frame of

interest into a binary threshold map: bins assumed to be associated with the source are

assigned a 1, and all other bins are assigned a map value of 0. Specifically, we tag

time-frequency bins where the source of interest exceeds an SNR threshold α as a 1, and a 0

otherwise. The ideal binary mask (IBM) is given by:

IBM =          1 SNR(t, f ) > α 0 otherwise, (2.36)

Alternatively, we can consider the Ideal Ratio Mask (IRM) [6] to preserve relative power

ratios instead of applying a hard threshold:

IRM = s

2_{(t, f )}

(41)

Wheres2_{(t, f ) is the speech energy and N}2_{(t, f ) is the noise energy.}

This screen, when applied to the time-frequency mixture, and inverse transformed

via the Inverse Short-Time Fourier Transform (ISTFT), returns an estimate of the

elemen-tary source vector. We see an example of this screen in Figure 2.6. The spectrograms

on the left-hand side of the figure show the spectrograms for a male and female speech

signal, respectively. After linearly combining the speech signals and taking the STFT, we

obtained the mixed speech spectrogram. After applying the masking algorithm described

by equation (2.36), we obtain the masked speech signal found in the bottom-right plot. The

IRM has shown higher levels of intelligibility compared to the IBM, but this difference

has been shown to be marginal [6]. Yilmaz and Rickard (2004) shows that the ability to

separate speech signals via binary masking is attributed to their approximately sparse

na-ture: in the time-frequency domain the majority of the meaningful speech information is

contained in a comparatively small number of Gabor coefficients; the other coefficients are

approximately zero. The Gabor coefficients refer to the magnitude values of the Discrete

STFT of a frame of speech data. In human speech, we see this behavior because energy

is concentrated at frequencies that are multiples of the fundamental frequency [4].

Ap-proximate sparsity allows signals in a mixture to be assumed W-disjoint orthogonal [21].

Approximate W-disjoint orthogonality is shown to be a strong enough condition to yield

(42)

Figure 2.6: Time-Frequency Representation of a Speech Signal and its associated Hard Mask.

Advances in neural network technology, especially in regards to deployment on

low-power microcontrollers (a target device for this thesis), makes the use of supervised

learn-ing techniques, a clear choice for investigatlearn-ing the viability of real-time source separation

on constrained devices. Further detail on neural networks and TensorFlow (Google’s

plat-form to facilitate neural network deployment) is provided in the subsequent sections of this

(43)

2.3 Artificial Neural Networks

In this section, we provide an overview of Artificial Neural Networks (ANNs). The

focus of this section is in concepts directly applicable to the scope of this thesis. The

math-ematical derivation of the feedforward neural network discussed in this section is largely

based on Hastie 2009 [14]. The generic feed-forward neural network topology can be seen

in Figure 2.7.

ANNs are a class of supervised and unsupervised machine learning methods that

attempt to learn a nonlinear mapping from a set of inputs, referred to as features, and a

series of outputs. ANNs can be used for both regression and classification problems.

Figure 2.7: Generic Feedforward Neural Network Topology.

Mathematically, the learning of the weights associated with a neural network is

ac-tually the solving of a gradient descent style optimization problem. Neural networks use

weight vectors, called neurons and transfer functions, called activation functions, in order

to describe linear and non-linear relationships between variables. An optional biasing term

(44)

outputy as:

y = σ(WX + b), (2.38)

whereσ is the activation function, W is the weight vector, and b is the biasing term. Many

activation functions have been used historically, but most neural networks employ the

sig-moid function (2.2.4), and Rectified Linear Unit (ReLU),σ(x), functions, respectively:

ReLU := σ(x) = x+ = max(0, x). (2.39)

In speech processing applications, it has been shown that the use of the rectified linear

unit activation function leads to better generalization, and more reliable training of deep

neural networks when compared to the sigmoid function [22]. This is shown to be attributed

to a favorable optimization surface that is approximately linear convex, and because the

internal representation learned by the neural network is more regularized, reducing the

likelihood of overfitting [22]. As a result, we use the rectified linear unit for all neural

network architectures.

The feedforward neural network uses a learning rule to modify the weights of the

neural network by optimizing with respect to a cost function (called the loss function).

The cost function can vary depending on the class of problem that we wish to solve (e.g.

regression vs. classification). In general, we find that regression problems use a

cross-entropy loss and regression problems use a minimum mean squared error loss. In order to

adjust the weights of the neural network, we present the neural network was a multitude

of examples, called training data. Once neural network weights have been initialized, we

(45)

compare the result of the neural network to the true solution. To minimize the cost function,

we perform gradient descent. In the context of ANNs, this is referred to as backpropagation.

We use the backpropagation example from [14]. Consider the squared error loss function:

R(θ) = N X i=1 K X k=1 (yik− fk(xi))2. (2.40)

Where the gradients are given by:

∂Ri δβkm = −2(y ik− fk(xi))g ′ k(β T kzi)zmi (2.41) ∂Ri δαml = −2(y ik− fk(xi))g ′ k(β T kzi)βkmσ ′ (αmTxi)xil. (2.42)

The gradient descent update at the(r + 1) iteration is expressed as:

β_kmr+1 = β_km(r) _{− γ}r N X i=1 ∂Ri ∂β_km(r), (2.43) αr+1_ml = α(r)_ml _{− γ}r N X i=1 ∂Ri ∂α(r)_ml, (2.44)

whereγr is the learning rate, a hyperparameter that determines gradient descent the jump

distance. Then ∂Ri ∂βkm = δkizmi, (2.45) ∂Ri ∂αml = smixil. (2.46)

(46)

Here,δkiandsmiare the errors from the model at the output layer and hidden layer,

respec-tively [14].

These errors satisfy the equation

smi= σ ′ (αT mxi) K X k=1 βkmδki, (2.47)

Using these equations yield the equations used for backpropagation. In the forward

pass, the weights are fixed and we propagate those fixed values through the network. For

the backward pass, we compute the errorsδkiand then backpropagate these errors through

the network to obtainsmi. Once we have these weights, we can compute the gradient for

the current update step. A similar scheme is used for other loss functions [14].

2.3.0.1 Convolutional Neural Networks

Convolutional Neural Networks (CNNs), proposed by LeCun et al [23] addresses the

usage of neural networks with image-class data. Though traditional multi-layer perceptrons

are capable of handling image data (by flattening the 2D image into a column vector), the

number of weights that need to be calculated quickly balloons due to the large number of

inputs [23]. Additionally, such networks do not have translation or distortion invariance,

compromising generalization power [23].

For speech and image processing applications, we can retain spatial and temporal

correlation that is otherwise lost in fully-connected neural network topologies [23]. CNNs

differ from standard feed-forward networks in that their hidden layers contain convolutional

(47)

are convolved with the input. The discrete convolution operation for a 2D input is defined as: y[m, n] = h[m, n] ∗ x[m, n] = ∞ X j=−∞ ∞ X i=−∞ h[i, j]x[m − i, n − j], (2.48)

where_{x is the input signal, h is the 2 − D impulse response, and y is the output [24].}

During training we learn the weights associated with each filter using the

backprop-agation scheme mentioned in the previous section.

2.3.0.2 Recurrent Neural Networks

Recurrent Neural Networks (RNNs) are another popular class of neural networks.

RNNs differ from the standard feedforward neural network in that they can incorporate

context from other states when they make predictions (see Figure 2.8). A bidirectional

neural network, for example, can use information from previous states and future states

to make predictions. The general RNN applies a recurrence formula at each time step to

estimate the state:

ht = fW(ht−1, xt), (2.49)

wherehtis the state estimate, fW is a transforming function with parametersW , ht−1 is

the previous state, andxtis the input vector.

This state, called a hidden vector, is then used to compute an output [25]. Consider

the hidden vector for a language model as an example [25]:

(48)

Figure 2.8: Generic Structure for a Recurrent Neural Network.

yt= Whyht. (2.51)

In general, we find that backpropagation for RNNs is similar to the case of standard

feedforward networks, though various permutations, such as backpropagation computed

over truncated chunks of the sequence, do exist. Variants of the RNN where states can

go stale over time and (called Long Short Term Memory networks) [25]. We do not use

(49)

deep neural networks. In these cases, the states take the form of frames of the STFT. These

approaches are described broadly in the next section.

2.3.1 Neural Network Approaches to Blind Source Separation

The Source Separation problem has traditionally been solved using ICA. ICA solves

the source separation problem by discovering a basis that describes N linearly

indepen-dent source vectors. Solving the time-frequency masking problem using supervised and

unsupervised learning techniques has gained traction in the machine learning community.

In these supervised learning schemes, features are extracted from the mixture and used to

create an algorithm that learns to compute the mask from the mixture [6].

Deep Learning approaches using Recurrent Neural Networks have been proposed for

source separation, and produce results consistent with the ideal binary mask, but assume

knowledge of the number of source signals a-priori (Weninger et al. (2014) and Huang et

al. (2015)).

Xie et al. (2016) proposed the Deep Embedded Clustering (DEC) algorithm [26], an

unsupervised clustering algorithm that bins data by non-linearly embedding the otherwise

high-dimensional mixed signal in a low-dimensional latent space_{Z. The DEC algorithm}

then uses a Deep Neural Network (DNN) to estimate the parameters of the transformation

function_{R that embeds the mixed signal in Z.}

In Hershey et al (2016) [27], it was shown that the DEC algorithm was extensible

to the Blind Source Separation problem. Application of DEC allowed for self-discovery

of three unknown speakers in an additive signal. Time-frequency masking is used to

(50)

is estimated for each source determined to be in the mixture. These binary masks, when

multiplied by the frame’s spectogram and passed through the ISTFT, produce theN source

vectors that comprised the mixture. A Bidirectional Recurrent Neural Network (BRNN) is

used to assign an embedding to each bin of the spectogram.

In Kolbaek et al 2017 [28] and Fan et al (2018) [29], an end-to-end training

frame-work for DEC-enabled Source Separation, called Utterance-level permutation invariant

training (uPIT), is proposed. There are two estimation objectives in this framework: Firstly,

we wish to blindly (i.e. without a-priori knowledge) determine the number of unique

sources present in the mixed signal. In addition, we seek the spectral mask estimate for

each identified source. The uPIT framework allows the estimation of both parameters to

be made jointly. Deep Learning approaches using Recurrent Neural Networks have been

proposed for source separation, and produce results consistent with the ideal binary mask,

but assume knowledge of the number of source signals a-priori [30] [31].

In Hershey et el 2016 [27], the use of Deep Embedded Clustering and Binary

Mask-ing are applied to the Blind Source Separation problem. Deep Embedded ClusterMask-ing is

shown to accurately cluster sources in an unknown, mixed audio signal. Kolbaek et al

2017 [28] and Chen et al 2017 [32] propose Permutation-Invariant neural networks to

ad-dress potential pitfalls in the method proposed in Hershey et al 2016 [27], namely the fact

that the Deep Clustering algorithm is not invariant. This lack of

permutation-invariance arises because the algorithm’s objective function is with respect to the mapped

sources in the low-dimensional latent space, rather than the unmapped sources in the

ini-tial basis. Consequently, the order in which bins are processed will affect how the source

(51)

Permutation-Invariant Training (PIT) framework to allow for end-to-end mapping, but there

are slight differences in how permutation-invariance is achieved. Kolbaek et al 2017 [28]

achieve permutation invariance by implementing a Bidirectional Recurrent Neural Network

(BRNN). Chen et al 2017 [32] implement a novel Neural Network architecture, called an

Attractor Neural Network. Once the time-frequency data has been transformed to the

low-dimensional subspace, reference points are defined for each source and time-frequency bins

belonging to that source will become attracted to it.

2.4 TensorFlow

TensorFlow is a framework developed by Google for solving large-scale machine

learning problems. TensorFlow is based off of the notion of a computational graph, where

all operations are expressed as a set of nodes within the graph. This graph describes the

execution flow of a processing pipeline. The computational graph seen in Figure 2.9

rep-resents the pipeline for a feedforward neural network. TensorFlow can handle arbitrarily

sized tensors so long as operations on the defined tensors are mathematically valid.

Ten-sors of different types, to include string, float, and int, and calculations can be performed

at varying levels of precision, including single and double precision. Operations in

Tensor-Flow are encapsulated in blocks called kernels. Kernels are written for execution on target

hardware such as CPUs or GPUs [8].

The use of TensorFlow as a data processing pipeline was mandatory because of the

size of the dataset used in this thesis. TensorFlow supports a queuing system while training

(52)

avail-Figure 2.9: An example of a Computational Graph.

able, and popped from device memory when processing is complete. This allowed for the

creation of a batch-processing pipeline for the audio data of interest. This batch processing

pipeline is described in detail in Chapter 3.

2.4.1 TensorFlow Lite for Microcontrollers

TensorFlow Lite for Microcontrollers (TFMicro) is an interpreter that is designed

for deploying TensorFlow models to embedded systems. The basic operating principle for

TFMicro is that we wish to portabilize a model that we have trained and developed on

the full TensorFlow client with minimal changes. TensorFlow Lite is used to convert a

(53)

data structure is recognized by the TFMicro interpreter and is used to execute the model on

the target device [9].

In order to fully port a model, all of the kernels must have an equivalent kernel in

TFMicro. Because TFMicro is still early in development, a sizable number of TensorFlow

kernels are not yet implemented [9].

The vast majority of TensorFlow kernels use floating point operations. This does

not cause problems when training TensorFlow models on high-performance servers. When

running on resource-constrained devices where memory ceilings can be on the order of

kilobyes, avoiding floating point operations becomes a necessity. TFMicro offers

special-ized support for certain platforms in the form of highly optimspecial-ized processing kernels. The

ARM CMSIS-NN library, a highly optimized neural network library for ARM Cortex M

processors, is utilized. The use of these accelerated kernels yields incredible speed-ups

compared to the reference kernels [9]. Because the Arduino Nano 33 BLE Sense and

i.MX-RT600 microcontrollers both leverage ARM Cortex M processors, TFMicro will

au-tomatically utilize these optimized kernels on deployed neural networks [9]. The TFMicro

interpreter is heavily leveraged in order to deploy neural architectures onto the target

de-vices.

2.5 Real-Time Signal Processing on Embedded Devices

Considera-tions

A processing constraint that we must address in this thesis is the need for the BSS

(54)

processing on an embedded platform is the large reduction in compute capability.

Com-pared to mobile platforms, an embedded device has at least 100-1000x less compute power

[9]. This reduction in compute power, especially in the context of real-time computing,

means that the problem size at any given instant needs to be greatly reduced to

accommo-date the platform.

For real-time processing applications, we utilize a frame-based processing scheme to

handle data in a packet-like format. The binary masking technique employed in this thesis

passes a STFT frame into a feed-forward neural network for mask inference. Note that the

STFT output is a matrix, X(f ) = X1(f ) X2(f ) X3(f ) ... Xk(f ) , (2.52)

where themth element of the matrix is given by:

Xm(f ) = ∞

X

n_=−∞

x(n)g(n − mR)e−j2πf n_, _(2.53)

whereg(n) is the window function of length M , Xm(f ) is the Discrete Fourier Transform

(DFT) of the windowed data centered about time mR, and R is the hop size between

successive DFTs [33].

In the real-time context, we cannot perform this summation over all time. Instead, we

evaluate the STFT on a fixed-sized packet. The size of the tensor for a packet is dependent

on the sampling rate, the hop size, and the FFT size used in each window. We can optimize

(55)

needed for intelligible speech in the reconstruction stage. For example, decreasing the

sampling rate reduces the amount of data that we feed through our neural network at any

one time, reducing the latency and increasing the throughput of the model. Because the

ideal binary mask can be easily constructed, we can evaluate these trade-offs before the

neural network is trained. To be deemed real-time, the neural network must be capable of

processing STFT frames faster than it can receive them. This is discussed in more detail in

(56)

TECHNICAL APPROACH

This chapter provides a description of the technical approach used to address the

topic of real-time BSS on a low-power microcontroller.

We outline the initial audio dataset, describe the TensorFlow Audio Processing pipeline

that was used to create the mixture speech dataset needed for binary mask estimation,

de-scribe the process used to select suitable STFT parameters for neural network input, detail

the hardware used, detail the neural network architectures evaluated, detail the

compres-sion techniques used to shrink the size of the networks, and address the methods used to

evaluate latency and power usage.

The Google Colaboratory Cloud environment is used for neural network

develop-ment and training. This environdevelop-ment was chosen because of access to high-end GPUs that

could be used to accelerate training. The TensorFlow audio processing pipeline, which is

(57)

Table 3.1: Google Colaboratory Specifications. Google Colaboratory Specifications

Number of CPUs 2

CPU Model Name Intel(R) Xeon(R) CPU @ 2.00GHz

Cache Size (KB) 39424

CPU Speed (MHz) 2000

Number of GPUs 1

GPU Model Name NVIDIA Tesla V100-SXM2

GPU VRAM 16 GB

3.1 Audio Dataset

Two datasets were considered for this research. The LibriSpeech ASR corpus and

the Mozilla Common Voice dataset.

The Mozilla Common Voice dataset, a crowdsourced dataset containing

approxi-mately 1400 hours of speech using multiple languages was considered as well, but this

was deemed inappropriate for neural network training. One of the core assumptions that

is made when developing the neural network model is that all two-speech mixtures were

derived from different speakers and that male and female speakers could be identified

a-priori. Because the Mozilla Common Voice dataset is crowdsourced, all data, to include

the audio files themselves and metadata regarding to speech (e.g. age, gender, accent, etc.)

are manually input by users. A rating system is used to gauge the accuracy of labels in the

dataset. Assessment of the audio data revealed many instances where a speaker’s gender

is misclassified. Additionally, there were many instances where the same speaker recorded

multiple lines of dialogue. Because speakers are not assigned a unique ID, the only way

to reconcile such as artifact would be to listen to all 1400 hours of audio manually. In the

(58)

Instead, this research leverages the LibriSpeech ASR corpus. The LibriSpeech ASR

corpus contains 1000 hours of English speech sampled at 16 KHz. Though the LibriSpeech

ASR corpus contains less examples, the speech has already been cleaned and is already

labeled, segmented and aligned. In this thesis, the LibriSpeech ASR corpus is used to

create the mixture dataset that is then used to train the designed neural networks.

Each LibriSpeech ASR corpus example contains a few seconds of speech for a given

male or female speaker. All examples are stored in separate Free Lossless Audio Codec

(FLAC) audio files. FLAC is a lossless audio compression format developed by Xiph [34].

3.2 TensorFlow Audio Processing Pipeline

TensorFlow is used in order to design the audio processing pipeline that is used in

this work. The goal of our neural network is to learn a binary mask estimate from a spectral

frame. Because the LibriSpeech dataset contains single-speaker examples, all contained in

separate data files of varying lengths, a TensorFlow Audio Processing pipeline was

devel-oped to make the audio data uniform to support input into a neural network.

Given the large number of audio files associated with the LibriSpeech dataset, it was

not reasonable to perform neural network training by fitting all of the data in memory (12

GB of RAM on the Google Colab server used for training). To get around the memory

overheads, we leveraged the TFRecord binary file format to store audio examples.

The TFRecord format is a proprietary data format used by TensorFlow that is used

to facilitate distributed training. As mentioned earlier, the dataset that we used is derived