POWER DEVICES
by
DESHAWN ANDRE BROWN
BSEE, University of Denver, 2016
A thesis submitted to the Graduate Faculty of the
University of Colorado Colorado Springs
in partial fulfillment of the
requirement for the degree of
Master of Science
Department of Electrical and Computer Engineering
DESHAWN ANDRE BROWN
This thesis for the Master of Science degree by
DESHAWN ANDRE BROWN
has been approved for the
Department of Electrical and Computer Engineering
by
Mark Wickert , Chair
Jugal Kalita
Byeong Kil Lee
December 17, 2020 Date
BLIND SOURCE SEPARATION IN ULTRA-LOW POWER DEVICES
Thesis directed by Professor Mark Wickert
ABSTRACT
Blind Source Separation (BSS) is the process of decomposing a mixed signal into a set of
linearly independent vector components. The Blind Source Separation problem arises
or-ganically in day-to-day human interaction in the form of the Cocktail Party Problem: when
found in an auditory environment characterized by simultaneous speakers (say speaker
A and speaker B), a human can narrow their auditory attention to a single target source
(speaker A) and ignore all other sources (speaker B). While seemingly trivial for a human,
this task has historically proven to be difficult to solve algorithmically. Advancements in
the multi-disciplinary field of Artificial Intelligence (AI) have given way to Neural
Net-work based approaches to BSS. In tandem, TinyML, an emerging field spearheading the
deployment of Neural Networks onto resource-starved microcontrollers, yields promise of
proliferation of AI on platforms previously thought to be computationally impractical. In
this work, we explore the feasibility of real-time BSS on a low-cost, commercially
avail-able microcontroller. The LibriSpeech dataset is used to create 1200 unique Male-Female
speech mixtures. These mixtures are used to create a TensorFlow-based audio conditioning
pipeline for training and deploying two BSS Deep Neural Network (BDNN) architectures.
Once trained, TensorFlow Lite for Microcontrollers is used to compress and deploy the
BDNNs onto the Arduino Nano 33 BLE Sense microcontroller. For each architecture, we
be deployed on the Arduino Nano 33 BLE Sense. While the implemented architectures
are algorithmically functional on the target device, analysis reveals a fundamental tradeoff
between model size (number of parameters), speech intelligibility, model throughput, and
ACKNOWLEDGMENTS
Mom, Dad, Terrell, Jamal, and Tee-Tee - I am eternally grateful for your support as I’ve
em-barked on this journey in higher education. A special thanks to my colleagues at Northrop
Grumman who were especially supportive in allowing me an incredible amount of
flex-ibility while I have pursued this MSEE. I’ve thoroughly enjoyed my time at UCCS. Dr.
Wickert, thank you for your limitless support in pursuing this thesis and giving me the
creative liberty to explore such a bleeding edge concept. This thesis is the culmination of
CHAPTER
1. Introduction 1
2. Core Concepts and Related Work 5
2.1 Overview of the Blind Source Separation Problem . . . 6
2.1.1 Source Separation as an Inverse Problem . . . 6
2.2 The Cocktail Party Problem . . . 8
2.2.1 Principal Components Analysis (PCA) for Audio Source Separation 10 2.2.2 Independent Components Analysis (ICA) for Audio Source Sepa-ration . . . 15
2.2.3 Constraints of PCA and ICA for Ultra-Low Power Device Integration 18 2.2.4 Computational Auditory Scene Analysis (CASA) . . . 19
2.2.5 Time-Frequency Masking . . . 26
2.3 Artificial Neural Networks . . . 30
2.3.0.1 Convolutional Neural Networks . . . 33
2.3.0.2 Recurrent Neural Networks . . . 34
2.3.1 Neural Network Approaches to Blind Source Separation . . . 36
2.4 TensorFlow . . . 38
2.4.1 TensorFlow Lite for Microcontrollers . . . 39
2.5 Real-Time Signal Processing on Embedded Devices Considerations . . . . 40
3.1 Audio Dataset . . . 44
3.2 TensorFlow Audio Processing Pipeline . . . 45
3.2.1 Short-Time Fourier Transform Frame Optimization . . . 50
3.3 Hardware . . . 52
3.4 Neural Network Architectures . . . 53
3.4.1 Neural Network Compression . . . 56
3.4.2 TensorFlow Lite Conversion and Microcontroller Deployment . . . 58
3.5 Latency Estimation . . . 60
3.6 Source Separation Figures-of-Merit . . . 61
4. Results 63 4.1 TensorFlow Neural Network Model Assessment . . . 63
4.2 Prototype Autoencoder Neural Network Model . . . 64
4.3 Lite Autoencoder Neural Network Model . . . 69
4.4 Super-Lite Autoencoder Neural Network Model . . . 74
5. Discussion and Conclusion 80 5.1 Conclusion . . . 80
5.2 Limitations and Challenges . . . 81
5.3 Future Work . . . 82
BIBLIOGRAPHY 89
TABLE
3.1 Google Colaboratory Specifications. . . 44
3.2 Short-Time Fourier Transform Frame Parameters. . . 52
3.3 Deep Feed-Forward Autoencoder Prototype Architecture. . . 57
3.4 Lite Autoencoder Neural Network Architecture. . . 57
3.5 Super-Lite Autoencoder Neural Network Architecture. . . 58
3.6 Deep Neural Network Training Parameters. . . 58
4.1 TensorFlow Lite Model Sizes for Prototype Autoencoder. . . 68
4.2 Prototype Autoencoder Inference Latency. . . 69
4.3 Prototype Autoencoder Separation Metrics. . . 69
4.4 TensorFlow Lite Model Sizes for Lite Autoencoder. . . 73
4.5 Lite Autoencoder Inference Latency. . . 74
4.6 Lite Autoencoder Separation Metrics. . . 74
4.7 TensorFlow Lite Model Sizes for Super-Lite Autoencoder. . . 78
4.8 Source Separation Metrics for Super-Lite Autoencoder. . . 79
4.9 Super-Lite Autoencoder Inference Latency. . . 79
FIGURE
2.1 General Formulation of an Inverse Problem. . . 8
2.2 Cocktail Party Problem with 2 Sources and 2 Observers. . . 9
2.3 PCA fails to align the Principal Axes of Non-Gaussian Mixtures. . . 13
2.4 y = abs(x) is an example of a function that is uncorrelated but dependent. . 14
2.5 ICA is able to align Non-Gaussian mixtures with the Principal Axes. . . 18
2.6 Time-Frequency Representation of a Speech Signal and its associated Hard Mask. . . 29
2.7 Generic Feedforward Neural Network Topology. . . 30
2.8 Generic Structure for a Recurrent Neural Network. . . 35
2.9 An example of a Computational Graph. . . 39
3.1 Effect of Normalization on STFT frame histogram. . . 47
3.2 Packet-Streaming Concept for Frame-Based Source Separation. . . 48
3.3 TensorFlow-Based Training Data Generation Pipeline. . . 49
3.4 Effect of STFT parameters on Time-Frequency Resolution. . . 51
3.5 Inappropriate STFT parameters cause estimated masks to ignore male-female speech features. . . 52
3.6 TensorFlow Computational Graph for Densely-Connected Feed-Forward Architectures developed for this thesis. . . 56
3.8 TensorFlow to TensorFlow Lite Deployment Scheme. . . 60
4.1 Target Mask for Ideal Male-Speech Separation. . . 64
4.2 Reconstructed Male Audio Sequence using Prototype Autoencoder Network. 66 4.3 Reconstructed Female Audio Sequence using Prototype Autoencoder Net-work. . . 66
4.4 Cross-Correlation Sequences for Ideal Male Speech and Prototype Autoencoder-Estimated Male Speech . . . 67
4.5 Cross-Correlation Sequences for Ideal Male Speech and Prototype Autoencoder-Estimated Male Speech. . . 67
4.6 Estimated Male Masks for Prototype Autoencoder Network. . . 68
4.7 Reconstructed Male Audio for Lite Autoencoder Network. . . 70
4.8 Reconstructed Female Audio for Lite Autoencoder Network. . . 71
4.9 Cross-Correlation Sequences for Ideal Male Speech and Lite Autoencoder Estimated Male Speech. . . 71
4.10 Cross-Correlation Sequences for Ideal Male Speech and Lite Autoencoder Estimated Female Speech. . . 72
4.11 Estimated Male Masks for Lite Autoencoder Network. . . 73
4.12 Reconstructed Male Audio for Super-Lite Autoencoder Network. . . 75
4.13 Reconstructed Female Audio for Super-Lite Autoencoder Network. . . 76
4.14 Cross-Correlation Sequences for Ideal Male Speech and Super-Lite Au-toencoder Estimated Male Speech. . . 76
4.15 Cross-Correlation Sequences for Ideal Male Speech and Super-Lite Au-toencoder Estimated Female Speech. . . 77
INTRODUCTION
The ubiquity of Blind Source Separation (BSS) techniques in audio signal processing
is clear: from audio enhancement to vocal isolation, the ability to separate a target signal
from the environment is a crucial driver in many of today’s audio processing pipelines.
Historically, algorithmic approaches to the BSS problem have been multi-pronged, with
research efforts split between computational modeling of the human auditory system, via
Computational Auditory Scene Analysis (CASA) and matrix-based approaches such as
Independent Components Analysis (ICA) [1].
CASA and ICA largely differ in the classes of BSS that they aim to address. The
goal of ICA is to separate all target sources from the input mixture, whereas the goal of
CASA is to separate a target speaker from the mixing environment [2]. This difference
becomes more clear when the performance of ICA and CASA are compared in different
scenarios. A direct comparison of ICA and CASA demonstrates dichotomy of performance
when the mixing environment is altered [2]. Specifically, it is shown that CASA techniques
con-verse is shown to be true for ICA. Auditorily, we find that CASA systems perform best
when competing sources are tonal or narrowband in nature [3]. In contrast, ICA performs
best when competing sources are both broadband in nature - as is the case where mixtures
containing competing speech signals and/or Additive White Gaussian Noise (AWGN) [3].
It is worth noting that ICA is a special case of the BSS problem: performance only holds if
the underlying mixing matrix is linear and the source signals are non-Gaussian.
In spite of these constraints, the ICA formulation of the BSS problem still finds use
because it yields meritorious results in the standard formulation of the Cocktail Party
Prob-lem. Here, we have a mixture of concurrent speakers and have the explicit goal of isolating
all target speakers. In these circumstances, ICA can reliability isolate the speakers in the
mixture. Many communities, including the hearing aide community, have long-held in
interest solving the problem in real-time [4]. The solving of BSS on platforms like the
hearing aide is unique because of the inherent resource-constrained nature of the device,
and the need for relatively low latency [4]. Unfortunately, the computationally intensive
matrix-based mathematics that underlie ICA have prevented the technique from seeing
widespread usage in real-time applications. While real-time approaches for ICA have been
demonstrated using FPGAs [5], special hardware is required, and, because of the need to
program in Verilog, ad-hoc tuning, modification, or addition of algorithms is not
straight-forward.
In recent years, the machine learning (ML) community has become increasingly
in-volved with the problem [6]. Due to advancements in Graphical Processing Unit (GPU)
technology and neural network theory, we have seen neural networks successfully used to
or exceeds the state-of-the-art. In speech mixtures, it has been shown that time-frequency
masking, a core component of CASA, yields performance on-par with ICA due to
favor-able orthogonality properties in the time-frequency plane [7]. Neural network architectures
have been developed that are able to estimate the time-frequency mask for a Short-Time
Fourier Transform (STFT) frame of audio data. Once a time-frequency mask is known, the
original speech signal can be recovered by applying the time-frequency mask to the STFT
frame and then computing the inverse STFT.
The relevancy of these advancements in neural network design is bolstered by
con-current advancements in the development of TensorFlow, Google’s computational graph
framework [8], and the invention of TinyML, which is the concept of deploying ML
algo-rithms on low-power microcontrollers [9]. Innovation in these areas make the commercial
use of algorithms previously seen as computationally infeasible a possibility. TensorFlow
Lite for Microcontrollers, a portable build of TensorFlow, allows full-fledged TensorFlow
models to be compressed into a low-memory footprint that is capable of running on
com-mercially available microcontrollers [9]. When taken holistically, these factors give rise to
the potential of solving the BSS problem on low-end hardware.
In this work, we address the issue of solving the source separation problem in
real-time. Specifically, we address the viability of using the Arduino Nano 33 BLE Sense
microcontroller [10] and the i.MX RT600 microcontroller [11] for estimating the
time-frequency masks needed for separating linearly-mixed speech signals. Following this
in-troductory chapter, Chapter 2 of this thesis manuscript presents a literature review that
outlines the core concepts and related work that lay the foundation for our proposed
of neural networks, TensorFlow, and TensorFlow Lite model conversion. Following the
establishment of these core concepts, Chapter 3 describes the audio processing pipeline
model that was designed in TensorFlow and the corresponding conversion in TensorFlow
Lite. This chapter includes details about the datasets used, the neural network
process-ing architecture, and details the conversion from TensorFlow to TensorFlow Lite. Chapter
4 presents the results of the study, including performance comparisons to ICA and ideal
time-frequency masks when applicable. Chapter 5 describes future work that may be
pur-sued following this study and provides a summary of the work conducted as well as future
CORE CONCEPTS AND RELATED WORK
This section describes the concepts and principles that govern this work. Specifically,
we provide a mathematical overview of the Blind Source Separation problem and outline
the two classical approaches to solving the problem: CASA and ICA. Mitianoudis’ work
on ICA serves as the foundation for the ICA section of this chapter [12]. Time-Frequency
masking, a newer technique that assigns time-frequency bins to a particular source, and the
driver for the neural network approach outlined in this thesis manuscript is also discussed.
For its seminal role in this work, TensorFlow, Google’s open-source framework and
interface for developing and deploying machine learning algorithms, will be described in
detail. TensorFlow Lite for Microcontrollers (TFMicro), a framework for converting and
greatly compressing TensorFlow models so that they can run on embedded devices will
2.1
Overview of the Blind Source Separation Problem
2.1.1 Source Separation as an Inverse Problem
Source separation belongs to the broader class of mathematical problems known as
inverse problems. In the forward problem, we wish to determine a function that maps a
set of input parameters and a set of measurements. Formally, this translation is known
as the measurement operatorM . The measurement operator transforms parameters in the
function space X to the space of the data D. That is:
y = M (x) for x ∈ X and y ∈ D. (2.1)
The inverse problem is the opposite: in this formulation, we are tasked with finding points x ∈ X using observations y ∈ D such that:
y ≈ M(x) for x ∈ X and y ∈ D. (2.2)
In other words, we use our observationsy ∈ D to determine the input parameters x ∈ X
that would have been needed to yield the observations. This is shown visually in Figure 2.1.
In order to uniquely reconstruct these parameters, we require the measurement operator to
be injective. That is,
Because measurements present in a mixture more times than not contain noise, the
injec-tivity is often only approximate. For practical problems of interest, such as the source
separation problem, we can treat a mixture’s measurement operator as an approximation
of an injective measurement operator. We can then find an inversion operator M−1 that
maps the range ofM to unique elements in X. The inversion operator is characterized by
its estimated stability. The stability estimate takes the form of a modulus of continuity:
kx1− x2kX ≤ ω(kM(x1) − M(x2)kD), (2.4)
where ω : R+ −→ R+ is an increasing function such that ω(0) = 0. Equation (2.4)
quantifies the continuity of the mapping.
An inverse problem is considered well-posed when the reconstruction error can be
de-scribed by some constantC, such that ω(x) = Cx. When strong noise is present in the
ob-servations or the inverse mapping landscape is otherwise unstable, such that small changes
in parameters yield large changes in observations, then an inverse problem is deemed to
be ill-posed. In general, the BSS problem is characterized as an ill-posed inverse problem.
Attempting to solve the inverse problem using numerical methods requires that we
regular-izethe problem. The classical means of doing so is by making a-priori assumptions about the sources or the mixing matrix. The use of these assumptions give rise to matrix-based
Figure 2.1: General Formulation of an Inverse Problem.
2.2
The Cocktail Party Problem
This work is focused on the Cocktail Party formulation of the BSS problem. This
formulation is named after the auditory scenario that arises in multi-speaker listening
envi-ronments. In a multi-speaker auditory scene characterized by competing audio sources, we
wish to discover an unmixing matrix that, when applied, is able to isolate the independent
audio sources that are present in the mixture.
The audio source separation problem can be characterized as described below.
Sup-pose that there are N speakers or sources transmitting audio signals S, represented as a N × 1 column vector, S= s1[n] s2[n] ... sN[n] T , (2.5)
and the observations at each listener or microphone (called observer from this point on) X
are given by:
X=
x1[n] x2[n] ... ... xM[n]
T
. (2.6)
Here,n denotes the time index in a discrete-time system model where samples have spacing
Ts. The mixing environment (or mixing method) can be represented by the mixing matrix
Figure 2.2: Cocktail Party Problem with 2 Sources and 2 Observers.
present with respect to each observer as:
E =
ǫ1[n] ǫ2[n] ... ǫM[n]
T
(2.7)
whereE describes the time-varying additive noise present on each observer. The
measure-ments captured by each observer can then be expressed as a linear system:
X= AS + E. (2.8)
In Figure 2.2, we represent the cocktail party problem as finding (or recovering) the audio
from speakersS1 and S2 given a monaural mixture collected on a microphone (eitherX1
orX2). Assuming that the inverse exists, we can recover the original source vectors S by
estimating a matrix operator W that can invert the mixing operator A:
2.2.1 Principal Components Analysis (PCA) for Audio Source Separation
PCA is a foundational technique in machine learning that also finds relevancy in
the discussion of audio source separation. Unlike frequency domain approaches, PCA is
statistical in that it relies on the vector space properties of the input data in order for the
technique to work.
The goal of PCA is to transform an input into an alternate vector space representation
that eliminates as many redundant parameters as possible while also retaining as much
unique information relevant to the signal as possible. PCA achieves this elimination in redundancy by projecting the data onto a new set of axes that satisfy a set of constraints
[13]. Mathematically, it is found that this projection is one that maximizes the variance of
the variables (thus maximizing the linear independence).
PCA builds off the principles of the previously formulated cocktail party problem,
but additional constraints are imposed because of the underlying linearity assumptions of
the PCA algorithm. The first constraint is that mixing of the source vectors with the mixing
operator must happen instantaneously. This assumption can be approximated for cases
where microphones are close to the sources with respect to the speed of sound. The second
assumption is that the mixing operator must be linear. This means that convolution-based
mixing operators, such as room reverberation, violate the assumptions of PCA.
Mathematically, we establish the case for PCA as follows. Suppose that we have a
zero-mean, discrete-time stochastic process vector, i.e.,
Cx= E{x[n]x[n]T}, (2.11)
where the E operator denotes the expected value. The stochastic process vector has a
covariance matrix given by Cx. In PCA, we wish to identify the (in)dependence structure
in all dimensions and discover an orthogonal transformation matrix W of sizeL × N from
RLto RN, whereL < N , such that the L-dimensional output vector y[n] = WX[n] captures the essence of the input vector and where the covariance vector of the output vector Cyis a
diagonal matrix D with elements arranged in a decreasing order [1]. The stochastic process
vector can then be reconstructed according to:
ˆ
X[n] = WTWX[n]. (2.12)
The objective of PCA is to find an optimal value of the orthogonal matrix W, denoted ˜
W, such that the reconstruction error J is minimized, where
J= E X[n] − ˆX[n] . (2.13)
The rows of the transform matrix ˜W are called the principal components of the
stochastic process vector. These principal components, which are also the solution to the
aforementioned optimization problem, are given by the eigenvectors of the covariance
ma-trix Cx. The subspace that is spanned by the eigenvectors form the PCA subspace.
PCA acts as a dimensionality reduction operation and finds use in data reduction
in order to reduce the number of independent variables that are under consideration when
fitting a model (thus acting as a regularizer to prevent overfitting). PCA provides a
trans-formation that maps the observation space to m bases of ascending importance [13].
One of the commonplace algorithms for calculating the eigenvalues and eigenvectors
for PCA is the use of Singular Value Decomposition (SVD). To perform SVD, we first
multiply the observation vector by a matrix containing the eigenvectors of the covariance
matrix. We then multiply the previous result with the diagonal matrix containing the inverse
of the square root of the corresponding eigenvalues. That is,
SVD := DVX, (2.14) where D= diag 1 √ d1 1 √ d2 ... 1 √ dN , (2.15)
and the elements of the diagonal matrix are the eigenvalues of the covariance matrix and
V=
e1 e2 ... eN
T
, (2.16)
where the elements of the vector V are the eigenvectors of the covariance matrix.
Ob-servation of the algorithm for SVD reveals a striking concern: computing eigenvalues and
eigenvectors is computationally intensive. Furthermore, because PCA is statistical, an input
vector of sufficient length is needed in order to get a decomposition that is representative of
the source data. The sufficient length comes about because time averages replace statistical
Even if computational overhead was not a concern, we find an issue with PCA when
we attempt to use the algorithm on speech mixtures. Because PCA performs dimensionality
reduction by discovering a subspace that maximizes the variance [13], we would expect
that PCA would allow for audio source separation for a frame of interest. Unfortunately,
analysis reveals that this is not the case.
Figure 2.3: PCA fails to align the Principal Axes of Non-Gaussian Mixtures.
After applying PCA, we find that the signals are uncorrelated, but are not aligned with
the principal axes, implying that the signals were unable to be separated. This is because
we cannot uniquely identify non-Gaussian signals using second-order statistics [12]. To see
this, consider the scatter plots in Figure 2.3. The left plot shows the observation basis that
is estimated from the eigenvectors of the covariance matrix for a non-Gaussian mixture.
The right-hand plot shows the result of representing the mixture in the basis estimated by
PCA. After representing the data in this basis, we see that the data is rotated, but is not
aligned with the principal axesu1 andu2. This misalignment indicates co-dependence of
Figure 2.4:y = abs(x) is an example of a function that is uncorrelated but dependent.
the distribution. This demonstrates a crucial finding for PCA: statistical decorrelation is
not a sufficient condition for source separation.
Statistical independenceis required for source separation. Mathematically, we find that decorrelation does not imply independence [12], i.e. uncorrelated random variables are
not independent unless Gaussian. To prove that decorrelation does not imply independence,
consider the absolute value function,y = abs(x) (Figure 2.4). Because the relationship for
the abs(x) is not linear over the full domain (and is in fact discontinuous at x = 0), we
can say that thex and y are uncorrelated. However, we see that the variables are dependent
because the needed property of probabilistic independence is violated. As an example, by
knowing a value ofx, we can predict with absolute certainty what the value of y will be.
This is not to say that PCA does not have merit in the context of BSS: when the
observations contain high levels of noise, PCA can be used as a preprocessing step prior to
the application of other source separation techniques. The use of PCA in this way yields
2.2.2 Independent Components Analysis (ICA) for Audio Source Separation
Independent Component Analysis (ICA), like PCA, is a special case of the BSS
prob-lem. Unlike the general formulation, ICA assumes that the source signals present in the
mixing environment are statistically independent (rather than simply uncorrelated). Source
vectors S = (s1, s2, ..., sN)T are deemed independent if their joint probability is equal to
the product of their probabilities. That is,
P(S) = P(s1, s2, ...sN) = P(s1)P(s2)...P(sN), (2.17)
where the P operator represents the probability density. ICA also imposes the constraint
that at most one of the source vectors can have Gaussian statistics [12]. This condition
should be satisfied for all practical human speech separation tasks of interest.
There are many approaches to ICA, ranging from neural networks to statistical signal
processing. Instead of rigorously addressing all formulations we instead describe the
con-cept of ICA in moderate detail. Further detail about additional approaches for ICA can be
found in [12]. Given the aim of this thesis, the ICA formulations discussed in this section
are chosen because of their relevancy to the cocktail party source separation problem.
In the following sections, we will establish the general formulation for ICA. While
many ICA algorithms have been developed over the years, including formulations for
Max-imum Likelihood Estimation (MLE), entropy maximization, and maximization of
non-Gaussianity, analysis in this thesis will be conducted with respect to FastICA [16], an
algorithm that uses the non-Gaussianity formulation of ICA. If all assumptions are met, the
ICA is to maximize the statistical independence of the resultant output vectors using either
the relative entropy (also called the Kullback-Leibler divergence), maximum likelhihood,
or maximizing the nongaussianity [17]. While the ICA algorithm has historical pedigree
in source separation, we note that the ICA algorithm imposes permutation and scale
ambi-guities on the source estimates. The permutation ambiguity means that we cannot control
the order of the source estimates [12]. The scaling ambiguity says that, because the mixing
matrix and the original source vectors are not known a-priori, scalar factors are lost in the
mixing process [12].
Because ICA is predicated on the non-gaussianity of the inputs, it is important to
understand why this is the case. Consider an n-dimensional, Gaussian distribution random
vector x, e.g., x= x1 x2 ... xn T . (2.18)
The Gaussian distribution is unique in that all orthogonal projections of x will have
the same probability distribution [18]. In other words, because the Gaussian process vector
has a symmetric covariance matrix, we cannot discover unique orthogonal transformations
of the Gaussian distribution. Consequently, we cannot discover unique sources in a mixing
matrix A if all of the underlying sources are Gaussian-distributed.
The Central Limit Theorem (CLT) plays a large part in ICA techniques too. The
CLT shows that the sum of independent random variables, irrespective of the underlying
distribution, tends to a Gaussian distribution. If we instead assume that all source vectors
An independent componenty can be represented by a linear combination of the estimated
source vectorsxi and an estimated mixing matrix W. That is,
y = WTx=X
i
wixi. (2.19)
Here, y denotes an independent component. In practice, the true mixing matrix A is not
known, so we must simultaneously estimate W and x. To approximate these terms, we
make the following change of variables:
z= ATW, (2.20)
then
y = WTx= WTAs= zTs. (2.21)
The above shows thaty is a linear combination of the sources si, with weights given byzi
[18]. By the CLT, we know that a sum of random variables tends to a Gaussian distribution.
Consequently, zTs is more Gaussian than the underlying source vectorssi. Conversely, zTs
strays from the Gaussian distribution equal to one of the source signals [18]. The
inde-pendent components can thus be discovered by maximizing the non-gaussianity of wTx. To discover all of the independent components, we search the convex optimization surface
for all local maxima. The optimization surface in the n-dimensional space of vectors w
contains 2n local maxima - one for each source vector [18]. These local maxima map
algo-Figure 2.5: ICA is able to align Non-Gaussian mixtures with the Principal Axes.
rithm applies an orthogonalization in the PCA subspace that corresponds to constraining
of the search space to a place where uncorrelated estimates are in close proximity [18].
We demonstrate the utility of ICA in Figure 2.5. Unlike PCA, after projecting the
non-Gaussian mixture in the ICA subspace, we see that the data is properly aligned with the
principal axes. Alignment with the principal axes is indicative of successfully resolving the
mixture into its independent components.
2.2.3 Constraints of PCA and ICA for Ultra-Low Power Device Integration
The cornerstone issue underpinning the viability of either ICA and PCA in a
real-time context, despite their ability to blindly separate mixtures, are the exceedingly long
processing frames needed to produce suitable statistics. Frames 9-10 seconds in duration
were used in Mitianoudis 2004 [12], which is well beyond the processing timelines that are
needed for real-time audio processing applications. Latency in the neighborhood of 20-30
2.2.4 Computational Auditory Scene Analysis (CASA)
CASA refers to algorithms that attempt to mimic the behavior of the human auditory
system. Classical research on auditory scene analysis shows that deconstruction of an
auditory scene is a two-step process [3]. In the first stage, elements or features are extracted
from the auditory scene. These elements are representative of key elements within the
environment. In the second stage, we combine elements that are likely to be paired with
the same acoustic source – such as a human speaker. Relevant features may be derived
using heuristic-based approaches or through unsupervised means such as neural networks
[3]. Once these features have been discovered, we can invert the auditory representation in
order to recover the original source vectors.
Time-frequency masking is a key concept used in some CASA systems that is used
to tag source signals in an acoustic mixture. Time-frequency masking, in the context of
CASA, is inspired by acoustic masking in the human auditory system, in which stronger
signals overpower weaker signals that are in the same (or nearly the same) time-frequency
region on interest [3]. In this schema, high-weight values are assigned to dominant sources,
while lower weights are applied to weaker sources. By applying this mask to the
time-frequency representation and applying and then inverting the representation, we can
effec-tively recover the source signals [3].
Construction of these time-frequency masks requires that the clean speech signals
are known a-priori. Using the clean speech signal as a reference, we find time-frequency
regions of the mixture where the energy is within 3 dB of the clean speech signal. These
energy threshold, and0 otherwise. If we have a time-frequency representation of the speech
signals, wheret represents the time axis, and f represents the frequency axis, then the ideal
binary mask m(t, f ) for source signal s(t, f ) with respect to a competing noise signal n(t, f ) can be expressed as:
m(t, f ) = 1, s(t, f ) > n(t, f ) 0, otherwise (2.22)
Creating such masks without a-priori knowledge of the source signals is a tenet of the
modeling approaches employed in CASA systems. As mentioned earlier, CASA systems
typically use a multi-stage processing scheme. The first stage typically involves the use
of time-frequency analysis. Many CASA systems use biologically-inspired operations,
but Short-Time Fourier Transforms (STFTs) or wavelets can be used as well [3]. In a
biologically inspired CASA system, a bank of bandpass filters is used to simulate the effect
of frequencies being associated with various locations on the basilar membrane [3], and a
gammatone filter bankis used to simulate the impulse responses associated with the nerve fibers in the human auditory system [3]. The gammatone filter bank has a continuous-time
impulse response of the form:
gi(t) = tn−1exp(−2πbit)cos(2πfi(t) + φi)u(t), (1 ≤ i ≤ N), (2.23)
whereN is the number of filter channels, n is the filter order, t is the time in seconds, and
The parameters of these filters are tuned to match a specified psycho-acoustic
envi-ronment. In the human auditory system, it is found that the bandpass filter bank increases
non-linearly as described by the mel-scale [20]. For human subjects, it is found that the
equivalent rectangular bandwidth (ERB) is equal to:
ERB(f ) = 24.7(4.37f /1000 + 1), (2.24)
wheref is the frequency in KHz and ERB is in Hz.
Once filtered, the next stage of the CASA system uses the time-frequency
informa-tion in order to identify and group relevant features. The fundamental frequency is one of
the most important features for such grouping. In the human auditory system, humans are
able to use differences in fundamental frequencies in order to cluster the harmonics of one
source and segregate the interferer. In order to identify the fundamental frequency, the
cor-relegramis used [3]. A correlegram can be interpreted as a multi-channel autocorrelation, where we compute the autocorrelation for each channel of our filter bank:
A(t, l, τ ) =
N −1
X
n=0
h(t − n, l)h(t − n − τ, l)w(n), (2.25)
whereh(t, f ) represents the filter response for channel l at time t, τ is the autocorrelation
lag, andw is a window function of N samples. Fundamental frequencies form peaks in
the correlegram which can then be found using standard peak detection algorithms. Once
these fundamental frequencies have been identified, pitch tracking algorithms can then be
have been proposed, including Hidden Markov Models (HMMs) and Particle Filter based
approaches [3].
Though these models are functionally robust, we avoid these in the context of this
thesis because of the associated computational and memory overheads [2] [3] . The neural
oscillator network approach, though much simpler than HMMs and Particle filters, is
suc-cessful at re-synthesizing both speech vectors and noise vectors in mixtures [2]. The neural
oscillator network is an oscillation-based network that is used to perform auditory grouping
[20]. The neural oscillator network proposed in Wang 1999 [20] forms a perceptual stream
if the oscillator neurons are phase-locked. This phase-locked stream is desynchronized
from the oscillator neurons that map to different streams. When this grouping is learned,
we can re-synthesize all sources by inverting the representation that is output from the
neural network. Functionally, the oscillator neural network acts as an image segmentation
algorithm.
The first layer of the network is a segmentation layer. Oscillator neurons in the
net-work are defined as reciprocally connected excitatory and exhibitory variablesxij andyij,
respectively. Excitatory and exhibitory variables in the 2-dimensional grid are described
by the equations,
˙xij = 3xij − x3ij + 2 − yij + Iij + Sij + ρ, (2.26)
Here, Iij represents an external input to the oscillator neuron, Sij denotes coupling
from other oscillators in the network, andρ is the amplitude of a Gaussian noise term [20].
We choose ǫ to be a small number. If noise effects are ignored and the external input is
constant, then ˙xij = 0, called the x-nullcline, is a cubic function and ˙yij, the so-called
y-nullcline, is the sigmoid function,S(x). Here, the sigmoid function is defined as:
S(x) = 1
1 + e−x. (2.28)
WhenIij > 0, the nullclines intersect at a single point. This configuration represents
an oscillator with two time-scales. We choose the tuning parameterβ to be a small value.
The oscillator configuration yields a stable limit cycle for sufficiently small values ofǫ [20]
and is considered enabled.
The limit cycle contains two distinct phases, referred to as silent and active. These
two phases map to the left and right branches of the cubic function. Transition between
these two states occurs on short time scales, with theγ parameter determining the relative
times spent in each state. A largerγ parameter increases the amount of time spent in each
phase. WhenIij < 0, the nullclines intersect at a stable point and no oscillation occurs. In
this state, the oscillator is considered excitable [20].
Sij represents segments formed in the first layer of the neural network. These
seg-ments are described mathematically as:
Sij =
X
kl∈N(i,j)
Here, Wij,kl is the connection weight from oscillator (i, j) to oscillator (k, l), and N (i, j)
represents the set of nearest neighbors at the grid location (i, j). θx is a threshold value
between the left and right branches of the cubic function. H is the unit step function. WZ
is described as the weight of inhibition from global inhibitorz [20], defined as:
˙z = σ∞− z, (2.30)
where σ∞ = 1 if xij ≥ θz for at least one oscillator and σ∞ = 0 otherwise. A lateral
potential is used for noise rejection purposes, and is described by the equation:
˙ pij = (1 − pij)u X kl∈Np(i,j) u(xkl− θx) − θp] − ǫpij . (2.31)
The lateral potential is added as a gating term to the external excitatory inputIij. If
the activity of each neighbork ∈ N1(i) is greater than the threshold θx, then the outer unit
step function will be equal to unity and the oscillator will accumulate potential. Oscillators
that accumulate enough potential are designated leaders. Followers are neighboring
oscil-lators that can transition phases (referred to in the model as jumping). Noisy fragments
are unable to transition phases beyond a short period of time because they cannot become
leaders or become followers because of a lack of nearest-neighbor stimulation (essentially
becoming isolated pixels in a 2D representation of the layer). This scheme allows the
net-work to reject noise artifacts. The strength of the external input is directly correlated to the
peak magnitudes found in a given channel’s autocorrelation function.
The consequence of this configuration is that an individual oscillator has no impact
in order to create the phase locking property that is fundamental to grouping. The second
layer of the network is the grouping layer. We will touch on the grouping scheme at a high
level. Additional detail can be found in [20]. Recall earlier the form of the correlegram.
For each time frame, we can created a pooled correlegram by summing the channels across
frequency. If we define the pooled correlegram ass(j, τ ), where time-frame j and lag τ are
the2 − D variables, then:
s(j, τ ) =
N
X
i=1
A(i, j, τ ). (2.32)
For each time frame, a fundamental frequency estimate from the pooled correlegram
is used to classify the frequency channels. Channels are classified as either being
con-sistent with the fundamental frequencyP , or otherwise inconsistent with the fundamental
frequency (two categories). For a given delayτM corresponding to the peak magnitude of
the pooled correlegram for a given channeli at time frame j, we tag the channels where:
A(i, j, τm)
A(i, j, 0) > θd. (2.33)
Here, A(i, j, 0) denotes the energy in channel i at time frame j. The resulting
energy-based tagging analogs the time-frequency masking approach mentioned earlier. Analysis
conducted in [3] and [2] highlights the flaws in this method – largely that CASA techniques
are sensitive to the structure of the source vectors. This sensitivity highlights differences
between ICA and CASA based methods for source separation.
Like ICA, CASA is used for source separation, but the two are designed to
the potential for variability in separation performance due to dependencies on the
struc-ture of the source vectors. The CASA system well-separates signals that are localized in
time-frequency, but poorly separates signals that are broadband in nature (such as human
speech). The opposite finding is true for ICA, where the JADE ICA algorithm yielded
de-graded performance compared to CASA when the interferer was tonal or narrowband. This
degradation is attributed to the comparatively poor higher-order statistics in these
scenar-ios [3]. These findings indicate that ICA and CASA are in fact complementary methods,
and neither is well-suited to separating arbitrary sources. Ultimately, because we are
in-terested in broadband separation (as in human speech mixtures encountered in the cocktail
party problem), the CASA approach is not a viable source separation technique for our
application.
2.2.5 Time-Frequency Masking
Time-frequency masking, proposed by Yilmaz and Rickard in 2004 [21], solves the
Blind Source Separation problem by exploiting the assumption of separability of mixed
speech signals when transformed to the time-frequency domain. This finding is attributed
to the approximate W-disjoint orthogonality of speech [7]. It is shown that when source
vectors are W-disjoint orthogonal, we can fully recover the source signal in a mixture.
Two signals s1 and s2 are said to be W-disjoint orthogonal if for a given window
function W (t), the supports of the windowed Fourier transforms are disjoint [21]. The
windowed Fourier transform ofsj is given by:
FW(s j(·))(ω, τ) = 1 √ 2π Z ∞ −∞ W (t − τ)sj(t)e−jωt. (2.34)
If we refer to the windowed Fourier transform (equation (2.34)) assˆj(ω, τ ), then we achieve
W-disjoint orthogonality if
ˆ
s1(ω, τ )ˆs2(ω, τ ) = 0, ∀(ω, τ). (2.35)
Approximate W-disjoint orthogonality allows us to assume that time-frequency bins
have little-to-no overlap, and that high-energy time-frequency bins can be assumed to be
attributed to a single speaker. Comparisons between mask-based source separators and
recent algorithms utilizing statistical model-based speech enhancement and non-negative
matrix factorization demonstrated better performance in the mask-based supervised
learn-ing techniques [6].
For each source, Binary masking is used to partition the time-frequency frame of
interest into a binary threshold map: bins assumed to be associated with the source are
assigned a 1, and all other bins are assigned a map value of 0. Specifically, we tag
time-frequency bins where the source of interest exceeds an SNR threshold α as a 1, and a 0
otherwise. The ideal binary mask (IBM) is given by:
IBM = 1 SNR(t, f ) > α 0 otherwise, (2.36)
Alternatively, we can consider the Ideal Ratio Mask (IRM) [6] to preserve relative power
ratios instead of applying a hard threshold:
IRM = s
2(t, f )
Wheres2(t, f ) is the speech energy and N2(t, f ) is the noise energy.
This screen, when applied to the time-frequency mixture, and inverse transformed
via the Inverse Short-Time Fourier Transform (ISTFT), returns an estimate of the
elemen-tary source vector. We see an example of this screen in Figure 2.6. The spectrograms
on the left-hand side of the figure show the spectrograms for a male and female speech
signal, respectively. After linearly combining the speech signals and taking the STFT, we
obtained the mixed speech spectrogram. After applying the masking algorithm described
by equation (2.36), we obtain the masked speech signal found in the bottom-right plot. The
IRM has shown higher levels of intelligibility compared to the IBM, but this difference
has been shown to be marginal [6]. Yilmaz and Rickard (2004) shows that the ability to
separate speech signals via binary masking is attributed to their approximately sparse
na-ture: in the time-frequency domain the majority of the meaningful speech information is
contained in a comparatively small number of Gabor coefficients; the other coefficients are
approximately zero. The Gabor coefficients refer to the magnitude values of the Discrete
STFT of a frame of speech data. In human speech, we see this behavior because energy
is concentrated at frequencies that are multiples of the fundamental frequency [4].
Ap-proximate sparsity allows signals in a mixture to be assumed W-disjoint orthogonal [21].
Approximate W-disjoint orthogonality is shown to be a strong enough condition to yield
Figure 2.6: Time-Frequency Representation of a Speech Signal and its associated Hard Mask.
Advances in neural network technology, especially in regards to deployment on
low-power microcontrollers (a target device for this thesis), makes the use of supervised
learn-ing techniques, a clear choice for investigatlearn-ing the viability of real-time source separation
on constrained devices. Further detail on neural networks and TensorFlow (Google’s
plat-form to facilitate neural network deployment) is provided in the subsequent sections of this
2.3
Artificial Neural Networks
In this section, we provide an overview of Artificial Neural Networks (ANNs). The
focus of this section is in concepts directly applicable to the scope of this thesis. The
math-ematical derivation of the feedforward neural network discussed in this section is largely
based on Hastie 2009 [14]. The generic feed-forward neural network topology can be seen
in Figure 2.7.
ANNs are a class of supervised and unsupervised machine learning methods that
attempt to learn a nonlinear mapping from a set of inputs, referred to as features, and a
series of outputs. ANNs can be used for both regression and classification problems.
Figure 2.7: Generic Feedforward Neural Network Topology.
Mathematically, the learning of the weights associated with a neural network is
ac-tually the solving of a gradient descent style optimization problem. Neural networks use
weight vectors, called neurons and transfer functions, called activation functions, in order
to describe linear and non-linear relationships between variables. An optional biasing term
outputy as:
y = σ(WX + b), (2.38)
whereσ is the activation function, W is the weight vector, and b is the biasing term. Many
activation functions have been used historically, but most neural networks employ the
sig-moid function (2.2.4), and Rectified Linear Unit (ReLU),σ(x), functions, respectively:
ReLU := σ(x) = x+ = max(0, x). (2.39)
In speech processing applications, it has been shown that the use of the rectified linear
unit activation function leads to better generalization, and more reliable training of deep
neural networks when compared to the sigmoid function [22]. This is shown to be attributed
to a favorable optimization surface that is approximately linear convex, and because the
internal representation learned by the neural network is more regularized, reducing the
likelihood of overfitting [22]. As a result, we use the rectified linear unit for all neural
network architectures.
The feedforward neural network uses a learning rule to modify the weights of the
neural network by optimizing with respect to a cost function (called the loss function).
The cost function can vary depending on the class of problem that we wish to solve (e.g.
regression vs. classification). In general, we find that regression problems use a
cross-entropy loss and regression problems use a minimum mean squared error loss. In order to
adjust the weights of the neural network, we present the neural network was a multitude
of examples, called training data. Once neural network weights have been initialized, we
compare the result of the neural network to the true solution. To minimize the cost function,
we perform gradient descent. In the context of ANNs, this is referred to as backpropagation.
We use the backpropagation example from [14]. Consider the squared error loss function:
R(θ) = N X i=1 K X k=1 (yik− fk(xi))2. (2.40)
Where the gradients are given by:
∂Ri δβkm = −2(y ik− fk(xi))g ′ k(β T kzi)zmi (2.41) ∂Ri δαml = −2(y ik− fk(xi))g ′ k(β T kzi)βkmσ ′ (αmTxi)xil. (2.42)
The gradient descent update at the(r + 1) iteration is expressed as:
βkmr+1 = βkm(r) − γr N X i=1 ∂Ri ∂βkm(r), (2.43) αr+1ml = α(r)ml − γr N X i=1 ∂Ri ∂α(r)ml, (2.44)
whereγr is the learning rate, a hyperparameter that determines gradient descent the jump
distance. Then ∂Ri ∂βkm = δkizmi, (2.45) ∂Ri ∂αml = smixil. (2.46)
Here,δkiandsmiare the errors from the model at the output layer and hidden layer,
respec-tively [14].
These errors satisfy the equation
smi= σ ′ (αT mxi) K X k=1 βkmδki, (2.47)
Using these equations yield the equations used for backpropagation. In the forward
pass, the weights are fixed and we propagate those fixed values through the network. For
the backward pass, we compute the errorsδkiand then backpropagate these errors through
the network to obtainsmi. Once we have these weights, we can compute the gradient for
the current update step. A similar scheme is used for other loss functions [14].
2.3.0.1 Convolutional Neural Networks
Convolutional Neural Networks (CNNs), proposed by LeCun et al [23] addresses the
usage of neural networks with image-class data. Though traditional multi-layer perceptrons
are capable of handling image data (by flattening the 2D image into a column vector), the
number of weights that need to be calculated quickly balloons due to the large number of
inputs [23]. Additionally, such networks do not have translation or distortion invariance,
compromising generalization power [23].
For speech and image processing applications, we can retain spatial and temporal
correlation that is otherwise lost in fully-connected neural network topologies [23]. CNNs
differ from standard feed-forward networks in that their hidden layers contain convolutional
are convolved with the input. The discrete convolution operation for a 2D input is defined as: y[m, n] = h[m, n] ∗ x[m, n] = ∞ X j=−∞ ∞ X i=−∞ h[i, j]x[m − i, n − j], (2.48)
wherex is the input signal, h is the 2 − D impulse response, and y is the output [24].
During training we learn the weights associated with each filter using the
backprop-agation scheme mentioned in the previous section.
2.3.0.2 Recurrent Neural Networks
Recurrent Neural Networks (RNNs) are another popular class of neural networks.
RNNs differ from the standard feedforward neural network in that they can incorporate
context from other states when they make predictions (see Figure 2.8). A bidirectional
neural network, for example, can use information from previous states and future states
to make predictions. The general RNN applies a recurrence formula at each time step to
estimate the state:
ht = fW(ht−1, xt), (2.49)
wherehtis the state estimate, fW is a transforming function with parametersW , ht−1 is
the previous state, andxtis the input vector.
This state, called a hidden vector, is then used to compute an output [25]. Consider
the hidden vector for a language model as an example [25]:
Figure 2.8: Generic Structure for a Recurrent Neural Network.
yt= Whyht. (2.51)
In general, we find that backpropagation for RNNs is similar to the case of standard
feedforward networks, though various permutations, such as backpropagation computed
over truncated chunks of the sequence, do exist. Variants of the RNN where states can
go stale over time and (called Long Short Term Memory networks) [25]. We do not use
deep neural networks. In these cases, the states take the form of frames of the STFT. These
approaches are described broadly in the next section.
2.3.1 Neural Network Approaches to Blind Source Separation
The Source Separation problem has traditionally been solved using ICA. ICA solves
the source separation problem by discovering a basis that describes N linearly
indepen-dent source vectors. Solving the time-frequency masking problem using supervised and
unsupervised learning techniques has gained traction in the machine learning community.
In these supervised learning schemes, features are extracted from the mixture and used to
create an algorithm that learns to compute the mask from the mixture [6].
Deep Learning approaches using Recurrent Neural Networks have been proposed for
source separation, and produce results consistent with the ideal binary mask, but assume
knowledge of the number of source signals a-priori (Weninger et al. (2014) and Huang et
al. (2015)).
Xie et al. (2016) proposed the Deep Embedded Clustering (DEC) algorithm [26], an
unsupervised clustering algorithm that bins data by non-linearly embedding the otherwise
high-dimensional mixed signal in a low-dimensional latent spaceZ. The DEC algorithm
then uses a Deep Neural Network (DNN) to estimate the parameters of the transformation
functionR that embeds the mixed signal in Z.
In Hershey et al (2016) [27], it was shown that the DEC algorithm was extensible
to the Blind Source Separation problem. Application of DEC allowed for self-discovery
of three unknown speakers in an additive signal. Time-frequency masking is used to
is estimated for each source determined to be in the mixture. These binary masks, when
multiplied by the frame’s spectogram and passed through the ISTFT, produce theN source
vectors that comprised the mixture. A Bidirectional Recurrent Neural Network (BRNN) is
used to assign an embedding to each bin of the spectogram.
In Kolbaek et al 2017 [28] and Fan et al (2018) [29], an end-to-end training
frame-work for DEC-enabled Source Separation, called Utterance-level permutation invariant
training (uPIT), is proposed. There are two estimation objectives in this framework: Firstly,
we wish to blindly (i.e. without a-priori knowledge) determine the number of unique
sources present in the mixed signal. In addition, we seek the spectral mask estimate for
each identified source. The uPIT framework allows the estimation of both parameters to
be made jointly. Deep Learning approaches using Recurrent Neural Networks have been
proposed for source separation, and produce results consistent with the ideal binary mask,
but assume knowledge of the number of source signals a-priori [30] [31].
In Hershey et el 2016 [27], the use of Deep Embedded Clustering and Binary
Mask-ing are applied to the Blind Source Separation problem. Deep Embedded ClusterMask-ing is
shown to accurately cluster sources in an unknown, mixed audio signal. Kolbaek et al
2017 [28] and Chen et al 2017 [32] propose Permutation-Invariant neural networks to
ad-dress potential pitfalls in the method proposed in Hershey et al 2016 [27], namely the fact
that the Deep Clustering algorithm is not invariant. This lack of
permutation-invariance arises because the algorithm’s objective function is with respect to the mapped
sources in the low-dimensional latent space, rather than the unmapped sources in the
ini-tial basis. Consequently, the order in which bins are processed will affect how the source
Permutation-Invariant Training (PIT) framework to allow for end-to-end mapping, but there
are slight differences in how permutation-invariance is achieved. Kolbaek et al 2017 [28]
achieve permutation invariance by implementing a Bidirectional Recurrent Neural Network
(BRNN). Chen et al 2017 [32] implement a novel Neural Network architecture, called an
Attractor Neural Network. Once the time-frequency data has been transformed to the
low-dimensional subspace, reference points are defined for each source and time-frequency bins
belonging to that source will become attracted to it.
2.4
TensorFlow
TensorFlow is a framework developed by Google for solving large-scale machine
learning problems. TensorFlow is based off of the notion of a computational graph, where
all operations are expressed as a set of nodes within the graph. This graph describes the
execution flow of a processing pipeline. The computational graph seen in Figure 2.9
rep-resents the pipeline for a feedforward neural network. TensorFlow can handle arbitrarily
sized tensors so long as operations on the defined tensors are mathematically valid.
Ten-sors of different types, to include string, float, and int, and calculations can be performed
at varying levels of precision, including single and double precision. Operations in
Tensor-Flow are encapsulated in blocks called kernels. Kernels are written for execution on target
hardware such as CPUs or GPUs [8].
The use of TensorFlow as a data processing pipeline was mandatory because of the
size of the dataset used in this thesis. TensorFlow supports a queuing system while training
avail-Figure 2.9: An example of a Computational Graph.
able, and popped from device memory when processing is complete. This allowed for the
creation of a batch-processing pipeline for the audio data of interest. This batch processing
pipeline is described in detail in Chapter 3.
2.4.1 TensorFlow Lite for Microcontrollers
TensorFlow Lite for Microcontrollers (TFMicro) is an interpreter that is designed
for deploying TensorFlow models to embedded systems. The basic operating principle for
TFMicro is that we wish to portabilize a model that we have trained and developed on
the full TensorFlow client with minimal changes. TensorFlow Lite is used to convert a
data structure is recognized by the TFMicro interpreter and is used to execute the model on
the target device [9].
In order to fully port a model, all of the kernels must have an equivalent kernel in
TFMicro. Because TFMicro is still early in development, a sizable number of TensorFlow
kernels are not yet implemented [9].
The vast majority of TensorFlow kernels use floating point operations. This does
not cause problems when training TensorFlow models on high-performance servers. When
running on resource-constrained devices where memory ceilings can be on the order of
kilobyes, avoiding floating point operations becomes a necessity. TFMicro offers
special-ized support for certain platforms in the form of highly optimspecial-ized processing kernels. The
ARM CMSIS-NN library, a highly optimized neural network library for ARM Cortex M
processors, is utilized. The use of these accelerated kernels yields incredible speed-ups
compared to the reference kernels [9]. Because the Arduino Nano 33 BLE Sense and
i.MX-RT600 microcontrollers both leverage ARM Cortex M processors, TFMicro will
au-tomatically utilize these optimized kernels on deployed neural networks [9]. The TFMicro
interpreter is heavily leveraged in order to deploy neural architectures onto the target
de-vices.
2.5
Real-Time Signal Processing on Embedded Devices
Considera-tions
A processing constraint that we must address in this thesis is the need for the BSS
processing on an embedded platform is the large reduction in compute capability.
Com-pared to mobile platforms, an embedded device has at least 100-1000x less compute power
[9]. This reduction in compute power, especially in the context of real-time computing,
means that the problem size at any given instant needs to be greatly reduced to
accommo-date the platform.
For real-time processing applications, we utilize a frame-based processing scheme to
handle data in a packet-like format. The binary masking technique employed in this thesis
passes a STFT frame into a feed-forward neural network for mask inference. Note that the
STFT output is a matrix, X(f ) = X1(f ) X2(f ) X3(f ) ... Xk(f ) , (2.52)
where themth element of the matrix is given by:
Xm(f ) = ∞
X
n=−∞
x(n)g(n − mR)e−j2πf n, (2.53)
whereg(n) is the window function of length M , Xm(f ) is the Discrete Fourier Transform
(DFT) of the windowed data centered about time mR, and R is the hop size between
successive DFTs [33].
In the real-time context, we cannot perform this summation over all time. Instead, we
evaluate the STFT on a fixed-sized packet. The size of the tensor for a packet is dependent
on the sampling rate, the hop size, and the FFT size used in each window. We can optimize
needed for intelligible speech in the reconstruction stage. For example, decreasing the
sampling rate reduces the amount of data that we feed through our neural network at any
one time, reducing the latency and increasing the throughput of the model. Because the
ideal binary mask can be easily constructed, we can evaluate these trade-offs before the
neural network is trained. To be deemed real-time, the neural network must be capable of
processing STFT frames faster than it can receive them. This is discussed in more detail in
TECHNICAL APPROACH
This chapter provides a description of the technical approach used to address the
topic of real-time BSS on a low-power microcontroller.
We outline the initial audio dataset, describe the TensorFlow Audio Processing pipeline
that was used to create the mixture speech dataset needed for binary mask estimation,
de-scribe the process used to select suitable STFT parameters for neural network input, detail
the hardware used, detail the neural network architectures evaluated, detail the
compres-sion techniques used to shrink the size of the networks, and address the methods used to
evaluate latency and power usage.
The Google Colaboratory Cloud environment is used for neural network
develop-ment and training. This environdevelop-ment was chosen because of access to high-end GPUs that
could be used to accelerate training. The TensorFlow audio processing pipeline, which is
Table 3.1: Google Colaboratory Specifications. Google Colaboratory Specifications
Number of CPUs 2
CPU Model Name Intel(R) Xeon(R) CPU @ 2.00GHz
Cache Size (KB) 39424
CPU Speed (MHz) 2000
Number of GPUs 1
GPU Model Name NVIDIA Tesla V100-SXM2
GPU VRAM 16 GB
3.1
Audio Dataset
Two datasets were considered for this research. The LibriSpeech ASR corpus and
the Mozilla Common Voice dataset.
The Mozilla Common Voice dataset, a crowdsourced dataset containing
approxi-mately 1400 hours of speech using multiple languages was considered as well, but this
was deemed inappropriate for neural network training. One of the core assumptions that
is made when developing the neural network model is that all two-speech mixtures were
derived from different speakers and that male and female speakers could be identified
a-priori. Because the Mozilla Common Voice dataset is crowdsourced, all data, to include
the audio files themselves and metadata regarding to speech (e.g. age, gender, accent, etc.)
are manually input by users. A rating system is used to gauge the accuracy of labels in the
dataset. Assessment of the audio data revealed many instances where a speaker’s gender
is misclassified. Additionally, there were many instances where the same speaker recorded
multiple lines of dialogue. Because speakers are not assigned a unique ID, the only way
to reconcile such as artifact would be to listen to all 1400 hours of audio manually. In the
Instead, this research leverages the LibriSpeech ASR corpus. The LibriSpeech ASR
corpus contains 1000 hours of English speech sampled at 16 KHz. Though the LibriSpeech
ASR corpus contains less examples, the speech has already been cleaned and is already
labeled, segmented and aligned. In this thesis, the LibriSpeech ASR corpus is used to
create the mixture dataset that is then used to train the designed neural networks.
Each LibriSpeech ASR corpus example contains a few seconds of speech for a given
male or female speaker. All examples are stored in separate Free Lossless Audio Codec
(FLAC) audio files. FLAC is a lossless audio compression format developed by Xiph [34].
3.2
TensorFlow Audio Processing Pipeline
TensorFlow is used in order to design the audio processing pipeline that is used in
this work. The goal of our neural network is to learn a binary mask estimate from a spectral
frame. Because the LibriSpeech dataset contains single-speaker examples, all contained in
separate data files of varying lengths, a TensorFlow Audio Processing pipeline was
devel-oped to make the audio data uniform to support input into a neural network.
Given the large number of audio files associated with the LibriSpeech dataset, it was
not reasonable to perform neural network training by fitting all of the data in memory (12
GB of RAM on the Google Colab server used for training). To get around the memory
overheads, we leveraged the TFRecord binary file format to store audio examples.
The TFRecord format is a proprietary data format used by TensorFlow that is used
to facilitate distributed training. As mentioned earlier, the dataset that we used is derived