Voice Activity Detection and Noise Estimation for Teleconference Phones

(1)

Voice Activity Detection and Noise

Estimation for Teleconference Phones

Björn Eliasson

June 20, 2015

Student

Master’s Thesis, 30 Credits Department of Mathematics and Mathematical Statistics

(2)

Copyright c Bj¨orn Eliasson.

VOICE ACTIVITY DETECTION AND NOISE ESTIMATION FOR TELECONFERENCE PHONES

Submitted in partial fulfillment of the requirement for the degree Master of Science in Industrial Engineering and Management Department of Mathematics and Mathematical Statistics Ume˚a University

SE-901 87 Ume˚a,Sweden

Supervisors:

Jun Yu, Ume˚a University Nils ¨Ostlund, Konftel AB Examiner:

Patrik Ryden, Ume˚a University

(3)

Abstract

If communicating via a teleconference phone the desired transmitted signal (speech) needs to be crystal clear so that all participants experience a good communication ability. However, there are many environmental conditions that contaminates the signal with background noise, i.e sounds not of interest for communication purposes, which impedes the ability to communicate due to interfering sounds. Noise can be removed from the signal if it is known and so this work has evaluated different ways of estimating the characteristics of the background noise. Focus was put on using speech detection to define the noise, i.e. the non-speech part of the signal, but other methods not solely reliant on speech detection but rather on characteristics of the noisy speech signal were included. The implemented techniques were compared and evaluated to the current solution utilized by the teleconference phone in two ways, firstly for their speech detection ability and secondly for their ability to correctly estimate the noise characteristics. The evaluation process was based on simulations of the methods’ performance in various noise conditions, ranging from harsh to mild environments. It was shown that the proposed method showed improvement over the existing solution, as implemented in this study, in terms of speech detection ability and for the noise estimate it showed improvement in certain conditions. It was also concluded that using the proposed method would enable two sources of noise estimation compared to the current single estimation source and it was suggested to investigate how utilizing two noise estimators could affect the performance.

Keywords: Voice Activity Detection (VAD), noise estimation, continuous noise estimation (CNE), statistical model-based VAD, improved minima-controlled recursive average (IMCRA), Rangachari noise estimation (RNE or MCRA-2), likelihood ratio approach, signal-to-noise ratio dependent recursive average, teleconferencing

(4)

Sammanfattning

När man kommunicerar via en konferenstelefon krävs att signalen (tal) som sänds

¨

ar tillräckligt klar för att alla parter ska uppleva en god kommunikationsförm˚aga.

I praktiken finns det m˚anga miljömässiga faktorer som kontaminerar signalen med bakgrundsbrus, d.v.s ljud som inte är intressanta i ett kommunikationsperspektiv, och försv˚arar kommunikation p˚a grund av störande ljud. Bakgrundsbrus kan re- duceras fr˚an den sända signalen ifall dess karaktäristik är känt och därför har olika metoder att uppskatta bakgrundsbrusets karatäristik utvärderats. Fokus l˚ag p˚a att använda taldetektering för att definiera bruset, d.v.s. signal utan tal, men även andra metoder som utnyttjar den brusiga signalens karaktäristik inkluderades. De implementerade metoderna jämfördes med och utvärderades mot den nuvarande lösningen för brusskattning p˚a tv˚a vis, för det första för förm˚agan att korrekt kunna detektera tal och för det andra för förm˚agan att korrekt kunna karaktärisera bruset.

Utvärderingsprocessen baserades p˚a en simuleringsstudie av metoderna i flertalet olika brusmiljöer som spänner intervallet milda till väldigt h˚arda förh˚allanden. Det p˚avisades att den metod som föreslogs visade p˚a en förbättring i jämförelse med den existerande lösningen, s˚asom implementerad i denna studie, gällande taldetektering och brusskattning i vissa förh˚allanden. Vidare kommer den föreslagna metoden ge tillg˚ang till tv˚a källor för brusskattning till skillnad mot den nuvarande lösningen som har en. Det föreslogs för vidare studier att studera hur dessa tv˚a olika källor kan kombineras.

(5)

Acknowledgements

I would like to thank my extraordinary supervisor Professor Jun Yu at Ume˚a Uni- versity for all the amount of time spent helping me out, be it encouraging words or lending of expertise or a thorough report review. Moreover, I direct a special thanks to my Konftel supervisor Dr. Nils ¨Ostlund for providing insight and keeping me on track throughout the project. Also, thank you members of project group Frost for letting me partake in the daily work at Konftel, it has been such a great learning experience. Lastly I would like to express my sincerest thank you to everyone at Konftel for being so incredibly nice to me this semester. It made me feel right at home!

(6)

Abbreviations

CNE Continuous Noise Estimation DFT Discrete Fourier Transform DTFT Discrete-Time Fourier Transform

DD Decision-Directed

ETSI European Telecommunications Standards Institute FFT Fast Fourier Transform

HMM Hidden Markov Model

IMCRA Improved Minima-Controlled Recursive Average LRA Likelihood Ratio Approach

LRT Likelihood Ratio Test MedSE Median Squared Error

MMSE Minimum Mean Square Error

MSE Mean Squared Error

PSD Power Spectral Density

ROC Receiver Operating Characteristics SNR Signal-to-Noise-Ratio

SNRDRA Signal-to-Noise-Ratio Dependent Recursive Average SMVAD Statistical Model-Based Voice Activity Detector VAD Voice Activity Detector

(9)

Chapter 1 Introduction

This chapter introduces the background together with the aim, scope and limitations for this work.

1.1 Background

Being a global contender in the teleconference solution scene is no easy feat and requires good products. An example of a product offered by Konftel can be seen in Figure 1.1. In a teleconference setting good products equal high quality audio, which is imperative for the users to be able to communicate properly and with ease. One necessity in creating good audio is a good background noise estimate.

An estimation of the background noise, or noise, is important for many aspects of generating high quality audio. It is needed for reducing the noise transmitted to the other participants in the teleconference and a part in allowing for the echo created when the microphone picks up the loudspeaker signal to be canceled out. One way to estimate the noise is via a voice activity detector (VAD) which as the name suggests tries to detect the presence of speech. Knowing when someone is speaking is a tool to estimate the noise, i.e. the sound of non-speech. Other methods rely on key characteristics of speech to update the noise estimate, these methods are herein called continuous noise estimators (CNE). The basics of both these two different strategies will be explained in the following subsections. Not only must an efficient noise estimator be able to well represent the background noise, it must also be able to do so in real-time with minimum delay. The implementation and evaluation of noise estimation methods will be the subject of this work.

1.1.1 Noise and Speech

As this work is concerned with estimating the background noise it is reasonable to give an explanation to what noise is even though it could appear trivial. Noise is everywhere around us and constitutes what everyone perceive as sounds not of

(10)

Figure 1.1: Teleconference phone Konftel 300IP.

interest, which in the case of teleconferencing translates to all sounds that do not originate from a participant speaking. In essence, this means that every signal can be decomposed into two parts, the speech signal and the noise signal. Together they form the noisy speech signal. As an example of noise, in a quiet room ones perception would often be that there is complete silence however this is not true as there is always a noise background, as long as you are in a medium where sound can exist. More noticeable is the noise background in other environments like in a busy street, the office or inside a restaurant because here the sounds coming towards you are more prominent. The are two types of noise, stationary and non-stationary noise. For stationary noise the noise characteristics does not change over time (e.g.

a fan) and for non-stationary noise it does change over time (e.g. an accelerating car or people talking next door). Intuitively it would seem that stationary noise would be easier to estimate then its counterpart. Not surprisingly, this is very much the case as a method trying to estimate ever-changing noise characteristics must be able to adapt to changes constantly whereas trying to estimate stationary noise you only need to characterize the noise once.

There are two main groups of speech, voiced and unvoiced speech. When the vocal folds are tensed and air is pushed through the vibration results in voiced sounds such as vowels. Unvoiced sounds are produced when the vocal folds do not vibrate but tense up and come closer together allowing the air stream to become turbulent. ’H’ in house is an example of an unvoiced sound. ’S’ and ’t’ are also unvoiced sounds, acquired when the tongue and and lips impose limitations to the vocal tract. Not surprisingly different types of spoken sounds are more or less easy to detect in a noise background as they more or less resemble noise [17].

(11)

1.1.2 Noise Estimation

There are two main strategies for estimating the noise. One strategy relies on a VAD decision and the other is based on a CNE scheme which utilizes some key characteristic of the noisy speech signal [17]. But before the basics of these methods are discussed it would be good to get a better understanding of why a noise estimate is needed. The importance of a good noise estimation procedure is made clear through its use in speech enhancement, including noise suppression.

Noise Suppression and Other Uses

The importance of noise suppression is explained in [13] with the teleconference setting as an example. In a conference call the background noise for each participant is picked up and additively combined at the network bridge. This means that each of the loudspeakers will reproduce the combined sum of the background noises from the other participants. As the number of participants increases the combined background noise will overpower the desired signal making communication impossible.

This makes it clear that the noise must be attenuated without affecting the speech, which is much of the issue when dealing with noise suppression. A noise suppression system will remove the estimated noise from the noisy speech signal and the resulting signal will hopefully contain speech only. This problem of removing the estimated noise from the noisy signal shows a difference in over and underestimating the noise. Overestimation can cause speech distortion as now too much of the noisy speech signal is removed, even speech. Underestimation of the noise can lead to the background noise still being present in the noise suppressed signal. However, in practice this is not exactly how it works but it still gives an idea of the difference in over and under estimating the noise.

While noise suppression is a very important aspect, and the main focus of this work, it is not the only thing the noise estimation is used for. In communication devices it is common to use comfort noise which is simply an estimated noise background that is transmitted to the far-end user as an assurance that the con- nection is still working. Comfort noise may also be used to cover some residual echo making it less audible. The process of transmitting the noise range from using static colored noise (noise with more power in some part of the frequency range) to adaptive schemes trying to emulate changes in the noise process [12] and it is for these latter methods an up-to-date noise estimate is needed. A noise estimate is also needed for echo cancellation, i.e. the practice of removing the echo created when the teleconference phone’s mic picks up its own loudspeaker signal, as well as double-talk-detection which tries to detect when two parties of the teleconference speak simultaneously.

(12)

Voice Activity Detection

The idea behind a VAD is simple and can be summarized in three steps. First a signal feature is extracted. Secondly a decision rule is employed deciding whether these features are that of speech or noise. Thirdly it is common to use some kind of decision alteration which is usually more empirical in nature and tuned to specific needs [22]. When a VAD is used for noise estimation it estimates the noise during noise periods only, i.e. when no speech is detected it treats the entire incoming signal as noise.

Several different signal features and decision rules have been suggested for speech detection throughout the years as there is no clear all-defining feature that catches all the complexity that is speech. In an overview by Ram´ırez in [22] the most common features used for a VAD are explained. Tracking the energy of the signal is a useful and intuitively simple method as it can be assumed that speech contains more energy than noise. Here the presence of speech is assumed when the signal energy is greater than some threshold. These energy measure based VADs are both used in time and frequency domain (see section 2.1 and section 2.2). As an addition to the energy based thresholding scheme some methods use frequency analysis tools based on tracking the minimum and maximum energies of the low and high frequencies.

Assuming an initial noise period the energy envelope can be tracked to be compared to incoming energy values using a simple difference measure which in turn is used in the VAD decision logic. Other methods assume that there are inherent differences between speech and noise in terms of periodicity of the signal, frequency or pitch (see section 2.1). Pitch is a non-linear function of frequency [9].

While these VADs are heuristic in nature other researchers have focused on developing statistical models for speech detection. In [26] a VAD was introduced that modeled the speech and noise as independent Gaussian processes. This method was later improved by adding contextual information into the decision rule, however this introduction lead to the methods no longer being causal [23]. Other methods use statistical models based on a Laplacian or Gamma distributions [3].

Continous Noise Estimation

The idea of the CNE schemes is to utilize some key characteristics of speech and the noisy speech signal to be able to constantly update the noise estimate regardless of speech presence. There are three main classes of CNE algorithms, the time-recursive averaging algorithms, the minimum tracking algorithms and the histogram based algorithms, each based on one or more of the three key characteristics of speech.

These classes are the The first characteristic is the fact that ”silent” portions of speech do not only occur in noise periods when looking at a frequency band (a subset of all frequencies, e.g. all frequencies between 0 and 8000 Hz). As an example a low-frequency vowel will affect the lower part of the frequencies, enabling estimation of the noise in the higher frequencies. In short it is possible to update the noise

(13)

(a) (b)

(c)

Figure 1.2: Visualization of the three key characteristics of speech and the noisy speech signal utilized by the CNE. In a) speech is shown to exist only in the lower parts of the frequency band (assuming speech has more energy than noise), allowing for estimation of the noise in the upper frequencies where the energy level is low and thus assumed to be noise only. b) show the idea behind the minimum statistics algorithms. As can be seen the power of the speech signal (red) decays to the power of the noise signal (blue) between utterances. In c) the histogram of the logarithmic energy level of the noisy speech signal show the most common energy level taken to be the background noise power level.

for every frequency not containing speech. The time-recursive averaging algorithms exploit this. The second characteristic is that the noisy speech signal often decays to the power of the noise and so by remembering what this lowest energy level was is a tool to estimate the background noise. This is the idea behind the minimum tracking algorithms. The third characteristic is that a histogram of the energy values for every frequency reveals the most common energy level, taken to be the noise energy level. This assumption leads to the histogram-based noise estimation schemes [17]. Figure 1.1 show a visualization of the three mentioned characteristics of speech and the noisy speech signal utilized by the CNE.

The reasoning behind having a constant update is to have a more recent noise estimate compared to a VAD-based noise estimate when long speech segments are present (or assumed to be present). This is especially important in the case of non- stationary noise (e.g. noise from inside a cafeteria) where a VAD might indicate a long speech period, all the while the background noise changes, making the noise estimate outdated. The VAD might even be performing well, i.e. it correctly classifies

(14)

the long period as speech, but this does not change the fact that the noise estimate has not been updated in a long period. In the case of stationary noise a continuous update scheme loses its purpose as a few noise estimates would be sufficient to fully characterize the noise. Salas in [25] present a good overview of different types of CNE schemes from all three groups.

Difficulties in Estimating the Noise

The biggest difficulty facing anyone trying to characterize the noise is the fact that in practice only the noisy speech signal picked up by the microphone is available for analysis. In other words the problem of noise estimation boils down to separating what is the speech signal and what is the noise signal with only the information available in the noisy speech signal. As you are trying to separate these two signals it is inevitable that the speech signal will affect the noise signal estimate and vice versa. The goal is to minimize these effects to allow for the most accurate estimate of each separate signal.

When using a VAD-based noise estimation scheme this problem occur when the incoming signal is classified as containing noise only while in reality it contains speech as well. This would mean that speech components will be incorporated in the noise estimate as now the entire signal is treated as containing only noise while in reality it contains both noise and speech. To minimize these effects it is common to only mark the signal as noise when you are very certain it is noise only. However, this would of course increase the chance of background noises being wrongly classified as speech. The CNE does not use the binary speech decision and so the problem of the two signals affecting each other’s estimate takes on another form but is still present.

Current Noise Estimation Solution

The background noise is currently estimated as part of the Aurora Audio Algorithm employed in the teleconference phones, henceforth known as only Aurora, by the use of a VAD. As previously mentioned relying on a VAD to characterize the background noise introduces a few possible kinks. The noise estimate will only be updated during periods marked as containing noise only and this becomes a problem for Aurora in two ways. Firstly the Aurora algorithm is very sensitive to non-stationary noise meaning that in these noise conditions the VAD will be tricked into believing that there is speech present in the signal. This ultimately leads to the noise estimate becoming outdated when long segments of the signal is falsely deemed to contain speech. Secondly, a problem with the current method is its dependence on signal strength. With its current design it will interpret strong signals as speech, even though it might just be a strong noise signal, which in the end yield the same result as previously, i.e. falsely marked long speech segments makes the noise estimate become outdated. A real example of the signal strength being an issue is the practice of using

(15)

strong white noise generators in office landscapes to create a more preferably noise background. These generators can emit very high sounds to counter the original background noise and this strong signal will be interpreted as speech by the Aurora VAD. Both of these two issues can be attributed to the empirical nature of the Aurora implementation. Having a empirically tuned VAD method makes it hard to know how it will perform in various conditions and therefore it is important that the proposed method is not as empirical as to introduce new unforeseen problems.

This is of course hard to verify but can be taken into account by choosing a tried and tested method.

Possible Solutions

There are two possible solutions to these problems. The first would simply be to improve upon the VAD’s speech detection ability to make it better at distinguishing between noise and speech. Another possible solution would be to implement a CNE not reliant on the binary decision of a VAD to update the noise estimate and therefore possibly better and dealing with non-stationary noise.

1.2 Aim

To stay ahead in a competitive environment product improvements are crucial. As the work of developing a new generation of teleconference phones is on its way the idea was to research alternatives to the current noise estimator. In order to improve upon the current noise estimator both possible solutions alternatives discussed above will be evaluated.

The aim of this work was twofold. The first part was to implement and evaluate different VAD methods and compare these to the current VAD implemented in the teleconference phones in terms of speech detection ability. The second part includes the implementation and evaluation of different methods for noise estimation only, i.e. trying to correctly characterize the noise.

1.3 Scope and Limitations

This work will be limited to three methods of VAD implemented and analysed as well as four CNE methods. The performance of the chosen methods will be evaluated in various conditions associated with the teleconference setting and compared to the existing solution and each other. These settings include different noise strength combined with various noise environments. The speech and noise conditions used for the comparison will be sound files recommended by International Communications Union for use as test signals. The sound-files used are a subset of the sound files discussed in the ITU-T P.501 standard [14]. The evaluation process will be limited

(16)

to simulations using the aforementioned sound files and no real-time implementation in a conference telephone will be done.

There are physical as well as computational limitations to the platform in which the noise estimator aims to be implemented in. The proposed estimation schemes must take these limitations into consideration. This means that any proposed method must be able to handle a real-time implementation with minimum delay as not to interfere with the communication. The proposed methods cannot be to computationally complex. Another important issue is that the method must be causal. This is a restriction imposed on the system to not cause to much time-delay when processing the signal.

1.4 Outline

This work is structured as follows: Chapter 2 introduces the relevant theory about acoustics, digital spectral analysis, loss and risk functions, the likelihood ratio test and hidden Markov models. Any reader familiar with any of these subject may skim through the chapters to familiarize yourself with the notation used for this work.

In chapter 3 the chosen methods for VAD and CNE are introduced along with the performance evaluation format and implementation specifics. Chapter 4 presents the results and finally in Chapter 5 the results are discussed and conclusions are drawn and lastly further work is proposed.

(17)

Chapter 2 Theory

This chapter gives a short introduction to acoustics theory, spectral analysis, loss and risk functions, the likelihood ratio test and hidden Markov models. If any of these subjects are familiar the reader is suggested to skip or skim through parts of this section, however it is recommended to familiarize yourself with the notation used..

2.1 Acoustic Theory

Sound can be defined in two ways, either as a physical wave propagating through any elastic medium, or as the excitation of our hearing mechanism resulting in the psychophysical perception of sound [9].

2.1.1 Sinusoidal and Complex Waves

The wave form that often describes sound, or various other kinds of signals, is the sinusoidal wave. To define a periodic sinusoidal wave the signal amplitude, frequency and phase are needed. The amplitude is the absolute value of the signal.

The frequency is the number of complete periods per second, where one period is the time between two wave peaks, and is measured in Hertz (Hz). The phase is a shift along the time axis and indicates where the first zero crossing occur. The sinusoidal wave can be expressed as

x(t) = A · cos(ωt + φ). (2.1)

where t is a time unit (a signal expressed like this is a called a time-domain signal as it is a function of time), A is the signal amplitude, φ is the phase in radians and ω are radians per time unit. ω can be used to derive the wavelength of the wave as T = ^2π_|ω|. The wavelength is the distance the wave travels in one period [21]. In the case of sound the sinusoidal wave represents the degree of displacement (compression and rarefaction) of air particles in relation to the prevailing atmospheric pressure.

(18)

Figure 2.1: Sine wave in acoustical application

Figure 2.1 shows the basics of the sinusoidal wave in acoustical application. The simple sinusoidal wave doesn’t seem to be of very much use for representing the complex wave of speech, since the wave shapes of speech look drastically different from the simple sinusoidal wave. However, no matter what shape the wave is it can be reduced to sinusoidal wave components as long as it is periodic. This means that any periodic complex wave can be synthesized using sinusoidal waves of different amplitudes, frequencies and phases [9].

Expressing the sinusoidal signals in terms of complex exponentials makes some useful tools to analyze the signal available, e.g the Fourier decomposition of signals discussed in the next section. By Euler’s formula the sinusoidal wave relates to the complex exponential by

e^iωt = cos(ωt) + i sin(ωt).

where i denotes the imaginary unit. Hence the sinusoidal wave from equation (2.1) can be expressed in terms of complex exponentials as [21]

A · cos(ωt + φ) = A

2e^iφe^iωt+A

2e^−iφe^−iωt 2.1.2 Decibel

Any power level of magnitude W1 can be expressed in terms of a reference power W2 as

L₁ = 10 · log₁₀W₁

W₂ decibels.

Magnitudes other than acoustic power can be expressed in dB. For example acoustic power is proportional to the squared acoustic pressure, p, hence the power level is

Lp = 20 · log₁₀p1

p₂ decibels. (2.2)

(19)

These two equations define two useful relationships between power levels. As sound pressure is a common parameter to measure in acoustics equation (2.2) is often used [9]. Decibel is important to the concept of signal-to-noise ratio (SNR) as it is by definition the ratio of the signal power and the noise power usually measured in dB.

2.2 Spectral Analysis

There are several advantages to moving into the frequency, or spectral, domain when analysing signals. In spectral domain the signal is expressed as a function of different frequencies, in contrast to time domain where it is expressed as a function of time. First of all the spectral domain yields a better separation of speech and noise as these signals usually contain different frequency information. E.g speech does not exist in the very high frequencies while noise can. Naturally, this makes it easier to implement an optimal or heuristic approach to VAD and CNE. Secondly in the spectral domain the spectral components are decorrelated, which means that to some extent the frequency information can be treated independently, simplifying statistical models [1]. Before the spectral analysis can begin a few concepts need to be brought about.

2.2.1 Representing an Analogue signal

For a computer to analyze an analogue signal it must be able to convert it into a digital signal. To do this an analogue-to-digital converter is used resulting in the process of sampling. Sampling an analogue signal means measuring it on discrete time intervals and therefore the sampled signal is merely a discrete representation of the continuous signal. So given a signal x(t), where t denotes a continuous time variable, the sampling model replaces t with the discrete value nTs. The discrete value n is used to index an array with the sampling period T_s as the time between each sample. The sampling process can be described as [30].

x[n] = x[nT_s] 2.2.2 Discrete-Time Fourier Transform

To move into the frequency domain from the time domain the Discrete-time Fourier Transform (DTFT) is used. Given a discrete time signal x[n] the DTFT maps the discrete-time signal to the linear combination of complex exponentials the signal consists of. The DTFT of x[n] is given by

X(e^iω) =

+∞

X

n=−∞

x[n]e^−iωn (2.3)

(20)

The DTFT is invertible in the sense that given X(e^iω) the original signal may be restored by the inverse DTFT (IDTFT) given by

x[n] = 1 2π

Z

2π

X(e^iω)e^iωndω (2.4)

These two equations are called the analysis equation (2.3) and synthesis equation (2.4) and are used to move from the time domain to the frequency domain and vice versa [21]. In practice the DTFT of a noisy speech signal is gener- ally a complex-valued function of eîω which can be expressed in polar form as X(eîω) = |X(eîω)|eîφ(eîω⁾ where |X(eîω)| is the magnitude spectrum and φ(eîω) is the phase spectrum [17].

2.2.3 Discrete Fourier Transform

In the case of digital signal processing the DTFT is replaced by the Discrete Fourier Transform (DFT) because the DTFT is a function of a continuous variable e^iω which is not compatible with digital computation. In practice the time signal x[n] consists of N samples and therefore finite contrary to equation (2.3) and so the DTFT can be sampled at N uniformly spaced intervals by using ω_k= ^2πk_N , often referred to as frequency bins. Sampling the DTFT this way yields the DFT given by

X[k] =

N −1

X

n=0

x[n]e^−i2πnk/N (2.5)

As in the case of the DTFT the DFT is invertible by

x[n] = 1 N

N −1

X

m=0

X[k]e^i2πnk/N (2.6)

Due to the computational complexity of the DFT a Fast Fourier Transform (FFT) algorithm is instead employed to compute the DFT [30].

2.2.4 Power Spectrum

Most signals used in applications cannot be predicted exactly and can only be expressed by probabilistic statements. A random signal can be characterized by a power spectral density (PSD). The power spectral density is the frequency domain specification of the second-order moment of the signal. To express the PSD the auto-covariance sequence of the stationary signal is needed, given by

r(k) = E[x[n]x^∗[n − k]]

(21)

where^∗ denotes the conjugate and E[·] is the expectation operator. The PSD is the DTFT of the covariance sequence and thus calculated as

P (e^iω) =

∞

X

k=−∞

r(k)e^−ikω

The idea of spectral estimation is to estimate how the total signal power is dis- tributed over frequency for finite discrete observations of a stationary process. There are both parametric and non-parametric techniques for estimating the power spectrum. In practice the non-parametric periodogram method is often used to estimate the PSD due to its simplicity. The periodogram can be computed as

P (k) =ˆ 1

N|X[k]|² (2.7)

where X[k] is the DFT of the data sequence x[n] and N is the length of the data sequence. This yields an efficient way of estimating the PSD with help of the FFT [28].

2.2.5 Frame Processing

Even though speech is a non-stationary process it can be assumed to be stationary for short periods of time between 10 to 30 ms. This assumption is necessary for models employing the DTFT. Because of this the practice of frame processing is used. A frame is simply a short segment of the sampled signal to be processed individually. So, in terms of a VAD or CNE this would mean that for each processed frame you get a binary speech presence decision and a noise estimate. When dealing with frames it is also common to use a window function which affects the PSD of the signal. The most simple window function is the rectangular window, which when used on a signal for the purposes of the DFT transform is identical to framing signals and applying the DFT to each frame. The rectangular window is defined to be 1 inside the window and 0 outside. Another common window function is the Hamming window, which add different weights to the signal samples, giving more weight to the mid samples than the side samples. Using a Hamming window instead of a rectangular makes it easier to spot differences in far between frequency bins at the expense of making it harder to separate frequency bins close to each other. Statistically speaking a Hamming window would, compared to a rectangular window, increase the correlation of nearby frequency bins and reduce the correlation of more distant frequency bins [6].

As using a hamming window would give less weight to some samples every frame uses overlap to make sure that in the end every samples has equal weight.

The amount of overlap between frames differs but around 50% is commonly used.

This means that when the signal processing step is complete the modified signal must be restored using the overlap-add method [17] and the IDFT from equation (2.6).

(22)

2.3 Additional Theory

Given an observation vector x and a vector of target variables θ the goal is to predict θ given a new set of observations. Hence, the goal is to determine the posterior probability density function (PDF) of θ, i.e. p(θ|x). This PDF can with the help of Bayes’ theorem be expressed as

p(θ|x) = p(x|θ)p(θ)

p(x) (2.8)

where p(x|θ) is the joint distribution of x for a given target variable, p(θ) is the prior target variable PDF and p(θ|x) the corresponding posterior density [2].

2.3.1 Loss, Risk and Bayes Risk function

When estimating any parameter it is in most cases important to add a disparity between types of estimation errors. For example in the case of noise estimation through VAD marking speech as noise is worse than the other way around as then the speech would affect the noise estimate. Also when estimating the noise a larger error should be penalized more than a small error as the larger error will have significantly greater effect in the noise suppression step. To do this a loss function is introduced. The loss function expresses the loss incurred for every error in the estimates. The loss function is usually denoted as

L = L(θ, ˆθ)

where L corresponds to the loss incurred for the estimate ˆθ of θ. So the idea here is to choose the estimator that minimizes the loss. However, in practice θ is unknown and it is therefore hard to find the optimal estimator ˆθ and so the aim is instead to try to minimize the expected loss, called the risk function denoted by R(θ) = E[L]

[2].

An estimator that relies on Bayes’ rule from equation (2.8) are considered to be a Bayesian estimator and can be derived using a Bayesian risk function. The most important feature for including these risk functions is that it enables perceptual weighting to the estimators, i.e. it makes it possible to include psychoacoustics trying to emulate our hearing mechanism in the estimator. The Bayesian risk function is the expectation of the risk function and given by

< = E[R(θ)] = Z Z

L(θ, ˆθ)p(x, θ)dxdθ (2.9) as the parameter θ is now a stochastic variable. The minimization of Bayes risk function with respect to ˆθ for any given loss function yield a variety of estimators [17].

(23)

2.3.2 Likelihood Ratio Test

Given a random sample X1, X2, ..., Xn from the stochastic variable X with PDF p(X; θ). The likelihood function is defined as

`(θ) :=

n

Y

i=1

p(Xi; θ).

The likelihood ratio test (LRT) is used for hypothesis testing given a set of observations and two hypothesis. The likelihood ratio is a measure of how much more likely one hypothesis is over the other. The LRT for an observation vector X conditioned on the two different hypothesis H₀ and H₁ can be defined as

Λ = `(θ_H₁)

`(θH0). (2.10)

and the decision in favor of either hypothesis is dependent on a threshold determining the acceptable false rate [18].

2.4 Markov Models

This section is based on [2]. When speech is very prominent in the noisy signal the work of classifying between speech and non-speech is simple. However, this will not always be the case in practice. The detection of weak speech endings, especially unvoiced speech, is troublesome as they often resemble noise. To reduce the risk of clipping the speech short one can model the correlative nature of speech occurrences into the LRT decision. To express this correlative behavior in a probabilistic manner a Markov Model can be used. With the help of the product rule the joint distribution for a sequence of N observations can be expressed as

p(x₁, ..., x_N) =

N

Y

n=2

p(x_n|x₁, ..., x_n−1).

where x1, ..., xN are the observation vectors. If the conditional distribution in (2.4) is independent of all previous observations except the most recent one the model becomes a first-order Markov chain. The joint distribution of the first-order Markov chain for N observations is

p(x1, ..., xN) = p(x1)

N

Y

n=2

p(xn|x_n−1).

For every observation x_n a corresponding unobservable variable z_n is introduced.

Under the assumption that the Markov chain is now formed by the unobservable

(24)

variables a so-called state space model is obtained. The joint distribution for this model is given by

p(x₁, ..., x_N, z₁, ..., z_N) = p(z₁)

N

Y

n=2

p(z_n|z_n−1)

N

Y

n=1

p(x_n|z_n).

If the unobservable variables zn of the state space model are discrete the hidden Markov Model (HMM) is obtained. Let the probability distribution of z_n depend on the previous state of the unobservable variable z_n−1 via the conditional distribution p(zn|z_n−1). The unobservable variables are binary meaning that the conditional distribution correspond to the so-called transition probabilities given by A_jk = p(z_nk = 1|z_n−1,j = 1) where z_nk denote the unobservable variable attaining state k. The transition probabilities are denoted by the matrix A. As they are probabilities they satisfy 0 ≤ A_jk ≤ 1 andP

kA_jk = 1. The conditional distribution of K different states can be expressed as

p(z_n|z_n−1, A) =

K

Y

k=1 K

Y

j=1

A^z_jk^n−1,j^z^nk. (2.11)

The initial unobservable variable z₁ cannot be defined as in (2.11) and is instead defined by a vector of probabilities π with elements π_k= p(z_1k= 1) so that

p(z₁|π) =

K

Y

k=1

π^z_k^1k where P

kπ_k = 1. To complete the HMM the conditional distributions of the observed variables p(x_n|z_n, φ) needs to be defined. The conditional distribution, with parameter set φ = [φ1, ..., φ_k], is called emission probabilities. The emission distribution can for example be Gaussian and so φ represent the parameter set needed to define the emission distribution. The emission probabilities for K states can be represented as

p(xn|z_n, φ) =

K

Y

k=1

p(xn|φ_k)^z^nk.

A homogenous HMM share the parameters A for all of the conditional distributions of the unobservable variables as well as φ for all of the conditional emission distributions. The joint distribution over both unobservable and observed variables are therefore given by

p(X, Z|θ) = p(z1|π)

N

Y

n=2

p(zn|z_n−1, A)

N

Y

m=1

p(xm|z_m, φ) (2.12) where X = [x₁, ..., x_N], Z = [z₁, ..., z_N] and θ = [π, A, φ] is the set of model parameters.

(25)

2.4.1 Maximum Likelihood for the HMM

Given observed data X the parameters of the HMM can be estimated using maximum likelihood. The maximum likelihood function is obtained from (2.12) by summing over the unobservable variables

p(X|θ) =X

Z

p(X, Z|θ).

To efficiently maximize the likelihood function of a HMM the expectation maximization (EM) algorithm is used. The EM algorithm is initialized by a selection of the model parameter, denoted θôld. The model parameters are often initialized ran- domly subject to model constraints. In the first step of the algorithm (E) the model parameters are used to find the posterior distribution of the unobservable variables p(Z|X, θôld). This distribution is used to assess the expected value of the logarithm of the likelihood function of the complete data, as a function of the parameters θ, to insert into the function Q(θ, θôld) given by

Q(θ, θ^old) =X

Z

p(Z|X, θ^old) ln p(X, Z|θ). (2.13) γ(zn) is introduced as the marginal posterior distribution of a unobservable variable znand ξ(zn−1, zn) as the joint posterior distribution of two successive unobservable variables such that

γ(zn) = p(zn|X, θ^old) ξ(z_n−1, z_n) = p(z_n−1, z_n|X, θ^old).

γ(z_nk) is used to denote the conditional probability that z_nk = 1. ξ(zn−1,j, z_nk) is defined in a similar fashion. As both γ(z_nk) and ξ(z_n−1,j, z_nk) are binary variables the expected value of the variable is just the probability that it takes value 1.

Substitute the joint distribution from equation (2.12) into equation (2.13) and use the definitions of γ and ξ to obtain

Q(θ, θ^old) =

K

X

k=1

γ(z1k) ln πk

+

N

X

n=2 K

X

j=1 K

X

k=1

ξ(zn−1,l, znk) ln Ajk+

N

X

n=1 K

X

k=1

γ(znk) ln p(xn|φ_k).

The expectation step of the EM algorithm is used to evaluate γ(z_nk) and ξ(z_n−1,j, z_nk).

This can be done using the forward-backward algorithm. The maximization step treats γ(z_nk) and ξ(z_n−1,j, z_nk) as constants and maximizes Q(θ, θ^old) with respect to the parameters θ. The maximization of Q(θ, θ^old) with respect to π and A is done by

π_k = γ(z1k) PK

j=1γ(z1j)

(26)

Ajk =

PN

n=2ξ(z_n−1,j, z_nk) PK

l=1

PN

n=2ξ(z_n−1,j, z_nl). (2.14) 2.4.2 Forward-Backward Algorithm

There are several forward-backward algorithms. Here the alpha-beta algorithm is used. First γ(znk) needs to be evaluated. According to Bayes’ theorem γ(zn) can be expressed as

γ(zn) = p(zn|X) = p(X|zn)p(zn) p(X)

where the denominator is implicitly conditioned on θ^old henceforth. With the help of the conditional independence property and the product rule of probability γ(z_n) can be further expressed as

γ(zn) = p(x1, ..., xn, zn)p(xn+1, ..., xN|z_n)

p(X) = α(zn)β(zn)

p(X) (2.15)

with

α(z_n) = p(x₁, ..., x_n, z_n) β(zn) = p(xn+1, ..., xN|z_n).

The computation of α(x_n) and β(x_n) can be done recursively. With the help of conditional independence properties along with the sum and product rule α(zn) can be expressed by α(zn−1) as

α(zn) = p(xn|z_n) X

zn−1

α(zn−1)p(zn|z_n−1). (2.16) An initial condition of α(z₁) = p(z₁)p(x₁|z₁) is needed to start the recursion.

During an EM optimization the value of the likelihood function p(X) is evaluated by summing over zn for both sides of the equation 2.15 and using the fact that γ(z_n) is a normalized distribution. This can be expressed as

p(X) =X

zn

α(z_n)β(z_n).

In the case where only the likelihood function is of interest, equation 2.4.2 can be modified by setting n = N . This means that there is no need for a β recursion and this reduce the computational cost. The evaluation of ξ(z_n−1, z_n) can be derived using Bayes’ theorem, the conditional independence property (citation) and the definition of α(zn) and β(zn) as

ξ(zn−1, zn) = p(zn−1, zn|X) = α(zn−1)p(xn|z_n)p(zn|z_n−1)β(zn)

p(X) .

Hence ξ(z_n−1, z_n), to be used equation 2.14 to estimate the transition probabilities, is computable by using the results of the α and β recursions.

(27)

Chapter 3 Method

In this chapter the chosen methods for both VAD and CNE will be introduced along with the evaluation format and implementation specifics. This work was divided in two. The first part revolved around the implementation and evaluation of VAD, i.e.

the problem of detecting speech presence in a signal. Whereas the second part was about noise estimation, i.e. the problem of trying to characterize the background noise.

3.1 VAD Methods

For this work 3 VAD methods were implemented and evaluated, however these methods will be modified for a total of 10 variants of the original 3 methods. The 3 methods were a standard VAD, a VAD method from a literature review and the method currently used by Konftel. The standard method chosen was the ES 202 050 standard by the European Telecommunications Standards Institute [8]. The ES 202 050 standard is a common benchmark method used in the comparison of VAD methods and will henceforth be known as ETSI. 2 variants of ETSI will be implemented, described in detail in Section 3.4.2. The method chosen from the literature review is the statistical model-based VAD (SMVAD) as described in [26].

As mentioned in the introduction, over the years there have been some additions to this SMVAD but they are less suited for the real time implementation needed in a conference telephone. This is because of the non-causality of their decision rules introduced when the future observations are used to classify the current frame. To avoid this the causal decision rule was chosen. Another prominent reason for choosing the SMVAD was for its flexibility, i.e heuristic additions can easily be added.

It is a solid base which can be expanded upon depending on the need of Konftel and time constraints. A total of 7 variants of the SMVAD will be implemented, described in detail in Section 3.4.2. The third and final method is the VAD used by Konftel as part of the Aurora Audio Algorithm.

Voice Activity Detection and Noise Estimation for Teleconference Phones