• No results found

Voice Activity Detection and Noise Estimation for Teleconference Phones

N/A
N/A
Protected

Academic year: 2022

Share "Voice Activity Detection and Noise Estimation for Teleconference Phones"

Copied!
69
0
0

Loading.... (view fulltext now)

Full text

(1)

Voice Activity Detection and Noise

Estimation for Teleconference Phones

Björn Eliasson

June 20, 2015

Student

Master’s Thesis, 30 Credits Department of Mathematics and Mathematical Statistics

(2)

Copyright c Bj¨orn Eliasson.

All rights reserved

VOICE ACTIVITY DETECTION AND NOISE ESTIMATION FOR TELECONFERENCE PHONES

Submitted in partial fulfillment of the requirement for the degree Master of Science in Industrial Engineering and Management Department of Mathematics and Mathematical Statistics Ume˚a University

SE-901 87 Ume˚a,Sweden

Supervisors:

Jun Yu, Ume˚a University Nils ¨Ostlund, Konftel AB Examiner:

Patrik Ryden, Ume˚a University

(3)

Abstract

If communicating via a teleconference phone the desired transmitted signal (speech) needs to be crystal clear so that all participants experience a good communication ability. However, there are many environmental conditions that contaminates the signal with background noise, i.e sounds not of interest for communication purposes, which impedes the ability to communicate due to interfering sounds. Noise can be removed from the signal if it is known and so this work has evaluated different ways of estimating the characteristics of the background noise. Focus was put on using speech detection to define the noise, i.e. the non-speech part of the signal, but other methods not solely reliant on speech detection but rather on characteristics of the noisy speech signal were included. The implemented techniques were compared and evaluated to the current solution utilized by the teleconference phone in two ways, firstly for their speech detection ability and secondly for their ability to correctly estimate the noise characteristics. The evaluation process was based on simulations of the methods’ performance in various noise conditions, ranging from harsh to mild environments. It was shown that the proposed method showed improvement over the existing solution, as implemented in this study, in terms of speech detection ability and for the noise estimate it showed improvement in certain conditions. It was also concluded that using the proposed method would enable two sources of noise estimation compared to the current single estimation source and it was sug- gested to investigate how utilizing two noise estimators could affect the performance.

Keywords: Voice Activity Detection (VAD), noise estimation, continuous noise estimation (CNE), statistical model-based VAD, improved minima-controlled recur- sive average (IMCRA), Rangachari noise estimation (RNE or MCRA-2), likelihood ratio approach, signal-to-noise ratio dependent recursive average, teleconferencing

(4)

Sammanfattning

ar man kommunicerar via en konferenstelefon kr¨avs att signalen (tal) som s¨ands

¨

ar tillr¨ackligt klar f¨or att alla parter ska uppleva en god kommunikationsf¨orm˚aga.

I praktiken finns det m˚anga milj¨om¨assiga faktorer som kontaminerar signalen med bakgrundsbrus, d.v.s ljud som inte ¨ar intressanta i ett kommunikationsperspektiv, och f¨orsv˚arar kommunikation p˚a grund av st¨orande ljud. Bakgrundsbrus kan re- duceras fr˚an den s¨anda signalen ifall dess karakt¨aristik ¨ar k¨ant och d¨arf¨or har olika metoder att uppskatta bakgrundsbrusets karat¨aristik utv¨arderats. Fokus l˚ag p˚a att anv¨anda taldetektering f¨or att definiera bruset, d.v.s. signal utan tal, men ¨aven andra metoder som utnyttjar den brusiga signalens karakt¨aristik inkluderades. De implementerade metoderna j¨amf¨ordes med och utv¨arderades mot den nuvarande osningen f¨or brusskattning p˚a tv˚a vis, f¨or det f¨orsta f¨or f¨orm˚agan att korrekt kunna detektera tal och f¨or det andra f¨or f¨orm˚agan att korrekt kunna karakt¨arisera bruset.

Utv¨arderingsprocessen baserades p˚a en simuleringsstudie av metoderna i flertalet olika brusmilj¨oer som sp¨anner intervallet milda till v¨aldigt h˚arda f¨orh˚allanden. Det avisades att den metod som f¨oreslogs visade p˚a en f¨orb¨attring i j¨amf¨orelse med den existerande l¨osningen, s˚asom implementerad i denna studie, g¨allande taldetektering och brusskattning i vissa f¨orh˚allanden. Vidare kommer den f¨oreslagna metoden ge tillg˚ang till tv˚a k¨allor f¨or brusskattning till skillnad mot den nuvarande l¨osningen som har en. Det f¨oreslogs f¨or vidare studier att studera hur dessa tv˚a olika k¨allor kan kombineras.

(5)

Acknowledgements

I would like to thank my extraordinary supervisor Professor Jun Yu at Ume˚a Uni- versity for all the amount of time spent helping me out, be it encouraging words or lending of expertise or a thorough report review. Moreover, I direct a special thanks to my Konftel supervisor Dr. Nils ¨Ostlund for providing insight and keeping me on track throughout the project. Also, thank you members of project group Frost for letting me partake in the daily work at Konftel, it has been such a great learning experience. Lastly I would like to express my sincerest thank you to everyone at Konftel for being so incredibly nice to me this semester. It made me feel right at home!

(6)

Contents

1 Introduction 1

1.1 Background . . . . 1

1.1.1 Noise and Speech . . . . 2

1.1.2 Noise Estimation . . . . 2

1.2 Aim . . . . 7

1.3 Scope and Limitations . . . . 7

1.4 Outline . . . . 8

2 Theory 9 2.1 Acoustic Theory . . . . 9

2.1.1 Sinusoidal and Complex Waves . . . . 9

2.1.2 Decibel . . . . 10

2.2 Spectral Analysis . . . . 11

2.2.1 Representing an Analogue signal . . . . 11

2.2.2 Discrete-Time Fourier Transform . . . . 11

2.2.3 Discrete Fourier Transform . . . . 12

2.2.4 Power Spectrum . . . . 12

2.2.5 Frame Processing . . . . 13

2.3 Additional Theory . . . . 14

2.3.1 Loss, Risk and Bayes Risk function . . . . 14

2.3.2 Likelihood Ratio Test . . . . 15

2.4 Markov Models . . . . 15

2.4.1 Maximum Likelihood for the HMM . . . . 17

2.4.2 Forward-Backward Algorithm . . . . 18

3 Method 19 3.1 VAD Methods . . . . 19

3.1.1 Aurora . . . . 20

3.1.2 ETSI . . . . 20

3.1.3 Statistical Model-Based VAD . . . . 21

3.2 CNE Methods . . . . 25

3.2.1 Likelihood Ratio Approach . . . . 26

(7)

3.2.2 Improved Minima-Controlled Recursive Averaging . . . . 26

3.2.3 Rangachari Noise Estimation . . . . 28

3.2.4 SNR Dependent Recursive Averaging . . . . 28

3.3 Methods of Comparison . . . . 29

3.3.1 Comparison of VAD Methods . . . . 29

3.3.2 Comparison of Noise Estimation . . . . 30

3.4 Implementation . . . . 32

3.4.1 Test Signals . . . . 32

3.4.2 Implementation of VAD Comparison . . . . 32

3.4.3 Implementation of Noise Estimation Comparison . . . . 35

4 Results 36 4.1 Voice Activity Detection . . . . 36

4.1.1 Cafeteria Noise . . . . 36

4.1.2 Street Noise . . . . 39

4.1.3 White Noise . . . . 41

4.2 Noise Estimation . . . . 43

4.2.1 Cafeteria Noise . . . . 43

4.2.2 Street Noise . . . . 45

4.2.3 White Noise . . . . 47

4.3 Comfort Noise . . . . 49

5 Discussion and Conclusion 51 5.1 VAD Evaluation . . . . 51

5.1.1 Statistical Model-Based VAD . . . . 51

5.1.2 ETSI . . . . 53

5.1.3 Aurora . . . . 53

5.2 Noise Estimation Evaluation . . . . 54

5.2.1 Continuous Noise Estimation . . . . 54

5.2.2 ETSI . . . . 55

5.2.3 Aurora . . . . 55

5.3 VAD vs. CNE . . . . 55

5.4 Critique . . . . 56

5.4.1 Evaluation Format . . . . 56

5.4.2 Model Assumptions . . . . 57

5.5 Conclusion . . . . 58

5.6 Further Studies . . . . 58

(8)

Abbreviations

CNE Continuous Noise Estimation DFT Discrete Fourier Transform DTFT Discrete-Time Fourier Transform

DD Decision-Directed

ETSI European Telecommunications Standards Institute FFT Fast Fourier Transform

HMM Hidden Markov Model

IMCRA Improved Minima-Controlled Recursive Average LRA Likelihood Ratio Approach

LRT Likelihood Ratio Test MedSE Median Squared Error

MMSE Minimum Mean Square Error

MSE Mean Squared Error

PSD Power Spectral Density

ROC Receiver Operating Characteristics SNR Signal-to-Noise-Ratio

SNRDRA Signal-to-Noise-Ratio Dependent Recursive Average SMVAD Statistical Model-Based Voice Activity Detector VAD Voice Activity Detector

(9)

Chapter 1

Introduction

This chapter introduces the background together with the aim, scope and limitations for this work.

1.1 Background

Being a global contender in the teleconference solution scene is no easy feat and requires good products. An example of a product offered by Konftel can be seen in Figure 1.1. In a teleconference setting good products equal high quality audio, which is imperative for the users to be able to communicate properly and with ease. One necessity in creating good audio is a good background noise estimate.

An estimation of the background noise, or noise, is important for many aspects of generating high quality audio. It is needed for reducing the noise transmitted to the other participants in the teleconference and a part in allowing for the echo created when the microphone picks up the loudspeaker signal to be canceled out. One way to estimate the noise is via a voice activity detector (VAD) which as the name suggests tries to detect the presence of speech. Knowing when someone is speaking is a tool to estimate the noise, i.e. the sound of non-speech. Other methods rely on key characteristics of speech to update the noise estimate, these methods are herein called continuous noise estimators (CNE). The basics of both these two different strategies will be explained in the following subsections. Not only must an efficient noise estimator be able to well represent the background noise, it must also be able to do so in real-time with minimum delay. The implementation and evaluation of noise estimation methods will be the subject of this work.

1.1.1 Noise and Speech

As this work is concerned with estimating the background noise it is reasonable to give an explanation to what noise is even though it could appear trivial. Noise is everywhere around us and constitutes what everyone perceive as sounds not of

(10)

Figure 1.1: Teleconference phone Konftel 300IP.

interest, which in the case of teleconferencing translates to all sounds that do not originate from a participant speaking. In essence, this means that every signal can be decomposed into two parts, the speech signal and the noise signal. Together they form the noisy speech signal. As an example of noise, in a quiet room ones perception would often be that there is complete silence however this is not true as there is always a noise background, as long as you are in a medium where sound can exist. More noticeable is the noise background in other environments like in a busy street, the office or inside a restaurant because here the sounds coming towards you are more prominent. The are two types of noise, stationary and non-stationary noise. For stationary noise the noise characteristics does not change over time (e.g.

a fan) and for non-stationary noise it does change over time (e.g. an accelerating car or people talking next door). Intuitively it would seem that stationary noise would be easier to estimate then its counterpart. Not surprisingly, this is very much the case as a method trying to estimate ever-changing noise characteristics must be able to adapt to changes constantly whereas trying to estimate stationary noise you only need to characterize the noise once.

There are two main groups of speech, voiced and unvoiced speech. When the vocal folds are tensed and air is pushed through the vibration results in voiced sounds such as vowels. Unvoiced sounds are produced when the vocal folds do not vibrate but tense up and come closer together allowing the air stream to become turbulent. ’H’ in house is an example of an unvoiced sound. ’S’ and ’t’ are also unvoiced sounds, acquired when the tongue and and lips impose limitations to the vocal tract. Not surprisingly different types of spoken sounds are more or less easy to detect in a noise background as they more or less resemble noise [17].

(11)

1.1.2 Noise Estimation

There are two main strategies for estimating the noise. One strategy relies on a VAD decision and the other is based on a CNE scheme which utilizes some key characteristic of the noisy speech signal [17]. But before the basics of these methods are discussed it would be good to get a better understanding of why a noise estimate is needed. The importance of a good noise estimation procedure is made clear through its use in speech enhancement, including noise suppression.

Noise Suppression and Other Uses

The importance of noise suppression is explained in [13] with the teleconference set- ting as an example. In a conference call the background noise for each participant is picked up and additively combined at the network bridge. This means that each of the loudspeakers will reproduce the combined sum of the background noises from the other participants. As the number of participants increases the combined back- ground noise will overpower the desired signal making communication impossible.

This makes it clear that the noise must be attenuated without affecting the speech, which is much of the issue when dealing with noise suppression. A noise suppres- sion system will remove the estimated noise from the noisy speech signal and the resulting signal will hopefully contain speech only. This problem of removing the estimated noise from the noisy signal shows a difference in over and underestimating the noise. Overestimation can cause speech distortion as now too much of the noisy speech signal is removed, even speech. Underestimation of the noise can lead to the background noise still being present in the noise suppressed signal. However, in practice this is not exactly how it works but it still gives an idea of the difference in over and under estimating the noise.

While noise suppression is a very important aspect, and the main focus of this work, it is not the only thing the noise estimation is used for. In communica- tion devices it is common to use comfort noise which is simply an estimated noise background that is transmitted to the far-end user as an assurance that the con- nection is still working. Comfort noise may also be used to cover some residual echo making it less audible. The process of transmitting the noise range from using static colored noise (noise with more power in some part of the frequency range) to adaptive schemes trying to emulate changes in the noise process [12] and it is for these latter methods an up-to-date noise estimate is needed. A noise estimate is also needed for echo cancellation, i.e. the practice of removing the echo created when the teleconference phone’s mic picks up its own loudspeaker signal, as well as double-talk-detection which tries to detect when two parties of the teleconference speak simultaneously.

(12)

Voice Activity Detection

The idea behind a VAD is simple and can be summarized in three steps. First a signal feature is extracted. Secondly a decision rule is employed deciding whether these features are that of speech or noise. Thirdly it is common to use some kind of decision alteration which is usually more empirical in nature and tuned to specific needs [22]. When a VAD is used for noise estimation it estimates the noise during noise periods only, i.e. when no speech is detected it treats the entire incoming signal as noise.

Several different signal features and decision rules have been suggested for speech detection throughout the years as there is no clear all-defining feature that catches all the complexity that is speech. In an overview by Ram´ırez in [22] the most common features used for a VAD are explained. Tracking the energy of the signal is a useful and intuitively simple method as it can be assumed that speech contains more energy than noise. Here the presence of speech is assumed when the signal energy is greater than some threshold. These energy measure based VADs are both used in time and frequency domain (see section 2.1 and section 2.2). As an addition to the energy based thresholding scheme some methods use frequency analysis tools based on tracking the minimum and maximum energies of the low and high frequencies.

Assuming an initial noise period the energy envelope can be tracked to be compared to incoming energy values using a simple difference measure which in turn is used in the VAD decision logic. Other methods assume that there are inherent differences between speech and noise in terms of periodicity of the signal, frequency or pitch (see section 2.1). Pitch is a non-linear function of frequency [9].

While these VADs are heuristic in nature other researchers have focused on developing statistical models for speech detection. In [26] a VAD was introduced that modeled the speech and noise as independent Gaussian processes. This method was later improved by adding contextual information into the decision rule, however this introduction lead to the methods no longer being causal [23]. Other methods use statistical models based on a Laplacian or Gamma distributions [3].

Continous Noise Estimation

The idea of the CNE schemes is to utilize some key characteristics of speech and the noisy speech signal to be able to constantly update the noise estimate regardless of speech presence. There are three main classes of CNE algorithms, the time-recursive averaging algorithms, the minimum tracking algorithms and the histogram based algorithms, each based on one or more of the three key characteristics of speech.

These classes are the The first characteristic is the fact that ”silent” portions of speech do not only occur in noise periods when looking at a frequency band (a subset of all frequencies, e.g. all frequencies between 0 and 8000 Hz). As an example a low-frequency vowel will affect the lower part of the frequencies, enabling estimation of the noise in the higher frequencies. In short it is possible to update the noise

(13)

(a) (b)

(c)

Figure 1.2: Visualization of the three key characteristics of speech and the noisy speech signal utilized by the CNE. In a) speech is shown to exist only in the lower parts of the frequency band (assuming speech has more energy than noise), allowing for estimation of the noise in the upper frequencies where the energy level is low and thus assumed to be noise only. b) show the idea behind the minimum statistics algorithms. As can be seen the power of the speech signal (red) decays to the power of the noise signal (blue) between utterances. In c) the histogram of the logarithmic energy level of the noisy speech signal show the most common energy level taken to be the background noise power level.

for every frequency not containing speech. The time-recursive averaging algorithms exploit this. The second characteristic is that the noisy speech signal often decays to the power of the noise and so by remembering what this lowest energy level was is a tool to estimate the background noise. This is the idea behind the minimum tracking algorithms. The third characteristic is that a histogram of the energy values for every frequency reveals the most common energy level, taken to be the noise energy level. This assumption leads to the histogram-based noise estimation schemes [17]. Figure 1.1 show a visualization of the three mentioned characteristics of speech and the noisy speech signal utilized by the CNE.

The reasoning behind having a constant update is to have a more recent noise estimate compared to a VAD-based noise estimate when long speech segments are present (or assumed to be present). This is especially important in the case of non- stationary noise (e.g. noise from inside a cafeteria) where a VAD might indicate a long speech period, all the while the background noise changes, making the noise es- timate outdated. The VAD might even be performing well, i.e. it correctly classifies

(14)

the long period as speech, but this does not change the fact that the noise estimate has not been updated in a long period. In the case of stationary noise a continuous update scheme loses its purpose as a few noise estimates would be sufficient to fully characterize the noise. Salas in [25] present a good overview of different types of CNE schemes from all three groups.

Difficulties in Estimating the Noise

The biggest difficulty facing anyone trying to characterize the noise is the fact that in practice only the noisy speech signal picked up by the microphone is available for analysis. In other words the problem of noise estimation boils down to separating what is the speech signal and what is the noise signal with only the information available in the noisy speech signal. As you are trying to separate these two signals it is inevitable that the speech signal will affect the noise signal estimate and vice versa. The goal is to minimize these effects to allow for the most accurate estimate of each separate signal.

When using a VAD-based noise estimation scheme this problem occur when the incoming signal is classified as containing noise only while in reality it contains speech as well. This would mean that speech components will be incorporated in the noise estimate as now the entire signal is treated as containing only noise while in reality it contains both noise and speech. To minimize these effects it is common to only mark the signal as noise when you are very certain it is noise only. However, this would of course increase the chance of background noises being wrongly classified as speech. The CNE does not use the binary speech decision and so the problem of the two signals affecting each other’s estimate takes on another form but is still present.

Current Noise Estimation Solution

The background noise is currently estimated as part of the Aurora Audio Algorithm employed in the teleconference phones, henceforth known as only Aurora, by the use of a VAD. As previously mentioned relying on a VAD to characterize the background noise introduces a few possible kinks. The noise estimate will only be updated during periods marked as containing noise only and this becomes a problem for Aurora in two ways. Firstly the Aurora algorithm is very sensitive to non-stationary noise meaning that in these noise conditions the VAD will be tricked into believing that there is speech present in the signal. This ultimately leads to the noise estimate becoming outdated when long segments of the signal is falsely deemed to contain speech. Secondly, a problem with the current method is its dependence on signal strength. With its current design it will interpret strong signals as speech, even though it might just be a strong noise signal, which in the end yield the same result as previously, i.e. falsely marked long speech segments makes the noise estimate become outdated. A real example of the signal strength being an issue is the practice of using

(15)

strong white noise generators in office landscapes to create a more preferably noise background. These generators can emit very high sounds to counter the original background noise and this strong signal will be interpreted as speech by the Aurora VAD. Both of these two issues can be attributed to the empirical nature of the Aurora implementation. Having a empirically tuned VAD method makes it hard to know how it will perform in various conditions and therefore it is important that the proposed method is not as empirical as to introduce new unforeseen problems.

This is of course hard to verify but can be taken into account by choosing a tried and tested method.

Possible Solutions

There are two possible solutions to these problems. The first would simply be to improve upon the VAD’s speech detection ability to make it better at distinguishing between noise and speech. Another possible solution would be to implement a CNE not reliant on the binary decision of a VAD to update the noise estimate and therefore possibly better and dealing with non-stationary noise.

1.2 Aim

To stay ahead in a competitive environment product improvements are crucial. As the work of developing a new generation of teleconference phones is on its way the idea was to research alternatives to the current noise estimator. In order to improve upon the current noise estimator both possible solutions alternatives discussed above will be evaluated.

The aim of this work was twofold. The first part was to implement and evaluate different VAD methods and compare these to the current VAD implemented in the teleconference phones in terms of speech detection ability. The second part includes the implementation and evaluation of different methods for noise estimation only, i.e. trying to correctly characterize the noise.

1.3 Scope and Limitations

This work will be limited to three methods of VAD implemented and analysed as well as four CNE methods. The performance of the chosen methods will be evaluated in various conditions associated with the teleconference setting and compared to the existing solution and each other. These settings include different noise strength combined with various noise environments. The speech and noise conditions used for the comparison will be sound files recommended by International Communications Union for use as test signals. The sound-files used are a subset of the sound files discussed in the ITU-T P.501 standard [14]. The evaluation process will be limited

(16)

to simulations using the aforementioned sound files and no real-time implementation in a conference telephone will be done.

There are physical as well as computational limitations to the platform in which the noise estimator aims to be implemented in. The proposed estimation schemes must take these limitations into consideration. This means that any proposed method must be able to handle a real-time implementation with minimum delay as not to interfere with the communication. The proposed methods cannot be to computationally complex. Another important issue is that the method must be causal. This is a restriction imposed on the system to not cause to much time-delay when processing the signal.

1.4 Outline

This work is structured as follows: Chapter 2 introduces the relevant theory about acoustics, digital spectral analysis, loss and risk functions, the likelihood ratio test and hidden Markov models. Any reader familiar with any of these subject may skim through the chapters to familiarize yourself with the notation used for this work.

In chapter 3 the chosen methods for VAD and CNE are introduced along with the performance evaluation format and implementation specifics. Chapter 4 presents the results and finally in Chapter 5 the results are discussed and conclusions are drawn and lastly further work is proposed.

(17)

Chapter 2

Theory

This chapter gives a short introduction to acoustics theory, spectral analysis, loss and risk functions, the likelihood ratio test and hidden Markov models. If any of these subjects are familiar the reader is suggested to skip or skim through parts of this section, however it is recommended to familiarize yourself with the notation used..

2.1 Acoustic Theory

Sound can be defined in two ways, either as a physical wave propagating through any elastic medium, or as the excitation of our hearing mechanism resulting in the psychophysical perception of sound [9].

2.1.1 Sinusoidal and Complex Waves

The wave form that often describes sound, or various other kinds of signals, is the sinusoidal wave. To define a periodic sinusoidal wave the signal amplitude, frequency and phase are needed. The amplitude is the absolute value of the signal.

The frequency is the number of complete periods per second, where one period is the time between two wave peaks, and is measured in Hertz (Hz). The phase is a shift along the time axis and indicates where the first zero crossing occur. The sinusoidal wave can be expressed as

x(t) = A · cos(ωt + φ). (2.1)

where t is a time unit (a signal expressed like this is a called a time-domain signal as it is a function of time), A is the signal amplitude, φ is the phase in radians and ω are radians per time unit. ω can be used to derive the wavelength of the wave as T = |ω|. The wavelength is the distance the wave travels in one period [21]. In the case of sound the sinusoidal wave represents the degree of displacement (compression and rarefaction) of air particles in relation to the prevailing atmospheric pressure.

(18)

Figure 2.1: Sine wave in acoustical application

Figure 2.1 shows the basics of the sinusoidal wave in acoustical application. The simple sinusoidal wave doesn’t seem to be of very much use for representing the complex wave of speech, since the wave shapes of speech look drastically different from the simple sinusoidal wave. However, no matter what shape the wave is it can be reduced to sinusoidal wave components as long as it is periodic. This means that any periodic complex wave can be synthesized using sinusoidal waves of different amplitudes, frequencies and phases [9].

Expressing the sinusoidal signals in terms of complex exponentials makes some useful tools to analyze the signal available, e.g the Fourier decomposition of signals discussed in the next section. By Euler’s formula the sinusoidal wave relates to the complex exponential by

eiωt = cos(ωt) + i sin(ωt).

where i denotes the imaginary unit. Hence the sinusoidal wave from equation (2.1) can be expressed in terms of complex exponentials as [21]

A · cos(ωt + φ) = A

2eeiωt+A

2e−iφe−iωt 2.1.2 Decibel

Any power level of magnitude W1 can be expressed in terms of a reference power W2 as

L1 = 10 · log10W1

W2 decibels.

Magnitudes other than acoustic power can be expressed in dB. For example acoustic power is proportional to the squared acoustic pressure, p, hence the power level is

Lp = 20 · log10p1

p2 decibels. (2.2)

(19)

These two equations define two useful relationships between power levels. As sound pressure is a common parameter to measure in acoustics equation (2.2) is often used [9]. Decibel is important to the concept of signal-to-noise ratio (SNR) as it is by definition the ratio of the signal power and the noise power usually measured in dB.

2.2 Spectral Analysis

There are several advantages to moving into the frequency, or spectral, domain when analysing signals. In spectral domain the signal is expressed as a function of different frequencies, in contrast to time domain where it is expressed as a function of time. First of all the spectral domain yields a better separation of speech and noise as these signals usually contain different frequency information. E.g speech does not exist in the very high frequencies while noise can. Naturally, this makes it easier to implement an optimal or heuristic approach to VAD and CNE. Secondly in the spectral domain the spectral components are decorrelated, which means that to some extent the frequency information can be treated independently, simplifying statistical models [1]. Before the spectral analysis can begin a few concepts need to be brought about.

2.2.1 Representing an Analogue signal

For a computer to analyze an analogue signal it must be able to convert it into a digital signal. To do this an analogue-to-digital converter is used resulting in the process of sampling. Sampling an analogue signal means measuring it on discrete time intervals and therefore the sampled signal is merely a discrete representation of the continuous signal. So given a signal x(t), where t denotes a continuous time variable, the sampling model replaces t with the discrete value nTs. The discrete value n is used to index an array with the sampling period Ts as the time between each sample. The sampling process can be described as [30].

x[n] = x[nTs] 2.2.2 Discrete-Time Fourier Transform

To move into the frequency domain from the time domain the Discrete-time Fourier Transform (DTFT) is used. Given a discrete time signal x[n] the DTFT maps the discrete-time signal to the linear combination of complex exponentials the signal consists of. The DTFT of x[n] is given by

X(e) =

+∞

X

n=−∞

x[n]e−iωn (2.3)

(20)

The DTFT is invertible in the sense that given X(e) the original signal may be restored by the inverse DTFT (IDTFT) given by

x[n] = 1

Z

X(e)eiωn (2.4)

These two equations are called the analysis equation (2.3) and synthesis equa- tion (2.4) and are used to move from the time domain to the frequency domain and vice versa [21]. In practice the DTFT of a noisy speech signal is gener- ally a complex-valued function of e which can be expressed in polar form as X(e) = |X(e)|eiφ(e) where |X(e)| is the magnitude spectrum and φ(e) is the phase spectrum [17].

2.2.3 Discrete Fourier Transform

In the case of digital signal processing the DTFT is replaced by the Discrete Fourier Transform (DFT) because the DTFT is a function of a continuous variable e which is not compatible with digital computation. In practice the time signal x[n] consists of N samples and therefore finite contrary to equation (2.3) and so the DTFT can be sampled at N uniformly spaced intervals by using ωk= 2πkN , often referred to as frequency bins. Sampling the DTFT this way yields the DFT given by

X[k] =

N −1

X

n=0

x[n]e−i2πnk/N (2.5)

As in the case of the DTFT the DFT is invertible by

x[n] = 1 N

N −1

X

m=0

X[k]ei2πnk/N (2.6)

Due to the computational complexity of the DFT a Fast Fourier Transform (FFT) algorithm is instead employed to compute the DFT [30].

2.2.4 Power Spectrum

Most signals used in applications cannot be predicted exactly and can only be ex- pressed by probabilistic statements. A random signal can be characterized by a power spectral density (PSD). The power spectral density is the frequency domain specification of the second-order moment of the signal. To express the PSD the auto-covariance sequence of the stationary signal is needed, given by

r(k) = E[x[n]x[n − k]]

(21)

where denotes the conjugate and E[·] is the expectation operator. The PSD is the DTFT of the covariance sequence and thus calculated as

P (e) =

X

k=−∞

r(k)e−ikω

The idea of spectral estimation is to estimate how the total signal power is dis- tributed over frequency for finite discrete observations of a stationary process. There are both parametric and non-parametric techniques for estimating the power spec- trum. In practice the non-parametric periodogram method is often used to estimate the PSD due to its simplicity. The periodogram can be computed as

P (k) =ˆ 1

N|X[k]|2 (2.7)

where X[k] is the DFT of the data sequence x[n] and N is the length of the data sequence. This yields an efficient way of estimating the PSD with help of the FFT [28].

2.2.5 Frame Processing

Even though speech is a non-stationary process it can be assumed to be stationary for short periods of time between 10 to 30 ms. This assumption is necessary for models employing the DTFT. Because of this the practice of frame processing is used. A frame is simply a short segment of the sampled signal to be processed individually. So, in terms of a VAD or CNE this would mean that for each processed frame you get a binary speech presence decision and a noise estimate. When dealing with frames it is also common to use a window function which affects the PSD of the signal. The most simple window function is the rectangular window, which when used on a signal for the purposes of the DFT transform is identical to framing signals and applying the DFT to each frame. The rectangular window is defined to be 1 inside the window and 0 outside. Another common window function is the Hamming window, which add different weights to the signal samples, giving more weight to the mid samples than the side samples. Using a Hamming window instead of a rectangular makes it easier to spot differences in far between frequency bins at the expense of making it harder to separate frequency bins close to each other. Statistically speaking a Hamming window would, compared to a rectangular window, increase the correlation of nearby frequency bins and reduce the correlation of more distant frequency bins [6].

As using a hamming window would give less weight to some samples every frame uses overlap to make sure that in the end every samples has equal weight.

The amount of overlap between frames differs but around 50% is commonly used.

This means that when the signal processing step is complete the modified signal must be restored using the overlap-add method [17] and the IDFT from equation (2.6).

(22)

2.3 Additional Theory

Given an observation vector x and a vector of target variables θ the goal is to predict θ given a new set of observations. Hence, the goal is to determine the posterior probability density function (PDF) of θ, i.e. p(θ|x). This PDF can with the help of Bayes’ theorem be expressed as

p(θ|x) = p(x|θ)p(θ)

p(x) (2.8)

where p(x|θ) is the joint distribution of x for a given target variable, p(θ) is the prior target variable PDF and p(θ|x) the corresponding posterior density [2].

2.3.1 Loss, Risk and Bayes Risk function

When estimating any parameter it is in most cases important to add a disparity between types of estimation errors. For example in the case of noise estimation through VAD marking speech as noise is worse than the other way around as then the speech would affect the noise estimate. Also when estimating the noise a larger error should be penalized more than a small error as the larger error will have significantly greater effect in the noise suppression step. To do this a loss function is introduced. The loss function expresses the loss incurred for every error in the estimates. The loss function is usually denoted as

L = L(θ, ˆθ)

where L corresponds to the loss incurred for the estimate ˆθ of θ. So the idea here is to choose the estimator that minimizes the loss. However, in practice θ is unknown and it is therefore hard to find the optimal estimator ˆθ and so the aim is instead to try to minimize the expected loss, called the risk function denoted by R(θ) = E[L]

[2].

An estimator that relies on Bayes’ rule from equation (2.8) are considered to be a Bayesian estimator and can be derived using a Bayesian risk function. The most important feature for including these risk functions is that it enables perceptual weighting to the estimators, i.e. it makes it possible to include psychoacoustics try- ing to emulate our hearing mechanism in the estimator. The Bayesian risk function is the expectation of the risk function and given by

< = E[R(θ)] = Z Z

L(θ, ˆθ)p(x, θ)dxdθ (2.9) as the parameter θ is now a stochastic variable. The minimization of Bayes risk function with respect to ˆθ for any given loss function yield a variety of estimators [17].

(23)

2.3.2 Likelihood Ratio Test

Given a random sample X1, X2, ..., Xn from the stochastic variable X with PDF p(X; θ). The likelihood function is defined as

`(θ) :=

n

Y

i=1

p(Xi; θ).

The likelihood ratio test (LRT) is used for hypothesis testing given a set of observa- tions and two hypothesis. The likelihood ratio is a measure of how much more likely one hypothesis is over the other. The LRT for an observation vector X conditioned on the two different hypothesis H0 and H1 can be defined as

Λ = `(θH1)

`(θH0). (2.10)

and the decision in favor of either hypothesis is dependent on a threshold determining the acceptable false rate [18].

2.4 Markov Models

This section is based on [2]. When speech is very prominent in the noisy signal the work of classifying between speech and non-speech is simple. However, this will not always be the case in practice. The detection of weak speech endings, especially unvoiced speech, is troublesome as they often resemble noise. To reduce the risk of clipping the speech short one can model the correlative nature of speech occurrences into the LRT decision. To express this correlative behavior in a probabilistic manner a Markov Model can be used. With the help of the product rule the joint distribution for a sequence of N observations can be expressed as

p(x1, ..., xN) =

N

Y

n=2

p(xn|x1, ..., xn−1).

where x1, ..., xN are the observation vectors. If the conditional distribution in (2.4) is independent of all previous observations except the most recent one the model becomes a first-order Markov chain. The joint distribution of the first-order Markov chain for N observations is

p(x1, ..., xN) = p(x1)

N

Y

n=2

p(xn|xn−1).

For every observation xn a corresponding unobservable variable zn is introduced.

Under the assumption that the Markov chain is now formed by the unobservable

(24)

variables a so-called state space model is obtained. The joint distribution for this model is given by

p(x1, ..., xN, z1, ..., zN) = p(z1)

N

Y

n=2

p(zn|zn−1)

N

Y

n=1

p(xn|zn).

If the unobservable variables zn of the state space model are discrete the hidden Markov Model (HMM) is obtained. Let the probability distribution of zn depend on the previous state of the unobservable variable zn−1 via the conditional distri- bution p(zn|zn−1). The unobservable variables are binary meaning that the con- ditional distribution correspond to the so-called transition probabilities given by Ajk = p(znk = 1|zn−1,j = 1) where znk denote the unobservable variable attaining state k. The transition probabilities are denoted by the matrix A. As they are prob- abilities they satisfy 0 ≤ Ajk ≤ 1 andP

kAjk = 1. The conditional distribution of K different states can be expressed as

p(zn|zn−1, A) =

K

Y

k=1 K

Y

j=1

Azjkn−1,jznk. (2.11)

The initial unobservable variable z1 cannot be defined as in (2.11) and is instead defined by a vector of probabilities π with elements πk= p(z1k= 1) so that

p(z1|π) =

K

Y

k=1

πzk1k where P

kπk = 1. To complete the HMM the conditional distributions of the observed variables p(xn|zn, φ) needs to be defined. The conditional distribution, with parameter set φ = [φ1, ..., φk], is called emission probabilities. The emission distribution can for example be Gaussian and so φ represent the parameter set needed to define the emission distribution. The emission probabilities for K states can be represented as

p(xn|zn, φ) =

K

Y

k=1

p(xnk)znk.

A homogenous HMM share the parameters A for all of the conditional distributions of the unobservable variables as well as φ for all of the conditional emission distri- butions. The joint distribution over both unobservable and observed variables are therefore given by

p(X, Z|θ) = p(z1|π)

N

Y

n=2

p(zn|zn−1, A)

N

Y

m=1

p(xm|zm, φ) (2.12) where X = [x1, ..., xN], Z = [z1, ..., zN] and θ = [π, A, φ] is the set of model parameters.

(25)

2.4.1 Maximum Likelihood for the HMM

Given observed data X the parameters of the HMM can be estimated using max- imum likelihood. The maximum likelihood function is obtained from (2.12) by summing over the unobservable variables

p(X|θ) =X

Z

p(X, Z|θ).

To efficiently maximize the likelihood function of a HMM the expectation maxi- mization (EM) algorithm is used. The EM algorithm is initialized by a selection of the model parameter, denoted θold. The model parameters are often initialized ran- domly subject to model constraints. In the first step of the algorithm (E) the model parameters are used to find the posterior distribution of the unobservable variables p(Z|X, θold). This distribution is used to assess the expected value of the logarithm of the likelihood function of the complete data, as a function of the parameters θ, to insert into the function Q(θ, θold) given by

Q(θ, θold) =X

Z

p(Z|X, θold) ln p(X, Z|θ). (2.13) γ(zn) is introduced as the marginal posterior distribution of a unobservable variable znand ξ(zn−1, zn) as the joint posterior distribution of two successive unobservable variables such that

γ(zn) = p(zn|X, θold) ξ(zn−1, zn) = p(zn−1, zn|X, θold).

γ(znk) is used to denote the conditional probability that znk = 1. ξ(zn−1,j, znk) is defined in a similar fashion. As both γ(znk) and ξ(zn−1,j, znk) are binary variables the expected value of the variable is just the probability that it takes value 1.

Substitute the joint distribution from equation (2.12) into equation (2.13) and use the definitions of γ and ξ to obtain

Q(θ, θold) =

K

X

k=1

γ(z1k) ln πk

+

N

X

n=2 K

X

j=1 K

X

k=1

ξ(zn−1,l, znk) ln Ajk+

N

X

n=1 K

X

k=1

γ(znk) ln p(xnk).

The expectation step of the EM algorithm is used to evaluate γ(znk) and ξ(zn−1,j, znk).

This can be done using the forward-backward algorithm. The maximization step treats γ(znk) and ξ(zn−1,j, znk) as constants and maximizes Q(θ, θold) with respect to the parameters θ. The maximization of Q(θ, θold) with respect to π and A is done by

πk = γ(z1k) PK

j=1γ(z1j)

(26)

Ajk =

PN

n=2ξ(zn−1,j, znk) PK

l=1

PN

n=2ξ(zn−1,j, znl). (2.14) 2.4.2 Forward-Backward Algorithm

There are several forward-backward algorithms. Here the alpha-beta algorithm is used. First γ(znk) needs to be evaluated. According to Bayes’ theorem γ(zn) can be expressed as

γ(zn) = p(zn|X) = p(X|zn)p(zn) p(X)

where the denominator is implicitly conditioned on θold henceforth. With the help of the conditional independence property and the product rule of probability γ(zn) can be further expressed as

γ(zn) = p(x1, ..., xn, zn)p(xn+1, ..., xN|zn)

p(X) = α(zn)β(zn)

p(X) (2.15)

with

α(zn) = p(x1, ..., xn, zn) β(zn) = p(xn+1, ..., xN|zn).

The computation of α(xn) and β(xn) can be done recursively. With the help of conditional independence properties along with the sum and product rule α(zn) can be expressed by α(zn−1) as

α(zn) = p(xn|zn) X

zn−1

α(zn−1)p(zn|zn−1). (2.16) An initial condition of α(z1) = p(z1)p(x1|z1) is needed to start the recursion.

During an EM optimization the value of the likelihood function p(X) is evaluated by summing over zn for both sides of the equation 2.15 and using the fact that γ(zn) is a normalized distribution. This can be expressed as

p(X) =X

zn

α(zn)β(zn).

In the case where only the likelihood function is of interest, equation 2.4.2 can be modified by setting n = N . This means that there is no need for a β recursion and this reduce the computational cost. The evaluation of ξ(zn−1, zn) can be derived using Bayes’ theorem, the conditional independence property (citation) and the definition of α(zn) and β(zn) as

ξ(zn−1, zn) = p(zn−1, zn|X) = α(zn−1)p(xn|zn)p(zn|zn−1)β(zn)

p(X) .

Hence ξ(zn−1, zn), to be used equation 2.14 to estimate the transition probabilities, is computable by using the results of the α and β recursions.

(27)

Chapter 3

Method

In this chapter the chosen methods for both VAD and CNE will be introduced along with the evaluation format and implementation specifics. This work was divided in two. The first part revolved around the implementation and evaluation of VAD, i.e.

the problem of detecting speech presence in a signal. Whereas the second part was about noise estimation, i.e. the problem of trying to characterize the background noise.

3.1 VAD Methods

For this work 3 VAD methods were implemented and evaluated, however these methods will be modified for a total of 10 variants of the original 3 methods. The 3 methods were a standard VAD, a VAD method from a literature review and the method currently used by Konftel. The standard method chosen was the ES 202 050 standard by the European Telecommunications Standards Institute [8]. The ES 202 050 standard is a common benchmark method used in the comparison of VAD methods and will henceforth be known as ETSI. 2 variants of ETSI will be implemented, described in detail in Section 3.4.2. The method chosen from the literature review is the statistical model-based VAD (SMVAD) as described in [26].

As mentioned in the introduction, over the years there have been some additions to this SMVAD but they are less suited for the real time implementation needed in a conference telephone. This is because of the non-causality of their decision rules introduced when the future observations are used to classify the current frame. To avoid this the causal decision rule was chosen. Another prominent reason for choos- ing the SMVAD was for its flexibility, i.e heuristic additions can easily be added.

It is a solid base which can be expanded upon depending on the need of Konftel and time constraints. A total of 7 variants of the SMVAD will be implemented, described in detail in Section 3.4.2. The third and final method is the VAD used by Konftel as part of the Aurora Audio Algorithm.

References

Related documents

The variation of gravimetric specific capacitances for electrodes with relatively similar cellulose content (5 wt%) and varying MXene loading (and hence different electrode thickness)

Vi kan alltså kort konstatera att Nordiska Ministerrådet inte tar ställning huruvida sponsring är positivt eller negativt för skolan utan istället fokuserar på vilka

This case study at a start-up company uses experiences from assembly system design and eco-design literature to propose green lean design principles to be used in the design

In the present work observation and TOA detection were automated using a cross-correla- tion algorithm described below, that was applied twice: a first time to estimate the time

Based on conditions defining a near mid-air collision (NMAC) anytime in the future we can compute the probability of NMAC using the Monte Carlo method.. Monte Carlo means that we

Department of Electrical Engineering Linköping University. SE-581

Figure A.21: Confidences on Smaller Multi-stream Network: predictions for rain, straight road and lane change.. Figure A.22: Confidences on Smaller Multi-stream Network: predictions

Sweden is known to be a highly developed and transparent country (Carlberg, 2008). In addition, it is one of the countries that has the lowest limits of the criteria regarding the