Speech Signal Extraction: A Multichannel Approach

(1)

A Multichannel Approach Nedelko Grbi´c

Ronneby, November 1999

(2)

(3)

A Multichannel Approach

Nedelko Grbi´c

Ronneby, November 1999

Department of Telecommunications and Signal Processing University of Karlskrona/Ronneby,

S-372 25 Ronneby Sweden

(4)

ISBN: 91-630-8841-X Published 1999

Printed byLund University, Reprocentralen Lund 1999

(5)

Preface

This Licentiate thesis summarizes mywork in the ﬁeld of speech signal extraction. The work is mainlyaimed for speech enhancement in communication systems such as conference telephony and handsfree mobile telephony. The work has been carried out at Department of Telecommunications and Signal Processing at Universityof Karlskrona/Ronnebyin collaboration with Ericsson Mobile Com- munications. The thesis consists of three parts:

N. Grbi´c, J. Nordberg, S. Nordholm, “Subband Acoustic Echo Cancelling using LMS and RLS,” Research Report 1999:5, ISSN: 1103-1581, Universityof Karl- skrona/Ronneby.

N. Grbi´c, M. Dahl, I. Claesson, “Acoustic Echo Cancelling and Noise Suppres- sion with Microphone Arrays,” Research Report 1999:4, ISSN: 1103-1581, Univer- sityof Karlskrona/Ronneby.

N. Grbi´c, X. J. Tao, S. Nordholm, I. Claesson, “Blind Speech Signal Separation using Overcomplete Subband Representation,” submitted to IEEE Transactions on Speech and Audio Processing, Nov. 1999.

Parts of the papers have been presented as:

M. Dahl, I. Claesson, S. Nordholm, N. Grbic, “Adaptive Microphone ArraySys- tem for Speech Enhancement,” In Proc. COST 254 Second Workshop, Toulouse, France, July97.

N. Grbi´c, M. Dahl, I. Claesson, “Neural Network Based Adaptive Microphone ArraySystem for Speech Enhancement,” 1998 IEEE World Congress on Compu- tational Intelligence, Anchorage, Alaska, USA, May1998

(6)

(7)

Acknowledgments

I would like to thank all colleagues at Department of Telecommunications and Signal Processing for the nice atmosphere theyall have created for myresearch studies. I would especiallylike to thank mysupervisors and dear friends, Prof.

Sven Nordholm and Prof. Ingvar Claesson for valuable mentorship and guidance in myresearch as well as in mystudies. I wish to thank mydear mentor and closest colleague and also mydear friend Dr. Xiao-Jiao Tao for intense collaboration in myresearch studies.

Mythanks also goes to M.Sc. TimothySamuels and Dr. Abbas Mohammed for their help and careful proof reading of the manuscript.

I am in debt to myfamilyand all myfriends for their support during mystudies.

Finally, I express mygratitude to mybeloved ﬁanc´ee Marina for her understanding and comfort during mystudies.

Nedelko Grbi´c

Ronneby, January 1999

(8)

(9)

Introduction

In speech signal extraction the aim is to extract human speech in a physical environment byusing microphones. In anyreal world environment there are many disturbance sources that cause unwanted sound, which in turn maydegrade the comprehension of the wanted speech at the microphones. These disturbances vary depending on the environmental preliminaries. Theymay, for instance, consist of indoor ventilators, computer fans and other disturbing noise sources. One way to reduce the environmental noise is to passivelycover each source with sound absorbing material. There are, however, several drawbacks with this approach.

First, there are often several minor sources that individuallycause small speech degradations, but when added together can cause severe degradation. Secondly, not all sources of disturbance are originating from physical devices; humans speaking in the background can cause disturbing noise. Obviously, it can be inconvenient to cover all disturbing sources.

There are other approaches that can be used for extracting the desired speech.

At the generated microphone signals an analog-to-digital conversion is performed and digital filters are connected. Furthermore, byusing digital signal processing techniques one can extract the speech signal from the disturbing environment by the appropriate design of these filters. The fundamental principle, which makes this extraction possible, relies on the fact that the physical properties and the location of the speech are different from most of the noise sources. This implies that both spatial and temporal information maybe used in the extraction process. Depending on the nature of the surrounding noise environment and the inferring sources, there are different approaches, which can extract speech successfully. In this thesis, the formulation of the problem is divided into three different perspectives in which manyreal world applications can be classified.

• One major disturbing and known source is to be cancelled from the desired speech.

• Several disturbing and unknown sources are present but information about the spatial location of the desired speech is known.

• Several disturbing and unknown sources are present and no information about the spatial location of the desired speech is available.

The thesis consists of three parts, where each part deals with the problem from the diﬀerent perspectives.

Part A

The ﬁrst case appears in a low noise handsfree situation. The user of a handsfree set leads a conversation with a person at the far end of a communication link. The user

(12)

hears the other person from a loudspeaker and a microphone is used to gather the user’s speech. Since both the loudspeaker and the microphone often are located close to the speaker, the microphone will also sense the speech originating from the loudspeaker. This effect will cause the communication system to send some of the speech back to the far end user, which appears as an echo. Speech signal extraction in this context is often referred to as “acoustic echo cancelling”. The disturbing signal is known but has passed through an unknown channel. In order to subtract the disturbance from the microphone signal, the echo canceller should perform an accurate and adaptive channel estimation. The degree of difficultyof the estimation precess depends highlyon the room characteristics. For example, size and shape of the room and the material of the walls are such factors, while modifications of furniture inside the room as well as people’s movement will change these characteristics and in turn lead to a need of a re-estimation process.

The ﬁrst part of this thesis deals with this estimation problem. Two diﬀerent situations are evaluated; a car cabin and a conference room. The focus is on evaluation of a delayless room estimation performed in frequency subbands and an echo cancelling performed in time domain. It is shown that for large room, such as the conference room the resulting echo cancellation has been improved as compared with conventional time domain techniques, in terms of amount of cancellation and the speed of channel estimation. It is important for the echo canceller to do the channel estimation accuratelyin a short time in order to initiallyadapt and track room condition variations.

Part B

The second case also takes place in the handfree situation when the surrounding noise situation is of a complex structure and each noise source cannot be discerned individually. The handfree situation in an automobile is such an example. The noise situation is made up from several fundamentallydifferent sources. Examples of such noise sources are wind and tire friction, fan and engine noise travelling over mechanical structures. One wayto perform speech signal extraction in this situation is to allow for spatial selectivity, i.e. directional hearing. By using several microphones separated in space, this can be accomplished. The principle follows from the fact that signals from different locations will impinge on the microphones at different time instances. Since digital filters have the the abilityof delaying signals arbitrarilyone can design filters in such a waythat onlysignals from desired direction will pass the system. Bysteering the microphone arraytowards the person speaking one can perform speech signal extraction in situations where several noise sources are present. Temporal information can simultaneouslybe taken into consideration and therebyallow for discrimination of noise sources from the same direction as the speech source, provided it bears a different spectral content.

The second part of this thesis deal with the problem of extracting a single speaker in an automobile in the application of handsfree communication. This is

(13)

done byplacing several microphones along a line in front of the driver. Since the driver keeps the same position, or onlyalters his position slightly, one mayuse this information to steer the hearing towards the correct position. Once the direction is known, there are manyways to find optimal filters, where the positions of the microphones must be placed with high accuracy, or carefully calibrated. This will lead to high cost of the equipment and it will also cause the filters to relysolely on spatial diversity. An alternative approach is evaluated in this thesis. The core of the operation is that a signal with human speech properties, and additional known disturbances, are emitted from their respective position and recorded. This recorded information is then used in an adaptive manner to extract the speech from the physical environment and simultaneously suppress all the unwanted noise sources. Recording of signals from the real environment gives information of microphone placement and channel propagation properties and provides more flexibility when installing a system.

Part C

The third case is the most general one encounters in a situation where manysources are present and some (or all) of the sources are to be extracted. The positions of the sources are considered unknown and the principle assumes no access to the properties of disturbing or desired sources. The extraction relies on information theoretic measures of the microphone signals. The problem is undetermined in the sense that exact restoration of the original sources can not be accomplished.

This is deduced from the fact that both the sources and the channels theyhave passed through are unknown. This, in turn, will lead to indeterminacyof scale and permutation; when the speaker lowers his voice and the channels provides less attenuation, the exact same signal will appear at the microphone. If two sources trade places but at the same time the microphones are shifting places, the same output signals will appear. Nevertheless, the sources can still be extracted from the environment, but without restoration of the exact scale and permutation. The problem can be viewed as an inverse multichannel estimation process and it is often referred to as “blind signal separation”.

The third part of this thesis deals with the problem of two persons speaking si- multaneouslyin a conference room and two microphones receive the speech signals.

The extraction is performed in time domain with linear ﬁlters whereas the inverse channel estimation is performed in the frequencydomain. Both artiﬁcial and real world scenarios are evaluated and compared. The extension of the principle to more sources and microphones is straight forward.

(14)

(15)

Subband Acoustic Echo

Cancelling using LMS and RLS

(16)

Universityof Karlskrona/Ronneby.

(17)

Subband Acoustic Echo Cancelling using LMS and RLS

Nedelko Grbi´ c, J¨ orgen Nordberg, Sven Nordholm University of Karlskrona/Ronneby

Department of Signal Processing Sweden

Abstract

The increasing use of modern hands free communication systems such as video conferencing, computer communications, and vehicle mounted cellular telephones brings the demand for high-quality acoustic echo cancellation up to focus. In these applications the echo path which has to be identified typically has long time duration, the order of 100 ms. For this identification the length of the filter will be long.

This report evaluates the Normalized Least Mean Square (NLMS) and the Weighted Recursive Least Square (WRLS) algorithms for acoustic echo cancelling using a delayless subband scheme. Subband signal processing has shown to be eﬃcient both when it comes to convergence rate and level of echo suppression.

The evaluation is performed for real speech signals sampled from a conversation using a hands free set mounted in an automobile, and a conversation using conference telephony equipment in a conference room. A comparison of subband and fullband algorithms is made both with respect to the computational cost and level of echo suppression.

Results show that when the impulse response is very long, i.e. in such environments as conference rooms, the subband approach is beneficial. In a car environment the size of enclosure and damping means that the response is quite short and a conventional echo canceller could perform as well as a subband echo canceller. In the study, finite word length effects have not been considered.

The LMS algorithm can perform as well as the RLS algorithm when implemented in the subband scheme and using an energy detector. The computational cost is reduced substantially for the RLS algorithm when implemented in subbands, while keeping most of its performance.

(18)

1 Introduction

In modern hands free communication systems such as hands free car phones, loudspeaker phones and video conferencing systems, it is necessary to perform an acoustic echo cancellation of the far-end speaker [2, 3, 4]. The echo cancellation system is made adaptive in order to track variations in the acoustic channel. The ﬁlter length of the acoustic canceller is typically 500-1500 FIR taps for normal sampling frequencies. Long ﬁlters imply a large computational burden and slow convergence rate. The slow convergence rate is especially obvious in signals with a large spectral dynamic range such as speech signals. A subband echo canceller [5, 6] gives several advantages when compared to a fullband echo canceller such as:

1. The computational burden is essentially reduced by the number of subbands due to decimation.

2. A faster convergence since the spectral dynamic range in each subband will be less.

3. A signal controlled adaptation can be performed in each subband individually, hence enhanced performance.

4. A well separated structure for parallel implementation.

This paper evaluates a version of a delayless subband adaptive ﬁlter presented by Morgan and Thi [6]. The evaluation is performed for speech signals where the suppression and the convergence are compared for the Normalized Least Mean Square (NLMS) algorithm and the Weighted Recursive Least Square (WRLS) algorithm. The evaluation also includes the use of a simple energy detector in the subbands.

2 System Overview

An acoustic echo canceller, see Figure 1, identifies the channel between the loudspeaker and the hands free microphone. This identified impulse response is then employed to achieve a suppression of the echo. One of the fundamental characteristics of this channel is the bulk delay. A typical distance between loudspeaker and microphone is 1 m. This separation corresponds to a 3 ms delay and with 8-12 kHz sample frequency this corresponds to about 20-30 samples. However, an FIR filter with 50 taps will only characterize the direct wave and give a suppression of about 5-10 dB. In order to achieve the suppression goal which is 30-40 dB, filter lengths of 500-2000 FIR taps become necessary. The filter should also be able to track variations in the acoustic environment. An appealing approach is to use a multirate technique since such a technique reduces the computational burden and also gives a faster convergence rate. The latter is due to the reduction of spectral

(19)

dynamic range in each subband. Since the identification of the acoustic path must be done on the basis of speech signals the spectral range plays an important role in the final performance. A major drawback is the delay which is introduced by the filter bank. This delay can, however, be circumvented by using a modified structure for the subband adaptive filter [6]. In conventional subband structures the delay introduced by the filter bank acts on the signals as well as on the adaptation, in this modified structure only the adaptation is affected by this delay.

2.1 The Delayless Subband Adaptive Filter

The delayless attribute of this technique comes from the fact that the new adaptive weights are computed in subbands and then transformed to an equivalent fullband filter by means of an inverse FFT, see Figure 1. The filter works in real time on the loudspeaker signal. The coefficients are calculated separately in each band.

They can be calculated either by employing the error signal e(k) (closed loop case) or the microphone input signal d(k) (open loop case). When the signal d(k) is used, a local error signal in each band is created. In this case the calculations do not need to be performed in real time. This approach will, however, give less ﬁnal suppression since the algorithm is working blindly with respect to the real error signal. The fullband signal is divided into several subband signals by using a polyphase FFT technique [7].

2.2 Polyphase FFT Filter Bank

A set of M ﬁlters is said to be a uniform DFT ﬁlter bank if they are related as H_l(z) = H₀(zW^l) =

∞ n=−∞

h₀(n)(zW^l)⁻ⁿ, (1)

where W = e^−j2π/M and l∈ [0, M − 1]. The polyphase decomposition can be used to implement such a filter bank in an efficient manner [7]. The number of filters in the filter bank is M, thus the passband frequency of the prototype filter should be set to _2M¹ . Since only fullband filters with real coefficients are considered, it is enough to calculate ^M₂ + 1 complex subband signals. In order to reduce aliasing, the signals in the filter bank are decimated by a factor of only ^M₂ . The polyphase decomposition of the DFT filter bank is performed accordingly. The resulting filters after decimation will have passbands centered at dc for even subbands, while passbands for odd subbands will be centered at ¹₂ , see Figure 2.

The prototype ﬁlter H₀(z) is polyphase decomposed as

H₀(z) =

∞ n=−∞

h₀(n)z⁻ⁿ =

M/2−1

m=0

z^−m

∞ n=−∞

h₀(nM

2 + m)z^−nM/2. (2)

(20)

+

h_p

Poly - phase FFT

M/2

Poly - phase FFT

M/2 Algorithm

yo y1

yM/2

d0/e0 FFT

Band 0

FFT Band 1

FFT Band M/2 Frequency Stacking and Conjugate Complement

IFFT h

x[n] d[n] e[n]

-

d1/e1

dM/2/eM/2 A

B

Algorithm

Figure 1: Delayless subband acoustic echo canceller; position A open loop conﬁgu- ration and position B closed loop conﬁguration.

An arbitrary ﬁlter in the ﬁlter bank Eq. (1 ) and (2) yields,

H_l(z) =

∞ n=−∞

h₀(n)(W^lz)⁻ⁿ =

M/2−1

m=0

(W^lz)^−m

∞ n=−∞

h₀(nM

2 + m)(W^lz)^−nM/2. (3) where

W^−lnM/2 = (e^jπl)ⁿ =

(−1)ⁿ l odd

1 l even (4)

Eq. (4) indicates that odd and even subbands are treated slightly diﬀerently.

For odd l Eq. (3) yields,

H_l(z) =

M/2−1

m=0

(W^lz)^−m

∞ n=−∞

h₀(nM

2 + m)(−1)ⁿz^−nM/2 (5)

(21)

Figure 2: Filter bank response for even and odd subbands after decimation.

deﬁning E_m (z) as

E_m (z) =

∞ n=−∞

h₀(nM

2 + m)(−1)ⁿz⁻ⁿ. (6)

Then Eq. (5) can be rewritten as

H_l(z) =

M/2−1

m=0

(W^lz)^−mE_m (z^M/2). (7) For even l Eq. (3) yields

H_l(z) =

M/2−1

m=0

(W^lz)^−m

∞ n=−∞

h₀(nM

2 + m)z^−nM/2 (8)

deﬁning E_m(z) as

E_m(z) =

∞ n=−∞

h₀(nM

2 + m)z⁻ⁿ. (9)

Then Eq. (8) can be rewritten as

H_l(z) =

M/2−1

m=0

(W^lz)^−mE_m(z^M/2). (10) This means that the polyphase ﬁlter bank can be divided into two ﬁlter structures:

one for even subbands and one for odd subbands, see Figure 3

(22)

XM/2-2(n) x(n)

z-1

E0(Z )

E1(Z)

EM/2-1(Z)

W*

D

X0(n)

X2(n)

XM/2-1(n) x(n)

z-1

E'0(Z )

E'1(Z)

E'M/2-1(Z)

W*

D

X1(n)

X3(n)

Figure 3: A ﬁlter bank design with polyphase FFT where even and odd subbands are calculated separately.

2.3 Transformation of subband filter coefficients to full- band filter coefficients

If the fullband ﬁlter has L taps the ﬁlter length in each subband will be _D^L, D = ^M₂ . An _D^L point FFT will be calculated based on the adaptive weights in each subband.

These subband weights are subsequently stacked to form an L/2 element array, [1, 2, ...^L₂]. The array is then completed by setting the element indexed L/2 + 1 to zero and using the complex conjugate of elements [2, 3, ...^L₂] in reverse order.

Finally, the new L element array is transformed by an L point inverse FFT to obtain the fullband ﬁlter weights.

The rule for this transformation in the FFT-domain can be described as follows.

Denote the fullband ﬁlter FFT bins as H_p(k) and the i:th subband ﬁlter FFT bins as H_sⁱ(n), where i = {0, 1, 2, . . . M/2}, n = {1, 2, . . . , L/D} and k = {1, 2, . . . L}.

By observing Figure 2 the relation between the fullband and the subband frequency mapping can be determined. Since FFT is used, the transformation rule becomes a stacking procedure according to the following:

(23)

H_p(k) =











H_s⁰(k mod_2M^L ), 1≤ k ≤ _2M^L

i odd { H_sⁱ(k mod_2M^L ), (2i− 1)_2M^L + 1≤ k ≤ (2i + 1)_2M^L i even

H_sⁱ(k mod_2M^L + 3_2M^L ), k ≤ ^2iL_2M

H_sⁱ(k mod_2M^L ), k > _2M^2iL (2i− 1)_2M^L + 1≤ k ≤ (2i + 1)_2M^L H^M² (k mod_2M^L ), (M − 1)_2M^L + 1≤ k ≤ ^L₂

where i is a index determined by i = f loor

kM L +1

2− 1 2M

. Floor means the closest integer smaller than the argument.

Now, since the fullband FIR ﬁlter is real valued and the FFT operator is deﬁned by discretized frequencies in the range of [0, 2π], the conjugate is taken in the reverse order to determine the mirror part of H_p(k) as:

H_p(k) = conj{H_p(L− k + 1)}, f or L

2 + 2≤ k ≤ L and

H_p(L

2 + 1) = 0.

The fullband time-domain representation is determined by h_p(n) = IF F T{Hp(k)}.

3 Structure Evaluation

The delayless subband echo canceller is evaluated by using the Normalized Least Mean Square (NLMS) algorithm and the Weighted Recursive Least Square (WRLS) algorithm in the subbands according to Figure 1. The suppression ratio is evaluated for the acoustic response in a situation using a car-mounted mobile hands free set and for the response in a conference room environment. The performances are compared with the conventional fullband implementation for the same two environments.

3.1 Least Mean Square versus Recursive Least Square

It is well known that the NLMS algorithm has low complexity, but slower convergence and higher excess mean square error when compared to the WRLS algorithm [8]. These characteristics are not always the case when dealing with speech signals as presented in the simulations. The NLMS performance can in some cases be boosted by introducing an energy detector, as described in Section 3.3.

The NLMS algorithm will be referred to as LMS and the WRLS will be referred to as RLS in the following.

(24)

3.2 Fullband versus Subband

The problem at hand is to identify the acoustic channel in the region of frequencies where the input signal has energy. The behavior of the identification in the frequency regions with no excitation can be arbitrary while still yielding high performance in the cancellation. These regions have, of course, some energy due to finite time effects, but the magnitude is small. The ratio between the highest and the lowest regional magnitude gives a measure of the condition of the problem.

Lower ratio gives better condition [9].

In the fullband realization of the identiﬁcation using RLS, the solution is very unstable due to the ratio’s being very high between the maximal and the minimal singular value of the correlation matrix [10]. Figure 4 shows the singular values of the estimated autocorrelation matrix, R_xx, of the input signal to the echo canceller in the car hands free situation. The solid line shows the maximum accuracy for de- termining the inverse of this matrix for the method used. The inverse is calculated as a pseudo inverse where the singular values of magnitude below this accuracy are discharged. This value is chosen such that the instability for the inversion will not be increased due to quantization.

0 100 200 300 400 500 600

−350

−300

−250

−200

−150

−100

−50 0

Eigenvalue plot for estimated autocorrelation matrix

Eigenvalue No. (in ascending order)

Eigenvalue Magnitude [dB]

Figure 4: Eigenvalue plot of autocorrelation matrix using Fullband Scheme; dashed lines shows minimal eigenvalue possible for pseudo inverse.

This result shows that the fullband identiﬁcation problem is ill-conditioned.

Regarding the problem instead as several subband identiﬁcation problems will result in several individually well-conditioned problems. Figure 5 shows the singular values of the ﬁrst 9 subband estimated correlation matrixes for the 16-subband implementation. Since the input signal is real valued there is no additional in-

(25)

0 50 100 -300

-200 -100 0 100

Eig.v. No. SB #: 0

Eig.v. Magn. [dB]

0 50 100

300 200 100 0 100

Eig.v. No. SB #: 1

0 50 100

300 200 100 0 100

Eig.v. No. SB #: 2

0 50 100

-300 -200 -100 0

Eig.v. No. SB #: 3

Eig.v. Magn. [dB]

0 50 100

300 200 100 0 100

Eig.v. No. SB #: 4

0 50 100

300 200 100 0

Eig.v. No. SB #: 5

0 50 100

-400 -300 -200 -100 0

Eig.v. No. SB #: 6

Eig.v. Magn. [dB]

0 50 100

400 300 200 100 0

Eig.v. No. SB #: 7

0 50 100

400 300 200 100 0

Eig.v. No. SB #: 8 Eigenvalue plots for estimated autocorrelation matrixes in subbands

Figure 5: Eigenvalue plots of autocorrelation matrixes using subband scheme; solid line shows minimal eigenvalue possible for pseudo inverse. Number of subbands is 16 and the ﬁrst 9 are shown.

formation in the last 7 subbands. The correlation matrixes are estimated for the whole speech sequence for which the evaluation is made. It shows that the ratio between the largest and the smallest singular values have been reduced. It should be noted that the singular values have been estimated over a 4 second sequence.

However, at a certain time instant the spectral content in the signal can be such that the correlation matrix estimate is singular due to the weighting, which in turn will lead to an unstable RLS algorithm implementation.

3.3 The use of energy detectors in the subbands

An energy detector (ED) can be introduced in order to stop updating the algorithm in those subbands where the excitation is poor. In this way the performance will be kept high in the fullband identification problem, by keeping the filter weights unal- tered. Since the condition number in the fullband identification problem depends on the input signal’s spectral content, which is time-varying, the energy detector will act as a time instant worst case condition limiter. The use of an energy detector is therefore crucial when the input has the character of speech signals. The use of ED gives more additional identification accuracy for the LMS then for the RLS.

This comes from the fact that the RLS by itself equals the spectral ratios, as long as they do not exceed the dynamic range of the processor.

The introduction of energy detectors gives better fullband accuracy and also increases the convergence of the total system as shown in chapter 3.4.

(26)

3.4 The subband identiﬁcation problem

Even though the main objective of the whole system is to make an accurate fullband identiﬁcation of the acoustic channel, the subband identiﬁcations are also sensitive to high spectral range. Figure 6 shows the individual subband condition numbers.

It can be seen that the condition is poor mainly in the low frequency subbands.

For the speech signal evaluated, the excitation is low in this range. Therefor the subbands affected by the energy detector are the ones which have high condition numbers locally. Thus, by adapting the algorithms in those subbands where excitation exists, the fullband identification as well as the subband identification becomes more stable and in most cases more accurate.

1 2 3 4 5 6 7 8 9

0 50 100 150

Condition number for subband autocorrelation matrixes

#SB = 16

0 2 4 6 8 10 12 14 16 18

0 50 100 150

#SB = 32

0 5 10 15 20 25 30 35

0 50 100 150

#SB = 64

0 10 20 30 40 50 60 70

0 50 100

#SB = 128

Subband No.

Condition number Magnitude [dB]

Figure 6: Magnitude of condition numbers in Euclidian norm shown for the sub- bands.

4 Performance Evaluation

4.1 Evaluation preliminaries

All results are based on a four second sequence of true speech sampled in a real environment. Two environments are evaluated here, a hands free mobile telephone set mounted inside a car cabin, and a conference telephony set placed in a typical conference room. The algorithms are compared, by means of computational cost in the number of ﬂops per sample of input data as well as the suppression ratio, for the fullband and the subband structures. The suppression ratios presented in the following sections are calculated as (using notation as in Figure 1)

(27)

Average Suppression [dB] = 10 N

N i=1

log|d(i)|²

|e(i)|²

where N denotes the number of samples in the sequence over the period of time where speech exists. The suppression ﬁgures presented in the appendix show the suppression during samples as short time (80 ms) power estimates.

The evaluation is performed for two structures, the number of subbands equal to 32 and 64 for the car hands free evaluation, and the number of subbands equal to 256 and 512 for the conference room evaluation. The fullband solution to the same problems are calculated for comparison. The number of FIR filter parameters for the acoustic identification have been 512 and 2048 for the car hands free and the conference telephony situation, respectively. Since the condition of the fullband identification problem is poor, as described in Chapter 3, the ordinary RLS does not converge. For comparison, the fullband least square solution is calculated. This calculation is done by omitting the singular values below a certain threshold for the pseudo inverse, as shown by Figure 4. The Wiener-Hopf equation is then solved off-line by using this pseudo inverse of the correlation matrix for the input sequence.

4.2 The hands free automobile environment

Average suppressions for the subband implementations are shown in Table 1 and 2 for 32 and 64 subbands, respectively. These results show that signiﬁcant improvement can be achieved when introducing the energy detector for the LMS algorithm, but it still does not give as good performance as the fullband solution.

The computational cost per input sample is shown in Tables 3 and 4. The gain in computational cost when using subband approach is signiﬁcant for the RLS algorithm. For the LMS algorithm the point of break even comes at 64 subbands when compared to the fullband computational cost. The use of energy detector also gives substantial savings of computations when it comes to the RLS, as the suppression level is almost the same.

Average Suppression (dB), #SB=32 LMS RLS Improvement RLS Figure

Fullband 16.3 17.1 *) 0.8 11

Open Loop 8.6 15.2 4.6 12

Closed Loop 3.8 10.3 6.5 13

Open Loop with ED 13.9 14.7 0.8 14

Closed Loop with ED 14.1 14.4 0.3 15

Average Improvement ED (oﬀ/on) 7.8 1.8 - -

Table 1. Average suppression in decibels by using a 32 subband implementation.

∗)The fullband RLSis calculated oﬀ-line by solving the Wiener-Hopf equations.

(28)

Fullband 16.3 17.1 *) 0.8 11

Open Loop 7.8 12.3 4.5 16

Closed Loop -6.6 11.5 18.1 17

Open Loop with ED 13.2 11.9 -1.3 18

Closed Loop with ED 12.9 11.2 -1.7 19

Average Improvement ED (oﬀ/on) 12.5 -0.35 - -

∗) The fullband RLSis calculated oﬀ-line by solving the Wiener-Hopf equations.

Computational Flops/sample x1000, #SB=32 LMS RLS Diﬀerence

Fullband 2.55 1039.90 *) 1659.75

Open Loop 4.46 29.86 40.64

Closed Loop 4.19 29.61 40.67

Open Loop with ED 4.19 11.40 11.53

Closed Loop with ED 3.93 11.32 11.83

Average Improvement ED (oﬀ/on) 0.42 29.40 28.98 Table 3. Computational cost per input sample for a 32 subband implementation.

Fullband 2.55 1039.90 *) 1659.75

Open Loop 2.73 9.22 10.38

Closed Loop 2.60 9.09 10.39

Average Improvement ED (oﬀ/on) 0.21 7.52 7.31

Table 4. Computational cost per input sample for a 64 subband implementation.

Figures 7 and 8 show the relative average suppression calculated as Average Suppression (x %) [dB] = 10

N (1− x/100)

N i=Nx/100+1

log|d(i)|²

|e(i)|² which is the average taken when leaving the ﬁrst x percent of the sequence, normalized by the suppression achieved by the fullband LMS. This average shows the ﬁnal values of the suppression after initial convergence. It can be seen that all subband implementations reach almost the same suppression in the end of sequence for the

(29)

32 subband implementation. For the 64 subband implementation the closed loop LMS system has still not converged. It should be noted that the accuracy in the suppression levels decreases as x increases since the average is taken for a shorter sequence.

0 25 50 75

0 0.2 0.4 0.6 0.8 1 1.2 1.4

Suppression after x percent of sequence, #SB = 32

x %

Relative suppression to LMS fullband

Fullband LMS Fullband RLS^*) OL.LMS.ed OL.RLS.ed CL.LMS.ed CL.RLS.ed OL.LMS OL.RLS CL.LMS CL.RLS

Figure 7: Average suppression after some initial time percentage of sequence. Num- ber of subbands is 32. Car hands free evaluation.

4.3 The conference telephony environment

Average suppressions for the subband implementations are shown in Tables 5 and 6 for 256 and 512 subbands, respectively. The computational cost per input sample is shown in Tables 7 and 8. It can be seen in this environment that the subband implementation is more eﬃcient both when it comes to the suppression ratio as well as the computational cost, in the open loop case. There is a trade oﬀ between the cost and the performance when it comes to the choice of number of subbands.

In the 256 subband realization the computational cost is cut to one half while the suppression level is improved as compared with the fullband LMS realization.

Increasing the number of subbands reduces the performance and gives only slightly less computational cost. The choice of the number of subbands is crucial for the total echo cancelling performance.

The energy detector has a great impact on the suppression performance when using the LMS algorithm in the subbands. The suppression level is substantially improved for the LMS algorithm while kept the same for the RLS with the energy detector. In this environment the gain in computation cost by using the energy detector is small for both LMS and the RLS realization, since the amount of computations needed for the adaptation is small due to the high number of subbands.

(30)

0 25 50 75 0.6

0.4 0.2 0 0.2 0.4 0.6 0.8 1 1.2

x %

Figure 8: Average suppression after some initial time percentage of sequence. Num- ber of subbands is 64. Car hands free evaluation.

It is interesting to notice, in the closed loop case, that the RLS has poor performance in this environment. This deficiency can be explained by the delay introduced by the filter bank on the error signal. The higher the number of subbands used, the more delay that will be introduced. Since the error signal is delayed when compared to the input signal the direct least square solution will be misled by the last samples of input data. Here, the weighting plays an important role. On one hand, the weighting should be set so that the channel tracking requirements will be met, but on the other hand the introduced delay of the error signal causes degraded performance when weighting recent information higher. This trade off is quite dependent on the acoustic situation and is difficult to resolve in practice.

Figures 9 and 10 show the average suppression relative to the fullband LMS implementation after initial convergence time, for the structures evaluated.

Fullband 13.2 15.6 *) 2.4 20

Open Loop -2.0 16.9 18.9 21

Closed Loop -20.9 -9.2 11.7 22

Open Loop with ED 13.3 16.3 3.0 23

Closed Loop with ED 9.8 -9.2 -19 24

(31)

Fullband 13.2 15.6 *) 2.4 20

Open Loop -0.6 11.2 11.8 25

Closed Loop -29.3 1.6 30.9 26

Open Loop with ED 9.9 9.8 -0.1 27

Closed Loop with ED 6.1 1.6 -4.5 28

Fullband 9.82 16128.02 *) 16118.2

Open Loop 5.69 11.88 6.19

Closed Loop 5.56 11.75 6.19

Fullband 9.82 16128.02 *) 16118.2

Open Loop 4.81 6.43 1.62

Closed Loop 4.74 6.36 1.62

5 Summary and Conclusions

A comparison between a fullband and a delayless subband adaptive acoustic echo canceller has been carried out. The acoustic echo cancelling problem can be viewed as an identiﬁcation problem where the identiﬁcation is made of the acoustic path.

The comparison measure has been suppression level and computational cost for the NLMS and the WRLS.

The spread of eigenvalues in the correlation matrix is a measure of how well

(32)

0 25 50 75

−2

−1.5

−1

−0.5 0 0.5 1 1.5

%

Figure 9: Average suppression after some initial time percentage of sequence. Num- ber of subbands is 256. Conference telephony evaluation.

the problem is conditioned. For the fullband identiﬁcation problem there is a high spread in eigenvalues and therefore it is an ill-conditioned problem. When trans- forming the problem to several subband identiﬁcations the condition is increased considerably. The computational savings for the WRLS is high in the subband approach. For the NLMS algorithm the savings is moderate when it comes to the number of computations.

When introducing an energy detector several beneﬁts are encountered. The convergence rate for the NLMS algorithm is improved substantially. The computational cost has been reduced further for the WRLS.

The open loop implementation, i.e. when the subband algorithm works on local error signals, the convergence rate is higher than for the closed loop case, in general.

The fullband solution when using the NLMS is still to be preferred for the problem of echo cancelling in an automobile. When dealing with echo cancelling problems such as conference telephony where the echo path is much longer in duration and therefore demands longer impulse response in the echo canceller, the implementation in subbands is shown to give better results, both when it comes to suppression performance as well as the computational load.

Overall, the diﬀerence in performance for the NLMS and the WRLS algorithms is small when implemented in subbands. This result is in favor of the NLMS algorithm because of its lower complexity. The best structure has shown to be the open loop incorporating a simple energy detector.

The advantage of utilizing a subband approach is reinforced when the acoustic path increases in length and complexity.

(33)

0 25 50 75

−2.5

−2

−1.5

−1

−0.5 0 0.5 1 1.5

%

Figure 10: Average suppression after some initial time percentage of sequence.

Number of subbands is 512. Conference telephony evaluation.

References

[1] S. Haykin

Adaptive Filter Theory Prentice Hall, 1996 [2] B. Widrow, S. D. Stearns

Adaptive Signal Processing Prentice Hall, 1985

[3] M. M. Sondhi, W. Kellermann

”Adaptive Echo Cancellation for Speech Signals”

Advances in Speech Signal Processing, New York: Marcel Decker, 1992 , ch 11 [4] D. R. Morgan

”Slow Asymptotic Convergence of LMS Acoustic Echo Cancelers”

IEEE Trans. on Speech and Audio Processing, vol. 3, no. 2., pp. 126-136, March 1995

[5] Y. Ono, H. K iya

”Performance Analysis of Subband Adaptive Systems using an Equivalent Model”

IEEE Proc ICASSP’94(Adelade, Australia), part III, pp. 53-56, 1994

(34)

[6] D. R. Morgan, J. C. Thi

”A Delayless Subband Adaptive Filter Architecture”

IEEE Trans. on Signal Processing, vol. 43, no. 8., pp. 1819-1830, Aug 1995 [7] P.P. Vaidyanathan

Multirate Systems and Filter Banks Prentice Hall, 1993

[8] J. R. Deller, J. G. Proakis, J. H. L. Hansen Discrete-Time Processing of Speech Signals Macmillan, 1993

[9] T. Söderström, P. Stoica System Identification

Prentice Hall International, 1989 [10] R.M. Gray

”On the Asymptotic Eigenvalue Distribution of Toeplitz Matrices”

IEEE Trans. on Information Theory, vol. IT-16, p.p. 725-730, 1972

(35)

A Figures-Evaluation

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 10⁴ 0

5 10 15 20 25 30 35 40

Suppression of Speech signal using Fullband Scheme

Magnitude Response [dB]

Samples

LMS RLS^*)

Figure 11: Suppression of Speech signal using Fullband Scheme. Car hands free evaluation.

∗)The fullband RLSis calculated oﬀ-line by solving the Wiener-Hopf equations.

(36)

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 x 10⁴

−20

−15

−10

−5 0 5 10 15 20 25 30

Suppression of Speech signal with no Energy detector, Open loop

Samples

LMS RLS

Figure 12: Suppression of Speech signal with no Energy detector, Open loop,

#SB=32. Car hands free evaluation.

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 10⁴

−30

−20

−10 0 10 20 30

Suppression of Speech signal with no Energy detector, Closed loop

Samples

LMS RLS

Figure 13: Suppression of Speech signal with no Energy detector, Closed loop,

(37)

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 x 10⁴

−5 0 5 10 15 20 25 30

Suppression of Speech signal with Energy detector, Open loop

Samples

LMS RLS

Figure 14: Suppression of Speech signal with Energy detector, Open loop, #SB=32.

Car hands free evaluation.

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 10⁴

−5 0 5 10 15 20 25 30

Suppression of Speech signal with Energy detector, Closed loop

Samples

LMS RLS

Figure 15: Suppression of Speech signal with Energy detector, Closed loop,

(38)

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 x 10⁴

−30

−20

−10 0 10 20 30

Samples

LMS RLS

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 10⁴

−40

−30

−20

−10 0 10 20 30

Samples

LMS RLS

(39)

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 x 10⁴

−10

−5 0 5 10 15 20 25 30

Samples

LMS RLS

Figure 18: Suppression of Speech signal with Energy detector, Open loop, #SB=64.

Car hands free evaluation.

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 10⁴

−10

−5 0 5 10 15 20 25 30

Samples

LMS RLS

(40)

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 x 10⁴

−5 0 5 10 15 20 25 30

Suppression of Speech signal using Fullband Scheme

Suppression Ratio Magnitude [dB]

Samples

LMS RLS^*)

Figure 20: Suppression of Speech signal using Fullband Scheme. Conference tele- phony evaluation.

∗)The fullband RLSis calculated oﬀ-line by solving the Wiener-Hopf equations.

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 10⁴

−30

−20

−10 0 10 20 30 40

Samples

LMS RLS

#SB=256. Conference telephony evaluation.

(41)

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 x 10⁴

−40

−35

−30

−25

−20

−15

−10

−5 0 5 10

Samples

LMS RLS

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 10⁴

−5 0 5 10 15 20 25 30 35

Samples

LMS RLS

Figure 23: Suppression of Speech signal with Energy detector, Open loop,

(42)

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 x 10⁴

−30

−20

−10 0 10 20 30

Samples

LMS RLS

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 10⁴

−30

−20

−10 0 10 20 30

Samples

LMS RLS

(43)

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 x 10⁴

−50

−40

−30

−20

−10 0 10 20

Samples

LMS RLS

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 10⁴

−5 0 5 10 15 20 25

Samples

LMS RLS

Figure 27: Suppression of Speech signal with Energy detector, Open loop,

(44)

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 x 10⁴

−15

−10

−5 0 5 10 15 20 25

Samples

LMS RLS

(45)

Acoustic Echo Cancelling and Noise Suppression with

Microphone Arrays

(46)

ISSN: 1103-1581, Universityof Karlskrona/Ronneby.

Speech Signal Extraction: A Multichannel Approach

A Multichannel Approach Nedelko Grbi´c

A Multichannel Approach

Nedelko Grbi´c

Preface

Acknowledgments

Contents

Introduction

Part A

Part B

Part C

Subband Acoustic Echo

Cancelling using LMS and RLS

Subband Acoustic Echo Cancelling using LMS and RLS

Nedelko Grbi´ c, J¨ orgen Nordberg, Sven Nordholm University of Karlskrona/Ronneby

Department of Signal Processing Sweden

1 Introduction

2 System Overview

2.1 The Delayless Subband Adaptive Filter

2.2 Polyphase FFT Filter Bank

2.3 Transformation of subband filter coefficients to full- band filter coefficients

3 Structure Evaluation

3.1 Least Mean Square versus Recursive Least Square

3.2 Fullband versus Subband

3.3 The use of energy detectors in the subbands

3.4 The subband identiﬁcation problem

4 Performance Evaluation

4.1 Evaluation preliminaries

4.2 The hands free automobile environment

4.3 The conference telephony environment

5 Summary and Conclusions

References

A Figures-Evaluation

Acoustic Echo Cancelling and Noise Suppression with

Microphone Arrays