Equalization of Audio Channels A Practical Approach for Speech Communication

(1)

Equalization of Audio Channels

A Practical Approach for Speech Communication

Nils Westerlund

(2)

Abstract

Many occupations of today requires the usage of personal preservative equip-ment such as a mask to protect the employee from dangerous substances or the usage of a pair of ear-muffs to damp high sound pressure levels. The goal of this Master thesis is to investigate the possibility of placing a microphone for communication purposes inside such a preservative mask as well as the possibil-ity of placing the microphone inside a persons auditory meatus and perform a digital channel equalization on the speech path in question in order to enhance the speech intelligibility.

(3)

Acknowledgements

(4)

Channel Equalization — An

Introduction

x(n) Y(z) H(z) X(z) h(n) y(n)

Figure 1.1: System with input and output signals and the corresponding system in z-domain.

A linear time-invariant system h(n) takes input signal x(n) and produces an output signal y(n) which is the convolution of x(n) and the unit sample response h(n) of the system, see fig. 1.1. The input, output and system is assumed to be real and only real signals will be considered in this thesis. The convolution described above can be written as

y(n) = x(n) ∗ h(n) (1.1)

where the convolution operation is denoted by an asterisk ∗. In z-domain, the convolution represents a multiplication given by

Y (z) = X(z)H(z) (1.2)

where Y (z) is the z-transform of the output y(n), X(z) is the z-transform of the input x(n) and H(z) is the z-transform of the unit sample response h(n) of the system.

(6)

1.1 Non-Adaptive Methods

A cascade connection of a system h(n) and its inverse hI(n) is illustrated in

fig. 1.2. Suppose the distorting system has an impulse response h(n) and let

x(n) y(n)

h(n)

Identity system

d(n) x(n)

=

h (n)I

Figure 1.2: System h(n) cascaded with its inverse system hI(n) results in an

identity system.

hI(n) denote the impulse response of the inverse system. We can then write

d(n) = x(n) ∗ h(n) ∗ hI(n) = x(n) (1.3)

where d(n) is the desired signal, i.e. the original input signal x(n). This implies that

h(n) ∗ hI(n) = δ(n) (1.4)

where δ(n) is a unit impulse. In z-domain, (1.4) becomes

H(z)HI(z) = 1 (1.5)

Thus, the transfer function for the inverse system is

HI(z) =

1

H(z) (1.6)

Note that the zeros of H(z) becomes the poles of the inverse system and vice versa.

If the characteristics of the system is unknown, it is often necessary to excite the system with a known input signal, observe the output, compare it with the input and then determine the characteristics of the system. This operation is called system identification [1]. If we obtain an output signal y(n) from a system h(n) excited with a known input signal x(n), we could of course use the z-transforms of y(n) and x(n) to form

H(z) = Y (z)

X(z) (1.7)

However, this is an analytical example and the transfer function H(z) is most likely infinite in duration. A more practical approach is based on a correlation method. The crosscorrelation of the signals x(n) and y(n) is given by

rxy(l) =

∞

X

n=−∞

(7)

The index l is the lag parameter1

and the subscripts xy on the crosscorrelation

sequence rxy(l) indicate the sequences being correlated. If the roles of x(n) and

y(n) is reversed, we obtain

ryx(l) = ∞ X n=−∞ y(n)x(n − l) , l = 0, ±1, ±2, . . . (1.9) Thus, rxy(l) = ryx(−l) (1.10)

Note the similarities between the computation of the crosscorrelation of two sequences and the convolution of two sequences. Hence, if the sequence x(n) and the folded sequence y(−n) is provided as inputs to a convolution algorithm,

the convolution yields the crosscorrelation rxy(l), i.e.

rxy(l) = x(l) ∗ y(−l) (1.11)

In the special case when x(n) = y(n) the operation results in the autocorrelation

of x(n), rxx(l).

Recall that y(n) = x(n) ∗ h(n). The insertion of this expression for y(n) into (1.11) yields rxy(l) = h(−l) ∗ rxx(l) (1.12) In z-domain, (1.12) becomes Pxy(z) = H ∗ (z)Pxx(z) (1.13) where H∗

(z) is the complex conjugate of H(z) and Pxx is the power spectral

density of x(n). The transfer function for the identified system is then

H∗

(z) = Pxy(z)

Pxx(z)

(1.14)

where Pxy(z) is the cross spectral density between x(n) and y(n). If rxy(l) is

replaced by ryx(−l) in (1.12), the complex conjugate in (1.14) is eliminated and

we obtain the following estimate of the transfer function:

H(z) = Pyx(z)

Pxx(z)

(1.15)

The MatLab2 _{function tfe}3 _{uses this method to estimate a transfer function}

of the system in question [4]. In later sections it will be clear that this method is both straightforward and powerful when identifying a given system.

1.2 Adaptive Channel Equalization

Another trail to equalize a channel is to use adaptive algorithms. There are a vast amount of application areas for adaptive algorithms and the mathematical theory is quite complex and reaches beyond the scope of this thesis. Therefore,

1

Also commonly referred to as (time) shift parameter

2

MatLab is a trademark of The MathWorks, Inc.

3

(8)

in this section, only a brief description of the basic principles of adaptive filtering will be given [2].

A block diagram of an adaptive filter is shown in fig. 1.3. It consists of a shift-varying filter and an adaptive algorithm for updating the filter coefficients. The goal of adaptive FIR-filters, is to find the Wiener filter w(n) that minimizes

Adaptive filter x(n) d(n) d(n) e(n) Adaptive algorithm

Figure 1.3: Basic structure for an adaptive filter.

the mean-square error

ξ(n) = E{|d(n) − ˆd(n)|2

} = E{|e(n)|2

} (1.16)

where E{·} is the expected value and ˆd(n) is the estimate of the desired signal

d(n).

We know that if x(n) and d(n) are jointly wide-sense stationary processes, the filter coefficients that minimize the mean-square error ξ(n) are found by solving the Wiener-Hopf equations [2]

Rxxw= rdx (1.17)

where Rxx denotes the autocorrelation matrix of x(n), w denotes the vector

containing filter coefficients and rdx denotes the crosscorrelation vector of d(n)

and x(n).

The calculation of the Wiener-Hopf equations is a complex mathematical

operation including an inversion of the autocorrelation matrix Rxx. If the

in-put signal or the desired signal is nonstationary, this operation would have to be performed iteratively. Instead, the requirement that w(n) should minimize the mean-square error at each time n can be relaxed and a coefficient update equation of the form

w(n + 1) = w(n) + ∆w(n) (1.18)

can be used. In this equation ∆w(n) is a correction that is applied to the filter coefficients w(n) at time n to form a new set of coefficients, w(n + 1), at time

n+1. Equation (1.18) is the heart of all adaptive algorithms used in this thesis.4

Since the error function ξ(n) is a quadratic function, its curve can be viewed as a “bowl” with the minimum error at the bottom of this bowl. The idea of

4

(9)

adaptive filters is to find the optimal vector w(n) by taking small steps towards the minimum error. The update equation for this vector is

w(n + 1) = w(n) − µ∇ξ(n) (1.19)

where µ is the step size and ∇ξ(n) is the gradient vector of ξ(n). Note that the steps are taken in the negative direction of the gradient vector since this vector points in the direction of steepest ascent.

The gradient can be directly estimated by the product of e(n) and x(n). Introducing this estimate in (1.19) yields

w(n + 1) = w(n) + µe(n)x(n) (1.20)

which is the well known Least Mean Squares (LMS) algorithm. Further devel-opments of this algorithm includes Normalized LMS (NLMS) and Leaky LMS (LLMS). All of these algorithms will be evaluated in later sections of this thesis [2].

In fig. 1.4, a block scheme that can be used for adaptive channel equalization is shown. The original signal s(n) is passed through some sort of system (a channel) that distorts the input signal and this distorted signal is then used

as input to the adaptive algorithm. The output signal ˆd(n) from the adaptive

causal filter is subtracted from the desired signal d(n) and the result forms the error e(n). The error is the second input signal to the adaptive algorithm.

If the system is considered as a non-trivial system, it will not only affect the spectral characteristics of the input signal but also introduce a delay on the

same5

. This is the reason why the delay ∆ is so important.

Another important property is that if the channel to be equalized is causal, the equalizing filter will be non-causal if no delay of the filtered signal x(n) is acceptable. However, only causal Finite Impulse Response (FIR) adaptive filters will be used in this thesis and these filters will indeed introduce a delay on the signal. Also note that an FIR filter of course only can approximate an Infinite Impulse Response (IIR) filter with a certain precision if such a filter is needed for an optimal solution [3].

5

(10)

Adaptive filter Adaptive algorithm x(n) d(n) e(n) d(n) Channel

D

Delay Input signal s(n)

(11)

Chapter 2

Equalization of Mask

Channel

In this chapter, a protective mask is studied. The goal was to equalize the distortion of human speech caused by this mask. In order to collect the necessary data to perfom this study, a measurement setup was assembled in order to record data on site.

2.1 Gathering of Measurement Data

The gathering of measurement data was made with the help of a test dummy

head1_{, two DAT-recorders}2 _{and a signal analyzer}3_{. The test dummy used,}

was constructed specially for audio measurements and was equipped with a loudspeaker placed in its mouth. A microphone was mounted on the inside of the mask and the mask was then attached to the test dummy head, see fig. 2.1. To damp disturbing environmental noise, the complete arrangement was placed behind particle boards covered with insulation wool.

The signal analyzer was used to generate noise bandlimited to 12.8 kHz and one of the DAT-recorders, the SV3800 model, was used to record noise and speech sequences while the other was used for playback of speech sequences. The sampling frequency was 48 kHz with a resolution of 16 bits and the information on the DAT-tapes was then stored as wav-files using the software CoolEdit 2000. The wav-files was finally read by MatLab for further processing. A block scheme of the complete setup is shown in fig. 2.2.

The first action taken, was to reduce the amount of data by sampling rate conversion. Using the MatLab function decimate, the sampling frequency was reduced in two steps: First from 48 kHz to 24 kHz and then from 24 kHz to 12 kHz. Hence, the amount of data was reduced to one fourth. For a detailed description of how decimate works, see [5].

1

Head Acoustic

2

Sony TCD-D8 and Panasonic SV3800

3

(12)

Figure 2.1: (a) Test dummy head equipped with a loudspeaker in its mouth. The microphone is placed inside the mask. (b) Placement of the microphone in the mask.

Figure 2.2: Block scheme of the complete measuring setup.

2.2 Coherence Function

A powerful tool for investigating the properties of input-output signals, is the

coherence function. If Pxx and Pyy are the power spectral densities of input

signal x(n) and output signal y(n) respectively, and Pxy is the cross spectrum

of the input and output signal, the coherence function Cxycan be calculated as

Cxy=

|Pxy|2

PxxPyy

(2.1) A coherence function equal to one, means that a perfectly linear and noise-free system is being measured. Thus, a coherence function gives a direct measure of the quality of the estimated frequency response.

(13)

The coherence function Cxy of the mask is shown in fig. 2.3. The length of

the FFTs (Fast Fourier Transform) used for calculating Cxy was 2048.

0 1000 2000 3000 4000 5000 6000 0 0.2 0.4 0.6 0.8 1 Frequency [Hz]

Figure 2.3: The coherence function Cxy of the mask. The input signal was flat

bandlimited noise sequence with variance σ2

= 1 (FFT-length 2048).

2.3 Channel Equalization using tfe

First, the impulse response of the system was estimated using the MatLab function tfe. For a detailed description of how this function works, see [4]. A short resum of the theory behind tfe is given in section 1.1. An alternative function, custom made by the author, is listed in appendix A.6.

The data was divided into non-overlapping sections and then windowed by a Hanning window. The magnitude squared of the Discrete Fourier Transforms

(DFT) of the input noise sections were averaged to form Pxx. The products of

the DFTs of the input and output noise sections were averaged to form Pxy.

A one-sided spectrum is returned by tfe and in order to perform an Inverse FFT (IFFT), the spectrum has to be converted to a two-sided spectrum. This spectrum can then be used as input to the MatLab function ifft and in this way the corresponding impulse response for the transfer function can be calculated. For a detailed description of the MatLab function ifft, see [5]

The channel transfer function and impulse response for different filter lengths are shown in fig. 2.4. Calculating a channel equalizing filter for the mask using

tfeis easily done simply by switching the input parameters. That is, if the tfe

function call to estimate a channel is Txy=tfe(inputNoise, outputNoise), the function call to estimate an equalizing filter to the same channel would be Txy inv=tfe(outputNoise, inputNoise). The result of this operation is shown in fig. 2.5.

2.4 Adaptive Channel Equalization

(14)

10 20 30 40 50 −0.2 0 0.2 L=50 0 1000 2000 3000 4000 5000 −60 −40 −20 0 20 dB 20 40 60 80 100 −0.2 0 0.2 L=110 0 1000 2000 3000 4000 5000 −60 −40 −20 0 20 dB 50 100 150 200 250 −0.2 0 0.2 L=256 0 1000 2000 3000 4000 5000 −60 −40 −20 0 20 dB 100 200 300 400 500 −0.4 −0.2 0 0.2 L=512 Filter taps 0 1000 2000 3000 4000 5000 −60 −40 −20 0 20 dB Frequency [Hz]

Figure 2.4: The left column shows impulse responses for the mask. The fil-ters were calculated using the correlation method. The filter lengths are L=50, L=110, L=256 and L=512. The right column shows the corresponding transfer functions. The transfer functions were calculated using the MatLab function

freqz[5].

2.4.1 The LMS Algorithm

The first adaptive algorithm used for channel equalization was the LMS algo-rithm according to

w(n + 1) = w(n) + µe(n)x(n) (2.2)

An implementation of the LMS algorithm is listed in appendix A.1 Step size

The correct choice of the step size µ is of great importance when using the LMS algorithm or other LMS-based algorithms. Using (2.3), the maximum step size can easily be approximated by

0 < µ < 2

pE{|x(n)|2_} (2.3)

where p is the filter length and E{|x(n)|2_{} is estimated with}

(15)

10 20 30 40 50 −4 −2 0 2 4 L=50 0 1000 2000 3000 4000 5000 −20 0 20 40 dB 20 40 60 80 100 −5 0 5 L=110 0 1000 2000 3000 4000 5000 −20 0 20 40 dB 50 100 150 200 250 −5 0 5 L=256 0 1000 2000 3000 4000 5000 −20 0 20 40 dB 100 200 300 400 500 −5 0 5 L=512 Filter taps 0 1000 2000 3000 4000 5000 −20 0 20 40 dB Frequency [Hz]

Figure 2.5: The left column shows impulse responses for the mask channel equal-izing filter. The filters were calculated using the correlation method. The filter lengths are L=50, L=110, L=256 and L=512. The right column shows the cor-responding transfer functions. The transfer functions were calculated using the MatLab function freqz.

In reality, this step size approximation can seldom or never be used. Instead, as a rule of thumb, a step size at least an order of magnitude smaller than the maximum value allowed, should be used [2]. Nevertheless there are applications that may allow larger step sizes.

Delay

The choice of delay has a substantial effect on the quality of the channel equal-izer. The Mean Square Error (MSE) measures the quality in this case. As a rule of thumb, the delay can be chosen equal to half the adaptive filter length [3]. In fig. 2.6 the MSE is plotted as a function of the delay. It is clear that a delay of about 100 samples gives the least MSE if the filter length is 200. Note that the introduction of a delay is crucial for the quality of a channel equalizer but that the length of the delay is not critical. According to the figure, the delay could have been as short as 50 samples and as long as 150 samples with maintained low level of the MSE. However, to leave out the delay results in an unacceptably high MSE.

(16)

0 50 100 150 200 250 300 350 400 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5x 10 −4

Mean Square Error

Delay [Samples]

Figure 2.6: MSE plotted as a function of the delay ∆. The length of the adaptive filter is L=200.

to the system, is best illustrated with a plot of the impulse response of the mask, i.e. the crosscorrelation between the loudspeaker and the microphone. This plot is shown in fig. 2.7 and is based on an estimate made by the Hewlett-Packard 36570A signal analyzer. Note that the amplitude of the impulse response is not correctly scaled.

The crosscorrelation is approximately zero during the time 0–0,2 ms. This delay is due to the propagation time for the first sound wave that reaches the microphone. If we approximate c ≈ 330 m/s as the speed of sound and ∆ ≈

2 · 10−4 _{ms delay, the distance L between the loudspeaker and the microphone}

can be calculated by

L = ∆ · c (2.5)

which yields a distance between the loudspeaker and the microphone of about 6.5 cm. This distance corresponds well to the real distance.

Filter length

(17)

0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 −4 −3 −2 −1 0 1 2 3x 10 −4 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 x 10−4 −4 −3 −2 −1 0 1x 10 −4 Time [Seconds]

Figure 2.7: The plots shows the impulse response of the mask, i.e. the cross-correlation between the loudspeaker and the microphone. The lower plot is a zoomed version of the upper plot.

choice of filter length, fig. 2.8 shows the MSE plotted as a function of the filter length. The result from all LMS-based adaptive algorithms used in this thesis are plotted. Note that when the filter lengths increases beyond a certain point the MSE actually increases. The reason for this is that as the number of filter coefficients is increased, the error due to stochastic “jumps” of these coefficients on the error surface also increases. This error is called the Excess MSE. Results

The LMS algorithm was used to perform both a channel identification and a channel equalization. The corresponding plots is shown in fig. 2.9-2.10.

2.4.2 The NLMS Algorithm

Normalized LMS (NLMS) uses a time varying step size as follows

µ(n) = β

xT_{(n)x(n) +} =

β

||x(n)||2₊ (2.6)

(18)

trans-0 20 40 60 80 100 120 140 160 180 200 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5x 10 −4

Mean Square Error

Filter Length

LMS, 0.025 of max step size LMS, 0.050 of max step size LMS, 0.100 of max step size NLMS, beta=0.05 NLMS, beta=0.10 NLMS, beta=0.20 LLMS, 0.025 of max step size LLMS, 0.050 of max step size LLMS, 0.100 of max step size RLS, lambda=1

Figure 2.8: The MSE plotted as a function of filter length for LMS, NLMS, LLMS and RLS. The delay is half the length of the filter plus eight samples due to the physical delay introduced by the system. Three different step sizes was used for each algorithm (except for the RLS algorithm). The input signal was

50 000 samples of flat bandlimited noise with the variance σ2

= 1 except for the RLS algorithm where 10 000 samples of noise were used.

pose. Also note that to avoid division by zero, a small constant is introduced in the denominator.

If equation 2.6 is inserted into equation 2.2, we obtain

w(n + 1) = w(n) + β x(n)

||x(n)||2e(n) (2.7)

With a correct statistical assumption it can be shown that the NLMS algorithm will converge if 0 < β < 2 [2]. Therefore, the NLMS algorithm requires no knowledge about the statistics of the input signal in order to calculate the step size.

Another advantage of the NLMS algorithm, is its insensitivity to the am-plification of the gradient noise that a high-amplitude input signal introduces. This insensitivity comes from the normalization in (2.7).

(19)

10 20 30 40 50 −0.5 0 0.5 L=50 0 1000 2000 3000 4000 5000 −60 −40 −20 0 20 dB 20 40 60 80 100 −0.2 0 0.2 L=110 0 1000 2000 3000 4000 5000 −60 −40 −20 0 20 dB 50 100 150 200 250 −0.2 0 0.2 L=256 0 1000 2000 3000 4000 5000 −60 −40 −20 0 20 dB 100 200 300 400 500 −0.2 0 0.2 L=512 Filter taps 0 1000 2000 3000 4000 5000 −60 −40 −20 0 20 dB Frequency [Hz]

Figure 2.9: The left column shows impulse responses for the mask. The fil-ters were calculated using the LMS algorithm. The filter lengths are L=50, L=110, L=256 and L=512. The right column shows the corresponding transfer functions. The transfer functions were calculated using the MatLab function freqz.

2.4.3 The LLMS Algorithm

If the eigenvalues of an autocorrelation matrix is zero, the LMS algorithm does not converge as expected. The LLMS algorithm (Leaky LMS) solves this prob-lem by adding a “leakage coefficient” γ to the filter coefficients according to

w(n + 1) = (1 − µγ)w(n) + µe(n)x(n) (2.8)

This leakage coefficient forces the filter coefficients to zero if either the input signal or the error signal becomes zero. The obvious drawback of this method, is that a bias is introduced to the solution. This bias becomes evident in fig. 2.8. In this case, the LLMS algorithm has approximately twice as large MSE compared to the other algorithms in the plot.

The delay requirement is the same as when using the LMS and NLMS algo-rithms.

2.4.4 The RLS Algorithm

(20)

10 20 30 40 50 −2 0 2 L=50 0 2000 4000 6000 −40 −20 0 20 dB 20 40 60 80 100 −2 0 2 L=110 0 2000 4000 6000 −40 −20 0 20 dB 50 100 150 200 250 −1 0 1 L=256 0 2000 4000 6000 −40 −20 0 20 dB 100 200 300 400 500 −0.5 0 0.5 L=512 Filter taps 0 2000 4000 6000 −40 −20 0 20 dB Frequency [Hz]

Figure 2.10: The left column shows impulse responses for the equalizing filter. The filters were calculated using the LMS algorithm. The filter lengths are L=50, L=110, L=256 and L=512. The right column shows the corresponding transfer functions. The transfer functions were calculated using the MatLab function freqz.

description of this algorithm, see [2]. One important property of the RLS

algo-rithm, is that the step size depends on the size of the error: If the estimate ˆd(n)

is close to the desired signal d(n), small corrections of the filter coefficients will be made. Hence, the step size will be large at the beginning of the convergence

and then, as ˆd(n) approaches d(n), become smaller and smaller.

A plot of the MSE as a function of the filter length is shown in fig. 2.8. Due to the complexity of the algorithm, the plot has been calculated from 10 000

(21)

2.5 Minimum-Phase Approach

If we were to design a channel equalizer for hifi audio purposes, a linear phase filter would be the only acceptable choice since all frequencies are delayed equally when passed through such a filter. In the case of the mask, this constraint is substantially relaxed. This channel equalizer is supposed to operate in a telephone network (PSTN) using the frequency band 300-3400 Hz. Since the channel equalizer is designed to operate in such a large system, it is desirable to reduce the delay caused by the filtering and in this way minimize the total delay introduced by the whole system, i.e. the PSTN.

One powerful method of minimizing the delay of a system, is to design it as a minimum-phase filter. A minimum-phase filter has all of its zeros inside of or possibly on the unit circle. This type of filters can be obtained from a linear-phase filter by reflecting all of the zeros that are outside the unit circle to the inside of the unit circle. The resulting filter will have minimum-phase and, except for a scaling factor, the same magnitude as the linear-phase filter [6].

0 20 40 60 80 100 120 −2 −1 0 1 2 0 20 40 60 80 100 120 −1 0 1 2 3 Filter taps

Figure 2.11: The upper plot shows the impulse response for a minimum-phase filter and the lower plot shows the impulse response for a linear-phase filter. Note how the “centre of gravity” of the linear-phase filter has been shifted to form the minimum-phase filter.

(22)

will be 64 samples due to its symmetry. In figure 2.12 the corresponding am-plitude functions are plotted. It is clear that the minimum-phase filter indeed

0 1000 2000 3000 4000 5000 6000 −40 −30 −20 −10 0 10 20 Amplitude [dB]

Desired Frequency Response Minimum Phase Frequency Response

0 1000 2000 3000 4000 5000 6000 −40 −30 −20 −10 0 10 20 Amplitude [dB] Frequency [Hz]

Desired Frequency Response Linear Phase Frequency Response

Figure 2.12: Amplitude for the minimum-phase filter (upper plot) and linear-phase filter (lower plot). Note that the overall performance is approximately the same for both filters

results in approximately the same amplitude as the linear-phase filter.

Another interesting question, is how the phase behaves over the frequency band. This is illustrated in fig. 2.13.

The group delay τg is defined to be

τg= −

dθ(ω)

dω (2.9)

(23)

0 1000 2000 3000 4000 5000 6000 −5 0 5 Radians 0 1000 2000 3000 4000 5000 6000 −200 −150 −100 −50 0 Radians Frequency [Hz]

Figure 2.13: The phase for the minimum-phase filter (upper plot) and linear-phase filter (lower plot).

0 1000 2000 3000 4000 5000 6000 −100 −50 0 50 100 Samples Frequency [Hz] Minimum Phase Linear Phase

(24)

2.6 Results of Mask Channel Equalization

When talking about speech quality and speech intelligibility it is hard to decide what is “high quality speech” and “low quality speech”. One needs some sort of measure to be able to draw conclusions on whether one speech sample is “better” than another. There are nevertheless a great deal of subjective feelings about speech quality and intelligibility.

In the case of the mask channel equalization, both correlation methods and adaptive methods proved to be a powerful tool in channel equalization. Both methods managed to substantially improve the speech quality and intelligibility using reasonable filter lengths. A subjective listening test showed that at a

sampling frequency of Fs = 12 kHz, a filter length of about L = 100 taps

significantly improved the speech quality. There was also little or no difference at all between the results of the different adaptive algorithms and this is the reason why only the LMS algorithm is used as adaptive method in chapter 3 where an equalization of the mouth-ear channel is performed.

(25)

Chapter 3

Equalization of Mouth-Ear

Channel

In chapter 2 we saw that it is possible to equalize the channel that a mask represents using both correlation methods and adaptive methods. We now move on to next issue: Placing the microphone inside a persons auditory meatus and identify and equalize the channel between the mouth and the ear. The first problem that arises, is how to generate a noise signal. When using the test dummy head, a signal analyzer could be used to generate the reference input signals (see section 2.1). Now, when placing the microphone inside a human auditory meatus, the skull itself represents the channel to be equalized. Thus, the test subject himself must generate a broadband noise-like signal to excite the channel/skull. This may seem like an impossible task but in fact it is quite possible to generate a broadband noise-like sound. The power spectral density for such a noise-like sound, made by a human speech organ, is shown in fig. 3.1.

0 1000 2000 3000 4000 5000 −100 −80 −60 −40 −20 0 Frequency [Hz] Amplitude [dB]

(26)

3.1 Gathering of Measurement Data

The equipment used was a DAT-recorder1_{, a custom made microphone amplifier,}

two microphones2_{and a pair of ear-muffs}3_{. One of the microphones was placed}

in front of the test subjects mouth and the other was placed inside the test sub-jects auditory meatus. The ear-muffs was then placed on the test subsub-jects head. This is advantageous since the signal path outside the skull is damped consider-ably. Also, a pair of ear-muffs damps disturbing or even harmful environmental noise.

The test subject was placed in a semi-damped room and pronounced a num-ber of sentences chosen a priori. He also tried to make noise-like sounds. The two-channel data was recorded at 44.1 kHz and then the sampling frequency was reduced to 11.025 kHz in the same manner as the data from the mask mea-surements (see section 2.1.) For a block scheme of the complete measurement setup, see fig. 3.2 and 3.3.

Figure 3.2: Block scheme of the complete measurement setup.

Figure 3.3: Microphone placement in auditory meatus.

(27)

3.2 Coherence Function of Mouth-Ear Channel

Using the noise signal generated by the human speech organ, the coherence was calculated as in (2.1). The result is shown in fig. 3.4. As described in

0 1000 2000 3000 4000 5000 0 0.2 0.4 0.6 0.8 1 Frequency [Hz]

Figure 3.4: Coherence function of mouth-ear channel (FFT-length 2048).

(28)

3.3 Channel Equalization Using tfe

The MatLab function tfe uses correlations to calculate a transfer function, as described in (1.14), section 1.1. The transfer functions and impulse responses for a number of different filter lengths are shown in fig. 3.5-3.6. The procedure of calculating the impulse response from the transfer function given by tfe, was the same as in section 2.3. It is evident that the skull performs a relatively simple low-pass filtering with a cut-off frequency of about 500 Hz and a stop-band damping of about 30-40 dB.

The strange behaviour of the impulse response can probably be explained by the aggravating circumstances mentioned in section 3.2.

10 20 30 40 50 −0.04 −0.02 0 0.02 0.04 L=50 0 1000 2000 3000 4000 5000 −60 −40 −20 0 dB 20 40 60 80 100 −0.02 0 0.02 L=110 0 1000 2000 3000 4000 5000 −60 −40 −20 0 dB 50 100 150 200 250 −0.02 0 0.02 0.04 L=256 0 1000 2000 3000 4000 5000 −60 −40 −20 0 dB 100 200 300 400 500 −0.02 0 0.02 0.04 0.06 L=512 Filter taps 0 1000 2000 3000 4000 5000 −60 −40 −20 0 dB Frequency [Hz]

Figure 3.5: The left column shows impulse responses for the mouth-ear channel. The filters were calculated using the correlation method. The filter lengths are L=50, L=110, L=256 and L=512. The right column shows the corresponding transfer functions. The transfer functions were calculated using the MatLab function freqz.

3.4 Adaptive Channel Equalization

(29)

10 20 30 40 50 −20 0 20 L=50 0 1000 2000 3000 4000 5000 −20 0 20 40 dB 20 40 60 80 100 −20 0 20 L=110 0 1000 2000 3000 4000 5000 −20 0 20 40 dB 50 100 150 200 250 −20 0 20 L=256 0 1000 2000 3000 4000 5000 −20 0 20 40 dB 100 200 300 400 500 −20 0 20 L=512 Filter taps 0 1000 2000 3000 4000 5000 −20 0 20 40 dB Frequency [Hz]

Figure 3.6: The left column shows impulse responses for the mouth-ear channel equalizing filter. The filters were calculated using the correlation method. The filter lengths are L=50, L=110, L=256 and L=512. The right column shows the corresponding transfer functions. The transfer functions were calculated using the MatLab function freqz.

3.4.1 The LMS Algorithm

As in the case with the mask, a number of parameters must be calculated to obtain an effective equalizing filter.

Step size and Filter Length

To find a proper step size, the MSE was plotted as a function of the filter length L. Different fractions of the maximum step size was used and the result is shown in fig. 3.7. According to this figure, a step size of about one fifth of the maximum allowed step size, seems to be a reasonable choice. For a start, the delay was chosen as half the filter lengths. Later, a more thorough investigation of an optimal delay is performed.

Delay

(30)

0 50 100 150 200 250 300 2 3 4 5 6 7 8 9 10x 10 −3

Mean Square Error

Filter Lenght 0.025 of maximum my 0.05 of maximum my 0.1 of maximum my 0.2 of maximum my 0.4 of maximum my

Figure 3.7: The MSE plotted as a function of filter length for the LMS algorithm. The delay was half the filter length. Five different step sizes was used and the input signal was noise generated by a human speech organ.

its “mouth”. Hence, the source of the speech or noise was generated at a certain isolated point. When a person is talking, this is not the case. Instead, the vocal chords acts together with the throat, mouth and nostril cavities to form sounds. This means that the speech or noise no longer is generated at one isolated point. Rather, the sound is a result of many systems cooperating. Since we are forced to use a human skull instead of a test dummy head to collect data, it is difficult to predict a certain optimal delay for a mouth-ear channel equalizer.

(31)

0 5 10 15 20 2 4 6 8 10 12x 10 −3 L=10 0 20 40 60 2 4 6 8 10 12x 10 −3 L=30 0 20 40 60 80 100 2 4 6 8 10 12x 10 −3 L=50 0 50 100 2 4 6 8 10 12x 10 −3 L=70 0 50 100 150 2 4 6 8 10 12x 10 −3 L=90 0 50 100 150 200 2 4 6 8 10 12x 10 −3 L=110 0 100 200 300 400 500 2 4 6 8 10 12x 10 −3 L=256 Delay [Samples] 0 200 400 600 800 1000 2 4 6 8 10 12x 10 −3 L=512 Delay [Samples]

Figure 3.8: The MSE plotted as a function of delay for eight mouth-ear channel equalizing filters, each of different length L and with a delay of 0–2L. The filters were calculated using the LMS algorithm.

3.5 Results of mouth-ear channel equalization

(32)

10 20 30 40 50 −10 0 10 L=50 0 1000 2000 3000 4000 5000 −60 −40 −20 0 20 dB 20 40 60 80 100 −0.05 0 0.05 L=110 0 1000 2000 3000 4000 5000 −60 −40 −20 0 20 dB 50 100 150 200 250 −0.05 0 0.05 L=256 0 1000 2000 3000 4000 5000 −60 −40 −20 0 20 dB 100 200 300 400 500 −0.05 0 0.05 L=512 Filter taps 0 1000 2000 3000 4000 5000 −60 −40 −20 0 20 dB Frequency [Hz]

(33)

10 20 30 40 50 −5 0 5 L=50 0 1000 2000 3000 4000 5000 −40 −20 0 20 dB 20 40 60 80 100 −2 0 2 L=110 0 1000 2000 3000 4000 5000 −40 −20 0 20 dB 50 100 150 200 250 −1 0 1 L=256 0 1000 2000 3000 4000 5000 −40 −20 0 20 dB 100 200 300 400 500 −0.5 0 0.5 L=512 Filter taps 0 1000 2000 3000 4000 5000 −40 −20 0 20 dB Frequency [Hz]

(34)

Chapter 4

Identification of “True”

Mouth-Ear Channel

Most types of measurements in some way affects the item being measured. In the case of the mouth-ear channel identification and equalization, the cables, analog-to-digital converters (ADC) and the microphones forms a system that distorts the signal in some way. However, it is possible to equalize this system as well and in this way find an approximation of the “true” channel.

4.1 Basic Approach

In fig. 4.1 a principal block scheme illustrates how the measurements are

per-formed. SE is the signal recorded in the auditory meatus, i.e. the Ear, and SM

is the signal recorded at the Mouth. HT is the true ear-mouth channel and H

is the true ear-mouth channel distorted by the measurement equipment. The microphones, cables and ADCs can be viewed upon as a system.

Sup-pose we perform a measurement and uses equipment/system GM to record data

at the mouth and equipment/system GEto record data in the auditory meatus.

We then have the situation shown in fig. 4.2. H1 is the first estimate of the

channel. This setup means that

SM = SEHT (4.1)

The output from H1will be SEGEH1and the output from GM will be SMGM.

This means that

H1=

SMGM

SEGE

(4.2) Then, the microphones are switched so that the equipment that was used to record data in the auditory meatus in the first measurement now is placed in

front of the mouth and vice versa. Fig. 4.3 shows this new setup. Note that GE

and GM are switched. This means that

H2=

SMGE

SEGM

(4.3)

If (4.1) is substituted into (4.2) and (4.3) and H1 and H2 are multiplied we

obtain

(35)

H_T ADC etc. ADC etc. H S_E S_M

Figure 4.1: Block scheme of how the measurements are performed.

i.e. the true channel.

The result of applying the operations described in this section to the mouth-ear channel equalizing problem is shown in fig. 4.4.

M

Figure 4.3: Block scheme of the second measurement where GM and GE are

(37)

0 1000 2000 3000 4000 5000 −30 −20 −10 0 10 20 30 40 Frequency [Hz] Amplitude [dB]

Transfer Function − Left channel at mouth Transfer Function − Right channel at mouth Estimate of "true" channel

50 100 150 200 250 300 350 400 450 500 −15 −10 −5 0 5 10 15 20 25 Filter taps

(38)

Chapter 5

Conclusions

The goal of this Master thesis has been to investigate the possibility of placing a microphone for communication purposes inside a preservative mask as well as the possibility of placing the microphone inside a persons auditory meatus and digitally equalize the speech path in question. A number of methods has been evaluated, both adaptive and non-adaptive. The work shows that the cor-relation method is a powerful and straightforward way of identifying a system. Subjective listening tests indicates that this method was able to identify and equalize the mask channel with a satisfactory result and with reasonable filter lengths.

The mouth-ear channel presented more difficulties because of its “non-ideal” circumstances. The mask was attached to a test dummy head equipped with a loudspeaker in its mouth and bandlimited noise was used as reference signal. When the mouth-ear channel was to be identified, a real human skull had to be used and the test subject had to excite the skull himself. Partly because of this, a proper transfer function for this channel was difficult to find. The work also shows that the speech signal detected inside the auditory meatus is substantially damped and this raises the requirements on the measurement equipment because of the low SNR. Another factor that affects the final result is that the excitation signal of the skull is not used as reference/desired signal. Instead, the speech at the test subjects mouth is used as the desired signal when identifying an equalizing filter. This makes the identification process far more complex than in the case with the test dummy head and the protective mask.

Nevertheless, subjective listening tests revealed that a substantial improve-ment in speech intelligibility was achieved when using the correlation method. The adaptive methods performed less well, mainly because of convergence prob-lems.

5.1 Further Work

Further improvements may be achieved by using one ore more of the suggestions below:

(39)

• It has been shown that the sound pressure level varies depending on where inside the auditory meatus the microphone is placed [8]. It is possible that a small change in the position of the microphone may increase the SNR to some extent.

(40)

Appendix A

MatLab functions

A.1 LMS Algorithm

function [yout,eout,f]=lms(x,d,mu,nord) % [yout,eout,f]=lms(x,d,mu,nord) % % x - Input Signal % d - Desired Signal % mu - Step size

% nord - Filter length % yout - Filter output

% eout - Error during convergence

(41)

A.2 NLMS Algorithm

function [yout,eout,f]=nlms(x,d,mu,nord) % [yout,eout,f]=nlms(x,d,mu,nord) % % x - Input Signal % d - Desired Signal % mu - Step size

% nord - Filter length % yout - Filter output

(42)

A.3 LLMS Algorithm

function [yout,eout,f]=llms(x,d,mu,gamma,nord) % [yout,eout,f]=llms(x,d,mu,gamma,nord) % % x - Input Signal % d - Desired Signal % mu - Step size

% gamma - Leakage factor

% nord - Filter length

% yout - Filter output

(43)

A.4 RLS Algorithm

function [W]=rls(x,d,nord,lambda) % [W]=rls(x,d,nord,lambda) % % x - Input Signal % d - Desired Signal

% nord - Filter length

% lambda - Forgetting factor

% W - Filter taps % % (c)Nils Westerlund, 2000 x=x(:)’; d=d(:)’; delta=0.001; P=inv(delta)*eye(nord); xflip=fliplr(x);

(44)

A.5 Minimum-Phase Filter Design

function [x2,h]=minfas(Admag,Ndft)

% [x2,h]=minfas(Admag,Ndft) %

% Admag - Desired frequency response

% Ndft - Length of DFT

% x2 - Minimum-phase Impulse response

% h - Linear-phase Impulse response

% Admag=Admag(:); Admag=Admag’; fs=12000; f=100*(1.2589).^(3:length(Admag)+2); Admagi=Admag; Admagi=[Admagi 0]; Ad=10.^(Admagi/20); Ad=Ad(:); Mag(1:Ndft/2+1)=Ad; Mag(Ndft/2+2:Ndft)=flipud(Ad(2:Ndft/2)); xehat=real(ifft(log(Mag))); xhat(1)=xehat(1); xhat=2*xehat; N=Ndft/2; x2=real(ifft(exp(fft(xhat(1:N),Ndft)))); x2=x2(1:N); % ---% Linear phase - FFT method

(45)

A.6 Coherence Function and Estimate of

Transfer Function

function [Txy_H1,Txy_H2,Cxy]=sysest(x,y,nfft,winflag) % [Txy_H1,Txy_H2,Cxy]=sysest(x,y,nfft,winflag) % % x - Input Signal % y - Output Signal % nfft - FFT Length

% winflag - 1-> windowing 0-> no windowing

% Txy_H1 - H1-estimate of transfer function

% Txy_H2 - H2-estimate of transfer function

% Cxy - Coherence Function

(46)

Txy_H2=Pyy./conj(Pxy); Txy_H1=Txy_H1(1:nfft/2); Txy_H2=Txy_H2(1:nfft/2); Cxy=(abs(Pxy).^2)./(Pxx.*Pyy); Cxy=Cxy(1:nfft/2);

A.7 Estimate of “True” Channel

function [Htrue,htrue,lchm_Hinv,rchm_Hinv]=...

truechan(lchm_innoise,lchm_outnoise,rchm_innoise,rchm_outnoise) % [Htrue,htrue,lchm_Hinv,rchm_Hinv]=...

% truechan(lchm_innoise,lchm_outnoise,rchm_innoise,rchm_outnoise)

% lchm_innoise - Left channel at mouth, input noise

% lchm_outnoise - Left channel at mouth, output noise

% rchm_innoise - Right channel at mouth, input noise

% lchm_outnoise - Right channel at mouth, output noise

% Htrue - "True" channel transfer function

% htrue - Impulse response for "true" channel

% lchm_Hinv - Est. of equ. transfer func., left ch. at mouth

% rchm_Hinv - Est. of equ. transfer func., right ch. at mouth

(47)

Bibliography

[1] Proakis J. G., Manolakis D. G. (1996). Digital Signal Processing, Principles, Algorithms and Applications (Prentice-Hall)

[2] Hayes M. H. (1996). Statistical Digital Signal Processing and Modeling (Wi-ley).

[3] Widrow B., Stearns S. D. (1985). Adaptive Signal Processing (Prentice-Hall). [4] MatLab Reference Guide - System Identification Toolbox.

[5] MatLab Reference Guide.

[6] Parks T. W., Burrus C. S. (1987). Digital Filter Design (Wiley).

[7] H˚akansson B., Carlsson P., Brandt A., Stenfelt S. (1995). “Linearity of sound

transmission through the human skull in vivo,” J. Acoust. Soc. Am. 99, 2239-2243.

Equalization of Audio Channels A Practical Approach for Speech Communication