Noise Reduction in Mobile Phones

(1)

MASTER'S THESIS

Simon Christensson

Master of Science in Engineering Technology Engineering Physics and Electrical Engineering

(2)

Noise Reduction in Mobile Phones

Simon Christensson Master’s Thesis September 18, 2011

Examiner: Supervisor:

Arne Nykänen Marcus Larsson

(3)

Preface

In my pursuit of a master thesis I came across an add with the title ”Noise Re- duction in mobile phones”. My first thought was; Perfect! Besides receiving my degree in electrical engineering and sound design from Luleå University of technology I would spend six month doing something I’m fascinated in.

Two months later I left Luleå for Gothenburg to begin my master thesis at ASCOM.

I want to thank my family for unconditional support and love and my friends for all relief during hard times To my new found friends at Ascom; thanks for encouragement, good times and ice-creams stolen from the company fridge during long and troublesome days . Also to my supervisors, Fredrik Bode and Markus Larsson, I want to thank for support, guidance and the opportunity given to me.

Simon Christensson, Author

(4)

Abstract

This project has been about reducing acoustical noise transferred in a communication system intended for speech. The communication system is in the form of a mobile phone and the intended use is in environments with high noise intensity. Spectral subtraction, adaptive filtering, fixed beamforming and adaptive beamforming are noise reduction methods that have been explored. For complexity reduction a polyphase subband structure has been proposed. Implementations and evaluation has been made in Matlab.

An analog beamformer provided by National Semiconducter (LMV1089) has been calibrated according to the manufactures instruction and tested in simulated noise environments.

For evaluation purposes four microphones were integrated in a mobile phone.

Recordings were made in an isolated chamber with the phone strapped to the head of a dummy torso. With one speaker in every corner and in the mouth of the dummy head different noise and speech situations were recorded. These recordings were used as guidelines to simulate the noise in Matlab. Using the recordings directly for evaluation failed because of what might be a very small time delay between the two channels.

The spectral subtraction method is based on a noise estimation made during speech free segments. Using a voice- activity-detector (VAD) the mean value of the noise is continuously updated. Since the VAD and noise estimation are both based on a mean value non-stationary noise will diminish the performance. There is also a chance that the altered noise gets even more distracting. Using an adaptive filter as a noise canceler a second microphone is needed as a reference signal. In a mobile handset application difficulties in isolating the reference source from the wanted signal limits the noise reduction considerably. A Beamformer utilizes the interference pattern that occurs when adding multiple inputs with different time delay to isolate sounds com- ing from specific angles. It is therefor dependent on the predefined spatial relationship between speaker and microphone placement. A beamformer can be made both stationary and adaptive. The effectiveness of the stationary beamformer depends on the spatial relationship while and adaptive beamformer can adjust to changes made in the spatial domain. Considering complexity, required hardware and convenience for the user digital beamforming is recommended for further study and real-time implementation.

(5)

Contents

1 Background 4

1.1 Speech and Human Perception . . . . 5

1.2 Noise . . . . 5

2 ANC - Active Noise Cancellation 7 2.1 Spectral Subtraction . . . . 7

2.1.1 VAD - Voice Activity Detector . . . . 9

2.2 Adaptive Filtering . . . 10

2.2.1 Recursive Least Square . . . 11

2.3 Beamforming . . . 12

2.3.1 Fixed Beamformer . . . 13

2.3.2 Adpative Beamformers . . . 14

2.3.3 Analog Beamforming . . . 16

3 Polyphase Subbands 17 3.1 Polyphase Subband Decomposition . . . 17

3.2 Polyphase Subband Reconstruction . . . 19

4 Simulation and Measurement 20 4.1 SPL Gain . . . 21

4.2 Calibration . . . 22

4.2.1 Digital Beamformers . . . 22

4.2.2 Analog Beamformers - LMV1089 . . . 22

4.3 Evaluation Noise and Speech . . . 22

5 Results 24 5.1 Spectral subtraction . . . 24

5.1.1 Complexity of Spectral Subtraction . . . 26

(6)

5.2 Adaptive Filters . . . 27

5.2.1 Complexity of Adaptive Filters . . . 29

5.3 Beamformer . . . 29

5.3.1 Fixed Beamformer . . . 29

5.3.2 Adaptive Beamformer . . . 31

5.3.3 Complexity of Beamforming . . . 31

5.3.4 Analog Beamforming - LMV1089 . . . 32

6 Improvements 35 6.1 Spectral Subtraction . . . 35

6.2 Adaptive Filtering . . . 36

6.3 Beamforming . . . 37

6.4 Polyphase Subbands . . . 38

7 Conclusion 39

8 Appendix 43

A 44

(7)

Introduction

Can you please speak up! A phrase uttered of most cellphone users at some point or another. The difficulties of trying to communicate in the presence of noise are well known. Whenever a signal is transferred it is bound to pick up noise along the way. Though the noise can take many forms they all have one thing in common; they store no information useful for the receiver. Since acoustical noise is constantly present the human auditory system performs its own processing of noisy signals. It allows us to ignore unnecessary information and focus on the important part of the signal. Effective as this may be it takes its toll and has its limits.

This thesis has been carried out at ASCOM AB in the purpose of exploring the production possibility of different noise reduction systems. All algorithms have been designed for future realtime implementation.

(8)

Chapter

1

Background

When designing an audio communication system noise will always have to be considered. The noise may be due to electrical disturbances, coding artifacts or other environmental noise coupled with the information bearing signal. Unwanted noise consumes energy and deteriorates the audibility of the signal. There are a variety of methods available to reduce noise, both analog and digital. Each method has its own advantages and areas of expertise where the efficiency is often dependent on the type of noise. In this project the communication system is a mobile phone to be used in industrial facilities. When designing a communication system for human perception audio masking, frequency dependency and audible habits are all phenomena that have to be considered. Intelligibility is a necessity for acceptability, but not a sufficient condition. Meaning that the altered noise and speech might be more inconvenient even with increased signal to noise ratio (SNR) [1].

The objectives of the project were to:

• Research noise reduction systems possible for a mobile communication systems

• Compare and evaluate different methods

• Achieve a SNR improvement of at least 20 dB when measuring diffuse and direct speech with the listener sidetone average D measurement.

Some wiggle room is implied in the last goal. The listener sidetone average D is the difference in sensitivity between the wanted signal and any noise [2]. For this to be an accurate measurement the system must be independent from the type of incoming signals, which is rarely the case for most active noise cancelers. Instead the most important aspect is that the receiver is

(9)

not disturbed by the noise surrounding the sender. All in all the goal was to make it sound good! Since it’s intended for a real time application the constraint set by the complexity is considered throughout the report.

1.1 Speech and Human Perception

Speech signals are classified as non-stationary signals, meaning that characteristics such as expectation value and variance changes during the progression. It’s difficult and sometimes impossible to create a model for a non-stationary process. Success has however been made by assuming it’s momentarily stationary, i.e. wide sense stationary. A linear prediction model can than be utilizing for speech recognition and reconstruction. One such model is the hidden Markov model where training vectors allows for an accurate model of a spoken word [3]. The upper frequency limit of human hearing starts around 20 kHz and slowly deteriorates as time goes on. The rate of decreased hearing becomes higher for those who have been exposed to high intensity sounds. The intensity perceived by the ear is frequency dependent and logarithmic [4]. The Sound Pressure Level (SPL) is expressed in dB, where 0 dB defines the threshold of hearing and 120 dB represents the pain threshold, both at 1 kHz.

1.2 Noise

Noise is no more than unwanted signals carrying no information of any value for the receiver. By this definition there is no way to model or define the characteristics of all types of noise. The kid on the buss screaming for ice cream has a different frequency spectrum, expectation value and variance compared to the hum introduced by the power line that’s biasing your stereo. What they do have in common is that they are both annoying and therefor labeled as noise. This implies that when designing a noise reduction system the type of noise will affect the performance. A ice cream to the screaming kid will eliminate that noise while a notch filter takes care of the hum in you stereo.

This is not to say there is only one solution for both these problem, but that some are more appropriate than others for a given environment. The intended application of this project is in industrial facilities. In such environments the acoustic noise intensity issuing from large machinery can be quite substantial. In Figure 1.1 the characteristics of a recording made in a typical industrial noise environment is outlined. Since the noise spectrum overlaps the human voice spectrum, Figure 1.1(b), a simple bandpass filter can not

(10)

be used to separate the two. The noise has a gaussian distribution, figure 1.1(c) with a expectation value of 0. In figure 1.1(d) it can be seen that the autocorrelation is periodic. This is an effect of high intensity noise emitted from machines working in a wide sense stationary fashion. This implies that given a certain time the noise repeats itself.

.

(a) Time (b) Frequency spectra in 1/3 octave bands

(c) Probability density function with σ = 0.12 and µ = 0

(d) Autocorreltion

Figure 1.1: Recording of a typical industrial facility

(11)

Chapter

2

ANC - Active Noise Cancellation

The increase in computational power of digital processors has enabled more complicated and efficient noise cancellation methods. As always there is a give and take procedure between efficiency and overall cost. When reducing noise in a commercial product computational load, power consumption and user convenience have to be considered. Spectral subtraction, adaptive filtering and beamforming are some of the main ideas that will be presented in this thesis.

2.1 Spectral Subtraction

Spectral subtraction is one of the few algorithms that operate using a single input. Other single input methods might have a slightly different approach but shares the same constraints [5]. The outline of the spectral subtraction algorithm can be seen in Figure 2.1. If the speech is denoted as s(k) and coupled noise is written as n(k) the incoming signal is denoted as

y(k) = s(k) + n(k) (2.1)

As a first step y(k) is divided into frames, where each frame has a 50 % overlap from the previous frame. To prevent spectral leakage each frame is windowed using a symmetrical window. The Fourier transform is taken for each windowed frame

Y (e^jω) = S(e^jω) + N (e^jω) (2.2) By assuming that the first frames contains no speech an average µ(e^jω)of the noise spectra N(e^jω) is made. The spectral subtraction estimatorS(eb ^jω) is written as

(12)

S(eb ^jω) = [|Y (e^jω)| − αµ(e^jω)]e^jθ^x^(e^jω⁾ (2.3) where θy(e^jω) is the phase of Y (e^jω) and α is an subtraction factor proportional to _{SN R}¹ [6] , where

SN R = 10 · log₁₀

Pe

k=b|Y (k)|² Pe

k=b|µ(k)|²

(2.4) α is then recalculated for each frame according to

α =

( α_max SN R < SN R_min

α_max+ α_slopeSN R SN R_min ≤ SN R ≤ SN R_max

α_min SN R > SN R_max

(2.5)

α_max, αmin, SNRmin and SNRmax are predefined constants. The slope of the smoothing curve is given by

α_slope = α_min− α_max

SN R_max− SN R_min (2.6)

IfS(eb ^jω)given in equation 2.3 were to have negative components a rectifica- tion is performed according to

| bS(e^jω)| = |S(k)|b S > 0b

β| bS(k)| else (2.7)

where β sets the noise floor. By using a Voice Activity Detector the noise estimator is continually updated using speech free frames.

(13)

Figure 2.1: Overview of proposed spectral subtracter

2.1.1 VAD - Voice Activity Detector

The purpose of the VAD is to recognize the presence of speech in a frame.

This information is then used to decide whether to update the noise estimation µ(e^jω). VAD algorithms are divided into two main categories; the time and the frequency domain techniques [7]. Since the spectral subtractor is op- erating in the frequency domain the proposed VAD is also in the frequency domain.

Spectral Threshold Voice Activity Detector

The VAD makes use of what’s already available from the spectral subtractor, that is short time spectrums and initial noise estimation. A distortion factor is calculated as

D(e^jω) = |Y (e^jω)| − |µ(e^jω)| (2.8)

(14)

Negative values are set to zeros and a mean value Db is calculated according to

D =b 1

f_max− f_min

fmax

X

k=fmin

D(k) (2.9)

where fmin and fmax denotes the spectral boundaries typical for speech. The speech indicator is based on the predefined threshold value Nthres as

SpeechF lag = true D ≤ Nb _thres

f alse D > Nb _thres (2.10) The proposed VAD also introduces a counter that keeps track of the number of consecutive frames used for the noise estimation. This introduces a hangover from frame to frame, ensuring that a decision is not made immedi- ately after detecting an inactive frame [8].

2.2 Adaptive Filtering

For a noise canceler to be equally effective in multiple environments adaptation is a must. A static filter is designed to enhance or reject predetermined parts of a signal while an adaptive filter alters its selectivity based on previous success. In figure 2.2 the outline of a noise canceler using an adaptive filter is seen. As input the noise canceler requires a noise reference n(k) and a corrupted speech signal. The noise polluted speech channel can be written as

x(k) = s(k) + n⁰(k) (2.11)

where s(k) is the desired speech and n⁰(k) is related to the noise reference n(k) as

n⁰(k) =

∞

X

b=−∞

h(b)n(k − b) (2.12)

The filter h(k) represents the physical changes the noise undergoes when propagating from one microphone to another. The objective of the adaptive filter is to identify the transfer function h(k). The error function e(k) used to adapt the filter coefficients is calculated as

e(k) = x(k) −

∞

X

b=−∞

h(b)n(k − b) (2.13)

(15)

The general form of the filter adaptation is then written as

h(k + 1) = h(k) + µ(k)G(e(k), x(n), Φ(k)) (2.14) where G is a vector-valued nonlinear function and µ(k) is a step size param- eter. Φ is a vector of states holding relevant data about the characteristic of the input and error signal. Since it is the noise that is being modeled and subtracted in equation 2.13 the error function converges to the desired speech signal [9].

Two conventional adaptive filters are the Least Mean Square (LMS) and the Recursive Least Square (RLS). The LMS filter is often a more practical solution because of less computations. The downside is slow convergence. In environments with non stationary noise signals the RLS is superior, at the cost of complexity. By utilizing subband structures the complexity can be reduced.

Figure 2.2: Outline of an adaptive noise canceler [9]

2.2.1 Recursive Least Square

The RLS adaptive filter is a sample adaptive formulation of the least square error Wiener filter [10]. If the input signals were to be stationary the RLS filter would converge to the optimal solution of the Wiener filter. The ad- vantage with a sample by sample process is lower processing delay and faster convergence. The downside is the increase in computations [11].

RLS algorithm

First an initialization is made:

(16)

Φ(k) = δI ω(0) = h_I

I is a biased identity vector of size P ×P and h = [h1 = 1, h₂ = 0...h_P = 0]. Each signal segment of the input vector defined in 2.11 are organized as y(k) = [x(k − 1), x(k − 2), x(k − 3)...x(k − P )]^T. The filter gain is defined as

g(k) = Φ(k − 1)y(k)

λ + y(k)^TΦyy(k − 1)y(k) (2.15) In equation 2.15 λ is the forgetting factor. It decides the impact older samples have on the adaptation. The sample by sample error signal is written as

e(k) = x(k) − ω^T(m − 1)y(m) (2.16) Before starting over with a new input Φyy(k) is updated.

Φ_yy(k) = λ⁻¹Φ(k − 1) − λ⁻¹g(k)y^T(k)Φ(k − 1) (2.17)

2.3 Beamforming

Beamforming capitalizes on the spatial information available. By combining elements in a receiving array of inputs certain angles can be made to experi- ence constructive interference [12]. There are I sensors receiving a correlated version of the speech signal s[k], noisy signal n[k] and a set of fixed point noise sources nid, [d = 1, 2 . . . , D]. The input to each sensor is written as:

x_i[k] = s_i[n] +

D

X

d=1

n_id[k] + n_i[n], i ∈ [1, 2 . . . , I] (2.18) In figure 2.4 the outline of a beamformer can be seen. Each channel is subdued to a filter wi and added to create the output

y[k] =

I

X

i=1 L−1

X

j=0

ω_i[j]x_i[k − j] (2.19) If the desired output y[k] is no more than the speech the problem can be formulated as an optimization task.

( max[y[k]] ∼= s[k]]

min

PI i=1

PL−1 j=0 ωi[j]

hPD

d=1nid[k − j] + vi[k − j]

i

(2.20) Beamformers are divided into two main categories: adpative and fixed.

(17)

Figure 2.3: Simple Beamformer [12]

Figure 2.4: Beamformer using I inputs [12]

2.3.1 Fixed Beamformer

If the wanted signal has a predefined and fixed location static optimal beamformers can be constructed, meaning the filter weights only have to be calculated once. Optimality is guaranteed as long as no changes in source location are made. In the delay and sum beamformers, as seen in figure 2.4, spatial information can be defined through pre-recordings where only the sources of

(18)

interests are active. By minimizing the mean square error,

ω_opt = arg

minω E|y[k] − sr[k]|²

r ∈ [1, 2, . . . , I], (2.21) an optimal solution is obtained. In equation 2.3.1 sr is the signal taken from a chosen reference microphone only receiving the source of interest. The ideal reference source would be the actual calibration signal sent to the speaker. If transfer function from speaker to nearest input is unknown a chosen reference microphone will have to do [12].

Optimal Near-field Minimum Mean Square Error Beamformer It can be shown that the filter weights ωopt are given by

ω_opt = [R_ss+ R_nn]⁻¹r_s (2.22) where

Rss=







R_s₁_s₁ R_s₁_s₂ . . . R_s₁_s_I R_s₂_s₁ R_s₂_s₂ . . . R_s₂_s_I

... ... ... ...

Rs_Is1 Rs_Is2 . . . Rs_Is_I







(2.23)

Rsisj =







r_s_i_s_j[0] r_s_i_s_j[1] r_s_i_s_j[2] . . . r_s_i_s_j[L − 1]

r_s_i_s_j[1] r_s_i_s_j[1] r_s_i_s_j[2] . . . r_s_i_s_j[L − 1]

... ... ... ... ...

r_s_i_s_j[L − 1] r_s_i_s_j[L − 2] . . . . . . r_s_i_s_j[0]







(2.24)

and rsisj[k] is the cross correlation between channelsl si[k] and sj[k] as:

r_s_i_s_j[k] = Es_i[k]s^∗_j[k + l]

l = 1, 2 . . . , L − 1 (2.25) R_nn is the noise correlation matrix, defined in the same as in equation 2.23, for the speech-free noise signal [12].

2.3.2 Adpative Beamformers

If there is a model mismatch in the spatial domain the static filters in the fixed beamformers will not realize the optimal constraint outlined in 2.22.

Adaptive beamformers are on the other hand equipped to track both variations and compensate for any model mismatch. Using the RLS adaptive

(19)

filter a superior performance is achieved, but at the cost of higher computations. Implementing a polyphase-subband structure will diminish the computational burden, as explained in the next chapter.

Calibrated Weighted RLS Beamformer

The adaptive beamformer proposed in [12] is a hybrid algorithm of the RLS adaptive filter and the fixed beamformer presented previously. During an acquisition phase an estimate of the speech and noise cross correlation defined in equation 2.23 is made

Rb_ss = R_ss & br_ss = r_ss

Rb_nn = R_nn & br_nn = r_nn (2.26) The correlation matrices are memorized in diagonalized form:

Rb_ss+ bR_nn

= Q^HΓQ (2.27)

Qcontains the eigenvectors, Q = [q1, q₂. . . . , q_I]and Γ = diag([γ1, γ₂, . . . , γ_I]), where γi is the eigenvalues.

The filter weights are initialized as in the RLS filter and the inverse of the correlation matrix, Pn, as

P₀ = Q^HΓ⁻¹Q (2.28)

For each k = 1, 2, . . .

xk = [x1[k], [x2[k], . . . , xI[k] (2.29)

P = P_k−1λ⁻¹−λ⁻²P_k−1x_kx^H_kP_k−1

1 + λ⁻¹x^H_nPk−1xk (2.30) P_k= P− γ_p(1 − λ)Pq_pq^H_p P

1 + γ_p(1 − λ)q^H_p Pq_p (2.31) where index p = (kmodI) + 1¹,

ω_k= αω_k−1+ (1 − α)P_kbr_s (2.32) The output from the adaptive beam former then becomes:

y[k] = ω^H_k x_k (2.33)

1Modulo operation; remainder of division

(20)

2.3.3 Analog Beamforming

When working in the analog domain available operations are limited. It is however possible to create a simple sum and delay beamformer. One such circuit is the LMV1089 from National Semiconductor [13].

LMV1089

The LMV1089 is a sum and delay beamformer consisting of a dual input and an EEPROM with corresponding I2C, see Figure 2.5. The I2C allows for a calibration to compensate for any gain mismatch between the two inputs.

The compensation is made in three frequencies; 300 Hz, 1 kHz and 3 kHz.

The LMV1089 is pre-defined as an end-fire configuration where the near- field voice signals will go straight thru while suppressing far-field noise. Any source within 4 cm is considered near-field while any disturbing sources more than 50 cm away are considered far-field.

Figure 2.5: overview of the LMV1089 [13]

(21)

Chapter

3

Polyphase Subbands

When applying an adaptive filter in realtime any computational reduction is always welcomed. The length of the filter is the obvious variable to adjust in the pursuit of speed. Reducing the filter weights gives an exponential complexity reduction. But what you gain in speed you pay with performance.

Subband coding is one method for reducing the need for large number of filter taps without less performance. Dividing the incoming data in to subbands is the same as distributing the information. Less information requires fewer filter weights. The polyphase structure uses the noble identity. This allows for a decimation before dividing the signal according to frequency [14].

3.1 Polyphase Subband Decomposition

A uniform DFT filter bank divides the input signal x[n] in to K subbands according to frequency content. Each subband is made using a low-pass prototype filter, h[n], constructed with a Hamming window technique [12].

Each subband signal is related to the prototype filter as

Hk(z) = H0(zW_K^k) =

∞

X

n=−∞

h0[n](zW_K^k)⁻ⁿ k = 0, . . . , K − 1 (3.1) where WK = e^−j2π/K. The polyphase structure is a way of implementing the DFT filter bank in an efficient manner. Writing the total filter as the sum of each polyphase component gives

H₀(z) =

∞

X

n=−∞

h₀[n]z⁻ⁿ=

K−1

X

l=0

z^−lH₀^l(z^K) (3.2)

(22)

H₀^l is the l-th polyphase component given as H₀^l(z) =

∞

X

n=−∞

h₀[Kn + l]z⁻ⁿ l = 0, . . . , K − 1 (3.3) Each subband filter can then be expressed as a sum of the polyphase components.

H_k(z) =

K−1

X

l=0

z^−lW_K^−klH₀^l(z^K) (3.4) In figure 3.1 the magnitude response of K = 32 subbands filter has been plotted. The implementation has been made with a prototype filter of 128 taps. Using equation 3.4 each subband signal can be written as

X_k(z) = X(z)

K−1

X

l=0

z^−lW_k^−klH₀^l(z) . . . k = 0, . . . , K − 1 (3.5) For fast realization of equation 3.5 the fast Fourier transform can be used, as illustrated in figure 3.2. The z⁻¹component is a filter resulting in no more than one sample delay. K is the number of subbands, which is equal to the down-samplings factor. It should be noted that the IFFT is applied across all subbands, resulting in Fourier transform operations of size K for each time instant.

Figure 3.1: Subbands filter for K = 32 using a 128 prototype filter

(23)

Figure 3.2: Polyphase subband decomposition structure [12]

3.2 Polyphase Subband Reconstruction

The reconstruction phase in the polyphase implementation is the reverse of the decomposition, see figure 3.3. The reconstructed signal X(z)b is given by

X(z) = X(z)b

K−1

X

l=0

z^−lz^{−(K−l−1}H₀^l(z)F₀^l(z) = z^−(K−1)X(z)

K−1

X

l=0

H₀^l(z)F₀^l(z) (3.6) A simplification in the reconstruction phase is to set the prototype filters F₀^l equal to the decomposition filters H0^l, see equation 3.3.

Figure 3.3: Polyphase subband reconstruction [12]

(24)

Chapter

4

Simulation and Measurement

For evaluation purposes a realistic noise environment was simulated. The setup consisted of:

• Acoustical dummy head from Brüel&Kjær

• Four electret condenser microphone integrated in a DH5 (Ascom) mobile phone

• Acoustical chamber with one equalized loudspeaker in each corner

• Pre-recorded noise and speech

• The Audio Precision APx525 audio analyzer for recording (two channels)

Four microphones were placed in a DH5 phone, Figure 4.1. They are ca- pacitor microphones, requiring a bias voltage of around 2.2 V. The frequency response of these microphones can be considered constant at the frequencies of interests and omnidirectional. It’s the standard microphone used in the phones produced by ASCOM. The microphones were placed facing a 3 mm hole drilled in the outer body of the phone. An acoustic dummy head equipped with a loudspeaker placed in the mouth was used as the source of interest, Figure 4.2. The set up was placed in an isolated chamber with a speaker in every corner. Two channels at a time were recorded with a samplings frequency of 44.1 kHz using the Audio Precision Signal generator.

(25)

Figure 4.1: Position of implemented microphones

Figure 4.2: Test dummy head

4.1 SPL Gain

By recoding an equalized sound file with known SPL level a translation from sample value to sound preassure value was obtained. The SPL was measured at the mouth of the dummy head. The input to the reference microphone, M ic.1 in Figure 4.1 was assumed to have the same level. A gain value could then be calculated enabling a translation from recorded sample value to SPL.

(26)

4.2 Calibration

4.2.1 Digital Beamformers

The spatial information needed in the calibration of the digital beamformer was obtained by recording a white noise played from the mouth of the dummy head in a noise free environment. The spatial information of known noise sources were recorded in a speech free environment. The phone was placed as in Figure 4.2.

4.2.2 Analog Beamformers - LMV1089

The calibration of the LMV1089 was made as instructed in the datasheet provided by National Semionducter [13]:

• Two microphones were placed with equal distance to a speaker.

• The input to the LMV1089 was set to appropriate amplitude (100 mVp−p for a 1 kHz tone)

• A sequence consisting of three tones were created in Matlab and played thru the speaker. The last tone was a few seconds longer than needed.

• Timing of the I2C calibration pins (PE and CAL) were realized with a comparator. The calibration audio file was set as input to the comparator enabling high output whenever calibration tones were played, and low level otherwise.

• The speaker was turned off after a few seconds of the last tone, leaving some silence in activated calibration mode.

• When the audio sequence stopped the comparator went low, ending the calibration.

4.3 Evaluation Noise and Speech

Three different noise types have been used for evaluation purposes; Noise.1 containing white noise, Noise.2 containing industrial noise and Noise.3 containing music (Figure 2 -4). The source signal used for evaluation was a male speaker uttering the phrase:

”No longer a hurricane, tropical storm George remains a menace to the Golf cost today, bringing rain, more rain and an uncertainty of the full extent of damage”

(27)

Corresponding spectrum of the speech can be seen in Figure 4.3.

Figure 4.3: Time plot and corresponding spectrum in 1/3 octave band of a male speaker, original sound file

(28)

Chapter

5

Results

As said before nothing worth having comes free! Higher computations will often result in an overall better algorithm, or at least a safer algorithm.

Meaning that the risk of making a bad situation worse is negligible. Artifacts and constrained noise reduction is a price paid for fast and easy computations.

In this section results of implemented algorithms will be outlined. Different environments, constraints and cost functions will be the main focus.

5.1 Spectral subtraction

Spectral subtraction is a fast and potentially effective algorithm. Since it’s based on estimation of the noise made in the frequency domain abrupt changes in noise characteristics will alter the performance. Its area of expertise is therefor in static noise reduction. Whenever the noise floor is constant, as with most machineries, a large reduction can be achieved.

For evaluation Noise.1, Noise.2 and Noise.3 were set to 75 dB and individ- ually coupled with an 80 dB speech signal. In figure 5.1 the noisy speech has been plotted. The result of applying a spectral subtraction can be seen in figure 5.2. Comparing figure 5.1 and 5.2 we see that the noise floor has diminished considerably. Listening to the sound files the reduction in noise is evident, but as said before artifacts are being introduced. Since humans are accustomed to a certain kind of noise alterations might lead to a more disturbing noise landscape even with an increase in SNR. In the spectral subtraction output in figure 5.2 the aggressiveness of the algorithm has been set high in order to demonstrate the downside and upside of a simple spectral subtractor. The aggressive settings of the algorithm are adjusted with the subtraction factor α, allowed maximum and minimum SNR (equation 2.1)

(29)

and the noise floor β. The change in SPL values in figure 5.2 is 40 dB, 17 dB and 2 dB for noise situation Noise.1, Noise.2 and Noise.3. Assuming the speech is left unaffected this can be considered to be the resulting noise reduction value. The spectral subtraction is most effective when the noise floor is at a constant level, as in the situation with white noise. More changes in noise characteristics, such as musical noise, will reduce the effectiveness.

The spectral subtraction is also dependent on a solid VAD algorithm. If the noise is drowning the speech these energies will be smeared across the spectrum. So the best situation for a spectral subtraction is when there is a defined noise with a high SNR. The advantages with spectral subtraction are low complexity, high noise reduction and only a single channel requirement.

The disadvantages are possible artifacts and VAD dependency.

Figure 5.1: Speech (80 db) coupled with white noise, industrial noise and musical noise (75 db)

(30)

Figure 5.2: Output of spectral subtracted speech (80 db) coupled with white noise, industrial noise and musical noise (75 db)

5.1.1 Complexity of Spectral Subtraction

The cost function of the spectral subtraction is dominated by the Fourier transform. If implementation is made with the fast Fourier transform the lower bound¹ of the complexity function for each sample becomes:

O(4 · log₂(L)) (5.1)

where L is the length of a frame [9]. Because of the overlap both the FFT and IFFT operation are done twice for each sample, thereby multiplication of four in equation 5.1. Using a sampling frequency of F s the minimum required clock cycle for realtime implementation becomes

clk_min = F s · O(4 · log₂(L)) (5.2) 8 kHz is the standard samplings frequency used in phones produced by Ascom. Inserted in equation 5.2 gives a minimum required clock cycle of around 220 kHz if using taps of L = 1024.

1the overall cost function is always higher but converges towards O(·) as N F F T → inf

(31)

5.2 Adaptive Filters

The adaptive filters are dependent on a reference microphone. So for the sake of evaluation it was assumed that one of the microphone received a correlated version of the noise, but with no speech. In figure 5.3 the inputs to the RLS filter have been plotted. Noise.1 has been used as a reference noise source. The corresponding error function seen in figure 5.4 converges fast and effective towards the speech signal. That’s the result when a good reference microphone is available. The effectiveness of the algorithm is dependent on the complexity of the transfer function between the two inputs. To inves- tigate the possibility of a reference microphone in a mobile phone the two microphones furthest away, Mic.1 and Mic.4, were considered. Recordings when only the talker of interest was active reveals that Mic.4 receives 50%

of the energy content of Mic.1. Using the same setup as in 5.3, but with a 50 % speech leakage in to the reference signal Noise, the error signal of the RLS filter becomes as in figure 5.5. In this case the filter converges to an optimal solution but as soon as the speech starts the noise goes up. It should be noticed that there is a noise reduction but not nearly as much as possible with a perfect reference source. This makes the RLS filter hard to use for a noise reduction in a mobile phone. Imagine a noise that only appears whenever you try to speak! It would probably be preferred to have a constant noise floor.

Figure 5.3: Input for adaptive filtering

(32)

Figure 5.4: Error signal of RLS filter when input as in figure 5.3

Figure 5.5: Error signal of RLS filter, 50 % leakage in to reference microphone

(33)

5.2.1 Complexity of Adaptive Filters

In the strive for optimal filter weights four larger matrix multiplications are performed, see equation 2.15 - 2.17. Using L taps the dimension of each multiplication becomes (L × L) · (L × 1). This results in L² operations for each multiplication, dominating the cost function as

O(L²4) (5.3)

Minimum number of operations per sample is given by:

clk_min = F s · O(L²4), (5.4) The required number of filter weights are dependent on the transfer function from reference microphone to speech microphone input. Since the complexity is a potential function a small increase in filter length gives a large upswing in complexity.

5.3 Beamformer

To avoid the need for an isolated reference microphone beamformering can be used. This would allow for capitalization of the spatial domain, under the assumption that the location of the talker is known. Noise.1 was used as a calibration sequence recorded in a noise free environment. Noise.2 was used as a diffuse noise source.

5.3.1 Fixed Beamformer

In the fixed situation there is no change in the filter weights. So any changes made in the configuration of the phone position relative to the speaker will worsen the performance. In figure 5.6 a typical industrial noise environment has been outlined together with the output of the beamformer. In this simulated noise environment Noise.2 has been added with speech. The Noise is assumed to be diffused while the speech makes an end-fire configuration

2. The delay from Mic.1 to Mic.2 is set to 4 samples. Using a sampling frequency of 44.1 kHz this would equal a distance of approximately 3 cm.

The energy received in Mic.1 is set to be bigger than Mic.3. In a fixed environment an ideal beamformer can be made without any adaptation. In this case both the noise and the speech source is known, resulting in an optimal beamformer. The resulting SNR is as good as infinite. In figure 5.7 the

2The speaker is in line with the microphone array, with M ic.1 closest to the speaker and M ic.3 furthest away

(34)

speech spectrum before and after the beamformer has been plotted. As an effect of the beamformer procedure there is an energy loss in lower frequencies proportional to the distance between the microphones. Because of the long wavelengths there’s little change in both phase and amplitude from one mic to another. Lower frequencies are therefor interpreted as diffuse sound fields, which is the same as the noise. To be able to register differences in sound with long wavelengths the distance between the inputs has to be increased.

Figure 5.6: Speech coupled with noise recorded at Mic.1 and resulting beamformer output

(35)

Figure 5.7: Power spectrum of a speech signal before and after beamforming

5.3.2 Adaptive Beamformer

The adaptive beamformer will compensate for any changes made in the spatial domain. To test the adaptive filter a real environment must be simulated. That is without any movement of the talker the adaptive beamformer makes no filter weights adjustments. The result is then identical to the fixed beamformer. Since no such recordings were made in this thesis a real time evaluation is still to come.

5.3.3 Complexity of Beamforming

Fixed Beamformer

The number of operations required for each sample in the proposed fixed beamformer are dominated by the filter operation. If implemented with fast convolution the lower bound of operations/sample becomes

O(I · log₂(L)) (5.5)

where I is the number of input channels. The minimum clock cycle required becomes

clk_min = F s · I · log₂(L) (5.6) Complex sound environments and mic positioning requires higher filter

(36)

orders. If using 1024 taps, as in figure 5.6, and a samplings frequency of 8 kHz a clock cycle of at least 120 kHz is required.

Adaptive Beamformer

The complexity for proposed adaptive beamformer is retrieved similar as the RLS adaptive filter. The difference is that there is a filter for all I channels:

O(4(L · I)²) (5.7)

To make use of the adaptive beamformer the polyphase structure has to be used while still keeping the filter lengths to a minimum. If the speed can be solved the next problem is the memory. More subbands allows for smaller filters but increases the memory space needed proportionally [12].

5.3.4 Analog Beamforming - LMV1089

The LMV1089 circuit was calibrated as instructed in the production datasheet. For evaluation two different microphone set up was used:

(a) The microphone array is on the side of the mouth of the dummy head (like a harmonica)

(b) The microphone array is in line with the mouth of the dummy (like a cigaret)

Since beamforming is independent from the type of signals and the rela- tion between them, noise and speech could be recorded and evaluated sepa- rately. Using white noise as source the system response of the LMV1089 was identified with the Wiener solution [11]. In Figure 5.8 the sensitivity of the noise and speech for set up (a) have been plotted. In the second plot of Fig- ure 5.8 the noise sensitivity has been subtracted from the speech sensitivity.

This can be considered to be the average D value. The speech has an even curve, just as desired and expected. The sensitivity for the noise is more frequency dependent. In figure 5.9 the input and output of the LMV1089 has been plotted in normalized form. By comparison its clear that a noise reduction has been made. Listening to the sound files the noise reduction is most evident in the lower frequencies. The audible speech distortion is negligible. In Figure 5.10 the sensitivity and average D value for set up (b) can be seen. In this case the noise behaves similarly as for set up (a). Some changes has been made in speech sensitivity. The noise reduction is larger in more critical areas for set up (b). No speech distortion is audible. Just as preferable the noise reduction sounds even across the spectra.

(37)

The main advantages with an analog beamformer is the low power consumption. The downside is the obvious dependency on the array placement.

Figure 5.8: Sensitivity of Noise, Speech and resulting average D in 1/3 octave bands for set up 1

Figure 5.9: Input (microphone closest to the mouth) and output of the LMV1089 for set up 1.

(38)

Figure 5.10: Sensitivity of Noise, Speech and resulting average D in 1/3 octave bands for set up 2

Figure 5.11: Input (microphone closest to the mouth) and output of the LMV1089 for set up 2.

(39)

Chapter

6

Improvements

All imposed noise reduction techniques have their limitations and areas of expertise. Improvements are often made thru knowledge of the intended environments or at the cost of complexity. In this section some weak spots of presented algorithms will be highlighted and further work will be discussed.

6.1 Spectral Subtraction

Dependent on the noise the spectral subtraction method can be a powerful algorithm. Two main drawbacks are unfamiliar noise floor and VAD constraints. Even with a considerable noise reduction the remaining noise floor might be perceived as more disturbing. Some means of controlling the noise floor has been introduced in the form of the aggressiveness factor α. Larger aggressiveness interval will reduce the noise floor but also increase the probability of speech distortion. The VAD-constrain is the main reason for speech distortion. If the noise drowns the speech a faulty noise estimation will be made. Smaller frame size would allow for a more accurate VAD at the cost of low frequency resolution. There are more complex VAD algorithms that operates under certain assumptions of the noise, such as a gaussian probability density function [15]. Since most industrial machineries fulfill these assumption more accuracy might be gained if incorporated in to the spectral subtraction algorithm. During the progression of the work some main thoughts were:

Manual Calibration

If the user would be able to do a manual calibration that has higher priority than the automatic the accuracy of the VAD would become less important.

(40)

This would allow for higher aggressiveness factor without fear of effecting the speech in environments with high noise intensity.

Noise Masking

Whenever a polyphone sound hits the human ear there will always be some degree of masking¹. This phenomena could be used to hide away unwanted noise and control the remaining noise floor perceived by the ear. White noise has for instance shown to have a soothing effect [16]. If such noise were to be coupled with the low intensity noise masking would occur. The disturbing noise would then be replaced by a more ignorable noise, a sort of comfort noise.

Adaptive Threshold Margin

The proposed VAD is dependent on a pre-defined threshold value. Whenever the mean value of a subtracted frame is above a certain threshold the noise flag goes high. By monitoring past frames the threshold value could be made to adapt to current situation, thereby improve the VAD-accuracy.

6.2 Adaptive Filtering

The adaptive filter is a solid and well established noise reduction technique.

The main drawbacks when used in mobile phone applications is the high complexity and the lack of a good reference source.

Microphone Separation

The obvious drawback of adaptive filtering is the speech leakage in the reference microphone. If it’s possible to separate the two microphones further better result could be obtained. An external reference microphone attached to the telephone user could be a solution. The transmission cost would of course have to be considered. If sending wireless both the transceiver and receiver would occupy both size and power while a wire-solution might be too inconvenient. As a result of separating the microphones the complexity of the transfer function from noise to source increases. The effectiveness will then decrease if not more filter weights are applied.

1Perception of one sound is affected by the presence of another sound

(41)

Complexity Reduction

As proposed previously the polyphase structure can be implemented to reduce the complexity of the RLS adaptive filter. If further reduction is wanted the Least Mean Square algorithm could be implemented instead of the Re- cursive Least Square. The performance would become more sensitive to noise variations but considering the intended environment this will probably not be a major concern.

6.3 Beamforming

The fixed beamformer is a straightforward and easaly implemented method.

The main downside with the fixed beamformer is the sensitivity to any changes in the spatial domain. The adaptive beamformer deals with these variations in an optimal way. How much there is to gain with the adaptation is still to be evaluated. This will have to be done in a real environment. The downside with the adaptation is the increase in complexity. The polyphase structure reduces the number of needed computations. Implementing such a scheme in a low level language is not an easy task. Besides the many matrix multiplication and the fact that complex values are involved there is also the relativly large memory requirement on the DSP to be considered.

Array Position

The performance of the beamformer is in some sense dependent on the po- sitions of the microphones. As explained the spacing of the microphones sets the lower frequency limit. Frequencies who’s wavelength is equal to a multiple of the array spacing will also be interpreted as a diffuse sound, and there for suppressed. There is also a question of how much the outline of the phone effects the performance. This problem could be avoided if placing all microphones on the same side, preferably towards the speaker.

Array Size

Using two inputs is the minimum for a beamformer. More will give more information to work with, so its natural to assume better performance. How much improvement that can be made with more than two microphones has not been investigated in this thesis.