Enhancement of Speech Intelligibility using Beamforming Techniques

(1)

1 Master Thesis

Electrical Engineering Thesis no: MEE 2010-2012

Enhancement of Speech Intelligibility using

Beamforming Techniques

Leela Krishna Gudupudi

School of Engineerinng

Blekinge Institute of Technology 371 79 Karlskrona

Sweden

(2)

2

This thesis is submitted to the School of Engineering at Blekinge Institute of Technology in partial fulfillment of the requirements for the degree of Master of Science in Electrical Engineering.

(3)

3

Contact Information:

Author:

Leela Krishna Gudupudi

E-mail: legu10@student.bth.se, lkgudupudi30@gmail.com

University advisors:

Dr. Nedelko Grbic

Department of Electrical Engineering School of Engineering

Blekinge Institute of Technology, Karlskrona, Sweden

Email: nedelko.grbic@bth.se

Dr. Benny Sällberg

Department of Electrical Engineering School of Engineering

Blekinge Institute of Technology, Karlskrona, Sweden

Email: bsa@bth.se

(4)

4

ABSTRACT

Speech enhancement is a process of enhancing the intelligibility or quality of speech using various audio processing algorithms. The concept of speech enhancement is the most important field for many applications such as teleconferencing systems, speech recognition, VoIP, mobile phones, hearing aids, etc. [1]. The delight of the speech enhancement algorithms using single microphone was tamed and sooner the markets will be dominated by microphone arrays with beamforming techniques in which significantly high signal to noise ratios (SNR) are observed. A considerable amount of research has already been done about the feasibility and advantages of using microphone arrays. Several beamforming techniques are widely applied by researchers round the globe.

In this thesis, eight selective beamforming techniques have been applied to enhance the intelligibility of speech degraded by interference. In order to demonstrate the algorithms, a graphical user interface (GUI) has been developed in matlab, where user has a chance to select the number of sensors in the linear array and the spacing between them, source and interference directions, type of algorithm and so on. Also, user can compare all the Beamformers by keeping a default environment. The performances are measured in terms of signal to noise ratio (SNR), speech and noise distortions and ITU-T’s PESQ values.

This thesis describes the opportunities and constraints for application of eight beamforming techniques using linear microphone arrays.

Keywords: Beamforming, Speech Enhancement, Microphone Arrays, PESQ, Delay-and-Sum

(5)

5

ACKOWLEDGEMENTS

I would never have been able to finish my Thesis without the guidance of my University faculty, help from friends, and support from my family and fiancé.

I would like to express my deepest gratitude to my supervisor, Dr. Nedelko Grbic, for his excellent guidance, caring, patience and providing me with an excellent atmosphere for doing this work. I am also grateful to Dr. Benny Sällberg for his support and valuable advices throughout the Thesis time.

I would like to thank Ramesh Telagareddy and Bulli KoteswaraRao, who as good friends, was always willing to discuss the concepts and share their thoughts.

I would also like to thank my parents and family, who were always supporting me and encouraging me with their best wishes.

Finally, I would like to thank my fiancé, Vineela Musti. She was always there cheering me up and stood by me through the good times and bad. My thesis would not have been possible without her help.

Leela Krishna Gudupudi

(6)

6

To My Parents

(7)

7

ABSTRACT ______________________________________________________ 4 ACKOWLEDGEMENTS ____________________________________________ 5 Table of Contents __________________________________________________ 7 List of Figures ____________________________________________________ 9 List of Tables ____________________________________________________ 12 List of Acronyms _________________________________________________ 13 INTRODUCTION ________________________________________________ 14 1 MICROPHONES ______________________________________________ 15 1.1 Microphone Polar Patterns ______________________________________ 15 1.2 Microphone Arrays ___________________________________________ 17 1.3 Spatial Aliasing ______________________________________________ 18 2 BEAMFORMING TECHNIQUES ________________________________ 20 2.1 Fixed Beamforming ___________________________________________ 20 2.2 Adaptive Beamforming ________________________________________ 22 2.3 Generalized Side-lobe Canceller (GSC) ___________________________ 23 2.4 The Wiener Beamformer _______________________________________ 25 2.5 The Elko Beamformer _________________________________________ 26 3 DESIGN AND SIMULATIONS __________________________________ 31 3.1 Simulation Procedure __________________________________________ 31 4 PERFORMANCE METRICS ____________________________________ 43 4.1 Signal-to-Noise Ratio _________________________________________ 43 4.2 Normalized Speech Distortion ___________________________________ 44 4.3 Normalized Noise Distortion ____________________________________ 45

(8)

8

4.4 Perceptual Evaluation of Speech Quality (PESQ) ____________________ 45 5 SIMULATION RESULTS _______________________________________ 47 5.1 Delay-and-Sum Beamformer ____________________________________ 48 5.2 WienerBeamformer ___________________________________________ 52 5.3 DSB – DSB GSC _____________________________________________ 56 5.4 DSB–WBF GSC _____________________________________________ 60 5.5 WBF-DSB GSC ______________________________________________ 64 5.6 WBF-WBF GSC _____________________________________________ 68 5.8 The Elko-Wiener Beamformer___________________________________ 76 5.9 Comparison Plots _____________________________________________ 80 6 SUMMARY AND CONCLUSION ________________________________ 82 7 FUTURE WORK: _____________________________________________ 84 REFERENCES __________________________________________________ 85

(9)

9

List of Figures

Figure 1.1: Omni-Directional Microphone polar pattern ... 15

Figure 1.2: Bi-Directional Microphone polar pattern¹ ... 16

Figure 1.3: Cardioid microphone polar pattern¹ ... 16

Figure 1.4: Example of spatial aliasing ... 19

Figure 2.1: Delay and Sum Beamforming ... 21

Figure 2.2: An adaptive beamformer ... 22

Figure 2.3: Griffiths-Jim Generalised Side-lobe Canceller ... 23

Figure 2.4: (a) Linear equispaced differential microphone array (b) Diagram of two element differential array composed of two omni-directional mic’s and a time delay. ... 27

Figure 2.5: Directional responses of the two-element array shown in Fig. (a) T=0, (b)T=(d/c)/2, (c) d/c ...28

Figure 2.6: Implementation of first-order differential microphone using back-to- back cardioids ...29

Figure 2.7: Directional responses of back-to-back cardioid arrangement (a) β=0; (b)β=0.383; (c) β=1 ...29

Figure 3.1: The delay-and-sum beamforming ... 32

Figure 3.2: Desired and Interference Matrices for Wiener Beamforming ... 33

Figure 3.3: Pictorial form of the method followed ... 34

Figure 3.4: Wiener Beamformer Structure ... 35

Figure 3.5: Novel GSC structure ... 36

Figure 3.6: The structure of DSB-DSB GSC ... 37

Figure 3.7: Structure of the DSB-WBF GSC ... 38

Figure 3.8: Structure of the WBF-DSB GSC ... 39

(10)

10

Figure 3.9: Structure of the WBF-WBF GSC beamformer. ... 40

Figure 3.10: Structure of the Elko-Wiener Beamformer ... 42

Figure 5.1: Simulation set-up in an anechoic chamber... 47

Figure 5.2: The Delay and Sum Beamformer ... 48

Figure 5.3: Signal plots before and after DSB ... 49

Figure 5.4: DSB SNR plots ... 49

Figure 5.5: DSB Speech Distortion plots... 50

Figure 5.6: DSB Noise Distortion plot ... 50

Figure 5.7: DSB PESQ MOS plot ... 51

Figure 5.8: The WienerBeamformer ... 52

Figure 5.9: Signal plots before and after WBF ... 53

Figure 5.10: WBF SNR plot ... 53

Figure 5.11: WBF Speech Distortion plot ... 54

Figure 5.12: WBF Noise Distortion plot ... 54

Figure 5.13: WBF PESQ MOS plot... 55

Figure 5.14: DSB – DSB GSC ... 56

Figure 5.15: Signal plots before and after DSB-DSB GSC ... 57

Figure 5.16: DSB-DSB GSC SNR plots ... 57

Figure 5.17: DSB-DSB GSC Speech Distortion plots ... 58

Figure 5.18: DSB-DSB GSC Noise Distortion plots ... 58

Figure 5.19: DSB-DSB GSC PESQ MOS plots ... 59

Figure 5.20: DSB–WBF GSC ... 60

Figure 5.21: Signal plots before and after DSB-WBF GSC ... 61

Figure 5.22: DSB-WBF GSC SNR plot ... 61

Figure 5.23: DSB-WBF GSC Speech Distortion plot ... 62

(11)

11

Figure 5.24: DSB-WBF GSC Noise Distortion plot ... 62

Figure 5.25: DSB-WBF GSC PESQ MOS plot... 63

Figure 5.26: WBF-DSB GSC ... 64

Figure 5.27: Signal plots before and after WBF-DSB GSC ... 65

Figure 5.28: WBF-DSB GSC SNR plot ... 65

Figure 5.29: WBF-DSB GSC Speech Distortion plot ... 66

Figure 5.30: WBF-DSB GSC Noise Distortion plot ... 66

Figure 5.31: WBF-DSB GSC PESQ MOS plot... 67

Figure 5.32: WBF-WBF GSC ... 68

Figure 5.33: Signal plots before and after WBF-WBF GSC ... 69

Figure 5.34: WBF-WBF GSC SNR plot ... 69

Figure 5.35: WBF-WBF GSC Speech Distortion plot ... 70

Figure 5.36: WBF-WBF GSC Noise Distortion plot... 70

Figure 5.37: WBF-WBF GSC PESQ MOS plot ... 71

Figure 5.38: Signal plots before and after Elko Beamformer ... 72

Figure 5.39: Elko Beamformer SNR plot ... 73

Figure 5.40: Elko Beamformer Speech Distortion plot ... 74

Figure 5.41: Elko Beamformer Noise Distortion plot ... 74

Figure 5.42: Elko Beamformer PESQ MOS plot ... 75

Figure 5.43: Elko-Wiener Beamformer ... 76

Figure 5.44: Signal plots before and after Elko-Wiener Beamformer ... 77

Figure 5.45: Elko-Wiener Beamformer SNR plot ... 77

Figure 5.46: Elko-Wiener Beamformer Speech Distortion plot ... 78

Figure 5.47: Elko-Wiener Beamformer Noise Distortion plot ... 78

Figure 5.48: Elko-Wiener Beamformer PESQ MOS plot ... 79

Figure 5.49: SNR comparison plot ... 80

Figure 5.50: SNR comparison plot for Elko based beamformers... 81

(12)

12

List of Tables

Table 4.1: Quality opinion scale used in the development of PESQ ... 46

Table 5.1: Delay and Sum Beamformer Observation ... 48

Table 5.2: DSB PESQ MOS Values ... 51

Table 5.3: Wiener Beamformer Observation ... 52

Table 5.4: WBF PESQ MOS Values ... 55

Table 5.5: DSB-DSB GSC Observation ... 56

Table 5.6: DSB-DSB GSC PESQ MOS Values ... 59

Table 5.7: DSB-WBF GSC Observation ... 60

Table 5.8: DSB-WBF GSC PESQ MOS Values ... 63

Table 5.9: WBF-DSB GSC Observation ... 64

Table 5.10: WBF-DSB GSC PESQ MOS Values ... 67

Table 5.11: WBF-WBF GSC Observation ... 68

Table 5.12: WBF-WBF GSC PESQ MOS Values ... 71

Table 5.13: Elko Beamformer Observation ... 73

Table 5.14: Elko Beamformer PESQ MOS Values ... 75

Table 5.15: Elko-Wiener Beamformer Observation ... 76

Table 5.16: Elko Beamformer PESQ MOS Values ... 79

(13)

13

List of Acronyms

1. DSB Delay-Sum-Beamformer 2. WBF Wiener Beamforming

3. GSC Generalized Side-lobe Canceller 4. SNR Signal to Noise Ratio

5. SNRI SNR Improvement

6. PESQ Perceptual Evaluation of Speech Quality

7. SD Speech Distortion

8. ND Noise Distortion

9. GUI Graphical User Interface

10. dB Decibels

11. NLMS Normalized Least Mean Square 12. PSD Power Spectral Density

13. NSD Normalized SD 14. NND Normalized ND 15. DOA Direction of Arrival 16. MOS Mean Opinion Score

(14)

14

INTRODUCTION

B

eamforming is a long-familiar and widely used technique in which a beam is to be formed in the direction of the desired speech using an array of sensors where the samples propagating from this beam are processed and the rest are nullified. In this study, a female voice recording is used as desired source and a male voice recording as interference. Eight beamforming algorithms have been applied in this scenario in which the main aim is to recover the desired female voice in the look- direction from the interference. This thesis summarizes the performance of the following eight beamforming techniques:

1. Delay and Sum Beamformer (DSB)

2. The DSB-DSB Generalised Side-lobe Canceller (GSC) 3. Wiener Beamformer (WBF)

4. The WBF-WBF GSC 5. The DSB-WBF GSC 6. The WBF-DSB GSC 7. Elko Beamformer

8. The Elko-Wiener Beamformer

Linear microphone arrays are used throughout the study with user defined number of sensors with constant user defined spacing between the sensors. All the beamforming algorithms are implemented in time-domain. Fractional delays are considered while calculating the sensors data [14]. The performance measures are the SNR, spectral distortions and ITU-T recommended PESQ (Perceptual Evaluation of Speech Quality). Entire project was implemented in MATLAB. No reflections have been considered throughout the project and the speed of the sound was taken as 343m/s.

(15)

15

1 MICROPHONES

A microphone is a transducer that converts sound into an electric signal. It is an electromechanical device that uses vibration to create an electric signal proportional to the vibration, which is usually an air pressure wave. Microphones are classified based on their transducer principal, physical characteristics and by their directional characteristics.

1.1 Microphone Polar Patterns

Not all microphones pass sound waves from sources coming from all the directions. Microphones are also categorized according to how well they pickup sound from certain directions. The three most common microphone patterns are described here:

 Omni-directional:

An omnidirectional microphone is the simplest mic design that will accept all sounds regardless of its point of origin. These are very easy to use and have good frequency response. The smallest diameter microphone gives the best omnidirectional characteristics at high frequencies as the flattening of the response is directly proportional to the diameter of the microphone.

Figure 1.1: Omni-Directional Microphone polar pattern¹

¹ Image Courtesy: www.prosoundweb.com

(16)

16

 Bi-directional:

A bi-directional microphone is also called as “figure of eight” microphone.

This mic with a figure of eight polar pattern allows sound coming from in front of the mic and from the rear but not from the sides i.e., 90 degree angle. Most of the Ribbon microphones are bi-directional. The frequency response is just as good as omnidirectional mic, at least for sounds that are not too close to the mic.

Figure 1.2: Bi-Directional Microphone polar pattern¹

 Cardioid:

A cardioid microphone has most of the sensitivity at the look direction and is least sensitive at the back i.e., a mic that picks up sounds it is pointed at. It isolates from unwanted ambient sound and is much more resistant to feedback than omnidirectional mic’s. This makes the cardioid mic’s particularly suitable for loud stages.

Figure 1.3: Cardioid microphone polar pattern¹

(17)

17

1.2 Microphone Arrays

In order to achieve better directionality, two or more microphones are positioned closely, called as microphone arrays. If they are positioned linearly then they are called as Linear Microphone Arrays. If the distance between the mic’s is equal, then it could be called as linear equispaced microphone array. Considering the advantage of the fact that an incoming sound wave arrives at each of the mic’s at a slightly different time, microphone arrays achieve better directionality compared to single microphone. Distinguishing sounds based on the spatial location of their sources is achieved by filtering and combining the individual microphone signals.

The important concepts that are used in the design of microphone arrays are Beamforming, Array directivity and Beam width.

 Beamforming:

If a “beam” is formed by using all the microphone’s signals, then the microphone array is being able to act as a highly directional microphone.

This beam can be electronically managed to point to the desired source.

More details about the beamforming can be discussed in the next chapter.

 Array Directivity:

Higher the directivity of the microphone array, higher the reduction of the amount of captured ambient noises and reverberated waves. Given the aperture response as a function of frequency and direction of arrival then the directivity pattern for a linear equispaced (d meters) array of N microphones is given by [3],





 





 

^cos

2 2 1

2 ) ( 1

) ( )

,

(

^c ^nd

j f N

n N

n

f e w

f D



(1)

From the above equation we see that the directivity pattern depends upon:

o Length of the microphone array, N o Spacing between the mic’s, d o The frequency, f

(18)

18

 Beam Width:

The beam width and the side-lobe level are the two important characteristics of the directivity pattern. These characteristics depend on the inter-element spacing (d), spatial sampling frequency (f) and the length of the array (N).

As the frequency increase, the beam width and the side-lobe level will decrease. As the effective array length (L=N*d) increases, the beam width and the side-lobe level decreases. Thus, in order to obtain a constant beam width we must ensure that the f*dremains relatively constant.

1.3 Spatial Aliasing

Spatial aliasing means insufficient sampling of the data along the space axis.

Spatial aliasing results in the appearance of grating lobes. Similar to the temporal aliasing, microphone arrays implement spatial sampling and an analogous requirement exists to avoid grating lobes in the directivity pattern,

^max

* 1 2

d f f

_s

 

(2)

wherefsis the spatial sampling frequency in samples per meter and fmax is the highest sampling component. This leads to the relation,

^min

max

1





f

(3)

hereλminis the minimum wavelength in the signal of interest and consequently the requirement is

2 

min

 d

(4)

Equation 4 is known as the spatial sampling theorem and must be followed in order to prevent the occurrence of spatial aliasing in the directivity pattern of a sensor array. Figure 4 illustrates the effect of spatial aliasing.

(19)

19

Figure 1.4: Example of spatial aliasing

(20)

20

2 BEAMFORMING TECHNIQUES

Spatially propagating signals often encounter the presence of interference signals and noise signals. If the desired signals and the interference signals occupy the same temporal frequency band, then temporal filtering cannot be used to separate the signal from the interferers, though the desired and interferences signals originate from different spatial locations. This can be done using a Beamformer.

Beamforming is a spatial signal processing technique used to control the directionality of the receiving signal on a microphone array. Beamforming takes the advantage of interference to change the directionality of the array. This can be done by combining all the signals from microphones of the array in a way where signals coming from desired angles experience constructive interference and all other angles experience destructive interference. Beamforming techniques are algorithms for determining the complex sensor weights w_n(f) (see eqn.1) in order to implement a desired shaping and steering of the array directive pattern [3].

Beamforming techniques are broadly classified into two types: data-independent and data-dependent.

Data-independent or fixed beamformers are those algorithms where their parameters are fixed during operation. The delay-and-sum beamformeris an example of fixed beamformer.

Data-dependent or adaptive beamformers continuously update their parameters based on the received input signals. The Griffiths-Jim beamformeris an example of adaptive beamformer.

2.1 Fixed Beamforming

 Delay and Sum Beamformer (DSB):

DSB is the simplest of all beamforming techniques. It is developed based on the idea that, if a linear equispaced microphone array is being used, then the output of each microphone will be the same, except that each one will be

(21)

21

delayed by a different amount. So, if the output of each mic is delayed appropriately, then we add all the outputs together. The desired signal that was propagating through the array will reinforce, while the noise or interference will tend to suppress. A block diagram of this can be seen below in Fig. 2.1.

Figure 2.1: Delay and Sum Beamforming

The delays are calculated based on the intra-element distance of the array.

The crucial factors involved in DSB are the geometric arrangement of the mic’s and weights associated with each mic. The signal-to-noise ratio (SNR) of the output signal is greater than that of any individual mic’s signal.

The major drawback of the DSB is the requirement of large number of mic’s to improve the SNR. If the interference or noise signals are completely uncorrelated with the desired signal then, for every doubling of number of mic’s will result in an additional 3dB of SNR increment. Another disadvantage is that, no nulls are being placed directly in the interference signal’s direction.

(22)

22

2.2 Adaptive Beamforming

Data-dependent beamforming techniques adaptively filter the incoming signals in order to pass the signal from desired direction, while rejecting interferences or noises coming from other directions. An adaptive beamformer is a data-dependent beamformer that is able to separate signals collocated in the frequency band but separated in the spatial domain [2]. An adaptive beamformer is able to automatically optimize the microphone array pattern by adjusting the elemental control weights until a prescribed objective function is satisfied. The choice of selecting the adaptive algorithm for deriving the adaptive weights plays key role in determining the convergence speed and system complexity. General adaptive algorithms that are being used for beamforming include Least Mean Squares (LMS) algorithm, Normalised LMS (NLMS) algorithm and Recursive Least Squares (RLS) algorithm. Each algorithm has its own advantages and disadvantages. A typical adaptive beamformer structure can be seen in Fig. 2.2.

Figure 2.2: An adaptive beamformer

The weight vector w can be chosen according to the input signal vector x(t). The main aim of the adaptive beamformer is to optimize the system response with respect to the prescribed criteria, so that the output signal y(t) will be free from

(23)

23

noise or interference. The optimum adaptive criteria for beamformer are Minimum Mean Square Error, Maximum Signal-to-Noise/Interference Ration and Minimum Variance [2].

There are two basic adaptive approaches, they are: Block adaptation and Continuous adaptation. In block adaptation, statistics are estimated from a temporal block of array data and used in an optimum weight equation. In continuous adaptation, weights are updated based on the input data that is sampled such that the resulting weight vector sequence converges to the optimum solution [13].

2.3 Generalized Side-lobe Canceller (GSC)

The GSC method provides a simple solution for implementing Linearly Constrained Minimum Variance (LCMV) beamformers. The basic idea behind LCMV beamforming is to constrain the response of the beamformer; signals from the look-direction are passed with specified gain and phase [13, 15]. The weights are chosen to minimize output variance subject to the response constraint. This has the effect of preserving the desired signal while minimizing the noise or interfering signal strength.

Figure 2.3: Griffiths-Jim Generalised Side-lobe Canceller

(24)

24

A block structure of the generalised side-lobe canceller is shown in Fig. 2.3 [2].

GSC separates the adaptive beamformer into two main processing blocks [3]. The first of these implements a standard fixed beamformer (say DSB) which extracts the desired signal. The second block is a set of adaptive filters that adaptively minimize the power in the output. Blocking MatrixBis used to eliminate the desired signal from this second block, ensuring that it is the noise power that is minimised.

 Blocking Matrix:

The purpose of the blocking matrix is to block the desired signal entering the second stage of GSC structure and allow the interference or noise signals.

Blocking will be occurred if the sum of elements of the rows of blocking matrix are zero [3], i.e., sum(B(i,:))=0. The number of rows in the blocking matrix is equal to the number of elements in the array. The standard Griffiths-Jim blocking matrix is given by [2]

 





 









1 1

0 0

0 0 1 1

0 0

0 ....

0 1 1

0 0 ....

0 0

1 1











 B

(11)

The output of the blocking matrix is adaptively filtered and summed to get the lower block’s final output. This final output contains only noise / interference signals and this should be subtracted from the upper block’s output, which tends to the final result.

In the original GSC method [2], the optimum array parameters are adaptively estimated based on the available data set. LMS algorithm was used for adaptive filtering (LMS GSC) because of its low computational complexity. The GSC is most widely used adaptive beamformer because of its structure flexibility. In practice, GSC structure leads to the desired signal distortion due to the signal leakage from the blocking matrix. In most of the cases, blocking matrix fails to

(25)

25

block the target signal and these leakages result in the desired signal distortion at the final output.

2.4 The Wiener Beamformer

The wiener beamformer is also referred to as minimum mean square error beamformer. It is defined as, the weights of the beamformer which minimizes the mean square difference between the beamformer output (when all sources are present) and a single microphone output (when only the desired signal is present) [5].

   



²



min

arg E y n s n

Wopt

_r

w



 ^r ^  ¹ ^, ² ^.... ^N 

(11)

The optimal weights that yield the optimum output signal is given by equation (11), the output y[n], from the beamformer is given by,

      











^N

i L

j

i

j x n j

w n

y

1 1

0

(12) where, L-1 is the order of the filters and w_i[j], j=0,1,....,L-1,are the filter taps for i^thmic. N is the number of the mic’s and xi(n) is the i^thmic observation.

Sr[n] in eqn. (11) is the single reference mic observation when only desired signal is chosen as input. The optimal weights which minimize the mean square error between the output and the reference signal is given by, [6]



_ss _nn



_s

opt

R R r

w  

^¹ (13) here, Rss and Rnn are the autocorrelation matrices of signal of interest and noise respectively and are given by, [5]















sNsN sNs

sNs

sN s s

s s

s

sN s s

s s

s

ss

R R

R

R R

R

R R











2 1

2 2

2 1

2

1 2

1 1

1

(14)

(26)

26















n Nn N n Nn

n Nn

n N n n

n n

n

n N n n

n n

n

n n

R R

R

R R

R

R R











2 1

2 2

2 1

2

1 2

1 1

1

(15)

rs is the cross correlation vector and is given by,



_N



s

r r r

r 

₁ ₂



(16) where, each element is given by,

 

^k ^E



^s

  

ⁿ ^s ⁿ ^k

 

r_i  _i _r^*  i=1,2,...,N,

^r ^  ¹ ^, ² ^.... ^N 

, k=0,1,...,L-1.

(17) The cross correlation vector rs is the centre column of the autocorrelation matrix of the desired signal, R_ss. The optimal weights, w, are arranged as,



^TN



^T

T

w w

w

w 

₁ ₂



(18) The performance of this wiener beamforming algorithm and the results will be presented in the following sections.

2.5 The Elko Beamformer

Gary W. Elko designed an algorithm for directional microphone arrays [7]. In hands-free communications, background noise has prejudicious effect on the microphone output power. By placing a null in the rear-half plane of the microphone array, a significant improvement in the signal-to-noise ratio can be observed. This is the concept behind adaptive first order differential microphones.

This can be achieved using two closely placed omni-directional microphones. Any first order array can be realized by combining the weighted subtraction of these two outputs [7]. The null angle can be steered using certain combination weighting. The Elko algorithm is very easy to implement and its system complexity is less. The elements of the differential microphone array are placed in an alternating sign fashion such that the directionality can be realized. Differential microphone array is super-directional as it has high directivity compared to omni-

(27)

27

directional microphone array. Therefore, these types of arrays are most suitable for personal communication devices and teleconferencing.

2.5.1 Design

Figure 2.4: (a) Linear equispaced differential microphone array (b) Diagram of two element differential array composed of two omni-directional mic’s and a time delay.

When a plane sound wave signal s(t) with spectrum S(ω) incident on a two element linear equispaced (d) microphone array making an angle ϴ with its axis as shown in Fig. 2.4 (b), then the sound wave will reach the mic’s with different delays based on the intra element distance d. The time difference



can be written as,

c d

c

a/ ( cos)/

   (19)

where, c is the velocity of sound (343m/s). The microphone which gets the earlier incidence of sound file should be delayed and subtracted from the other microphone signal to get the output, y(t) [8] ,

) (

)

(t s t s t T

y    

) (

) / ) (cos

(t c s t T

s   

  (20)

(28)

28

By transforming Eqn. (20) into frequency domain,

) (

) , (

co s j T

e e

Y

^c

d

j^ ^ 



 

^



^ (21)

The directional response of the array is presented in the below Fig. 2.5. The null location can be steered by changing the time delay T. In order to change the null location from 90^o to 0^o, the time delay T should be changed from 0 to d/c.

Figure 2.5: Directional responses of the two-element array shown in Fig. 5 (a) T=0, (b)T=(d/c)/2, (c) d/c

A cardioid system is easy to develop as shown in Fig. 2.5(c) with very less computational complexity using the set-up shown in Fig. 2.4. The constraints to get cardioid directivity are: 1) by using unit sample delay, T = 1. 2) by setting the sampling period as d/c. By using the combination of two back-to-back cardioid mic’s, one can implement a first-order differential microphone [7].

(29)

29

Figure 2.6: Implementation of first-order differential microphone using back-to-back cardioids

The back-to-back cardioid arrangement is as shown in Fig. 2.6. The directivity patterns using the arrangement in Fig. 2.6 are shown in Fig. 2.7.

Figure 2.7: Directional responses of back-to-back cardioid arrangement (a) β=0; (b)β=0.383;

(c) β=1

By examining Fig. 2.6 and Eqn. (20), we can write the expressions for forward cardioid and backward cardioid,

) (

) ( ) (

C

_F

t  s t  s t  T  

(30)

30

) / ) cos (

) / ( ( )

(t s t d c d c

s    

 (22)

) (

C

_F

t  s t    s t  T

)) / ( ( ) / ) cos (

(t d c s t d c

s   

  ₍₂₃₎

) (

* )

( )

( t C t C t

y 

_F

 

_B (24) Transforming Eqn. (24) into frequency domain,

) , (

* )

, ( )

,

(   C

_F

   C

_B

 

Y  

(25)



ê ^j ê ^j ê ^j ^d ^c



S

Y(



,



) (



)1 ^ ^^d^c⁽¹^^cos^⁾ 



^ ^^d^c^cos^ 



^ ^ ^/ ₍₂₆₎

Here the null location can be steered from 180^oto 90^o by changing the value of β from 0 to 1. Unit time delay (T=1) has been considered. Directional responses of the system in Fig. 2.6 are plotted in Fig. 8 by changing the values of β.

In order to make the system adaptive, a simple and easy LMS algorithm has been used for the back-to-back cardioid adaptive first-order differential array.

Squaring and differentiating with respect to β on both sides of the Eqn. (24) gives,

) ( ) ( ) 2

2

(

t C t d y

t dy



B

 

(27)

Thus, the LMS version of the β is given as,

) ( ) (

1 _t

2 y t C

_B

t

 



_

 

(28) The normalized LMS version of the β is therefore,

) (

) ) (

(

2

₂

1

C t

t t C

y

B B t

t

 



_

 

(29)

where, µ is the step size and ^C_B²⁽^t⁾ indicates the time average to normalize µ.

(31)

31

3 DESIGN AND SIMULATIONS

A graphical user interface (GUI) has been designed using MATLAB to study, demonstrate and play with beamforming techniques. A linear equispaced microphone array is used throughout the thesis, where the user is allowed to choose the number of elements in the array and the intra-element distance. Two sound sources s(n) and I(n), English speaking female and male recordings are used as desired and interference signals respectively. User has again given choice to select the incidence angles of desired and interference sources. A total of eight beamforming algorithms have been implemented, they are: 1) The delay-and-sum beamformer (DSB) 2) The wiener beamformer (WBF) 3) The DSB-DSB GSC 4) The WBF-DSB GSC 5) The DSB-WBF GSC 6) The WBF-WBF GSC 7) The Elko beamformer 8) The Elko-Wiener beamformer. The performance measures used in the GUI are: 1) signal-to-noise ratio (SNR) 2) speech distortion plot. 3) noise distortion plot 4) perceptual evaluation of speech quality (PESQ).

3.1 Simulation Procedure

Let s(n) and I(n) be the sequences of desired and interference sound sources respectively and making angles of incidences ϴ1 and ϴ2 with the centre of N element linear microphone array. The sound field is composed by plane waves. Let d meters be the intra-element distance. The velocity of the sound is c=343m/s. The arrival time differences of the two speech signals to each microphone in the array are computed and added to get the microphone response. The delay calculations are determined by the steering direction of the array:

c d n

n

  ⁽ ^ ¹ ⁾ ^cos 

(30)

where, n is the microphone index range from 1 n N, is the time the sound wave takes to travel between the reference sensor and n^th sensor. Getting microphone array response is the common step in the simulation procedure of all the mentioned beamformers.

(32)

32

3.1.1 The Delay-and-sum Beamformer (DSB):

The delay-and-sum beamformer is very easy to implement after getting the each microphone response in the array. As the name indicates, appropriate delays are inserted after each microphone to compensate for the arrival time differences of the desired speech signal. The outputs of the delays are the time aligned signals which are then added together to get the beamformer output [see Fig. 2.2]. How to insert appropriate delays? The answer is simple, using the input delay vector of desired speech signal, calculated using Eqn. 30. Flip this vector and insert each element as delay to each microphone. So, the microphone to which the desired signal impinges first will get the highest delay. This has the effect of reinforcing the desired signals in phase while attenuating the interference, which are likely to shift out of phase.

Figure 3.1: The delay-and-sum beamforming

(33)

33

3.1.2 The Wiener Beamforming (WBF):

The simulation procedure for wiener beamforming starts before adding the desired and interference signals at each mic’s to get microphone response. Consider the desired and interference signals separately at each mic and represent in a matrix form. Let the total number of samples of speech and interference signals at each mic be ‘S’. Therefore, the order of the desired and interference matrices are N x S as shown in Fig. 3.2. Each row in the matrix represents corresponding mic’s speech/interference response.

Figure 3.2: Desired and Interference Matrices for Wiener Beamforming

Let us recall Eqn. (13), the optimum weights which minimize the mean square error between the output and the reference signal. It requires the auto-correlation matrices of desired and interference signals, also the cross-correlation vector. How to get these matrices using M_S and M_I? Here is the method that I used while implementing wiener beamformer. Let the order of the wiener filter be 64.

Consider an empty matrix X of order 64*N xK, where K is the integer part of S/64.

Now, consider the first 64 columns of matrix M_S and make this N x 64 matrix in to a single column vector and let it be x₁. Replace the first column of X with x₁. Now, consider the next 64 columns (65 to 128) of M_S and convert it into single column vector and let it be x2. Replace the second column of X with x2. Follow this procedure till the end of the matrix M_S and get the matrix X. Also, follow the

(34)

34

similar method using matrix M_I and get the matrix Y. This method of finding X and Y matrices is shown in following Figure 3.3.

Figure 3.3: Pictorial form of the method followed

After getting the X and Y matrices, the auto-correlation matrices of the desired signal s(n) and interference signal I(n) can be calculated using the following equations:

(35)

35

The auto-correlation matrices of the desired (Rss) and interference signal (Rnn):





^K

i

T i i

ss

x x

R K

1

) . 1 (

(31)





^K

i

T i i

nn

y y

R K

1

) . 1 (

(32)

The cross-correlation vector term (rs) of Eqn. (13) is given by the centre column of the auto-correlation matrix, R_ss. If the user wants to concentrate on the interference alone then the cross-correlation vector should be chosen as the centre column of the auto-correlation matrix, Rnn. So, now we have all the required elements in Eqn.

(13) to calculate the optimum weight vector, Wopt. In this particular case, the order of the optimum weight vector Wopt is N*64 x 1. Now, consider the first 64 elements of W_optand make it flip upside down and let it be w_N. Again, consider the next 64 elements of Wopti.e., from 65 to 128 and make it flip upside down and let it be wN-1. Follow the similar procedure till the end of vector Wopt that is, until we get w1. Now, use these individual weight vectors to filter the N microphone responses in the array. Add the entire filtered outputs gives rise to the wiener beamformer response s^’(n) as shown in following Fig. 3.4.

Figure 3.4: Wiener Beamformer Structure

(36)

36

3.1.3 Generalized Side-lobe Canceller (GSC):

In this thesis, few efficient algorithms are implemented based on the GSC structure discussed in section 2.3. The performances of the suggested realizations of the NLMS algorithm based GSC methods are illustrated in the context of beamforming applications. The major problem with the traditional GSC [2] is the sensitiveness of the blocking matrix which leads to target signal leakage and thereby cancellation of the desired signal at the final output. So, why don’t we use another beamformer in the second path of the GSC structure instead of a blocking matrix?

Yes, based on this idea and using the DSB and the WBF beamformers I implemented four kinds of beamformer structures. The structure of the beamformer

with N microphones is as shown in Fig. 3.5.

Figure 3.5: Novel GSC structure

Therefore, the four implemented beamformers are the DSB-DSB GSC, the DSB- WBF GSC, the WBF-DSB GSC and the WBF-WBF GSC.

(37)

37

3.1.4 The DSB-DSB GSC:

The DSB-DSB GSC structure with N microphones is shown in Fig. 3.6. The GSC structure includes two delay-and-sum beamformers, one in the upper path and the other in the lower path and an adaptive filter. The upper path DSB enhances the target signal and let it be d1(n), where n is the sample index. The lower path DSB enhances the interference signal and rejects the target signal, let the output from the lower DSB be d₂(n). The NLMS adaptive filter adaptively subtracts the components correlated to the d2(n)from the d1(n) gives rise to the final output E(n).

Here, the lower DSB acts as blocking matrix. If the number of sound sources increases then the number of lower paths in the GSC structure also increases to the same number where the lower paths consist of a beamformer and an adaptive filter.

All the outputs of the adaptive filters are summed before subtracting from the upper beamformer output (d1(n)).

Figure 3.6: The structure of DSB-DSB GSC

(38)

38

3.1.5 The DSB-WBF GSC:

We can imagine target signal leakage from the DSB in the above method, especially in the lower path which leads to the desired signal cancellation at the final output. Now, replace the DSB from the lower path with wiener beamformer (WBF) which leads to the DSB-WBF GSC. The structure of the DSB-WBF GSC with N microphones and two sound sources is shown in Fig. 3.7. In this method, the DSB enhances the desired signal s(n) and let the output of the DSB be d₁(n).

The WBF in the lower path acts like a blocking matrix which extracts the interference signal and rejects the desired signal. The output of the WBF d2(n)is free from leakage problems. The NLMS adaptive filter adaptively subtracts the components correlated to the d₂(n)from the d₁(n) gives rise to the final output E(n).

We can observe better signal enhancement at the output compared to DSB-DSB GSC.

Figure 3.7: Structure of the DSB-WBF GSC

(39)

39

3.1.6 The WBF-DSB GSC:

Let us invert the beamformer sections from the above method that is wiener beamformer in the upper layer to enhance the desired signal and delay-and-sum beamformer in the lower path to get the interference signal. As mentioned before, d1(n)and d2(n) are the outputs of the WBF and DSB respectively. The structure of the DSB-WBF GSC with N microphones and two sound sources is shown in Fig.

3.8. The NLMS adaptive filter adaptively subtracts the components correlated to the d2(n)from the d1(n) gives rise to the final output E(n). We can observe better signal-to-noise ratio at the output compared to DSB-WBF GSC, this is because of the robust performance of the WBF in the upper layer.

Figure 3.8: Structure of the WBF-DSB GSC

(40)

40

3.1.7 The WBF-WBF GSC:

If there is no chance for the signal leakage in both the upper and lower layers of GSC structure then it leads to the perfect system structure with best signal-to-noise ratio. The WBF-WBF GSC is such a desired system which is a common improvement that builds on the wiener beamformer. It comprises two signal flow paths, a WBF in the upper path enhances the target signal while the other WBF in the lower path approximates the interference signal. Let the outputs of the upper and lower path beamformers be d1(n)and d2(n)respectively. The interference signal d2(n) is adaptively filtered using NLMS filter and subtracting with the approximated target signal d1(n) to get the paths converges. The WBF-WBF GSC structure with N microphone array is shown in Fig. 3.9.

Figure 3.9: Structure of the WBF-WBF GSC beamformer.

(41)

41

3.1.8 The Elko Beamformer:

The design of the Elko beamformer was discussed in section 2.5. The initial step in the implementation part is similar up to getting microphone array response as mentioned in section 3.1. The necessary steps while implementing Elko beamformer includes alternate arrangement of forward and backward facing cardioids, using unit delays after each cardioid. As I am using two speech recordings throughout the thesis and the Elko beamformer is used to suppress the background sound sources, hence desired speech is allowed to impinge from the forward direction (-90to 90 degrees) and the interference is from backward direction (-90 to -180 or 90 to 180). After choosing the direction of arrivals of sound sources, consider adjacent pairs of microphones and then find the individual angles of arrival (say ϴ1 and ϴ2 for desired and interference sources) to the respective microphone pair. Use these new angles to get the forward and backward cardioid data.

2 )

) cos 1

sin( (

* ) ( 2 )

) cos 1

sin( (

* ) ( )

(   

¹

 kd  

²

n kd I

n S n

C

_F

(30)

2 )

) cos 1

sin( (

* ) ( 2 )

) cos 1

sin( (

* ) ( )

(   

¹

 kd  

²

n kd I

n S n

C

_B

(31) where, wave vector k=ω/c, ϴ₁and ϴ₂ are the angles of arrivals of the desired and interference signals for selected pair of microphones.

After getting the cardioid responses, using equations 2.4 and 2.9, we will get the elkobeamformer output y1(n) for the first pair of microphones. Similarly we will get the elkobeamformer outputs for all the pairs of microphones in the array.

 ⁽ ⁾ ⁽ ⁾ ⁽ ⁾ 

)

( n y

₁

n y

₂

n y n

y  

_m (24)

Finally by summing up all the outputs we will get the first-order differential microphone array response. If the number of microphones in the array is odd then take cumulative pairs of forward and backward facing microphones that is if the first pair of microphones are forward and backward cardioids then the next pair

(42)

42

will be backward cardioid from the last pair and the next adjacent forward cardioid.

3.1.9 The Elko-Wiener Beamformer:

The Elko-Wiener beamformer is a common improvement that builds on the elkobeamformer. It comprises of two sections, one the elkobeamformer that minimizes the microphone output power by locating the solitary first order microphone null in the rear-half plane, while the other section is the typical wiener beamformer which enhances the desired speech signal from the elkobeamformer output. The structure of N microphone elko-wiener beamformer is shown in Fig.

3.10.

Figure 3.10: Structure of the Elko-Wiener Beamformer

The intelligibility of the desired speech signal at the output will be considerably high compared to the only elkobeamformer due to the wiener beamforming. If the number of microphones in the array is odd then the pairs of microphones should be like [(1, 2), (2, 3), (3, 4) ... (N-2, N-1), (N-1, N)] to perform elko algorithm.

(43)

43

4 PERFORMANCE METRICS

A speech performance metric is a measure of the performance of the hands-free speech enhancement system with respect to the quality of the speech it reproduced.

An ideal speech communication system should produce the same perceived sound impression on the listener’s side as in a situation where a speaker and a listener communicate by speech in a close and quite anechoic chamber [9]. The system performing noise/interference suppression should preserve the quality of the speech. The performance of the speech enhancement systems requires the development of measures to evaluate the overall quality of the received speech [10].

The objective measures that I used in this thesis to evaluate the performance of the beamforming algorithms are: 1) signal-to-noise ratio (SNR) at the input and output of the systems, the output SNR vs. input SNR graphs are occasionally plotted. 2) Measures of speech and noise spectral distortion caused by the beamforming filters and 3) Perceptual evaluation of speech quality (PESQ), an ITU-T standard. These metrics present the advantage of being repeatable and efficiently computable.

While computing the performance metrics; a system with particular beamforming algorithm is first executed with both the sound sources (desired and interference speech) present. After the system has converged, all the beamforming filters weights W should be saved. Now, disable the interference sound source and allow only the desired speech signal through the system and save the response. Next, disable the desired sound source and allow the interference only to pass through the saved system and save the response. Using these two individual responses we can compute the performance measures.

4.1 Signal-to-Noise Ratio

This metric measures the dominance of speech signal with respect to the noise/interference. Though I am not using any noise in this thesis, consider interference as noise. So, instead of mentioning signal-to-interference ratio (SIR), I used the term signal-to-noise ratio (SNR) which is more familiar. Input SNR is

(44)

44

given by the ratio of the variance of the desired speech signal (ESi) to the variance of the interference signal (ENi) at a particular reference microphone. Similarly, the output SNR is given by the variance of the output when only desired speech is given as input (ESo) to the variance of the output when only interference is given as input (E_No).

) (

log

10

10 )

(

^Ni

S i

E E IN

dB

SNR 

(32)

) (

log

10

10 )

(

^No

S o

E E OUT

dB

SNR 

(33)

Therefore, the difference of the output SNR and input SNR values gives the SNR improvement of the beamforming algorithm.

IN

OUT

SNR

SNRI  

(34) 4.2 Normalized Speech Distortion

The normalized speech distortion is defined as the deviation in the power spectral density (PSD) of the clean speech at the reference microphone and the processed speech signal at the output. To establish a power reference level, the enhanced output signal is normalized by the target speech signal gain of the beamforming structure [10]. The power normalization constant C_Sis computed as

)) ( (



 

S

mean Y

S

C mean

(35)

where, S(Ω) is PSD of S(n) and Y_S(Ω) is PSD of y_s(n).

) (

* )

1

(  

_S _S



S

C Y

Y

(36) Now, the speech distortion is given by

)) ( (

)) ( )

( log (

10

1

10





 

S abs

S Y

NSD abs

^S (37)

where, abs() gives the absolute value.

(45)

45

4.3 Normalized Noise Distortion

Normalized noise distortion (NND) is similar to NSD and is defined as the deviation in the PSD of the clean interference signal at the reference microphone and the processed interference signal at the output. To establish a power reference level, the output interference signal is normalized by the interference signal gain of the beamforming structure [10]. The power normalization constant C_Iis computed as

)) ( (



 

I

mean Y

I

C mean

(38)

where, I(Ω) is PSD of I(n) and Y_I(Ω) is PSD of y_I(n).

) (

* )

1

(  

_I _I



I

C Y

Y

(39) Now, the noise distortion is given by

)) ( (

)) ( ) ( log (

10

1

10





 

I abs

I Y

NND abs

^I (40)

where, abs() gives the absolute value.

4.4 Perceptual Evaluation of Speech Quality (PESQ)

The subjective analysis in determining the speech quality of a transmission system is always a lengthy and expensive process. PESQ is an objective analysis method recommended by International Telecommunications Union, ITU-T P.862. This tool predicts the results of subjective listening tests based on the schematic procedure described by the PESQ application guide ITU-T P.862.3 [12].

Let X(n) be the pre-processed signal and Y(n) be the processed signal then PESQ tool determines the quality of Y(n) by using a large database of subjective tests.

The final PESQ score is a linear combination of the average disturbance value and the average asymmetrical disturbance value [12]. The perceived listening quality is expressed in terms of Mean Opinion Score (MOS), an average quality score over a

(46)

46

large set of subjects. For most of the cases the output range will be in between 1.0 and 4.5. Most of the subjective experiments used in the development of PESQ used the Absolute Category Rating (ACR) opinion scale as shown in table-1.

PESQ posses an advantage of being rapid and repeatable besides it admits less computational time.

Table 4.1:Quality opinion scale used in the development of PESQ TABLE-1

QUALITY OPINION SCALE USED IN THE DEVELOPMENT OF PESQ Quality of the Speech Score

Excellent Good

Fair Poor

Bad

4.5 4 3 2 1

Enhancement of Speech Intelligibility using Beamforming Techniques