Methods for Improving Voice Activity Detection in Communication Services

(1)

Methods for Improving Voice

Activity Detection in Communication

Services

Amardeep

Institutionen för informationsteknologi

Department of Information Technology

(2)

(3)

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

A video conferencing application has to display only active sites due to limited display area that are identified using voice activity detector (VAD) and maintain a list of the most vocally active sites. In a typical video conferencing room there will be people typing on their computers or laptops and this can cause problem when the VAD classifies the keyboard typing signals as speech activity even there is nobody talking in the room. As a result the vocally inactive site is not removed from the list of active sites and thus blocks another vocally active site from being added to the list, thus creating a very bad user experience in the video conference. Current VAD often classify keyboard typing as active speech.

In this thesis work, we explore two main approaches to solve the problem. First approach is based on identification of keystroke signals in the mixed audio data (speech and keyboard signal). In this approach we explore various audio signal classification approaches based on temporal and spectral features of speech and keystroke signals as well as prediction model based classification. We evaluate and compare this approach by varying parameters and maximizing the percentage of correctly-classified keystroke frames as true-keystroke frames whereas minimizing the falsely-classified keystroke frames among non true-keystroke frames. The evaluated keystroke identification approach is based on thresholding the model error that resulted into 85% accuracy using one previous and one future frame. The

falsely-classified frames as keystroke frames in this approach are mainly due to the plosive sounds in the audio signal due to the similar characteristics as that of keystroke signal.

Second approach is based on finding a mechanism to complement VAD such that it doesn’t trigger at keystroke signals. For this purpose we explore different methods for improving pitch detection functionality in the VAD. We evaluate a new pitch detector which computes pitch using autocorrelation of the normalized signal frames. Then we design a new speech detector which consists of the new pitch detector along with hangover addition that separates the mixed audio data into speech region and non-speech region in real time. The new speech detector doesn’t trigger at keystroke frames i.e. it places the keystroke frames in non-speech region and hence solves the problem.

Tryckt av: Reprocentralen ITC

Sponsor: Multimedia Division, Ericsson Research, EAB IT 13 001

Examinator: Lisa Kaati

Ämnesgranskare: Magnus Lundberg Nordenvaad Handledare: Erlendur Karlsson

(4)

(5)

Acknowledgement:

This Master’s thesis is the final achievement for my degree of Master of Science in Computer Science at Uppsala University. The work has been performed at the Audio Technology Department of Ericsson Research in Stockholm.

I would like to acknowledge and thank my supervisor Erlendur Karlsson for his tremendous support during the work and writing of this thesis. I would also like to thank Magnus

Lundberg Nordenvaad for his valuable feedback and suggestions during report writing. Finally, I would like to express my gratitude to the all others who either directly or indirectly helped me in achieving the goal.

(6)

Contents

1 Introduction ... 5

1.1 Thesis outline ... 5

2 Voice Activity Detector ... 7

2.1 Sub-band Filter Bank ... 8

2.2 Pitch detection ... 11

2.3 Tone detection... 12

2.4 Complex signal analysis ... 13

2.5 VAD decision ... 13

2.5.1 SNR computation ... 14

2.5.2 Background Noise Estimation ... 14

2.5.3 Threshold Adaption ... 14

2.5.4 Comparison block ... 15

2.5.5 Hangover block ... 15

2.6 Problems with VAD ... 16

3 Signal Characteristics ... 17

3.1 Signal Recording ... 17

3.1.1 Keystroke signals ... 17

3.1.2 Speech signals ... 17

3.2 Keystroke Signal Characteristics ... 18

3.2.1 Keypress signal ... 20

3.2.2 Keyrelease signal ... 20

3.2.3 Low frequency component of Keystroke signal ... 21

3.3 Speech Signal ... 22

3.3.1 Voiced Sounds ... 23

3.3.2 Fricative or unvoiced sounds ... 23

3.3.3 Plosive sounds ... 24 3.4 Acoustic Phonetics ... 24 3.4.1 Vowels ... 24 3.4.2 Diphthongs ... 24 3.4.3 Semivowels ... 25 3.4.4 Nasals ... 25 3.4.5 Unvoiced Fricatives ... 25 3.4.6 Voiced Fricatives ... 25 3.4.7 Voiced Stops ... 25 3.4.8 Unvoiced Stops ... 26

4 Alternative Signal Classification Approaches ... 27

4.1 Feature Vector Based Classification ... 27

4.1.1 Frame Based Feature Vector ... 27

4.1.2 Texture Based Feature Vector ... 28

4.2 Temporal and Spectral Feature Extraction ... 28

4.2.1 Zero Crossing Rate (

ZC

_r) ... 28

4.2.2 Cepstral feature ... 31

4.2.3 Short Time Fourier Transform (STFT) ... 34

4.3 Temporal Prediction Based Signal Classification ... 35

4.3.1 Prediction Model for smooth speech signals ... 35

4.3.2 Classification based on thresholding the norm of the modeling error vector 43 4.3.3 Computation of variance



_n2_,k ... 45

(7)

4.3.4 Weight computation ... 48

4.4 Pitch detection based on Autocorrelation of the Normalized Signal ... 50

5 Performance Evaluation ... 55

5.1 Test Signals ... 56

5.2 Evaluation Terminology ... 57

5.3 Evaluation of keystroke signal classification approaches ... 58

5.3.1 Approach based on combined prediction of the previous and next frame ... 58

5.3.2 Approach based on prediction using previous and next frame separately... 64

5.4 Evaluation of new speech detector based on the new pitch detection algorithm ... 71

5.5 Conclusion ... 72

5.5.1 Similarities and differences between Speech and Keystrokes ... 72

5.5.2 Comparison of the classification approaches ... 73

5.5.3 Future work ... 74

6 Appendix A.1: Automatic detection of keystroke in a keystroke only file ... 77

7 Appendix A.2: Data collection of Audio classification approach using prediction model (by varying parameters) ... 79

7.1 Plots for how the classification criteria works using variance based on the average of the previous and next frame ... 79

7.2 Plots for how the classification criteria works using variance based on the filtering of squared model error. ... 81 7.3 Plots and tables for hit rate, False alarm and false alarm in speech region . 82

(8)

Abbreviations:

3GGP 3rd Generation Partnership Project AMR Adaptive Multi Rate

AMR-WB+ Adaptive MultiRate WideBand Plus ASC Audio Signal Classification

CCF Cross-Correlation Function DFT Discrete Fourier Transform FFT Fast Fourier Transform

ITU International Telecommunication Union SNR Signal to Noise Ratio

STFT Short Time Fourier Transform VAD Voice Activity Detector

(9)

1 Introduction

The use of video conferencing is rapidly growing in various domains like government, law, education, health, medicine and business. This development is being enabled by the availability of affordable high speed internet connections (fixed and mobile) and the recent technological innovations in high quality video coding and affordable high resolution displays and driven by a vision of sustainable growth. The strain on the environment is released by the drastic reduction in travel (saving airplane fuel and decreasing

CO

₂ emissions) and the meetings become more efficient and more affordable as the

participants do not need to spend time on long distance travels and the companies and organizations can make big savings on travel and hotel expenses.

A video conferencing site usually has a limited display area to show the videos from the other participating sites and very often the number of the other participating sites is so large that it is impossible to display them all at the same time. This situation is typically handled by identifying the most active sites and only displaying those. The identification of the most active sites is usually achieved by using a voice activity detector (VAD) on the audio signals from each of the sites and maintaining a list of the most active sites, where a site is taken from the list when it becomes vocally inactive and a site is added to the list when it goes from vocally inactive to vocally active. For this to work it is imperative that the VAD does a good job of properly identifying the voice activity in each audio signal.

In a typical video conferencing room there will be people typing on their computers/laptops and this can cause some problems when the VAD classifies the keyboard typing signals as speech activity when there is nobody talking in the room. This can result in that a vocally inactive site is not removed from the list of active sites and thus block another vocally active site from being added to the list, thus creating a very bad user experience in the video conference.

Current VADs often classify keyboard typing as active speech. In this thesis work we will be looking into methods for complementing those VADs with a mechanism that will enable them to distinguish between true speech and keyboard typing. To be able to achieve this we will of course need to understand the inner functions of the current VADs and the main signal characteristics of speech and keyboard typing signals.

1.1 Thesis outline

In the first chapter we study about the narrowband AMR VAD (Voice activity detector) according to our purpose. The problems with the VAD are also described using some test data.

The second chapter focuses on the signal characteristics of keystroke and speech signals. In the third chapter, we look into the frame based signal processing used in real time

classification.

The fourth chapter describes the various alternative approaches for the signal classification. Feature vector based and maximum likelihood based approach are

described in detail. A derived approach based on the combination of the above features is also described.

(10)

In the fifth chapter, evaluation of various approaches described earlier and summary of the obtained results are provided.

In the appendix, one section is included on automatic detection of keystroke signal. Another section includes data from testing of the evaluation chapter.

Following are the main modules of the thesis work:

 VAD (Voice activity Detection) o Functionality

o Problems related to our purpose

 Signal Characteristics

o Speech and keyboard signal

 Frame based signal processing o Multirate signal processing o Effect of windowing on the signal

 Alternative signal processing approaches o Feature vector based classification

o Maximum likelihood criteria based classification

o New approach to solve our problem based on the above methods and data analysis

o Improved pitch detection

 Classification results and conclusion

(11)

2 Voice Activity Detector

In this chapter we describe the functionality of the VAD. The VAD we have used in this study is based on the narrowband AMR VAD that is used in the speech coder so our functional description will focus on this particular VAD. For more details, we refer the reader to technical documentation on AMR by 3GPP [1].

VAD is based on a technique used in speech processing to detect the presence or absence of speech. The main usage of VAD is in speech coding and speech recognition. In speech coding it is used to avoid unnecessary coding/transmission of silent frames, saving both computational effort and network bandwidth.

The input signal to the VAD is sampled at 8 kHz, has a bandwidth of 4 kHz, and the VAD processes the signal in 20ms signal frames. The VAD consists of four analysis (feature extraction) blocks that provide signal features to the VAD decision block, which outputs a speech flag for each frame, as illustrated in figure 1.

Figure 1 Block Diagram of VAD Sub-band Filter Bank Pitch Detection Tone Detection Complex Signal Analysis S(i) VAD decision level[n] pitch tone Complex_w arning _{Complex_timer} VAD_flag

(12)

The analysis blocks are:

 A sub-band filter bank delivering the signal levels in 9 sub-bands.

 A pitch detection block that delivers a pitch flag for each frame.

 A tone detection block that delivers a tone flag for each frame.

 A complex signal analysis block that identifies and delivers a complex flag for each frame.

2.1 Sub-band Filter Bank

The input signal to the sub-band filter bank is sampled at 8 kHz, has a bandwidth of 4 kHz, and is processed in 20ms signal frames. The sub-band filter bank uses 8 LP/HP QMF filter blocks organized in 4 stages to generate 9 sub-band signals as illustrated in figure 2.

Since most of the speech energy is contained in the lower frequencies, the frequency band resolution is higher at the lower frequencies than the higher frequencies.

5th order filter block 5th order filter block 5th order filter block 3rd order filter block 3rd order filter block 3rd order filter block 3rd order filter block 3rd order filter block _0-250Hz 250-500Hz 500-750 Hz 750-1000Hz 1-1.5 kHz 1.5-2 kHz 2 - 2.5 kHz 2.5-3 kHz 3 - 4 kHz LP[0-2 k] LP[2-3 k] LP[0-1 k] LP[0-500] HP[1-2 k] HP[500-1000] HP[2-4 k]

Stage 1 Stage 2 Stage 3 Stage 4

[0-4 k]

(13)

A 2-channel LP/HP alias free QMF filter block [2] is shown in Figure 3. In the figure 3,

H

₀ and

H

₁ are low pass and high pass filters respectively. For alias free realization, the high pass filter is made the mirror image of the low pass filter with a cutoff frequency π/2 as shown in the figure 4. So,

H

₀ and

H

₁ can be related as

) ( ) ( ₀ ( ) 1    _ j  j e H e H Figure 3 A LP/HP QMF block

Figure 4 High pass and low pass filter

In the z domain, it can be represented as

H

₁

(

z

)

=

H

₀

(



z

)

. An efficient way to implement 2-channel alias free QMF block is done in polyphase form. A 2-band Type 1 polyphase representation of

H

₁ and

H

₀ are shown below using the alias free condition:

2 2 1

H

0

H

x[n] Two output bands

)

(

)

(

)

(

)

(

2 1 1 2 0 1 2 1 1 2 0 0

z

E

z

E

H

z

E

z

E

H

 









(14)

Now figure 3 can be redrawn using the above equations as shown in figure 5.

Figure 5 A 2-channel alias free polyphase QMF block

The cascade implementation of the filter block in figure 6 is shown in figure 7. The cascade implementation saves a lot of processing time by reducing the computation work by half in each channel as we downsample the signal first. If we downsample later then we process each sample and then throw away every other sample, so we can reduce the computation cost by the downsampling the signal first. In case of 5th order filter bank,

E

0

(

z

)



A

1

(

z

)

and

E

₁

(

z

)



A

₂

(

z

)

, where

A

₁

(

z

)

and

A

₂

(

z

)

are all pass filters. In case of 3rd order filter bank,

E

₀

(

z

)



1

and

E

₁

(

z

)



A

₃

(

z

)

, where

A

₃

(

z

)

is an all pass filter.

Figure 6: Cascade implementation of QMF block

In case of all pass filters, there is only phase distortion, but magnitude is preserved. The filters

A

₁

(

z

),

A

₂

(

z

)

and

A

₃

(

z

)

are first order direct form all-pass filters, whose transfer function is given by equation (1).

1 1

1 )

(

_ 





z

C

z

C

z

A

k k k (1) ) ( 2 0 z E

)

(

2 1

z

E

1 

z

x[n] +

-

Two output bands 2 2 + + + + 2 2

)

(

0

z

E

)

(

1

z

E

1 

z

x[n] + Two output bands

-

+ + + +

(15)

where

C

_k is the filter coefficient.

As shown in the figure 2, after each stage, the signal is downsampled by a factor of 2. So the final 9 output sub-band signals are sampled at different sampling frequencies and have different bandwidths as summarized in table 1. Table 1 also shows the number of samples per frame for each of the sub-band signals.

For each frame, the signal level is summed over a 24 ms time interval [1]. Since the input frame size is 20 msec, 4 msec must be obtained from the previous frame. To exemplify this in number of samples we can look at sub-band 9. The required number of samples is 48. The current frame has 40 samples and 8 samples are taken from the previous frame. Table 1 Frequency distribution in Sub-bands

Band no

Frequency (Hz) Output from stage number Sampling rate (Hz) No of samples in 20 ms frame 1 0-250 4 500 10 2 250-500 4 500 10 3 500-750 4 500 10 4 750-1000 4 500 10 5 1-1.5 k 3 1 k 20 6 1.5-2 k 3 1 k 20 7 2-2.5 k 3 1 k 20 8 2.5-3 k 3 1 k 20 9 3-4 k 2 2 k 40

2.2 Pitch detection

This block computes the autocorrelation of the current signal frame and decides whether the signal frame is pitched or not. Example of pitched speech signals are vowel sounds and other periodic signals. In the AMR VAD [1], the pitch detection is done by comparison of open-loop lags or delays which are computed by open loop pitch analysis function. This function computes autocorrelation maxima and delays corresponding to them.

In the AMR VAD, the delays are divided into three ranges and one autocorrelation maximum is selected in each range. Instead of choosing the global autocorrelation maxima, logic is used to select autocorrelation maximum among them favoring the autocorrelation maximum corresponding to lower delay range.

In the AMR VAD, if the bit rate mode is 5.15, 4.75 kbits/s, open-loop pitch analysis is performed once per frame (each 20 ms). The computation of autocorrelation is given by the equation (2).



  159 0 ) ( * ) ( n k x n x n k O (2)

where

O

_k is the autocorrelation of the 20ms signal frame x(0:159) at delay k. The delay is divided into three ranges as shown in following table 2.

(16)

Table 2

Range number (i) Delay range (k)

3 20,…,39

2 40,…,79

1 80,…,143

Maxima are computed in each range and they are normalized by dividing with the signal power of the corresponding delayed frame. The normalized maxima and corresponding delays are denoted by

(

M

i

,

t

i

),

i



1 ,

2 ,

3

. An Autocorrelation maximum

)

(

T

op

M

corresponding to the delay (

T

_op) is selected favoring the lower delay range over the higher ones. The autocorrelation maximum corresponding to the largest delay range is assigned first, and then it is updated with the autocorrelation maxima corresponding to the lower delay ranges using following logic:

end

t

T

M

T

M

T

M

if

end

t

T

M

T

M

T

M

if

M

T

M

t

T

op op op op op op op op 3 3 3 2 2 2 1 1

)

(

)

(

85 .

0 )

(

)

(

85 .

0 )

(











A counter variable lagcount stores the number of lags for the current frame. The difference of the open-loop lags or delays is computed and if it is smaller than a threshold then the lagcount is incremented. If the sum of lagcounts of the two consecutive frames is higher than the pitch threshold then the pitch flag is set.

2.3 Tone detection

The main functionality of this block is to detect the sinusoidal signals such as information tones. One way is to use the second order AR model. It tries to look into the poles of the model on the unit circle. If the poles are close to the unit circle, then the signal is classified as tonal.

In AMR VAD [1], a pitched signal is classified as tonal, if the pitch gain is higher than the tone threshold (TONE_THR), otherwise non-tonal. The normalized autocorrelation

maximum from the open-loop pitch analysis function is compared to the tone threshold; if it is higher then tone flag is set to 1 otherwise 0.

(17)

2.4 Complex signal analysis

This block detects the highly correlated signal in high pass weighted speech domain. One example of signal having high correlation value is music.

2.5 VAD decision

The block diagram of the VAD decision algorithm [1] is shown in the figure 7.

Figure 7: Block diagram for VAD decision

As shown in the figure 7, signal level output from the sub-band filter bank, the pitch flag from the pitch computation block, tone flag from tone detection block and complex warning flag from the complex signal detection block are inputs to this block. Then we compute SNR (signal to noise ratio) using signal level of the frame and background noise estimation. We will describe backward noise estimation in the sub-section [2.5.2]. The computed SNR is compared to adaptive threshold which depends on the noise level. If the SNR is higher than the threshold then intermediate VAD flag (vad_reg) is set, which along with hangover determines the VAD flag (vad_flag). We describe how the hangover works with intermediate VAD flag in the sub-section [2.5.5].

SNR Computation Background Noise Estimation level[n] pitch_flag bckr_est[n] Threshold Adaption noise_level Comparison Hangover Addition vad_thr vad_flag tone_flag complex_warning vad_reg

(18)

Each block in the figure 7 is described in the following sub-sections, starting with the SNR Computation block.

2.5.1 SNR computation

For SNR computation, the ratio between the signal levels of the input frame and the background noise estimate in each band is computed and then sum of them is stored in the output variable snr_sum for the current frame as shown in the equation (3)

 





9 1 2

,

)

_

,

0 .

1 (

_

n

bckr

est

n

level

MAX

sum

snr

(3)

where level[n] and bckr_est[n] are the signal level and level of background noise estimate at band n respectively.

2.5.2 Background Noise Estimation

Background noise estimation is updated using amplitude levels of previous frame in the non speech region only. In VAD, the background noise is estimated using the first order IIR filter in each band as shown in the equation (4).

 

n

bckr

est

 

n

level

 

n

est

bckr

_

m1



(

1 



)

_

m







m1 (4)

where m = the current frame, n = band number ,

level

m 1

 

n

= signal level of previous frame.

The variable



is set to some value by comparing the background noise estimate of current frame with the signal level of previous frame. A pseudo-code for this is as bellow:

 

end

else

n

level

n

est

bckr

if

DOWN UP m m







_

)

_

(

1

where variables



_UP and



_DOWN are set according to complex signal hangover, pitch and intermediate VAD decision.

2.5.3 Threshold Adaption

The threshold is tuned according to the background noise level. A threshold is tuned to a lower value in case of higher noise level to detect the speech reliably; although some noise frames may be classified as speech frames.

Average background noise level is computed by adding noise estimates in each band as shown in the equation (5).

(19)

 



  9 1 _ _ n n est bckr level noise (5)

VAD threshold is calculated using average noise level as shown in the equation (6).

HIGH THR VAD PI VAD level noise SLOPE VAD thr vad_  _ *( _  _ ) _ _ (6)

where VAD_SLOPE, VAD_PI and VAD_THR_HIGH are constants.

2.5.4 Comparison block

The input to this block is snr_sum from SNR computation block and vad_thr from the Threshold adaption block. The intermediate VAD decision (vad_reg) is made by comparing the variable snr_sum to vad_thr; if it is higher than the vad_thr it is set to 1 otherwise 0 as bellow: end reg vad else reg vad thr vad sum snr if 0 _ 1 _ ) _ _ (    2.5.5 Hangover block

Figure 8: Hangover addition Length>burst_len

Hangover addition

Intermediate VAD Flag

(20)

Hangover is added to avoid detecting the silence between two speech words as non-speech frames. It combines two conditions. If a certain number of frames (burst_len) have the intermediate VAD decision set to 1, then the hangover flag is set to one. The next certain number of frames (hang_len) will have the VAD flag set to 1, although the intermediate VAD flags become 0. This is illustrated in the figure 7.

The green dot in the figure 8 indicates the intermediate VAD flag set to 1 and the orange dot indicates intermediate flag set to 0. After hangover addition, the hang_len number of orange dots becomes green.

2.6 Problems with VAD

We tested the VAD with a test signal that is a mixture of clean speech and keyboard typing signal. The plot of the mixed signal along with the speech flag signal from the VAD is shown in figure 9. The first subplot in the figure indicates the mixed signal whereas the second subplot indicates speech flag with respect to frame numbers, flag one indicates the frame contains speech signal whereas the flag zero indicates the frame doesn’t contain speech signal.

Figure 9: Test run using VAD

As from the figure 9, it is clear that VAD is indicating almost all keystroke frames as

speech frames. So we will look into signal characteristics in the next section to find out why keystrokes signal are classified as speech by VAD.

(21)

3 Signal Characteristics

This chapter describes the general characteristics of keystroke and speech signals in the time and frequency domains such as the duration and frequency range of a keystroke and the phonetic components of speech signals.

To study the signal characteristics of these signals we have had to record a representative collection these types of signals. How we went about this is described in section 3.1. In section 3.2 we describe the signal characteristics of the keystroke signals and section 3.3 covers the signal characteristics of speech signals.

3.1 Signal Recording

In this section we describe how we covered the variation of the keystroke signals 3.1.1 and speech signals 3.1.2.

3.1.1 Keystroke signals

Keystroke signals can vary quite a bit, depending on the keyboard used, the person typing on the keyboard and the mood of that person. It is, therefore, important to obtain keystroke signals that cover this variation space reasonably well. Keyboards manufactured by

different companies have different keyboard mechanics that affect the signal

characteristics of the keystroke signal. To cover that variation we have used keyboards manufactured by following companies in our recordings:

 HP keyboard

 Logitech Keyboard

 Logitech wireless keyboard

 Mac laptop keyboard

Another factor that influences the signal characteristics is the person typing on the keyboard, because different persons have different typing styles, which also are affected by the mood of the person. Some people type very fast, so the duration between the keystrokes will be low compared to people typing slowly. In our work, we recorded two minute typing of ten people on each of the above keyboards. The recording room was noise free and insulated from outside environment.

3.1.2 Speech signals

Speech signals vary from person to person and with the phonetic sequences of the sentences being spoken. To cover this variation we have used recordings made by Ericsson of different English speaking persons speaking different sentences with well chosen phonetic content.

(22)

There were 160 files of speech data each having 8 seconds length. To cover all the variation in the phonetics, speech of seven female and nine male was recorded having spoken ten speech files of two sentences by each person.

Ten speech files each consisting of two sentences by each person from a group of seven female and nine male was recorded.

3.2 Keystroke Signal Characteristics

A keystroke begins with a key press signal component and ends with a key release signal component [3][[4] The duration of a keystroke is between 60 to 200 ms [5]. The key press and key release signal components are high frequency components and the middle component between the two is a low frequency component.

Keeping in mind the typing mechanism on a keyboard, a person first touches the keyboard button, then hits it and finally releases it, but everything in a fast order. This mechanism results in three peaks as touch peak, a hit peak and a release peak [3]. Generally the touch and hit peak are very close and sometimes overlapping. The touch peak and hit peak form the key press component. The release peak forms the key release sound. The duration of the key press and key release components varies from 10-35 ms as shown in the figures 10 and 11.

(23)

Figure 11: Keystroke duration

Spectrally keystroke signals are highly random due to the typing style, key sequences and the mechanics of the keyboard. The signal power of a key press is larger than that of the key release. The main frequency range for the key press varies between 1-8 kHz. The spectrum of a typical keystroke is shown in the figure 12.

Figure 12: Spectrum of a keystroke signal

In the subsection 3.2.1 variations of keypress signal is described in more detail.

Keyrelease and the middle component of keystroke signal are described in sub-sections 3.2.2 and 3.2.3 respectively.

(24)

3.2.1 Keypress signal

As mentioned earlier, keystroke signal varies with keyboard mechanics and typing style of a person typing on the keyboard. It was observed that the peaks in a keypress vary with the keyboard mechanics. One peak and two peak keypress signals were mainly observed in our data set.

A one-peak key press component was observed on the HP keyboard. Here the touch peak and hit peak are not differentiable. The key press is followed by the low frequency middle component and the key release, which may have more than one peak. The key press component is stronger than the key release component. The average width of key press is 10-15 ms. A typical keystroke for HP keyboard looks like the figure 13.

A two-peak key press component is observed on the Logitech wired, Logitech wireless and Macbook laptop keyboard. Here the touch peak and hit peak are a bit distant. Hence the key press is two peaks which are followed by the low frequency middle component and the key release. The width of key press varies from 10-20 ms varies from 15-25 ms. A typical keystroke for Logitech keyboard looks like the figure 14.

Sometimes the keyboard signal also exhibits multiple peaks in keypress as shown in figure 15 due to mechanics of the key in the device. This behavior is found in specific keys of the keyboard like the spacebar and the enter key. In this case the width of keystroke varies between 20-35 ms.

Figure 13: One peak key press

3.2.2 Keyrelease signal

Keyrelease signals have weak signal strength compared to keypress signal. The keyrelease peaks are not as sharp as keypress peaks. The width of keyrelease peak varies between 10 – 25 ms. A typical keyrelease is shown in the figure 10.

(25)

3.2.3 Low frequency component of Keystroke signal

The part of the keyboard signal between the keypress and keyrelease contain low

frequency components. The spectrum of the low frequency component is shown in figure 16.

Figure 14: Two peak key press

(26)

Figure 16: Spectrum of high and low frequency component of keyboard signal

3.3 Speech Signal

In this section a brief description about speech signals is given. The speech signal can be divided into 3 distinct classes according to the mode of excitation [6]. These are Voiced sounds [3.3.1], Fricative Sounds [3.3.2] and Plosive sounds [3.3.3]. Figure 17 shows a cross sectional view of the vocal tract system.

(27)

3.3.1 Voiced Sounds

They are produced by forcing air through the glottis with the tension of the vocal chords adjusted so that they vibrate in a relaxation oscillation. These periodic pulses excite the vocal tract. Examples of voiced sound are /U/, /d/, /w/, /i/ and /e/ [6] shown in figure 18.

Figure 18: /i/ in finished

3.3.2 Fricative or unvoiced sounds

They are generated by forming a constriction at some point in the vocal tract (usually towards end of the mouth), and forcing air through the constriction at a high enough velocity to produce turbulence. It creates broad spectrum noise source to excite the vocal tract. An example of fricative sound is /sh/ and is labeled as /∫/ [6] shown in figure 19.

(28)

3.3.3 Plosive sounds

It results from the complete closure of the front of vocal tract, building up pressure behind the closure, and abruptly releasing it. A typical example of a plosive sound is /t∫/ [6] shown in the figure 20.

Figure 20: /t/ in town

3.4 Acoustic Phonetics

Most languages can be described in terms of a set of distinctive sounds or phonemes. There are 42 phonemes for American English including vowels, diphthongs, semivowels and consonants [6]. There are many ways to study the phonetics e.g. study of distinctive features or characteristics of the phonemes.

3.4.1 Vowels

They are produced by exciting a fixed vocal tract with quasi-periodic pulses of air caused by vibration of the vocal cords. The variation of cross-sectional area along the vocal tract determines the resonant frequencies of the tract (also called formants). The dependence of cross-sectional area upon distance along the tract is called the area function of the vocal tract. The position of tongue determines the area function of a particular vowel, but the positions of the jaw, lips, and, to a small extent, the velum also influence the resulting sound. Each vowel is characterized by the vocal tract configuration. The examples of vowel are /a/ in “father”, /i/ in “eve” [6].

3.4.2 Diphthongs

A diphthong is gliding monosyllabic speech item that starts at or near the articulatory position for one vowel and moves to or toward the position for another. There are six diphthongs in American English including /eI/ (as in bay), /oU/ (as in boat), /aI/ (as in buy), /aU/ (as in how), /oI/ (as in boy) and /ju/ (as in you) [6].

(29)

3.4.3 Semivowels

They are characterized by gliding transition in vocal tract area function between adjacent phonemes. Thus the acoustic characteristics of these sounds are strongly influenced by the context in which they occur. Example of semivowels is /w/, /l/, /r/ and /y/. These are called semivowels because they sound like vowel [6].

3.4.4 Nasals

They are produced by glottal excitation and the constriction of vocal tract at some point. Due to lowering of velum, the air flows through the nasal tract. The mouth serves as a resonant cavity. They are characterized by resonances which are spectrally broader than those for vowels. Examples are /m/, /n/ [6].

3.4.5 Unvoiced Fricatives

They are produced by exciting the vocal tract by steady air flow and constriction at some location in the vocal tract. The constriction location determines the fricative sound. Examples of unvoiced fricatives and their constriction place are shown in table 3 [6]. As shown in the figure 20, the unvoiced fricative sounds are non-periodic.

Table 3: Constriction place for unvoiced fricatives Unvoiced fricatives Constriction place

/f/ lips

/θ/ teeth

/s/ Middle of oral tract /sh/ Back of the oral tract

3.4.6 Voiced Fricatives

In case of production of the voiced fricatives, the constriction place is some point near the glottis and is same for all, but there are 2 excitation sources involved in it. One excitation source is at glottis [[6].

3.4.7 Voiced Stops

These sounds are produced by building up pressure at some constriction point in the oral tract and releasing it suddenly. Examples of voiced stops and their constriction point are shown in table 4 [6].

Table 4: Constriction place for voiced stops Voiced stops Constriction point

/b/ Lips

/d/ Back of teeth

/g/ Near velum

These sounds are highly dynamic and their properties depend on the vowel which follows them. Their waveforms give little information about particular voiced stop.

(30)

3.4.8 Unvoiced Stops

They are very similar to the voiced stops except the vocal chords don’t vibrate in this case. Examples are /p/, /t/ and /k/. Their duration and frequency contents also vary with the stop constants [6].

(31)

4 Alternative Signal Classification Approaches

Chapter 2 described the VAD that is currently used in classifying/detecting speech content in an audio signal. As described there, the current VAD has a problem in that it is prone to classify non-speech content as speech content. In this chapter we describe two

approaches that can be used to classify/detect keystrokes in audio signals and the

improved pitch detection method, which can be integrated into the VAD to greatly improve its speech detection performance through a big reduction in incorrectly classified non-speech frames.

The audio signal classification is explained in section 4.1-4.3. Feature Vector Based Classification is explained in the section 4.1. Section 4.2 describes basic approaches used for window based signal processing. Temporal Prediction Based Classification is explained in the section 4.3 where we explain two approaches that are based on prediction model. The first approach of audio signal classification is based on a combined prediction of the current frame from the previous and the following frames (combined backward and forward prediction), whereas the second approach uses separate backward and forward

predictions.

The improved pitch detection method is explained in the section 4.4 where we explain an approach based on the autocorrelation of the normalized signal to compute the pitch of the audio signal.

4.1 Feature Vector Based Classification

A feature vector is used to classify the signal into their corresponding classes. The

selection of features in the feature vector plays an important role in obtaining a reliable and robust signal classification. There are two different approaches for selecting these features [7]:

 Frame based feature vectors

 Texture based feature vectors.

4.1.1 Frame Based Feature Vector

In the frame based feature vector approach, the input signal is broken into small blocks and a feature vector is computed for each block. The blocks are called analysis windows. The analysis window also represents the window length. The feature vector is computed in the time intervals between 10-40 ms [[7]. Frame Based Feature Vector approach is widely used in real time classification of audio signal.

(32)

4.1.2 Texture Based Feature Vector

The major drawback with the frame-based approach is that it doesn’t take into account the other long-term characteristics which can improve the classification result. For example in music classification, rhythmic structure can help in detecting the genre. But with 40ms frame we can’t find the rhythmic structure. Similarly envelop detection also helps in the classification. For this purpose we need to find the structural description. So, longer time interval is required for the classification purpose. In case of music classification, not only feature, but its variation also helps a lot in better classification. A texture window is used for this purpose as it contains a long term segment (in the range of seconds) having many analysis windows [[7]. In this approach mainly statistical measure for each analysis window like mean, standard deviation, mean of the derivative, standard deviation of the derivative are computed for classification purpose.

Texture Based Feature Vector is not suitable for real time classification because it needs processing of a large number of frames and may introduce large delay which defeats the classification purpose.

4.2 Temporal and Spectral Feature Extraction

In this section various temporal and spectral features are explained that are widely used in classification of audio signals.

4.2.1 Zero Crossing Rate (

ZC

_r)

Zero crossing (ZC) is the number of times the signal crosses zero during one analysis window and is often used to obtain a rough estimation of the fundamental frequencies for voiced signals [8][[12]. In the case of a complex signal it gives a measure of noisiness. Generally the short-time

ZC

_r is helpful in differentiating between voiced and unvoiced segments of speech due to their differing spectral energy concentration. If the signal is spectrally deficient, like the sinusoid, then it will cross the zero line twice per cycle.

However if it is spectrally rich then it might cross the zero line many more times per cycle. The zero-crossing of speech signal, keystroke signal and mixed signal is shown in figures 21, 22 and 23 respectively.

Zero Crossing (ZC) is defined as the number of times the signal amplitude changes the sign during one analysis window as shown in the equation (7). Zero Crossing Rate (

ZC

_r) is defined as change of zero crossing of current frame with respect to the previous frame as shown in equation (8) [7].

 













N n

n

x

sign

n

x

sign

ZC

1

|

)

1 (

)

(

|

2

1

(7)

where sign function is defined by

0 1 0 0 0 1 ) (        x if x if x if x sign

(33)

prev current

r

ZC





(8) where

ZC

_r = zero crossing rate,

ZC

_current=zero crossing of current analysis

window,

ZC

_prev= zero crossing of previous analysis window

Figure 21 : Zero crossing of clean speech signal

As shown in the figure 21, ZC of voiced speech has lower value than that of the unvoiced speech and keystroke. So, this feature can help in differentiating among them. One important advantage with ZC is that it’s very fast to calculate as it is a time domain feature so we don’t need to compute the spectrum [8].

Figure 22 shows zero crossing of keystroke signal frames. The figure shows that Zero crossing of keystroke signal is less than 350.

(34)

Figure 22 : Zero crossing of keystroke signal

(35)

4.2.2 Cepstral feature

Spectral features are more general purpose features for different kind of problems compared to temporal features, which are limited to instrument recognition, genre recognition or speaker recognition. The cepstral feature is a commonly used spectral feature in speech processing. The idea behind using cepstrum feature is to find the range of resonant frequency that can be computed by extracting smooth envelop of lower cepstral coefficients. Following are three types of cepstrums that are commonly used

Error! Reference source not found.:

 Power Cepstrum

 Real Cepstrum

 Complex Cepstrum

The Power Cepstrum is often used to determine the pitch of a human speech signal. It is defined as the squared magnitude of the Fourier transform of the log of the squared magnitude of the Fourier transform of the signal Error! Reference source not

found.[11][18]. 2 2 10( ( ) )) (log _ceps fft fft signal power  (9)

The Complex Cepstrum is used in homomorphic signal processing. It is defined as the Fourier transform of the log of the Fourier transform of the signal. The Complex Cepstrum is used for complete reconstruction of the signal as it includes the phase information along with the magnitude information [75]Error! Reference source not found..

)))

(

(log

_

ceps

fft

10

fft

signal

complex



(10)

Homomorphic system theory states that if there are two signals convoluted in the time domain, one having high frequency components while the other having low frequency components, then those signals can be extracted separately through a selection of the cepstral coefficients. The lower cepstral coefficients will then represent the low frequency signal and the higher cepstral coefficients represent the high frequency signal [6].

A very important property of the complex cepstral domain is that the convolution of two signals can be expressed as addition of their Cepstra [6][9]. Suppose signal

x

is the convolution of two signals

x

₁ and

x

₂ in the time domain, then the Fourier transform of the signal x in the frequency domain will be the multiplication of the Fourier transforms of signals

x

₁ and

x

₂. 2 1 2 1

x

X

x







FFT









(11)

where

X

,

X

₁

,

X

₂ are the Fourier coefficients of the signals

x

,

x

₁

,

x

₂respectively. The complex cepstrum of equation (11) is shown in equation (12).

cc cc

cc

fft

X

fft

X

(36)

where

X

_cc

,

X

₁_cc

,

X

₂_ccare the complex cepstral coefficients of the signals

x

,

x

₁

,

x

₂ respectively.

If the signal

x

₁ is a low frequency signal and

x

₂a high frequency signal, then the lower cepstral coefficients of

X

_ccwill be dominated by the lower spectral coefficients of

X

₁_cc and the higher cepstral coefficient of

X

_ccwill be dominated by the higher spectral coefficients of

cc

X

₂ .

Another important application is done in extracting a smooth envelope of the log of the Fourier Transform of the signal so that resonant frequencies are easier to detect. As the log of FT contains both low and high frequency components, we can get rid of the high frequency components by only selecting the lower spectral coefficients and then estimating the spectrum by taking the Inverse Fourier transform using these lower spectral

coefficients only. But the selection of cutoff threshold for the spectral index should be done carefully.

The Real Cepstrum is defined as the real part of the inverse Fourier transform of the log of the absolute value of the Fourier transform of the signal [10]. Only the real part is taken, because during computation very small imaginary part is produced. As the phase

information is removed in the computation there is a significant amount of reduction in the information being processed.

))))

(

(log

(

_

ceps

real

ifft

₁₀

abs

fft

signal

real



(13)

For a signal x, its real cepstrum is defined as bellow:

)) ( ( ))))) ( ( ( (log

(ifft ₁₀ abs fft x real ifft X real

X_rc   (14)

where

X

_rcis the real cepstrum of the signal x and

X

is the log of absolute value of the Fourier Transform of the signal x as shown in equation (15).

))) ( ( ( log₁₀ abs fft x X  (15)

We are interested in smooth part of real cepstrum i.e. it’s envelop X_rc_{_}_envelop that can be computed by selecting only the lower spectral coefficients of

X

, then taking the real part of the Fourier Transform of these lower spectral coefficients. This property is true, because

X

is real and symmetric.

))

(

_envelop smooth rc

real

ifft

X



(16)

where

X

_rc_{_}_envelop = estimated envelop of real cepstrum,

X

_smooth = lower spectral coefficient of

X

.

(37)

In the thesis work we computed the smooth Real Cepstrum coefficients. The plot of

X

is shown by the red line in the first window of the figure 24. The green line shows the estimated real cepstrum using the first 50 cepstral coefficients. It is seen from the figure that the plot is not very smooth, so it may be difficult to find the resonant frequency in a general case. In this particular case of the keystroke as shown in the figure 24, the resonant frequency is approx 6 kHz.

Figure 24 : Estimated real cepstrum of the keypress signal and its estimate

The log of the FT of voiced speech and unvoiced speech are shown in figure 25 and 26 respectively. As shown in the figure 25, the resonant frequencies of voiced speech lies bellow 1 kHz. Fricative sound is spectrally flat compared to voiced speech sound. Their resonant frequency lies in higher frequency domain (10-15 kHz) as shown in figure 26.

(38)

Figure 25 : Estimated real cepstrum of the voiced speech signal and its estimate

Figure 26 : Estimated real cepstrum of the fricative sound signal and its estimate

4.2.3 Short Time Fourier Transform (STFT)

An FFT based feature is widely used in signal classification. The Short Time Fourier Transform is obtained by taking the Fourier transform of the windowed signal. The width of window function determines the resolution. For good frequency resolution the size of window is increased whereas for good time resolution the size of window is decreased.

(39)

There are many spectral features based on the STFT of the signal such as the Spectral Flux where the change of the STFT of the current frame from the previous frame is compared with a threshold.

4.3 Temporal Prediction Based Signal Classification

Speech signals are smooth and highly correlated signals in the temporal domain (frame to frame) and can, therefore, be modeled with an AR model. This model is explained in section 4.3.1. The classification, which is based on the smoothness criteria, is explained in section 4.3.2. We describe weight computation in section 4.3.3 to optimize the

classification algorithm.

4.3.1 Prediction Model for smooth speech signals

As discussed earlier, speech signal consists of smooth signal, so the STFT of a speech signal can be modeled with the Autoregressive model [17][18] as shown in the equation (17) [[5]. k n M m m m k n Y n k X k n Y _, 1 , , ( , ) ) , ( 



  





(17)

where n and k represents the frame index and the frequency index respectively,



_m is the delay,



_n_,_k_,_m are the prediction coefficients for the frames used in the prediction andX_n_,_kis the model error. The model error is modeled here as a white Gaussian stochastic process in the temporal domain (over n) with zero mean and variance



_n2_,k (



(

0 ,



_n2_,k

)

), with probability density function given by equation (18).

2 , 2 , 2 , ,

2

1 )

(

nk k n X k n k n

e

X

p











(18)

and independence between the subbands given by following equation (19)





      otherwise q k and m n if X X E nk mq nk 0 2 , , ,



(19)

where E is the expectation operator. The model error vector for each frame is:















k n n n n

X

, 2 , 1 ,

...

(40)

and the normalized model error vector given as















k n n n n

X

, 2 , 1 ,

...

, where k n k n k n X X , ,

, 

_

are the normalized model errors.

If we assume that there is no correlation between the frequency components of a given frame [5], the pdf for the normalized vector

X

_n is given by equation (20).

)

(

X

_n

p

= p(Xn,1)p(Xn,2)...p(Xn,k) = 2 ) ... ( 2 2 , 2 2 , 2 1 ,

)

2 (

1

Xn Xn Xnk k

e

   



=



 k k n X k

e

2 , 2 1 2

)

2 (

1 

(20)

Using this pdf, we can evaluate the probability for the normalized model error vector to be within k-dimensional sphere of radius R as:



   R X k n k n n n n n X d X d X d X p R X P( ) ( ) ,1 , ... ,



 





R X k n k n n X k n k k n

X

d

X

d

X

d

e

,1 , , 2 1 2

...

)

2 (

1

2,



(21)

whereP(X_n R) is a monotonic increasing function of R. For a given probability





 

0 ,

1

, we can find the

R

_that satisfies P(X_n R)



and use the criterion as shown in the equation (22) for distinguishing between smooth and impulsive signals. With



0.95 we would then be classifying the smooth signals correctly with probability 0.95.

 R

X_n  (22)

)

(X R

P _n  as a function of R can be expressed in terms of the known Gamma function as shown below. First the normalized model error vector is mapped to polar coordinates as shown below Error! Reference source not found.:

(41)

1 2 2 1 , 1 2 2 1 1 , 3 2 1 3 , 2 1 2 , 1 1 ,

sin

....

sin

,

cos

sin

....

sin

...

,

cos

sin

,

cos

sin

,

cos

    



k k k n k k k n n n n

r

X

r

X

r

X

r

X

r

X



(23)

where r can be computed as shown in the equation (24) :

r

2



X

_n2_,₁



X

_n2_,₂



...



X

_n2_,_k (24) and the limits of the polar coordinate vary as shown bellow:

















2

0

0 ...

0

1 2 2 1































  k k

R

r

(25)

The mapping of the differential volume from the Cartesian format to the polar format is done using the above equations as shown bellow:

1 2 2 1 2 2 3 1 2 1 , , 1

,

...

sin

...

sin



...

 

  



k k k _k _k _k k n k n n

d

X

d

X

r

dr

d

X

d



(26)

Using the above polar transformations, equation (21) can now be mapped from Cartesian to polar coordinate as shown in equation (27) as bellow:



         k n n r k k k k k k r k n R e r drd d d d X P , 1 , 2 ,..., , 1 2 2 1 2 2 3 1 2 1 2 1 2 ... sin ... sin sin ) 2 ( 1 ) (  





(27)

As the function inside the integral is in equation (27) is separable in terms of the polar variables, we can integrate over each polar variable separately. First we integrate over all the angular variables. To simplify the above equation, let’s define





 





0

sin

)

(

d

I

k k (28)

(42)



      k n n r k k k r k n R e r I I I I dr X P , 1 , 2 ,..., , 0 1 1 2 1 2 1 2 ) 2 ( ) ( )... ( ) ( ) 2 ( 1 ) (  



(29)

The first three integrals of (28) are computed as bellow:







 

2 )

2 (

2 0 0







d

I

2 sin

)

(

0 1







  





d

I

2 sin

)

(

0 2 2







 







d

I

The rest of the integrals for the angular variables can be computed recursively from previous integrals through the following recursion formula:

I

const

n

d

I

term nd n term st n n n











 _













 



 



2 2 1 1

1 sin

cos

1 sin



(30)

For integrating θ over 0 to π for these integrals, the 1st_{term in the equation (30) will be}

zero. So the above equation reduces to equation (31). _n





1 I

_n_₂

n

I

(31)

The integration over all the angular dimensions of (29) using (31) will result into a constant term as shown in equation (32).

)!

1

2 (

2 )

2 (

)

(

)...

(

)

(

2 0 1 1 2





_ _

k

I

C

k k k



(32)

Using equation (32) in the equation (29), we get:



  





R r k r k n

e

r

dr

C

R

X

P

0 1 2 2 2

)

2 (

)

(





   





R r k r k

e

r

dr

k

₀ 1 2 1 2 2

)!

1

2 (

2

1

(33)

We can write the integral in the equation (33) in terms of the incomplete gamma function

(43)



  





x t m

dt

e

t

x

m

,

)

1

(

Using the following integration formula [15]

const

x

m

x

dx

e

x

m x





m m







_ _  

)

,

2

1 (

)

(

2

1

1 2 ₂ 1₂ 2 2

where



(

m

,

x

)

is the incomplete gamma function. For an integer m, the gamma function is defined as:



  







1 0

!

)!

1 (

)

,

(

m n n x

n

x

e

m

x

m

The equation (33) reduces to following equation (34):

2 / 0 2

)

,

2 (

)!

1

2 (

1 )

(

R n

x

k

R

X

P





































)

2 ,

2 (

)

0 ,

2 (

)!

1

2 (

1 k

k

R

2

k

)!

1

2 (

)

2 ,

2 (

1

2









k

R

k

(34)

where the left hand side of the above equation (34) is the probability of the norm of the k-dimensional normalized vector i.e.

X

_n that is less than a threshold R.

The equation (34) shows how the probability of the norm of k-dimensional vector that is less than a threshold R can be represented in terms of gamma function of the dimension k and the threshold R. In our case k is the number of the frequency components. So for example if the window length is 15ms and the sampling rate is 48 kHz, then k will be 360. The plot of the probability P versus the threshold R is shown in the figure 27 for k=360.

(44)

Figure 27 Plot of Probability P vs Threshold R

The above figure shows that to correctly identify the impulsive signal with 80.23%

accuracy, the threshold must be greater than 19.56. In other words, the signals identified bellow the threshold will be a smooth signal.

To illustrate equation (34) we consider the case for k=2. The inequality using the equation (22) can be written as equation (35).

(

X

_n2_,₁



X

_n2_,₂

)



R

(35)

It’s an equation of a circle which tells that the norm of

X

_n will lie inside or on the circle having radius R as shown by the green region in the figure 28.