• No results found

Evaluation of Text-Independent and Closed-Set Speaker Identification Systems

N/A
N/A
Protected

Academic year: 2022

Share "Evaluation of Text-Independent and Closed-Set Speaker Identification Systems"

Copied!
123
0
0

Loading.... (view fulltext now)

Full text

(1)

Evaluation of Text-Independent and Closed-Set Speaker

Identification Systems

BERK GEDIK

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)
(3)

Text-Independent and Closed-Set Speaker Identification Systems

BERK GEDIK

Master in Computer Science Date: November 26, 2018

Supervisor: Iman Sayyaddelshad <salehs@kth.se>, David Buffoni

<david.buffoni@gmail.com>

Examiner: Olov Engwall <engwall@kth.se>

Swedish title: Utvärdering av textoberoende talaridentifieringssystem

School of Electrical Engineering and Computer Science

(4)
(5)

Abstract

Speaker recognition is the task of recognizing a speaker of a given speech record and it has wide application areas. In this thesis, various machine learning models such as Gaussian Mixture Model (GMM), k- Nearest Neighbor(k-NN) Model and Support Vector Machines (SVM) and feature extraction methods such as Mel-Frequency Cepstral Co- efficients (MFCC) and Linear Predictive Cepstral Coefficients (LPCC) are investigated for the speaker recognition task. Combinations of those models and feature extraction methods are evaluated on many datasets varying on the number of speakers and training data size.

This way, the performance of methods in different settings are ana- lyzed. As results, it is found that GMM and KNN methods are provid- ing good accuracies and LPCC method performs better than MFCC.

Also, the effect of audio recording duration, training data duration and number of speakers on the prediction accuracy is analyzed.

(6)

Sammanfattning

Talarigenkänning är en benämning på tekniker som syftar till att iden- tifiera en talare givet en inspelning av dennes röst; dessa tekniker har ett brett användningsområde. I det här examensarbetet tillämpas ett antal maskininlärningsmodeller på uppgiften att känna igen talare.

Modellerna är Gaussian Mixture Model(GMM), k-Nearest Neighbour(k- NN) och Support Vector Machine(SVM). Olika tekniker för att ta fram variabler till modelleringen provas, såsom Mel-Frequency Cepstral Co- efficients (MFCC) och Linear Predictive Cepstral Coefficients (LPCC).

Teknikernas lämplighet för talarigenkänning undersöks. Kombinatio- ner av ovan nämnda modeller och tekniker utvärderas över många olika dataset som skiljer sig åt i antalet talare samt mängden data. På så sätt utvärderas och analyseras de olika metoderna för olika förut- sättningar. Resultaten innehåller bland annat utfallen att både GMM och kNN ger hög träffsäkerhet medan LPCC ger högre träffsäkerhet än MFCC. Även effekten av inspelningslängden för de olika rösterna, den sammanlagda längden på träningsdatan samt antalet talare på de olika modellerna analyseras och presenteras.

(7)

Acknowledgements

First of all, I would like to express my gratitude to Iman Sayyaddelshad and David Buffoni, two supervisors of this thesis, for providing me with valuable knowledge, suggestions and helping me through all the time of writing this thesis. Without their guidance, it would be much more difficult to structure this thesis in a systematic way and to grasp enough technical knowledge to decide on proper methods.

Also, I would like to thank Ailsa Meechan-Maddon, my college and my friend, for all her proof-readings and constructive comments. Our discussions definitely helped me to improve the quality of the thesis in terms of language to a great extent and motivated me to do my best.

I would like to finalize with a huge thanks to my mother for helping me to have such a great opportunity of working at this amazing uni- versity and also for supporting me during my entire life.

(8)

1 Introduction 1

1.1 Speaker Identification Problem . . . 1

1.2 Relevance of the Problem . . . 4

1.3 Thesis Objective and Contribution . . . 5

1.4 Delimitations . . . 7

1.5 Ethical and Societal Aspects . . . 7

2 Background 8 2.1 Data Augmentation . . . 8

2.2 Preprocessing . . . 9

2.2.1 Removal Of Unvoiced Audio Segments . . . 9

2.2.2 Framing . . . 9

2.3 Feature Extraction . . . 10

2.3.1 Mel-Frequency Cepstral Coefficients . . . 11

2.3.2 Linear Predictive Cepstral Coefficients . . . 14

2.3.3 Experiments on Feature Extraction Methods . . . 15

2.4 Machine Learning Methods For Classification . . . 17

2.4.1 Gaussian Mixture Models . . . 17

2.4.2 k-Nearset Neighbor Models . . . 20

2.4.3 Self-Organizing Maps . . . 23

2.4.4 Support Vector Machines . . . 23

2.4.5 Neural Networks . . . 25

3 Methods 27 3.1 Dataset . . . 27

3.1.1 Preprocessing . . . 27

3.1.2 Training Datasets . . . 29

3.1.3 Test Datasets . . . 32

3.1.4 Statistics on Speakers . . . 33

vi

(9)

3.2 Framing . . . 33

3.3 Feature Extraction . . . 34

3.3.1 Configurations for Mel-Frequency Cepstral Co- efficients Methods . . . 35

3.3.2 Configurations for Linear Predictive Cepstral Co- efficients Methods . . . 36

3.3.3 Normalization . . . 36

3.4 Classification . . . 37

3.4.1 Gaussian Mixture Models . . . 37

3.4.2 k-Nearest Neighbor Model . . . 39

3.4.3 Support Vector Machines . . . 40

4 Results 41 4.1 Striding . . . 41

4.1.1 Closeness of Adjacent Vectors . . . 42

4.1.2 Effect of Striding on Accuracy . . . 43

4.2 Effect of Training Data Duration . . . 43

4.3 Effect of Number of Speakers . . . 47

4.4 Effect of Test Speech Duration . . . 50

4.5 Effect of Feature Extraction Methods . . . 53

4.5.1 Comparison of Feature Extraction Methods: LPCC versus MFCC . . . 53

4.5.2 Effect of Feature Vector Dimension . . . 54

4.6 Effect of Machine Learning Models . . . 56

4.6.1 KNN . . . 56

4.6.2 GMM . . . 56

4.6.3 General Comparison . . . 60

4.7 Analysis of Errors . . . 60

5 Discussion and Conclusions 65 5.1 Feature Extraction Methods . . . 65

5.2 Machine Learning Models . . . 66

5.3 Future Work . . . 67

Bibliography 69

A Tables for All Experimental Results 76

(10)
(11)

Introduction

1.1 Speaker Identification Problem

Nowadays, as the available audio data grow rapidly, applications which have cognitive capabilities for automatically analyzing an audio signal become more and more attractive. Systems that are capable of per- forming speech recognition and speaker recognition tasks are widely studied. Among those tasks, the latter, namely the speaker recognition task, is of interest of this thesis. Speaker identification is the task of recognizing a speaker of a given speech record, based on extracted vo- cal features. Vocal features contain linguistic data such as clues about the spoken words, low-level acoustic information on vocal tract charac- teristics and high-level prosodic information. Vocal tract characteristics are extracted by dividing the speech record into many small frames, hence low-level, and applying frequency analysis to each frame [47].

High-level prosodic information is related to longer units of speech record, hence high-level, so it includes speed-related and pause-related features [44].

Both acoustic and prosodic features contain useful information that can be used for speaker identification. Each person differs in vocal tract shape, larynx size and vocal organ peculiarities. In addition to that, each person has a different manner of speaking, accent, rhythm and intonation [25]. For these reasons, it is reasonable to consider that the speech record of each person has distinguishable acoustic and prosodic features.

1

(12)

Figure 1.1: Steps for Speaker Identification [19]

Hence, the first step of an automatic speaker identification system is to extract useful acoustic and/or prosodic features from a speech record.

In the thesis work, feature extraction methods that are designed to extract only acoustic features were experimented with. It should be noted that in a speaker identification system where only such feature extraction methods were used, eliminating noisy and silent parts of the audio before the feature extraction step can increase the accuracy as such parts do not contain useful acoustic information about the speaker. These steps, namely Preprocessing and Feature Extraction, are shown as the first steps of a speaker identification system in Figure 1.1.

Then, as shown in the same figure, the final step, namely Classification step, involves the usage of a classifier to predict the speaker of a given speech record, based on the features extracted in the previous step. For this purpose, one can use a rule-based approach, a generative model such as Gaussian Mixture Model (GMM), Vector Quantization (VQ), or a discriminative model such as Support Vector Machines (SVM) and Artificial Neural Networks (ANN).

There are two variations for speaker identification: closed set identifica- tion and open set identification. In the former, it is known that during evaluation, every speech record belongs to a speaker in a predefined set. However, in the latter, there is no such limitation. For this varia- tion, the system must be able to classify the speaker of a speech record as unknown speaker in case the speaker is not in the predefined speaker set. Open set identification is more realistic and it is harder to have

(13)

Figure 1.2: Different Variations of Speaker Identification

a good accuracy in such a use case because in this problem, a system must be able to handle the case of unknown speaker.

The speaker identification problem has different variations also de- pending on the constraints put on spoken text. First, in the variation which is called text-dependent identification, the spoken text is known beforehand and a speaker is asked to say a sentence given by the sys- tem. In this use case, it is easier to get good accuracy because the speaker identification system can take advantage of knowing the spo- ken text. In text-independent identification, there is no such constraint, thus the system is expected to identify the speaker regardless of the spoken text. It is more flexible than text-dependent identification and it is also more challenging to get good accuracy in such a use case.

Finally, vocabulary-dependent identification puts a constraint on the set of words that can be used, but apart from that, it does not cause any limitations on what a speaker can say. All variations of speaker iden- tification tasks that were mentioned are shown in Figure 1.2

(14)

1.2 Relevance of the Problem

Speaker identification is a very important problem as there are many possible application areas where such systems can be useful. In this section, some interesting use cases for such systems will be provided.

First of all, speaker identification techniques can provide security-critical systems with a new form of authentication. The Siri service developed by Apple is a good example of such a use case. Siri is a personalized assistant which can be activated by the "Hey Siri" command only if it is spoken by the owner of the device [57]. Biometric features such as voice are unique for an individual, so they are hard to imitate [61].

For this reason, in addition to classical authentication methods such as challenge-response based authentication, where a predefined password or personal information is asked from a user as proof of identity, speech- based authentication where a user is identified by a speaker identifica- tion system can increase the security and make the system harder to be tricked by an intruder. Such use cases have been applied by Barclays Bank [4]. The system does not require any challenge question, and it only uses voice for authentication. It extracts voiceprints from the first few seconds of the conversation for authentication. It is a passive au- thentication system in the sense that it does not force a user to actively spend time only for authentication purposes. Instead, the user can im- mediately start talking about his/her actual request, and meanwhile, the system can identify the user by considering the audio signals.

Another practical case where speaker identification can be extremely useful is for the development of efficient forensics tools and surveil- lance. Nowadays, the amount of available data is growing rapidly, and voice data have become much more readily available. However, in order to build efficient forensics and surveillance tools, automatized analysis must be performed on such data. Speaker identification sys- tems can be used to analyze sound signals coming from speech records or telephone conversations, and people speaking in such records can be detected. For instance, this can help in detecting people who were at a crime scene or in detecting relevant sound records among many records. Such a system would not only improve efficiency, but it would also improve the accuracy. According to [35], such systems are used by many police institutions such as Guardia Civil de Espana, Cyber

(15)

Security Police in Malaysia, Police Nationale of France, Russian Fed- eration Ministry of Justices.

Furthermore, a speaker identification system can be used together with speech recognition systems. A speaker identification system can de- tect the speaker and this information can be used to fine-tune a speech recognizer or to pick a speech recognizer which has been trained for that particular speaker. This approach can increase the accuracy of a speech recognizer to a great extent. The paper [7] describes a similar approach, where an automatic speech recognizer system is supported by a speaker identification system. In their experiment, speech records contain heterogeneous regions of multiple speakers. First, they split the audio into homogeneous regions where only one speaker speaks, which is called diarization [7]. For this process and for omitting non- voiced parts of the sound signal, they use a speaker identification sys- tem.

Speaker identification can be used in real-time communication appli- cations, such as call centers and online conferences [15, 35]. For call centers, a speaker identification system can determine the speakers in a call and the time intervals where each speaker speaks. For on- line conferences, a speaker identification system can detect a current speaker and prompt this information to all participants. This can be useful in a conference where many people do not know each other.

Finally, such systems can give an opportunity to develop efficient search systems working on a database of speech records. Thanks to speaker identification systems, one can produce metadata from the database that keeps information about speakers in each record, and then de- velop a search system to find speech records by speaker. This would be a good example of a data mining system performing on speech data.

1.3 Thesis Objective and Contribution

The aim of the thesis is to investigate various machine learning mod- els and feature extraction methods that can be efficiently used for text- independent, closed-set speaker identification problem. Speaker iden- tification has been widely studied and many efficient and accurate

(16)

methods have been proposed. However, it is not always possible to compare machine learning models or feature extraction methods in a scientific way. This is due to the fact that in much research, machine learning models have been evaluated on different datasets, or with different feature extraction methods. However, in order to compare the success of two different components such as classifiers and feature extraction methods, all other factors must remain identical while us- ing those two components. For this reason, this thesis is beneficial in the sense that it provides comparative experiment results for differ- ent combinations of classifiers and feature extraction methods applied to the same dataset. By using those experiment results, the thesis pro- poses ways in which a selected machine learning model (classifier) and feature extraction method can affect the accuracy of speaker identifica- tion. In order to achieve that objective, possible combinations of classi- fier and feature extraction methods have been used on a preprocessed dataset for training and testing. This led to an unbiased evaluation and comparison among classifiers and feature extraction methods.

The thesis has focused on three machine learning models, namely Sup- port Vector Machines (SVM), Gaussian Mixture Models (GMM) and k-Nearest Neighbor (KNN). Moreover, regarding the feature extrac- tion methods, Mel-Frequency Cepstral Coefficients (MFCC) and Lin- ear Predictive Cepstral Coefficients (LPCC) are used in experiments.

For each classifier and feature extraction method, a detailed theoretical explanation is provided throughout the thesis report and the reasons for preferring such methods for speaker identification are discussed.

Also, the impact of test audio duration, training data amount and number of speakers on accuracy are evaluated. For such evaluations, many training datasets which differ in terms of number of speakers and available data amount were created. Each combination of classi- fier and feature extraction method was trained independently on ev- ery training dataset. For each such process, one speaker identification system was obtained. Then each obtained system was tested on the same test dataset and accuracy was calculated.

(17)

1.4 Delimitations

As stated in Section 1.3 this thesis focuses on closed-set and text-independent speaker identification problem. Open-set or text-dependent variants of the problem are out of the scope of this thesis.

Furthermore, preprocessing was only used to remove silences, and noise removal was not applied as it was assumed that the dataset was noise free.

Also, data augmentation was not used on the dataset. It is a good way to make a machine learning-based classifier more robust [11, 53].

However, it is out of the scope of this thesis.

Finally, for the experiments, it was assumed that only one speaker was speaking in a particular speech record. In many applications, it is pos- sible that multiple speakers speak in a single speech record. In this case, a speaker identification system must be able to perform segmen- tation and then identify the speaker for each segment. Such cases are not covered in this thesis.

1.5 Ethical and Societal Aspects

Regarding the social aspect, considering all the use cases introduced in the previous sections, any work dedicated to improve speaker recog- nition systems would have a positive impact on the society as such tasks would be solved in a more efficient way.

Research on speaker recognition task contribute to the economy as well as they provide an opportunity of building applications useful in many areas such as secure authentication which use speaker recog- nition. Also, it does not cause any loss of jobs as such jobs are not done by humans as humans are not capable of recognizing a speaker among many speaker candidates just by considering voice data.

For those reasons, this thesis work satisfies ethical concerns related with societal well-being. As stated, improvements on speaker recog- nition do not harm the society and in fact, it contributes to it.

(18)

Background

A speaker identification system is composed of four components. They are data augmentation component, audio preprocessor, feature extrac- tion method and classifier. All of them are crucial for the performance of the system. For this reason, to obtain an accurate system, reasonable decisions must be made regarding all of these components. The role of each component will be explained in the following subsections.

2.1 Data Augmentation

Data augmentation is an approach for building a more robust speaker identification system. For any machine learning model, there is al- ways the possibility for the model to learn patterns which are specific to training data and not shared among all possible data. This phe- nomenon is known as overfitting. In order to prevent this phenomenon and to force the model to learn real patterns in the data, one must use large amounts of training data. One way to obtain such data is to use data augmentation. In this approach, new training samples are cre- ated by using existing samples. For instance, by varying the volume or adding some background noise on an existing audio, new train- ing sample can be created and richer training data can be obtained.

If a system can be trained with augmented data, it can be more ro- bust against noises as there will be noisy samples in the training data and the model will learn to ignore such noises. In [14], a successful approach is described for adding background noise to speech signals.

The authors reported that applying that approach decreased the error rate under degraded speech conditions.

8

(19)

2.2 Preprocessing

In a speaker identification system, an audio signal must be processed first. The preprocessing consists of two important steps. In the first step, unvoiced audio segments are removed and in the second step, the audio signal is divided into the frames. The importance of each step will be explained and some ways to perform each step will be described briefly in the following subsections.

2.2.1 Removal Of Unvoiced Audio Segments

In a speaker identification system which considers acoustic features in an audio signal to predict the speaker, silent parts or the parts where there is no human voice should be removed as those segments cannot contain any useful acoustic information about the speaker. Removing them will make the system more robust. Some efficient methods, ca- pable of doing this removal are discussed in [3, 37]. Those works are based on the idea of extracting waveform domain parameters and fre- quency domain parameters as features for intervals of an audio signal and clustering intervals by considering those features.

2.2.2 Framing

In a speaker identification system, a speech signal is usually divided into small, equally-sized and possibly overlapping frames. Due to the non-stationary nature of speech signals, it is unlikely to be able to ex- tract useful information by using the speech signal as a whole. Thus segmenting the signal into multiple frames and analyzing each frame separately is the preferred way. [13]

The preferred frame size depends on what features one wants to cap- ture from a speech signal. The effects of different frame size on the possible features to be captured have been widely discussed in litera- ture and some examples will be provided in the following subsections.

Subsegmental Frame Size

In this case, frames with a size of around 3ms (subsegmental frames) are used. This is a good approach to extract the characteristics of the excitation source which contain speaker-related information [52].

(20)

Segmental Frame Size

In this case, frames with the size 10-30 ms (segmental frames) are used to extract vocal tract (cavity in humans where the sound is produced) re- lated features which contain useful information related to speech [47].

Supra Segmental Frame Size

In this case, frames with the size of 100-300 ms (supra-segmental frames) are used to extract features due to behavioral traits of the speaker such as word duration and intonation.

2.3 Feature Extraction

After preprocessing the audio signal, a feature extraction component should be used for the speaker identification task. Feature extraction is a process to obtain a compact representation of an audio frame. This is the preferred approach over using directly the raw audio signal. The main reason for this preference is the high dimensionality of the raw audio data. Regarding this reason, assume a 50ms frame length is be- ing used on mono audio data with 16kHz sample rate. In such setting, in each frame, there would be 0.05 * 1 * 16000 = 800 data points, so each frame would be represented by an 800-dimensional vector. This is not practical due to the high computational complexity and overfit problem 1. Instead of using such a large vector, it is better to have a smaller vector which contains useful features for the speaker identifi- cation task.

Raw audio data contain some redundant information such as back- ground noises which are not related with the speaker. It is desired that an extracted feature vector varies only among different speakers and should not vary due to noise. This way, a feature vector can contain speaker-related information and leaves redundant data out. A good feature extraction method is expected to have this property.

In this section, the theory for the feature extraction methods which are applied on speech signals segmented into segmental frames such as Mel-Frequency Cepstral Coefficients (MFCC) and Linear Predictive

1Infamous curse of dimensionality phenomenon

(21)

Cepstral Coefficients (LPCC) is explained. After that, some interest- ing experiments where different feature extraction methods are exper- imented with are provided.

2.3.1 Mel-Frequency Cepstral Coefficients

Mel-Frequency Cepstral Coefficients (MFCC) method has been intro- duced by S.B Davis and P.Mermelstein [12]. Since then, it has become one of the most popular feature extraction methods for speaker iden- tification [22]. In this section, the mathematical steps, which have also been used in the implementation for this thesis, to obtain MFCC fea- tures will be explained.

Steps

In order to extract the MFCC features of a segmental frame, the fol- lowing steps are applied:

• Apply Discrete Fourier Transform (DFT) on the frame

Given that s(j) is the jth sample in the frame, N is the number of samples in the frame, and K is the predefined length of DFT domain (1 ≤ x ≤ K), S(j), which is the jth value in the DFT domain, is calculated in the following way:

S(x) =

N

X

n=1

s(n)e−2πixnN (2.1)

In Figure 2.1, an example output of the DFT process for an arbi- trary audio signal is shown. The first graph represents the audio signal and the second signal is the output of the DFT process on that signal.

• Calculate power spectrum

Given that the power spectrum function is represented by P , the power spectrum can be calculated in the following way:

P (x) = |S(x)|2

N (2.2)

(22)

Figure 2.1: DFT output for an audio signal [17]

• Apply Mel-filter banks

In this step, F filter banks are computed, where F is a value that can be picked before computations. The computation of filter banks involves Mel-scale. This scale is preferred as it can imi- tate the sound perception of the human ear. The human ear can discriminate better at lower frequencies and worse at higher fre- quencies [56]. The Mel-scale imitates this fact by scaling high- frequency pairs closer and low-frequency pairs further from each other, for a pair with a fixed distance. Mathematically, for a fre- quency value f on Hertz-scale, the corresponding value m on Mel-scale is calculated as:

m = 2595log10(1 + f

700) (2.3)

F filter banks are computed by the following steps:

Frequency range to be used is determined. In the Fourier domain, frequencies outside of this range are ignored dur- ing the computations.

– F+2points are placed in the frequency range, where the dis- tance among each adjacent point is equal on Mel-scale, and

(23)

Figure 2.2: Example representation of filter banks [55]

the first and last points are at the lowest and highest fre- quencies of the range.

Frequency value of each point is converted from Mel-scale into Hertz-scale by applying (2.3).

For each point i in the Hertz-scale, corresponding index f (i) in DFT domain with N dimensions, and sample rate S is computed by the following formula:

f (i) = (N + 1) ∗ h[i]

S



(2.4) – kthfilter bank Fk(x) (1 ≤ k ≤ F ), is given below as:

Fk(x) =









0, x < f(k-1)

x−f (k−1)

f (k)−f (k−1) f(k-1) ≤ x < f(k)

f (k+1)−x

f (k+1)−f (k) f(k) ≤ x ≤ f(k+1) 0 x > f(k+1)

For each filter bank Fk(x), the corresponding coefficient is computed by performing the dot product operation on the filter bank Fkand power spectrum P (x), and this coefficient is called Akin the next step.

In Figure 2.2, an example representation of filter bank functions are shown. Each color represents a different filter bank function.

• Apply Discrete Cosine Transform (DCT)

(24)

In this step, the Discrete Cosine Transform (DCT) is applied to the coefficient vector computed in the last step. Those coeffi- cients are highly correlated. For this reason, Discrete Cosine Trans- form (DCT) is applied to the logarithm of the coefficient vector to decorrelate the coefficients. This operation computes cepstral co- efficients. For each coefficient Ak computed in the previous step, the corresponding cepstral coefficient, Ckis computed by:

Ck = r2

F X

i=1

F Akcos(kπ(i − 0.5)

F ) (2.5)

where F is the number of filter banks.

After that computation, some cepstral coefficients with high in- dices are discarded as they fail to have high discriminating capa- bilities [65] and the remaining cepstral coefficients constitute the MFCC vector.

• Apply Liftering

Then, a process called liftering is applied to make higher coeffi- cients have less weight as they can provide less discrimination, and this approach has been proven to provide a better represen- tation of the audio signal [42]. To apply the liftering process, first of all, a liftering weight is calculated for each feature in the MFCC vector. For a predefined liftering coefficient D, the ith liftering weight wiis calculated as:

wi = 1 + D

2sin(πi

D) (2.6)

• Then the ithfeature M F CCi in MFCC vector is updated as:

M F CCi0 = M F CCi∗ wi (2.7)

2.3.2 Linear Predictive Cepstral Coefficients

Linear predictive coding (LPC) is a method used to represent an audio signal in a compressed form and Linear Predictive Cepstral Coefficients (LPCC) is based on LPC method. For speaker identification, LPCC is

(25)

known as a good method that can provide high accuracy and for noise- free test cases it is suggested that LPCC can perform better than MFCC [5].

For a predefined coefficient number K, the xth frame fx can be ex- pressed as:

fx = ex+

K

X

k=1

ckfx−k (2.8)

This is a linear approximation for the frame fx, where ex is the error vector for the approximation, and the aim is to compute values for coefficients c1, c2,...,ck to minimize (ex)2. Those coefficients constitute LPC feature vector for the frame.

Finally, an LPCC feature vector is obtained by computing cepstral co- efficients of the LPC vector.

2.3.3 Experiments on Feature Extraction Methods

One of the first experiments which showed the usefulness of spec- tral analysis on frequency domain for speaker identification was con- ducted by S.Pruzansky in 1963 [46]. In this experiment, vectors for each speech record were produced by applying 17-channel filter bank to signal where the frequency domain was divided into equal intervals and then by producing a energy-frequency vector. The authors used a test dataset consisting of 10 speakers and achieved an accuracy of 89%.

In [16], a similar experiment was conducted. Interestingly, the authors claim that the power spectra vary a lot during the articulation of a phoneme since human articulators are in constant motion, however this is not the case for nasal consonants. Therefore, they proposed to only apply spectral analysis to speech segments where the phoneme /n/ was articulated. According to the same paper, they achieved 93%

accuracy with a test dataset consisting of 30 speakers.

In [32], another feature representation method based on cepstral analy- sis was experimented with. In this experiment, a speech segment was

(26)

represented by a 34-dimensional vector, where the first 32 were cep- stral coefficients and the other 2 were pitch and length value. Then, the nearest neighbor method was used for classification. On an exper- iment about speaker verification where the authors used 4 authorized speakers and 30 impostors, 6% to 13% error rates were obtained.

In [2], a feature extraction approach based on LPC was investigated.

In this approach, a speech sample had been approximated as a linear combination of a fixed number of past speech samples. Then, the co- efficients used in linear combination were used as a feature vector. It has been shown that LPC together with the cepstrum function (LPCC) performed better than raw LPC coefficients, and 93% accuracy was achieved in tests with a test dataset consisting of 10 speakers.

Another feature extraction method, named Mel-Frequency Cepstral Co- efficient (MFCC) has been widely investigated and used in many ex- periments such as [30, 43, 49, 33]. In all those experiments, it has been shown that MFCC is effective in representing speaker-related informa- tion and that it can provide a good accuracy on the speaker identifica- tion.

In another study [59] conducted by I.Trabelsi et al., MFCC, LPC and another feature extraction method which is called Perceptual Linear Prediction (PLP) were compared on a speech record set consisting of 14 female speakers. Using support vector machine (SVM) model where both linear kernel and a kernel based on radial basis function had been used, authors concluded that both MFCC and LPC performed better than PLP.

A study conducted by H.S Jayana et al. [21], suggested that using mul- tiple frame sizes and rates can be useful, especially when the training dataset is limited. Also, they proposed that using delta and delta- delta coefficients and MFCC together increases the accuracy. Briefly, delta coefficients keep information about changes of the MFCC coeffi- cients over time and likewise, delta-delta coefficients contain informa- tion about changes of delta coefficients over time.

Finally, the effect of using high level features such as pitch related data, pause durations, pause frequencies and average length of voiced seg-

(27)

ments have been investigated. In [44] where the combination of such features have been used together with k-nearest neighbor (kNN) clas- sifier, some decent results were obtained on 2001 NIST Speaker Recog- nition Evaluation Corpus (for more information about the dataset, reader can check [1]). Authors reported that they had 25.2% error rate when using pause related features such as frequency of pauses and duration of pauses in the speech, and had 14.8% error rate when using pitch related features. It was claimed that such features would improve ac- curacy if used together with features extracted from segmental frames.

2.4 Machine Learning Methods For Classifi- cation

2.4.1 Gaussian Mixture Models

Gaussian Mixture Models (GMM) have been widely used in speaker identification. GMM is considered as the state-of-the-art approach for speaker identification and good results with high accuracy have been reported in many experiments such as [49, 48, 66, 39].

GMM is a statistical approach where each speaker is considered as a random source that produces a feature vector that represents a frame of a speech record. [48]. Hence, each speaker is modeled by a mixture of Gaussian density functions. The mixture itself is called the proba- bility distribution function (pdf) and formally, it is a linear combination of M Gaussian probability density functions, which are called compo- nents. The parameter M is a predefined number. A careful decision is needed to pick a value for M, because using too few components might cause the model to underfit the data whereas using too many components might cause overfitting.

More formally, PA, the probability density function of the model for speaker A, is defined by:

(28)

PA(x|WA, µA, ΣA) =

M

X

i=1

wA,iG(x|µA,i, ΣA,i) (2.9) where WA = {wA,1, wA,2, ..., wA,M} is the set of weights of components, µA = {µA,1, µA,2, ..., µA,M} is the set of mean values for components, and ΣA = {ΣA,1, ΣA,2, ..., ΣA,M} is the set of covariance matrices for components. The Gaussian probability distribution function G(x|µA,i, ΣA,i) is:

G(x|µA,i, ΣA,i) = 1 (2π)d2A,i|12

e12(x−µA,i)TΣ−1A,i(x−µA,i) (2.10) where d is dimension of feature vector.

In GMM approach, each model for speakers is trained independently by the Expectation Maximization (EM) algorithm [9]. In this approach, training a model MAfor speaker A means to find good parameter val- ues for wA, µAand ΣA, so that the likelihood of observing feature vec- tors given in the training data of speaker A will be high. It is assumed that unseen feature vectors produced by the same speaker will share similarities with the training data for this speaker. For this reason, a model trained with feature vectors of a particular speaker is expected to produce high likelihood also for unseen feature vectors of the same speaker.

Parameters for a model are initialized in a random way and improved by the Expectation Maximization (EM) algorithm. This is an iterative al- gorithm which tries to increase the likelihood for training feature vec- tors in each iteration by updating parameters wA, µAand ΣA. It keeps updating the values of those parameters until the algorithm converges (no more improvement on likelihood can be obtained or the improve- ment is smaller than a certain threshold) or until it reaches a certain number of iteration. More detail on the EM process will be given in the following parts of this section.

In Figure 2.3, a GMM model trained on a random dataset, created in a way to form three clusters, is presented. The GMM is chosen to have 3 components to fit the training data well. Each line shows the negative likelihood of observing a data on them. So, the likelihood of observing

(29)

Figure 2.3: A GMM model trained on data

a data on the line with value 3.33 is bigger than that of the line with value 23.33.

Then, for a given feature vector during evaluation, each model out- puts the probability of the vector being observed according to their probability density function. The speaker can be predicted as the cor- responding speaker of the model that outputs the highest probability.

Training - Expectation Maximization Algorithm

Here is the procedure for training a model MAfor speaker A, assum- ing M Gaussian components are used, training data contains N feature vectors and xiis the ithfeature vector in the training dataset:

1) Initialization: Initialize randomly wA, µAand ΣA

2) Iteration: Until convergence do the following:

2.1) E-Step: ∀j, i j ∈ {1, 2, ..., M } and i ∈ {1, 2, ..., N } calculate

Tj,i = wA,jG(xiA,j, ΣA,j) PM

y=1wA,yG(xiA,y, ΣA,y) (2.11)

(30)

2.2) M-Step: Update parameters

WA,j = 1 N

N

X

i=1

Tj,i (2.12)

µA,j = PN

i=1Tj,ixTi PN

i=1Tj,i

(2.13)

ΣA,j = PN

i=1Tj,i(xi− µA,j)(xi− µA,j)T PN

i=1Tj,i (2.14)

It has been proven that in each iteration, the likelihood for training feature vectors is increasing and the algorithm converges. Interested readers are encouraged to read [38] to see a formal proof for conver- gence and increase in likelihood in each iteration.

As was seen in the previous section, the initialization is crucial for EM algorithm. For this reason, it has been claimed that using a Uni- versal Background Model (UBM) could improve the success. In UBM approach, a Gaussian mixture model is trained by using all speech data for all speakers, and this model determines the initialization of speaker-specific models. This approach was reported to improve the accuracy [66]. In the experiment described in this paper, for 2000 NIST Speaker Recognition Evaluation dataset, their GMM-UBM sys- tem achieved 31.2% relative error reduction, compared to a GMM sys- tem with 128 components.

2.4.2 k-Nearset Neighbor Models

k-Nearest Neighbor (kNN) method is a simple approach that has been used for many classification problems [28, 24, 64]. In the regular kNN algorithm, the feature vectors are stored in an efficient data structure.

When an unseen feature vector V is to be evaluated, k closest feature vectors to V among already-stored vectors are determined. Then the prediction can be made according to the determined feature vectors and their corresponding labels. One approach might be to apply a simple majority vote method. Another way might be to use a weighted majority vote approach, where the weights are depending on the dis- tance of a feature vector to the query vector.

(31)

Figure 2.4: Effects of different k values on k-NN [10]

The decision on the value k is crucial for the accuracy of the system.

Too small of a k value might cause the system to overfit to seen data, whereas too big of a k value might cause the involvement of unrelated data points into the decision making process which can cause accuracy loss. In Figure 2.4, the effect of different k values is shown. It can be observed that, for small k values, many isolated regions appear, whereas for large k values, the decision boundary is smoother.

Increasing Computational Efficiency

The requirement of computing the distance between each stored fea- ture vector and query vector might cause the naive k-NN approach to be infeasible, especially in case of huge number of feature vectors stored. In order to improve the speed, the feature vectors can be par- titioned by ball-tree data partitioning. This method, and different ways to perform such partitioning are discussed in [41]. Such partitioning methods are necessary to reduce the required number of distance cal- culations.

This partitioning method splits the input space into spheres defined by a center point C and radius r. Each feature vector will belong to exactly one sphere. Thanks to this approach, it is possible to avoid performing many distance calculations. Before explaining the reason, it should be proven that for a query vector Q and a feature vector x, it is possible to know the lower limit for distance between x and Q without explicitly calculating it. It will be proven below:

Theorem 1. Assume Q is a query vector, and CY, rY are respectively the center point and radius of a sphere Y (the related schema is shown at Figure 2.5). XY is set of feature vectors that belong to sphere Y, and D(a, b) is the

(32)

Figure 2.5: Schema for the theorem

function that gives the Euclidian distance between two arbitrary vectors a, b.

In this case, ∀x ∈ XY, D(Q, CY) − rY ≤ D(Q, x)

Proof. This can be proved by contradiction. Assume otherwise so ∃x ∈ XY such that D(Q, x) < D(Q, CY) − rY (1).

First of all, by the definition of partitioning, D(x, CY) ≤ rY (2) By (1) and (2), D(Q, x) + D(x, CY) < D(Q, CY)(3)

By triangle inequality, D(Q, CY) ≤ D(Q, x) + D(x, CY)(4)

By (3) and (4), we get D(Q, CY) < D(Q, CY), which is a contradiction.

Thanks to the information about lower boundaries for the distance be- tween each vector and the query vector, many distance computations can be skipped as it might be the case that some vectors would turn out to have a high lower boundary which prevents them from being a candidate for one of the k-closest vectors to the query vector. This way, the speed of the system is improved.

(33)

2.4.3 Self-Organizing Maps

Another interesting experiment which used a self-organizing map (SOM) method for speaker identification [33] was conducted by A.T.Mafra et al. In that experiment, a SOM had been trained for each speaker. For each SOM, all speech segment vectors of the corresponding speaker were used. After training, a vector representing a test data was classi- fied by considering quantization error at each SOM. In other words, a speaker whose SOM was giving the minimum quantization error for a given test vector was chosen as the correct speaker. On the experiment on the dataset which consisted of 14 speakers, the authors obtained 98.57% accuracy [33].

2.4.4 Support Vector Machines

For the speaker identification problem, one of the other machine learn- ing models that had been experimented with was Support Vector Ma- chines (SVM) [62, 8, 59].

SVM works by finding a hyperplane that separates instances of two classes. Formally, assuming W represents the separating hyperplane, xi is the vector of the ith training instance, and ti is the class of that training instance with value -1 or 1, and  > 0, there are two con- straints. The first one is:

ti = 1 =⇒ W xi ≥  (2.15) and the second one is:

ti = −1 =⇒ W xi ≤  (2.16) These constraints guarantee that the hyperplane can linearly separate instances of two classes, and this can be further simplified into the following constraint:

∀ti, xiW ti ≥  (2.17) Among all possible hyperplanes, it is desirable to choose one such that the margin is maximized. The margin is the distances between instances and the hyperplane. A wide margin is preferred to decrease the risk of bad classification. It is assumed that a separating hyper- plane with a wide margin can generalize better. Considering a point P

(34)

on the margin of a hyperplane, the distance between P and the hyper- plane is given by:

|P W |

||W || = 

||W || (2.18)

This proves that the width of the margin is inversely proportional to the length of W. The SVM is an optimization problem that maxi- mizesW under the constraints described above.

Sometimes, in a classification problem, there might be outliers which can cause SVM to find a separating hyperplane with a narrow mar- gin, or prevent SVM from finding any separating hyperplane at all. To prevent such cases, slack variables are used in SVM. These variables are introduced to obtain a wider margin at the expense of possible margin violations and misclassified instances. Without slack variables, margin violations are not allowed. Together with slack variables, a regulariza- tion coefficient C is introduced. This coefficient controls the trade-off among margin size and amount of misclassified data. The smaller C is, the narrower the margin will be, and the bigger C is, the wider the margin will be at the expense of more misclassified training data and more margin violations. Especially, in cases where training dataset in- cludes many noisy instances, slack variables can be particularly useful as they prevent the system from being forced to classify the noisy in- stances correctly and losing the opportunity to compute a hyperplane with a good margin as a consequence. Including the slack variables S1, S2, ..., SN ≥ 0 and regularization coefficient C, where N is the num- ber of training instances, the problem becomes to compute:

arg min

W,S

||W || + C ×X

i

Si (2.19)

under the following constraints:

∀ti, xiW ti ≥  − Si (2.20) There are cases where a dataset is not linearly separable, and using slack variables does not help. For such cases, it might be a good idea to project the data into higher dimensional space and try to find a sepa- rating hyperplane in that space. For this aim, a computational method called kernel trick is used. Kernel trick provides a way to solve the op- timization problem as if the data are projected into high dimensional

(35)

space. This way, the system can have the advantage of projection and at the same time, it does not waste computational resources for ac- tually projecting the data. The way to achieve this is to use a kernel function that takes two points from the original space as input and computes the dot product in the high dimensional space.

SVM with different kernel functions such as polynomial kernel and Fisher kernel have been tried and the performances were compared with GMM based systems [62]. According to the result of those ex- periments, SVM based systems can obtain a similar performance com- pared with GMM based systems.

Regular SVM is used for binary classification. To perform multinomial classification, which is generally necessary for speaker identification, various methods have been proposed in [18, 45]. The method that was used in this thesis implementation will be revealed in Methods section.

2.4.5 Neural Networks

Neural networks have also been studied for speaker identification and promising results were reported. For instance, Vector Quantization methods have been widely studied at [23, 31, 6] for the speaker identi- fication task. Bennani et al. obtained 97% identification accuracy [6] on a dataset of 10 speakers by using Vector Quantisation (VQ) approach.

The authors claimed that their system worked well with short speech records and that the training process was fast.

J.Oglesby et al. used the Radial Basis Function (RBF) model [40] for the speaker verification task. On a dataset consisting of 40 speakers, they achieved the best performance (8% true speaker rejection rate and 1% impostor acceptance rate) by using one hidden layer with 384 hid- den nodes. They also claimed that RBF model outperformed both a standard Multilayer Perceptron (MLP) and a Vector Quantisation (VQ) based systems. [40]. Apart from that, Deep Neural Networks have also been tried and reported to give good results [50].

Another interesting approach was proposed by L.Rudasi et al. [51].

Their method relies on building many small neural networks for bi- nary classification. In other words, for N speakers, N*(N+1)/2 small

(36)

neural networks (for each speaker pair in the dataset) are formed and each network is responsible for discriminating among those two speak- ers. They reported that they obtained 100% accuracy on their dataset [51]. However, it is not feasible for datasets with a large number of speakers, as the necessary number of neural network grows propor- tionally to the square of the number of speakers.

Another neural network method that is worth mentioning here was proposed by [29]. This experiment used a deep neural network to ex- tract features from audio data and to represent the signal in a compact way. First of all, the system extracted MFCC and Gabor features and by feeding them into a neural network similar to an autoencoder, it reduced the dimensionality and produced a feature vector which was more robust against noise. They concluded that this feature vector was better in classifying the audio, compared to MFCC and Gabor alone.

(37)

Methods

In this section, the creation of datasets that have been used during the experiments will be explained first. After, all parts included in the ex- periment pipeline such as framing, feature extraction and classification will be explained in detail.

3.1 Dataset

In this thesis, speech records provided by Inovia AB are used as the main source for creating training and test datasets for experiments.

The main audio source consists of speech records of 321 male and 259 female speakers. All speakers had been anonymized, thus only age and gender information were available. For each speaker, 18 to 525 speeches had been recorded. The speeches were recorded in 16kHZ, mono, 16-bit format.

3.1.1 Preprocessing

As mentioned before, the task of removing noisy and silent frames is crucial to build a robust speaker identification system. This task is called voice activity detection (VAD). VAD refers to the problem of seg- menting an audio signal and labeling each segment as either voiced, where a human voice is observed, or non-voiced. After such labeling, each speech segment can be extracted as a separate audio signal, which is called a chunk.

27

(38)

Figure 3.1: Voice Activity Detection [54]. Note that out of this signal, 3 voiced chunks can be extracted as the output of VAD process

An example of the VAD process is shown in Figure 3.1. For the au- dio signal shown in the figure, 3 voiced segments have been detected, which would result in the extraction of 3 chunks.

In this thesis implementation, for the VAD process, WebRTC VAD li- brary [60] was used. This library uses a configuration property called aggressiveness which can be set as an integer between 0 and 3. The value 0 would make the script the least aggressive in filtering out non- speech, and 3 would be the most aggressive. In other words, it defines the trade-off between precision and recall for voiced segments. In this thesis implementation, the value 3 was used for this property because precision was more important than recall.

After chunks were extracted for each speech in the main audio source, chunks whose durations were shorter than 3 seconds were removed.

This was due to the observation that among such chunks, the precision was relatively worse. Many audio segments in this group consisted of only breathing or murmuring sounds, so many of them were false pos- itives.

(39)

Each dataset that was used in training or testing can be seen as a mix- ture of chunks. For each dataset, it is desired to have sound signals coming from various speeches. This is the preferred approach to have a representative training dataset and realistic test dataset. Considering that a chunk consists of sound signals coming from only one speech, having a long chunk in a dataset would cause the dataset to be dom- inated by a particular speech. To prevent this undesired effect, the chunks whose durations were longer than 20 seconds were removed as well.

3.1.2 Training Datasets

After the chunk extraction process, various training datasets were de- rived by using the chunks. Derived training datasets are different from each other in terms of the number of speakers and training data dura- tion. For number of speakers, the values 40, 80, 120, 160 were used to observe the effect of different number of speakers on the prediction ac- curacy. For each training dataset, an equal number of male and female speakers were used. For instance, a dataset with 120 speakers consists of 60 male and 60 female speakers. For training data duration, 10, 30, 60, 90 and 120 seconds were used. This shows the sum of the duration of chunks that were used in training for each speaker. It should be noted that chunks are created after the silence removal process. This means that, in chunks, there is no silence and they are completely filled with the voice of a speaker. Before explaining the derivation of the training dataset in a more detailed way, some definitions which are stated below will make the explanation more systematic and easy to understand.

Definition 3.1. T rains,trepresents the training dataset with s speakers which contains t seconds of chunks for each speaker

Definition 3.2. Speakers(T rains,t) represents the set of speakers of training dataset T rains,t

Definition 3.3. Chunks(T rains,t, X) represents the set of chunks for speaker X in training dataset T rains,t

Below, the properties that each training dataset follows are given. In these definitions, t1, t2 stand for any possible durations of training chunks and s1, s2 stands for any possible numbers of speakers.

(40)

Property 1. Speaker coverage: For two datasets where the latter has more speakers than the former, the set of speakers of the latter covers the set of speakers of the former. Formally:

s2 ≥ s1 =⇒ Speakers(T rains1,t1) ⊆ Speakers(T rains2,t2)

Property 2. Chunk set coverage: For two datasets where both datasets have speaker X, if the latter uses bigger/equal training data du- ration for each speaker than/to the former, then the latter in- cludes all chunks for speaker X that the former includes. For- mally:

t2 ≥ t1∧X ∈ Speakers(T rains1,t1)∧X ∈ Speakers(T rains2,t2) =⇒

Chunks(T rains1,t1, X) ⊆ Chunks(T rains2,t2, X)

The speaker coverage property implies that two datasets that have the same number of speakers must have the same set of speakers. Also, those two properties together imply that if two datasets use the same number of speakers and the latter dataset uses a bigger data duration than the first dataset, the latter dataset includes all chunks that the first dataset contains.

The reason for applying such rules for the dataset derivation process is to minimize the effect of randomness on the evaluation. The necessity of the speaker coverage property can be explained by the following example: assume that the speaker coverage property is not used and for the training dataset T rain40,10, 40 speakers that are very hard to distinguish were picked. Then, for the training dataset T rain80,10, it is possible to pick 80 other speakers randomly which are much easier to distinguish from each other. In this case, given that all other fac- tors are the same, it is very likely that T rain80,10 gives better accuracy than T rain40,10, which is unfair and misleading. The chunk set cover- age property is also very important for a fair evaluation. To prove the importance of this property, think about two datasets, T rain40,10 and T rain40,30. Without the chunk set coverage property, it is possible to pick very representative chunks for each speaker for the first dataset and to pick chunks that fail to adequately represent well some speak- ers for the second dataset. In this case, T rain40,10 can provide better accuracy than T rain40,30, which is again unfair and misleading.

(41)

Figure 3.2: 20 training datasets and the relations among them. In each box, the first number shows the duration of the training data for each speaker in terms of seconds and the second number shows the num- ber of speakers. The arrows show the coverage relations among the datasets. Orange arrows show chunk set coverage relation among two datasets and blue arrows show speaker set coverage relation.

(42)

To prevent the effect of intersession variability, training chunks are pre- ferred to be taken from different speeches. The term intersession vari- ability is used to explain the differences between two speeches of the same speaker. Such changes can occur due to the changes in the record- ing environment, transmission circumstances, background noise and variations in speaker voices [58]. During tests, it is not possible to know the background noise or recording environment beforehand. For that reason, it is good to train the system considering all possibilities to have a robust speaker identification system.

Figure 3.2 shows all training datasets that were used during the ex- periments. In each box, the first number shows the training data du- ration for each speaker in terms of seconds and the second number shows the number of speakers. The arrows show the coverage rela- tions among the datasets. Orange arrows show chunk set coverage re- lation among two datasets and blue arrows show speaker set coverage relation. It should be noted that the arrows show a transitive relation.

For instance, it can be concluded that 60/40 has a chunk set coverage over 30/80, because 60/80 has a chunk set coverage over 30/80 and 60/40has chunk set coverage over 60/80.

3.1.3 Test Datasets

Regarding the evaluation, 4 test datasets were created. They have 40, 80, 120 and 160 speakers respectively, and they will be denoted by T est40, T est80, T est120, T est160. As expected, the speaker set of each test dataset matches with the speaker set of the corresponding training dataset. Formally, for any number of speakers X and training data duration t,

Speakers(T estX) = Speakers(T rainX,t). (3.1) This implies that the speaker coverage property is preserved for test datasets as well.

For each speaker, the number of test cases vary among 25 and 48, and each test case has a duration of 5 seconds. Too long test cases would cause the experiments to be computationally expensive whereas too short test cases would prevent observing the full potential of systems as the accuracy would be lower. For this reason, picking 5 seconds for

(43)

test case duration seemed to be a good compromise. Each test case was created by merging adjacent chunks produced out of the same speech where the sum of durations of the chunks is at least 5 seconds. Then, from the merged audio signal, the interval of the first 5 seconds was taken as the test case record.

During evaluations, for each test case, attempts are made to predict the speaker 5 times. In each attempt, respectively, the first 1,2,3,4 and 5 seconds of the record are considered. This way, the effect of duration of speech record on the accuracy is evaluated.

It is important to note that the test cases for a particular speaker in any two test datasets are the same. This is also due to the aim of minimiz- ing the effect of randomness on evaluation. Assuming that Cases(TX) gives the set of test cases for T estX, s2 > s1 =⇒ Cases(T ests1) ⊂ Cases(T ests2). Also, each test case had been extracted from a different speech. The reason is to observe intersession variability, hence to have more realistic tests.

3.1.4 Statistics on Speakers

In Figure 3.3, Figure 3.4, Figure 3.5 and Figure 3.6, the age and gen- der statistics for each speaker set are shown. In each pie chart label, the numbers show the age interval for the corresponding group, and f stands for female and m stands for male. The number that can be seen in a slice of the pie chart shows the number of speakers that belong to the corresponding speaker group.

3.2 Framing

As discussed in Section 2.2, framing is a crucial process before the fea- ture extraction step. In this thesis implementation, the duration of 25 milliseconds was used for frame size and the step length between suc- cessive frames was chosen as 10 milliseconds.

During the framing process, each chunk was segmented into segmental frames. After this process, a training dataset can be seen as a dictionary where the set of keys are the speakers and the value for each key is the set of the frames produced out of the chunks belonging to that

(44)

5

18-29 f 5

30-39 f

5 40-49 f

5 +50 f

5

18-29 m 5

30-39 m 5

40-49 m 5

+50 m

Figure 3.3: Statistics for T40

10

18-29 f 10

30-39 f

10 40-49 f

10 +50 f

10

18-29 m 10

30-39 m 10

40-49 m 10

+50 m

Figure 3.4: Statistics for T80

15

18-29 f 15

30-39 f

15 40-49 f

15 +50 f

15

18-29 m 15

30-39 m 15

40-49 m 15

+50 m

Figure 3.5: Statistics for T120

20

18-29 f 20

30-39 f

20 40-49 f

20 +50 f

20

18-29 m 20

30-39 m 20

40-49 m 20

+50 m

Figure 3.6: Statistics for T160

particular speaker. Likewise, a test dataset can be seen as a dictionary whose set of keys consists of test case identifiers and the corresponding value for each key is the set of frames produced out of the chunks belonging to that test case.

3.3 Feature Extraction

After the framing process is complete, a feature extraction method is applied to each frame. After the feature extraction process, each frame in the training and test datasets is replaced by the corresponding fea- ture vector computed by the applied feature extraction method.

(45)

It should be noted that for each frame, after all corresponding fea- ture vectors have been extracted, the striding approach which involves keeping only every other feature vector and removing the rest was used. The reason is that it was observed that adjacent feature vectors were much similar to each other, compared to the all possible vector pairs. For this reason, it was assumed that for each pair of adjacent feature vectors, ignoring one feature vector would not harm the ac- curacy as the remaining vectors would be representative enough, and this approach would provide huge performance gain in terms of com- putation time. In fact, this approach was proven to be very efficient as it had caused ignorable loss in accuracy and a big boost in perfor- mance. The experiments related to this approach will be explained with details in Striding section of the thesis.

For this thesis, two feature extraction methods were used, namely Mel- Frequency Cepstral Coefficients (MFCC) method and Linear Predic- tive Cepstral Coefficients (LPCC). Each of them was used with 4 dif- ferent configurations, which gives 8 different feature extraction con- figurations in total. In the following sections, the configurations used for MFCC and LPCC methods during thesis experiments will be ex- plained.

3.3.1 Configurations for Mel-Frequency Cepstral Co- efficients Methods

First of all, for each configuration for the MFCC method, the length of DFT domain K is predefined as 512 (as suggested as the default value in the library used in the implementation for MFCC feature extrac- tion). Moreover, as the duration of 25 milliseconds was used for the frame size and the sample rate was 16kHz, the number of samples in each frame N was the same for each configuration and was computed as:

N = 16000 × 0.025 = 400 (3.2)

References

Related documents

2) Integration within the PTARM Microarchitecture: The four resources provided by the backend of the DRAM con- troller are a perfect match for the four hardware threads in

In this thesis, online learning algorithms were derived for approximative training of all-in-one multi-class Support Vector Machines (SVMs). In particular, an online solver has

In the linear-Gaussian case, the reference trajectory of a control system is of no importance while estimating the state, this follows from (4.24), which shows that the covariance

In conclusion, PaDEL fingerprint-based k-NN classification models presented here show potential as tools for the prediction of the hERG toxicity endpoint, an important issue in

This research is of a design nature, and focuses on investigating CD classification: what features set them apart from other similar diagrams; how these

In the Vector Space Model (VSM) or Bag-of-Words model (BoW) the main idea is to represent a text document, or a collection of documents, as a set (bag) of words.. The assumption of

Non-sequential machine learning algorithms for anomaly detection like one- class support vector machine (OC-SVM), k-nearest neighbor (K-NN), and random forest (RF) are

In Section 4, we also apply the same idea to get all moments of the number of records in paths and several types of trees of logarithmic height, e.g., complete binary trees,