Evaluation of Text-Independent and Closed-Set Speaker Identification Systems

(1)

Evaluation of Text-Independent and Closed-Set Speaker

Identification Systems

BERK GEDIK

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

Text-Independent and Closed-Set Speaker Identification Systems

BERK GEDIK

Master in Computer Science Date: November 26, 2018

Supervisor: Iman Sayyaddelshad <salehs@kth.se>, David Buffoni

<david.buffoni@gmail.com>

Examiner: Olov Engwall <engwall@kth.se>

Swedish title: Utvärdering av textoberoende talaridentifieringssystem

School of Electrical Engineering and Computer Science

(4)

(5)

Abstract

Speaker recognition is the task of recognizing a speaker of a given speech record and it has wide application areas. In this thesis, various machine learning models such as Gaussian Mixture Model (GMM), k- Nearest Neighbor(k-NN) Model and Support Vector Machines (SVM) and feature extraction methods such as Mel-Frequency Cepstral Co- efficients (MFCC) and Linear Predictive Cepstral Coefficients (LPCC) are investigated for the speaker recognition task. Combinations of those models and feature extraction methods are evaluated on many datasets varying on the number of speakers and training data size.

This way, the performance of methods in different settings are analyzed. As results, it is found that GMM and KNN methods are providing good accuracies and LPCC method performs better than MFCC.

Also, the effect of audio recording duration, training data duration and number of speakers on the prediction accuracy is analyzed.

(6)

Sammanfattning

Talarigenkänning är en benämning på tekniker som syftar till att iden- tifiera en talare givet en inspelning av dennes röst; dessa tekniker har ett brett användningsområde. I det här examensarbetet tillämpas ett antal maskininlärningsmodeller på uppgiften att känna igen talare.

Modellerna är Gaussian Mixture Model(GMM), k-Nearest Neighbour(k- NN) och Support Vector Machine(SVM). Olika tekniker för att ta fram variabler till modelleringen provas, såsom Mel-Frequency Cepstral Co- efficients (MFCC) och Linear Predictive Cepstral Coefficients (LPCC).

Teknikernas lämplighet för talarigenkänning undersöks. Kombinatio- ner av ovan nämnda modeller och tekniker utvärderas över många olika dataset som skiljer sig åt i antalet talare samt mängden data. På så sätt utvärderas och analyseras de olika metoderna för olika förut- sättningar. Resultaten innehåller bland annat utfallen att både GMM och kNN ger hög träffsäkerhet medan LPCC ger högre träffsäkerhet än MFCC. Även effekten av inspelningslängden för de olika rösterna, den sammanlagda längden på träningsdatan samt antalet talare på de olika modellerna analyseras och presenteras.

(7)

Acknowledgements

First of all, I would like to express my gratitude to Iman Sayyaddelshad and David Buffoni, two supervisors of this thesis, for providing me with valuable knowledge, suggestions and helping me through all the time of writing this thesis. Without their guidance, it would be much more difficult to structure this thesis in a systematic way and to grasp enough technical knowledge to decide on proper methods.

Also, I would like to thank Ailsa Meechan-Maddon, my college and my friend, for all her proof-readings and constructive comments. Our discussions definitely helped me to improve the quality of the thesis in terms of language to a great extent and motivated me to do my best.

I would like to finalize with a huge thanks to my mother for helping me to have such a great opportunity of working at this amazing uni- versity and also for supporting me during my entire life.

(8)

1 Introduction 1

1.1 Speaker Identification Problem . . . 1

1.2 Relevance of the Problem . . . 4

1.3 Thesis Objective and Contribution . . . 5

1.4 Delimitations . . . 7

1.5 Ethical and Societal Aspects . . . 7

2 Background 8 2.1 Data Augmentation . . . 8

2.2 Preprocessing . . . 9

2.2.1 Removal Of Unvoiced Audio Segments . . . 9

2.2.2 Framing . . . 9

2.3 Feature Extraction . . . 10

2.3.1 Mel-Frequency Cepstral Coefficients . . . 11

2.3.2 Linear Predictive Cepstral Coefficients . . . 14

2.3.3 Experiments on Feature Extraction Methods . . . 15

2.4 Machine Learning Methods For Classification . . . 17

2.4.1 Gaussian Mixture Models . . . 17

2.4.2 k-Nearset Neighbor Models . . . 20

2.4.3 Self-Organizing Maps . . . 23

2.4.4 Support Vector Machines . . . 23

2.4.5 Neural Networks . . . 25

3 Methods 27 3.1 Dataset . . . 27

3.1.1 Preprocessing . . . 27

3.1.2 Training Datasets . . . 29

3.1.3 Test Datasets . . . 32

3.1.4 Statistics on Speakers . . . 33

vi

(9)

3.2 Framing . . . 33

3.3 Feature Extraction . . . 34

3.3.1 Configurations for Mel-Frequency Cepstral Co- efficients Methods . . . 35

3.3.2 Configurations for Linear Predictive Cepstral Co- efficients Methods . . . 36

3.3.3 Normalization . . . 36

3.4 Classification . . . 37

3.4.1 Gaussian Mixture Models . . . 37

3.4.2 k-Nearest Neighbor Model . . . 39

3.4.3 Support Vector Machines . . . 40

4 Results 41 4.1 Striding . . . 41

4.1.1 Closeness of Adjacent Vectors . . . 42

4.1.2 Effect of Striding on Accuracy . . . 43

4.2 Effect of Training Data Duration . . . 43

4.3 Effect of Number of Speakers . . . 47

4.4 Effect of Test Speech Duration . . . 50

4.5 Effect of Feature Extraction Methods . . . 53

4.5.1 Comparison of Feature Extraction Methods: LPCC versus MFCC . . . 53

4.5.2 Effect of Feature Vector Dimension . . . 54

4.6 Effect of Machine Learning Models . . . 56

4.6.1 KNN . . . 56

4.6.2 GMM . . . 56

4.6.3 General Comparison . . . 60

4.7 Analysis of Errors . . . 60

5 Discussion and Conclusions 65 5.1 Feature Extraction Methods . . . 65

5.2 Machine Learning Models . . . 66

5.3 Future Work . . . 67

Bibliography 69

A Tables for All Experimental Results 76

(10)

(11)

Introduction

1.1 Speaker Identification Problem

Nowadays, as the available audio data grow rapidly, applications which have cognitive capabilities for automatically analyzing an audio signal become more and more attractive. Systems that are capable of performing speech recognition and speaker recognition tasks are widely studied. Among those tasks, the latter, namely the speaker recognition task, is of interest of this thesis. Speaker identification is the task of recognizing a speaker of a given speech record, based on extracted vocal features. Vocal features contain linguistic data such as clues about the spoken words, low-level acoustic information on vocal tract characteristics and high-level prosodic information. Vocal tract characteristics are extracted by dividing the speech record into many small frames, hence low-level, and applying frequency analysis to each frame [47].

High-level prosodic information is related to longer units of speech record, hence high-level, so it includes speed-related and pause-related features [44].

Both acoustic and prosodic features contain useful information that can be used for speaker identification. Each person differs in vocal tract shape, larynx size and vocal organ peculiarities. In addition to that, each person has a different manner of speaking, accent, rhythm and intonation [25]. For these reasons, it is reasonable to consider that the speech record of each person has distinguishable acoustic and prosodic features.

1

(12)

Figure 1.1: Steps for Speaker Identification [19]

Hence, the first step of an automatic speaker identification system is to extract useful acoustic and/or prosodic features from a speech record.

In the thesis work, feature extraction methods that are designed to extract only acoustic features were experimented with. It should be noted that in a speaker identification system where only such feature extraction methods were used, eliminating noisy and silent parts of the audio before the feature extraction step can increase the accuracy as such parts do not contain useful acoustic information about the speaker. These steps, namely Preprocessing and Feature Extraction, are shown as the first steps of a speaker identification system in Figure 1.1.

Then, as shown in the same figure, the final step, namely Classification step, involves the usage of a classifier to predict the speaker of a given speech record, based on the features extracted in the previous step. For this purpose, one can use a rule-based approach, a generative model such as Gaussian Mixture Model (GMM), Vector Quantization (VQ), or a discriminative model such as Support Vector Machines (SVM) and Artificial Neural Networks (ANN).

There are two variations for speaker identification: closed set identification and open set identification. In the former, it is known that during evaluation, every speech record belongs to a speaker in a predefined set. However, in the latter, there is no such limitation. For this variation, the system must be able to classify the speaker of a speech record as unknown speaker in case the speaker is not in the predefined speaker set. Open set identification is more realistic and it is harder to have

(13)

Figure 1.2: Different Variations of Speaker Identification

a good accuracy in such a use case because in this problem, a system must be able to handle the case of unknown speaker.

The speaker identification problem has different variations also depending on the constraints put on spoken text. First, in the variation which is called text-dependent identification, the spoken text is known beforehand and a speaker is asked to say a sentence given by the system. In this use case, it is easier to get good accuracy because the speaker identification system can take advantage of knowing the spoken text. In text-independent identification, there is no such constraint, thus the system is expected to identify the speaker regardless of the spoken text. It is more flexible than text-dependent identification and it is also more challenging to get good accuracy in such a use case.

Finally, vocabulary-dependent identification puts a constraint on the set of words that can be used, but apart from that, it does not cause any limitations on what a speaker can say. All variations of speaker identification tasks that were mentioned are shown in Figure 1.2

(14)

1.2 Relevance of the Problem

Speaker identification is a very important problem as there are many possible application areas where such systems can be useful. In this section, some interesting use cases for such systems will be provided.

First of all, speaker identification techniques can provide security-critical systems with a new form of authentication. The Siri service developed by Apple is a good example of such a use case. Siri is a personalized assistant which can be activated by the "Hey Siri" command only if it is spoken by the owner of the device [57]. Biometric features such as voice are unique for an individual, so they are hard to imitate [61].

For this reason, in addition to classical authentication methods such as challenge-response based authentication, where a predefined password or personal information is asked from a user as proof of identity, speech- based authentication where a user is identified by a speaker identification system can increase the security and make the system harder to be tricked by an intruder. Such use cases have been applied by Barclays Bank [4]. The system does not require any challenge question, and it only uses voice for authentication. It extracts voiceprints from the first few seconds of the conversation for authentication. It is a passive authentication system in the sense that it does not force a user to actively spend time only for authentication purposes. Instead, the user can im- mediately start talking about his/her actual request, and meanwhile, the system can identify the user by considering the audio signals.

Another practical case where speaker identification can be extremely useful is for the development of efficient forensics tools and surveillance. Nowadays, the amount of available data is growing rapidly, and voice data have become much more readily available. However, in order to build efficient forensics and surveillance tools, automatized analysis must be performed on such data. Speaker identification systems can be used to analyze sound signals coming from speech records or telephone conversations, and people speaking in such records can be detected. For instance, this can help in detecting people who were at a crime scene or in detecting relevant sound records among many records. Such a system would not only improve efficiency, but it would also improve the accuracy. According to [35], such systems are used by many police institutions such as Guardia Civil de Espana, Cyber

(15)

Security Police in Malaysia, Police Nationale of France, Russian Fed- eration Ministry of Justices.

Furthermore, a speaker identification system can be used together with speech recognition systems. A speaker identification system can detect the speaker and this information can be used to fine-tune a speech recognizer or to pick a speech recognizer which has been trained for that particular speaker. This approach can increase the accuracy of a speech recognizer to a great extent. The paper [7] describes a similar approach, where an automatic speech recognizer system is supported by a speaker identification system. In their experiment, speech records contain heterogeneous regions of multiple speakers. First, they split the audio into homogeneous regions where only one speaker speaks, which is called diarization [7]. For this process and for omitting non- voiced parts of the sound signal, they use a speaker identification system.

Speaker identification can be used in real-time communication applications, such as call centers and online conferences [15, 35]. For call centers, a speaker identification system can determine the speakers in a call and the time intervals where each speaker speaks. For online conferences, a speaker identification system can detect a current speaker and prompt this information to all participants. This can be useful in a conference where many people do not know each other.

Finally, such systems can give an opportunity to develop efficient search systems working on a database of speech records. Thanks to speaker identification systems, one can produce metadata from the database that keeps information about speakers in each record, and then develop a search system to find speech records by speaker. This would be a good example of a data mining system performing on speech data.

1.3 Thesis Objective and Contribution

The aim of the thesis is to investigate various machine learning models and feature extraction methods that can be efficiently used for text- independent, closed-set speaker identification problem. Speaker identification has been widely studied and many efficient and accurate

(16)

methods have been proposed. However, it is not always possible to compare machine learning models or feature extraction methods in a scientific way. This is due to the fact that in much research, machine learning models have been evaluated on different datasets, or with different feature extraction methods. However, in order to compare the success of two different components such as classifiers and feature extraction methods, all other factors must remain identical while using those two components. For this reason, this thesis is beneficial in the sense that it provides comparative experiment results for different combinations of classifiers and feature extraction methods applied to the same dataset. By using those experiment results, the thesis pro- poses ways in which a selected machine learning model (classifier) and feature extraction method can affect the accuracy of speaker identification. In order to achieve that objective, possible combinations of classifier and feature extraction methods have been used on a preprocessed dataset for training and testing. This led to an unbiased evaluation and comparison among classifiers and feature extraction methods.

The thesis has focused on three machine learning models, namely Sup- port Vector Machines (SVM), Gaussian Mixture Models (GMM) and k-Nearest Neighbor (KNN). Moreover, regarding the feature extraction methods, Mel-Frequency Cepstral Coefficients (MFCC) and Lin- ear Predictive Cepstral Coefficients (LPCC) are used in experiments.

For each classifier and feature extraction method, a detailed theoretical explanation is provided throughout the thesis report and the reasons for preferring such methods for speaker identification are discussed.

Also, the impact of test audio duration, training data amount and number of speakers on accuracy are evaluated. For such evaluations, many training datasets which differ in terms of number of speakers and available data amount were created. Each combination of classifier and feature extraction method was trained independently on every training dataset. For each such process, one speaker identification system was obtained. Then each obtained system was tested on the same test dataset and accuracy was calculated.

(17)

1.4 Delimitations

As stated in Section 1.3 this thesis focuses on closed-set and text-independent speaker identification problem. Open-set or text-dependent variants of the problem are out of the scope of this thesis.

Furthermore, preprocessing was only used to remove silences, and noise removal was not applied as it was assumed that the dataset was noise free.

Also, data augmentation was not used on the dataset. It is a good way to make a machine learning-based classifier more robust [11, 53].

However, it is out of the scope of this thesis.

Finally, for the experiments, it was assumed that only one speaker was speaking in a particular speech record. In many applications, it is possible that multiple speakers speak in a single speech record. In this case, a speaker identification system must be able to perform segmen- tation and then identify the speaker for each segment. Such cases are not covered in this thesis.

1.5 Ethical and Societal Aspects

Regarding the social aspect, considering all the use cases introduced in the previous sections, any work dedicated to improve speaker recognition systems would have a positive impact on the society as such tasks would be solved in a more efficient way.

Research on speaker recognition task contribute to the economy as well as they provide an opportunity of building applications useful in many areas such as secure authentication which use speaker recognition. Also, it does not cause any loss of jobs as such jobs are not done by humans as humans are not capable of recognizing a speaker among many speaker candidates just by considering voice data.

For those reasons, this thesis work satisfies ethical concerns related with societal well-being. As stated, improvements on speaker recognition do not harm the society and in fact, it contributes to it.

(18)

Background

A speaker identification system is composed of four components. They are data augmentation component, audio preprocessor, feature extraction method and classifier. All of them are crucial for the performance of the system. For this reason, to obtain an accurate system, reasonable decisions must be made regarding all of these components. The role of each component will be explained in the following subsections.

2.1 Data Augmentation

Data augmentation is an approach for building a more robust speaker identification system. For any machine learning model, there is always the possibility for the model to learn patterns which are specific to training data and not shared among all possible data. This phenomenon is known as overfitting. In order to prevent this phenomenon and to force the model to learn real patterns in the data, one must use large amounts of training data. One way to obtain such data is to use data augmentation. In this approach, new training samples are created by using existing samples. For instance, by varying the volume or adding some background noise on an existing audio, new training sample can be created and richer training data can be obtained.

If a system can be trained with augmented data, it can be more robust against noises as there will be noisy samples in the training data and the model will learn to ignore such noises. In [14], a successful approach is described for adding background noise to speech signals.

The authors reported that applying that approach decreased the error rate under degraded speech conditions.

8

(19)

2.2 Preprocessing

In a speaker identification system, an audio signal must be processed first. The preprocessing consists of two important steps. In the first step, unvoiced audio segments are removed and in the second step, the audio signal is divided into the frames. The importance of each step will be explained and some ways to perform each step will be described briefly in the following subsections.

2.2.1 Removal Of Unvoiced Audio Segments

In a speaker identification system which considers acoustic features in an audio signal to predict the speaker, silent parts or the parts where there is no human voice should be removed as those segments cannot contain any useful acoustic information about the speaker. Removing them will make the system more robust. Some efficient methods, capable of doing this removal are discussed in [3, 37]. Those works are based on the idea of extracting waveform domain parameters and frequency domain parameters as features for intervals of an audio signal and clustering intervals by considering those features.

2.2.2 Framing

In a speaker identification system, a speech signal is usually divided into small, equally-sized and possibly overlapping frames. Due to the non-stationary nature of speech signals, it is unlikely to be able to extract useful information by using the speech signal as a whole. Thus segmenting the signal into multiple frames and analyzing each frame separately is the preferred way. [13]

The preferred frame size depends on what features one wants to cap- ture from a speech signal. The effects of different frame size on the possible features to be captured have been widely discussed in litera- ture and some examples will be provided in the following subsections.

Subsegmental Frame Size

In this case, frames with a size of around 3ms (subsegmental frames) are used. This is a good approach to extract the characteristics of the excitation source which contain speaker-related information [52].

(20)

Segmental Frame Size

In this case, frames with the size 10-30 ms (segmental frames) are used to extract vocal tract (cavity in humans where the sound is produced) related features which contain useful information related to speech [47].

Supra Segmental Frame Size

In this case, frames with the size of 100-300 ms (supra-segmental frames) are used to extract features due to behavioral traits of the speaker such as word duration and intonation.

2.3 Feature Extraction

After preprocessing the audio signal, a feature extraction component should be used for the speaker identification task. Feature extraction is a process to obtain a compact representation of an audio frame. This is the preferred approach over using directly the raw audio signal. The main reason for this preference is the high dimensionality of the raw audio data. Regarding this reason, assume a 50ms frame length is being used on mono audio data with 16kHz sample rate. In such setting, in each frame, there would be 0.05 * 1 * 16000 = 800 data points, so each frame would be represented by an 800-dimensional vector. This is not practical due to the high computational complexity and overfit problem ¹. Instead of using such a large vector, it is better to have a smaller vector which contains useful features for the speaker identification task.

Raw audio data contain some redundant information such as background noises which are not related with the speaker. It is desired that an extracted feature vector varies only among different speakers and should not vary due to noise. This way, a feature vector can contain speaker-related information and leaves redundant data out. A good feature extraction method is expected to have this property.

In this section, the theory for the feature extraction methods which are applied on speech signals segmented into segmental frames such as Mel-Frequency Cepstral Coefficients (MFCC) and Linear Predictive

1Infamous curse of dimensionality phenomenon

(21)

Cepstral Coefficients (LPCC) is explained. After that, some interesting experiments where different feature extraction methods are experimented with are provided.

2.3.1 Mel-Frequency Cepstral Coefficients

Mel-Frequency Cepstral Coefficients (MFCC) method has been introduced by S.B Davis and P.Mermelstein [12]. Since then, it has become one of the most popular feature extraction methods for speaker identification [22]. In this section, the mathematical steps, which have also been used in the implementation for this thesis, to obtain MFCC features will be explained.

Steps

In order to extract the MFCC features of a segmental frame, the following steps are applied:

• Apply Discrete Fourier Transform (DFT) on the frame

Given that s(j) is the j^th sample in the frame, N is the number of samples in the frame, and K is the predefined length of DFT domain (1 ≤ x ≤ K), S(j), which is the j^th value in the DFT domain, is calculated in the following way:

S(x) =

N

X

n=1

s(n)e^−2πixn^N (2.1)

In Figure 2.1, an example output of the DFT process for an arbitrary audio signal is shown. The first graph represents the audio signal and the second signal is the output of the DFT process on that signal.

• Calculate power spectrum

Given that the power spectrum function is represented by P , the power spectrum can be calculated in the following way:

P (x) = |S(x)|²

N (2.2)

(22)

Figure 2.1: DFT output for an audio signal [17]

• Apply Mel-filter banks

In this step, F filter banks are computed, where F is a value that can be picked before computations. The computation of filter banks involves Mel-scale. This scale is preferred as it can imitate the sound perception of the human ear. The human ear can discriminate better at lower frequencies and worse at higher frequencies [56]. The Mel-scale imitates this fact by scaling high- frequency pairs closer and low-frequency pairs further from each other, for a pair with a fixed distance. Mathematically, for a frequency value f on Hertz-scale, the corresponding value m on Mel-scale is calculated as:

m = 2595log₁₀(1 + f

700) (2.3)

F filter banks are computed by the following steps:

– Frequency range to be used is determined. In the Fourier domain, frequencies outside of this range are ignored during the computations.

– F+2points are placed in the frequency range, where the distance among each adjacent point is equal on Mel-scale, and

(23)

Figure 2.2: Example representation of filter banks [55]

the first and last points are at the lowest and highest frequencies of the range.

– Frequency value of each point is converted from Mel-scale into Hertz-scale by applying (2.3).

– For each point i in the Hertz-scale, corresponding index f (i) in DFT domain with N dimensions, and sample rate S is computed by the following formula:

f (i) = (N + 1) ∗ h[i]

S

(2.4) – k^thfilter bank Fk(x) (1 ≤ k ≤ F ), is given below as:

F_k(x) =











0, x < f(k-1)

x−f (k−1)

f (k)−f (k−1) f(k-1) ≤ x < f(k)

f (k+1)−x

f (k+1)−f (k) f(k) ≤ x ≤ f(k+1) 0 x > f(k+1)

– For each filter bank Fk(x), the corresponding coefficient is computed by performing the dot product operation on the filter bank Fkand power spectrum P (x), and this coefficient is called Akin the next step.

In Figure 2.2, an example representation of filter bank functions are shown. Each color represents a different filter bank function.

• Apply Discrete Cosine Transform (DCT)

(24)

In this step, the Discrete Cosine Transform (DCT) is applied to the coefficient vector computed in the last step. Those coefficients are highly correlated. For this reason, Discrete Cosine Trans- form (DCT) is applied to the logarithm of the coefficient vector to decorrelate the coefficients. This operation computes cepstral coefficients. For each coefficient Ak computed in the previous step, the corresponding cepstral coefficient, Ckis computed by:

Ck = r2

F X

i=1

F Akcos(kπ(i − 0.5)

F ) (2.5)

where F is the number of filter banks.

After that computation, some cepstral coefficients with high in- dices are discarded as they fail to have high discriminating capabilities [65] and the remaining cepstral coefficients constitute the MFCC vector.

• Apply Liftering

Then, a process called liftering is applied to make higher coefficients have less weight as they can provide less discrimination, and this approach has been proven to provide a better representation of the audio signal [42]. To apply the liftering process, first of all, a liftering weight is calculated for each feature in the MFCC vector. For a predefined liftering coefficient D, the i^th liftering weight wiis calculated as:

wi = 1 + D

2sin(πi

D) (2.6)

• Then the i^thfeature M F CCi in MFCC vector is updated as:

M F CC_i⁰ = M F CC_i∗ w_i (2.7)

2.3.2 Linear Predictive Cepstral Coefficients

Linear predictive coding (LPC) is a method used to represent an audio signal in a compressed form and Linear Predictive Cepstral Coefficients (LPCC) is based on LPC method. For speaker identification, LPCC is

(25)

known as a good method that can provide high accuracy and for noise- free test cases it is suggested that LPCC can perform better than MFCC [5].

For a predefined coefficient number K, the x^th frame fx can be ex- pressed as:

f_x = e_x+

K

X

k=1

c_kf_x−k (2.8)

This is a linear approximation for the frame fx, where ex is the error vector for the approximation, and the aim is to compute values for coefficients c1, c2,...,ck to minimize (ex)². Those coefficients constitute LPC feature vector for the frame.

Finally, an LPCC feature vector is obtained by computing cepstral coefficients of the LPC vector.

2.3.3 Experiments on Feature Extraction Methods

One of the first experiments which showed the usefulness of spectral analysis on frequency domain for speaker identification was conducted by S.Pruzansky in 1963 [46]. In this experiment, vectors for each speech record were produced by applying 17-channel filter bank to signal where the frequency domain was divided into equal intervals and then by producing a energy-frequency vector. The authors used a test dataset consisting of 10 speakers and achieved an accuracy of 89%.

In [16], a similar experiment was conducted. Interestingly, the authors claim that the power spectra vary a lot during the articulation of a phoneme since human articulators are in constant motion, however this is not the case for nasal consonants. Therefore, they proposed to only apply spectral analysis to speech segments where the phoneme /n/ was articulated. According to the same paper, they achieved 93%

accuracy with a test dataset consisting of 30 speakers.

In [32], another feature representation method based on cepstral analysis was experimented with. In this experiment, a speech segment was

(26)

represented by a 34-dimensional vector, where the first 32 were cepstral coefficients and the other 2 were pitch and length value. Then, the nearest neighbor method was used for classification. On an experiment about speaker verification where the authors used 4 authorized speakers and 30 impostors, 6% to 13% error rates were obtained.

In [2], a feature extraction approach based on LPC was investigated.

In this approach, a speech sample had been approximated as a linear combination of a fixed number of past speech samples. Then, the coefficients used in linear combination were used as a feature vector. It has been shown that LPC together with the cepstrum function (LPCC) performed better than raw LPC coefficients, and 93% accuracy was achieved in tests with a test dataset consisting of 10 speakers.

Another feature extraction method, named Mel-Frequency Cepstral Co- efficient (MFCC) has been widely investigated and used in many experiments such as [30, 43, 49, 33]. In all those experiments, it has been shown that MFCC is effective in representing speaker-related information and that it can provide a good accuracy on the speaker identification.

In another study [59] conducted by I.Trabelsi et al., MFCC, LPC and another feature extraction method which is called Perceptual Linear Prediction (PLP) were compared on a speech record set consisting of 14 female speakers. Using support vector machine (SVM) model where both linear kernel and a kernel based on radial basis function had been used, authors concluded that both MFCC and LPC performed better than PLP.

A study conducted by H.S Jayana et al. [21], suggested that using multiple frame sizes and rates can be useful, especially when the training dataset is limited. Also, they proposed that using delta and delta- delta coefficients and MFCC together increases the accuracy. Briefly, delta coefficients keep information about changes of the MFCC coefficients over time and likewise, delta-delta coefficients contain information about changes of delta coefficients over time.

Finally, the effect of using high level features such as pitch related data, pause durations, pause frequencies and average length of voiced seg-

(27)

ments have been investigated. In [44] where the combination of such features have been used together with k-nearest neighbor (kNN) classifier, some decent results were obtained on 2001 NIST Speaker Recog- nition Evaluation Corpus (for more information about the dataset, reader can check [1]). Authors reported that they had 25.2% error rate when using pause related features such as frequency of pauses and duration of pauses in the speech, and had 14.8% error rate when using pitch related features. It was claimed that such features would improve accuracy if used together with features extracted from segmental frames.

2.4 Machine Learning Methods For Classifi- cation

2.4.1 Gaussian Mixture Models

Gaussian Mixture Models (GMM) have been widely used in speaker identification. GMM is considered as the state-of-the-art approach for speaker identification and good results with high accuracy have been reported in many experiments such as [49, 48, 66, 39].

GMM is a statistical approach where each speaker is considered as a random source that produces a feature vector that represents a frame of a speech record. [48]. Hence, each speaker is modeled by a mixture of Gaussian density functions. The mixture itself is called the probability distribution function (pdf) and formally, it is a linear combination of M Gaussian probability density functions, which are called components. The parameter M is a predefined number. A careful decision is needed to pick a value for M, because using too few components might cause the model to underfit the data whereas using too many components might cause overfitting.

More formally, PA, the probability density function of the model for speaker A, is defined by:

(28)

P_A(x|W_A, µ_A, Σ_A) =

M

X

i=1

w_A,iG(x|µ_A,i, Σ_A,i) (2.9) where WA = {w_A,1, w_A,2, ..., w_A,M} is the set of weights of components, µA = {µA,1, µA,2, ..., µA,M} is the set of mean values for components, and ΣA = {Σ_A,1, Σ_A,2, ..., Σ_A,M} is the set of covariance matrices for components. The Gaussian probability distribution function G(x|µA,i, Σ_A,i) is:

G(x|µ_A,i, Σ_A,i) = 1 (2π)^d²|Σ_A,i|¹²

e⁻¹²^(x−µÂ,i⁾^T^Σ⁻¹Â,i^(x−µÂ,i⁾ (2.10) where d is dimension of feature vector.

In GMM approach, each model for speakers is trained independently by the Expectation Maximization (EM) algorithm [9]. In this approach, training a model MAfor speaker A means to find good parameter values for wA, µAand ΣA, so that the likelihood of observing feature vectors given in the training data of speaker A will be high. It is assumed that unseen feature vectors produced by the same speaker will share similarities with the training data for this speaker. For this reason, a model trained with feature vectors of a particular speaker is expected to produce high likelihood also for unseen feature vectors of the same speaker.

Parameters for a model are initialized in a random way and improved by the Expectation Maximization (EM) algorithm. This is an iterative algorithm which tries to increase the likelihood for training feature vectors in each iteration by updating parameters wA, µAand ΣA. It keeps updating the values of those parameters until the algorithm converges (no more improvement on likelihood can be obtained or the improvement is smaller than a certain threshold) or until it reaches a certain number of iteration. More detail on the EM process will be given in the following parts of this section.

In Figure 2.3, a GMM model trained on a random dataset, created in a way to form three clusters, is presented. The GMM is chosen to have 3 components to fit the training data well. Each line shows the negative likelihood of observing a data on them. So, the likelihood of observing

(29)

Figure 2.3: A GMM model trained on data

a data on the line with value 3.33 is bigger than that of the line with value 23.33.

Then, for a given feature vector during evaluation, each model outputs the probability of the vector being observed according to their probability density function. The speaker can be predicted as the corresponding speaker of the model that outputs the highest probability.

Training - Expectation Maximization Algorithm

Here is the procedure for training a model MAfor speaker A, assuming M Gaussian components are used, training data contains N feature vectors and xiis the i^thfeature vector in the training dataset:

1) Initialization: Initialize randomly wA, µAand ΣA

2) Iteration: Until convergence do the following:

2.1) E-Step: ∀j, i j ∈ {1, 2, ..., M } and i ∈ {1, 2, ..., N } calculate

Tj,i = w_A,jG(x_i|µ_A,j, Σ_A,j) PM

y=1w_A,yG(x_i|µ_A,y, Σ_A,y) (2.11)

(30)

2.2) M-Step: Update parameters

W_A,j = 1 N

N

X

i=1

T_j,i (2.12)

µ_A,j = PN

i=1T_j,ix^T_i PN

i=1Tj,i

(2.13)

Σ_A,j = PN

i=1T_j,i(x_i− µ_A,j)(x_i− µ_A,j)^T PN

i=1T_j,i (2.14)

It has been proven that in each iteration, the likelihood for training feature vectors is increasing and the algorithm converges. Interested readers are encouraged to read [38] to see a formal proof for convergence and increase in likelihood in each iteration.

As was seen in the previous section, the initialization is crucial for EM algorithm. For this reason, it has been claimed that using a Uni- versal Background Model (UBM) could improve the success. In UBM approach, a Gaussian mixture model is trained by using all speech data for all speakers, and this model determines the initialization of speaker-specific models. This approach was reported to improve the accuracy [66]. In the experiment described in this paper, for 2000 NIST Speaker Recognition Evaluation dataset, their GMM-UBM system achieved 31.2% relative error reduction, compared to a GMM system with 128 components.

2.4.2 k-Nearset Neighbor Models

k-Nearest Neighbor (kNN) method is a simple approach that has been used for many classification problems [28, 24, 64]. In the regular kNN algorithm, the feature vectors are stored in an efficient data structure.

When an unseen feature vector V is to be evaluated, k closest feature vectors to V among already-stored vectors are determined. Then the prediction can be made according to the determined feature vectors and their corresponding labels. One approach might be to apply a simple majority vote method. Another way might be to use a weighted majority vote approach, where the weights are depending on the distance of a feature vector to the query vector.

(31)

Figure 2.4: Effects of different k values on k-NN [10]

The decision on the value k is crucial for the accuracy of the system.

Too small of a k value might cause the system to overfit to seen data, whereas too big of a k value might cause the involvement of unrelated data points into the decision making process which can cause accuracy loss. In Figure 2.4, the effect of different k values is shown. It can be observed that, for small k values, many isolated regions appear, whereas for large k values, the decision boundary is smoother.

Increasing Computational Efficiency

The requirement of computing the distance between each stored feature vector and query vector might cause the naive k-NN approach to be infeasible, especially in case of huge number of feature vectors stored. In order to improve the speed, the feature vectors can be par- titioned by ball-tree data partitioning. This method, and different ways to perform such partitioning are discussed in [41]. Such partitioning methods are necessary to reduce the required number of distance calculations.

This partitioning method splits the input space into spheres defined by a center point C and radius r. Each feature vector will belong to exactly one sphere. Thanks to this approach, it is possible to avoid performing many distance calculations. Before explaining the reason, it should be proven that for a query vector Q and a feature vector x, it is possible to know the lower limit for distance between x and Q without explicitly calculating it. It will be proven below:

Theorem 1. Assume Q is a query vector, and CY, rY are respectively the center point and radius of a sphere Y (the related schema is shown at Figure 2.5). XY is set of feature vectors that belong to sphere Y, and D(a, b) is the

(32)

Figure 2.5: Schema for the theorem

function that gives the Euclidian distance between two arbitrary vectors a, b.

In this case, ∀x ∈ XY, D(Q, C_Y) − r_Y ≤ D(Q, x)

Proof. This can be proved by contradiction. Assume otherwise so ∃x ∈ X_Y such that D(Q, x) < D(Q, CY) − r_Y (1).

First of all, by the definition of partitioning, D(x, CY) ≤ r_Y (2) By (1) and (2), D(Q, x) + D(x, CY) < D(Q, C_Y)(3)

By triangle inequality, D(Q, CY) ≤ D(Q, x) + D(x, C_Y)(4)

By (3) and (4), we get D(Q, CY) < D(Q, C_Y), which is a contradiction.

Thanks to the information about lower boundaries for the distance between each vector and the query vector, many distance computations can be skipped as it might be the case that some vectors would turn out to have a high lower boundary which prevents them from being a candidate for one of the k-closest vectors to the query vector. This way, the speed of the system is improved.

(33)

2.4.3 Self-Organizing Maps

Another interesting experiment which used a self-organizing map (SOM) method for speaker identification [33] was conducted by A.T.Mafra et al. In that experiment, a SOM had been trained for each speaker. For each SOM, all speech segment vectors of the corresponding speaker were used. After training, a vector representing a test data was classi- fied by considering quantization error at each SOM. In other words, a speaker whose SOM was giving the minimum quantization error for a given test vector was chosen as the correct speaker. On the experiment on the dataset which consisted of 14 speakers, the authors obtained 98.57% accuracy [33].

2.4.4 Support Vector Machines

For the speaker identification problem, one of the other machine learning models that had been experimented with was Support Vector Ma- chines (SVM) [62, 8, 59].

SVM works by finding a hyperplane that separates instances of two classes. Formally, assuming W represents the separating hyperplane, x_i is the vector of the i^th training instance, and ti is the class of that training instance with value -1 or 1, and > 0, there are two constraints. The first one is:

t_i = 1 =⇒ W x_i ≥ (2.15) and the second one is:

t_i = −1 =⇒ W x_i ≤ (2.16) These constraints guarantee that the hyperplane can linearly separate instances of two classes, and this can be further simplified into the following constraint:

∀t_i, x_iW t_i ≥ (2.17) Among all possible hyperplanes, it is desirable to choose one such that the margin is maximized. The margin is the distances between instances and the hyperplane. A wide margin is preferred to decrease the risk of bad classification. It is assumed that a separating hyperplane with a wide margin can generalize better. Considering a point P

(34)

on the margin of a hyperplane, the distance between P and the hyperplane is given by:

|P W |

||W || =

||W || (2.18)

This proves that the width of the margin is inversely proportional to the length of W. The SVM is an optimization problem that maxi- mizesW under the constraints described above.

Sometimes, in a classification problem, there might be outliers which can cause SVM to find a separating hyperplane with a narrow margin, or prevent SVM from finding any separating hyperplane at all. To prevent such cases, slack variables are used in SVM. These variables are introduced to obtain a wider margin at the expense of possible margin violations and misclassified instances. Without slack variables, margin violations are not allowed. Together with slack variables, a regularization coefficient C is introduced. This coefficient controls the trade-off among margin size and amount of misclassified data. The smaller C is, the narrower the margin will be, and the bigger C is, the wider the margin will be at the expense of more misclassified training data and more margin violations. Especially, in cases where training dataset includes many noisy instances, slack variables can be particularly useful as they prevent the system from being forced to classify the noisy instances correctly and losing the opportunity to compute a hyperplane with a good margin as a consequence. Including the slack variables S₁, S₂, ..., S_N ≥ 0 and regularization coefficient C, where N is the number of training instances, the problem becomes to compute:

arg min

W,S

||W || + C ×X

i

Si (2.19)

under the following constraints:

∀t_i, x_iW t_i ≥ − S_i (2.20) There are cases where a dataset is not linearly separable, and using slack variables does not help. For such cases, it might be a good idea to project the data into higher dimensional space and try to find a separating hyperplane in that space. For this aim, a computational method called kernel trick is used. Kernel trick provides a way to solve the optimization problem as if the data are projected into high dimensional

(35)

space. This way, the system can have the advantage of projection and at the same time, it does not waste computational resources for ac- tually projecting the data. The way to achieve this is to use a kernel function that takes two points from the original space as input and computes the dot product in the high dimensional space.

SVM with different kernel functions such as polynomial kernel and Fisher kernel have been tried and the performances were compared with GMM based systems [62]. According to the result of those experiments, SVM based systems can obtain a similar performance compared with GMM based systems.

Regular SVM is used for binary classification. To perform multinomial classification, which is generally necessary for speaker identification, various methods have been proposed in [18, 45]. The method that was used in this thesis implementation will be revealed in Methods section.

2.4.5 Neural Networks

Neural networks have also been studied for speaker identification and promising results were reported. For instance, Vector Quantization methods have been widely studied at [23, 31, 6] for the speaker identification task. Bennani et al. obtained 97% identification accuracy [6] on a dataset of 10 speakers by using Vector Quantisation (VQ) approach.

The authors claimed that their system worked well with short speech records and that the training process was fast.

J.Oglesby et al. used the Radial Basis Function (RBF) model [40] for the speaker verification task. On a dataset consisting of 40 speakers, they achieved the best performance (8% true speaker rejection rate and 1% impostor acceptance rate) by using one hidden layer with 384 hidden nodes. They also claimed that RBF model outperformed both a standard Multilayer Perceptron (MLP) and a Vector Quantisation (VQ) based systems. [40]. Apart from that, Deep Neural Networks have also been tried and reported to give good results [50].

Another interesting approach was proposed by L.Rudasi et al. [51].

Their method relies on building many small neural networks for binary classification. In other words, for N speakers, N*(N+1)/2 small

(36)

neural networks (for each speaker pair in the dataset) are formed and each network is responsible for discriminating among those two speakers. They reported that they obtained 100% accuracy on their dataset [51]. However, it is not feasible for datasets with a large number of speakers, as the necessary number of neural network grows propor- tionally to the square of the number of speakers.

Another neural network method that is worth mentioning here was proposed by [29]. This experiment used a deep neural network to extract features from audio data and to represent the signal in a compact way. First of all, the system extracted MFCC and Gabor features and by feeding them into a neural network similar to an autoencoder, it reduced the dimensionality and produced a feature vector which was more robust against noise. They concluded that this feature vector was better in classifying the audio, compared to MFCC and Gabor alone.

(37)

Methods

In this section, the creation of datasets that have been used during the experiments will be explained first. After, all parts included in the experiment pipeline such as framing, feature extraction and classification will be explained in detail.

3.1 Dataset

In this thesis, speech records provided by Inovia AB are used as the main source for creating training and test datasets for experiments.

The main audio source consists of speech records of 321 male and 259 female speakers. All speakers had been anonymized, thus only age and gender information were available. For each speaker, 18 to 525 speeches had been recorded. The speeches were recorded in 16kHZ, mono, 16-bit format.

3.1.1 Preprocessing

As mentioned before, the task of removing noisy and silent frames is crucial to build a robust speaker identification system. This task is called voice activity detection (VAD). VAD refers to the problem of segmenting an audio signal and labeling each segment as either voiced, where a human voice is observed, or non-voiced. After such labeling, each speech segment can be extracted as a separate audio signal, which is called a chunk.

27

(38)

Figure 3.1: Voice Activity Detection [54]. Note that out of this signal, 3 voiced chunks can be extracted as the output of VAD process

An example of the VAD process is shown in Figure 3.1. For the audio signal shown in the figure, 3 voiced segments have been detected, which would result in the extraction of 3 chunks.

In this thesis implementation, for the VAD process, WebRTC VAD library [60] was used. This library uses a configuration property called aggressiveness which can be set as an integer between 0 and 3. The value 0 would make the script the least aggressive in filtering out non- speech, and 3 would be the most aggressive. In other words, it defines the trade-off between precision and recall for voiced segments. In this thesis implementation, the value 3 was used for this property because precision was more important than recall.

After chunks were extracted for each speech in the main audio source, chunks whose durations were shorter than 3 seconds were removed.

This was due to the observation that among such chunks, the precision was relatively worse. Many audio segments in this group consisted of only breathing or murmuring sounds, so many of them were false pos- itives.

(39)

Each dataset that was used in training or testing can be seen as a mixture of chunks. For each dataset, it is desired to have sound signals coming from various speeches. This is the preferred approach to have a representative training dataset and realistic test dataset. Considering that a chunk consists of sound signals coming from only one speech, having a long chunk in a dataset would cause the dataset to be dom- inated by a particular speech. To prevent this undesired effect, the chunks whose durations were longer than 20 seconds were removed as well.

3.1.2 Training Datasets

After the chunk extraction process, various training datasets were derived by using the chunks. Derived training datasets are different from each other in terms of the number of speakers and training data duration. For number of speakers, the values 40, 80, 120, 160 were used to observe the effect of different number of speakers on the prediction accuracy. For each training dataset, an equal number of male and female speakers were used. For instance, a dataset with 120 speakers consists of 60 male and 60 female speakers. For training data duration, 10, 30, 60, 90 and 120 seconds were used. This shows the sum of the duration of chunks that were used in training for each speaker. It should be noted that chunks are created after the silence removal process. This means that, in chunks, there is no silence and they are completely filled with the voice of a speaker. Before explaining the derivation of the training dataset in a more detailed way, some definitions which are stated below will make the explanation more systematic and easy to understand.

Definition 3.1. T rains,trepresents the training dataset with s speakers which contains t seconds of chunks for each speaker

Definition 3.2. Speakers(T rains,t) represents the set of speakers of training dataset T rains,t

Definition 3.3. Chunks(T rain_s,t, X) represents the set of chunks for speaker X in training dataset T rains,t

Below, the properties that each training dataset follows are given. In these definitions, t1, t₂ stand for any possible durations of training chunks and s1, s₂ stands for any possible numbers of speakers.

(40)

Property 1. Speaker coverage: For two datasets where the latter has more speakers than the former, the set of speakers of the latter covers the set of speakers of the former. Formally:

s₂ ≥ s₁ =⇒ Speakers(T rain_s₁_,t₁) ⊆ Speakers(T rain_s₂_,t₂)

Property 2. Chunk set coverage: For two datasets where both datasets have speaker X, if the latter uses bigger/equal training data duration for each speaker than/to the former, then the latter includes all chunks for speaker X that the former includes. For- mally:

t₂ ≥ t₁∧X ∈ Speakers(T rain_s₁_,t₁)∧X ∈ Speakers(T rain_s₂_,t₂) =⇒

Chunks(T rains1,t1, X) ⊆ Chunks(T rains2,t2, X)

The speaker coverage property implies that two datasets that have the same number of speakers must have the same set of speakers. Also, those two properties together imply that if two datasets use the same number of speakers and the latter dataset uses a bigger data duration than the first dataset, the latter dataset includes all chunks that the first dataset contains.

The reason for applying such rules for the dataset derivation process is to minimize the effect of randomness on the evaluation. The necessity of the speaker coverage property can be explained by the following example: assume that the speaker coverage property is not used and for the training dataset T rain40,10, 40 speakers that are very hard to distinguish were picked. Then, for the training dataset T rain80,10, it is possible to pick 80 other speakers randomly which are much easier to distinguish from each other. In this case, given that all other factors are the same, it is very likely that T rain80,10 gives better accuracy than T rain40,10, which is unfair and misleading. The chunk set coverage property is also very important for a fair evaluation. To prove the importance of this property, think about two datasets, T rain40,10 and T rain_40,30. Without the chunk set coverage property, it is possible to pick very representative chunks for each speaker for the first dataset and to pick chunks that fail to adequately represent well some speakers for the second dataset. In this case, T rain40,10 can provide better accuracy than T rain40,30, which is again unfair and misleading.

(41)

Figure 3.2: 20 training datasets and the relations among them. In each box, the first number shows the duration of the training data for each speaker in terms of seconds and the second number shows the number of speakers. The arrows show the coverage relations among the datasets. Orange arrows show chunk set coverage relation among two datasets and blue arrows show speaker set coverage relation.

(42)

To prevent the effect of intersession variability, training chunks are preferred to be taken from different speeches. The term intersession variability is used to explain the differences between two speeches of the same speaker. Such changes can occur due to the changes in the recording environment, transmission circumstances, background noise and variations in speaker voices [58]. During tests, it is not possible to know the background noise or recording environment beforehand. For that reason, it is good to train the system considering all possibilities to have a robust speaker identification system.

Figure 3.2 shows all training datasets that were used during the experiments. In each box, the first number shows the training data duration for each speaker in terms of seconds and the second number shows the number of speakers. The arrows show the coverage relations among the datasets. Orange arrows show chunk set coverage relation among two datasets and blue arrows show speaker set coverage relation. It should be noted that the arrows show a transitive relation.

For instance, it can be concluded that 60/40 has a chunk set coverage over 30/80, because 60/80 has a chunk set coverage over 30/80 and 60/40has chunk set coverage over 60/80.

3.1.3 Test Datasets

Regarding the evaluation, 4 test datasets were created. They have 40, 80, 120 and 160 speakers respectively, and they will be denoted by T est₄₀, T est80, T est120, T est160. As expected, the speaker set of each test dataset matches with the speaker set of the corresponding training dataset. Formally, for any number of speakers X and training data duration t,

Speakers(T est_X) = Speakers(T rain_X,t). (3.1) This implies that the speaker coverage property is preserved for test datasets as well.

For each speaker, the number of test cases vary among 25 and 48, and each test case has a duration of 5 seconds. Too long test cases would cause the experiments to be computationally expensive whereas too short test cases would prevent observing the full potential of systems as the accuracy would be lower. For this reason, picking 5 seconds for

(43)

test case duration seemed to be a good compromise. Each test case was created by merging adjacent chunks produced out of the same speech where the sum of durations of the chunks is at least 5 seconds. Then, from the merged audio signal, the interval of the first 5 seconds was taken as the test case record.

During evaluations, for each test case, attempts are made to predict the speaker 5 times. In each attempt, respectively, the first 1,2,3,4 and 5 seconds of the record are considered. This way, the effect of duration of speech record on the accuracy is evaluated.

It is important to note that the test cases for a particular speaker in any two test datasets are the same. This is also due to the aim of minimiz- ing the effect of randomness on evaluation. Assuming that Cases(TX) gives the set of test cases for T estX, s2 > s₁ =⇒ Cases(T est_s₁) ⊂ Cases(T est_s₂). Also, each test case had been extracted from a different speech. The reason is to observe intersession variability, hence to have more realistic tests.

3.1.4 Statistics on Speakers

In Figure 3.3, Figure 3.4, Figure 3.5 and Figure 3.6, the age and gender statistics for each speaker set are shown. In each pie chart label, the numbers show the age interval for the corresponding group, and f stands for female and m stands for male. The number that can be seen in a slice of the pie chart shows the number of speakers that belong to the corresponding speaker group.

3.2 Framing

As discussed in Section 2.2, framing is a crucial process before the feature extraction step. In this thesis implementation, the duration of 25 milliseconds was used for frame size and the step length between suc- cessive frames was chosen as 10 milliseconds.

During the framing process, each chunk was segmented into segmental frames. After this process, a training dataset can be seen as a dictionary where the set of keys are the speakers and the value for each key is the set of the frames produced out of the chunks belonging to that

(44)

5

18-29 f 5

30-39 f

5 40-49 f

5 +50 f

5

18-29 m 5

30-39 m 5

40-49 m 5

+50 m

Figure 3.3: Statistics for T40

10

18-29 f 10

30-39 f

10 40-49 f

10 +50 f

10

18-29 m 10

30-39 m 10

40-49 m 10

+50 m

15

18-29 f 15

30-39 f

15 40-49 f

15 +50 f

15

18-29 m 15

30-39 m 15

40-49 m 15

+50 m

20

18-29 f 20

30-39 f

20 40-49 f

20 +50 f

20

18-29 m 20

30-39 m 20

40-49 m 20

+50 m

particular speaker. Likewise, a test dataset can be seen as a dictionary whose set of keys consists of test case identifiers and the corresponding value for each key is the set of frames produced out of the chunks belonging to that test case.

3.3 Feature Extraction

After the framing process is complete, a feature extraction method is applied to each frame. After the feature extraction process, each frame in the training and test datasets is replaced by the corresponding feature vector computed by the applied feature extraction method.

(45)

It should be noted that for each frame, after all corresponding feature vectors have been extracted, the striding approach which involves keeping only every other feature vector and removing the rest was used. The reason is that it was observed that adjacent feature vectors were much similar to each other, compared to the all possible vector pairs. For this reason, it was assumed that for each pair of adjacent feature vectors, ignoring one feature vector would not harm the accuracy as the remaining vectors would be representative enough, and this approach would provide huge performance gain in terms of computation time. In fact, this approach was proven to be very efficient as it had caused ignorable loss in accuracy and a big boost in performance. The experiments related to this approach will be explained with details in Striding section of the thesis.

For this thesis, two feature extraction methods were used, namely Mel- Frequency Cepstral Coefficients (MFCC) method and Linear Predic- tive Cepstral Coefficients (LPCC). Each of them was used with 4 different configurations, which gives 8 different feature extraction configurations in total. In the following sections, the configurations used for MFCC and LPCC methods during thesis experiments will be explained.

3.3.1 Configurations for Mel-Frequency Cepstral Co- efficients Methods

First of all, for each configuration for the MFCC method, the length of DFT domain K is predefined as 512 (as suggested as the default value in the library used in the implementation for MFCC feature extraction). Moreover, as the duration of 25 milliseconds was used for the frame size and the sample rate was 16kHz, the number of samples in each frame N was the same for each configuration and was computed as:

N = 16000 × 0.025 = 400 (3.2)