• No results found

Development of a Speaker Recognition Solution in Vidispine

N/A
N/A
Protected

Academic year: 2022

Share "Development of a Speaker Recognition Solution in Vidispine"

Copied!
60
0
0

Loading.... (view fulltext now)

Full text

(1)

Development of a Speaker Recognition Solution in

Vidispine

Karen Farnes

June 20, 2013

Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Henrik Bj¨ orklund Supervisor at Codemill: Thomas Knutsson

Examiner: Frank Drewes

Ume˚ a University

Department of Computing Science SE-901 87 UME˚ A

SWEDEN

(2)
(3)

Abstract

A video database contains an enormous amount of information. In order to search through the database, metadata can be attached to each video. One such type of metadata can be labels containing speakers and where they are speaking. With the help of speaker recognition this type of metadata can automatically be assigned to each video. In this thesis a speaker recognition plug-in for Vidispine, an API media asset management platform, is presented.

The plug-in was developed with the help of the LIUM SpkDiarization toolkit for speaker diarization and the ALIZE/LIA RAL toolkit for speaker identification.

The choice of using the method of GMM-UBM that ALIZE/LIA RAL of- fers, was made through an in-depth theoretical study of different identification methods. The in-depth study is presented in its own chapter. The goal of the plug-in was to perform an identification rate of 85%. However, the results unfortunately became as low as 63%. Among the issues the plug-in faces, its low performance on female speaker was shown to be crucial.

(4)
(5)

Contents

1 Introduction 1

1.1 Thesis Structure . . . 2

2 Background 3 2.1 Speech Parameterization . . . 3

2.2 Gaussian Mixture Models . . . 4

2.3 Hidden Markov Model . . . 4

2.4 Speaker Diarization . . . 5

2.4.1 Voice Activity Detection . . . 5

2.4.2 Segmentation . . . 5

2.4.3 Clustering . . . 6

2.5 Speaker Identification . . . 7

2.5.1 Closed- and Open-Set . . . 7

2.5.2 Supervision . . . 8

2.6 Vidispine . . . 8

3 Problem Description 9 3.1 Problem Statement . . . 10

3.2 Goal and Purpose . . . 10

3.3 Methods . . . 10

3.4 Related Work . . . 11

4 Comparison of Speaker Identification Methods 13 4.1 Gaussian Mixture Model . . . 13

4.1.1 EM-ML Algorithm . . . 14

4.1.2 Identification . . . 14

4.2 Vector Quantization . . . 14

4.2.1 Codebook Construction . . . 15 iii

(6)

4.3 Support Vector Machine . . . 15

4.4 Artificial Neural Network . . . 17

4.4.1 Backpropagation . . . 18

4.5 Modern Additions . . . 19

4.5.1 GMM-UBM . . . 19

4.6 Comparison . . . 19

4.6.1 Evaluation . . . 20

5 Vidispine Plug-in 23 5.1 Implementation Choices . . . 23

5.2 Vidispine . . . 24

5.3 Speaker Diarization . . . 24

5.3.1 Preprocessing . . . 25

5.3.2 Voice Activity Detection . . . 25

5.3.3 BIC Segmentation . . . 25

5.3.4 Linear Clustering . . . 26

5.3.5 Hierarchical Clustering . . . 26

5.3.6 Initialize GMM . . . 26

5.3.7 EM Computation . . . 26

5.3.8 Viterbi Decoding . . . 26

5.3.9 Adjustment of Segment Boundaries . . . 27

5.3.10 Speech/non-speech Filtering . . . 27

5.3.11 Splitting of Long Segments . . . 27

5.3.12 Gender and Bandwidth Selection . . . 27

5.3.13 Final Clustering . . . 27

5.3.14 Postprocessing . . . 27

5.4 Speaker Identification . . . 28

5.4.1 Preprocessing . . . 29

5.4.2 Training . . . 30

5.4.3 Testing . . . 30

5.4.4 Postprocessing . . . 30

6 Results 33 6.1 Evaluation Methods . . . 33

6.1.1 Testing and Training Set . . . 34

6.1.2 Errors . . . 34

6.1.3 Metrics . . . 34

6.2 Testing Results . . . 36

(7)

CONTENTS v

6.2.1 DET Curve and EER . . . 36

6.2.2 Identification Rate . . . 37

6.3 Discussion . . . 38

6.3.1 Gender Difference . . . 38

6.3.2 Clustering . . . 38

6.3.3 Metrics . . . 39

6.3.4 Test Data and Filtering . . . 39

6.3.5 Possible Improvements . . . 39

7 Conclusions 41 7.1 Present Work . . . 41

7.2 Future Work . . . 42

8 Acknowledgements 43

References 45

(8)
(9)

List of Figures

2.1 The result obtained by performing speaker diarization . . . 5 2.2 Illustration of speaker identification for an open-set database . 8 4.1 Conceptual presentation of two-dimensional VQ matching for a

single speaker . . . 15 4.2 The principles behind SVM . . . 17 4.3 An MLP with inputs x, a single hidden layer and an output y . 18 5.1 The use of LIUM for speaker diarization. . . 28 5.2 Illustration of the speaker identification process . . . 29 6.1 DET curve . . . 36 6.2 FA- and ME-probability in relationship to the threshold, to-

gether with the number of SegE for the whole dataset . . . 37

vii

(10)
(11)

List of Tables

6.1 Threshold and EER for the dataset . . . 37 6.2 Overview of the distribution of the errors . . . 38 6.3 Identification and segment accuracy . . . 38

ix

(12)
(13)

Chapter 1

Introduction

When humans listens to other people talk, they automatically extract informa- tion about what they are hearing. Not only do they understand what is being said, but also who is speaking and in what language. Making software that can do the same automatically, is not a simple task. These are examples of three types of information that are being extract when working with a speech signal.

This further leads to these three recognition fields:

Speech recognition: Extraction of words, to answer “What is being said?”.

Language recognition: Recognition of language, to answer “Which language is spoken?”.

Speaker recognition: Recognition of speakers, to answer “Who is speak- ing?”.

Of these three types of automatically recognition, this thesis will focus on the last, Speaker Recognition.

Speaker recognition is the task of automatically identifying who is speak- ing. Speaker recognition can be divided into two task; speaker verification and speaker identification. The former seeks to verify a speaker against a stated speaker model. This is what can be called an one-to-one comparison. It is often used to identify if a person speaking is who he/she really says he/she is. Speaker verification is therefore commonly studied with regard to the ap- plication area of biometric identification. This thesis will however focus on speaker identification. The goal for this task is more general, and the goal is simply to know who is speaking. A speech segment is therefore compared with a database of speaker models, a so-called one-to-many comparison. To obtain this database there must exist data of recorded speech from all the possible speakers in the database [10].

Automatically recognizing speakers is in itself a fascinating ability, but for what purpose can it be used for? As mentioned, the simpler task of speaker verification is mainly used for authentication tasks. More specifically one can

1

(14)

imagine using it to get access to devices, facilities and web pages. Another area for speaker recognition technology is in forensics, where the goal can be to prove that a specific person has said something or was at a certain location.

The area that this thesis will focus on is information structuring. For example, one may wish to have an archive of videos and then be able to search through them by speaker. These types of application are often useful in the media industry. The purpose might be to build a list of actors and where they are speaking in a movie or to identify speakers in recorded or online meetings.

To achieve high accuracy, one can imagine that it would be convenient to record specific words, and in that way recognize speakers by how they pro- nounce these words. Though in many cases this type of data is not available.

For example, it would be convenient to recognize a person that does not know he/she is being identified. Therefore, a more flexible technique to execute speaker recognition is often used. The type of recognition that does not rely on words is called text-independent and will be the type of recognition that will be in focus in this thesis [3].

In a speaker recognition system the system is often divided into two separate phases that are often performed sequentially. The first phase is speaker diariza- tion, which is a form of preprocessing of the sound file where the segments of where a speaker speaks are identified. The output from speaker diarization is a set of labels for which part of the sound file different unknown speaker are speaking. This can then be sent into a speaker identification module that identifies who these unknown speakers are. Speaker diarization and speaker identification is often treated as two separate research fields.

1.1 Thesis Structure

This thesis will in Chapter 2 give some background theory on the speaker recognition field including description of the two phases, speaker diarization and speaker identification. Chapter 3 describes the thesis project including a problem statement and goals. After this, Chapter 4 is a presentation of a set of speaker identification methods together with a comparison of these methods.

This chapter is a part of the in-depth study for this thesis. Description of the implementation of a prototype for a speaker recognition system can be found in Chapter 5. The methods for evaluating the prototype are explained in Chapter 6 together with the results and a discussion of them. Lastly, Chapter 7 summarizes the work done in this thesis together with suggestions of future work. This chapter is followed by the Bibliography.

(15)

Chapter 2

Background

This chapter gives some theoretical background in the field of speaker recog- nition. How speech is represented and a possible model to represent a speaker speech is presented, together with more detailed information about speaker diarization and speaker identification.

2.1 Speech Parameterization

In order to analyze speech signals as a statistical model and represent it in a compact manner, a feature vector from the speech signal is extracted. The aim is to capture the characteristic patterns in a speaker’s speech and then produce a model that can represent these characteristics. Though there exist multiple types of feature representations, the most commonly used is Mel-frequency cepstral coefficients (MFCC) [3]. Feature extraction with MFCC consists of a set of steps, which will now be presented.

High frequency often gets lowered under the productions of speech in the throat, a filter is therefore applied to amplify the frequencies. Then, a window smaller than the whole signal is set to the start of the signal and constantly shifted with a given time interval until it reaches the end of the signal. The Fast Fourier transform (FFT) is applied to each window. To emphasize the lower frequency components, mel-scale frequency warping is done. This includes mul- tiplying the spectrum with a filterbank, so that the frequencies are transfered into the mel-frequency scale. This scale is similar to the frequencies perceived by the human ear. In the filterbank each filter has a triangular bandpass fre- quency response. Applying a filter for each component in the mel-frequency will produce the average spectrum around each center frequency with increas- ing bandwidth. The formula to convert the frequencies f into mel-spectrum is as follows [23]:

fmel= 2595 · log10(1 + flin

700) (2.1)

3

(16)

After this, the logarithm of the spectral envelope is computed to acquire the spectral envelope in dB. And finally the cosine discrete transformation is ap- plied and it is defined as:

cn=

K

X

k=1

Sk· cos[n(k −1 2)π

K], n = 1, 2, . . . , L (2.2) where K is the number of log-spectral coefficients, Sk are the log-spectral coef- ficients and L is the number of cepstral coefficients to be calculated (bounded by L ≤ K)

2.2 Gaussian Mixture Models

A common approach in text-independent speaker recognition is to use Gaussian Mixture Model (GMM) to model speakers. A model consists of multi-modal Gaussian probability distribution, which in other words means that it is a weighted sum of M component densities and can be defined as:

p(~x|λ) =

M

X

i=1

wipi(~x) (2.3)

where ~x is a D-dimensional feature vector, wi is the mixture weights with the constraint that PM

i=1wi = 1, pi(~x) is a uni-modal Gaussian density on the form:

pi(~x|λ) = 1 (2π)D/2|P

i|12e12(~x−~µ)P−1i (~x−~µ) (2.4) When GMM is used for speaker identification, each speaker to be identified is represented by a GMM. A speaker’s model, and its parameters, is referred to as λ = {pi, ~µ,P

i}, i = 1, . . . , M . GMM is more thoroughly described in Section 4.1.

2.3 Hidden Markov Model

Another approach to speaker modeling is Hidden Markov Model (HMM). HMM can be trained to model phrases or phonemes, because of this HMM is most commonly used in text-dependent speaker recognition. HMM is a type of a finite state machine with:

– A set of hidden states Q

– An output alphabet/observations O – Transition probabilities A

– Output emission probabilities B

(17)

2.4. Speaker Diarization 5

– Initial state probabilities Π

The system is only in one state at a time, and this state is not observable, but the probability B produced from each state is accessible. Q and O are fixed and therefore the parameters of the HMM model can be defined as Λ = {A, B, Π}

[19].

2.4 Speaker Diarization

Before starting to identify who is speaking, some preprocessing has to be done to the audio file used as input. The goal of speaker diarization is to divide the audio file into segments containing speech and cluster those segments that associate to the same speaker together. In short the task is to answer “Who is speaking when?”. Speaker diarization is a two-step task, where the first is segmentation and the second is clustering. Figure 2.1 illustrates the result from speaker diarization.

Figure 2.1: The result obtained by performing speaker diarization

2.4.1 Voice Activity Detection

In order to analyze which person is speaking, the parts of the audio file that in- cludes speech needs to be extracted, this process is often referred to as Voice Ac- tivity Detection (VAD). There exist multiple techniques for extracting speech, which includes measuring frame energy and the use of GMM. The first method is only able to detect speech/non-speech, while with a method like GMM it would be possible to add categories like laughter, music and background noise.

This thesis will primarily be concerned with speech/non-speech detection.

2.4.2 Segmentation

In order to assess where there is a change of speaker, the segmentation looks for change-points in the audio. Segmentation methods often fall down into two main categories:

Metric based: Determines if two acoustic segments originate from the same speaker by computing the distance between two acoustics segments. Met-

(18)

ric based methods think of a speaker’s speech as having a Gaussian dis- tribution.

Model based: Models are with the help of supervision trained to recognize speakers. The models are then used to estimate where there are change- points in the audio file. In practice, the segmentation task becomes an identification task, where the models often are GMMs.

Metric based segmentation is the most popular type of method and is often favored because no prior knowledge is needed [1, 31]. One of these metric based methods is Bayesian Information Criterion (BIC) segmentation.

Bayesian Information Criterion

BIC segmentation is a type of metric-based segmentation. BIC is what is called a penalized maximum likelihood model selection criterion. If there exists an acoustic segment X , how well a model M fits the data is represented by a BIC value, which is defined as:

BIC(M ) = logL(X |M ) −λ

2#(M )logN (2.5)

where logL(X |M ) is the log-likelihood, λ is a penalty weight usually set to 1, N is the number of frames and #(M ) is the number of parameters in M . When de- tecting change-points a window is initialized and continuously expanded while continuously looking for change-points. The searching for change-points is re- ally a comparison between two models, one with two Gaussians and another with just one. The difference between these models are expressed as

BIC(i) = R(i) − λP (2.6)

R(i) is the likelihood ratio and is defined as:

R(i) = N log|Σ| − N11| − N2log|Σ2| (2.7) where thePs are the sample covariance matrices from the all data, this includes {x1, . . . , xi} and {xi+1, . . . , xN}

The penalty P , where d is the feature dimension, is expressed as:

P = 1 2(d +1

2d(d + 1))logN (2.8)

If BIC(i) is positive, the model with two Gaussians fits the data best, and the sample represents a change-point [5].

2.4.3 Clustering

When clustering segments together the similarity between segments are exam- ined, and a hierarchy of clusters are constructed. Two main approaches to making such a hierarchy exist:

(19)

2.5. Speaker Identification 7

Agglomerative: This is a bottom-up approach. Initially there is a single cluster for each segment, and clusters are iteratively merged together until the set of clusters reaches an optimal size.

Divisive: This is a top-down approach. In the beginning all segments are stored in one big cluster. A new speaker is recursively introduced and the cluster is split into smaller and smaller clusters.

Because agglomerative clustering has good balance between ease of the struc- ture and the performance, it is the approach most commonly used [7].

In some way clustering is a simple form of identification. However, clustering differs from speaker identification because clustering methods are performed without previous knowledge of the speakers in the audio file.

2.5 Speaker Identification

The next phase in speaker recognition is to identify which clusters correlate to which speaker. The phase itself consists of two sub-phases, the first being the training phase and the second being the testing phase. The training phase involves training a model for each speaker.

Though the exact definition of what a model consists of depends on the methods used, the main idea is to have a set of parameters that can represent the characteristics in a speaker’s speech. A description of different methods for extracting such a model can be found in Chapter 4. The training phase can be done independently of a recognition system. If a set of speaker models is trained and a database of these models is made, the database can be incorporated as existing data in the recognition system. Functionality for adding new speakers may of course be added into such a system.

The testing phase takes data from a test speaker and compares this data to each model in the database. In this way the speaker in each cluster can be identified.

2.5.1 Closed- and Open-Set

Speaker identification can be performed on a closed set or an open set of speak- ers. If the set is closed it is assumed that all the speakers being identified belong to the database. However, if the set is open, a decision about whether or not a speaker belongs to the database needs to be performed. When dealing with an open-set speaker recognition system, the identification task consist of both a scoring phase and a verification phase, as illustrated in Figure 2.2. A sample of the speaker is compared with the models of the speakers in the database giving each speaker a score based on how well the sample and the model match. By using a decision logic, the model having the best match to the speaker using the score value is chosen. The verification phase is then entered, and a decision is made about whether or not the speaker is in the database. This is in most

(20)

cases decided with the help of a threshold. And the decision of setting the speaker as known or unknown is based on which side of the threshold the score lies [24, 18].

Figure 2.2: Illustration of speaker identification for an open-set database

2.5.2 Supervision

Methods are often divided into supervised and unsupervised. Supervised meth- ods are methods where the identification models are trained using data from all speakers to be identified. These methods often model speakers by the difference between a specific speaker and all other types of speaker, and because of this the data needs in some way to be labeled. In this way the identification method can know which portion of the data belongs to the speaker being modeled and which belongs to the “others”. The labels used needs to be assigned by a su- pervisor. For unsupervised methods the identification models are trained using data from only the speaker it represents, a labeling is therefore not necessary.

2.6 Vidispine

Vidispine is an API media asset management platform, which among other things has support for file management, metadata management and media con- version. The metadata management allows Vidispine to offer advanced search over a media database. To facilitate the metadata management, Vidispine of- fers the ability to automatically populate the metadata for a video through a set of plug-ins that performs automatic recognition.

(21)

Chapter 3

Problem Description

The goal of this project is to develop a robust solution for speaker recognition integrated into Vidispine (as a plug-in). The focus of this thesis will be on the postprocessing of audio streams in order to extract the identity of the speaker from a segment, typically from a recorded meeting, video conference or TV- debate.

The result should include a demo that shows how it is possible to annotate the media asset with metadata from the speaker recognition solution. Existing libraries that could form the foundation for this solution should be evaluated.

The solution will handle the following, possibly individual tasks:

– The speaker segmentation task is to divide an audio stream into segments, each with a single speaker.

– The speaker clustering task is grouping the segments with the same speaker together.

– The speaker identification task is to determine which speaker out of a group that produces the voice segments in a cluster.

– The learning task is where the system learns how to identify a specific speaker.

Limitations

– The solution only needs to handle a limited group of known possible speakers, in other word the number of speakers and which voices that is used in a specific audio stream is known.

– Learning could be a manual task.

9

(22)

3.1 Problem Statement

Is it possible to produce a speaker identification system with an 85% success rate for a set of 12 speakers where the gender division is fairly equal? The test data should be a TV-debate, and unknown speakers should occur.

3.2 Goal and Purpose

The goal of this project is to develop a robust solution for speaker recognition integrated into Vidispine (as a plug-in). When speaker recognition is integrated into Vidispine, the system can allow search for time labels for where a specific person is speaking in a video. Furthermore, the plug-in can combined with speech recognition be a powerful tool, which allows searching for where a certain person is saying a specific phrase.

There should also exist a theoretical comparison of different speaker iden- tification methods, and an evaluation of which method is best suited for this application. The purpose of this research is to examine the field of speaker recognition and the methods used for speaker identification. The theoretical research should take into account that the implementation will be used as a plug-in for Vidispine. The project is done in cooperation and at the request of the IT consultancy firm Codemill located in Ume˚a.

3.3 Methods

To achieve the goals set for the thesis, the following methods should be applied.

Literature Research: A general study of the theoretical background of speaker recognition should be performed. Additionally, the speaker identifi- cation should be studied more extensively.

Evaluation: Discussion and evaluation of the speaker identification meth- ods, together with an evaluation of the existing software libraries that may be relevant for the implementation for these methods.

Design: Planning and designing the implementation of the plug-in and the evaluation methods used for testing.

Implementation: Programming a prototype of the plug-in in C++ as this is the programming language used in Vidispine. The prototype functionality needs to be tested and refined.

Testing: Writing an evaluation program that can test the prototype against a manually acquired correct speaker-label file.

(23)

3.4. Related Work 11

3.4 Related Work

Codemill has two other student working on similar project connected to video information analysis, namely speech and face recognition. The goal is to com- bine these three application into one powerful video analyzer and search tool.

Examples of articles that focus on speaker identification in the context of broadcast and video analysis are [12] and [2].

(24)
(25)

Chapter 4

Comparison of Speaker Identification Methods

As a part of an in-depth study for this thesis a presentation of four different speaker identification methods will be given. The chapter will end with a comparison of the methods together with a discussion about which method is best suited for the problem presented in this thesis. The methods to be analyzed are:

– Gaussian Mixture Model – Vector Quantization – Support Vector Machine – Artificial Neural Network

The methods chosen are all classic and basic approaches to speaker identifica- tion. Some studies show that combining methods lead to better performance, but since the main goal of this in-depth study is to go deeper into the methods and not seeking the ultimate result, the methods are studied separately. After description of the mentioned methods, a brief presentation of the most modern approaches is given.

4.1 Gaussian Mixture Model

Gaussian Mixture Model (GMM) is introduced in Section 2.2, and is discussed more in detail in this section. Each speaker in the database is represented by a GMM λ. When training the GMMs, the goal is to estimate the parameters of λ. The Expectation-Maximization (EM)-Maximum Likelihood (ML) Algorithm is the most popular method for this task.

13

(26)

4.1.1 EM-ML Algorithm

The ML estimation calculates the model parameters maximizing the likelihood of the GMM, according to a set of training data. This likelihood is defined as:

p(X|λ) =

T

Y

t=1

p( ~xt|λ) (4.1)

where T is the number of serialized training vectors X = { ~xi, . . . , ~xT}.

The ML parameter estimates is obtained by applying the iterative EM al- gorithm. Each EM iteration begins with an initial model λ and ends with a new estimated model ¯λ so that p(X|¯λ) ≥ p(X|λ). The new model becomes the initial model for the next iteration. The process continues in this fashion until the maximum is found with an accuracy within a given threshold [25].

4.1.2 Identification

To identify who is speaking, a search for the λ that maximizes the probability is performed. The identification rule for a set of S speakers and a set of X = {~x1, . . . , ~xT} observations from a speaker (e.g. extracted from a sound file) can be defined as:

S = arg maxˆ

1≤k≤S T

X

y=1

log p(~xtk) (4.2)

where p(~xtk) was defined in Equation 2.3 [25, 3].

4.2 Vector Quantization

Vector quantization (VQ) originates from the 1980s and is a simple and easily implemented model for text-independent speaker identification [32]. Originally indented to be used for lossy data compression, VQ is in speaker modeling used to represent a large amount of data as a small set of points. This is done by constructing a VQ codebook for each speaker, using training data, which contains a small number of entries. These entries are code vectors C = {c1, . . . , cK} and each of the code vectors is a representation of a cluster (not to be confused with the audio cluster) formed by clustering the feature vectors to a speaker. The code vector is a centroid of the cluster. This approach is called clustering.

The average distortion function is calculated to match two speakers:

DQ(A, C) = 1 T

T

X

i=1

min

i≤j≤Kd(ai, ci) (4.3)

given a set of T feature vectors A = {a1, . . . , aT} from an unknown speaker and where d is a distance measure between a vector and the closest centroid

(27)

4.3. Support Vector Machine 15

in the codebook. The smaller the value of DQ is, the higher the likelihood that these feature vectors originates from the same speaker is. This means that when performing VQ for closed-set speaker recognition, the speaker that corresponds to the codebook that generates the smallest DQis chosen. A visual presentation of the concept for VQ is shown in Figure 4.1 [32, 8].

Figure 4.1: Conceptual presentation of two-dimensional VQ matching for a single speaker

4.2.1 Codebook Construction

For each speaker in the database, a VQ codebook is constructed. Though [15]

concludes that the crucial point of generating a codebook is not the methods used for clustering, but the size of the codebook, a well-known algorithm is presented, namely the LBG algorithm. The algorithm is named after its au- thors Y. Linde, A. Buzo and R. Gray which presented their work in [17]. The algorithm can be described as a split and cluster algorithm because a codebook is generated by, for each iteration, splitting a codebook vector in two and then clustering all the new codebook vectors [35]. An overview of the algorithm is presented in Algorithm 1.

4.3 Support Vector Machine

Support vector machine (SVM) is a powerful and robust method for speaker identification. The idea behind the method is to create a hyperplane that can divide two speakers from each other with as big margin as possible. When creating an SVM, the process starts with a given set of training vectors for two speaker classes. These training vectors are labeled with 1 or −1 by a supervisor. With the use of these training data, a hyperplane is created in such a manner that the margin between the hyperplane and the training vectors are maximized. The vectors that define how big this margin is, are called support vectors. Methods that make decisions based on where features are divided

(28)

Data: A set of training vectors Result: A codebook of size N

Initialize a codebook C0 with a centroid generated by taking the average L;

N = desired codebook size;

ξ = threshold;

M = 1;

while Codebook size M < N do

Split all codebooks in two, so that Ci= (1 + )Ci and Cj= (1 − )Cj for splitting parameter ;

M = 2M ;

while Average distance ≥ ξ do

Assign every training vector to a cluster according to the centroids in the current codebook;

Update the centroid for each cluster;

end end

Algorithm 1: LBG algorithm for codebook construction

into separate hyperplane is called discriminate. Figure 4.2 shows the principles behind SVM.

The identification of a speaker is done by the function:

f (x) =

N

X

i=1

αiyiK(x, xi) + b (4.4)

where xi is support vectors derived from the training vectors, yi ∈ {−1, +1}

are the labels of the support vectors and αi are their weights. K(x, xi) is the kernel function. The kernel function is defined as

K(x, y) = Φ(x)TΦ(y) (4.5)

so that Φ(x) maps from the input space to a kernel feature space in a higher dimension. This is done because it is easier to find a linear hyperplane in this higher dimension [16, 14].

Since SVM is a two-class identifier there exist methods for making multi- class SVMs. Two of the approaches are called one-against-all and one-against- one. The former seeks to construct k SVM models for k speakers, where each model has a positive class representing a single speaker and a negative class representing all other speakers. The SVM that has the highest value for f (x) defines which class the speaker belongs to. The seconds approach one-against- one pairs up all speakers and construct kk−12 SVM models. Each SVM is trained to model the difference between two speakers. Deciding which class a speaker belongs to can be done in multiple ways. For example, in a strategy called “max wins”, each SVM is run and for each “win” a speaker gets, it

(29)

4.4. Artificial Neural Network 17

receives a vote. The speaker class that receives most votes in the end is the winner. Another strategy is to make a knock-out-system, where the winner of each “fight” is paired up again for a new fight until there is one winner.

For practical application the one-against-one method is shown to be favorable [13, 11].

Figure 4.2: The principles behind SVM

4.4 Artificial Neural Network

Artificial Neural Network (ANN) is a method inspired by the way neurons op- erate in the human brains. The method works very well for problems where limited info about the problem and its input characteristics is known. Though there exists multiple ways to represent an ANN, the focus here is on a type of feed forward network called multilayer perceptrons (MLP). The MLP architec- ture consists of an input layer, zero or more hidden layers and an output layer.

The layers and its nodes are connected with weights [9]. Figure 4.3 gives an illustration of an MLP.

The usage of MLP to identify a speaker is similar to those methods used for SVM. For k speaker, k(k − 1)/2 MLPs is created, where each model decides between two speakers and a strategy for selecting the ultimate winner has to be chosen. An existing alternative is to make k MLPs where the model decides between a single speaker versus an input representing all other speakers.

Research by Rudasi and Zahorian (1991) concludes that using this type of binary-pair network has better performance than constructing a single large network [28]. The single network especially suffers from long training time, which grows exponentially with the number of speakers. MLP generally has problems with large populations. For example, the MLP often converges to local optimums instead of global. Also, if the majority of inputs should yield 0 and only a small percentage yield 1, the positive answer are interpreted as deviations and the MLP learns that all inputs should yield 0 [22].

(30)

Figure 4.3: An MLP with inputs x, a single hidden layer and an output y

4.4.1 Backpropagation

There exists several methods for training the MLP. Here, a method called backpropagation is presented. If the goal is to make a binary network it is possible to create an MLP where the inputs xi are feature vectors from a speaker and an output layer with one node yielding an output y = {0, 1}. Each node consists of an input function and an output function. These functions can differ, but one example is to let the first function summarize the total input by the equation PN

j xiwi,j for a node i and all its preceding nodes j, while the output activation function is a simple sigmoid function. When the network is being trained the process consists of finding the set of weights that, to the highest degree as possible, ensures that for a given input, the desired output is acquired [29].

Initially all the weights are random values. For each set of test data, two phases are conducted, feed-forwarding and backpropagation [30]. The feed- forwarding phase feeds test data to the network and looks at the resulting output, and calculates the output error E = y − d where d is the desired output. The backpropagation phase iterates backwards through the network trying to correct the weights. The derivation of the activation function is used to try to predict how much the hidden layers affect the output. The weights that are leading to the output is updated according to

wj,i= wj,i+ yajEg0(ini) (4.6) where aj is the output from the activation function in the hidden node and g0is the derivation of the activation function. This is followed by backpropagation through the rest of the layers updating the weights according to

wk,j= wk,j+ αIkj (4.7)

(31)

4.5. Modern Additions 19

where α is the learning rate, Ik is the input to the weight and

j= g0(inj)X

i

wj,iEig0(ini). (4.8)

4.5 Modern Additions

Here is a presentation of some of the modern approaches to speaker identifica- tion.

Fusion: The method of combining pieces of information from several methods to enhance the full information [16].

Supervectors: Mapping a large set of low-dimensional vectors into a single high-dimensional vector. This is for example used to train GMM speaker models, which is transformed by its parameters into a supervector that is used as input to an SVM [33, 16].

GMM-UBM: Is a technique to make the GMM perform well for a small set of training data. Since each model is trained independently, a small amount of training data leads to a very sparse model [27]. The GMM-UBM adds knowledge of the whole world of possible speakers and therefore produces a richer model. The method is further described in following subsection.

4.5.1 GMM-UBM

Instead of creating a single model for each speaker, a speaker-independent model Universal Background Model (UBM) λ0 is constructed. The UBM is trained with data from all speakers. In order to derive the speaker-dependent model λ Maximum A Posterior (MAP) estimation is performed [27]. To get the score for an observation the log likelihood ratio is computed between a speaker model and the UBM:

S(X0|λ, λ0) = log p(X0|λ) p(X00) = 1

T0

T0

X

t=1

log p(~x0t|λ)

p(~x0t0) (4.9) where X = {x1, . . . , xT0} is a set of feature vectors [34].

4.6 Comparison

As explained in Section 2.5.2, identification methods can either be supervised or unsupervised. Of the ones introduced in this chapter the Support Vector Ma- chine and the Artificial Neural Network are supervised, while Gaussian Mixture Models and the Vector Quantization are unsupervised. The supervised meth- ods have the advantage that they create models defining the difference between a specific model and the rest of the database of models. This task demands a

(32)

big set of training data and it makes the methods more computational heavy than the unsupervised methods. The unsupervised methods have the advan- tage that it is not necessary to reconfigure the whole database of models every time a new speaker is added. Also, the unsupervised methods benefit from demanding a smaller amount of training data, making the methods less com- putation heavy. The SVM and ANN are discriminative models where a model describes the difference between the speakers, while GMM and VQ are genera- tive methods. Generative methods describe the distribution within a variation in a speaker’s speech where likelihood and probabilities are used for making the decisions about the identification [16].

The main advantage of GMM is that it has a proven high accuracy and that it is computationally inexpensive, but it does require a sufficient amount of data to perform well [25, 26]. SVM has also been shown to have good performance and the method makes it easy to determine a strict line between two separate classes [16], on the other hand the performance depends on a kernel function, which is not always easy to choose. This is in contrast to the VQ which has a very easy structure to implement, it is also computationally efficient and only needs a small set of training data. The VQ is, however, less robust and is, therefore, easily affected by noisy data and the variability that occurs in speech. The discriminative function used in ANN makes it very powerful and the architecture is adaptable, but finding the ideal structure and number of hidden layers together with the number of nodes in each layer is difficult. Additionally the performance rapidly decreases when the number of speaker in the database increases.

4.6.1 Evaluation

When evaluating which method to use in the application, a set of properties of the data used in the application needs to be evaluated. The considerations to be made can be summaries as follows:

– There is a limited amount of data

• When a new speaker is added to the database, the user should only need to add a small and convenient amount of training data. The amount of data for testing will vary, and the application should therefore be able to handle small amount of test data.

– Only a limited number of different speakers are to be recognized, which means that the database will be small.

• Although this thesis focuses on a given small set of speaker, it would be convenient if the system could easily add new speakers. The system should have the potential to have a larger database.

– The data originates from many different types of settings.

(33)

4.6. Comparison 21

• There is little control over the number of microphones and amount of noise. Therefore, the method should be as robust as possible.

– The goal is a robust solution with high accuracy.

Of these properties robustness and high accuracy is the most important. VQ has both lower performance than the other methods and is easily affected by noise, and it can quickly be excluded from the best choices. The methods that fulfill accuracy and robustness best are SVM and GMM. GMM only performs well when given a sufficient amount of training data. On the other hand, the performance of SVM is highly affected by finding the right kernel function, which is a similar to a problem with ANN. Choosing to implement ANN will mean that a lot of time will be consumed by searching for the right architecture.

This is of course relevant because the work in this thesis is highly limited by time. Additionally, ANN has problems meeting the wish for an additional feature that allows for the speaker database to be extended.

Based on this short summary, the GMM and SVM seem like the best al- ternatives. Even though there still exists some problems with both. GMM has the advantage that there exists an easy extension to the method in the form of GMM-UMB. This method eliminates GMM’s main problem concerning the amount of training data needed to ensure good performance. What is left is a method that offers high accuracy, despite a limited training data. SVM does also hold these properties, but in comparison with GMM, the SVM is more prone to noisy training data making it a less robust method than GMM. In ad- dition, SVM is more computationally heavy. The GMM and more specifically GMM-UBM becomes the favored alternative. Though it is less important in the reasoning in this chapter, the fact that GMM-UBM in the practical context of this thesis is easier to implement is a very positive property.

(34)
(35)

Chapter 5

Vidispine Plug-in

This chapter includes a description of how the prototype of the Vidispine plug- in is implemented, with focus on the software libraries that are used.

5.1 Implementation Choices

The prototype consists of two main modules, one for speaker diarization and one for speaker identification. Dividing the prototype into two separate mod- ules in such a way, means that there is a clear division between the unsupervised part and the supervised part of the system. The speaker diarization prepares the audio file in such a manner that the speaker identification module can solely concentrate on recognizing the individual speakers. Speaker identification of- ten requires a sufficient amount of data to perform well, and it is therefore an advantage that all the data belonging to the same speaker is clustered together.

In that way the identification can be performed on a larger set of data, and hopefully achieve higher accuracy.

The plug-in is implemented as a text-independent speaker identification system for an open set. The application of the plug-in, as a video analyzer, clearly makes this a speaker identification problem. The first of two reasons for making the plug-in text-independent is the difficulty of obtaining records of all speakers in the database speaking specific words. The second reason is that a text-independent system also is language-independent. It is unclear for what language the plug-in will be used, but both Swedish and English are likely candidates. In practice, it is hard to have models for all speaker that may appear in a video, and it is for this reason the set is chosen to be open.

Open-set text-independent speaker identification is considered the hardest type of speaker recognition [12].

23

(36)

5.2 Vidispine

To perform speaker recognition a user will communicate with Vidispine by requesting that speaker recognition should be performed on a specified video.

Vidispine will then convert the video file into a WAVE audio file, with 1 channel and a sample rate of 16kHz. This becomes the input to the speaker recognition system, which first phase is the speaker diarization. When the speaker recog- nition is finished it converts the information extracted into a metadata format and updates the metadata in the Vidispine database.

5.3 Speaker Diarization

To implement speaker diarization several open source tools where considered, among them where LIUM SpkDiarization1, ALIZE’s LIA RAL2 package, SHoUT3 and AudioSeg4. A short summary of the author’s impression of the tools are:

LIUM SpkDiarization: A set of tools written in Java dedicated to speaker diarization, developed for use on TV-shows. Therefore, it is necessary to make a call to the JAR-file through the command line in order use the tools. The results are, however, satisfying, after finding the right type of test data.

ALIZE/LIA RAL: A high-level toolkit written in C++. Even though sev- eral articles state that they used ALIZE/LIA RAL for their implemen- tation, implementing something that produces satisfying result was not successful. Though time was a factor, the main problem was the lack of documentation.

AudioSeg: A toolkit devoted to audio segmentation and indexing written in C. Since the tools where at such a high level, great flexibility was not available. With limited documentation, figuring out how to manipulate the result to become acceptable was hard. In addition, Audioseg itself states that the purpose of the toolkit is to help prototyping.

SHoUT: Is a speech recognition toolkit written in C++ and developed solely by Marijn Huijbregts in his Ph.d. work. The toolkit includes support for speech/non-speech detection and speaker diarization.

LIUM SpkDiarization (from here on only referred to as LIUM) was chosen as the tool used for implementing the speaker diarization module. ALIZE/LIA RAL was excluded because of lack of good documentation, and AudioSeg offered too

1http://www-lium.univ-lemans.fr/diarization/doku.php/welcome

2http://mistral.univ-avignon.fr/index_en.html

3http://shout-toolkit.sourceforge.net/index.html

4audioseg.gforge.inria.fr/

(37)

5.3. Speaker Diarization 25

little flexibility. Both LIUM and SHoUT have good documentation with an online wiki, but LIUM was preferred because it allows greater flexibility. The ability to manually tweak some of the parameters was seen as more important than using a library that can directly be linked to the program in the way that using a library that uses the same programming language can. Figure 5.1 shows the process of using the tools in LIUM to perform speaker diarization [21].

5.3.1 Preprocessing

LIUM takes an audio file and a feature file as input. The input format that LIUM accepts are Sphinx5, SPro46, gzipped-text and HTK7. The chosen tool for this task is Sphinx, which is the format that the examples in the documen- tation use. Before the audio file received from Vidispine can be sent to LIUM, some audio transformation needs to be done.

Parameterization

The first step is to transform the audio file into .sph file type using SoX8. Then, features are extracted from the audio file into a feature file in Sphinx format.

The tool sphinx fe is used to produce a feature files with 13 coefficients of the type MFCC. A segmentation file with one big segmentation from frame 0 to n − 1, where n is the total number of frames in the audio file is initialized. The parameterization step ends with a safety check on the feature file. The first safety check makes sure that the file is as long as it is supposed to be, while the second ensures that series of multiple identical vectors does not exist. The initial segmentation and tests are done by using the LIUM tool MSegInit.

5.3.2 Voice Activity Detection

To segment away music and jingles the LIUM tool Decode is used. This tool performs basic Viterbi decoding, with an input of 8 one-state HMMs, where each of these states are represented by a GMM composed of 64 Gaussians with diagonal covariance. These GMMs have been trained by LIUM using EM-ML. Where the feature vector consists of 12 MFCC coefficients (C0 is removed) together with δ coefficient. The 8 HMMs represent silence (wide and narrow band), clean speech, speech over noise, speech over music, speech (narrow band), jingles and music.

5http://cmusphinx.sourceforge.net/

6http://www.irisa.fr/metiss/guig/spro/spro-4.0.1/spro_4.html

7http://htk.eng.cam.ac.uk/

8http://sox.sourceforge.net/

(38)

5.3.3 BIC Segmentation

BIC segmentation is performed using the LIUM tool MSeg. The tool does not take the output from VAD as input, but rater does change detection on the audio file as a whole. The goal of this segmentation is to separate the audio file into “homogeneous” segments.

5.3.4 Linear Clustering

In order to merge those segments originating from the same speaker lying side by side, linear clustering is run. This is done with the LIUM tool MClust with the option –cMethod=l. This option makes the algorithm go from start to end in the audio file. The algorithm evaluates the similarity value on the diagonal of the similarity matrix in the Gaussian model. The clustering algorithm uses BIC similar to the one used for segmentation, but here the consecutive segments are considered.

5.3.5 Hierarchical Clustering

In order to perform agglomerative hierarchical clustering the LIUM tool MClust with the option –cMethod=h is used. As described in Section 2.4.3 each seg- mentation starts out as a cluster and clusters that are similar (according to BIC) is iteratively merged together until no ∆BICi,j> 0 for clusters i and j.

5.3.6 Initialize GMM

For each segment a GMM is initialized, and this is done by running the LIUM tool MTrainInit.

5.3.7 EM Computation

To train the GMMs initialized in the previous step the LIUM tool MTrainEM is run. The tool uses the EM algorithm. The training is done on the segments of the clusters computed in the hierarchical clustering. The computation aims to train the GMMs by estimating the GMMs parameters λ. To do this, ML estimation is done together with the EM algorithm.

5.3.8 Viterbi Decoding

A new segmentation is now generated using Viterbi decoding using the LIUM tool MDecode. The clustered segments from the hierarchical clustering are used as input together with the GMM trained in the previous step. The Viterbi decoding models a cluster according to a HMM with one state represented by a GMM and finds the most likely sequence of hidden states from the HMM parameters and observation sequence [19].

(39)

5.3. Speaker Diarization 27

5.3.9 Adjustment of Segment Boundaries

Because the segment boundaries produced in the Viterbi decoding are not perfect, the LIUM tool SAdjSeg is used to adjust them. The tool ensures that that the boundaries are set to areas with low energy.

5.3.10 Speech/non-speech Filtering

At this point the speech/non-speech segmentation conducted early in the pro- cess is brought back. The segmentation is filtered with the speech/non-speech segmentation using the LIUM tool SFilter.

5.3.11 Splitting of Long Segments

It is often convenient that the segments are shorter than 20 seconds. To ensure this the LIUM tool SSplitSeg is used to split up long segments at low energy areas. The tool takes GMMs trained by LIUM.

5.3.12 Gender and Bandwidth Selection

Detection of gender and bandwidth is done with the LIUM tool MScore to- gether with GMMs trained by LIUM. Each cluster gets a label with female/male and narrow/wide band so that the characteristics of the GMM maximizes the likelihood of the features in the cluster.

5.3.13 Final Clustering

All previous steps use features that are not normalized because it preserves information and help distinguish speakers. However, the process is now at a point where each cluster should contain a single speaker. It might be a problem at this stage that multiple clusters have the same speaker. Therefore, a final hierarchical agglomerative clustering with the LIUM tool MClust with option – cMethod=ce is run. The Universal Background Model (UBM) trained by LIUM is used, which is a fusion of the GMMs found in the gender and bandwidth step.

After going through all these steps, the result is in a form that can be used by the speaker identification module. The important properties of the final result is that the segments are shorter than 20 seconds and contain a single voice that is labeled with gender and bandwidth.

5.3.14 Postprocessing

The final result from the LIUM computation is a label file where each line represents a segment in the audio file. Each line states among other things start time, length, gender and which cluster the segment belongs to. To prepare for speaker identification the audio file is split and merged into a single audio file for each cluster.

(40)

Figure 5.1: The use of LIUM for speaker diarization.

5.4 Speaker Identification

The selection of speaker identification methods was based on the in-depth study done in Chapter 4. There it is argued that GMM is the best alternative.

An important additional feature to consider when choosing a method, is the availability of software libraries. A list of some libraries that are available for speaker identification is presented below:

Torch:9 State-of-the-art SVM and Neural Networks library written in Lua.

HTK:10 A low-level HMM Toolkit primarily directed towards speech recogni- tion.

ALIZE/LIA RAL: State-of-the-art GMM toolkit, specialized on speaker verification and written in C.

Toolboxes for MatLab: for example Statistical Pattern Recognition Tool- box11

Though there exists multiple examples of articles using ALIZE/LIA RAL (from here on only referred to as ALIZE) for speaker diarization [7], the amount of documentation is sparse. For speaker identification, more examples and documentation exist and ALIZE is therefore easier to use for this purpose. In addition to being a widely used toolkit and that it uses GMM-UBM, ALIZE is chosen as the software library for speaker identification in this implementation.

Though ALIZE is written in C and therefore is integrable with C++, it is more

9http://www.torch.ch/

10http://htk.eng.cam.ac.uk/

11http://cmp.felk.cvut.cz/cmp/software/stprtool/

(41)

5.4. Speaker Identification 29

convenient to use the existing set of the pre-compiled programs and start them through the command line. An illustration of the speaker identification module is shown in Figure 5.2 [4].

Figure 5.2: Illustration of the speaker identification process

5.4.1 Preprocessing

For all types of data that is used (training and testing) some preprocessing needs to be done. Though ALIZE can be configured to use multiple types of format for the feature files, SPRO4 is recommended by ALIZE. All training files are converted to .sph format through SoX and then the feature files are extracted using the SPRO command sfbcep. The command produces an MFCC with a total of 34 vectors.

Silence Removal

To decide which vectors in the feature files is relevant, speech detection using the ALIZE tool EnergyDetector is performed. Speech segments in the training data are acquired by taking the frames with the highest energy. The tool produces a label file which states which portions of the data is speech. Before and after silence removal is performed, the energy coefficients in the feature files is normalized. For the normalization performed afterward only the segments labeled as speech segments are considered. The normalization is done using the ALIZE tool NormFeat, which uses zero mean and unit variance normalization.

(42)

5.4.2 Training

Before one can start to recognize the speakers, a model for each speaker in the database needs to be trained. First, the UBMs are created and then individual GMM models for each speaker is extracted from the UBMs. Training is done as a separate process and the UBMs and GMM models are saved as data in Vidispine.

UBM training

To increase accuracy two UBMs are created, one for each gender. This can be done because the information about the gender of the segments is obtained through the diarization process. Both UBMs are developed by using the ALIZE tool TrainWorld. The UBMs are trained through the EM algorithm with data from the speakers for each gender.

Model Extraction

For each speaker to be identified, a GMM model is extracted. The model is ex- tracted using data from the speaker and the UBM of the gender of the speaker.

This is done by the ALIZE tool TrainTarget, which uses MAP estimation. To increase performance the test data used for model extraction is different from the one used to train the UBM.

5.4.3 Testing

When testing a speaker model in the speaker identification module, the seg- ments corresponding to an unknown speaker acquired from the speaker di- arization is used. To test the unknown segment the ALIZE tool ComputeTest is used. As input it takes the feature file of the segment, all the speaker GMMs and the UBM corresponding to the segments speaker’s gender. This process outputs a score for each speaker considered. The speaker with the highest score is chosen as the identity of the segment speaker. Since the thesis focuses on an open-set problem, the module also checks if the score exceeds a threshold. If it is lower than the threshold, the module labels the speaker as unknown.

5.4.4 Postprocessing

The result obtained by performing the previous steps is similar to the one from the speaker diarization module. The difference in the result is that instead of the label file stating a cluster, the file contains the name of the speaker.

The information obtained so far is somewhat unrefined, and a cleanup of the information is required to suit the application of the plug-in better. The speaker recognition will occasionally recognize small sound outburst as someone talking.

In most cases this is just random sound like laughter, background sound, etc.

Therefore, to clean up the information, segments smaller than 1.9 seconds that

(43)

5.4. Speaker Identification 31

have no neighboring segments within 5 seconds identified as the same speaker are filtered away.

The second postprocessing task that is performed on the output from the speaker identification is merging. Considering that the application of the sys- tem is adding metadata for a video specifying where different speakers are speaking and that this information is supposed to be used by for example a search application, the information should be as short as possible without losing important data. For these reasons neighboring segments that have a maximum 5 seconds between them and are identified as belonging to the same speaker are merged together. As a practical example one can imagine a person searching for where a specific actor is speaking in a movie. The person searching is not interested in getting a long list of segments when a multiple number of the segments belong to the same long monologue. Multiple segments may just be a result of the actor taking small thinking pauses. Instead, the user just wants a single segment for the whole monologue. In theory the merged segments that are constructed are true segments as described in the theory section. Ideally these should already have been constructed perfectly by the speaker diarization itself, but in most cases errors and inaccuracies exist. The merging is in a way a reassurance that reasonable result is acquired.

(44)
(45)

Chapter 6

Results

In this chapter, the evaluation methods for testing the prototype are presented together with a discussion of the test result.

6.1 Evaluation Methods

In order to evaluate the performance of the system, a video will be used as test data, and the number of correctly identified segments will be counted. This video has been manually labeled so that each label contains a true segment.

This label contains information about the start and length of the segment, along with the correct speaker. A simple evaluation program comparing the manually derived label file and the label file produced by the system is used.

This program also notes which types of errors the system has done. This is done to enable evaluation of the factors that are contributing to reduce the accuracy of the system.

The manual labeling of the test file is done by the author without the help of any tools. As with all elaborate manual work the result is not perfect. The label file contains some inaccuracy in the bounds of each segment. One can argue that for the main purpose of the application, perfect accuracy of the boundaries is not necessary. Furthermore, recognizing a small segment that is shot in into a sequence when someone is speaking is very difficult to handle. An example of this is when a person is speaking and another person asks something in one single word. The information about a person saying one or maybe two words is not that interesting for someone searching for where someone is saying something valuable. Taking very short segments into account can of course be interesting in some applications, but in this thesis a choice is made to ignore them. To cover up for these two possible flaws, a 5 seconds margin is allowed for the segment’s bound and the options of ignoring small segments (less than 2 seconds) in between segments from the same speaker. In that way, the two segments are seen as one large segment.

33

(46)

The evaluation program will be executed using a set of different thresholds.

For the threshold producing the best result, an examination will be performed with focus on the diarization part and the identification part.

6.1.1 Testing and Training Set

As stated in the problem description, the test data for this system should be a TV-debate. The number of speakers is allowed to be limited, but to ensure that the system is not just lucky, a certain number of speakers is chosen. Two TV-debate episodes from a Swedish TV-debate series with a total time of 3 hours, 27 minutes and 34 seconds are chosen. The data consists of 15 people, where 6 are females and 9 are males. Of these people, models are trained for all but 1 female and 2 males.

The training data is taken from a similar environment as the test data. Two separate speech samples with the duration of 1 minute from each speaker are used to train the UBM and the GMM speaker models.

6.1.2 Errors

When the speaker recognition system defines a segment, it can make four differ- ent errors. If a segment is recognized as belonging to a speaker in the database and the identity is wrong, the error can either be that the segment does not belong in the database or that the segment is identified as the wrong speaker.

In this thesis the first error will be known as false accept (FA) and the second error as speaker error (SpkE). If a segment is identified as belonging to an unknown speaker and the speaker really does belong to the database, the error will be called false reject (FR). Note that this error is derived by the threshold decision in the identification phase. The segment may originally be identified to either the correct or the wrong speaker. The last error is when a segment defined by the system has bounds that do not exist. This error is purely derived from the speaker diarization task in the system and the error is called segment error (SegE).

6.1.3 Metrics

The main goal of this thesis is to look at the identification rate of the system.

To obtain a better understanding of the performance of the system some other metrics are also considered.

Detection Error Trade-off (DET)

A common evaluation metric for speaker verification is detection error trade-off (DET), which is illustrated with the help of a curve where the miss probability is plotted against the FA probability related to a set of thresholds. As purposed by [12], SpkE and FR can be grouped together into what is here called a miss error (ME). This can be done because, despite the cause of the error being

(47)

6.1. Evaluation Methods 35

different, the result is that a speaker segment belonging to a known speaker is wrongfully identified. The FA probability is defined as the number of FA divided by the number of segments that truly are unknown. Similarly, the ME probability is defined as the number of ME divided by the number of segments that truly belong to a speaker.

The point behind the DET plot is to show the trade-off between the two types of errors (ME and FA), that are related to the threshold. The closer the curve is to the origin the better. In relations to this there exists a performance measure called equal error rate (EER) that is defined as the point where ME and FA are equal. Good performance is indicated by a small ERR and the ERR determines the optimal threshold. In addition, a curve that is close to a straight line verifies a normal likelihood distribution in the underlying system [20].

Identification Rate

The accuracy of the system is metered by an identification rate (IR) which is defined as the number of correctly found and identified segments divided by the total number of true segments in the video. As explained previously the system is allowed to ignore small segments within or between another speaker’s segments. This means that the total number of true segments is not a constant number. The program goes through the reference label file and searches for the same segments in the label file that the system has computed.

Since the evaluation program counts the number of segments the system got right, the identification rate benefits from identifying many small segments as opposed to recognize fewer longer segments. Choosing this type of segment- based evaluation instead of a duration-based evaluation is done because the purpose of the system is to index where people are speaking. When searching, the duration is secondary information. One can imagine searching for where a speaker speaks and then watching a video from that point on.

Identification and Segment Accuracy

The system consists of two independently working modules, which together contribute to the IR value for the whole system. To get a better understanding of how well these two parts work, the accuracy of the two modules are com- puted. The segment accuracy is defined as number of segments found divided by the number of segments that truly exist. The system is tested as a black box and therefore, this value can only say something about the segmentation part of the diarization and nothing about the clustering. The identification accuracy is the number of correctly identified segments divided by the number of segments found.

References

Related documents

46 Konkreta exempel skulle kunna vara främjandeinsatser för affärsänglar/affärsängelnätverk, skapa arenor där aktörer från utbuds- och efterfrågesidan kan mötas eller

För att uppskatta den totala effekten av reformerna måste dock hänsyn tas till såväl samt- liga priseffekter som sammansättningseffekter, till följd av ökad försäljningsandel

Digit recognition and person-identification and verification experiments are conducted on the publicly available XM2VTS database showing favorable results (speaker verification is

MAP Based Speaker Adaptation in Very Large Vocabulary Speech Recognition of Czech.. Petr ČERVA,

This Thesis contains a set of eight studies that investigates how different factors impact on speaker recognition and how these factors can help explain how listeners perceive

When testing if country-level enforcement has an impact on the recognition of identifiable intangible assets in a business combination (hypothesis 2), the results from the

In this thesis, various machine learning models such as Gaussian Mixture Model (GMM), k- Nearest Neighbor(k-NN) Model and Support Vector Machines (SVM) and feature extraction

Initial experiments shows that adaptation of an acoustic model trained on adult speech significantly reduced the word error rate of ASR for children, but not to the level of a model