Experiments in speaker diarization using speaker vectors

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2021

Experiments in speaker diarization using speaker vectors

MING CUI

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

Experiments in speaker diarization using speaker vectors

MING CUI

Master in software engineer of distributed system Date: October 19, 2020

Supervisor: Jens Edlund Examiner: Jonas Beskow

School of Electrical Engineering and Computer Science Host company: EDAI Technology AB

Swedish title: Experiment med talarvektorer för diarisering

(3)

(4)

iii

Abstract

Speaker Diarization is the task of determining ‘who spoke when?’ in an au- dio or video recording that contains an unknown amount of speech and also an unknown number of speakers. It has emerged as an increasingly impor- tant and dedicated domain of speech research. Initially, it was proposed as a research topic related to automatic speech recognition, where speaker di- arization serves as an upstream processing step. Over recent years, however, speaker diarization has become an important key technology for many tasks, such as navigation, retrieval, or higher-level inference on audio data.

Our research focuses on the existing speaker diarization algorithms. Par- ticularly, the thesis targets the differences between supervised and unsuper- vised methods. The aims of this thesis is to check the state-of-the-art algo- rithms and analyze which algorithm is most suitable for our application sce- narios. Its main contributions are (1) an empirical study of speaker diarization algorithms; (2) appropriate corpus data pre-processing; (3) audio embedding network for creating d-vectors; (4) experiments on different algorithms and corpus and comparison of them; (5) a good recommendation for our require- ments.

The empirical study shows that, for embedding extraction module, due to the neural networks can be trained with big datasets, the diarization per- formance can be significantly improved by replacing i-vectors with d-vectors.

Moreover, the differences between supervised methods and unsupervised meth- ods are mostly in clustering module. The thesis only uses d-vectors as the input of diarization network and selects two main algorithms as compare ob- jects: Spectral Clustering represents unsupervised method and Unbounded Interleaved-state Recurrent Neural Network (UIS-RNN) represents supervised method.

Keywords: Speaker Diarization, Embedding Extraction Module, Deep

Learning, Supervised method, Unsupervised method

(5)

iv

Sammanfattning

talardiarisering är uppgiften att bestämma ”vem talade när?” i en ljud- eller videoinspelning som innehåller en okänd mängd tal och även ett okänt antal talare. Det har framstått som en allt viktigare och dedikerad domän inom tal- forskning. Ursprungligen föreslogs det som ett forskningsämne relaterat till automatisk taligenkänning, där talardiarisering fungerar som ett processteg upströms. Under de senaste åren har dock talardiarisering blivit en viktig nyc- kelteknik för många uppgifter, till exempel navigering, hämtning, eller högre nivå slutledning på ljuddata.

Vår forskning fokuserar på de befintliga algoritmerna för talare diarise- ring. Speciellt riktar sig avhandlingen på skillnaderna mellan övervakade och oövervakade metoder. Syftet med denna avhandling är att kontrollera de mest avancerade algoritmerna och analysera vilken algoritm som passar bäst för vå- ra applikationsscenarier. Dess huvudsakliga bidrag är (1) en empirisk studie av algoritmer för talare diarisering; (2) lämplig förbehandling av corpusdata, (3) ljudinbäddningsnätverk för att skapa d-vektorer; (4) experiment på olika algoritmer och corpus och jämförelse av dem; (5) en bra rekommendation för våra krav.

Den empiriska studien visar att för inbäddning av extraktionsmodul, på grund av de neurala nätverkna kan utbildas med stora datamängder, diarise- ringsprestandan kan förbättras avsevärt genom att ersätta i-vektorer med d- vektorer. Dessutom är skillnaderna mellan övervakade metoder och oöverva- kade metoder mestadels i klustermodulen. Avhandlingen använder endast d- vektorer som ingång till diariseringsnätverk och väljer två huvudalgoritmer som jämförobjekt: Spektralkluster representerar oövervakad metod och obe- gränsat återkommande neuralt nätverk (UIS-RNN) representerar övervakad metod.

Nyckelord: Talardiarisering, inbäddning av extraktionsmodul, djupinlär-

ning, övervakad metod, oövervakad metod

(6)

Chapter 1 Introduction

The thesis presents methods for data pre-processing and implementation of speaker diarization algorithms. In this introductory chapter, I motivate the research question, introduce the related work of research, and put forward a further step for the remainder of the thesis.

1.1 Motivation

Speaker diarization consists of segmenting and clustering a speech recording into speaker homogenous regions, In other words, given an audio track of a meeting, a speaker diarization system will automatically discriminate between and label the different speakers (“Who spoke when?”)[1, 2]. This involves speech/non-speech detection (“When is there speech?”) and overlap detection and resolution (“Who is overlapping with whom?”), as well as speaker identi- fication.

With the development of deep learning, speaker diarizaiton algorithms be- come more diverse. With the rise of online video and audio chat and meeting, speaker diarization has become the most popular research directions. A whole speaker diarization system actually includes many different parts and each part plays different role in diarization process, while there existing numerous meth- ods for speaker embedding and cluster process, speaker diarization remains to be a challenging problem. In our project, we want to build a diarization system based on the AMI Meeting Corpus which contains Voice Activity Detection Module, Change Point Detection Module, Speaker embedding network and Clustering Module that can help us deal with our classroom scenario, doing class contents transcription automatically.

In order to achieve this, we need to firstly understand the implementation

1

(9)

2 CHAPTER 1. INTRODUCTION

of our target diarization algorithms and we believe that different algorithms has unique advantages in different scenarios and different parts have differ- ent degrees of influence on results. In order to build our system, we perform experiments on nowadays two main speaker diarization algorithms.

The current flow chart of the existing method and speaker diarization steps we want to perform experiments is as follows (Fig. 1.1):

Figure 1.1: Speaker diarization System Pipeline, the dotted line represents that the change point detection is not necessarily required

1.2 Problem

Manual transcription of audio contents in different scenarios are a tedious and time-consuming task that voice transcription workers need to be serious at all times.

This thesis addresses the problem of researching the factors that can affect

diarizations results and doing speaker diarization automatically in classroom

scenarios. Which algorithm is most suitable for classroom scenarios and

what are the key factors affecting their accuracy?

(10)

CHAPTER 1. INTRODUCTION 3

1.3 Purpose

The objective of the project is to understand the implementations of current speaker diarization methods and compare the differences between them. We contribute to look for each algorithm’s advantages and then analyze which algorithms is best for our scenario and finally build a diarization system for our online classroom scenario.

1.4 Goal

First the input data are the audio recordings without labels. All the audio recordings should be transferred into acoustic features and transformed into Voice Activity Detection (VAD). For supervised method, we need to partition the utterance into non-overlapping segments as the input of the speaker em- bedding network. This network will embed the segments the label them and then transfer them to diarization module. For unsupervised method, the out- put of the speaker embedding network will be transferred to clustering module without label.

This project is applying neural networks and clustering methods to seg- ment and cluster each speaker in an audio recording into his own region and select a better method to achieve better results.

1.4.1 Ethics and Sustainability

This project supports a sustainable development by enabling a speaker diariza- tion system, which benefits both speech and meeting scenarios. Speaker di- arization can be used for speaker adaptation for automatic speech recognition.

More importantly, it can help speaker retrieval and rich text transcription in many scenarios, such as online classes, telephone conversations and online meetings, which is essential for automatic speech algorithms. Our project will have a great impact on online distant teaching, as the project contin- ues to mature, we will no longer need to manually transcribe classroom con- tents.Students and teachers no longer need to record the classroom content in detail or repeatedly listen to the classroom audio contents. All they have to do is to view the transcription text contents and can quickly become familiar with the classroom.

An ethical prerequisite for processing audio recordings is that the process-

ing complies with privacy laws and respects the integrity of users. The data

(11)

4 CHAPTER 1. INTRODUCTION

used to produce results presented in this thesis are non-confidential and pro- cessed solely for scientific purposes. In data collection part, I have obtained permission from the data owner. The most critical issue here is whether the class recording infringes the intellectual property rights of teachers and stu- dents.We believe that for classrooms where both teachers and students speak, we need to obtain mutual consent before we record and transcribe.

1.5 Thesis Contributions

The thesis contributions can be summarized as:

(1) Data collection and an appropriate pre-processing on experiments corpus. The original datasets are relatively small and only includes audio recordings which are not meeting with the input conditions of neural network.

We need to transform original audio recordings into log-mel spectrogram and record them according to the role of speakers. For AMI Meeting Corpus, although it consists of approximate 100 hours audio recordings, when we want to train the cluster network on AMI, it seems small, so we use Sub-sequence Randomisation and Diaconis Augmentation to do the data augmentation.

(2) Building the speaker embedding network. We build the speaker em- bedding network with a 3-layer LSTM and a Linear layer and train the network on LibriSpeech dataset with generalized end-to-end loss functions.

(3) Modifying the spectral clustering by cosine distance. We modify the last step of spectral clustering, the initial implementation by Google is using sklearn.cluster.KMeans from python which is not support customized distance. We implement by ourselves with initial the centroids and calculate the cosine distance.

(4) Comparing the differences between spectral clustering and UIS- RNN and analyzing factors and reasons which will influence the results.

We perform various experiments on spectral clustering and UIS-RNN respec- tively. The experiments pass tests on different types of data to find out the factors which will influence the final results. After experiments, we analyze which algorithm is most suitable for our classroom scenarios.

(5) Building a speaker diarization system base on the AMI Meeting

Corpus After completing all the above tasks, our finally aim is to build a AMI-

based speaker diarization system that can provide diarization service for our

classroom scenario to do the audio contents transcription.

(12)

CHAPTER 1. INTRODUCTION 5

1.6 Research Methodology

The research methodology consists of quantitative evaluations and compar- isons with different solutions. There are some theories exist on the topic of our work, the conducted research is based on the previous work by experimenting the performance of different algorithms under different circumstances, with the goal of providing new solutions on online distant teaching.

1.7 Outline

The remainder of this thesis is structured as follows. Chapter 2 introduces the background and related works of this thesis. Chapter3 describes the methodol- ogy and implementation of the automation system and neural networks. Chap- ter 4 summarizes the performance of each networks and does the comparison.

Our group discussions and conclusions are shown in Chapter 5. In the last,

Chapter 6 is the future work of this project.

(13)

Chapter 2 Background

When it comes to nowadays speaker diarization systems[3, 4, 5], there are al- ways four main independent components: (1) a voice activity detection mod- ule, which removes the non-speech parts, and divides the utterances into small segments; (2) An embedding neural network, which contributes to d-vectors extraction; (3) A clustering module, which determines the number of speakers, and assigns speaker identities to their regions; (4) A re-segmentation mod- ule[4], which further refines the diarization results. The typical speaker di- arization pipeline is as follows(Fig 2.1):

Figure 2.1: The typical pipeline of speaker diarization

The most popular algorithms nowadays are divided into two categories, unsupervised[3, 6] and supervised[7]. The spectral offline clustering[3] pro-

posed by Wang et al. and Unbounded Interleaved-state Recurrent Neural Network(UIS- RNN)[7] proposed by Google are representative algorithms. Google has pro-

posed another model called Joint Speech Recognition and Speaker diarization

6

(14)

CHAPTER 2. BACKGROUND 7

via Sequence Transduction[8] and improves word-level diarization error rate from 15.8% to 2.2%. However, it does not include in the scope of our experi- ments.

The different between supervised and unsupervised methods almost comes from the clustering module. In unsupervised methods, the clustering algo- rithms that have been applied in diarization system for different scenarios in- clude Gaussian mixture models[9, 10], mean shift agglomerative hierarchical clustering[5], k-means[3, 11], Links[3, 12], and spectral clustering[3]. And in fully supervised method, the model proposed by Google called Unbounded Interleaved-state Recurrent Neural Network(UIS-RNN)[7] seems to have a greater performance than others.

2.1 Related Works

2.1.1 Voice Activity Detection

Voice Activity Detection (VAD) (Figure 2.2)refers to determining the area in the audio recordings that contains the speaker’s voice. Depending on the type of audio recordings being processed, the non-speech area may contain silence, music, room noise, or background noise. Effective voice activity detection is a very important part of the speaker segmentation and clustering system. If non-speech frames are included in the clustering process, it will have a great impact on the correct distinction between speakers. Effective voice activity detection can generally be broadly divided into the following 4 categories:

(1) Voice activity detection based on energy/spectrum.

(2) Model-based speech/non-speech detection.

(3) Mixed speech/non-speech detection.

(4) Voice activity detection based on Multi-channels.

Energy-based voice detection is usually used for effective voice detection of telephone voice, because in telephone voice, non-speech generally only in- cludes silence and slowly changing noise. In meeting scenarios, there are var- ious types of noise, such as the noise of turning over the paper and the sound of tables and chairs shaking.

Due to the limitations of energy-based voice detection methods, effective model-based voice detection methods are used in many speech segmentation and clustering systems because it can characterize various acoustic features.

In the system of Wooters et al. [13], only speech and non-speech models

are used. In the complex system of Nguyen et al. [14], four models that distin-

guish between gender and channel bandwidth are used. In the paper[15], Zhu

(15)

8 CHAPTER 2. BACKGROUND

Figure 2.2: Voice Activity Detection, NS represents non-speech parts and S represents speech parts

modeled noise and music. In their system, the audio file consists of 5 parts, namely, voice, music, noise, voice superimposed music, and voice superim- posed noise. Literature[16] divides the voice types in voice documents into more detail.

The model-based approach also has its limitations, that is, it requires the use of labeled data sets to train the speech/non-speech model. Moreover, the mismatch between the training set and the test set data will seriously affect the generalization performance of the system. In order to solve these problems, a hybrid speech/non-speech detection method is introduced. The method con- sists of two steps: the first step is to perform simple energy-based detection;

the second step is model-based detection. The model is trained on the test data itself, so no additional training data is required.

2.1.2 Change Point Detection

Time series analysis has become increasingly important in nowadays diverse research fields. The Change Point Detection(CPD) (Figure 2.3) detects abrupt changes in data when a property of the time series changes[17]. In speech recognition, the change point detection is applied for audio segmentation and recognizing boundaries between silence, sentences, words, and noise[18]. Nowa- days, effective change point detection methods can be divided into five cate- gories:

(1) Cumulative sum control chart(CUSUM). According to the small devi-

(16)

CHAPTER 2. BACKGROUND 9

Figure 2.3: Change Point Module, the line represents a change point happens ation of the accumulated data, it can detect whether the data distribution has changed. This is also the oldest and most primitive. However, it needs to ad- just the threshold, if the threshold is too small, the model will be too sensitive while too large threshold leads to a rough model.

(2) Probability density estimation. For a set of time series, the probability density distribution before and after the change point will be different. By us- ing first N points, we can build probability density models and estimate prob- ability density function.Then using score to measure, after adding this point, the difference in probability density distribution (after adding this point, the change in probability density distribution). The higher the score, the higher the probability that this point is a change point.

(3) Direct compute. Because the probability density distribution is difficult to be accurate, it is derived not to calculate the evaluation probability density distribution, but to directly calculate the difference of the probability objective distribution before and after the point. For the data before a point and the data after the point, some models and algorithms can measure the differences.

(4) Probability method. This part focuses on directly predicting whether a point is a change point. Gaussian process and Bayesian process are the most typical methods in this part. However, it is difficult to define a prior function, and the result is inaccurate if it is not well defined.

(5) Cluster method. Cluster method intends to divide the time series into many clusters. By using cluster method such as hierarchical clustering and graph clustering. If the behavior of a time series is quite different from other members in the same cluster, it is regarded as a change.

The CUSUM chart is developed by E. S. Page of the University of Cam-

bridge to do the change point detection, It detects whether the data distribution

has changed according to the slight deviation of the accumulated data. How-

ever, it needs to adjust the threshold and the model scale will influence the

accuracy, if the scale is too small, the model will be too sensitive and other-

(17)

10 CHAPTER 2. BACKGROUND

wise too rough.

The probability density estimate detects the change point according to the difference of probability density distribution before and after the change point.

According to the paper[19], their basic idea is to build probability density mod- els based to the first n points, and when a new data points added, they use a evaluation value score to evaluate the difference in probability evaluation. The higher the score is, the higher the probability that this point is a change point.

Because of the probability density estimate needs a number of data and it is hard to get high accuracy. The direct compute methods has been pro- posed. The non-parametric model proposed by Naoki [20], they proposed a SST technique to compute a change point score as the detection standard as well as other simple evaluation value we can compute, such as mean and vari- ance. The disadvantages of this method are that the resulted will be easily influenced by data noise and cannot perform well on a high dimension.

In the paper[21], Saatci et al. proposed a model combining Bayesian online change point detection with Gaussian processes to create a non-parametric time series model. They present methods to reduce the computational burden of the model. Generally, offline algorithms will perform better than online algorithms, Gaussian process can get more accuracy than Bayesian.

The last change point detection method can be described as cluster method.

It clusters many time series. In the same category, if the behavior of one time series is quite different from other members in the same cluster, it is regarded as a change.

2.1.3 Audio Embedding Network

Nowadays, speaker embedding neural networks are mostly modified from ex- isting speaker verfication networks[22, 23, 24, 25]. Wan et al. has introduced a LSTM-based speaker embedding network for speaker verification network, by proposing a new loss function called generalized end-to-end loss(GE2E)[26].

GE2E loss function intend to pushes the embedding towards the centroid of

the true speaker, and away from the false speaker. Compared with the previ-

ous tuple-based end-to-end loss function they propose, generalized end-to-end

loss is more efficient and is confirmed that it can converge to a better model in

shorter time. Their model is trained on fixed-length segments extracted from

a large corpus. They also propose a technique called MultiReader which is

similar to the regularization technique. When the target dataset does not have

efficient data, training on this dataset will lead to overfitting, so they require

the nework to also perform well on other datasets to regularize the network.

(18)

CHAPTER 2. BACKGROUND 11

As a result, they combine different data sources and give each data source a weight in order to indicate the importance of each data source.

In their experiments, they perform different experiments on text-dependent and text-independent speaker verification. For text-dependent speaker verifi- cation, they use 128-hidden nodes and the projection size is 64. For text- independent speaker verification, they use 768 hidden nodes with projection size 256. After experiments, they get the speaker verification EER lower to 3.55% by applying Multi-Reader to make their models support multiple key- words and multiple languages.

2.1.4 Spectral Offline Clustering

As a result of the rising of deep learning in various domains, the audio embed- ding based on neural network also known as d-vectors, have revealed a higher performance than i-vectors. The paper[3] illustrates their model by combing LSTM-based d-vector audio embedding with their non-parametric clustering algorithms.

In their experiments, they run experiments on four different clustering methods: (1) Naive online clustering, which is a prototypical online clustering algorithm and calculates each cluster by all its corresponding embeddings; (2) Links online clustering, which estimates cluster probability distributions and models their substructure based on the embedding vectors; (3) K-Means offline clustering, which determines the number of speaker by using the derivatives of conditional Mean Squared Cosine Distances(MSCD) between each embed- ding to its cluster centroid; (4) Spectral offline clustering. They observes that d-vector[26, 9] based system always achieve significantly lower Diarization Error Rate than i-vector[27] based system.

For spectral offline clustering, compared with other parametric models,

such as K-means. When do the cluster, we always assume that the data is Gaus-

sian distribution, however, speech data are often Non-Gaussian, this makes the

centroids of a s cluster is not a sufficient representation. And the cluster imbal-

ance will also influence the results, the imbalance speech data from different

speakers will lead the cluster algorithms to split large clusters into several small

clusters. Moreover, speakers will fall into some groups according to some fea-

tures which have large differences, such as gender. In this situation, the cluster

process can cluster the speaker embeddings into just males and females. And

with the help of spectral offline clustering, these problems will be mitigated to

some degree.

(19)

12 CHAPTER 2. BACKGROUND

Figure 2.4: Speaker Cluster, different colors of bar represent the d-vectors embedded from different speakers

2.1.5 Unbounded Interleaved-state Recurrent Neural Network

Another clustering algorithms we want to perform experiments on is Unbounded Interleaved-state Recurrent Neural Network proposed by Google. In UIS- RNN, the developer proposes that there is still one components that is un- supervised - the clustering module (Figure 2.4).

In the paper, they create a fully supervised model by using the similar speaker embbeding network as spectral clustering. They split model into three parts that separately model sequence generation, speaker assignment and speaker change. In the decoding process, the maximum posterior probability criterion is applied, and the beam search method is used for processing.

In order to solve the problems of unknown number of speakers, the au- thors add a distance-dependent Chinese restaurant process (ddCRP)[28] to accommodate an unknown number of speakers. Chinese restaurant process is a Dirichlet process(DP) mixture model, which describes the scenario that each Chinese people comes to restaurant to have a seat. It assumes that the restaurant has unlimited tables. The first customer sits on the first table and the probability of next n

_th

customer to sit on the k

_th

table is as follows:

n

_k

α

₀

+ n − 1

n

_k

represents the number of customers in k

_th

table, α is a giving scaling pa- rameter. The customer can also sit on a new table with the probability:

α

₀

α

₀

+ n − 1

The difference between Chinese Restaurant Process and distance-dependent

Chinese Restaurant Process is that in ddCRP distributions, the seating plan

(20)

CHAPTER 2. BACKGROUND 13

probability is described in terms of probability of a customer sitting with each of the other customers, not tables. And the sitting probability only depends on the distance between customers, the distance can be time, space or other characteristic.

p(c

_i

= j|D, α) ∝ f (d

_ji

), j 6= i p(c

_i

= j|D, α) ∝ α, j = i

c

_i

denotes the i

_th

customer is sitting and d

_ji

is the distance measurement be- tween customer i and j. The function f is a decay function. This adding pro- cess can help UIS-RNN define the number of speakers in a utterance and this is the most important part in the cluster process.

Another Innovative part of UIS-RNN is that the authors train the embed- ding network with three different strategies and get d-vectors in three versions.

They first train the embedding network with just the basic datasets and then use the far-field training set to train the network again. After that, they find the window for training is too large when they run small window in diarization system. So they retrain a new model by using variable-length windows.

In order to verify the model, the authors choose the dataset CALLHOME (2000 NIST SRE, disk-8), 5-fold cross-validation, and the model effect is measured by DER evaluation. In addition, two off-domain datasets are used:

Switchboard (2000 NIST SRE, disk-6) and ICSI conference corpus. The final Diarization Error Rate can be as low as 7.6%, and the segmentation perfor- mance exceeds the existing state-of-the-art algorithms.

2.1.6 Some previous related AMI-based system

The paper[29] proposed the diarization results on different speech database

from RT-04S to RT-06S. The important point of their result is that, they firstly

split the audio input into seven conditions and experiments are mainly con-

duced in two mainly conditions, Multiple distant microphones (MDM) and

Single distant microphone (SDM). This is very detailed and the number, direc-

tion and type of microphones can actually influence the audio input. Second,

they split the meeting scenarios into Conference Meeting and Lecture Meeting,

It is well known that the type, number, and placement of sensors have a signif-

icant impact on the performance.The conference meetings are primarily goal-

oriented, decision-making exercises and can vary from moderated meetings

to group consensus building meetings. As such, these meetings are highly-

interactive and multiple participants contribute to the information flow and

decisions made. In contrast, lecture meetings are educational events where a

(21)

14 CHAPTER 2. BACKGROUND

single lecturer briefs an the audience on a particular topic. While the audience occasionally participates in question and answer periods, it rarely controls the direction of the interchange or the outcome[29].

The paper[30] proposed a new method called Discriminative neural clus-

tering(DNC) actually on AMI Meeting Corpus and has produced a very ad-

vanced results than other systems, they are the first to perform a system fully

on AMI Corpus, they did the data re-alignment on AMI and fundamentally

solve the problem of AMI itself. They used the neural Change Point Module,

the output of the overall network can be regarded as a two-class network. The

input is a timing RNN. The specific details are: extract the static features of a

single frame in the first n frames and the last n frames of the current frame, and

then aggregate them through RNNs. The feature output of the RNN is oper-

ated by a point multiplication, and then some networks are passed to the final

binary classification output. Regarding the Speaker Clustering module, the au-

thor team used the method to extract features (Embedding) on each segment,

and then cluster, so the key is how to extract features. The author team had

designed a lot of structures. For example, a self-attention structure is designed

for each paragraph, and a feature is weighted for clustering. In addition, the

author team also designed 2D Consecutive Self-attentive Combination, which

is actually ensemble of a variety of characteristic models, and then followed

by a total self-attention. Furthermore, when the author was in ensemble, in

addition to using self-attention, they also tried feature fusion of bi-linear pool-

ing. They selected Speaker Error Rate(SER) as their evaluation metric and got

13.90% on SER.

(22)

Chapter 3 Methods and Implementation

3.1 Data Processing

Our application scenarios are different class audio recordings, which consist of a teacher and different students. However, because of the CONVID-19, EADI Technology cannot get the data from their online class applications, so I can only choose public datasets online which is as similar as our scenarios. I finally collected data from AMI Meeting Corpus, which contains 100 hours meeting recordings captured by different devices, such as microphone and earphones.

Two-third of AMI Meeting Corpus consists of recordings in which groups of four people played four different roles in a team. The remaining third of the corpus contains recordings of other types of meetings.

In order to transform the basic audio recordings into the format which can be treated as the input of the embedding network, I firstly split the data into two parts, 90% for training and 10% for testing, then loaded utterance audios and each partial utterance were split by voice detection. Here I just used Deci- bel level as the simple evaluation standard. I transformed audio signals into frames of width 25ms and step 10ms. Actually, we experimented on 20ms width and 40ms width, both are performed a little poor than 25ms width on our experiment data. Too short width will result in too little information while too long width will lead to under-fitting in the next process. However, 25ms maybe not the optimal solution, we should do some further improvements.

After doing the short time Fourier transform and transferring the output into Mel filters and log function, I got log-mel spectrogram of utterances and saved them into numpy files.

15

(23)

16 CHAPTER 3. METHODS AND IMPLEMENTATION

3.2 Data Augmentation

Although the AMI Meeting Corpus consists of 100 hours meeting audio record- ings, it seems relatively small when we intend to build a diarization system completely based on AMI. Under this circumstance, we need to do data aug- mentation. We took two methods for augmentation.

3.2.1 Select Sub-Sequence

The most simplest method is to select sub-sequences of full sequence. This means we can have more examples for training, each sub-sequences can be represented as a short meeting. And for each new sub-sequence, we gave them new different label.

3.2.2 Input Vector Randomisation

The next augmentation method is called input vector randomisation. From the speaker embedding network, we can get the vectors of utterances and the labels. For example, we get the vector X and its label Y = [1,3,2,1,1]. After that, we alternate the input vector based on the label, changing the vector into the other vector spoken by the same speaker but preserves the label. This augmentation process is depicted in Figure 3.1.

3.3 Embedding Network Building

The embedding network is relatively simple than diarization neural network.

It consists of a 3-layer LSTM network with a Linear layer. The last-frame output of LSTM is treated as the d-vector representation.

The importance part of the embedding network is the loss function. I train the network by applying the state-of-the-art loss function generalized end-to- end loss function. GE2E is training is based on a large number of utterances at once, I put N different speakers and M utterances for each speaker in one batch. I denoted the output of the entire neural network as f (x

ji

: ω), x

_ji

rep- resents the i

th

utterance of speaker j, and ω represents parameters of the neural network. Then, I calculated the d-vector by calculating the L2 normalization of the output of the neural network:

e

ji

= f (x

ji

: ω)

||f (x

_ji

: ω)||

₂

(3.1)

(24)

CHAPTER 3. METHODS AND IMPLEMENTATION 17

Figure 3.1: Input Vector Randomisation, X represents d-vectors embedded

from different speakers and list Y represents the speaker identity

(25)

18 CHAPTER 3. METHODS AND IMPLEMENTATION

I also needed to compute the centroids for each cluster by using the equa- tions:

c

_k

= 1 M

M

X

i=1

e

_ki

(3.2)

The difference between TE2E and GE2E is that TE2E is a scalar value which defines the similarity between an embedding vector and a single cen- troids, but GE2E is build a similarity matrix which calculates the similarities between an embedding vector and all centroids:

S

_ji,k

= ω · cos(e

_ji

, c

_k

) + b (3.3) Our purpose is to let the embedding of each utterance far from other speak- ers’ centroids and be close to all that speaker’s embedding. To achieve this, there are two ways according to the paper, the first method is to put a softmax function on the similarity matrix as follows:

L(e

_ji

) = −S

_ji,j

+ log

N

X

k=1

exp(S

_ji,k

) (3.4)

The second method is to compute the contrast loss defined on positive pairs and most aggressive negative pairs. In this project, I choose the softmax function.

According to the research work by Wan et al., when computing the cen- troids of true speaker, it is necessary to remove e

_ji

, which will make the train- ing stable. So, by following the paper, I compute the centroids and similarity finally as follows:

c

⁽⁻ⁱ⁾_j

= 1 M − 1

M

X

m=1,m6=i

e

_jm

, (3.5)

S

_ji,k

= ω · cos(e

_ji

, c

⁽⁻ⁱ⁾_j

) + b, if k = j (3.6) S

_ji,k

= ω · cos(e

_ji

, c

_k

) + b, otherwise (3.7) The final GE2E loss can be computed as the sum of all losses over the similarity matrix:

L

_G

(x; w) = L

_G

(S) =

^X

j,i

L(e

_ji

) (3.8)

3.4 Spectral Clustering Modifications

In this part, I implemented the spectral clustering algorithms by modified the

initial spectral clustering. The paper[3] has proposed that in the last step of

(26)

CHAPTER 3. METHODS AND IMPLEMENTATION 19

spectral clustering, we should apply K-Means algorithms to cluster the new embeddings and produce their labels. However, the model that shown by Google is using sklearn.cluster package in Python, which is using Euclidean distance and not support customized distance such as cosine distance[31]. So I implemented the K-Means module of the spectral clustering algorithms myself in two methods.

3.4.1 Initial Centroids By Roulette Wheel Selection

Before the last step of K-means clustering, the spectral clustering algorithms has determined the number of clusters k by using the maximal eigen-gap through a range of processes. The next clustering process consists of the following steps:

(1)Randomly choose a data point as the first cluster.

(2)For each sample x, compute the minimal cosine distance between the sample and centroids as D(x).

(3)The possibility of each sample being selected as the next centroids can be calculated as follows by using the Roulette Wheel Selection:

D(x)

PN

i=1

D(x

_i

)

(4)Repeat (2) and (3) until we find all the initial centroids.

3.4.2 Initial K Centroids Randomly

Due to the K-Means clustering algorithm is initial value sensitive, I have the other initial method as follows:

(1) Randomly initial k centroids from the whole set D.

(2) For each rest sample in D, compute the cosine distance between it and centroids, and put the sample into the cluster which has the shortest distance.

(3) According to the clustering result, recalculate each cluster’s centroid.

(4) Re-clustering the sample in D based on new centroids.

(5) Repeat (4) until the cluster result remains unchanged.

We should also set the iteration size because centroids are selected ran- domly, only one-time running cannot get the best result.

3.5 Evaluation Metric

The main metric that we used for our experiments is the Diarization Error

Rate (DER). It is described and used by NIST in the RT evaluations[29]. In

(27)

20 CHAPTER 3. METHODS AND IMPLEMENTATION

our project, we choose DER as our main metric. We not only calculate its value, but also calculate the value of its three components: Confusion, False Alarm and Miss as follows:

DER = E

_{conf usion}

+ E

_{F A}

+ E

_{M iss}

(3.9) Confusion is also called speaker error, it is about to calculate the percentage of scored time that a speaker ID is assigned to the wrong speaker as follows:

E

_{conf usion}

=

PS

s=1

dur(s) · (min(N

_ref

(s), N

_hys

(s)) − N

_correct

(s) T

score

(3.10) The terms N

_ref

(s) and N

_hyp

(s) indicate the number of speaker speaking in segment s, and N

_correct

(s) indicates the number of speakers that speak in seg- ment s and have been correctly matched. False Alarm is about the percentage of scored time that a hypothesized speaker is labelled as a non-speech in the reference and Miss is about the percentage of scored time that a hypothesized non-speech segment corresponds to a reference speaker segment as follows:

E

_{F A}

=

PS

s=1

dur(s) · (N

hys

(s) − N

correct

(s))

T

_score

(3.11)

E

_{M iss}

=

P_S

s=1

dur(s) · (N

ref

(s) − N

hys

(s))

T

_score

(3.12)

Actullay, we do not need to calculate by ourselves. The pyannote.metrics li-

brary provide us the function to calculate these values.

(28)

Chapter 4 Experiments and Results

4.1 Models

I just ran the experiments on d-vector model because of the existing conclusion that d-vector always perform better than i-vector[3]. The d-vector embedding model are trained on LibriSpeech[32], which is free and has approximate 2K speakers and 292K utterances. We train the network with Stochastic Gradient Descent.

The d-vector embedding model is a 3-layer LSTM network and a output linear layer. For each LSTM layer, it has 768 hidden nodes with the embedding size of 256.

4.2 Dataset

I reported Diarization Error Rates(DER) on two free public datasets: (1) The AMI Meeting Corpus[33]; and (2) The ICSI Meeting Corpus[34].

The first dataset is multiple languages with 100 hours of meeting record- ings. For each meeting, it is divided into 4 or 5 parts, each part contains 1-hour meeting.

The second dataset is only English with 70 hours of meeting recordings.

Each meeting recording is approximately 1 hour.

In order to do the further experiments, I also collected two relatively small datasets myself: (1) English education institution class recordings, which con- sist of approximate 20 hours classroom meeting and each recording is two lan- guage, English and Chinese, and with only two speakers, teacher and student;

(2) A company interview scenario which contains about 50 hours interview meeting, with 3 interviewees and 5 to 6 interviewers.

21

(29)

22 CHAPTER 4. EXPERIMENTS AND RESULTS

4.3 Experiments Setup

In order to compute the diarization Error Rates and other evaluation compo- nents, I use the pyannote.metrics library[35].

For the embedding network, I fetched 32 speakers and 10 utterance with each speaker in one batch. I chose 1e-2 as learning rate and trained the em- bedding network with 1000 epoches.

For the AMI Meeting Corpus and ICSI Meeting Corpus, I experimented both these two datasets on spectral clustering algorithms and UIS-RNN algo- rithms to and reported Diarization Error Rate as the main evaluation metric.

Moreover, I also decided to explore whether the number of speakers and the number of the utterances of different speakers will influence the result of the cluster. In our scenarios, there always multiple speakers, for example teachers and students have different weights in a class audio recordings, teachers always speaker a lot while one student sometimes just say one word or one question.

This situation always manifests as outliers in cluster problems which will lead to a adverse effect on the final results. So, firstly, I applied for some class au- dio recordings from an English education institution in China, which consist two languages, English and Chinese, and with only two speakers, one teacher and one student. Due to spectral clustering algorithm is an offline algorithm, I should make sure that the audio recordings always contain at least two speak- ers. Besides, from the two datasets I have selected, I chose approximately 20 hours special audio files, in which the weight of different speakers is obviously that can help us explore the problem I have mentioned above.

In our project, because the data and time constraints, we did not imple- ment a Voice Activity Detection ourselves, we just used the VAD module from python and simply defined the speech and non-speech part based on the GMM model.

As is standard in some machine learning literature, I excluded overlapped speech from our project, which represents multiple speakers speak together.

However, this is a very problematic decision and one that masks/hides the reality of spoken conversation and speech. I think it should be handled in the future work.

4.4 Result

The comparison result are shown in two parts with data augmentation and

without data augmentation, as is typical, I reported the final Diarization Error

(30)

CHAPTER 4. EXPERIMENTS AND RESULTS 23

Rate together with its three components: False Alarm, Miss and Confusion.

4.4.1 Experiments without Data Augmentation train- ing on AMI

The first part reports the basic comparison by performing experiments on just the small English institution classroom dataset. The experimental results are shown in Table 4.1.

Table 4.1: DER(%) on small English institution class dataset for spectral clus- tering and UIS-RNN algorithms.

Clustering Confusion Miss FA Total Spectral 21.47 5.62 5.22 32.31 UIS-RNN 20.24 5.42 5.19 30.85

From the Table we can see that, in the scenarios of two speaker, the per- formances of these two algorithms are similar, especially in Miss and False Alarm evaluation.

In order to figure out whether the number of speakers will influence the result or not. I just edited the audio recordings into small segments which only contains 3 speakers. The Table 4.2 shows the result on parts of AMI Meeting Corpus and ICSI Meeting Corpus.

Table 4.2: DER(%) on parts of AMI Meeting Corpus and ICSI Meeting Cor- pus for spectral clustering and UIS-RNN algorithms of 3 speakers.

Clustering Confusion Miss FA Total Spectral 21.25 5.82 5.37 32.44 UIS-RNN 20.18 5.77 5.31 31.26

The AMI Meeting Corpus and ICSI Meeting Corpus also have audio record- ings which contain 4 or 5 speakers, I select them for the next experiments, Ta- ble 4.3 and Table 4.4 show the result of 4 and 5 speakers on parts of AMI Meet- ing Corpus and ICSI Meeting Corpus respectively and the Table 4.5 shows the result of 9 speakers of a interview meeting.

From the Table and Figure 4.1 shown above, we can see that the increasing

number of speakers will actually influence the clustering result, especially the

confusion error in spectral offline clustering. That is because we did not make

the number of speakers as a parameter. We just let the model to define it by

itself. For spectral clustering, it is initial centroids sensitive, when we meet

(31)

24 CHAPTER 4. EXPERIMENTS AND RESULTS

Table 4.3: DER(%) on parts of AMI Meeting Corpus and ICSI Meeting Cor- pus for spectral clustering and UIS-RNN algorithms of 4 speakers.

Clustering Confusion Miss FA Total Spectral 22.42 5.63 5.37 33.42 UIS-RNN 20.27 5.42 4.93 29.72

Table 4.4: DER(%) on parts of AMI Meeting Corpus and ICSI Meeting Cor- pus for spectral clustering and UIS-RNN algorithms of 5 speakers.

Clustering Confusion Miss FA Total Spectral 22.38 5.53 5.36 33.27 UIS-RNN 20.24 5.42 5.03 30.69

Table 4.5: DER(%) on Interview Meeting Corpus for spectral clustering and UIS-RNN algorithms of 9 speakers.

Clustering Confusion Miss FA Total Spectral 25.02 5.64 4.98 35.64 UIS-RNN 20.78 4.82 4.88 30.48

Figure 4.1: The DER(%) of two different algorithms experimented on different

number of speakers(trained on AMI without Data Augmentation)

(32)

CHAPTER 4. EXPERIMENTS AND RESULTS 25

Num of speakers Clustering Confusion Miss FA Total

2 Spectral 12.28 5.60 5.14 23.02

2 UIS-RNN 11.50 5.24 5.08 21.82

3 Spectral 12.31 5.58 5.07 22.96

3 UIS-RNN 11.49 5.57 5.07 22.13

4 Spectral 12.35 5.34 4.99 22.68

4 UIS-RNN 11.20 5.14 5.02 21.36

5 Spectral 12.34 5.62 5.83 23.79

5 UIS-RNN 11.27 5.42 5.60 22.29

9 Spectral 16.70 5.98 5.48 28.16

9 UIS-RNN 11.45 6.02 5.54 23.01

Table 4.6: The DER(%)of the experiments with model trained on the ICSI Meeting Corpus + English education Institute + Company Interview.

with the problems that audio recordings contain too many different speakers, we think the spectral clustering is not a good choice.

4.4.2 Experiments without Data Augmentation train- ing on ICSI

However, we can see from the Table above that, the Diarization Error Rate is bit high when the model is trained on the whole AMI Meeting Corpus. We performed another experiment training on the ICSI Meeting Corpus , English education institute and Company interview to see the results. The results are shown in the next Table 4.6 and Figure 4.2.

4.4.3 Experiments with Data Augmentation training on AMI

From the Table 4.6 and Figure 4.2, we can see that the model trained on the ICSI Meeting Corpus can achieve obviously lower Diarization Error Rate than on the AMI Meeting Corpus. That is partly because the AMI Meeting Corpus itself. First, although AMI is ccomplex, it is relatively small when we want to build a model completely on it. As a result, we need to do data augmentation.

Besides, the manual annotations of the AMI always contain silent frames be-

fore and after speech segments, this will influence the results a lot. However,

re-aligning the corpus is such a large work and is not included in our exper-

iments. The next Table 4.7 will show the results of model UIS-RNN trained

(33)

26 CHAPTER 4. EXPERIMENTS AND RESULTS

Figure 4.2: The DER(%) of two different algorithms experimented on differ- ent number of speakers(trained on ICSI Meeting Corpus + English education Institute + Company Interview)

Num of speakers Clustering Confusion Miss FA Total

2 UIS-RNN 17.46 4.24 4.68 26.38

3 UIS-RNN 17.52 4.67 5.07 27.06

4 UIS-RNN 17.28 5.04 5.14 27.64

5 UIS-RNN 16.98 5.25 5.14 27.37

9 UIS-RNN 17.74 5.84 5.62 29.20

Table 4.7: The DER(%)of the experiments with model trained on the AMI Meeting Corpus with Data Augmentation.

on the AMI with data augmentation.

From the experiments of some test utterances, we found that the predict accuracy fluctuates a lot. Some utterances can reach 90% accuracy while some utterances only got 50% accuracy, especially in spectral clustering exper- iments. we observed that utterances which get low accuracy always contains unbalanced data that some speakers speak a lot which some just say some sim- ple words in a short time. In order to handle this scenario, I performed the next experiment and show the results on AMI and ICSI Meeting Corpus in Table 4.7 and Figure 4.3.

From the Table 4.7 and Figure 4.3, we can see that the unbalanced data can

obviously influence the results, especially in spectral clustering algorithm. We

observed that, the low accuracy results from spectral clustering always consist

(34)

CHAPTER 4. EXPERIMENTS AND RESULTS 27

Figure 4.3: The DER(%) of UIS-RNN experimented on different number of speakers(trained on AMI with Data Augmentation)

Table 4.8: DER(%) on parts of AMI Meeting Corpus and ICSI Meeting Cor- pus for spectral clustering and UIS-RNN algorithms of unbalanced data.

Clustering Confusion Miss FA Total

Spectral 15.68 5.54 4.78 26.00

UIS-RNN 12.20 5.24 5.04 22.48

(35)

28 CHAPTER 4. EXPERIMENTS AND RESULTS

Num of speakers Clustering Confusion Miss FA Total

2 Spectral 12.04 5.52 4.98 22.54

3 Spectral 12.08 5.58 5.46 23.12

4 Spectral 12.58 5.32 4.78 22.68

5 Spectral 12.26 5.44 5.62 23.32

9 Spectral 13.98 5.75 5.32 25.05

Table 4.9: The DER(%)of the experiments on Spectral Clustering with Change Point Module.

of wrong number of clusters. Because the unbalanced data make the cluster algorithm split the large cluster into several small cluster that leading to low accuracy.

4.4.4 Spectral Offline Clustering with Change Point Module

We also intended to make a better model on spectral offline clustering. We no- ticed that the spectral clustering methods cannot handle the data points which are close to the boundaries as well as common data points. And the change point detection module can perform well when we meet with the boundary problems. So we decided to add a change point detection module proposed by [17] after voice activity detection module and before embedding network.

The next Table 4.9 shows the result of spectral clustering with the change point detection module.

From the Table 4.9 and Figure 4.4, we can see that the change point de- tection module slightly affects the final results,especially when the number of clusters is large, the change point detection module can help the cluster algo- rithm to achieve a better result.

4.5 Building a diarization system based on the AMI Meeting Corpus

After our experiments, we intended to build a diarization system completely

on the AMI Meeting Corpus. Our system contains a Voice Activity Detection

Module, a Speaker Embedding Module, a Clustering Module. Because our

classroom scenario always have limited speakers, we did not know have many

speakers speaker in classroom in advance, so we selected the UIS-RNN as our

(36)

CHAPTER 4. EXPERIMENTS AND RESULTS 29

Figure 4.4: The DER(%) of Spectral Clustering experimented on different number of speakers

clustering algorithms and do the data augmentation on AMI Meeting Corpus.

Finally, we get the result as the Table 4.10 shows.

Clustering Confusion Miss FA Total UIS-RNN 17.40 5.00 5.13 27.53 Table 4.10: DER(%) of our AMI-based system

We also list the results produced by other systems. The next table 4.11

shows the best DER results for the best MDM and SDM SPKR systems from

RT-04S through RT-06S[29]. Their experiments performed on two scenarios,

Conference Meeting and Lecture Meeting respectively. And they also per-

form experiments on AMI Meeting Corpus but only get 44.8% on Conference

Meeting and 27.8% on Lecture Meeting. Besides this, there are still some

system performed speaker diarization experiments on AMI, however, most of

them used the different evaluation metric from us. The paper[30] proposed the

new method called discriminative neural clustering on AMI Meering Corpus

and used Speaker Error Rate(SER) as their evaluation metric. They achieved

13.90% on SER, because of the differences on evaluation metric, we did not

make specific comparisons.

(37)

30 CHAPTER 4. EXPERIMENTS AND RESULTS

MDM SDM MDM SDM

Conference Meeting Lecture Meeting RT-04S 37.9 40.6

RT-05S 25.9 23.6 12.5 12.3

RT-06S 35.8 43.6 24.1 23.9

Table 4.11: DERs for the best MDM and SDM SPKR systems from RT-04S

through RT-06S

(38)

Chapter 5 Conclusions

The thesis presented data collection, data pre-processing, neural audio em- bedding network establishment, a modification version of spectral clustering based on cosine distance and training, experiments, and comparisons under different conditions. In terms of data collection, we firstly chose two public datasets: The AMI Meeting Corpus and The ICSI Meeting Corpus. In order to do further experiments, I apply for two small private corpus from a English education institution and a company. According to the data pre-processing, I transferred the audio recordings into Voice Activity Detection and do the short time Fourier transform. After that, I transferred the output into Mel filters and log functions to get log-mel spectrogram of each utterance and saved them into numpy file for further use.

From the experiments, we can observe that:

1. The UIS-RNN method is better than Spectral Offline Clustering to some degree.

2. UIS-RNN network benefits form learning from more examples, so adding more training data will improve the UIS-RNN a lot.

3. The number of clusters will also influence the results, although we do not have enough test data. From part of experiments, especially the results by using spectral clustering algorithms are affected.

4. The unbalanced data is the important factor that influences the cluster result, especially in spectral clustering.

5.1 Applications

Researchers’ current method to do the speaker diarization is based on super- vised and unsupervised methods. However, different methods have their own

31

(39)

32 CHAPTER 5. CONCLUSIONS

advantages and disadvantages and different scenarios should apply the algo- rithms which are most suitable. For example, UIS-RNN is most suitable for scenarios with large data sources while spectral clustering performs better on balanced data. With the different implementations, work efficiency and accu- racy will be significantly improved. After completion, our project will be ap- plied on online class or meeting systems. The contents of classes and meetings will be recorded manually to help people do the review after classes and meet- ings or help people who cannot attend meetings but want to scan the meeting contents quickly. After doing the comparison, we also arrange which methods should be used to organize in which occasions for further researches.

5.2 Discussions

We have analyzed the two different state-of-the-art algorithms in speaker di- arization by doing data collection, building the audio embedding network and training and testing them in different scenarios. This section contains our in- terpretation of the results.

From the Table 4.1 we can see that, in the scenarios of two speaker, Miss and False Alarm error are similar. Why? False Alarm is about the percentage of scored time that a hypothesized speaker is labelled as a non- speech in the reference and Miss is about the percentage of scored time that a hypothesized non-speech segment corresponds to a reference speaker seg- ment. From the principle of Voice Activity Detection, we can find that these two error mostly comes from Voice Activity Detection Module.

What is the differences between UIS-RNN and Spectral Offline Clus- tering? First, UIS-RNN is a supervised algorithms while Spectral Offline Clustering is unsupervised. Second, UIS-RNN is suitale for the scenario which we can get large data source. When we know the number of speakers in ad- vance, spectral offline clustering can also achieve similar Diarization Error Rate as UIS-RNN. However, although spectral offline clustering has some modifications compared with K-means, it is still suffered from the unbalanced data and other factors. As a result, when our audio recordings include too many speakers and the number of segments spoken by each speaker have a large difference, we prefer UIS-RNN than spectral clustering.

Why the experiments performed on the AMI Meeting Corpus get rel-

atively worse results than on other corpus? From the experiments I per-

formed above, we can see that, not only spectral clustering, but also UIS-RNN

did not perform well on AMI. After we check the corpus, we found there are

some problems with the AMI iteself. AMI is a very complex diarization cor-

(40)

CHAPTER 5. CONCLUSIONS 33

pus, when we first use AMI, we assume that the manual annotations are cor- rect. However, the manual annotations by AMI developers have a serious prob- lem that each segment annotated by developers contains non-speech silent part at the beginning and end, even an one second speech segment contains four seconds silent frames at the beginning and end. This means that the segments annotated by them contains to many overlapping parts. This seems to have a small effect when we use a near-field microphone. But when we use a far-field microphone array for such a meeting scene, all the annotations have problems.

Why Change Point Detection Module does not always lead to better performance? Is CPD Module necessary? The change point detection mod- ule firstly applied on the telephone conversation scenarios because in tele- phone conversation, the speaker change point is clearly. However, when it comes to the scenarios with too much overlapping, it is difficult for CPD to detect a clear speaker change point, because many speakers will speak at the same time, and at one time, not only one speaker will change, some speakers will change together which makes the issue difficult and influence the results.

I think in some original diarization systems, it should be included, especially in spectral clustering, CPD modules can help to handle some edge data point in detail.

Why unbalanced data can really influence the finally results? Our an- swer is yes, it can really influence the finally results. From our experiments, we can observe that especially spectral clustering is affected by these unbalanced data more easier. That is because spectral clustering is sensitive in handling outliers, when some speakers only occur very rarely, the algorithm cannot de- tect them well. However, this problem also depends on whether the model is powerful since the UIS-RNN is not very affected by such problems.

How to handle a variable number of speakers? From the experiments we can see that, spectral clustering can be easily affected by a large number of speakers that is partly because with increasing number of speakers, the unbal- anced degree of data will also increase. Also, many other speaker diarization systems based on deep learning have this problem, when we training the neu- ral network with one to four speakers, we can only test the model with the audio recordings consist of less than 4 speakers. In order to solve this, I think one method is following the Google’s implementation of UIS-RNN. They use DDCRP process to let the model to define the number of speakers. Another method is to train the model on complex data, making the number of speakers in each audio recordings more than our test cases.

Compare with the results proposed by other diarization system. We

made some comparisions between our results and our systems’ results. As

(41)

34 CHAPTER 5. CONCLUSIONS

for discriminative neural clustering, though they used the completely different

evaluation metric from us, we think their results are much more better than this

paper. Firstly, they used Diaconis Augmentation as data augmentation method

and they had more perfect neural Voice Activity Detection and neural Change

Point Module than us. More importantly, they did the corpus re-alignment,

which fundamentally solve the problem of data. When it comes to the system

proposed in Rich Transcription 2006, our results are obviously better than that

in Conference Meeting scenarios, but in Lecture Meeting or on other corpus,

our performance is as better as their. We think it is mostly because the AMI

Meeting Corpus itself, with no data augmentation and data re-alignment, AMI

is flawed as a system’s main training and testing data.

(42)

Chapter 6 Future work

Getting more complex data source. In our experiments, lack of data is a large problem. First, we cannot get enough public data since they are expensive and the lack of diversity in data will lead to model under-fitting and the results only perform well on specific test dataset. We cannot get the large data corpus that contains audio recordings which have more than 10 speakers, that makes our system unable to perform well in scenarios with a large number of speakers.

So in the future, we prefer to use more complex data corpus with the number of speakers always larger than our scenario cases.

Re-aligning the AMI Meeting Corpus. The AMI Meeting Corpus is ac- tually a complex data corpus in speaker diarization. However, after our exper- iments and observation, the AMI Meeting Corpus has some problems with its manual label which has a big impact on diarization results. In order to use this dataset in real sense, we need to re-alignment the AMI, removing the silent frames before and after the segments. This is obviously a large work, we did no implement in this paper.

Doing some experiments on Voice Activity Detection. Voice Activity Detection is a very important part in speaker diarization system. If there ex- ists non-speech frames when doing cluster process, it will have an adverse impact on the final results. The Miss and False Alarm which are contained in Diarization Error Rate are mostly due to Voice Activity Detection. However, because it is not our focus, we only use VAD based on Webrtcvad, which is only detect voiced audios and silent. Actually, in actual situation, non-speech part also contains various kinds of noise or music that will also affect the di- arization results. So in the future work, we hope to implement a new VAD module based on models not just on segments energy.

Implementing an end to end online diarization system. The whole di-

35

Experiments in speaker diarization using speaker vectors

Experiments in speaker diarization using speaker vectors

MING CUI

Experiments in speaker diarization using speaker vectors

MING CUI

Master in software engineer of distributed system Date: October 19, 2020

Supervisor: Jens Edlund Examiner: Jonas Beskow

School of Electrical Engineering and Computer Science Host company: EDAI Technology AB

Swedish title: Experiment med talarvektorer för diarisering

iii

Abstract

The empirical study shows that, for embedding extraction module, due to the neural networks can be trained with big datasets, the diarization per- formance can be significantly improved by replacing i-vectors with d-vectors.

Keywords: Speaker Diarization, Embedding Extraction Module, Deep

Learning, Supervised method, Unsupervised method

iv

Sammanfattning

Nyckelord: Talardiarisering, inbäddning av extraktionsmodul, djupinlär-

ning, övervakad metod, oövervakad metod

Contents

1 Introduction 1

1.1 Motivation . . . . 1

1.2 Problem . . . . 2

1.3 Purpose . . . . 3

1.4 Goal . . . . 3

1.4.1 Ethics and Sustainability . . . . 3

1.5 Thesis Contributions . . . . 4

1.6 Research Methodology . . . . 5

1.7 Outline . . . . 5

2 Background 6 2.1 Related Works . . . . 7

2.1.1 Voice Activity Detection . . . . 7

2.1.2 Change Point Detection . . . . 8

2.1.3 Audio Embedding Network . . . 10

2.1.4 Spectral Offline Clustering . . . 11

2.1.5 Unbounded Interleaved-state Recurrent Neural Network 12 2.1.6 Some previous related AMI-based system . . . 13

3 Methods and Implementation 15 3.1 Data Processing . . . 15

3.2 Data Augmentation . . . 16

3.2.1 Select Sub-Sequence . . . 16

3.2.2 Input Vector Randomisation . . . 16

3.3 Embedding Network Building . . . 16

3.4 Spectral Clustering Modifications . . . 18

3.4.1 Initial Centroids By Roulette Wheel Selection . . . 19

3.4.2 Initial K Centroids Randomly . . . 19

3.5 Evaluation Metric . . . 19

v

vi CONTENTS

4 Experiments and Results 21

4.1 Models . . . 21

4.2 Dataset . . . 21

4.3 Experiments Setup . . . 22

4.4 Result . . . 22

4.4.1 Experiments without Data Augmentation training on AMI . . . 23

4.4.2 Experiments without Data Augmentation training on ICSI . . . 25

4.4.3 Experiments with Data Augmentation training on AMI 25 4.4.4 Spectral Offline Clustering with Change Point Module 28 4.5 Building a diarization system based on the AMI Meeting Corpus 28 5 Conclusions 31 5.1 Applications . . . 31

5.2 Discussions . . . 32

6 Future work 35

Bibliography 37

Chapter 1 Introduction

The thesis presents methods for data pre-processing and implementation of speaker diarization algorithms. In this introductory chapter, I motivate the research question, introduce the related work of research, and put forward a further step for the remainder of the thesis.

1.1 Motivation

In order to achieve this, we need to firstly understand the implementation

1

2 CHAPTER 1. INTRODUCTION

of our target diarization algorithms and we believe that different algorithms has unique advantages in different scenarios and different parts have differ- ent degrees of influence on results. In order to build our system, we perform experiments on nowadays two main speaker diarization algorithms.

The current flow chart of the existing method and speaker diarization steps we want to perform experiments is as follows (Fig. 1.1):

Figure 1.1: Speaker diarization System Pipeline, the dotted line represents that the change point detection is not necessarily required

1.2 Problem

Manual transcription of audio contents in different scenarios are a tedious and time-consuming task that voice transcription workers need to be serious at all times.

This thesis addresses the problem of researching the factors that can affect

diarizations results and doing speaker diarization automatically in classroom

scenarios. Which algorithm is most suitable for classroom scenarios and

what are the key factors affecting their accuracy?

CHAPTER 1. INTRODUCTION 3

1.3 Purpose

1.4 Goal

This project is applying neural networks and clustering methods to seg- ment and cluster each speaker in an audio recording into his own region and select a better method to achieve better results.

1.4.1 Ethics and Sustainability

This project supports a sustainable development by enabling a speaker diariza- tion system, which benefits both speech and meeting scenarios. Speaker di- arization can be used for speaker adaptation for automatic speech recognition.

An ethical prerequisite for processing audio recordings is that the process-

ing complies with privacy laws and respects the integrity of users. The data

4 CHAPTER 1. INTRODUCTION