IN
DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS
,
STOCKHOLM SWEDEN 2021
Experiments in speaker diarization using speaker vectors
MING CUI
KTH ROYAL INSTITUTE OF TECHNOLOGY
SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
Experiments in speaker diarization using speaker vectors
MING CUI
Master in software engineer of distributed system Date: October 19, 2020
Supervisor: Jens Edlund Examiner: Jonas Beskow
School of Electrical Engineering and Computer Science Host company: EDAI Technology AB
Swedish title: Experiment med talarvektorer för diarisering
iii
Abstract
Speaker Diarization is the task of determining ‘who spoke when?’ in an au- dio or video recording that contains an unknown amount of speech and also an unknown number of speakers. It has emerged as an increasingly impor- tant and dedicated domain of speech research. Initially, it was proposed as a research topic related to automatic speech recognition, where speaker di- arization serves as an upstream processing step. Over recent years, however, speaker diarization has become an important key technology for many tasks, such as navigation, retrieval, or higher-level inference on audio data.
Our research focuses on the existing speaker diarization algorithms. Par- ticularly, the thesis targets the differences between supervised and unsuper- vised methods. The aims of this thesis is to check the state-of-the-art algo- rithms and analyze which algorithm is most suitable for our application sce- narios. Its main contributions are (1) an empirical study of speaker diarization algorithms; (2) appropriate corpus data pre-processing; (3) audio embedding network for creating d-vectors; (4) experiments on different algorithms and corpus and comparison of them; (5) a good recommendation for our require- ments.
The empirical study shows that, for embedding extraction module, due to the neural networks can be trained with big datasets, the diarization per- formance can be significantly improved by replacing i-vectors with d-vectors.
Moreover, the differences between supervised methods and unsupervised meth- ods are mostly in clustering module. The thesis only uses d-vectors as the input of diarization network and selects two main algorithms as compare ob- jects: Spectral Clustering represents unsupervised method and Unbounded Interleaved-state Recurrent Neural Network (UIS-RNN) represents supervised method.
Keywords: Speaker Diarization, Embedding Extraction Module, Deep
Learning, Supervised method, Unsupervised method
iv
Sammanfattning
talardiarisering är uppgiften att bestämma ”vem talade när?” i en ljud- eller videoinspelning som innehåller en okänd mängd tal och även ett okänt antal talare. Det har framstått som en allt viktigare och dedikerad domän inom tal- forskning. Ursprungligen föreslogs det som ett forskningsämne relaterat till automatisk taligenkänning, där talardiarisering fungerar som ett processteg upströms. Under de senaste åren har dock talardiarisering blivit en viktig nyc- kelteknik för många uppgifter, till exempel navigering, hämtning, eller högre nivå slutledning på ljuddata.
Vår forskning fokuserar på de befintliga algoritmerna för talare diarise- ring. Speciellt riktar sig avhandlingen på skillnaderna mellan övervakade och oövervakade metoder. Syftet med denna avhandling är att kontrollera de mest avancerade algoritmerna och analysera vilken algoritm som passar bäst för vå- ra applikationsscenarier. Dess huvudsakliga bidrag är (1) en empirisk studie av algoritmer för talare diarisering; (2) lämplig förbehandling av corpusdata, (3) ljudinbäddningsnätverk för att skapa d-vektorer; (4) experiment på olika algoritmer och corpus och jämförelse av dem; (5) en bra rekommendation för våra krav.
Den empiriska studien visar att för inbäddning av extraktionsmodul, på grund av de neurala nätverkna kan utbildas med stora datamängder, diarise- ringsprestandan kan förbättras avsevärt genom att ersätta i-vektorer med d- vektorer. Dessutom är skillnaderna mellan övervakade metoder och oöverva- kade metoder mestadels i klustermodulen. Avhandlingen använder endast d- vektorer som ingång till diariseringsnätverk och väljer två huvudalgoritmer som jämförobjekt: Spektralkluster representerar oövervakad metod och obe- gränsat återkommande neuralt nätverk (UIS-RNN) representerar övervakad metod.
Nyckelord: Talardiarisering, inbäddning av extraktionsmodul, djupinlär-
ning, övervakad metod, oövervakad metod
Contents
1 Introduction 1
1.1 Motivation . . . . 1
1.2 Problem . . . . 2
1.3 Purpose . . . . 3
1.4 Goal . . . . 3
1.4.1 Ethics and Sustainability . . . . 3
1.5 Thesis Contributions . . . . 4
1.6 Research Methodology . . . . 5
1.7 Outline . . . . 5
2 Background 6 2.1 Related Works . . . . 7
2.1.1 Voice Activity Detection . . . . 7
2.1.2 Change Point Detection . . . . 8
2.1.3 Audio Embedding Network . . . 10
2.1.4 Spectral Offline Clustering . . . 11
2.1.5 Unbounded Interleaved-state Recurrent Neural Network 12 2.1.6 Some previous related AMI-based system . . . 13
3 Methods and Implementation 15 3.1 Data Processing . . . 15
3.2 Data Augmentation . . . 16
3.2.1 Select Sub-Sequence . . . 16
3.2.2 Input Vector Randomisation . . . 16
3.3 Embedding Network Building . . . 16
3.4 Spectral Clustering Modifications . . . 18
3.4.1 Initial Centroids By Roulette Wheel Selection . . . 19
3.4.2 Initial K Centroids Randomly . . . 19
3.5 Evaluation Metric . . . 19
v
vi CONTENTS
4 Experiments and Results 21
4.1 Models . . . 21
4.2 Dataset . . . 21
4.3 Experiments Setup . . . 22
4.4 Result . . . 22
4.4.1 Experiments without Data Augmentation training on AMI . . . 23
4.4.2 Experiments without Data Augmentation training on ICSI . . . 25
4.4.3 Experiments with Data Augmentation training on AMI 25 4.4.4 Spectral Offline Clustering with Change Point Module 28 4.5 Building a diarization system based on the AMI Meeting Corpus 28 5 Conclusions 31 5.1 Applications . . . 31
5.2 Discussions . . . 32
6 Future work 35
Bibliography 37
Chapter 1 Introduction
The thesis presents methods for data pre-processing and implementation of speaker diarization algorithms. In this introductory chapter, I motivate the research question, introduce the related work of research, and put forward a further step for the remainder of the thesis.
1.1 Motivation
Speaker diarization consists of segmenting and clustering a speech recording into speaker homogenous regions, In other words, given an audio track of a meeting, a speaker diarization system will automatically discriminate between and label the different speakers (“Who spoke when?”)[1, 2]. This involves speech/non-speech detection (“When is there speech?”) and overlap detection and resolution (“Who is overlapping with whom?”), as well as speaker identi- fication.
With the development of deep learning, speaker diarizaiton algorithms be- come more diverse. With the rise of online video and audio chat and meeting, speaker diarization has become the most popular research directions. A whole speaker diarization system actually includes many different parts and each part plays different role in diarization process, while there existing numerous meth- ods for speaker embedding and cluster process, speaker diarization remains to be a challenging problem. In our project, we want to build a diarization system based on the AMI Meeting Corpus which contains Voice Activity Detection Module, Change Point Detection Module, Speaker embedding network and Clustering Module that can help us deal with our classroom scenario, doing class contents transcription automatically.
In order to achieve this, we need to firstly understand the implementation
1
2 CHAPTER 1. INTRODUCTION
of our target diarization algorithms and we believe that different algorithms has unique advantages in different scenarios and different parts have differ- ent degrees of influence on results. In order to build our system, we perform experiments on nowadays two main speaker diarization algorithms.
The current flow chart of the existing method and speaker diarization steps we want to perform experiments is as follows (Fig. 1.1):
Figure 1.1: Speaker diarization System Pipeline, the dotted line represents that the change point detection is not necessarily required
1.2 Problem
Manual transcription of audio contents in different scenarios are a tedious and time-consuming task that voice transcription workers need to be serious at all times.
This thesis addresses the problem of researching the factors that can affect
diarizations results and doing speaker diarization automatically in classroom
scenarios. Which algorithm is most suitable for classroom scenarios and
what are the key factors affecting their accuracy?
CHAPTER 1. INTRODUCTION 3
1.3 Purpose
The objective of the project is to understand the implementations of current speaker diarization methods and compare the differences between them. We contribute to look for each algorithm’s advantages and then analyze which algorithms is best for our scenario and finally build a diarization system for our online classroom scenario.
1.4 Goal
First the input data are the audio recordings without labels. All the audio recordings should be transferred into acoustic features and transformed into Voice Activity Detection (VAD). For supervised method, we need to partition the utterance into non-overlapping segments as the input of the speaker em- bedding network. This network will embed the segments the label them and then transfer them to diarization module. For unsupervised method, the out- put of the speaker embedding network will be transferred to clustering module without label.
This project is applying neural networks and clustering methods to seg- ment and cluster each speaker in an audio recording into his own region and select a better method to achieve better results.
1.4.1 Ethics and Sustainability
This project supports a sustainable development by enabling a speaker diariza- tion system, which benefits both speech and meeting scenarios. Speaker di- arization can be used for speaker adaptation for automatic speech recognition.
More importantly, it can help speaker retrieval and rich text transcription in many scenarios, such as online classes, telephone conversations and online meetings, which is essential for automatic speech algorithms. Our project will have a great impact on online distant teaching, as the project contin- ues to mature, we will no longer need to manually transcribe classroom con- tents.Students and teachers no longer need to record the classroom content in detail or repeatedly listen to the classroom audio contents. All they have to do is to view the transcription text contents and can quickly become familiar with the classroom.
An ethical prerequisite for processing audio recordings is that the process-
ing complies with privacy laws and respects the integrity of users. The data
4 CHAPTER 1. INTRODUCTION
used to produce results presented in this thesis are non-confidential and pro- cessed solely for scientific purposes. In data collection part, I have obtained permission from the data owner. The most critical issue here is whether the class recording infringes the intellectual property rights of teachers and stu- dents.We believe that for classrooms where both teachers and students speak, we need to obtain mutual consent before we record and transcribe.
1.5 Thesis Contributions
The thesis contributions can be summarized as:
(1) Data collection and an appropriate pre-processing on experiments corpus. The original datasets are relatively small and only includes audio recordings which are not meeting with the input conditions of neural network.
We need to transform original audio recordings into log-mel spectrogram and record them according to the role of speakers. For AMI Meeting Corpus, although it consists of approximate 100 hours audio recordings, when we want to train the cluster network on AMI, it seems small, so we use Sub-sequence Randomisation and Diaconis Augmentation to do the data augmentation.
(2) Building the speaker embedding network. We build the speaker em- bedding network with a 3-layer LSTM and a Linear layer and train the network on LibriSpeech dataset with generalized end-to-end loss functions.
(3) Modifying the spectral clustering by cosine distance. We modify the last step of spectral clustering, the initial implementation by Google is using sklearn.cluster.KMeans from python which is not support customized distance. We implement by ourselves with initial the centroids and calculate the cosine distance.
(4) Comparing the differences between spectral clustering and UIS- RNN and analyzing factors and reasons which will influence the results.
We perform various experiments on spectral clustering and UIS-RNN respec- tively. The experiments pass tests on different types of data to find out the factors which will influence the final results. After experiments, we analyze which algorithm is most suitable for our classroom scenarios.
(5) Building a speaker diarization system base on the AMI Meeting
Corpus After completing all the above tasks, our finally aim is to build a AMI-
based speaker diarization system that can provide diarization service for our
classroom scenario to do the audio contents transcription.
CHAPTER 1. INTRODUCTION 5
1.6 Research Methodology
The research methodology consists of quantitative evaluations and compar- isons with different solutions. There are some theories exist on the topic of our work, the conducted research is based on the previous work by experimenting the performance of different algorithms under different circumstances, with the goal of providing new solutions on online distant teaching.
1.7 Outline
The remainder of this thesis is structured as follows. Chapter 2 introduces the background and related works of this thesis. Chapter3 describes the methodol- ogy and implementation of the automation system and neural networks. Chap- ter 4 summarizes the performance of each networks and does the comparison.
Our group discussions and conclusions are shown in Chapter 5. In the last,
Chapter 6 is the future work of this project.
Chapter 2 Background
When it comes to nowadays speaker diarization systems[3, 4, 5], there are al- ways four main independent components: (1) a voice activity detection mod- ule, which removes the non-speech parts, and divides the utterances into small segments; (2) An embedding neural network, which contributes to d-vectors extraction; (3) A clustering module, which determines the number of speakers, and assigns speaker identities to their regions; (4) A re-segmentation mod- ule[4], which further refines the diarization results. The typical speaker di- arization pipeline is as follows(Fig 2.1):
Figure 2.1: The typical pipeline of speaker diarization
The most popular algorithms nowadays are divided into two categories, unsupervised[3, 6] and supervised[7]. The spectral offline clustering[3] pro-
posed by Wang et al. and Unbounded Interleaved-state Recurrent Neural Network(UIS- RNN)[7] proposed by Google are representative algorithms. Google has pro-
posed another model called Joint Speech Recognition and Speaker diarization
6
CHAPTER 2. BACKGROUND 7
via Sequence Transduction[8] and improves word-level diarization error rate from 15.8% to 2.2%. However, it does not include in the scope of our experi- ments.
The different between supervised and unsupervised methods almost comes from the clustering module. In unsupervised methods, the clustering algo- rithms that have been applied in diarization system for different scenarios in- clude Gaussian mixture models[9, 10], mean shift agglomerative hierarchical clustering[5], k-means[3, 11], Links[3, 12], and spectral clustering[3]. And in fully supervised method, the model proposed by Google called Unbounded Interleaved-state Recurrent Neural Network(UIS-RNN)[7] seems to have a greater performance than others.
2.1 Related Works
2.1.1 Voice Activity Detection
Voice Activity Detection (VAD) (Figure 2.2)refers to determining the area in the audio recordings that contains the speaker’s voice. Depending on the type of audio recordings being processed, the non-speech area may contain silence, music, room noise, or background noise. Effective voice activity detection is a very important part of the speaker segmentation and clustering system. If non-speech frames are included in the clustering process, it will have a great impact on the correct distinction between speakers. Effective voice activity detection can generally be broadly divided into the following 4 categories:
(1) Voice activity detection based on energy/spectrum.
(2) Model-based speech/non-speech detection.
(3) Mixed speech/non-speech detection.
(4) Voice activity detection based on Multi-channels.
Energy-based voice detection is usually used for effective voice detection of telephone voice, because in telephone voice, non-speech generally only in- cludes silence and slowly changing noise. In meeting scenarios, there are var- ious types of noise, such as the noise of turning over the paper and the sound of tables and chairs shaking.
Due to the limitations of energy-based voice detection methods, effective model-based voice detection methods are used in many speech segmentation and clustering systems because it can characterize various acoustic features.
In the system of Wooters et al. [13], only speech and non-speech models
are used. In the complex system of Nguyen et al. [14], four models that distin-
guish between gender and channel bandwidth are used. In the paper[15], Zhu
8 CHAPTER 2. BACKGROUND
Figure 2.2: Voice Activity Detection, NS represents non-speech parts and S represents speech parts
modeled noise and music. In their system, the audio file consists of 5 parts, namely, voice, music, noise, voice superimposed music, and voice superim- posed noise. Literature[16] divides the voice types in voice documents into more detail.
The model-based approach also has its limitations, that is, it requires the use of labeled data sets to train the speech/non-speech model. Moreover, the mismatch between the training set and the test set data will seriously affect the generalization performance of the system. In order to solve these problems, a hybrid speech/non-speech detection method is introduced. The method con- sists of two steps: the first step is to perform simple energy-based detection;
the second step is model-based detection. The model is trained on the test data itself, so no additional training data is required.
2.1.2 Change Point Detection
Time series analysis has become increasingly important in nowadays diverse research fields. The Change Point Detection(CPD) (Figure 2.3) detects abrupt changes in data when a property of the time series changes[17]. In speech recognition, the change point detection is applied for audio segmentation and recognizing boundaries between silence, sentences, words, and noise[18]. Nowa- days, effective change point detection methods can be divided into five cate- gories:
(1) Cumulative sum control chart(CUSUM). According to the small devi-
CHAPTER 2. BACKGROUND 9
Figure 2.3: Change Point Module, the line represents a change point happens ation of the accumulated data, it can detect whether the data distribution has changed. This is also the oldest and most primitive. However, it needs to ad- just the threshold, if the threshold is too small, the model will be too sensitive while too large threshold leads to a rough model.
(2) Probability density estimation. For a set of time series, the probability density distribution before and after the change point will be different. By us- ing first N points, we can build probability density models and estimate prob- ability density function.Then using score to measure, after adding this point, the difference in probability density distribution (after adding this point, the change in probability density distribution). The higher the score, the higher the probability that this point is a change point.
(3) Direct compute. Because the probability density distribution is difficult to be accurate, it is derived not to calculate the evaluation probability density distribution, but to directly calculate the difference of the probability objective distribution before and after the point. For the data before a point and the data after the point, some models and algorithms can measure the differences.
(4) Probability method. This part focuses on directly predicting whether a point is a change point. Gaussian process and Bayesian process are the most typical methods in this part. However, it is difficult to define a prior function, and the result is inaccurate if it is not well defined.
(5) Cluster method. Cluster method intends to divide the time series into many clusters. By using cluster method such as hierarchical clustering and graph clustering. If the behavior of a time series is quite different from other members in the same cluster, it is regarded as a change.
The CUSUM chart is developed by E. S. Page of the University of Cam-
bridge to do the change point detection, It detects whether the data distribution
has changed according to the slight deviation of the accumulated data. How-
ever, it needs to adjust the threshold and the model scale will influence the
accuracy, if the scale is too small, the model will be too sensitive and other-
10 CHAPTER 2. BACKGROUND
wise too rough.
The probability density estimate detects the change point according to the difference of probability density distribution before and after the change point.
According to the paper[19], their basic idea is to build probability density mod- els based to the first n points, and when a new data points added, they use a evaluation value score to evaluate the difference in probability evaluation. The higher the score is, the higher the probability that this point is a change point.
Because of the probability density estimate needs a number of data and it is hard to get high accuracy. The direct compute methods has been pro- posed. The non-parametric model proposed by Naoki [20], they proposed a SST technique to compute a change point score as the detection standard as well as other simple evaluation value we can compute, such as mean and vari- ance. The disadvantages of this method are that the resulted will be easily influenced by data noise and cannot perform well on a high dimension.
In the paper[21], Saatci et al. proposed a model combining Bayesian online change point detection with Gaussian processes to create a non-parametric time series model. They present methods to reduce the computational burden of the model. Generally, offline algorithms will perform better than online algorithms, Gaussian process can get more accuracy than Bayesian.
The last change point detection method can be described as cluster method.
It clusters many time series. In the same category, if the behavior of one time series is quite different from other members in the same cluster, it is regarded as a change.
2.1.3 Audio Embedding Network
Nowadays, speaker embedding neural networks are mostly modified from ex- isting speaker verfication networks[22, 23, 24, 25]. Wan et al. has introduced a LSTM-based speaker embedding network for speaker verification network, by proposing a new loss function called generalized end-to-end loss(GE2E)[26].
GE2E loss function intend to pushes the embedding towards the centroid of
the true speaker, and away from the false speaker. Compared with the previ-
ous tuple-based end-to-end loss function they propose, generalized end-to-end
loss is more efficient and is confirmed that it can converge to a better model in
shorter time. Their model is trained on fixed-length segments extracted from
a large corpus. They also propose a technique called MultiReader which is
similar to the regularization technique. When the target dataset does not have
efficient data, training on this dataset will lead to overfitting, so they require
the nework to also perform well on other datasets to regularize the network.
CHAPTER 2. BACKGROUND 11
As a result, they combine different data sources and give each data source a weight in order to indicate the importance of each data source.
In their experiments, they perform different experiments on text-dependent and text-independent speaker verification. For text-dependent speaker verifi- cation, they use 128-hidden nodes and the projection size is 64. For text- independent speaker verification, they use 768 hidden nodes with projection size 256. After experiments, they get the speaker verification EER lower to 3.55% by applying Multi-Reader to make their models support multiple key- words and multiple languages.
2.1.4 Spectral Offline Clustering
As a result of the rising of deep learning in various domains, the audio embed- ding based on neural network also known as d-vectors, have revealed a higher performance than i-vectors. The paper[3] illustrates their model by combing LSTM-based d-vector audio embedding with their non-parametric clustering algorithms.
In their experiments, they run experiments on four different clustering methods: (1) Naive online clustering, which is a prototypical online clustering algorithm and calculates each cluster by all its corresponding embeddings; (2) Links online clustering, which estimates cluster probability distributions and models their substructure based on the embedding vectors; (3) K-Means offline clustering, which determines the number of speaker by using the derivatives of conditional Mean Squared Cosine Distances(MSCD) between each embed- ding to its cluster centroid; (4) Spectral offline clustering. They observes that d-vector[26, 9] based system always achieve significantly lower Diarization Error Rate than i-vector[27] based system.
For spectral offline clustering, compared with other parametric models,
such as K-means. When do the cluster, we always assume that the data is Gaus-
sian distribution, however, speech data are often Non-Gaussian, this makes the
centroids of a s cluster is not a sufficient representation. And the cluster imbal-
ance will also influence the results, the imbalance speech data from different
speakers will lead the cluster algorithms to split large clusters into several small
clusters. Moreover, speakers will fall into some groups according to some fea-
tures which have large differences, such as gender. In this situation, the cluster
process can cluster the speaker embeddings into just males and females. And
with the help of spectral offline clustering, these problems will be mitigated to
some degree.
12 CHAPTER 2. BACKGROUND
Figure 2.4: Speaker Cluster, different colors of bar represent the d-vectors embedded from different speakers
2.1.5 Unbounded Interleaved-state Recurrent Neural Network
Another clustering algorithms we want to perform experiments on is Unbounded Interleaved-state Recurrent Neural Network proposed by Google. In UIS- RNN, the developer proposes that there is still one components that is un- supervised - the clustering module (Figure 2.4).
In the paper, they create a fully supervised model by using the similar speaker embbeding network as spectral clustering. They split model into three parts that separately model sequence generation, speaker assignment and speaker change. In the decoding process, the maximum posterior probability criterion is applied, and the beam search method is used for processing.
In order to solve the problems of unknown number of speakers, the au- thors add a distance-dependent Chinese restaurant process (ddCRP)[28] to accommodate an unknown number of speakers. Chinese restaurant process is a Dirichlet process(DP) mixture model, which describes the scenario that each Chinese people comes to restaurant to have a seat. It assumes that the restaurant has unlimited tables. The first customer sits on the first table and the probability of next n
thcustomer to sit on the k
thtable is as follows:
n
kα
0+ n − 1
n
krepresents the number of customers in k
thtable, α is a giving scaling pa- rameter. The customer can also sit on a new table with the probability:
α
0α
0+ n − 1
The difference between Chinese Restaurant Process and distance-dependent
Chinese Restaurant Process is that in ddCRP distributions, the seating plan
CHAPTER 2. BACKGROUND 13
probability is described in terms of probability of a customer sitting with each of the other customers, not tables. And the sitting probability only depends on the distance between customers, the distance can be time, space or other characteristic.
p(c
i= j|D, α) ∝ f (d
ji), j 6= i p(c
i= j|D, α) ∝ α, j = i
c
idenotes the i
thcustomer is sitting and d
jiis the distance measurement be- tween customer i and j. The function f is a decay function. This adding pro- cess can help UIS-RNN define the number of speakers in a utterance and this is the most important part in the cluster process.
Another Innovative part of UIS-RNN is that the authors train the embed- ding network with three different strategies and get d-vectors in three versions.
They first train the embedding network with just the basic datasets and then use the far-field training set to train the network again. After that, they find the window for training is too large when they run small window in diarization system. So they retrain a new model by using variable-length windows.
In order to verify the model, the authors choose the dataset CALLHOME (2000 NIST SRE, disk-8), 5-fold cross-validation, and the model effect is measured by DER evaluation. In addition, two off-domain datasets are used:
Switchboard (2000 NIST SRE, disk-6) and ICSI conference corpus. The final Diarization Error Rate can be as low as 7.6%, and the segmentation perfor- mance exceeds the existing state-of-the-art algorithms.
2.1.6 Some previous related AMI-based system
The paper[29] proposed the diarization results on different speech database
from RT-04S to RT-06S. The important point of their result is that, they firstly
split the audio input into seven conditions and experiments are mainly con-
duced in two mainly conditions, Multiple distant microphones (MDM) and
Single distant microphone (SDM). This is very detailed and the number, direc-
tion and type of microphones can actually influence the audio input. Second,
they split the meeting scenarios into Conference Meeting and Lecture Meeting,
It is well known that the type, number, and placement of sensors have a signif-
icant impact on the performance.The conference meetings are primarily goal-
oriented, decision-making exercises and can vary from moderated meetings
to group consensus building meetings. As such, these meetings are highly-
interactive and multiple participants contribute to the information flow and
decisions made. In contrast, lecture meetings are educational events where a
14 CHAPTER 2. BACKGROUND
single lecturer briefs an the audience on a particular topic. While the audience occasionally participates in question and answer periods, it rarely controls the direction of the interchange or the outcome[29].
The paper[30] proposed a new method called Discriminative neural clus-
tering(DNC) actually on AMI Meeting Corpus and has produced a very ad-
vanced results than other systems, they are the first to perform a system fully
on AMI Corpus, they did the data re-alignment on AMI and fundamentally
solve the problem of AMI itself. They used the neural Change Point Module,
the output of the overall network can be regarded as a two-class network. The
input is a timing RNN. The specific details are: extract the static features of a
single frame in the first n frames and the last n frames of the current frame, and
then aggregate them through RNNs. The feature output of the RNN is oper-
ated by a point multiplication, and then some networks are passed to the final
binary classification output. Regarding the Speaker Clustering module, the au-
thor team used the method to extract features (Embedding) on each segment,
and then cluster, so the key is how to extract features. The author team had
designed a lot of structures. For example, a self-attention structure is designed
for each paragraph, and a feature is weighted for clustering. In addition, the
author team also designed 2D Consecutive Self-attentive Combination, which
is actually ensemble of a variety of characteristic models, and then followed
by a total self-attention. Furthermore, when the author was in ensemble, in
addition to using self-attention, they also tried feature fusion of bi-linear pool-
ing. They selected Speaker Error Rate(SER) as their evaluation metric and got
13.90% on SER.
Chapter 3
Methods and Implementation
3.1 Data Processing
Our application scenarios are different class audio recordings, which consist of a teacher and different students. However, because of the CONVID-19, EADI Technology cannot get the data from their online class applications, so I can only choose public datasets online which is as similar as our scenarios. I finally collected data from AMI Meeting Corpus, which contains 100 hours meeting recordings captured by different devices, such as microphone and earphones.
Two-third of AMI Meeting Corpus consists of recordings in which groups of four people played four different roles in a team. The remaining third of the corpus contains recordings of other types of meetings.
In order to transform the basic audio recordings into the format which can be treated as the input of the embedding network, I firstly split the data into two parts, 90% for training and 10% for testing, then loaded utterance audios and each partial utterance were split by voice detection. Here I just used Deci- bel level as the simple evaluation standard. I transformed audio signals into frames of width 25ms and step 10ms. Actually, we experimented on 20ms width and 40ms width, both are performed a little poor than 25ms width on our experiment data. Too short width will result in too little information while too long width will lead to under-fitting in the next process. However, 25ms maybe not the optimal solution, we should do some further improvements.
After doing the short time Fourier transform and transferring the output into Mel filters and log function, I got log-mel spectrogram of utterances and saved them into numpy files.
15
16 CHAPTER 3. METHODS AND IMPLEMENTATION
3.2 Data Augmentation
Although the AMI Meeting Corpus consists of 100 hours meeting audio record- ings, it seems relatively small when we intend to build a diarization system completely based on AMI. Under this circumstance, we need to do data aug- mentation. We took two methods for augmentation.
3.2.1 Select Sub-Sequence
The most simplest method is to select sub-sequences of full sequence. This means we can have more examples for training, each sub-sequences can be represented as a short meeting. And for each new sub-sequence, we gave them new different label.
3.2.2 Input Vector Randomisation
The next augmentation method is called input vector randomisation. From the speaker embedding network, we can get the vectors of utterances and the labels. For example, we get the vector X and its label Y = [1,3,2,1,1]. After that, we alternate the input vector based on the label, changing the vector into the other vector spoken by the same speaker but preserves the label. This augmentation process is depicted in Figure 3.1.
3.3 Embedding Network Building
The embedding network is relatively simple than diarization neural network.
It consists of a 3-layer LSTM network with a Linear layer. The last-frame output of LSTM is treated as the d-vector representation.
The importance part of the embedding network is the loss function. I train the network by applying the state-of-the-art loss function generalized end-to- end loss function. GE2E is training is based on a large number of utterances at once, I put N different speakers and M utterances for each speaker in one batch. I denoted the output of the entire neural network as f (x
ji: ω), x
jirep- resents the i
thutterance of speaker j, and ω represents parameters of the neural network. Then, I calculated the d-vector by calculating the L2 normalization of the output of the neural network:
e
ji= f (x
ji: ω)
||f (x
ji: ω)||
2(3.1)
CHAPTER 3. METHODS AND IMPLEMENTATION 17
Figure 3.1: Input Vector Randomisation, X represents d-vectors embedded
from different speakers and list Y represents the speaker identity
18 CHAPTER 3. METHODS AND IMPLEMENTATION
I also needed to compute the centroids for each cluster by using the equa- tions:
c
k= 1 M
M
X
i=1
e
ki(3.2)
The difference between TE2E and GE2E is that TE2E is a scalar value which defines the similarity between an embedding vector and a single cen- troids, but GE2E is build a similarity matrix which calculates the similarities between an embedding vector and all centroids:
S
ji,k= ω · cos(e
ji, c
k) + b (3.3) Our purpose is to let the embedding of each utterance far from other speak- ers’ centroids and be close to all that speaker’s embedding. To achieve this, there are two ways according to the paper, the first method is to put a softmax function on the similarity matrix as follows:
L(e
ji) = −S
ji,j+ log
N
X
k=1
exp(S
ji,k) (3.4)
The second method is to compute the contrast loss defined on positive pairs and most aggressive negative pairs. In this project, I choose the softmax function.
According to the research work by Wan et al., when computing the cen- troids of true speaker, it is necessary to remove e
ji, which will make the train- ing stable. So, by following the paper, I compute the centroids and similarity finally as follows:
c
(−i)j= 1 M − 1
M
X
m=1,m6=i
e
jm, (3.5)
S
ji,k= ω · cos(e
ji, c
(−i)j) + b, if k = j (3.6) S
ji,k= ω · cos(e
ji, c
k) + b, otherwise (3.7) The final GE2E loss can be computed as the sum of all losses over the similarity matrix:
L
G(x; w) = L
G(S) =
Xj,i
L(e
ji) (3.8)
3.4 Spectral Clustering Modifications
In this part, I implemented the spectral clustering algorithms by modified the
initial spectral clustering. The paper[3] has proposed that in the last step of
CHAPTER 3. METHODS AND IMPLEMENTATION 19
spectral clustering, we should apply K-Means algorithms to cluster the new embeddings and produce their labels. However, the model that shown by Google is using sklearn.cluster package in Python, which is using Euclidean distance and not support customized distance such as cosine distance[31]. So I implemented the K-Means module of the spectral clustering algorithms myself in two methods.
3.4.1 Initial Centroids By Roulette Wheel Selection
Before the last step of K-means clustering, the spectral clustering algorithms has determined the number of clusters k by using the maximal eigen-gap through a range of processes. The next clustering process consists of the following steps:
(1)Randomly choose a data point as the first cluster.
(2)For each sample x, compute the minimal cosine distance between the sample and centroids as D(x).
(3)The possibility of each sample being selected as the next centroids can be calculated as follows by using the Roulette Wheel Selection:
D(x)
PN
i=1
D(x
i)
(4)Repeat (2) and (3) until we find all the initial centroids.
3.4.2 Initial K Centroids Randomly
Due to the K-Means clustering algorithm is initial value sensitive, I have the other initial method as follows:
(1) Randomly initial k centroids from the whole set D.
(2) For each rest sample in D, compute the cosine distance between it and centroids, and put the sample into the cluster which has the shortest distance.
(3) According to the clustering result, recalculate each cluster’s centroid.
(4) Re-clustering the sample in D based on new centroids.
(5) Repeat (4) until the cluster result remains unchanged.
We should also set the iteration size because centroids are selected ran- domly, only one-time running cannot get the best result.
3.5 Evaluation Metric
The main metric that we used for our experiments is the Diarization Error
Rate (DER). It is described and used by NIST in the RT evaluations[29]. In
20 CHAPTER 3. METHODS AND IMPLEMENTATION
our project, we choose DER as our main metric. We not only calculate its value, but also calculate the value of its three components: Confusion, False Alarm and Miss as follows:
DER = E
conf usion+ E
F A+ E
M iss(3.9) Confusion is also called speaker error, it is about to calculate the percentage of scored time that a speaker ID is assigned to the wrong speaker as follows:
E
conf usion=
PS
s=1
dur(s) · (min(N
ref(s), N
hys(s)) − N
correct(s) T
score(3.10) The terms N
ref(s) and N
hyp(s) indicate the number of speaker speaking in segment s, and N
correct(s) indicates the number of speakers that speak in seg- ment s and have been correctly matched. False Alarm is about the percentage of scored time that a hypothesized speaker is labelled as a non-speech in the reference and Miss is about the percentage of scored time that a hypothesized non-speech segment corresponds to a reference speaker segment as follows:
E
F A=
PS
s=1
dur(s) · (N
hys(s) − N
correct(s))
T
score(3.11)
E
M iss=
PS
s=1