A study of the temporal resolution in lipreading using event vision

(1)

DEGREE PROJECT IN TECHNOLOGY, FIRST CYCLE, 15 CREDITS

STOCKHOLM, SWEDEN 2020

A study of the

temporal resolution

in lipreading using

event vision

KTH Bachelor Thesis Report

Marcus Enström

Axel Alness Borg

(2)

Abstract

Mechanically analysing visual features from lips to extract spoken words consists of finding patterns in movements, which is why machine learning has been applied in previous research to address this problem.

In previous research conventional frame based cameras have been used with good results. Classifying visual features is expensive and capturing just enough information can be of importance. Event cameras are a type of cameras which is inspired by the human vision and only capture changes in the scene and offer very high temporal resolution. In this report we investigate the importance of the temporal resolution when performing lipreading and whether event cameras can be used for lipreading.

A trend of initially increasing accuracy which peaks at a maximum to later decrease when the frame rate is increased can be observed. The research was therefore able to conclude that when using a frame based representation of event data increasing the temporal resolution does not necessarily strictly increase classification accuracy. It is however difficult to be certain about this conclusion due to the fact that there are many other parameters that could effect the accuracy such as an increasing temporal resolution requiring a larger dataset and parameters of the neural network used.

(3)

Abstrakt

Maskinell analys av visuella drag från läppar för att extrahera uttalade ord består av att hitta mösnter i rörelser varpå maskin inlärning har använts i tidigare forskning för att hantera detta problem.

I tidigare forskning har konventionella bildbaserade kameror använts med bra resultat. Att klassificera visuella drag är dyrt och att fånga precis tillräckligt med information kan vara viktigt. Eventkameror är en ny typ av kamera som är inpirerade av hur den mänskliga synen funkar och fångar bara förändringar i scenen och erbjuder väldigt hög temporär upplösning. I den här rapporten undersöker vi vikten av temporär upplösning vid läppläsning och om en eventkamera kan användas för läppläsning.

En trend av initial ökning av nogrannhet som toppar på ett maximimum för att sen minska när bildfrekvensen minskar kan observeras. Forskningen kan därför dra slutsarsen att när en bild baserad representation av eventdata används ökar den temporära upplösningen inte nödvändigtvis klasssificeringsnogrannheten. Det är däremot svårt att vara säker på den här slutsatsen eftersom det finns för många parametrar som kan påverka nogrannheten som att en ökande temporär upplösning kräver ett större dataset och parametrar för det neurala nätverket som använts.

(4)

Authors

Axel Alness Borg axelborg@kth.se Marcus Enström menst@kth.se

Information and Communication Technology KTH Royal Institute of Technology

Examiner

Örjan Ekeberg

KTH Royal Institute of Technology

Supervisor

Jörg Conradt

(5)

1 Introduction

Analysing and interpreting speech is one of the most, if not the most important aspect of human interaction since it is how we express ourselves and communicate with others. As a result, doing so mechanically is of great interest with some every day applications being voice assistants in phones, automatically transcribing audio and video as well as aiding hearing impaired.

Understanding human speech consists of analysing visual features such as the movement of the lips and auditory features produced in the form of sound waves. Although the majority of the information can be gained from auditory features, such as voice recognition in phones which has proven successful, there are cases where visual features can act as a valuable complement. These cases include noisy environments where the audio alone is not sufficient to classify the spoken words, videos without audio and words which sound very similar where visual features can help differentiate them such as ’bar’ and ’far’. The later is known as the McGurk effect. [26]

Analysis of visual features of speech is also referred to as lip reading. Lip reading is a concept where the aim is to evaluate a speakers lips during speech and determine the word being pronounced which is a very difficult task for humans. Even those who are trained in the area achieve very poor accuracy compared to modern computer aided lipreading.

In recent years speech to text recognition has received a lot of attention and has gained a lot of traction, partly due to advances in neural networks and the increase in computational power. These systems have proven to be very successful. Previous research shows that the accuracy of these models can be increased by extracting lip features and using these as input data in combination with audio, especially when the audio contains noise. [5]

One challenge for video based classification is that each frame in the video often contains regions that are not relevant for the prediction which requires unnecessary processing power. A possible alternative to a frame-based camera is a different type of camera which is event based.

(7)

works which begun its development roughly 20 years ago. These event based cameras work in a radically different way compared to traditional cameras that takes pictures at a constant rate. The event camera’s pixels works asynchronously, where each pixel detects changes in brightness. This gives it some advantages compared to a traditional camera. It has a higher dynamic range, does not suffer issues with motion blur, has microsecond temporal resolution and it consumes much less power. Temporal resolution refers to the resolution of a measurement with respect to time. Low temporal resolution for a normal camera would refer to a low number of captured frames per second and vice versa.

Event cameras have not previously been used for specifically lip reading but has been used for action recognition which is a similar problem. For example in [17] an event based camera was used for recognising hand symbols.

1.1 Research question

This thesis will study the importance of the temporal resolution while performing lip reading. Using an event based camera that also can capture normal frames we aim to answer theses questions:

• Does a higher temporal resolution increase the classification accuracy of speech?

• How does a normal camera compare to an event based camera in terms of classification accuracy of speech?

As for the first question, our hypothesis is that higher classification accuracy will be achieved with a higher temporal resolution as it will allow for more information to be captured. Our hypothesis for the second question is that an event camera will perform better due to its high temporal resolution.

1.2 Scope

The scope of our thesis is limited by the dataset. Since no dataset currently exists one will be recorded using an event camera. As the time available for this thesis is limited recording a big extensive dataset is not be possible. A universe of 8 different words is used in order to make sure that enough recordings of each words

(8)

are collected. The dataset is also collected using a total of 4 different speakers in attempt to make sure that the results are not just a fluke applicable to one speaker.

(9)

2 Background

2.1 Lipreading

Human classification accuracy is as low as 17± 12 % for a limited subset of only 30 monosyllabic words and 21± 11 % for 30 compound words [8]. The accuracy achieved by a human can be expected to be low as movements are quick and the different patterns of words have similarities. Visually classifying spoken words is a task of pattern recognition which is why the field of machine learning has proven successful in solving the task.

Most previous work have been done on word level lipreading but the accuracy can be improved when done on sentences. When performed on the same dataset [10] achieved 86.4% accuracy while [1] achieved 95.2% accuracy while doing sentence level lipreading.

2.2 Event Camera

An event camera like the Dynamic Vision Sensor (DVS) [16] is a relatively new type of camera that works radically different from traditional cameras. The event camera has independent pixels that works asynchronously to detect intensity changes. Therefore output of the camera are not frames but rather a stream of events. Each event contains the x and y coordinates of the pixel that detected the event, the timestamp of the event as well as the polarity of the event, where the polarity of the event means whether it was an increase in intensity or a decrease in intensity. One advantage of event cameras is their high temporal resolution of which for [4] is as low as 3 µs and it also has a dynamic range of 130 dB. These figures are much better than traditional cameras. In many situations the event camera could also be much more energy efficient as data is only captured when there is a change in the scene. For real-time tracking event cameras can use up to 100 times less power compared to traditional cameras [20].

The event camera described in [4] does not only outputs events but also global shutter frames. This combination allows for recognition and classification with the global shutter frames while still having the ability to track fast moving objects.

(10)

Since event cameras are relatively new an issue is the lack of available datasets compared to traditional cameras that have been around for a long time. For event cameras the datasets are often quite small in comparison to those available for conventional cameras and very limited. This means that most research which uses an event camera requires a dataset to be collected.

2.2.1 Representation of event data

Figure 2.1: Different ways of representing event data [9]

One key aspect of working with event cameras is deciding on how to represent the event data acquired. There are several different approaches to represent the data. Figure 2.1 showcases some examples of how the event data can be represented.

Figure 2.2: The event frames generated over different amount of times in [18]

One of the simplest approach is to collect events into 2D histograms to create frames. There are two ways to collect events into frames. The first approach collects all events within a time period into a frame. The choice of time to

(11)

collect events into frames is important, choosing a too small time will create less discriminative images that are harder to learn from while having a too large integration time will lead to motion blur [18]. This can be seen in figure 2.2. The other approach is to generate events over a fixed number of events [19]. Both approaches have different advantages so a choice depending on the scenario must be made. Since each event has a polarity, p∈ {−1, 1}, one can either collect frames with two channels containing negative and positive events or weigh these together to a single channel frame.

A more complex approach is to use time surfaces. A time surface is a 2D map where each pixel holds the most recent event which better preserves the rich temporal information of time surfaces [15]. Another advantage of this representation is that it can be updated asynchronously. There exists some more ways to represent the data. Examples are Voxel Grid [2] and 3D point set [25].

2.3 Machine Learning

Machine learning is used to find regularities in data. This is then used to perform actions like classification and prediction. This is being applied to all types of data and there exists many different machine learning methods. This report will focus on artificial neuronal networks (ANN) which are inspired by how the neurons in the human brains works [3].

Deep neural networks are a machine learning method based on ANN with multiple layers [23]. A class of deep learning networks are convolutional neural networks (CNN). These are often applied for analyzing visual imagery [28]. They do this by using a kernel of a fixed sized that has been chosen which acts as a filter and extracts relevant spatial and temporal information. Convolutional layers are often followed by pooling layers. These are used to reduce the dimensionality by extracting the dominant features. There are two types of pooling, the first is max pooling which takes the maximum value covered by the kernel while the second type is called average pooling which takes the average value covered by the kernel.

(12)

the data where one solution to go deeper with several layers. One disadvantage when stacking several convolutional layers together is the famous problem of vanishing gradient. One way of addressing this problem is to use a residual networks (ResNet). These work by adding shortcut connections [12].

Recurrent neural networks (RNN) are a type of ANN commonly used for speech recognition [22]. They are used to give the network memory and do this by using loops to allow for information to persist. But they have a problem with more long-term dependencies. A solution to this is the use of long short-term memory (LSTM)[13]. LSTM networks have input, output and forget gates. In our research we used a simpler variant of LSTM called gated recurrent units (GRU). The main difference is that it lacks an output gate [11].

While normal RNNs only preserve information from the past a bidirectional recurrent neural network (BRNN) run your inputs in two directions in order to preserve information from the future as well as the past [24]. In our research a BRNN variant of GRU was used called bidirectional gated recurrent unit (BiGRU).

2.4 Related Work

There have been some major advances in lip reading during the last couple of years thanks to the development within neuronal network. At the same time the relatively new event cameras are being applied in more and more areas during the last couple of years and have been proven to be usable in many areas.

2.4.1 Lipreading

In [21] an audiovisual speech recognition system is designed. The system can be seen in figure 2.3. This system consists of two streams, one visual and one audio stream that are then combined where both streams work independently. The video stream is very similar to the one used in [27]. This stream is displayed in figure 2.4. This system is the current state of the art at the time of this thesis with the best accuracy on word level lip reading on the ”lip reading in the wild” dataset [6]. They achieved a accuracy of 83.0 % while [21] achieved 82.0 % accuracy on their visual stream. Both these system outperform previous works on the same

(13)

Figure 2.3: The model used in [21] for audiovisual speech recognition

data set by quite a margin as the accuracy achieved in [7] was 76.2 % and 61.1 % in [6].

Both [21] and [27] uses a 3D convolution followed by a rectified linear unit (ReLU) and a 3D max pooling layer to drop the spatial size of the 3D feature map from the 3D convolution. This is then fed into 34-layer ResNet. After this layer the two systems differ. In [21] a 2-layer BiGRU while [27] used a 2-layer bidirectional long short term memory (Bi-LSTM). Another difference was that [21] used a fixed mouth region of interest (ROI) while [27] used facial landmarks to detect a different mouth ROI for each video.

All of this work has been done on approximately the same frame rate. They have all used a frame rate of around 30fps. To the best of our knowledge there has not been a study on whether a different frame rate could provide better recognition rate. In our study we intend to fill this gap of knowledge.

2.4.2 Event cameras for gesture recognition

Event cameras have been used for different types of gesture recognition. In [17] they use event cameras for recognising different hand symbols. The representation of the event data used was similar to [19] where event frames were accumulated over a fixed number of events. They used a single channel where no events were assigned the value 0.5 and each positive and negative events

(14)

Figure 2.4: The system used in [27]

contributed with 1/200 and -1/200 respectively.

To the best of our knowledge event cameras have not been used for lip reading. This study intend to close this gap in knowledge.

(15)

3 Method

In this study a dataset of spoken words was recorded using an event camera. An open source lip reading network with small adjustments was used for doing lip reading on the recorded dataset. This is explained in more detail in the following sections.

3.1 Dataset

The dataset was recorded with a total of 4 different speakers where each speaker spoke all 8 different words of car brands in the datset. The recording was done using the event camera DAVIS 346 along with audio recording. Both event data and normal frames were recorded. The recordings were gathered by allowing each speaker to consequently pronounce each word repeatedly starting and ending with a closed mouth for a fixed amount of time with at least one second of silence in-between each pronunciation. Audio was recorded separately which was synced by starting each recording with a clapping sound. The audio was used to pragmatically split each pronunciation based upon the non-silent parts. This method allowed for easy labeling of data.

Table 3.1: Words used in dataset

WORDS Audi Bentley Ferrari Jaguar Nissan Saab Toyota Volvo

All non-silent segments that deviated more than ±30 % from the mean length were discarded in order to remove segments that either contained two words or were too short segments and only contained noise. The timestamps for each valid segment were collected and the normal camera frames were gathered for each

(16)

Table 3.2: Table of the total number of spoken words by the 4 different speakers

Word Number of pronunciations of the word Audi 300 Bentley 257 Ferrari 225 Jaguar 286 Nissan 242 Saab 257 Toyota 266 Volvo 227 Total: 2060

Table 3.3: Table of the total number of spoken words by the 4 different speakers

Speaker Pronunced words

1 666

2 557

3 478

4 359

Total: 2060

segment. For each normal frame in the segments facial landmarks were detected. These landmarks were then used to find the mouth region by using the points on the nose and chin for height and two points on the outer edges of the eyes for the width. Then all events within these region in each segment were collected into event frames using the normal frame that was closest before the event in time.

The event frames were created by assigning an initial weight to each pixel of 0.5. Then each positive event contributed with 0.2 and each negative event -0.2. These values were chosen based on testing different values. Each pixel value had a maximum and minimum value of 1 and 0. The event frames were generated by collecting events over different amounts of time resulting in different frame rates. Each event frame was interpolated to 50x50 size using a linear interpolation. Each

(17)

frame rate was also interpolated to a fixed number of frames. Each frame rate and corresponding number of frames it was interpolated to can be seen in table 3.4. The normal frames are recorded at 25fps and were interpolated to 10 frames. Table 3.4: Table of the different frame rates and how many frames that was generated for the corresponding frame rate.

Frames per second Number of Frames

25 10 37,5 15 50 20 62,5 25 75 30 87.5 30 100 30

3.2 Deep Learning Network

The open source deep learning network from [21] was used and can be seen in figure 3.1. It is an end-to-end audiovisual speech recognition system with a visual and audio stream. Only the visual stream was used in our research. The visual stream consists of 3D convolution followed by a ReLU and a 3D max pooling layer to drop the spatial size of the 3D feature map from the 3D convolution. This is then fed into a 34-layer ResNet and a 2-layer BGRU. The output layer is a softmax which provides a label to each sequence based on the word with highest probability. Some minor changes to the parameters of the system were made to work with our input. We changed the convolution size from 5x7x7 to 3x7x7 (time/width/height) in order for the system to work with the smaller number of frames at the lower frame rates. The number of classes were changed to the 8 as our dataset contained 8 classes number of frames were adjusted for each frame rate.

3.2.1 Training

During the training data augmentation was performed on the sequences where horizontal flips are made with 50 % probability. The training was done in three

(18)

Figure 3.1: The model used in [21] for audiovisual speech recognition

steps in the same way as done in [27] because training the stream end-to-end would lead to non-optimal performance. At first a temporal convolutional back-end was used instead of the 2-layer BGRU. This was then trained for 12 epochs and the model with best performance on the validation set was chosen. Then the temporal convolution is removed an the BGRU back-end attached. Then this back-end was trained for 12 epochs keeping the weights of the 3D convolution and ResNet fixed. Then the model with best performance on the validation set was used and the whole stream is trained end-to-end. The 5 out of 12 best performing models on the validation set was selected for evaluation. The reason for choosing 5 was that the top 5 models had roughly the same accuracy on the validation set. The Adam algorithm [14] is used for the training with an initial learning rate of 0.0003 and a mini-batch size of 18.

3.2.2 Evaluation

After training the models at the different frame rate the 5 models with highest accuracy on the validation set were used for evaluation. Each model was evaluated on the accuracy for seen speakers and then unseen speaker. Further evaluation

(19)

was done by looking into the performance for each speaker and each word and observing the average accuracy at each frame rate. When evaluating on word level the evaluation was done separately for seen speakers and unseen speakers. Evaluation of the classification accuracy when only using normal frames was also performed. This was done in the same way as for the event frames, where the 5 best models after the last step of training was used for evaluation. This was then used for comparing the accuracy of lipreading between normal cameras and event cameras.

(20)

4 Results and Analysis

We investigated the accuracy for different frame rates with event frames and compared it with normal frames. Deeper investigation was also done on both speaker and word level.

4.1 Accuracy and Frame Rate

Figure 4.1: Accuracy against frame rate for seen speakers. Each colored circle represent the average accuracy of a model. The black triangle represents the average accuracy at each frame rate and the bars represents one standard deviation from the average.

As seen in table 4.1, the models trained on data with 37.5 frames per second achieves the best average accuracy. There is also a trend of decreasing accuracy for higher frame rate when looking at figure 4.1 with 100 frames per second achieving the worst accuracy. Note that accuracy in the figure ranges from 80% to 100%. Normal frames perform slightly better than event frames at the same frame rate. With a difference of 0.6 percentage points which equals to less than a third of the standard deviation of both models.

When comparing the average accuracy of models trained on normal frames with the average accuracy of the best event frame models the event frames have a higher

(21)

Table 4.1: Table of average accuracy & standard deviation for seen speakers

Frame rate Average accuracy (%) Standard deviation (percentage points) Normal frames 94.9 2.0 25 94.3 2.0 37.5 96.9 1.3 50 96.0 0.6 62.5 94.4 1.8 75 94.4 1.8 87.5 92.9 3.2 100 88.2 2.3

average accuracy. The difference is 2.0 percentage points which is exactly the standard deviation of the normal frames and around two thirds of the event frames standard deviation.

The standard deviation ranges from 0.6 to 3.2 percentage point where the lowest standard deviation was achieved at 50 frames per second and highest for 87.5 frame per second. A tendency of an increasing standard deviation with frame rate can be seen.

Table 4.2: Table of average accuracy & standard deviation for unseen speakers

Frame rate Average accuracy (%) Standard deviation (percentage points) Normal frames 60.0 6.2 25 45.8 0.9 37.5 58.5 2.7 50 37.6 4.1 62.5 32.4 4.0 75 43.6 5.8 87.5 33.4 4.4 100 37.3 6.5

Looking at table 4.2 the models trained on data of normal frames can be seen to achieve the best average accuracy. The trend of accuracy against frame rate is

(22)

Figure 4.2: Accuracy against frame rate for unseen speaker. Each colored circle represent the average accuracy of a model. The black triangle represents the average accuracy at each frame rate and the bars represents one standard deviation from the average.

roughly the same trend as for the seen speakers which can seen in figure 4.2. With the average accuracy first increasing from 25 frames per second to 37.5 frames per second and then a drop in average accuracy for 50 frames per second. After this the trend is not as clear as for the seen speakers. The accuracy then stays within a range from 32.4% to 43.6% with no clear trend of increase or decrease in accuracy. Notice that the accuracy in figure 4.2 ranges from 20% to 80%.

When comparing the average accuracy of the seen speaker in table 4.1 with the unseen speaker in table 4.2 the accuracy is worse for the unseen speaker on all frame rates and for the normal frames.

The average accuracy of the models trained on normal frames and the models trained on event frames at the same frame rate differ by 14.2 percentage points. This difference is bigger than the standard deviation of both the models trained on normal frames and the models trained on event frames. When comparing the normal frames with the event frames with best average accuracy the normal frames achieves the best average accuracy. The difference is only 1.5 percentage

(23)

points which is less than one fourth of the standard deviation of the models trained on normal frames and just above half that of the models trained on event frames standard deviation.

The standard deviation ranges from 0.9 to 6.5 percentage points with the lowest standard deviation achieved at 25 frames per second and highest for 100 frames per second. There is a clear trend of standard increasing standard deviation with higher frame rate for the models trained on event frames.

4.2 Speaker Level Accuracy

Figure 4.3: Average accuracy against frame rate for the 4 speakers. Each color represents the average accuracy for the models of a frame rate.

Table 4.3 and figure 4.3 shows how that speaker 3 has the worst average accuracy of the seen speakers for most frame rates. It is also the speaker with the fewest pronounced words of the seen speakers. All three seen speakers have their highest average accuracy on different frame rates. Ranging from 37.5 frames per second to 62.5 frames per second. Where the trained models achieved the best average accuracy for speaker 3 at 37.5 frames per second, for speaker 2 at 50 frames per second and for speaker 1 at 62.5 fps. The models trained on 100 frames per second achieved the worst accuracy for all three seen speakers.

(24)

Table 4.3: Table of average accuracy of the speakers for each frame rates

Frame rate Speaker 1 (%) Speaker 2 (%) Speaker 3 (%) Speaker 4 (%)

Normal frames 92.3 95.8 97.2 60.0 25 98.5 94.5 86.2 45.9 37.5 94.4 98.0 98.8 58.5 50 95.7 99.3 92.5 37.6 62.5 99.4 96.5 84.8 32.4 75 91.8 92.4 95.5 43.6 87.5 94.2 94.7 88.2 33.4 100 91.5 91.0 80.4 37.3

The average accuracy of the the models trained on normal frames achieve highest average accuracy on speaker 3 of the seen speakers. This was the opposite case for most frame rates of the models trained on event frames where it was the speaker with worst accuracy.

It is clear when looking at table 4.3 that all models have the worst average accuracy for speaker number 4 that is the unseen speaker.

4.3 Word Level Accuracy

Table 4.4: Table of accuracy on the 8 different words at different frame rates for seen speakers

Frame rate Audi (%) Bentley (%) Ferrari (%) Jaguar (%) Nissan (%) Saab (%) Toyota (%) Volvo (%) Normal frames 97.6 97.1 100.0 96.8 77.1 94.5 97.4 98.9 25 96.0 95.2 100.0 92.8 100.0 95.5 91.3 87.7 37.5 100.0 92.4 98.9 92.8 99.0 99.1 100.0 92.2 50 96.0 93.3 95.8 98.4 98.1 98.2 95.7 91.1 62.5 94.4 93.3 98.9 97.6 100.0 91.8 82.6 97.8 75 90.4 100.0 85.3 91.2 81.0 99.1 99.1 98.9 87.5 96.8 95.2 81.1 99.2 91.4 80.0 98.3 98.9 100 97.6 50.5 100.0 100.0 94.3 82.7 92.2 84.4

(25)

Figure 4.4: Average accuracy against frame rate for the 8 different words for the seen speakers. Each color represents the average accuracy for the models of a frame rate.

Looking at 4.4 one data point stands out. The average accuracy of the word Bentley for the model trained on 100 frames per second event frames. Looking into more detail what the wrong guesses of the models we see that the models in the majority of cases guesses Ferrari and Jaguar but in a few cases other guesses were made. Both Ferrari and Jaguar words have a 100% average accuracy for the models trained on the 100 frames per second event frames.

As seen in table 4.4 Volvo is the word with lowest accuracy for the models trained on 25 frames per second event frames. This accuracy increases with the higher frame rates except for the highest where it drops. The average accuracy of the word Saab drops at the two highest frame rates. The other words does not seem to have any clear trend with the frame rate of the event frames.

For the normal frames the word Nissan is the only word with clearly lower accuracy. With an accuracy of 77.1% while the others achieve an accuracy of 94.5% or better. This is 17.4 percentage points worse than that of the word with second worst accuracy. For the event frames the average accuracy on the word Nissan was generally good for all frame rates with the exception of the models trained on

(26)

75 frames per second.

Figure 4.5: Average accuracy against frame rate for the 8 different words for the unseen speaker. Each color represents the average accuracy for the models of a frame rate.

Table 4.5: Table of accuracy on the 8 different words at different frame rates for unseen speakers

Frame rate Audi (%) Bentley (%) Ferrari (%) Jaguar (%) Nissan (%) Saab (%) Toyota (%) Volvo (%) Normal frames 98.2 23.9 0.0 34.7 99.5 61.1 74.2 90.6 25 48.2 2.9 81.4 32.0 91.3 89.7 40.0 1.8 37.5 93.2 44.2 90.0 44.4 95.9 73.0 9.8 21.3 50 74.3 7.1 66.7 0.9 84.6 2.7 0.0 59.1 62.5 30.4 8.3 99.0 54.7 72.8 1.1 0.0 0.4 75 63.2 6.7 54.8 74.7 31.3 34.6 27.1 51.1 87.5 35.0 1.3 78.1 8.9 75.4 4.9 21.3 46.8 100 59.6 0.0 98.6 21.3 80.5 49.2 0.0 0.0

As seen in figure 4.5 the average accuracy of each word is very scattered between the different frame rates for the unseen speaker. The average accuracy of the word

(27)

Nissan is quite good for the three lowest frame rates and for the normal frames which can be seen in table 4.5. This decreases with the higher frame rates but is generally quite good compared to the other words. Another word that seems to have quite good average accuracy for event frames is Ferrari. The accuracy ranges from 54.8% to 99.0% and for the normal frames the accuracy is 0% which is the word with lowest average accuracy for the normal frames.

The words Bentley and Toyota both have low average accuracy for all event frames. Bentley have low average accuracy for the normal frames with 23.9% accuracy but Nissan have clearly better average accuracy for normal frames with a accuracy of 74.2%.

(28)

5 Discussion

As mentioned in the background choosing a representation of event data is of importance. In this thesis event frames were chosen. One reason for that is that an already existing state of the art network could be used and the results from previous research with normal cameras would be easier to compare. Another reason for using event frames was the simplicity of varying the temporal resolution to investigate how it impacted the accuracy. The frames were generated over time rather than a fixed number of events as this allowed for more explicit evaluation of the temporal resolution’s effect.

5.1 Accuracy and Frame Rate

The results show that the accuracy of predicted words does increase initially but does not increase as frame rates are increased even further. This is the case for both a seen speaker as well as an unseen speaker where results peak at 37.5 frames per second. The results are therefore not inline with our hypothesis of increasing performance as temporal resolution is increased.

One possible explanation for this is described in [18]. The event frames with higher frame rate are generated over a shorter time interval and thus contain fewer events meaning less information in each individual frame than for those with lower frame rate. This leads to less discriminative images when the motion is not high enough which makes it harder to learn from. On the other hand when the frame rate is too low compared to the motion the generated event frames suffers from motion blur and the contours becomes washed out. This explains why the performance is worse for 25 compared to 37.5 frames per second.

The fact that accuracy peaks at 37.5 frames per second does not necessarily mean that the higher temporal resolution performs worse in general but that the gain in temporal resolution does not outweigh the loss in performance due to too few events in each frame when using this method. In order to resolve this issue another representation of the event data would likely be needed.

Another reason for the peak at 37.5 frames per second could be explained by the fixed kernel size that is used for the network, especially the temporal dimension of

(29)

the kernel could play an important role when evaluating the results. It is possible that the kernel size used is optimal for 37.5 fps in this case but this would require further research in order to be certain about.

When looking at the standard deviation of the accuracy for seen speakers we see that it lies around 1 and 3 percentage points which can be considered to be quite small. That is why it is reasonable for us to use the average accuracy between the models for each frame rate when looking into the performance on speaker and word level. For unseen speakers the standard deviation is a bit higher with it ranging from around 1 to 6 percentage points. But this is still reasonably low so it should be appropriate to use the average accuracy between the models when looking more deeply into the accuracy on speaker and word level for the unseen speaker as well.

Another observation that can be made is a clear trend of increasing standard deviation of the average accuracy for the unseen speaker. This trend seems to also exist for seen speakers although not being as clear. This might mean that the model needs to be trained for more epochs with higher frame rates in order to get closer to optimal performance.

5.2 Event Camera vs Normal Camera

The accuracy of the event frames are almost as good at the same frame rate as the normal frames and slighter better at the best performing frame rate. That shows that event cameras are comparable with normal cameras in accuracy. This might be due to the fact that event cameras captures the essential information of lipreading by detecting the motions needed for pronunciation of the words. Considering the fact that we used a network that was designed for normal frames it is possible that event cameras could outperform normal cameras with another network. This could possibly be achieved with a network better designed for the event data.

(30)

5.3 Speaker Level Accuracy

For the seen speakers one can observe that the best average accuracy for the different speakers is achieved at different frame rates for the event frames by looking at table 4.3. A explanation for this could be a difference in the utterance of the words where speaker3 that achieved best average accuracy at the lowest frame rate of the speakers probably uttered the words with slower lip movements that led to fewer events generated per second. As mentioned earlier this leads to less discriminative images making it harder to learn from. Speaker 1 achieved the best average accuracy at the highest frame rate of the speakers which could possibly mean that speaker 1 spoke faster and therefor generated more events per second. This could lead to motion blur for the lower frame rates and therefor a higher frame rate is more advantageous for that particular speaker. Whether or not this is the case is uncertain and a more in depth study of this would be needed in order to confirm this.

While the event frame models generally achieved the worst accuracy for speaker 3 the normal frames achieved the best performance for speaker 3. This in combination with the fact that the best accuracy for speaker 3 was achieved at the lowest frame rate of the seen speakers strengthens the reasoning of speaker 3 generating fewer events. Since lower frame rates means more motion blur that could decrease performance. Since the normal frames are recorded at 25fps slower lip movements could mean better accuracy. This means that perhaps collecting events over time might not be the most appropriate method of generating the event frames when it comes to getting a less speaker dependant model, but rather collecting a fixed number of events into the frames. This could also increase the performance on unseen speakers since it would not depend as much on the fact that different speakers generate different amounts of events for the different words. While a solution like this might decrease speaker dependence it might achieve lower accuracy. This would need to be investigated further. The accuracy of the event frames are slightly worse than normal frames on the unseen speaker for the best performing frame rate. This indicates that the event frames differ more between speakers compared to normal frames. That would likely mean for the model to generalize better with event frames it would need

(31)

to be trained with more speakers compared to with normal frames that are more similar between speakers.

5.4 Word Level Accuracy

It does not appear to be any words that are explicitly more difficult than others for seen speakers. This can be seen in 4.4 where all data points shows that the accuracy between words are quite similar except for the one outlier data point. The reason for this is most likely that neither of the words are similar in pronunciation. Since our vocabulary is very limited this result is not very unexpected. For a larger vocabulary where words are more similar this would probably not be the case. In [27] they noted that words like spend and spent often were confused by their network that is very similar to the one used in this study which shows that this would probably be the case for the network used in this study.

The outlier point is the average accuracy for Bentley at 100 frames per second. Looking at the false guesses that were made it was said to be either Ferrari or Jaguar. The average accuracy for these models were 100% for both words. That suggest that the models were biased towards these two words.

Other things that can be noted when looking at word level for the seen speakers are the increasing accuracy for Volvo with higher frame rates. This could suggest that pronouncing Volvo requires faster lip movements compared to most of the other words. Another word which we can see a trend of accuracy against frame rate is Saab. For the two highest frame rates the accuracy drops which could suggest that the lip movement is slower in comparison to the other words. This would mean that fewer events were generated during pronunciation like mentioned before with different speakers.

(32)

6 Conclussion

This thesis explored the significance of the temporal resolution of the visual data when specifically classifying lip movements of spoken words. This was performed by representing event data as multiple frames and using the frames as input for a neural network and varying the number of frames used. We have attempted to find a trend between the frame rate used and the accuracy of the predicted words. A trend of initially increasing accuracy which peaks at a maximum to later decrease when the frame rate is increased can be observed. The research was therefore able to conclude that when using a frame based representation of event data increasing the temporal resolution does not necessarily strictly increase classification accuracy. It is however difficult to be certain about this conclusion due to the fact that there are many other parameters that could effect of the accuracy such as an increasing temporal resolution reacquiring a larger data set and varying the kernel size.

Another aspect which was explored in this thesis was how an event-based camera compared to a normal frame based camera. Our research was able to conclude that event-based cameras are able to capture enough visual information to be able to offer at least the same classification accuracy as a normal frame based camera. As very little to no considerations for optimising the event data representation or network design was done it is quite possible that the accuracy could be greater when using event based data which could be an interesting future research area.

(33)

References

[1] Assael, Yannis M. et al. “LipNet: Sentence-level Lipreading”. In: CoRR abs/1611.01599 (2016). arXiv:1611.01599. URL: http://arxiv.org/abs/ 1611.01599.

[2] Bardow, P., Davison, A. J., and Leutenegger, S. “Simultaneous Optical Flow and Intensity Estimation from an Event Camera”. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016, pp. 884–892.

[3] Bishop, Christopher. Pattern Recognition

and Machine Learning (Information Science and Statistics). Oct. 2007. ISBN: 0387310738.

[4] Brandli, C. et al. “A 240 × 180 130 dB 3 µs Latency Global Shutter Spatiotemporal Vision Sensor”. In: IEEE Journal of Solid-State Circuits 49.10 (2014), pp. 2333–2341.

[5] “Real-time lip tracking for audio-visual speech recognition applications”. In: Computer Vision — ECCV ’96. Ed. by Bernard Buxton and Roberto Cipolla. Springer Berlin Heidelberg, 1996, pp. 376–387. ISBN: 978-3-540-49950-3.

[6] Chung, J. S. and Zisserman, A. “Lip Reading in the Wild”. In: Asian Conference on Computer Vision. 2016.

[7] Chung, Joon Son et al. “Lip Reading Sentences in the Wild”. In: CoRR abs/1611.05358 (2016). arXiv:1611.05358. URL: http://arxiv.org/abs/ 1611.05358.

[8] Easton, Randolph and Basala, Marylu. “Perceptual dominance during lipreading”. In: Perception Psychophysics 32 (Nov. 1982), pp. 562–570. DOI:10.3758/BF03204211.

[9] Gehrig, Daniel et al. “End-to-End Learning of Representations for Asynchronous Event-Based Data”. In: CoRR abs/1904.08245 (2019). arXiv:1904.08245. URL: http://arxiv.org/abs/1904.08245.

(34)

[10] Gergen, Sebastian et al. “Dynamic Stream Weighting for Turbo-Decoding-Based Audiovisual ASR”. In: Sept. 2016, pp. 2135–2139. DOI:10.21437/ Interspeech.2016-166.

[11] Gers, F. A., Schmidhuber, J., and Cummins, F. “Learning to forget: continual prediction with LSTM”. In: 1999 Ninth International Conference on Artificial Neural Networks ICANN 99. (Conf. Publ. No. 470). Vol. 2. 1999, 850–855 vol.2.

[12] He, Kaiming et al. “Deep Residual Learning for Image Recognition”. In: CoRR abs/1512.03385 (2015). arXiv: 1512 . 03385. URL: http : / / arxiv . org/abs/1512.03385.

[13] Hochreiter, Sepp and Schmidhuber, Jürgen. “Long Short-Term Memory”. In: Neural Computation 9.8 (1997), pp. 1735–1780.

[14] Kingma, Diederik P. and Ba, Jimmy. “Adam: A Method for Stochastic Optimization”. In: CoRR abs/1412.6980 (2015).

[15] Lagorce, Xavier et al. “HOTS: A Hierarchy of Event-Based Time-Surfaces for Pattern Recognition”. In: IEEE transactions on pattern analysis and machine intelligence 39 (July 2016). DOI:10.1109/TPAMI.2016.2574707. [16] Lichtsteiner, P., Posch, C., and Delbruck, T. “A 128× 128 120 dB 15

µs Latency Asynchronous Temporal Contrast Vision Sensor”. In: IEEE Journal of Solid-State Circuits 43.2 (2008), pp. 566–576.

[17] Lungu, I. A., Liu, S., and Delbruck, T. “Incremental Learning of Hand Symbols Using Event-Based Cameras”. In: IEEE Journal on Emerging and Selected Topics in Circuits and Systems 9.4 (2019), pp. 690–696.

[18] Maqueda, Ana I. et al. “Event-based Vision meets Deep Learning on Steering Prediction for Self-driving Cars”. In: CoRR abs/1804.01310 (2018). arXiv:1804.01310. URL: http://arxiv.org/abs/1804.01310. [19] Moeys, D. P. et al. “Steering a predator robot using a mixed

frame/event-driven convolutional neural network”. In: 2016 Second International Conference on Event-based Control, Communication, and Signal Processing (EBCCSP). 2016, pp. 1–8.

(35)

[20] NI, Z. et al. “Asynchronous event-based high speed vision for microparticle tracking”. In: Journal of Microscopy 245.3 (2012), pp. 236–244. DOI:10. 1111 / j . 1365 - 2818 . 2011 . 03565 . x. eprint: https : / / onlinelibrary . wiley.com/doi/pdf/10.1111/j.1365-2818.2011.03565.x. URL: https: //onlinelibrary.wiley.com/doi/abs/10.1111/j.1365- 2818.2011. 03565.x.

[21] Petridis, Stavros et al. “End-to-end audiovisual speech recognition”. In: (2018), pp. 6548–6552.

[22] Sak, H., Senior, Andrew, and Beaufays, F. “Long short-term memory recurrent neural network architectures for large scale acoustic modeling”. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH (Jan. 2014), pp. 338–342. [23] Schmidhuber, Jürgen. “Deep learning in neural networks: An overview”.

In: Neural networks : the official journal of the International Neural Network Society 61 (2015), pp. 85–117.

[24] Schuster, Mike and Paliwal, Kuldip. “Bidirectional recurrent neural networks”. In: Signal Processing, IEEE Transactions on 45 (Dec. 1997), pp. 2673–2681. DOI:10.1109/78.650093.

[25] Sekikawa, Yusuke,

Hara, Kosuke, and Saito, Hideo. “EventNet: Asynchronous recursive event processing”. In: CoRR abs/1812.07045 (2018). arXiv: 1812 . 07045. URL: http://arxiv.org/abs/1812.07045.

[26] Silsbee, P. L. and Bovik, A. C. “Computer lipreading for improved accuracy in automatic speech recognition”. In: IEEE Transactions on Speech and Audio Processing 4.5 (1996), pp. 337–351.

[27] Stafylakis, Themos and Tzimiropoulos, Georgios. “Combining Residual Networks with LSTMs for Lipreading”. In: CoRR abs/1703.04105 (2017). arXiv:1703.04105. URL: http://arxiv.org/abs/1703.04105.

[28] Valueva, M.V. et al. “Application of the residue number system to reduce hardware costs of the convolutional neural network implementation”. In: Mathematics and Computers in Simulation 177 (2020), pp. 232–243.

(36)

ISSN: 0378-4754. DOI: https : / / doi . org / 10 . 1016 / j . matcom . 2020 . 04.031. URL: http://www.sciencedirect.com/science/article/pii/ S0378475420301580.

(37)

A study of the temporal resolution in lipreading using event vision

A study of the

temporal resolution

in lipreading using

event vision

KTH Bachelor Thesis Report

Marcus Enström

Axel Alness Borg

Abstract

Abstrakt

Authors

Examiner

Supervisor

Contents

1

Introduction

1.1

Research question

1.2

Scope

2

Background

2.1

Lipreading

2.2

Event Camera

2.3

Machine Learning

2.4

Related Work

3

Method

3.1

Dataset

3.2

Deep Learning Network

4

Results and Analysis

4.1

Accuracy and Frame Rate

4.2

Speaker Level Accuracy

4.3

Word Level Accuracy

5

Discussion

5.1

Accuracy and Frame Rate

5.2

Event Camera vs Normal Camera

5.3

Speaker Level Accuracy

5.4

Word Level Accuracy

6

Conclussion

References