Faculty of Technology and Society Computer Engineering
Deep Learning Models for Human Activity Recognition
Deep Learning modeller för mänsklig aktivitets igenkänning
George Albert Florea
Exam: Bachelor of Science in Engineering in Computer Science 180hp
Area: Computer Science
Examiner: Thomas Pederson Supervisor: Radu-Casian Mihailescu
The Augmented Multi-party Interaction(AMI) Meeting Corpus database is used to in-vestigate group activity recognition in an office environment. The AMI Meeting Corpus database provides researchers with remote controlled meetings and natural meetings in an office environment; meeting scenario in a four person sized office room. To achieve the group activity recognition video frames and 2-dimensional audio spectrograms were extracted from the AMI database. The video frames were RGB colored images and audio spectrograms had one color channel. The video frames were produced in batches so that temporal features could be evaluated together with the audio spectrogrames. It has been shown that including temporal features both during model training and then predicting the behavior of an activity increases the validation accuracy compared to models that only use spatial features . Deep learning architectures have been implemented to recognize different human activities in the AMI office environment using the extracted data from the AMI database.
The Neural Network models were built using the Keras API together with TensorFlow library. There are different types of Neural Network architectures. The architecture types that were investigated in this project were Residual Neural Network, Visual Geometry Group 16, Inception V3 and RCNN(Recurrent Neural Network). ImageNet weights have been used to initialize the weights for the Neural Network base models. ImageNet weights were provided by Keras API and was optimized for each base model. The base models uses ImageNet weights when extracting features from the input data.
The feature extraction using ImageNet weights or random weights together with the base models showed promising results. Both the Deep Learning using dense layers and the LSTM spatio-temporal sequence prediction were implemented successfully.
AMI Meeting Corpus (AMI) -databasen används för att undersöka igenkännande av grup-paktivitet. AMI Meeting Corpus (AMI) -databasen ger forskare fjärrstyrda möten och naturliga möten i en kontorsmiljö; mötescenario i ett fyra personers stort kontorsrum. För att uppnå gruppaktivitetsigenkänning användes bildsekvenser från videos och 2-dimensionella audiospektrogram från AMI-databasen. Bildsekvenserna är RGB-färgade bilder och ljud-spektrogram har en färgkanal. Bildsekvenserna producerades i batcher så att temporala funktioner kunde utvärderas tillsammans med ljudspektrogrammen. Det har visats att inkludering av temporala funktioner både under modellträning och sedan förutsäga be-teende hos en aktivitet ökar valideringsnoggrannheten jämfört med modeller som endast använder rumsfunktioner. Deep learning arkitekturer har implementerats för att känna igen olika mänskliga aktiviteter i AMI-kontorsmiljön med hjälp av extraherade data från theAMI-databas.
Neurala nätverks modellerna byggdes med hjälp av KerasAPI tillsammans med Tensor-Flow biblioteket. Det finns olika typer av neurala nätverksarkitekturer. Arkitekturerna som undersöktes i detta projektet var Residual Neural Network, Visual GeometryGroup 16, Inception V3 och RCNN (LSTM ). ImageNet-vikter har använts för att initialisera vikterna för Neurala nätverk basmodeller. ImageNet-vikterna tillhandahålls av Keras API och är optimerade för varje basmodell . Basmodellerna använder ImageNet-vikter när de extraherar funktioner från inmatningsdata.
Funktionsextraktionen med hjälp av ImageNet-vikter eller slumpmässiga vikter tillsam-mans med basmodellerna visade lovande resultat. Både Deep Learning användningen av täta skikt och LSTM spatio-temporala sekvens predikering implementerades framgångsrikt.
The realization of this thesis would not have been possible without the support of our supervisor Radu-Casian Mihailescu at the Internet of Things and People Research Centre at Malmö Univeristy. His support with ideas and possible solutions to the problems we had to face while producing the results and writing this thesis has been very important. His enthusiasm and guidance helped us during the whole process and his insight into Neural Networks was crucial.
1 Introduction 1
1.1 Background . . . 1
1.2 Activity recognition . . . 1
1.3 The AMI Corpus dataset . . . 2
1.3.1 Audio and video signals . . . 2
1.4 Problem statement . . . 3
1.4.1 Research questions . . . 3
1.5 Limitations . . . 4
2 Theoretical background 5 2.1 Machine Learning . . . 5
2.2 Artificial Neural Network (ANN) . . . 5
2.2.1 Multilayer Perceptron . . . 5
2.3 Deep Learning . . . 7
2.3.1 Recurrent Neural Network . . . 7
2.3.2 Long-Short Term Memory . . . 8
2.3.3 Convolutional Neural Network . . . 9
2.4 Deep Learning Architectures . . . 9
2.4.1 VGG16 . . . 9
2.4.2 Inception v3 . . . 10
2.4.3 Residual Network . . . 11
3 Related work 12 3.1 An investigation of transfer learning for deep architectures in group activity recognition . . . 12
3.1.1 Background . . . 12
3.1.2 Method . . . 12
3.1.3 Results . . . 13
3.1.4 Comments . . . 14
3.2 Towards Robust Human Activity Recognition from RGB Video Stream with Limited Labeled Data . . . 14
3.2.1 Comments . . . 14
3.3 Deep Residual Learning for Image Recognition . . . 14
3.3.1 Comments . . . 14
3.4 Robust Audio Sensing with Multi-Sound Classification . . . 15
3.4.1 Comments . . . 15
4 Method 16 4.1 Construct a Conceptual Framework . . . 16
4.2 Develop a System Architecture . . . 16
4.2.1 Google Colaboratory . . . 16
4.3 Analyze and Design the System . . . 17
4.3.1 Cifar-10 dataset . . . 17
4.3.2 Cats Vs. Dogs dataset . . . 17
4.4 Feature extraction by using transfer learning . . . 17
4.4.1 Data augmentation . . . 18
4.4.2 Parameters during model training . . . 19
4.5.1 Frame extraction . . . 21
4.5.2 Video dataset . . . 21
4.5.3 Audio dataset . . . 22
4.5.4 Video and Audio combined . . . 23
4.6 Observe and Evaluate the System . . . 24
5 Results 25 5.1 Cats vs Dogs dataset . . . 25
5.1.1 Implementing the system architecture . . . 25
5.2 Cifar-10 dataset . . . 27
5.2.1 Benchmarking the system architecture on Cifar-10 dataset . . . 27
5.3 AMI Corpus Video and Audio . . . 28
5.3.1 Video dataset . . . 28
5.3.2 Audio dataset . . . 30
5.3.3 LSTM: Audio and video sequence . . . 32
5.4 Looking deeper into ResNet . . . 33
5.4.1 Audio: Mixed audio ES and IS meetings . . . 33
5.4.2 Audio: Mixed ES and IS meetings with random weight initialization 35 5.4.3 Audio: Edinburgh Scenario meetings LSTM audio sequence . . . 36
5.4.4 Audio: Edinburgh Scenario meetings LSTM audio sequence with random weight initialization . . . 37
6 Discussion 38 6.1 Shuffling the data . . . 38
6.2 Audio data compared to video data . . . 38
6.3 ResNet compared to VGG16 and InceptionV3 . . . 39
6.4 ResNets with random weights . . . 39
6.5 LSTM results . . . 39
6.6 Parameter optimization . . . 40
6.7 Model architecture . . . 40
7 Conclusion and future work 41 7.1 Contribution . . . 41
AMI Augmented Multi-party Interaction.
ANN Artificial Neural Network, a type of machine learning ResNet Residual Network, a type of ANN architecture
CNN Convolutional Neural Network, a type of ANN architecture used for image processing and classification
VGG Visual Group Geometry, a type of ANN architecture based on CNN. MLP Multilayer Perception, a type of ANN architecture
DL Deep Learning, A type of ANN architecture LSTM Long Short Term Memory.
RCNN Recurrent Neural Network. MFCC Mel-frequency cepstral coefficient. ES Edinburgh Scenario.
IS Idiap Scenario.
In this chapter, we introduce the concepts of using Deep Learning models for recognition of human activities. This chapter also shows the thesis’ research questions and aim.
Neural Networks have successfully been applied to classification problems. Neural Net-works have the capability of solving non-linear problems .
Deep learning, which is just a small part of Neural Networks, is a technique that emulates the information processing of the human brain and has contributed to a breakthrough regarding object recognition in image data . Deep Learning has since the breakthrough been adopted as an approach to deal with object recognition in image data. One of the next big challenges in computer vision regarding video sequence processing is to allow com-puters to not only recognize objects in the video, but also human activity recognition  occurring in the video both during playback and live stream. Human activity recognition includes a single human activity, an interactive activity between a human and an object or a group activity including two or more humans.
1.2 Activity recognition
Group activities are more difficult to classify compared to object recognition because of the diverse possible ways that group activities can be carried out.
Activity recognition can be used for surveillance , human-machine interaction or mul-timedia retrieval or determining human emotional states , a smart environment taking place at home in the kitchen  or at the office. Activity recognition can also be used in smart homes based on IoT solutions . In  it is stated that recognizing peoples activity from different views is a difficult task because research done on activity recognition are usually view dependent meaning not invariant of the view angle. Activity recognition uses a view angle and therefore only works for the view angle used as is described in . Also implementing a surveillance system using computer vision that can recognize human activity is important and would free up human resources needed to constantly monitor a video feed for certain human activities .
Recognizing group activities in an office environment has shown a higher degree of val-idation accuracy by incorporating spatio-temporal features ; compared to only using spatial features. By considering a short time sequence, the group activity will be easier to recognize by the Deep Learning model used in . For example, the temporal domain makes it possible to recognize if a human is going to sit down in the chair or getting up from the chair.
1.3 The AMI Corpus dataset
The AMI Corpus dataset is a European-funded project that aims to improve the effective-ness of meetings1. The AMI Meeting Corpus consists of 100 hours of meeting records. The meetings are annotated with, for example, topics (presentation, discussion and closing), decisions, and intense discussions. The data can be used for different purposes for example linguistics and social psychology but in this thesis, the meetings will specifically be used for video and audio processing in order to extract features.
AMI Corpus dataset consists of 10-60 minutes long synchronized video and audio sequences. These sequences take place in an office environment where the participants are pretending to have a meeting; not all meetings are remote-controlled scenario meetings, some of these are real meetings taking place. Both real and scenario meetings were done in a similar manner so no distinction was made between remote controlled scenario meetings and nat-ural meetings. The participants are not always the same people however there are always four people present, each given one out of four different roles. The four different roles are the project manager (PM) who runs the meetings, the marketing expert (ME), the user interface designer (UI) and the industrial designer (ID), which would result in a variety of behaviors because of the different characteristic role each participant had. The partici-pants had no prior professional training or any experience in their role. The behavior was expected to differ compared to expert designers. The decision to use participants with no experience was based on economic and logistical difficulties. Participants would be affected by past experience so this was taken into consideration in order to produce replicable be-havior. Randomly assigning the roles resulted in the participants being unhappy with roles that did not fit them, the teams performed poorly. Instead the participants were asked who wanted to do what. The participants were given training at the beginning of the task and were each assigned a personal coach. The personal coach gave enough hints by e-mail on how to do their job. The disadvantages of role-playing were taken into consideration. For example, there is no guarantee that the participants will care enough so that the data provided is comparable to natural interactions. However no natural meeting data was used because there were no annotations available. The AMI Corpus group had past experience for similar team tasks which suggested that the approach described for the AMI Corpus dataset will result in behavior that generalizes well to real groups.
1.3.1 Audio and video signals
There are differences regarding recording the audio and video signals depending on the room. The different rooms are the Edinburgh Room, Idiap Room, and TNO Room. The details about these rooms regarding video and audio setup and recording can be read on the AMI Corpus website 2. The important details have been taken into account and pre-sented in this thesis paper. The video used in this thesis are videos with a camera angle that shows the whole room. The camera angles used are overhead view (top view from above) and corner view (camera positioned at the top or a corner), which can be seen in figure 1.
(a) Corner camera angle
(b) Overhead camera angle
Figure 1: Images showing different camera angles
Audio was used from far-field microphones and room-view video cameras. The meetings were recorded in English and include mostly non-native speakers. The audio had a sample rate of 16 kHz and save as WAV files. At each time frame, an omnidirectional microphone was used to sample the sound with the most energy. An automatic energy threshold de-rived from a simple energy-based technique  was applied to classify the frame as speech or silence. The speech and silence segments were smoothed with a low-pass filter.
The scenario meetings range from 700 seconds to 3400 seconds. The video signals from Edinburgh and Idiap were stored on disk using DivD AVI codec 5.2.1. Encoding bit rate was 23 000 Kbps and a maximum interval of 25 frames between two consecutive MPEG keyframes; in order to reduce redundant information. The image resolution of videos is high [720x576], sufficient for doing person location and facial feature analysis.
Special hardware is used to provide synchronization signals. The recordings use a range of signals synchronized to a common timeline. So the audio and video signals are synchronized and make it possible to combine both signals and thereby makes it possible to answer the research questions.
1.4 Problem statement
There are difficulties utilizing computer vision in recognizing human activities in everyday environments. These difficulties include different background lightning, bad angles and lack of information. In this project, we specifically look into the office environment and try to classify a few human activities based on video data and/or audio data. The audio data or audio and video data combined might provide another perspective that could not be achieved with using video data alone.
1.4.1 Research questions
vironment including four humans. ResNet was chosen because of it being state-of-the-art within image classification and has been used to win contests .
• RQ 1: What is the relationship between the complexity of the ResNet architecture and the validation accuracy for activity recognition in office environments?
– RQ 1.1: What is the accuracy the ResNet can classify the AMI meeting office environment activities?
– RQ 1.2: How does the validation accuracy and complexity compare to the known literature and already implemented models; the VGG16 and Inception V3 model?
– RQ 1.3: Among the models that are tested in this project, which one has the highest performance regarding validation accuracy and what might be the reason?
This thesis will be limited to deep learning models: ResNet, VGG16 and Inception V3. We are also limited to one specific video recordings, namely the AMI Meeting Corpus ES and IS video recordings.
This chapter will explain the following concepts: Machine learning, Artificial Neural Net-works(ANN), Multilayer Perceptron(MLP), Convolutional Neural Network(CNN), Deep Learning(DL), Residual Network(ResNet), Visual Geography Group(VGG)
2.1 Machine Learning
Machine learning is a field of study that aims to give computers the ability to learn without explicitly being programmed . Machine learning algorithms build models from sample data, known as "training data", that is used to make predictions or decisions without being explicitly programmed to perform the task.
In machine learning, there are different types of tasks, namely supervised and unsupervised learning. The majority of practical machine learning uses supervised learning. Both types have an input variable (X) but only supervised have a corresponding output variable (Y).
Supervised learning uses an algorithm to learn the mapping function from the input to the output .
Y = f (X) (1)
The goal of supervised learning is to be able to approximate the mapping function so well that whenever you have new input data (X) the model can predict the output (Y) for that data.
Since unsupervised learning does not have any output variable (Y), the goal instead be-comes to model the underlying structure or distribution in the data in order to learn more about the data .
2.2 Artificial Neural Network (ANN)
ANN’s are a category of a type of machine learning architecture. ANNs were inspired by observing how biological neurons operate. ANNs try to mimic how the human brain operates. An ANN by itself is not an algorithm, but rather a framework used by numerous different machine learning algorithms. The ANN architecture consists of layers, neurons, and weights .
2.2.1 Multilayer Perceptron
The Multilayer Perceptron (MLP) is a type of ANN architecture that has one or more hidden layers. There are three different layer types; input layer, hidden layers, and output layer. Hidden layers exist between the input layer and the output layer. Every layer has one or more neurons. Every input layer and hidden layer also have a bias neuron each. The bias neuron has a constant input value .
The information is propagated forward through the ANN by using weights and transfer functions. The output from one neuron in a layer l becomes the input for another neuron in the next layer l + 1; these networks are called feed-forward NN’s. The weight ωijl is a connection between a neuron nli and a neuron nl+1j . The output nli is multiplied with the weight ωlij and becomes an input for the neuron nl+1j , which can be seen in figure 2. All inputs to the neuron nl+1j are summed :
nliωlij , (2)
where ni is the input neuron, ωij is the neuron weight and nl+1j is the summed input
ol+1j = φ(νjl+1) , (3)
the output ol+1j from the neuron nl+1j is calculated by the transfer function φ(·).
Figure 2: MLP 
The sum of inputs vl+1j are fed to the transfer function. The output from the transfer function becomes the new output for the neuron nl+1j . The new output is sent as an input to the next neuron in the next layer and the process repeats until the output layer is reached.
The transfer function is a non-linear function that is differentiable. The Rectifier transfer function is a non-linear transfer function that has shown promising results when used in the domain of computer vision, speech recognition and deep learning. The Rectifier transfer function:
0 , ν ≤ 0
aν , ν > 0 (4)
A training dataset with labeled classes is used during supervised learning. If the network output deviates compared to the training dataset, then the ANN has to adjust its weights. The process of adjusting the weights of the ANN is called training. When the network is trained then the weights are adjusted accordingly to minimize the network error .
The network error is calculated by evaluating a loss function. The loss function compares the network’s output to the corresponding output value in the dataset. There are different algorithms for adjusting the weights. One of the algorithms is called backpropagation. The backpropagation algorithm propagates the error back through the network and makes adjustments to all the weights. Every weight is adjusted according to how much the weight contributed to the measured network error.
2.3 Deep Learning
Deep Learning(DL) is a type of machine learning architecture which is based on ANN. The DL architecture is an MLP with many neurons and many layers. The DL is computationally demanding and training a model can take a lot of time. The amount of data required to train a model scales with the size of the model .
2.3.1 Recurrent Neural Network
Recurrent neural network (RNN) is built to allow information to be persistent . For example, if we humans read a text we understand each word based on understanding the previous word, we do not throw away everything and think from scratch between each word. Traditional neural networks can’t do that but RNNs address this issue and make it possible. To do this, RNNs make loops , see figure 3.
The network, A, takes some input xtand outputs a value htand the loop allows information to be passed from one step to the next in the network. Recurrent neural networks are not that much different from normal neural networks. If the loop is unrolled it will just look like multiple copies of the same network that is passing the information on to the next one, shown in figure 4.
Figure 4: Recurrent neural network loop unrolled 
2.3.2 Long-Short Term Memory
Long-Short Term Memory, or LSTM as it is usually called, is a special kind of RNN that was introduced in 1997 by Hochreiter & Schmidhuber . This work was later refined and popularized by many people . LSTMs are usually used in deep learning.
LSTM uses feedback connections that give it the possibility to not only process single data points (such as images), but also sequences of data (speech and video). The most common LSTM unit is composed of a cell, an input gate, an output gate and a forget gate. The cells are used to remember values over arbitrary time intervals while the gates are used to regulate the flow of information that goes in and out of the cell , visualised in figure 5.
2.3.3 Convolutional Neural Network
A special class of MLP is known as Convolutional Neural Networks(CNN). CNN’s are neu-robiologically inspired and are well suited for pattern classification. CNN’s are designed to recognize two-dimensional shapes and can handle a high degree of distortion .
The first step of CNN is feature extraction. Feature extraction means that each neuron takes its inputs from a local receptive field in the previous layer . The second step is feature mapping. Each computational layer of the CNN has multiple feature maps. Each feature map is characterized by being in the form of a plane where the individual neurons are required to share the same weights . The third step is subsampling. Subsampling occurs after each convolutional layer. Subsampling is a computational process that lowers the resolution of the feature map. The subsampling process lowers the sensitivity of the output of the feature maps. Lowering the sensitivity enables the output to better with-stand distortion.
The input to a CNN is a two-dimensional array where every element corresponds to a pixel value. Certain inputs have the RGB color scheme and will then have three input channels instead of one. The first hidden layer performs convolution. The hidden layer consists of a number of filters of the same size. Each neuron is assigned a two-dimensional receptive field. This receptive field is pre-determined and is also called a kernel . The kernel size will determine the output dimension of the convolution. The second hidden layer performs subsampling and local averaging. The size of the filters get smaller compared to the pre-vious layer.
The above architecture repeats itself until the output layer is reached. The output layer is flat(1x1) and has as many outputs as there are class labels. The spatial resolution is reduced while the feature maps are increased .
2.4 Deep Learning Architectures 2.4.1 VGG16
VGG16 is a 16-layer CNN model proposed by K. Simonyan and A. Zisserman in the pa-per “Very Deep Convolutional Networks for Large-Scale Image Recognition” . Neural networks prior to VGG used bigger receptive fields (7x7 and 11x11) as compared to 3x3 in VGG16, but they were not as deep as VGG16.
By default, the input for the first layer is of size 244 x 244 RGB image. The image is then passed through a stack of convolutional layers, where the filters were used with a very small receptive field: 3x3. Following the convolutional layers, there are three Fully-Connected layers, or dense layers: the first two have 4096 channels each, the third performs 1000-way ILSVRC classification and thus contains 1000 channels (one for each class). The third dense layer should always be changed to match the number of classes that are being trained. The final layer is the soft-max layer. This is all visualised in figure 6.
Figure 6: VGG16 architecture based on “Very Deep Convolutional Networks for Large-Scale Image Recognition” 
2.4.2 Inception v3
Inception v3 is one of the multiple iterative improvements on the Inception network. In-ception v3 was in introduced in the same paper as InIn-ception v2, "Rethinking the InIn-ception Architecture for Computer Vision" by Szegedy, et. al . Inception v3 is a widely-used and popular image recognition model that has shown to reach greater than 78.1% accuracy on the ImageNet dataset. 
The model is made up of symmetric and asymmetric building blocks, including convolu-tions, average pooling, max pooling, concats, dropouts, and fully connected layers. Batch normalization is used a lot throughout the model and is applied to the activation inputs. The loss is computed using Softmax. This can be seen in figure 7.
2.4.3 Residual Network
Residual Network(ResNet), is a type of ANN architecture where x is the input vector and H(x) is the underlying mapping. ResNet was created to combat a problem for deep networks with more than 30 layers. All known techniques, such as Relu, dropout, batch normalization, etc does not work well with such deep networks. To solve this, the idea was to use blocks, see figure 8, that re-route the input but also add what was learned from the previous layer .
Figure 8: ResNet block  The residual function can be computed as
F (x) = H(x) − x (5)
So the original function becomes the sum of the residual function and input vector
Neural networks(NN) have successfully been applied to classification problems. NN’s have the capability of solving non-linear problems. There are different types of NN’s and the type of NN that will be investigated is called Residual Networks(ResNet), because of the model winning the image classification competition ILSVRC 2015 . ResNet will be implemented to recognize different human activities in an office environment by evaluating the data from a video stream. Similar models will also be evaluated and compared with the ResNet with regards to complexity, training and validation accuracy.
The Residual Neural Network architecture has shown better results than earlier Deep Learning architectures . The Visual Geometry Group 16 network has been success-fully implemented using the AMI office environments with a high degree of validation accuracy ; correctly classified a finite number of well defined human group activities. Another network architecture that was used in  was the Inception V3. The Inception V3 architecture is also implemented in this thesis.
3.1 An investigation of transfer learning for deep architectures in group activity recognition
In the article  the authors investigated how DL architectures with high performance in solving activity recognition problems could benefit from applying transfer learning. Trans-fer learning means that pre-trained DL architecture is reused to solve another problem by peeling away the last layer.
The researchers tried out the DL architectures in a set of controlled experiments using the AMI Meeting Corpus database . The database was used because it offered a controlled environment where the conditions were exactly the same. These conditions include the camera angles, the lightning, and the participants. The database offered up to six different camera angles; four cameras in middle each one directed towards one participant, centered top view of the whole room and top corner view of the whole room. The researchers used the data from camera angles that had an overview of the entire room.
Three activity classes were used; presentation, meeting or empty. The reason for choos-ing the three activity classes was because these three activity classes were present durchoos-ing every meeting. From the controlled set of experiments, the researchers selected the best performing DL architectures. These DL architectures were applied to a dataset captured from two different cameras in the IoTaP Lab at Malmo University. The dataset that was used from the IoTaP Lab at Malmo University was not relevant for this project because this project aims to use only the AMI Meeting Corpus database.
The videos from the AMI Meeting Corpus database were pre-processed before using them to train the DL networks. Each video was copied once and flipped horizontally; in order to create a mirror image and thereby expand the total volume of data. Each video was exactly five minutes long. In order to avoid overfitting the images that belonged to the same video were kept together. The video was split into 25 170 frames. Each frame of the video was extracted into a JPEG and resized to 224x224 pixels. The test set consisted of 15% of each activity class.
The network models used were VGG16 with randomly initialized weights, VGG16 with pre-trained weights from the ImageNet dataset and Inception V3 pre-trained on the ImageNet weights. Each network model was trained for one hour using a GTX 1060 video card. The activity classes presentation and meeting have similar features compared to the activity class empty. So the researchers excluded the activity class empty in order to investigate classification bias towards the activity class empty.
The researchers changed the original models by removing and adding specific layers. The layers that were not modified had their weights locked during training. To incorporate temporal features in the experiments the researchers added RNN elements and an LSTM layer to the original models . The temporal features of the network models had to be incorporated into the validation process. The validation process was redesigned so that image frames were grouped together. Image frames per video segment were classified in-dependently during validation.
The researchers investigated deeper into temporal features and therefore implemented a 3D CNN . The 3D CNN model required that the data be reprocessed. The reprocessing of data was done by scaling the images to a width and height of 32x32 pixels and a depth of 10 images. The previous models mentioned are not applicable to 3D CNN. The researchers used a pre-trained model that was given in . The model was trained 100 epochs.
The researchers results regarding temporal features scored the highest validation accuracy among the models tested in the paper . The experimental results in the controlled set of experiments using the AMI database showed that the model with highest validation accuracy was the 3D CNN; 94.8% validation accuracy . The second highest validation accuracy was achieved by the RCNN combining VGG16 features with LSTM layers; 88,0% validation accuracy .
The researchers tried changing the dimensional input and trying unidirectional or bidirec-tional LSTM layers in order to enhance the VGG16 model. The researchers could enhance the validation accuracy of the VGG16 model to reach 92%. The changes that resulted in high accuracy were unidirectional LSTM layers and high dimensional input .
This thesis is based upon work done in . This thesis is extending the work from  by incorporating both ResNet and audio.
3.2 Towards Robust Human Activity Recognition from RGB Video Stream with Limited Labeled Data
This article  is investigating how to recognize human activity from an RGB video stream that has limited data. The limited data in this instance is the lack of depth information as opposed to RGB-D video that has depth data.
They propose a framework that couples skeleton data extracted from RBG video and deep BLSTM model for activity recognition . In order to train this model effectively, they brought forward a set of algorithmic techniques. This solution can even outperform the state-of-the-art RGB-D video stream solutions and can be widely deployed using ordinary cameras.
Their proposed architecture combines deep BLSTM layers and MLP, with five consecutive BLSTM with dropout . They utilize Batch Normalization after each BLSTM layer and then feed the output of the BLSTM layers to the MLP.
This article was chosen as relevant for getting an idea of combining layers and what type of activation layers and optimizers might be useful.
3.3 Deep Residual Learning for Image Recognition
In this article  the authors write about the problem of vanishing/exploding gradients in DL networks. At a certain network depth, the training error will start to increase by adding more layers to the network. The researchers investigated the ResNet architectural approach to avoid vanishing/exploding gradients. Using identity mapping and identity shortcuts the researchers added the output of a previous layer to the output of stacked layers. The iden-tity shortcuts do not contribute to more parameters or more computational power required.
The researchers assumed that if non-linear functions can be approximated by a few NN layers then also the residual function can be approximated . As is described in equation (5) the original underlying function becomes the sum of the residual function and the identity shortcut is seen in equation (6).
This article is relevant because it shows that using ResNet can help to avoid vanish-ing/exploding gradients which will decrease the training error. This thesis is using ResNet
which makes the article useful.
3.4 Robust Audio Sensing with Multi-Sound Classification
The authors in  explore different approaches in multi-sound classification and propose a stacked classifier based on recent advancements in deep learning. The proposed approach can robustly classify sound categories among mixed acoustic signals, without the need to know about the number and signature of sounds in the mixed signals.
To do this the authors apply a state-of-the-art CNN called VGGish pre-trained on AudioSet . To extract MFCC (Mel Frequency Cepstral Coefficient) from each frame the authors compute a spectrogram using a magnitude of STFT (Short-Time Fourier Transform) to each frame of the original time-domain signal. The STFT is configured using a window size of 25 ms, a hop of 10 ms and a periodic Hanning window. The spectrogram is then mapped to a 64 Mel bin which produce a Mel spectrogram. In order to stabilize the spectrogram, the authors took the logarithm with an offset of 0.01 to avoid taking a logarithm of zero. Lastly, the audio features are extracted as a log Mel spectrogram with a shape of 96x64 (96 frames of 10 ms each, with a range of 64 Mel bands).
The windows size of 25 ms, hop of 10 ms, mapping to 64 Mel bins and taking the logarithm of the produced Mel spectrogram is implemented in this thesis. The results in  are promising as the accuracy never drops by a significant amount when going from 1 to 5 mixed events of similar and distinct sound categories. On average the accuracy only dropped 6.75% when going from 1 to 5 mixed sounds.
The method of choice for this research is the system development method written by Nuna-maker and Chen . This systems development process is done in five stages . The Nunamaker and Chen’s method was chosen because of the possibility of an iterative process until answering the research questions. Unforeseen difficulties that may be encountered will force the project to take a step back and re-evaluate an earlier stage.
Figure 9: Nunamaker 
4.1 Construct a Conceptual Framework
The first stage of this research methodology was to break down the problem domain into smaller research questions and to perform a literature study. The defining of the research questions are already done and presented in section 1.4.1. A literature study was done to gain more knowledge into the problem domain which helped to provide information for the theoretical background. The theoretical background is presented in chapter 2. Also analyzing what was done in  and using the same approach. The method of approach is described in 3.1.
4.2 Develop a System Architecture
With the first stage complete, the next step is to develop a system architecture based on the conceptual framework. Based on the information from the conceptual framework the components and requirements needed were acquired.
The development of the system architecture was done using Google Colab and by using Keras API . Also, Kaggle (competitions based on machine learning problems) was used for inspiration on how to implement and build a prototype system. Kaggle provided a variety of implementations regarding image classification using Keras API.
4.2.1 Google Colaboratory
Colaboratory is a so-called Jupyter notebook in the cloud, requiring no setup to start using
3. Jupyter notebook is a web application allowing you to create and share documents
containing live code, equations, visualizations, and narrative text. By using Colaboratory we get access to powerful hardware which speeds up the training time for our deep learning model.
4.3 Analyze and Design the System
At this stage, we have a system architecture and it is time to analyze in order to figure out if the system architecture meets the requirements so that results with good validity can be produced
4.3.1 Cifar-10 dataset
The Cifar-10 dataset is an image multiclassification problem. The Cifar-10 dataset consists of 60000 32x32 color images in 10 classes. There are 6000 images per class. For training, there are 50000 images and for validation, there are 10000 images. The Cifar-10 dataset is a moderately difficult dataset because of the similarity of some of the classes. The target names of Cifar-10 are as follows: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. The dataset is provided by Keras API and the models used for feature extraction to answer the research questions are also provided by Keras API . The AMI Corpus dataset is a custom dataset and not supported by Keras API like the Cifar-10. This means that the AMI Corpus data has to be loaded in a different way compared to the Cifar-10 dataset in order to use Keras API.
In order to validate the implemented system design, it was needed to analyze the results given by the models by comparing benchmark scores on Cifar-10. So the results were compared to known official benchmark scores which would show the validity of the system implementation. The image size used for Cifar-10 was 32x32 pixels.
4.3.2 Cats Vs. Dogs dataset
The Cats Vs. Dogs dataset is a two class image classification problem. The dataset consists of 8000 training samples and 2000 validation samples. The image sizes are not the same and the images are reshaped to the same size.
The Cats Vs. Dogs dataset was used in order to analyze the design of the system. This had to be done because the folder structure of the Cats Vs. Dogs dataset is not the same as for 4.3.1. This means that the Cats Vs. Dogs dataset is undergoing a different loading process regarding the data and the result has to be analyzed.
The Cats Vs. Dogs dataset is used in order to verify that the loading process with the folder structure provided works. The same loading process of the data will also be used for the AMI dataset.
4.4 Feature extraction by using transfer learning
Transfer learning was achieved by removing the top layer of the base model. Through trans-fer learning the features of each image were extracted. The image features were then used to train a smaller NN model to classify the image belonging to one of the three classes; described in subsection 4.5.1.
4.4.1 Data augmentation
Data augmentation is used to increase the sample size producing a larger variety of unique data samples. Keras API has a built-in data augmentation feature. The built-in data augmentation feature was used so that the model would be able to generalize better and thereby have a higher validation accuracy and less risk of overfitting data. Data augmen-tation was only done on the training dataset.
The data augmentation that is done:
- horizontal flip = True - rotation range = 5 - shear range = 0.2 - zoom range = 0.2 - shuffle = False - seed = 4
Some of the data augmentations used are inspired from  but are also standard augmen-tation techniques for images. The horizontal flip does not change the information in the image but represents it in another way. For instance in the Cats Vs. Dogs dataset if the cat image is flipped horizontally it still looks like a cat. Also using rotation, shear and zoom will create augmented images that would show the same visual information but rep-resented in another way. Distorting the image too much could lead to worse performance because the augmented images do not represent reality any more. So the values used were small but still do manage to create augmentation that is distinguishable from the original image.
Keras data generators were used to augment the data and to extract image features through the transfer learning process. The generators used the base models to extract the image features through online predictions on the training and testing datasets. The extracted features were then used for training and evaluating the model. This approach solved the problem of not being able to use large image sizes due to memory overflow in the Google Co-laboratory environment. Memory overflow did not occur because the base models predicted online and only loaded a small enough amount of images. The image size used for Dogs Vs Cats and for AMI was 244x244. The loading process sorts the data in alphanumeric order. The AMI data has file stamps for every sample in every video batch that is unique. The batches will be sorted by the Keras data generators without shuffling the data samples. So the integrity of the sample order within each batch of data is maintained. Maintaining
the order of the samples within each batch is of importance for the use of temporal features.
4.4.2 Parameters during model training
The parameters have been chosen by considering the paper . The batch size used was 1000 for Cifar-10. The Cats vs Dogs dataset is smaller so a smaller batch size was used during training; batch size of 256.
The optimizer chosen was the Stochastic Gradient Descent(SGD) with a learning rate of 0.0001, learning decay of 0 and a momentum of 0.9.
Figure 11: Neural Network after applying dropout
The dropout will set a fraction of the inputs to 0 and is used to counteract the model overfitting the data. The inputs that are set to 0 were chosen randomly each time step . The dropout rate will be set to 0.4 and any changes will be specified.
A variety of approaches have been tested to speed up and reproduce the results. The approach that has been adopted includes implementing data generators and also saving the features extracted by the base models. The AMI Corpus data used is available at the git repository 4 for future use.
The model architecture chosen was inspired by :
Figure 12: Model architecture used after the base model feature extraction.
4.5 Build the (prototype) System
This stage is all about the implementation of the system that was designed. Here we do all the programming needed for the system.
4.5.1 Frame extraction
Using the AMI Corpus NITE XML Toolkit5 to extract the annotations for each video. A program was created to read the data in each excel file and put together and save all the data into one excel file. This one excel file contained the most important data needed which was the name of the video, the topic, start time and end time for each topic and the class assigned to each topic. The assumption was made that video segments where the topic name included the word "presentation" or "drawing" were classified to be a presentation. Topic names that included the word "closing" were classified as "empty" and the rest of the topics were classified as "meeting".
Another program was created in order to extract a batch of frames from the video signals. The batch of video frames was stored according to the classification that was made based on the topic name. For each batch of video frames, there is a 2-dimensional audio spectro-gram extracted. The video batch of frames and the audio spectrospectro-gram was synchronized which is a requirement for answering the research questions.
The extraction of samples was done by reading a video segment containing frames and storing all the video segments in a list. The list was then randomly shuffled and the class that had the least occurring samples, set the limit for how many video segments were to be extracted from each class. Then the samples were extracted from the shuffled list and the samples were chosen randomly. There were a couple of lists such as the class label, file stamp, spectrogram and batch of video frames. The class "closing" occurs usually only at the end of the video or never. The class that has the most occurrences is the "meetings" class. By limiting the extraction the dataset becomes more balanced regarding the samples per class. The extraction process has been thoroughly tested.
4.5.2 Video dataset
The same approach will be used as in 3.1.2 where three activity classes(presentation, empty and meeting ) are to be classified by the model. The datasets used can be augmented as has been described in chapter 4.4.1 to expand the total volume of samples. The dataset consisted of images with size 144x144. Each of the three classes had batches of 32 frames. The 32 frames were captured by taking the fifth frame until reaching a batch size of 32 frames. The frame rate of the videos were 25 frames/second meaning that a batch of 32 frames had a duration of 6.2 seconds. From each video, it was possible to extract one or maybe two batches for each class, for training and validation respectively. From some videos, no batches were extracted because it was necessary to keep the dataset balanced, and the class empty occurred only once in the video or zero times. The empty class also occurred during a short time span in the video. We solved this by manually moving the
batches from validation folders to training folders. The frames were re-scaled to 144x144 and placed in folders with training and validation. Each folder containing subfolders for each class 0,1,2. The images were then uploaded to GitHub.
Google Colab6 was used to implement the base models for transfer learning. The images had to be resized to 244x244. Because of limited RAM, the images had to be resized and predicted online in batches so that the RAM limit would not be exceeded.
The base models used made online predictions of batches with sample size of 32 frames. The result from the predictions are the extracted image features. The features extracted were stored in variables to be used by the model during training. Only the model was trained, the base models were only used to extract the image features.
Presentation Empty/No-activity Meeting
Figure 13: Shows how the ideal setting for each of the three classes can look like. The presentation class should have someone standing up and speaking. The empty/no-activity class should be an empty room. The meeting class should have all four participants con-ducting a conversation.
4.5.3 Audio dataset
The same approach was used in . Mel spectrograms were extracted from the audio files that were part of the videos. The FFT window had a time span of 25ms with a window hop of 10ms. The audio sample rate was 16000 frames/second. This meant that for the duration of 6.2 seconds there would be 400 audio frames captured. For every 160 audio frames a new FFT window would be created. The number of Mel bins used was 64 and the frequencies that were cut off were [175,7500]Hz. So there was an overlap of 240 audio frames between every FFT window. The logarithm was taken from the spectrograms to stabilize it. The spectrograms were uploaded to GitHub with the size of 144x144. In Google Colab they were resized to 244,244 for feature extraction by the base model.
Presentation Empty/No-activity Meeting
Figure 14: shows the Mel audio spectra corresponding to the figures in Fig 13.
4.5.4 Video and Audio combined
To combine the video and audio extracted features the data had to be concatenated. The concatenation was done along the feature axis of the audio and video data. An LSTM layer was used to analyze the temporal data The last layer is an output layer with three outputs, one for each class, and the whole model makes three probability predictions indi-cating which class the model thinks the batch of data belongs to.
The simple architecture used for testing audio and video where one LSTM layer was enough, only audio and only video:
Figure 15: Model architecture used after feature extraction done.
Each audio frame used for training will be of 0.2 seconds length. A total of 32 audio frames make up one sequence and will be combined with corresponding video batch along the extracted feature axis. The pre-processing of the images was done in the same way as in section 4.5.3:
Figure 16: shows the Mel audio spectra for three consecutive frames in the meeting class. Each image spans over 0.2 seconds.
4.6 Observe and Evaluate the System
The final stage is to observe and evaluate the results of our prototype system. Here we can determine if the results were as expected or not.
The AMI Corpus dataset was successfully loaded into the system without any concern that the system would be at fault because of earlier testing(see 4.3.1 and 4.3.2).
Test simulations done with the Cats Vs. Dogs dataset showed promising results with a validation accuracy of more than 90%. The training accuracy did not change and the reason could be that the dataset was too small.
Because online prediction was needed for the image size 244x244 the Cifar-10 and Cats Vs Dogs datasets were tested with the online predict generator  The results were similar to the results presented in Fig 20 and in 19b.
In this section, the results from various experiments will be presented in detail.
5.1 Cats vs Dogs dataset
Figure 17: Resulting images from the Cats Vs Dogs dataset by implementing the keras data generator.
In Fig. 17 the generator is set to a batchsize of 1000 training samples. Because the Cats Vs Dogs dataset only has 8000 training samples this will result in that the sequence repeats itself for every eighth image. The generator can be used to generate more samples than the total sample size and the extra samples can be augmented and thereby expand the total training size.
(a) VGG16 (b) ResNet50 (c) Inception V3
Figure 18: shows the model accuracy for the Cats vs Dogs data set described in 4.3.2. The data has been augmented. The base models used for feature extraction are shown under each subplot. The model architecture used for training and prediction is shown in Fig. 15
5.1.1 Implementing the system architecture
The implemented system architecture for loading data, creating the data set described in Fig. 17 then extracting the features and training a model based on the extracted features does result in a high model validation accuracy shown in Fig. 18. The training accuracy is around 50% and there is no definite explanation to why the training accuracy is not higher. There could be a lot of variance in the Cats Vs Dogs data set which does reduce over training. Also dropout is used which lowers the training accuracy so that the model
(a) VGG16 (b) ResNet50 (c) Inception V3
Figure 19: shows the confusion matrix for the Cats vs Dogs dataset described in 4.3.2. The data has not been augmented. The data processing type is show under each sub plot. The model architecture used for training and prediction after the feature extraction is shown in 15
The confusion matrix Fig. 19 shows that the classification done by the model does indeed work. Also the confusion matrices in Fig. 19 show that some samples are not classified correctly, meaning the model is confused not being able to distinguish which class is the correct class for the sample that is classified wrong.
The table 5 shows that the model performs better when the data set is shuffled. The data is read class by class and shuffling the data samples around seem to have an impact on the learning. If the data is not shuffled then the model will first learn the class cats then dogs which can make the model more biased towards classifying the data as cats.
Table 1: Shows the F1 score and for the resulting confusion matrix in Fig. 19 using no shuffle with a ratio between training and validation samples of 50/50. Shuffling the data with a ration of 80/20 and no shuffling with at ration of 80/20 (data generated by using Keras generators) and data is not augmented. The accuracy for each class is the same as the F1 score.
Data processing F1-score[%] samples: cats samples: dogs
No shuffle 94 1000 1000
Shuffle + 80/20 99 1003 988
5.2 Cifar-10 dataset
Figure 20: The Cifar-10 example data of the first 20 images that are preprocessed and then undergo feature extraction.
Figure 21: Shows first 20 images of the Cifar-10 dataset same as in Fig. 20 with the difference that these images also are augmentation before doing anything else. The aug-mentation done is described in section 4.4.1.
5.2.1 Benchmarking the system architecture on Cifar-10 dataset
Testing to see how well the system architecture is implemented by training the model on the Cifar-10 dataset. These tests are to ensure that the base models and transfer learning works correctly and no anomalies are present in the system architecture.
Figure 22: Accuracy and loss for Cifar-10 using ResNet50 where (a) shows accuracy and (b) shows loss
Figure 23: Confusion matrix based on the result from Fig. 22
Table 2: Shows the precision, recall, F1-score and number of samples of the Fig. 22 and Fig. 23. Data is shuffled but not augmented.
Class Precision[%] Recall[%] F1-score[%] Samples
Airplane 90 90 90 1165 Automobile 95 94 95 1218 Bird 87 86 86 1213 Cat 76 76 76 1214 Deer 85 85 85 1201 Dog 82 79 80 1176 Frog 90 94 92 1181 Horse 90 89 90 1206 Ship 92 93 93 1223 Truck 91 94 92 1197
The validation accuracy shown in Fig. 22 shows that the feature extraction works properly and that the model is able to learn from the extracted features and classify the validation data set with a high validation accuracy. There is also clear evidence that the training accuracy of the model exceeds the validation accuracy at around epoch 60. The confusion matrix in Fig. 23 shows the miss-classifications done by the model where some samples of the classes dog and cat seem to confuse the model and so does ship and airplane. The reason why the model gets confused is because there are similarities between these classes and the samples are classified wrong.
5.3 AMI Corpus Video and Audio 5.3.1 Video dataset
(a) (b) (c)
Figure 24: Shows the accuracy for ResNet50 without augmentation on AMI dataset for the Edinburgh Scenario meetings. Only two classes to categorize, presentation and meeting. The data set is shuffled and balanced so that the ratio between training and validation is 80/20. Figure (a) represents the corner camera, (b) the overhead camera and (c) the corner and overhead camera angles combined.
The model has a high validation accuracy for the corner camera angle, overhead camera angle and the data from the two camera angles used together in Fig. 24. There are extracted features in the images that are similar. The features could be different but very unique for each camera angle. Meaning the model finds a pattern that results in a high validation accuracy even if the camera angles are different. So the models complexity allows the model to optimize its weight configuration in such a way that it can reach a high validation accuracy indifferently of the camera angle used in Fig. 24.
(a) (b) (c)
Figure 25: Shows the loss for ResNet50 without augmentation on AMI data set for the Edinburgh Scenario meetings. The network architecture can be seen in the Appendix at Fig. ?? . Only two classes to categorize, presentation and meeting. The data set is shuffled and balanced so that the ratio between training and validation is 80/20. Figure (a) represents the corner camera, (b) the overhead camera and (c) the corner and overhead camera angles combined.
The model loss shown in Fig. 25 does vary depending on the camera angle. Some camera angles make it more difficult for the model to learn the classification pattern. Comparing Fig. 25a and Fig. 25b it can be observed that the loss function is steeper for Fig. 25b. The steeper loss function indicates that the data in the overhead camera angle is easier for the model to learn compared to the corner camera angle. Putting the camera angles together creates a model that can classify data from different camera angles.
(a) (b) (c)
Figure 26: Shows the confusion matrix for ResNet50 without augmentation on AMI dataset showing the confusion matrix for the different datasets used. Only two classes to categorize, presentation and meeting. The dataset is shuffled and balanced so that the ratio between training and validation is 80/20. Figure (a) represents the corner camera, (b) the overhead camera and (c) the corner and overhead camera angles combined.
The confusion matrices in Fig. 26 show that the model is more likely to miss-classify presentation as meeting. The data is not clear enough regarding what a "presentation" or a "meeting" setting looks like. The confusion matrix showing the corner camera angle in Fig. 26a is more likely to miss-classify presentation as meeting comrpared to Fig. 26b.
Table 3: Shows the accuracy, precision and recall for Fig 24 shuffling the data and a training/validation ratio of 80/20. Data is not augmented.
Meeting, angle accuracy[%] presentation/meeting precision[%] presentation/meeting recall[%]
ES, corner 98 97/99 99/97
ES, mixed 99 99/99 99/99
ES, overhead 99 98/99 99/98
5.3.2 Audio dataset
(a) (b) (c)
Figure 27: Shows validation accuracy for each of the three base models on the audio data set for the Edinburgh Scenario and Idiap Scenario meetings combined without augmentation. Figure (a) represents the base model InceptionV3, figure (b) representes the ResNet50 and figure (c) the VGG16 base model.
The validation accuracy in Fig. 27 does not seem to provide enough unique features for the model to be able to distinguish between different classes. The lack of smooth curves
spans over 6.2 seconds. ResNet50 does reach the highest validation accuracy. Shows that ResNet50 has extracted features that make it easier for the model to classify the audio data correctly.
(a) (b) (c)
Figure 28: Shows the loss for the different base models. Same dataset used as in Fig. 27. Figure (a) represents the base model InceptionV3, figure (b) representes the ResNet50 and figure (c) the VGG16 base model.
The ResNet50 model in Fig. 28 has a steeper loss function compared to both InceptionV3 and VGG16.
(a) (b) (c)
Figure 29: Shows the confusion matrix for each base model. Same data and training parameters as shown in Fig. 27. Figure (a) represents the base model InceptionV3, figure (b) representes the ResNet50 and figure (c) the VGG16 base model.
Table 4: Shows the validation accuracy and F1-score for each class for the resulting con-fusion matrix in Fig. 29 InceptionV3, ResNet50 and VGG16. Data ratio is 80/20 between training and validation.
Base model accuracy [%] presentation/empty/meeting F1-score[%]
InceptionV3 66 62/73/64
ResNet50 78 74/81/79
VGG16 50 54/50/47
According to the confusion matrix in Fig. 29 the InceptionV3 base model did miss-classify presentation as meeting. Also VGG16 miss-classified empty as meeting and also is confused regarding presentation and meeting. The model that excelled is ResNet50. ResNet50 did get confused about presentation and meeting. All three base models did have difficulties distinguishing between presentation and meeting. It is reasonable to think that the audio spanning over 6.2 seconds is similar for both presentation and meeting which makes it difficult for the model to distinguish between one ore more persons speaking.
5.3.3 LSTM: Audio and video sequence
(a) (b) (c)
Figure 30: Shows the validation accuracy during model training using the LSTM model architecture. Using ResNet50 as base model for the feature extraction. Figure (a) repre-sents only audio with 65% accuracy, figure (b) is only video with 100% accuracy and (c) is both audio and video features combined with 100% accuracy.
The LSTM model did reach a high validation accuracy seen in Fig. 30 only by using video sequences compared to audio sequence. Combining the audio and video sequences along their feature axis did lower the validation accuracy with about 5 percentage. The audio samples span over 0.2 seconds and for every video sample there is a corresponding audio sample. One sequence makes up 32 samples. The dataset is shuffled and balanced so that the ratio between training and validation is 80/20. The distribution of validation batches corresponding to the classes (presentation, empty/no activity, meeting) are (3, 13, 15) respectively per class. The distribution of training batches corresponding to the same classes have batches of (69, 34, 17) respectively per class.
(a) (b) (c)
Figure 31: Shows the loss during model training using the LSTM model architecture. Figure (a) represents only audio, (b) only video and (c) both audio and video features combined.
The loss function for the audio sequences in Fig. 30 does show that the model has difficulties finding a unique pattern in the data for each class. On the other hand the loss function for video is steep and indicates that the model does see a pattern in spatio-temporal video data.
(a) (b) (c)
Figure 32: Shows the confusion matrix after training. Same data and training parameters as shown in Fig. 32. Figure (a) represents only audio, (b) only video and (c) both audio and video features combined.
Table 5: Shows the validation accuracy and F-1 score from each class for the resulting confusion matrix in Fig. 32. Using ResNet50 as the base model for feature extraction of the audio data, video & audio data and only audio data. The ratio between training and validation data is 90/10.
Base model accuracy presentation/empty/meeting F1-score[%] [% ]
Audio 65 40/64/76
Video 100 100/100/100
Audio & Video 97 80/100/97
The confusion matrices in Fig. 32 shows that the audio data confuses the model and results in miss-classifying empty and meeting.
Figure 33: Shows Fig. 30b as Fig. (a) represents the data where video features come first then audio features with 100% validation accuracy. Fig. (b) represents the data where audio features come first then video features with 97% validation accuracy.
From Fig. 33 the difference is clear regarding the efficiency of the training depending on if the data combined is arranged first audio then video or first video then audio features. If the audio data features are first the outcome of the training will have worse validation accuracy compared to if the video data features are first during training.
5.4 Looking deeper into ResNet
(a) (b) (c)
Figure 34: Shows the training and validation accuracy during training. Same data and training parameters as shown in Fig. 27. The accuracy in case (a) is 75%, in (b) 66% and (c) 69% validation accuracy. Figure (a) represents base model ResNet50, (b) ResNet101 and (c) ResNet 152.
The validation accuracy does not increase with a deeper ResNet base model which can be seen in Fig. 39. Meaning that there are no deeper features hidden in the data for the ResNet to extract by making the base model more complex.
(a) (b) (c)
Figure 35: Shows the loss function during training. Same data and training parameters as shown in Fig. 34. Figure (a) represents base model ResNet50, (b) ResNet101 and (c) ResNet 152.
The best performing base model among the three shown in Fig. 35 is the ResNet50 base model. It is expected that increasing the complexity of the ResNet base model the performance would increase. It needs to be noted that the sample size provided for the model to train and validate can be too small.
(a) (b) (c)
Figure 36: Shows the confusion matrix after training. Same data and training parameters as shown in Fig. 34. Figure (a) represents base model ResNet50, (b) ResNet101 and (c)ResNet 152.
5.4.2 Audio: Mixed ES and IS meetings with random weight initialization
(a) (b) (c)
Figure 37: Shows the training and validation accuracy during training using random weight initialization. Same data and training parameters as shown in Fig. 27. The accuracy in case (a) is 56%, in (b) 67% and (c) 58% validation accuracy. Figure (a) represents base model ResNet50, (b) ResNet101 and (c) ResNet 152.
Increasing the depth of the ResNet increases the validation accuracy which can be seen in Fig. 37.
(a) (b) (c)
Figure 38: Shows the confusion matrix after training. Same data and training parameters as shown in Fig. 37. Figure (a) represents base model ResNet50, (b) ResNet101 and (c)ResNet 152.
5.4.3 Audio: Edinburgh Scenario meetings LSTM audio sequence
(a) (b) (c)
Figure 39: Shows the validation accuracy and same data and training parameters as shown in Fig. 30. Figure (a) represents base model ResNet50, (b) ResNet101 and (c) ResNet 152. The validation accuracy in (a) is 65% in (b) 55% and (c) 42%
The validation accuracy decreases with a deeper ResNet base model which can be seen in Fig. 39. Meaning that the feature extraction is of poorer quality and does not enable the model to better classify the data.
(a) (b) (c)
Figure 40: Shows the loss during model training using the LSTM model architecture. Same data and training parameters as shown in Fig. 40. Figure (a) represents base model ResNet50, (b) ResNet101 and (c) ResNet 152.
(a) (b) (c)
Figure 41: Shows the confusion matrix after training. Same data and training parameters as shown in Fig. 39. Figure (a) represents base model ResNet50, (b) ResNet101 and (c) ResNet 152.
5.4.4 Audio: Edinburgh Scenario meetings LSTM audio sequence with ran-dom weight initialization
(a) (b) (c)
Figure 42: Shows the training and validation accuracy during training using random weight initialization. Same data and training parameters as shown in Fig. 30. The accuracy in case (a) is 42%, in (b) 58% and (c) 71% validation accuracy. Figure (a) represents base model ResNet50, (b) ResNet101 and (c) ResNet 152.
The validation accuracy does also increase for the LSTM sequence data with a deeper ResNet base model which can be seen in Fig. 42.
(a) (b) (c)
Using the ImageNet weights instead of training the base models provides an advantage re-garding the model training time . The base models are quite large with a lot of param-eters. Randomly initializing the weights of the base models would mean time consuming optimization of the base models weights. Training the base model and using ImageNet weights to initialize the base models could lead to better results . There is the option of adjusting the weights of some layers and not all the layers in the base model. The benefits are less parameters to optimize and utilizing the already optimized base model for feature extraction. There was no intention of going any deeper into adjusting the weights of the base models or trying to optimize the base models with randomly initialized weights. How-ever the results on the AMI dataset shows the powerful computation that lies in transfer learning and utilizing an already optimized base model.
6.1 Shuffling the data
Shuffling the data resulted in a higher validation accuracy for the CatsVsDogs and Cifar-10 datasets. Shuffling the AMI datasets also resulted in a higher accuracy compared to unruffled datasets. The reason to why shuffling the data seems so successful might be the order the data is presented to the model. If no shuffling would occur then the model would train on the first class, then the second class and so on. The result would be that the model weights would approach the local minimum for the first class and then the second class. So shuffling the data might increase the probability of minimizing the cost function and not getting stuck on a local minimum.
In Fig. 33 the order of the features do affect the training. The feature order not only affects the end result but also during training. During training it can be observed in Fig. 33 that the two models training is different.
6.2 Audio data compared to video data
The audio spectrum used is not dependent on the environment in the same way as video. So the images generated for the audio spectrum are not affected by the fact that there are different camera angles. Also the audio spectrum is not affected by people behaving in a way that does not fit the class. Both camera angles share the same audio so using audio seems to be more beneficial compared to video. Although audio also has its drawbacks such as the same audio pattern can be recorded for entirely different visual behavior. Meaning that presentation and meeting could be interpreted the same by the model if only provided audio data which might be the case in Fig. 29 and Fig. 32. So the problem with only audio might be that the presentation class still has enough similarities to the meeting class so the model got the two classes mixed up. The model does seem to confuse the classes meeting and empty in Fig. 32 for the only audio part. There is not enough data for the class presentation, so there is not enough data to say anything about the classification pattern between presentation and the other two classes. The validation accuracy when using only audio is around 70% for the dense models in Fig. 27. For the LSTM model in Fig. 30 only the audio has a validation accuracy of around 50%. The datasets for audio are small but