Automatic Gait Recognition : using deep metric learning

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Electrical Engineering

Master’s thesis, 30 ECTS | Datateknik

2020 | LIU-ISY/LITH-EX-A--20/5316--SE

Automatic gait recognition

–

using deep learning

Automatisk gångstilsigenkänning

Martin Persson

Supervisor : Karl Holmquist Examiner : Jörgen Ahlberg

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten ﬁnns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

Recent improvements in pose estimation has opened up the possibility of new areas of application. One of them is gait recognition, the task of identifying persons based on their unique style of walking, which is increasingly being recognized as an important method of biometric indentification.

This thesis has explored the possibilities of using a pose estimation system, OpenPose, together with deep Recurrent Neural Networks (RNNs) in order to see if there is sufficient information in sequences of 2D poses to use for gait recognition. For this to be possible, a new multi-camera dataset consisting of persons walking on a treadmill was gathered, dubbed the FOI dataset.

The results show that this approach has some promise. It achieved an overall classifi-cation accuracy of 95,5 % on classes it had seen during training and 83,8 % for classes it had not seen during training. It was unable to recognize sequences from angles it had not seen during training, however. For that to be possible, more data pre-processing will likely be required.

(4)

Acknowledgments

I would like to thank the staff and my fellow master thesis students at FOI for making my stay there a pleasant and informative one, in particular my FOI supervisor Henrik Petersson who has provided me with help, guidance and support over the course of the project. I would also like to thank my examiner Jörgen Ahlberg and my LiU supervisor Karl Holmquist for their help in matters relating to both machine learning theory and the structure of the project.

(5)

List of Figures

2.1 An image showing a false detection of a partial human being in the pattern of the

curtains, alongside a correct detection of a person walking on a treadmill. . . 5

2.2 Openpose output 2D-keypoints overlaid on top of an image from the TUM-GAID database. The output itself only contains the 2D keypoints, shown as colored cir-cles in the image. The limbs connecting the keypoints only exist for visualization purposes. . . 6

2.3 A simple neural network with one internal hidden layer and three hidden units. The hidden units, or neurons, sum up the values of the input layer scaled by the weights of the different connections. The activation functions each output an acti-vation depending on the value of the sum and pass that actiacti-vation onward to the output layer. . . 8

2.4 A simple recurrent neural network with an input layer x, one internal hidden layer hconsisting of three hidden units and an output layer o. The later outputs benefit from both current and past information. . . 9

2.5 A bidirectional recurrent neural network. Outputs benefit from both current, past and future information because of the g units passing information backward and the h units passing information forward. . . . 10

2.6 The internal structure of an LSTM cell. . . 11

3.1 An example image from the TUM-GAID database [tum] . . . . 15

3.2 Another example image from the TUM-GAID database [tum] . . . . 16

3.3 The FOI dataset data collection experiment setup. The different camera positions are highlighted by the red circles. . . 18

3.4 The view of the front camera of the FOI dataset. . . 19

3.5 The view of the front-side camera mounted in the ceiling of the FOI dataset. . . 19

3.6 The view of the side camera of the FOI dataset. . . 20

3.7 The view of the side-back camera of the FOI dataset. . . 20

3.8 The view of the back camera of the FOI dataset. . . 21

(8)

List of Tables

3.1 The hyperparameters used when training the network. . . 22 4.1 Precision, recall and f1-score for individual classes as well as their weighted and

unweighted (macro) averages. . . 25 4.2 Confusion matrix. . . 25 4.3 Precision, recall and f1-score for individual classes using a kNN classifier as well

as their weighted and unweighted (macro) averages for classes not seen during training. . . 26 4.4 Confusion matrix for classes not seen during training. . . 26 4.5 Precision, recall and f1-score for individual classes as well as their weighted and

unweighted (macro) averages for front angle sequences. . . 27 4.6 Confusion matrix for front angle sequences. . . 27 4.7 Precision, recall and f1-score for individual classes as well as their weighted and

unweighted (macro) averages for front-side angle sequences. . . 28 4.8 Confusion matrix for front-side angles. . . 28 4.9 Precision, recall and f1-score for individual classes as well as their weighted and

unweighted (macro) averages for side angle sequences. . . 29 4.10 Confusion matrix for side angles. . . 29 4.11 Precision, recall and f1-score for individual classes as well as their weighted and

unweighted (macro) averages for side-back angle sequences. . . 30 4.12 Confusion matrix for side-back angles. . . 30 4.13 Precision, recall and f1-score for individual classes as well as their weighted and

unweighted (macro) averages for back angle sequences. . . 31 4.14 Confusion matrix for back angles. . . 31

(9)

1 Introduction

In recent years, surveillance cameras and other technical evidence have become increasingly important tools in criminal investigations. Many countries have invested significantly into developing and deploying closed-circuit television (CCTV) surveillance systems. The adop-tion of such systems has, particularly when combined with different types of law enforcement interventions, resulted in substantial decreases in crime [26].

While the mere presence of CCTV systems can be enough to deter potential criminal ac-tivity, criminal investigators and prosecutors would benefit immensely from also being able to identify suspects caught on tape. Currently, this is generally achieved by looking at the face or the clothes worn by the persons the camera is observing, but this method has the ob-vious drawback of being susceptible to a change of clothes or masking of the face. Therefore, developing an alternative method of analysing and identifying potential suspects is highly desirable [25].

In the search for useful alternative biometric identifiers, gait is considered an interesting characteristic to analyze for several reasons. Unlike fingerprints or DNA, gait data can be collected from surveillance cameras at a distance without the person of interest being aware of the fact. Unlike other video-based means of identification like facial or clothing recognition, gait is difficult to obscure, not dependent on high resolution images and is thought to be relatively constant and unique [22] [21]. Therefore, an automatic gait recognition framework would be of great use in keeping track of and identifying persons of interest fleeing a crime scene or frequently located near protected property. The goal of this thesis is to implement such a framework using recent developments in deep learning in order to recognize persons based on their gait and extract a metric that can be used to compare arbitrary gait sequences.

1.1 Motivation

Traditionally, gait recognition has been performed by utilizing hand-crafted features manu-ally extracted from a video sequence that has been pre-processed in some manner, commonly by extracting the silhouette or the skeletal structure of the walking person of interest [33]. Pre-vious approaches have typically utilized hand-crafted features like angles between limbs and joints [2] or movement parameters like speed and stride length [36]. Additionally, some sort of feature selection is typically performed on the often high-dimensional input feature space.

(10)

1.2. Aim

The features are then classified using some shallow machine learning technique, like Support Vector Machines or Naive Bayes classifiers.

Recent developments in the field of deep learning, a subset of machine learning tech-niques using multi-level neural networks, have opened new possibilities. Namely, deep neu-ral networks can learn more abstract feature descriptions and capture information about the topic that shallow or hand-crafted features might miss [29]. In particular, the field of deep metric learning, where the deep neural network learns some alternative representation of the input data that better separates different classes from each other, seems interesting for the purposes of automatic gait detection. In addition, recent developments in the field of pose estimation has provided a promising way of extracting gait information.

1.2 Aim

The underlying purpose of the thesis project is to investigate the possibilities of using deep learning for the purpose of identifying walking persons based on their gait. Specifically, it aims to investigate if it is possible to do so for 2D features that can be extracted from a regular camera in an unobstructive manner without requiring any additional sensors.

1.3 Research questions

This thesis has implemented a framework for automatic gait recognition that utilizes Open-Pose to extract poses from image sequences. A deep neural network was then trained on gait sequences labeled with the identity of the person walking and then used to predict the identity of the corresponding persons for unseen sequences.

1. Can a deep neural network trained on 2D keypoint sequences be used for person recog-nition?

Previous studies [7] have used 3D keypoint sequences extracted from motion capture devices for the purposes of gait recognition. In real life scenarios, the persons being ob-served by surveillance cameras will not be wearing motion capture sensors. Therefore, it would be preferable if gait recognition could be performed on 2D data that can be extracted from a regular surveillance camera.

2. Can a deep neural network trained for person recognition be used to identify persons not seen during training?

It is practically infeasible to train a neural network with data for every single potential person of interest. For this reason, it is interesting to explore how a deep neural network trained for gait recognition on data for some persons can be used to recognize other persons unseen by network during training.

3. How robust is the network to changes in camera angles?

Since a person of interest may have been captured on several different cameras in dif-ferent positions, it would be interesting to study how constant the features learned by the neural network are to changes in camera angle.

1.4 Delimitations

This study will only consider model-based gait recognition, i.e the use of a skeleton keypoint representation of walking persons instead of silhouettes or other alternative means of repre-sentation.

(11)

2 Background and theory

In this chapter background information about relevant technologies and datasets will be pre-sented, as well as some previous work done within the field of gait recognition.

2.1 Automatic gait recognition

Automatic gait recognition is the task of identifying a walking person of interest based on the unique gait of the person in question. In comparison to other means of identifying persons, the use of gait as a biometric identifier has several benefits. Compared to facial recognition, gait is harder to conceal and not as dependent on high resolution images. Compared to identifiers such as fingerprints or DNA, gait information can be collected at a distance in an unobstructive manner without the person of interest being aware of the fact.

Generic automatic gait recognition frameworks are considered to consist of five different steps [33]:

1. Acquisition

This step involves the capture of video data containing human walking. In this thesis project, data from the TUM-GAID database [12] as well as data collected specifically for the purposes of the thesis was used. Details regarding the datasets used can be found in section 2.3.

2. Pre-Processing

This step involves extracting the relevant gait segment of the video from the back-ground. There are two main approaches, model based and model free. Model based ap-proaches use skeletal representations of walking persons while model free apap-proaches typically use their silhouette. In this thesis the OpenPose real-time pose estimation sys-tem [3] was used to extract a stick figure representation of walking persons and thus a model based approach was used. That system is described in more detail in section 2.2. 3. Feature Extraction

This step involves extracting useful features from the representation obtained in the pre-vious step. Prepre-vious approaches have typically utilized hand-crafted features like an-gles between limbs and joints [2] or movement parameters like speed and stride length

(12)

2.2. Pose estimation

[36]. This thesis intends to explore the possibility of using deep learning instead. Deep learning is described in section 2.5 In 2018 Coskun et al. [7] employed deep metric learn-ing on gait sequences captured uslearn-ing motion capture sensors to learn a distance metric for keypoint sequences and identify persons and actions and achieved state-of-the-art performance. Typically, 3D joint-angle features have been used, but since OpenPose by default extracts 2D coordinates, this was not done in this thesis work. Also, in practice it would be desirable if it were possible to be able to perform gait analysis on 2D-features, since regular surveillance cameras would have a much easier time extracting such fea-tures compared to 3D-feafea-tures.

4. Feature Selection

This step involves reducing the dimensionality of the features. This was not done man-ually in this thesis project. Instead, it was left to the deep neural network.

5. Classification

This step involves classifying a gait sequence as belonging to some class. In this thesis the classes used were the persons appearing in the datasets, but it could also be the mode of walking, i.e walking slowly or running.

Gait Energy Image

Most state-of-the-art gait recognition approaches utilize or are inspired by a way of represent-ing gait known as a Gait Energy Image (GEI). GEIs present a way to represent a gait sequence composed of several 2D stills as a single 2D image, without losing the temporal information the sequence contains. A GEI G is calculated from a sequence of size normalized and aligned 2D silhouettes B according to the following formula:

G(x, y) = 1 N N ÿ t=1 Bt(x, y) (2.1)

where x and y are 2D coordinates, N is the total number of frames in the sequence and t is the index of an individual frame. GEIs were developed by Han and Bhanu in the 2006 study "Individual Recognition Using Gait Energy Image" [11].

2.2 Pose estimation

Human pose estimation is the problem of discovering the location and orientation of the dif-ferent joints and parts of the human body in an image or a sequence of images. The detected body parts are typically represented as the coordinates of important skeletal joints. In this thesis project, pose estimation was utilized in order to extract sequences of such coordinates in order to use them as inputs for a deep learning neural network architecture. Pose estima-tion was not the main field of study, and therefore existing soluestima-tions with readily available code were considered the most interesting for the purposes of this thesis project.

Contemporary methods of doing pose estimation are commonly divided into top-down and bottom-up approaches [14]. A top-down approach starts by first detecting each individ-ual in an image and then estimating their pose. A bottom-up approach does it the other way around and starts by detecting the joints and other relevant features in an image, and then assigning them to an individual. Advantages of the top-down version include their ability to capture global context and strong structural information [14], but they are worse at han-dling complex poses and occlusion than bottom-up models. Bottom-up models on the other hand do not utilize human structural information which makes them prone to detect humans where there are none, as exemplified in Figure 2.1 . The OpenPose framework that has been used in this thesis project utilizes a bottom-up approach.

(13)

2.2. Pose estimation

Figure 2.1: An image showing a false detection of a partial human being in the pattern of the curtains, alongside a correct detection of a person walking on a treadmill.

OpenPose

The most promising pose estimation system with regard to the nature and aim of this thesis project was OpenPose [3]. OpenPose is a 2D pose estimation system capable of detecting and tracking the poses of multiple subjects in the same image simultaneously at a respectable framerate (20-30 frames per second). The framework can handle both image and video in-puts. By default, OpenPose outputs 2D coordinates for 25 different skeletal joints and other potentially interesting human features, i.e ears. Additionally, it has the potential to output 3D coordinates if multiple views of the same scene from different cameras are available and a camera calibration is performed.

(14)

2.3. Datasets in the public domain

Figure 2.2: Openpose output 2D-keypoints overlaid on top of an image from the TUM-GAID database. The output itself only contains the 2D keypoints, shown as colored circles in the image. The limbs connecting the keypoints only exist for visualization purposes.

The OpenPose pipeline takes as an input a color image and outputs the 2D coordinates of select keypoints for all the persons found in the image [3]. It utilizes a bottom-up approach to the problem of pose estimation. As the first step of the OpenPose pipeline, a feed-forward neural network predicts a confidence map of the locations of the keypoints as well as a part affinity field (PAF) for each human limb. A limb in this context means a part of the body located between a pair of OpenPose keypoints. PAFs are 2D vector fields for the limbs [4]. The part affinity itself is a 2D vector that for each pixel belonging to a certain limb points in the direction from the keypoint at one end of the limb to the keypoint at the other end. The PAFs can therefore keep track of both the position and the orientation of human body parts. Using the PAFs, keypoints belonging to the same person are then paired together. This makes it easier to estimate the poses of multiple persons standing close to each other.

2.3 Datasets in the public domain

There are multiple datasets containing labeled sequences of human gait in existence. These datasets all differ in regard to the number of subjects involved, the number of sequences for each subject, the viewing angles and the availability of the dataset. A short summary of datasets used or considered for this thesis project will be presented in this section.

(15)

2.4. Neural networks

The TUM-GAID database contains gait information about 305 different persons. 32 of these have 20 recorded gait sequences each and the rest have 10. It is one of the larger datasets in terms of number of subjects, but the image data only contains footage from one angle, 90°from the walking direction of the subject. This means the dataset does not simulate realistic conditions. In addition to video footage, the dataset also contains sound files and depth maps, which could be of interest for sensor-fusion based approaches. This thesis project only utilizes video data, however. The researches at TUM responded to the request and delivered the data in a timely manner.

• CASIA Gait Database [13]

There are four different CASIA datasets. Dataset A contains footage of 20 different subjects, with each subject having 12 sequences. The sequences are captured from three different angles, parallel, 45°and 90°, with 4 sequences for each of these angles. Dataset B contains 124 subjects captured from 11 different angles between 0°and 180°. This is the dataset with the most view variation. Datasets C and D contain thermal footage and foot pressure readings, respectively, and are not of interest to this thesis project. The CASIA B dataset shows quite a bit of promise, but unfortunately the researchers responsible for the CASIA database did not respond to the request, so it is possible the dataset is no longer available.

• KS20 VisLab Multi-View Kinect skeleton dataset [1]

This dataset consists of 300 sequences of skeletal keypoints taken from 20 subjects from five different angles, 0°, 30°, 90°, 130°and 180°. The researchers responded to the request in a timely manner, but did not deliver any video data, only the keypoints themselves. Since the 3d keypoints are captured using Kinect motion capture, they are better than anything a regular surveillance camera could capture and thus make for an unrealistic scenario and was therefore not used.

• KinectREID dataset [28]

This dataset consists of 483 sequences of video data showing 71 subjects, with 7 videos for each subject, from three viewpoints, three near-frontal, three near-rear and one lat-eral. The researchers responded in a timely manner, but the files use the cumbersome .xed ending, which means they can only be processed in Kinect studio, a major disad-vantage.

• The CMU Motion of Body (MoBo) Database [10]

This database consists of 25 subjects walking on a treadmill captured from six cameras at different angles. This sounds promising, but the responsible researchers did not re-spond to the request, so it is possible the dataset is no longer available.

One major shortcoming of all the existing datasets is the small amount of sequences for each subject. Since deep learning models require large amounts of data to train properly, ad-ditional data was recorded specifically for this thesis project, in order to acquire significantly many sequences from a multitude of different viewpoints.

2.4 Neural networks

An artificial neural network is a learning system that is made up of interconnected nodes called neurons [19]. They were inspired by the biological workings of mammalian brains. The neurons are arranged into different layers, where each layer may have a different number of neurons, and the layers are connected to each other by unidirectional links between their neurons. Each link has a numerical weight that decides how the previous node affects the following. The nodes in turn have activation functions of varying types. These activation

(16)

2.5. Deep learning

functions send a signal to the next node of the network. The strength of that signal depends on the information passed from the previous nodes and the weights of the links. In this manner, a neural network is able to simulate complex non-linear functions. A simple neural network is depicted in Figure 2.3.

Supervised neural networks learn by comparing the outputs for a certain input value to the desired output of that input value, known as ground truth. The input features travel from the visible input layer through the hidden layers of the network to the visible output layer in a process known as forward propagation. The resulting output is compared to the ground truth, and a measure of how good the network has performed, called loss, is calcu-lated according to a user-defined loss function. The weights are then updated by moving in the negative direction of the gradient of this loss function, a process known as gradient de-scent. The calculations are done using an efficient method known as backpropagation. The gradient descent algorithm is not guaranteed to find the global minimum of the loss function, only a local minimum.

Figure 2.3: A simple neural network with one internal hidden layer and three hidden units. The hidden units, or neurons, sum up the values of the input layer scaled by the weights of the different connections. The activation functions each output an activation depending on the value of the sum and pass that activation onward to the output layer.

In order to make good predictions, a neural network needs to learn how to generalize. Even if it is capable of making good predictions for the data it is trained on, it might perform significantly worse for data it did not see during the training process. This phenomenon is known as overfitting.

2.5 Deep learning

A neural network consisting of many layers is known as a deep neural network. Deep neural networks are capable of learning more abstract representations of information than regular shallow neural networks due to their higher number of neuron layers, and have proven more capable than these for many different applications [29].

Recurrent Neural Networks

A recurrent neural network (RNN) is a neural network designed for handling sequential data [9]. Advantages of such networks in comparison to regular neural networks include their

(17)

2.5. Deep learning

ability scale for much longer sequences and to handle inputs sequences of variable size. This is possible because RNNs share their weights across different time steps. An RNN takes as input a sequence of vector values x1, ..., xn, where the index stands for a point in time or a

position in the sequence. A simple RNN is depicted in figure 2.4

Figure 2.4: A simple recurrent neural network with an input layer x, one internal hidden layer hconsisting of three hidden units and an output layer o. The later outputs benefit from both current and past information.

Bidirectional RNN

When processing the input at time t, a regular RNN is only capable of considering the past values x1, ..., xt´1 in addition to the current input xt. Since relevant information may also

be found in future values, Bidirectional RNNs were developed [9]. A Bidirectional RNN is made up of two regular RNNs, each processing the input sequence in a different order. One processes it from beginning to end and the other processes it from end to beginning. There-fore, the bidirectional RNN is capable of considering both past, present and future sequence values when making predictions for a time step t. A bidirectional RNN is shown in Figure 2.5.

(18)

2.5. Deep learning

Figure 2.5: A bidirectional recurrent neural network. Outputs benefit from both current, past and future information because of the g units passing information backward and the h units passing information forward.

Long Short-Term Memory

A long short-term memory (LSTM) is a type of RNN [9]. Ordinary RNNs typically suffer from the vanishing gradient problem. As the gradient propagates back through the earlier layers of the RNN, the values shrink and as a consequence, the earlier layers learn very little for longer input sequences. Instead of the simpler hidden units found in regular RNNs, LSTMs have cells with complex internal structures. The cells contain gates and self-loops that makes it possible for the network to learn what information to keep and what to discard. The gates are sigmoid functions whose outputs decide what information to pass forward. There are three gates in a regular LSTM. First there is the input gate, which decides if the input features calculated by a regular neuron unit, also contained in the cell, can be added to the internal state of the cell. Then there is the forget gate, which decides how much of the previous internal state to remember. Finally, there is an output gate that decides what values of the internal state to output. Just like the neurons of regular neural networks have weights that are learned during the training process, the gates of an LSTM have weights that learn what information to keep and what information to forget. The internal structure of an LSTM cell is shown in Figure 2.6

(19)

2.5. Deep learning

Figure 2.6: The internal structure of an LSTM cell.

Network optimization

This section details methods used to avoid overfitting and increase the performance of the network.

Batch normalization

Batch normalization is a means of adaptive reparametrization for a deep neural network [9]. When a gradient makes an update to some layer of the model, it does so under the assump-tion that no other layer of the model changes. This is not actually the case, however. As a consequence of this faulty assumption, unexpected results may occur. Neural networks are often trained on subsets of the entire training data set called minibatches. Batch normaliza-tion helps mitigate this problem by normalizing the activanormaliza-tions of such a minibatch. If H is a minibatch of activations, its normalization H1is calculated by

H1 = H ´ µ

σ (2.2)

where µ is a vector of means and σ is a vector of standard deviations. Batch normalization has been empirically observed to reduce overfitting and to improve the generalization ability of neural networks.

Dropout

Applying dropout to a deep neural network means randomly selecting hidden units or input units of a neural network to be multiplied by zero, effectively removing them from the model [9]. This is done while training the network in order to reduce overfitting and make the network better at generalizing. Each such unit is removed with a user defined probability, independently of each other. To compensate for the fact that some units were zeroed out

(20)

2.6. K-nearest neighbours

during training, the weights of the final network are scaled by dividing them with the inverse of the dropout probability after the training of the network is finished.

Hyperparameters

When training a neural network certain parameters like the neuron weights or the gate weights of an LSTM are learned through backpropagation. In addition to these, there are parameters involved in the training process that are not learned but must be chosen by the user. These parameters are called hyperparameters.

Minibatch size

The minibatch size hyperparameter decides how many samples of the training dataset to include in each update of the neural network weights [9]. A larger minibatch size usually results in a more accurate gradient approximation and a faster training runtime. The possible minibatch sizes are limited by the training hardware, however.

Learning rate

The learning rate hyperparameter decides how large steps the gradient descent algorithm should take. Too large values of this hyperparameter may result in the model not converging, while too small values may result in very slow rates of learning or risk the model getting stuck at poor local minima [9].

Weight decay

In order to avoid overfitting, it is possible to add a penalty factor λ to the weights of the model to discourage overly high values. If the values of this hyperparameter is too high, the weights may be penalized to such a degree that the model is unable to learn complex patterns and will underfit instead [9].

2.6 K-nearest neighbours

The k-nearest neighbours (knn) algorithm is a simple supervised learning algorithm. When making predictions for some sample input x, the knn algorithm looks at the k closest points in the training data set and classifies x as belonging to the majority class, with ties being decided randomly. The constant k is a user-defined parameter [9].

2.7 Evaluation metrics

Predictions made by a neural network can be divided into four different categories. There are true positives (tp) and true negatives (tn), predictions that are correctly classified as either belonging to or not belonging to some class, false positives (fp), predictions that are incorrectly classified as belonging to some class, and false negatives, predictions that are incorrectly clas-sified as not belonging to some class. These categories are used to calculate a number of evaluation metrics. The ones used in this project are outlined below.

Accuracy= (tp+tn)

(tp+tn+f p+f n) (2.3)

Accuracy measures how many of the predictions made were actually correct. Precision= (tp)

(21)

2.8. Related work

Precision measures how many of the predictions that were made for some class actually belong to that class.

Recall= (tp)

(tp+f n) (2.5)

Recall measures how many correct predictions of of all possible correct predictions were made.

F1=2 ˆ precision ˆ recall

precision+recall (2.6)

F-1 score is a harmonic mean of precision and recall.

2.8 Related work

In this section, previous work relating to the current study will be presented.

Parametric elliptic fourier descriptors for automated extraction of gait features for

people identification

In this 2015 study by Bouchrika [2] uses elliptic fourier descriptors to extract joint positions from images of walking persons. From these positions, various features like the angles be-tween different limbs. The resulting feature vectors are classified using kNN classifiers.

Robust view-invariant multiscale gait recognition

In this 2015 study by Choudhury and Tjahjadi [6], a two stage process for gait recognition is proposed. In the first step, the angle of the image being analysed is determined by using principal component analysis and a Euclidean distance classifier, and in the second the image is compared to other images of the same view using multiscale shape analysis. This study represents image sequences as Gait Energy Images.

Human Gait Recognition using Deep Convolutional Neural Network and Gait

Recognition Using Deep Convolutional Features

In this 2019 study, Nithyakani et al. use a type of deep neural network known as a CNN to classify Gait Energy Images. It is trained on the TUM-GAID dataset, and is able to overcome the low amount of sequences per person in that dataset because of the compact representation the GEI provides.

Gait Recognition Using Deep Convolutional Features

In this 2019 study, Min et al. also use CNNs to classify persons based on their gait. They do a more thorough evaluation of what activation functions to use than Nithyakani et al., and achieve better results, albeit on the larger CASIA-B gait dataset. This study also utilizes Gait Energy Images.

(22)

3 Method

This chapter will describe the frameworks and libraries used during the thesis project. It will also describe the datasets that were used and how they were processed. Finally, it will describe the experiments that were performed to answer the research questions.

3.1 Resources

In order to extract sequences of skeletal keypoints to use for deep learning, the OpenPose open source 2D pose estimation system was used [3]. OpenPose was the first pose estimation system researched for the purposes of the thesis work and proved robust and efficient enough to use in the thesis project. Additionally, it has an easily installable implementation available on GitHub along with an active community. Judging from available comparisons [23], none of the competitors could deliver any significant increase in performance or ease of use.

The machine learning aspects of the thesis work was done using the Keras API [16] with Tensorflow as the backend, an open source machine learning library primarily developed for the python programming language [35]. Keras and Tensorflow were chosen over the main competitor, Pytorch [27], mainly because of the perceived ease of use. The neural networks used for the project were trained on a Nvidia GeForce RTX 2070 Super graphics card with 8 gigabytes of GPU memory. The general processing tool CUDA and the deep neural network processing tool cuDNN by Nvidia were used to speed up the training process.

3.2 Datasets

The first gait dataset that was used is the TUM-GAID [12] database, described in more detail in chapter 2. Of all the datasets outlined there, the TUM-GAID was chosen primarily because of the fact that the researchers responsible for the dataset were quick to answer the request and delivered the data in a usable format, but has the additional benefit of having the most subjects out of any dataset considered for this thesis work. The gait image sequences of the database were used by OpenPose to extract the skeletal keypoint sequences the neural networks used for the project were trained on.

(23)

3.2. Datasets

(24)

3.2. Datasets

Figure 3.2: Another example image from the TUM-GAID database [12]

TUM-GAID dataset pre-processing

The pose keypoint coordinates generated by OpenPose are pixel coordinates using the lower left corner of the image being processed as the origin. Since the TUM-GAID database contains sequences of persons walking in both directions, the coordinates were normalized so that they were centered around the middle of the hip bone instead, since that keypoint is centrally located and relatively stationary compared to other joints, like those of the arms or legs.

The image sequences of the TUM-GAID database generally start with the person in ques-tion walking into the view of the camera from off-screen and continues until the person has walked out of view on the opposite side. This results in several frames where the person is only partially visible, and this makes it hard for OpenPose to extract a correct pose. To counteract this phenomenon, all frames where OpenPose was unable to detect the middle hip bone were discarded. The middle hip bone was selected for this purpose for the same reason it was selected for use as the origin, since it is centrally located in the body and since it is relatively stationary compared to the joints of the arms or the legs.

Additionally, since OpenPose occasionally detects inanimate objects as human beings and the dataset should contain no images featuring more than one person at a time, all frames where multiple persons are detected are automatically discarded. This is done to reduce the impact of such OpenPose errors.

Each keypoint sequence shorter than 144 frames, the length of the longest sequence in the dataset, was zero padded to be 144 frames long so the RNNs would receive a fixed-size input. This was desirable from a practical standpoint. Since the TUM-GAID database contains relatively few sequences per class and a large number of classes, the sequences were

(25)

3.2. Datasets

split into shorter pieces. Both splitting them in two and splitting them in three was attempted. In the first case, all sequences longer than 60 frames were split in two and the sequences fed into the models were of length 60. Sequences shorter than 60 frames were zero padded to that length. In the second case all sequences longer than 40 frames were split into three and the sequences fed into the models were of length 40. Sequences shorter than 40 frames were zero padded to that length.

Data collection

The TUM-GAID database was the sole available dataset suitable for the purposes of this thesis. In order to counteract the flaws of that dataset, it was decided that additional data was to be collected on site at FOI. This was done by requisitioning one of the treadmills from the FOI gym and recording video data of persons walking on it.

Experiment Setup

The cameras used for data collection were Panasonic cameras of the model "HDC-SD700" and could record footage at 1080p resolution and 50 frames per second. The cameras had been fitted into 3D-printed shells and connected to an Arduino chip. This made it possible to start all cameras at the same time and stop all the cameras at the same time. In addition, they were equipped with XLR connectors that could be used to connect them to a GPS unit. This unit sends an audible synchronization signal to all cameras, which could be used to align the frames from the different cameras.

The five available cameras were placed at five different angles surrounding a treadmill. The cameras were placed in a half-circle surrounding the treadmill. Each camera was placed 45°apart from the each other, with the first camera facing toward the front of the treadmill and the last one facing toward the back of the treadmill. Four of the cameras were placed on tripods and one of them was mounted to a support beam in the ceiling. The experiment setup is depicted in Figure 3.3.

(26)

3.2. Datasets

Figure 3.3: The FOI dataset data collection experiment setup. The different camera positions are highlighted by the red circles.

There were 10 participants in the study, seven males and three females. Approximately 10 minutes of data were collected for each participant and angle.

(27)

3.2. Datasets

Figure 3.4: The view of the front camera of the FOI dataset.

(28)

3.2. Datasets

Figure 3.6: The view of the side camera of the FOI dataset.

(29)

3.3. Classification of known persons

Figure 3.8: The view of the back camera of the FOI dataset.

FOI dataset pre-processing

All keypoint sequences longer than the length of the shortest, 30500 frames, were cut down to that length in order for the sequences to have the same length. Since the recorded videos were relatively similar in length, the only differences arising from the slight difference in response time to the remote control between the cameras and the speed at which the remote stop button was pressed after the 10 minute alarm had rung, this did not result in the loss of very many frames. In contrast, the shortest sequence of the TUM-GAID dataset was 17 frames long and the longest 144 frames, more than eight times as long.

The 30000 frame long sequences were then split into shorter sequences of 350 frames. This resulted in a total of 85 sequences per person and angle, significantly more than for the TUM-GAID database. Just like for the TUM-TUM-GAID dataset, the coordinates were normalized to be centered around the middle hip bone and frames where multiple persons were detected were discarded.

3.3 Classification of known persons

The method used for gait recognition was an LSTM based classifier. The problem was to solve was a multiclass classification problem where a single label is to be predicted for each sequence, with 305 labels to choose from in the TUM-GAID dataset and 10 labels to choose from in the FOI dataset. It was trained on 80 % of the sequences for each class, tested on 16% and validated on the remaining 4 %.

For certain deep learning problems, there are existing neural network architectures that have already proven themselves effective, e.g AlexNet [17] for image classification. Since there were no pre-existing models designed to solve multiclass classification problems using keypoint sequences, one was created based on the model described in Coskun et al [7].The model used for the classification was a stacked bidirectional LSTM model, consisting of four bidirectional LSTM layers stacked on top of one another, each followed by a batch normaliza-tion layer and the final LSTM followed by a dropout layer as well. The final layer is a dense

(30)

3.3. Classification of known persons

layer with a softmax activation function that outputs probabilities for the different classes, 10 in the case of the FOI dataset and 305 in the case of the TUM-GAID dataset. The model was trained for 40 epochs in all experiments performed using the FOI dataset and for 150 epochs in the experiments using the TUM-GAID dataset. Hyperparameter values were set to the values shown in Table 3.1.

Batch size Learning rate Dropout rate Weight decay

128 0.001 0.5 0.001

Table 3.1: The hyperparameters used when training the network.

(31)

3.4. Classification of unseen angles

3.4 Classification of unseen angles

In order to find out how robust the model was to changes in perspective, the sequences per-taining to one angle were left out while training the model. It was then evaluated for those sequences alone.

3.5 Classification of unknown persons

In the practical scenario, the persons that are to be identified are unlikely to be persons the network has seem during training. It would be much more practical if the network could be used to identify unseen individuals as well. To see if the network was capable of this, the final softmax layer of the model was removed. The resulting model outputs embedding vectors describing the input sequences instead of class probabilities. This model was trained using only the first five persons in the dataset. Once the model had been trained, embedding vectors were extracted for the five remaining classes, the ones the model had not seen during training. The vectors were then used to train a kNN-classifier with a k value of 1. This k value was determined to be the best through a grid search of k values ranging from 1 to 25.

3.6 Evaluation

The model was evaluated by calculating the following metrics: 1. Accuracy for the entire test set

2. Precision, recall and f1-score for each individual class

3. Macro average precision, recall and f1–score calculated over all individual classes 4. Weighted average precision, recall and f1–score calculated over all individual classes

In the case of the classification for unseen persons, the metrics were calculated for the class predictions made by the kNN classifier. For the other experiments, they were calculated for the class predictions made by the network itself.

(32)

4 Results

In this chapter the results from the experiments performed on the FOI and the TUM-GAID datasets are presented. For the FOI dataset, this includes classification using a model trained and evaluated on all angles and classes, a model trained on all angles and half of the classes and evaluated on all angles and the rest of the classes as well as a model trained on all classes but only four angles and evaluated on all classes and a fifth angle. For the TUM-GAID dataset, it only includes the first scenario.

4.1 TUM-GAID Dataset

The confusion matrix and classification report for the TUM-GAID dataset are very large, ow-ing to the 305 classes it contains. Therefore, not all results are presented here. The overall accuracy of the classifier trained and evaluated on data containing all persons and angles on the TUM-GAID dataset was 59.8 % on test data. The performance varies widely between the different classes, however. The best performing class, number 103, has a precision, recall and f1-score of 1.0 while the worst performing class has a precision, recall and f1-score of 0.0. Due to the significantly worse results obtained for the TUM-GAID dataset compared to those of the FOI dataset, the remaining experiments were only performed on the FOI dataset.

(33)

4.2. FOI Dataset

4.2 FOI Dataset

The results presented in this section were obtained from the experiments performed on the FOI dataset.

Classification

In Table 4.1, the classification results obtained from the LSTM model trained and evaluated on all subjects and angles are presented.

precision recall f1-score support

0 0.953405017921147 0.9672727272727273 0.9602888086642599 275.0 1 0.974910394265233 0.9819494584837545 0.9784172661870503 277.0 2 0.9651567944250871 0.9753521126760564 0.9702276707530647 284.0 3 0.947565543071161 0.9511278195488722 0.9493433395872419 266.0 4 0.9618320610687023 0.9581749049429658 0.9600000000000001 263.0 5 0.968 0.9097744360902256 0.937984496124031 266.0 6 0.940251572327044 0.9614147909967846 0.9507154213036566 311.0 7 0.9591836734693877 0.9591836734693877 0.9591836734693877 294.0 8 0.9387755102040817 0.9387755102040817 0.9387755102040817 294.0 9 0.9481481481481482 0.9481481481481482 0.9481481481481482 270.0 accuracy 0.9553571428571429 2800.0 macro avg 0.9557228714899992 0.9551173581833003 0.9553084334440921 2800.0 weighted avg 0.9554505693815063 0.9553571428571429 0.9552954871196396 2800.0 Table 4.1: Precision, recall and f1-score for individual classes as well as their weighted and unweighted (macro) averages.

In Table 4.2 the confusion matrix showing how the different subjects were classified is presented. 0 1 2 3 4 5 6 7 8 9 0 266 0 1 1 1 0 6 0 0 0 1 4 272 0 0 1 0 0 0 0 0 2 0 0 277 0 0 2 5 0 0 0 3 0 0 0 253 0 0 0 1 12 0 4 0 0 0 5 252 0 0 0 5 1 5 3 5 8 1 0 242 3 2 0 2 6 2 2 0 0 0 0 299 6 1 1 7 0 0 1 0 0 6 0 282 0 5 8 0 0 0 6 7 0 0 0 276 5 9 4 0 0 1 1 0 5 3 0 256

(34)

4.2. FOI Dataset

Classification for unseen classes

These results were obtained by training the network on the first five classes and evaluating it on the remaining five classes using the embeddings of the second-to-final layer of the network and a knn-classifier.

5 0.7270029673590505 0.7379518072289156 0.7324364723467863 332.0 6 0.8148148148148148 0.8033707865168539 0.809052333804809 356.0 7 0.8448753462603878 0.8840579710144928 0.8640226628895185 345.0 8 0.9293785310734464 0.896457765667575 0.912621359223301 367.0 9 0.8703170028818443 0.8628571428571429 0.866571018651363 350.0 accuracy 0.8382857142857143 1750.0 macro avg 0.8372777324779088 0.8369390946569959 0.8369407693831556 1750.0 weighted avg 0.8392079574912645 0.8382857142857143 0.8385778592648329 1750.0 Table 4.3: Precision, recall and f1-score for individual classes using a kNN classifier as well as their weighted and unweighted (macro) averages for classes not seen during training.

5 6 7 8 9 5 245 28 28 9 22 6 40 288 14 7 7 7 16 16 305 4 4 8 9 15 3 329 11 9 25 5 14 5 301

(35)

4.2. FOI Dataset

Classification for unseen angles

The results in this subsection were obtained by training the network on four of the available angles and evaluating it on sequences of the fifth unseen angle. The tables in the subsection show the results for all five combinations of such a split.

Unseen front angle

0 0.0 0.0 0.0 279.0 1 0.3504043126684636 0.4626334519572954 0.39877300613496935 281.0 2 0.34472934472934474 0.8736462093862816 0.49438202247191015 277.0 3 0.12848751835535976 0.6433823529411765 0.2141982864137087 272.0 4 0.0 0.0 0.0 279.0 5 0.0 0.0 0.0 279.0 6 0.0 0.0 0.0 282.0 7 0.0 0.0 0.0 291.0 8 0.4796511627906977 0.5935251798561151 0.5305466237942122 278.0 9 0.0 0.0 0.0 282.0 accuracy 0.2542857142857143 2800.0 macro avg 0.13032723385438658 0.25731871941408685 0.16378999388148002 2800.0 weighted avg 0.12937331021369236 0.2542857142857143 0.16241176080998762 2800.0 Table 4.5: Precision, recall and f1-score for individual classes as well as their weighted and unweighted (macro) averages for front angle sequences.

0 1 2 3 4 5 6 7 8 9 0 0 8 0 265 0 0 0 0 6 0 1 0 130 2 136 0 0 0 0 13 0 2 1 10 242 9 0 0 15 0 0 0 3 0 24 73 175 0 0 0 0 0 0 4 5 5 1 221 0 0 0 0 47 0 5 0 153 100 22 0 0 0 0 4 0 6 0 32 141 109 0 0 0 0 0 0 7 0 2 143 146 0 0 0 0 0 0 8 0 1 0 112 0 0 0 0 165 0 9 0 6 0 167 0 0 0 0 109 0

(36)

4.2. FOI Dataset

Unseen front-side angle

0 0.0 0.0 0.0 279.0 1 0.0 0.0 0.0 281.0 2 0.0 0.0 0.0 277.0 3 0.0 0.0 0.0 272.0 4 0.35751295336787564 0.989247311827957 0.5252140818268316 279.0 5 0.243879472693032 0.9283154121863799 0.3862788963460104 279.0 6 0.0 0.0 0.0 282.0 7 0.0 0.0 0.0 291.0 8 0.2707659115426106 0.9028776978417267 0.416597510373444 278.0 9 0.0 0.0 0.0 282.0 accuracy 0.2807142857142857 2800.0 macro avg 0.08721583376035183 0.28204404218560636 0.1328090488546286 2800.0 weighted avg 0.08680764652851392 0.2807142857142857 0.13218594599787153 2800.0 Table 4.7: Precision, recall and f1-score for individual classes as well as their weighted and unweighted (macro) averages for front-side angle sequences.

0 1 2 3 4 5 6 7 8 9 0 0 0 0 0 151 6 0 0 122 0 1 0 0 0 0 90 49 0 7 135 0 2 0 0 0 0 6 253 0 18 0 0 3 0 0 0 0 93 20 0 8 151 0 4 0 0 0 0 276 0 0 1 1 1 5 0 0 0 0 3 259 0 0 17 0 6 0 0 0 0 1 276 0 0 5 0 7 0 0 0 0 3 194 0 0 94 0 8 0 0 0 0 23 2 0 2 251 0 9 0 0 0 0 126 3 0 2 151 0

(37)

4.2. FOI Dataset

Unseen side angle

0 0.3082706766917293 0.14695340501792115 0.19902912621359223 279.0 1 0.39676840215439857 0.7864768683274022 0.5274463007159904 281.0 2 0.23356401384083045 0.9747292418772563 0.37683182135380316 277.0 3 0.3522727272727273 0.34191176470588236 0.34701492537313433 272.0 4 0.6996904024767802 0.8100358422939068 0.7508305647840531 279.0 5 0.11627906976744186 0.035842293906810034 0.05479452054794521 279.0 6 1.0 0.014184397163120567 0.027972027972027972 282.0 7 0.005952380952380952 0.003436426116838488 0.004357298474945533 291.0 8 0.23076923076923078 0.01079136690647482 0.02061855670103093 278.0 9 0.010416666666666666 0.0035460992907801418 0.005291005291005291 282.0 accuracy 0.3107142857142857 2800.0 macro avg 0.3353983570592186 0.3127907705606393 0.23141861474275277 2800.0 weighted avg 0.3344620804362847 0.3107142857142857 0.2298791134779215 2800.0 Table 4.9: Precision, recall and f1-score for individual classes as well as their weighted and unweighted (macro) averages for side angle sequences.

0 1 2 3 4 5 6 7 8 9 0 41 11 135 2 0 0 0 0 0 90 1 18 221 9 15 1 15 0 0 0 2 2 0 1 270 0 0 0 0 6 0 0 3 46 71 27 93 24 6 0 0 4 1 4 3 8 5 34 226 1 0 1 1 0 5 2 17 173 0 15 10 0 57 5 0 6 1 4 123 1 4 42 4 101 0 2 7 0 0 289 0 1 0 0 1 0 0 8 8 100 0 113 47 7 0 0 3 0 9 14 124 125 6 5 5 0 2 0 1

(38)

4.2. FOI Dataset

Unseen side-back angle

0 0.0 0.0 0.0 279.0 1 0.2 0.028469750889679714 0.049844236760124616 281.0 2 0.6898148148148148 0.5379061371841155 0.6044624746450303 277.0 3 0.3333333333333333 0.051470588235294115 0.08917197452229299 272.0 4 0.18504901960784315 0.5412186379928315 0.27579908675799086 279.0 5 0.31414473684210525 0.6845878136200717 0.4306651634723788 279.0 6 0.7867647058823529 0.37943262411347517 0.5119617224880383 282.0 7 0.0 0.0 0.0 291.0 8 0.18421052631578946 0.025179856115107913 0.044303797468354424 278.0 9 0.2762237762237762 0.8404255319148937 0.4157894736842105 282.0 accuracy 0.30857142857142855 2800.0 macro avg 0.2969540913020015 0.30886909400654694 0.24219979297984207 2800.0 weighted avg 0.29578346667486566 0.30857142857142855 0.2416939042526364 2800.0 Table 4.11: Precision, recall and f1-score for individual classes as well as their weighted and unweighted (macro) averages for side-back angle sequences.

0 1 2 3 4 5 6 7 8 9 0 0 1 0 0 149 0 0 0 0 129 1 26 8 0 5 88 0 1 0 22 131 2 0 0 149 0 2 122 1 3 0 0 3 3 6 0 14 148 0 1 0 6 94 4 1 0 0 1 151 0 0 0 0 126 5 0 3 65 2 0 191 17 0 0 1 6 0 1 2 5 37 18 107 9 0 103 7 0 0 0 0 0 276 8 0 0 7 8 1 1 0 7 232 0 0 0 7 30 9 3 20 0 8 9 1 1 0 3 237

(39)

4.2. FOI Dataset

Unseen back angle

0 0.21825726141078838 0.942652329749104 0.35444743935309975 279.0 1 0.015748031496062992 0.014234875444839857 0.014953271028037384 281.0 2 0.9290540540540541 0.9927797833935018 0.9598603839441535 277.0 3 0.0 0.0 0.0 272.0 4 0.24850299401197604 0.8924731182795699 0.3887587822014052 279.0 5 0.0 0.0 0.0 279.0 6 0.0 0.0 0.0 282.0 7 0.0 0.0 0.0 291.0 8 0.0 0.0 0.0 278.0 9 0.17142857142857143 0.02127659574468085 0.03785488958990536 282.0 accuracy 0.28464285714285714 2800.0 macro avg 0.1582990912401453 0.28634167026116963 0.17558747661166013 2800.0 weighted avg 0.15726504936756255 0.28464285714285714 0.17432600363909617 2800.0 Table 4.13: Precision, recall and f1-score for individual classes as well as their weighted and unweighted (macro) averages for back angle sequences.

0 1 2 3 4 5 6 7 8 9 0 263 13 2 0 1 0 0 0 0 0 1 161 4 0 0 113 0 0 0 0 3 2 1 0 275 0 1 0 0 0 0 0 3 17 19 0 0 231 0 0 0 0 5 4 7 6 2 2 249 0 0 0 4 9 5 154 100 9 0 15 0 0 0 0 1 6 243 18 4 0 14 0 0 0 0 3 7 135 1 2 0 153 0 0 0 0 0 8 29 28 0 2 211 0 0 0 0 8 9 195 65 2 0 14 0 0 0 0 6

(40)

5 Discussion

This chapter contains an analysis of the method and the results and as well as a comparison with state-of-the-art results for gait recognition. Additionally, it contains a discussion about the possibility of practical usage of the LSTM classifier and the ethical concerns surrounding the field of gait recognition and mass surveillance in general.

5.1 Results

Classification

The results achieved for classification when all persons and angles were available to the neu-ral network are comparable to non-deep learning state-of-the-art results like the ones pre-sented in the study by Bouchrika [2] or in the study by Choudhury et al [6], as well as the ones achieved using deep convolutional neural networks like the studies done by Nithyakani et al. [24] or Min et al. [20]. They all achieve accuracy scores in the mid- to high nineties. These scores are taken from their respective publications. This suggests that a keypoint-LSTM based approach is a viable alternative to these solutions, although it is beaten by by the best of them with 4.5 %. However, the keypoint-LSTM approach also seems to be sig-nificantly more data-dependent than any of the other approaches. The CNN approach used by Nithyakani performed slightly worse than the classifier developed in this thesis when the classifier was trained on the FOI dataset, 94.7 % accuracy vs 95.5 %, but that CNN was trained on the significantly smaller TUM-GAID dataset. In comparison, the LSTM classifier was only able to achieve 60% accuracy for the TUM-GAID dataset. Training the LSTM classifier on fewer classes from that dataset did not improve performance. The fact that even the other deep learning approaches are so much less reliant on large amounts of data is likely due to the effectiveness of the Gait Energy Image and it’s ability to represent an entire gait sequence with a single image. The FOI dataset also has fewer classes than any of the other available gait datasets, a factor that should be considered when comparing results.

Classification of unseen persons

The results obtained when classifying unseen persons suggest that it is indeed possible to use the trained model to identify persons it has not seen during training. This suggests that there

(41)

5.2. Method

is quite a bit of potential in the use of LSTM models for gait recognition. Ideally, such a model would not be a classifier trained to output class probabilities, but a deep metric learning [15] network that outputs a new representation of gait sequences that can then be used with a knn classifier for recognition purposes.

Classification of unseen angles

The results obtained when classifying unseen angles were not very impressive. The features learned by the model are obviously not viewpoint invariant. If a view-invariant LSTM model is to be developed, better data or additional data pre-processing is likely required. One pos-sibility would be to use 3D coordinates instead of 2D. This is possible to do with OpenPose if multiple cameras are used and if they are calibrated properly. However, the reliance on mul-tiple cameras makes that approach less likely to be usable under realistic conditions, except for high security locales like airports or nuclear power plants.

5.2 Method

In this section, the method used during the project will be discussed and evaluated.

Data collection

The decision to record additional data for the project was made relatively late during the work on the thesis. Due to this, there was only time for a limited amount of recordings. If data collection had begun at the start of the project, more classes could have been included in the dataset. Additionally, interesting topics to study like how clothing variation, lighting conditions or if the person is carrying something or not impacts accuracy could have been in-vestigated. It might also have been possible to extract 3D keypoints instead of 2D, something that might have been able to improve on the results of the unseen angle classification.

The fact that the persons participating in the study were walking on a treadmill instead of on the ground is also something to consider. The decision was made primarily based on the ease of experiment setup compared to an outdoors example of free walking, but many of the subjects had the experience that their gait felt artificial when walking on the treadmill com-pared to their ordinary style of walking. If those premonitions are correct, then the results obtained in this thesis might not be directly relevant to regular walking not done on a tread-mill. Fortunately, biomechanical studies indicate that the differences are relatively few [18]. Still, an automatic gait recognition system for practical use ought to be trained on regular walking instead of treadmill walking for maximum reliability.

Classification

While the classifier was useful to determine whether it was feasible to use keypoint sequences for the task of gait identification, it is of limited interest if practical applicability is to be con-sidered. Only being able to make predictions of predefined classes is a major disadvantage and would make it necessary to train the network for an enormous amount of classes for it to be used in practice. This downside is mitigated somewhat by the fact that the classi-fier trained on only a few classes could be used to classify sequences of persons it had not seen with reduced but still respectable accuracy. Of course, this is not the ideal way of doing this. The best way would be to use a model designed explicitly to learn embeddings, i.e. a deep metric learning model. This thesis has demonstrated that even a simple way of learning embeddings can achieve respectable results, which shows great promise for the future appli-cation of deep metric learning on the task of gait recognition, as does recent appliappli-cations such as the one presented in the study by Su et al. [34], published as this thesis was being finished.

(42)

5.3. The work in a wider context

The results of the classification could probably have been improved if some sort of hy-perparameter optimization, like grid search or random search, had been used on the model trained on the FOI dataset. As it stands, a grid search was done for the model when it was being trained on the TUM-GAID dataset, and the resulting best hyperparameters were also used when the model was trained on the FOI dataset. Since the resulting performance of the model trained on the FOI dataset was deemed sufficiently good to demonstrate the usability of the chosen approach, an additional grid search was not performed.

Source criticism

Most of the theoretical information about different deep learning approaches used in the study was based the book Deep Learning by Goodfellow et al. [9]. The book is very well cited, with almost 17000 citations on Google Scholar [31]. The authors of the book, Ian Goodfellow, Yoshua Bengio and Aaron Courville, are all well respected authors within the field of deep learning with tens of thousands of citations each [31] [32] [30]. The scientific articles and conference papers cited in the report were gathered from scientific databases such as Google Scholar, IEEE Xplore, Researchgate and arXiv. Most recent work on gait recognition has re-ceived relatively few citations, likely due to the somewhat niche character of the field. This should not be taken to mean that they lack credibility. One of the primary influences for the approach taken in this thesis project is "Human Motion Analysis with Deep Metric Learning" by Coskun et al. [7]. It has been cited a mere 16 times since 2018, but among it’s authors is Nassir Navab, Professor at the Technische Universität München, which adds a measure of credibility to the study.

5.3 The work in a wider context

In this section, the possible impact automatic gait recognition might have on society and how it should be handled will be discussed.

Ethical concerns

Although camera surveillance can be a useful tool for tracking down suspects or deterring po-tential criminals, there are a number of concerns regarding their use. Indiscriminate camera surveillance in public places will make it possible for the ones in control of the surveillance system to map out and monitor the lives and behaviour of criminals and the general public alike. For example, the Chinese government has rolled out a program called "social credit" which involves mass surveillance in combination with a punishment and reward system in order to reinforce behavior they deem as good [5]. China has also displayed an interest in the use of gait recognition, likely as a part of the social credit system [8]. When it comes to camera surveillance done by private actors like security companies it is theoretically possible for the local government to place restrictions on what data the private actors are allowed to store and what they are allowed to use it for, but the matter is not so simple when it comes to sovereign nation states. The mere definition of who is a criminal worthy of being watched is entirely dependent on the legislative authority where the surveillance is being conducted, and might in places include political opponents or oppressed minorities. The means available to the in-ternational community for preventing such abuse are extremely limited. Indeed, there is no guarantee that the international community itself has any better standards of judgement.

Since there are so many ethical pitfalls involved on the topic of surveillance and privacy, any researcher whose work relates to such matters needs to be aware of what consequences their work might have. The question of what responsibility the scientific community has for the application of their work for ethically questionable purposes is one that should always be on the mind of any researcher or scientist. Is it morally justifiable for a researcher make their

(43)

5.3. The work in a wider context

discoveries public if the risks and consequences of misuse are too severe? Is it correct for sci-entists to make decisions that might dramatically impact the future of society, or should those decisions be left to political leaders that can be held accountable to the public? Answering these questions once and for all falls far without the scope of this report, but they are never-theless questions of paramount importance that ought to be discussed more frequently. They are not just questions for the scientific community alone; ideally, they ought to be discussed by scientists, ethical philosophers, politicians and members of the general public together.

(44)

6 Conclusion

This study evaluates the possibility of using existing pose estimation systems in conjunction with deep learning techniques in order to implement an automatic gait recognition frame-work. Such a framework could be of great use for law-enforcement agencies in their quest to apprehend criminals and prevent crime. However, care must be taken when additional surveillance is concerned. It is important that such measures are implemented in a way that respects the privacy of the general public. Gathering and using biometric information can give governments additional power and control over their citizens and subjects, so it is im-portant that any further usage of such methods are accompanied by an honest discussion of the possible consequences they might have.

This was realized by using the OpenPose pose estimation system in order to capture gait sequences and then using these sequences to train an LSTM classifier that predicts what per-son a given gait sequence belongs to. The classifier achieved close to state-of-the-art perfor-mance when predicting on sequences belonging to persons it had seen during training. When predicting on sequences for persons it had not seen before, performance was decreased. The classifier was not able to predict sequences from angles it had not seen during training with anything resembling good performance.

In order to acquire sufficient amounts of data to train the classifier properly, a dataset of subjects walking on a treadmill was created. In comparison to existing datasets, it had significantly more data per test subject and a lower amount of test subjects.

Now, to answer the research questions previously posed:

• Can a deep neural network trained on 2D keypoint sequences be used for person recog-nition?

Yes, with an overall accuracy of 95.5 %.

• Can a deep neural network trained for person recognition be used to identify persons not seen during training?

Yes, by extracting their embeddings an classifying them using a kNN classifier. The overall accuracy is 83.8 %

• How robust is the network to changes in camera angles? No, more work will be required to make the network robust.

Automatic Gait Recognition : using deep metric learning

Linköping University | Department of Electrical Engineering

Master’s thesis, 30 ECTS | Datateknik

2020 | LIU-ISY/LITH-EX-A--20/5316--SE

Automatic gait recognition

using deep learning

Automatisk gångstilsigenkänning

Martin Persson

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

1

Introduction

1.1

Motivation

1.2

Aim

1.3

Research questions

1.4

Delimitations

2

Background and theory

2.1

Automatic gait recognition

Gait Energy Image

2.2

Pose estimation

OpenPose

2.3

Datasets in the public domain

2.4

Neural networks

2.5

Deep learning

Recurrent Neural Networks

Bidirectional RNN

Long Short-Term Memory

Network optimization

Hyperparameters

2.6

K-nearest neighbours

2.7

Evaluation metrics

2.8

Related work

Parametric elliptic fourier descriptors for automated extraction of gait features for

people identification

Robust view-invariant multiscale gait recognition

Human Gait Recognition using Deep Convolutional Neural Network and Gait

Recognition Using Deep Convolutional Features

Gait Recognition Using Deep Convolutional Features

3

Method

3.1

Resources

3.2

Datasets

TUM-GAID dataset pre-processing

Data collection

Experiment Setup

FOI dataset pre-processing

3.3

Classification of known persons

3.4

Classification of unseen angles

3.5

Classification of unknown persons

3.6

Evaluation

4

Results

4.1

TUM-GAID Dataset

4.2

FOI Dataset

Classification