Evaluation of Methods for Person Re-identification between Non-overlapping Surveillance Cameras

Full text

(1)LiU-ITN-TEK-A--21/052-SE. Evaluation of Methods for Person Re-identification between Non-overlapping Surveillance Cameras Henrik Nilsson 2021-06-23. Department of Science and Technology Linköping University SE-601 74 Norrköping , Sw eden. Institutionen för teknik och naturvetenskap Linköpings universitet 601 74 Norrköping.

(2) LiU-ITN-TEK-A--21/052-SE. Evaluation of Methods for Person Re-identification between Non-overlapping Surveillance Cameras The thesis work carried out in Medieteknik at Tekniska högskolan at Linköpings universitet. Henrik Nilsson Norrköping 2021-06-23. Department of Science and Technology Linköping University SE-601 74 Norrköping , Sw eden. Institutionen för teknik och naturvetenskap Linköpings universitet 601 74 Norrköping.

(3) Upphovsrätt Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under en längre tid från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/ Copyright The publishers will keep this document online on the Internet - or its possible replacement - for a considerable time from the date of publication barring exceptional circumstances. The online availability of the document implies a permanent permission for anyone to read, to download, to print out single copies for your own use and to use it unchanged for any non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional on the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its WWW home page: http://www.ep.liu.se/. © Henrik Nilsson.

(4) ABSTRACT This thesis describes a comparison of several state-of-the-art methods used for re-identification of a person between several non-overlapping views captured by surveillance cameras. Since 2014, the focus of the area of person re-identification has been heavily oriented towards approaches employing the use of neural network due to the increase in performance shown from this approach. Three different methods employing convolutional neural networks as a means of attempting automatic person re-identification have mainly been evaluated in this thesis. These three methods are named Spatial-Temporal Person Re-identification (ST-reID), Top DropBlock Network (Top-DB-Net), and Adaptive L2 Regularization. A fourth method known as Multiple Expert Brainstorming Network (MEB-Net) using domain adaptation is used for comparison to the results of applying the trained models from the other three methods on an unseen environment. As an attempt at improving the results of applying the models on an unseen environment, two different approaches have been taken. The first of these is an attempt at segmenting the person from the background by creating a mask that encapsulates the person while disregarding the background, as opposed to using a rectangular cropped image for training and evaluating the methods. To do this, Mask-RCNN which is a framework for object instance segmentation is used. The second approach explored in this thesis is attempting automatic white balancing as a means of removing the effect of the illumination source of the scenes before the person images are extracted. Both approaches show positive results when the model is applied on an unseen environment as opposed to using the unchanged person images, although the results have not been able to match those that have been obtained using domain adaptation.. iii.

(5)

(6) Acknowledgments I would like to thank the people at FOI for providing me with the opportunity to do this thesis work, especially my supervisor Erik Valldor who has always been available if needed while giving me a great deal of freedom in choosing my own path for the thesis. I would also like to thank my examiner Sasan Gooran and my supervisor Daniel Nyström at Linköping University for always responding rapidly when I have had questions or concerns. Lastly, I would like to extend my gratitude to the other students working on their own thesis at FOI for their good company, and many rewarding discussions.. v.

(7)

(8) Contents Abstract. iii. Acknowledgments. v. Contents. vii. List of Figures. ix. List of Tables 1. Introduction 1.1 Motivation . . . . . 1.2 Aim . . . . . . . . . 1.3 Research Questions 1.4 Delimitations . . .. xiii. . . . .. 1 2 3 3 3. 2. Theory 2.1 Technical Background . . . . . . . . . . . . . . . . . . . . . . . 2.2 Person Re-identification History . . . . . . . . . . . . . . . . .. 5 5 13. 3. Method 3.1 Evaluated Methods . . . . 3.2 Evaluation Metrics . . . . 3.3 Datasets . . . . . . . . . . 3.4 Approaches To Improving ronment . . . . . . . . . .. 19 19 29 30. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Result In An Unknown . . . . . . . . . . . . . . . . .. . . . .. . . . .. . . . .. . . . . . . . . . . . . Envi. . . .. 32. 4. Results 4.1 Results Of The Different Methods . . . . . . . . . . . . . . . . . 4.2 Results After Applying Mask-RCNN . . . . . . . . . . . . . . . 4.3 Results After Applying Automatic White Balancing . . . . . .. 39 40 41 43. 5. Discussion 5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 47 47. vii.

(9) 5.2 5.3 6. Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Work In A Wider Context . . . . . . . . . . . . . . . . . . .. 48 53. Conclusion 6.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 55 55 56. A CMC Plots Of Models Trained On The Unchanged Datasets A.1 Models Trained On Images From The Market-1501 Dataset . . A.2 Models Trained On Images From The DukeMTMC-reID Dataset A.3 Models Trained On Images From The CUHK03 Dataset . . . .. 57 57 59 60. B CMC Plots Of Models Trained On Masked Images B.1 Models Trained On Images From The Market-1501 Dataset . . B.2 Models Trained On Images From The DukeMTMC-reID Dataset B.3 Models Trained On Images From The CUHK03 Dataset . . . .. 63 63 65 66. C CMC Plots Of Models Trained On White Balanced Images C.1 Top DropBlock Model Trained On White Balanced Images . . C.2 Adaptive L2 Regularization Model Trained On White Balanced Images . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 69 69. Bibliography. 73. viii. 70.

(10) List of Figures 2.1. 2.2 3.1. 3.2 3.3. 5.1. 5.2 5.3 5.4. 5.5. The architecture of a 3-layered artificial neural network, with 3 neurons belonging to the input layer, 5 neurons belonging to the hidden layer, and 2 neurons belonging to the output layer. . . . . Shortcut connection of a residual network. . . . . . . . . . . . . . (a) Camera Tolopology of DukeMTMC-ReID. (b) The Spatialtemporal distribution of images contained in the DukeMTMCReID dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The architecture of the Top DropBlock Network (Top-DB-Net). . . The original image with its color histogram along with the corrected image with its histogram, from left to right. . . . . . . . . . An example of incorrect classification showcasing the Top-DBNet model focusing too much on background features. The leftmost image is the query image, followed by the top 5 ranked matches in descending order. . . . . . . . . . . . . . . . . . . . . . The same query image as in 5.1 correctly classified by the Top-DBNet model trained on the DukeMTMC-reID dataset. . . . . . . . . The same query image as in 5.1 correctly classified by the Top-DBNet model trained on the DukeMTMC-reID dataset. . . . . . . . . (a) - Original image. (b) - Gray World algorithm applied. (c) White Patch algorithm applied. (d) - Histogram stretch algorithm applied. (e) - Deep-WB approach applied. . . . . . . . . . . . . . . (a) - Original image. (b) - Gray World algorithm applied. (c) White Patch algorithm applied. (d) - Histogram stretch algorithm applied. (e) - Deep-WB approach applied. . . . . . . . . . . . . . .. A.1 Ranks 1 to 20 of models trained on images from the Market-1501 dataset and evaluated on images from the Market-1501 dataset. . A.2 Ranks 1 to 20 of models trained on images from the Market1501 dataset and evaluated on images from the DukeMTMC-reID dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.3 Ranks 1 to 20 of models trained on images from the Market-1501 dataset and evaluated on images from the CUHK03 dataset. . . . ix. 6 12. 20 23 35. 50 50 50. 52. 52. 57. 58 58.

(11) A.4 Ranks 1 to 20 of models trained on images from the DukeMTMCreID dataset and evaluated on images from the Market-1501 dataset. A.5 Ranks 1 to 20 of models trained on images from the DukeMTMCreID dataset and evaluated on images from the DukeMTMC-reID dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.6 Ranks 1 to 20 of models trained on images from the DukeMTMCreID dataset and evaluated on images from the CUHK03 dataset. A.7 Ranks 1 to 20 of models trained on images from the CUHK03 dataset and evaluated on images from the Market-1501 dataset. . A.8 Ranks 1 to 20 of models trained on images from the CUHK03 dataset and evaluated on images from the DukeMTMC-reID dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.9 Ranks 1 to 20 of models trained on images from the CUHK03 dataset and evaluated on images from the CUHK03 dataset. . . . B.1 Ranks 1 to 20 of models trained on masked images from the Market-1501 dataset and evaluated on masked images from the Market-1501 dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . B.2 Ranks 1 to 20 of models trained on masked images from the Market-1501 dataset and evaluated on masked images from the DukeMTMC-reID dataset. . . . . . . . . . . . . . . . . . . . . . . . B.3 Ranks 1 to 20 of models trained on masked images from the Market-1501 dataset and evaluated on masked images from the CUHK03 dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.4 Ranks 1 to 20 of models trained on masked images from the DukeMTMC-reID dataset and evaluated on masked images from the Market-1501 dataset. . . . . . . . . . . . . . . . . . . . . . . . . B.5 Ranks 1 to 20 of models trained on masked images from the DukeMTMC-reID dataset and evaluated on masked images from the DukeMTMC-reID dataset. . . . . . . . . . . . . . . . . . . . . . B.6 Ranks 1 to 20 of models trained on masked images from the DukeMTMC-reID dataset and evaluated on masked images from the CUHK03 dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . B.7 Ranks 1 to 20 of models trained on masked images from the CUHK03 dataset and evaluated on masked images from the Market-1501 dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . B.8 Ranks 1 to 20 of models trained on masked images from the CUHK03 dataset and evaluated on masked images from the DukeMTMC-reID dataset. . . . . . . . . . . . . . . . . . . . . . . . B.9 Ranks 1 to 20 of models trained on masked images from the CUHK03 dataset and evaluated on masked images from the CUHK03 dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. x. 59. 59 60 60. 61 61. 63. 64. 64. 65. 65. 66. 66. 67. 67.

(12) C.1 Ranks 1 to 20 of the Top DropBlock model trained on white balanced images from the PRW dataset and evaluated on white balanced images from the PRW dataset. . . . . . . . . . . . . . . . . . C.2 Ranks 1 to 20 of the Top DropBlock model trained on white balanced images from the PRW dataset and evaluated on white balanced images from the Saivt-SoftBio dataset. . . . . . . . . . . . . C.3 Ranks 1 to 20 of the Adaptive L2 Regularization model trained on white balanced images from the PRW dataset and evaluated on white balanced images from the PRW dataset. . . . . . . . . . . . C.4 Ranks 1 to 20 of the Adaptive L2 regularization model trained on white balanced images from the PRW dataset and evaluated on white balanced images from the Saivt-SoftBio dataset. . . . . . . .. xi. 69. 70. 70. 71.

(13)

(14) List of Tables 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8. Results of models trained on images from the Market-1501 dataset. Results of models trained on images from the DukeMTMC-reID dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results of models trained on images from the CUHK03 dataset. . Results of models trained on segmented person images from the Market-1501 dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . Results of models trained on segmented person images from the DukeMTMC-reID dataset. . . . . . . . . . . . . . . . . . . . . . . . Results of models trained on segmented person images from the CUHK03 dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results of the evaluation of the Top-DB-Net model trained on white balanced images from the PRW dataset. . . . . . . . . . . . Results of the evaluation of the Adaptive L2 Regularization model trained on white balanced images from the PRW dataset. . . . . .. xiii. 40 41 41 42 42 43 44 44.

(15)

(16) 1. Introduction. When a crime has been committed in an area containing surveillance cameras, the case is often that the cameras do not overlap. When this is the case, it can be difficult to manually track the offender throughout the area where the crime has been committed. Being able to automatically recognize an offender in between several areas being surveilled by different cameras can therefore be an advantage when trying to find the specific person that has committed the crime. Successfully being able to do this might lead to a better representation of the offender, and therefore help to more accurately track the offender’s movements in the surveilled area. Attempting automatic re-identification of a person observed in multiple camera views will remove the cost of having a person doing this manually, and can give more consistent results compared to manual re-identification that will vary depending on the person performing the task. However, automatic re-identification introduces a number of problems that would not have to be considered as heavily in manual re-identification. Firstly, different camera views will likely not be illuminated the same, giving varying pixel values of the same person. A person that is partially occluded will be harder to match to a person that is not occluded in a different view. The viewing angle of a person would also alter the representation of that person which could make it problematic matching a person seen from different angles in different camera views. Background clutter is another factor that could alter the representation of a person as an automatic system could easily confuse the background as being part of the person. Besides factors such as these that 1.

(17) 1. I NTRODUCTION will vary depending on the surveilled areas, the surveillance cameras themselves might not be of the same model meaning they would differ in their representation of the scene. All of these problems can be easily bypassed in manual re-identification, as a person viewing video footage from different cameras could easily find the same person in different views even though they contain, for example, a high difference in illumination. The image quality of surveillance cameras is often too low to be able to make use of highly detailed areas such as a persons face, and the typical approach is instead to look at other factors that are not as detailed and use them to represent a person for comparison purposes.. 1.1. Motivation. Being able to automatically determine that a person visible in different scenes is indeed the same person can be broken down into smaller segments used to build a representation of a person that can then be matched to another representation of a person to find how similar they are to each other. Stating that two people in two different camera views is the same person simply because of the fact that they are both wearing a blue shirt would not be enough, but using this fact in combination with other similarities could be enough to determine that the person in both images is likely the same. The field of automatic person re-identification, from here on referred to as person re-ID, has seen much improvement since the term was first introduced in 2005 [42]. Proposed methods for person re-ID have consistently shown better results since then, however the problem of person re-ID is still far from solved. Most proposed methods are only tested for a closed-world scenario using databases that have been specifically produced for person reID. This means the actual recognition of what is a person in the scene does not have to be considered as these databases already contain separate images of the people observed, together with their true matches showing which people are in fact the same. Many popular databases are further on based on the assumption that every person is always observed in at least two camera views, which does not match reality. Databases are constructed this way for training and evaluation purposes, allowing easier comparison of different methods focusing only on the re-identification of a person, nevertheless additional factors have to be considered if a method is to be used in an openworld scenario. Further on, long-term person re-ID - where re-identification of a person previously seen several days or months ago - is far from as well studied as short-term person re-ID. Short-term person re-ID methods commonly compare descriptors such as the texture and color of a persons clothing, which is not a viable strategy when attempting long-term person re-ID as other descriptors that do not change with time instead need to be compared.. 2.

(18) 1.2. Aim. 1.2. Aim. The aim of this thesis is to compare different person re-ID methods that have previously been presented. Most state-of-the-art person re-ID methods are not evaluated on the same datasets, and produce varying results depending on which dataset the method is evaluated on. This makes the comparison of different methods hard, and there is interest in doing a more thorough evaluation of methods on the same datasets so that the results of these methods can be compared to each other. Methods based in deep learning have been proven to greatly outperform methods using hand-crafted features and all state-of-the-art methods at the time of this report use neural networks to attempt efficient automatic person re-ID. Training neural networks is however a time-consuming but necessary process. In the optimal case, a system for person re-ID should be able to be set up in a new environment where it can re-identify people between different camera views instantly without requiring further training. A solution allowing this does not yet exist, and several approaches exploring how such a solution could be possible will be explored in this thesis.. 1.3. Research Questions. 1. Which currently existing methods for person re-identification have the highest accuracy on popular re-identification datasets? 2. How does the accuracy of the methods change when applied on different datasets obtained from different environments? 3. What approaches can be taken to improve the result of applying a method on a previously unknown environment?. 1.4. Delimitations. A longer training time for a certain method is a small sacrifice if the method achieves greater results, further on the possibility of using a trained model will be dependant of the hardware of the machine handling the computations. The computation time of the methods evaluated in this thesis will therefore not be considered. Secondly, as the goal of this thesis work is to compare different methods for person re-identification, any methods of interest need to have the source code publicly available. If no source code is available, the method will not be investigated further.. 3.

(19)

(20) 2. Theory. This chapter is split into two main parts, namely a Technical Background which explains many of the concepts used in methods attempting automatic person re-ID, and a section containing Person Re-identification History which details earlier works within the area of person re-ID.. 2.1. Technical Background. Since all state-of-the-art methods attempting automatic person re-ID at the time of this report employ the use of neural networks, this section details how different types of neural networks commonly used for image classification are constructed, and how they can be used for automatic reidentification of a person.. Artificial Neural Networks To be able to understand many of the methods explained in this chapter, it is crucial that the underlying methods are understood. This section will therefore explain the concept of Artificial Neural Networks (ANN’s). ANN’s are the basis of Deep Learning, which is a subfield of Machine Learning. The idea of ANN’s is to mimic the way that the human brain works. The human brain contains a multitude of biological neurons that each activate in specific situations, generating an electrochemical impulse that is sent off to other neurons. Each of these neurons receiving that impulse might be 5.

(21) 2. T HEORY activated depending on what other impulses that specific neuron receives. Everything that happens within the body depends on what combination of neurons are activated, for example if you touch something that is very hot with your hand certain neurons will activate making you move your hand. If you got burned, other neurons will activate making you feel pain. The brain contains close to 90 billion neurons, together forming a complex network. ANN’s try to mimic this behaviour by creating layers, where each layer consists of a multitude of artificial neurons, from here on simply referred to as neurons. Every ANN contains three different types of layers, namely an input layer, one or several hidden layers, and an output layer. The neurons belonging to these layers have connections to the neurons of the adjacent layers. A fully connected layer is a layer where each neuron belonging to that layer is connected to every neuron of adjacent layers. The neurons of the input layer receives some sort of data, this can be the pixel values describing an image, the sound frequencies of some audio, or any other type of data. That data is then sent from the neurons in the input layer to the connected neurons in the first hidden layer, where each neuron will have an output depending on what inputs it has received. Those outputs are again sent to the neurons of the next layer. This process is repeated until the data has reached the output layer. A simplistic view of the architecture of an ANN with 3 layers can be seen in Figure 2.1.. Figure 2.1: The architecture of a 3-layered artificial neural network, with 3 neurons belonging to the input layer, 5 neurons belonging to the hidden layer, and 2 neurons belonging to the output layer. More specifically, each neuron belonging to a layer (excluding the input layer) receives an input multiplied by a weight from each of the connected neurons in the previous layer, which are then added up. That sum is then added together with a bias and fed into an activation function to form the output of that specific neuron. Weights and biases are together often referred to as the parameters of a network. The weight is a learnable factor meant 6.

(22) 2.1. Technical Background to define the importance of the neuron that it belongs to, and the bias is a constant that is added to the sum of the weighted inputs in order to shift the activation function to the left or the right. The activation function scales the output of the neuron, and can be constructed in a number of ways. The activation function could for example be a step function that sets the output to 1 if the input is above a set threshold, and 0 otherwise. A commonly used activation function is the rectified linear activation function (ReLU), which sets the output to be equal to the input as long as the input is positive. If the input is negative the output is instead set to 0. Mathematically the output value of a neuron is shown in Equation 2.1, where the inputs gained from neurons in the previous layer are denoted by xi , their respective weights are denoted by wi , and the activation function is denoted by φ. Output = φ(. n ÿ. wi xi + bias). (2.1). i =1. In order for an ANN to give a good result for any task, it first has to be trained on the type of data it is meant to process. This is done by adjusting the weights wi , so that each neuron influences the output of the ANN optimally. The output of the ANN is compared to a desired output through a loss function which represents the difference of the expected output and the current output through a single number, where a number close to 0 means that the current output is close to the expected output. This loss function can be constructed in several ways, where one of the more intuitive approaches is mean squared error as seen in Equation 2.2, where yi denotes the desired output, yî denotes the current output, and n is the number of samples used to train the network. L=. n 1ÿ (yi ´ yî )2 n. (2.2). i =0. Training an ANN is done by attempting to minimize the value of this loss function, which is done through an algorithm known as gradient descent. Based on the gradients of the loss function, the weights wi are updated in an attempt to output a value closer to the desired output as per Equation 2.3. In Equation 2.3, α is known as the learning rate, and simply decides how much the weight should be updated depending on the gradient. Setting a lower learning rate will require training the network for more iterations, but if the learning rate is set too high the weights will be updated too quickly which might bring them further from their optimal values. wi + = wi ´ α. BL Bwi. (2.3). In order to update the weights as per Equation 2.3, the gradients of the loss function with respect to the separate weights have to be found. This 7.

(23) 2. T HEORY is done through the backpropagation algorithm, in which the gradients are calculated one layer at a time by using the chain rule, starting from the last hidden layer of the ANN. After updating all the weights, the output of the ANN should be closer to the desired output. The weights are continuously updated this way until the error (the result of the loss function) starts getting closer to zero. The weights are initialized with some value which could be the same for all weights, or just a random number. By repeating the process explained above a high number of times, a network gets better at the task it is supposed to be used for. Sending the same data point continuously to the network will however only train the network on that specific data point which is not a viable approach. Training an ANN is therefore typically done on a large amount of training data which is sent to the ANN iteratively. When all of the training data has been sent through the ANN and the weights have been updated accordingly, one epoch has been completed. ANN’s are trained in a set amount of epochs, which intuitively means the weights have to be updated a high amount of times. A common way to speed up this process is by sending multiple data samples at a time into the ANN. The batch of samples sent into the ANN is known as a minibatch and are typically the size of 2n , where n is a positive integer. If an ANN is trained for too many epochs, it tends to get too closely connected to the data it is training on, and will struggle when being used on other similar data which is the opposite of what is wanted. This is known as overfitting a model, and can be avoided through a number of approaches besides manually finding a good amount of epochs. Although the concept of ANN’s was first introduced in 1943 by McCulloch and Pitts [23], they have been much more used within the past two decades. This is mainly due to hardware constantly becoming better at efficiently handling the vast amount of computations needed, and the fact that the amount of training data publicly available has become much larger through the production of datasets such as Imagenet [6] which was introduced in 2009 containing more than 14 million labelled images.. Convolutional Neural Networks ANN’s can be subdivided into several different types that tend to be used for different types of tasks. One of these are Convolutional Neural Networks (CNN’s) that are a very popular choice when the input data are in the form of images. A CNN contains at least one convolutional layer, which uses filters that are defined as N-by-M matrices containing the values of the weights. These filters operate on a N-by-M segment of the image by performing a convolution operation in order to output a single value for that area. Each filter repeats this process for other N-by-M areas of the image until the entire image has been processed in this way, and the result is stored in a feature map. The feature map will, as the name suggests, show different features of 8.

(24) 2.1. Technical Background the image that depend on the values of the weighs that have been used for the filter used to construct that feature map. Feature maps that are output from earlier layers in the CNN will locate simple patterns such as vertical or horizontal lines in the image, however these patterns become more complex for each new convolutional layer as simple patterns will be combined into more advanced ones, and could several layers in represent, for example, a human eye. The size of the feature map depends on the size of the filters, the amount of filters used, and their stride. The feature map is stored as a tensor of size h ˆ w ˆ c, where h denotes height, w denotes width, and c denotes the number of channels, where one channel is the result of applying one filter. Whenever a filter larger than size 1 ˆ 1 is used on an image, the resulting feature map will have a smaller width and height than the image due to the convolution operation. To avoid this zero-padding is often used, where a border of pixels with value 0 is added to the input so that the resulting feature map keeps the same size as the input while not differing in the values contained within the feature map. Each filter moves along the image from the left to the right, and from the top to the bottom, at a speed according to the stride of that filter. The stride decides how many pixels the filter should be shifted, meaning that a stride of 2 for example would always move the filter two pixels before the convolution is calculated again, resulting in a feature map of roughly half the height and width compared to using a stride of 1. Images typically contain a high number of pixels, which intuitively means that a lot of data has to be processed even for a simple network only containing a few layers, which further on has to be done for each image. Besides the common approach of pre-processing images to reduce their resolution, certain layers are often used in a network whose only task is to reduce the amount of data that is sent to the next layer. A common way to to this is by using pooling layers, that effectively reduce the size of a feature map by representing each N-by-N area with a new value, often the maximum pixel value of that N-by-N area or the average pixel value of that area. Typically, this approach does not reduce the quality of the output all that much, while getting rid of at least 75% of the data (in the case of using a pooling layer with filter size of 2 ˆ 2) which severely reduces the number of parameters in the network, therefore reducing training time. Intuitively, pooling layers will only reduce the height and the width of the feature map. If the number of channels is getting unnecessarily large, a common approach is to use a convolutional layer with a filter size of 1 ˆ 1. This will have no impact on the width and height of a feature map, but the number of channels can be controlled by using as many filters as wanted channels in the output, and the result is a summary of the previous feature map. Typically the output layer in a CNN will use the softmax activation function, transforming the output into an ordered list containing the probabilities that the output belongs to a certain class, depending on what features was 9.

(25) 2. T HEORY found in the image. These probabilities will always be in between 0 and 1, where all probabilities amount to 1 representing 100%. In the case of person re-identification these classes represent different people. Depending on whether the identity of the predicted person matches the identity of the input image or not, the matches are typically divided into four different types. A True Positive is a sample that the CNN has correctly classified as belonging to the same identity as the input image. A False Positive is instead a sample which has been wrongfully classified as belonging to the same identity as the input image, even though the images contain two different people. A True Negative is a sample that matches the identity of the input image, but where the actual class is incorrect. For example if a mannequin is correctly matched to the same mannequin in a different view, that sample would be a True Negative. A False Negative in instead where both the class of the object, and the prediction of the model are both wrong. The expected output of a data point is known by labelling that piece of data, so that it is known which class it belongs to. This allows the loss function used to measure the distance between the currently predicted class and the true class that the input belongs to. CNN’s that are used for image classification tasks such as person re-ID often use other loss functions besides the previously mentioned mean squared error loss function (MSE). A loss function that is often used instead of MSE is the cross entropy loss function, which is similar to MSE but punishes misclassifications more heavily than MSE does and is usable only when each data point can only belong to a single class. The formula for the cross entropy loss function can be seen in Equation 2.4. L=´. N ÿ. yi log(yî ). (2.4). i =1. In other words, the cross entropy loss function measures the distance between the desired predictions yi to the current output predictions of the network yî with the help of the logarithmic function. N denotes the number of classes (the number of unique person IDs), and i is the current ID. Another loss function that is used for classification problems is the triplet loss function, where an input image commonly referred to as the anchor is compared to two other images, one known as the positive which is an image belonging to the same class as the anchor, and one known as the negative which is an image belonging to a different class than the anchor. The idea is to simultaneously minimize the distance between the positive and the anchor while maximizing the distance between the negative and the anchor, in order to train the model so that it more closely connects images of the same class while differentiating more between images of different classes. The triplet loss function is defined as per Equation 2.5, where A, P, and N denote the anchor, the positive, and the negative respectively. D is a distance function which could for example return the euclidean distance between the two sam10.

(26) 2.1. Technical Background ples, and m is a margin defining how much closer the positive should be to the anchor compared to the negative. L( A, P, N ) = max ( D ( A, P) ´ D ( A, N ) + m, 0). (2.5). The three samples A, P and N are together often defined as easy, semihard, or hard triplets. Easy triplets are those that give a loss of 0, meaning that D ( A, P) + m ă D ( A, N ). Semi-hard triplets instead have a loss greater than 0, where the negative is further from the anchor than the positive, which means that D ( A, P) ă D ( A, N ) ą D ( A, P) + m. Hard triplets are those where the loss is greater than 0 and the negative is closer to the anchor than the positive, which can be defined as D ( A, N ) ă D ( A, P). When performing image classification, it is common to use pre-defined network architectures that have proven to work well for similar tasks as a baseline model. The typical approach is to use a specified network that has been pre-trained on a large scale dataset, such as ImageNet [6] or MSCOCO (MicroSoft Common Objects in Context) [20], and then adapt the model by introducing changes or additions to the network architecture. This has shown to often give good results as the model will already differentiate between different objects well, and using a baseline model with pre-trained parameters allows for greatly reducing the training time which would otherwise be much higher. A few network architectures that are commonly used for image classification and are all used by the methods detailed in this thesis are Residual Networks [11], Densely Connected Networks [15] and Inception Networks [35] which will all be explained below. Residual Network The idea behind the Residual Network (ResNet), is to allow for a higher number of layers compared to what was previously viable. Generally speaking, a deeper network containing more layers will be able to distinguish more complex features compared to a more shallow network containing fewer layers, and constructing deeper networks will therefore often give better results for image classification as it allows locating more complex patterns. However, as a network gets deeper, it becomes problematic to train due to the vanishing gradient problem. As the error gradient is propagated from the output layer to the input layer, it becomes increasingly smaller. This in turn means that the parameters belonging to the earlier layers of the network will eventually not be updated due to the value of the gradient being very small, and the network will therefore not be trained properly. In 2015, He. et al. [11] introduced ResNets as a solution to this problem which have shown promising results and has therefore become a popular network architecture. The idea is to solve the vanishing gradient problem by introducing shortcut connections as seen in Figure 2.2, where the output of a layer is added to the output of. 11.

(27) 2. T HEORY another deeper layer in the network. This process is repeated every few layers, and allows for the gradients to skip past certain layers, and therefore not be diminished during backpropagation, successfully allowing for a network to become deeper without suffering from the vanishing gradient problem. Compared to common network architectures introduced prior to ResNets, such as VGG-19 [32] that was previously considered very deep containing 19 layers, the authors of ResNet showed that it was possible to train a network of over 1000 layers without the network suffering from vanishing gradients by introducing these shortcut connections.. Figure 2.2: Shortcut connection of a residual network.. Densely Connected Network Similarly to how residual networks are constructed, Densely Connected Networks (DenseNets) also add the values of different layers together to solve the vanishing gradient problem. However, while the ResNet architecture will only ever add the output of two different layers, DenseNets concatenates each previously added layer to each following layer. The output of the last layer would therefore be the result of concatenating the output of all previous layers. Compared to ResNet, where the values of the output are simply added together, DenseNets concatenate the outputs keeping all of the original information. This results in feature maps that quickly become much larger, and DenseNets are for that reason very computationally demanding given their size. Concatenating layers like this has led to the possibility of constructing networks that can perform well with a much lower amount of channels in the layers, although as the outputs are concatenated the channels of the feature map increase much faster compared to other network architectures. To counter this behaviour, 1 ˆ 1 convolutional layers are used which allows for reducing the amount of channels of the feature maps, as previously mentioned.. 12.

(28) 2.2. Person Re-identification History Inception Network In a typical CNN, each layer will produce a feature map by performing a convolution using filters containing different weights, where all the filters are of the same size (for example 5 ˆ 5 filters containing 25 weights each). The main idea behind the Inception network proposed by Szegedy et al. [35] is to instead perform convolutions through differently sized filters in a single layer, and concatenate the result into a single feature map. Specifically, filters of size 1 ˆ 1, 3 ˆ 3, and 5 ˆ 5 are all used, together with a 3 ˆ 3 max pooling operation. The results of applying these four operations to the input results in four different outputs, that are then concatenated forming a single feature map. Performing convolutions through a larger filter means an increase in the amount of computations needed if the convolution operations are otherwise performed in the same way, and similarly to how the feature maps of DenseNet are reduced in size, the Inception network use 1 ˆ 1-convolutions to reduce the input size before applying the 3 ˆ 3 and 5 ˆ 5 convolutions. The max pooling operations appends a 1 ˆ 1 convolution in the same manner to reduce the the amount of channels in the output feature map. The explanation above defines one inception module, which takes one input and produces one output. The Inception network is the result of stacking multiple inception modules together, which has proven to be several times faster compared to similarly performing networks that do not utilize the same idea as the Inception network. To further improve the speed, Szegedy et al. has later on modified the network detailing Inception-V2 and Inception-V3 in [36], where operations using filters larger than size 1 ˆ 1 are split into smaller consecutive operations of smaller filters. This effectively reduces the amount of computations needed while giving the same result, as for example a 5 ˆ 5 filter containing 25 weights could be represented by two 3 ˆ 3 filters containing 18 weights in total, which in turn can be reduced into 1 ˆ 3 and 3 ˆ 1 sized filters containing a total of 12 weights. Further on, 7 ˆ 7 convolutions were implemented into each module to be part of the resulting feature map of the module, and label smoothing was implemented to regularize the model in order to prevent overfitting.. 2.2. Person Re-identification History. There are many methods that have been suggested to solve the person re-ID problem, and the area has seen much improvement in the past years. This section details some of the earlier important works that have been a part of reaching the stage that the area is in today, together with an explanation of the pipeline detailing required steps for an automatic person re-identification system.. 13.

(29) 2. T HEORY. Pipeline In order to compare an image of a person to other images of people, a way of separating these people from the rest of the image firstly has to be done. Methods attempting person re-ID typically only attempt to solve the actual re-identification of a person, while taking the remainder of the pipeline for a usable system for granted. Generally, a functional automatic person re-ID system can be summarized in the following 5 steps: 1. Video data in the form of images firstly has to be gathered from one or more cameras. 2. In the images that contain one or multiple people, these people have to be separated from the background. The common approach is to draw bounding boxes around each person that is made into a new image. This removes much background noise which would otherwise be an obstacle when comparing images of people. 3. In order to train a model, training data is needed which needs to be labelled. The images of people obtained through bounding boxes therefore need to be labelled such that all instances of the same person have the same label. This is often done manually, which is a major reason why pre-labelled datasets are commonly used when testing and evaluating person re-ID methods. 4. Once a sufficient amount of data has been labelled to be used for training, a model can be trained. Previous steps are generally not a part of most methods proposed for person re-ID, since they are mostly evaluated using datasets that have been specifically produced for the sake of person re-ID. Using datasets such as these means previous steps do not have to be considered as the datasets already contain labelled images specifically showing cropped person images. 5. When a model has been sufficiently trained, it can be evaluated and used. This is done by comparing the person-of-interest (query image) to a gallery containing other images of all the people that have been observed by the system. This results in a ranking list that in the optimal case has all other known images of the same person at the top of that list.. The start of person re-identification The first attempt at non-overlapping, multi-camera tracking was presented by Huang and Russel [16] in 1997. Using a Bayesian approach, a method was suggested for finding the probability that one vehicle in a camera view matches another vehicle appearing in a different camera view. Although this 14.

(30) 2.2. Person Re-identification History approach was oriented towards vehicles, the concept is similar to that of person re-ID. In 2005, Zajdel et al. proposed a similar strategy in [42] that is oriented towards re-identifying human subjects, and it is the first well-known report that contains the term person re-identification. [42] details a method to be used with a mobile robot in order to track humans in indoor environments. This is done through labelling every person that enters the robots field of view, and then comparing these labels to one another with the help of a dynamic Bayesian network making use of color and spatio-temporal information to find out how closely two labels match.. Methods using hand-crafted features For the following years, re-ID methods focused mainly on using handcrafted features for the purpose of matching a person from different camera feeds. These hand-crafted features use information in the image to find patterns through different algorithms, and which algorithms to use will therefore have to be manually selected. Bak et al. [2] adapts a HOG (Histogram of Oriented Gradients) based technique originally used for face detection in order to detect human beings. Once detected, the person is separated from the background and compared to another person using Haar-like features [19] and Dominant Color Descriptors [40]. In [4], Cheng et al. propose a method based on pictorial structures [8], where each body model is split into different segments containing the different body parts. These are then used to assign an ID signature to the person which is compared to other ID signatures through a distance function. Hirzer et al. [14] encodes the color structure of a person in the LAB color space through a histogram, and then learns a representation of the person by using Large Margin Nearest Neighbor metric learning [39]. By looking at the Mahalanobis distance between two samples, they then conclude whether the samples are likely to represent the same person or not.. Methods using deep neural networks In 2014, the first methods employing the use of deep neural networks were presented in close proximity by Yi. et al. [41] and Li. et al. [18]. In [41], a Siamese Convolutional Neural Network (SCNN) consisting of two CNN’s is used to find the similarity between two person images. By splitting two person images into three sections each containing the upper, middle, and bottom part of the person the different parts can be compared by feeding the images belonging to one person image into one CNN, and the images belonging to the other person into the second CNN. A similarity score is then achieved through a cosine function, which gives a measurement of how similar the two images are. Similarly to [41], the method proposed in [18] consists of feeding two images split into even smaller sections into a SCNN 15.

(31) 2. T HEORY to find the similarity between the images. Both of the methods detailed in [18] and [41] achieved results slightly outperforming most previous methods based on hand-crafted features, and even though the increase in accuracy was not groundbreaking, these works are highly important due to furthering the research in the area of person re-ID that has since been heavily oriented towards using deep neural networks. Due to the large amount of data needed for training a neural network sufficiently, [18] also introduced a new dataset named CUHK03. Previously popular datasets produced to be used for evaluating re-ID methods such as VIPeR [10], i-LIDS [48] and ETHZ [31] were not sufficiently large for training neural networks, and there was a need for new datasets containing more data. Following CUHK03, other large scale datasets such as Market-1501 presented by Zheng et al. [46] have therefore been produced to be used for these types of methods. In 2018, Zhang et al. [45], besides developing a method for person re-ID, also measured the performance of 10 professional annotators by presenting them with a query image and a small portion of either the CUHK03 dataset or the Market1501 dataset, in order for them to find the person in one of the images that matches the query image. Comparing the performance of the human annotators to the performance of the method proposed in the same paper showed that their method surpassed human-level performance on both Market1501 [46] and CUHK03 [18]. Even though automatic methods for person re-ID has been proven to outperform manual person re-ID, the situation changes as the time interval between observations increase.. Methods for long-term person re-identification All of the previously mentioned methods rely on visual features of individuals, most importantly their clothing since that occupies most of the body area of a person. While the methods have been proven to work well in short-term scenarios where these visual features rarely change, the same methodology is not reliable for long-term scenarios where the appearance of an individual will change continuously. To be able to reliably re-identify a person after several days or months requires a method to be able to identify characteristics that do not change as time passes, but that still are unique to each individual. Person re-ID methods to be used for long-term scenarios are far from as well researched as methods viable for short-term scenarios, and the first method that focuses on long-term person re-ID was suggested in 2018 by Zhang et al. [44]. Based on the hypothesis that people keep constant motion patterns under non-distractive walking conditions, Zhang et al. present a model where the human body is divided into several body-action primitives, where each of these are encoded using Fisher vectors [30] resulting in multiple motion descriptors. Concatenating these motion descriptors gives. 16.

(32) 2.2. Person Re-identification History a motion representation of the person, that can then be compared to another persons motion representation to find similarities. Since 2018, several other methods have been proposed to better handle long-term person re-ID however the accuracy of these methods is still far from as high as those only using visual features. Further on, most long-term re-ID methods that have been proposed require additional information that a standard RGB camera can not gather. One of the more intuitive approaches is to use depth information, which can be gathered using an RGB-D camera. Several methods [25][17][27] use this additional depth information to get a more accurate representation of the person. Since the body shape can easier be estimated using depth information, a better long-term representation of the person can be found. A different approach is presented in [7] where radio frequencies are used to find the shape and size of a person. Using a radio together with an RGB camera, the authors create a better long-term representation of a person by using radio frequency signals that reflect off the human body.. 17.

(33)

(34) 3. Method. This section will detail the different methods that have been evaluated in this thesis work, and will furthermore specify the parameters used for the training of networks used in the different methods. A detailed description of the evaluation metrics used for comparing the different methods follows, and the specifications of the datasets evaluated upon is explained. Lastly, a few approaches attempted to improve the result of re-identification on an unknown domain is detailed.. 3.1. Evaluated Methods. Spatial-Temporal Person Re-identification Spatial-Temporal Person Re-identification (ST-ReID), as proposed by Wang. et al. [37] makes use of a Part-Based Convolutional Baseline (PCB) [33], which in itself uses a Residual Network [11] of 50 layers (ResNet-50) as a backbone network. This baseline is used to extract visual features of each person image resulting in a feature vector for each of these images. The distance between two feature vectors is then compared to each other through the cosine distance, as seen in Equation 3.1 where xi and x j are two feature vectors extracted from two images i and j. s ( xi , x j ) =. xi ¨ x j ||xi || ˆ ||x j || 19. (3.1).

(35) 3. M ETHOD The ST-ReID method improves the result of only using PCB for feature vector comparison by also introducing a spatial-temporal stream, making the probability that two people match depend on the time interval between when the two images were taken, on top of their visual representations. By estimating spatial-temporal histograms mapping the frequencies of each persons re-appearance to the time interval in between two cameras, the estimated probability of a person re-appearing in a different camera view after a certain amount of time can be found. The histogram is smoothed with the Parzen Window method, resulting in a continuous function estimating the probability that a person re-appears in a camera view after being observed in a different camera view. Figure 3.1 shows the spatial-temporal distribution of images between camera 6 and all other cameras in the DukeMTMCreID dataset together with the camera topology of the cameras used to obtain these images.. (a). (b). Figure 3.1: (a) Camera Tolopology of DukeMTMC-ReID. (b) The Spatialtemporal distribution of images contained in the DukeMTMC-ReID dataset. The spatial-temporal distribution seen in Figure 3.1b is achieved by estimating a spatial-temporal histogram describing the probability of a positive image pair according to Equation 3.2, where k is the k:th bin of a histogram for two different cameras ci and c j , nkci c j is the number of person image pairs ř whose time difference is at that k:th bin, and m nm ci c j is the total number of person image pairs found in both cameras ci and c j . This histogram is smoothed by the Parzen Window method as per Equation 3.3, in which the gaussian function seen in Equation 3.4 is used as the filter K, where σ is the gaussian filter parameter used to change the size of the filter K. nkci c j pˆ (y = 1|k, ci , c j ) = ř m m n ci c j. 20. (3.2).

(36) 3.1. Evaluated Methods. L. p(y = 1|k, ci , c j ) =. 1ÿ pˆ (y = 1|k, ci , c j )K (l ´ k ) L. (3.3). l. ´x2. e 2σ2 K(x) = ? 2πσ. (3.4). By combining the spatio-temporal probability from Equation 3.3 with the similarity score from Equation 3.1, a joint probability relying on both is found. To make the probability more robust to rare events where the time difference of two observations is significantly larger or smaller, a logistic smoothing approach is used as seen in Equation 3.5. The resulting joint probability function is then given by Equation 3.6. f ( x; λ, γ) =. p(y = 1|xi , x j , k, ci , c j ) =. 1 1 + λe´γx. 1 1 + λ0 e. ´γ0 s( xi ,x j ). (3.5). 1 1 + λ1 e. ´γ1 p(y=1|k,ci ,c j ). (3.6). The hyperparameters used to train the method have been chosen as closely as possible to match the parameters used by Wang et al. [37] in order to achieve similar results. The batch-size has however been set to 16 instead of the proposed 32 due to hardware limitations, and according to the linear scaling rule proposed by Goyal. et al. [9] the learning rate has therefore been multiplied by the same factor as the batch-size, namely 0.5, meaning an initial learning rate of 0.05 has been used instead of the authors proposed value of 0.1.. Top DropBlock Network Top DropBlock Network (Top-DB-Net), as proposed by Quispe and Pedrini [26], attempts to create a model that focuses more on less informative regions of a person, by removing more informative areas of the person image during training. This forces the network to train on the data which is not removed, meaning the features found in these areas will be of higher importance to the network. Top-DB-Net is constructed with the Batch DropBlock (BDB) Network [5] as the baseline, where the BDB Network itself uses a modified ResNet-50 [11] as its backbone. Compared to the originally proposed ResNet-50, this modified version used by the DBD Network removes the last pooling layer, resulting in a larger feature map of size 2048x24x8. Top-DB-Net is split into three different streams where two of the streams require information directly from this feature map, therefore the feature map is duplicated and sent to both of 21.

(37) 3. M ETHOD these streams. The first stream, called the Global Stream, simply transforms the feature map into a 2048-dimensional feature vector by using a global average pooling layer which contains the average values from the feature map. A 1x1 convolution layer is furthermore used to reduce the amount of features to 1024. The second stream, called the Top-Dropblock Stream, finds the most activated regions of the feature map from which it removes horizontal stripes. This is done by first passing the feature map through 2 bottleneck layers for computational efficiency, from which the output from here on out will be referred to as G. G is transformed into an activation map as shown by Equation 3.7, where c is the number of channels of the feature map, and Fi contains the part of the feature map G belonging to that channel i. By calculating the sum of the values of the activation maps belonging to a certain row, the most activated row of the feature map is found according to Equation 3.8, where j denotes the current row and k denotes the current column of A. A=. c ÿ. |Fi |. (3.7). i =1. řw rj =. k =1. A j,k. (3.8) w When the largest value of r j is found, the elements belonging to that row are all replaced by 0. This process is repeated until 30% of the rows contain zeroes resulting in a feature map where the most activated areas have been dropped. The resulting feature map is combined with G through the dot product, upon which a global maximum pooling layer is appended resulting in a feature vector. A fully connected layer is lastly used to set the number of elements of the feature vector to the same number as the result from the Global branch, namely 1024. The third stream, called the regularizer stream is only used during the training of the network as a way to counteract false positives which may otherwise be generated. Larger removed regions of an image that has been dropped by the Batch Dropblock stream tend to create noise in G, which leads to these false positives being generated as unique regions between different ID inputs are removed. As the model is updated through backpropagation, the result of this feature dropping will update the parameters of the bottleneck layers in a sub-optimal way, which leads to noise being generated in G that will negatively affect the model. The regulalizer stream is therefore introduced, which simply appends a global average pooling layer to G, allowing for a separate loss function to be used to update the parameters of the bottleneck layers optimally without taking the feature dropping procedure into account. The loss function used for all three streams is a combination of the cross entropy with label smoothing regularizer loss function as presented by Szegedy. et al. [34] and the triplet loss. 22.

(38) 3.1. Evaluated Methods with hard positive-negative mining as presented by Hermans. et al. [13]. Figure 3.2 shows a visual representation of the architecture of the model.. Figure 3.2: The architecture of the Top DropBlock Network (Top-DB-Net). During the training process of Top-DB-Net, the hyperparameters have been set according to the suggestions of the authors.. Adaptive L2 Regularization This method, proposed by Ni. et al. [24], introduces an adaptive L2 regularization mechanism, which is used to prevent overfitting of a model by introducing an additional term to the loss function that effectively shifts the weights away from the values which would otherwise overfit a model. L2 regularization is typically defined as per Equation 3.9, and is used to update the loss function by adding the L2 regularization term to the output of the loss function, as seen in Equation 3.10 where L( P) is the loss function used. L2 = λ. N ÿ. ||wn ||2. (3.9). n. L( P) = L( P) + λ. N ÿ. ||wn ||2. (3.10). n. When using L2 regularization, the regularization factor λ is typically chosen manually and remain constant throughout the entire training procedure. Ni. et al. [24] instead argue that a better approach would be to introduce multiple regularization factors λn which are associated to each individual weight instead of the entire sum of weights, updating the loss function as seen in Equation 3.11 instead. L( P)new = L( P) +. N ÿ. λn ||Wn ||2. (3.11). n. 23.

(39) 3. M ETHOD Manually trying to find optimal values for all these regularization factors λn would be immensily time consuming, and the regularization factors are instead replaced by scalar variables θn which are updated iteratively using backpropagation in the same manner as the weights are updated. Negative values of the regularization factors could easily minimize the updated loss function L( P)new quickly, however this would make the regularization term dominant which in turn would result in useless feature descriptors. To maintain positive regularization factors, a hard sigmoid function is therefore applied which effectively scales the values of the scalar variables θn between 0 and 1 depending on a constant c according to Equation 3.12. In the experiments performed by the authors, and the experiments performed in this thesis the value of c is set to 2.5. Finally, the regularization factors are multiplied with an amplitude A that is used to control the effect of the regularization term, helping to avoid excessively large values of the regularization factors. The final loss function is regularized according to Equation 3.13. $ ’ &0, f (θn ) = 1, ’ % θn. 2c. if θn < -c if θn > c. (3.12). + 0.5, otherwise. L( P)new = L( P) +. N ÿ. ( A f (θn )||Wn ||2 ). (3.13). n. The method proposed by Ni. et al. uses a slightly modified Residual network [11] which has been pre-trained on ImageNet [6] as the backbone model. The first four blocks of the backbone model are kept as originally introduced in [11]. The stride is reduced to 1 (from the original value of 2) in the first layer of block 5, resulting in a feature map of twice the width and height. To improve the models performance, several strategies are applied. Improving the robustness of the model is done by randomly flipping the images sent to the network horizontally with a probability of 50%, further on random erasing [49] is applied to the images which replaces a randomly located area within the image with a rectangular shape of random pixel values simulating occlusion. Further on, the loss function used is a combination of evenly weighted categorical-entropy loss and triplet loss. By replicating the fifth block of the backbone model, the resulting feature map is sliced into two horizontal stripes on which dimensionality reduction is performed. These feature maps are sent into a global average pooling layer which reduces their spatial dimensions into a single feature vector each. The elements within these feature vectors are then restricted between 0 and 1 using a clipping layer that simply sets all negative values to 0, and all values greater than 1 to 1, before being normalized using a batch normalization [22] layer. Lastly, a fully connected layer is used in order to match the number of features to the number of identities in the gallery. The second copy of the fifth block 24.

(40) 3.1. Evaluated Methods is subject of the same order of operations except for the slicing and dimensionality reduction procedures, and the three resulting feature vectors from the fully connected layers are summed up. As presented in [24], each one of these strategies have made an increase in the performance of the model, as they have been added one at a time when evaluating the model. Further on, conventional L2 regularization is attempted but is proven to not achieve a result as good as when using multiple regularization factors, proving that the main contribution of the model, namely the adaptive L2 regularization, is a successful approach. For fairer comparison purposes, the residual network used as backbone model for this method contains 50 layers to be consistent with the previously explained methods. All hyperparameters used for the model have been defined as proposed by the authors.. Multiple Expert Brainstorming Network Multiple expert brainstorming network (MEB-Net), as proposed by Zhai et al. [43], attempts ensemble learning as a way of transferring knowledge in between different networks. This allows for unsupervised learning on a dataset that does not have labelled images by first training several models using different network architectures on another labelled dataset, and using the information gained to create pseudo-labels for the images in the unlabelled dataset. To avoid confusion, the first dataset with labelled images will from here on out be referred to as the source dataset, and the second unlabelled dataset will be referred to as the target dataset. By introducing several backbone models that are each trained on the same data from the source dataset, an attempt at transferring the knowledge in between these backbone models to be used on images in the target dataset is made. The training is performed in two stages, where the first stage consists of training 3 different models that differ in the network architecture on images from the source dataset. The loss function for these models is defined as the sum of an entropy loss function with label smoothing, as seen in Equation 3.14, and a softmax triplet loss function as seen in Equation 3.16. In Equation 3.14, Ns is the number of sample images, which have Ms unique identities. q j is defined as per Equation 3.15, where e is a small constant, in this case set to 0.1, making q j a significantly larger value when the true person identity of the current sample image ys,i matches the currently predicted identity j. p j ( xs,i |θ k ) denotes the probability predicted by the model that image xs,i belongs to the identity j, using the parameters θ k of the current model k. In Equation 3.16, xs,i+ denotes the hardest positive sample of the anchor xs,i , meaning it is the image that is most similar to image xs,i belonging to the same identity. Similarly, xs,i´ denotes the hardest negative sample of xs,i , which is the most similar image to xs,i that does not belong to the same 25.

(41) 3. M ETHOD identity as xs,i . The loss function used to update the parameters of all three models is defined as the sum of Equation 3.14 and Equation 3.16 through training the models on images from the source dataset, and can be seen in Equation 3.17.. Lks,id =. $ e ’ &1 ´ e + Ms , if j = ys,i q j = Me , otherwise ’ % θn s otherwise 2c + 0.5, (3.15). N M 1 ÿs ÿs q j log p j ( xs,i |θ k ) Ns i =1 j =1. (3.14). Lks,tri = ´. k k N e|| f ( xs,i |θ )´ f ( xs,i´ |θ )|| 1 ÿs log k k k k Ns e|| f ( xs,i |θ )´ f ( xs,i+ |θ )|| + e|| f ( xs,i |θ )´ f ( xs,i´ |θ )|| i =1. Lks = Lks,id + Lks,tri. (3.16) (3.17). The three different network architectures that have been used are the previously explained Residual Network with 50 layers [11], a Dense Convolutional Network with 121 layers [15], and the Inception-V3 Network [36] consisting of 48 layers. Once these three models have been trained on the same training data of the source dataset, an attempt at transferring the knowledge they have gained to another target dataset is made. Firstly, each of the trained models k extract convolutional features f ( Xt |θ k ) from the images in the target dataset. Average features are then calculated as shown by Equation 3.18, where K is the number of models (in this case 3), and f ( Xt |Θk ) is a temporally average model with parameters Θk that are initialized with the same values as the parameters θ k of each model. These average features are then classified into different clusters through a mini-batch k-means clustering process, where each cluster is represented by a specific ID that is used as a pseudolabel for a certain image. This puts images of people that have a similar set of features in the same cluster, where those images optimally show the same person. Using Θk initialized with the values of θ k allows for the average features of images in the target dataset to remain closer to their initial values by updating them a small amount compared to θ k . Θk is updated for each iteration T as shown by Equation 3.19, where the value of α decides how much the parameters of the temporally average model should be updated depending on the parameters θ k of the model. The value of α has in this thesis been set to 0.999 as per the suggestions of the authors. f ( Xt ) =. K 1 ÿ f ( Xt |Θk ) K k =1. 26. (3.18).

(42) 3.1. Evaluated Methods. ΘkT = αΘkT´1 + (1 ´ α)θ k. (3.19). The discrimination capability of all three models is likely to differ on sample images in the target dataset. Because of this, an authority regularization scheme is used to find the discrimination capability of each model and regulate the authority of that model thereafter. This is done by clustering the samples into Mt different clusters depending on their features, in the same manner as previously explained. The scatter of the samples both within each separate cluster (intra-cluster scatter) aswell as the scatter of all samples (inter-cluster scatter) of a model are compared to find how well the model discriminates between different identities. More specifically, the intracluster scatter is found by comparing the squared euclidean distance of each sample within a cluster to the average value of that cluster, as seen in Equation 3.20 where f ( x|Θ T ) are the features Θ T of epoch T for the sample image x. Ci denotes the cluster with label i, and nit is the number of samples contained in the cluster, meaning the average sample within the cluster is subtracted from each sample within the cluster. i Sintra =. ÿ. || f ( x|Θ T ) ´. xPCi. 1 ÿ f ( x|Θ T )||2 nit xPC. (3.20). i. Similarly, Equation 3.21 is used to calculate the inter-cluster scatter by finding the squared euclidean distance between the average sample 1 ř xPCi f ( x|Θ T ) of cluster Ci and the average sample of all clusters nit ř Nt 1 Nt i =1 f ( xt,i |Θ T ), where Nt denotes the number of clusters. Sinter =. Mt ÿ i =1. nit ||. N 1 ÿt 1 ÿ f ( xt,i |Θ T )||2 f ( x|Θ ) ´ T Nt nit xPC i =1. (3.21). i. The discrimination capability of a model can be summarized as the relationship between the inter- and intra-cluster scatter according to Equation 3.22, where a larger value of J shows better discrimination capability of the model meaning the distance of samples within each cluster is small, and/or the distance between different clusters is large. In other words, the model manages to closely connect different samples belonging to the same identity and/or separates between different identities better for larger values of J. Lastly, the value of J is used for each model k to define their authority wk which is utilized when transferring knowledge in between the different models, and can be seen in Equation 3.23. S J = ř M inter i t i =1 Sintra. (3.22). 27.

(43) 3. M ETHOD. wk = ř3. Jk. n =1. Jn. (3.23). 3. For each iteration T of each training epoch, a batch of images from the source dataset are fed to all three models which gains feature representations and predicts the confidence of each sample belonging to the different identities. This information is used to calculate the cross entropy loss between the predicted confidence values of a model and the predicted confidence values of a second temporal average model (as previously defined in Equation 3.19), following Equation 3.24 where k denotes the model gaining knowledge from the model e, and the cross entropy loss is dependant on the number of sample images Nt and the predicted confidence values p j of sample image xt,i belonging to identity j. LkceÐe = ´. N M 1 ÿt ÿt p j ( xt,i |ΘeT ) log p j ( xt,i |θ Tk ) Nt. (3.24). i =1 j =1. The loss for each of the three models is calculated as the average loss of the other models, according to Equation 3.25 where the previously explained authority we for model e is used. Lkce =. 2 1 ÿ e k Ðe w Lce 2. (3.25). e‰k. Similarly, a binary cross entropy loss is calculated for each model according to Equation 3.26, using the softmax of the feature distance between negative sample pairs Si (θ k ) defined in Equation 3.27 using the same notation as Equation 3.16. Just as with the cross entropy loss above, the loss for each model is calculated as the average loss of the other models as seen in Equation 3.28.. Ðe Lkbce =´. N 1 ÿt Si (ΘeT ) log Si (θ k ) + (1 ´ Si (ΘeT )) log(1 ´ Si (θ k )) Nt. (3.26). i =1. Si ( θ k ) =. k k e|| f ( xs,i |θ )´ f ( xs,i´ |θ )|| k k k k e|| f ( xs,i |θ )´ f ( xs,i+ |θ )|| + e|| f ( xs,i |θ )´ f ( xs,i´ |θ )||. Lkbce =. 2 1 ÿ Ðe we Lkbce 2. (3.27). (3.28). k = 1‰ e. The overall loss which is to be minimized is dependant on the cross entropy loss in Equation 3.25 and the binary cross entropy loss in Equation 3.28, together with the previously explained loss function seen in Equation 3.17 28.

No results found