The Effect of Beautification Filters on Image Recognition:

(1)

Master Thesis

HALMSTAD

UNIVERSITY

Network Forensics, 60 credits

The Effect of Beautification Filters on Image Recognition:

"Are filtered social media images viable Open Source Intelligence?"

Digital Forensics, 15 credits

Halmstad 2021-06-11

Pontus Hedman, Vasilios Skepetzis

(2)

The Effect of Beautification Filters on Image Recognition:

”Are filtered social media images viable Open Source Intelligence?”

Pontus Hedman, ponhed15@student.hh.se Vasilios Skepetzis, vasske17@student.hh.se

Halmstad University

Master in Network Forensics, 60 Credits 2021-06-11

Supervisors: Josef Bigun

Kevin Hern´andez Diaz Fernando Alonso-Fernandez Examiner: Eric J¨arpe

(3)

Foreword

We would like to thank our supervisors Josef Bigun, Kevin Hern´andez Diaz, and Fernando Alonso-Fernandez for their support and advice during the project. We would also like to thank our family and friends for their patience and support.

(4)

Abstract

In light of the emergence of social media, and its abundance of facial imagery, facial recognition finds itself useful from an Open Source Intelligence standpoint. Images uploaded on social media are likely to be filtered, which can destroy or modify biometric features. This study looks at the recognition effort of identifying individuals based on their facial image after filters have been applied to the image. The social media image filters studied occlude parts of the nose and eyes, with a particular interest in filters occluding the eye region.

Our proposed method uses a Residual Neural Network Model to extract features from images, with recognition of individuals based on distance measures, based on the extracted features. Classification of individuals is also further done by the use of a Linear Support Vector Machine and XGBoost classifier. In attempts to increase the recognition performance for images completely occluded in the eye region, we present a method to reconstruct this information by using a variation of a U-Net, and from the classification perspective, we also train the classifier on filtered images to increase the performance of recognition.

Our experimental results showed good recognition of individuals when filters were not occluding important landmarks, especially around the eye region. Our proposed solution shows an ability to mitigate the occlusion done by filters through either reconstruction or training on manipulated images, in some cases, with an increase in the classifier’s accuracy of approximately 17% points with only reconstruction, 16%

points when the classifier trained on filtered data, and 24% points when both were used at the same time. When training on filtered images, we observe an average increase in performance, across all datasets, of 9.7% points.

Keywords: Face Recognition, OSINT, Machine Learning, Deep Learning, Convolutional Neural Network, Social Media Filters, U-Net, Residual Neural Network

(5)

List of Tables

4.1 Distance between the eyes of all face images. . . 21 4.2 General presentation of number of images for each person, where

people with less than 10 images are removed. . . 22 4.3 Number of images in each dataset of encoded images. . . 27 5.1 Number of images in each reconstructed dataset of encoded images. . 31 6.1 Summary description of the 8 datasets further processed in the study. 33 6.2 Hyperparameters for the LinearSVM model. . . 34 6.3 Hyperparameters for the XGBoost model. . . 34 6.4 Micro averaged performance values (Accuracy, Precision, F1-score,

and Recall) for distance-based identification. . . 36 6.5 Linear Support Vector Machine micro performance values. . . 37 6.6 XGBoost micro performance values. . . 38 6.7 Equal Error Rates as given graphically by Figures A.13, A.14, A.15,

in appendix . . . 39

(8)

List of Figures

2.1 General outline of the research project. . . 7 3.1 ResNet node . . . 10 3.2 Identification . . . 11 3.3 Linear Support Vector Machine Illustration. Binary classification of

two classes, in 2-dimensional space. . . 12 3.4 Illustration of OVR approach to multi-class classification, with 4

classes. Each class and its corresponding binary classifier has its own color. . . 13 3.5 Verification . . . 16 3.6 Illustration of the U-NET proposed in [48] . . . 18 4.1 Example of landmarks given on a face image by the landmarks function. 20 4.2 Total number of images left after people with less than threshold

amount of images are removed. . . 21 4.3 Discarded images due to background faces, when filters were applied

to the original images. . . 22 4.4 The distribution of applied Instagram filters. . . 23 4.5 Examples of the 9 various Instagram filters pseudo randomly applied

to the dataset images. . . 24 4.6 Examples of the four (4) various AR filters applied to the dataset

images. . . 25 4.7 An individual from the dataset with the two sunglasses filters ap-

plied displayed in a heatmap showcasing the first 10 pixel values to illustrate the information that is left by opacity . . . 26 4.8 The face detection rate for the feature extraction process for the var-

ious datasets. . . 26 4.9 Example images in light of failure to detect the face. Image 4.9a

shows failure due to finding multiple faces. Image 4.9b shows failure due to occluded biometric features. . . 27 4.10 Class separation for the various datasets, as dimensions are reduced to

2 dimensions, by t-SNE method, on the datasets after feature extraction. 28 4.11 Class separation between the datasets, using t-SNE method. The

macro perspective showing the clusters for all individuals, while the intra-cluster perspective is of the most frequent individual . . . 29 5.1 Illustration of our U-NET model. The blue rectangles are results of

convolutions during the compression, the yellow results of convolutions during transpose convolutions, and finally, the green symbolizes the addition of blue and yellow through the add operations. . . 30 5.2 Examples of the reconstruction on the shades dataset . . . 32 5.3 The face detection rate for the feature extraction process for the re-

constructed datasets shades with and without leakage. . . 32

(9)

6.1 Identification euclidean distance . . . 35

6.2 Micro averaged classification metrics for euclidean distance measure . 36 6.3 Micro performance of Linear SVM trained on benchmark dataset. . . 37

6.4 Micro performance of Linear SVM trained on filter dataset. . . 37

6.5 Micro performance of XGBoost trained on benchmark dataset. . . 38

6.6 Micro performance of XGBoost trained on filter dataset. . . 38

6.7 DET curve for the various filters, euclidean . . . 40

A.1 The distribution of the 158 classes and 4 291 records. . . 56

A.2 Statistics as people with images less than threshold are removed. . . . 57

A.3 Macro performance of Linear SVM trained on benchmark dataset. . . 58

A.4 Weighted performance of Linear SVM trained on benchmark dataset. 58 A.5 Macro performance of Linear SVM trained on filter dataset. . . 59

A.6 Weighted performance of Linear SVM trained on filter dataset. . . 59

A.7 Macro performance of XGBoost trained on benchmark dataset. . . 59

A.8 Weighted performance of XGBoost trained on benchmark dataset. . . 60

A.9 Macro performance of XGBoost trained on filter dataset. . . 60

A.10 Weighted performance of XGBoost trained on filter dataset. . . 60

A.11 Micro averaged classification metrics for manhattan distance measure 61 A.12 Micro averaged classification metrics for cosine distance measure . . . 61

A.13 Verification euclidean distance . . . 62

A.14 Verification Manhattan distance . . . 63

A.15 Verification Cosine distance . . . 64

A.16 Identification Manhattan distance . . . 65

A.17 Identification Cosine distance . . . 65

A.18 DET curve for the various filters, Manhattan . . . 66

A.19 DET curve for the various filters, cosine . . . 66

(10)

1 Introduction

This chapter consists of a problem statement, related works, and our purpose of this study. The aforementioned sections are then reduced into a statement of research questions. Following is a problematization and positioning of the presented questions. Lastly, the scope of our research is presented.

1.1 Problem Statement

The concept of Open Source Intelligence (OSINT), first defined in 2001 in the

”NATO Open Source Intelligence Handbook” [49], is an intriguing new aspect of the modern interconnected world [17]. Recently, OSINT has been studied for its value as an evidence source due to the magnitude of information posted and made available by users worldwide [13],[18]. More specifically, social media is a big part of OSINT and has also been studied accordingly [44]. Therein lay the inspiration for this research. Identifying individuals from an OSINT source may be crucial in an investigation as crimes are captured on easily accessed devices such as smartphones and posted online. There are multiple examples of crimes being captured on mobile devices [3],[41], with the most striking lately being the use of posted videos in the US Capitol in identifying and apprehending rioters [35].

The challenge lies in that an individual’s facial features may be changed or distorted due to applied image filters. There is, therefore, interest to study the consequences of different levels of social media effects on facial recognition systems. The different levels of effects refer to the degree of facial concealment or manipulation of con- trasts, which could be critical to facial recognition systems. Lastly, it would also be of interest to try to remove the filter’s effect in order to increase the recognition performance.

1.2 Related Works

Studies focus on the issue of facial manipulation through physical or digital means in an effort to identify manipulated images, measure the effect said manipulation had on facial recognition, or try to remove it entirely. A survey by Zheng et al. [64]

clearly describes, among others, the different subtypes under Image Manipulation that can take place. Image manipulation can be split into Image Forgery and Image Steganography. As Image Steganography has the explicit goal of hiding information in an image while it appears unchanged, it is the first kind that is of interest for this study [64]. Image Forgery, the act of manipulation aiming to deliver misleading information through the image, can be split further into compositing, morphing, retouching, enhancing, computer generating, and computer painting [64]. This study’s interest lies mainly in Image Tampering, a subterm under Image Forgery, meaning manipulating one or several parts of the picture, including most of the Image Forgery

(11)

subtypes [64]. Zheng et al. [64] also describe several approaches to solving the detection of Image Tampering such as SURF and Discrete Wavelet Transform (DWT) also mentioned in other research. In [25], a Convolutional Neural Network was used in the hopes to detect GAN-generated, altered, images and [5] used semi-supervised autoencoders in combination with an SVM in order to detect retouching.

In [60], the authors describe the issue of forged images and forgery detection, com- pares different approaches in the past, and propose a model that uses DWT in order to transform the image into four (4) more distinct sub-images that are then fed to the SURF algorithm in order to acquire features to feed a Support Vector Machine (SVM) that finally must perform binary classification between forged and unforged images. A similar approach was taken by Rathgeb et al. [47] to detect image retouching, where they proposed a Multi-Biometric Approach on the issue. The approach consisted of extracting characteristics of interest from the picture (features, texture, and deep features) and feeding them into multiple SVMs to combine the resulting predictions using a weighted score-level function.

The impact of image retouching on face recognition was briefly examined in [47], showing the relative robustness of widely used face recognition systems against this kind of tampering. Similar conclusions on the relatively small impact of facial retouching on face recognition were drawn using different face recognition systems in [4]. The authors also proposed a Supervised Restricted Boltzmann Machine (SRBM) to detect said retouching [4]. Ferrera et al. [14] examine image alterations, concluding that while most face recognition systems may overcome limited alterations (slight simulated surgical enhancement), they will stumble on more heavily manipulated images (heavy simulated surgical enhancement). In [46], an overview is given, grounded in earlier studies on the effect of plastic surgery and cosmetics on face recognition, and drew analogies with the digital version of these alterations.

Antitza et al. [10] studied the different effects various kinds of makeup have on several face recognition systems finding an increase in errors when testing against pictures with makeup.

Finally, the effects of partial occlusion on face recognition have been studied because of its effect. In [39], the effect of occlusion on facial recognition was examined using Linear Regression Classification (LRC), and Principal Component Analysis (PCA) on pictures occluded by sunglasses and scarfs. In [30], a probabilistic method was presented to combat the effect of partial occlusion. For the same purpose Song et al. in [56], used a Pairwise Differential Siamese network for mask learning. Fur- thermore, efforts to remove elements that produce occlusion were made in [61] with Cycle Consistent Generative Adversarial Networks, in [43] with PCA for removing eyeglasses to improve facial recognition.

1.3 Purpose

This research is done from a law enforcement perspective, intending to help ap- prehend criminals and help safeguard the general population. We seek to study the impact the use of beautification filters, or similar obfuscation techniques have on the automated process of recognizing individuals in image format, using facial

(12)

recognition models.

1.3.1 Research Questions

We reduce our problem statement and stated purpose into the following questions:

1. Performance of recognition

(a) Do our choice of beautification methods (social media filters) affect the recognition performance?

2. Attempts of increasing performance due to image manipulation

(a) Does training with synthetic versions of predicted filters increase performance?

(b) Can the manipulation of the image be reversed (nullified)?

i. If so, does the recognition performance increase?

1.3.2 Problematization & Positioning

Question (1a) aims to create the benchmark of our chosen models of recognition.

Three approaches will be taken for this purpose, given a feature vector produced by state of the art (SOTA) convolutional neural networks, (1) a classification model trained through machine learning, (2) a verification, and (3) an identification model based on distance measures. The three approaches fulfill different purposes. The classification model is less likely to be seen in a real-world application of a recognition system as it needs to train for every new addition to a database. Although unlikely, it provides the best performance as it further trains on the extracted feature vector and can therefore become the harshest “critic”. The distance-based verification and identification approaches are more common and will present a more realistic view. The identification would be the approach used when searching a database for a match given a picture and its distance from the registered individuals. In this aspect, the approach is similar to the classification but without the extra training, therefore a more practical approach to addition of new identities to the database.

After establishing the benchmark, question (2a) aims to define the chosen social media filters’ effect on recognition. Although the ideal solution would be to use the social media applications’ filter techniques, the APIs of social media platforms do not provide such functionality. Even if the functionality existed the sheer amount of pictures makes it unlikely that platform will allow such usage. Therefore, we will recreate the filter, which may introduce unexpected bias to the model. Furthermore, the choice of filters will be based on the most popular or common filters such as the dog filter, a digital representation of a dog’s nose applied over the nose area on a human face.

The manipulation of the picture is not an overlapping but a merging process meaning that it is impossible to peel off the filter in order to acquire the original picture.

Therefore, question (2b) is an issue of generation of what is obfuscated by the filter by use of CNNs, which by itself introduces bias. After what blueprint does the

(13)

generation occur? This is an important consideration. A model mainly trained on Caucasians will tend to generate features resembling the average Caucasian features, whether bone structure or skin hue. Of course, this inherent bias may be desirable if a community has, in the vast majority, Caucasian citizens, where a more general model may produce worse performance. It is also imperative to note that the generation may fail entirely or create an average human face. Meaning that in the first case the generated image is unusable, while in the second case the resulting face must not be considered, and may at most provide an indication or lead of the individual’s identity. Generative algorithms may also be sensitive to noise in the image and can therefore be fooled.

1.3.3 Scope of Research

The focus of this thesis lies on the effect that chosen types of filters have on the recognition performance. The results of this study should be considered in light of the choice of filters, how they occlude the face, i.e., filters occluding parts of the face, or in some different manner than what is presented in this study is not comparable. The study does not examine the detection of faces in pictures, neither does it examine in detail the process of extracting the feature vector from a given image.

(14)

2 Methodology

This chapter details the overall methodology of this study. The general structure of the thesis is presented, and in the following subsections, the data is described.

Lastly, the proposed research is problematized and examined from an ethical perspective.

The theoretical background of used models and methods is presented in the following chapter 3. Theory.

2.1 Outline of the Study

This thesis begins with an initial examination of the existing literature around facial recognition. The selection of models for this study is based on the intersection between publications of state-of-the-art models, as implemented in public libraries available in program languages such as Python, and models which we were able to implement with the limited resources of the project. Within this scope, an experimental study is constructed, the setup and results of which are presented in each of the result chapters 4. Data Processing, 5. Reconstruction, and 6. Recognition. See Figure 2.1 for a flowchart illustration of the outline of the study.

2.2 Data

The data in this study consists of face imagery. This section details the datasets of images used in the study and the method of creating the “beautified” filtered images from the original dataset. The detailed processing and use of the data are presented in chapter 4. Data Processing.

It is important to note that presentation of facial images in this report, from the dataset, are pixelated. This is done to avoid potential breach of copyright or infringement on the depicted individuals integrity. The purpose of our illustrations presented in this study is only to illustrate how we perform the study and what our results are.

2.2.1 Labeled Faces in the Wild

The main dataset used in this paper is the Faces in the Wild (LFW) dataset [22],[28],[23]. The dataset was created at the University of Massachusetts by Gary B. Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller [22], and is made available for academic purposes¹ . It consists of 13, 233 pictures, of 5, 749 people, with 1, 680 individuals having two or more pictures. We make use of the updated version with faces aligned by funneling, produced in [23].

1The dataset is made available here: http://vis-www.cs.umass.edu/lfw/

(15)

These pictures initially shape the database for the distance-based verification and identification scenarios, as well as train and test the classifier, thus creating the benchmark against which the following results will be measured. Afterwards, datasets of these same pictures will be created with the addition of beautification filters, see Section 2.2.3. Application of Beautification Filter for further detail.

(16)

Figure 2.1: General outline of the research project.

(17)

2.2.2 CelebA

The CelebA dataset is produced by the authors of [31] originally to study Facial Features and will be used to train the generative model that will be discussed in a later chapter. The training and validation of the generative model is not performed on the main dataset (LFW) because of the limited data it provides in contrast to the CelebA dataset (totaling a little over 202, 000 pictures). The testing can still be performed on the LFW dataset as the main point of the testing is to use unseen data to document its ability to generalize and thus create the reconstructed images used in comparison to the benchmark and filtered images for the fourth research question.

2.2.3 Application of Beautification Filters

This study requires a significant number of images to derive statistically significant results, thus the reliance on a dataset of images. To our knowledge, there are no datasets available of images with applied beautification filters, which are commonly seen on social media applications. The creation of such a dataset would ideally be made through the use of these particular applications. However, our initial research shows that social media sites such as Instagram or Snapchat do not publicly offer APIs or other means for us to apply their filters on large amounts of images. This leads us to have to recreate these filters and apply them to our images ourselves. An alternative approach would be to directly create our whole dataset from the various social media sites, abandoning the use of a large dataset. However, this would be difficult to scale to accommodate the great number of images desired. There would also be ethical concerns of storing and incentivizing people to participate, sharing their face data with us. Furthermore, a public and already used dataset allows for comparisons of the results with other studies, an important aspect of academic research, allowing for duplication of existing results as well as framing for new results against earlier efforts.

Contrast & Lighting

Instagram is a social media platform centered around the sharing of images. The service offers several ways to filter the user’s images, to “beautify” them before up- load. A study on another platform, FLICKR, found that filtered images are more likely to be viewed and commented on, thereby achieving a higher engagement on social media platforms [2]. The filters offered on Instagram are varied. We focus on the majority of filters, which change the contrast and lighting of the image. This type of filtering is arguable what is most associated with the service today. The number of available filters is high, and we choose to use the 10 most “popular”

“selfie filters” according to Canvas.com [36]. The ranking is based on the number of images with a particular filter, using the hashtag “#selfie“. Hashtags are tags on the service, where users themselves categories their uploaded images.

We recreate the Instagram filters by the use of the Instafilter library in Python [21]. This library uses a four-layer fully connected neural network to learn the RGB, lightness, and saturation changes of various Instagram filters [20].

(18)

Augmented Reality

The application of augmented reality (AR) filters on static, already captured images can be a difficult task. Our use of already captured images reduces our possibilities of utilizing these types of filters, the most evident are the types of filters applied in the actual capturing process. We choose to recreate these filters based on two (2) factors. First, what “type” of filters we observe on various platforms. Examples of platforms are social media platforms such as Snapchat and the conference application Zoom. Both of these platforms offer the “adding of information” to the image.

The second factor is the type of filter we are capable of introducing to a static image, with the given positioning of the person in the image. We use augmented reality filters, obfuscating parts of the face, expecting to make identification more difficult.

Thus, our intention of adding filters is to “hide” key areas of the face, such as the eyes and nose area.

Based on the face areas we wish to obscure, we choose four (4) images to apply to the face images. These are images of “Dog nose”, “Transparent glasses”, “Sun- glasses with slight transparency”, and “Sunglasses with absolutely no transparency”.

The images are scaled to size and applied to the face images by use of the landmarks given by the “face landmarks” function in the Python library “face recognition” [16].

2.3 Problematization of Method

The intent of helping law enforcement in apprehending suspects is pure. However, we recognize that “The road to hell is paved with good intentions”. Technology such as this can and has been used to oppress people, and enhancing the capabilities of such technology could potentially do more harm than good. Every advancement in science, however small, can be abused. Therefore, even our findings should be used with caution, transparency, and responsibility.

The datasets are used for a purely academic purpose as expressed by the original creators of the dataset. The pictures are not used in any other way than the one intended by them, and the data is not further distributed in any way. The example images presented in this report are pixelated, to reduce a possible infringement on the integrity of the individuals used as face imagery in the dataset. Do note that the actual images processed for the study are not pixelated, this is only for the written report.

The use of generative models, as suggested in this paper and more importantly in the context of law enforcement, should be used with extreme caution. The model generates pictures based on the dataset that was used to train it. This is the first and foremost source of bias and error in recreation. Any application of this form should not be autonomous but at most used as a helpful lead under the critical usage of an informed user. It would be tragic if a picture created by this kind of model, with its bias and possible errors, to be used as the sole or primary evidence that would place someone in jail.

(19)

3 Theory

This chapter details the principle methods, such as models and frameworks, utilized to derive results. The practical implementation, use, and application, of each presented model are further detailed in the experimental setup section of each results chapter.

3.1 Image Preparation

The data preparation step is an essential part of any experimental study, much more so when the data studied consists of images. Image data is stored in an encoded format; thus, working with this type of data requires methods to represent the data in a format acceptable to the various other models used to study the data. In this study, the image preparation stage consists of extracting the faces from the given pictures with a CNN model. This is done to reduce noise in the background of the images. The produced face images are then scaled to a mutual standard size.

3.1.1 Max-Margin Object Detection (MMOD) - Crop

The detection and extraction of the individual faces are performed through the Max- Margin Object Detection presented by Davis E. King in [27]. The MMOD algorithm uses the Max-Margin approach in the optimization part, intending to select only the sub-images most likely to be a certain label, in this case, a face [27]. As this method uses CNNs, it is not ideal for real-time applications without significant resources, but it performs outstandingly compared to more traditional methods such as HOG [27].

3.2 Feature extraction Model - RESNET

Figure 3.1: ResNet node

Features of the individual’s face are extracted when working with human faces. Thus, a given image is encoded into a feature vector, suitable for further study, and input into presented models.

This study makes use of a Residual Neural Network (RESNET) model for the feature extraction process, as out- lined in [19]. A residual network is a solution to the appear- ance of the degradation problem. Normally as the network depth increases, the accuracy saturates and finally degrades [58].

A residual network boost performance by avoiding the degradation problem by using of a combination of identity mapping

(20)

and residual mapping. The innovative part is the introduction of the residual mapping part as shown in Figure 3.1. It allows the network to pass information from the “past” to complement the next operation providing stability.

3.3 Identification

In biometrics, identification and verification are central concepts, as explained in [62]. Verification will be presented in 3.4 while identification is presented below as a straight forward distance measurement approach, and also as a multi-class classification problem in 3.3.2. In comparison to verification, identification does not make a claim of the identity. The machine itself figures out the identity.

3.3.1 Identification by Distance

Identification in closed-set applications aim to identify the individual based on how

“close” their presented biometric signal is to the expected stored signal, see Figure 3.2. The identification process works by comparing the given signal to all registered individuals to ascertain the individual’s identity [62]. This approach only requires one (1) input, the biometric signal of the present user, e.g., the face image.

Figure 3.2: Identification

3.3.2 Identification by Machine Learning Classification

The identification problem can also be approached by usage of a classifier trained through machine learning. Classification is a supervised learning technique within the field of machine learning. The task requires the use of ML algorithms that learn how to assign a class label to events. Thus a prerequisite for the training of the prediction models is that the ground truth labels of events are known.

This thesis approaches the task of Multi-Class Classification of face images and their corresponding names/ID.

An important consideration, which allows the use of classification models to an- swer this study’s problem, is that our data is structured. The structured data is acquired in the feature extraction phase.

Support Vector Machine

The classifier Support Vector Machine (SVM) is a well-known and established machine learning method, originally authored by Vladimir Vapnik [9]. SVM is a linear

(21)

model capable of both classification and regression problems. Dependent on the kernel chosen as a hyperparameter, the model is capable of solving both linear and non-linear problems.

The algorithm works by creating a hyperplane, a line, which separates data into classes. The algorithm’s goal is to find the most appropriate hyperplane which distinctly classifies the data points within their clusters. The model works in N- dimensional space, where N is the number of features in the data. However, the model is best illustrated in 2-dimensional space.

In 2-dimensional space, we look at separating two classes with a line (the hyperplane for 2-dimensional space). The objective of our function is to find the maximum margin to each cluster of data points (the two classes). This is done by support vectors, the data points of the respective cluster closest to the hyperplane. The distance between the support vector and the hyperplane is sought to maximize. This achieves our class separation. The support vectors influence the position and orientation of the hyperplane. The choice of margin, whether soft or hard, between the hyperplane and the support vectors, drastically affects the decision boundary position (the hyperplane). A hard margin, means that the support vector does not allow many (if any) data points inside the support vector boundary. Whereas a soft margin would allow for certain data points to be ignored when the support vector tangents the cluster data points. This means that a soft margin is less affected by noise compared to a hard margin. However, also increasingly less accurate as the number of ignored data points increase. See Figure 3.3 for an illustration of the 2-dimensional case, for a linear support vector machine, with two (2) classes.

Figure 3.3: Linear Support Vector Machine Illustration. Binary classification of two classes, in 2-dimensional space.

(22)

As the number of input features increases, so does the dimensional plane for the decision boundary. Thus, such a system is hard to illustrate where the data contains a number of features beyond three (3) dimensions.

SVM at its principles works by separating data points into two classes, it does not natively support multi-class classification. however, it can be achieved by looking at the problem as multiple binary classification problems. The approach to this, which this study is concerned about, is the One-vs-Rest (OVR) approach [1]. Where each class independently gets classified by a binary classifier. Hereby relying on the assumption that the classification of one class is independent from the classification of all others. In the OVR approach. A hyperplane separate a class and all the other classes. All points are taken into consideration, e.g., an attempt is made to separate all images of Alice from every other person in the set, and then the same for Bob, Carol, and so on for all people (classes). See the following Figure 3.4 for an illustration with four (4) classes. Each class has its own color. The line (hyperplane), with the corresponding color to the class, is the binary classifier separating that class from the rest.

Figure 3.4: Illustration of OVR approach to multi-class classification, with 4 classes.

Each class and its corresponding binary classifier has its own color.

XGBoost

The Classifier Extreme Gradient Boosting (XGBoost) is chosen for this study, due to its wide adoption in the industry and proven record of being an efficient and flexible model for real-world applications. The model has won first place in multiple contemporary challenges within the field of Machine learning [8].

The XGBoost model utilizes Gradient Boosting in its algorithm [7]. Boosting is a technique where an ensemble of multiple weak classifiers is created, typically de-

(23)

cision trees. Weak or strong in this context speaks to a measure of how correlated the learners are to the actual target label. The multiple models together create a stronger model, in the aggregate, where the learners are trained sequentially. Gra- dient boosting, however, adds to the concept of boosting by instead of assigning different weights to the classifiers for each iteration when training, gradient descent is utilized to minimize the loss when updating and fitting new models.

For solving multi-class classification problems XGBoost produces decision trees for each class [37]¹. Each tree classifier produces a margin score for the corresponding class. The class with the highest margin score is chosen as the most likely candidate.

A softmax function is used to give the probability distribution of the list of classes [63]². Thus, the otherwise output of a vector of real values is transformed into a vector of real values which sum to 1 [55].

3.3.3 Evaluation - Identification

When evaluating the identification case without the usage of a classifier, distances measures, described in 3.5, can be used in order to examine the resemblance of two samples. Thus, the rate of true positive identifications is illustrated in a Cumulative Match Characteristic (CMC) curve for a number in [1, n] best matches [5],[59] to measure its performance.

The evaluation techniques for classification, from a machine learning approach, that are presented below, are also used in the case of evaluating distance-based identification, to enable a more fair comparison between the performance of the two approaches.

When using a classifier the evaluation performed follows the usual evaluation of a multi-class classification problem. Machine learning classification metrics are derived from the confusion matrix (also known as error matrix ), a table layout that presents and tabulates the performance of a supervised learning algorithm. The metrics are derived from observing and calculating the true positives (TP), false positives (FP), false negatives (FN), and true negatives (TN) [29].

Accuracy

Accuracy is an easy metric to interpret. It is the fraction of predictions that are true.

However, it should be noted that a high accuracy does not necessarily mean that the classifier performs well. The metric speaks nothing of whether FNs or FPs are more common [29]. An example of poor application of this metric is when the number of classes in a dataset is unbalanced. A prediction that every instance belongs to one class would falsely show a high accuracy due to a very high unbalance.

1Comment made by the lead maintainer of the XGBoost library used in this study, see https://github.com/hcho3

2Note that the implementation of XGBoost in this study automatically sets the objective function to multi:softprob, which is similar to softmax, when the input to the classifier is multi-class.

(24)

Accuracy = Correct classif ications

N umber of classif ications = T P + T N

T P + T N + F P + F N (3.1) Precision

Precision, also known as predictive value, captures both TPs and FPs and is a measurement of the proportion of predicted positives that are correct. However, neither TNs nor FNs are captured [29]. “A very conservative test that predicts only one subject will have the disease — the case that is most certain — has a perfect precision score, even though it misses any other affected subjects with a less certain diagnosis” [29].

P recision = T P

T P + F P (3.2)

Recall

The recall measurement is a useful metric to understand FNs, also called true positive rate. This measurement is the proportion of known positives that are predicted correctly. However, neither TNs nor FPs are captured in this metric. A classifier that predicts all data points as positive would show a high recall metric [29].

Recall = T P

T P + F N (3.3)

F1-Measure

F1-Measure is an aggregate metric that seeks to present a complete summary of the confusion matrix. This metric balances recall and precision equally. As the F1-Measure metric is based on Precision and Recall, it does not capture TNs [29].

f 1 = 2 · P recision · Recall

P recision + Recall = 2 · T P

2 · T P + F P + F N (3.4) Averaging with micro statistic

Averaging is the aggregate statistic measurement to represent performance in a multi-class case. The above calculations for accuracy, precision, recall, and f1- measure, calculate FP, TP, FN, and TN for each class. This study examines many classes (158). The classification performance in this case can not effectively be presented for each of the 158 classes. Thus, performance results are presented as an average metric.

In the micro-average case, the TP and FP are calculated of each class. The final TP and FP is then the sum of each TP and FP of all classes, such that T P = T P₁+ T P₂+ ... + T P_n, where TP is the total average for all classes. The same is done for FP, FN, and TN. When precision, recall, accuracy, and f1-measure are calculated, the total sum of each metric is used for the calculations, as presented above for these performance metrics.

(25)

For calculations of macro and weighted performance see appendix C.1

3.4 Verification

Verification is the process of comparing an individual’s biometric signal (i.e., face or fingerprint) against the corresponding biometric signal stored in a database to verify the identity [62]. Thus, verification applications need two (2) inputs, the identity that the individual claims to have, and the biometric signal presented at the time, see Figure 3.5. Evaluation of Verification must allow quantification of errors from both legitimate client’s perspective and that of an impostor’s.

Figure 3.5: Verification

3.4.1 Evaluation - Verification

Similarly to the identification case, the verification system decides based on distance measures, described in 3.5, to make a decision. The evaluation of Verification consists of comparing the presented signal (e.g. the face) against the one stored in the database corresponding to the claimed identity. The matching-error is computed. If it is above a threshold the presenter is considered as impostor, else client. Evidently this decision can be wrong, and when this happens for the legitimate user (client) this is called a false-rejection. If the machine erroneously decides that an impostor is a client then this is called False-accept. The verification evaluation consists in quantification of False-reject and False-Accept rates. These are then computed by counting the rate of images falsely verified as true, known as False Match/Accep- tance Rate (FAR), and the rate of images falsely rejected as false known as False Non-Match/Rejection Rate (FRR) at different thresholds according to the selected distance measure [24]. The point of equilibrium between the two errors, FAR and FRR, is known as Equal Error Rate (EER), provides the optimum threshold to a balanced application [11]. Other applications may have different needs; for example, security applications may require lower FAR levels in contrast. The EER can also be used to compare the performance of the system against each other. Another way to find the EER is a Detection Error Tradeoff (DET) [62] curve that is used to illustrate the tradeoff between FAR and FRR in every filter’s case. The EER can be found in the intersection of the diagonal passing from zero and the curves.

(26)

3.5 Distance measures

The following distance measures are utilized in this study in the calculations for the dissimilarity between identities.

Euclidean distance

d(p, q) =

v u u t

n

X

i=1

(q_i− p_i)² (3.5)

Manhattan distance

d(p, q) =

n

X

i=1

|p_i− q_i| (3.6)

Cosine distance

similarity =

Pn

i=1A_i· B_i

qPn

i=1A²_i ·^q^Pⁿ_i=1B²_i

(3.7)

distance = 1 − similarity³ (3.8)

3.6 Image Reconstruction / Generation

Following the increase in computing power available in later years, deep learning has become more common. It has been used in multiple applications, image generation, by generative models, being one of them. The most commonly known are deepfakes, a combination of the word ”deep” from deep learning and the word ”fake”. They usually consist of a combination of autoencoders, as generators, trained against a discriminator (CNN) in an adversarial manner [38].

3.6.1 U-NET

The U-NET was originally presented as a solution for biomedical Image Segmen- tation in [48] as a solution with low requirement for training data that would out- perform more complex networks on the same task. The network consisted of a combination of convolutions with max-pooling to compress the input while creating links with cropped residual information from previous layers to combine with the upsampling-convolution block for the decompression, as can be seen in Figure 3.6.

This gives the network a U-shape hence its name [48].

The findings in [48] show that the model performed better and was faster than their competitors while training on the limited data that characterizes the field through extensive augmentation.

3As described in the function that is used in the code

https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cosine.html. If A_i and B_i ∈ IR, then due to Schwartz inequality we have −1 ≤ similarity ≤ 1, so that the distance in this software is restricted to distance ∈ [0, 2]

(27)

Figure 3.6: Illustration of the U-NET proposed in [48]

(28)

4 Data Processing

This chapter details the data processing steps conducted on the original data to enable the later chapter’s Reconstruction and Recognition. The original data is analyzed and selected, filters are created and applied to images, and features are extracted from images to represent data in an appropriate format for further study.

4.1 Experimental Setup

This section presents the method used for our resulting data.

4.1.1 The Scale of Information

In an attempt to provide the scale for the used information in each image, we provide the measured distance, in pixels, between the eyes of each image. Insight is provided with regards to the information of the image in the face area, particularly around the eyes, by the inclusion of this measurement.

4.1.2 Image Selection and Initial Preparation

The original dataset consists of 5, 749 individual people, totaling 13, 233 images.

The choice was made to remove the people and their images with less than ten (10) images each, so that there would be sufficient amount of images for each person.

Each image is cropped from the original size of 250x250 pixels to 145x145 pixels around its center. This is based on the realization that the images in the dataset are already centered around this area. This method allows for the reduction of scale in the data, which reduces the processing requirements in the future steps and re- moves background faces, which may have further complicated the feature extraction portion of this study.

4.1.3 Creation of Filtered Images

Five (5) filter datasets are created based on the cropped images of the original dataset. The landmarks function used for applying the AR filter requires the function to find all faces in the input image. If the number of faces found by the function is more than one (1), then the image is discarded. Furthermore, if the function does not find a face at all, the image is also discarded.

The Instagram filter images are created using the Instafilter library [21]. For each image, one (1) of the nine (9) most popular “selfie” filters are randomly applied¹. Randomization is achieved by use of the python pseudorandom function choice from the built-in library random.

1Note that one of the top 10 filters, as detailed in the method chapter, is no filter.

(29)

The augmented reality filters are applied to the images based on calculations of the region of interest in the image for the particular filter. The face landmarks function from the face recognition library [15] gives the landmarks of the face. Thus the application of, e.g., shades is done by scaling the original shades-image to appropriate size by calculating the new width and height based on the total width of the eyebrows, the height between the eyebrows, and the third quartile of the nose bridge. The filter image is then applied, by merging, to the appropriate position of the original image by calculating the center position of the nose bridge. See the following Figure 4.1 for an example of the visible landmarks on a face image.

(a) Original image (b) Landmarks image

Figure 4.1: Example of landmarks given on a face image by the landmarks function.

4.1.4 Feature Extraction

The face of the person(s) in each image is extracted with the face recognition library’s face locations function. The specified model used for the function is CNN rather than the default HOG model. Some images produce multiple faces; this is the recognition of faces in the background². Each original image that produces more than one (1) face is skipped. Thus, it and its faces are not used. Furthermore, in some images the face is not found at all, increasingly so for the filtered images. This in total produces the face detection error for the image.

The found and selected face image is then downsized to an arbitrary 64x64 image size to ease the storage and processing requirements in the future parts of the study. The resize is done with the OpenCV library (cv2 ) [40], resize function, specifying the interpolation as INTER LANCZOS4.

The found face image, by the face locations function, is encoded into a feature

2Do note that this function is different from face landmarks. The CNN model to find the face in this step is more sensitive. The function to find landmarks is sufficient to apply filters. The face encoding function, later used to encode the face image, requires a face as input. There is a need for the input image to this function to be as certain as can be the correct person (not including background noise). For this reason the the face locations function is called in order to use the CNN model to find the face, then the face encoding function is forced to consider the found face when encoding. If this is not done, there is a risk encoding an image which is a poor representation of the individual.

(30)

vector by a pre-trained ResNet with 29 layers, which was trained using a combination of the VGG, face scrub dataset, and a large amount of pictures that the author scraped from the internet [26]. The encoding is made with the face recognition library’s face encoding function. This function is forced to consider the whole found face image as the individuals face, thus not prompting the function to try to find the regions of the face for a second time. The resulting feature vector is given as a 128-dimensional array of values representing a compressed version of the most important aspects of the image as produced by the ResNet.

4.2 Results

The following Table 4.1 shows the measured distance between the eyes for all the original images. As mentioned in Section 4.1, this measurement is provided, so that future researchers can have a greater understanding of the size of the face, and pixel detail of the studied images.

Table 4.1: Distance between the eyes of all face images.

Mean Std Q1 Median Q3

40.93 4.17 38.18 40.50 43.18

Figure 4.2 shows the number of images left in the total dataset of images, as the individuals with images less than the threshold are removed. The choice of removing all people and their images, with fewer images than ten (10), results in a dataset of 158 individuals, totaling 4, 324 images.

Figure 4.2: Total number of images left after people with less than threshold amount of images are removed.

(31)

Further information about the change to the distribution of pictures is shown in Figure A.2, in appendix. The figure shows various statistics as people with images less than threshold are removed. What is presented here is a visualization of the distribution of images for each person, when considering various thresholds of least number of images per person. The selected case of a threshold of 10, where each person with less than 10 images are removed, results in that each person on average have 27.37 images, see Table 4.2.

Table 4.2: General presentation of number of images for each person, where people with less than 10 images are removed.

Total number images Mean Std Q1 Median Q3 4, 324 27.37 47.45 12.0 17.0 26.0

When the filters are applied with the face landmarks function, three (3) original images produce multiple faces (background faces)³ and are thus discarded, see the Figure 4.3. Of the remaining images, 30 were discarded due to the function not finding a face at all. This results in images of 158 people, totaling 4, 291 images, for each dataset: benchmark, Instagram, dog nose, glasses, shades (leakage), and shades (no leakage). Note, this is not the final dataset distribution for the encoded images. These are the amount of images on which we successfully applied filters.

(a) Jean Charest 0004 (b) Julianne Moore 0001 (c) Julianne Moore 0008 Figure 4.3: Discarded images due to background faces, when filters were applied to the original images.

3Two (2) of these three (3) images are duplicates in the original dataset, “Julianne Moore 0001”

and “Julianne Moore 0008” are the same image.

(32)

The following image 4.4 shows the distribution of the pseudorandom application of the top 9 “selfie” Instagram filters.

Figure 4.4: The distribution of applied Instagram filters.

There are 158 individual classes, and a total of 4, 291 records, for all datasets. See the image of class distribution of this data in Figure A.1, in appendix.

(33)

The following Figure 4.5 shows the nine (9) various Instagram filters applied to the images.

(a) Aden (b) Ashby (c) Dogpatch

(d) Gingham (e) Hudson (f) Ludwig

(g) Skyline (h) Slumber (i) Valencia

Figure 4.5: Examples of the 9 various Instagram filters pseudo randomly applied to the dataset images.

(34)

The following Figure 4.6 shows the four (4) various AR filters applied to the images.

(a) Dog nose (b) Glasses

(c) shades 95% opacity (leakage of latent features).

(d) shades 100% opacity (no leakage of latent features).

Figure 4.6: Examples of the four (4) various AR filters applied to the dataset images.

(35)

The difference between the two sunglasses filters due to the use of the alpha channel (opacity) is not discernible to the naked eye but can be observed in Figure 4.7, a heatmap made to examine the range of 0 to 10 pixel values, a range otherwise unnoticeable by the human eye. The left picture displays the information preserved in the picture, although very vaguely and not visible at first, while the picture on the right shows the complete destruction of all underlying information, only the filter can be seen.

Figure 4.7: An individual from the dataset with the two sunglasses filters applied displayed in a heatmap showcasing the first 10 pixel values to illustrate the information that is left by opacity

The following Figure 4.8 details the face detection rate for the various filter datasets for the feature extraction. The total number of images is the same across all datasets at 4, 291 images. The final resulting number of encoded images further processed, for each dataset, is given by the success of the detection phase. Thus, the number of images in each encoded dataset, given by the successful face detection rate (read from Figure 4.8), and is presented in Table 4.3.

Figure 4.8: The face detection rate for the feature extraction process for the various datasets.

(36)

Table 4.3: Number of images in each dataset of encoded images.

Benchmark Dog Glasses Instagram Shades leak Shades no leak

4,276 4,229 3,666 4,277 3,851 3,825

Failure to detect an encoded image was defined in Section 4.1.4, as any image which produced no face or more than one (1) face. A majority of the failures are due to the face locations function not finding the face, prior to running the face encodings function. The following two (2) images in Figure 4.9, show a failure due to finding multiple faces, and the second image is a failure to detect due to obstructed biometric features. It is thus worth noting that the face locations function with specified model as CNN is more sensitive than the model to find a face used by the face landmarks function, used to apply the filters in the preceding steps.

(a) Unfiltered image of “Hans Blix 0039”. (b) Image “Alejandro Toledo 0017” filtered with shades without leakage.

Figure 4.9: Example images in light of failure to detect the face. Image 4.9a shows failure due to finding multiple faces. Image 4.9b shows failure due to occluded biometric features.

(37)

The following Figure 4.10 details the class separation for the various datasets after feature extraction. Only the five (5) most frequent classes are colored due to the limitation of available distinct colors. The t-SNE parameters used are the defaults of the “scikit-learn” library function [50]. Perplexity parameter being 30.

(a) Benchmark (b) Instagram

(c) Dog (d) Glasses

(e) Shades 95% Opacity (f) Shades 100% Opacity

Figure 4.10: Class separation for the various datasets, as dimensions are reduced to 2 dimensions, by t-SNE method, on the datasets after feature extraction.

(38)

The following image 4.11 shows the cluster separation between the various datasets after feature extraction. Each macro cluster is each individual in the dataset of the 158 people. The intra-clusters are the six (6) various datasets. The most frequent individual is shown as an example to illustrate the intra-cluster distances.

(a) t-SNE all datasets (b) t-SNE all datasets, the most frequent individual

Figure 4.11: Class separation between the datasets, using t-SNE method. The macro perspective showing the clusters for all individuals, while the intra-cluster perspective is of the most frequent individual

(39)

5 Reconstruction

In this chapter, the experimental setup and results of the reconstruction efforts is presented.

5.1 Experimental Setup

The network used is largely based on the U-Net network proposed in [48] and explained in Section 3.5.1. This network was chosen because of its ability to produce good results with surprisingly low data volume and processing resources, a concern that followed throughout the project. Another important reason for this kind of network to be used is the residual links providing information from various compression stages of the data. This allowed the model, as will be seen in the results, to largely ignore the parts of the image that remained the same and focus on the parts that were modified after applying the filter. There were some modifications to the model proposed in [48] as the task was different. Inspired by [57], the max-pooling and upsampling operations were changed to strided convolutions and strided transpose convolutions, respectively, in an effort to stabilize the network’s training and performance. Also, the crop and copy operations were replaced by add operations to add information from earlier stages to the whole vectors in the hopes to retain as much of the valuable information as possible but also blend in the newly gained knowledge on how that should change. Finally, batch normalization was performed for every second convolution to assist the learning process further. The model is displayed in Figure 5.1.

Figure 5.1: Illustration of our U-NET model. The blue rectangles are results of convolutions during the compression, the yellow results of convolutions during transpose convolutions, and finally, the green symbolizes the addition of blue and yellow through the add operations.

(40)

The network was trained on three versions of the CelebA dataset, the original and two, where the filters shades leak and shades no leak were applied. These two filters were selected for reconstruction as they covered the eyes and obstructed the recognition the most amongst the filters, as is presented in chapter 6, Recognition. The data was fed to the network in batches of 64 pictures of 64x64x3 dimensions while constantly observing the validation error in order to avoid overfitting as much as possible. A strict rule of ending the training on the first epoch, where the validation error increased, instead of decreased, was adapted. This may not be the optimal choice but was chosen because of resource constrains. The optimizer was Adam with the loss function of MeanSquaredError.

After training, the images from the LFW dataset that had the filters mentioned above applied to them, are fed to the U-Net in order to be reconstructed. The resulting pictures have their features extracted and compared to the original.

5.2 Results

After training, the network was fed with the manipulated images, as described in the experimental setup. Figure 5.2 illustrates the effect of the network on an example image. With Figure 5.2c originally having been applied a filter of shades that had glass with 95% opacity, and Figure 5.2d having been applied a filter of shades with 100% opacity, both looking like 5.2b to the human eye. The results show a clear reconstruction of the information behind the filter in the first case, while the second shows at most a vague general understructure.

During the extraction of the feature vectors needed in the recognition step, the images that previously had the 95% opacity filter were ≈ six (6) times less likely to get a face detection error in comparison to 100% opacity. That said, the face detection errors for both cases were still quite low, shown in Figure 5.3. The number of images in each encoded reconstructed dataset is given by the successful face detection rate (read from Figure 5.3), presented in Table 5.1.

Table 5.1: Number of images in each reconstructed dataset of encoded images.

Shades recon no leak Shades recon leak

4,271 4,288

(41)

(a) Original face image. (b) Shades filter applied to original image.

(c) Reconstruction image of shades 95%

opacity (leakage of latent features).

(d) Reconstruction image of shades 100%

opacity (no leakage of latent features).

Figure 5.2: Examples of the reconstruction on the shades dataset

Figure 5.3: The face detection rate for the feature extraction process for the reconstructed datasets shades with and without leakage.

(42)

6 Recognition

This chapter forms the main examination of our problem statement and research questions. It includes the experimental setup for the different modes of recognition (Identification, split into Identification by Machine Learning Classification and Identification and Verification by Distance) and the results given in each mode for all filters and reconstructed images.

6.1 Experimental Setup

The data processing steps combined with the results from reconstruction provide a total of eight (8) datasets, which are the datasets further processed in this chapter, see Table 6.1.

Table 6.1: Summary description of the 8 datasets further processed in the study.

Name Description

Benchmark The processed original images, without applied filter.

Dog Applied dog nose.

Glasses Applied transparent glasses.

Instagram Applied instagram filters.

Shades leak Applied shades, with 95% opacity.

Shades no leak Applied shades, with 100% opacity.

Shades recon leak Reconstruction of shades images, with 95% opacity.

Shades recon no leak Reconstruction of shades images, with 100% opacity.

Identification by Machine Learning Classification has a different experimental setup than Identification by use of distance measures. Furthermore, Identification and Verification by use of distance measures is also different, as Verification needs a claim of identity. Here the differences will be specified.

6.1.1 Identification by Machine Learning Classification

The datasets are each split into a train and test set by use of scikit-learn [42] [6]

function train test split [51]. The resulting split is 80% training data, and 20% test data. The hyperparameter random state is initialized with the same value for each split of the datasets, resulting in the same split across all datasets, making the results between them all comparable.

The data is scaled by using of MinMaxScaler [53] in the range of [0, 1] by usage of X_scaled = _X^X−X^min

max−Xmin. The target values are label encoded with LabelEncoder [52].

In the classification case of using an SVM classifier, the Linear Support Vector

(43)

Machine from scikit-learn, LinearSVC [54] is used. The classifier is initiated with default hyperparameters, with an increased maximum number of iteration from default 1000 to 10000 (max iter ), see the following Table 6.2. This is done after having observed failure to converge in the lower number of iterations.

Table 6.2: Hyperparameters for the LinearSVM model.

penalty loss dual tol C multi class max iter 12 squared hinge True 0.0001 1.0 OVR 10 000

For the use of XGBoost as a classifier, this is done by the xgboost library in python [45]. The classifier is initiated with default hyperparameters, specifying the choice of booster as linear, the implementation itself adapts for the multi-classification case by changing the objective function to multi:softprob and the evaluation metric to mlogloss [63], see Table 6.3.

Table 6.3: Hyperparameters for the XGBoost model.

base score booster eval metric learning rate n estimators objective

0.5 gblinear mlogloss 0.5 100 multi:softprob

Evaluation metrics (performance) values are generated by using precision score, recall score, f1 score, and accuracy score, all from the scikit-learn library. The results are presented as a bar graph using the matplotlib library in python [33].

6.1.2 Identification & Verification by Distance

In order to examine the performance levels for the various filters against the original pictures by using distance measures, there is a need first to build a database containing a feature vector from every individual in the dataset. This is done by considering the first original picture of every individual as the ”registration” image enrolled in the database and thus the one that the rest are compared to.

From there, it is a question of running a loop comparing every picture against the database from a verification or identification perspective as defined in Section 3.4.

6.2 Results

This section presents the results for the recognition chapter. This section is divided into Identification and Verification. The identification section consists of the results of identification when using a distance-based approach, and also the use of two trained classifiers, SVM and XGBoost. The verification section shows the performance in the verification mode by use of distance measures.

(44)

6.2.1 Identification

This section presents the results in the identification mode and is divided into identification using distance measures, and identification using two machine learning classifiers.

Identification by Distance

In the identification mode with euclidean distance as the distance measure the results show that benchmark (originals) pictures and pictures with Instagram filters applied have very similar performance with only three pictures being needed for the identification rate to reach ≈ 97.5% and rank-1 identification ≈ 93%. From there all filters appear to affect the identification rate to various degrees, although all quite drastically. The best performance, considering 20 samples, of the remaining is observed with the dog filter, at ≈ 96%. From there with descending order comes the glasses filter that at best reaches ≈ 95%, the shades leak ≈ 88%, and lastly shades no leak ≈ 86%. After reconstruction, the pictures that had the shades leak filter applied on them show a drastic increase reaching the levels slightly above the dog filter. The opposite is observed with the shades no leak pictures decreasing in identification rate after reconstruction, see Figure 6.1.

Figure 6.1: Identification euclidean distance

Similar behaviour is observed with the other two distance measures, see Figures A.16 and A.17.

Performance results using the micro-averaged statistic, with the four (4) performance metrics (from the machine learning perspective), shown graphically in Figure 6.2, are analogous to those seen previously in the CMC curves, see Figure 6.1.

The Effect of Beautification Filters on Image Recognition:

Master Thesis

HALMSTAD

UNIVERSITY

Network Forensics, 60 credits

The Effect of Beautification Filters on Image Recognition:

"Are filtered social media images viable Open Source Intelligence?"

Digital Forensics, 15 credits

Halmstad 2021-06-11

Pontus Hedman, Vasilios Skepetzis

The Effect of Beautification Filters on Image Recognition:

Foreword

Abstract

Contents

List of Tables

List of Figures

1 Introduction

1.1 Problem Statement

1.2 Related Works

1.3 Purpose

1.3.1 Research Questions

1.3.2 Problematization & Positioning

1.3.3 Scope of Research

2 Methodology

2.1 Outline of the Study

2.2 Data

2.2.1 Labeled Faces in the Wild

2.2.2 CelebA

2.2.3 Application of Beautification Filters

2.3 Problematization of Method

3 Theory

3.1 Image Preparation

3.1.1 Max-Margin Object Detection (MMOD) - Crop

3.2 Feature extraction Model - RESNET

3.3 Identification

3.3.1 Identification by Distance

3.3.2 Identification by Machine Learning Classification

3.3.3 Evaluation - Identification

3.4 Verification

3.4.1 Evaluation - Verification

3.5 Distance measures

3.6 Image Reconstruction / Generation

3.6.1 U-NET

4 Data Processing

4.1 Experimental Setup

4.1.1 The Scale of Information

4.1.2 Image Selection and Initial Preparation

4.1.3 Creation of Filtered Images

4.1.4 Feature Extraction

4.2 Results

5 Reconstruction

5.1 Experimental Setup

5.2 Results

6 Recognition

6.1 Experimental Setup

6.1.1 Identification by Machine Learning Classification

6.1.2 Identification & Verification by Distance

6.2 Results

6.2.1 Identification