Facial Identity Embeddings for Deepfake Detection in Videos

(1)

Master of Science Thesis in Electrical Engineering

Department of Electrical Engineering, Linköping University, 2020

Facial Identity Embeddings

for Deepfake Detection in

Videos

(2)

Emir Alkazhami LiTH-ISY-EX--20/5341--SE Supervisor: David Gustafsson

FOI

Erik Valldor

FOI

Pavlo Melnyk

isy_{, Linköpings universitet}

Examiner: Amanda Berg

isy_{, Linköpings universitet}

Computer Vision Laboratory Department of Electrical Engineering

(3)

Abstract

Forged videos of swapped faces, so-called deepfakes, have gained a lot of atten-tion in recent years. Methods for automated detecatten-tion of this type of manipu-lation are also seeing rapid progress in their development. The purpose of this thesis work is to evaluate the possibility and effectiveness of using deep embed-dings from facial recognition networks as base for detection of such deepfakes. In addition, the thesis aims to answer whether or not the identity embeddings con-tain information that can be used for detection while analyzed over time and if it is suitable to include information about the person’s head pose in this analysis. To answer these questions, three classifiers are created with the intent to answer one question each. Their performances are compared with each other and it is shown that identity embeddings are suitable as a basis for deepfake detection. Temporal analysis of the embeddings also seem effective, at least for deepfake methods that only work on a frame-by-frame basis. Including information about head poses in the videos is shown to not improve a classifier like this.

(4)

(5)

Acknowledgments

Firstly, I would like to thank my supervisors at FOI: David Gustafsson and Erik Valldor, both for the helpful discussions and ideas during the project, but also for giving me the opportunity to have a fun and interesting master thesis at FOI. It has been a great experience for me.

I would also like to thank Pavlo Melnyk for the insightful comments, both on the project itself, but also on the report. The feedback have been very crucial for the quality of my report.

Lastly, I would like to thank my examiner Amanda Berg, who even at her time away from work, goes out her way to help me complete my masters thesis.

Thank you all for the great support.

Linköping, September 2020 Emir Alkazhami

(6)

(7)

2.4 Face detection . . . 9 2.5 Metric learning . . . 10 2.6 Facial landmarks . . . 11 2.7 Perspective-n-point . . . 12 2.8 DeepFake creation . . . 13 2.8.1 Graphics-based methods . . . 13 2.8.2 Encoder-decoder methods . . . 14 3 Related Work 17 3.1 Deepfake detection . . . 17 3.2 Datasets . . . 18 3.2.1 FaceForensics++ . . . 18 3.2.2 Celeb-DF . . . 20 4 Method 23 4.1 Data extraction . . . 23 4.1.1 Identity embeddings . . . 23 4.1.2 Head poses . . . 24 4.2 Classifiers . . . 26 4.2.1 Image classifier . . . 27 4.2.2 Temporal classifier . . . 28 vii

(8)

4.2.3 Temporal classifier with head poses . . . 30 4.2.4 Data preprocessing . . . 30 4.3 Training details . . . 30 5 Results 33 5.1 Celeb-DF . . . 34 5.2 FF++ deepfake . . . 36 5.3 FF++ faceswap . . . 38 6 Discussion 41 6.1 Results . . . 41 6.2 Method . . . 43 6.3 Conclusions . . . 43 6.4 Ethical Aspects . . . 43 6.5 Closing comments . . . 44 A Examples of estimated head poses 47 B Head pose dependency in the identity embeddings 49

(9)

Notation

Abbreviations

Abbreviation Meaning

GAN Generative adversarial network CNN Convolutional neural network ROC Receiver operating characteristic MMOD Max-margin object detection

HOG Histogram of oriented gradients PnP Perspective-n-point

AUC Area under curve

LSTM Long short-term memory

(10)

(11)

1

Introduction

Automated video and image manipulation have seen great progress in recent years. Especially forged videos of swapped faces, so-called deepfakes, have gained a lot of attention due to their many possible usages in fraud, defamation, entertainment and the spread of misinformation. Today, techniques are getting to the point where humans have a hard time distinguishing between real and fake video content of faces [22], which has led to a rising concern towards this technology.

Methods for automated detection of this type of manipulation are evolving alongside the synthesising algorithms. These are meant to reduce the risk of exploitation of deepfakes, but due to the rapid progress, new approaches for de-tection are constantly in need to be evaluated.

1.1 Purpose

The purpose of this thesis work is to evaluate the possibility and effectiveness of using deep embeddings from facial recognition networks as base for detection of swapped faces in videos. In addition, the thesis aims to answer whether or not the facial embeddings contain relevant information while analyzed over time, and also if it is suitable to include information about the head pose in a deepfake detector.

1.2 Motivation

Multiple types of face manipulation techniques exist, for instance methods that alter a person’s facial expression, hairstyle, or apparent age. However, the meth-ods used in this study, often classified under the names deepfake and face swap,

(12)

are not one of these. They differentiate from the other, because they work with facial replacement instead of facial manipulation. They use videos of two differ-ent iddiffer-entities and aim to replace the source person’s iddiffer-entity with the iddiffer-entity of a target person by swapping the whole face. They do this while retaining other properties such as head pose and facial expression. This motivates the use of fa-cial identity embeddings in a classifier for this type of data. The reasoning is that if a deepfake method alters a person’s identity, it might also leave traces and arte-facts in these attributes. State-of-the-art detection methods commonly use hand-crafted low-level features, general pre-trained object recognition embeddings, or are directly data driven with CNNs, but they rarely use recognition networks specially trained for facial identities. This calls for an evaluation of embeddings based on identity separation to further expand the understanding of deepfake detection.

Many facial replacement methods work on a frame-by-frame basis, swapping the faces of each frame of a video independent from the other frames. This lack of temporal awareness in the deepfake algorithms prompts the use of some tem-poral analysis in a detector. It is therefore reasonable to believe that a classifier based on identity embeddings might improve if some temporal metric is used. These properties are also analyzed in this study.

Temporal variations in the identity embeddings could be correlated with the movement of the head in a video. So it is also motivated to include information about the head pose to further improve a detector based on a temporal analysis. This would make it possible to disentangle the temporal artefacts caused by the swapped face, from the temporal variation due to head movement.

1.3 Research questions

• Can identity embeddings from facial recognition networks be used to dis-tinguish between real and swapped faces in single images?

• Can temporal information in facial identity embeddings be used to distin-guish between real and swapped faces in videos?

• Can information about the head pose in a video be incorporated with the facial identity embeddings to further improve a classifier?

To answer these research questions, three classifiers are created and tested. The first classifier works on a frame-by-frame basis, the second uses temporal information to classify videos, and the third also incorporates information about head poses. The classifiers are tested and compared with each other and also with state-of-the-art classifiers to get a grasp of their effectiveness and the impact of each addition. The goal is not to create a perfect classifier, but rather to analyze how well the deep embeddings can be used in this context. Both frame-by-frame and over time and also with or without information about the head poses.

(13)

1.4 Scope and delimitations 3

1.4 Scope and delimitations

The intent of this study is to evaluate how suitable identity embeddings are for detection of swapped faces. Therefore, no data from facial manipulation methods will be used in the evaluation. Neither will completely synthetically generated images and videos be analyzed. Only data in which the actual face area have been swapped with another face is of interest for this thesis. More specifically, three different datasets are used with videos created from three different methods for swapping faces.

Methods that create synthetic data of full bodies exist as well, but these meth-ods are also outside the scope of this thesis.

(14)

(15)

2

Background

This chapter presents underlying theory and information in regards to the meth-ods used in this thesis work. The theory is in many cases presented as descrip-tions of methods that solve certain relevant problems. For most of these problems multiple techniques exist with valid solutions, but only the ones with the closest connection to this thesis work will be presented. For instance, in section 2.4 face detection with CNNs and the MMOD-loss is described since it is used in this project. One could also use a face detection technique based on HOG-features, but this will not be described any further.

2.1 Neural networks

An artificial neural network is a learning system with a general structure that is inspired by mammalian brains. They are fundamentally utilized as ways to approximate a function that maps a specific input domain to a specific output domain. The networks are mainly comprised of a high number of interconnected computational nodes, arranged in layers and with numerical weights assigned to each connection between the nodes. When passing an input sample through the network, a linear combination is calculated in every node. The calculations use all the node inputs with their corresponding weights as coefficients. The linear combination is then typically passed through some non-linear function, called an activation function. This produces the output of the node, called its activation. These calculations are done in all nodes and layer-wise to eventually produce the final output of the network. An illustration of such a structure can be seen in Figure 2.1.

(16)

Input

vector Output

Input layer Hidden layer Output layer

Figure 2.1:The structure of a small fully connected neural network. The main goal of such a system is to find appropriate weights in its connec-tions so that the overall network approximates a function that maps the input domain to the output domain. This can be done with supervised learning, in which we have some samples with known values in both domains. We let these samples pass through the network and we observe the resulting outputs. We compare the outputs from the network with the correct values for each sample. It is then determined how close the network’s approximations were via some loss function. From this point, the learning process is basically a non-convex opti-mization problem in which we want to optimize the loss function with respect to the weights of the network. This optimization can for instance be performed with gradient descent, in which we update the weights by moving in the negative direction of the gradient of the loss function. The gradient is usually computed in an efficient manner with a process called backpropagation [9].

One of the most important challenges to overcome for a learning system like this is the ability to generalize to unseen data. The algorithm must perform well on previously unseen data, not only on the samples that it has already been trained on. Typically, when training a neural network, we use a training set. This is a subset of the total amount of data available and it is only used when opti-mizing the model. We can calculate the so-called training error as some error measure of how well the network transfers samples in this dataset from input to output domain. In a pure optimization task, we would only care about minimiz-ing the trainminimiz-ing error, but learnminimiz-ing processes are about generalization. We are therefore also using a test set with data from the same distribution as the train-ing set, but with samples that the network has not been trained on. This will produce a so-called test error or generalization error, which we want to be as low as possible [9].

If the system has a low training error but a high generalization error we say that the network is overfitted to the training data. This indicates that no gen-eral structures or patterns in the data have been learned. The model have simply memorized all the training samples, but not learned anything about the distribu-tion that the samples come from. This is unwanted, and we should adjust the model with some available regularization technique. In the end, we should strive

(17)

2.2 Convolutional neural networks 7

for a model with a generalization error as low as possible [9].

2.2 Convolutional neural networks

CNNs are a special kind of neural networks that are designed to work with grid-like data. They are outstanding tools for many processing techniques in computer vision and signal processing. For example, the 2D grid topology in images or the 1D grid in time series data are suitable to analyze with a CNN [9].

A CNN is characterized by its convolutional layers. These layers are different from the fully connected ones, that can be seen in Figure 2.1, in a multitude of ways. The layers incorporate the convolution operation to transfer the data from their input to their output. In this operation, the layer use a convolution kernel, which is applied to the data in a sliding window process. The parameters in the kernel are the weights that are adjusted in the optimization algorithm, which gives the layer so-called sparse connectivity since it uses fewer weights than a fully connected layer. This is accomplished by making the convolution kernel smaller than the input, and thus reducing the memory requirements and the number of parameters [9].

Different variations of convolutions are used in machine learning to achieve different results. One usage is to apply a dilated convolution, which expands the kernel with empty elements. This produces a downsampling effect on the data and also gives the network a large receptive field. Figure 2.2 shows three different levels of dilation in a convolution kernel.

Figure 2.2: An illustration of different levels of dilation in a convolution kernel, applied in a 2D grid.

2.3 Receiver operating characteristic curves

When assessing the performance of different classifiers, a suitable performance measure must be chosen. Standard classification accuracy is not appropriate in the case of binary classification of classes with a different number of samples. This is because the smaller class is not going to affect the total accuracy percent-age as much as the larger class. Accuracy reports the fraction of correctly pre-dicted labels in relation to the total number of samples. Therefore, if one class contains more samples than the other, the accuracy will be biased. In a situa-tion like this, ROC curves are often used instead. To explain ROC curves, we introduce some terminology. Firstly, we distinguish between the two classes by

(18)

referring to them as thepositive (P) and the negative (N) class. All the samples that

a classifier correctly predicts as belonging to the positive class aretrue positives (TP), and likewise, all the samples correctly predicted as belonging to the

nega-tive class aretrue negatives (TN). Misclassified samples are in the same manner false positives (FP) and false negatives (FN). Many different metrics can be derived

from these quantities, but those used in the ROC curve are thetrue positive rate (TPR) and the false positive rate (FPR), defined as

T P R = T P P = T P T P + FN; (2.1) FP R = FP N = FP FP + T N. (2.2)

Now imagine representing the class labels of the positive and negative class numerically as one and zero, respectively. A classifier could then be designed to output its certainty in its class prediction as a value between these limits. To then achieve the actual prediction from the classifier, we simply threshold this certainty at some value. Then, to generate a ROC curve, we vary this threshold all the way from zero to one and calculate the T P R and FP R at every stage. Then we plot T P R variations against the FP R variations [10].

An optimal classifier will have a T P R of one for all possible thresholds larger than zero. It will consequently also have a FP R of zero for all thresholds smaller than one. This means that the classifier correctly predicts every sample’s class belonging with full certainty. If the classifier has anything else than full certainty, it will produce some errors at some thresholds. This will show up in the ROC curves as a T P R less than one and a FP R larger than zero. In Figure 2.3, two different ROC curves are drawn to illustrate their typical appearance [10].

Figure 2.3: Two typical ROC curves. The blue curve is the result of a clas-sifier that is better than the clasclas-sifier that produced the red one. The black line show the limit at which ROC curves indicate a performance just as good as a random guess.

(19)

2.4 Face detection 9

A common performance measure for classifiers is the area under their ROC curve (AU C). The AUC can vary between zero and one and summarizes the ROC curve with a single value. An AUC score of one represents the perfect classifier described earlier, while a score of 0.5 represents a classifier that is just as good as a random guess. Somewhat simplified, we can say that the higher the AUC, the better the classifier [10].

2.4 Face detection

Object detection, or in this case face detection, infers the detection of all objects of relevance in an image. A detection is commonly represented as a bounding box that contains the whole visible area of the object, and the area outside of a bounding box should not contain any of the objects searched for.

When facial detection is performed with a CNN and the MMOD-loss, a train-ing step on annotated data is required. The annotations are represented as bound-ing boxes for all faces in an image, and the purpose of the trainbound-ing step is to learn suitable model parameters for the CNN to output a heat map of probable bound-ing box locations. The CNN works in a slidbound-ing window fashion, principally scan-ning every image location and examiscan-ning whether or not there is a face in each location. The desired output from a model like that is a heat map that contains larger values in areas that represent the locations of a face in the input image and smaller values elsewhere. The problem with this approach is that slightly different sliding window locations, that all contain a face, will create an entire area of large values in the heat map. This is undesired since only one detection is wanted per face. Therefore a non-maximum suppression is applied on the heat map. This is the general setup for the detection algorithm, but the MMOD-loss comes into action with some important details [14].

The basic postulation for the MMOD-loss is the assumption of non-overlapping labels. Let R denote the set of all rectangular areas that are scanned by the detection system. Then let Y be a subset of R that contains all non-overlapping rectangles, in which non-non-overlapping for two rectangular areas

r1, r2 ∈ Ris defined as

Area(r1∩r2)

Area(r1∪r2) < 0.5. (2.3)

With this notation, Y will be the set of all possible non-overlapping bounding-boxes in an image that the system analyzes [15].

Now we introduce F(x, y) as the sum of region scores for an image x and a set of rectangles y ∈ Y . That is, the summation of the values from a heat map in areas corresponding to the regions y. The heat map is the output from a CNN with the parameters w. The face detection problem can now be expressed as finding the set of bounding-boxes y∗that maximizes the sum of region scores [15]:

y∗= argmax

y∈Y

(20)

From this formulation we seek to find the parameters w that lead to as few in-correct labelings as possible. For a set of images {x1, x2, ..., xn}and corresponding

bounding box labels {y1, y2, ..., yn}, we want the correct labeling score to be higher

than the score for all the incorrect labelings, thus leading to the constraint

F(xi, yi) ≥ max_y

j,yi

F(xi, yj). (2.5)

With a max-margin approach, in which we require the label for each training sample to be correctly predicted with a large margin, we get to the following optimization problem [15]: min w 1 2||w|| 2 s.t. F(xi, yi) ≥ max_y j,yi (F(xi, yj) + ∆(yj, yi)), ∀i, (2.6)

in which ∆(yj, yi) is the loss for predicting a label as yj when the true label is yi.

Equation 2.6 is similar to the objective function of a support vector machine, which also is based on the max-margin principle. The difference between the MMOD objective function and the one of a standard support vector machine is that F(x, y) is based on the output of a CNN and thus optimized with some non-convex optimization technique, instead of using the standard Lagrangian multi-plier method used for optimization of support vector machines [15].

This setup is a general object detector. To specifically have a face detector, we train the system on data with annotated faces.

2.5 Metric learning

Machine learning algorithms are supposed to learn appropriate behaviors based on the data they are presented with. This will, of course, make these algorithms completely dependent on the quality of the data. For example, in the context of a face recognition problem, we expect the data to contain multiple images of differ-ent persons. Differdiffer-ent images of the same person are expected to have a similar visual appearance, and images of different identities are expected to have a differ-ent visual appearance. If this is the case, we could imagine a machine learning algorithm to learn the connection between appearance and identity. However, many real-world datasets are not this suitable for direct learning, and often have their own inherent problems. For a dataset like this, we may, for instance, have to deal with factors such as pose variations, illumination differences, scaling, back-ground, occlusion, and facial expression. Each factor will vary the appearance of images within each class, making it more difficult to directly learn the connection between appearance and identity. We can see from this example that a classifier benefits from data in which samples of the same class are similar, and samples of different classes are dissimilar. In other words, it is beneficial to have a simi-larity metric that, based on the data, reduces intra-class variations and increases

(21)

2.6 Facial landmarks 11

inter-class variations. Metric learning is the practice of finding such a similarity metric, based on analysis of patterns in the data [12].

Deep metric learning with the triplet loss is used in [23] to find a suitable met-ric for the specific case of face recognition. It utilizes mainly two core concepts. Firstly, a deep CNN architecture which is able to produce embeddings from face images, and secondly the triplet loss, which enforces that the model learns to output embeddings in accordance with a suitable distance metric [23].

Let xa, xpand xnrespectively denote an anchor sample, a positive sample, and a negative sample, all representing images of faces in this case. The anchor and the positive sample are two images of the same identity, while the negative sam-ple represents a different identity. Now let f (x) ∈ Rd _{denote the L}

2-normalized

embedding of a sample x, in other words the L2-normalized output of the CNN

given the input x. We can now formulate the criteria of low intra-class variation and high inter-class variation in the CNN outputs as

k_{f (x}a i) − f (x p i)k22+ α < kf (xia) − f (xin)k22 ∀ _(xa i, x p i, xni) ∈ T . (2.7)

All possible triplets in the training data are here denoted T , |T | = N , and α is a margin that is enforced between the positive and negative classes [23].

From this criteria we can formulate the triplet loss as

L = N X i max0, kf (xa_i) − f (xp_i)k2₂− k_{f (x}a i) − f (xin)k22+ α . (2.8)

Optimizing the CNN to minimize this loss will cause the embeddings to have the properties sought after in equation 2.7. In the face recognition case, it means that we have created a CNN that for images of the same identity, will output embeddings close to each other in terms of the Euclidean distance. For images of different identities, the embeddings will have a large distance from each other.

2.6 Facial landmarks

Facial landmarks are points with distinct locations that can be found on any face. In this thesis, a method with 68 points is used to find landmarks in images [14]. It uses a model with landmarks that are located along the chin, the eyebrows, the eyes, the nose, and the mouth (see Figure 2.4).

(22)

Figure 2.4:A facial model with 68 landmarks [14].

The method is based on [13], in which they use an ensemble of regression trees together with gradient boosting to iteratively update the locations of each landmark’s position in a face image.

2.7 Perspective-n-point

PnP estimation is a way to find the pose of a calibrated camera. The estimation re-quires at least three known 3D points in world coordinates and their correspond-ing 2D-projections through the camera in the image plane. PnP can, therefore, be described as the problem of finding the projection matrix of a camera, from n given projection correspondences and known camera internals. The camera pro-jection matrix C is a 3 × 4 matrix, describing both the internal parameters of the camera and its pose [20]. With the notation C ∼ K(R|t) it is possible to separate the internal parameters K from the camera pose (R|t):

K=         f s γ u0 0 f sσ v0 0 0 1         , (R|t) =         r11 r12 r13 t1 r21 r22 r23 t2 r31 r32 r33 t3         (2.9)

The pose (R|t) has six degrees of freedom: three for the rotation matrix and three for the translation vector. It is therefore, as mentioned, sufficient with three point correspondences to find a solution [20].

Many different efficient implementations and solutions exist for this type of estimation. In this thesis, an implementation fromOpenCV is used [21].

(23)

2.8 DeepFake creation 13

2.8 DeepFake creation

Mainly three different types of algorithms exist for swapping faces in videos. The most simple one is the graphics-based technique [3, 8], which for the most part is used in light-weight mobile applications such asSnapchat. The second type of

algorithm relies on some latent feature space to separate facial traits from identi-ties. These algorithms are considered more advanced and require target specific training with autoencoders [1, 2, 8, 18]. Lastly, methods based on GANs also exist [8, 16, 19], but are more common for generating single images instead of videos.

Variations of these three base-approaches are widely adopted in different open-source implementations. These are the implementations that are the most easily accessible to the public, and can even come with simple to use GUI-software. Implementations such asDeepFaceLab [1], deepfake [2], and FaceSwap

[3] are probably used for a majority of the deepfake material circulating on the internet because of their simplicity and availability. More advanced state-of-the-art methods also exist, e.g.,FSGAN [19], but without any simple user interfaces.

2.8.1 Graphics-based methods

The graphics-based methods are often referred to as faceswap, instead of deep-fake, since they do not make use of any deep-learning in their algorithms. Results from these methods are rarely convincing and often contain some face warping artefacts, as can be seen in Figure 2.5.

Figure 2.5:Example of face swapping with a graphics-based method [3]. Im-ages from [22] (Left: source image, Middle: swapped identity, Right: target image).

The general algorithm make use of a moldable 3D model of a face, together with facial landmarks from the source and target images. The landmarks are used as anchor-points to first fit the target texture to the 3D model, and then fit the model to the source face.

FaceForensics++ [22] incorporates a graphics-based method [3] in their dataset.

It uses a face detector based on HOG-features to crop out a facial region from the source and the target video. Then it uses facial landmarks from the cropped regions to fit a facial template model to the texture from the source face. There-after, the model is back-projected onto the target face by minimizing the distance between the projections and the targets landmarks. Finally, the rendered face model is blended into the target image and color correction is applied to remove

(24)

obvious borders between the original face and the swapped area. This is applied on a frame-by-frame basis.

2.8.2 Encoder-decoder methods

The separation of facial identities and facial expressions is an important task for creating convincing deepfake material, and the encoder-decoder based algo-rithms are mainly designed around this premise. The methods seek to extract the facial expression of a target image, and apply it to a subject identity. They make use of an autoencoder architecture in which the encoder is shared between all identities, while the decoders are target specific. The goal is to find a latent fea-ture space, via the encoder, that describes the generic properties of a face. From this feature space, the decoders are then supposed to recreate the image based on some identity. Encoder Encoder Decoder A Decoder B L1-loss L1-loss

Figure 2.6: Training process of the encoder-decoder method (face images from [18]).

Figure 2.6 illustrates the training process for the base approach. The encoder-decoder pairs optimize the L1-loss between the original and the recreated image in an unsupervised manner. The encoder is shared between all identities during training, but the decoders are target specific. This forces the encoder to learn

(25)

2.8 DeepFake creation 15

encodings for the generic properties that all identities share. This can for instance be facial expression, orientation, lighting, or shades. The decoders on the other hand learn how to transfer the generic properties back to an image of a specific identity.

Figure 2.7 illustrates how the trained encoder and decoders are used to gen-erate swapped faces. Firstly, the target image is fed into the encoder to extract its identity-independent properties. Then instead of using a decoder trained for the same identity to recreate the image, a decoder trained for the source identity is used. In other words, the generic properties from the target image and the identity specific properties from the source image are combined.

Encoder Decoder

B

Figure 2.7:Generation process of the encoder-decoder method (face images from [18]).

It is worth to point out that both the training and the generation steps are only performed on cropped facial regions instead of the whole images. This is to ensure that the autoencoder learns properties strictly related to the faces. To generate whole images, the recreated faces are spliced into the facial region of the target image, and boundary smoothing is applied to hide the splicing boundary. An example based on [2] is shown in Figure 2.8.

Figure 2.8:Example of face swapping with an encoder-decoder method [2]. Images from [22] (Left: source image, Middle: swapped identity, Right: tar-get image).

(26)

(27)

3

Related Work

This chapter gives an overview of the current state of deepfake detectors. It also presents the datasets that are being used in this thesis.

3.1 Deepfake detection

A large number of methods for deepfake detection already exists and new pro-posals are constantly being tested. Initiatives such as theDeepfake Detection Chal-lenge [8] with over 2000 contesting teams push the pace of development even

further. A few categorizations can be made to more easily get an overview of the many available techniques. The first distinction is whether the method works on a frame-by-frame basis or has temporal awareness and uses the information across frames for the detection. To further categorize, it is worth to notice the three different premises that these methods are typically based on: either they use high-level artefacts, low-level artefacts, or are directly data-driven. High-level artefacts can, for instance, be the lack of eye blinking, or incoherent facial expressions, or head poses in deepfake videos. Low-level artefacts are different image qualities in different parts of the image, visible image splicing boundaries, or areas with mismatching color. Finally, the directly data-driven methods do not rely on specific artefacts, but find distinguishing properties by directly training on the deepfake data with convolutional networks.

One problem with all current detection algorithms is their lack of ability to generalize to different deepfake methods. The algorithms often have very good performances on deepfake data generated by the same method as the data they have been trained on. But no classifier is consistently able to perform good on deepfake methods it has never seen before. It seems like it is a difficult task to find common discrepancies between different types of deepfakes, and it limits the detection algorithms from being used as reliable tools in real-world scenarios.

(28)

3.2 Datasets

A small number of public datasets exists for evaluation of detection methods of deepfake-data. Two different datasets are used for this thesis work, namely, FaceForensics++ and Celeb-DF, which both are used for training and evaluation.

3.2.1 FaceForensics++

The FaceForensics++ dataset was released in January 2019 and it is used for eval-uation in many different detection methods [22]. It contains multiple forgery methods, both facial manipulation and facial replacement. However, only two facial replacement methods are used for evaluation in this thesis. One of them is a graphics-based method while the other is an encoder-decoder method. These methods are respectively from this point on referred to as FF++ faceswap and FF++ deepfake.

The dataset contains 1000 real videos collected from the internet, and mainly shows interview scenarios or newscasters. The fake videos are made up of com-binations of the real videos, in which two real videos are combined into one fake video. The fake videos are of relatively poor quality compared to other modern deepfake methods and contain visual artefacts in many cases. Some examples are shown in Figures 3.1-3.

Many studies have used FaceForensics++ for training and evaluation. Table 3.1 shows the detection performances on this dataset for state-of-the-art classi-fiers that were developed in some of these studies.

Table 3.1: FF++ classification results (AUC %). The results are from three different state-of-the-art classifiers.

Study Dataset FF++ faceswap FF++ deepfake Afchar et al. [6] 93.0 % 94.0 % Rössler et al. [22] 97.0 % 98.0 % Agarwal and Farid [7] 96.3 %

-The FaceForensics++ dataset contain duplicate examples of each video, but at different compression levels. Only data at compression level c23 have been used in this thesis, and all presented results concern data at this compression level.

(29)

3.2 Datasets 19

Figure 3.1:Examples of swapped faces from the FF++ faceswap dataset.

(30)

Figure 3.3:Real faces, corresponding to the fake examples in Figure 3.1 and 3.2.

3.2.2 Celeb-DF

Celeb-DF [18] is developed with the intent of providing a large-scale and chal-lenging deepfake video dataset, which matches the quality of deepfake videos circulated on the internet. It uses an improved version of the encoder-decoder synthesis algorithm, and the results have few visual artefacts. In Figure 2.8, the swapped face shows some of the typical flaws that come with the encoder-decoder methods. Lower resolution in the swapped area is a common problem due to con-structing the generated faces from a feature space of relatively low dimension, therefore losing a lot of information. A clear color mismatch can also be seen between the background and facial region. Both of these problems, among oth-ers, are handled in the improved version used in Celeb-DF. This also includes temporal regulations to reduce flickering in the deepfake videos. They use fa-cial landmarks to crop out the fafa-cial regions, and the temporal sequence of the landmarks are filtered using a Kalman smoothing algorithm to reduce imprecise variations between each frame.

The dataset consists of 5639 deepfake videos and 890 real videos of celebrities, mainly in interview scenarios. It was released in September 2019 and the average length of the videos is 13 seconds. Some examples are shown in Figure 3.4.

An evaluation of existing detection-methods’ performances on Celeb-DF is also carried out in [18], showing that further improvements in the detection tech-niques are needed.

(31)

3.2 Datasets 21

Figure 3.4:Examples of swapped faces from the Celeb-DF dataset. The top row shows the real face for each column.

(32)

(33)

4

Method

This chapter presents the methods used to extract all the relevant data from the deepfake videos. It also shows the structure of the three different classifiers that are used to detect the deepfakes.

4.1 Data extraction

The complete pipeline of the detection methods can be divided into two main parts: the extraction of the relevant data, and the classification based on this data. The information that is extracted from the videos is mainly the identity embeddings from each frame, and also the head poses.

4.1.1 Identity embeddings

The first act in the detection process is to extract the identity embeddings from the videos. This is done in two steps. To begin with, a face detection step finds all the image regions containing a face. This is done in accordance with the method described in Section 2.4. These image regions are then used in a facial recognition network to produce the embeddings. The recognition method used to do this is based on a variation of theResNet-34 CNN-architecture [11], pre-trained

end-to-end in accordance with theFaceNet deep metric learning [23], to produce

128-dimensional embeddings for each face. The training process of this network has led it to produce embeddings that separate images based on their facial identities. The used model is provided pre-trained and with Python bindings through [4].

This model is applied to extract one embedding vector for every frame in every video. This requires there to be only one face in every frame, but in a few exceptions, the videos contain multiple faces. In these cases, only the face closest to the center of the image is used and all other faces are discarded.

(34)

To summarize, the identity embeddings are the direct output from a face recognition CNN that is trained for identity separation, and a typical identity embedding looks as follows:

h

0.0546, 0.0261, −0.0310, −0.1337, . . . , 0.0175, −0.0584, −0.1034, 0.1569i.

They usually contain both positive and negative elements, with each element’s magnitude rarely larger than 0.2. These embeddings are saved as one matrix for every video with one embedding vector for every frame, as follows:

N frames 128 embedding dimensions z }| {                              at11 at12 · · · at1128 at21 at22 · · · at2128 .. . ... . .. ... atN1 atN2 · · · atN128               . (4.1)

4.1.2 Head poses

To estimate the head pose from an image, three major steps are used. Firstly, the detection of the face takes place. This is the same detection as the one de-scribed in Section 4.1.1, which produces a cropped region of the image contain-ing the face. Secondly, the positions of a set of facial landmarks are located in the cropped image region. These are found with the method described in Section 2.6. Here the landmarks are represented as 2D pixel coordinates.

In the last step, the landmarks’ orientation in relation to a general 3D face model is found with PnP-estimation. The 3D model is provided via [17], and it is used for all videos. It is comprised of points that represent the same land-marks as in the images but in 3D. Both the landland-marks in the images and the 3D model are enumerated in the same predefined order, making it possible to find corresponding points between a face image and the 3D model. These 2D to 3D point correspondences are used to find the extrinsic camera parameters R, t in the model shown in Figure 4.1.

(35)

4.1 Data extraction 25

Figure 4.1: An illustration of how facial landmarks in an image are used together with a 3D model in world coordinates to find camera pose with PnP estimation.

The landmarks in the 3D model are given in so-called world coordinates and the camera describes how these are projected onto the image plane. We are there-fore finding the camera position in relation to the 3D model by finding its ex-trinsic parameters. After estimating the rotation matrix, we use it to calculate the corresponding rotations about each axis, i.e., we find the Euler angles for the camera in relation to the world coordinate system [24]. If we now instead use the camera as our point of reference, we can use the same angles to describe the rota-tion of the face in the image in relarota-tion to the 3D model in the world coordinate system. We then use the 3D model’s rotation as our definition of looking straight forward. Now it becomes possible to interpret the Euler angles we just found as the rotation of the face in the image with respect to looking straight forward. A typical example of such angles looks as follows:

h

11.1336, 5.5668, −_7.6815i_.

The angles are given in degrees and take both positive and negative values, mainly in the range -20° to 20°.

Such rotations are saved as the three angles around each axis, row-wise for each frame and in one matrix for each video, as follows:

(36)

N frames

Rotation angles around each axis z }| {                               rt1x rt1y rt1z rt2x rt2y rt2z .. . ... ... rtNx rtNy rtNz                . (4.2)

For illustration purposes, the world coordinate axes are projected through the camera matrix and drawn onto the image. Figure 4.2 shows an example. This method is used to visually evaluate the performance of the pose estimations. A few more examples are provided in Appendix A.

Figure 4.2: An example of the estimated head pose, with facial landmarks and face axes drawn on the image. The original image is from [22].

4.2 Classifiers

The classifiers that are used in this thesis are designed to answer one of the initial research questions each. The most simple classifier is utilized to categorize single images, while the other two methods employ temporal information to classify videos. The two temporal classifiers are mainly structured in the same way, but one of them only uses the identity embeddings while the other also includes the head poses.

(37)

4.2 Classifiers 27

4.2.1 Image classifier

The image classifier works on a frame-by-frame basis. It uses the identity embed-dings from single frames as training samples together with their corresponding labels. It takes 128-dimensional embedding vectors as its input and passes them through a fully connected network to make a binary prediction.

Figure 4.3 describes the full structure of the image classifier.

Embeddings Face detection Videos Data preprocessing Fully connected, 128 Batch normalization Dropout, 0.5 Fully connected, 8 Batch normalization Dropout, 0.5 Fully connected, 1 ReLU ReLU Prediction Sigmoid

(38)

4.2.2 Temporal classifier

The design of the temporal classifier was not decided at the start of the project. Instead multiple different techniques were tested to find a suitable method. All the tested methods follow the same structure as shown in Figure 4.4, but with different techniques to analyze the data over time.

Embeddings Face detection Pose data Head pose estimation Videos Data preprocessing Fully connected, 128 Batch normalization Dropout, 0.5 Fully connected, 8 Batch normalization Dropout, 0.5 Fully connected, 1 ReLU ReLU Method specific descriptor Prediction Sigmoid Method specific descriptor Data preprocessing Sequence descriptor

Figure 4.4: The structure of the temporal algorithm. The step named Se-quence descriptor is where the seSe-quence data is analyzed over time and

(39)

4.2 Classifiers 29

The tested classifiers all use a fully connected network as their last step, which requires a vector as input. Therefore, a way to compress the embedding data from the matrix format shown in equation 4.1 to a single vector that describes a whole video must be found. In Figure 4.4, this is described as finding a sequence descriptor.

A few different methods were tested for this task by creating prototype clas-sifiers. These classifiers were created without any parameter tuning or extensive work put into them. They were merely prototypes, constructed as simple as pos-sible to find out what methods seem promising to use for the temporal analysis.

One prototype classifier looked at the mean difference between the elements of the identity embeddings over time as a way to compress the temporal mea-sures into a single vector. Another classifier used an LSTM to produce a sequence descriptor from the temporal data. The third tested prototype used dilated 1D convolutions in a CNN to gradually down-sample the temporal dimension into a single vector. These prototype classifiers were tested on all the datasets, but had bad performance and saw no further development. Instead, one technique that showed promising results was kept and developed into the final temporal classifier. This method essentially estimates the temporal distribution for each element in the input by calculating their mean and standard deviation over time.

It starts off by taking in the sequence data, which is formatted as each frame’s embedding stacked on top of each other in a matrix, as follows:

              at11 at12 · · · at1128 at21 at22 · · · at2128 .. . ... . .. ... atN1 atN2 · · · atN128               .

Then it calculates the mean and standard deviation for each of the embedding elements over the whole sequence. This can be seen as a maximum likelihood estimation for each embedding element’s temporal variations, using 1D Gaussian distributions:

"mean1 mean2 · · · mean128

std1 std2 · · · std128

#

.

The purpose of this operation is to express each embedding element’s tempo-ral distribution with only two values. This makes it possible to vectorize the data by repeating all distribution parameters, pairwise for each embedding element:

h

mean1 std1 mean2 std2 · · · mean128 std128

i

.

This vector now represents the entire video sequence and is the input to the fully connected network.

(40)

4.2.3 Temporal classifier with head poses

When including the head pose data in the temporal classifier, the embeddings in equation 4.1 are concatenated with the head rotation angles in equation 4.2 to form a data structure as follows:

N frames Concatenated data z }| {                               at11 at12 · · · at1128 rt1x rt1y rt1z at21 at22 · · · at2128 rt2x rt2y rt2z .. . ... . .. ... ... ... ... atN1 atN2 · · · atN128 rtNx rtNy rtNz                . (4.3)

The classification method thereafter is the same as in the classifier that only uses the identity embeddings. It calculates the mean and standard deviation column-wise to produce a sequence descriptor, and then passes it through a fully connected network.

The raw embeddings and the raw pose data initially differ in their numerical sizes. The rotation angles often have values a hundred times larger than the ones found in the embeddings, which makes a concatenation like this unsuitable to use directly in a neural network. However, in the data preprocessing step, all descriptor elements are centered around zero and scaled to unit variance to avoid this problem.

4.2.4 Data preprocessing

The preprocessing step is the same for all three classifiers and is applied on the input vector to the neural network. Each element in the input vector is normal-ized: independently centered around zero and scaled to have unit variance. This is done by calculating the mean and standard deviation, for each vector element, over all samples in the subset currently in use. The mean is then subtracted from each corresponding element to center the values around zero. The scaling is per-formed by dividing each vector element by its corresponding standard deviation. In this way, every element is standard normal distributed over all the samples.

4.3 Training details

To produce reliable results from the classifiers, some precautions have to be taken when splitting the datasets up to training, validation, and testing subsets. The different subsets should contain as little common and similar data as possible, while still coming from the same distribution. In the data used in this thesis, this problem becomes apparent because there exist multiple videos of the same person, but in different scenarios. This becomes even more problematic since all the deepfake videos are created from different combinations of two identities from different scenarios. Thus, many videos either share the same backgrounds

(41)

4.3 Training details 31 but with different facial identities, or the same facial identities but placed in different backgrounds.

To solve this problem, a number of identities are reserved for each subset. To place a certain video in a specific subset, both the facial identity and the back-ground in the scene must come from a real video in which the original identity has been reserved for the subset in question. As a consequence, some videos become unusable since they are created from two identities that are not reserved for the same subset. However, the reserved identities are varied between different test runs, so that most videos will be used in the end.

The Celeb-DF dataset contains videos from 59 different persons, but the num-ber of videos per identity varies. It is therefore not possible to choose a certain proportion of the data that should go to each subset, while also following the data splitting method described above. To at least have a crude control over the sub-set proportions, the number of identities reserved for each subsub-set is sub-set constant. The training set has 35 reserved identities while the test and validation sets have 12 identities each.

Since the number of available deepfake videos is much larger than the number of available real videos, class weights are used to penalize the models differently depending on the sample class. The weights are equal to the inverse proportion of the number of samples for each class.

During training, early stopping is used. The model will continue the optimiza-tion process until no improvements in the validaoptimiza-tion loss have been seen for 20 epochs. Then the model in the iteration with the best validation loss will be used for the testing. The best performing model is on average found around epoch 18.

The loss function used is the binary cross-entropy, and the optimization method is the Adam algorithm with the default hyperparameters from Keras [5].

(42)

(43)

5

Results

This chapter presents the performances of the three different classifiers when applied on the three different datasets. Table 5.1 show the results for the Celeb-DF data, Table 5.2 show the results for the FF++ deepfake data, and Table 5.3 show the results for the FF++ faceswap data.

Each column in the tables represent a specific classifier. Framewise refers to

the image classifier in Section 4.2.1,Temporal refers to the temporal classifier in

Section 4.2.2, andTemporal + Pose refers to the temporal classifier that also uses

the head poses, described in Section 4.2.3.

In each case, the classifiers were trained, validated, and tested on subsets from the same dataset. For instance, the classifiers that were trained with Celeb-DF data were also validated and tested on Celeb-DF data, and so on with all the datasets. Each subset was divided according to the method in Section 4.3 so that no identities were shared between the train, validation and test subsets.

Each classifier was used in twenty different experiments and an average score was calculated together with the standard deviation over all attempts. Each at-tempt was assigned a fixed random seed to be used for all the classifiers. This makes the data partitioning and the weight initialization to be the same for all the classifiers over each attempt. Due to this, we can compare the different clas-sifiers not only by their averages, but also for each attempt. We can also see that some data partitioning are more difficult to use than others. For instance, attempt 16 in Table 5.1 shows lower scores than the average for all three classifiers.

Figure 5.1-6 are included to contextualize the performances and give a better understanding of the data that is used. Figure 5.1, 5.3, and 5.5 show challenging examples of fake faces that the classifier managed to correctly classify as fake. Figure 5.2, 5.4, and 5.6 show fake faces of both good and poor quality, that were miss-classified as being real faces. All the predictions in Figure 5.1-6 are done by the temporal classifier that does not use head pose data.

(44)

5.1 Celeb-DF

Table 5.1:Celeb-DF classification results (AUC %).

Attempt

Classifier

Framewise Temporal Temporal +Pose 1 78.8 79.6 80.4 2 82.5 76.8 76.4 3 80.0 74.7 75.1 4 72.6 72.5 72.0 5 72.2 75.1 73.0 6 69.9 60.3 53.4 7 75.4 63.0 64.9 8 76.8 77.5 77.7 9 82.1 84.5 83.2 10 81.1 81.4 82.1 11 73.3 78.6 76.5 12 81.5 71.0 73.2 13 69.0 73.8 72.7 14 80.4 72.2 72.7 15 78.4 76.1 73.0 16 64.0 63.0 62.6 17 80.3 77.4 80.0 18 77.4 86.1 84.6 19 82.1 75.5 75.9 20 76.9 70.8 73.8 Average 76.7 ± 5.0 74.4 ± 6.5 74.2 ± 7.1

(45)

5.1 Celeb-DF 35

Figure 5.2:Fake faces from Celeb-DF wrongly classified as real. The top row show samples of poor quality (blurry eyes, blurry mouth), while the bottom row show samples of high quality.

(46)

5.2 FF++ deepfake

Table 5.2:FF++ deepfake classification results (AUC %).

Attempt

Classifier

(47)

5.2 FF++ deepfake 37

Figure 5.4: Fake faces from FF++ deepfake wrongly classified as real. The top row show samples of poor quality, while the bottom row show samples of high quality.

(48)

5.3 FF++ faceswap

Table 5.3:FF++ faceswap classification results (AUC %).

Attempt

Classifier

(49)

5.3 FF++ faceswap 39

Figure 5.6: Fake faces from FF++ faceswap wrongly classified as real. The top row show samples of poor quality, while the bottom row show samples of high quality.

(50)

(51)

6

Discussion

6.1 Results

The image classifier seems to perform relatively well on all three datasets. Al-though slightly worse on Celeb-DF, than on FF++ deepfake and FF++ faceswap, which can be seen in Table 5.1-3, in the column named Framewise. This is

ex-pected since the Celeb-DF data is more visually convincing and of higher quality. It also uses an improved technique to generate deepfakes, which likely makes the detection process more difficult. Nevertheless, the image classifier manages to achieve a rather good AUC score for all datasets. The identity embeddings must, therefore, contain a substantial amount of information that reveals whether or not an image is real or fake. This conclusion can be drawn since the image classi-fier utilizes no other information than the embeddings themselves, but still man-ages to classify the imman-ages somewhat reliably.

When we instead look at the temporal classifier, we obtain a large perfor-mance boost for the FF++ deepfake and FF++ faceswap data compared to the image classifier. The AUC score increases by 10.0% and 12.7% respectively, which indicates that there is relevant information to extract at the temporal level. It be-comes even more apparent that identity embeddings analyzed over time are use-ful for deepfake detection when we compare the results in Table 5.2 and 5.3 with the state-of-the-art classifiers in Table 3.1. The temporal classifier achieved scores of 90.7% and 95.6% respectively for the FF++ deepfake and FF++ faceswap data, while the state-of-the-art classifier reached scores of 98.0% and 97.0%. The AUC scores in Table 5.2 and 5.3 are of course lower, but still high enough to show that the temporal information in the embeddings is of relevance.

For the Celeb-DF data, on the other hand, the temporal classifier actually performs worse than the image classifier. This could be explained by the fact that FF++ generates their videos by applying the deepfake methods frame-wise,

(52)

while Celeb-DF also uses temporal regulations to reduce flickering movements. It might be possible that these regulations hide the discrepancies that the tem-poral classifier otherwise could have found and thus reducing the effectiveness of the temporal classifier. However, one could also make the argument that tem-poral regulations themselves introduce discrepancies that a temtem-poral classifier should be able to detect. It is thus unclear what exactly causes the temporal de-tector to perform worse on the Celeb-DF data. One possible explanation could be that the temporal regulations introduce some higher level of discrepancies while removing the low level flickering ones. The temporal classifier might not be able to detect such high level artefacts, thus leading to the results presented in Table 5.1.

The last classifier, that includes the head poses in the temporal detection al-gorithm, referred to asTemporal + Pose in Section 5, have a small impact on the

performance compared to using only the embeddings. The AUC score increases by 0.5% for the FF++ deepfake data and by 0.9% for the FF++ faceswap data, while it decreases it by 0.2% for the Celeb-DF data. Therefore, the head pose data does not seem to be very relevant for a classifier like this.

Multiple arguments can be made against the head poses. One might assume that the identity embeddings already encode information about the head poses and therefore no new knowledge is added by explicitly computing the head ro-tations. However, it is somewhat unlikely since the embeddings are based on an identity separating metric, learned from face images. A metric learning sys-tem like this is encouraged to produce embeddings that are invariant to rotations since rotation is a generic property for all faces and not something used for iden-tity separation.

Another argument against the head poses is that they simply are too noisy. The head pose extraction was only evaluated visually since no reference data was available. So it might very well be the case that the poses are inconsistent and noisy. It is also possible that some videos contain completely wrong pose estima-tions, since it was not possible to examine every video manually. However, the videos that were examined seemed to be correct.

One could also reason that the temporal algorithm described in Section 4.2.2 is designed in a way that does not allow to utilize the pose data to its full extent. In the start of the project, the idea was to use the rotations to disentangle the embedding variations that are based on head movement from the embedding variations based on some deepfake artefact. But this is not how the rotations are used in the temporal algorithm. The head poses over time are simply summarized as their mean and standard deviation, which hides the exact pose at each frame. The head poses could have been more useful if some other algorithm was chosen for the temporal analysis.

To analyze the head poses further, an additional test was performed to ex-amine the head pose dependency in the identity embeddings. It is presented in Appendix B. This test showed that there usually is small deviance in the embed-dings, depending on the head rotations, with larger rotations giving larger de-viances. This seem to indicate that the embeddings are not completely invariant to rotations and thus contain some information about the head pose. However,

(53)

6.2 Method 43 the deviations seem minor, and the results varied a lot between different videos. Furthermore, it was not possible to see any discrepancies between the real and fake videos in this test. This indicates that the head poses themselves are not very relevant when trying to distinguish between real and fake embeddings.

To summarize, the embeddings have some rotational dependencies, although minor. But it is not obvious that a fake embedding has a different rotational dependency compared to a real embedding, thus signaling that the head poses are not very relevant for an embedding based classification.

6.2 Method

The method to analyze the embedding properties is completely based on classifi-cation results. This is somewhat ineffective since it can be unclear what proper-ties are related to the embeddings themselves and what properproper-ties belong to the classifier in question. For instance, the temporal classifier with the head poses added saw no improvements compared to the temporal classifier without the head poses. It is not obvious whether the head poses are an ineffective means for the detection, or the classifier simply does not utilize them in an effective manner. It is however possible to motivate conclusions based on some other analysis, as, for instance, the tests presented in Appendix B.

6.3 Conclusions

The identity embeddings definitely contain information that can be used as the basis for deepfake detection, even when analyzing single frames.

If the embeddings are studied over time in a video, it is possible to extract even more information of relevance for deepfake detection. And it even seems to contain enough information to make close to perfect predictions, depending on what temporal algorithm the embeddings are used in. On the other hand, it is relatively easy to hide the temporal artefacts in the embeddings with temporal regulations, as shown by the comparatively bad results for the temporal classi-fier on the Celeb-DF dataset. Therefore, precautions must be taken if a general deepfake detector based on temporal analysis of identity embeddings is to be developed. Such a detector must not only be able to detect simple flickering dis-crepancies, but also higher level artefacts, caused by the regulations themselves.

Finally, head poses are not suitable to be combined with classifiers based on identity embeddings. They seem to contain some relevant information, but are not an effective addition to the embeddings.

6.4 Ethical Aspects

Deepfakes have become a controversial topic themselves, since they many times are created with malicious intentions. Therefore, a palpable concern towards this technology have arisen. Deepfake researchers are aware of this worry, and they

(54)

often motivate their work by making their implementations available for anyone to use. The reasoning is: if deepfakes become more common in the everyday life of people, they lose some of their harmful traits. Deepfakes already exists online, but if more people are informed about this technology, they will be more observant and approach video data with more caution.

Facial recognition is also a heavily criticised technology because of its possible privacy violating usages. However, no such violations occur in this thesis work. Whenever the technology is used in this project, it is applied to videos that depict public figures, such as newscasters and celebrities. These persons are aware of their own publicity and can not expect these videos nor their facial identities to be private.

6.5 Closing comments

This thesis has shown that identity embeddings from a ResNet-34

CNN-architecture trained in accordance with the FaceNet deep metric learning, do

contain information that is suitable for deepfake detection. But there exists a multitude of different face recognition models apart from this one, that will pro-duce different types of identity embeddings. The premise of face recognition is the same regardless of the model in question, but different models are obviously beneficial for different tasks. It is therefore relevant to analyze different types of embeddings and their suitability for deepfake detection in some future work. It would also be interesting to focus on the development of a detector that uti-lizes the temporal analysis of the identity embeddings and the head poses more optimally to find out what the limitations of an embedding based classifier are.

(55)

(56)

(57)

A

Examples of estimated head poses

The following images, Figure A.1-4, show examples of estimated head poses. The estimations are performed as described in Section 4.1.2, and the images contain both facial landmarks, and facial axes to illustrate the estimated poses.

Videos with the same type of illustrations as in Figure A.1-4 were used to visually evaluate the head pose estimation method.

Figure A.1

(58)

Figure A.2

Figure A.3

(59)

B

Head pose dependency in the identity

embeddings

Figure B.1-5 show examples of plots generated from a test that was designed to ex-amine if the identity embeddings have a dependency on the head poses. This test found a key-frame in every video, in which the person had a head pose as straight as possible. Or in other words when they looked straight to the camera. The em-bedding from the key-frame was saved and compared with the emem-beddings from every other frame in terms of their Euclidean distance from each other. The dis-tances were plotted against the yaw angle of the head in every frame, to find how much the embedding deviations are dependent on the head rotation. Tests that used the pitch and roll angles were also performed, but they are not presented here, since the results were very similar to those of the yaw angle. Furthermore, the pitch and roll angles often varied a lot less than the yaw angle, thus making it more difficult to estimate a reliable dependency over large angles.

This test showed that there usually is small deviance in the embeddings, dependent on the head rotations, with larger rotations giving larger deviances. Commonly, this deviance is smaller than 0.6, which in this learned distance met-ric, is a relatively small deviance.

Figure B.1-5 contain both the extracted key-frames for each dataset, and also the plots of the corresponding embedding deviances. The plots are drawn for samples from the FaceForensic++ dataset. Lines are drawn through the point clouds to more easily see the relation between embedding deviance and head rotation angle.

(60)

(61)

51

(62)

Facial Identity Embeddings for Deepfake Detection in Videos

Master of Science Thesis in Electrical Engineering

Department of Electrical Engineering, Linköping University, 2020