Writer identification using semi-supervised GAN and LSR method on offline block characters

(1)

Master Thesis

HALMSTAD

UNIVERSITY

Master of Science in Engineering, Computer Science and Engineering, 300 credits

Writer identification using semi-supervised GAN and LSR method on offline block

characters

Computer Science and Enginering, 30 credits

Halmstad 2020-10-19

Adrian Leo Hagström, Rustam Stanikzai

(2)

Adrian L. Hagström & Rustam Stanikzai: Writer identification using semi-supervised GAN and LSR method on offline block characters, , c Oc- tober 19, 2020

(3)

(4)

A B S T R A C T

Block characters are often used when filling out forms, for example when writing ones personal number. The question of whether or not there is recoverable, biometric (identity related) information within individual digits of hand written personal numbers is then relevant.

This thesis investigates the question by using both handcrafted features and extracting features via Deep Learning (DL) models, and successively limiting the amount of available training samples. Some recent works using DL have presented semi-supervised methods using Generative Adversarial Network (GAN) generated data together with a modified Label Smoothing Regularization (LSR) function. Us- ing this training method might improve performance on a baseline fully supervised model when doing authentication. This work additionally proposes a novel modifiedLSRfunction named Bootstrap La- bel Smoothing Regularizer (BLSR) designed to mitigate some of the problems of previous methods, and is compared to the others. The

DLfeature extraction is done by training a ResNet50 model to recog- nize writers of a personal numbers and then extracting the feature vector from the second to last layer of the network.

Results show a clear indication of recoverable identity related information within the hand written (personal number) digits in boxes.

Our results indicate an authentication performance, expressed in Equal Error Rate (EER), of around 25% with handcrafted features. The same performance measured in EER was between 20-30% when using the features extracted from the DLmodel. The DLmethods, while showing potential for greater performance than the handcrafted, seem to suffer from fluctuation (noisiness) of results, making conclusions on their use in practice hard to draw. Additionally when using 1-2 training samples the handcrafted features easily beat theDLmethods.

When using the LSR variant semi-supervised methods there is no noticeable performance boost and BLSR gets the second best results among the alternatives.

iii

(5)

(6)

A C K N O W L E D G E M E N T S

We would first like to thank our supervisors Josef Bigun, and Fer- nando Alonso-Fernandez for great feedback and wise guidance, es- pecially through these tough times, having online meetings during the pandemic. We would also like to thank Kevin Hernandez Diaz who did the opposition of this paper and gave great feedback allowing us to improve our work. Also the examiner Slawomir Nowaczyk who helped by pushing us to do better.

Additionally, last but not least our friends and family who provided support through the most stressful of times.

v

(7)

(8)

C O N T E N T S

1 i n t r o d u c t i o n 1 1.1 Background 1

2 p r o b l e m f o r m u l at i o n 5 3 r e l at e d w o r k 7

4 t h e o r y 11

4.1 Identification and Authentication 11 4.1.1 Identification Metric 11 4.1.2 Authentication Metric 11 4.2 Handcrafted Contour Features 12 4.3 Convolutional Neural Networks 13 4.4 Generative Adversarial Networks 14 4.5 Label Smoothing Regularization 15

4.5.1 Label Smoothing Regularization for Outliers 16 4.5.2 Weighted Label Smoothing Regularization 16 4.5.3 Distributed Pseudo Label 17

4.5.4 Residual Neural Network 17 5 m e t h o d o l o g y 19

5.1 Block Character Data 19 5.1.1 Data Digitization 19 5.1.2 Data Description 20 5.1.3 Data Pre-Processing 20 5.2 Experiment Methodology 20

5.2.1 Handcrafted Feature Experiment 22 5.2.2 Deep-Learning Experiment 23 6 r e s u lt s 31

6.1 Handcrafted Feature Results 31

6.1.1 Identification with Handcrafted Features 31 6.1.2 Authentication with Handcrafted Features 31 6.2 Generative Adversarial Network Results 31

6.3 Deep-Learning Model Results 36

6.3.1 Deep-Learning Model Identification Results 36 6.3.2 Deep-Learning Model Authentication Results 36 6.3.3 Optuna learning rates 46

7 d i s c u s s i o n 57

7.1 Personal number biometric data 57

7.2 Proposed deep learning training method 57 8 c o n c l u s i o n 61

i a p p e n d i x 63 9 a p p e n d i x a 65

9.1 DCGAN Layers 65 9.2 ResNet50 Layers 65

vii

(9)

viii c o n t e n t s

b i b l i o g r a p h y 69

(10)

L I S T O F F I G U R E S

Figure 1 The figure shows a graphical illustration of contour direction (f1), contour hinge (f2) and horizontal direction co-occurrence (f3h). The figure was taken from paper [9], and credit goes to the authors. 13

Figure 2 The figure shows how some N number of training samples are combined and then compared to a single test sample. Figure 2a shows how digit by digit is used, doing this one ends up with 6 distances, one for each digit, the final distance is calculated as the mean of those 6 distances. Figure 2b shows the technique of combining all digits then do a comparison between the train and testing. When “combining” multiple samples it refers to taking the features of them and adding them together and normalizing. “Comparing” in this case means calculating the distance by taking the χ² distance for each feature, as described in Equa- tion 1. 14

Figure 3 Two bar plots showing howLSRtakes a ground truth and sets a target function with an α of 0.5, Figure 3ashows the given prediction and Figure 3bresulting target function. 16 Figure 4 Image of form with the blocks where the digits

and other characters would be written. 19 Figure 5 Figure 5ashows a histogram of number of sam-

ples on the x-axis and the number of writers on the y-axis. Figure 5b is a box plot of writers and number of samples. The vertical axis is the number of samples. Minimum samples for a writer is 1, maximum is 70, median is 8 and average is 13.6. The red crosses are considered outliers. 21

Figure 6 Bar plot of digit counts. Vertical- and horizontal- axis corresponds to number of digits and the digit in question respectively. 22

Figure 7 Two bar plots showing how BLSR takes a prediction and sets a target function with an α of 0.5, Figure 7ashows the given prediction and Figure 7bresulting target function. 25

ix

(11)

x List of Figures

Figure 8 Drawings of theDCGAN, including the general structure of theGANinFigure 8a, the structure of the generator inFigure 8band the discriminator inFigure 8c. 26

Figure 9 Drawing of the semi-supervised model that shows how the data including the LSR variant is switched with the original labeled data and cross entropy loss function with a simple ground-truth label. 28

Figure 10 The figure shows how N training samples are combined and compared to a single test sample. In this case when “combining” the multiple samples their feature vectors are extracted and then their mean is calculated. This is done when taking the distance to mean feature. “Com- paring” in this case refers to calculating the distance by taking the euclidean distance of feature vectors. 29

Figure 11 Drawing of how the distance is calculated via

“distance to mean feature” in Figure 11a and

“mean of distances” inFigure 11b. Both of the drawings mark several feature vectors as fN

where N ∈ 1, 2, 3, 4, 5 that are to be combined and a distance calculated to a single test feature vector ft. In “distance to mean feature” a mean feature vector is calculated and shown as f_m, and the final distance d is calculated as the euclidean distance of fm to ft. The “mean of distances” is calculated by taking the euclidean distance from each of the fN to the ft, giving the set of distances dN, finally the distance is calculated by taking the mean of dN. 30 Figure 12 The graph plots top-1 identification accuracy

using the two methods of feature comparison discussed inSection 4.2, taking the mean of all handcrafted feature combinations. On the vertical axis, identification accuracy is plotted as a percentage, whereas the horizontal axis represents the training sample amount. 32 Figure 13 The graph compares all of the feature combi-

nations, showing the identification accuracy on the vertical axis, with different top-N accuracies on the horizontal axis. The graph also has a legend showing what line shows what feature combination. 32

(12)

List of Figures xi

Figure 14 The figure plots identification accuracy with handcrafted features f1 & f2 on the vertical axis and the training sample amounts on the horizontal axis. It also shows 10 lines representing the accuracies of top-1 through top-10 identification. The figure shows a legend detailing what line corresponds to what top-n identification. 33

Figure 15 The graphs compares all of the feature combinations, showing the EERson the vertical axis, and different amount of training samples on the horizontal axis. The graph also has a legend showing what line shows what feature combination. 33

Figure 16 The graphs shows the EERsof features f1 & f2 on the vertical axis, and different amount of training samples on the horizontal axis. Fig- ure 16b graph shows the same data as Fig- ure 16abut is zoomed in on the vertical axis. 34 Figure 17 The graphs plots the FA and FR rates using

handcrafted features f1 & f2, plotted on the vertical axis against the threshold value on the horizontal axis.Figure 17aplots results with 1 training samples whileFigure 17bplots results with 10. 35

Figure 18 The figures shows example images generated from theGAN models. Figure 18a was trained with 1 training sample per person, while Fig- ure 18bwas trained with 5 andFigure 18cwith 10. The rightmost column of each contain random examples from the dataset used to train theGANs. 37

Figure 19 The figures plots identification results from the baseline model using no data augmentation, with identification accuracy on the vertical axis and the training sample amount used when training on the horizontal axis. In Figure 19a the model is trained with frozen layers, while in Figure 19b the model is trained with unfrozen layers. There are 10 lines representing the accuracies of top-1 through top-10 identification, and there are legends detailing what line corresponds to what top-N identification. 38

(13)

xii List of Figures

Figure 20 The figures plots identification results from the baseline model using the digit shuffle data augmentation, with identification accuracy on the vertical axis and the training sample amount used when training on the horizontal axis. In Figure 20a the model is trained with frozen layers, while inFigure 20bthe model is trained with unfrozen layers. There are 10 lines representing the accuracies of top-1 through top-10 identification, and there are legends detailing what line corresponds to what top-N identification. 39

Figure 21 The graphs shows the meanEERsof frozen and unfrozen, of the baseline model trained with and without data augmentation compared with the EER using the handcrafted features, with different training samples along the horizontal axis and the EER on the vertical axis. The handcrafted results are the same as shown in Figure 16.Figure 21ashows the difference between calculating distances with and without using the mean of multiple training samples without data augmentation. While Figure 21b shows the difference between calculating distances with and without using the mean of multiple training samples with data augmentation. 41

Figure 22 The figures plots the results of the baseline model compared to results using handcrafted features, with the EER on the vertical axis and the amount of training samples used when training on the horizontal axis. The handcrafted results are the same as shown in Figure 16.Fig- ure 22a shows the result when not using data augmentation andFigure 22bshows the result when using the digit shuffle data augmentation. 42

(14)

List of Figures xiii

Figure 23 The figures plots the results of the baseline model trained with LSRO compared to results using handcrafted features, with theEERon the vertical axis and the amount of training samples used when training on the horizontal axis.

The handcrafted results are the same as shown inFigure 16.Figure 23ashows the result when not using data augmentation and Figure 23b shows the result when using the digit shuffle data augmentation. 43

Figure 24 The figures plots the results of the baseline model trained with WLSRcompared to results using handcrafted features, with theEERon the vertical axis and the amount of training samples used when training on the horizontal axis.

Figure 25 The figures plots the results of the baseline model trained with DPL compared to results using handcrafted features, with theEERon the vertical axis and the amount of training samples used when training on the horizontal axis.

Figure 26 The figures plots the results of the baseline model trained with BLSR with the α parameter set to 0 compared to results using handcrafted features, with the EER on the vertical axis and the amount of training samples used when training on the horizontal axis. The handcrafted results are the same as shown in Fig- ure 16. Figure 26a shows the result when not using data augmentation andFigure 26bshows the result when using the digit shuffle data augmentation. 47

(15)

xiv List of Figures

Figure 27 The figures plots the results of the baseline model trained with BLSR with the α parameter set to 0.2 compared to results using handcrafted features, with the EER on the vertical axis and the amount of training samples used when training on the horizontal axis. The handcrafted results are the same as shown in Fig- ure 16. Figure 27a shows the result when not using data augmentation andFigure 27bshows the result when using the digit shuffle data augmentation. 48

(16)

List of Figures xv

Figure 31 The figures shows a comparison of the mean

EERs of the BLSR training method when using different α parameters, both with and without data augmentation. The mean is calculated from all the training sample amounts, 1 through 10 and is displayed on the vertical axis. The α parameter is displayed along the horizontal axis. The figure also shows a legend of what data shows results using data augmentation and not.Figure 31agraph shows the same data as Figure 31bbut is zoomed in on the vertical axis. 52

Figure 32 The figures shows box plots of the final learning rates of the base model inFigure 32aand of theLSROin Figure 32b. Each plot contains 100 learning rates, from training sample amount and cross validation combinations, and compares with and without freezing the model as well as with and without data augmentation. 54 Figure 33 The graph compares the learning rates all of

the different models with box plots. For each 400learning rates are used, with the combinations of training sample amount, cross valida- tions and with/without data-augmentation/frozen- network. 55

(17)

L I S T O F TA B L E S

Table 1 Comparison of authentication mean EERs for the different models. 53

Table 2 Layers for the DCGAN discriminator. 65 Table 3 Layers for the DCGAN generator. 66 Table 4 Layers for the ResNet50 67

xvi

(18)

A C R O N Y M S

AUC Area Under Curve

BLSR Bootstrap Label Smoothing Regularizer

CER Crossover Error Rate

CNN Convolutional Neural Network

DCGAN Deep Convolutional Generative Adversarial Network

DL Deep Learning

DML Distance Metric Learning

DNN Deep Neural Network

DoB Date of Birth

DPL Distributed Pseudolabel

EER Equal Error Rate

FA False Acceptance

FR False Rejection

GAN Generative Adversarial Network

LSR Label Smoothing Regularization

LSRO Label Smoothing Regularization for Outliers

NN Neural Network

PDF Probability Distribution Function

SLSR Sparse Label Smoothing Regularization

TA True Acceptance

TR True Rejection

WLSR Weighted Label Smoothing Regularization

xvii

(19)

(20)

1

I N T R O D U C T I O N

1.1 b a c k g r o u n d

Today there are many different biometric techniques for identifying and authenticating users, such as: fingerprint, hand geometry, iris, written signatures, voice detection, etc. Physical characteristics such as hand geometry and iris are more accurate than written signatures and voice detection according to some studies [25,27]. Biometric systems are often deployed to perform one of the two types of tasks:

authentication or identification. In authentication, a user provides the system a biometric sample i.e. a signature, fingerprint, etc. along with a claim on the identity and the system checks if the user is who they claim to be. For identification the objective of the system is to identify the user input among all the other users which are a part of the system and thus return a number of potential matches. In practice there might also not be a large amount of data to train any system with, thus it is valuable if the system can produce good results with a small amount of data.

Authentication and identification by written signature is one of the oldest methods to authenticate and identify people, many documents are signed by hand which is still to this day accepted in legal documents as a piece of proof for veracity of an agreement. It is known that signatures contain biometric information that can authenticate and identify users and the task of signature authentication is to authenticate the identity of a person based on the persons handwriting.

Block characters are single characters which can either be a letter or a number written within a box/block and such block writings are commonly used when claiming an identity, typically by a personal number, which in Sweden is a unique number derived from birth date. However, uses of such unique numbers exist in nearly every country, e.g. social security number in the USA, pension number in Switzerland, etc. Since people sign their names frequently on paper, for example when students write exams, when people sign bank cheques, etc. one might suspect that there might also be some biometric information when writing ones own name even in block characters.

Substantiation or rejection of this suspicion is what the present thesis is going to study.

Biometric information is very private since it is used to authenticate people to various systems which contains sensitive data. Thus it is very important to keep biometric information safe.

1

(21)

2 i n t r o d u c t i o n

In relation to our study there are two types of signature authentication systems.

• Online signature authentication system, which is a system which is used on a device such as a tablet or a phone which can cap- ture dynamic information such as time, pen orientation, pres- sure, direction, etc.

• Offline signature authentication system, which is a system where a signature is written offline, i.e. on a sheet of paper and then scanned as, a digital image.

Offline signature authentication is a more complex task to solve since there is no dynamic information compared to online signature authentication which makes online more powerful and more difficult to forge by traditional impersonation methods. On the other hand online systems are less ubiquitous as they demand more from its providers in terms of equipment and infrastructure, as well as they may simply not be acceptable in courts of law.

For a long time, a large part of the research has been put into finding good handcrafted features for offline signature identification/authentication, some of the handcrafted features are, geometric features as shown in [19] and texture features in [32].

But in recent years thanks to the advancements in Deep Learn- ing (DL) using Neural Networks (NNs), there exists techniques which do not need to rely on handcrafted features. These techniques instead use feature representations that are learnt from raw data and discov- ers what features discriminates the classes from each-other. When do- ingDLand working with images it is preferable to use Convolutional Neural Networks (CNNs) which is a type of Deep Neural Network (DNN) that uses filters in order to represent and extract local spatial information of images. One of the greatest leap forward ofDNNswould be Generative Adversarial Network (GAN) introduced in [11], and is a type of DNN model that is used to generate high quality complex data.

There are some issues with the DL techniques however, and such issues are discussed in several papers. One of these are that NNsare mostly black boxes and there is a lack of insight of how they oper- ate, however work has been done in order to mitigate this. For example, in [33] the authors propose a method of visualizing network layers in order to gain insight into what features matter to CNN decisions. Additionally in [30] the issue of “adversarial examples” are discussed where it was shown that the cutting edge CNNs of 2014 such as AlexNet [18] would massively misclassify images by simply adding some barely perceptible perturbations, such techniques are called “adversarial attack”. Similarly DeepFool [21] is an algorithm which computes adversarial examples that fooled state-of-the-art clas- sifiers in year 2016. The study also showed that augmenting training

(22)

1.1 background 3

data with adversarial perturbations improves the robustness of the classifier. Another paper [28] showed the same type of adversarial attack but restricted to only changing a single pixel. Unlike the others, this study uses a black box attack, meaning it did not need access to the inner workings of the model. In paper [22] an algorithm is presented that can craft inputs that fool otherwise well performing networks into suggesting unrecognisable visual data are images of real objects with a high machine pretension on certainty. This implies that CNNs does not inherently generalize well, or at least not in an intuitive sense. Another issue that has been discovered is the concept of adversarial reprogramming described in [7]. This is a method of attack on CNNs where an adversary can craft an input to reprogram a model to do something it was not intended to do. This is an issue where an attacker can take advantage of this to steal computational resources. However vulnerabilities of NNs will not be investigated, even though we are cognisant of the seriousness of these problems.

Label Smoothing Regularization (LSR) is a method of regularization that was presented in [29], and works by instead of having a single ground truth label, using a “lable distribution” among all classes with the ground truth getting the largest value. Doing this causes the network to not be overconfident, resulting in it getting regularized. In [10] it was shown thatLSRwas able to increase adversarial robustness and LSR is described in general to have several advantages to other methods. Those benefits being easy to implement by modifying the cross-entropy loss, not requiring any change in architecture and not increasing training time. A review of LSR as a regularizer was done in [23] where it was found that LSR consistently improves state of the art models of 2017 across a wide range of tasks suggesting wide applicability as a regularizer.

LSRhas also found use in a semi-supervised context, in paper [34] an LSR variant called Label Smoothing Regularization for Outliers (LSRO) is used to train a network with GAN data by setting the label value to be uniform across all classes. By training a model like this, it will force the model to ignore the most unimportant features and learn the most important ones. This method is not flawless however as it has a tendency to oversmooth, intuitively one can imagine the GAN generating a sample very close to one of the classes, close enough to be considered apart of it. In such a case it would probably not give the best result to use a uniform distribution. Since its intro- duction alternate variants have been proposed that relieve this problem, some of which will be described in Chapter 3 of which some will be tested along with this paper own proposed variant designed to mitigate this problem.

When usingDLin order to do authentication one can use Distance Metric Learning (DML) methods which comes in the form of siamese networks [17,6] or triplet loss network [15], etc. TheseDMLnetworks

(23)

4 i n t r o d u c t i o n

work by teaching some NN model to differentiate between multiple sets of data by feeding it data pairwise and telling the model that pairs from different data sets should be classified with say a 1 and pairs from the same with 0. Recent research [16] has shown how these DML methods performs better in feature extraction than standard multi-class DLmodels, where a model makes a prediction among many classes, when using a small amount of data. However this flips when there is a large amount of data available. This leads to the question of if it could be possible to achieve better results with low amounts of data with a standard multi-class model somehow. For example with a training method such as the semi-supervised method withLSRvariants andGANdata which would force the model to learn the most important features.

(24)

2

P R O B L E M F O R M U L AT I O N

The goal is twofold, first to find whether there is any recoverable identity related information given when writing ones own personal identification numbers in character blocks, using only the images of the written digits on paper, and if there is such information then find out how much there is. Second is to investigate whether using the semi-supervised training method usingLSRvariants will improve performance of a multi-class CNN when doing authentication with low amounts of data. And to propose a new LSR variant, Bootstrap Label Smoothing Regularizer (BLSR), and compare it to other variants such as LSRO.

Since the block character data consist of images there is no temporal information, and the problem is categorized as offline recognition.

Earlier work described in Chapter 3 have done similar writer identification with block characters, however they have used characters written on a mobile device. Because of this it has inherent advantages over the offline problem such as it contains temporal information that offline data does not have access to, also it does not suffer from problems such as different pens producing different stroke thicknesses, and other data noise such as smudges.

Studying offline block character recognition gives useful knowl- edge as character boxes are widely used in forms. The potential biometric information from these could be (ab)used to authenticate the identity of the person filling the form. For this reason any data con- taining biometric information will need to be handled appropriately in order to protect the integrity of data donors. Since in a real world scenario one is not likely to have a lot of samples for each person, the experiments done will also be limited to have a set amount of training samples for each writer.

In order to find out whether there is information about the identity of the writer in the images of the written digits, machine learning and handcrafted features will be used to attempt to discriminate whether or not a certain written personal number belongs to a person or not.

If the results are statistically significant compared to a random guess the assessment would be that there is some biometric information about the writer within the image. Also, by determining how well the algorithms can identify individual writers given some of their written digits the amount of biometric data there is will be quantified and compared to other biometrics.

TheDLmethod of this paper will use a baselineCNNmodel trained with a semi-supervised method using GAN and the proposed LSR

5

(25)

6 p r o b l e m f o r m u l at i o n

variant called BLSR, and investigate how well it performs and learn features from given digit images. This method will be compared to the baseline CNNmodel trained with a standard supervised method to see how BLSR affect the end results. Additionally the CNNwill be trained withLSROas well as otherLSRvariants described inChapter 3, in order to tell how well the proposed method does in comparison to otherLSR variants. All of these will also ultimately be compared to a more traditional approach using handcrafted features from [9, 3], in order to gain insight into how well they compare to more established techniques. These methods will be evaluated both on how well they do identification but more importantly authentication.

Authentication is the more interesting result as identification of the unique string of numbers that is the personal number is more or less trivial as it is essentially just learning to read which CNNs have already been shown to do well. Authentication will be tested with unskilled forgeries generated by taking digits written by others and combining them in order to get the same string of digits, these are referred to as “pseudo forgeries”.

The semi-supervised method usingGAN andLSR is useful since it is a general method that can be applied on many different problems.

And if the proposedLSRvariant improves performance beyondLSRO

and the other types then we hope that this will be a meaningful contribution.

(26)

3

R E L AT E D W O R K

In paper [19] the authors focus on writer verification/identification using handwritten cursive texts and block character texts from two different data sets. The relevant database to the block characters of the cited paper is the Secure Password DB 150, which is composed of 150 users where each user has 18 samples of so-called “single character words” which are strings of 8 independent written characters, much like the block characters the present project uses. Although the authors use an online verification/identification system they also ex- plore how different types of features such as geometrical features, sta- tistical features and temporal features performs on their own. How- ever since the data is captured dynamically they have inherent advantages to offline image scans, as there is much more information available.

The work [8] trains a CNN on images of free flow handwriting to train the network to identify the writer. The second to last layer is then used as a feature vector that is next used with a K-Nearest Neigh- bour algorithm in order to identify writer and do writer retrieval.

This method could also be used to do authentication by extracting features from two writings and using the measured distance between the two in order to decide on a rejection or acceptance. Such a method of authentication is used in paper [13] that in order to authenticate offline signatures versus skilled forgeries with CNN, first training to identify signatures from different people. Next they train the network with the forgeries in order to learn discriminative features of forgeries and next use the second to last layer in order to get features for an Support-Vector Machine in order to do authentication. Their method achieves state of the art results and is shown to generalize well to other databases. This work will use the proposed method of [8] as provides a way to use a multi-class CNN to train for authentication, while [13] shows that the method can transfer well to authentication tasks.

The alternatives to this method are DML based methods which designed to determine a similarity between two inputs. One such method is the siamese network used in paper [17] that combines two

NNsin parallel in order to discriminate whether two inputs belong to the same class. Such a siamese network was implemented in the context of signature authentication in paper [6] with good results. There is also triplet loss used in [15] which is inspired by the siamese network and utilizes three instances of the same network with the same parameters, it is fed three samples, one of which is used as reference.

7

(27)

8 r e l at e d w o r k

The outputs are then compared to get two distances to the reference in question which are used to determine similarity. These methods have benefits such as not having to retrain when more classes are added, robust to imbalances in data and generally perform well with low amounts of data. However the models will also generally be larger with more parameters and according to research [16] they are outper- formed by standard DNNs models in feature extraction when large amounts of data is available.

The proposed method will rely on the use of GANs which were first introduced in paper [11]. TheGANsin this paper will be generating handwritten digits similar to the commonly used MNIST dataset.

Paper [24] introduced the Deep Convolutional Generative Adversar- ial Network (DCGAN) which uses convolution/deconvolutions in the generator/discriminator, this method shows good results when used with the MNIST dataset. Because of thisDCGAN is chosen to be the

GANused to generate data for the unsupervised method of this paper.

The semi-supervised approach in this work is based onLSROwhich was introduced in [34]. The authors of the paper train aCNNfor person re-identification from CCTV footage in a semi-supervised manner, using generatedGANdata along with the modified loss function

LSRO, causing a regularizing effect. The paper presents results showing the method improves performance of the baseline model. The details ofLSROis described inSection 4.5.1.

SinceLSROhas a tendency to oversmooth, an alternative proposed in [1] can be used. The method called Sparse Label Smoothing Reg- ularization (SLSR) which works like LSROexcept instead of having a uniform target label distribution over all labels, it first clusters the classes and uses GANs to generate new samples from each cluster.

Next it uses a target label distribution which is uniform over the cluster that the sample in question was generated from. This method miti- gates the problem of oversmoothing somewhat, however, the problem of over-smoothing within each cluster can still remain.

In paper [4] the authors use a similar semi-supervised technique with ResNet-50 to train the network for writer identification, along with aLSR variant and non-generated unlabeled data. They propose Weighted Label Smoothing Regularization (WLSR) which assigns a weighted target label distribution to the unlabeled data according the amount of labeled data of each class, as opposed to theLSROmethod where the target labels are uniform. The extra unlabeled data should also be located near the original training data in sample space, otherwise the effect of the regularization WLSR will be ineffective. The authors show that the results of their implementation is competitive with other popular methods and also how using a semi-supervised method provides a better result than a fully supervised on a baseline model. They also discuss usingGANsfor future works in order to get

(28)

r e l at e d w o r k 9

a closer sample space to the labeled data which is what this work will do. A more formal description ofWLSRis found inSection 4.5.2.

Another type ofLSRthat is displayed in [26] is referred to as “boot- strapping”, which is used in the context of having noisy labels, in other words when the ground truth of some samples are wrong or missing. The method works by setting target label as a combination of the training label and the current prediction of the model. The reasoning is that the model prediction will become better as time goes on and thus, this method will mitigate bad labels. Paper [31] presents a similar approach called Distributed Pseudolabel (DPL), us- ingGANgenerated samples in an semi-supervised manner.DPLworks by setting the target label of the top three predictions by the model to λ₁, λ₂, λ₃ (chosen in the paper as 0.4, 0.2, 0.2 respectively) with λ₁ as the most likely prediction according to the model etc. The authors report improvements upon a baseline model on several datasets, and does a comparison to LSRO which shows DPL getting better results.

An in depth description ofDPLcan be found inSection 4.5.3.

In order to gain an idea of how well the proposed method does, a comparison to the already well established method of papers [9] and [3] will be made. Both of these papers use the same handcrafted features based on the contours of written text, which will make implementation and interpretation of results easier. While the former uses the feature extraction on written signatures, the latter uses it on characters segmented out from written text making it similar but not equivalent to the problem of writer identification and authentication using written block character digits. The method works by having features be Probability Distribution Functions (PDFs) which are matrices representing the probability of some variable, and in this case different contour elements. These PDFs are used to identify/authenticate a signature and find the closest matching writing. The results they report are comparable to other contemporary methods using handcrafted features. The theory of these features are explained in more detail inSection 4.2.

Paper [12] presents a literature review of different techniques used for offline signature verification. The techniques presented use different feature extractions, mainly hand crafted features such as geometric features, graphometric features, directional features, etc. and representation learning usingDL. Since pre-processing is important in pattern recognition problems the authors present some of the main pre-processing techniques used in signature verification such as signature extraction, noise removal, etc. Most state of the art models use either Support-Vector Machine or Hidden Markov Model but recently error rates have dropped significantly thanks to the advancements in

DL. The paper also presents tables on state-of-the-art performances on different datasets where learning features usingCNNsshow significant improvements compared to the other presented methods.

(29)

(30)

4

T H E O R Y

4.1 i d e n t i f i c at i o n a n d au t h e n t i c at i o n

When working with biometrics one generally works on two related but different problems, identification and authentication. Whereas identification allows for finding the best matching sample out of many samples, authentication answer the question if two samples are from the same donor or belong to the same individual/class.

4.1.1 Identification Metric

Identification performance is generally measured by some kind of top-N metric. The metric is found by calculating the percentage of cases where the correct signature was found in the top-N matching samples, where N is a positive integer below the total amount of classes. Because of how the Top-N metric works it is also important to note that generally, the larger the amount of classes there is to classify, the harder it becomes to get a high Top-N score. This is important to note since it this may distort results when comparing two systems to each others.

4.1.2 Authentication Metric

Authentication is measured with False Acceptance (FA) and False Re- jection (FR). Or alternatively True Acceptance (TA) and True Rejec- tion (TR), which hold the same information. Acceptances and rejec- tions are the decisions made by the system on whether two samples are from the same person or not respectively. When the system is tested with two samples from different people it is called “impos- tor testing”, as in reality this would happen when someone tries to impersonate someone else. On the other hand, when the two samples are from the same person it is called “client testing”. In order to compare how well a certain signature matches with another, a metric of similarity or dissimilarity is used. With similarity metrics having high values when the samples are similar and dissimilarity or distance metric have a high value when dissimilar and closer to 0 when similar.

With neural networks this distance could be the outputs of one of the final layers. The calculated distance is used when deciding if the two samples are form the same person, by comparing it to some threshold. If the distance is greater than the threshold then the system

11

(31)

12 t h e o r y

will decide on a rejection otherwise an acceptance. If the system made a correct decision, it is considered as true, else it is considered false.

Equal Error Rate (EER), also known as Crossover Error Rate (CER), is a common way to quantify an authentication system with a single number, the EER is the error-rate when both acceptances and rejec- tions have the same error rate, the lower theEERis the better.

In order to quantify these results, one can look at the FA/FR rates and plot how these change at different threshold cutoffs starting with distance 0 when the similarity is absolute and increasing rightward on the threshold axis. Doing this will cause the FR rate to be 100%

at threshold cutoff distance 0, and continuously decrease to a value of 0% as the threshold cutoff distance increases while the FA rate inversely starts at 0% and increases to 100%. With this visualization you can also easily extract the Equal Error Rate (EER) when the FA

andFRlines intersect.

4.2 h a n d c r a f t e d c o n t o u r f e at u r e s

This work utilizes handcrafted features presented in earlier work of papers [9,3]. These features are based on the contours of writings in order to identify the writer. More specifically, this work makes use of 3 of the features, each made up of a vector or matrix Probability Distribution Function (PDF) of values describing the contour direction, contour hinge and direction co-occurrence.PDFsin this case are simply a set of values in a vector or matrix that sum up to 1, and represents a function of the probabilistic value of a variable. The contour directionPDFwhich is a vector and represents the probability of a part of the contour to be angled in a certain direction, this feature is referred to as f1. Next is the contour hinge PDFwhich is a 2d matrix representing the probability of a part of the contour referred to as hinge, being curved with two angles, this second feature is referred to as f2. The next feature is split into two parts: horizontal and vertical direction co-occurrence, they are both represented by 2d matrix

PDFs that contain the probability of some part of the contour being a direction, and a part of the contour which is next to it horizontally/vertically is another direction. This feature is referred to as f3h/v depending on whether it checks horizontally or vertically, and can be thought of as a measure of the roundness of the strokes. Each of the features are illustrated inFigure 1for additional clarity.

With these features it is possible to combine multiple writings into one set of features. This is done by adding all f1 PDFstogether then normalizing, and doing the same with f2 and f3, once done there will only be a single set of PDFs(f1, f2, f3). Combining data samples like this allows for more accurate depiction of inherent feature PDFsof a person.

(32)

4.3 convolutional neural networks 13

Figure 1: The figure shows a graphical illustration of contour direction (f1), contour hinge (f2) and horizontal direction co-occurrence (f3h).

The figure was taken from paper [9], and credit goes to the authors.

When combining multiple samples together in respect to personal numbers this can be done in two ways. First by doing it digit by digit, combining the features one get when extracting features from say the first digit of the personal number, then the second and so on, doing this you will end up with 6 feature sets, one for each digit. The second way is to simply combine all PDFsone get from a complete personal number, when doing this you will end up with a single feature set.

Figure 2 shows visually the difference between doing digit by digit comparison and combining all digits.

In order to find the distance between two feature-sets to decide on how similar two data samples are to each other, the authors of paper [9] use the χ² distance between each feature and then take the average of those distances. The χ² distance is a way to compute similarity between PDFs and is calculated according to the following equation:

χ²_qi= XN n=0

(p[n]_q− p[n]_i)²

(p[n]_q+ p[n]_i) (1)

Where N is the total amount of elements in the PDF, n is the PDF

element index, p denotes the PDF in question, and the q, i variables represents the two samples getting matched.

4.3 c o n v o l u t i o na l n e u r a l n e t w o r k s

Convolutional Neural Network (CNN) is a type of NN introduced in [5] that is designed to work with the types of data where the elements have some spatial relation to each other, such as pixels. The CNNar- chitecture works by applying filters on the data, the filters are defined by their weights which are the learnable parameters of the network.

The filter application are additionally defined by the number of filters applied, size of filters, activation function etc.

There is also operations such as max-pooling where one reduces the size of the image by a factor, say 2 along rows and columns.

Then one takes the max/min of every (non-overlapping) 2 × 2 square to produce the output. In opposition to max-pools there is also up- sampling which makes the image a factor larger filling in values with

(33)

14 t h e o r y

(a)

(b)

Figure 2: The figure shows how some N number of training samples are combined and then compared to a single test sample. Figure 2a shows how digit by digit is used, doing this one ends up with 6 distances, one for each digit, the final distance is calculated as the mean of those 6 distances. Figure 2b shows the technique of combining all digits then do a comparison between the train and testing. When “combining” multiple samples it refers to taking the features of them and adding them together and normalizing.

“Comparing” in this case means calculating the distance by taking the χ²distance for each feature, as described inEquation 1.

the use of the existing data. Applying a layer of filters and max pooling is known as convolution, while using up-sampling is called decon- volution inNN literature, doing this with several layers will produce an increasingly abstract representation of the data.

4.4 g e n e r at i v e a d v e r s a r i a l n e t w o r k s

In the 2014 paper [11] a new type ofDNNwas introduced called Gen- erative Adversarial Network (GAN), this model is used to generate

(34)

4.5 label smoothing regularization 15

new high quality and varied data from already existing data. This method makes use of two NNs, one generator and one discriminator, these two networks compete against each other by the generator creating new data as close to existing data as possible while the discriminator tries to decide whether its input is real or was created by the generator. This method of producing a model that can generate new data produces impressive results and has been a common area of research since its inception.

4.5 l a b e l s m o o t h i n g r e g u l a r i z at i o n

The method of Label Smoothing Regularization (LSR) has existed since early days ofNNs, however in 2016 the concept was revived when discussed in [29]. With a standard training method the ground truth target label is set to 1 while other labels are set to 0, as shown in Equation 2:

q(k) =





1 k = k_gt 0 k6= k_gt

(2)

Where q is the “target label function” which essentially describes what prediction is desired from the model, k is a class, and kgt is the ground truth class. The cross entropy loss is calculated according to Equation 3:

L = − XN k=1

log(p(k))q(k) (3)

Where p(k) is the model prediction for an input being some class k and the N is the total number of classes (Note how the lowest loss is when p(k) = q(k)). This means the loss of Equation 2 is simply calculated with Equation 4:

L = −log(p(k_gt)) (4)

LSRworks instead by setting all the target labels to _N where is a number between 1 and 0. Then 1 − is added to the ground truth, as shown inEquation 5:

q(k) =





1 − + _N k = k_gt

N k6= k_gt

(5)

With the cross entropy loss being calculated again withEquation 3.

Using LSR will make the optimal model for loss-minimization less confident of the correct answer, thus having a regularizing effect on the model. Figure 3 shows an example of how LSR creates a target function q(k) from a ground truth using the parameter of 0.5.

(35)

16 t h e o r y

Figure 3: Two bar plots showing howLSRtakes a ground truth and sets a target function with an α of 0.5,Figure 3ashows the given prediction andFigure 3bresulting target function.

4.5.1 Label Smoothing Regularization for Outliers

A new training method using a type of LSR was proposed in 2017 in paper [34]. It suggest using LSR and combining it with GANs in order to regularize a model through semi-supervised training. The so-called Label Smoothing Regularization for Outliers (LSRO) method trains the network with the unlabeled GAN images, define the label target function to have a uniform value. The resulting label target function becomes:

q(k) = 1/N (6)

which is inserted inEquation 3in order to calculate the cross entropy loss.

This results in the model to be unsure about what class the GAN

data belongs to, which is desirable since the data should be a mix of all the classes. This will force the model to learn more discriminative features while ignoring the trivial ones. One note onLSRO or any of the other variants, the unlabeled data being used should be close to the same sample space as the labeled data it trains on, otherwise the effect would be undesirable [34,4].

4.5.2 Weighted Label Smoothing Regularization

AnotherLSRvariant that has been proposed is Weighted Label Smooth- ing Regularization (WLSR) proposed in paper [4]. WithWLSRthe target label function is defined as each class having a value being propor- tional to the number of labeled images in each respective class. The

WLSRtarget label function is described inEquation 7:

q(k) = P_N

n=1I(y_n= k)

N (7)

(36)

4.5 label smoothing regularization 17

Equation 7 is inserted intoEquation 3 in order to calculate the cross entropy loss.

Where I(yn = k) denotes an identity function giving a value of 1 when the data sample yn is from the class of k, else it gives a value of 0. When first conceived it was used as a semi-supervised method in conjunction with unlabeled data not generated by a GAN, however this method can also be applied toGANgenerated data since the generated data will be a combination of the data used to train it.

Also the generated data is more likely to resemble data that it was given more of.

4.5.3 Distributed Pseudo Label

A third method of doing semi-supervised training Distributed Pseu- dolabel (DPL) was introduced in paper [31]. The suggested target label function uses the resulting predictions of the model when setting the target labels forGANdata, with the top three predicted classes getting set to some predetermined value, while the rest is set to a uniform value. The function is more formally described bellow inEquation 8:

q(k) =











λ₁ k = k_top1

λ₂ k = k_top2

λ₃ k = k_top3

1−(λ1+λ2+λ3)

N−3 otherwise

(8)

and againEquation 8is inserted intoEquation 3in order to calculate the cross entropy loss.

With λ1, λ2&λ3 being pre-set hyperparameters set to 0.4, 0.2&0.2 in paper [31], and ktopxis the top “x” class prediction by the model.

4.5.4 Residual Neural Network

A ResNet [14] is an artificial neural network which consists of convolutional layers with skip connections added between every stack of two 3 × 3 convolutions. These skip connections turns the network into its counterpart residual version, they decrease the training error and generalizes the validation data better. For deeper ResNets such as ResNet50 with 50 layers, the layer block between each skip con- nection are replaced with a stack of 3 layers instead of 2 which are called bottleneck. These bottlenecks consists of 1 × 1, 3 × 3 and 1 × 1 convolutions.

(37)

(38)

5

M E T H O D O L O G Y

5.1 b l o c k c h a r a c t e r d ata 5.1.1 Data Digitization

The data provided for this work are pages with handwritten digits and characters within boxes. Those pages are scanned and then ro- tated, scaled and shifted to align using a set of spiral markers ([20], also known as spiral codes) as references. Next, in order to get the digits of each individual box they are cut out into 35 × 27 pixel sub- images using a script assuming the boxes will be at the same pixel coordinates for each aligned image. An example of what these character boxes looks like is shown inFigure 4.

Figure 4: Image of form with the blocks where the digits and other characters would be written.

Furthermore the digit data was provided unlabeled and only the Date of Birth (DoB) part of the personal id provided. The 4 extra digits unrelated to birth date, allowing identification with some efforts, were concealed from the authors. Likewise, Name and Email data were concealed. Some writers wrote their DoB “YYYYmmdd” while other writers used the format “YYmmdd”. Since writers used different formats when writing their DoB it was noticed that the writers who wrote in the format “YYYYmmdd” were missing the days of theirDoB, i.e. the following format “YYYYmm” was retrieved because of only receiving 6 first digits of their personal number. The index of the samples that used the 8 digit format for DoBwere saved and sent to the data provider in order to retrieve the days of theDoB. The provided birth date digits (in individual squares on black background) were next manually labeled, and stored in an hash map with theDoB

as key and a vector of the pages they appear on as the value. This

DoB information is used in order to distinguish between individuals without knowing their identities. The data was “cleared” before made available to authors, so that there were no data from two individuals with the same birth date.

19

(39)

20 m e t h o d o l o g y

5.1.2 Data Description

The data given for this work consists of 317 people writing theirDoB

in six (YYmmdd) or eight numbers (YYYYmmdd) with a varying number of samples per persons as seen in Figure 5a, some having over 50 samples but most are in the 5 − 20 samples/person range which can be seen inFigure 5bwith the median being 8 samples per writer and average is 13.6 samples per writer. In total there are 4290 writtenDoBswith 28484 individual written numbers, in Figure 6the distribution of the individual numbers are shown.

5.1.3 Data Pre-Processing

The alignment precision is high with large overlap of box boundaries among (28484) individual boxes. However, due to different amount of photonic exposure of each sample entering the scanner, the small but nonetheless non-zero motion blur in the scanner, there is still a small portion of vertical borders of boxes which appear as non- overlapping with black borders of the average box. Thus when cutting the scanned image along fixed locations after alignment there are usu- ally still weak borders in sub-images. This is combated by looking at the maximum values of each row and column from which it can be discerned where the borders end. Those rows/columns are removed.

The scanned images also tend to have some remaining weak background noise. In order to remove the noise Otsu’s method is applied to binarize the image, next a dilate filter is applied twice and then the binary image is multiplied with the original image. With the area around the written digit is untouched while the rest is set to zero.

5.2 e x p e r i m e n t m e t h o d o l o g y

Tests of identification and authentication of person identity from writing styles of digits is in this study done with different methods. These are handcrafted contour features, and DLmodel, pre-trained on Im- ageNet. The contour features which are shortly explained in Sec- tion 4.2, are from papers [9, 3] and their usage is described below in Section 5.2.1.TheDLmodel will be trained with a fully supervised method as a baseline, as well as trained with the proposed method of this work along with the other similar methods discussed earlier. The precise implementation of these are described inSection 5.2.2.

The experiments will be done such that every experiment uses a set amount of training samples per writer to study the amount of used samples in training versus identity recognition performance.

(40)

5.2 experiment methodology 21

(a)

(b)

Figure 5:Figure 5ashows a histogram of number of samples on the x-axis and the number of writers on the y-axis. Figure 5b is a box plot of writers and number of samples. The vertical axis is the number of samples. Minimum samples for a writer is 1, maximum is 70, median is 8 and average is 13.6. The red crosses are considered outliers.

(41)

22 m e t h o d o l o g y

Figure 6: Bar plot of digit counts. Vertical- and horizontal-axis corresponds to number of digits and the digit in question respectively.

5.2.1 Handcrafted Feature Experiment

The handcrafted features are extracted using code from [9] and are used for both identification and authentication. Since the previous papers have shown there to be different amount of information in each of the features, the experiments were run with all possible feature combinations in order to show the best possible results. When the word “train” is used for the handcrafted features it doesn’t refer to training some model in the same way as DL, instead in this experiment “train” refers to simply combining multiple data samples to get a more accurate prediction. A description of how the features are combined or “trained” and how distances is calculated etc. can be found inSection 4.2.

5.2.1.1 Handcrafted Feature Identification Experiment

In order to test the identification ability of the handcrafted features all data is split into training and test data. When N number of training samples are being tested, N number of train data points (=personal number) are picked randomly from each person, the rest of the data is used as test data. If a person has less than N data samples, then all of the persons data is used for training. After the data has been selected, the features are extracted and the training data for each person is combined in both ways described in Section 4.2, digit by digit merging or combining all digits into one.

Next step is to go through all the validation samples and calculate the distance between them and the training data, then sorting what class is closest to each validation sample according to the training

(42)

5.2 experiment methodology 23

data. Using this information the Top-N accuracy is extracted by taking the number of times the correct class was among the top-n closest for each validation sample and then dividing by the total number of validation samples. Results from both combination methods will be presented.

5.2.1.2 Handcrafted Feature Authentication Experiment

Because of how this work limits the amount of training samples to some N amount per writer, cross validation would not have worked given the constraints, and a method to do cross validation is created as a replacement. This method works by when doing the authentication experiment with N amount of training samples per person, all people with less than N samples are not used, since the they don’t have enough data to do any client tests. For writers with greater than Ndata samples a “fold” is made and N number of data samples are selected as a training set while the rest is used as validation data. If the writer has more than 2 ∗ N data samples, another “fold” is made and another set of N data samples that have not yet been used in training are selected, with the rest used for validation. This repeats next for writers with more than 3 ∗ N data samples etc. All the samples in a training set are combined, this combined training set is used to calculate distances to the all the validation samples. These distances are used to calculate FR rate. In order to test the FA rate a “pseudo forgery” for every validation sample is created: random digit-entries from different people are used to build the same personal number as the current writer being tested. The pseudo forgeries are used as imposters where the system should reject them when presented. The distances from the combined training sample and the pseudo forgeries are calculated and used when calculating theFArate.

Using this method will create an imbalance where writers with a greater amount total data samples have more results per sample than writers with less data. In order to fix this imbalance a weight for each distance is calculated such that the total contribution for each sample will be the same. This is done by making the weight the inverse of number of folds the person is able to do.

5.2.2 Deep-Learning Experiment

The proposed method consists of using a DLmodel to identify and authenticate the writer of personal number while employing a semi- supervised learning method using the LSR variants described in Sec- tion 4.5. Also using a proposedLSRvariant named “BLSR”, described inSection 5.2.2.1. The semi-supervised method uses labeled data along with data generated by GANs, the creation and training of the GAN

is described in Section 5.2.2.2. The DL model used is described in Section 5.2.2.3. The details of how the model is trained to identify