BETTER LEARNING THROUGH IMPROVED DISTRIBUTIONAL MODELING

(1)

BETTER LEARNING THROUGH IMPROVED DISTRIBUTIONAL

MODELING

by

ETHAN MCAVOY RUDD

B.S., Trinity University, 2012

M.S., University of Colorado Colorado Springs, 2014

A dissertation submitted to the Graduate Faculty of the

University of Colorado Colorado Springs

in partial fulfillment of the

requirements for the degree of

Doctor of Philosophy

Department of Computer Science

2017

(2)

ii

This dissertation for the Doctor of Philosophy degree by Ethan McAvoy Rudd

has been approved for the Department of Computer Science

by

Terrance E. Boult (Chair)

Walter J. Scheirer

Jonathan Ventura

Manuel Günther

Rama Chellappa

(3)

iii

Rudd, Ethan McAvoy (PhD, Engineering)

Better Learning through Improved Distributional Modelling Thesis directed by Professor Terrance E. Boult

ABSTRACT

With the 21st century well underway, machine learning algorithms have advanced consider-ably in the ability to tackle difficult recognition problems. However, machine recognition is still rife with challenges, some of which are a direct result of advances made, for instance, as applications of machine recognition generalize, it is increasingly important for algorithms to “know” when they have have no basis to make a prediction. Similarly, algorithms must be able draw correlations across different training instances and be able to generalize from a training set how to make inferences about similarly correllated but differently distributed samples at query time. With the advent of algorithms and frame-works that can leverage datacenter-scale and general purpose graphical processing unit (GP-GPU) computation, and the necessity of gathering datasets that are too large for hand-labelling to train such models, increasingly noisy training data constitutes another problem that has emerged. In this dissertation, we investigate these three problems and propose and evaluate solutions that tackle them by incorporating additional distributional information into optimization objective functions. We further explore specific applications including machine vision and malware recognition.

(4)

iv

(5)

v

ACKNOWLEDGEMENTS

I would like to begin by thanking all of my dissertation committee members: Dr. Terry Boult, Dr. Walter Scheirer, Dr. Manuel Günther, Dr. Jonathan Ventura, and Dr. Rama Chellappa. I chose each of them due to their remarkable intelligence and track record of quality research. Dr. Terry Boult, my advisor, I shall further acknowledge in subsequent paragraphs. Dr. Walter Scheirer has spent untold hours outside of the normal purview of his job to immensly improve the quality of research of both publications submitted to peer-reviewed conferences and journals as well as this manuscript. We are very lucky to have Dr. Manuel Günther at the UCCS Vision and Security Technology (VAST) lab, and I am very grateful for his help in advising in both research directions and development parctices, tips, and tricks. He has spent a great deal of time creating and sharing very high quality code repositories, assisting myself and others in writing, modifying, and debugging code, and co-authoring, proofreading, and editing papers. His efforts have contributed greatly to the overall quality of this manuscript. Dr. Jonathan Ventura has been a tremendous asset to the VAST lab and has significantly improved the UCCS’s research capacity in machine vision and machine learning. He is taking the lab in new and very interesting directions in the augmented reality (AR) and robotics spaces. Finally, Dr. Rama Chellappa has been a global force unto himself, both in the advancement and computer vision and pattern recognition technology and for securing the funding for that advancement. It is largely thanks to his efforts that the UCCS VAST Lab has been able to

(6)

vi

collaborate with University of Maryland (UMD) in the IARPA Janus project that made much of the research presented in this dissertation possible.

Second, I would like to thank all of my friends and colleagues at the VAST Lab, and in particular, I would like to call out three names: 1.) Andras Rozsa, for being a wonderful co-author on several publications and for being very non-hesitant to voice differences in opinion, 2.) Chad Mello, for suggesting and being willing to pursue many heterogeneous projects and problems, and 3.) Steve Cruz for keeping our servers up and running, doing installs, and repeatedly driving down to our server rooom in University Hall to reboot and upgrade many machines. There are untold others from the VAST Lab and UCCS faculty, staff, and student body that have encouraged and assisted me in pursuing this Ph.D. program so I would like to acknowledge them too.

The person that deserves absolutely the greatest acknowledgement for both the process of this dissertation and for the success of my Ph.D. overall is my advisor, Dr. Terry Boult, for numerous reasons: First, for hiring me at his company, Securics Inc., thereby funding my Masters degree, and in the process, getting me interested in some very interesting fields that I did not even know existed. Second, for spending countless hours suggesting research directions in a number of very heterogeneous research areas, some very unrelated to the normal focus of his research. Third, Terry has taught me how to publish, and through conferences and other means, including a brief stint at Google, has put me in touch with remarkably intelligent and accomplished individuals in the computer vision and machine learning communities. Terry’s level of intelligence is exceptionally rare and it has been a tremendous privilege to work with someone of his intellect and enthusiasm. I also appreciate the enormous amounts of effort he spends providing resources for the lab, running conferences, reviewing papers, founding and maintaining the Bachelor of Innovation, and helping others start companies. UCCS is extremely lucky to have him.

(7)

vii

Ginger Boult bears special acknowledgement as well for the amount of time that she has spent helping myself and other graduate students with travel and conference logistics, as well as organizing conferences herself. From helping students navigate the infernal Concursystem to scheduling conference venues and running registration desks, Ginger does so much work – voluntarily, I must add – that is responsible for the seamless running of very pivotal events in the computer science research community. In my opinion, she has done far more for the computer vision community than most researchers in the area. I would also like to thank both Terry and Ginger for their enormous generosity to myself and others, not only in terms of their time, but also in terms of resources that they have provided, including the use of their Keystone condo. They are two of the most generous people I know.

I am also immensely grateful to family members who have encouraged and inspired me over the years, starting with my late grandparents, George and Cleta Rudd, who recognized the value of higher education. my parents, Jack and Emily Rudd, not only for their excellent genetics, but also for instilling higher intelligence and critical thinking as values to strive for, being very supportive of my early education, paying for my undergraduate degree, and unlike many parents, encouraging me to pursue graduate study. I would also like to thank my older sister Melissa Rudd, particularly for serving as a positive role-model in my formative years – I was very academically lazy until around age 20, but still quite far ahead by most standards thanks to the extreme peer pressure imposed by Melissa. Additionally, I cannot go without acknowledging my uncle, James Brown, for consistently demonstrating a winner’s mentality, even in the face of tremendous adversity. This is a mentality that I hope has rubbed off on me, but that I am still working on cultivating.

(8)

viii

Finally, I would like to thank the funding agencies that made this work possible. Terry’s former company, Securics Inc., supported much of my Masters and Ph.D. coursework. With respect to my Ph.D. research, it has been supported by research contracts through the VAST Lab; Particularly:

• This research is based upon work supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activ-ity (IARPA), via IARPA R&D Contract No. 2014-14071600012. The views and conclusions contained herein are those of the author and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the ODNI, IARPA, or the U.S. Government. The U.S. Govern-ment is authorized to reproduce and distribute reprints for GovernGovern-mental purposes notwithstanding any copyright annotation thereon.

• This work was also supported in part by the National Science Foundation, NSF grant number IIS-1320956: Open Vision - Tools for Open Set Computer Vision and Learning.

(9)

ix

LIST OF FIGURES

FIGURE

3.1 Extreme Value Machine Ψ-models . . . 29

3.2 Multi-class open set recognition performance on OLETTER . . . 45

3.3 Open world recognition performance on imagenet . . . 48

3.4 Distributions of Ψ-model scale (λ ) and shape (κ) parameters over the first 200 classes of the ImageNet training set and over all classes of the Letter training set. 53 4.1 Three approaches to attribute learning (and other multi-task problems) . . 64

4.2 CelebA Dataset Bias . . . 70

4.3 Error Rates on CelebA . . . 74

4.4 Score distributions . . . 76

4.5 Attribute distributions across demographics . . . 87

4.6 Mis-classified by all algorithms . . . 88

4.7 MOON Errors . . . 89

4.8 Face Tracer Errors . . . 90

5.1 Performance degradation from different noise types . . . .104

5.2 ROC Curves . . . .106

(14)

LIST OF FIGURES xiv

6.2 Attribute Prediction from Deep Features . . . .119

6.3 Angle Prediction . . . .125

6.4 Digit Classification Errors . . . .126

6.5 PAIR Network Topology . . . .128

6.6 Scree diagrams for derived LeNet feature spaces . . . .129

6.7 Heat maps for derived LeNet feature spaces . . . 131

6.8 Diagonalizations of the data matrices of LeNet . . . .132

7.1 Feature Space vs. State Space OPCODE classification . . . .143

7.2 Problems with the closed world assumption . . . .149

7.3 Intrusion Heirarchies . . . 167

7.4 Open Set Evaluation . . . 171

(15)

LIST OF TABLES

TABLE

3.1 Extreme vectors retained with batch size . . . 49

4.1 Error Rates on CelebA . . . 75

4.2 Accuracies for different test demographics and networks . . . 80

4.3 Test set subsamplings for CelebA . . . 81

5.1 Performance of LeNet on noisy MNIST . . . .105

5.2 Results of EVOLVE Training on MNIST under 60 % multinomial noise .108 6.1 Predicting Pose from Deep Features . . . .122

(16)

CHAPTER I

INTRODUCTION

Machine learning differs from more purely mathematical areas of probability and statistics in several respects, but one of the most noticeable differences is the generally accepted success criterion in the literature. In contrast to statistical approaches, the success of which tends to be measured by how mathematically expressive the model is under distributional assumptions, machine learning models are commonly evaluated based on how well they perform in a predictive capacity on benchmarks that aim to simulate real-world applications. The models that perform the best in benchmark settings are then operationally deployed in the application settings. Model selection is based on the assumption that the empirical benchmark results map well to real performance. Conversely, the directions in which applied machine learning theory advances converge around the theory behind the best performing approaches – at least until better ones based on different theory are devised.

This benchmark-driven nature of applied machine learning has yielded tremendous advances in the field as a whole and has brought forth remarkable technological and scien-tific achievements. However, it has also has certain drawbacks because researchers tend to lose sight of reality in favor of the benchmark. Once performance has saturated for one benchmark, researchers develop an new benchmark that is more difficult, but reflects the

(17)

2

same idealized modeling assumptions as its predecessor. Benchmarks are simplifications of reality by design, and in some cases they are extreme over-simplifications, and the benchmark-driven nature of applied machine learning research has led some aspects of reality to be consistently under-addressed – both in benchmarks and in the development of objective functions.

For example, the closed set assumption – that query samples belong to regions of known support in hypothesis space – is unfounded for many real-world applications. Training sets consist of only several out of nearly infinitely many classes of samples, and the represented classes are heavily sub-sampled. This is referred to as the open set problemin the literature [170]. While many classifiers are optimized for an objective that minimizes confusion of classes in the training set – the empirical risk of misclassification, there is no constraint in their objective functions to avoid mis-labeling unsupported space – the open space risk. This comes from a discrepancy between idealized benchmarks and reality. Most benchmark test sets contain samples from classes that are also in the training set, so there is no penalty on ascribing unknown hypothesis space to a known class. In real deployment settings, like object recognition, malware recognition, or face recognition, however, there will be queries from unsupported regions that should be rejected as “unknown”.

Another example of a problem in which standard benchmark/objective function as-sumptions seldom match reality is the multi-label attribute classification problem. When describing an object in terms of semantic attributes approaches in the literature (e.g., [112,96]) typically assume that attributes are independent of one another and that distribu-tions of attributes and corresponding frequency bias between training and query sets will be identical. In reality, however, the independent and identical distribution assumptions are invalid. Rather, attribute labels are correlated and frequencies vary between training

(18)

3

and query sets. These assumptions are also related insofar as how attributes should be treated: rather than optimizing over each attribute independently during training, a more realistic approach is to optimize a representation over multiple attributes simultaneously. Doing multi-objective optimization, however, requires per-attribute compensation for frequency differences between training and operational test sets. This domain adaptation requirement might occur when training face attribute classifiers on a dataset consisting of celebrities who are young and attractive and not chubby for use on the older, less attractive, and chubbier general public.

Related to the multi-label domain adaptation problem discussed in the previous paragraph is the noisy label problem. Objective functions in classical machine learning generally treat labels as a pure and correct description of the content of a sample, penalizing deviations of the model’s hypothesis from each label by the same loss function. In reality, however, labels do not convey complete information about a sample, are noisy, subjective, and in rare cases almost entirely incorrect. This problem has gotten worse – not better as dataset sizes have increased, since the big data trend has led to widespread use of semi-supervised labeling methods. Classical objective functions are ill-prepared to deal with the noisy label problem, since they assess all loss contributions equally rather than considering loss contributions in proportion to label confidence.

We could name additional examples, but these three problems:

1. the Open Set Problem,

2. the Multi-Objective Domain Adaptation Problem, and

3. the Noisy Label Problem

arise because classical formulations do not enforce realistic constraints in their op-timization objective functions. In this dissertation, we enforce these constraints by

(19)

4

augmenting the respective objective functions with key terms that draw distributional information from input samples, labels, and feature space representations. Hence, the title of this work: Better Learning through Improved Distributional Modelling. The remainder of this dissertation is structured as follows:

Chapter2of this work discusses the relationships between the three problems, associ-ated distributional assumptions, and relassoci-ated work. Solutions to these three problems are presented in Chapters3,4, and5.

To incorporate distributional knowledge on the support of known data into a classifier that addresses the Open Set Problem, we introduce the Extreme Value Machine (EVM) in Chapter3. While other work in the field of Open Set Recognition has been generally well-received, these approaches have attempted to model data support in terms of either calibrated distances from a decision boundary, which introduces implicit limitations in distributional modeling from the rigid parameterization of the decision boundary (e.g., hyperplanar or hyperspherical parameterizations). The EVM, by contrast, yields a much richer probability of inclusion distribution by assigning a theoretically grounded extreme value distribution for every point in the dataset and distilling the model to a subset of extreme vectors – a non-redundant subset of these (point,distribution) pairs. This compact, but highly accurate representation of the support of the training data allows the EVM to be efficiently and incrementally updated with data from novel classes or novel types of data from existing classes, which means that the EVM addresses not only the Open Set Problem, but also the Open World Problem [15], in which a classifier must learn novel labels incrementally under the presence of unknown data. The experiments in this chapter demonstrate experimentally that the EVM advances the state of the art in Open World Recognition.

(20)

5

We address the Multi-Objective Domain Adaptation Problem in Chapter4by correct-ing for the source/target distributional discrepancy in the loss-layer of a multi-label neural network. This inherently leads to a corrective re-weighting of backpropagated gradients during network training. The incorporation of source/target distributional information can be viewed as mixing the objective of domain balancing with the multi-label objectives, leading to a Mixed Objective Optimization Network (MOON) architecture. While this architecture and its derivatives have advanced the state of the art on the worlds largest publicly available facial attribute recognition dataset, CelebA [112], we believe that it can be extended to many similar problems.

We address the Noisy Label Problem in Chapter 5, by combining concepts from Chapters3and4. Our approach, Extreme Value Objective Validity Evolution (EVOLVE), consists of iteratively training a neural network via backpropagation, extracting scores across the training set, using these extracted scores to fit distributions that reflect confidence of label validity, and renormalizing the losses associated with samples in the training set by these distributions in the next round of backpropagation. Probability of label validity is assessed via statistical Extreme Value Theory-derived distributions over consistent network hypotheses and the network update is conducted via MOON-like reweighting.

The next chapter, Chapter 6, discusses a more theoretical, but related problem of how to force more homogeneous score distributions at arbitrary layers of a network for very heterogeneous inputs of the same class. Invariance in score distributions over arbitrary layers is a desirable characteristic for many applications, particularly commercial biometric, which use truncated versions of networks as feature space extractors for enrolling templates of subjects that were not present in the original networks’ training sets. Enforcing invariance during optimization, particularly if the invariance holds for novel

(21)

6

data, could conceivably yield a more favorable feature space for open set and open world decision machines and novel class discovery, e.g., in affective learning problems [65].

The penultimate chapter, Chapter 7, offers novel directions for the future research. Particularly, this chapter discusses a heretofore untapped application of several of the concepts discussed in this work: open world detection and recognition of stealth malware. As the author will shortly (after the publication of this dissertation) be working as a researcher for a security software and hardware company, this chapter can be read as a forethought and hopefully a flash-forward of some very interesting research to come in the computer security space.

(22)

CHAPTER II

BACKGROUND AND RELATED WORK

This dissertation does not focus on any single algorithm, dataset, or problem, but rather on the technique of improving machine learning algorithms for a variety of tasks by explicitly incorporating distributional information into their optimization objectives during training so as to better match real-world assumptions. In each section of this chapter, we describe the assumptions that we address and the distributional information that we incorporate into the objective function in order to address it where applicable. We also survey related work.

2.1. The Open Set Problem

Our proposed approach to the open set problem is presented in Chapter3.

2.1.1 Background, Assumptions, and Distributional Information

The open set problem arises due to the incorrect assumption that all samples seen at query time will come from regions of known support. This is an unrealistic assumption because training sets constitute only a subsampling of the hypothesis space over which we wish to generalize and not all data that might be present at query time (from nearly infinitely many

(23)

2.1 The Open Set Problem 8

other classes) can be feasibly represented by parameterizing a classifier over the training set. The realistic assumption that we enforce is that a classifier can only make reasonable inferences about samples in a region of hypothesis space corresponding to learnt support. To enforce this assumption we introduce the notion of a sample-wise continuous margin distribution which may vary in bandwidth depending on the region of hypothesis space in which the sample lies. The probability of being an outlier with respect to the margin distribution for a particular sample is then given by the CDF of the margin distribution and the probability of sample inclusion is given by 1 minus this CDF – a rejection model. The distributional assumption that we make is that the margin distribution is governed by statistical Extreme Value Theory (EVT) and the optimization objective is then to find the EVT distribution of best fit for each sample. In Chapter3, we additionally discuss the open world problem, in which the classifier is incrementally updated to recognize novel classes under an open set recognition regime.

2.1.2 Related Work

With respect to classifiers that mitigate open space risk at classification time, the 1-vs-Set machine [170] approaches the problem of open set recognition by replacing the half-space of a binary linear classifier by bounding the positive data with two hyperplanes. An algorithm similar to the 1-vs-Set machine was described by Cevikalp and Triggs [26] for object detection, where a binary classifier with a slab is combined with a nonlinear support vector data description (SVDD) classifier for just the positive class. In later work, Scheirer et al. introduced the Weibull-calibrated SVM (W-SVM) for multi-class open set recognition problems using nonlinear kernels, with provable guarantees of open space risk reduction [161]. These nonlinear models were more accurate, but also more costly to compute and store. For the more expansive problem of open world recognition, Bendale

(24)

2.1 The Open Set Problem 9

and Boult modified the Nearest Class Mean [117] algorithm by limiting open space risk for model combinations and transformed spaces, resulting in the nearest non-outlier (NNO) algorithm introduced in Sec. 3.1, which we will use as a baseline for comparison in Sec.

3.5.

Other approaches exist for related problems involving unknown class data such as multi-class novelty detection [20], domain adaptation [61], and zero-shot classification [100]. However, these problems need not be addressed by a classifier that is incrementally updated over time with class-specific feature data. More related is the problem life-long learning, where a classifier receives tasks and is able to adapt its model in order to perform well on new task instances. Pentina and Ben-David [137] lay out a cogent theoretical framework for SVM-based life-long learning tasks, but leave the door open to specific implementations that embody it. Along these lines, Royer and Lampert [152] describe a method for classifier adaptation that is effective when inherent dependencies are present in the test data. This works for fine-grained recognition scenarios, but does not address unknown classes that are well separated in visual appearance from the known and other unknown classes. The problem most related to our work is rare class discovery, for which Haines and Xiang have proposed an active learning method that jointly addresses the tasks of class discovery and classification [65]. We consider their classification algorithm in Sec.3.5, even though we do not make distinctions between common and rare unknown classes.

There is growing interest in statistical extreme value theory for visual recognition. With the observation that the tails of any distance or similarity score distribution must always follow an EVT distribution [169], highly accurate probabilistic calibration models became possible, leading to strong empirical results for multi-biometric fusion [167], describable visual attributes [165], and visual inspection tasks [58]. EVT models have

(25)

2.2 The Multi-Objective Domain Adaptation Problem 10

also been applied to feature point matching, where the Rayleigh distribution was used for efficient guided sampling for homography estimation [54], and the notion of extreme value sample consensus was used in conjunction with RANSAC for similar means [53]. Work in machine learning has shown that EVT is a suitable model for open set recogni-tion problems, where one [79] and two-sided calibration models [161,226] of decision boundaries lead to better generalization. However, these are post hoc approaches that do not apply EVT at training time.

2.2. The Multi-Objective Domain Adaptation Problem

Our proposed approach to the Multi-Objective Domain Adaptation Problem is discussed in Chapter4.

2.2.1 Background, Assumptions and Distributional Information

The multi-objective domain adaptation problem occurs when, in multi-objective learning, there is a mismatch between the distribution of the target and source and this domain shift is non-uniform over tasks, so simply re-sampling the input data is not an option because different and contradictory samplings would need to be conducted for each task, resulting in a zero-size training set. Multi-objective domain adaptation is a generic problem in machine learning, solutions to which could benefit many applications. In this Chapter4

we focus particularly on the facial attribute recognition application, in which we would like to leverage the same training set to create classifiers that work well for heterogeneous target demographics. However, our approach with respect to network topology is generic and can be extended to a wide variety of other problems. Our realistic assumptions are that the target demographic distribution will not match the source demographic distribution

(26)

and that considerable labeled data is available across the source demographic, but only attribute distributional frequency information is available across the target demographic. The distributional information that we introduce is the frequency discrepancy between the source and target distribution for each task and we incorporate it into a multi-task optimization objective by using it to re-weight per-task (per-attribute) loss.

2.2.2 Related Work

Multi-task learning has been applied to several areas that rely on learning fine-grained discriminations or localizations under the constraint of a global correlating structure. In these problems, multiple target labels or objective functions must simultaneously be opti-mized. In object recognition problems, multiple objects may be present in a training image whose co-occurrences should be explicitly learnt [207]. In text classification problems, joint inference across all characters in a word yields performance gains over independent classification [78]. In multi-label image tagging/retrieval [212,75], representations of the contents of an image across modalities (e.g., textual descriptions, voice descriptions) are jointly inferred from the images. The resulting classifiers can then be used to generate descriptions of novel images (tagging) or to query images based on their descriptions (retrieval). Closer to this work, facial model fitting and landmark estimation [32,19] is another multi-task problem, which requires a fine-grained fit due to tremendous diversity in facial features, poses, lighting conditions, expressions, and many other exogenous factors. Solutions also benefit from global information about the space of face shapes and textures under different conditions. Optimization with respect to local gradients and textures is necessary for a precise fit, while considering the relative locations of all points is important to avoid violating facial topologies.

(27)

In Chapter4, we apply multi-task learning to facial attributes. Applications of facial at-tributes include searches based on semantically meaningful descriptions (e.g., “Caucasian female with blond hair”) [95, 97, 166], verification systems that explain in a human-comprehensible form why verification succeeded or failed [96], relative relations among attributes [133], social relation/sentiment analysis [229], and demographic profiling. Fa-cial attributes also provide information that is more or less independent of that distilled by conventional recognition algorithms, potentially allowing for the creation of more accurate and robust systems, narrowing down search spaces, and increasing efficiency at match time.

The classification of facial attributes was first pioneered by Kumar et al. [96]. Their classifiers depended heavily on face alignment, with respect to a frontal template, with each attribute using AdaBoost-learnt combinations of features from hand-picked facial regions (e.g., cheeks, mouth, etc.). The feature spaces were simplistic by today’s standards, consisting of various normalizations and aggregations of color spaces and image gradients. Different features were learnt for each attribute, and a single RBF-SVM per attribute was independently trained for classification. Although novel, the approach was cumbersome due to high dimensional varying length features for each attribute, leading to inefficiencies in feature extraction and classification [209].

In recent years, approaches have been developed to leverage more sophisticated feature spaces. For example, gated CNNs [86] use cross-correlation across an aligned training set to determine which areas of the face are most relevant to particular attributes. The outputs of an ensemble of CNNs, one trained for each of the relevant regions, are then joined together into a global feature vector. Final classification is performed via independent binary linear SVMs. Zhang et al. [229] use CNNs to learn facial attributes, with the ultimate goal of using these features as part of an intermediate representation for

(28)

a Siamese network to infer social relations between pairs of identities within an image. Liu et al. [112] use three CNNs – a combination of two localization networks (LNets), and an attribute recognition network (ANet) to first localize faces and then classify facial attributes in the wild. The localization network proposes locations of face images, while the attribute network is trained on face identities and attributes, and is used to extract features, which are fed to independent linear SVMs for final attribute classification. Prior to our approach in Chapter4, theirs achieved state-of-the-art performance on the CelebA dataset – and serves as a basis of comparison. In contrast to our approach, Liu et al. and many other recent works do not directly use attribute data in learning a feature space representation, but instead use truncated networks trained for other tasks. While research suggests that coarse-grained attribute data (e.g., image-level) can be indirectly embedded into the hidden layers of large-scale identification networks [45], the efficiency of this approach has not been well studied for inferring fine-grained (e.g., facial) attribute representations, and findings from [231] suggest that optimal implicit representations reside across different layers depending on the attribute.

Surprisingly, multi-task learning has not been widely applied to the problem of facial attribute recognition. Only very recently has it been addressed, e.g., Ehrlich et al. [43] developed a Multi-Task Restricted Boltzmann Machine (MT-RBM). In terms of joint inference for facial attributes, it is the first we could find in the literature, but the approach deviates radically from DCNN approaches in many other respects as well: the MT-RBM is generative and non-convolutional and it is unclear what contributed most to the improvement over [112].

While there has been significant prior work in visual domain adaptation [135], in-cluding more recent work for CNNs [194], the main problem that we address in this chapter – incorporating domain adaptation into the training procedure for multi-objective

(29)

2.3 The Noisy Training Set Problem 14

attribute classifiers – has heretofore not been addressed, neither in DCNN multi-task learning nor in facial attribute research. For facial attributes in particular, we contend that domain adaptation is essential when building classifiers fit to chosen target demographics. Recently, Wang et al. [202] demonstrated that even throughout New York City, a relatively compact geographic region, differences in demographic profile are so prominent as a function of geolocation that binned geolocation can be used to derive a powerful unsuper-vised facial attribute feature space representation. In order to leverage attribute data we have for training demographic-specific classifiers, domain adaptation during training is vital to provide a balanced representation and mitigate problems from an over-correlated representation [82].

2.3. The Noisy Training Set Problem

Our proposed approach to the Noisy Training Set Problem is presented in Chapter5.

2.3.1 Background, Assumptions, and Distributional Information

Much of the machine learning literature implicitly makes the naive and idealistic as-sumption that training labels constitute a pure and correct description of training sample content, because it does not call the veracity of training labels into question. In reality, this assumption rarely holds. Instead, we approach the problem by assuming that:

• Labels will be noisy, but informative. This means that for the classification of hard-categories, labels can be completely incorrect but the plurality of the labels for each class of training samples will be correct.

• The probability that a label is valid depends on its consistency with a general learnt concept for the majority of the data (i.e., the non-outliers).

(30)

• A general concept can be learnt from noisy data, but to get the details correct, incorrectly labeled samples should be down-weighted as evidence proportionate to their degree of expected validity. In terms of the classical bias-variance tradeoff [18], the bias of the dataset should be learnt first, then the variance.

To incorporate these assumptions into a deep neural network’s objective function, once some training has been conducted, we use the learnt representation to assess probability of label validity by fitting EVT distributions over the network outputs for hypotheses that agree with their respective labels. We then assess probability of label validity using the resultant distributions across the entire training set, updating weights on the gradient of each sample used to update the network representation. This process is repeated at each epoch.

2.3.2 Related Work

Noisy training data is a fundamental problem in machine learning because as models become more expressive, which is necessary to model complicated nonlinear decision boundaries in high dimensions, they also become more prone to overfitting. A popular way to deal with this is through some form of regularization [18], which penalizes the complexity of the model’s parameter space in the optimization objective. In analogy to Occam’s Razor, regularization aims to find the simplest decision surface that performs reasonably well for the task at hand, with just enough nonlinearity – but no more than necessary. For example, a standard Euclidean loss function over an N-sample, M-label-per-sample dataset: J(X,Y ; θ ) = N

∑

j=1 M

∑

i=1 (hj(Xj;θ ) −Yji)2, (2.1)

(31)

2.3 The Noisy Training Set Problem 16 J(X,Y ; θ ) = N

∑

j=1 M

∑

i=1 (hj(Xj;θ ) −Yji)2+ υ||θ ||p, (2.2)

under an LP regularization on the parameter space, where|| · ||pis the Lpnorm andυ is a

constant hyperparameter.1 When p= 1, i.e., L1 norm regularization, this is commonly

referred to as Least Absolute Shrinkage and Selection Operator (LASSO) regression, and tends to enforce a sparse representation in which a good portion ofθ elements become very close to zero [127]. Under L2regularization (a.k.a. Tikhonov regularization), (2.2) is

commonly referred to as ridge regression. This type of regularization is used by default in the solvers of most deep neural learning frameworks (in Caffe solver files this corresponds to the “weight_decay” hyperparameter).

The effectiveness of regularization techniques is widely credited as one of the reasons that deep neural networks have dominated much of the machine learning space in recent years [200] – because without good regularization it was very hard to fit such models. While regularization helps in obtaining a decision surface of sufficient complexity with which to fit noisy or subsampled data, samples in the training set that are simply misla-beled will still be detrimental to the solution. Viewing regularization as a smoother, the regularized solution may in some cases yield an even worse decision boundary than a non-regularized solution, because what would be effectively a spike or a point mass in a nonlinear decision boundary could instead be smoothed to mis-label a much larger span of space.

For such samples, it would be better to either remove them entirely from the training set or significantly down-weight their contribution in optimizing the classifier. The study of detecting and removing mislabeled training samples is not new and has its origins in the

1_{In much of the literature,} _{υ is commonly written as λ . We chose this notation to}

disambiguate between the Weibull shape parameter, which makes an appearance later on in this dissertation.

(32)

removal of high-residual outliers in linear regression [208,18]. Much of the fundamental work on classifying the correctness of sample labels was conducted by Brodley and Freidl [21], in which they used ensembles of classifiers trained across separate folds of the training set and compared the classifier outputs to the correct label. To ascertain the correctness of the labeling, they used ensemble majority voting and ensemble consensus. They then pruned mislabeled samples. The two strategies – voting and consensus – yield different precision and recall characteristics, since consensus enforces a more stringent criterion to reject a sample as improperly labeled. Both strategies yielded improved classification performance on several noisy datasets. However, the classifiers used in this work were very simple – k nearest neighbor, decision trees, and linear discriminants – all shallow methods with no latent variables. Moreover, a fundamental limitation of this method is that it assumes that labeling errors are independent and random. Similar methods which do not remove data, but improve classification performance under mislabeling include bagging, boosting, and ensemble prediction [18,90].

Surpisingly, little follow up work on the noisy training set problem was published over the first decade of the 21st century – largely, we suspect, due to the the strong notion of a dichotomy between “supervised” vs. “unsupervised” learning, and the fact that public datasets were so small that they could at least be reasonably hand-labeled. Once deep neural networks started to advance the state of the art for many problems, which occurred largely as a result of GPU-based frameworks like Caffe [84] and TensorFlow [1], much of the research community opted for larger datasets that could only be collected by unsupervised or semi-supervised means, e.g., by mining online tagged video [3] and photo [29] search engine databases. While acquiring web-scale data, using semi-supervised methods allowed researchers to advance the state of the art [29], some researchers, the author of this dissertation included, noticed that these datasets were rife with errors.

(33)

Consequently, from 2014 until now (2017), several authors have re-visited the noisy training set problem – both directly and indirectly.

Recent work of ours [9] used training set pruning, on the hypothesis that it would improve classification accuracy, but the approach was largely heuristic and oriented toward a fairly niche application of detection of sensitive text for data leak prevention systems. Several approaches have proposed using label confidences to alter deep neural network training algorithms, including [146] and [185]. However, both of these works aim to model noise distributions, requiring at least some very confident labels and placing tremendous assumptions on the type of noise. Particularly, Sukhbaatar and Fergus [185] use a domain adaptation-like approach, in which they add a new layer prior to the loss that changes “probabilities” output from softmax into a distribution that better matches the noise of the label space. This approach may be viable assuming particular types of systematic error. However, it cannot generalize well to an unknown noise distribution. Reed et al. [146] also introduce an approach that aims to estimate confidence of the labels by jointly optimizing an autoencoder to reproduce the raw input and the noisy labels simultaneously; this approach does not require modeling noise. In [122], Mnih and Hinton introduce an approach to mitigate errors from omission noise2and registration noise3in remote sensing images with noisy map data as groundtruth. Their approach is similar to ours in its use of a re-weighted objective function, but unlike ours, uses an assumed noise model.

In particular, manifestations of the noisy label problem, e.g., in which there is good reason to assume that noise can be attributed to systematic subjective biases [121], or in which training labels are simply incomplete, e.g., in image-level labeling of objects for a classifier whose objective is to detect and localize the objects [107], modeling of the noise

2_{The object is present in the image but not in the map.}

3_{The object is in both the map and the image, but local pixel-wise alignment is}

(34)

2.4 Towards an Invariant Feature Space 19

distributions may be sensible via a curriculum-like [29] or domain-adaptive [121,107] approach. In many cases, however, barring exhaustive manual inspection, modeling noise is problematic because labeling errors may occur for many reasons and defining a probability of mislabeling based on mislabeled data is a difficult task. It is also statistically ungrounded to model a meaningful probability over an unknown generating stochastic process. Instead, we approach the problem by re-invoking statistical Extreme Value Theory from Chapter 3, and fitting an EVT probability distribution only over extrema of samples with labels that agree with the network hypotheses, at each training epoch. Using this probability, we can then re-weight updates to the classifier according to the probability of inclusion for each class in question under a MOON-like topology (cf. Sec.

4.2).This approach is formalized in Sec. 5.2.

2.4. Towards an Invariant Feature Space

Our proposed approaches to obtain invariant feature space representations are presented in Chapter6. This chapter also studies the extent to which feature spaces are invariant across different face biometric recognition tasks.

2.4.1 Background, Assumptions, and Distributional Information

This chapter differs in structure from Chapters3-5, insofar as it is largely an exploratory study of the extent and effects of invariance in deep feature spaces with respect to cross-task application, for example using outputs from the intermediate layers of a face identity recognition network to perform facial attribute recognition (a topic discussed in Chapter4). However, we also propose incorporating distributional information into deep network’s ob-jective function for a different motivation than in Chapters3-5: Rather than modifying the

(35)

objective function in order to reflect real-world assumptions, we explore whether we can force a deep network’s representation to generate outputs that have unique distributional properties, e.g., decreased variance for the outputs of a given class. In other words, by modifying the network objective function we would like to change deep neural networks to have a pre-assumed property while maintaining comparable or better classification performance. An invariant feature space representation with respect to class irrespective of whether the sample (or class) is present in the original training set could, for example, yield favorable performance characteristics, both for canonical open set problems and for template enrollment problems commonly seen in biometrics – where a trained network is used to serve as a feature extractor to create gallery templates (during the enrollment stage) with which to compare probe templates (during the query stage). For this reason, the exploratory study in Chapter6consists of two parts. The first part focuses on examining invariance characteristics of intermediate feature space representations from trained face recognition networks, while the second part focuses on forcing the intermediate feature space representations of a network to have invariant characteristics.

2.4.2 Related Work

Template representations, in which samples are stored as feature vectors for rapid compar-ison to all other samples, are common for problems that require accurate representations with inexpensive update mechanisms. For example, modern face biometric systems use large face datasets to train a network as a feature space representation, then during en-rollment they extract and sometimes aggregate features for gallery faces. At query time, features extracted from probe images are then compared with the gallery features and a similarity score is assigned, often as a normalized inverse distance measure [134,189]. This deep-feature driven approach has proven quite effective, for example, Taigman et al.

(36)

[189] were able to reach human-level performance on the once thought difficult Labeled Faces in the Wild (LFW) benchmark [74], and system performances have improved further since. Crucially, the network is never run in an end-to-end manner, but instead truncated one or more layers before. On the intuition that given enough face biometric training data, a discriminative feature space representation can be learnt, even for novel face identities, this assumption allows for the enrollment of novel identities in the gallery with little overhead.

Taking the intuition a step further, researchers have developed methods which extend deep features across tasks, for example, the facial attribute extraction approach by Liu et al. [112] with which we compare in Chapter4uses a network trained on face identity data data as a feature space representation for facial attributes. Moving beyond faces, the use of deep learnt representations across visual tasks, including the aforementioned face biometric is widespread and can be traced back to the seminal work of Donahue et al. [41], in which the authors used a truncated version of AlexNet [93] to arrive at a Deep Convolutional Activation Feature (DeCAF) for generic visual recognition.

Use of generic features is compelling for several reasons, but unlike end-to-end networks, whose outputs are optimized to be invariant to exogenous input-level variations deep feature space representations at earlier layers have no such explicit constraint. Template enrollment implicitly assumes a degree of invariance – i.e., that non-training images of similar characteristic will at least lie in similarly discriminable regions as the remainder of the training set, but research conducted to remove/adapt variations in features that transfer across domains or tasks [106,145,194] has shown that the assumption of an invariant feature space does not hold.

While invariance seems desirable for template representations, for a generic cross-task feature space, invariance is not necessarily a favorable property. For example, Liu et al.

(37)

2.5 Open Set Recognition of Stealth Malware 22

[112] were able to classify the Smiling attribute using a feature space derived from face identities, which suggests that despite the optimization of the output layer, the derived feature space was not entirely invariant to facial expression.

There has been surprisingly little published research devoted to examining relation-ships between input-level variations and feature-level variations or the feasibility of enforcing feature-space level invariance. The closest that we could find, conducted by Parde et al. [132] investigates how well properties of deep features can predict non-identity-related image properties. They examined images from the IJB-A dataset [91] and concluded that “DCNN features contain surprisingly accurate information about yaw and pitch of a face.” Additionally, they revealed that it is not possible to determine individual elements of the deep feature vector that contained the pose information, but that pose is encoded in a different set of deep features for each identity. Our work builds on theirs by investigating pose issues in much greater detail, exploring even more generic invariance for attribute-related information content across identity-trained deep representations, and proposing novel topologies that aim to enforce invariance.

2.5. Open Set Recognition of Stealth Malware

Applying open set recognition to the unique challenges of stealth malware recognition is discussed in Chapter7. This chapter is applications oriented, so no new algorithms are introduced.

2.5.1 Background, Assumptions, and Distributional Information

Machine learning offers tremendous potential to improve anti-malware applications, particularly with respect to recognizing stealth malware, but its application has been

(38)

limited due to six flawed assumptions: 1.) That intrusions are closed set. 2.) That anomalies imply class labels. 3.) That static models are sufficient. 4.) That no feature space transformation is required. 5.) That model interpretation is optional, and 6.) That class distributions are Gaussian.

We discuss these flawed assumptions in detail in Chapter7. We hypothesize that these assumptions predominantly stem from the fact that intrusion recognition has historically been approached from the paradigm of closed world machine-learnt classifiers, when it should be approached as an open world problem, as presented in Chapter3. By re-framing the problem in this context (cf. Chapter7), we illustrate theoretically how the six flawed assumptions can be obviated, and offer experimental support by evaluating a canonical network intrusion dataset under an open set protocol.

2.5.2 Related Work

Machine learning has been commercially applied in anti-malware systems since the early 1990s [188], and many academic papers have been written on the subject. A few of the more prominent examples include [38,150, 142,105,71, 56], though we present a far more comprehensive discussion in Chapter 7. Unfortunately, these works, with some exceptions (e.g., [153]) have an uncanny pattern of applying off-the-shelf classical machine learning algorithms and reporting results on pre-canned outdated datasets [191] like KDD CUP ’99, which is now nearly 20 years old.

Addressing the dataset problem is a challenge that has not been well carried out in the academic space. Though there have been some attempts, e.g., [36,35,34,126], these collections are too small to reflect realistic usage environments. Industrial and government datasets have been kept largely proprietary. However, some commercial data collection APIs [192] and industrial data sharing [160] have begun to hit the academic space.

(39)

Datasets aside, however, there are tremendous disconnects between industry and academia that have prevented widespread use of machine learning in practical intrusion detection and recognition systems. Several of these disconnects were outlined by Sommer and Paxson in [181]. We further deduced that these disconnects were due to six flawed assumptions that we outlined in a comprehensive survey on stealth malware [153]. These assumptions are discussed in detail in Sec. 7.3.1.

As we identify in our survey [153] and in Chapter7, many of these flawed assumptions are eliminated if we simply examine the problem through an open set or open world recognition paradigm (cf. Chapter3), which deals more gracefully with previously unseen attack types and novel attacks – marking them as “unknown” and prioritizing an order with which to update the classifier. In terms of open set intrusion recognition, the concept of empirical risk minimization corresponds to the risk of confusing known signatures, while the concept of open space risk [163,162] minimization corresponds to the risk of matching signatures from novel exploits to types of either intrusive or normal behavior not present in the (closed) training set.

(40)

CHAPTER III

ADDRESSING THE OPEN SET PROBLEM

It is often desirable to be able to recognize when inputs to a recognition function learned in a supervised manner correspond to classes unseen at training time. With this ability, new class labels could be assigned to these inputs by a human operator, allowing them to be incorporated into the recognition function — ideally under an efficient incremental update mechanism. While good algorithms that assume inputs from a fixed set of classes exist, e.g., artificial neural networks and kernel machines, it is not immediately obvious how to extend them to perform incremental learning in the presence of unknown query classes. Existing algorithms take little to no distributional information into account when learning recognition functions and lack a strong theoretical foundation. We address this gap by formulating a novel, theoretically sound classifier — the Extreme Value Machine (EVM). The EVM has a well-grounded interpretation derived from statistical Extreme Value Theory (EVT), and is the first classifier to be able to perform nonlinear kernel-free variable bandwidth incremental learning. Compared to other classifiers in the same deep network derived feature space, the EVM is accurate and efficient on an established benchmark partition of the ImageNet dataset.

(41)

3.1 Introduction 26

3.1. Introduction

Recognition problems which evolve over time require classifiers that can incorporate novel classes of data. What are the ways to approach this problem? One is to periodically retrain classifiers. However, in situations that are time or resource constrained, periodic retraining is impractical. Another possibility is an online classifier that incorporates an efficient incremental update mechanism. While methods have been proposed to solve the incremental learning problem, they are computationally expensive [33,216,102,109], or provide little to no characterization of the statistical distribution of the data [15,87,116,

149]. The former trait is problematic because it is contrary to the fundamental motivation for using incremental learning — that of an efficient update system — while the latter trait places limitations on the quality of inference.

There is also a more fundamental problem in current incremental learning strategies. When the recognition system encounters a novel class, that class should be incorporated into the learning process at subsequent increments. But in order to do so, the recognition system needs to identify novel classes in the first place. For this type of open set problem in which unknown classes appear at query time, we cannot rely on a closed set classifier, even if it supports incremental learning, because it implicitly assumes that all query data is well represented by the training set.

Closed set classifiers have been developed for approximating the Bayesian optimal posterior probability, P(Cl|x′; C1, C2, . . . , CM), l ∈ {1, . . . , M}, for a fixed set of classes,

where x′is an input sample, l is the index of class Cl(a particular known class), and M is

the number of known classes. When Ω unknown classes appear at query time, however, the Bayesian optimal posterior becomes P(C_˜l|x′; C1, C2, . . . , CM,UM+1, . . . ,UM+Ω), ˜l ∈

(42)

3.1 Introduction 27

are unknown. Making closed set assumptions in training leads to regions of unbounded support for an open set problem because a sample x′from an unknown class U_˜lwill be misclassified as a known class Cl. For classifiers that assess confidence in terms of signed

distance from a decision boundary, or some calibration thereof, this misclassification will occur with high confidence if x′is far from any known data — a result that is very misleading. Scheirer et al. [170] termed this problem open space risk.

More formally, let f be a measurable recognition function for known class C , O be the open space, and Sobe a ball of radius rothat includes all of the known positive training

examples x∈ C as well as the open space O. Open space risk RO( f ) for class C can be

defined as RO( f ) = R O fC(x)dx R So fC(x)dx ,

where open space risk is considered to be the relative measure of positively labeled open space compared to the overall measure of positively labeled space. In this probabilistic formulation, the objective of open set recognition is to exercise a rejection option [18] for queries that lie beyond the reasonable support of known data, thus mitigating this risk.

Open set recognition [170,161], and more generally novelty/outlier detection [114,70] are well established areas in their own right, but much less research has been conducted on how to treat unknown samples in an incremental context, which is the focus of this chapter. When an open set recognition system labels a sample as unknown, it suggests that the model was not trained with data from the class corresponding to that sample. In response, the classifier’s decision boundaries should be updated so that the system can incorporate this new class information for future decision making. But there is a caveat: full retraining is not always feasible, depending on timing constraints and the availability of computational resources.

(43)

3.1 Introduction 28

Recent work [15] extended the open set recognition problem to include the incremental learning of new classes in a regime dubbed open world recognition, which a problem that we are also concerned with in this chapter. An effective open-world recognition system must perform four tasks: detecting unknowns, choosing which points to label for addition to the model, labeling the points, and updating the model. An algorithm, nearest non-outlier (NNO), was proposed as a demonstration of these elements — the first of its kind. Unfortunately, NNO lacks strong theoretical grounding. The algorithm uses thresholded distances from the nearest class mean as its decision function, and otherwise ignores distributional information. Weak classifiers are a persistent problem for this task: it is not immediately obvious how one might extend class boundary models from classical machine learning theory (e.g., neural networks and kernel machines) to incorporate both incremental learning and open set constraints. A new formulation is required.

We address the construction of a compact representation of open world decision boundaries based on the distribution of the training data. Obtaining this representation is difficult because training points that do not contribute to a decision boundary at one point in time may be extremely relevant in defining a decision boundary later on, and retraining on all points is infeasible at large scales. Moreover, by the definition of the open world problem, the hypothesis space will be under-sampled, so in many cases linearity of the decision boundaries cannot be guaranteed and the data bandwidth is unknown. So how does one obtain a compact statistical model without discarding potentially relevant points — especially in regions where the data bandwidth is unknown? To this end, we introduce the Extreme Value Machine (EVM), a model which we derive from statistical Extreme Value Theory (EVT).

EVT dictates the functional form for the radial probability of inclusion of a point with respect to the class of another. By selecting the points and distributions that best

(44)

3.1 Introduction 29

Fig. 3.1 EXTREME VALUE MACHINE Ψ-MODELS. A solution from the proposed EVM algorithm trained on four classes: dots, diamonds, squares, and stars. The colors in the isocontour rings show a Ψ-model (probability of sample inclusion) for each extreme vector (EV) chosen by the algorithm, with red near 1 and blue near .005. Via kernel-free non-linear modeling, the EVM supports open set recognition and can reject the three “?" inputs that lie beyond the support of the training set as “unknown.” Each Ψ-model has its own independent shape and scale parameters learnt from the data, and supports a soft-margin. For example, the Ψ-model for the blue dots corresponding to extreme vector A has a more gradual fall off, due to the effect of the outlier star during training.

summarize each class, i.e., are least redundant with respect to one another, we arrive at a compact probabilistic representation of each class’s decision boundary, characterized in

(45)

3.2 Theoretical Foundation 30

terms of its extreme vectors (EV), which provides an abating bound on open space risk. This is depicted in schematic form in Fig. 3.1. When new data arrives, these EVs can be efficiently updated. The EVM is a scalable nonlinear classifier, with radial inclusion functions that are in some respects similar to RBF kernels, but unlike RBF kernels assume variable bandwidths and skew that are data derived and grounded in EVT.

3.2. Theoretical Foundation

As discussed in Sec. 3.1and as illustrated in Fig.3.1, each class in the EVM’s training set is represented by a set of extreme vectors, where each vector is associated with a radial inclusion function modeling the Probability of Sample Inclusion (PSI or Ψ). Here we derive the functional form for Ψ from EVT; this functional form is not just a mathematically convenient choice — it is statistically guaranteed to be the limiting distribution of relative proximity between data points under the minor assumptions of continuity and smoothness.

The EVM formulation developed herein stems from the concept of margin distribu-tions. This idea is not new; various definitions and uses of margin distributions have been explored [176,57,148,7,136], involving techniques such as maximizing the mean or median margin, taking a weighted combination margin, or optimizing the margin mean and variance. Leveraging the margin distribution itself can provide better error bounds than those offered by a soft-margin SVM classifier, which in some cases translates into reduced experimental error. We model Ψ in terms of the distribution of sample half-distances relative to a reference point, extending margin distribution theory from a per-class formulation [176,57,148,7,136] to a sample-wise formulation. The model is fit on the distribution of margins — half distances to the nearest negative samples — for each positive reference point. From this distribution, we derive a radial inclusion function

(46)

3.2 Theoretical Foundation 31

which carves out a posterior probability of association with respect to the reference point. This radial inclusion function falls toward zero at the margin.

3.2.1 Probability of Sample Inclusion

To formalize the Ψ-model, let x ∈ X be training samples in a feature space X . Let yi∈ C ⊂ N be the class label for xi∈ X . Consider, for now, only a single positive instance

xifor some class with label yi. Given xi, the maximum margin distance would be given

by half the distance to the closest training sample from a different class. However, the closest point is just one sample and we should consider the potential maximum margins under different samplings. We define margin distribution as the distribution of the margin distances of the observed data. Thus, given xi and xj, where∀ j, yj̸= yi, consider the

margin distance to the decision boundary that would be estimated for the pair(xi, xj) if

xj were the closest instance. The margin estimates are thus mi j = ∥xi− xj∥/2 for the

τ closest points.1 _{Considering these} _{τ nearest points to the margin, our question then}

becomes: what is the distributional form of the margin distances?

To estimate this distribution, we turn to the Fisher-Tippett Theorem [51] also known as the Extreme Value Theorem.2 Just as the Central Limit Theorem dictates that the random variables generated from certain stochastic processes follow Gaussian distributions, EVT dictates that given a well-behaved overall distribution of values, e.g., a distribution that is continuous and has an inverse, the distribution of the maximum or minimum values can assume only limited forms. To find the appropriate form, let us first recall:

1_{In this chapter, we deviate from convention and let}_∥x

i− xj∥ denote arbitrary distance

or divergence between two vectors xi and xj, including but not limited to a subtractive

norm.

2_{There are other types of extreme value theorems, e.g., the second extreme value}

theo-rem, the Pickands-Balkema-de Haan Theorem [139], addresses probabilities conditioned on the process exceeding a sufficiently high threshold.

BETTER LEARNING THROUGH IMPROVED DISTRIBUTIONAL MODELING