Understanding people movement and detecting anomalies using probabilistic generative models

(1)

IN

DEGREE PROJECT MATHEMATICS, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2020,

Understanding people movement and detecting anomalies using probabilistic generative models

AGNES HANSSON

(2)

(3)

Understanding people movement and detecting anomalies using probabilistic generative models

AGNES HANSSON

Degree Projects in Mathematical Statistics (30 ECTS credits)

Degree Programme in Applied and Computational Mathematics (120 credits) KTH Royal Institute of Technology year 2020

Supervisor at Assa Abloy: Kenneth Pernyér Supervisor at KTH: Pierre Nyquist

Examiner at KTH: Pierre Nyquist

(4)

TRITA-SCI-GRU 2020:388 MAT-E 2020:095

Royal Institute of Technology School of Engineering Sciences KTH SCI

SE-100 44 Stockholm, Sweden URL: www.kth.se/sci

(5)

Understanding people movement and detecting anomalies using probabilistic generative models

Agnes Hansson

Supervisor: Pierre Nyquist External Project Partner: Assa Abloy

December 2020

Abstract

As intelligent access solutions begin to dominate the world, the statistical learning methods to answer for the behavior of these needs attention, as there is no clear answer to how an algorithm could learn and predict exactly how people move. This project aims at investigating if, with the help of unsupervised learning methods, it is possible to distinguish anomalies from normal events in an access system, and if the most probable choice of cylinder to be unlocked by a user can be calculated.

Given to do this is a data set of the previous events in an access system, together with the access configurations – and the algorithms that were used consisted of an auto-encoder and a probabilistic generative model.

The auto-encoder managed to, with success, encode the high-dimensional data set into one of significantly lower dimension, and the probabilistic generative model, which was chosen to be a Gaussian mixture model, identified clusters in the data and assigned a measure of unexpectedness to the events.

Lastly, the probabilistic generative model was used to compute the conditional probability of which the user, given all the details except which cylinder that was chosen during an event, would choose a certain cylinder. The result of this was a correct guess in 65.7 % of the cases, which can be seen as a satisfactory number for something originating from an unsupervised problem.

(6)

(7)

1 Sammanfattning

Allt eftersom att intelligenta ˚atkomstlösningar tar över i samhället, s˚a är det nödvändigt att ägna de statistiska inlärnings-metoderna bakom dessa tillräckligt med uppmärksamhet, eftersom det inte finns n˚agot självklart svar p˚a hur en algoritm ska kunna lära sig och förutsp˚a människors exakta rörelsemönster.

Det här projektet har som m˚al att, med hjälp av oövervakad inlärning, undersöka huruvida det är möjligt att urskilja anomalier fr˚an normala iak- ttagelser, och om den l˚ascylinder med högst sannolikhet att en användare väljer att försöka l˚asa upp g˚ar att beräknda.

Givet för att genomföra detta projekt är en datamängd där händelser fr˚an ett ˚atkomstsystem finns, tillsammans med tillhörande ˚atkomstkonfig- urationer. Algoritmerna som användes i projektet har best˚att av en auto- encoder och en probabilistisk generativ modell.

Auto-encodern lyckades, med tillfredsställande resultat, att koda det hög-dimensionella datat till ett annat med betydligt lägre dimension, och den probabilistiska generativa modellen, som valdes till en Gaussisk mixtur-modell, lyckades identifiera kluster i datat och med att tilldela varje observation ett m˚att p˚a dess otrolighet.

Till slut s˚a användes den probabilistiska generativa modellen för att beräkna en villkorad sannolikhet, för vilken användaren, given alla at- tribut för en händelse utom just vilken l˚ascylinder som denna försökte

¨

oppna, skulle v¨alja.

Resultatet av dessa var en korrekt gissning i 65,7 % av fallen, vilket kan ses som en tillfredställande siffra för n˚agot som härrör fr˚an ett oövervakat problem.

(8)

(9)

2 Acknowledgements

This Master’s thesis was written in collaboration between Assa Abloy AB and the Division of Mathematical Statistics at KTH Royal Institute of Technology.

Despite the unfortunate circumstances due to COVID-19 during the time that this thesis was carried out, I want to express my sincere gratitude to Kenneth PernyérResearch Director of ASSA ABLOY, for providing me with ideas, input and for cheering on when working remotely full-time started to take its toll on all of us. I want to thank Daniel Garmén, Manager Project Leaders at Assa Abloy, who made sure I was equipped with everything needed, and for making sure that the project was running according to plan. I want to thank Gustav Ryd, from Assa Abloy, for providing me with the data and helping me with technical questions – and I wish we could have been at the office so that I could have asked questions more often. The same goes out to Anders Sahlström, also from Assa Abloy, who also provided me with some input on the project, and to whom I would have liked to talk to more regarding the auto-encoders. Lastly, I want to thank Pierre Nyquist at KTH, who provided me with guidance and insights about the project.

(10)

(11)

3 Introduction

The techniques behind door locks and openings have evolved throughout history, providing companies and people with a possibility to secure their properties and homes, in an ever-changing fashion. As humanity exploits new techniques – such as the use of wood, forged metal, electricity, plastic materials and the world wide web – the perception of what a lock and a key is also changes.

Figure 1: A classical versus a modern lock and a key.

In 1939, August Stenman AB (ASSA), manufactured their first door lock in their recently acquired factory in Eskilstuna. Over the years, parallel to the chain of events with acquisitions and mergers with other companies, the product range has widened and become increasingly more advanced. In 1994, ASSA became Assa Abloyas they merged with the Finnish company Abloy, and are now leaders within many areas of the world - including Europe, North America and the Pacific region [1]. The part of the company called ASSA ABLOY Opening Solutions, which are working on developing what they call access solutions, have set their own goal of how they wish people to experience the ultimate lock and key solution when they move around in buildings or systems. The seamless opening is the access solution where the user is barely required to make any effort at all – as the access solution converges to some form of an omniscient entity, that learns movement patterns and when a certain user approaches.

One of the big challenges to overcome before the seamless opening can be- come reality, is how to learn the movement patterns. Accompanying this rather advanced task is a range of interesting questions – some too complex to answer here, yet interesting to know to better understand the motives behind this project – and others of more reasonable proportion to be of subject here. The most interesting, yet most complicated and with an answer that lies distantly in the future, is; if provided the necessary data, such as time stamps, key config-

(14)

urations, what the user attempted to do and the results, it is possible to learn how people typically move around a building or a system? Also, is it possible to design an algorithm, that predicts the next move of the user so well, that he never needs to waste time standing in front of a locked door again? The answer of those two are not be provided here, but to be pondered upon for further research. Of more relevance for this project is the latter question; if it would also be possible to distinguish anomalies in the movement patterns, and what degree of certainty is needed to be able to justify taking actions on such suspicious activities? As there is very limited or no labeled data that may tell the distinction between an anomaly and a normal event, it is impossible to classify an event as either of them with complete certainty.

This degree project composes one of the first steps towards a long-term goal, where there would also exist a map of the physical locations of the entities and their attributes of the access system, such as in Figure2. Then, ultimately, the algorithm could tell how to evacuate buildings in cases of emergency, or learn how to shut down buildings in case of a terror attack. As this, which needs to be emphasized, is yet not the reality, this project focuses on some of the more basic statistical learning approaches – no less needed than the more advanced – to join the search for a perfect machine learning algorithm.

Figure 2: A simple example of a physical access system (PACS) at Assa Abloy.

(15)

3.1 Outline of the Report

The outline of this report has the sequential design of background–theory–methods–

results, which hopefully will allow the reader to become acquainted with the somewhat deep, mathematical theory that is needed to comprehend the methods, the solution and the problems that one might encounter. The report begins with Section (3) that contains the background and introduction, which includes a presentation of the data set, the aims and objectives and intended research question, followed by a thorough review of the theory needed to execute the investigation and computations, in Section (4). Next comes Section (5) where the method is described, which includes a rough description of the algorithms and the prerequisites of the data. Following this comes the three last sections (6) and (7) with the numerical results, a discussion of these and a conclusion that summarizes the findings and the knowledge which were obtained here.

3.2 Aims and Objectives

The aim of this project is to, by using the theory and methods in the field of supervised and unsupervised learning, try to unravel and investigate the structure and underlying mechanisms of a given set of unlabeled data. The data consists of logs from an access system, and it will remain unknown whether this system consists of doors inside a building, or if they lie separate with hundreds of kilometers in between them. Furthermore, the objective of the project is to design and implement supervised and unsupervised learning methods to gain insights on how people move.

3.3 Data Description

The data set to be used in this project, called the Audit Trail data, is provided by Assa Abloy and originates from one of their product systems, with its administration interface seen in Figure3, that has thousands of customers worldwide.

The Audit Trail data constitutes of no more than a small fraction of all the data that this system has produced, but is used for different experiments and future development projects. The data set consists of information about 13.5 million different events, where the following information is specified each time an event is registered in the database:

• cylinderPlug: A unique number for each door or entrance. The cylinders are not wirelessly connected to any grid, and thus do not deliver real time- data. The data is obtained manually, in batches, a procedure that requires on site-personnel.

(16)

• userKey: A unique number for each key that exists. The keys are physical objects and works mechanically, but do not open any door unless there is also an electronically registered access granted to that specific door. The keys are regularly updated by the user, as to ensure that they do not expire.

• system: The location where the event took place. What confines a system is not exactly specified; it could be a building, a remote power station, or a collections of buildings considered to belong to the same group.

• commandCode: The command code for the specific event (key X tries to open cylinder Y with the given command code).

• resultCode: The result code for the specific event (key X tries to open cylinder Y resulting in a response from the system).

• dateAndTime The date and time for each specific event.

Figure 3: Graphical user interface for the Audit Trails administration system.

3.4 Research Question

Is it possible, with the help of unsupervised learning on a dataset with previous events in a system, to build a generative model of the data? Can it be used to assign each new event a measure of how expected the event is and thereby detect any anomalies?

Given a specific user key and the other event attributes, is it possible to predict which cylinder plug it has tried to access? Can we understand what each cluster in the fitted model corresponds to in terms of typical events?

(17)

4 Theoretical Background

4.1 Unsupervised Learning

As there is no list where previous events have already been analyzed and clas- sified into different categories such as ”normal”, ”suspicious” or ”fraudulent”, it is not clear how one can apply supervised learning techniques here. In su- pervised learning techniques, the algorithm that is constructed is trained on already labeled training data, from which it learns to assign labels to previously unseen data points [11]. After successful training it can then predict the labels of previously unseen data with high accuracy.

What must instead be used here, where the available data comprises a list with millions of unlabeled events, is unsupervised learning. In the context of unsu- pervised learning, one has a set of N observations, X = (x1, x2, x3, · · · , xN), of a random vector x, which has some presumably unknown, joint probability distribution p(x). The probability distribution p may be a probability density function (pdf) or probability mass function (pmf). By skillfully selecting some unsupervised technique, one could optimally draw conclusions about this joint probability density p(x) and hence answer questions about the underlying origin or mechanism of the data source [7].

4.2 Model Inference

When fitting a model to data, some kind of method to calculate how well the model fits the data is required – and there are a range of alternatives to choose amongst. Two examples of these include minimizing the cross-entropy (CE) for classification and minimizing the sum of squares for regression – which both will be discussed in Section 4.5. Other methods of inference are the Bayesian method and the bootstrap, but these will not be discussed in detail here.

4.2.1 Maximum Likelihood Inference

Another method that provides a tool for inference is the maximum likelihood approach, which seeks to maximize with respect to θ a likelihood function

p(X | θ) =

N

Y

i=1

p(xi| θ) (1)

defined as the probability of the observed data. Here, it is assumed that the

(18)

observations xi are independent given some parameter vector θ, and p(x | θ) is some model likelihood [7].

A common assumption is that x = x is a normally distributed variable (in the one-dimensional case), and hence the unknown parameters θ consists of a mean µand variance σ², which gives that θ = (µ, σ²). The probability density function in this setting would then be given by

p(x | θ) = 1

√2πσe⁻¹²^(x−µ)²^/σ²

and (1), with the pdf inserted, would then look like

p(X | θ) =

N

Y

i=1

√1

2πσe⁻¹²^(xⁱ^−µ)²^/σ², (2)

which consists of the product of all the N contributions of the pdf. Next, in order to maximize the likelihood function, the fact that products turns into sums when the logarithm function is applied is utilized, and gives the log-likelihood

l(θ) =

N

X

i=1

log p(xi| θ) = −

N

X

i=1

log√

2πσ + (xi− µ)² 2σ²

. (3)

Now, by determining for which µ and σ²that maximizes the log-likelihood l(θ), by setting _∂µ^∂l and _∂σ^∂l2 = 0, the maximum likelihood estimates ˆθ = (ˆµ, ˆσ²) will be the parameters that maximizes the likelihood of the data, i.e. the parameters that allows the best prediction of the data [7].

For the multivariate case, an important probability distribution, which will be used extensively in this project, is the multivariate Gaussian distribution, with the pdf

N(x | µ, Σ) = 1 (2π)^D/2

1

|Σ|^1/2exp

−1

2(x − µ)^TΣ⁻¹(x − µ)

(4)

where the vector x and the mean vector µ both are D-dimensional, the co- variance matrix Σ has the shape D × D and |Σ| denotes the determinant of Σ.

(19)

Furthermore, the log-likelihood to be used for derivation of the ML-estimates is

l(θ) = −N D

2 log(2π) −N

2 log |Σ| − 1 2

N

X

n=1

(xn− µ)^TΣ⁻¹(xn− µ)

and the ML-estimate θM L = (µM L, Σ_{M L}), omitting the details of the deriva- tions, is given by

µ_{M L}= 1 N

N

X

n=1

xn

ΣM L= 1 N

N

X

n=1

(xn− µ_{M L})(xn− µ_{M L})^T.

Calculations involving other probability density functions are performed in a similar manner, but can get difficult to solve in case the expression inside the sum is not on closed form. An example of a case when the parameters cannot be calculated in a straightforward manner like was seen above is the Gaussian mix- ture model (GMM), where the pdf will consist of not one, but several Gaussian components. Methods to tackle these kind of problems requires more advanced techniques, and will be discussed later.

4.3 Model Assessment and Selection

Consider the case of several possible models to choose amongst. To evaluate which of these models that are best suitable for the data set, the trade-off between the two conflicting goals

• Data fit – the desire to predict the data as accurately as possible, which is the same as maximizing the likelihood. An increased number of clusters often increases the likelihood.

• Model complexity – to not choose an unnecessarily complicated model, something that is measured by the number of model parameters.

must be taken into consideration, which can be done with some validation criterion. Two such criteria include the

• Akaike Information Criterion, AIC

log p(X) ' log p(X | θM L) − M,

(20)

where M is the number of estimated parameters, and the

• Bayesian Information Criterion, BIC

log p(X) ' log p(X | θM L) −1

2Mlog N.

4.4 Dimensionality reduction

An important question to answer before trying to apply any unsupervised learning techniques, is whether the number of dimensions is reasonable, so that

• the curse-of-dimensionality problem can be avoided, and

• feature learning can be done with an improved performance [12].

Conventional dimensionality reduction methods include principal component analysis (PCA), linear discriminant analysis (LDA) and L2-norm regulariza- tion methods, for example ridge regression and regularized discriminant anal- ysis (RDA) [12]. These, however, are limited in the sense that they require strong linearity assumptions about the data. Another alternative to be used for dimensionality reduction is to use an auto-encoder, which is a kind of neural network, but also a nonlinear generalization of the PCA [12].

4.5 Neural Networks and Auto-Encoders

A neural network, typically described by Figure 4, can be used both for classification and regression problems. It consists of an input layer, one or several number of hidden layers and one output layer. The input layer has the same amount of units as there are features in the considered data set, and the output layer typically has one unit in the one-dimensional regression case, and K units in the case of classification with K classes. The kth unit, where k = 1, . . . , K, models the probability of the input being of class k.

(21)

Figure 4: A schematic picture of an artificial neural network, with fully connected layers [12].

A target observation y = (y1, ..., yK) may be represented by a one-hot vector, where each ykis 0 or 1, and each such represents the ground truth classification for each specific input observation. The target observations are used to compute the loss, which can be computed with a range of different functions [7]. Here is no need for more than two of them, and hence the rest will be omitted. These are the cross-entropy loss, used in classification problems

CE= −1 N

K

X

k=1 N

X

i=1

y_i,klog fk(xi),

where fkis the kth output unit of the neural network, and the (one-dimensional) mean-square error loss

M SE= 1 N

N

X

i=1

(yi− f(xi))²,

used for regression problems. For each unit and layer L of width M, one derived feature ZM is computed, which consists of linear combinations of the inputs from the previous layer, inserted into an activation function σ. Depending on the shape of the data, different activation functions can be used. For each layer, this results in M derived features, such that Z = (Z1, Z2, · · · , ZM).

(22)

ZM = σ(α0m+ α^Tmx), m = 1, . . . , M,

where α0m and α^Tm are the weights, which are some, from the beginning un- known, parameters which are tuned when the network eventually is trained.

Weights that are correctly tuned is what makes the model fit the training data.

After computing the features Z, which is repeated L times here, these are mul- tiplied and added with a constant, so that we obtain T = (T1, . . . , T_K) with

Tk= β0k+ β^TkZ, k= 1, . . . , K and

fk(x) = gk(T ), k = 1, . . . , K

where gk(T ) constitutes the last step, where some output function is used to compute the result of the outputs T . For regression, the output function is simply the identity function, and for K-class classification it is common to use the softmax

softmax(Z) = e^Z^k PK

l=1e^Z^l.

The corresponding classifier, that is used to determine which class the input belongs to, is then

argmaxkf_k(x)

Other used activation functions are the sigmoid function

sigm(Z) = 1 (1 + e^−Z) and the rectified linear unit (ReLU)

ReLU(Z) = max(0, Z).

In order to train the model, the complete set of weights, θ, that consists of

(23)

β01, β02, . . . , β0k, k= 1, . . . , K and α01, α02, . . . , α0m, m= 1, . . . , M, we minimize a total loss function R(θ), which may be a summed combination of the cross-entropy and mean-squared error losses if the output has both classification and regression components. By using the chain rule for differentiation, which is also called back-propagation, the gradient of R(θ) can be derived and used in a stochastic gradient descent algorithm, which, if possible and given the right presumptions, eventually will lead to the finding of a minimum of R(θ).

Due to the risk of overfitting the solution, finding the global minimum of R(θ) is in most cases not desirable. This scenario can be avoided by using some reg- ularization technique, for example by early stopping or by adding some penalty term [7].

4.5.1 The Auto-Encoder

The auto-encoder (AE), or the replicator neural network [4], constitutes a branch of the artificial neural networks, and has for long been used to reduce the dimension of data sets, by efficiently learning the data codings – which ultimately allows each data point to be described by a vector of fewer dimensions than the original. This unsupervised learning technique can be schematically described by Figure5 and 6, where the picture of the coffee cup – the input x – to the left is fed through the algorithm, by firstly transforming it into something of lower dimension – the encoding part. The decoding part takes the compressed picture and expands it into a picture ˆx that, if the algorithm has been trained successfully, resembles the original picture with some desired level of accuracy.

Ideally, the output ˆx from the decoder is exactly the same as the input to the encoder, x.

The AE relates to the principal component analysis technique, by the fact that it works exactly like a PCA analysis tool, under certain circumstances. In the setting where the activation function in the AE is a linear function, the code has r units, and the error function that is used to calculate R(θ) is the mean- squared error, the encoder will learn to project the input data x onto its first r principal components [12].

Figure 5: A schematic picture of an auto-encoder [12].

(24)

Figure 6: A schematic picture of an auto-encoder [12].

Consequently, the AE can be seen as a generalization of PCA, that instead of using linear methods to encode the data, can capture more of the information by using adaptive, nonlinear methods [12].

4.6 K-means clustering

Clustering is a technique where a given data set is divided into a number K of homogeneous groups [3]. K-means clustering, which uses this technique, is an unsupervised learning method, and although it will not be explicitly used in any calculations in this thesis, the theory of GMMs includes some references to this that one might want to address.

The idea of the K-means clustering algorithm is that the algorithm is given a set of data and a number K of means (which will be the centers of the different clusters), from where it will iteratively compute where the centers should be located and which points that will be assigned to it.

The K-means clustering algorithm, also illustrated in Figure7, consists of the following steps:

1. Randomly assign every data point to one of the K classes 2. Calculate the center of each cluster

(25)

3. Now, iterate until convergence:

(a) for each center k, find which data points that lie closer to it than any other center, and assign them to class k

(b) Compute the mean vector of every cluster, which will constitute the new center of that specific cluster [7].

Figure 7: The iterative process of the K-means algorithm [10].

4.7 Gaussian Mixture Models

The Gaussian Mixture Model (GMM), is a suitable model when a simple Gaus- sian distribution fails to capture the nature of the data [2]. An example of this is shown in Figure8, where a single Gaussian distribution is plotted to the left, and a mixture of several Gaussian distributions are plotted to the right.

(26)

Figure 8: The GMM – as a superposition of three Gaussian densities.

More precisely, the GMM can be described as a linear superposition of K Gaus- sian densities, written as

p(x) =

K

X

k=1

πkN(x | µk, Σk). (5)

Each of the Gaussian densities N (x | µk, Σk) constitutes a component of the mixture model and have their own mean µ and covariance Σ. Furthermore, πkis the mixing coefficient, and has the properties that 0 ≤ πk ≤1 and P^Kk=1π_k = 1, which therefore makes πk fulfil the requirements to be probabilities of a categorical random variable. In addition, these mixing components can then be seen as the prior probabilities, which, in other words, is the probability of which component k is chosen. This, used together with the fact that the density N(x | µk, Σk) = p(x | z = k), where z is a random variable indicating the class of x, allows Equation5to be rewritten as

p(x) =

K

X

k=1

p(z = k)p(x | z = k). (6)

Now, in order to set the values of the parameters Σ, π, µ, which decide the shape of the GMM, the log-likelihood function is to be maximized, which according with Equation1 will look like

l(θ) = log p(X | π, µ, Σ) =

N

X

i=1

log

^K X

k=1

π_kN(xi| µ_k, Σ_k)

. (7)

In this case, due to the summation over k inside the logarithm on the right hand side, there is no closed analytical form to compute the solution. Consequently, some other approach will be needed here, and this is where the Expectation

(27)

Maximization algorithm, which will be described in Section4.8, can be utilized.

By adjusting the mixing coefficients, the means and covariances, it is possible to use a GMM for fitting almost any continuous density, up to an arbitrary level of accuracy.

The GMM resembles the K-means clustering algorithm in the sense that it can be seen as a somewhat more advanced extension of it, as it also uses K different clusters, but to which the data points now are assigned in a manner that maximizes the likelihood, which here would be the probability of seeing the data, given the (iteratively) calculated model parameters. The algorithm to calculate the parameters of the GMM resembles that of the K-means algorithm, with the modification that the latter uses the alternating two steps in the Expectation Maximization algorithm. For the sake of instructiveness, these are roughly described below, and in a more exhaustive manner in the following section.

The EM-algorithm consists of two steps; the expectation (E) step and the max- imization(M) step.

1. In the E-step, every data point is assigned a weight which will depend on the likelihood for that point, with respect to all of the other Gaussian distributions. If the point is close to the center, it will be assigned a weight close to 1 for that Gaussian and close to 0 for all the others. In case the data point is right in between two centers, it will divide its weight between those two centers.

2. In the M-step, each data point contributes with their assigned weights, and the weighted means and covariance for every Gaussian is computed.

In contrary to the K-means clustering, a GMM is a probabilistic model, which means that we can use it to calculate probabilities, not just to assign the data points to classes 1, . . . , K. With Gaussian mixture model means that this is a model that contains a number of different Gaussian distributions, with their own mean and covariance, which are iteratively calculated with the ultimate goal of maximizing the likelihood of the data points [7].

4.8 Expectation Maximization (EM)

As discussed in Section 4.2.1, problems may arise when trying to derive the maximum likelihood estimates, for example in case there is no closed form that expresses the maximum likelihood. This is the problem that one encounters in models that contains latent variables – and an example of such a model is the GMM. A powerful tool to circumvent the problem with the latent variables is the EM algorithm, which finds maximum likelihood solutions by iterating

(28)

through the possible parameters and solutions, and eventually reaching a set level of convergence in either of the cases.

To explain the EM algorithm, let’s assume that there is a data set X = (x1, . . . , xN) of independent realizations of x = (y, z), where y is observed and z is unobserved. Furthermore, assume that the data set has the conditional density p(X | θ) given some parameters θ and the log-likelihood function

l(θ; X) = log p(X | θ)

Now, the log-likelihood function for the observed data Y = (y1, . . . , y_N) is

lobs(Y | θ) = logZ

p(X | θ)dz.

The maximization of this log-likelihood includes computing the integral, which may be difficult if there is no closed form. Instead, to maximize lobs(Y | θ) with respect to θ, the EM algorithm is an iterative procedure consisting of two steps [8]:

1. The E-step: At iteration i, compute Q(θ | θ⁽ⁱ⁾) = Eθ⁽ⁱ⁾[l(θ; X) | Y ].

2. The M-step: Set θ⁽ⁱ⁺¹⁾= arg maxθQ(θ | θ⁽ⁱ⁾).

4.8.1 EM in the GMM setting

As seen previously, the EM algorithm can be used to calculate a maximum- likelihood estimation of the parameters of a GMM, by iterating between the E-step and the M-step. Ultimately, this leads to that the likelihood of the data points is maximized – without having access to all the data that is needed – and here lies the finesse with this algorithm.

Now, after going through the details of the EM algorithm, it is possible to give a more in-depth description of the E-step and the M-step in the setting of the GMM algorithm. Let γnk = p(zn = k | x) be the so called posterior probabilities, where

γ_nk= p(zn= k | x) = π_kN(xn| µ_k, Σ_k) P

lπlN(xn | µ_l, Σl)

by Bayes’ theorem. The specific conditions that must be fulfilled at a maximum of the likelihood function are

(29)

1. the derivative of Equation7 with respect to the means µk must be zero, which gives

0 = −

N

X

n=1

πkN(xn| µ_k, Σk) P

jπjN(xn| µ_j, Σj)Σk(xn− µ_k)

identifying the responsibilities γnkfrom Equation8 and mutliplying with the inverse of the covariances, Σ⁻¹k gives

µ_k= 1 Nk

N

X

n=1

γ_nkx_n

where

N_k=

N

X

n=1

γ_nk.

2. Differentiation of Equation7 in a similar manner, but now with respect to the covariances Σk gives

Σk =

N

X

n=1

γnk(xn− µ_k)(xn− µ_k)^T.

Algorithm:

1. Propose some starting values for µkand Σk, and evaluate the log-likelihood 2. The E-step: Evaluate the values for the responsibilities, γnk with the

given parameters as

γnk= π_kN(xn| µ_k, Σ_k) PK

j=1πjN(xn| µ_j, Σj) (8) 3. The M-step:

µ^new_k = 1 Nk

N

X

n=1

γnkxn

σ_k^new= 1 Nk

N

X

n=1

γnk(xn− µ^new_k )(xn− µ^new_k )^T

π_k^new=N_k N where

N_k=

N

X

n=1

γ_nk

(30)

4. Then evaluate the log-likelihood again

ln p(X | π, µ, Σ) =

N

X

n=1

ln

^K X

k=1

πkN(xn| µ_k, Σk)

Now, the requirements to be fulfilled for the algorithm to end is either

1. Convergence of the parameters or

2. Convergence of the log likelihood. If neither of these criteria are satisfied, the algorithm return to step 2.

5 Methodology

The methodology that was applied to answer the research questions, and which concerns the numerical part of the project, was conducted in the following order:

• Data exploration

• Selection of an appropriate part of the data, to be subject for the analysis

• Data cleaning

• Dimensionality reduction with an AE

• Fitting a GMM

• Calculation of the corresponding values of unexpectedness, from that GMM

• Decoding samples from the GMM

• Use the GMM as a generative model, and calculate the conditional proba- bilities p(cylinderP lug | userKey), that will answer the research question.

Now, how could the methods of unsupervised learning be applied to the problem of this report?

Let us begin by assuming that the input data originating from the Audit Trail System contains N observations, (x1, x2, x3, · · · , xN), each consisting of m vari- ables. Furthermore, assume that (h1, h2, h3, · · · , hN) is the same set of observations, but where some kind of dimensionality reduction method has been applied, so that the number of dimensions is now n, where m n. In the case

(31)

of this project, this dimension reduction method will consist of the decoder part φof an auto-encoder, so that h = φ(x). As each input x consists of a blend of continuous and categorical variables, a mean-squared error is applied on the continuous variables and the cross-entropy loss is applied on the categorical variables represented by one-hot encodings. The total loss to be minimized is then their sum.

Now, the ultimate desire here is to fit a suitable probability density function onto x, in order to generate a probabilistic model that can be used for making predictions on yet unseen observations, but the high dimensionality of x makes it an intractable task. By instead fitting a probability density function onto h, it is possible to obtain a density p(x) according to

p(x) = p(φ(h)). (9)

In particular, a GMM is fitted using the encoded observations (h1, h2, h3, · · · , hN) by an EM algorithm. This leads to a generative model.

Lastly, it is possible to use the above model to make predictions of cylinderPlug given an observation of all the other variables. Let ˜x be an observation of userKey, commandCode, resultCode and dateAndTime (the system is fixed).

The conditional probability of the corresponding cylinderPlug being number k is then given by

p(cylinderP lug = k | ˜x) = p(cylinderP lug = k, ˜x) PK

l=1p(cylinderP lug = l, ˜x). (10) Note that each probability on the right-hand side is computable using the generative model.

The above methods were implemented in Python using a JupyterLab interface.

In particular, the auto-encoder was constructed using TensorFlow, and scikit- learn was used for the GMM and the accompanying EM algorithm.

6 Results

The auto-encoder is designed according to Figure9,10and11, with the details of the design as in Table1.

The lowest training and validation error that was generated by the auto-encoder is found in Table2.

(32)

Table 1: Design of the resulting auto-encoder Layer No. of units Activation function

Input Layer 30 -

Hidden Layer 17 ReLu

Coder 4 Sigmoid

Gaussian noise 4 -

Hidden Layer 17 ReLu

Output Layer 30 Sigmoid/softmax

Table 2: Training and validation errors obtained with the auto-encoder.

Error Result

Training error 0.0403 Validation error 0.0301

code part

Decoder

Encoder Reconstructed

data Input

data

Figure 9: The resulting auto-encoder.

(33)

x₁

x₂

x3

x30

a₁

a₂

a3

a17

y1

y2

y3

y₄

Input Layer Hidden Layer ”Code” Layer

Figure 10: Illustration of the encoder part of the auto-encoder.

y₁

y₂

y3

y4

a1

a2

a₃

a₁₇

z1

z2

z₃

z₃₀

”Code” Layer Hidden Layer Output Layer

Figure 11: Illustration of the decoder part of the auto-encoder.

The quality of the regenerated data is depicted in Figure12. The auto-encoder

(34)

succeeds with its predictions in most cases, although it appears to have some trouble with the time stamps to the left (at the indexes 1, 2 and 3).

Figure 12: Results of the auto-encoder. The 30 different predictors are found as separate indexes on the x-axis; Access granted (0 or 1) on x = 0, the time stamp (weekday, hour, minute) at 1, 2 and 3, Commandcode on 4 and 5, Resultcode at 6, 7 and 8, userKey between 9 and 22 and cylinderPlug between 23 and 29.

Every predictor except time stamp is categorical, and a ”1” indicates that the data point contains this predictor. If the orange line completely overlaps the blue, the result is 100 % accurate.

The auto-encoder training was run for 1000 epochs and exhibited a fast and simultaneous decay of both the training and test errors, as seen in Figure13.

(35)

Figure 13: A plot of the auto-encoder training results from the 1000 epoques that the auto-encoder was run.

In Figure14, the correlation between the four units in the code part is plotted, as calculated for each of the 200 testing points. As it appears not to be as many as 200 points in each box, this means that what is seen is not single data points, but cluster of points. Located on the diagonal are the histograms of the correlations for each code unit.

(36)

Figure 14: A plot of the resulting correlation between the 4 coding units in the auto-encoder.

To fit a Gaussian mixture model on the data, now with dimension 4 due to the dimensionality reduction with the auto-encoder, a model with 20 components was chosen. The reasons for that choice are the calculated BIC and AIC values, plotted in figure15, which appeared to no longer be decreasing around n = 20.

This, together with the fact that approximately 20 clusters of points could be found in each box in Figure14, motivates the choice of a GMM with 20 clusters.

(37)

Figure 15: A plot of the resulting Bayesian information criterion (BIC) and Akaike information criterion (AIC) for the fitted GMM model (with 8 dimensions)

In Figure16, the resulting GMM is plotted. The dimension of the model is 4, but for practical reasons, so that the model can be visualized, the dimension is here further reduced to 2 with a t-SNE algorithm [6]. What can also be seen in the plot are the encoded train and test points, whose placement in the density reveals the probability of seeing that point. The further out on the pink areas that a point is located, the lower the expectation of that event happening.

(38)

Figure 16: A plot of the train and test data (blue and orange dots) together with the fitted GMM model, embedded with t-SNE.

The risk that an event, x, constitutes an anomaly is just − log p(x), and can now be plotted for all points in the data. This is done in Figure17.

Figure 17: A plot of the unexpectedness.

(39)

Lastly, an attempt to predict which cylinderPlug that was most likely to be seen was calculated, and the result can be seen in Figure18. The blue line represents the conditional probability of each userKey, given all the other variables. This was calculated by running the test data points through the algorithm, and test the expectedness for each of the 7 cylinders.

Figure 18: Results of the cylinder plug prediction. Orange line represents the ground truth, and blue represents the predictions. 4 randomly chosen test points was chosen to be used for this report.

Table 3: Results obtained with the auto-encoder

Measure Result

Cross entropy 1.287

Mean probability 0.6577

Mean misclassification error 0.36

(40)

7 Discussion

7.1 Auto-Encoder and GMM Results

The simple, yet efficient architecture of the auto-encoder can be motivated by the sparse structure of the data. The fact that the data contains 30 dimensions – 1 for access (Yes/No), 3 for time, 2 for the commandCode, 3 for the ResultCode, 16 for the userPlug and 7 for the cylinderKey – makes it relatively low-dimensional, yet too high be practical. The reason behind this is that a Gaussian mixture model that is fitted on a data set with 30 dimensions will indeed constitute a generative model, but with a high risk that it exhibits poor predictive quality on new data points, due to overfitting. Overfitting can occur when there are too many parameters to estimate and too few data points to train the model on [3]. A GMM with 30 dimensions, which had been the case without the auto-encoder, and presumable 20 clusters, entails 9919 parameters to estimate. As the data set to be used contains 3100 observations, this means that the number of parameters to be decided by the EM-model is three times as many as the data points. When processed with the auto-encoder, the now encoded data set consists of 4 dimensions, and with 20 clusters this resulted in a GMM with 299 parameters to estimate, which is much more reasonable.

The unexpectedness plot shows that the first part of the research question

”Is it possible, with the help of unsupervised learning on a dataset with previous events in a system, to build a generative model of the data? Can it be used to assign each new event a measure of how expected the event is and thereby detect any anomalies?”

is possible to answer. The location of the yellow, encoded test points in Figure 8 allows to tell by inspection how likely or unlikely the observations are to be seen, and the calculation of the unexpectedness gives a numerical result on the same observation. This value could for example be set to some threshold value, which alerts when an observation reaches that number.

Last to discuss is the result of the cylinder plug prediction, which is supposed to answer the second part of the research question;

”Given a specific user key and the other event attributes, is it possible to predict which cylinder plug it has tried to access? Can we understand what each cluster in the fitted model corresponds to in terms of typical events?”

The result that can be seen in Figure18and Table3is what confirms that the first question is possible to answer - but not with 100 % accuracy. If 100 % of the cylinder predictions had been correct, the mean probability would have been 1 and not 0.6577 – but would that have been a reasonable result? The

(41)

Gaussian mixture model is a probabilistic, generative model, which has to be balanced between the generality of it and the amount of parameters to estimate.

The wrong balance between these two will lead to overfitting in case there are too many parameters, and the opposite if there are too few, which will result in a model that is too general, and not case-specific enough. For the second question, it is possible to sample hidden encodings from each cluster and then use the decoder to obtain a representative sample of typical events.

7.2 Further Research

As was mentioned in the introduction, obtaining a solution to the research question of this master thesis, still does not solve the problem of how to achieve the seamless opening solution. The algorithm behind the seamless opening solution is supposed to be able to learn and predict people’s movement so well, that transitioning through a building is experienced as seamless. Due to the sparse information that was provided in the data set, there are a few already proposed solutions that could lead a similar project, provided that additional data is available, further along the way towards the solution. These further research suggestions include

• Variational auto-encoders. The variational auto-encoder is similar to the auto-encoder (AE) discussed in Section 4.5, but unlike the AE, the variational AE is a generative model, which can be used to directly draw samples from a distribution that is similar enough to the distribution of the original data [9].

• Pattern recognition by using Time Series Analysis, where sequences of eventsfrom the access systems are analyzed instead of single data points.

• Markov random fields. One way to predict the next move of a user in an access system, would be by deriving a joint probability distribution, for the whole system and which then would allow the construction of the full conditional distribution. An advantage with this approach would be that a Markov random field is described by an undirected graph, which enables the usage of it of the access system, where it still remains unknown in which direction (in or out of a room) that a user is moving. The drawback with this methods is that the spatial coordinates of the access system (the PACS) is required to calculate the joint probability distribution, unless the usage of Markov chain Monte Carlo-techniques had been implemented (which is a task that requires an additional level of complexity, and was therefore chosen not to be appropriate for this project). Nevertheless, it might be a good choice of further research in the future.

• Graph Databases. An idea that never made it to the implementation part of this thesis was the implementation of a graph database, which was

(42)

supposed to store all the access configuration data and the door opening attempts. The implementation of such a graph database, together with the PACS mentioned in Section3, also represented in the graph database, could offer another way of solving this problem. Although that research might not contain the explicit development of statistical learning algorithms, at least not to the same extent as in this project, these are already incorporated in the graph database programs and used to calculate the results.

Graph databases is used by many large companies, and provides a way to easily store and access data in a way that enables visualisation of what is happening inside a system. To obtain a solution that succeeds to identify anomalies in real time - e.g. stop thieves from using a stolen access key, or lock a suspected terrorist out of a building, a statistical learning algorithm with an almost non-existing reaction time would be needed. This is hard to achieve, and the implementation of a graph database could help tackle that problem. Statistical learning and data analysis do require sophisti- cated algorithms to, for example, detect anomalies or predict the future, but nonetheless tackle the problem of efficient data collection, storage and accessibility [5]. Access solutions systems with perhaps thousands or millions of entries per day to store and make predictions on will create slow, demanding and possibly infeasible calculations and visualisation attempts.

This problem could be solved by using a graph database, at least for some parts of the analysis, and is therefore suggested for further research of this project.

(43)

8 References

[1] Assa Abloy. Assas historia – en ¨oversikt. 2020. url: https : / / www . assaabloyopeningsolutions.se/sv/local/se/om-oss/assas-historia/

assas-historia-en-oversikt/.

[2] Christopher M. Bishop. Pattern Recognition and Machine Learning (Infor- mation Science and Statistics). Berlin, Heidelberg: Springer-Verlag, 2006.

isbn: 0387310738.

[3] Charles Bouveyron, St´ephane Girard, and Cordelia Schmid. “High-dimensional data clustering”. In: Computational Statistics & Data Analysis 52.1 (2007), pp. 502–519.

[4] C Aggarwal Charu. Neural Networks and Deep Learning: A Textbook.

2018.

[5] Gaurav Deshpande et al. Native Parallel Graphs – The Next Generation of Graph Database for Real-Time Deep Link Analytics. 8 TigerGraph, Inc, 2018.

[6] scikit-learn developers. t-distributed Stochastic Neighbor Embedding. 2020.

url: https : / / scikit - learn . org / stable / modules / generated / sklearn.manifold.TSNE.html.

[7] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer Series in Statistics. New York, NY, USA:

Springer New York Inc., 2001.

[8] Henrik Hult. Lecture Notes, no 8. 2010. url: https://www.math.kth.

se/matstat/gru/Statistical%5C%20inference/Lecture8.pdf.

[9] Diederik P Kingma and Max Welling. “Auto-encoding variational bayes”.

In: arXiv preprint arXiv:1312.6114 (2013).

[10] Stack Overflow: K-means clustering. 2020. url:https://stackoverflow.

com/questions/51263331/kmeans- save- each- iteration- step(vis- ited on 11/19/2020).

[11] Sebastian J Wetzel. “Unsupervised learning of phase transitions: From principal component analysis to variational autoencoders”. In: Physical Review E 96.2 (2017), p. 022140.

[12] Haitao Zhao et al. Feature Learning and Understanding, Algorithms and Applications. Jan. 2020. isbn: 978-3-030-40793-3. doi:10.1007/978- 3- 030-40794-0.

(44)

(45)

(46)

TRITA -SCI-GRU 2020:388

Understanding people movement and detecting anomalies using probabilistic generative models

Understanding people movement and detecting anomalies using probabilistic generative models

AGNES HANSSON

Understanding people movement and detecting anomalies using probabilistic generative models

AGNES HANSSON

Understanding people movement and detecting anomalies using probabilistic generative models

Agnes Hansson

Supervisor: Pierre Nyquist External Project Partner: Assa Abloy

December 2020

1 Sammanfattning

2 Acknowledgements

Contents

3 Introduction

3.1 Outline of the Report

3.2 Aims and Objectives

3.3 Data Description

3.4 Research Question

4 Theoretical Background

4.1 Unsupervised Learning

4.2 Model Inference

4.3 Model Assessment and Selection

4.4 Dimensionality reduction

4.5 Neural Networks and Auto-Encoders

4.6 K-means clustering

4.7 Gaussian Mixture Models

4.8 Expectation Maximization (EM)

5 Methodology

6 Results

7 Discussion

7.1 Auto-Encoder and GMM Results

7.2 Further Research

8 References