Statistical Modeling of Dynamic Risk in Security Systems

(1)

Statistical Modeling of Dynamic Risk in Security Systems

GURPREET SINGH

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ENGINEERING SCIENCES

(2)

(3)

Systems

GURPREET SINGH

Degree Projects in Financial Mathematics (30 ECTS credits) Master's Programme in Applied and Computational Mathematics Mathematics KTH Royal Institute of Technology year 2020 Supervisor at STANLEY Security: Tom Vetterlein

Supervisor at KTH: Anja Janssen Examiner at KTH: Anja Janssen

(4)

TRITA-SCI-GRU 2020:076 MAT-E 2020:039

Royal Institute of Technology School of Engineering Sciences KTH SCI

SE-100 44 Stockholm, Sweden URL: www.kth.se/sci

(5)

Abstract

Big data has been used regularly in finance and business to build forecasting models. It is, however, a relatively new concept in the security industry. This study predicts technology related alarm codes that will sound in the coming 7 days at location L by observing the past 7 days. Logistic regression and neural networks are applied to solve this problem. Due to the problem being of a multi-labeled nature logistic regression is applied in combination with binary relevance and classifier chains. The models are trained on data that has been labeled with two separate methods, the first method labels the data by only observing location L. The second considers L and L’s surroundings.

As the problem is multi-labeled the labels are likely to be unbalanced, thus a resampling technique, SMOTE, and random over-sampling is applied to in- crease the frequency of the minority labels. Recall, precision, and F1-score are calculated to evaluate the models. The results show that the second labeling method performs better for all models and that the classifier chains and binary relevance model performed similarly. Resampling the data with the SMOTE technique increases the macro average F1-scores for the binary relevance and classifier chains models, however, the neural networks performance decreases. The SMOTE resampling technique also performs better than random over-sampling. The neural networks model outperforms the other two models on all methods and achieves the highest F1-score.

(6)

(7)

Statistisk Modellering av Dynamisk Risk i Säkerhetssystem

Sammanfattning

Big data har använts regelbundet inom ekonomi för att bygga prognosmodel- ler, det är dock ett relativt nytt koncept inom säkerhetsbranschen. Denna studie förutsäger vilka larmkoder som kommer att låta under de kommande 7 dagarna på plats L genom att observera de senaste 7 dagarna. Logistisk regression och neurala nätverk används för att lösa detta problem. Eftersom att problemet är av en multi-label natur tillämpas logistisk regression i kombination med binary relevance och classifier chains. Modellerna tränas på data som har an- noterats med två separata metoder. Den första metoden annoterar datan genom att endast observera plats L och den andra metoden betraktar L och L:s om- givning. Eftersom problemet är multi-labeled kommer annoteringen sannolikt att vara obalanserad och därför används resamplings metoden, SMOTE, och random over-sampling för att öka frekvensen av minority labels. Recall, precision och F1-score mättes för att utvärdera modellerna. Resultaten visar att den andra annoterings metoden presterade bättre för alla modeller och att classifier chains och binary relevance presterade likartat. Binary relevance och classifier chains modellerna som tränades på datan som använts sig av resamplings metoden SMOTE gav ett högre macro average F1-score, dock sjönk prestatio- nen för neurala nätverk. Resamplings metoden SMOTE presterade även bättre än random over-sampling. Neurala nätverksmodellen överträffade de andra två modellerna på alla metoder och uppnådde högsta F1-score.

(8)

(9)

Acknowledgements

Firstly, I would like to thank my supervisor at KTH Royal Institute of Tech- nology, Anja Janssen for her support and guidance throughout the project. I would also like to thank my supervisor at STANLEY Security, Tom Vetterlein for the opportunity to write my master’s thesis and providing me with necessary data and material. Finally, I would like to thank my friends and family for not only supporting me throughout this thesis but also through my time at KTH.

(10)

(11)

2.1 The logistic function transforms the input value to the range (0,1) . . . 9 2.2 A neuron with three input variables and one output. . . 11 2.3 ReLU activation function with one dimensional input space,

b = 0, w = 1 . . . 13 2.4 A simple neural network with three layers, and five neurons. . 13 2.5 A general neural network structure with n hidden layers. . . . 14 2.6 One path of the backpropagation algorithm, Figure source: [6]. 19 4.1 Correlation matrix for input data used for the first labeling

method, the axes represents the input values in X, i.e. 0 rep- resent the alarm 100 and 20 represents the alarm 120. . . 34 4.2 Correlation matrix for technology related alarms. . . 35 4.3 Correlation matrix for input data used for the second labeling

method, the axes represents the input values in X. . . 36

v

(12)

List of Tables

1.1 Alarm codes and their description. . . 7 1.2 A made up example of the data. . . 7 2.1 The binary relevance transformation. Here x = [1, 0, 0, 1, 0, 1, 1, 1, 0]

and ˆy = [1, 0, 0, 1, 1]. Every single labeled classifier, h_i, is trained to predict ˆy_i ∈ {0, 1} . . . 10 2.2 The Classifier Chains transformation. Here x = [1, 0, 0, 1, 0, 1, 1, 1, 0]

and ˆy = [1, 0, 0, 1, 1]. Every single labeled classifier, hi, is trained to predict ˆy_i ∈ {0, 1} . . . 11 3.1 Frequency of each alarm with the first labeling method before

and after resampling. . . 31 3.2 Frequency of each alarm with the second labeling method be-

fore and after resampling. . . 32 4.1 Recall, precision and F1-score for all technology related alarms

and their micro and macro averages for binary relevance model with the first labeling method. The support column represents the frequency of each alarm in the test data. . . 37 4.2 Recall, precision and F1-score for all technology related alarms

and their micro and macro averages for classifier chains model with the first labeling method. . . 38 4.3 Recall, precision and F1-score for all technology related alarms

and their micro and macro averages for neural networks model with the first labeling method. . . 39 4.4 Recall, precision and F1-score for all technology related alarms

and their micro and macro averages for binary relevance model with the first labeling method with resampled data. . . 40

vi

(13)

4.5 Recall, precision and F1-score for all technology related alarms and their micro and macro averages for classifier chains model with the first labeling method with resampled data. . . 41 4.6 Recall, precision and F1-score for all technology related alarms

and their micro and macro averages for neural networks model with the first labeling method with resampled data. . . 42 4.7 Recall, precision and F1-score for all technology related alarms

and their micro and macro averages for binary relevance model with the second labeling method. . . 43 4.8 Recall, precision and F1-score for all technology related alarms

and their micro and macro averages for classifier chains model with the second labeling method. . . 44 4.9 Recall, precision and F1-score for all technology related alarms

and their micro and macro averages for neural networks model with the second labeling method. . . 45 4.10 Recall, precision and F1-score for all technology related alarms

and their micro and macro averages for binary relevance model with the second labeling method with resampled data. . . 46 4.11 Recall, precision and F1-score for all technology related alarms

and their micro and macro averages for classifier chains model with the second labeling method with resampled data. . . 47 4.12 Precision, recall and F1-score achieved for all technology re-

lated alarm with the classifier chains model. . . 48 4.13 The summarized average micro and macro results from the

first labeling method with and without over-sampling using ReS11. Here: BR - Binary Relevance, CC - Classifier Chains, NN - Neural Networks. . . 49 4.14 The summarized average micro and macro results from the

second labeling method with and without over-sampling using ReS21. Here: BR - Binary Relevance, CC - Classifier Chains, NN - Neural Networks. . . 49 4.15 The summarized average micro and macro results from meth-

ods ReS12, ReS13 and ReS22. Here: BR - Binary Relevance, CC - Classifier Chains, NN - Neural Networks. . . 50

(14)

Introduction

Although big data and forecasting models are regularly used in finance and business, it is a relatively new concept in the security industry, crime preven- tion and criminology. The difficulties with analyzing big data are the messi- ness, abundance, variety and the fact that it is often generated without a specific question in mind. Historically, handling these challenges has been difficult and expensive, however, due to improved technology and new analytical methods it is now possible. The main approach is to construct machine learning models that can recognize patterns and predict outcomes. There is, however, difficulty to predict which model will perform best on a given data set, as all models have strengths and weaknesses, [1].

An approach to analyze big data is taken in this study. The data is provided by STANLEY Security, which is a security company, which among other things provides its customers with sensors, cameras, alarms, etc. to ensure their safety.

1.1 Problem Description

STANLEY Security has an alarm centre that collects incoming alarms and notes down the alarm type, time, location, postal code, and alarm code for each alarm. The data is currently only being placed in a repository. This data will be used to provide STANLEY Security’s customers with a risk measure.

The risk measure will be based on the alarm codes that will sound in the future at location L. This study will lay the groundwork for this risk measure. The future alarms will be predicted by the alarms that have sounded in the past at location L and its surroundings.

1

(18)

2 CHAPTER 1. INTRODUCTION

This risk measure will give STANLEY Security’s customers a new way to determine their risk levels, and it might also be useful for taking countermea- sures to prevent these risks.

The first step to predict the future alarms is to label the data. The data was labeled in two different ways, the data was first labeled by observing the past, z days, alarms at location L. Then the future, z days, alarms at location L were observed. These observations were used to define the labels which are based on which alarms sounded at location L in the past and which will sound in the future at location L. This process is repeated for every location L.

The second method used for labeling the data takes the surroundings of L into consideration. It labels the data by first observing the past, z days, alarms at location L and its surroundings (within d kilometers of L) and then the future, z days, alarms at location L. A comparison of these two labeling methods will indicate if the information from the surroundings of L is of any importance or not.

Thereafter, the models are created and trained with a training data set. The models tested in this study are logistic regression in combination with binary relevance and classifier chains, and neural networks. The models are assessed with the test data. The results from the assessment determines which model is the superior model for this data set.

The research questions for this thesis are:

• Which model is preferable when predicting future alarms based on past alarms?

• Does a model which accounts for the surroundings of location L improve performance?

1.2 Background

A few approaches have been taken to tackle similar problems. Crime forecasting models using crime data from police database were built in the study, [2].

The models were spatiotemporal, which relied on the locations that different crimes were committed on. The data was collected from Crime Anticipation

(19)

System, which collects historic data on more than 200 parameters. For instance, crime, demographic and socioeconomic parameters are collected. The study, however, kept the parameters low and only selected relevant parameters based on previous studies. Data such as robbery, burgulary and battery along with environmental and proximity variables were collected from 2011 to 2014 in a bi-weekly period. The city was divided into grids with size 200m×200m. The 3% highest risk areas were mapped again with the resolu- tion 125m×125m. This resulted in 1575 total cells. The data was designated to each cell by geocoding. Some of the data only had information on a street level, these values were assigned to a cell by a random process and taking the relative area of the street in the cell into consideration. The data was then aggregated so that each cell only had one observation for a bi-weekly period, which resulted in a total of 163 800 observations (grids × years × weeks in a year

2 ).

The problem was also analyzed where a distinction between crimes that were committed on day, defined as 07:00-19:00, and night, 19:00-07:00, was made.

Some crimes had a time window in which the crime had been committed, the average time of these crimes was used in the models. Crimes with a time window longer than 24 hours were disregarded.

The problem was analysed with three models, a logistic regression model, a neural networks model and an ensemble model. The ensemble model is a combination of the logistic regression model and the neural networks model.

Before training the models the data was split into a training data set and a testing data set. The data from 2011 to 2013 was used as the training data and the data from 2014 was used as the testing data. However, the testing was done using a rolling window approach, i.e. the first period of 2014 was first predicted by using the data from 2011 to 2013, then the second period of 2014 was predicted by using the data from the second period of 2011 to the first period of 2014.

A neural network with one hidden layer and a binary logistic regression model was fitted to the data, their results were averaged using ensemble averaging.

The output from the models was the probability of a crime occurring, a cutoff probability was set to 20%. The cutoff probability was set quite low in order to combat conservative estimates from the models. The cutoff probability was determined by the Receiver Operating Characteristic (ROC) method, which optimized the models ability to predict crimes correctly and at the same time minimizing the false positives.

(20)

The models were evaluated by measuring recall (the proportion of crime that was correctly predicted, see more in section 3.5), the precision (proportion of correct predictions and total predictions, see more in section 3.5) and the prediction index (ratio of direct hit vs the proportion of the total area predicted as being high risk).

The results showed that for the first method the results for each model were very similar and that the ensemble model did not provide any benefits. Only recall for the neural network was significantly higher than the other methods with a value of 6.62% higher. Predictions made by the models were generally 26.45% and 32.95% correct. The results also show that the prediction index is similar for battery and burglary but higher for street robbery.

The results showed that the models predicted significantly better with monthly data where the distinction between day and night was considered. The predictions made by the models were now generally 51.51% to 60.47% correct. All of the models performed similarly, however, the ensemble model reached a balance between a high recall of the neural networks model and a high precision of the logistic regression model.

The authors concluded that only minimal improvements between logistic regression and neural networks were seen, however, the neural networks model did have higher recall. Thus they stated that the neural networks should be considered if recall is more important than precision, otherwise the logistic regression model is preferable as it is simpler. They also discuss that the ensemble method was able to attain a balance between the two models, however, only in the case where a distinction between day and night was made. They finally conclude that the results from the study should be taken cautiously and that more research has to be done in the field in order to determine the stability of the predictions.

A similar study, [3], models the frequency with which certain types of crimes occur. The secondary goal of the study was to investigate how a model that encompasses all crime types differs from a specialized model that only considers one crime type. The data used in the study was acquired from the website data.police.uk, which is a website for open crime data in Wales, England and Northern Ireland. The first record of the data dates back to December 2010 and is updated monthly. The paper narrows down to only include the data from Hampshire Constabulary from the period December 2010 to March

(21)

2014. The data contained information such as crime id, date, locations, coordinates, crime type and LSOA code. The LSOA code refers to the Lower Layer Super Output Area, which describes which area the crime was committed in.

There are in total 1454 unique LSOAs in Hampshire Constabulary. There are 16 different types of crime types, the five most frequent crimes in descending order are listed as: anti-social behaviour, burglary, criminal damage and arson, violent crime and other thefts. The data consisted of 609 418 records where each record has 12 attributes, however, 19% of the records had missing values.

The problem was modeled using instance-based learning, linear regression and decision trees. Instance-based learning, also called memory based learning, simply trains the model on specific queries only when they need to be answered. I.e. instead of creating a model the method only tries to classify new data by finding relevant data in the training set and matching the results, by using a distance function. This requires the data to be stored in the memory of a computer. The decision trees model has a tree-like structure with nodes and leaves. For classification purposes the decision trees model outputs a class attribute value whereas they output a numerical value for prediction purposes.

The models were evaluated on the test data, which is a part of the data that is withheld from the model when training it. Thus the model is tested against data it has not seen. A 10-fold cross validation was performed to evaluate the data. This method splits the data into 10 parts, 9 of these 10 parts are used as training data and the remaining part is used as testing data. This processes is repeated 10 times, i.e. until all parts have been used as testing data, and the scores are averaged over all repetitions. The evaluation metrics that were used in the paper was mean absolute error, root mean-squared error and correlation coefficients. The mean absolute error is a measurement of the average magni- tude of the individual errors. The root mean squared error looks at the average of the squared difference between the predicted values and the real values.

The correlation coefficient measures the correlation of the real values and the predicted values. The range of the correlation coefficient is [−1, 1] where -1 represents perfect inverse correlation, 1 represents perfect correlation and 0 represents no correlation.

Three experiments were conducted to analyze the problem. The first experiment predicted the crime frequency by the LSOA codes. The second experiment predicted the crime frequency by postcodes and the last experiment predicted the anti-social behaviour frequency. The first experiment was con-

(22)

ducted by only taking five attributes into consideration, namely, the date, LSOA codes, LSOA names, crime type and frequency. The three models were then trained on the data to predict the crime frequency by the LSOA codes. The second experiment is very similar to the first experiment, the only difference is the addition of the postal codes to the data. The third experiment only predicts anti-social behavior crimes, this is done to see if a bespoke model will outper- form the general model. Anti-social behavior crimes are selected as they were the most frequent. The time for training and testing is also evaluated for all models.

The results from the experiment showed that the decision trees model performed the best with the lowest mean absolute error, root mean squared error and the highest correlation coefficient. The linear regression model was not tested for the first experiment as the training of the model was stopped after 900 hours (making the model impractical). Decision tress performed the best in the second experiment as well, and the two other methods performed similarly. However, there were no major improvements in the model by also adding the post codes. In the last experiment the linear regression and decision tress model performed similarly, with correlation coefficient equal to 0.85 and the instance-based learning model performed the worst.

The results showed that the instance-based learning model was quick, however, it performed poorly and did not produce an explicit model. It scored the highest on the specialized model and the authors thus concluded that it might be best suited for specialized cases. The linear regression model was the slow- est and did perform best on the specialized case as well. The authors discuss that this might be due to the fact that there is less linearity in the data when considering all crimes, thus making the model perform worse. They also state that the model might not be useful if the model needs to be updated often as it is very slow. The decision trees model performed best for all cases and it was the second fastest model to train. The model was also marginally worse for the global model compared to the specialized model.

1.3 Data Description

The data consists of events that have occurred. Each event consists of an alarm type, the time (measured to the second), the address, postal code and the alarm code. The alarm codes give a short description of the cause for the alarm. The codes are divided into categories, as seen in Table 1.1. A made up example of

(23)

how the data looks like can be seen in Table 1.2. A made up example is shown as the real data is classified and requires authorization to view it.

Alarm Codes Alarm Description

100-110 Crime Related

111-130 Customer Related

131-150 Contractor Related 151-160 Environment Related 161-180 Technology Related

181-199 Other Causes

Table 1.1: Alarm codes and their description.

Alarm Type Alarm Added Location Postal Code Alarm Code

Camera Alarm 2019-01-01 00:00:02 Lindstedtsvägen 25 Stockholm 176 Fire Alarm 2019-01-01 00:00:06 Stora Algatan 4 Lund 155 Burglar Alarm 2019-01-01 00:00:13 Hörselgången 4 Gothenburg 175

Table 1.2: A made up example of the data.

Before labeling the data it had to be processed. Several rows in which the location or alarm codes were missing had to be removed. Some rows had locations that did not exist which also had to be removed.

The data was then labeled so that the model could predict technology related alarms based on all incoming alarms from each location. Only the technology related alarms were considered due to them being the most frequent. It might also be interesting to model the crime related alarms; however, these are very occasional (13 alarms in a year with 670 000 alarms) which will make the model biased towards the events that occurred.

The technology related alarms are alarms that sound when for instance a de- tection unit is faulty, there is a power outage, there is a disturbance in the system, etc. Crime related alarms sound if there has been a break in, an as- sault, shoplifting, etc. The customer related alarms are caused by a customer or sound when a window has been left open etc. Contractor related alarms are caused for instance by a cleaning worker, a technician, a security guard, etc.

Environmental related alarms sound when there is a fire, flooding, high tem- peratures, etc. Other causes for alarms are animals, false alarms (i.e alarms without a cause), alarm tests, etc.

(24)

Chapter 2 Theory

2.1 Logistic Regression

One of the tools used in this study to predict future alarms is logistic regression, as described for example in [4]. Logistic regression is a mathematical model that uses a set of independent variables, x = (x₁, x₂, . . . , x_n), to predict a dichotomous dependent variable, y. A dichotomous variable is a nomi- nal variable which only has two categories. For instance, 0 can represent if an alarm did NOT sound and a 1 can represent if an alarm did sound.

The logistic model relies on the logistic function f (z) = 1

1 + e^−z. (2.1)

where z is a continuous variable. As seen in Figure 2.1 when z −→ ∞, f (z) −→ 1 and when z −→ −∞, f (z) −→ 0, thus the logistic function has the range (0, 1).

Due to this fact the logistic function is a useful aid for predicting dichotomous variables where the resulting values indicate the probability for the outcome.

In order to obtain the logistic model from the logistic function, z is rewrit- ten as a linear function

z = α + β₁x₁+ β₂x₂+ · · · + β_kx_k. (2.2) Here α and the β_i’s are unknown constants and the x_i’s are the independent input variables. This z is then inserted into the logistic function (2.1). So, given the observations, xi, a prediction of the dependent variable is calculated. This prediction is a probability which can be written as a conditional probability, P (y = 1|x₁, x₂, . . . , x_k) = f (α + β₁x₁+ β₂x₂+ · · · + β_kx_k). The constants

8

(25)

Figure 2.1: The logistic function transforms the input value to the range (0,1)

α and the β_i’s are estimated using maximum likelihood.

Although logistic regression is highly useful it does not extend directly to a multi-labeled problem. A multi-label classification is a supervised learning method where the output may be multiple labels simultaneously, as opposed to a single-labeled (binary, or multi-class) classification where the output can only take on one single class label. For the multi-label classification the problem has to go through a transformation, where the multi-labeled problem is transformed to one or several single-labeled problems. Thus, the predictions are made for each single-labeled problem which is then transformed back to a multi-labeled prediction. Examples of these methods are binary relevance and classifier chains, as given in [5].

2.2 Binary Relevance

The binary relevance model simply transforms the multi-labeled problem into N , one for each label, single-labeled problems. Each single-labeled model, hi, only predicts one label at a time. The binary relevance model is one of the simplest models to use on a multi-labeled problem and does not model correlations between labels. For instance, for a given constellation of independent variables if it is known that the fire alarm sounded then it might be the case that there is a higher probability for the heat sensor to have sounded.

However, the binary relevance model will not be able to model this correlation.

Although the binary relevance model is simple it does have some redeeming

(26)

10 CHAPTER 2. THEORY

qualities. The model does not over-fit certain label-combinations as it does not expect them to be correlated. If certain labels are considered useless then they can be removed without causing any disruptions to the rest of the model. The calculation complexity is also low, which is useful when dealing with large data sets. Finally, the binary relevance model is very intuitive which makes it a good starting model [5]. Table 2.1 shows how each prediction is made with the binary relevance model.

h: x ˆy

h₁ [1, 0, 0, 1, 0, 1, 1, 1, 0] 1 h₂ [1, 0, 0, 1, 0, 1, 1, 1, 0] 0 h3 [1, 0, 0, 1, 0, 1, 1, 1, 0] 0 h₄ [1, 0, 0, 1, 0, 1, 1, 1, 0] 1 h₅ [1, 0, 0, 1, 0, 1, 1, 1, 0] 1

Table 2.1: The binary relevance transformation. Here x = [1, 0, 0, 1, 0, 1, 1, 1, 0] and ˆy = [1, 0, 0, 1, 1]. Every single labeled classi- fier, hi, is trained to predict ˆyi ∈ {0, 1}

.

2.3 Classifier Chains

The classifier chains method in contrast to the binary relevance model does model the correlation between labels. The model does this by first trans- forming the multi-labeled problem into N , one transformation for each label, single-labeled problems, hi. A chain h = (h1, h2, . . . , hN) is constructed, where each hj learns and predicts the j^thlabel. The classification process be- gins with h₁, which performs a binary classification. Thereafter, h₂ is used, however, now the classification made in h₁ is used as an input in h₂, see [5].

The process then continuous and hjpredicts the j^thlabel given the output predictions from the earlier chains, as seen in Table 2.2.

(27)

h: x ˆy h₁ [1, 0, 0, 1, 0, 1, 1, 1, 0] 1 h2 [1, 0, 0, 1, 0, 1, 1, 1, 0, 1] 0 h₃ [1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0] 0 h₄ [1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0] 1 h₅ [1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1] 1

Table 2.2: The Classifier Chains transformation. Here x = [1, 0, 0, 1, 0, 1, 1, 1, 0] and ˆy = [1, 0, 0, 1, 1]. Every single labeled classi- fier, hi, is trained to predict ˆy_i ∈ {0, 1}

.

2.4 Neural Networks

The second method used to classify the data are neural networks, as described for example in [6]. A neural network comprises of layers. There are three types of layers, one input layer, one output layer and one or more hidden layers. The hidden layers are simply defined as “not the output or input layers". Each layer consists of one or several neurons which for a given input, x = (x₁, x₂, . . . , x_n), compute a simple calculation defined by the activation function. An example of a neuron with three input variables and one output can be seen in Figure 2.2. There are several types of activation functions, in this paper we will only discuss the linear perceptron, the sigmoid and the rectified linear unit (ReLU) activation functions. The aim of the neural network model is to recognize patterns in the data. The network is first trained with training data to do just this. The model is then used to make prediction on the test data.

Figure 2.2: A neuron with three input variables and one output.

(28)

2.4.1 Linear Perceptron

The linear perceptron activation function is the most basic activation function.

Given an input vector, x, the linear perceptron calculates x · w^T + b, where w is the weight vector and b is the bias. The function then checks whether this value is larger than 0 or not and gives a single output,

output =

(1, if x · w^T + b > 0 0, if x · w^T + b ≤ 0

This, however, is not useful when trying to optimize the weights, as the function is not smooth (the function is not differentiable).

2.4.2 Sigmoid

The sigmoid activation function, also known as the logistic activation function uses the logistic function given in Equation (2.1) as its activation function, with z = x · w^T + b. This gives an output which is continuous in the (0, 1) range.

This makes it easier to optimize the weights as small changes in the weights give small changes in the output (the function is smooth). The graphical representation of the sigmoid activation function, for one a dimensional input space, is seen in Figure 2.1.

2.4.3 Rectified Linear Unit (ReLU)

The ReLU activation function performs better than the sigmoid activation function with respect to the computational cost and statistical performance, [7]. This is due to the fact that the ReLU disables negative activation (the negative values are set to zero, as seen in Equation (2.3)) which makes it sparse and improves its generalized performance, [8]. The output for the ReLU function is given by

output = max(0, x · w^T + b). (2.3) The graphical representation of the ReLU function, for a one dimensional input space, can be seen in Figure 2.3.

(29)

Figure 2.3: ReLU activation function with one dimensional input space, b = 0, w = 1

2.4.4 Neural Network Structure

The neurons are assembled in layers which are then connected to the next layer.

Consider the simple network in Figure 2.4.

Figure 2.4: A simple neural network with three layers, and five neurons.

The network is a three-layered network with only one hidden layer. The left layer is the input layer with two neurons, the middle layer is the hidden layer, also with two neurons, the right layer is the output layer with only one neuron. It might seem as each neuron has multiple outputs, however, that is not the case. Each neuron only has one output; however, the same output is used as an input in each neuron in the next layer. Consider now the more complex network as seen in Figure 2.5.

This is an n + 2-layered network with n hidden layers. The hidden layers can have varying numbers of neurons. The number of neurons in the input layer is determined by the input data. For instance, if the input data consists of 100 alarm codes then there will be 100 neurons in the input layer. The number of neurons in the output layer is determined by what is being predicted. If the

(30)

Figure 2.5: A general neural network structure with n hidden layers.

aim of the network is to predict how many technology related alarms there will be (alarms 161-180, see Table 1.1, in total there are 20 alarm types) 20 neurons in the output layer. It is not as trivial to determine the number of hidden layers and the number of neurons in each layer. There are several heuristics that one can follow in order to determine which set-up works best for one’s problem, [6]. The hidden layers can also be determined by manually experimenting with different layouts.

2.4.5 Gradient Descent

In order to train the model, the weights and biases have to be adjusted to improve performance, one basic method of doing so is called gradient descent.

Before discussing gradient descent a few preliminaries have to be defined.

The first step of training a network is to feed in training data. The training data is a portion of the entire data set. This is done so that the model can be tested with a completely new data (test data) set after it has been trained.

The training data is labeled with input x_train = (x₁, x₂, . . . , x_n) and output y_train = (y₁, y₂, . . . , y_k). To calculate all the weights and variables we have to first define a cost function:

C(w, b) ≡ 1 2n

n

X

i=1

kf (x_i) − y_ik², (2.4) where f (x_i) is the output predicted by the network, for observation i, and n the total number of training inputs, w the weights and b the biases. With

(31)

the cost function, as in (2.4), it is easy to define how well the network is per- forming. The better the networks performs the smaller the cost function will be, hence, the goal is to find weights and biases that minimize the cost function.

One might ask why there is a need to define a cost function and not simply try to maximize correctly classified codes by the network. The problem with this approach is that the number of correctly classified codes is not a smooth function of weights and biases. I.e. small changes in the weights and biases will most likely not affect the number of correctly classified codes, this makes it difficult to correctly modify the weights. A quadratic cost function however, is a smooth function of weights and biases. And when the cost function is smooth, it is possible to apply gradient descent to it.

The main idea of gradient descent is to find the gradient of C, which is defined as,

∇C ≡ ∂C

∂w₁, ∂C

∂w₂, . . . , ∂C

∂w_n,∂C

∂b₁,∂C

∂b₂, . . . , ∂C

∂b_n

T

, (2.5)

the change in the cost function, ∆C, is determined by the changes in the weights, ∆w, and biases ∆b, and given by

∆C ≈ ∂C

∂w₁∆w₁+ · · · + ∂C

∂w_n∆w_n+∂C

∂b₁∆b₁+ · · · + ∂C

∂b_n∆b_n. (2.6) The aim is to select (∆w₁, . . . , ∆w_n, ∆b₁, . . . , ∆b_n) = ∆v, such that ∆C is negative. ∆C has to be negative in order to minimize the cost function. By inserting Equation (2.5) into Equation (2.6), we get:

∆C = ∆v · ∇C. (2.7)

Suppose now that ∆v is defined as:

∆v = −η∇C^T, (2.8)

where η is a small, positive parameter, also known as the learning rate. By defining ∆v as in Equation (2.8), we can guarantee that ∆C will be negative (∆C = −η∇C^T · ∇C = −ηk∇Ck²). Thus, the cost function, C, will de- crease. Now, by using Equation (2.8) it is possible to update the weights and biases, v, by:

(32)

v = v + ∆v

v = v − η∇C^T. (2.9)

This process is then repeated until, hopefully, we reach the global minimum of the cost function. The parameter η is defined as:

η =

k∇Ck , = k∆vk. (2.10)

Essentially, what is happening here is that we are finding the gradient and taking small steps towards the negative of the gradient. This can be seen more intuitively as moving down a hill, we are finding the slope and then taking small steps down the hill. The process continues until we are at the bottom of the hill, i.e. in a global or local minimum of the cost function.

Instead of using the quadratic cost function given in Equation (2.4), we will use the categorical cross-entropy cost function, given as

C_cce(w, b) ≡ −

n

X

i=1

(y_i· log(f (x_i))), (2.11) as, when applied after a sigmoid activation, is the best method for classification as stated in [9].

2.4.6 Adaptive Moment Estimation (Adam) Optimizer

Although gradient descent is viable option and has been a key method for previous machine learning models it has now been surpassed by several different methods. One of these methods is called the Adam optimizer. Adam is an efficient optimizer that only requires, like gradient descent, a first order gradient. As the name suggests, the Adam optimizer works by calculating the first and second order moment of the gradient, which makes it less prone to con- verging to a local minimum. The convergence time for the Adam algorithm is also generally lower than gradient descent, [10]. The pseudo code for the Adam optimizer is given in Algorithm 1.

(33)

Algorithm 1 Adam Algorithm

Good general parameter values for machine learning problems are: α = 0.001, β₁ = 0.9, β₂ = 0.999, = 10⁻⁸. All vector operations are performed element wise.

Require: α - Stepsize

Require: β₁, β₂ ∈ [0, 1] - Stepsize

Require: C(v) - Cost function with input parameters v Require: v0 - Initial parameters

m0 = 0 - First moment vector θ₀- Second moment vector t = 0 - Timestep

while vthas not converged do t = t + 1 - Update timestep

g_t = ∇C_t(v_t−1) - Get the gradients

m_t = β₁m_t−1+ (1 − β₁)g_t- Update first moment

θ_t = β₂θ_t−1 + (1 − β₂)g_t ◦ g_t - Update second moment, here ◦ is the Hadamard product

mˆt = m_t/(1 − (β₁)^t) - Bias corrected first moment θˆ_t = θ_t/(1 − (β₂)^t) - Bias corrected second moment vt = vt−1− α ˆmt/(p ˆθt+ ) - Update parameters end

Return: v_t- Final parameters

(34)

2.4.7 Backpropagation

Backpropagation is the algorithm that calculates the gradient ∇C of the cost function. As said before, it is essential for the cost function to be a smooth function of weights and biases in order to determine the gradient. Thus, a small change in the weights and biases will lead to a small change in the cost function. Suppose that we make a small change, ∆w^l_jk, to a weight. Here, w_jk^l is the weight from k^thneuron in the (l − 1)^thlayer, which connects to the j^th neuron in the l^th layer. I.e. the weight w³_4,2, would be the weight in the 2nd neuron in the 2nd (3-1) layer, connected to the 4th neuron in the 4th layer.

This change in the weight will change the output from the neuron. Which will change the output in the next neuron, which in turn will change the output from the next neurons, and so on. This will eventually lead to a change in the cost function. The change in the cost function can be approximated by

∆C ≈ ∂C

∂w_jk^l ∆w^l_jk. (2.12)

In order to now compute _∂w^∂Cl jk

, the propagation from the change in w_jk^l has to be tracked. So, the change ∆w^l_jkcauses the change ∆a^l_j, which is the change in the activation function in the j^thneuron in the l^thlayer and a^l_j is simply the activation function in the l^thneuron. The change ∆a^l_j is given as

∆a^l_j ≈ ∂a^l_j

∂w_jk^l ∆w^l_jk. (2.13)

This change now will in turn cause a change in all activation functions in the next layer (layer (l+1)). To simplify the expression we will only be looking at one neuron in the next layer, a^l+1_q . The change in neuron a^l+1_q will be

∆a^l+1_q ≈ ∂a^l+1_q

∂a^l_j ∆a^l_j. (2.14)

By inserting Equation (2.13) into Equation (2.14) we obtain:

∆a^l+1_q ≈ ∂a^l+1_q

∂a^l_j

∂w_jk^l ∆w^l_jk. (2.15) This process will continue until the last layer, which determines the cost

function. Suppose that the path goes through the activations a^l_j, a^l+1_q , . . . , a^L−2_p , a^L−1_n , a^L_m, where L is the last layer, then the expression will turn out to be

(35)

∆C ≈ ∂C

∂a^L_m

∂a^L−1_n a^L−1_n

∂a^L−2_p · · ·∂a^l+1_q

∂a^l_j

∂w_jk^l ∆w^l_jk. (2.16) This however, is only the change for this particular path and there are generally many more paths one could take from the initial change in ∆w^l_jk. The change over all paths is simply the sum over all paths, i.e.,

∆C ≈ X

mnp...q

∂C

∂a^L_m

∂a^L−1_n a^L−1_n

∂a^L−2_p · · ·∂a^l+1_q

∂a^l_j

∂w_jk^l ∆w^l_jk. (2.17) Where m, n, p, . . . , q are the paths which can be taken from the weight w^l_jk to the cost function. Now inserting Equation (2.12) into Equation (2.17) gives

∂C

∂w^l_jk ≈ X

mnp...q

∂C

∂a^L_m

∂a^L−1_n a^L−1_n

∂a^L−2_p · · ·∂a^l+1_q

∂a^l_j

∂w^l_jk. (2.18) An illustration of what ^∂a

l j

∂w^l_jk, etc. represents and the backpropagation algorithm for one path can be seen in Figure 2.6.

Figure 2.6: One path of the backpropagation algorithm, Figure source: [6].

The right-hand side of Equation 2.18 is easily computable with a little bit of calculus as the a’s in Equation 2.18 are the activation functions.

(36)

2.5 Resampling

Due to the nature of multi labeled classifications the data is often imbalanced.

This means that the labels in the data set are not represented equally, [11].

Two naive approaches to fixing this issue are called random over-sampling and random sub-sampling. The random over-sampling approach replicates the data where the less frequent labels occur and the random sub-sampling approach takes a sub sample (a small portion of the data) from the data where more frequent labels occur to balance the data. These methods are flawed, as the random over-sampling method is prone to overfitting to the replicated labels and the sub-sampling method loses out on useful information. In order to combat these shortcomings the SMOTE (Synthetic Minority Over-sampling Technique) is used.

2.5.1 SMOTE (Synthetic Minority Over-sampling Tech- nique)

the SMOTE method uses an over-sampling technique which is less prone to overfitting as the new samples are not exact replications of the original samples (seed samples). The new samples are created by interpolating the seed samples, [12]. I.e. we find the k nearest neighbours for each minority label and then we select one at random and create a new sample by interpolating them. The pseudo code for SMOTE is given in Algorithm 2.

(37)

Algorithm 2 SMOTE

Require: S - Seed samples of minority label, y_i, i = 1, 2, . . . , m Require: M axS - Number of samples in majority label

Require: M inS - Number of samples in minority label r = (M axS − M inS)/(M inS) - Imbalance factor k - Number of nearest neighbours

for i = 1, 2, . . . , m do

Compute the distances ky_i− y_jk₂ , ∀i 6= j Find the k nearest neighbours of y_i

n = round(r) - Number of new samples to be created for z = 1, 2, . . . , n do

Draw a random integer ∈ [1, k] and the associated vector y Create a random vector with uniform distribution λ ∼ U (0, 1) New samples s^z_i = λ ◦ (y_i− y) + y_i , ◦ is the Hadamard product end

end

Return: n × m matrix of new samples

M axS in Algorithm 2 represents the number of samples a label should have after resampling. M axS can also be set to any majority label and does not have to be set to the largest majority label. As the problem is multi-labeled and we are using an interpolation method, the new samples will also contain majority labels. This is caused by the fact that minority labels appear with majority labels for some instances and thus the labels can never be completely balanced when applying the SMOTE algorithm.

(38)

Chapter 3 Methodology

The data was acquired from STANLEY Security’s alarm central. The original data was split into two parts, one part contained data from 2018 and the other contained data from 2019. The 2018 data had 952 278 instances before processing, where an instance is one row of an incoming alarm with columns as seen in Table 1.2. The 2019 data had 1 001 630 instances before processing.

The processing begun by deleting rows which had missing information, such as missing locations, postal codes or alarm codes. Many rows had wrong information or unusable information in the location column, these were deleted as well. After processing the 2018 had 634 983 instances and the 2019 data had 691 050 instances.

3.1 Labeling the Data

As mentioned in Section 1.1 the data had to be labeled before inserting it into the models, as the raw data cannot directly be applied to the models. The data was labeled by going through the data row by row and observing the codes that appeared. First the empty vectors y = [0, 0, . . . , 0, 0] and x = [0, 0, . . . , 0, 0] were created. The vector y represents technology related alarms that will sound in the coming 7 days and x represents all alarms that have sounded in the past 7 days. These vectors are filled with two different methods, the first method looks at each location separately while the second method takes each locations surrounding into consideration. The two methods are described in more detail below.

22

(39)

3.1.1 Method 1

As mentioned previously, this method only considers the current location, loc_r, when labeling the data. The labeling process starts at row r = 0 of the data and works through all rows. Imagine that a pointer, p, is placed on row r.

The location in row r is noted as locr and the time noted as timer. The pointer is then moved to row r + 1 and if loc_r = loc_r+1 is true, then the change x(alarm_r+1 − 100) = 1 (all alarms are used as input for the predictions), is made and the pointer is moved to row r + 2. For instance an alarm with code 128 sounded in locrthen the 28^th(128-100) value in the vec- tor x will be changed to 1. If however, loc_r = loc_r+1 is false the pointer is moved to r + 2 without making any changes to x. This process continues until timer + 7days > timer+k, which means that the process has gone through 7 days worth of rows (past 7 days). The vector x is placed as the row r is the matrix X, which has the dimensions 100 × instances, where one instance is one row in the data.

The process then continues and the pointer is moved to row r + k. The location, loc_r+k, and time, time_r+k, is noted. The pointer is then moved to row r + k + 1 and if

loc_r+k = loc_r+k+1

161 ≤ alarm_r+k+1 ≤ 180 (technology related alarm) (3.1) is satisfied, where alarmr+k+1 is the alarm type on row r + k + 1, then the change y(alarmr+k+1− 161) = 1 is made and the pointer is moved to row r + k + 2. For instance, if there is an alarm with code 176 at the same location the pointer was placed initially at then the 15^th(176-161) value in the vector y will be changed to 1. However, if Equation 3.1 is not satisfied the pointer moves to row r + k + 2. This process continues until row r + k + n where time_r+k+ 7days > time_r+k+n, which means that the process has gone through 7 days (coming 7 days) worth of rows, the vector y is placed as the row r in the matrix Y, which has the dimensions 20 × instances.

The vectors y and x are reset as zero vectors and the pointer is moved to row r + 1 and the process starts again. This overlaying loop continues until timer+m+ 14days < time_lastrow. A pseudo code for this process is given in Algorithm 3.

(40)

24 CHAPTER 3. METHODOLOGY

Algorithm 3 Labeling the data, method 1 Require: Input data set D

r = 0

timer= Time at row 0

loc_r = Location (address) at row 0

y, x = Initiate zero vectors y of length 20 and x of length 100

Y, X = Initiate zero matrices Y of size instances × 20 and X of size instances × 100 where instances are the number of rows in the input data set D

while time_r+ 7days < time_lastdo k = r

while time_r+ 7days < time_kdo if loc_r == loc_kthen

code_k = Alarm code at location loc_kand row k x[code_k− 100] = 1

end k + = 1 end

Insert vector x into matrix X while time_r+ 14days < time_k do

if locr == lockAnd 160 < code_r < 181 then code_k= Alarm code at location loc_kand k y[codek− 161] = 1

end k + = 1 end

Insert vector y into matrix Y r + = 1

Reset x and y as zero vectors end

This process was done twice, once for the 2018 data and once for the 2019 data. After the labeling was the done the datasets were combined and randomly shuffled in order to create a more robust model. Thereafter, the data was split so that 80% of the data was used as training data, X_train and Y_train, and 20%

of the data was used as the testing data, Xtest, Ytest.

(41)

3.1.2 Method 2

Before labeling the data using the second method the locations (addresses) from the data had to be transformed to coordinates. The locations are transformed to coordinates so that it becomes easier to define distances between different locations. The locations are transformed to coordinates with the package geopy.geocoders in python. The function N ominatim is used from the package geopy.geocoders. The locations were inserted into the function N ominatim and the coordinates for these locations were outputted by the function. Some locations had spelling errors, incompatible locations and general human errors, these locations could not be transformed to coordinates by the function N ominatim and were therefore discarded. After geocoding the locations the 2018 data had 465 910 instanced and the 2019 data had 504 952 instances. The data could now be labeled using the second method.

The second method is very similar to the first method, the only difference comes when looking at the past 7 days and when filling the vector x. The process starts as before with the pointer being placed at row r and the location, loc_r, and time, time_r, is noted. The pointer is then moved to row r + 1 and if loc_r= loc_r+1is true, then the same changes in method 1 is made. How- ever, if locr = locr+1 is not true, then the distance, d, between locrand locr+1

is checked. If d < 1km the change x(alarmr+1) = 1 is made, if however, 1km < d < 2km the change x(alarm_r+1 + 100) = 1 is made. Finally, if instead 2km < d < 5km, the change x(alarm_r+1 + 200) = 1 is made.

By doing so we define what alarms have sounded within 1km, 2km and 5km of locr, which provides the models with different information. The remaining process of method 2 is very similar to method 1 and the pseudo code for method 2 is given in Algorithm 4.

(42)

Algorithm 4 Labeling the data, method 2 Require: Input data set D

r = 0

timer= Time at row 0

loc_r = Location (address) at row 0

y, x = Initiate zero vectors y of length 20 and x of length 400

Y, X = Initiate zero matrices Y of size instances × 20 and X of size instances × 400 where instances are the number of rows in the input data set D

while time_r+ 7days < time_lastdo k = r

while time_r+ 7days < time_kdo if loc_r == loc_kthen

code_k = Alarm code at location loc_kand row k x[code_k− 100] = 1

else

d = Distance between loc_rand loc_k in km if d < 1 then

x[code_k] = 1 else

if d < 2 then

x[code_k+ 100] = 1 else

if d < 5 then

x[codek+ 200] = 1 else

Do nothing end

end end end k + = 1 end

Insert vector x into matrix X while time_r+ 14days < time_k do

if locr == lockAnd 160 < code_r < 181 then code_k= Alarm code at location loc_kand k y[code_k− 161] = 1

end k + = 1 end

Insert vector y into matrix Y r + = 1

end

(43)

And as with method 1, this was done twice, once for the 2018 data and once for 2019 data. The data was then shuffled and split as given in method 1.

3.2 Logistic Regression

Several libraries and packages were imported to build the logistic regression model in python. The libraries were:

• numpy - To manipulate vectors/matrices with ease

• pandas - To import the data in correct format

• from sklearn import LogisticRegression

• from skmultilearn.problem_transf orm import Classif ierChain

• from skmultilearn.problem_transf orm import BinaryRelevance

• from sklearn.metrics import classif ication_report - To evaluate models

The classifier chains model was first created by using the package Classif ierChain with the input value LogisticRegression. The order of the classifier chains

model is that of the output variable y, i.e. the alarm code 161 is predicted first and the results from alarm code 161 is used to predict alarm code 162, etc. The model was then fitted to the training data using the command f it on the model with input variables X_train and Y_train. The model was then used to predict Ypredwith input X_test. To evaluate the model the predicted Ypredand the real Ytest were inserted into the classif ication_report package. The list below shows the code in a more structural way.

1. Classif ierChainM odel = Classif ierChain(LogisticRegression()) 2. Classif ierChainM odel.f it(X_train, Y_train)

3. Ypred = Classif ierChainM odel.predict(Xtest) 4. print(classif ication_report(Ytest, Y_pred))

The binary relevance model is created in a similar way, the only difference is that the package BinaryRelevance is used instead of Classif ierChain.

The list below show the binary relevance model in a more structural way.

(44)

1. BinaryRelevanceM odel = BinaryRelevance(LogisticRegression()) 2. BinaryRelevanceM odel.f it(Xtrain, Y_train)

3. Y_pred= BinaryRelevancenM odel.predict(X_test) 4. print(classif ication_report(Y_test, Y_pred))

3.3 Neural Networks

The neural network model is created by first importing a few packages seen in the list below.

• numpy

• pandas

• from keras.models import Sequential

• from keras.layers import Dense

• from sklearn.metrics import classif ication_report

The package Sequential simply creates a sequential neural network. A sequential neural network is where the layers are stacked in a linear fashion, meaning that the layer 1 is connected to layer 2, layer 2 is connected to layer 3, etc. The Dense package defines the layer type in the network, a dense class layers is where the neurons are fully connected. A neuron is fully connected if it is connected to every neuron in the adjacent layers, [13].

3.3.1 Epochs and Validation Data

The neural network is trained in epochs. Each epoch represents when the entire data has been used to train the model once, [14]. The entire dataset can not be passed through the model at once and has to be divided in batches. After every batch of data has been passed through the model, the weights and biases are adjusted. After each epoch the model is validated against a validation data set. The validation data set is received by splitting the original training data into a training data set and a validation data set. This split is necessary when training neural networks due to the fact that neural network models are

Statistical Modeling of Dynamic Risk in Security Systems