Approaches for estimating the Uniqueness of linked residential burglaries

(1)

Thesis no: MSCS-2016-06

Approaches for estimating the uniqueness of linked residential

burglaries

Chakravarthy Gajvelly

Faculty of Computing

Blekinge Institute of Technology SE–371 79 Karlskrona, Sweden

(2)

This thesis is submitted to the Faculty of Computing at Blekinge Institute of Technology in partial fulfillment of the requirements for the degree of Master of Science in Computer Science.

The thesis is equivalent to 20 weeks of full time studies.

Contact Information:

Author(s):

Chakravarthy Gajvelly

E-mail: chga14@student.bth.se

University advisor:

Dr. Martin Boldt

Department of Computer Science

Faculty of Computing Internet : www.bth.se

Blekinge Institute of Technology Phone : +46 455 38 50 00 SE–371 79 Karlskrona, Sweden Fax : +46 455 38 50 57

(3)

Abstract

Context. According to Swedish National Council for Crime Prevention, there is an increase in residential burglary crimes by 2% in 2014 compared to 2013 and by 19% in the past decade. Law enforcement agencies could only solve three to five percent of crimes reported in 2012. Multiple studies done in the field of crime analysis report that most of the residential burglaries are committed by relatively small number of offenders. Thus, the law enforcement agencies need to investigate the possibility of linking crimes into crime series.

Objectives. This study presents the computation of a median crime which is the centre most crime in a crime series calculated using the statistical concept of median. This approach is used to calculate the uniqueness of a crime series consisting of linked residential burglaries. The burglaries are characterised using temporal, spatial features and modus operandi.

Methods. Quasi experiment with repeated measures is chosen as research method.

The burglaries are linked based on their characteristics(features) by building a statistical model using logistic regression algorithm to formulate estimated crime series. The study uses median crime as an approach for computing the uniqueness of linked burglaries. The measure of uniqueness is compared between estimated series and legally verified known series. In addition, the study compares the uniqueness of estimated and known series to randomly selected crimes. The measure of uniqueness is used to know the feasibility of using the formulated estimated series for investigation by the law bodies.

Results. Statistical model built for linking crimes achieved an AUC = 0.964, R² = 0.770 and Dxy = 0.900 during internal evaluation and achieved AUC = 0.916 for predictions on test data set and AUC = 0.85 for predictions on known series data set. The uniqueness measure of estimated series ranges from 0.526 to 0.715, and from 0.359 to 0.442 for known series corresponding to different series. The uniqueness of randomly selected crimes ranges from 0.522 to 0.726 for estimated series and from 0.636 to 0.743 for known series. The values obtained are analysed and evaluated using Independent two sample t-test, Cohen’s d and kolmogorov-smirnov test. From this analysis, it is evident that the uniqueness measure for estimated series is high compared to the known series and closely matches with randomly selected crimes. The uniqueness of known series is clearly low compared to both the estimated series and randomly selected crimes.

Conclusions. The present study concludes that estimated series formulated using the statistical model has high uniqueness measures and needs to be further filtered to be used by the law bodies.

Keywords: Residential burglaries, Median crime, Uniqueness, Logistic Regres- sion, Machine Learning, Data Mining.

i

(4)

Acknowledgments

I here by take the opportunity to express the deepest appreciation to my su- pervisor Dr. Martin Boldt who continually and convincingly provided valuable inputs and suggestions during the entire period of my master thesis. In spite of my work, this thesis would not have been possible without his excellent supervision, supreme guidance and extraordinary support.

I also like to thank my grand father who has always been my inspiration, my parents, all my colleagues and well-wishers for being a helping hand.

ii

(5)

List of Figures

1.1 Linked Residential Burglaries . . . 2 1.2 Overview of present study . . . 9 3.1 Data collected for the study . . . 20 4.1 Uniqueness of Estimated Series (ES) vs Randomly selected crime

series (RS) . . . 38 4.2 Uniqueness of Known Series (KS) vs Randomly selected crimes(RS) 39 5.1 Model evaluation using calibration curve . . . 42 5.2 Predictions of test data set by LRM . . . 43 5.3 Predictions of gmHash data set by LRM . . . 44 5.4 Frequency of UM of Estimated series and corresponding Random

crimes . . . 50 5.5 Frequency of UM of Known series and corresponding Random crimes 50

iii

(6)

List of Tables

3.1 Categories involved in Burglary form data . . . 21

3.2 Attributes involved in Linked burglary data . . . 22

3.3 Attributes involved in Unlinked burglary data . . . 23

3.4 Preprocessed data records count . . . 27

3.5 Description of known series (gmHash data set) . . . 28

3.6 Description of data sets created . . . 28

3.7 Features selected for building Logistic Regression Model (LRM) . 29 4.1 Results of testing Data set . . . 34

4.2 Predictions of LRM on gmHash data set . . . 34

4.3 Crime count of uniqueness measures of Estimated series (ES) compared to Known series (KS) . . . 36

4.4 Range of uniqueness measure (UM) values of entire distribution of Estimated series (ES) and Known series (KS) . . . 36

4.5 Uniqueness measure of Estimated Series(ES) and Randomly selected Crimes (RS) . . . 37

4.6 Summary of Uniqueness measure of Known series(KS) and Ran- domly selected crime series (RS) . . . 39

5.1 Predictions of LRM on respective data sets . . . 41

5.2 Summary of Uniqueness measure of Estimated series(ES) and known series(KS) . . . 45

5.3 Summary of Uniqueness measure of Estimated series(ES) and Ran- domly selected crimes . . . 47

5.4 Summary of Uniqueness measure of Known series(KS) and Ran- domly selected crimes . . . 48

iv

(7)

List of Abbreviations

AUC . . . Area Under Curve ES . . . Estimated Series KS . . . Known Series

LRM . . . Logistic Regression Model MO . . . Modus Operandi

RC . . . Randomly selected Crimes

ROC . . . Receiver Operating Characteristic UM . . . Uniqueness Measure

v

(8)

Chapter 1 Introduction

According to Swedish national council for crime prevention 22453 burglaries were reported in 2014 [1]. There is an increase in such crimes by 2 percent in comparison to 2013 and by 19 percent in the past decade. It is important to note that only reported crimes are included in these statistics i.e., it is estimated that ac- tual number of burglaries committed was higher. Law enforcement agencies know that a large proportion of the burglaries are committed by rather few number of offenders. Among the reported burglaries only 5 percent could be cleared so that a person could be tied to that crime.

This work presents different approaches for estimating the uniqueness of linked residential burglaries. Two or more burglaries that are committed by an offender or group of offenders are termed as linked burglaries. An offender is an individual who has committed two or more crimes of same type. Linked burglaries formed into a group are termed as ’crime series’. This work addresses the problem of residential burglaries that occurred in Sweden. It presents a way to know whether a crime series is unique from other series, and from other individual crimes or not. If it is not unique then there is a possible chance that they are connected.

In this way the present study helps the law enforcement agencies to know the uniqueness of a crime series or a crime in a series from other crimes. It could provide them with some useful information that could possibly enhance the process of investigation and helps them in identifying the suspect and tie him with the crime.

1.1 Background

Burglary is defined as an offence of either entering into the perimeter of a place or building or trespassing with an intent to steal something [2]. Residential burglary is defined as entering into a building with an intent to steal something whether or not the action is fulfilled [3].

An offender is a person who has committed at least one illegal act that is liable to punishment or serious action [4]. Offenders commit crimes individually or a crime can be executed by a group of offenders [4]. An offender can also commit multiple

1

(12)

Chapter 1. Introduction 2 crimes [4]. Linked crimes are a group of crimes committed by the same offender.

They are characterised by a similarity in patterns shared among a set of crimes, e.g. a similar method of operations is also known as modus operandi (MO) [5].

A crime series consists of a set of crimes that share same MO characteristics and are suspected to be committed by the same offender(s).

Figure 1.1: Linked Residential Burglaries

The above figure 1.1 gives a basic understanding about the linked residential burglaries, the smaller circles correspond to individual crimes and the outer circle that wraps a set of smaller circles indicates the crimes committed by same offender and are grouped to form a series based on predefined criteria. The small circles that are filled in black are crimes that do not belong to that series.

Most of the evidence regarding a burglary is gathered from the crime scene.

Every crime scene holds evidence, while the perpetrator leaves trace evidence at the crime scene [6]. Most crime scenes include interesting evidences. Based on the evidence available at different crime scenes, it is possible to estimate links between the crimes. Physical evidence like fingerprints or DNA etcetera found at several crime scenes is enough to trace out the offender and to determine the series. If there is a lack of physical evidence then behavioural evidence can help determining series. If there are physical evidences then behavioural evidence can complement the picture.

Other than the above evidences, certain attributes that could help the law enforcement agencies are modus operandi (MO), spatial and temporal characteristics of crimes[7]. Among these attributes the most important is the MO, which is a term that in this context describes the offender’s method of committing the burglary.

Spatial data gives information about geographical area and temporal data gives the time when burglary is committed [4]. Using these attributes it possible to find similarities between different crimes

(13)

Chapter 1. Introduction 3

1.2 Data Mining

Machine Learning is the study of algorithms that improve through experience.

The applications range from data mining programs that discover general rules in large data sets, to information filtering systems that automatically learn from user interests [8]. It is a field of study that describes a computer program as being induced with some sort of learning ability rather being explicitly programmed [9].

Based on a certain tasks of some class, a computer program is said to learn from experience with certain performance measure [9]. Predictive models analyse historical facts to make predictions about future or unknown events [8][9]. Unlike traditional static program instructions, these algorithms construct a model from the input data instances and make predictions based on the constructed model [8]. Machine learning is the approach used in the present work.

1.2.1 Supervised Learning

Supervised learning is a category of machine learning that involves inferring a function from the training data consisting of labels [10]. The supervised learning algorithms generalise the instances that have not occurred yet based on the existing (historic) data. The training data/training data set consists of set of inputs values each having a corresponding output value termed as a label. The process of learning using such labelled data is termed as supervised learning. The testing data set are a set of values for which the output label is to be predicted. Formerly a training set is a set of data used to discover potentially predictive relationships [10]. A test set is a set of data used to assess the strength and utility of a predictive relationship [10]. The algorithms falling under this category analyse the training data and produce an inferred function that is used for mapping new data examples [10]. Supervised learning include Logistic regression, random forests, Support Vector Machines(SVM), neural networks etcetera.

1.2.2 Unsupervised Learning

Unsupervised learning is a category of machine learning that involves finding a hidden structure in unlabelled data unlike supervised learning. The training data consists of input objects without any output label. As the training data examples are unlabelled, there is no error in evaluating a solution. This distinguishes supervised from unsupervised learning [11]. Unsupervised learning includes Gaussian mixture models, multivariate analysis and clustering techniques like k-means and hierarchical clustering.

(14)

1.3 Predictive Analytics

Predictive analytics is the use of data, algorithms and machine learning techniques to identify the likeliness of unpredicted(future) outcomes based on the historical data [22][23]. Often the unknown event of interest is in the future, but predictive analytics can be applied to any type of unknown, it could be past, present or future [23]. In a much broader area of context predictive analytics is an area of data mining used for extracting information from data and later use it to predict patterns in data.

The main aim of predictive analysis is to help in predicting the outcomes to new instances of data that could help in decision making and produce new insights that lead to taking better actions or improves decision making. Predictive models use already known relationships in data to develop a model that is used to predict values for new instances of such uniform data. The results in predictions represent a probability of response variable that is dependent on estimated significance from a set of input variables(predictors) [23].The techniques used to conduct predictive analytics can be divided into regression and machine learning techniques [23].

The machine learning techniques include neural networks, multilayer perceptron (MLP), Naive Bayes, k-nearest neighbours etcetera. The other technique is the regression analysis that is of more interest in the present work.

1.3.1 Regression

Regression is also a part of supervised learning where a real value is predicted for each item in regression [26]. It is a statistic process for estimating relationships among variables. It is used in knowing which among the independent variable has a relation with the dependent variable and also explore the "relation". Regression analysis helps one understand how the typical value of the dependent variable changes when any one of the independent variables is varied, while the values of other independent variables are held fixed [25]. Regression is used in situations where a continuous outcome is expected and a classification is used where a class is to be predicted for a chosen data point.

1.3.2 Classification using Regression

Classification is a problem of predicting the desired class based on an output label e.g. classifying output into multiple classes like A,B,C,D or a binary class Yes/No, Male/Female etcetera. Classification using regression is done by doing the regression and then classifying the outputs into classes based upon a probability value. If the outcome is less than a chosen probability it is classified into one class and greater into another class for binomial classification and ranges are used for for multiclass classification [28].

(15)

1.4 Aim and Objectives

The aim of this thesis is to develop a method for finding the uniqueness of different crime series involving linked residential burglaries. The uniqueness of a crime series formulated for the study are compared to uniqueness of legally verified known series to gather further implications. All the data required for the study is provided by Swedish police.

This approach uses a state-of-the-art algorithm for linking the residential burglaries into different series. It is done by calculating the distance of a crime from each of the crimes using a specific distance measure and then group them into series termed as estimated series based on a chosen probability value. We then calculate the uniqueness of this estimated series and compare it with the uniqueness of known series. The aim of this study will be fulfilled by accomplishing the following objectives

• Generate training and testing data sets by sampling the residential burglary data set provided by the Swedish police. It contains residential burglaries that occurred in Sweden from 2011 to 2014. To perform an exploratory data analysis of the attributes related to each of the crime entries in the data, to gain knowledge on ways to use those attributes for different computations like distance calculation, finding estimated series of crimes and to calculate uniqueness.

• Calculate the distance between each of the crimes in the burglary data set by using proper distance coefficient.

• Calculate an estimated series of crimes based on the distance using a state- of-the-art algorithm. The algorithm takes the distance measures as input and gives output the crimes that are linked to form a series, it is termed as the ’estimated series’.

• Create a data set consisting of known linked series of crimes based on the ground truth i.e., crimes that share similarities and are verified as linked by the law bodies legally. It is termed as ’gmHash’ data set.

• Calculate the distance metrics for each of the series identified in the estimated series and for each of the series identified in the gmHash data set.

• Calculate the uniqueness of the estimated series and compare them to the uniqueness of the gmHash data set.

(16)

1.5 Research Questions

The research questions are the backbone for any study. They determine the research method, result reporting and their analysis. Based on the above aims and objectives the following research questions are formulated to guide the present study in a systematic way.

RQ1: What methods exists in the literature and how suitable are they for calculating the uniqueness of crime series?

This question is formulated to know different methods that are suitable for calculating the uniqueness of formulated crime series. In the present work the crime series is formulated using a state-of-the-art algorithm and are termed as

’estimated series’.

RQ2: How unique are the estimated series of linked crimes compared to the ground truth?

After obtaining the estimated series of linked crimes in the above step, RQ2 answers how unique the estimated series are, from already known valid series of solved linked crimes termed as ’known series’ i.e., ground truth which are provided by the police. This research question will be answered using an experiment.

RQ3: How unique are both estimated as well as known series of linked crimes compared to randomly selected crimes?

As an extension to RQ2, RQ3 is formulated to know whether both estimated and known series of crimes are more unique than crime series formulated by using randomly selected crimes from the sample. The result either accepts or rejects the above statement. It is answered by an experiment as well.

1.6 Hypothesis

Hypothesis is an educated prediction that provides explanation for the observed event that could be measurable or simply a condition. In the present work, it is a statement of assumption about the relationship between variables chosen for the study. Unlike all the other cases such as experimental types, sampling etc.

there exists no conclusive specifications for formulating a hypothesis. It is chosen according to the research question for obtaining the results. If the results stick to the hypothesis it is accepted or else it is simply rejected. In the present study two hypothesis are formulated namely Hypothesis 1, Hypothesis 2 for Research Questions RQ2, RQ3 respectively that are presented below.

(17)

1.6.1 Hypothesis 1

The uniqueness of estimated series of linked crimes is less or equal compared to ground truth. The alternate hypothesis states that uniqueness of estimated series of linked crimes is greater compared to ground truth.

This hypothesis is formulated to compare the uniqueness between estimated series and ground truth or known series. The measure of uniqueness presents how unique are the estimated series formulated using the state-of-the-art algorithm compared to known series present in gmHash data set.

Independent Variables : Residential burglaries

Dependent Variables : Uniqueness measure of estimated and known crime series

Way of addressing Hypothesis 1

The acceptance or rejection of null hypothesis is guided by results obtained after executing an experiment. The dependent variable in this case is the uniqueness measures of estimated and known series. Independent variables are crime pairs in both estimated series and known series (ground truth). The ground truth here refers to known series formulated from the gmHash data set (see Table-3.5).

1.6.2 Hypothesis 2

The uniqueness of both estimated series and known series is higher compared to a randomly selected crime. The alternate hypothesis states that uniqueness of both estimated and known series is less or equal compared to randomly selected crimes.

This hypothesis is formulated to compare the uniqueness of both estimated series, known series with a randomly selected crimes. The random crime corresponds to a crime that is selected randomly from the residential burglary data set and known series corresponds to series formulated using gmHash data set (see Table-3.5).

Independent Variables : Residential burglaries

Dependent Variables : Uniqueness measures of estimated, known and randomly selected crime series.

Way of addressing Hypothesis 2

The acceptance or rejection of null hypothesis is guided by results obtained after executing an experiment. The dependent variables are the uniqueness measures of the estimated, known and randomly selected series. The independent variables are crime pairs from estimated series, known series and crime pairs formulated from randomly selected crimes.

(18)

1.7 Overview of the present work

The present study involves lot of description about every detail for proper understanding of reader. Therefore some information is presented for providing a brief overview of the study.

• Building statistical model using a state-of-the-art algorithm, formulate estimated series of linked crimes using the model.

• Compute median crime for each of the estimated series, known series and randomly selected series.

• Compute uniqueness for each of these series.

• Compare the uniqueness of estimated series and known series.

• Compare the uniqueness of estimated and known series with randomly selected series.

A pictographic view of the work overview is presented below in Figure 1.2.

(19)

Figure 1.2: Overview of present study

(20)

Chapter 2 Related Work

Linking of residential burglaries is done based on similarities in MO attributes of the crimes committed. A crime is uniquely characterised by spatial, temporal attributes and MO attributes namely types of goods stolen, type of residential area targeted etc. Researchers have used many techniques for grouping residential burglaries into crime series which share similarities in attributes that describe the crime e.g. by the use of machine learning and data mining [12][13].

Stofell et al. used a cluster based methodology to extract links between the robbery crimes and residential burglaries [14]. A methodology based on fuzzy set theory, data mining techniques are used for extracting information from the original forensic data to design an expert system. The expert system makes decisions based on generated fuzzy rules. It is done by fuzzy clustering of data, membership function extraction and fuzzy rules generation. The system captured the dependencies or links in the data used for the study. It has a drawback of not finding any useful information for some sets of data due to the use of fuzzy clustering [14].

Wang et al. developed an association model for linking crimes based on modus operandi [15]. The study uses 150 robberies and residential burglary data collected from local police department in Taiwan. It calculates information entropy, a measure to quantify information contents for each of the MO variables. The calculation of the similarity measure for linking crimes is done using information retrieval concepts like frequency and inverse document frequency. Experiment was executed for validating the model. The results proved that the model can be used to find links between crimes and it performs slightly better than the existing state-of-the-art system in their area with 11 % more retrieval of links [15].

Markson et al. compared MO, spatial and temporal proximity for linking the residential burglaries [16]. The study uses 160 solved residential burglary cases, 80 linked and 80 unlinked crime pairs. Logistic regression and Receiver Operating Characteristic (ROC) analysis were computed for each crime pair to know whether they are belong to linked or unlinked pairs. The study proves that a combination

10

(21)

Chapter 2. Related Work 11 of geographical and temporal proximity is effective compared to MO behaviour to establish links between the crimes and to distinguish linked crimes from unlinked crimes [16] .

Tonkin et al. has made an attempt to know the extent to which the behavioural case linkage is capable enough to link crimes [17]. Their work was limited to samples of solved crime series provided by the Northamptonshire Police. The crimes were linked based on the behavioural case attributes and are validated using already available solved crimes. The results prove that spatial distance, temporal proximity or both combined provided necessary accuracy among crime categories, across crime types, and within crime types measured by ROC analysis.

It indicated behavioural case linkage has potential to contribute to the linking of crimes and also could be used for distinguishing the linked pairs from non-linked pairs [17].

Borg et al. developed a decision support system for linking residential burglaries in Sweden [12]. Firstly, the study proposes a systematic data collection method for collecting the crime scene data, manage and to analyse residential burglaries.

Modus operandi, goods stolen, residential types, spatial and temporal proximities were the attributes used in analysing the burglaries. Secondly, similar burglaries were grouped into a cluster. The calculated clusters are validated using cluster validation measurements. The use of clustering measurements, such as Modu- larity index and Rand index showed that it was possible to reduce the number of residential burglaries by connecting crimes. The study is performed on the residential burglary database provided by the Swedish police.

2.1 Identification of the Gap

At present, the study of Borg et al. presented a systematic data collection method for residential burglaries using which crime-scene information could be collected [12]. A Decision Support System (DSS) was developed comparing similarities between burglaries and to perform analysis and visualisations of the crimes.

Based on the above work, the individual crimes can be linked to formulate a crime series. The formulation of a crime series helps in increasing the information obtained from crime scenes pertaining to different crimes in a series, thus increasing the amount of available evidence. Ground truth can be interpreted as the crimes that share relationship i.e., linked or unlinked in the real world. However the formulation of a crime series based on available ground truth data is a compound task. Therefore, the identification and validation of crime series in the provided crime data formulated using a state-of-the-art algorithm is the problem area that is addressed in the present work.

(22)

Chapter 2. Related Work 12

2.2 Contribution

The study involves developing a method/approach for finding the uniqueness of a crime series containing estimated linked crimes. The present work can be described as two-fold. Firstly, the formulation of crime series using a state-of-the- art algorithm termed as estimated series. Secondly, the uniqueness is calculated for these estimated series and is compared to uniqueness of already solved crimes termed as known series for validation. The purpose of this work is to help the law enforcement agencies during their investigation.

Crime series formulation enhances the investigation process by providing the in- vestigators with additional evidence obtained by gathering all information from crime scenes of each of the crime within a series.e.g the pattern of MO within the crime series helps in identifying the suspect(s) for a crime series.

The uniqueness measure could be interpreted as a criteria for knowing if the estimated series formulated are feasible for investigation by the law bodies. It is known by comparing the uniqueness of estimated series with the baseline i.e., ground truth/known series. If the uniqueness of estimated series lies in the range of known series then they are feasible for investigation.

Also it can be inferred that higher value of uniqueness corresponds to less relationship between crimes in a series i.e. there is a less probability that crimes in a series are linked. A lower value of uniqueness corresponds to higher relationship between crimes in series i.e. crimes have the highest probability of being linked to one another. It can also be interpreted as crime series with low uniqueness are preferred for investigation as low uniqueness indicates crimes in a series are less unique or similar to one another. Therefore, the present study concludes with the result stating whether the estimated crime series are useful for investigation or not.

Different ways of comparing individual crimes has been presented by various researchers but no attempt was made to compare entire crime series which is discussed in the present work. This could be a baseline for the future researchers working in the field of residential burglaries to gain knowledge about a way of comparing the crime series. This is the final outcome of the present work which could be a contribution to research community.

(23)

Chapter 3 Research Method

This section describes the approach used for calculating the uniqueness of different crime series consisting of residential burglaries. Next to description of the approach the organisation of this section can be described as three fold. Firstly the state-of-the-art algorithm used for building the statistical model to estimate links between burglaries and similarity coefficient used for calculating distance between crimes are presented. Secondly, the steps to be done in prior for executing the research method such as data collection, data preprocessing, formulation of data sets are presented. Thirdly, the type of research method used in the present study, hypothesis, statistical model coefficients are described concluding with presentation of validity threats.

3.1 Median Crime - Approach

A crime series can be defined as a collection of two or more residential burglaries that are suspected to be committed by same offender(s). Median crime is chosen as the approach for calculating the uniqueness of the linked residential burglaries.

The reason for choosing this approach is that no existing method is available for calculating the uniqueness of a crime series. Therefore, it can be interpreted as a new approach for finding the uniqueness of an entire crime series. Abiding to the statistical definition of median as the centre entity of a distribution or a collection, median crime is the centre most crime in a given crime series. The median crime is of key interest in the present work as it is the criteria/approach for calculating the uniqueness of a crime series. A description on uniqueness and its usage in deducing results of present work will be presented in section 3.1.2

3.1.1 Process of calculating median crime for a crime series

Each of the crime in a crime series is characterised by spatial, temporal and MO attributes. The data pertaining to spatial has geographical values namely latitude and longitude, temporal has date-time values and MO has binary values. The computation of a median crime is not so tiresome task if done only with one type

13

(24)

Chapter 3. Research Method 14 of data, but in the present study we have binary and non-binary attributes. In case of spatial, the median is calculated as a geographical point whose distance is equal from all other points. In case of temporal, the median is calculated as a time point equidistant from time value of all crimes in a series. For MO, the maximum count of a binary value pertaining to corresponding attribute is selected as the median, for example consider a crime series with 5 crimes and take a specific MO like Goods stolen attribute, 3 crimes have this attribute with value ’1’ and 2 crimes have a value ’0’. Therefore the median for this attribute is ’1’ as it has the maximum count (3>2), in case of equal count where there are equal number of crimes in a series with value ’1’ and value ’0’ either of them can be taken as the median.

There are several ways in which the crimes can be compared to each other e.g.

using Spatio-temporal attributes, MO, specific MO like Goods stolen, type of housing etc. But for comparing a whole crime series no explicit method exists, it is addressed by using median crime in the present work. Also the median crime computation for a crime series inherently gathers all the information common to all the crimes in that series.

3.1.2 Uniqueness Measure

Uniqueness measure is computed for a crime series, to know how unique are the crimes present in a particular crime series. A higher value of uniqueness for a crime series infers to less relationship between crimes in that series i.e. crimes in the series have lowest possibility of being linked to one another. A lower value of uniqueness for a crime series infers to crimes in the series have higher possibility of being linked to one another. After computing the median crime for a particular series the uniqueness is computed by calculating the average distance of all crimes from the median crime of that crime series.

The uniqueness or Uniqueness Measure (UM) aids in knowing the feasibility of the estimated series to be used for investigation by the law bodies. It can be interpreted as a validation measure for the estimated series which are formulated based on predictions by a statistical model. The values of UM of estimated series should closely match or must lie in the range of UM of known series to use them for investigation purpose. These are the reasons for calculating the uniqueness of estimated series before actually using them for investigation.

Generalising the calculation of uniqueness measure, it can be inferred that higher UM of any crime series indicates that the corresponding crime series should be filtered before investigation and a lower UM of a crime series indicates that the crime series can be used for investigation. These are the advantage of calculating the uniqueness for comparison of an entire crime series.

(25)

Chapter 3. Research Method 15

3.2 Algorithm - Logistic Regression

The algorithm used in the present study for grouping the similar crimes to form a crime series is Logistic Regression. This algorithm suits the present problem of linking crimes because the crimes are linked based on the binary output i.e., either linked or unlinked as is the output we could procure by using logistic regression.

Logistic Regression is a probabilistic regression model developed by D.R. Cox and it belongs to the supervised category of machine learning algorithms [18].

In this regression model, training data containing labels is given as input and a model is constructed based on the input data. The constructed model is used for predicting a binary response to the new set of data instances. During construction of the model, a function is produced by analysing the input and the generated function is used for predicting the binary response to the new data instances.

Inverse of this logistic function is the “logit” function, it gives the output in the form of probabilities ranging between 0 and 1. It is thus useful in cases where the final results are built based upon the probability of the output [18].

Logistic Regression Model (LRM) is a derivative of generalised linear models, but it is different in its own terms i.e., it includes a function that transforms the continuous input into two values between 0 and 1 [18]. It fits a line to the data points, the advantage of using this function gives it the flexibility in fitting the line to data points based on logit i.e., instead of fitting the line to binary outcomes, LRM uses logit that is a transformation of the outcomes. Compared to linear models, logistic regression is much more complex in fitting values and is hard to evaluate [30].

3.2.1 Reason for choosing Logistic Regression

In the present work, grouping of similar crimes into crime series is to be done.

Logistic Regression is used here to accomplish the task. One reason for its selection is the fact that it provides the desired binary response (linked or unlinked) and another being the present work which is a continuation of work done by Borg et al. which uses Logistic Regression algorithm [6][19][20][21].

For the present study, we have a set of linked and unlinked crimes legally verified by the law bodies. These crimes are sampled into training and testing data. The training data serves the purpose of input for building the statistical model or Logistic Regression Model (LRM) in the present work. The constructed model is tested using the testing data for ensuring functionality. After ensuring that model provides the desired binary output, we further move on with the task of formulating the series from the data available from the burglary data set. It is termed as “estimated series”.

(26)

Chapter3. Research Method 16

3 .2 .2 Log ist ic Regress ion A lgor ithm

Thissectionprovidestheinformationaboutthelogisticregressionalgorithm,how isitformulatedusingmathematicalequations.Itaimsatﬁndingalogitfunction for mappingtheinputinstances.

•Thelogisticfunction(σ(t)):Itistakenbecauseitacceptsanyinputvalue betweennegativeinﬁnityandpositiveinﬁnity,buttheoutputisbinaryi.e., eitherzeroorone.

σ(t)= e^t

e^t+1= 1

1+et and t=β0+β1x1+βnxn (3.1)

F(x)= 1

1+et= 1

1+e^(β⁰^+β¹^x¹^+βⁿ^xⁿ⁾ (3.2)

•Thelogitfunction(h(F(x))):Itistheinverseoflogisticfunctionalso calledaslog-oddsornaturallogarithmoftheoddsanditisequaltolinear regressionexpression.

h(F(x))=ln F(x)

1−F(x) = β0+β1x1+βnxn (3.3) Ittakesanyvaluebetween−∞ and ∞

•Odds: Theoddsofthedependentvariableequallingacaseisequalto exponentialfunctionoflinearregressionexpression.

Odds=e^t=e^(β⁰^+β¹^x¹^+βⁿ^xⁿ⁾ (3.4) Basedontheabovedeﬁnitionsofthealgorithmthefollowingdescriptionprovides theinformationabouttheactualworkingofthealgorithm.Itpresentstheway ofusingthelogitfunctiontopredicttheoutcomefornewinstances.

•Usingthelogitfunctionlogit,mapthecontinuousindependentvariablesto adependentvariablethatisalsocontinuous.Takesvaluesfrom−∞to+∞

logit=β0+β1x1+βnxn (3.5)

•Convertlogitintoodds. Thisvaluerangesfrom0to+∞

(27)

odds = e^logit = e^(β⁰^+β¹^x¹^+βⁿ^xⁿ⁾ (3.6)

• Convert the odds into probability score. This value ranges between 0 to 1

P robability (p) = odds

odds + 1 (3.7)

• Setup a cut-off value for p , a response above the cut-off is “Linked” and below cut-off is “Unlinked”.

The following information presents the description of various terms used in above equations 3.1 to 3.7.

• h() refers to the logit function, h(F(x)) is the “logit” i.e., natural logarithm of the odds, it is equivalent to linear regression expression.

• “t” is a linear function of an explanatory or independent variable

• “ln” is the natural logarithm.

• β0 is the intercept from the linear regression equation i.e., it is the value of criterion where predictor is equal to zero.

• β1x is the regression coefficient multiplied by some value of the predictor

• “e” or base-e is the exponential function.

• F(x) is the probability that dependent variable equals a case for some linear combination “x” of predictors. The value of F(x) ranges between 0 and 1.

• + ∞ denotes positive infinity and - ∞ denotes negative infinity.

3.3 Distance measure

The logistic regression algorithm is used for building the statistical model to predict links between the crimes. The predictors for the algorithm are the features that are to be calculated using different attributes in Table-3.2 and Table-3.3.

Each of the feature is a distance value between different crime entries that is calculated using a distance coefficient. The features that are identified for the present work based on the different categories in Table-3.2 and Table-3.3 are Spatial, Temporal, Combined, Entry, Target, Goods, Trace, Victim [12].

(28)

3.3.1 Jaccard Similarity Coefficient

Jaccard Index(also known as Jaccard Similarity Coefficient) is used to calculate the similarity. It is a measure of similarity of two distinct pairs X and Y, both of them are finite sample sets. It is calculated as per the below noted formula.

J accard Index(X, Y ) = |X ∩ Y |

|X ∪ Y | (3.8)

Jaccard distance, measures the dissimilarity between the finite sample sets.

J accard Distance(X, Y ) = 1 − J accard Index(X, Y ) (3.9)

= 1 − |X ∩ Y |

|X ∪ Y |

3.3.2 Jaccard Similarity Coefficient for Binary Variables

Another computation of the Jaccard similarity and Jaccard distance is also present if the sample sets contain binary values (0 and 1) as in the case of the burglary data set in the present study.

J accard Index(X, Y ) = M₁₁

M₁₀+ M₀₁+ M₁₁ (3.10) As per the above modified formula for computation using binaries, Jaccard distance is

J accard Distance(X, Y ) = 1 − J accard Index(X, Y ) (3.11)

= 1 − M₁₁

M₁₀+ M₀₁+ M₁₁

M₀₀+ M₁₀+ M₀₁+ M₁₁ = n (3.12)

(29)

Chapter 3. Research Method 19 The following information presents the description of various terms used in above equations 3.8 to 3.12

• X, Y are both distinct and finite sample sets.

• M11 is the total number of attributes where both X and Y have a value “ 1

”.

• M00 is the total number of attributes where both X and Y have a value “ 0

”.

• M10 is the total number of attributes where X has a value “ 1 ” and Y has value “ 0 ”.

• M01 is the total number of attributes where X has a value “ 0 ” and Y has value “ 1 ”.

• n is the total number of the binary attributes involved in the computation.

• X ∪ Y is the union of X and Y, it contains all the attributes that contains at least one "1" in X or Y. In case of binary values the union of X and Y returns a value “0” only when both X and Y are “0”, in all the remaining cases it returns the value “1”.

• X ∩Y is the intersection of X and Y, it contains only the common attributes present in both X and Y. In case of binary attributes the intersection of X and Y returns a value “ 1 ” only when both X and Y are “ 1 ”, in all the remaining cases it returns the value “ 0 ”.

The Jaccard coefficient returns a value between 0 and 1. The two finite sample sets are identical if the value returned by the Jaccard Coefficient is 1 and are totally different (no intersection exists) if it returns a value 0.

3.4 Data

This section describes the process of data collection, description about the data collected, the way it is made ready for use by research method and for further analysis i.e., preprocessing and how are the different data sets constructed by using the preprocessed data.

(30)

3.4.1 Data Collection and description

The Residential burglary data set provided by Swedish police is used for experi- ments conducted as a part of this study. The present study deals with four types of data, divided into four different files. Each file consists of number of rows termed as entries and each entry in the data is a residential burglary committed in Sweden.

The primary data used for the study is ’Burglary Form’ consisting of 1278 re- ports of residential burglaries, it provides a way to access various attributes of a particular crime making it easy for doing different computations. The data about the burglaries is collected using a standardised digital form consisting of different parameters that describe a burglary. Therefore, the fields in this file are hence uniform.

Second type of data termed as ’Linked Burglary’ is a set of crimes for which there exists a suspect burglar, such type of crimes are known as linked crimes. They are formulated by the law bodies using different means e.g., fingerprints, DNA or the suspect is caught at the crime scene while committing the felony by the law bodies. Third type of data termed as ’Unlinked Burglary’ is a set of crimes that do not have any such matches as linked crimes, and are termed as unlinked crimes.

The final type of data termed as ’GM-Hash burglary’ also consists of linked burglaries but they also have hash digest of suspect. The difference between Linked Burglary and GM-Hash Burglary is, linked crimes that does not have information about suspect’s hash digest available with the author are categorised into ’Linked Burglary’ and linked crimes having information of suspect’s hash digest available with the author are categorised into ’GM-Hash Burglary’. A representation of all the above discussed files are shown in Figure 3.1.

Figure 3.1: Data collected for the study

(31)

Chapter 3. Research Method 21 The data collected for the study consists of a number of attributes, a lucid detail about the type of the data involved in each of the above mentioned records and a description about each type is mentioned below.

3.4.2 Burglary form data

This is the primary data available for the study consisting of a total of 1278 entries.

Each entry consists of attributes that provide broad to minute details about the burglary except the suspect information which is not present in this data. There are a total of 140 attributes/columns that describe about the burglary in the data.

All the information about the burglary could be obtained from the burglary form data file. A detail about the type of data is presented below in Table 3.1.

Type of data Description

Time/Date Date and time of the burglary

Distance Distance between two different unique burglar- Residential Area The type of neighbourhood the burglary tookies

place Residential

Type Type of residence where the burglary took place. Villa, apartment etc.

Burgle Alarm If there is a burgle alarm, actions like triggered or not, sabotaged etc.

Standard Object Objects in vicinity of crime scene. Vehicle in driveway, street light etc.

Plaintiff Information collected from the victim or from his/her filed complaint

Access Object The object used to break into the vicinity to commit the burglary

Ransacked Search strategy used by burglar inside.

Goods Type of goods or objects that are stolen.

Trace Evidence The evidence gathered from the crime scene,e.g.

DNA, shoe prints etc.

Other Information about witness, traceable goods etc.

Notes Additional information from crime scene from victim or witnesses.

Table 3.1: Categories involved in Burglary form data

(32)

3.4.3 Linked Burglary data

This data set consists of crimes that are linked, i.e., they share the same offender that has been linked to both crimes by e.g. DNA or fingerprints. Each of the entry consists of two crimes that are linked that could be differentiated using their

“unique id” which is a unique id assigned to a crime. Every entry has calculated Jaccard distance which is a distance measure between identical attributes of each crime. It consists of 228 entries. A detail about the type of data and information this data provides is presented in Table 3.2.

unique id-1 Identification number for first burglary in each entry

unique id-2 Identification number for second burglary in each entry

Conn It indicates the criteria for linked, here it is “1”

as burglaries are linked

Temporal Jaccard distance of two burglaries for “temporal” data.

Spatial Jaccard distance of two burglaries for “spatial”

data.

Combined Jaccard distance of two burglaries for “combined” data.

Entry Jaccard distance of two burglaries for “mode of entry” data.

Target Jaccard distance of two burglaries for “type of residential area targeted” data.

Goods Jaccard distance of two burglaries for “type of goods stolen” data.

Trace Jaccard distance of two burglaries for “trace of evidence” data.

Victim Jaccard distance of two burglaries for “victim”

data.

Table 3.2: Attributes involved in Linked burglary data

(33)

3.4.4 Unlinked Burglary data:

This data set consists of crimes that are not-linked i.e. unlinked. Each of the entry consists of two crimes that are not-linked and they could be differentiated using their “unique id” which is a unique code given to each crime. Every entry has calculated jaccard distance which is a distance measure between identical attributes of the two burglaries. It consists of 17179 entries. The data collected here consists of some duplicates. A detail about the type of data is presented below in Table 3.3.

unique id-1 Identification number for first burglary in each entry

unique id-2 Identification number for second burglary in each entry

Conn It indicates the criteria for linked, here it is “0”

as burglaries are unlinked

Temporal Jaccard distance of two burglaries for “temporal” data.

Spatial Jaccard distance of two burglaries for “spatial”

data.

Combined Jaccard distance of two burglaries for “combined” data.

Entry Jaccard distance of two burglaries for “mode of entry” data.

Target Jaccard distance of two burglaries for “type of residential area targeted” data.

Goods Jaccard distance of two burglaries for “type of goods stolen” data.

Trace Jaccard distance of two burglaries for “trace of evidence” data.

Victim Jaccard distance of two burglaries for “victim”

data.

Table 3.3: Attributes involved in Unlinked burglary data

(34)

3.4.5 GM-Hash burglary data

Each entry in this data file has an identified suspect. Each of the entry is a burglary consisting of the “unique id” that helps in identifying a burglary and its corresponding pHash value. The pHash is a hash digest of the suspect’s name/date of birth. The data collected here consists of some duplicates that must be cleaned before using them for any further computations. This data has a total of 2824 entries.

Some of the categories of data presented in Table 3.1 and few categories of Table 3.2 and Table 3.3 are characterised by multiple attributes. The Jaccard distances calculated in the these tables are done using the attributes mentioned below.

• Temporal: Distance between a two crime records is calculated in days. Both of the records are distinct.

• Spatial: Distance between a two crime records is calculated in kilometers.

Both of the records are distinct.

• Combined: Jaccard distance between two crime records is calculated based on all categories of data except the temporal and spatial categories. Both of the records are distinct.

• Entry: Jaccard distance between two crime records is calculated based on sub-categories in mode of entry. Both of the records are distinct. The sub-categories are listed under Access Object which are patiodoor, mirror- patiodoor, balconydoor, cellardoor, door, window, triplepanewindow.

• Target : Jaccard distance between two crime records is calculated based on sub-categories of type of residence and its properties. Both of the records are distinct. The sub-categories are listed under Type of residence which are House,farm, terrace/twin/town-house, apartment, co-operative apartment, multiple floors, single floors, basement, top-floor apartment, bottom-floor apartment.

• Goods: Jaccard distance between two crime records is calculated based on sub-categories of type of goods stolen. Both of the records are distinct.

The sub-categories are listed under Goods which are alcohol or tobacco, electronics, gold/jewellery/cash, clothing, pharmaceutical, toys, weapons, safe or alarm-box, perfume, vehicle keys, passport or ID and other goods.

• Trace: Jaccard distance between two crime records is calculated based on physical evidence. Both of the records are distinct. The sub-categories are listed under Trace evidence which are fingerprint, dna, shoes, gloves, tires,

(35)

Chapter 3. Research Method 25 visiblefibre, compareglass, goodstosearch, toolmark, sect91na, smallmark, medmark, largemark, lte5marks, gte6marks, colormark, comparecolor.

• Victim: Jaccard distance between a two crime records is calculated based on sub-categories of Plaintiff and Standard Object. Both of the records are distinct. The sub-categories are listed under Standard Object which are Letterbox emptied, lit indoors, lit outdoors, street -lighting, vehicle in driveway, lawn moved/snow shuffled and dog or sign , indicating dog and under Plaintiff which are at home during crime, entrepreneur, appears in company index, planned absence, spontaneous absence, household services, call from unknown number/person, document absence online, children at home, advertised buy or for sale and vehicle at airport or border.

3.5 Data Pre-processing

Data preprocessing is performed on raw data to prepare it for further analysis. It is generally used to transform raw data into a format without any inconsistencies so that it is easy for an algorithm to learn from the data and provide appropriate and accurate output/results. The reason for data preprocessing being necessary is due to the process of data gathering/collection, where the methods are often loosely controlled. It may result in e.g. out-of-range values(garbage values), awk- ward data combinations and missing values in the collected data. Analysing data that has not been addressed for such problems can produce misleading results.

Hence ensuring the quality of data is the first and foremost task before running an analysis.

Irrelevant, redundant or noisy and unreliable data triggers difficulty in knowledge discovery during training and does not produce desired output/results. Data preparation and filtering can take considerable amount of processing time. It includes cleaning, normalization, transformation etc.

• Sampling, which selects a representative subset from a large population.

• Transformation, which manipulates raw data to produce a single input.

• Denoising, which removes noise from data. Normalization, which organises data for more efficient access.

• Feature extraction, which pulls out specified data that is significant in some particular context. It could be automatic using some tools or manually if enough information about the data is present.

The final output of pre-processing is data set ready to be used by a learning algorithm.

(36)

3.5.1 Linking Burglaries

The process of linking crimes is done using logistic regression, a supervised learning algorithm. The binomial response from the algorithm provides the relation between crimes (i.e., ’1’ is linked or ’0’ is unlinked). The predictors chosen for this process are selected based on the categories of data as per Table 3.2 and Table 3.3.

3.5.2 Pre-processing data for building the model

The data is preprocessed and is used for training the model in order to avoid inconsistencies. The steps are described below.

• Remove attributes from the burgleform data that are not expected to be useful predictive indicators.

• Identify the different features and select the required features for predicting the outcome from the model (linked or unlinked)

3.5.3 Pre-processing data for analysis

The data to be used for analysis has some duplicate records. As each of the record is a unique crime entry that occurs only once in real world duplicates must be removed as they give inappropriate results. The steps are described below.

• Each entry in linked burglary and unlinked burglary files (see Figure 3.1) has

’unique id-1’ and ’unique id-2’ attributes one for first burglary and another for second burglary (see Table 3.2 and Table 3.3). The distinct entries are chosen by selecting only one entry when two or more entries have same values for a combination of ’unique id-1’ and ’unique id-2’ attributes in each entry. All the duplicate entries in Linked burglary and Unlinked burglary are removed by this process.

• Entries in GM Hash burglary data (see Figure 3.1) differ in terms of ’pHash’.

It has only two attributes, ’unique id’ for identifying the crime and ’pHash’

which is suspect’s hash digest value. Therefore only one entry is taken into consideration if two or more entries have same values for a combination of

’pHash’ and ’unique id’ attributes. By this process duplicates are removed.

These entries are further processed for formulating the crime series based on all unique pHash values termed as "known series".

After removing the duplicates, each of the files (see Figure 3.1) has preprocessed data entries and are ready for further processing to formulate data sets for the

(37)

Chapter 3. Research Method 27 study. The new count of entries in each of those respective files is presented in Table 3.4 below

Record Count Burgleform Linked Unlinked GMHash

Original 1278 228 17179 2824

Pre-processed 1278 29 4627 1789

Table 3.4: Preprocessed data records count

3.6 Data Sets

The processed data is used in formulating several data sets for the present study.

The preprocessed data in linked, unlinked and GM-Hash files is checked for having a valid entries in the burgle form (see Figure 3.1). It is due to the reason that all the information required at any stage of the study is present in burgleform.

Taking into account the crime entries absent in burglary form data is of no use as no information about them is available. After removing entries absent in burgleform the following data sets are formulated.

3.6.1 burgleform data set

This data set consists of unique residential burglaries with all the information about each crime necessary to make any computations during the study. It consists of 1278 records.

3.6.2 linked data set

This data set consists of 29 unique records of linked burglary data and their information pertaining to attributes mentioned in Table-3.4.

3.6.3 unlinked data set

This data set consists of 4627 unique records of unlinked burglary data and their information pertaining to attributes mentioned in Table-3.4.

3.6.4 gmHash data set

This data set is formed after making some computations on the pre processed GMHash burglary data.

• The unique records obtained has a count of 1789 (see Table-3.4). Initially all burglaries pertaining to 1789 distinct pHash values are formulated.

(38)

• pHash having only one crime associated with them are excluded. After this exclusion, 456 pHash values exist that have more than one crime associated to each of them.

• Each of these pHash now is associated with two or more crimes, each such pHash corresponds to a separate series. Under each pHash value, crimes not having entries in burgleform are excluded, only 9 pHash values got remained after this exclusion.

So after this exclusion, 9 pHash values remain corresponding to 9 series that contain more than one crime. This is the final data set generated in this step. The series formulated using this data set are termed as "known series". A description about these crime series is presented in Table-3.5.

Series 1 2 3 4 5 6 7 8 9

Suspect S1 S2 S3 S4 S5 S6 S7 S8 S9

Crime count 2 3 4 2 4 4 4 4 4

Table 3.5: Description of known series (gmHash data set)

Among these, series 1, 4, 3, 5, 6, 7, 9 have same crimes but different pHash values. This can be inferred as, these crimes are committed by a group of suspects (burglars) rather than a single burglar. The final data sets are thus formulated and crime count for each of the data sets are shown below in Table-3.6

Data set Description Count

burgleform Contains all the data attributes about a burglary 1278 linked Contains data attributes of linked burglaries 29 unlinked Contains data attributes of unlinked burglaries 4627 gmHash Contains formulated series of crimes 9

Table 3.6: Description of data sets created

3.6.5 Feature selection

Feature selection is the process of selecting a set of relevant features(predictors) for building the statistical model. The selection of features is beneficial as it helps in simplification of models making them easier to interpret, reduces the training time and over-fitting thus enhancing generalisation [29].

The features are identified based on the categories of data collected as per Table- 3.2 and Table-3.3 for building statistical model using logistic regression algorithm termed as Logistic Regression Model (LRM). All the features included in Table- 3.2 and Table-3.3 are selected except combined as it is the sum of values of all the

(39)

Chapter 3. Research Method 29 features which is inappropriate for the model. The different categories of data in Table-3.1 that fall under each of the feature are listed below in Table-3.7.

Features Type of data

Temporal Time difference between two crimes

Spatial Geographical distance between two crimes

Entry Jaccard distance of Access object between two crimes Target Jaccard distance of Residence type between two crimes Goods Jaccard distance of Goods between two crimes

Trace Jaccard distance of Trace Evidence between two crimes

Victim Jaccard distance of Standard Object, Plaintiff between two crimes Table 3.7: Features selected for building Logistic Regression Model (LRM)

3.7 Statistical Model Coefficients

The statistical model coefficients are used for internal evaluation of the model, as the present study is concerned with logistic regression algorithm the following coefficients are of key interest.

3.7.1 Receiver operating characteristic (ROC)

Receiver operating characteristic (ROC) also known as ROC-Curve is a graphical plot that elucidates the performance of a dichotomous classification system(binary classifier system) at various threshold levels [31]. It plots the true positive rate against false positive rate of the system at various threshold levels. It can also be interpreted as a plot of sensitivity versus (1-sensitivity) at various thresholds.

Receiver Operating Characteristic (ROC) analysis is one of the most popular method for the visualising and interpreting the classifier performance [32].

Usage for analysis - AUC : To compare classifiers we may want to reduce ROC to a single scalar value representing the performance in terms of power of prediction. An accepted method for such transformation is to calculate Area Under ROC curve (AUC) also known as "c-index" [31][32]. AUC is an area of unit square, hence has a value ranging from 0 to 1. By the definition a random prediction produces diagonal in the graph having an area of 0.5, hence no real classifier has an AUC<0.5. A value of AUC>0.5 is acceptable and a value of 1 is indicated as perfectly predicting. In many cases a value greater than 0.8 is treated good for predicting outcomes [31]. An understandable way of interpreting AUC is, it tells you how good your model would work, on average and across all mis-classification costs.

(40)

3.7.2 Coefficient of determination or R-squared

It indicates how well data fits a statistical model, sometimes simply a line or a curve. It is a coefficient that determines the goodness of fit, it indicates the extent to which the statistical model fits the data points. It is a discrimination index presenting the percentage of variability the model is accounted for i.e., the percentage of variance in outcome accounted for, by a statistical model [33].

Usage for analysis - Range values of R-squared : A value of 0 indicates the model does not fit any of the data points and is considered due to random data, and a value of 1 indicates model fits all the data points perfectly. Although it should be noted that well constructed statistical models that predict the outcomes correctly may have lesser R-squared values in some cases [29][33].

3.7.3 Somers’ D Rank correlation

Somers’ D is a rank correlation between the predicted and observed probabilities [34]. The values of this rank discrimination index ranges from 0 to 1, it is due to the fact it is calculated based on the c-index or AUC. It has simple relationship with the c-index

Dxy = 2(c − 0.5). (3.13)

Usage for analysis - Range values of D_xy : Value of 0 for Dxy indicates that model’s predictions are random and a value of 1 indicates that model is perfectly discriminating [35].

3.8 Experimental Design

Experiment is an operation carried out to justify, establish or test the validity of a formulated hypothesis [39]. The procedure of designing experiment(s) to obtain valid results that are required for stating objective conclusions based on the chosen sample data is called as Experimental Design [37][38].

The research method chosen for the present study is experiment. The experimenter has a control over one set of variables to observe its effects on other set of variable to provide appropriate conclusions [36]. The present work provides a conclusion by proving or disproving a formulated hypothesis based on results obtained after executing the method. In experiment, researcher tests the influence of a variable(s) on another variable(s). For all the above reasons experiment is preferred over other existing empirical methods for the present study.

• The variable or set of variables the researcher/experimenter manipulates or those which are not dependent on other variables are called independent variables [40].

Approaches for estimating the Uniqueness of linked residential burglaries