A Machine Learning Approach To Crime Investigation In The New York City Land Area

Full text


Faculty of Technology and Society

Department of Computer Science and Media Technology

Master Thesis Project 15p, Spring 2020

A Machine Learning Approach To Crime Investigation

In The New York City Land Area


Yani Di Giovanni


Daniel Spikol


Johan Holmgren


Contact information


Yani Di Giovanni E-mail: digiovanniyani@yahoo.com


Daniel Spikol E-mail: Daniel.spikol@mah.se

Malmö University, Department of Computer Science and Media Technology.


Johan Holmgren

E-mail: johan.holmgren@mau.se



I would like to thank Daniel Spikol for his advice and feedback, moreover, I would like to thank Radu Mihailescu for discussing some of the statistical aspects concerning this dissertation.



This dissertation will specifically discuss how machine learning, through some of its algorithms, is able to investigate the various kinds of crime committed in the New York City land area, with special focus on the root-cause, allegedly paving the way for the violation of certain areas of the law. After covering some general background information concerning the history of this field while discussing a few examples taken from previous work, as well as the history of crime within the interested geographical area, focus will be placed in first of all finding ways to retrieve all the necessary numerical information dating back several years, since some of them might not be explicitly available, and after fulfilling this task, the selected machine learning algorithms will be implemented to have an insight about the relationship between the chosen variables. We then conclude with the direction in which future research should be heading.

Popular Scientific Summary

Nowadays, computer science areas such as machine learning and data mining, are considered a very important part dedicated to the prevention and detec-tion of crime [1]. Crime analysts often spend a substantial amount of time analyzing large datasets to discover if a specific crime follows a pattern since such findings can be adopted to predict and prevent crime [2]. This data-driven approach is growing day by day; as a matter of fact, a few years ago, The U.S. Department of Justice inaugurated initiatives supporting what is called ’predictive policing’ which allows law enforcement agencies to perform their investigative tasks more effectively with fewer resources [2]. The aim of this dissertation is to analyze how the field of machine learning selects various social factors and evaluates if and how such aspects are responsible for shap-ing the tendency of certain crimes. In this sense, the aidshap-ing potential that predictive policing might receive is directly proportional to the performance of the ML technique analysis. Sometimes, these ’explanatory’ elements are represented with a dataset that for the most part is public record. After


substituting such values into some of the mathematical equations being part of the machine learning field, we might be able to generate values very close to the ones representing the investigated crime, hinting that such unlawful activities are indeed ’molded’ by the aforementioned elements. Such discov-eries could clue the law enforcement agencies in, concerning the direction to which their investigation should be headed, besides influencing policy makers decisions in case of realizing that particular laws and regulations have the potential to influence special classes of committed felonies. The performance by these equations will be evaluated according to a numerical item generated by our dataset applied to other statistical formulas; there are pre-established parameters by which said values have to lay within to measure the strength of the relationships we attempt to prove or reject.


List of Acronyms

AI Artificial Intelligence

BK Brooklyn

BK TOT total crimes in Brooklyn

BX Bronx

BX TOT total crimes in the Bronx

HSGR High school graduation rate

INC Average household income

MH Manhattan

MH TOT total actual crimes in Manhattan

MHvsBK linear Manhattan predicted crimes with Brooklyn social factors using linear regression

MHvsBK poisson Manhattan predicted crimes with Brooklyn social factors using poisson regression

MHvsBX linear Manhattan predicted crimes with the Bronx social factors using linear regression

MHvsBX poisson Manhattan predicted crimes with the Bronx social factors using poisson regression

MHvsMH linear Manhattan predicted crimes with Manhattan social factors using linear regression

MHvsMH poisson Manhattan predicted crimes with Manhattan social factors using poisson regression

MHvsQN linear Manhattan predicted crimes with Queens social factors using linear regression

MHvsQN poisson Manhattan predicted crimes with Queens social factors using poisson regression

MHvsSI linear Manhattan predicted crimes with Staten Island social factors using linear regression

MHvsSI poisson Manhattan predicted crimes with Staten Island social factors using poisson regression

ML Machine Learning

NYC New York City

NYPD New York Police Department

QN Queens

QN TOT total crimes in Queens

SF social factors: Unemployment, high school graduation rate, household income

SI Staten Island

SI TOT total crimes in Staten Island



1 Introduction 8

1.1 Motivation . . . 8 1.2 Outline . . . 9

1.3 A Brief Summary of the History of Machine Learning . . . 10

1.4 A Brief History of New York City Crime and Attempts to

Crime Reduction . . . 11

2 Research Methods 13

2.1 Linear Regression: Description, Outcome Evaluation and

Va-lidity . . . 14

2.2 Poisson Regression, Description, Outcome Evaluation and

Va-lidity . . . 18

2.3 Numerical Methods . . . 21

3 Related Work and Contribution 22

4 Research Questions and Expected Results 24

5 Crime Analysis: Research Methods 25

5.1 Data Collection and Cleaning . . . 25

5.2 Total Crime Trend . . . 28

5.3 Linear Regression Analysis . . . 29

6 Poisson Regression Analysis 36

7 Conclusion 40

7.1 Limitation and Future Research . . . 40

7.2 Results Evaluation, Real World Application and Future




When violent crime affects a community, there is a natural inclination to find ways aimed to reduce such unlawful activities, and to address this issue, methods such as law-authorities prosecution, or programs aimed to precau-tion are not uncommon. Within the process of addressing issues as such, the reasons behind the root cause of violent crime is sometimes overlooked. The goal of this dissertation is to analyze through ML algorithms, the possible reasons why certain violent crimes in the New York City land area, take place. After our inquiry, we will be able to get an idea of which social factors affect the occurrence of a specific class of violent unlawful activities.



Crime is a tendency that for the most part relates to social as well as economic aspects [3] [4]. Given that the socio-economic causes are usually represented by a large amount of data (e.g. unemployment, graduation rate), a need for data storage, development and investigation, as well as for algorithms capable of processing such info with distinct precision must evolve [5]; de-spite some data showing in various instances an overall decrease, and within certain geographical areas [6] [7], crime tends to be quite often an ongoing process, therefore, just like many other issues, comprehending the root-cause exemplified in the aforementioned aspects, has the potential to provide an excellent prevention tool. Being able to understand the source generating unlawful activities helps guaranteeing a substantial aid to agencies within law-enforcement as well as policy makers, not to mention that the whole society will benefit from it. In this sense, machine learning methods can be used to eventually confirm or deny if whether specific crimes are a ’function’ of one or more social characteristics. The choice of examining this particu-lar geographical area has been made due to the available dataset containing large numbers, not to mention that New York City is listed among the first 30 cities in the world for population density [8], raising the idea that the crime level in the examined boroughs might indeed be impacted by such factors.




Given the nature of this topic and the specificity of what this dissertation will analyze, an occasional chronological approach has been considered useful to provide the reader with a solid insight concerning this field, and gradually pave the way for our main goal; therefore the following adopted structure presents some degrees of divergence as opposed to a typical computer science dissertation.

We will first of all, after having motivated the reasons behind the choice of this project as well as behind the geographical zone selection (see 1.1), start getting an insight into the scientific field as well as the geographical area that we are dealing with, by discussing the history of machine learning as well as the background information regarding violent crime in the loca-tion in quesloca-tion. As the history of crime in the designated area can be to some extent considered a prequel of the related work, the natural inclination would be the one of discussing machine learning attempts within the same area in the following section, but since research had to be done to find the needed information, it can be useful to know which methods were selected to do research, therefore the research methodology section has been placed directly after. As we will see, the latter section mentions the adopted ML techniques, therefore the part describing said methods as well as the data retrieval methods in details is located in the same section. The related work section directly follows the latter and paves way for the contribution to the field included in the last paragraph of the same section. The contribution will attempt to fill in one of the gaps discovered by the previous work anal-ysis. After these sections are fully discussed, there is enough material to ask the research questions. The subsequent part of this thesis work discusses the implementation of the retrieved data-set to the selected techniques and once the outcomes are analyzed, the conclusion will answer the research questions, provide a direction for future research as well as an explanation suggesting how our results might be applied to the real world.



A Brief Summary of the History of Machine


Machine learning (ML), as a subset of artificial intelligence (AI), can be de-fined as an area of computer science, and it merely deals with the upgrade of machines capable of simulating as accurately as possible human activities like learning and reasoning [9]. Given that this field is characterized by a wide array of different methods as well as several tools implementation, giving a general definition is not always feasible; moreover, all the algorithms devel-oping techniques pave the way to further possible uses, inevitably broadening the application field of this discipline. Despite its broadness, for the most part it is safe to say that ML includes several mechanisms, having the power of enabling intelligent machines to improve its abilities to perform a designed task over time. In other words, computers will be capable of improving their performance applied to specific chores through experience, or more specifi-cally, their design enables them to make decisions based on past trends. The reached results within the fields of ML and AI come from a rather lengthy de-velopment process. The first experiments involving intelligent machines date back to the early 50’s where some mathematicians argued that probabilis-tic techniques could be adopted to build computers able to make decisions based on the likelihood of certain events happening. The British mathemati-cian Alan Turing is among the first using said methods for the purpose of creating intelligent machines, at the same time that AI research in general witnessed a rise in growth as well as a decay, mostly due to the lack of funds and distrust in this research field [10]. In 1952, the American computer science pioneer Arthur Samuel invented the first computer program able to learn while running, namely, the Samuel Checkers-playing program; the ma-chine showed to improve its gaming skills proportionally to the amount of playing time [11]. Moving to more modern times, ML learning is used in a humongous variety of fields such as image-speech-facial recognition, medical diagnosis, law enforcement, military, financial services and many others. For instance the Xbox Kinect, after being released in 2010 and experiencing a hiatus in the field of video gaming has been at the center of attention among


programmers and developers, who have been able to implement ML algo-rithms able to use the tracked 32 human joints to detect specific movements indicating unique behavior. For example, a suggestion to law enforcement can be the one of implementing such algorithms in the Kinect to identify a suspect identity based on their tracked gait pattern [12]. Another example of a ML algorithm showing excellent prediction abilities was shown in 2017 by the AlphaGo Google AI program, when it managed to defeat at the Chinese game of Go, the world’s best player [13]. As we have briefly just seen, ML has shown especially throughout recent years, to be able to give substantial help to multiple fields engaging with tasks occurring daily.


A Brief History of New York City Crime and

At-tempts to Crime Reduction

In terms of violent crimes, unlawful activities in New York City have been decreasing over the years. For instance, the number of homicides that were recorded in 2019 is around three times less than what used to be decades ago. Despite witnessing an improvement, crime is an ongoing process. It is possible to find several examples of geographical areas around the globe where the use of AI and more specifically of ML is made to alleviate this issue as much as possible. Given that the whole city is still facing law violation events, entities like law-enforcement agencies, the real-estate market, the department of commerce and the department of labor will still gain a substantial benefit in being aware of the crime level in the different boroughs.

The crime level in the NYC land area was already showing large num-bers during the 70’s, where nearly 1700 murders were committed and over 250 weekly felonies took place in the subway system [14]. Authorities have tried to implement all possible policies, trying their best to ensure everyone’s safety, for instance, the subway wagons availability was cut in half in for the purpose of allowing a larger number of passengers to huddle together and consequently, be safer [15]. The 80’s experienced an even higher boost in murders, adding up to exactly 1814 in 1980 while witnessing a population decline of a million people compared to the previous decade. This spike was


largely attributed to the crack-heroin infestation. In terms of murders, 1990 saw another record, this time the number of the committed murders went as high as 2245 [16]. In 1994, New York City elected Rudolph Giuliani as the new mayor, who’s policies are attributed by some as the reason behind the city’s crime reduction. During those years, felons arrests increased from 50 to 70%, burglars were arrested 10% more and burglaries decreased to 3.2%. Besides more arrests being performed overall, one of the most famous imple-mented crime-reduction policies for which the Giuliani administration gets credit is the one called ”get-tough”, which consisted on implementing tough policing against lower-level crimes like vandalism, based on the view that low-ering the tolerance for such misdemeanors would spread a safer vibe among the law-abiding citizens, and consequently, increase the law-abiding trend for more serious offenses [17]. William Bratton was the law-enforcement officer and commissioner who many credit for the success of the Giuliani adminis-tration as he was the mastermind behind going after smaller violations and aggressive policing as well as other policies implemented by the same ad-ministration, giving rise to a much safer situation throughout the whole city. New York became a model city in terms of crime-reduction as its numbers plummeted by reaching in 1998 a new low of roughly 1000 violent crimes, as a matter of fact, Bratton, after resigning from his position with the NYPD, he became a consultant to law-enforcement agencies and traveled to many other cities in the US as well as worldwide (e.g. Johannesburg, Mexico City, Berlin) to share his expertise with the respective authorities [18]. As to whether the policies implemented throughout those years are the sole reason or part of the lower crime results might still be considered a gray area, since despite NYC witnessing a tremendous crime rate decline, in those same years, cities like Seattle and San Antonio experienced an even lower decrease despite having a more lenient policing; moreover, the crack epidemic which was considered one of the main driving forces behind the rise of crime in the 80’s was falling as well, given that such substances were turned down by the new younger

generation [19]. In 2001, Michael R. Bloomberg was elected as the new

mayor of New York City, and to assure that the crime level would continue its decreasing pattern, he implemented policies that to some extent might


be considered harsher than the previous administration. The Bloomberg administration became famous for the stop-and-frisk-program, consisting of giving license to police officers to randomly select, interrogate, detain and body-search individuals solely based on their own suspicions of some form of illegality taking place. Despite the crime in the whole city kept its declining trend, attributing so to the stop-and-frisk program can be considered quite misleading, since in some instances, accordingly to the NYPD’s annual re-ports, roughly 90% of the suspects turned out to be innocent. This policy fueled controversy as well given that for the most parts areas populated by specific racial groups were the sole targets [20]; furthermore, this practice has been ruled out to be applied unconstitutionally in the city of New York [21]. Given the ambiguous interpretations of these outcomes, a stronger case is arising that reinforces the benefits of ML methods aimed to investigate crime.


Research Methods

This work discusses the usage of machine learning methods as an alternative aimed to accomplish results in our crime investigation case study.

In order to proceed with the fulfillment of our goals, we considered im-portant to begin by stating our motivation. To legitimize even further our choice, articles from prominent newspapers (e.g. the New York Times) were used in section 1.3 to provide an historical background for the purpose of showing how the apparent successful outcome of some previous attempts to reduce crime turned out to be misleading, therefore reinforcing the idea of exploring new strategies.

Information has been gathered regarding previous cases where machine learning was used to investigate crime. Performing a literature review of related work, first of all provided a stronger insight on the potential of ML algorithms; furthermore, a better understanding on the presence of eventual

gaps was achieved. For the purpose of seeking previous work related to

our task, the Malmo University’s library website was primarily used for our research, given that mainly well-respected and reliable work is contained in


it and predominantly, the database used is the IEEE.

Our objective is to evaluate the way certain social factors affect various classes of crimes, therefore an experiment is carried out after the gap identi-fication of the related work. Firstly, the collection of raw data such as crime data as well as unemployment, average household income and high school graduation rate will in large part be retrieved from the official city hall web-site, the United States Census Bureau and the New York City department of education [22] [23][24]. Once the data-set is gathered, it will be implemented into the selected machine learning algorithms, namely linear and Poisson re-gression, whose results will be possibly showing a relationship between the independent variables (average household income, unemployment and high school graduation rate) and the major felonies offenses such as larceny, rape, murder and assault. In order to conduct our research an experiment has been the chosen method as the practice here involved consists in the manipulation of quantitative variables for the purpose of generating statistical analyzable data.


Linear Regression: Description, Outcome

Evalua-tion and Validity

Linear regression is a machine learning technique consisting of representing the relationship among the selected variables, respectively dependent and independent, by providing a linear equation applied to the chosen data; for example, this method could be useful in relating a student’s grades with the hours spent studying. As we will see in the later sections, linear regression was chosen to be suitable based on the scatterplot representing our data, in some instances showing an approximate linear pattern. N.B. Concern-ing the outcome validity, some chosen variables could generate a threat, as sometimes, observed cases that at first glance might seem to be strongly cor-related, might still not imply that a strong relationship between the chosen variables exists; for instance, we may consider the fact that the amount of weight loss does not always increase proportionally with the time spent work-ing out, as other factors unbeknownst to the experimenter could affect this


correlation (e.g. speed, metabolism, type of work-out, diet, muscle mass); additionally, is imperative that some ’common sense’ is used concerning the choice of variables since in the real world numerous examples of different categories might happen to show a strong relationship once experimented, without de facto having anything do with each other. If for instance our choice will lean towards the number of ice creams consumed VS amount of bikers crossing a selected road, if our experiment is performed in the sum-mertime, is highly possible to witness a strong positive correlation between these two variables, even though eating ice creams does not obviously relate to the action of getting on a bike. Information included in the motivation section, as well as in the literature review, explain the relationship strength among the selected variables for the experiment, consequently ruling out the aforementioned threat. Another threat to validity is represented by the presence of multicollinearity, in other words, in multiple regression, the in-dependent variables must not be directly related, or more specifically, one variable cannot be a subset of the other. If for instance, the selling price of various real estates must be predicted, and one of the chosen independent variables is the square footage, choosing the number of rooms as another variable would overlap the first one, given that the square footage is likely to show an increase that is directly proportional to the number of rooms. A linear relationship needs to be assumed as well, which can be visible by the scatterplot, but we need to keep in mind that if the selected variables do not seem to be linearly related, does not imply the absence of a relationship in general.

In order to have an explicit idea regarding the relationship strength, for single independent variables, a scatterplot is the most commonly used tool in linear regression, which visually provides the user with the presence or absence of correlation, that can be measured with what is known as the correlation coefficient, whose value lays between 1- and 1, and after being calculated, provides an interpretation of the relationship that can overall be defined as a strong positive one (coefficient close to 1), a strong negative one (coefficient close to -1) or weak (coefficient around 0). The following equation


represents the equation of the linear regression

y = β0+ β1x1+  (1)

where x and y are respectively the independent and dependent variable, β1

is the slope, also called regression coefficient, representing the strength of the rate of change in y, β0 is the y − intercept (where x = 0) and  is the error,

assuming that is highly unlikely for this equation to perfectly fit the model. The slope and y − intercept values are respectively calculated the following way: β1 = Pn i=1(xi− ¯x)(yi− ¯y) Pn i=1(xi− ¯x)2 β0 = ¯y − β1x,¯ (2)

and the correlation coefficient is calculated with the following formula:

r = Pn i=1(x − ¯x)(y − ¯y) pPn i=1(x − ¯x)2 Pn i=1(y − ¯y)2 (3)

with ¯x and ¯y being respectively the means of our independent and dependent variable, and n corresponding to the sample size.

If we have a case of a simple linear regression (one independent variable) and our data does not have too many entries, it can be manually solved, as substituting our numbers into the above formulas is not a particularly long and tedious procedure, however, if the dataset we deal with is partic-ularly large, characterized by entries with sizable numbers, and/or there is more than one independent variable, it implies a multiple linear regression, matrix algebra territory, which for the most part, its feasibility can only be accomplished numerically. The equation of the multiple linear regression in standard form can be considered an ’extension’ of eq. 1, and is written in the following form:

y = β0+ β1xn1+ β2xn2+ ... + βkxnk + i (4)

where k corresponds to the number of the chosen independent variables with n entries. For simplicity, (4) can be rewritten in matrix form, as our dataset can be represented in a way that ’resembles’ matrix operations.


y = Xβ + 


where Y =     y1 .. . yn     X =     x11 . . . x1n .. . . .. ... xn1 . . . xkn     β =     β0 .. . βk      =     1 .. . i     (6)

and we call the slope β, the regression coefficient, which represents the mean change of the target variable for a unit increase in the predictor, and is calculated the following way:

β = (X








In multiple linear regression, in order to assure the reliability of our re-sults, we no longer look the correlation coefficient but we evaluate our mea-surement with the coefficient of determination R2, which is obtained by

di-viding the sum of square regression by the sum of squares total, and measures the extent to which the predicted values are close to the actual ones, varying from 0 to 1. R2 is calculated the following way

R2 = Pn i=1(ˆyi− ¯y) Pn i=1(yi− ¯y) (8)

where ˆy is the predicted value. When R2 = 1, the implication is that the

values observed by the independent variable, can be 100% explained by the

independent variable, on the other hand, an R2 close to 0, describes such

relationship as weak. Note that in multiple linear regression, the R2 value can be deceiving, due to being eventually influenced by the increased number of ’items’ brought by the number of predictors, this is why most statistical software are equipped with the ’adjusted-R2’ for the purpose of taking care of

this issue. For measuring the statistical significance of our results, the p-value is a useful point of reference which determines if whether the relationships in the observed sample exist in the larger population as well. The main purpose


of the p-value is to display the existence of sufficient evidence to conclude that the tested results can be applied to the whole population. The significance level is the parameter to which the p-value must lie within in order for the statistical significance to be maintained; when in several examples, a 95% confidence interval is discussed, the implication is that we can apply our result to 95% of the population, therefore, in order for the p-value to be successful, it must not be larger than 5%.

In statistical terminology, we say that the p-value is designed to whether reject or fail to reject the null-hypothesis H0 and eventually accept the

alter-native hypothesis Ha. Within the case of linear regression, H0 = 0, indicates

that the values of β in (1) or (4) in case of multiple variables, are equal to zero, and therefore the outcome of the performed experiment is not proven to be applicable to population samples other than the one for which the re-lationship has been detected. The alternative case scenario happens when we manage to reject the null-hypothesis (Ha 6= 0) implying the existence of

a positive or negative correlation, depending on the slope sign, throughout the size of the sample space, which varies upon the level of significance. Note that for the p-value to be significant, the data-set has to be considerably large (> 30); additionally, the American statistical association argued that the p-value does not provide a satisfactory measure concerning models or hypotheses [25].


Poisson Regression, Description, Outcome

Evalu-ation and Validity

The Poisson regression is an alternative linear model. Besides the linearity aspects, some common traits are shown between the Poisson regression the multiple linear regression, since it deals with more than one independent variable. The observed dependent variable outcomes obey to a distribution called Poisson which consists in observing constantly occurring events within a specified time-range. The reason behind choosing this other method is due to the fact that our dependent variable is a ’count’, in other words, our response variable data-set is characterized by non-negative integers


repre-senting a certain number of events occurring within a designed time-frame. Some examples that potentially may follow such distribution are the number of cars crossing a bridge everyday, or how many people enter a shopping mall on a Saturday. On the other hand, if we take into consideration the number of students walking through the main entrance of a college campus within the course of an entire week, this value will probably not be Poisson distributed, since what we will probably see is not a constant flaw, but a lower number of students in between lecture hours, as well as a higher number of students during weekdays as opposed to weekends; for this reason, the specific nature of various cases could represent a threat to validity.

In order to calculate the probability that y events following a Poisson distribution happen within a time t, the following formula is used:

P (y) = e


y! ; y  Z

(9) where λ is the number of events occurring per time t. If the value of λ, instead of being constant, changes from an observation to another, the assumption is that this variation occurs due to the influence of outside factors, in other words, independent variables. In this case, similarly to the predicted outcome in the linear regression equation, λ is calculated the following way

λ = eβ0+β1xn1+β2xn2+...+βkxnk+i (10)

and (9) leads to the joint probability function, also called, the likelihood function for i counts

P (yi|xi) = L(y, β) = n Y i=1 e−λiλyi i yi! (11)

where βiin (10) is the regression coefficient for the i-th count, which, as in

linear regression, it represents the rate of change expected to be seen. Unlike in linear regression, β represents a percentage change in the target variable for a unit change in the predictor. After substituting the value of λ in (10) to (11), taking the natural logarithm on both sides and setting d = 0, we


obtain the maximum likelihood estimation (MLE) for β.




(yi− exiβ)xi = 0. (12)

To solve this equation, a manual procedure would not be considered a feasible process, therefore a numerical approach enabling the process to be signifi-cantly more agile is here necessary as well.

To measure the extent to which the observed data matches the expected values, also known as the goodness of fit, the deviance is observed as opposed

to R2 like in the multiple linear regression. The deviance formula is the

following: D = 2 n X i=1  yilog  yi ˆ yi  . (13)

Note that in the attempt of qualifying the Poisson regression as a good model, the ideal case is for the deviance to be equal or less than the difference between the number of observation and the number of parameters; in the case of large values, this model might have to be rejected. Another thread to validity that is typical of this method is represented by the over-dispersion or under-dispersion of the data-set. Poisson regression assumes a constant flow, or an ’even’ dispersion if you will, implying that the mean µ and the variance of the data-set are inferred to have the same value. The variance (σ2) measures the spreads between the values of the data-set. The mean and

the variance are calculated in the following way:

µ = Pn i=1xi n σ 2 = Pn i=1(xi− µ) n . (14)

When σ2 > µ we have a case of over-dispersion, on the other hand, we have an under-dispersion case. If after experimenting, should one of such threats appear, another ML approach should be suggested.



Numerical Methods

When using the above-mentioned methods, the used programming language here used is Python, and the libraries that will mainly be used are pan-das, statsmodel, NumPy, Matplotlib, OpenCV and openpyxl. Python is a cost-free open-source general programming language, and released in 1991. Python is known for its simple syntax improving code readability [26]. Other than merely a general-purpose programming language, python has enormous capabilities concerning the scientific computing field [27, p. 1].

• Pandas is a free, open-source library able to read excel or csv documents and create python data-frames, objects composed by rows and columns which are practical to deal with when applying statistical algorithms [28, pp. 4-5].

• Statsmodel is a Python library providing tools to estimate several sta-tistical models. Besides stasta-tistical methods like linear and poisson re-gression, this module is equipped with a command called summary which provides useful information like the regression and correlation coefficient values, confidence interval and number of observations [29]. • Numpy is a scientific computing library equipped, besides numerous other features, with array commands. In this case, this library is mainly needed to implement the correct numbering on the x-axis in the gen-erated graphs [28, pp. 4-5].

• Matplotlib is a library useful for the creation of graphs, besides several other features such as static as well as animated visualizations [28, pp. 5-6].

• OpenCV is another free, open-source library, suited to work with lan-guages like python or C++. Its main functions deal with images as well as videos and its algorithms are capable of accomplishing tasks like face and object recognition, track movements on camera and de-tect the position of a specific object in an image; the latter capability is


the most useful to our project since we are interested in graphs values [30, pp. 1-2].

• Openpyxl is a python library suited to read and write excel documents [31, pp. 249-255]. In this project, it was mainly used to transfer the newly generated data into a new excel file.


Related Work and Contribution

As we could see, throughout the past several years, attempts have been made by implementing policies aimed at crime reduction, however, it can be considered an effective strategy the one of trying to first of all understand the root-cause behind certain unlawful actions, and subsequently, establishing policies based on such analysis. To zero in the root cause of crime, applying ML algorithms to numerical data representing some social factors, can be considered an excellent way of developing some insight into how crime relates to such aspects.

The Poisson distribution has shown to be a convenient strategy concern-ing various aspect within the field of criminal justice and criminology. For instance, around the year 1820, this method was adopted to analyze the rates of convictions in France. Other examples within the criminal justice system that benefit from this technique are the projection of prisons popula-tions and the estimation of the criminal population size [32] (p.23). Several works involving ML have been done with NYC data as well as with other cities around the globe, for instance [34], after investigating the possibility of predicting future crime for a specific place in cities like San Francisco using linear regression, results showed that the most reported crime occurs in the central region, however, to obtain improved outcomes, the authors suggested to analyze other ML methods too. Concerning the root-cause, [35] discov-ered through linear analysis, that education and dropout rate were highly correlated to the crime level in Salinas in California. Another example can be taken from [36], who used similar strategies to answer questions related to the general safety of NYC throughout the past years, most dangerous


boroughs, the safest months of the year and other time-related questions, and some of the results that the experiments suggested were that the coldest months are the lowest in crime and Manhattan has the largest number of grand larcenies. In this instance, the author suggests to improving results by engaging in a more in-depth analysis concerning blocks and streets to obtain more detailed results. Another interesting case is discussed in [37] where linear regression was used to predict crime trends in Bangladesh. The authors discovered that the different types of crime like robbery, kidnap-ping and theft were successfully modeled by this technique, specifically, they concluded that the crime rate rose proportionally with population growth, moreover, they suggested for future research to focus on crime location. In [38], crime was analyzed by taking into account independent variables like population size and inhabitants distribution by age. Three different ML tech-niques were used to compare and evaluate which one performed best, namely linear regression, additive regression and decision stump. Linear regression showed to have performed better than the other two algorithms, while deci-sion stump performed quite poorly; to improve the field of crime prediction, the authors suggested to incorporate methods like data mining. Another instance covering this field is found in [39] where car theft in Malaysia was analyzed via negative binomial regression, and the authors discovered that the areas with the highest population density were also the ones more prone to car theft. In this project, inspired by the analyzed previous work, we aim to investigate the seven major felony offenses committed in the New York City land area in relation to explanatory variables such as unemployment, high school graduation rate and household average income; furthermore, pre-dictors and predicted variables will be based on how each of the five New York City boroughs considered as wholes, affect one another, which as far as our knowledge goes, it has never been experimented before. Besides minor exceptions, our dependent variable focus will revolve around the Manhattan borough, since it is the most central one and the borough with the highest average household income [40], in addition to being the one having both the highest density population [41], and as the generated graph with the python command matplotlib.pyplot.plot in figure 1 shows, the lowest unemployment


rate together with Queens.

Figure 1: New York City unemployment rate 2005-2019, years VS percentage, source:labor.ny.gov


Research Questions and Expected Results

As briefly mentioned above, the aim will be directed towards the explana-tory/investigating power of the selected independent variables, and the cho-sen machine learning algorithms will give us an idea on the strength of the correlation between dependent and independent parameters.

Given that our choice for the dependent variable is crime and our inde-pendent variables will be unemployment, high school graduation rate and average household income, the question that we will attempt to answer are:


• how do unemployment, high school graduation rate and average house-hold income affect the crime level?

• Which one of the selected algorithms is more suited to investigate crime?

• After evaluating the performance of the algorithms, is there room for improvement?

In terms of what is expected, possibilities might vary between a some-how robust positive/negative correlation as well as a similarity in patterns between the predicted and actual data. N.B. There is a possibility that in some cases, the shown correlation is somehow weak or non-existent depend-ing on the kind of crime, chosen environmental factor and/or location, as well as the selected algorithm possibly being ill-suited for the designed task.


Crime Analysis: Research Methods

In order to discuss crime in all 5 boroughs, after analyzing what was dis-covered in the motivation section concerning the relationship between crime and social factors, and after a literature review aimed to discover eventual gaps after examining previous work, the following independent variables were used for each borough: unemployment, high school graduation rate and me-dian household income. Unlike the unemployment rate and the crime data, explicitly available in excel format, special methods had to be developed to gather the rest of the entries. N.B. Said additional methods might result in outcome with a very small margin of error. Once the data collection was completed, an experiment was performed by substituting the aforementioned dataset into the selected ML algorithms.


Data Collection and Cleaning

The NYC crime dataset can be easily retrieved from the New York City Hall official website [22] and the unemployment rate numbers are listed in the New York State department of labor [23]. On the other hand, the high


school graduation was partially observed graphically up to year 2015 from a picture posted by the Furman Center, a branch of New York University’s school of law, who among its mission, it has the one of presenting some NYC demographic data [42]. The median household income was also graphically displayed by the official website of the US Census bureau [40]. The rest of the data (2015-2019) for the high school graduation rate, was available in the NYCDOE official website [24]. In order to obtain our dataset from a graph in png format, a special command from the cv2 library was used, namely cv2.EVENT LBUTTONDOWN, that has the feature of extracting the exact location of various points of an image anytime a mouse-click is performed on it. Given that the location is given in rows-columns coordinates, an algorithm to obtain the right direction of the y − axis needed to be created, as the ’row parameter’ increases from top to bottom; in addition to solving the y − axis direction issue (the columns direction were not an issue since this parameter shows an increase from left to right), the above-mentioned cv2 command’s outcome had to be re-scaled in order to obtain the dataset that we were interested in. The x and y-coordinates were obtained by rescaling the row an column output of the image shape with the following formulas:

x − coord = number − of − units − in − the − x − axis

column − coordinate ∗ mouse click + starting x value;

y − coord = number − of − units − in − the − y − axis

rows − coordinate ∗ −mouse click + ending y value

file_name= (r"graph.png") img = cv2.imread(file_name) list_size=list( img.shape) x_axis=list_size[1]


def click_event(event, x, y):

x_coord=(number-of-units-in-the-x-axis/column-coordinate)*x + starting_x_value

y_coord= -y*(number-of-units-in-the-y-axis/rows-coordinate) + ending_y_value


a=np.array(x_coord) b=np.array(y_coord)

if event == cv2.EVENT_LBUTTONDOWN: print(a,b)

Note that, as already stated in the beginning of this section, although the outcomes from this method might not lead to a 100% accuracy, the error that might be presented is so minimal, that it can be considered negligible, therefore, such inaccuracies present no influence concerning the pattern of the analyzed data; additionally, the median household income for the year 2019 is not yet made available from the proper entities, therefore, entries regarding that year have been estimated.

The first action to take was using pandas to implement the following command on python, able to read the excel file containing the following crimes that took place across the city since the year 2000: Murder, Rape, Robbery, Burglary, Assault, Larceny and Motor-vehicle theft.

data_crime = pd.read_excel(r’major-felony-offenses-by-precinct-2000-2019.xls’) Since the data is subdivided in police precincts, but the aim of this paper

is towards the 5 boroughs, the various precincts needed to be checked in the New York city hall official website [22] to see their borough correspondence. The following commands had to be used to sum the entries of the respective precincts into the same borough and exporting them into the newly created excel file briefly mentioned in the methods section. This function iterates through the whole document by selecting specific crimes according to the

value of the ’start’ parameter in the for-loop. The ’step’ parameter has

always been set equal to 8, since the document contained seven different kinds of crime plus an entry for the total of all crimes per precinct.

def total_crimes_manhattan(start, end, step): list_1=[]

for i in range(start,end,step):

list_1.append((df_crime.iloc[i,first_yr_position:last_yr_position]))) return(list_1)


After obtaining from the above function a number of lists equal to the number of precincts in the interested borough, the sum of all entry ’per-year’, was performed with a list comprehension applied to the ’zipped’ lists, using the zip command for aggregating the items in our lists.

zipped_list = zip(list_0,list_1,list_2 ,list_3...)

nr_total_murders_mahhattan=[sum(i) for i in zipped_list]

At this point, the newly generated data is exported to a new excel docu-ment with various commands from openpyxl. After opening this docudocu-ment, a for loop was needed to iterate through the column cells and with the cell command the entries were filled with said dataset.

newdocname = ’newdoc.xlsx’

wb = openpyxl.load_workbook(newdocname) ws=wb["Sheet 1"]

for i in range(1,len(number_of_total_crimes_in_manhattan)): wcell1 = ws.cell(i, column_number)

wcell1.value =nr_total_murders_mahhattan[i]


Total Crime Trend

As stated in the previous paragraphs, New York City crime in general, has been declining over the years. Before analyzing the independent variables that might or might not have shaped this trend, we go on and use pandas to sum all the columns from the newly generated excel file containing the total crime for each borough, and then we plot them with respect to the dates in years.

dates=(list(range(2005,2020))) str_dates = [str(i) for i in dates]

tot_crime= data_crime[’MH_TOT’] + data_crime[’BK_TOT’]

+data_crime[’BX_TOT’] +data_crime[’QN_TOT’] +data_crime[’QN_TOT’] plt.xticks(np.arange(0,15), [i for i in str_dates], rotation=30) plt.plot(tot_crime)


Figure 2: New York City seven major crimes 2005-2019

As we can tell from figure 2 right above, the decreasing pattern is explic-itly visible.


Linear Regression Analysis

To get a general idea, our first attempt consists of testing a few cases with simple linear regression; we will try the aforementioned independent variables individually with data across the 5 boroughs to see if a strong correlation is detectable. We will begin by plotting our data in a scatterplot and draw the best fitting line (see eq. 1 and 2) with the following code:

plt.plot(independent_variable, dependent_variable, ’o’)

m, b = np.polyfit(independent_variable, dependent_variable, 1) plt.plot(independent_variable, m*independent_variable + b) After substituting selected samples from our data in the above commands, the following graphs are obtained:


(a) (b)

(c) (d)

(e) (f)

Figure 3: Simple linear regression, the blue dots represent the actual values and the orange line represents the estimated ones


Judging by our images, a vague linear trajectory can be spotted in some cases, moreover, we can see from said pattern that certain independent vari-ables have a larger impact on the crime level than others. If for instance we take a close look to figure 3e and 3f, the unemployment rate does not seem to have much of an effect on the total crimes committed, but the rest of the figures show indeed a stronger correlation between the selected variables, for instance, the Brooklyn high school graduation rate in figure 3d, seems to hold a quite strong inverse correlation with the crimes committed in the same borough, and figure 3b shows that the average household income in Brooklyn is somehow correlated to the Staten Island total crime. In order to have a clearer view on what we observed, the table below lists some of the experimented variables with the correlation coefficient (eq. 3) obtained with the stats.pearsonr command. The strength of the correlation coefficient has been described according to the guide found in [43].

X Y corr-coeff strength sign

QN UNEM MH TOT -0.25 weak negative

BK INC MH TOT -0.67 strong negative

BX HSGR MH TOT -0.87 very strong negative

MH INC MH TOT -0.73 strong negative

MH HSGR MH TOT -0.9 very strong negative

MH INC BK TOT -.89 very strong negative

SI INC BK TOT -0.73 strong negative

SI INC BK TOT -0.76 strong negative

SI UNEM SI TOT 0.17 very weak negative

BX INC BX TOT -0.56 moderate negative

SI UNEM MH TOT -0.37 weak negative

Table 1: Correlation Coefficient Table

So far, we have been able to get an idea of the impact of our indepen-dent variables. However, to aim to a different perspective, and eventually, an higher accuracy, we ought to adopt methods including more than one


independent variable. From the python library statsmodel, we use the ols command to work with multiple linear regression. Let’s now compare the various kinds of crime with all of the independent variables together. The predict command was here used to generate the predicted values (eq. 4), given the independent variables.

Figure 4: Manhattan VS Brooklyn, Brooklyn VS Brooklyn


Figure 6: Manhattan VS Bronx, Bronx VS Bronx

Figure 7: Manhattan VS Staten Island, Staten Island VS Staten Island


Figure 9: Manhattan VS Bronx, larceny & assault VS murder & rape

Figure 10: Manhattan VS Brooklyn, larceny & assault VS murder & rape All of the figures above have displayed interesting results. The outcomes obtained unequivocally point out to the idea that the social factors in ev-ery single borough, affect the crime level in Manhattan, and vice-versa, with some predicted numbers being very close to the actual ones (see table 2), ex-ception made for Brooklyn, the Bronx and Staten Island sometimes showing predicted numbers a bit off from the actual ones; moreover, figure 6,7,8 show in some instances that by omitting the category of homicides and sexual as-saults, the correlation shows to be at least as strong as when all crimes are factored in, on the other hand, including these two categories while exclud-ing the rest does not show any relationship in the applied case. Interestexclud-ingly enough, the first graphs of figures 7,8 confirm the unemployment to have lower relevance compared to the other two variables, which is what has been


seen above with the simple regression. The table below shows all the pre-dicted outcomes compared to the actual number of crimes committed in Manhattan, and as we can see, plenty of values are fairly close.


36795 34420.97 34599.01 36036.88 34702.02 35153.08 34740 34555.1 34557.40 35117.54 34152.26 34893.76 32392 33503.62 33799.75 32944.17 33961.99 33488.25 31256 32318.75 32253.06 30447.89 31815.41 31049.06 28111 27647.62 27773.26 26791 29357.38 26516.12 26847 26709.04 27470.33 26696.75 26642.46 27849.07 26104 27265.88 27209.75 27265.15 26725.35 27885.78 27397 26650.90 26550.16 27220.53 26078.12 27270.89 27527 26447.84 26187.43 26887.26 26767.98 26175.93 25955 27543.98 27038.74 27607.11 26610.52 26381.12 26823 28180.98 27645.65 27808.36 28157.64 27849.37 26612 26998.8 26527.31 26384.83 26814.17 27219.09 25949 26644.92 26355.86 26541.42 26305.59 25584.42 26780 25823.83 26263.14 26495.82 26324.27 26466.17 27103 25678.71 26160.09 26146.22 25975.77 26608.83

Table 2: Linear Regression Table: Manhattan total crimes actual number vs predicted numbers after implementing all independent variables from the rest of the boroughs

As we can see from the generated graphs, as well as from the predicted values listed in table 2, the linear regression did a very good job in producing a visual fit for our dataset; but to assure that this is a reliable technique for our case, a close look to the adjusted-R2 coefficient needs to be taken.


adjusted-R2 MH SF QN SF BK SF BX SF SI SF MH TOT 0.89 0.85 0.88 0.92 0.88 QN TOT 0.92 0.94 0.91 0.98 0.9 BK TOT 0.82 0.87 0.82 0.94 0.82 BX TOT 0.76 0.84 0.71 0.82 0.72 SI TOT 0.57 0.61 0.6 0.72 0.56

Table 3: adjusted R-squared after mix and matching all 5 boroughs total crimes VS all social factors

Table 3 has been generated with the summary command, showing, among other statistical features, the adjusted-R2 value, confirms what we have seen in the figures 1-7, that is to say: Manhattan, Queens and sometimes Brook-lyn, show a strong correlation with each other’s social factors, on the other hand, The Bronx and Staten Island seem to be more uncorrelated with the way they are affected by manhattan social factors; one possibility that might explain this outcome, could be related to their lowest household income for the former, and geographical location for the latter.


Poisson Regression Analysis

Now that we have used multiple linear regression to analyze the relationship between several independent variables and the dependent variable, we are now ready to compare the results with another ML algorithm we previously mentioned, namely, the poisson regression. In this case, we will analyze our dataset by making the same comparisons that we made for the multiple linear regression, and just like above, a command from the statsmodel library will be used, namely, the poisson command. After implementing our variables, we obtained the following graphs:


Figure 11: Manhattan VS Brooklyn

Figure 12: Manhattan VS Queens


Figure 14: Manhattan VS Staten Island

The figures above show results that look very much like the ones for the linear regression, as a matter of fact, the predicted numbers between the two regressions are almost the same, as table 4 shows.


MH TOT QN ALL BK ALL BX ALL SI ALL MH ALL 36795 32948.73 32126.79 32926.57 32392.53 33121.54 34740 34183.78 30777.23 33869.17 33391 34291.6 32392 33274.43 30366.6 33927.35 32630.14 33289.13 31256 33235.81 29820.44 31764.97 33121.1 31949.05 28111 29437.09 30276.31 28184.64 29466.81 28320.16 26847 26991 30649.67 27202.51 28606.63 26973.88 26104 27423.48 30616.19 27713.16 28730.31 25994.02 27397 27285.18 29539.61 28879.53 28499.18 26987.31 27527 27087.98 28713.29 28444.91 27993.88 27143.26 25955 27180.9 28349.14 27708.41 26108.16 26472.27 26823 27949.3 27047.73 27205.56 26671.69 27398.63 26612 25816.77 25865.65 26573.94 26114.84 26570.59 25949 25564.18 25310.38 25984.86 26018.45 26396.71 26780 26286.53 25641.44 25603.71 25851.66 24856.25 27103 25728.49 25279.73 24407.26 24800.98 25078.14

Table 4: Poisson Regression Table: Manhattan total crimes actual number vs predicted numbers after implementing all independent variables from the rest of the boroughs

Now that the predicted number have been calculated, the deviance needs to be calculated too in order to evaluate the performance of this method.


MH SF 449.05 581.18 456.47 343.53 455.73

QN SF 389.48 268.88 382.76 104.92 453.23

BK SF 1045.2 764.33 1032.6 343.64 1051.5

BX SF 189.92 122.36 226.19 145.21 215.80

SI SF 187.4 168.74 171.88 124.48 188.27

Table 5: Deviance after mix and matching all 5 boroughs total crimes VS all social factors


compared with the difference between the number of observation and the number of parameters which is equal to 12; for this reason, despite obtaining a more than satisfactory visual outcome, it is safe to conclude that this model has not performed very well.




Limitation and Future Research

Despite the motivation section and the analysis of our previous work build-ing the foundation for a reasonable selection of all independent variables, the average household income could be posing some limitation as there might be a possibility that the salary increase over the years did not keep up with inflation. This aspect could be posing a threat to validity and in order to mit-igate it, a salary adjustment to the inflation rate might not solve this issue as other aspects such as income tax rate, available services, wealth distribution and prevailing wage rate could potentially further affect this measurement. A strategy that should be considered in terms of future research could be the one of taking into consideration as many aspects as possible affecting income, in order to obtain better accuracy.


Results Evaluation, Real World Application and

Future Research

As we have seen from both methods, at least two of the three selected in-dependent variables, namely the high school graduation rate and the aver-age household income, have shown a strong explanatory potential given the strength of the displayed models. Moreover, we have observed that not all of the selected dependent variables are successfully modeled by our choice of explanatory variables. Tables 2 and 4, and the bar chart representing the dataset taken from these tables (fig 15) showed that both models have per-formed equally well, since the results generated were in some instances very close to each other and to the actual numbers.


Figure 15: Manhattan actual crime VS Manhattan predicted crime based on all independent variables considering each borough, both ML algorithms are compared

Despite the displayed outcome, after evaluating both algorithms accord-ing to their R2 and deviance, we are able to deduct that the linear regression

showed a much higher performance than the poisson regression. In terms

of potential for improvement, the linear regression R2 values showed to be

very high, in some instances, larger than 0.9, but in order to achieve the highest possible value, a good strategy where future work should aim, could be the one of selecting the independent variables after considering factors such as gentrification, concentration of police patrolling and a closer analysis to wealthy areas, given that within the same borough, it is quite possible to run into wide discrepancies concerning social status and demographics. The poisson regression unsatisfactory evaluation could be attributed to the fact that this method assumes that the mean and the variance are equal, which is close to impossible to find such a case scenario with real world data, there-fore, a strategy suggesting where future research should be headed, could be


the one of experimenting with other algorithms such as the negative bino-mial, where such assumptions regarding the dataset are not presumed. To summarize our results, we can conclude from our evaluation, that at least two of the three analyzed social factors in every single borough, that is to say, high school graduation rate and average household income, affect the crime level in Manhattan, in addition to Queens and Brooklyn being affected as well by Manhattan and each other; furthermore, judging from the few cases where specific crimes were analyzed, as opposed to total crimes, it is safe to say that the aforementioned explanatory variables have only shown to shape theft and assault crimes, as when homicides and sexual misconduct felonies alone were analyzed, no correlation appeared.

Our analysis suggests that to address the theft and assault crimes, the analyzed ML algorithms could be used to predict future trends of such ac-tivities, giving a hint to policy makers in terms of selecting which regula-tions should be implemented to address the problems behind the high school dropout phenomena, to assure that as many students as possible graduate, as well as addressing the income inequality element, which is known to be a serious issue in the United States, as according to [44], in 2018, the top 10% of the richest people held 70% of the total wealth, while the bottom 50% held approximately 1%.

It is worth mentioning that some research suggests that lower incomes could depend on the achieved educational level, based on the idea that those without a high school diploma, might be forced to apply for lower wages occupation, as a matter of fact, according to [45], after 19 countries including the United States were analyzed, it has been concluded that higher education promotes wage inequality reduction. Regardless of the relationship strength between the reached educational level and wage increase, it is clear that the wealth concentration is largely skewed towards a small percentage of citizens to begin with, and since we have seen that the average household income plays a substantial role in explaining theft and assault, policy makers could take advantage in what has been seen in our regression analysis to propose new regulations aimed at a proper taxation of wealth, in order to, among other aspects, promote a higher minimum wage, which will contribute to


a fairer redistribution of assets, also considering that the current federal minimum wage of $7.25 remained stagnant for the past several years, at the same time that the living cost has jumped to 18% more [46]; note that in 2018, the richest 400 people in America paid a lower tax rate compared to the rest of the income groups [47]. In terms of the procedures that the police department could engage in, based on our analysis, could be the one of selecting the areas to patrol, taking into consideration that unemployment, as we have seen, does not play a crucial role in theft, therefore, the areas with highest unemployment are not necessarily the ones to keep under control, moreover, speaking of unemployment, a possibility for future research to aim could be the one of discovering which other variables shape or are shaped by joblessness.

In order to be able to predict felonies related to sexual misconduct and murder, they could be influenced by entirely different factors, that is why a deep research must aim to select the right independent variables, as the ones selected for this experiment, showed to be uncorrelated.


[1] L. McClendon, N. Meghanathan Using Machine Learning Algorithms to Analyze Crime Data Machine Learning and Applications: An Interna-tional Journal (MLAIJ) Vol.2, No.1 (2015)

[2] C. Rudin, M. Sloan Predictive Policing: Using Machine Learning To Detect Patterns of Crime Retrievable at www.wired.com (2018)

[3] S. Kim, P. Joshi, P. S. Kalsi and P. Taheri Crime Analysis Through Machine Learning 9th Annual Information Technology, Electronics and Mobile Communication Conference (2018)

[4] J. Shingleton Crime Trend Prediction Using Regression Models or Sali-nas, California Naval Postgraduate School, Monterey, CA, USA (2002)


[5] D. Tyagi, S Sharma An Approach to Crime Data Analysis, A Systematic Review International Journal of Engineering Technologies and Manage-ment Research (2018)

[6] C. Sun Analyzing crime in NYC: data, visuals and source code NYC Data Science Academy (2016)

[7] J. Manaliyo Townships as Crime ‘Hot-Spot’ Areas in Cape Town: Per-ceived Root Causes of Crime in Site B, Khayelitsha Mediterranean Jour-nal of Social Sciences (2014)

[8] J. Gedeon ’30 Most Crowded Cities in the World’, msn news 24/7/2017.

Available at:

https://www.msn.com/en-us/money/realestate/30-most-crowded-cities-in-the-world (Accessed: 20 May 2020).

[9] J. Kok, E. Boers, W. Kosters, P. Van Der Putten Artificial Intelligence: Definition, Trends, Techniques, and Cases (1998).

[10] G. Piccinini Alan Turing and the Mathematical Objection Minds and Machines 13, 23–48 (2003).

[11] A. Samuel Some Studies in Machine Learning Using the Game of Check-ers IBM Journal of Research and Development (1988), 366-367.

[12] R. Sahak, N. M. Tahir, D. Mustapha, A. Zabidi, I. M. Yassin, F.H.K. Zaman, Gait recognition using kinect and locally linear enbedding Jour-nal of Fundamental and Applied Sciences JourJour-nal of Fundamental and Applied Sciences (2017).

[13] P. Mozur ’Google’s AlphaGo Defeats Chinese Go Master

in Win for A.I.’ New York Times 23/5/2017. Available at:

www.nytimes.com/2017/05/23/business/google-deepmind-alphago-go-champion-defeat (Accessed: 20 May 2020)

[14] A. Samaha ’The Rise and Fall of Crime in New York City: A Timeline’ Village Voice 7/8/2014.


Available at: www.villagevoice.com/2014/08/07/the-rise-and-fall-of-crime-in-new-york-city-a-timeline (Accessed: 20 May 2020)

[15] K. Baker ’Welcome to Fear City’ – the inside story of New

York’s civil war, 40 years on’ The Guardian 18/5/2015.

Avail-able at:

www.theguardian.com/cities/2015/may/18/welcome-to-fear-city-the-inside-story-of-new-yorks-civil-war-40-years-on (Accessed: 20

May 2020)

[16] C. Sterbenz ’New York City Used To Be A Terrifying Place’ Business In-sider 12/7/2013. Available at: www.businessinIn-sider.com/new-york-city- www.businessinsider.com/new-york-city-used-to-be-a-terrifying-place-photos (Accessed: 20 May 2020)

[17] K. Drum ’Crime Didn’t Drop In New York City Because of CompStat’

Mother Jones 2/3/2018. Available at:

www.motherjones.com/kevin-drum/2018/03/crime-didnt-drop-in-new-york-city-because-of-compstat/ (Accessed: 20 May 2020)

[18] A. Nagy, J. Podolny William Bratton and the NYPD Crime Control through Middle Management Reform (2008)

[19] B. Bowling THE RISE AND FALL OF NEW YORK MURDER: Zero Tolerance or Crack’s Decline? British Journal of Criminology (1999) [20] New York Civil Liberties Union

Retrievable at: www.nyclu.org/en/Stop-and-Frisk-data

[21] S. Holmes ’Reality Check: Who’s right about constitutionality of stop-and-frisk?’ cnn 1/10/2016.

Available at: edition.cnn.com/2016/10/01/politics/fact-check-stop-and-frisk (Accessed: 20 May 2020)

[22] official website of the City of New York Retrievable at: www1.nyc.gov [23] New York City Department of Labor Retrievable at: labor.ny.gov [24] New York City Department of Education Retrievable at: schools.nyc.gov


[25] R. Wasserstein, N. Lazar The ASA Statement on p-Values: Context, Process, and Purpose American Statistical Association (2016), 129-133. [26] G. Van-Rossum Python Tutorial CreateSpace Independent Publishing

Platform (2020) .

[27] C. Fuhrer, J. Solem, O. Verdier Scientific Computing with Python 3 Packt (2016), Birmingham, UK.

[28] W. McKinney Python for Data Analysis O’Reilly Media Publication (2013), Sebastopol, CA, USA.

[29] W. McKinney, J. Perktold, S. Seabold Time Series Analysis in Python with statsmodels 10th Python Science Conference (2011).

[30] J. Minichino, J. Howse Learning OpenCV 3 Computer Vision with Python Packt (2015), Birmingham, UK.

[31] J. Hunt Advanced Guide to Python 3 Programming Springer (2019) Chippenham, UK.

[32] W. Osgood Poisson-Based Regression Analysis of Aggregate Crime Rates Journal of Quantitative Criminology (2000), pp. 21–43.

[33] A. Awal, J. Rabbi, I. Hossain, A. Hashem Using Linear Regression to Forecast Future Trends in Crime of Bangladesh 5th International Confer-ence on Informatics, Electronics and Vision (2016), Dhaka, Bangladesh. [34] A. Wang, L. Perez Gaussian Processes for Crime Prediction (2014) [35] J. Clarke, T. Onufer Understanding Environmental Factors that Affect

Violence in Salinas, California Naval Postgraduate School, Monterey, CA, USA (2009)

[36] C. Sun Analyzing crime in NYC: data, visuals and source code NYC Data Science Academy (2016)

[37] M. Awal, J. Rabbi, S. Hossain, and M. Hashem Using Linear Regression to Forecast Future Trends in Crime of Bangladesh (2016)


[38] L. McClendon, N. Megnanathan Using Machine Learning Algorithms to Analyze Crime Data Machine Learning and Applications: An Interna-tional Journal (2015)

[39] M. Zulkifli, A. Razalin, N. Masseran, I. Noriszura Statistical Analysis of Vehicle Theft Crime in Peninsular Malaysia using Negative Binomial Regression Model (2015)

[40] United States Census Bureau Retrieved from: www.census.gov

[41] New York City: Research and History: The Five Boroughs Lloyd Sealy Library of John Jay College of Criminal Justice (2020)

[42] NYU Furman Center Retrievable at: furmancenter.org

[43] J. Evans Straightforward statistics for the behavioral sciences Brooks Cole, Pacific Grove, CA, USA (1996)

[44] P. Dacosta ’America’s Humongous Wealth Gap Is

Widening Further’ Forbes 29/5/2019 Available at:

www.forbes.com/sites/pedrodacosta/2019/05/29/americas-humungous-wealth-gap-is-widening-further (Accessed: 20 May 2020)

[45] P. Martins, P. Pereira Does education reduce wage inequality? Quantile regression evidence from 16 countries (2004)

[46] A. Picchi ’The federal minimum wage sets a record – for not ris-ing’ Cbs News 15/6/2019. Available at:

www.cbsnews.com/news/federal-minimum-wage-sets-record-for-length-with-no-increase (Accessed: 20

May 2020)

[47] D. Leonhart ’The Rich Really Do Pay Lower Taxes

Than You’ New York Times 6/10/2019 Available at:

www.nytimes.com/interactive/2019/10/06/opinion/income-tax-rate-wealthy (Accessed: 20 May 2020)


Figure 1: New York City unemployment rate 2005-2019, years VS percentage, source:labor.ny.gov

Figure 1:

New York City unemployment rate 2005-2019, years VS percentage, source:labor.ny.gov p.24
Figure 2: New York City seven major crimes 2005-2019

Figure 2:

New York City seven major crimes 2005-2019 p.29
Figure 3: Simple linear regression, the blue dots represent the actual values and the orange line represents the estimated ones

Figure 3:

Simple linear regression, the blue dots represent the actual values and the orange line represents the estimated ones p.30
Table 1: Correlation Coefficient Table

Table 1:

Correlation Coefficient Table p.31
Figure 4: Manhattan VS Brooklyn, Brooklyn VS Brooklyn

Figure 4:

Manhattan VS Brooklyn, Brooklyn VS Brooklyn p.32
Figure 5: Manhattan VS Queens, Queens VS Queens

Figure 5:

Manhattan VS Queens, Queens VS Queens p.32
Figure 8: Manhattan VS Manhattan, larceny & assault VS murder & rape

Figure 8:

Manhattan VS Manhattan, larceny & assault VS murder & rape p.33
Figure 7: Manhattan VS Staten Island, Staten Island VS Staten Island

Figure 7:

Manhattan VS Staten Island, Staten Island VS Staten Island p.33
Figure 6: Manhattan VS Bronx, Bronx VS Bronx

Figure 6:

Manhattan VS Bronx, Bronx VS Bronx p.33
Figure 9: Manhattan VS Bronx, larceny & assault VS murder & rape

Figure 9:

Manhattan VS Bronx, larceny & assault VS murder & rape p.34
Figure 10: Manhattan VS Brooklyn, larceny & assault VS murder & rape All of the figures above have displayed interesting results

Figure 10:

Manhattan VS Brooklyn, larceny & assault VS murder & rape All of the figures above have displayed interesting results p.34
Table 3: adjusted R-squared after mix and matching all 5 boroughs total crimes VS all social factors

Table 3:

adjusted R-squared after mix and matching all 5 boroughs total crimes VS all social factors p.36
Figure 11: Manhattan VS Brooklyn

Figure 11:

Manhattan VS Brooklyn p.37
Figure 14: Manhattan VS Staten Island

Figure 14:

Manhattan VS Staten Island p.38
Table 5: Deviance after mix and matching all 5 boroughs total crimes VS all social factors

Table 5:

Deviance after mix and matching all 5 boroughs total crimes VS all social factors p.39
Figure 15: Manhattan actual crime VS Manhattan predicted crime based on all independent variables considering each borough, both ML algorithms are compared

Figure 15:

Manhattan actual crime VS Manhattan predicted crime based on all independent variables considering each borough, both ML algorithms are compared p.41



Relaterade ämnen :