Usage of Distributed Systems Data for Automated Financial Health Advice

(1)

Usage of Distributed Systems Data for Automated Financial Health Advice

ANA CANCINO

(2)

KTH Royal Institute of Technology

Dept. of Software and Computer Systems

Degree project in Masters in Software Engineering of Distributed Systems

Usage of Distributed Systems Data for Automated Financial Health Advice

Author: Ana Cancino acca@kth.se

Supervisors: Mark T. Smith, KTH, Sweden

Sebastian Stentorp, Financial Innovation Manager, Sweden

Examiner: Mark T. Smith, KTH, Sweden

(3)

Abstract

Financial health is a subject that deeply affects the choices and decisions one takes everyday, but for most people, financial concepts are difficult to understand. Financial illiteracy worldwide amounts to approximately 67%, and in Sweden, being one of the world’s least financial illiterate countries, it is around 29% [1].

The goal of this research is to be able to provide more insights into an individual’s financial health situation. During the development, several steps have been taken so as to make a program which will predict the customer’s risk capacity, which means the risk an individual has to take in order to achieve his or her goals. First, a program is developed to map occupations into their closest job titles using NLP, which has to be done so as to be able to use occupations for Data Mining. Then, a case-based recommender system is developed in order to get insights into the customer’s data about the their personal financial situation. Finally, based on the above, using the customer’s profile information and following the CRISP-DM methodology, be able to predict the risk capacity of the customer using Machine Learning and Deep Learning techniques.

(4)

Sammanfattning

Ekonomisk hälsa är ett ämne som djupt påverkar de val och beslut man tar varje dag, men för de flesta är finansiella koncept svåra att förstå. Finansiell obildning över hela världen uppgår till cirka 67%, och i Sverige, som är ett av världens minst finansiella analfabeter, är det cirka 29% [1].

Målet med denna forskning är att kunna ge mer inblick i individens finansiella hälsosituation. Under utvecklingen har flera steg tagits för att skapa ett program som kommer att förutsäga kundens riskkapacitet, vilket innebär att en individ måste ta för att uppnå sina mål. Först utvecklades ett program för att kartlägga yrken i sina närmaste jobbtitlar med hjälp av NLP, vilket snabbt görs för att kunna använda yrken för Data Mining. Sedan utvecklas ett ärende-baserat rekommendationssystem för att få insikt i kundens data om deras personliga ekonomiska situation. Slutligen, baserat på ovanstående, genom att använda användarens profilinformation och följa CRISP-DM metodiken, kunna förutsäga användarens riskkapacitet med hjälp av Machine Learning och Deep Learning tekniker.

(5)

Acknowledgment

First and foremost, I would like to thank God and the Virgin Mary, for allowing me to get here.

To KTH and all the professors who, with their effort and dedication, gave me their knowledge, and taught me responsibility and persistence.

To my tutors, Mark and Sebastian. Thank you for your patience, dedication, judge- ment and motivation, to encourage me to achieve more than I intended.

To Professor Anders Stenkrona for his insights and knowledge into financial concepts.

My Student Innovation Team, for giving me the opportunity to develop my capabili- ties, expand my knowledge, pushing me forward.

To my Dad, my Mom, Ceci and Dario, for supporting me and giving me all the necessary tools to achieve all my goals. Especially my mother, for her sacrifice and effort, always giving me her understanding and affection.

To Mamina and my family, for always supporting me, even from afar.

To my husband, Francisco, who has been with me all the way, for always helping me to fulfill my ideals, always being my indispensable source of support.

Stockholm, 01 August 2019 Ana Cristina Cancino

(6)

List of Figures

2.1 NLP Types . . . 7

2.2 Types of Recommender Systems . . . 10

2.3 CRISP-DM Process Diagram . . . 13

2.4 Overfitting and underfitting . . . 15

2.5 Neuron (based on a Paint3D model) and Perceptron . . . 21

2.6 Multi-layered Perceptron . . . 22

3.1 Task description . . . 24

4.1 Occupation Mapping to generate occupations list . . . 28

4.2 Translator of occupations to provide a mapped occupation title . . . 32

4.3 Classification Flowchart for occupation titles . . . 35

4.4 Recommender system steps . . . 38

5.1 Original Dataset MSE Line Chart . . . 51

5.2 Original Dataset MSE Bar Chart . . . 51

5.3 Original Dataset R² Line Chart . . . 52

5.4 Original Dataset R² Bar Chart . . . 52

5.5 Filter Method Dataset MSE Line Chart . . . 52

5.6 Filter Method Dataset MSE Bar Chart . . . 53

5.7 Filter Method Dataset R² Line Chart . . . 53

5.8 Filter Method Dataset R² Bar Chart . . . 53

5.9 Filter Method and PCA Dataset MSE Line Chart . . . 54

5.10 Filter Method and PCA Dataset MSE Bar Chart . . . 54

5.11 Filter Method and PCA Dataset R² Line Chart . . . 54

5.12 Filter Method and PCA Dataset R² Bar Chart . . . 55

(10)

List of Tables

2.1 Example of Edit Distance calculation . . . 8

4.1 Occupation Mapping package versions . . . 29

4.2 Translator of occupations package versions . . . 33

4.3 Table of Mapping to Major Groups . . . 36

4.4 Recommender system package versions . . . 38

4.5 Risk capacity predictor package versions . . . 40

(11)

Acronyms

R² R-squared or Coefficient of Determination.

AI Artificial Intelligence.

ANNs Artificial Neural Networks.

CRISP-DM Cross Industry Standard Process for Data Mining.

DL Deep Learning.

GDA Generalized Discriminant Analysis.

GDPR General Data Protection Regulation.

ILO International Labour Organization.

ISCO International Standard Classification of Occupations.

LDA Linear Discriminant Analysis.

ML Machine Learning.

MSE Mean Squared Error.

NaN Not a Number.

NLP Natural Language Processing.

NLTK Natural Language Toolkit.

NNs Neural Networks.

PCA Principal Component Analysis.

RBF Radial Basis Function.

RFE Feature ranking with Recursive Feature Elimination.

(12)

RITSs Recommendation Systems In-The-Small.

SCB Sverige Statistiska Centralbyrån (Swedish Central Bureau of Statistics).

SEK Svenska Krona (Swedish crown).

SSYK Standard för Svensk Yrkesklassificering (Standard for Swedish occupational classification).

SVM Support Vector Machine.

TF-IDF Term Frequency - Inverse Document Frequency.

XGBoost Extreme Gradient Boosting.

(13)

1 Introduction

^{Chapter 1}

Financial health is a broad term that involves a set of personal financial concepts and applied to an individual’s situation, will provide an overview of the position or state of that individual’s financial situation [2].

There are very diverse concepts that enclose financial health and make the term even harder, by understanding loans, stocks, debts, interest rates, among other concepts [3], one can become overwhelmed and make wrong decisions due to lack of information and skill. Since it is a term that is very difficult to approach due to its complexity, it is not surprising that around 29% of people in Sweden have a hard time understanding their personal finances, and that number amounts to 67%

worldwide [1].

Financial literacy is a subject that most are scared to deal with, however, the earlier one starts to understand these concepts, one will understand the importance of savings and asset management. Decisions regarding customer’s financial health can have a huge impact on the individual’s future and can even have huge repercussions when dealing with unforeseen circumstances.

One very important aspect of financial health is the risk, because when investing, there is a danger to lose some money. The risk capacity is the amount of risk the customer is able or willing to take in order to achieve the goals they have set for their investments [4].

The risk that the individual is willing to take determines the wealth advice or the investment opportunities that are more suited for their situation [4]. Depending on the risk, the investments will be more conservative or more aggressive, that means the volatility of the assets can be affronted by the customer, if handled correctly.

Every investment has some degree of risk, bonds tend to have less risk and stocks are normally associated to more risk but also more return. Bonds are a type of debt that a company promises to repay on the future, in contrast stocks are shares or holdings of the company. [5].

The best combination between bonds and stocks, depends on the customer’s risk

(14)

customer can sustain a significant amount of risk, it can become more revenue for the customer, and if the risk predicted for the customer is less, then it can translate into missed investment opportunity. In the other hand, if the risk predicted is more than what the customer can withstand, the customer may have to withdraw the investment when the market value is low and lose some of their investment.

Although the importance of money is globally known, as the means to achieve security, that is associated with a good lifestyle. Nonetheless, in most schools nowadays, how to handle money is not a subject that is taught as other subjects, there is not a lot of focus [6]. As a student, one is taught to work for money, without thinking about how to understand it and make correct decisions given what you want to achieve. This thesis sets out to apply Machine Learning on personal financial situations, in order to predict the individuals’ risk capacity, thereby enabling enhanced asset selection and thus financial health.

1.1 Background

There have been several studies that convey that when a person has financial stability, they are also good with self-control and feel less anxious about their future, and hence prone to make better decisions [7]. There is even a correlation between the decrease in financial illiteracy due to the “just-in-time”recommendations and the advantages of this kind of education instead of prolonged education years [8]. Which means that if the recommendation system works as expected, there can also be a rise in financial literacy.

With the arrival of new technologies, there has also been a drastic change in how all people spend their money, since today it is very simple to go online and, after a few clicks, a person might have bought a lot of things, which she does not need [9].

That is why now, more than ever, it is very hard to grasp expenses, and even if they are very easy to follow, most of the services nowadays are inclined to sell more, and save less. This means that today, one is surrounded by publicity that wants to sell products, at any cost, bombarding people by advertisements to sell more things, and now more than ever it is hard to stop and think about numbers, about loans and financing opportunities that aren’t popularly known.

It is supposed to be every individual’s duty to learn personal finance management, because today one has all the information available, through the internet one has access to a lot of information[6]. Also the amount of information can be a disadvantage because there is so much that it is very hard to know what are the correct steps to follow. People tend to avoid numbers and it is difficult also to know who to trust, when following a path that can determine your future.

There are already applications on the market that help you save and keep track of your finances [10], mostly external applications, which need to ask permission to your

(15)

bank to retrieve your financial information which could also lead to lack of security.

This is caused because, when sharing information with another entity, the data can be stolen or even sold to make profits, because nowadays information is power, and the more distributed it is, the more susceptible. Due to financial information being very private and sensitive subject, it is particularly targeted.

1.2 Problem Description

During the progress of the thesis, the goal is to provide a part of a financial health analysis by predicting the risk capacity of the customer. The prediction of the risk capacity can also help advisors to make accurate predictions and have more time to get to know the customer, and also allow to have more validation to ensure the correct risk capacity is chosen.

Part of this project will be to investigate the data-points that correspond to the customer’s financial health, and determine how each of these correlates to the individual’s financial risk capacity. According to the risk capacity, one can invest accordingly and choose funds and assets that help one achieve their objectives, in a sustainable way.

Due to data privacy, this project will use test data compiled by approximately 1000 samples of mock customers that were created with realistic characteristics by financial professionals. Even so, it is important to stress that the inferences that will be made on this mock users will be invalid until the model is applied to real customer data. The objective is to do a model that can determine which are the features that relates to the individual’s financial health, to know the link between financial stability and an individual’s profile, so afterwards, it can be used for true customers data.

1.3 Purpose and Goal

The purpose of the risk capacity prediction is that it could help people that do not have access to personalized advice, to get understandable information about their financial position, by using a program that can learn from advisors. Furthermore, financial advisers could be able to gather all the collective information of many advisors to provide an accurate prediction of the risk capacity, that helps with a baseline for accuracy.

After, there will be a development of different python scripts to provide different insights into an individual financial health, understanding the data-points and the importance of the different features and how it interacts with the recommended advice. Also different machine and deep learning algorithms will be applied to

(16)

compare results and have a model of algorithms that can be used later for real data.

1.3.1 Contributions

This investigation will lead to several different deliverables. At the center of the risk capacity prediction is the occupation mapping, because to be able to use a free text occupation, a full occupation or job names have to be developed in order to map the words into an existing position. Hence an algorithm that can map occupations without any true values, with reasonable performance. After, a recommender system, that can map some basic variables that can have the most impact on an individual’s risk capacity and personal finances. And finally, a template of data mining techniques that can be used to predict the customer’s risk capacity.

1.3.2 Delimitations

On this project, the goal is to construct a basic model to perform risk analysis on the customer’s financial data, because of confidentiality and the sensibility of the data that will be treated, it is not possible to use real customer data.

The fact that this research is not going to deal with true data, makes the performance measures invalid, even if the model is still suitable for real data. The important objective is to provide a template that can be used with real data too, and with few modifications can give insights for customers.

1.3.3 Benefits, Ethics and Sustainability

Regarding the ethical aspect of the recommendation systems, it can be challenging to tackle because it handles customer’s personal information and this type of information is protected by law and for example General Data Protection Regulation (GDPR).

When making inferences of the personal nature, it can invade the customer’s privacy.

One has to be careful when providing the information to the customer, to make sure they do not feel their personal privacy being affected by the proposed algorithm.

Even though, to know the customer’s personal information and make profile to make the correct assumptions and recommend the correct product to the individual in the ideal context.

When the system is available for everyone, it is important to follow the ethics of recommendation systems, by letting customers confirm their interests and their soft-datapoints, in order to after give them the correct advice. More investigation about how is a recommendation system proposed to customers has to be performed so to understand if a system like this can be directly managed with end customers, or it can be used by advisors to facilitate the input of the customer’s current financial

(17)

situation. The ethical appropriateness of items perceived by end customers could predict the trust and credibility of the system [11]. For the suggestions to be effective, the priority is not that the customer changes their behaviour, but learn about their financial health and how can it be improved.

Since traditionally people are scared of changes, and technology is advancing so rapidly, it can see as if it is taking over too many responsibilities, for the companies that adapt, the benefits can be huge [12]. Even if excellent financial advisors can be developed, there will always be need for people assistance and most importantly the personal touch that gives confidence to customers. Maybe the advisor figure will evolve to be more a facilitator and work with each individual to reach their goals.

Since the industrial revolution with the development of new technology, a lot of jobs have and will continue changing their focus to become useful in different ways. This type of algorithm can even make the job of the financial advisor considerably easier, since they already know the data, and they just need to correlate and understand what is important to the particular customer.

In this project the intention is to make sure the recommendations given are suited to the correct audience because knowing what advisors normally offer when providing a one-to-one experience. Since this experience can only be provided to a limited number of customers, this platform can help more people having access to the same information, and even adapt it to the customer’s particular needs. In order to provide access to everyone, and deploy this solution to a broader audience, more insights about customer experience needs to be taken into account.

1.4 Outline

In the development of this research, the following structure will be used:

• Chapter 2 the extended background will be discussed, to better understand the project’s past advancements and concepts.

• In Chapter 3 the goals of the project and the tasks achieved to the full development of the project will be explained and the different evaluation techniques used to provide insights to the performance of the project.

• In the Chapter 4 the implementation and design of each of the tasks will be explained.

• On Chapter 5 the analysis of the results will be provided based on the results and performance measures.

• And on the final chapter, Chapter 6, the goals achieved will be discussed and the future work that can help for future implementations.

(18)

2 Extended Background

^{Chapter 2}

During the course of the project, several processing methods are used to infer the most of the information has been gathered. So in this section, the basic concepts used in the research of this project.

2.1 Natural Language Processing (NLP)

Natural Language is any of the languages that has been used by humans to communicate between each other, which enabled us to form a civilized society. The problem is that it is complicated for machines to understand and process this kind of languages, because some phrases can have more than one meaning, it can depend on the context, and even the tone can change the message. This is the reason behind Natural Language Processing, that is a ramification of computer science that focuses on analyzing this particular type of data and process it in order to be processed by a machine.

There are two clear distinct ramifications shown in Figure Figure 2.1 in NLP:

"Linguistic approach" and "Statistical approach" [13]. The linguistic approach uses cases and language to understand the underlying structure of the sentence in order to extract an exact metric of a particular sentence. Nowadays more complex methods are used by taking a statistical approach, and using Artificial Intelligence (AI) and Machine Learning techniques.

It has been proven that the best results are achieved by using probabilistic approaches to provide accurate results [14], applied in applications all over the field from sentiment analysis to E-learning with Augmented Reality Applications [15].

There are several aspects of NLP to take into account, for example, semantics, grammatical, orthography among others. In this case, the approach taken by this thesis will be to deal with the morphology and syntax of the semantic language, in order to filter and only highlight the most important information from a free text form. Also, ML techniques cannot be used to infer the corresponding meaning,

(19)

NLP Types

Linguistic approach Statistical approach

Figure 2.1: NLP Types

because for this research no mapped data is used for all the sentences, so it would be impossible to train a model.

Normally, the analysis of the text is done by doing a pipeline of different models:

tokenization (to identify the different words in the sentence), stemming (finding the root of the words), Term Frequency - Inverse Document Frequency (TF-IDF) to show the frequency of words in a text and determine the significance of the word.

Other simpler methods include cleaning the data by removing stopwords (words that do not give any meaning to the sentence like prepositions or such), or it is possible to assign certain weights for category mapping [16].

Since the idea is to map the free text occupation field to the probability of corresponding to the Occupation table, and there is not a true table in order to learn the values, or at least a list of mapped occupations big enough to take a Machine Learning approach. This is why basic NLP methodologies have been used, taking into account the data will also be treated in English and Swedish, so the linguistic rules applied have to be different according to language.

2.2 Edit Distance

Edit distance is a term used in computer science of a calculation of the difference (or distance) between two words that belong to the same alphabet[17]. This distance can be used for applications like Natural Language Processing (NLP) or for spelling correction.

The most popular or classic edit distance is also known as the Levenshtein distance [18], finds the minimum number of operations turning one string into another. The operations defined by the Levenshtein distance are removal, insertion, and substitution of the characters. An example of calculating the edit distance from "tables" to "staple"

that amounts to 3 as shown on Table 2.1.

(20)

Step number Step description Distance 1 "tables" → "stables" (insertion of "s") 1 2 "stables" → "staples" (substitution of "b" for "p") 2 3 "stables" → "staple" (deletion of "s" ) 3

Table 2.1: Example of Edit Distance calculation

2.3 Recommender Systems

Recommendation systems have been extremely useful to give automated advice and guide a person within a website to find similar products. It filters the data by providing insights of the past actions made by the customer and predicts other new, or similar products that can also be suitable. Well thought recommendation systems nowadays, would not only provide similar products, but it has to also help the customer find new things that they couldn’t predict. This special kind of desired result called the serendipitous results, that refers to the discovery of a new product, that does not have much in common with previous searches but refers to something really wanted that makes the customer happy [19].

Examples of recommendation systems for financial services can be found since the beginning of Machine Learning implementations, because one of the basic implementations is to predict stock market values based on the history of the stock exchange values. Predictions can be made using a variety of algorithms that can predict the customer behavior, it depends on the datapoints and the predictions and how is the distribution of the dataset.

Similar research has been done by several engineers, for example, with multidimensional association rules [20]. This research is only based on the customer’s age, job type, history of financial products and characteristics of those financial products which is limiting because of the available dataset they had available. When looking at previous implementations of recommendation systems for example Amazon, recommends other products to buyers [21], Netflix that help you choose a movie according to your previous interests [22, 23], Spotify helps you find similar songs according to your taste in music, Facebook and LinkedIn help you connect with people, events and others. Today there are even apps to recommend other apps based on your history of downloaded information and even correlated to the demographic and context related features in Ant Financial [24].

There are two kinds of data that can be available when making assessments about a person’s financial health situation, hard-data refers to actual datapoints like transactions and account balances, in contrast soft-data refers to the person’s feelings, that relates to survey data collected, and the individual’s self-assessment of their psychological well-being [25].

(21)

The most important part of this project will be to gather the data to be able to make inferences based on the available information about the customer. By looking at the all the data points, it is possible to make inferences of the customer’s personal finances, and provide more insightful recommendations. At the same time the project would help the customer have more insight in the correct path to better handle their finance, and at the same time, helps with the financial stability by using the correct tools and achieve their goals [7].

Money should be seen as an instrument to achieve goals, it is just an instrument to help us develop ourselves. Nowadays money is seen related to power, but if one has money just sitting on their bank accounts, it provides security, but each day it loses its value if the interest rate is negative. If one uses the money to get more revenue, the money by itself provides more money, that helps the person be financially independent.

2.3.1 Kinds of Recommender Systems

Among the kinds of recommendation systems, are 5 kinds that offer a different approach when dealing with diverse kinds of implementations that can be used depending on the content to recommend and the types of data [26].

2.3.1.1 Collaborative Filtering

This type of filtering uses the data of all the customers in the system, to look for neighbors who share similar interest and recommend based on similar behavior or preferences [26]. This filtering method helps to find patterns within the data, and it doesn’t depend on the actual content or additional data on the dataset (this means it works regardless of the information one has about the actual item to recommend).

One example could be to have the financial advisors choose several fictional figures, who have a good financial situation, and try to adapt other customers behaviours to adapt to imitate the good behaviour. It is important, because to give good recommendations to customers, as a bank entity, wrong recommendations can have a negative impact in the company’s appreciation and the trust the customer gives the bank can be affected.

2.3.1.2 Content-based Filtering

Without knowing the customer’s priorities or evaluation parameters, predict the customer behavior by approximating [26]. This means that by just having the

(22)

Recommender Systems Types Collaborative Filtering

Content-based Filtering

Knowledge-based Filtering

Case-based Filtering

Hybrid Filtering

Figure 2.2: Types of Recommender Systems

metadata of the customer, and a basic description of the searched item, then recommend based on approximating what the customer wanted in the past and in the present.

This is the technique that can be used to predict the categorization in the customers expenses, so for example categorizing all supermarket expenses as groceries expenses, but maybe it is actually a gift or it may be another type of expense. Even though the customer will need to adapt this to their actual expenses to have a better overview of their financial situation it is a good way to predict the behaviour and make a first approach to the categorization.

2.3.1.3 Knowledge-based Filtering

Based on knowledge one already possesses about the customer’s preferences, predict their behavior in the future and propose similar things [26]. This method tries to predict the customer behavior and preferences first, so to have an idea of what they

(23)

may want next and recommend things that the customer may need by predicting what will the customer want next.

For example, knowing that the customer just bought a new house, and knowing what comes after buying a house, usually one needs to buy a home insurance, then recommend the home insurance to the customer.

2.3.1.4 Case-based Filtering

When defining the different kind of answers, it can prove advantageous to build a tree-like structure to map customers to different kinds of recommendation cases [26].

This case is very robust, since it helps the customer into a flow of what they may possibly need, based on customer research. This method is also very reliable, because the system doesn’t offer different recommendations that can be unreliable, but it proposes recommendations that are proven to help. This helps that the suggestions made can be easily traced and the results can be replicated.

2.3.1.5 Hybrid Filtering

It is possible to chose several kinds of filters and merge them in a more complicated pattern [26]. To have this complex type of recommendation system, depending on the data, making a model that fits all the requirements.

When using hybrid filters there is a need to consider the prioritization of the different kinds of recommendations, to combine them and make them as accurate and timely as possible.

2.3.1.6 Other Techniques

It can also be possible to implement a Recommendation Systems In-The-Small (RITSs) [19], which could in fact help with the consent of the customers, given the computation would be done in the customer’s own device. It is not a viable solution, because if the computation is made on the customer’s portable phone, it can have consequences on the customer’s battery life and application performance, that would in the end customer’s to feel the application drains the phone’s resources. Moreover, this solution would limit the performance and usage of Machine Learning techniques, because a lot of computer power is needed to perform this kind of analysis. And of course there is always a security issue that would make the project impossible, and it is needed to avoid, because making the code accessible to everyone, also makes the system vulnerable and prone to security issues.

(24)

On the same note, if the customer’s own data would be used without making a collaborative filtering, it could impact customer’s financial health and provide insights that are not relevant, that could even affect the whole outcome. The idea is to provide as much insight as possible trying to make the project usable for other applications afterwards.

It is important to consider that any kind of recommendation system that uses such private information needs a special consent from the customers. So, in the banking sector it is very hard to be able to test this kind of program with true data can be a challenge.

2.4 Cross Industry Standard Process for Data Mining (CRISP-DM)

Cross Industry Standard Process for Data Mining (CRISP-DM) is a process that shows a standard approach for applying data mining through a common methodology[27].

This process helps with the coupling the business objectives with the actual analysis, which is the most important part because of the business value is the actual value of the outcome [28]. This model was created on 1996, by some of the most prominent engineers in the field, but still nowadays with the emergence of Deep Learning (DL) and other new technologies CRISP-DM provides strong guidance for even the most advanced algorithms [29].

The process is done in several iterations, which can be described in six different steps, as shown in the following Figure 2.3 and described as follows. Each of the steps has a number of tasks that should be handled in each step, which will give more depth to the research.

• Business Understanding: Understanding the business value of the appli- cation, or the goal that wants to be achieved by applying data mining. This step is the one that merges the engineering with the value and profit that can be derived from the project. The different steps are taken in order to better understand the business as follows.

– Determine the Business Objectives – Assess the Situation

– Determine the Data Mining Goals – Produce a Project Plan

• Data Understanding: In this step the main goal is to get insights of the data, to make explorations and interpret the data. Also, the data must be checked that it is complete and valid. The phases in this step correspond to the latter.

(25)

Figure 2.3: CRISP-DM Process Diagram

– Collect the Initial Data – Describe the Data – Explore the Data – Verify Data Quality

• Data Preparation: During this phase the data is modified to be cleaned and formatted, so that it can be used for the modeling on the following step. In this step instead of using the traditional methodology [27], the adaptation of it will be used as shown below [30]. This adaptation allows to use the CRISP-DM and using the methods to process the data to provide the best performance for the models. So, some of the possible steps that can be taken within this phase are as follows.

– Select Data

(26)

– Clean Data (Treat missing values) – Value Transformation and Encoding – Data Standardization or Scaling – Feature Selection

– Dimensionality Reduction

• Modeling: This step is to try several models, Machine Learning and Deep Learning techniques. Also defining the train and test set, and assessing the models, as described in the following steps.

– Select the Modeling Technique – Generate Test Design

– Build the Model – Assess the Model

• Evaluation: To assess if the models are predicting correctly and check the performance. Some of the following phases are taken in order to measure the accuracy of the models.

– Evaluate Results – Review Process – Determine Next Steps

• Deployment: Finally the deployment process is the way to run and maintain the model on the system. Also to provide the the final results, and the review project. The following steps show what the deliverables are for this phase.

– Plan Deployment

– Plan Monitoring and Maintenance – Produce Final Report

– Review Project

(27)

2.5 Machine Learning (ML)

The fact that machines can learn and adapt to changes, has always been an exciting proposal for Computer Science. Machine Learning are algorithms or programs, normally based on statistical models that allow the computer to predict values. These models are used to understand the underlying patterns of the data, to be able to adapt to the data and predict the pattern in non-existing points.

There are many types of ML algorithms, in this case, the focus will be on supervised learning. Supervised learning is based on inputs and outputs, so the algorithms know the expected outputs and adapt to the expected results (called labels), to make the least amount of errors possible [31]. Machine learning algorithms learn by adapting to a part of the dataset called the training set, and then are tested (based on the accuracy of the results) with the test set.

For this specific case, regression is going to be used for which the goal is to estimate a real-valued variable given a certain pattern, such that the function is close to the observed values [32].

Other important detail one has to take into consideration when dealing with ML are the terms overfitting and underfitting. Overfitting is when the model corresponds and adapts too closely to the data, making the model too reliable on the training set and when one tries to fit the test data, the performance drops because the model cannot adapt adequately to different kinds of data. There are several methods to tackle this problem, the most common are cross validation, early stopping, removing nodes (pruning) and regularization (reduce the number of features and penalizing against complexity). Underfitting is the opposite, the model is too simple to adapt to the data correctly, so it cannot follow the trend properly. Both cases are shown on the Figure 2.4, where it is shown how the model is capable of correctly following the data trend.

Figure 2.4: Overfitting and underfitting

(28)

2.5.1 Linear Regression

The most basic machine learning technique is Linear Regression, which is sometimes not even considered as a Machine Learning algorithm, due to its simplicity. It assumes that the relationship between the features and the target vector is approximately linear, following the linear equation below [33]. In the following subsections some models used will be discussed.

y = ˆˆ β0+ ˆβ1x1+ ˆβ2x2+ . . . + (2.1)

2.5.1.1 Multiple Linear Regression

Since the prediction of this research has several features, multiple linear regression is used. This also follows a line, as shown in the following equation (based on [34]):

y = Xβ + εˆ (2.2)

But using matrix notation to take into account the other dimensions corresponding to the other features that have to be considered.

X =







1 x₁₁ · · · x_p1 ... ... ... 1 x_1n · · · x_pn





y =ˆ





 y₁

... y_n





β =





 β₀ β₁ ... βp





 ε =





 ε₁

... ε_n





 (2.3)

The algorithm will fit the data, and get the best values for β and based on the Mean Squared Error (MSE) or the distance of all the points to the mapped line (or the predictions):

M SE = 1 n

n

X

i=1

y⁽ⁱ⁾− ˆy⁽ⁱ⁾² (2.4)

2.5.1.2 Ridge Regression

Regularization is an approach to tackle overfitting by adding more data, and reducing the parameters and provide a penalty for complexity.

(29)

Ridge regression uses L2 penalization to reduce the parameters [35]:

L2 : λkwk²₂= λ

m

X

j=1

w_j² (2.5)

In other words, adding a penalization to the least-squares cost function:

J (w)_Ridge=

n

X

i=1

y⁽ⁱ⁾− ˆy⁽ⁱ⁾²+ λkwk²₂ (2.6)

2.5.1.3 Lasso Regression

Lasso regression also uses penalization to reduce the complexity, but uses L1 penalization [35]:

L1 : λkwk1= λ

m

X

j=1

|w_j| (2.7)

Which translates to the cost function:

J (w)_LASSO=

n

X

i=1

y⁽ⁱ⁾− ˆy⁽ⁱ⁾²+ λkwk₁ (2.8)

2.5.2 Polynomial Regression

So, with the linear regression there is only the possibility to fit linear data, now if the data is nonlinear, there is a need to look for other types of models. With this model it is possible to add powers of each feature as new features, then train a linear model on this extended set of features [31]. With only one feature the equation would look as follows, reaching a degree d.

y = w₀+ w₁x + w₂x²+ . . . + w_dx^d (2.9) This will be applied to each of the features to make it multinomial (for many features). This model can help if the model underfits the training data and adding more training examples will not help, because it tries to predict with a more complex model.

(30)

2.5.3 Decision Tree Regression

Decision trees are very effective models, they break down the information, by making decision based on the feature that results in the least error. By repeating the steps to break down data, it is possible to repeat breaking down the data until all the samples belong to the same class [35].

To be able to split the data into the optimal number of divisions and achieve the best results, it is necessary to minimize the Mean Squared Error (MSE). Where the MSE is defined by the following equation:

MSE(t) = 1 N_t

X

i∈Dt

y⁽ⁱ⁾− ˆyt

2

(2.10)

Where N_tis the amount of training samples at the node t and then D_tis the training subset at node t. Finally, the equation for the predictions using the decision tree model would be:

yˆ_t= 1 N_t

X

i∈Dt

y⁽ⁱ⁾ (2.11)

2.5.4 Ensemble Regression

Ensemble means together, this type of model that uses several models (like for example Decision Trees) to provide better results.

2.5.4.1 Random Forest Regression

Random Forest is an ensemble model based on collections of Decision Trees, the idea is to average many of them and building a more dependable model that has less variance and less prone to overfitting [35].

Several decision trees are trained with a number of sample of the training data, and only with a sample of the features. So, finally to predict the value by taking the average of all the trees.

(31)

2.5.4.2 Gradient Boosting Regression

This model is very similar of the Random Forest, but instead of choosing the trees randomly, Gradient Boosting builds the trees sequentially by correcting the mistakes of the previous tree. In order to correct it, the goal is to minimize the MSE. Gradient Boosting normally makes smaller trees, so it works faster being more memory efficient, and it improves performance. It has a learning rate which controls how much does each tree tries to correct the last tree’s mistakes, if the learning rate is high the model becomes complex.

2.5.4.3 XGBoost for Regression

Extreme Gradient Boosting (XGBoost) differentiates from Gradient Boosting because the difference is the greedy regularization technique it uses [36]. Regularization helps control overfitting that also helps with the performance.

2.5.5 K-Nearest Neighbors Regression

The idea behind K-Nearest Neighbors is very simple, based on the distance between one row and all the other rows in the training set. Then choose the K (the amount must be determined beforehand as a hyperparameter) nearest points to the row in question. And finally the point will be calculated by the average of all these points.

Several types of distances can be used within this model, but normally the Eu- clidean distance is used by using the Minkowski distance between two vectors X and Y :

M inkowski Distance(X, Y ) =

n

X

i=1

|x_i− y_i|^p

!1/p

(2.12)

This is a generalization because using p = 2 (which is the most used) converts the formula into the Euclidean distance:

Euclidean Distance(X, Y ) = v u u t

n

X

i=1

(x_i− y_i)² (2.13)

Another distance that can be derived from the Minkowski distance is the Manhattan distance, that can also be used for this model but it is not as used.

(32)

2.5.6 Support Vector Machine (SVM) for Regression

Support Vector Machine (SVM) uses hyperplanes, that are n-1 subspaces in n- dimensional spaces, to find the one that maximizes the margin between the separating hyperplane (or decision boundary) and the training samples that are closest to this hyperplane[35]. So, then SVM for Regression tries to fit as many instances as possible on the separation, while limiting margin violations [31].

There are several kinds of SVM kernels: linear, polynomial, Radial Basis Function (RBF) or sigmoid depending on the data, it can be changed to fit better and show better results. The RBF kernel, also known as the Gaussian kernel helps to make approximations to consider all possible polynomials of all degrees, while at the same time changing the relevance of the features as their importance decreases for higher degrees [37].

2.6 Deep Learning (DL)

Deep Learning is part of Machine Learning algorithms, based on Artificial Neural Networks. Deep Learning has other architectures like Recurrent Neural Networks or Convolutional Neural Networks, but for this research only linear Artificial Neural Networks (ANNs) will be used.

2.6.1 Artificial Neural Networks (ANNs)

ANNs are inspired by biological Neural Networks, on how the brain works, and exchanges information by the neurons. ANNs use the unit as the perceptron the equivalent of the neuron to the brain. The similarities between the neuron and the single layer perceptron can be illustrated by Figure 2.5.

The perceptron has the inputs that are the different features that correspond to the input variables. After the inputs are multiplied by weights, depending on the learning the weights will change in order to adjust to the inputs and outputs, one constant is added to add a bias to the operation. So all the weighted sum is added and passed through an activation function based on the output.

Usually, neural networks have three types of layers, the input layer, the hidden layers (there can be several) and the output layer. In the Figure 2.6 it is shown how a multi-layered perceptron (also called feed-forward neural network) would look like.

DL is derived from the layers that increase the complexity of the ANNs, helping with the predictions.

The training on the ANNs is done by backpropagation, where the weights will be adapted by each epoch (or every round feeding one instance to each neuron at a time).

The backpropagation is done by first making a forward pass where one calculates

(33)

Figure 2.5: Neuron (based on a Paint3D model) and Perceptron

the prediction, then calculating the error based on that prediction and later in a backward pass changing the weights to make the error smaller.The weight is also adapted by the learning rate one chooses that is the number that corresponds to how much will the weight adapt to the changes.

Choosing the amount of hidden layers, normally depends on the complexity of the problem, more complex problems need more layers. To choose the number of neurons in each layer, for which several rules of thumb exist like approximating by a percentage of the number of inputs and outputs [38]. At the end, the only way of knowing is by trying with different values until one achieves the best accuracy.

(34)

Output Layer Input

Layer

Hidden Layers

Figure 2.6: Multi-layered Perceptron

(35)

3 ^Methods

^{Chapter 3}

The methodology used on this research project is based on the steps and development of the project in each of the diverse stages. Many different concepts are used, so the tasks achieved by the project help to structure the overall view of the research. The project was developed in this way in order to be able to separate each of the sections, and can be used for other purposes.

3.1 Goals

The main goal of this research is to provide an implementation of a new program, which will help to predict the risk capacity, hence providing some insights about the customer’s personal finance. This project will provide a completely independent approach to predicting the amount of risk the customer should be comfortable taking when investing in assets.

Since the project also includes a recommender system, it is possible to use this part of the project in a separate environment to understand and develop a more conscious approach in financial health. Also the occupancy mapping algorithm can be used for different mapping to certain elements without true values.

3.2 Tasks

In order to achieve the full development of this project, it was divided into different sections as shown in Figure 3.1

The first two sections will provide an NLP semantic approach. The Occupation Mapping will generate two sets with all the possible occupation titles, in order to use them for the next step (translation of occupations) and have a list of all the occupations in both English and Swedish. This means that the first two steps will provide a mapped occupation into one of the 10 major groups of the International

(36)

1. Occupation Mapping

2. Translator of Occupancies

3. Recommender System - Case-based

4. Risk Capacity Prediction

Figure 3.1: Task description

Standard Classification of Occupations (ISCO) standard, instead of a free text occupation, so it can be used in Data Mining.

The next step will provide a case-based recommender system, that will answer some simple questions about the customer’s financial health. This simple variables will trigger a recommender system, in order to improve their personal finances. These variables will be added to the table in order to be used in the customer’s risk capacity prediction, for the purpose of getting some hard-data points that give insights into their personal finance.

Finally, in the last step, different Machine Learning models will be used to predict the risk capacity of the customer with the most accuracy. CRISP-DM will be used as a model in order for the prediction to provide accurate results, the data features to be considered must be complete and the data has to be correctly processed.

3.3 Evaluation

Regarding the evaluation of the program, starting with the first section (which is occupancy mapping) will result in two datasets with all the mapped occupations, one in Swedish and one in English, so it’s not possible to check the performance in the first step. Nonetheless, it is possible to measure the performance of the translator of occupancies by providing an overall measure of quantity of translated occupations and also measure the confidence by using the Levenshtein distance, to provide some accuracy measure. Regarding the recommender system, it is not possible to provide

(37)

any accuracy measure since it is a case based recommender system that will trigger some new columns to get insights about the customer’s financial behavior. Finally, for the risk capacity predictor, in order to provide a measure of how accurate is the model chosen, even if the data is not real. It is possible to predict an underlying model and measure its accuracy with the MSE and R² to provide insights on how the model should be chosen when tested with a real dataset.

(38)

4 Implementation

^{Chapter 4}

This project will be implemented using Python[39] language, it is open-source and widely used, which translates into a large community that provides a lot of support.

Since this project is supposed to handle Big Data in the long run, Python[39] is used to handle scalability and even integrates easily with Hadoop. It also provides a very wide range of packages that are open source and very useful when dealing with data analytics.

Among the supported packages that will be used for this research, Pandas[40], for example, provides useful data structures called dataframes. This helps the programmer to see the data insights and apply other libraries to process the data and give some statistical insights. NumPy[41] is also used, it helps with the mathematical comput- ing, to compute multidimensional matrices and provide operations and predefined methods.

It is important to take into account that the program that translates and maps the occupations to one of the occupations in the list, needs to work very fast, because the occupation mapping is needed before applying the Machine Learning or Deep Learning algorithms.

4.1 Occupation Mapping

To be able to provide a better classification of the occupation for mapping, and provide a correct allocation of the occupation of the customer. The main problem with this field is that the customer can write their occupation in their own words, which can cause the mapping of the occupations a difficult job. For this value to provide useful information that can be used in the Machine Learning or Deep Learning algorithms, first a clear set of occupations must be mapped, being able to provide a correspondence between what the customers wrote and the occupations table.

(39)

In order to find the correct occupation linked to each customer profile, it is impera- tive to use Natural Language Processing (NLP) to map and filter the words that have meaning and provide a correct mapping of the words to a correct occupation.

NLP is a branch of technology that focuses on understanding and treating natural languages (as used by people to communicate) into information that computers can process.

In this thesis in particular, some basic approaches of NLP will be used to clean the data in order to map it afterwards to the correct occupation. Two occupations lists will be used since customers must be able to use the language of their preference (either Swedish or English). There is an International Standard Classification of Occupations (ISCO) provided by the International Labour Organization (ILO) of the United Nations(UN) [42], which provides a list of all possible occupations in English and the corresponding ISCO codes and even the minor groups occupations.

There are many versions of this standard, but the ISCO-08 will be used, which is also mapped by the Sverige Statistiska Centralbyrån (Swedish Central Bureau of Statistics) (SCB) to the Standard för Svensk Yrkesklassificering (Standard for Swedish occupational classification) (SSYK) [43], which the 2012 version will be used due to the clear connection between the two standards (ISCO-08 and SSYK- 2012).

The ISCO file provides the list of ISCO-08 codes, the equivalent ISCO-88 code (which are not going to be used in this project) and finally the English title of each occupation in the list. On the other hand, the SSYK file provides the SSYK code of 2012, along with the translation to ISCO-08 code and a full list of the Yrkesbenämning (or occupation names).

In order for this research to be of use, there is the need to find a list with the ISCO-08 codes and the correct translation into different groups of works. This is important because the occupations can be 8,500 different naming conventions for Swedish and other 7,000 in English. So using the convention, the different categories can be grouped and also identify the subgroups to which the occupations belongs, and for that, a translation table is used which is provided also by the International Labour Organization (ILO).

This first part of the project was done using public data sources and simple feature cleaning procedures to provide the list that will be used to better understand the occupations. This methodology only has to be done once, when the two files are generated. These files can be used in the next steps without regenerating them, because some of the next programs will continue to change these files. For this Occupation Mapping it is needed to treat and filter these lists by following the next methodologies described in the figure below, to output two different files (one for Swedish and one for English) that will contain all the occupation titles, the ISCO codes and the ISCO category titles.

(40)

Set up Importing the libraries that will be used in the script and uploading the tables into dataframes.

Clean data

First convert all the strings to lowercase and delete repeated values, also take a look at the

data to check everything looks as it should.

Merge datasets

Combine the datasets to create the 2 final dataframes (one in Swedish and one in English).

Remove

empty data Remove all the data that contains empty rows.

Tokenize

strings Separate the strings by words

Remove stopwords

Remove all the data that contains empty words that do not give any more meaning to the string.

Stemming Add new column with the stemmer.

Add uncommon

cases

Add pensioner and student to the occupation list (since it is not mapped).

Export Finally convert it to an excel sheet that can be used for the mapping.

Figure 4.1: Occupation Mapping to generate occupations list

(41)

4.1.1 Set up

Start the process by importing all the Python libraries for example the Natural Language Toolkit (NLTK) libraries that will be used in the script as shown on Table 4.1. NLTK is one of the most used python libraries to process natural language because it has a lot of resources and interfaces to interpret human linguistics, it is open source and widely used.

Library Name Version Citation Python 3.6.7 (for 64 bit) [39]

Pandas 0.23.4 [40]

NLTK 3.3 [44]

Table 4.1: Occupation Mapping package versions

Afterwards the pandas library is used to convert the tables into dataframes, in order to be able to use the data and apply different functions and analytics to the whole set.

Initially, there are 3 datasets: One with the ISCO codes and the English Title of all the possible occupations, the second one with the SSYK codes (this is not deleted for possible future development), the corresponding ISCO codes and the Swedish occupation title (Yrkesbenämning), and the third dataset is the correspondence table between the ISCO code and the classification title (which includes the minor groups).

4.1.2 Clean data

The next step is to convert all the strings to lowercase and delete repeated values.

This is to make sure there are no empty fields, also by taking a look at the data to check everything looks as it should and filter some characters that can produce errors like hyphens("-") and slashes("/").

4.1.3 Merge datasets

Combine the dataframes to create the 2 final dataframes (one in Swedish and one in English). Merge the dataframes by using the ISCO code column, so there are two sets: an English dataframe containing the ISCO code, English occupation titles, and ISCO classification within the minor groups, and then an equal dataframe with the occupation titles in Swedish, and the SSYK codes.

(42)

4.1.4 Remove empty data

To continue checking the dataset it is important to make sure that there are no empty rows, and remove all the rows that contains empty strings.

4.1.5 Tokenize strings

In this step, Natural Language Processing concepts start to emerge. The NLTK library includes a pre-trained tokenizer, called Punkt. Tokenizing the sentences means to separate the sentence word by word and defining abbreviations and other important features.

4.1.6 Remove stopwords

Remove all the data that contains empty words that do not give any more meaning to the string, which helps to improve the performance time and the accuracy of the results. This is done using the corpus library of NLTK that provides two lists of stopwords (both in Swedish and in English) and removing any word that is in the list. Then, the punctuation marks are added to the list, because for the project it is not of interest to understand the tone of the sentence, just the correct mapped occupation.

4.1.7 Stemming

The next step performed is to make a new column with the corresponding root of each word. Stemmming the dataset means that all words will be turned to the root, plurals, and verb tenses will be converted to their root, to compare the roots of the words. So, for example if there is a string containing "computer engineers", after stemming it would be transformed to "comput engin". Lemmatization is closely similar to stemming but the words are contextual (canonical form of the word that exists in the vocabulary). Unfortunately this library is not available in Swedish, so stemming is used.

In this case, the snowball stemmer is used, which is the most recommended one.

There are other several options like the porter stemmer, but it is less aggressive, and since stemming will be used as a second option, the assertiveness is a required feature.

(43)

4.1.8 Add uncommon cases

There are some cases that are not considered in the occupation list. The list contemplates mostly occupations that produce some value, or that one can receive payment. This is why there was a need to consider other options like pensioner, student and own business to the occupation list.

4.1.9 Export

Finally both dataframes are converted each to an Excel sheet that can be used for the next step, that is the mapping of the Occupations column in the full dataset.

4.2 Translator of Occupancies

In this section the mapped lists of occupations will be used, and the purpose of this part is to export the list of the major groups of the ISCO standard the occupation corresponds. The main goal is to find the free text occupation and identify the job that it is mapped to and return the minor groups, after, the ISCO codes associated to each of this minor groups, and then map the 10 major groups to return it for the risk capacity prediction.

During the process of mapping, several tasks were followed as shown on the Figure 4.1. The complete process will be described in the upcoming sections where all the assumptions and methods used will be thoroughly explained.

4.2.1 Set Up

Similarly as done with the mapping of the occupations list, all the libraries are imported that will be used on the translator. For this script Natural Language Toolkit (NLTK) is also used, as Pandas and NumPy as shown in the following Table 4.2.

For this script a library called LangId [47, 45] is used, which is a language identification package for Python. This package has a pre-trained model to help distinguish between 97 languages (including English and Swedish).

In this section, the 3 datasets that are going to be used are loaded as dataframes.

First and most importantly, the Occupation on the free text format is loaded, then also the two datasets with the mapped occupations are loaded, one in Swedish and the other one in English.

Usage of Distributed Systems Data for Automated Financial Health Advice

Usage of Distributed Systems Data for Automated Financial Health Advice

ANA CANCINO

KTH Royal Institute of Technology

Dept. of Software and Computer Systems

Degree project in Masters in Software Engineering of Distributed Systems

Usage of Distributed Systems Data for Automated Financial Health Advice

Acknowledgment

Contents

List of Figures

List of Tables

Acronyms

1 Introduction

1.1 Background

1.2 Problem Description

1.3 Purpose and Goal

1.4 Outline

2 Extended Background

2.1 Natural Language Processing (NLP)

2.2 Edit Distance

2.3 Recommender Systems

2.4 Cross Industry Standard Process for Data Mining (CRISP-DM)

2.5 Machine Learning (ML)

2.6 Deep Learning (DL)

3 Methods

3.1 Goals

3.2 Tasks

3.3 Evaluation

4 Implementation

4.1 Occupation Mapping

4.2 Translator of Occupancies

3 ^Methods