Automatic Tagging of Salary Differences

(1)

Örebro universitet Örebro University

Institutionen för School of Science and Technology

naturvetenskap och teknik SE-701 82 Örebro, Sweden

701 82 Örebro

Computer Engineering C, bachelor thesis, 15 credits

AUTOMATIC TAGGING OF SALARY

DIFFERENCES

Emanuel Gustafsson

Computer engineering program, 180 credits Örebro spring 2021

Examinator: Martin Längkvist

(2)

Abstract

Workplace equality is still a big problem especially when it comes to salaries. To help solve this problem Sysarb has developed software to help find the inequalities in salaries. One of the tools they have developed is a salary analysis tool for comparing the salaries in a company or organization to make sure it is fair across gender, age and other factors. When making a salary analysis one of the most important parts is to explain the difference in salary between different groups of workers. This process is currently done manually by the managers responsible by selecting one of five premade tags. To speed up the process and make it less tedious for the managers this report aims to explore the possibility of automating the process using machine learning. To achieve this goal the algorithms boosted decision tree, random forest decision tree and logistic regression were evaluated to find which one best solved the problem. For training the models real world data collected from Sysarbs salary software was used.

Keywords: machine learning, classification, salary analysis.

Sammanfattning

Jämlikhet på arbetsplatser är fortfarande ett stort problem särskilt när det kommer till

lönesättning. För att hjälpa till att lösa det problemet har Sysarb utvecklat olika verktyg för att hitta icke jämlika förhållanden i löner på företag och organisationer. Ett av de verktygen är ett löneanalysverktyg som hjälper till att jämföra löner så de är jämlika med avseende på kön, ålder och andra faktorer. Vid lönanalyser är en av de viktigaste delarna att förklara

skillnaderna i lön mellan olika grupper av arbetare. Denna process sker just nu manuellt av den chef som leder arbetet genom att välja en av fem förinställda taggar. För att snabba upp processen och göra den mindre repetitiv för chefer utforskar den här rapporten möjligheten att automatisera processen med maskinlärning. För att uppfylla det här målet var algoritmerna boosted decision tree, random forest decision tree och logistic regression utvärderade för att hitta den som bäst löste problemet. För att träna modellerna så användes data insamlad från Sysarbs löneanalysverktyg.

(3)

Preface

Thanks to Sysarb for letting me do this project and special thanks to Pierre Hedstöm and Stephanie Lowry for excellent support and help.

(4)

1 INTRODUCTION ... 4 1.1 BACKGROUND ... 4 1.2 PROJECT ... 4 1.3 REQUIREMENTS ... 4 2 BACKGROUND ... 5 2.1 MACHINE LEARNING ... 5 2.2 CLASSIFICATION ... 5 2.3 METRICS ... 6 2.3.1 Accuracy ... 7 2.3.2 Recall ... 7 2.3.3 Confusion matrix ... 7

2.4 BIAS AND VARIANCE ... 8

2.5 FEATURE SELECTION... 9

2.6 DECISION TREES ... 10

2.6.1 Random forest ... 12

2.6.2 Boosting ... 12

2.7 LOGISTIC REGRESSION ... 13

3 METHODS AND TOOLS ... 16

3.1 METHODS ... 16 3.2 TOOLS ... 16 3.3 OTHER RESOURCES ... 17 4 IMPLEMENTATION ... 19 4.1 FEATURE ENGINEERING ... 19 4.2 MODEL DEVELOPMENT ... 20 5 RESULT ... 21 5.1 MODEL RESULTS ... 21

5.1.1 Boosted decision tree. ... 22

5.1.2 Random forest decision tree ... 23

5.1.3 Logistic regression ... 25

5.2 CONCLUSIONS ... 26

6 DISCUSSION ... 29

6.1 FULFILLING OF THE PROJECT REQUIREMENTS ... 29

6.2 SOCIAL AND ECONOMIC IMPLICATIONS ... 29

6.3 FURTHER DEVELOPMENT POTENTIAL ... 30

6.4 KNOWLEDGE AND UNDERSTANDING ... 31

6.5 SKILL AND ABILITY ... 31

6.6 EVALUATION ABILITY ... 31

(5)

1 Introduction

1.1 Background

The project aimed to develop a machine learning model that based on a difference in salary could select the tag that best describe why the difference exists.

In Sweden, every company needs to do a salary analysis every year to make sure there is not any unseen bias in their salary policies. All companies with more than 10 employees do also have to send the result of this analysis to the government in the form of a report. Sysarb Ab makes this process much simpler with their web-based software that lets managers find and explain differences in salary between different groups of workers as well as generating a report.

The process of finding salary differences starts with the employer grouping all their workers into groups of similar work for example all cleaners would be in one group and all teachers in another. When all workers have been divided the employers usually in collaboration with the workers unions sit down and score the different groups of workers based on a couple of different criteria. These criteria are things like education required for a job or responsibility accompanied with the job. Given the score all groups are lastly divided into ranks based on their score. The idea is that in the end the rankings should correspond to the salary where groups of workers in a higher rank should have a higher salary than those in the lower ranks. The groups in the same rank should have roughly the same salary.

Given the rankings comparisons can now be done to see if there are any unexplained differences in salary. These comparisons can be done in three different ways.

1. Within a group.

2. Between groups in the same rank. 3. Between groups in different ranks.

The last step in the analysis is for the manager responsible for the salaries to explain the differences. This has traditionally been done manually using a set of tags that represent all the lawful reasons for a difference in salary. The goal of this project was to create a proof-of-concept automated system for setting these tags. If the differences can’t be explained the salaries should of course be adjusted.

1.2 Project

The project aimed to develop a proof-of-concept to see if the act of explaining differences in salaries could be automated. This was done to see if this system in the future could be completely automated or implemented as a complement to the manual input method to make it easier for the customers to use.

This was done by developing and testing three different machine learning models to see which of them was best suited for the problem. The three models were developed from three different optimized algorithms found in azure machine learning studio. The algorithms chosen where multiclass boosted decision tree, multiclass decision forest and multiclass logistic regression.

1.3 Requirements

The requirements for the project were to develop a model that would be able to predict the labels of differences. This model should furthermore have an accuracy higher than random chance the higher the better.

(6)

2 Background

The information below are if nothing else is mentioned from The Hundred-Page Machine Learning Book[1], An Introduction to Statistical Learning[2], Data Science from Scratch[3], Pattern recognition and machine learning[4] or Fundamentals of machine learning for

predictive data analytics[5].

2.1 Machine learning

Machine learning in the general sense refers to different techniques to find some sort of pattern or relationship in a dataset that could be learned to make decisions or predictions given similar data in the future. This could be to guess which products a user would want to buy based on previous purchases, which way to drive a car based on camera input or

predicting the weather based on previous days weather.

Machine learning is split up into two big parts supervised and unsupervised learning the difference being that in supervised learning the data is labeled with the desired output while in unsupervised learning the data does not contain any information about how it should be structured. In this project only supervised learning is discussed.

In supervised learning the goal is generally to find some label or number given a sample of data. These two types of problems are generally referred to as classification and regression problems. Where the goal in classification problems is to fit a label or class to the data and in regression to find a real number to fit the data. A classic starting problem in regression is to predict a house price based on some rudimentary data about the house. For classification an example would be to predict if a person has breast cancer or not based on roentgen images[6]. There are many more kinds of problems in the area of supervised learning, but they can almost always be generalized to a regression or classification problem.

2.2 Classification

A good example of a classification problem is labeling images of rooms with the correct room label. This was the problem faced by a group of researchers who was looking into how

interiors of homes differed around the world. They did this by collecting thousands of images from Airbnb and labeling them with the correct room label in this case labels like bedroom, kitchen and bathroom. The next step was to train a machine learning model using the set of images and labels so that it would output the correct label for a given image. Given this model the researchers could then input images the model had never seen and have it output the correct room label. [7]

To develop machine learning models’, different algorithms are used they range from the simple linear regression to the extremely complicated deep neural networks. But the goal of all machine learning algorithms is the same to find a function F(x) so that F(x) = y where x is the data sample and y is the value or prediction sought. In the example above x would be the input image and y the room label. What y represents differ from problem to problem it could be as simple as a label or it could represent which way a car should drive given some video input.

Different algorithms go about finding F(x) in different ways. In linear regression a relatively simple calculation can be used, but in deep neural networks small tweaks are made to

thousands or even millions of weights using gradient descent until the best function has been found. No matter how an algorithm goes about finding the function, the process is referred to as training and the function found at the end is referred to as the model. Classification

(7)

Some algorithms can solve both regression problems and classification problems, but most have their strengths in one or the other.

2.3 Metrics

When training and developing models, it is useful to have different metrics to be able to compare different models against each other to see which one is the best. The reason for this is the fact that it is almost impossible to make a model that is 100% perfect. This is mainly because to have a model that predicted perfectly you would have to have perfect information. This would require that all the data that could be measured has been measured and that all our tools for measuring are perfect and always gives the exact value. This is of course in practice impossible to achieve. Given this different metrics are used to decide which model best solves the problem. For example, if our data is non-linear a linear model will generally perform worse than a non-linear model this would then be reflected in the metrics.

However, it is possible to train a model to predict the data samples it is being trained on perfectly. This might seem like a good thing but has problematic implications for predicting unseen samples. The problem is that the model has predicted the data to closely not only modeling the general relationship between the data and the target but also modeling the error in the data. This is called overfitting. See figure 1 for an example of a regression model that is well fitted to the data and one that is overfitted.

Figure 1 Left Well fitted model, Right overfitted model.

To avoid overfitting the accuracy of a trained model is not measured on the data the model is trained on. Instead, the data is split up into two sets the training set and the test set with all the testing being done on the test set. Making it possible to detect overfitting since a model that has started reflecting the errors of the training data will start to do worse on the test set than one that has not. Different algorithms have more or less of a chance to overfit the data

depending on how flexible they are. For example, a linear model will never be able to model a nonlinear relationship making the risk of overfitting small. On the other hand, a highly

flexible model will have a high chance of overfitting.

If our model is to unlike the data, the opposite problem of overfitting can be true where the data and the model does not have any relationship at all this is called underfitting. The solution to underfitting is to choose a model that is more flexible and can model more complex relationships.

(8)

2.3.1 Accuracy

Accuracy is a basic metric to get a quick check on the predictive power of a trained model. The accuracy of a model is defined as the percent of the cases that were correctly classified out of all cases (1).

Accuracy =

𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐶𝐶𝐶𝐶𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝐶𝐶𝑐𝑐 𝐶𝐶𝑐𝑐𝑐𝑐𝐶𝐶𝑐𝑐_{𝐴𝐴𝐶𝐶𝐶𝐶 𝐶𝐶𝑐𝑐𝑐𝑐𝐶𝐶𝑐𝑐} (1) 2.3.2 Recall

Another useful metrics is the recall rate which is defined as the number of correct predictions for a class divided by the correct predictions plus the cases that should have been labeled as that class but was not (2).

𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 𝑓𝑓𝑓𝑓𝑓𝑓 𝑅𝑅𝑅𝑅𝑅𝑅𝑐𝑐𝑐𝑐 𝑖𝑖 = 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐶𝐶𝐶𝐶𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝐶𝐶𝑐𝑐𝑖𝑖

𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐶𝐶𝐶𝐶𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝐶𝐶𝑐𝑐𝑖𝑖+𝑤𝑤𝐶𝐶𝐶𝐶𝑤𝑤𝑤𝑤𝐶𝐶𝐶𝐶 𝐶𝐶𝐶𝐶𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝐶𝐶𝑐𝑐𝑖𝑖 (2) Since the recall for a single class is not that useful, we want the recall for the entire dataset this can be done two ways the first is called macro-recall which works by simply averaging the recall for all classes (3).

𝑀𝑀𝑅𝑅𝑅𝑅𝑓𝑓𝑓𝑓 𝑓𝑓𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 =

∑𝑐𝑐𝑖𝑖=1𝑅𝑅𝐶𝐶𝐶𝐶𝑐𝑐𝐶𝐶𝐶𝐶𝑖𝑖

𝐶𝐶 (3)

Macro-recall will give each class the same weight even though they might constitute different amounts of the dataset. This is where micro-recall comes in which combines the recall scores using a weighted average based on their proportions in the data set (4). This gives a more truthful picture of the recall for the model.

𝑀𝑀𝑖𝑖𝑅𝑅𝑓𝑓𝑓𝑓 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 =

∑𝑐𝑐𝑖𝑖=1(𝑅𝑅𝐶𝐶𝐶𝐶𝑐𝑐𝐶𝐶𝐶𝐶𝑖𝑖∗𝑆𝑆ℎ𝑐𝑐𝐶𝐶𝐶𝐶𝑖𝑖)

𝐶𝐶 (4)

2.3.3 Confusion matrix

Metrics that apply to the entire dataset are useful for a quick overview and comparison between different models but to understand the workings of a model it is important to see the predictions for each class. This is where the confusion matrix comes in. The confusion matrix is a matrix that shows how well a model works on the individual classes by showing what the model predicted given what it should have predicted see figure 2. This way it is possible to see where the model predicts wrong and what is possible to improve.

(9)

Figure 2 example of confusion matrix showing what class was predicted in each situation.

2.4 Bias and Variance

Two other important attributes of machine learning models are bias and variance. Bias is assumptions made by a model about the form of the data. Models that can only model linear relations between data will therefore have higher bias than models that can model more complex relationships. Variance is the amount the model would change if the training data were changed. If a model has a high variance that means that the model would change a lot if the data changed indicating that the model have not found the underlying relationships in the data instead modeling the data directly. Low variance on the other hand indicates that the model will not change significantly if the data changed indicating that the model have found the underlying relations. Generally, there is a tradeoff between bias and variance meaning that a decrease in bias will increase the variance and vice versa. This is called the bias-variance trade-off see figure 3 for an example of how the bias and variance change when the

(10)

Figure 3 shows the impact of model complexity on bias and variance.

2.5 Feature selection

When training machine learning models it is not unusual to have large amounts of data with hundreds or even thousands of attributes. Because of this it is not always feasible to use all the attributes in the data when training models. Instead, the attributes are selected carefully to be able to make the best model possible without having to use all the attributes. This is referred to as feature selection.

The goal of feature selection is to select the attributes of the data that has the highest

predictive power on the target. One of the easiest ways to do this is using normal correlation. Correlation works great when all data attributes and the target is numerical.

However, correlation has its drawbacks mainly the fact that it cannot handle categorical data in an easy way, and it will only find linear relationships missing any relationships that have a more complex form. It also expects the data to have a standard distribution.

A better way to check the predictive power of attributes is with the predictive power score. The predictive power score can handle categorical data and targets out of the box, find nonlinear relationships and is also not symmetrical. The fact that the score is not symmetrical means that even though one attribute can predict another well does not necessarily mean that the same is true the other way around. For example, you might have one attribute city and another country. Given the city the country can easily be predicted but given the country it is not possible to predict the city. The easiest way to go about feature selection is to sort the attributes according to their predictive power on the target class and take the required number

(11)

of attributes from the top.

2.6 Decision trees

Figure 4 example decision tree.

Decision trees is a machine learning algorithm based on binary trees. They work by following a tree down its branches and given different attributes of the data take different paths. Given the data the algorithm will end up on different leaf nodes. Each leaf node has a prediction attached to it which becomes the prediction of the model. For example, in the tree seen in figure 4 the algorithm would start by checking the number of siblings a person has, if it is more than five, we would go to the right branch we would then check the education and given the education we would end up on a leaf node which would give us our prediction hired or not.

To train a decision tree we start with our data set. A small example of what that could look like can be seen in table 1. Usually, our dataset would have thousands of rows in it to make the predictions more accurate.

Table 1

Siblings Living Conditions Education Married Age Hired

1 Apartment High-School Yes 20 No

2 House Bachelor No 37 Yes

3 House Bachelor Yes 26 No

5 House Master Yes 52 Yes

1 Apartment High-School Yes 50 No

2 Apartment High-School No 21 No

3 Apartment High-School Yes 35 Yes

(12)

1 House High-School Yes 24 Yes

2 House Bachelor Yes 46 No

Figure 5 starting point for the decision tree algorithm.

At the start of the training process, we only have one node see figure 5, and all the data belongs to that node so if a prediction were made at this time the result would be no. This is because the prediction of a leaf node is the same as the majority class of the data attached to it. Since there is 6 no samples and 4 yes samples the prediction is no. To calculate how good the tree is at any given point the easiest way is to use the classification error rate (5) there are better methods, but they are all built on the same idea. The classification error rate would start getting smaller the more samples belong to the majority class in a node. Meaning that if all samples in all leaf node belong to one class the classification error rate would be 0.

𝐶𝐶𝑅𝑅𝑅𝑅𝑐𝑐𝑐𝑐𝑖𝑖𝑓𝑓𝑖𝑖𝑅𝑅𝑅𝑅𝐶𝐶𝑖𝑖𝑓𝑓𝐶𝐶 𝑅𝑅𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 𝑓𝑓𝑅𝑅𝐶𝐶𝑅𝑅 = 1 − 𝑃𝑃𝑅𝑅𝑓𝑓𝑅𝑅𝑅𝑅𝐶𝐶𝐶𝐶 𝑓𝑓𝑓𝑓 𝑐𝑐𝑅𝑅𝑠𝑠𝑠𝑠𝑅𝑅𝑅𝑅𝑐𝑐 𝑏𝑏𝑅𝑅𝑅𝑅𝑓𝑓𝐶𝐶𝑏𝑏𝑖𝑖𝐶𝐶𝐶𝐶𝑏𝑏 𝐶𝐶𝑓𝑓 majority 𝑅𝑅𝑅𝑅𝑅𝑅𝑐𝑐𝑐𝑐 (5) To find the classification error rate for the entire tree the error rates for all leaf nodes is

averaged. So, in our minimal tree with only one node the error rate would be 0.4 since 60% of the samples belong to the majority class. Given this number the actual training can start by finding the way to split the data that minimizes the error rate of the entire tree. To do this the algorithm goes through each attribute one by one. With each attribute it tries to split on all unique variables in the attribute column. So, in our example data it would start with the sibling attribute and try to split on the data by all samples with 1 sibling or less. When it tried that it would move on to 2 siblings or less and so on until all splits of the sibling’s attribute has been considered. After that it would move on to the next attribute and so on until all attributes has been done.

For each split the algorithm checked it would end up with two datasets for which it would calculate the error rate. The last step is selecting the split with the smallest error rate and creating two new leaf nodes each with their own data. It is worth noting that the goal with doing this is to end up with a tree where the leaf nodes only contain one class each. This is not always possible if the amount of training data is very big the necessary tree to do this would become unwieldy. Therefore, the majority class of the leaf nodes is used for the prediction instead.

For our example tree the smallest split was splitting on siblings with the samples with 4 or less going to one leaf and the rest to the other. The final tree can be seen in figure 6.

(13)

Figure 6 sample decision tree after one iteration of the decision tree algorithm.

This process is then repeated until one of the stop criteria have been meet. The stopping criteria can be that a max depth of the tree has been reached, the number of samples in a leaf node is fewer than some threshold or all samples in a leaf node belong to the same class. When deciding on where to split when more than one leaf node exists all possibilities are searched before a decision is made this means that the tree will not necessarily be

symmetrical as seen in figure 4.

Decision trees in their pure form have quite a low bias since they do not make any

assumptions about the form of the data. However, they do have quite high variance. This is because a single new sample can change what the best split is in the beginning which would yield a completely different tree in the end.

Decision trees are prone to overfitting if they are left to train for too long this can be avoided in many ways. The most used way is to monitor the accuracy of the model while training for both the training set and the test set. If the accuracy starts to go down on the test set while still going up on the training set the model has overfitted. When this is detected, the model can be rolled back a couple of splits and the training is then finished.

2.6.1 Random forest

Random forest is a method to minimize the chance of overfitting and improving the accuracy of decision trees. Instead of training one huge tree to classify the data many small trees are trained, and their predictions are combined in the end to gain a better prediction than a single tree could. To make sure all the small trees are not the same each of them is trained on a subset of the original training data instead of the entire dataset. Furthermore, to reduce

correlation between the trees the small trees only actively investigate a couple of the attributes at each split instead of all of them to make sure that the properties of the data that has the highest possible correlation is not chosen each time. For combining all the trees to a prediction simple voting is used where the class most of the trees wants becomes the final prediction.

2.6.2 Boosting

Another method for getting better results from decision trees is boosting. Just as with the random forest method boosting works by training multiple trees. Unlike random forest

(14)

bosting trains one tree at a time with each new tree correcting for the mistakes of the last. This is done by changing the training data depending on the predictions of the previous tree

trained. The data is changed to reflect the problems with the previous tree. This is in a simplified way made by lowering the importance of the samples that the tree predicted correctly and upping the importance of the samples it did not predict correctly. This new data is then what the next tree is trained on. Additionally, each new tree’s impact is lowered by a small amount given by the training rate. Having a small training rate generally gains better results but will require more trees in total for the training to be successful.

The process of training trees and changing the dataset is repeated a predefined number of times. The predictions of all the trees are then combined just as with random forest by voting.

2.7 Logistic regression

The third and last algorithm used in this project is logistic regression. This algorithm is quite different from the decision tree-based algorithms as it is a continuation of linear regression but changed to work with classification tasks. In linear regression the goal is to fit a line to the data by using the standard linear equation (6).

𝑦𝑦 = 𝑘𝑘𝑘𝑘 + 𝑠𝑠

(6)

To extend this to work with multiple attributes new factors can be added for each attribute (7).

𝑦𝑦 = ∑ (𝑘𝑘

𝑤𝑤𝑐𝑐=1 𝑐𝑐

𝑘𝑘

𝑐𝑐

) + 𝑠𝑠

(7)

To accommodate attributes that are categorical dummy variables can be used. For example, if we have a class attribute with 3 different values child, teenager and adult. We can use 2 dummy variables to represent this. Let us call these x and z where we let x = 1 indicate that it is a child, z = 1 a teenager and both being 0 means it is an adult. This way any categorical attribute can be represented using n -1 dummy variables where n is the number of categories. With this little trick both numerical and categorical attributes can be represented.

To change linear regression to be able to handle categorical predictions we can assign the output we want a number just as with categorical attributes. For example, a yes, no column could be modeled as 1 and 0 or any other numbers for that matter see table 2. Given this we could do a normal linear regression to find a line that tries to fit the new numbers see figure 7.

Table 2

Salary Manager Manager mapped

14328 No 0 15865 No 0 17658 No 0 18213 No 0 20251 No 0 22698 No 0 23448 No 0 24034 No 0 27858 No 0 28634 No 0

(15)

28710 No 0 28976 Yes 1 32627 No 0 35134 Yes 1 35833 Yes 1 43370 Yes 1 43456 Yes 1 44140 Yes 1 45632 Yes 1 47810 Yes 1

Figure 7 a linear model fitted to the example data seen in table 2.

Since the target data now only contains zeros and ones the model should preferably only put out numbers between zero and one. But a regression line will predict any number as can be seen in figure 7. This can easily be solved by passing the prediction through a sigmoid

function see figure 8 which is a function that takes any number and outputs a number between zero and one. After the prediction has passed through the sigmoid function it can be rounded to find the actual class if it is a one it is a yes and if it is a zero it is a no. To predict the value of an unseen data sample the model would now feed the new salary through the linear

equation calculated feed that through the sigmoid function and lastly round that number to get a prediction. This would be the same as looking at figure 8 and seeing where the salary lies on the x axis and checking what line value is right above that. Rounding that value would give the same prediction as the calculation.

(16)

Figure 8 linear model from figure 7 passed through the sigmoid function.

To extend this concept to scenarios with more than two target classes one model can be trained for each class where the goal is for each model to predict the probability that a sample belongs to a class or not. To combine all the different outputs from the different models the outputs are passed through the SoftMax function instead of the sigmoid function. The SoftMax function can be thought of as sigmoid for multiple variables where it instead of outputting a number between zero and one outputs a probability distribution for all the models. The final prediction can now simply be made by choosing the class with the highest probability.

Since logistic regression algorithm is based on linear regression it has a hard time modeling a complex relationship and therefore has a high bias. However just as with normal linear regression the variance is low. This has to do with the fact that a big part of the data must change before the model will change significantly. Logistic regression can be changed to be handle non-linear relationships, but this is outside of this project.

(17)

3 Methods and tools

3.1 Methods

The predictive power score[8] is a metric that aims to show the predictive power of one attribute on another. In many ways it tries to accomplish the same thing as simple correlation but with the added benefit of being able to pick up on non-linear correlations between

attributes, not being symmetrical and being able to handle categorical data out of the box. The fact that the score is not symmetrical is useful when looking at the relationships between data. For example, if you have one Boolean attribute and one integer attribute with thousands of distinct values the Boolean attribute will never be able to correctly predict the integer

attribute, but the integer might be able to predict the Boolean. There are however some things to keep in mind when using the predictive power score. First the amount of computation compared to normal correlation is massive and second you can only compare predictive power scores that have been calculated against the same target attribute. This has to do with the fact that the predictive power score calculates the limit of what constitutes a 1 and 0 from the target column.

In the project the predictive power score is used to calculate the relevance of including an attribute in the final model. This is done by selecting the attributes that have the highest score since these should yield the best predictions.

3.2 Tools

The models were developed in azure machine learning studio which is an all-in-one solution for developing and deploying machine learning models. It also handles the storage of data sets used for training. Access to Azure was provided by Sysarb. The main part of Azure machine learning studio is the pipeline creator which provides a simple drag and drop interface for creating machine learning models see figure 9 for an example. The pipeline creator allows for the most common data transformations like attribute selecting, normalization and

standardization. It has optimized versions of some of the most common algorithms for both regression and classification. It also has built in tools for performing parameter sweeps to find the best parameters automatically. If more complex behavior is needed the studio

environments also allows for the addition of python nodes enabling the user to combine the simplicity of the drag and drop interface with the power of python with all its libraries.

(18)

Figure 9 example of an azure ml studio model pipeline.

For data analysis python was used in a Jupyter notebook. The Jupyter notebook is a perfect environment for exploration of data because of its cell-based structure. The cell-based structure allows the user to partially run code and store the results which speeds up the progress of data exploration.

To calculate the predictive power of different attributes against a target the Ppscore library was used. This library lets the user in an easy way calculate the predictive power for any attribute given a target even if the attribute or the target consists of categorical data.[8] For simple data manipulation such as normalization and standardization of the data the Scikit-learn library was used. The library contains a lot of useful tools for data manipulation as well as the tools for training and develop models directly in python these were however not used.[9]

3.3 Other resources

The models were trained on a dataset containing in excess of 5000 examples of salary differences and their corresponding tags. Tags is in this project the same as the class of the data. This dataset was provided by Sysarb. The data was collected from multiple salary analyses in the public sector in Sweden. Since this data contains salary information, a non-disclosure agreement was signed before it was turned over. The dataset contains some basic information about the groups being compared like the name of the group, the work code, mean salary, median salary, mean age and points awarded during the salary mapping. The data set is labeled with five different tags with a quite uneven split see figure 10. All in all, the data contained 23 different attributes including the tags.

(19)

Figure 10 graph of the data distribution between the different tags in the dataset used for training the models.

(20)

4 Implementation

The development of the models was done by first calculating the predictive power of each attribute in the data to find which ones were best at predicting the tags. The models were then trained in an iterative fashion, with each iteration adding a new attribute to the model.

4.1 Feature engineering

To train the models, the predictive power score was calculated with aspect to the target label. The score was calculated to find how well each attribute was at predicting the label and not the other way around. The scores are presented in table 3.

Table 3

Attribute Predictive power score

MedellonJamförelsearbete 0,54 JamforelsearbeteKod 0,54 MedianlonJamforelsearbete 0,52 VarderingspoangJamforelsearbete 0,52 SkillnadMedellon 0,46 LonespridningJamforelsearbete 0,42 SkillnadMedianlon 0,38 SkillnadVarderingspoang 0,34 Alderjamforelsearbete 0,34 Anställningstidjamforelsearbete 0,34 MedellonKvinnodomArbete 0,25 MedianlonKvinnodomineradeArbete 0,21 VarderingspoangKvinnodominerade 0,20 KvinnodomineratArbeteKod 0,17 Skillnadlonespridning 0,13 AntsallningstidKvinnomdominerade 0,10 SkillnadAlder 0,09 SkillnadAnstallningstid 0,09 LonespridningKvinnodomineratArbete 0,04 AlderKvinnodominerade 0,01

The scores are based on the target column where an attribute that completely predicts the target gets a score of one and a attribute that have no predictive power gets a score of zero. For this reason, MedellonJamförelsearbete with a score of 0.54 is the first to be included in the model because it will help predict the label better than the attributes with a lower score. To try and improve the predictive power of the attributes each of them were in turned both standardized and normalized. With the standardized and normalized values, the predicted power score was calculated again and subtracted from the original scores to find the

(21)

the attributes that were improved. After looking at the improvements it was clear that the attributes being improved by standardization was the same ones being improved by normalization. The improvements for the normalized attributes were smaller than the

improvements for the standardized attributes. Because of this the normalized attributes were not used. The attributes that were improved by standardization can be seen in table 4. All other attributes were either not improved or negatively affected by being standardized.

Table 4 Attribute Improvements SkillnadMedellon 0,000386 SkillnadVarderingspoang 0,002318 Skillnadlonespridning 0,000112 4.2 Model development

To train the models, the predictive power scores of the attributes were sorted in descending order. The training started using only the top attribute from the predicted power scores. A parameter sweep was done on each of the three models to find the optimal parameters for each model. This procedure was repeated with each iteration adding the next most effective attribute stopping when the new attributes did not improve the accuracy of the models in any meaningful capacity. The final models consisted of the following attributes

JamforelsearbeteKod, LonespridningJamforelsearbete, MedellonJamförelsearbete, MedianlonJamforelsearbete, SkillnadMedellon, VarderingspoangJamforelsearbete and SkillnadMedianlon.

As discussed in chapter 3 different algorithms have different parameters that can be tweaked to improve the performance of the model being trained. The parameters used for the final versions of the different models can be seen in table 5.

Table 5

Boosted Decision Tree

Learning rate 0.1

Minimum number

of samples per leaf 1

Number of trees 100

Maximum number

of leaves 8

Random Forest Decision Tree Maximum depth of trees 16 Minimum number

of samples per leaf 1

Number of trees 8

Logistic Regression Optimization tolerance 0

(22)

L2 regularization weight 00.01

The specific parameters of each model come from the specific implementations of each algorithm in Azure machine learning studio. An example of the final implementations of the model pipelines can be seen in figure 11.

Figure 11 the final pipeline implementation of the boosted decision tree model.

All the pipelines are similar with the big difference being the model node used in this case multiclass boosted decision tree. Besides that, the main difference is in the parameters used for the different models.

5 Result

5.1 Model results

Table 6

Attribute Predictive power score

MedellonJamförelsearbete 0,54

JamforelsearbeteKod 0,54

MedianlonJamforelsearbete 0,52

VarderingspoangJamforelsearbete 0,52

(23)

LonespridningJamforelsearbete 0,42

SkillnadMedianlon 0,38

To train the models the predicted power scores of the attributes in the data were used, see table 3. To train the first iteration of the model only one attribute was used. With each following version of the models adding one attribute. In the end the attributes seen in table 6 were used.

Figure 12 accuracy of each model based on the number of attributes used for training.

The accuracy for each version of the models as seen in figure 12 increased fast in the beginning but slowed down with only small increases in the end.

One thing to note about the attributes included in the final models is that both

SkillnadMedellon and SkillnadMedianlon is included they represent the difference in mean

and median salary between the two groups being compared. Because of this they are quite correlated and do in many ways represent the same information. This might be one of the reasons that the last version of the models barely improves since the models already had that information. If the models should be simplified in an easy way the difference in median salary could be removed so that the models did not need as many attributes.

5.1.1 Boosted decision tree.

Table 7

Boosted Decision Tree

Micro recall 0,804

Macro recall 0,701

(24)

The boosted decision tree algorithm turned out with the best result of the different algorithms used for testing see table 7.

Figure 13 the confusion matrix for the boosted decision tree model.

The model also approached a 50% or above accuracy for each individual class see figure 13.

Table 8

Boosted Decision Tree Parameters

Learning rate Minimum number of samples per leaf node Number of trees

Maximum number of leaves per tree Accuracy 0,1 1 100 8 0,788538 0,1 10 100 8 0,784585 0,2 1 20 8 0,784585 0,4 10 20 8 0,784585 0,2 10 20 8 0,782609 0,4 50 20 8 0,782609 0,025 1 500 8 0,780632 0,05 1 100 8 0,780632 0,4 1 100 2 0,778656 0,2 1 100 2 0,778656

The final parameters used for the model were gathered from a parameter sweep where each possible combination of parameters was tested to see which of them performed the best. The top ten results from the parameter sweep can be seen in table 8.

(25)

It is hard to find a parameter that has higher importance than the others when looking over the results from the parameter sweep. This probably has to do with the parameters being

correlated in how they affect the accuracy.

5.1.2 Random forest decision tree

Table 9

Random Forest Decision Tree

Micro recall 0,789

Macro recall 0,67

Accuracy 0,789

The random forest decision tree algorithm performed just slightly worse than the boosted decision tree algorithm, see table 9.

Figure 14 confusion matrix for the random forest decision tree model.

On some labels it did however perform significantly worse see figure 14. For example, on

Erfarenhet it lost almost 10% in accuracy in comparisons to the boosted decision tree

algorithm.

As with the boosted decision tree model the parameters were found using a parameter sweep over all the possible combinations, see table 10.

Table 10

Random Forest Decision tree

Maximum depth Minimum number of samples per leaf node Number of trees Accuracy

16 1 8 0,747036

(26)

16 4 32 0,743083 64 1 1 0,741107 16 4 1 0,73913 16 1 32 0,73913 64 4 32 0,737154 16 4 8 0,735178 64 4 1 0,733202 16 16 32 0,72332

Given the parameter sweep it seems that the most important parameter is the maximum depth of the tree given that changing it decreases the accuracy the most. The accuracy when changing the maximum depth drops by 2%.

5.1.3 Logistic regression Table 11 Logistic Regression Micro recall 0,767 Macro recall 0,659 Accuracy 0,767

The worst performing algorithm was logistic regression, see table 11. However worth noting is that even though it performed worse overall it managed to beat out both the other

algorithms in the class Utökat ansvar see figure 15. The algorithm did outperform the others on this class by more than 10%.

(27)

Figure 15 confusion matrix for the logistic regression model.

Just as for the other alghorithms a parameter sweep was done to find the optimal parameters. This wasn’t as necessary for logisitc regression as for the others due to the fact that it has fewer parameters to tune. The full list of evaluated parameters can be found in table 12.

Table 12

Logistic Regression Parameters

Optimization tolerance L2 regularization weight Accuracy

0 0,01 0,756917

0,00001 0,01 0,756917

0,00001 0,1 0,747036

0 0,1 0,747036

0 1 0,725296

In the table it is easy to see that the parameter L2 regularization weight that seems to have the highest impact on the result of the accuracy. This is because when the parameter decreases the accuracy increases.

5.2 Conclusions

Given the accuracies from the different models, the most promising model is the boosted decision tree because it had the best accuracy. But the model did not perform best on all the classes, see figure 16. The logistic regression model performed best on the tag Utökat ansvar. Given this if the project were to be continued a possible avenue for improvement would be to build an ensemble model combining the strengths of the boosted decision tree with the strengths of logistic regression to improve the overall accuracy.

(28)

Figure 16 accuracy per class for each model.

One interesting phenomenon can be seen in the confusion matrices of the three models. The clearest example is in the boosted decision tree matrix see IMAGE NUMBER. When the model tries to predict the tag Erfarenhet the second most predicted class is Utökat ansvar. The opposite is true for the tag Utökat ansvar where the second most predicted class is Erfarenhet. This indicates that the data coupled to these two tags are a lot alike and the models have a hard time telling them apart. This makes sense when thinking about what the two tags mean,

Erfarenhet means that one group of workers have a higher salary because they have worked

longer and gained more experience, Utökat ansvar on the other hand means that a group of workers have a higher salary because of extra responsibility something that usually is given to workers with more experience. Because both tags can be given in similar situations there is not really an easy way to tell them apart which is reflected in the confusion matrices. A solution to this might be to rework the tags giving them more specific and non-overlapping use cases.

Another thing to keep in mind with the results is the unbalanced dataset the models are trained with. Solutions to this problem was explored in the beginning of the project with the two main options being to cut samples to end up with a balanced dataset. This was in the end not done since to do this half of all samples would have to be discarded. Another option that was being considered was artificially augmenting the data with fake samples. This was in the end not done because of the difficulty with creating the fake data.

The models have by far the highest correct prediction rate for the tag Alternativ

arbetsmarkand which makes sense since that tag makes up for roughly half the samples in the

dataset making it easy to predict for the models. Normally this would be unwanted behavior but, in this case, this is not necessarily a bad thing since the data to be expected in the future should have the same distribution between the classes as the dataset has.

Since the accuracy is not perfect, the model will not be able to completely replace the manual process of selecting the right tag. It might however be enough accuracy to constitute a

(29)

suggestive system where the model predicts a tag and asks the user for confirmation if it is the wrong tag the system could default back to the old system of the user choosing themselves. Given the fact that all machine learning models are never better than the data they are trained on it stands to reason that they should improve with more data. In time it might be possible to improve the models to such a point that a fully automatic system could be implemented this is however not a guarantee.

(30)

6 Discussion

6.1 Fulfilling of the project requirements

The goal of developing a model to predict the correct labels was fulfilled with a decent

margin. The improvement from random chance of 20% to 80.4% is significant enough that the project should be considered a success with a possible follow up.

6.2 Social and economic implications

Since the models that has been developed in this project aim to explain real world salary differences it is extremely important to make sure that they are not biased in any significant way since this could end up having real world consequences for the people whose income is dependent on the salary analysis.

With all machine learning models it is extremely important to be weary to not introduce unwanted bias. The most common cause for bias in a machine learning model is the use of a biased dataset. For example, it has been shown that a lot of facial recognition models have built in biases based on skin color due to the datasets used. These biases can take their form in many forms either making a model worse at predicting certain features or giving the wrong prediction altogether.[10]

The most prevalent reason for a biased dataset is human bias introduced during the collection or cleaning of the data. During the collection of the data the risk is that only a fraction of the possible values is collected making the model biased. In the facial recognition software this might be because most of the collected faces it was trained on were white making it harder for the model to correctly classify darker faces.

In the cleaning process data samples with incorrect or missing data is either removed or extended to fit the required data format. This can introduce bias if the data with missing features generally belongs to a group of samples making the dataset less diverse by removing them or if the method used to replace missing data replaces the data in a way that introduces bias. For example, say that a data attribute gender exists with the possible values male, female and non-binary. If this attribute is to be predicted it is important that the dataset is correctly labeled. For a general attribute of this kind the values male and female will probably

constitute most of the dataset with non-binary making up a minority part. If the replacement strategy for this attribute is to replace a missing value with the most common value it might skew the model towards wrongly classify non-binary people as male or female. This is a simple example that would be rather easy to find most bias introduced in the data is much harder to spot. An important part to remember is that bias is not always introduced

consciously. As a matter of fact, the hardest bias to get rid of is the one introduced unconsciously.

The biggest risk for bias in the models developed in this project is in the data collection. Since the data samples are tags put on by the people using the salary software there is a big risk for some human bias to creep in. The bias of any individual person should not be able to skew the results in any meaningful way but if a significant bias exists among the people doing the salary analyses this constitutes a problem.

Since the data for this project comes from the salary analyses of the public sector in Sweden which has a significant bias in the distribution of manager positions with the majority being female there might be a bias[11]. This is a relatively easy discrepancy to find, but more might exist. To combat any bias introduced by this and similar problems one solution might be to let

(31)

an expert go through a portion of the dataset to make sure its correctly labeled before training any further models.

It is hard to gauge what the impact of bias in the training data might be in the worst-case scenario it might make for an unfair analysis which in the long run might affect the salary of the employees negatively however the bias might just as well be small enough to not have a statistical significance on the result of the analysis.

6.3 Further development potential

Given the relative success of the project the logical next step would be to implement a test run of the models into the software. This test run could either be a hidden prediction test running behind the scenes comparing the predicted tag against the one chosen by a human. This would let Sysarb test the success of the model in the wild before letting the users test it out. The other option is to release it as a suggestion service to the users directly but giving the users a chance to change the label if the prediction turned out to be wrong.

If Sysarb wanted to further improve the accuracy of the models developed a couple of things could be tested, first since no model was the best across the board some improvements can probably be gained from creating an ensemble model. For example, logistic regression could be used to predict Utökat ansvar while boosted decision tree is used everywhere else. Another thing to test to improve the accuracy is implementing a neural network and deep learning. This would however require a bigger dataset to be effective but might be worth exploring in the future. Other algorithms might also be researched and developed to test the accuracy.

The project is in and of itself a success. It is also a big step towards being able to automate the entire salary analysis process.

(32)

7 Reflection on own learning

7.1 Knowledge and understanding

Since the project focused so heavily on machine learning and more specifically on

classification tasks, I have gained a good understanding of the basics in this area. However, there is off course an ocean of information left in the area of machine learning. The biggest part not looked in to during this project is regression problems which I would like to learn more about. If more time would be spent on this project the next step would probably been to investigate neural networks and deep learning for classification tasks. This was not done because of the time limit but also the fact that the dataset was too small for deep learning to be effective.

Since I started from zero when it comes to knowledge in machine learning I think I have absolutely succeeded in gaining knowledge in one of the topics for the project. The way experiments were used to find the best model should show some understanding of the scientific model. I am however aware that with the knowledge I have now at the end of the project the experiments could have been done more thoroughly by testing more methods for feature selection for example. This is one thing I would have liked to explore more if more time were accessible.

7.2 Skill and ability

In the beginning of the project there were a lot of talk about exactly what should be included but, in the end, the final project turned out to be a quite nicely packaged concept with not too many branching parts which made it easier to have a project with a clean end goal. This off course made it easy for me to restrain the research needed to be made about the project. In a way this hindered me since I might not have explored all that I should have. Overall though I think I managed to complete the project in a scientific way.

When it comes to the actual writing of the report a huge thanks must go out to my excellent supervisors both at the school and the company which helped me in a great way when it came to formulating the information in a clear and understandable way. Personally, I tend to ramble on which is something I really have had to train away in this project. In the end I think the language and structure of the report although not perfect is way better than what I could manage in the beginning of the project

The presentation of the project went by my own account quite well the hardest part here was trying to explain highly technical concepts in a way that is possible to take in in a short time. I would have liked to add a short discussion on the social impacts of the project by that had to be cut due to the time constraint on the presentation.

7.3 Evaluation ability

The project allowed me to be a part of Sysarbs AI team where I had the possibility to discuss problems I encountered as well as listen and learn from the team at Sysarb. I was also part of the dev team’s morning meetings updating them on my status and following the progress of the development going on at Sysarb. The partaking in daily meetings was a routine I felt helped me a lot especially in getting the day started, but also in how the meetings forced to make me have a plan to present for each day.

Since the project might have a quite big social impact if handled incorrectly one of the easiest parts of the project was t place it in a social context this was also the part of the project that I enjoyed writing the most. If I were to do another similar project in machine learning it would

(33)

(34)

8 References

1. Burkov A. The Hundred-Page Machine Learning Book [Internet]. Andriy Burkov; 2019. Available from: https://books.google.se/books?id=0jbxwQEACAAJ

2. James G, Witten D, Hastie T, Tibshirani R. An Introduction to Statistical Learning: with Applications in R [Internet]. Springer New York; 2014. (Springer Texts in Statistics). Available from: https://books.google.se/books?id=at1bmAEACAAJ

3. Grus J. Data Science from Scratch [Internet]. O’Reilly Media; 2015. Available from: https://books.google.se/books?id=7iLNrQEACAAJ

4. Bishop CM. Pattern Recognition and Machine Learning [Internet]. Springer; 2006. (Information Science and Statistics). Available from:

https://books.google.se/books?id=kTNoQgAACAAJ

5. Kelleher JD, Namee BM, D’Arcy A. Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies [Internet]. MIT Press; 2015. (The MIT Press). Available from:

https://books.google.se/books?id=3EtQCgAAQBAJ

6. Amrane M, Oukid S, Gagaoua I, Ensarİ T. Breast cancer classification using machine learning. In: 2018 Electric Electronics, Computer Science, Biomedical Engineerings’ Meeting (EBBT). 2018. p. 1–4.

7. Liu X, Andris C, Huang Z, Rahimi S. Inside 50,000 living rooms: an assessment of global residential ornamentation using transfer learning. EPJ Data Sci. 2019 Dec;8(1):1–18. 8. 8080labs. 8080labs/ppscore [Internet]. 2021 [cited 2021 Apr 21]. Available from:

https://github.com/8080labs/ppscore

9. scikit-learn: machine learning in Python — scikit-learn 0.24.2 documentation [Internet]. [cited 2021 May 4]. Available from: https://scikit-learn.org/stable/

10. Sengupta E, Garg D, Choudhury T, Aggarwal A. Techniques to Elimenate Human Bias in Machine Learning. In: 2018 International Conference on System Modeling Advancement in Research Trends (SMART). 2018. p. 226–30.

11. Andel kvinnor och män i chefspositioner [Internet]. Statistiska Centralbyrån. [cited 2021 May 4]. Available from:

http://www.scb.se/hitta-statistik/statistik-efter- amne/arbetsmarknad/sysselsattning-forvarvsarbete-och-arbetstider/yrkesregistret-med-yrkesstatistik/pong/tabell-och-diagram/andel-kvinnor-och-man-i-chefspositioner/