Comparison of methods applied to job matching based on soft skills

(1)

September 2020

Comparison of methods applied to job matching based on soft skills

Emilia Elm

Institutionen för informationsteknologi

(2)

(3)

Teknisk- naturvetenskaplig fakultet UTH-enheten

Besöksadress:

Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress:

Box 536 751 21 Uppsala Telefon:

018 – 471 30 03 Telefax:

018 – 471 30 00 Hemsida:

http://www.teknat.uu.se/student

Comparison of methods applied to job matching based on soft skills

Emilia Elm

The expression ``Hire for attitude, train for skills'' is used as a

motive to create a matching program where companies and job seekers' soft qualities are measured and compared against each other. Are there better or worse methods for this purpose, and how do they compare with each other?

By associating soft qualities with companies and job seekers, it is possible to generate a value for how well they match. Therefore, data has been collected on several companies and job seekers. Their associated qualities are then translated into numerical vectors that can be used for matching purposes, where vectors closer together are more equal than vectors with greater distances. When it comes to analyzing and comparing the qualities, several methods have been used and compared with a subsequent discussion about their suitability. One consequence of the lack of a proper standard for presenting the qualities of companies and job seekers is that the data is messy and varied. An expected conclusion from the result is that the most flexible method is the one that generates the most accurate results.

Examinator: Lars-Åke Nordén Ämnesgranskare: Georgios Fakas Handledare: Andreas Samuelsson

(4)

(5)

Med uttrycket “Hire for attitude, train for skills” som ledmotiv skapas ett matchningsprogram där företag och jobbsökares mjuka kvalitéer mäts och jämförs mot varandra. Finns det bättre och sämre metoder för ändamålet och hur står sig dessa i en jämförelse mot varandra? Genom att koppla mjuka kvalitéer till företag och jobbsökare möjliggörs det att kunna generera ett värde på hur bra de matchar. Därför har data samlats in på ett antal företag och jobbsökare. Deras tillhörande kvalitéer översätts sedan till numeriska vektorer som man kan använda i matchningssyftet, där vektorer närmare varandra är mer lika än vektorer med större avstånd. När det kommer till att analysera och jämföra kvalitéerna har flera metoder använts och jämförts med efterföljande diskussion om deras lämplighet. En följd utav avsaknaden av en riktig standard för att presentera företagens och jobbsökandes kval- itéer är att datan är rörig och varierande. En väntad slutsats från resultatet är att den flexiblaste metoden är den som generar de träffsäkraste resultatet.

(6)

I want to thank Fred Isaksson and my supervisor Andreas Samuelsson at Ava for their enthusiastic support and encouragement throughout the project and for allowing me to pursue my Master Thesis with them. I also want to thank my reviewer Georgios Fakas for his guidance and expertise that helped me during this project.

(7)

1 Introduction 1

2 Background & Related Work 2

2.1 One hot encoding . . . 2

2.2 Similarity measures . . . 3

2.3 Jaccard similarity coefficient . . . 3

2.4 Cosine similarity . . . 5

2.5 Word2Vec . . . 8

2.5.1 Continuous Skip-Gram . . . 10

2.5.2 Continuous Bag of Words . . . 10

2.6 Summation of Word2Vec . . . 11

2.7 FastText . . . 12

2.8 Related Work . . . 13

2.9 Delimitations . . . 14

3 Method 15 3.1 Gather company descriptions . . . 15

3.2 Pre-processing descriptions . . . 16

3.3 Algorithm . . . 18

3.3.1 Jaccard & Cosine similarity . . . 20

3.3.2 FastText . . . 21

4 Results and Discussion 25 4.1 Similarity matrix results . . . 25

(8)

4.4 Impact of additional soft skills . . . 31 4.5 Impact of additional company data . . . 34

5 Conclusions 38

6 Future work 40

List of Figures

1 An example of a neural network with one hidden layer. . . 9 2 A plot of the words king, queen, man and woman which explain

corresponding arithmetic’s. . . 10 3 Example of company description data saved in one line of the

text file in JSON format. . . 16 4 An example draft from the formaga file containing soft skill words. 18 5 Shows how company skills from job seeker data are added to

description data on column Skills . . . 20 6 Displays an example of the DataFrame holding Companies with

their Descriptions, Headline/Name and Skills . . . 21 7 The similarity matrix result for job seeker 0-4 and seven compa-

nies when calculating Jaccard similarity . . . 25 8 The similarity matrix result for job seeker 0-4 and seven compa-

nies when calculating cosine similarity . . . 26 9 The similarity matrix result for job seeker 0-4 and seven compa-

nies when calculating similarity with FastText . . . 26 10 Best matches for job seeker 0-4 by Jaccard calculations using

historical data . . . 35

(9)

12 Previous best matching companies for FastText and with historical data, together with highest similarity to the previous best

matched job seeker. . . 36

13 New best matching companies for FastText and with historical data, together with highest similarity to the previous best matched job seeker. . . 36

14 Company and job seeker best match result from the methods Jaccard, Cosine and FastText. An X in their columns indicate they received the resulting company-job seeker pair . . . 38

List of Tables

1 List of example queries of soft skills . . . 2

2 One hot encoded vector matrix in alphabetic order. . . 3

3 A summary of Jaccard similarities calculated for queries in Ta- ble 1. . . 5

4 One hot encoded vectorsv andu for query A and B . . . 5

5 One hot encoded vectorsv₂ andu₂ for query A and C . . . 6

6 One hot encoded vectorsv₃ andu₃ for query A and D . . . 7

7 Jaccard and Cosine methods applied on soft skill queries in Table 1 8 8 Jaccard, Cosine and Word2Vec similarities applied on soft skill queries from Table 1 . . . 12

9 Jaccard, Cosine, Word2Vec and FastText methods applied on soft skill queries from Table 1 . . . 13

10 Questions for job seeker survey . . . 19

11 FastText similarity between misspelled and correctly spelled versions of the Swedish word administrativ. . . 23

(10)

and three companies Ampillo, Pfc Clinic and Bygg VVS El Stock- holm . . . 24 14 Soft skills of the first five job seekers from the similarity matrices. 26 15 Soft skills of the first seven companies from the similarity matri-

ces. . . 27 16 Best company match for job seeker 0-4 with Jaccard, job seekers

soft skills are displayed in Table 14. . . 28 17 Best company match for job seeker 0-4 with Cosine, job seekers

soft skills are displayed in Table 14. . . 29 18 Best company match for job seeker 0-4 for FastText, job seekers

soft skills are displayed in Table 14. . . 30 19 Before adding soft skills to the companies, numbers in bold are

best match for the job seeker among listed companies. . . 31 20 After adding soft skills to the companies, bold numbers are best

match with job seeker and listed companies. Underlined numbers are best matches in Table 19. . . 32 21 Three cases explaining the events occurring in Table 20 . . . . 32 22 Four companies with their original soft skills and in bold, the

additional soft skills . . . 33

(11)

1 Introduction

At the company Ava, they are developing a digital tool which provide study and career advise to students in compulsory school and high school in Swe- den. This tool include features such as proposing education paths to a certain job or assisting a job seeker to find relevant positions and companies.

The main part of Ava’s clients are people far from the job market, who may have been long term sick or recently arrived in Sweden. These clients may lack certain desired technical skills but can still be an asset to a company because of their soft skills. Once hired they can learn the technical skills at the company, following the principle “hire for attitude, train for skill”.

With this principle as a view-point, Ava sees a need for a company match making system aimed for job seekers, where the desired soft skills are considered, compared to the often clearly listed technical skills. A match making system with these features would further support this group that has difficul- ties finding a job. By only considering soft skills it opens up the possibility to hire a person based on soft skills desired by the company as a whole rather than hire for a specific position or job advert. Therefore both descriptions of the company and their job adverts can be useful to decide which soft skills are desired at the workplace. Are there methods compatible with matching job seekers and companies based on their soft skills and how do these methods compare with each other?

The Swedish Public Employment Service (Arbetsförmedlingen), from here on denoted as SPES, has listed over six million job adverts between 2016- 2019[5] and has thousands of company descriptions, making them ideal as a data source for the match making system. The goal is to create a system which match a job seeker with a company/companies based on their soft skills. The data gathering is concentrated to mainly be done from SPES’s website and the words that are considered as soft skills have to included in a specific file called Formaga created by JobTech Development. As the language on SPES’s website can vary but the job seekers are Swedish the language used for the soft skills are Swedish. Extracting the soft skills from the descriptions and job adverts, for the companies, and gather job-seekers soft skills from a survey will result in comparable queries. An example of these queries can be viewed in Table 1.

(12)

A [fast, fun, playful, committed] B [quick, lively, frisky, devoted] C [quick, fun, fun, committed]

D [administrative, comited, creative] Table 1 List of example queries of soft skills

2 Background & Related Work

To compare the soft skills of a job seeker and a company a computational comparison needs to occur. In order to apply a computational method on a text it can be necessary to convert the human language to something the computer comprehends. One way of doing so is using numerical vectors, where each vector represents a word or a phrase in the vocabulary, this is called word embedding.

There are several different technique’s to choose from when implementing a word embedding method and they can vary in degree of accuracy depending on the task. Overall a word embedding program is often applied to a corpus, which essentially is a long list of sentences in a certain language. The more coverage the corpus has of a language, the larger vocabulary it has, which increases the accuracy of the word embedding. Vocabulary refers to the collection of unique words existing in a given corpus. This means more words and context are available to the method which then yields a better result.

2.1 One hot encoding

When conducting word embedding there are several different methods that can be applied, one of them is one hot encoding. One hot encoding is one of the more basic methods to use when creating the vector representations. Essentially every word in a chosen corpus corresponds to one position in a vector and the dimension of the vector equals the number of unique words(vocabulary) in the corpus. To gain further simplicity the first position of the vector will correspond to the first word in the vocabulary seen in alphabetic order.

(13)

As an example, imagine a corpus consisting of only the words in query A at Table 1, they generate the one hot encoded vectors seen in Table 2. So

word committed fast fun playful

committed 1 0 0 0

fast 0 1 0 0

fun 0 0 1 0

playful 0 0 0 1

Table 2 One hot encoded vector matrix in alphabetic order.

the vector representing the word fast corresponds to the vector with a “1” at fast’s position within the vector. This is a rather straightforward method and quite easy to implement to receive word representations but it has its faults.

As an example it is not possible to represent relations between words. For example the words “flexible” and “adaptable” have similar meaning and are synonyms to one another but their target “1”’s in their vectors are not near each other, considering alphabetic order and a corpus containing all words in the English language. Another problematic aspect is regarding sparsity. All the redundant zeros use a lot of space which is not efficient. There are other representations which solve these issues but at the cost of higher complexity.

2.2 Similarity measures

When considering the example calculations of the upcoming two similarity functions the queries in Table 1 are used. For the sake of these example calculations query A can be viewed as a job seeker’s soft skills and B-D are companies’ soft skills.

2.3 Jaccard similarity coefficient

Jaccard similarity coefficient, also known as Jaccard index, is a method for measuring similarity or diversity of two sets. In order to obtain the similarity of two sets the intersection of them both is divided by the union. The general equation for obtaining a Jaccard index is shown in Equation 1.

J(X, Y ) = |X \ Y |

|X [ Y | ⁽¹⁾

(14)

The output is a number between 0 and 1, corresponding to a percentage of similarity, where 0 stands for no similarity and 1 for 100% similarity meaning they are equal. As an example, consider the example queries in Table 1.

When obtaining the Jaccard index from query A and query B from the example sets the similarity measure results in the value 0as no entities in the queries are shared, the calculation steps can be seen in Equation 2.

J(A, B) = |{fast, fun, playful, committed} \ {quick, lively, frisky, devoted}|

|{fast, fun, playful, committed} [ {quick, lively, frisky, devoted}|

= |{}|

|{fast, fun, playful, committed, quick, lively, frisky, devoted}|

= 0 6

= 0

(2) Consider the Jaccard index between query A and C from the example queries in Table 1, they yield the result 0.4 as the queries share two elements and have five unique elements in total, see Equation 3 for calculations.

J(A, C) = |{fast, fun, playful, committed} \ {quick, fun, fun, committed}|

|{fast, fun, playful, committed} [ {quick, fun, fun, committed}|

= |{fun, committed}|

|{fast, fun, playful, committed, quick}|

= 2 5

= 0.4

(3) Again consider a calculation of Jaccard index between query A and D from the example queries in Table 1. The calculation yield again0as the queries do not share any element. See calculations in Equation 4

J(A, D) = |{fast, fun, playful, committed} \ {administrative, creative, comited}|

|{fast, fun, playful, committed} [ {administrative, creative, comited}|

= |{}|

|{fast, fun, playful, committed, administrative, creative, comited}|

= 0 6

= 0

(4)

(15)

A summarize of the equations are displayed in Table 3.

Method

Queries

A-B A-C A-D

Jaccard 0 0.4 0

Table 3 A summary of Jaccard similarities calculated for queries in Table 1.

2.4 Cosine similarity

Cosine similarity is also a measure of similarity specifically for correlations between two non zero vectors. Calculating the dot product of two vectors, divided by the product of their magnitudes yield the cosine of the angle between the two vectors which is the cosine similarity. The general equation can be seen in Equation 5.

Cosine(v,u) = ^v·^u

k^vk k^uk ⁽⁵⁾

The result of the cosine similarity is bounded to a positive number between 0 and 1, which also corresponds to a percentage 0-100% where 1 stands for identical vectors and 0 for no similarity at all. In order to calculate cosine similarity, as an example, of the queries in Table 1 they first need to be converted from query of words to vectors representing the words. Consider the

Word

Vector

v u committed 1 0

devoted 0 1

fast 1 0

frisky 0 1

fun 1 0

lively 0 1

playful 1 0

quick 0 1

Table 4 One hot encoded vectorsvandu for query A and B

two queries A and B as the complete corpus to conduct a one hot encoding

(16)

word embedding. As mentioned in Section 2.1 every word has it’s own vector with a significant “1” at the words placement in the corpus but here a vector will represent a query of words and therefore will contain several “1”s.

The previous statement is partly false as there will not only be “1”s in these vectors but other numbers as well depending on how often the word occur in the query. This analogy yield in the two vectors v, representing query A andu, representing query B seen in Table 4.

When calculating the cosine similarity of these two vectors the result equals 0, implicating the vectors have no similarities. The calculation steps of the example vectors for query A and B are denoted in Equation 6. Note the zeros in the denominator at line two in the equation are deprecated.

Cosine(^v,^u) = (1, 0, 1, 0, 1, 0, 1, 0)· (0, 1, 0, 1, 0, 1, 0, 1) k(1, 0, 1, 0, 1, 0, 1, 0)k k(0, 1, 0, 1, 0, 1, 0, 1)k

= 1· 0 + 0 · 1 + 1 · 0 + 0 · 1 + 1 · 0 + 0 · 1 + 1 · 0 + 0 · 1 p1²+ 1²+ 1²+ 1²·p

1²+ 1²+ 1²+ 1²

= 0

2· 2

= 0

(6)

Apply the same method to create vectors to the two queries A and C from Table 1, this yield in the vectors seen in Table 5. The cosine similarity for

Word

Vector

v₂ u₂ committed 1 1

fast 1 0

fun 1 2

playful 1 0

quick 0 1

Table 5 One hot encoded vectorsv₂ andu₂ for query A and C

the two new vectors v₂ representing query A and u₂ representing query C yield the approximated result0.61. Implicating the two queries share several

(17)

elements and are a 61% match.

Cosine(v₂,u₂) = (1, 1, 1, 1, 0)· (1, 0, 2, 0, 1) k(1, 1, 1, 1, 0)k k(1, 0, 2, 0, 1)k

= p1· 1 + 1 · 0 + 1 · 2 + 1 · 0 + 0 · 1 1²+ 1²+ 1²+ 1²·p

1²+ 2²+ 1²

= 3

2·p 6

⇡ 0.61

(7)

Again create vectors but for query A and D, this yield vectors v₃ for query A andu₃ for query D. The vectors can be seen in Table 6. The cosine similarity

Word

Vector

v₃ u₃ administrative 0 1

comited 0 1

committed 1 0 creative 0 1

fast 1 0

fun 1 0

playful 1 0

Table 6 One hot encoded vectorsv₃ andu₃ for query A and D

for the two vectors v₃ representing query A and u₃ representing query D yield the result0. As again the two queries do not share any elements. See the calculation steps in Equation 8, notice the zeros are deprecated in the denominator on line 2.

Cosine(v₃,u₃) = (0, 1, 1, 1, 0, 1, 0)· (1, 1, 0, 1, 0, 0, 0) k(0, 1, 1, 1, 0, 1, 0)k k(1, 1, 0, 1, 0, 0, 0)k

= 0· 1 + 1 · 0 + 1 · 0 + 1 · 0 + 0 · 1 + 1 · 0 + 0 · 1p 1²+ 1²+ 1²+ 1²·p

1²+ 1²+ 1²

= 0

2·p 3

= 0

(8)

As the example queries used here are the same as for the examples in Sec- tion 2.3 the results of the two equations can be compared. The reason cosine similarity yield0.61for query A and C while Jaccard index yield0.4is because

(18)

Method

Queries

A-B A-C A-D

Jaccard 0 0.4 0

Cosine 0 0.61 0

Table 7 Jaccard and Cosine methods applied on soft skill queries in Table 1

cosine similarity consider the closeness of the vectors not just the amount of identical matches between vector elements. Where as Jaccard index consider the unique identical words with the amount of unique words in total and therefore ignore the second “fun” in query C, which is considered when calculating cosine similarity. Although both Jaccard index and cosine similarity do not consider synonyms and therefore both yield0when calculating the similarity of queries A and B. For queries A and D there are no identical soft skills and therefore it should be zero for both methods. A summary of both Jaccard and cosine similarity calculated from queries in Table 1 is displayed in Table 7.

2.5 Word2Vec

Word2Vec[17] is a collective name for word embedding models which takes the context into account when creating vectors, solving one of the problems with the one hot encoder method, see Section 2.1. Word2Vec is essentially a neural network with a single hidden layer for an example see Figure 1. This means there are two weight matrices, input-hidden matrix and hidden-output matrix. There is no activation function in hidden layer but one activation function in the output layer. During the training process the network takes one hot encoded words as input and outputs a probability vector. The goal is for the probability vector to match the one hot encoded word vectors to the words associated to the input word(s). The neural network is trained for a specific task, output correct target/adjacent word vectors, but it is never actually used for this task, a common trick in machine learning. Instead the real purpose of the training is to gain insight about the weights of the hidden layer which becomes the representation of the word vectors, instead of one hot encoding. Although, for a word to receive a word embedding it needs to satisfy a frequency threshold i.e. run through training enough times to get a reasonable embedding.

(19)

Figure 1 An example of a neural network with one hidden layer.

The vector of a specified word is created using vectors of adjacent words.

Each word pass through the shallow, two-layer neural network and the hidden layer encodes the vector. Where one hot encoding yield vectors with a dimension equal to the vocabulary size, Word2Vec decreases the dimension to the length of the hidden layer matrix, set before hand to a number between 100 and 300. Therefore the word vectors from Word2Vec are more space efficient as a corpus for the English language would have a vocabulary size larger than 100 000.

Words relationships are preserved in the Word2Vec vector representation and this enables arithmetic comparison of relationships on the form “word A relate to word B as word C relate to word D”. Consider the words king, man, queen and woman, the relationship between king and man is the same as the relationship between queen and woman. Both king and queen are regents the difference lies in the gender. As this is preserved in the numerical vector a relationship comparison corresponds to basic arithmetic’s, in our case it would be “King - Man + Woman = Queen”, an example of a plot of these words can be seen in Figure 2.

Word2Vec comes with two flavours Continuous Bag of Words, from here on

(20)

Figure 2 A plot of the words king, queen, man and woman which explain corresponding arithmetic’s.

denoted as CBOW, and Continuous Skip-Gram, from here on denoted as SG.

2.5.1 Continuous Skip-Gram

A neural network based on the SG model will receive a target word from a sentence in the corpus. Then, at random, it will choose a number of adjacent words based on a specified window size and output the probability of the occurrences of those adjacent words given our input word.

For example assume the target word “fun” and a small corpus built of one sentence the query in Table 1, “fast fun playful comitted”. Also assume a window size equal to 1, then the one hot encoded words “fast” and “playful”

would be the adjacent words corresponding to the output probabilities of our neural network. The probabilities are then classified against the adjacent words one hot encoded vectors to retrieve the error and update the weights in the network through back propagation accordingly.

2.5.2 Continuous Bag of Words

CBOW is quite similar to the Skip-Gram algorithm but with one significant difference, the architecture is reversed. In Skip-Gram the target word is

(21)

the input and a predicted context is output where as in CBOW the context word’s are the input and predicted target word is the output. This means that the neural network receives adjacent words of the target word, from all sentences containing the target word, as input. How many words that are fed to the network depends on the setting of the window size variable but also on the number of occurrences of the target word in the corpus.

As an example of the process, again assume the target word “fun” and a small corpus built of query A from Table 1, “fast, fun, playful, committed”.

Also assume a window size equal to 1, then the one hot encoded words “fast”

and “playful” would be the input to our neural network. The input vectors are multiplied with the input-hidden weight matrix, generating two weighted vectors. In the hidden layer the average of these weighted vectors is calculated and then multiplied with the hidden-output weighted matrix to produce an output vector. The output vector is then processed through an activation function which yield a probability vector. This probability vector is as a one hot encoded vector but instead of populated with either 0 or 1 it is a range that goes from 0 to 1. The highest value in the probability vector should ide- ally correspond to the placement of the one hot encoded vector of “fun”’s 1.

To achieve this ideal output an error between the two vectors are calculated and used to adjust the weight matrices in the neural network.

2.6 Summation of Word2Vec

To calculate the similarity of the example queries in Table 1 the mean of the corresponding vectors needs to be calculated. The word vectors used here are collected from an example Word2Vec model. If a word is not presented in the model it is omitted from the calculation. When two queries are represented by their mean vector the similarity is calculated using cosine similarity. The result is displayed together with the results from Jaccard and solemnly using cosine calculations in Table 8. Where Jaccard and solemnly cosine yield zero similarity Word2Vec yield a non-zero result. Although, the result of the similarity between query A and C is lower when using Word2Vec compared to solemnly using cosine. This can be because not all words in the queries are presented in the Word2Vec model which can lower the similarity result. Word2Vec is indeed useful and an improvement from only using Jac- card or cosine to calculate similarity but it still have flaws. By assigning a discrete vector to every word the morphology of the words are ignored. This

(22)

Method

Queries

A-B A-C A-D

Jaccard 0 0.4 0

Cosine 0 0.61 0

Word2Vec 0.13 0.44 0.2

Table 8 Jaccard, Cosine and Word2Vec similarities applied on soft skill queries from Table 1

is a limitation which distinctly effect languages with an significant amount of rare words and large vocabularies. Also, there is no way to represent words that are not part of the corpus often enough to be presented with a vector in the model which put higher requirements on the data.

2.7 FastText

FastText[3][10][11] is another library for text representation and classifica- tion which can be seen as an extension to the Word2Vec library, specifically the Skip-Gram algorithm. The difference between the two lays in how the words are seen. Where Word2Vec use distinct vectors for each word, Fast- Text use distinct vectors for each sub-words[3] of the words in the vocabulary. Then the sum of these vectors becomes the vector representation of a word, such as the sum of the sub-words equals the word. A word is seen as a bag of charactern-grams, which holds a sequence ofn (sub-)words.

Consider the word fast it has the bi-grams <f, fa, as, st and t>. The symbols

< and > serve the purpose of distinguish words from n-grams and further allows the capture of meaning with prefixes and suffixes. As an example as is a sub-word of the word fast which has a different vector from then-gram

<as> corresponding to the word as, which is visible thanks to the use of the symbols. As the bi-grams are all sub-words of fast the sum of their vectors will result in the word embedding vector representing fast.

After training the neural network the word embedding vectors for all n- grams of the vocabulary will be available. Compare this to Word2Vec which ends up with vectors for only complete words. FastText’s algorithm enables the model to properly represent rare words in the corpus as it is a high probability that their n-grams also appear in other more common words. It can

(23)

also properly represent words which have not been part of the training data as for the same reason.

Method

Queries

A-B A-C A-D

Jaccard 0 0.4 0

Cosine 0 0.61 0

Word2Vec 0.13 0.44 0.2 FastText 0.30 0.72 0.35

Table 9 Jaccard, Cosine, Word2Vec and FastText methods applied on soft skill queries from Table 1

Calculating the similarity between the queries in Table 1 by using the Fast- Text model and cosine similarity yield the result displayed in Table 9. For comparison all the previous calculations are also displayed in the same Ta- ble. By observing the Table it is evident that FastText yield a higher similarity percentage than all other methods. Especially interesting to compare with Word2Vec as they both render word embedded vectors but FastText yield a more reliable result than Word2Vec as all words can be represented and ac- counted for. Including the words that are misspelled as the word “comited”

in query D which is not presented in Word2Vec’s model but in FastText’s.

2.8 Related Work

SkyHive[9] have a skill matching platform, as opposed to job-based, that are using AI and Machine learning to match job seekers with possible employers depending on both hard and soft skills. Apricot[15] also have a job matching tool based on hard and soft skill. Where soft skills are gathered and evaluated first then an assessment of hard skills are evaluated to find the perfect match. Unfortunately both SkyHive and Apricot is not revealing the mech- anisms behind their tools but their approach of evaluating skills resembles this project and what Ava aims to do.

In this project any soft skills are extracted and then word embedded through a library such as Word2Vec. Although, when considering soft skills in a job advert or description of a company not all words considered soft skills are desired for the job seeker. As mentioned in the conference paper “Learning Representations for Soft Skill Matching”[22] by extracting all soft skills also,

(24)

so called, false positives risk to be included. As an example if the soft skill

“friendly” is used to describe a completely other entity than a desired soft skill for a future employee. In this project it would still be retrieved as a soft skill to match job seekers with companies and would therefore be a false positive. Whereas the project described in the conference paper take a step further to ensure the words extracted as soft skills are sought by the company for an possible employee. The common ground with these two projects is the use of the word embedding Word2Vec and the extraction of soft skills.

In order to meet a labor market with more frequent job transitions and changed requirements for workers the Public Employment Services, PES, in the European Union moves towards approaches that aim to provide more personalized services[2]. To have soft skill based profiles and matching tools are a part of this approach. Although, there is a need for more analyzes of soft skill matching tools which is where this project can contribute. A further understanding of which underlying methods are suited for soft skill matching and how they differ is desired.

2.9 Delimitations

This project have endless of possibilities but for the time frame it is necessary to narrow it down. The main data source will be gathered from various sites belonging or strongly connected to SPES. The word embedding will be preformed on the Swedish language only and therefore only Swedish data will be used. To be defined as a soft skill word it has to be part of the soft skill terms listed in the file Formaga[6] created by JobTech Development[7].

(25)

3 Method

The project consists of three main steps, first, gathering all the data from the main data source SPES’s web page, second, pre-process all the data to be structured and manageable and third implementation and tweaking of the matching algorithm. While this structure was the aim of the project the characteristics of the data and the evaluations pursued required revisits to earlier steps.

3.1 Gather company descriptions

The main data source, the company descriptions, are found on SPES’s web site. Web scraping became the method of choice to fetch this data because there was no direct access through an API. Web scraping refers to tools that automatically or manually access web pages. When done automatically it is often implemented with a bot or web crawler. The web crawler systemat- ically traverse, through a browser or HTTP protocol, over the internet and copying data down to your local device[18]. There exist several web scraping tools, but the framework Scrapy[14] was chosen.

Scrapy uses asynchronous calls when retrieving data, therefore the web crawling becomes efficient, which is important here as there are more than 4000 pages with possible relevant data. All the pages with company descriptions have the URL pattern of http://www.arbetsformedlingen.se/foretagsprofil/

followed by the id number to reach the specific company description. As an example, the company SPES’s company description site has number three as an id, therefore the following URL, http:// www.arbetsformedlingen.se/

foretagsprofil/ 3 , leads to a description of SPES. Note that the URL ends with the number three, i.e “3” is the company id of SPES. To crawl all the company descriptions, all the company descriptions’ links have to be retrieved. These links are in an XML file[1] also available at SPES’s web site, therefore that page was the first one to be crawled. The gathered URLs are then stored in a regular text file, where one line corresponds to one URL. The URLs are imported as a list to another crawler, to retrieve the company descriptions and name.

Although, most web pages are more complicated than the previously mentioned XML file, which consists of only regular HTML tags, whereas the com-

(26)

pany description pages include dynamically loaded JavaScript elements. The Scrapy crawler can not find these elements because it does not get access to the dynamically loaded content. As the actual description text of the company is within JavaScript tags, it is necessary to overcome this obstacle. The solution is a Scrapy plugin called Scrapy-Splash[13]. Scrapy-Splash is rec- ommended by the Scrapy documentation[4], for pre-rendering JavaScript.

As the JavaScript is pre-rendered, all the content on the website is accessible to the Scrapy crawler. Therefore the descriptions are accessible and then can be downloaded.

When accessing several pages on the same web domain, as done here, there is a risk the web site will shut down the crawler’s access, as a safety measure. In order to prevent this, the crawler is limited to how many URL requests it can do before it needs to rest. Then after a certain time, the crawler can continue sending another limited amount of requests. The downside of slowing down the process of retrieving the data, by letting the crawler rest, is smaller compared to the risk to be denied access to the web site and have to manually restart the crawler.

3.2 Pre-processing descriptions

After fetching the company descriptions, using Scrapy, they are organized and put into a text file, one row of text for each company. Each row, in the file, holds the corresponding id number, found at the end of the URL, with the accompanying description. To simplify the upcoming utilization of the data, it is saved on the same format as of a JSON object, see Figure 3 for example data. So each id number is clearly bounded to correct description. To re-

1 {

2 "Number": 12345,

3 "Description": This is a description example

4 }

Figure 3 Example of company description data saved in one line of the text file in JSON format.

trieve the complete description text from the website, all content within the HTML description tag needs to be retrieved. Therefore irrelevant text such

(27)

as HTML tags and symbols within the class are also part of the description text and need to be removed before storing the data in the text file. The descriptions are therefore passed to a parser, within the crawler, for removing the unwanted characters which appeared.

To further utilize the data, it was loaded into Jupyter Notebook[12] a web based application providing a computational environment. The reason to use a Jupyter Notebook is that the code can be modified and then rerun without a need to reload the data into the program. A significant benefit when having large files as they can take a while to load.

Although initial filtering already occurred when crawling the pages, problematic characters remained. Some of these characters prevented the ability to load the lines as JSON objects, and therefore the second filtering of characters was necessary. Having imported the data as a JSON object, it is a straight forward process to convert the data to a Pandas[21][16]

DataFrame. A DataFrame is essentially a matrix built up from columns and rows. Therefore, when here referring to the data being converted to a DataFrame means sorting it into appropriate columns and rows. Each row represents a company with the company name, company id, and the description separated into different columns, which makes it simple to overview and manage. When having a good overview of the data, it became clear that it was both rough and messy. Some descriptions include personal reflections of the company, rather than a description. Probably SPES’s web site lack or at least it has lacked verification checks of the company profiles. Also, a variety of languages occurs, English, Swedish, or sometimes both. This inconsistency is something to address during the pre-processing of the data, but how to categories a description as a personal reflection or actual description is difficult. By removing descriptions with swearwords, at least the most unprofessional ones were omitted. The Swedish descriptions seem more relevant to use, than the English ones, as Swedish job seekers are the target group. Therefore, descriptions consisting of only English text were filtered out. Apart from the descriptions to analyze the actual soft skills to consider needed to be organized. The choice fell on the list Formaga[6] created by JobTech Development[7], a development unit within SPES. The list is a CSV file containing several different soft skills or terms as they call them. In Fig- ure 4 a selection from the Formaga file can be seen. The selection shows the different categories of data within the file listed below.

(28)

Figure 4 An example draft from the formaga file containing soft skill words.

1. term - soft skills 2. uuid - concept id

3. concept - name for specified groups of terms

4. type - what type the concept is, here all is FORMAGA referring to the Swedish word förmåga, meaning ability

5. term_uuid - term id

6. term_misspelled - True or False depending on if the term is misspelled 7. version - build version

Here, again, a lot of columns are redundant and not necessary in this case such as the ids. Therefore the data was converted to a DataFrame, and the irrelevant parts were filtered out. As mentioned before, the description texts are quite messy and are not without spelling errors. From this, the conclusion that it is relevant to keep the complete list of terms, even the misspelled ones, were drawn.

3.3 Algorithm

The soft skill terms provided by JobTech, are useful when extracting the soft skill words from the company description text. The words in the text that match the terms are added to a list which, in turn, is added to the corresponding company’s row under a category called Skills. As a final step all companies with empty skill list is removed.

(29)

1. If a close friend or family member would describe your 3 clearest soft skills what would they say?

2. What is your current workplace, i.e the company where you actively work? (if you are not working - write your latest)

3. What is your occupational title / role in that workplace?

4. Which 3 soft skills are most important for being able to perform that professional role in that workplace?

5. What was your previous professional title / role before it?

6. At what workplace, i.e at which company did you work? (Can be the same)

7. Which 3 soft skills were most important in order to perform that professional role in that workplace?

Table 10 Questions for job seeker survey

Ava provides the data source for job seekers’ soft skills. Each job seeker an- swered a survey regarding their soft skills and companies they have worked at. They provided information such as which soft skills they possess, what work role and what soft skills they saw necessary for that position and that workplace. A complete list of the questions can be seen in Table 10. The job seeker data was also loaded into a DataFrame to ease management. At best two different companies with accompanying soft skills were presented by each job seeker, soft skills which can be added to the same company’s skill set in the description Dataframe. First, the companies and their soft skills were extracted from the job seeker DataFrame to a new DataFrame.

Then for every company in the description DataFrame matching a company from the job seeker data, the soft skills from the job seeker data were ap- pended to the soft skills list in column Skills, see Figure 5 for an example of the appending process.

In order to apply this, the headline or name of the companies were needed but they where not collected when fetching the company descriptions and id number. Therefore a second crawler had to be implemented to retrieve the company names, following the same principle as before. After retrieving the headlines they were added to the DataFrame holding the companies and

(30)

Figure 5 Shows how company skills from job seeker data are added to de- scription data on column Skills

their descriptions with current skills. An example is displayed in Figure 6.

The job seekers themselves with their corresponding soft skills are used to measure the percentage of similarity between them and the companies, through calculations of the Jaccard index and Cosine similarity between their soft skills.

3.3.1 Jaccard & Cosine similarity

The Jaccard index gives a base line similarity measure, dividing what the sets share with what they contain. Similar to the standard percentage di- vision “the part divided by the whole”, a straightforward measure to grasp an understanding of the sets. Therefore a calculation of the Jaccard index between job seeker and companies were conducted, such as the equations

(31)

Figure 6 Displays an example of the DataFrame holding Companies with their Descriptions, Headline/Name and Skills

seen in Section 2.3.

Jaccard is very hands on and literal in the comparison of the sets, where as cosine calculates an actual distance between vectors, here constructed from the soft skill lists. Cosine similarity is also the equation used by Word2Vec and FastText’s similarity functions. Considering the knowledge required to define both Word2Vec and FastText their choice of similarity function is probably not arbitrary and therefore relevant to use.

Cosine is also among the more popular similarity measures[8] to apply not only on text but also clustering. As cosine similarity calculates vectors, the soft skills needs to be converted from a set to vectors. The conversion to vectors and calculation of cosine similarity can be seen in Section 2.4.

3.3.2 FastText

While both Jaccard index and cosine similarity consider the comparison of the actual word composition a method to include context of the words is necessary. As for example the words “reliable” and “dependable” are synonyms and will be seen in the same context but would not in our previous calculations be considered to be close or equal to each other as their composition differs. In order to achieve a consideration for the context a different word embedding program is necessary.

The criteria of this word embedding program would be the ability to create word embedding for Swedish words or the occurrence of a pre-trained Swedish model. Early on in the project a promising pre-trained model with

(32)

Swedish word vectors was found done by Kyubyong[19], implemented using Gensim’s Word2Vec. Although this pre-trained model was based on Swedish Wikipedia back up dumps the variety of words was not enough.

The specific words correlated to job searching was not included as they were probably not occurring frequent enough in the Wikipedia dumps to be considered. Therefore the description texts were added to the existing Wikipedia corpus used by the Word2Vec function in order to train a new model more directed towards work related words. This new model was then imported to the program holding the description and job seeker data.

Now all soft skill words should have a corresponding vector in the Word2Vec model. As all words already have a vector it is not necessary to create the one hot encoded vectors as cosine similarity can be evaluated directly through Word2Vec’s similarity function. A problem with Word2Vec’s similarity function is that it only compares one word against another, while often there are several words associated to a job seeker and a company. The solution is to take the mean of the word vectors, for both job seeker and company soft skills then take the similarity of the two mean vectors.

In order to retrieve the word vector from the Word2Vec model it is assumed the word exists there, words not included in the model do not have a corresponding vector. This causes problems as the data contains misspelled soft skill words which occurs even less in the data than the correct spelled words.

Although we added the descriptions to the corpus used by Word2Vec there is no guarantee the words have been assigned a vector. To assure every soft skill word, with all the different spellings, are included in the model a more thorough detail managing of the data would be necessary. A time consuming and storage draining process, as more accurate data needs to be collected and included. Another option, which were used in this project, is the word embedding library FastText. As FastText creates word vectors for sub-words of words within the corpus it becomes more flexible than Word2Vec. Practi- cally it means that FastText can create vectors for words not included in the corpus used when creating the FastText model, as the sub-words of the word are probably already included.

As misspelled words are very similar in their composition to their correctly spelled version, they are built up from roughly the same sub-words generating in vectors close to one another. See Table 11 for the percentages of similarity between the correctly spelled Swedish word “administrativ” and

(33)

Miswrite

Correct

Administrativ Admnistrativ 0.956354 Admininstrativ 0.924669 Aministrativ 0.966103 Admistrativ 0.925482 Administartiv 0.851144

Table 11 FastText similarity between misspelled and correctly spelled ver- sions of the Swedish word administrativ.

it’s misspelled versions included in the soft skill list from JobTech Develop- ment. Another upside with using FastText is that their similarity function

1 #Similarity measure between query A and B

2 fasttext_model.n_similarity(

3 ['fast', 'fun','playful','committed' ],

4 ['quick', 'lively','frisky', 'devoted' ])

5 0.58

6 #Similarity measure between query A and C

9 ['quick', 'fun','fun', 'committed' ])

10 0.79

11 #Similarity measure between query A and D

14 ['administrative', 'genuine', 'reliable' ])

15 0.45

Listing 1 Output display of similarity calculation between A and B, C and D

is already taking two lists of words as input and therefore can replace the previous implemented mean of word vector function.

Now a similarity measurement can be withdrawn from the job seeker and all the companies soft skill lists which consider the context of the words as well as the composition. Consider the example queries in Table 1, where query A is considered to be soft skills for a job seeker and B-D are soft skills for companies. The calculations of the similarity would simply be a cosine

(34)

similarity equation but with vectors from the FastText model. An example of the code output can be seen in Listing 1.

If again consider the previous example calculations of Jaccard index, in Sec- tion 2.3 and cosine similarity, in Section 2.4 the matrix in Table 12 is a sum of all the calculations.

Equation

Query Pair

A-B A-C A-D

Jaccard Index 0 0.4 0

Cosine Similarity 0 0.61 0 FastText Similarity 0.58 0.79 0.45 Table 12 All example equations results in a sum matrix

Job seeker

Company Ampillo AB

Pfc Clinic AB

Bygg VVS El Stockholm

AB

0 0.621 0.475 0.634

1 0.793 0.687 0.596

2 0.522 0.320 0.668

3 0.610 0.450 0.492

4 0.560 0.377 0.660

Table 13 An example of a match percentage matrix with job seeker 0-4 and three companies Ampillo, Pfc Clinic and Bygg VVS El Stockholm

These three equations, Jaccard, cosine and FastText are calculated for every company in the description DataFrame and every job seeker which result in a match percentage matrix per equation, see Table 13 for reference picture, where number of columns equals amount of companies and number of rows equals amount of job seekers. From the similarity matrix top three matching companies for every job seeker is withdrawn together with the skills and percentage which can be evaluated against the results from the Jaccard and cosine similarity matrix.

(35)

4 Results and Discussion

The results are compared between the three similarity methods used, Jac- card, cosine and FastText. An analysis from using Word2Vec is not a part of this project as the results would be faulty due to words missing in the model, especially misspelled ones. A comparison between variations of the amount of data used and the effect of various number of soft skills included are also conducted. The latter comparisons are attempts to further gain insight about the three methods used and how variations in the input impacts them.

4.1 Similarity matrix results

Figure 7 The similarity matrix result for job seeker 0-4 and seven companies when calculating Jaccard similarity

An excerpt from the resulting similarity matrices, received in this project, are displayed in Figure 7, Figure 8, and Figure 9. The rows correspond to job seekers 0 through 4, while the columns correspond to the companies.

The specified job seekers and companies appearing in this result are listed with corresponding soft skills in Table 14 and Table 15.

(36)

Figure 8 The similarity matrix result for job seeker 0-4 and seven companies when calculating cosine similarity

Figure 9 The similarity matrix result for job seeker 0-4 and seven companies when calculating similarity with FastText

Job seeker Soft skills

0 glad, ansvarstagande, organiserad 1 driven, kritiskt tänkande, tidsstrukturerande 2 omtänksam, pålitlig, lyhörd

3 hård skal mjuk insida, driven, sympatiskt 4 pålitlig, trygg , intelligent

Table 14 Soft skills of the first five job seekers from the similarity matrices.

(37)

Company Soft skills

Pfc Clinic AB konstruktiv, driv

Sjöfartsverket effektiv, vänlig, effektiv

Enersize Advanced Rearch AB logisk, logiskt, effektiv, effektiv, effektiv

+46 Sverige AB effektiv, trevlig

Purus AB pålitlig, nytänkande, tänkande

Train Planet AB snabb, driv

Tension education AB driv, driv

Table 15 Soft skills of the first seven companies from the similarity matrices.

Through observations of the three matrices, it is apparent that FastText shows some level of similarity between all job seekers and the companies.

Whereas Jaccard and cosine have zero similarity with a majority of the companies and job seekers, in these matrices. Although, there is one cell where all three methods yield a non-zero result, the similarity between the company “Purus AB” and job seeker 4. In the Jaccard matrix, the result is 0.2in the cosine matrix it is⇡ 0.33^and⇡ 0.7in the FastText matrix. Extracting the soft skills from the job seeker and the company in the two tables Table 14 and Table 15 yield the soft skills “pålitlig”, “nytänkande”, “tänkande” as one query and “pålitlig”, “trygg”, “intelligent” as the other. An observation of this result is that the job seeker and the company share one soft skill “pål- itligt”, which is the reason cosine and Jaccard get a similarity percentage larger than zero. Cosine results in a slightly higher similarity percentage because cosine divide by the number of possible attributes whereas Jaccard divides by the union of the two queries.

The same cell in the FastText matrix has a significantly higher similarity percentage. To reveal the reason behind this difference a further analysis of the two involved queries is necessary. Consider the job seeker query, the two other words than “pålitlig” are “nytänkande” and “tänkande”. These words are synonyms or close to synonym with the word “intelligent” from the company query and will probably often appear in the same context. Also, the word “trygg” from the company query, is a synonym to the word “pålitlig”, which enhance the two queries connection further. To summarize, the two sets of soft skills contain two identical words and the rest are synonyms and therefore share many contexts. From this observation it seems like a reasonable result that the FastText similarity is0.7.

(38)

Job

seeker Match Company

Company soft skills

0 0.25 Insikten HVB AB positiv, ansvarstagande 1 0.25 Jessica Lundqvist

Konfektions AB driv, driven

2 0.40

Inflight International

Logistics AB

innovativ, tydlig, innovativ, pålitlig,

omtänksam 3 0.25 Jessica Lundqvist

Konfektions AB driv, driven 4 0.25 NSG Sweden AB pålitlig, flexibel

Table 16 Best company match for job seeker 0-4 with Jaccard, job seekers soft skills are displayed in Table 14.

4.2 Best match

The best matches between job seekers and companies have been collected and put together with three columns displaying the percentage between job seeker and company, the company name, and the soft skills of the company.

The matching is a result of calculations using Jaccard, cosine, and FastText which therefore yield in three matrices. An excerpt of the best matches from these matrices are displayed in Table 16, Table 17 and Table 18.

Compare the Jaccard matrix with the cosine matrix there is barely any difference between them. Job seekers 0, 1, and 3 are matching with the same company in both the Jaccard and cosine matrix. Although the match percentage is higher with cosine, it does not automatically guarantee it is a better result than using Jaccard, as the job seekers still match with the same company. Consider job seeker 2 and 4, which differ between the two matrices.

Job seeker 4 receives a better match calculated with cosine than with Jaccard because of the two “pålitlig” in the company soft skill query. Here better is quantified as more connections between the two queries which, reflects in a higher similarity percentage. When a soft skill has several occurrences, it means the word has been mentioned several times in the description. There- fore a repetition can indicate higher importance to the company. When comparing the two situations where there is a match on one word and where

(39)

Job

0 0.408 Insikten HVB AB positiv, ansvarstagande 1 0.408 Jessica Lundqvist

Konfektions AB driv, driven 2 0.516 Kvalitetspartner

Sverige AB seriös, pålitlig, pålitlig 3 0.408 Jessica Lundqvist

Konfektions AB driv, driven 4 0.516 Kvalitetspartner

Sverige AB seriös, pålitlig, pålitlig

Table 17 Best company match for job seeker 0-4 with Cosine, job seekers soft skills are displayed in Table 14.

there is a match on one word, but with several occurrences, the latter is to prefer.

Consider job seeker 2, which has a different situation than job seeker 4 with two words matching in both matrices. They both have two words matching, but the words themselves differ between the matrices. In the Jaccard matrix, job seeker 2 matches on two different words in the company query

“omtänksam” and “pålitlig”. Whereas in the cosine matrix, job seeker 2 matches on the double occurrence of the word “pålitlig” in the company’s query. Therefore there is no straightforward answer to which matrix received the better match, it depends on which soft skill the job seeker values the most.

Compare the best match matrix received using FastText, with the cosine matrix there is no intersection between the companies. In the same comparison with the FastText matrix and Jaccard matrix, there is one match that is reoc- curring, job seeker 2 and company “Inflight International Logistics AB”. The same pair that matches on two but different words in Jaccard and cosine, which made it difficult to decide which method made the best match. The soft skills corresponding to job seeker 2, from Table 14, are“omtänksam”,

“pålitlig” and “lyhörd” which translates to “thoughtful”, “reliable” and “re- sponsive”. As mentioned previously, the two words “omtänksam” and “pål-

(40)

Job

0 0.698 Luleå

kommun engagerad, ansvarstagande

1 0.752 Melleruds

kommun

strategisk, driv, drivande, nytänkande, tänkande

2 0.789

Inflight International

Logistics AB

innovativ, tydlig, innovativ, pålitlig,

omtänksam

3 0.630 Mindsweep

Consulting AB

strategisk, lagkänsla, driv, drivande, pragmatisk,

resultatinriktad 4 0.752 Nordens Invest

Bygg AB pålitlig, stabil

Table 18 Best company match for job seeker 0-4 for FastText, job seekers soft skills are displayed in Table 14.

itlig” occur in both skill sets. Apart from that, the company contains “tydlig” and a double occurrence of “innovativ”, which translates to “clear” and

“innovative”. Aside from the two identical words, there are no obvious synonyms or straight forward connections between the other words.

To get a better understanding of this matching, the similarity percentage is derived word for word. The word“innovativ” has the largest similarity with

“lyhörd” at0.23and “pålitlig” at0.27, whereas the word “tydlig” has the most similarity with “pålitlig”, at 0.36. A certain amount of clearness is probably involved to see someone as reliable it can help to derive an understanding of the person and what can be expected of them. Then executing the same analogy on cosine’s matching company, with the soft skills “seriös” and the double occurrence “pålitlig”. The word “seriös” has the most similarity with the job seeker’s soft skill “lyhörd”, but it is at 0.19, which is significantly lower than the other company’s two words. Explaining why FastText choose

“Inflight International Logistics AB” and not “Kvalitetspartner Sverige AB as cosine did.

(41)

4.3 Insights best match metrics

The mean from all the best matches from the resulting Jaccard’s best match matrix yield 0.298, where the highest is 0.400 and the lowest is 0.250. The mean of cosine’s best match matrix is0.521, where0.707is the highest value, and 0.408 is the lowest value. FastText’s best match matrix has the mean 0.762, and 0.901 is the highest value, and 0.621 is the lowest value. FastText has a higher mean but all methods have a mean roughly in the middle of their highest and lowest, meaning an even spread of the results.

4.4 Impact of additional soft skills

To be able to analyze the impact of the soft skill data, the similarity percentage between job seekers and companies, occurring in the answers from the questions in Table 10 is calculated using FastText. More precisely, analyzing before and after the additional soft skills, mentioned by job seekers at questions 4 and 7 in Table 10, are added to the crawled soft skill lists.

Table 19 displays the similarity percentage between job seeker 0 through 4 and four companies, mentioned in the answer sheets, before the additional soft skills. The numbers in bold are the best match for the job seeker among these four companies. The same matrix but after adding further soft skills is displayed in Table 20. The numbers in bold are again the best match between job seeker and company, the numbers that are underlined represent the cell with the best match in the “before” matrix.

Job seeker

Company Avantime Group

AB InUse

Svensk Fastighets- förmedling

Tritech Technology

AB

0 0.654 0.630 0.651 0.455

1 0.655 0.671 0.612 0.386

2 0.422 0.616 0.613 0.734

3 0.510 0.557 0.491 0.312

4 0.414 0.651 0.534 0.710

Table 19 Before adding soft skills to the companies, numbers in bold are best match for the job seeker among listed companies.

(42)

By focusing on the “after” matrix and the numbers in bold and underlined, it is clear that two job seekers have their best match with the same company in both the “before” and “after” matrix, InUse. All best matches except one job seeker 3 and InUse have increased similarity percentage indicating that additional data thus not guarantee a higher similarity percentage. To further

Job seeker

Company Avantime Group

AB InUse

AB

0 0.638 0.675 0.644 0.591

1 0.605 0.729 0.595 0.540

2 0.732 0.633 0.769 0.649

3 0.486 0.627 0.484 0.463

4 0.693 0.653 0.692 0.612

Table 20 After adding soft skills to the companies, bold numbers are best match with job seeker and listed companies. Underlined numbers are best matches in Table 19.

analyze the similarity percentage and the change of best match a compilation of the companies’ soft skills are displayed in Table 22. The soft skills in bold are the additional soft skills from the job seekers’ observations. The soft skills of job seeker 0 through 4 are displayed in Table 14. Three different cases can be derived from Table 20, they are displayed in Table 21.

1. Best match similarity percentage went up and company was un- changed.

2. Best match similarity percentage went up and company changed.

3. Best match similarity percentage went down and company changed.

Table 21 Three cases explaining the events occurring in Table 20

The first case is represented by the match between job seeker 1 and company InUse. Job seeker 1 has the soft skills “driven”, “kritiskt tänkande”,

“tidsstrukturerande” which translates to driven, critical thinking and good

(43)

Avantime Group

AB InUse

AB

nytänkande, tänkande, passionerad, kommunikativ,

pålitlig, kunnig

rolig, rak, driv, drivkraft, engagerad,

flexibilitet, modig, professionell,

professionel, empatisk, lyssnande,

tydlig, lyssnande,

flexibel, kommunikativ

rak,

ansvarstagande, nytänkande,

tänkande, omtänksam, affärsmässig, affärsmässighet, kommunikativ,

pålitlig, kunnig

trevlig, fokuserad,

vill lära- sig nytt, bra på- samarbete

Table 22 Four companies with their original soft skills and in bold, the addi- tional soft skills

at time management. InUse has 15 soft skills where two of them are du- plicates, including the misspelled duplicate of the word “professionell”. As InUse has a comparatively large set of soft skills, therefore only a few words will be analyzed. The two words “driv” and “drivkraft” have a straight forward connection to the word “driven” and are included in the set from the beginning. The additional soft skills do not contain any words with an apparent connection to job seeker 1’s soft skills. An indication on why job seeker 1 continue to have a best match with InUse. Although there is no obvious connection, the words can not be too dissimilar otherwise, the similarity percentage would decrease instead of increase after adding further soft skills.

The second case is represented with job seeker 2 and the two companies Svensk Fastighetsförmedling and TriTech Technology AB. Job seeker 2 has the soft skills “omtänksam”, “pålitlig”, “lyhörd”. Before further soft skills were added, job seeker 2 had the best match with Tritech which only has one soft skill, “trevlig”. Although Svensk Fastighetsförmedling has one soft skill with 100% similarity to job seeker’s soft skill “omtänksam”. Probably the Svensk Fastighetsförmedling’s other soft skills were too far off from the job seeker’s three soft skills as a whole which pulled the similarity percentage down. In the “after” matrix, it is clear that job seeker 2 has the highest simi-