Characterisation of a developer’s experience fields using topic modelling

(1)

Linköpings universitet SE–581 83 Linköping

VT2019 | LIU-IDA/ERASMUS-A-2019/001-SE

Characterisa on of a developer’s

experience ﬁelds using topic

modelling

Vincent Déhaye

Supervisor : Rita Kovordanyi Examiner : Ola Leiﬂer

(2)

De a dokument hålls llgängligt på Internet – eller dess fram da ersä are – under 25 år från pub-liceringsdatum under förutsä ning a inga extraordinära omständigheter uppstår. Tillgång ll doku-mentet innebär llstånd för var och en a läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och a använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrä en vid en senare dpunkt kan inte upphäva de a llstånd. All annan användning av doku-mentet kräver upphovsmannens medgivande. För a garantera äktheten, säkerheten och llgäng-ligheten ﬁnns lösningar av teknisk och administra v art. Upphovsmannens ideella rä innefa ar rä a bli nämnd som upphovsman i den omfa ning som god sed kräver vid användning av dokumentet på ovan beskrivna sä samt skydd mot a dokumentet ändras eller presenteras i sådan form eller i så-dant sammanhang som är kränkande för upphovsmannens li erära eller konstnärliga anseende eller egenart. För y erligare informa on om Linköping University Electronic Press se förlagets hemsida h p://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet – or its possible replacement – for a period of 25 years star ng from the date of publica on barring excep onal circumstances. The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educa onal purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are condi onal upon the consent of the copyright owner. The publisher has taken technical and administra ve measures to assure authen city, security and accessibility. According to intellectual property law the author has the right to be men oned when his/her work is accessed as described above and to be protected against infringement. For addi onal informa on about the Linköping University Electronic Press and its procedures for publica on and for assurance of document integrity, please refer to its www home page: h p://www.ep.liu.se/.

(3)

Finding the most relevant candidate for a position represents an ubiquitous challenge for organisations. It can also be arduous for a candidate to explain on a concise resume what they have experience with. Due to the fact that the candidate usually has to select which experience to expose and filter out some of them, they might not be detected by the person carrying out the search, whereas they were indeed having the desired experience. In the field of software engineering, developing one’s experience usually leaves traces behind: the code one produced. This project explores approaches to tackle the screening challenges with an automated way of extracting experience directly from code by defining common lexical patterns in code for different experience fields, using topic modeling.

Two different techniques were compared. On one hand, Latent Dirichlet Allocation (LDA) is a generative statistical model which has proven to yield good results in topic modeling. On the other hand Non-Negative Matrix Factorization (NMF) is simply a singular value decomposition of a matrix representing the code corpus as word counts per piece of code. The code gathered consisted of 30 random repositories from all the collaborators of the open-source Ruby-on-Rails project on GitHub, which was then applied common natural language processing transformation steps.

The results of both techniques were compared using respectively perplexity for LDA, reconstruction error for NMF and topic coherence for both. The two first represent how well the data could be represented by the topics produced while the later estimates the hanging and fitting together of the elements of a topic, and can depict human understand-ability and interpretunderstand-ability. Given that we did not have any similar work to benchmark with, the performance of the values obtained is hard to assess scientifically. However, the method seems promising as we would have been rather confident in assigning labels to 10 of the topics generated.

The results imply that one could probably use natural language processing methods directly on code production in order to extend the detected fields of experience of a developer, with a finer granularity than traditional resumes and with fields definition evolving dynamically with the technology.

(4)

First of all, I want to thank Ola Leifler, Rita Kovordanyi and Elie Raad for the quality support they provided me with, as well as their reactivity.

I would like to thank Biniam Palaiologos as well, for standing by my side all along the way, concluding his effort with being not only a good friend but a great opponent.

I also want to express my gratefulness to Arthur Devillard for his sound feedback and his presence when I needed it.

Finally, I want to thank my family for having brought me here and endured the stress probably even more than me. This work is dedicated to my grandfather, may he be proud of me.

(5)

Abstract iii Acknowledgments iv Contents v List of Figures vi 1 Introduction 1 1.1 Motivation . . . 1 1.2 Aim . . . 2 1.3 Research questions . . . 2 1.4 Delimitations . . . 3 2 Theory 4 2.1 Mining Software Repositories . . . 4

2.2 Topic Models . . . 5

3 Method 14 3.1 Pre-study . . . 14

3.2 Data . . . 14

3.3 Latent Dirichlet Allocation . . . 19

3.4 Non-negative Matrix Factorization . . . 20

3.5 Metrics . . . 21

4 Results 23 4.1 Perplexity and reconstruction error . . . 23

4.2 Coherence . . . 23

5 Discussion 29 5.1 Results . . . 29

5.2 Method . . . 30

5.3 The work in a wider context . . . 32

6 Conclusion 33

(6)

2.1 Difference between the generative model and the statistical inference approach . . . 6

2.2 Illustrating the impact of the hyperparameter on a symmetric Dirichlet distribution for three topics. The darker the color, the higher the probability. On the left, α= 4. On the right, α= 2. . . 8

2.3 Graphical description of the topic model using plate notation . . . 8

2.4 Comparison of the matrices decompositions of LSA and topic model . . . 11

3.1 Number of diffs for the ten most occurring programming languages. . . . 16

3.2 Example output of the git diff command on one file in one commit, with a context window of 3 lines. . . 16

3.3 Data preprocessing pipeline. . . 17

3.4 Length distribution of the data items after preprocessing. . . 18

3.5 Number of diffs for the twenty largest repositories in the dataset in terms of number of diffs. . . . 18

3.6 Number of tokens for the twenty largest repositories in the dataset in terms of number of diffs. . . . 19

3.7 Reconstruction error of NMF and number of iterations to converge in function of α value. . . . 21

4.1 Evolution of perplexity in function of number of topics. . . 24

4.2 Evolution of reconstruction error in function of number of topics. . . 24

4.3 CU CI coherences in function of the number of topics on (from the left to the right and the top to the bottom) 10, 20, 50 and 100 top words per topic. . . 25

4.4 CN P M Icoherences in function of the number of topics on (from the left to the right and the top to the bottom) 10, 20, 50 and 100 top words per topic. . . 25

4.5 CV coherences in function of the number of topics on (from the left to the right and the top to the bottom) 10, 20, 50 and 100 top words per topic. . . 26

4.6 CU M asscoherences in function of the number of topics on (from the left to the right and the top to the bottom) 10, 20, 50 and 100 top words per topic. . . 27

4.7 Normalized evolution of NMF coherences in function of the number of topics on (from the left to the right and the top to the bottom) 10, 20, 50 and 100 top words per topic. . . 27

4.8 LDA coherences normalized evolution in function of the number of topics on (from the left to the right and the top to the bottom) 10, 20, 50 and 100 top words per topic. . . 28

(7)

1.1 Motivation

Companies have a strong need in sourcing the most relevant candidates for a position. It is particularly true for consulting companies, who are looking for candidates ready to be placed on a customer’s project, without any prior training. In order to do this, the classic approach is to scrutinize resumes and find candidates matching the required programming language and framework.

However, one can usually carry out very different tasks with the same programming lan-guage, possibly with no relation between them. From this point of view, it is true that knowing the programming language for a position can be an asset for the candidate, but it would be better to be able to characterize the programmer’s experience at a lower level of granularity. For example, instead of knowing if a web developer has worked with a specific framework such as Ruby-on-Rails, you would prefer to determine if they have worked with authentication systems in the past, because their effort will be focused on creating one for this project.

Additionally, individuals may exaggerate their skill-set or, on the contrary, omit some of their skills in their self-authored resumes[18]. Traces provided by coding contributions are more verifiable than a list of achievements on a resume[29]. John Resig, the creator of jQuery1_{, once tweeted: ”When it comes to hiring, I’ll take a Github commit log over a resume}

any day.”2_{.This approach is also useful to find candidates who lack formal degrees or relevant}

job experience[10][18].

Software development is an incremental process: programmers verify the functionality of their code frequently throughout the process. They sometimes have to return to a previous version because unwanted behaviours have been introduced since then. They also often work in teams and must modify the same piece of code at the same time for distinct reasons. In both these situations, the version control systems come to the rescue.

Thanks to the emergence of version control systems (VCSs), it has become easier for the developers to work collaboratively, which in turn allows for larger and more complex software development projects, involving more developers. Each programmer can now easily contribute with their own specific skills, and the VCS enables to assemble them efficiently.

1_{A very spread cross-platform JavaScript library to simplify the client-side scripting of HTML.} 2_{https://twitter.com/jeresig/status/33968704983138304}

(8)

The development process got smoother and the success of a project has shifted from being dependent on individual heroes to relying on collective intelligence.

As a consequence, computer-based solutions are more and more common, and they have found their way into the average person life. The ever-increasing general interest in computer science causes the volume of data handled by version control systems to increase each and every day. At the same time, programming needs are becoming more specific to each domain. However, given the constant evolution of the field, it is hard to classify developers3 _{into static}

categories, which could turn obsolete very quick.

The richest information source for a developer history is the code they produced itself. It is a best practice for programmers to use meaningful identifier names, insert comments that clarify the functionality of the code and output explanatory error messages, present in the code as string literals. If they follow these guidelines, the developer’s intentions can be unveiled by the specific vocabulary they used in comments, identifier names and string literals. Once the purpose of the code, i.e. the developer’s intentions with the code, is defined, it is straightforward to consider that the author of the code has some knowledge in the field corresponding to this purpose. Considering those facts together, it seems that it would be beneficial to be able to characterise the fields of knowledge of the developers at a low level of granularity based on their code production, which evolves alongside their programming experience.

The objective of this research being to create a new way to characterise and clusterize developers, we do not know in advance which classes we will use, they will be defined by the clusters, thus we can not use any pre-labelled dataset. We then need to utilise unsupervised data mining techniques to discover the latent classes. The approach we will follow is topic modelling, a form of text mining aimed at analysing the topics a document deals with, through analysis of the words’ occurrence. We will evaluate the quality of the outcome using perplex-ity[4], equivalent to the likelihood of a held-out set of data, and the topic coherence[34], which estimates the hanging and fitting together of the elements of a topic.

1.2 Aim

The aim of this thesis is to provide an innovative way to characterise a developer’s experience fields. Based on the source code they produced, we want to assess a developer’s domains of knowledge such as authentication systems or API development. The purpose of this charac-terisation is to lay foundations upon which further applications could build. Examples of these applications could include:

• generate a pool of relevant developers for a position, project or team[21][2]

• automatically detect what are the programming efforts of a group of developers focused on at the moment, or follow their evolution[2][41]

• compute statistics for the different characteristics detected: demographics, education, positions...[3]

1.3 Research questions

The following research questions are to be answered:

1. Which topic modelling algorithm yields the best performance on the source code: Latent Dirichlet Allocation (LDA)[6] or Non-negative Matrix Factorization (NMF)[27] ? We will compare the performance based on the perplexity[4] and the coherence[34] of the topics generated by each method.

(9)

1.4 Delimitations

The data we will gather for this study will come from a VCS, because that is the most usual tool to store evolving source code. The typical data available from version control systems consists of the following:

• user information: name, public projects, age, location, organisation

• project information: users and their roles in the project (owner, contributor, viewer...), branches, description

• commit data: user, date and time, files involved, content of the modifications

All this data is not necessarily present in any version control system. There can be more as there can be less, it only depends on the system. In order to base the research on a consistent data structure, we will select one system and work only with data from this specific one. However, as this project aims at being useful in the most different contexts possible, the data we will base the approach on needs to be as generally available as possible. To this end, we will focus on the commit modifications content, which is a backbone of all version control systems. We will not consider the commit messages, which could have been informative as well, because we want to keep a homogeneous data set, with the same granularity and every document being of the same nature, for instance only source code.

There has never been any formal study on which VCS is the most popular. However, considering multiple unofficial surveys, it seems to be git at the moment, and that it has been for a while. Actually, git is used in most of the open source software projects, and has in fact been created for helping the development of one of the most famous open source projects, the Linux kernel. We will aim at open source projects data for availability reasons. GitHub4_{is one}

of the most popular online hosting services for version control using git. It will be our primary source of data, because on one hand of its popularity among open source programmers, and on the other hand of the fact it provides a convenient application programming interface (API)5_.

4_{https://github.com/}

5_{An API is a set of clearly defined methods of communication between various software components,}

intended to make the developer’s job easier by providing all the building blocks. In this case, the GitHub API can be used for retrieving data.

(10)

Given the nature of the data, the techniques used to reach the goal belong to the field of natural language processing.

2.1 Mining Software Repositories

Mining Software Repositories (MSR) is a field of software engineering intended to analyse and understand the data available in repositories related to a software development project. These repositories can offer an informative and fine-grained view of the process of realizing a software system. They can contain different types of data, which leads to classify them in different categories such as bug and vulnerability databases, mailing lists and chat logs, or the one interesting us, source code repositories.

Source code repositories typically contain unstructured text. They can also include struc-tured text, such as JSON or XML files, but due to the diversity of these possible structures, and the uncertainty of their relevance for characterising a developer, we will not consider this kind of documents. As a matter of fact, these structured files often store data to be used by the software, which is not necessarily directly related to the purpose of the program and can be ambiguous, including noise in the data. The only data considered will be source code, with the intuition that the developer’s intentions can be discovered from comments, identifier names and string literals. It is a best practice for programmers to use meaningful identifier names, insert comments that clarify the functionality of the code and output explanatory error messages, present in the code as string literals. Applying natural language processing (NLP) and information retrieval (IR) techniques to these three sources of information should then yield good insights about what the developer is trying to achieve with this code, leaving the

how aside[41].

As a matter of fact, a programming language is designed to be a mean of communication between the human and the machine. For the human to learn it as smoothly as possible, it is better to have the programming language as similar to the human natural language as possible. For most of the existing programming languages, the natural language picked as a reference is English. This is the reason why most of the programming languages’ keywords are English-based. Given this information, we will try to achieve natural language processing tasks on English-based source code in order to characterise the authors.

(11)

What we are trying to detect in the source code is its purpose, what is the goal of this piece of code. Thanks to the fact that the data comes from a version control system, we are also able to determine which developer wrote this piece of code. Combining the two pieces of information, one can assess that the considered developer has some knowledge in the field corresponding to the purpose of their code. This defines the approach of this thesis to characterising a developer: assessing their domains of knowledge from their source code contributions.

As we are aiming at creating new ways of differenciating developers, there does not exist labeled data that we could use for training our model, we are in fact aiming at creating these labels. Hence, we have to find a suitable unsupervised learning model. The approach selected is topic modelling, which is a form of text mining aimed at discovering the latent topics a document deals with, through analysis of the words’ occurrence. We will assimilate the purpose of a piece of code to its topic, so that we can approach this problem from the point of view of the topic modelling field.

2.2 Topic Models

A topic model is a statistical model aimed at discovering the underlying topics which occur in a collection of documents, where a topic consists of a probability distribution over words of a fixed vocabulary. They are frequently used in text-mining for discovering hidden semantic structures. Intuitively, if a document is about a particular topic, we could expect the words which appear the most frequently in this topic to be found a lot in this document. One fundamental assumption of topic models, shared by all of them, is the fact a document is a mixture of multiple topics in different proportions. However, they differ slightly when it comes to the statistical assumptions.

In order to describe the model in a more formal way, we will introduce some notations for expressing it mathematically. Let P(z) be the distribution over topics z in a particular document and P(w∣z) the probability distribution over words w given topic z. Each word wi in a document (where i is the index of the word in the document) is generated by choosing a topic from the topic distribution, and then sampling a word from this topic’s distribution over words. We use P(zi= j) for representing the probability that the ith token was sampled from the jth topic, and P(wi∣zi= j) for the probability of word wi under topic j. The model has then the following distribution over words in a document:

P(wi) = T ∑ j=1

P(wi∣zi= j)P (zi= j) (2.1)

with T the number of topics. To shorten the notation, we write Φj _{= P (w∣z = j) for the} multinomial distribution over words for topic j and θd_{= P (z) for the multinomial distribution} over topics for document d. The document collection consists of D documents and each document d is composed of Nd words. We note N = ∑ Nd the total number of words. The parameter Φ expresses the importance of each word for each topic, and θ the importance of each topic for each document.

This minimal probabilistic topic approach to document modeling was first introduced by Hofmann in his Probabilistic Latent Semantic Indexing method (pLSI)[23], also known as the aspect model. The aspect model is a probabilistic version of the work on latent semantic anal-ysis by Deerwester et al.[17], which highlighted the utility of the singular value decomposition of the document-term matrix. This version does not assume anything about how the mixture weights θ are generated, which makes it harder for testing the generalizability of the model to new documents[39]. We will thus adopt the model extended by Blei et al. (2003), which is generative and introduces a Dirichlet prior on θ, hence its name: Latent Dirichlet Allocation (LDA). From the matrix factorization perspective of the latent semantic analysis, LDA can also be considered as a type of principal component analysis for discrete data[8][9].

(12)

Generative Models and Latent Dirichlet Allocation

Generative Models

A generative model is a mathematical model describing the process by which all values of a phenomenon have been generated. Contrarily to discriminative models, they try to depict the generation process of both the values that can be actually observed and variables that can only be computed from the observations. To put it in another way, discriminative models infer outputs from the inputs, and generative models generate both inputs and outputs, typically based on some hidden parameters.

In practice, researchers try to find the best generative model, i.e. the one which is the more likely to have generated the observed data, by searching for the best set of hidden variables. The best way to illustrate it is to follow a concrete example, which will highlight the difference of approaching topic modeling from the point of view of a generative model or of a statistical inference problem.

Figure 2.1: Difference between the generative model and the statistical inference approach

The example is illustrated in Figure 2.1. Let topic 1 be thematically related to education, topic 2 to eye anatomy. The topics consist of bag of words, with a different distribution over words. Before generating a document, it is assumed that it is provided with a specific distribution over the topics. Based on this distribution, different documents can be produced by picking words from a topic depending on the importance given to the topic. For example, document 1 and 3 were generated by giving the maximum weight to respectively topic 1 and 2, and no weight at all to the other topic, which means only sampling from one topic. Document 2 was generated by mixing equally topic 1 and 2. Note that words can be part of different topics, as for the word ”pupil” in this example. This allows topic models to take polysemy into account. In this example, both topic 1 and topic 2 would give a high probability to ”pupil”.

This generative process does not consider the structure of the documents at all: the order of the words is not important, the only information extracted is the number of occurrences of each word, their place in the document does not matter.

The right part of the figure illustrates the problem of statistical inference. Based on the observed data, the goal is to infer what topic model is most likely to have generated the data. In order to do that, one needs to infer:

• for each topic, its probability distribution over the vocabulary • for each document, its probability distribution over topics • often, for each word, which topic it has been sampled from[39]

(13)

Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA)[6] is a generative statistical model which has proven to be a highly effective unsupervised learning technique for finding the different underlying topics of a collection of documents. It is based on the assumption that each document of the corpus has been generated by the following simple probabilistic procedure:

1. Randomly choose this document’s distribution over the topics 2. For each word in the document

a) Randomly choose a topic from the document’s distribution over the topics b) Randomly choose a word from the chosen topic’s distribution over the vocabulary This process complies with the topic modeling assumption that each document exhibits multiple topics. One can also notice that each document has different proportions for the topics because of the first step, and that each word is drawn from one specific topic (step 2b) chosen from the document distribution over topics (step 2a). The distinguishing characteristic of LDA is that each document is based on the exact same set of topics, they only exhibit different proportions of each.

As indicated by the name of the model, these per-document proportions are assumed to follow a Dirichlet distribution, simply because Dirichlet is a conjugate prior for the multinomial, which simplifies the problem of statistical inference. A T -dimensional Dirichlet distribution over the multionomial distribution p= (p1, ..., pT) has the following density:

Dir(α1, ..., αT) = Γ(∑jαj) ∏jΓ(αj) T ∏ j=1 pαj−1 j (2.2)

Each hyperparameter αj can be seen as a prior (before any observation) estimation of how many times the topic j is sampled in a document. If we do not have any prior information on the document, we should not assume the preponderance of any of the topics, they should all be equally probable. This case is convenient as it is equivalent to having all αj to equal a certain value α, which we can then consider as a single hyperparameter. Applying a Dirichlet prior on θ allows for a smoothed topic distribution, the level of smoothing depending on α. The effect of this parameter on the distribution is shown in Figure 2.2.

The Dirichlet prior on the topic distributions can be interpreted as forces on the topic, with α reflecting the variance of the distribution: the lower it is, the more concentrated the values are around the mean. If one wants to smooth their distribution a lot, they need a high

α value, so that a lot of topics will receive moderate probabilities. On the contrary, if one

wanst only a few topics per document, the elected topics will have high probabilities and the others probabilities close to zero, so they need a low α value.

A variant of this model places another symmetric Dirichlet(β) prior on Φ as well. β can then be interpreted as the prior estimation of the number of times words are sampled from a topic, still before any actual observation. It allows for smoothing the distribution of topics over words.

In order to clarify, the introduced topic model can be represented clearly and concisely using the plate notation used in Figure 2.3. In this notation, the shading of a variable expresses the fact that it is observed, whereas unshaded variables are latent. Arrows indicate conditional dependencies between variables and plates (represented by boxes in the figure) express the repetitions of sampling steps, with the number of samplings appearing in the lower right corner.

One can easily notice that the only actual observable data are the words in the documents, and what we are trying to infer are ϕ, θ and z, respectively the topic distributions over words, the document distributions over topics and the topic each word has been sampled from. With regards to the final output, z is not very interesting, it will only be used to determine ϕ and

(14)

Figure 2.2: Illustrating the impact of the hyperparameter on a symmetric Dirichlet distribution for three topics. The darker the color, the higher the probability. On the left, α= 4. On the right, α= 2.

D

N

d

T

α

θ

(d)

z

w

φ

(z)

β

Figure 2.3: Graphical description of the topic model using plate notation

θ, which will respectively allow us to define what each topic consists of and which topic each

document is about.

The algorithm proposed in the original LDA paper is Variational Bayesian inference (VB). It estimates the true posterior by a simpler distribution q(z, θ, ϕ) indexed by a set of free parameters. Following David Blei et al.[6], we choose a fully factorized distribution as follows:

(15)

q(zd,i= k) = φd,wd,i,k

q(θd) = Dir(θd; γd) q(ϕk) = Dir(ϕk; λk)

(2.3)

where d∈ D is the document index, i the word index in the document, hence wd,ithe word at the ith_{position in the d}th_{document, and k the k}th_{topic. The posterior is then parameterized} by the multinomial parameters φ, γ, later on referred as the documents topics, and λ, later on referred as the corpus topics. We will try to optimize the Kullback-Leibler divergence between

q(z, θ, ϕ) and p(z, θ, ϕ∣w, α, β), which is equivalent to maximizing the Evidence Lower Bound

(ELBO) defined as

L(w, φ, γ, λ) = Eq[log p(w, z, θ, ϕ∣α, β)] − Eq[log q(z, θ, ϕ)] ≤ log p(w∣α, β) (2.4) This ELBO can be optimized via coordinate ascent over the variational parameters φ, γ and

λ:

φd,w,k∝ exp {Eq[log θdk] + Eq[log ϕk,w]} γd,k= α + ∑ w nd,wφd,w,k λk,w= β + ∑ d nd,wφd,w,k (2.5)

where nd,w is the number of tokens w in document d. The expectations of log θ and log ϕ under q are Eq[log θd,k] = ψ(γd,k) − ψ( K ∑ i=1 γd,i) Eq[log ϕk,w] = ψ(λk,w) − ψ( W ∑ i=1 λk,i) (2.6)

with ψ the digamma function ψ(x) = d

dxln(Γ(x)) =

Γ′_(x)

Γ_(x). Based on the formulas introduced,

the Variational Bayesian algorithm can then be described as: Algorithm 1 Batch variational Bayes

1: λ= random value

2: while relative improvement in ELBO > threshold do 3: for d = 1 to D do

4: γd,k= 1 (arbitrary)

5: while _K1 ∑k∣change in γd,k∣ ≥ threshold do 6: φd,w,k∝ exp {Eq[log θd,k] + Eq[log ϕk,w]} 7: γd,k= α + ∑wnd,wφd,w,k

8: end while 9: end for

10: λk,w= β + ∑dnd,wφd,w,k 11: end while

The updates of the variational parameters will eventually converge to a stationary point of the ELBO[22]. Comparing it to the Expectation-Maximization (EM) algorithm[32], we could define the E step as the updates performed by the lines 6 and 7 of the algorithm, and the M step as the update of line 10. Each succession of an E and M step (later on referred as an iteration) will improve the ELBO.

However, this algorithm has constant memory requirements. It needs to pass through the whole corpus for each iteration. In case of large data sets or a streaming input, it requires a lot of computation. This is the reason why an online version has been proposed[22], which is much faster to converge in these scenarios. First, let us introduce the factorized version of the

(16)

ELBO defined in Equation 2.4:

L(w, φ, γ, λ) = ∑ d

{ Eq[log p(wd∣θd, zd, ϕ)] + Eq[log p(zd∣θd)] − Eq[log q(zd)]+

Eq[log p(θd∣α)] − Eq[log q(θd)] + Eq[log p(ϕ∣β)] − Eq[log q(ϕ)]/D}

(2.7)

The last term of the sum allows us to optimize the ELBO according to φ and γ for each document individually, which is necessary for the algorithm. Let γ(nd, λ) and φ(nd, λ) be the values of γ and φ produced by one E step. We are looking for a value of λ to maximize

L(n, λ) = ∑ d

l(nd, γ(nd, λ), φ(nd, λ), λ) (2.8) where the terms we sum are the individual contributions of the documents to the ELBO. We then compute ̃λ the setting of λ that would be optimal, given φtif the corpus was only composed of the document nt repeated D times (D being the number of documents in the corpus). For each iteration (each new document), we update λ using combination of its previous value and ̃λ, setting its new value to λ= (1 − ρt)λ + ρt̃λ. The weight associated to the new value has two parameters: τ0 slows down the first iterations, and K defines the rate

to which old values of ̃λ will be forgotten. The algorithm terminates while all the corpus has

been processed.

Algorithm 2 Online variational Bayes 1: ρt= (τ0+ t)−K, (K ∈ [0, 5; 1], τo≥ 0) 2: λ= random value 3: for t = 0 to∞ do 4: γd,k= 1 (arbitrary) 5: while 1 K∑k∣change in γt,k∣ ≥ threshold do 6: φt,w,k∝ exp {Eq[log θt,k] + Eq[log ϕk,w]} 7: γt,k= α + ∑wnd,wφt,w,k

8: end while 9: end for

10: λ= (1 − ρt)λ + ρt̃λ

Latent Semantic Analysis with Non-negative Matrix Factorization

Latent Semantic Analysis (LSA) is a natural language processing technique based on the assumption that words that are close in meaning should occur in similar pieces of text. It uses a matrix containing word counts per unit (a unit can be a paragraph, a chapter, a document...), which is then reduced by singular value decomposition, to compute word similarities. The singular value decomposition involves three matrices: a matrix of word vectors, a diagonal matrix with singular values and a matrix with document vectors. This process is graphically depicted in Figure 2.4.

The topic model can also be approached as a matrix factorization. In fact, as for LSA, topic models find a representation for the content of a document corpus with a lower dimensionality than the original. The word-document co-occurrence matrix mentioned above is split into two parts: Θ, the document matrix, and Φ, the topic matrix. To make the similarity between both process even more apparent, it is worth noting that the diagonal matrix with singular values

D in LSA can be absorbed in whichever of its surrounding matrices.

However, both approaches are still different by the fact, among others, that in topic models, the vectors representing the words and the documents contain probabilities and not counts, implying that the values are non-negative and they sum up to one. Moreover, the LDA model

(17)

Figure 2.4: Comparison of the matrices decompositions of LSA and topic model

adds a priori constraints on the word and topic distributions. These two constraints are not present in the general LSA approach.

In non-negative matrix factorization (NMF), however, the non-negative constraints are met again. NMF consists of a basic matrix factorization with non-negative constraints, thus giving always non-negative outputs, just as those of probabilistic methods. That’s why it does not suffer from the interpretation difficulties of LSA, where the mixture of element has positive and negative factors.

NMF inherently has a clustering property, which is what will let us look for topics in the data as topics are just clusters of specific words. It is worth noting that it is a soft clustering method, meaning that a document can belong to multiple clusters, compared to the k-means method for example where one item can only belong to one cluster (hard clustering). This is another valuable feature of NMF in the context of comparing it to the LDA approach, because LDA relies on the assumption that a document is generated from a mixture of topics, not only one, and that words can belong to multiple topics.

Metrics

Perplexity

In topic models, performing unbiased evaluation of different models with different topic dimen-sions or different variants is problematic due to the fact that a document usually encompasses a large number of latent variables, making the exact computation intractable[11][44]. A classic approach to evaluating such a model is to calculate the likelihood, or equivalently the perplex-ity, of a held-out set of data, the test set. The perplexity measures how well the model can predict a sample. It is defined[22] as the geometric mean of the inverse marginal probability of each word in the held-out set:

perplexity(ntest, λ, α) = exp {−∑ilog p(n test i ∣α, ϕ) ∑i,wntesti,w

} (2.9)

with ntest

i as the vector of word counts for the document i, n test

i,w representing the word counts of the word w for the ith document and λ our approximation of β. As we can-not compute log p(ntest

i ∣α, ϕ) directly, the ELBO serves here as an upper bound proxy on perplexity[neurvarinf]:

perplexity(ntest, λ, α) ≤ exp⎧⎪⎪⎨⎪⎪

⎩− (∑i

Eq[log p(ntesti , θi, zi∣α, ϕ)] − Eq[log q(θi, zi)]) / ⎛ ⎝i,w∑ ntest_i ⎞ ⎠ ⎫⎪⎪ ⎬⎪⎪ ⎭ (2.10)

(18)

However, Chang et al.[12] have shown that predictive likelihood and human judgement are often not correlated, and can even be anti-correlated. As the topics we get through the models are not automatically labelled, we would like to allow for a human to scrutinize them and be able to determine what they are dealing with, so as to give them an arbitrary label. That is where the coherence measures steps in.

Topic coherence

A coherence measure estimates the hanging and fitting together of the elements of a set. They reflect better the human understandability and interpretability compared to measures based on evaluation of topic model distributions. Based on the work by Röder et al.[34], in our context of topic modelling, we can view this measure as a pipeline of four steps having their own variants, their different combinations thus creating an important diversity of measures. These four steps are as follows:

• Segmentation of word subsets • Confirmation measure • Probability estimation • Aggregation

Coherence of a set of words is about measuring the hanging and fitting together of single words or subsets of them. The segmentation step is simply a choice of which subsets you will compare: pairs of single words, specific subsets with single words, or specific subsets with other specific subsets ? Let’s define W as the vocabulary of a specific topic. Some example segmentations are as follows:

• Sone one = {(W′, W∗)∣W′= {wi}; W∗= {wj}; wi, wj∈ W ; i ≠ j} • Sone pre = {(W′, W∗)∣W′= {wi}; W∗= {wj}; wi, wj∈ W ; i > j} • Sone set = {(W′, W∗)∣W′= {wi}; wi∈ W ; W∗= W }

Once you have defined what to compare, you need to define a confirmation measure to compare these pairs of sets Si = (W′, W∗). This confirmation measure will score the direct agreement of a pair of words or word subsets, how strongly they support each other. They are usually classified into direct and indirect confirmation measures. A direct confirmation measure directly computes the confirmation of a single pair whereas an indirect confirmation measure assumes that given w∈ W , direct confirmations of words in a subset are close to direct confirmation of words in another subset with respect to this given word. Example of direct confirmation measures could be the following:

mlc(Si) = log P(W′, W∗) + ϵ P(W∗) mlr(Si) = log P(W′∣W∗) + ϵ P(W′) ∗ P (W∗) mnlr(Si) = mlr(Si) −log(P (W′_{, W}∗_{) + ϵ)} (2.11)

mlc is called the log-conditional-probability measure, mlr the log-ratio measure and mnlr the normalized log-ratio measure. A small constant ϵ is added only to avoid the logarithm of zero, or the division by zero.

Let us display the advantage of the indirect approach with an example. Let w1 be a word

that semantically supports w2but they have a low joint probability because they do not appear

(19)

American equivalent, such as ”modelling” and ”modeling”. They may never appear together because one document is usually written in one single language. However, they will appear in different documents surrounded by the same words. Their direct confirmation measure will be low, but their indirect confirmation measure will be high.

Let us define formally how to calculate an indirect confirmation measure of Si= (W′, W∗). Whatever their cardinality is, we represent the word sets W′ and W∗ by vectors having as dimension the cardinality of the word set W . The formula giving the values of the vector of any word set W′, for any direct confirmation measure m, is as follows:

⃗vm,γ(W′) =⎧⎪⎪⎨⎪⎪ ⎩wi∑∈W′

m(wi, wj)γ⎫⎪⎪⎬⎪⎪

⎭j_{=1,...,∣W ∣}

(2.12)

For a pair Si= (W′, W∗), the indirect confirmation measure is computed as vector similarity between the context vectors ⃗u = ⃗v(W′) and ⃗w= ⃗v(W∗). The parameter γ is used to adjust

the weight given to higher values of similarity. Any vector similarity measure can be used, as for example the cosine:

scos(⃗u, ⃗w) = ∑

∣W ∣

i₌₁ui⋅ wi ∣∣⃗u∣∣2⋅ ∣∣ ⃗w∣∣2

(2.13) Hence, finally, for a given similarity measure sim, a direct confirmation measure m and a value for γ, an indirect confirmation measurem is̃

̃

msim_(m,γ)(W′, W∗) = ssim(⃗vm,γ(W′), ⃗vm,γ(W∗)) (2.14) The confirmation measure, either direct or indirect, will be based on probability esti-mation of the co-occurrence of the words you are considering. The simplest method to calculate those probabilities is the boolean document (Pbd) where the probability of a single word is simply the number of documents in which it occurs divided by the total number of documents. Simple straightforward variants of this are the boolean paragraph (Pbp) and the boolean sentence (Pbs), which have the exact same definition except that you replace document by paragraph or sentence. Another probability estimation method is the boolean sliding

win-dow (Psw), in which you apply boolean document to a derived dataset created from sliding a window of a specific size over the documents, one token at a step, and considering the window content at each step as a new document.

Finally, in order for a human to quickly grasp a good overview of the coherence, all confir-mation scores of all subset pairs Si need to be aggregated to a single coherence score. Same as before, the choice of any method is possible, but in the literature, arithmetic mean (σa) and median (σm) are frequent options.

(20)

3.1 Pre-study

The pre-study consisted of a literature review in the subject area and an analysis of the available resources. The review explored the Mining Software Repositories (MSR)[20][13] field in the scope of the developer characterisation goal. A lot of research has been done in this field about the cognitive aspects of the development process[16], the personality of the developers[37] and how they collaborate with each other[7][25]. A large number of researchers worked towards determining software quality[1][31] and defect-proneness[14][5] from source code as well. We tried to understand what data could be mined, how it could be extracted, processed and mined, and what could it be used for. The most promising approach given our goal of characterising a developer’s experience appeared to be topic modeling, hence our selection of this method. We then refined the literature review to focus more on topic modeling and its specificities, also looking for more concrete information such as parameters tuning[46][35], metrics[34][30][4], algorithmic complexity etc. Finally, we searched for the available implementations of the selected algorithms in Python in order to be efficient basing our work on bricks already made.

3.2 Data

Data acquisition

The data was acquired from the popular hosting service GitHub1_{. The reasons we decided}

working on open source data were multiple. The first one, straightforward, is that it was the easiest type of data to get, which also allows for a good replicability of the experiment. Besides, we were looking for good quality data, which regarding our goal means well-written code, i.e. applying the naming conventions and clearly expressing the purpose of the different parts through the naming of the entities. As the genesis of the open source code is to be shared, and often to let different programmers collaborate on the same project, we had the intuition that this category of code was more susceptible to being carefully developed, as it would be potentially read and used by fellow developers. However, studies on this subject relate that the quality and dependability of today’s open source software is roughly on par

(21)

with commercial and government developed software[19]. Nonetheless open source data is still more accessible, so given the fact that it is of equal quality, it still gets our preference.

We built a shell crawler to retrieve and extract code from GitHub. The crawler was designed to work with the following inputs:

• Pick an origin repository

• Give a maximum number of repositories to extract per user • Provide a size for the context window

The crawler would then retrieve the list of all the collaborators of the origin repository. It would clone repositories from all those collaborators randomly (up to the number specified as second argument), as well as the origin repository. Finally, the crawler would retrieve all the commits of the master branch of all the cloned repositories, using the git diff command on all of them, with the provided context window size. Those outputs would be split up file by file, and only the modifications on files with specific extensions, mostly those considered as ”code”, would be kept (cf. list in appendix 6). Finally, the results would be stored in a CSV file, having on each line one diff for one file in one commit by one user of a particular repository.

In our case, a rather large open source project repository was selected: Ruby-on-Rails2_.

We chose it mainly because of the number of its contributors. We wanted a number such that we could have a good diversity of contributors (potentially implying a good diversity of project fields), but we could also characterise well each of these contributors, i.e. fetching a reasonable number of repositories from each of them (30 in this case), without getting too much data considering the available resources. At that time, it had 3586 contributors and 69199 commits, which meant no more than 3856∗ 30 = 107580 repositories. We defined a context window of 3 lines.

Data characteristics and preprocessing

It seemed rather logical to assume that most of the contributors of the Ruby-on-Rails repository would have made use of this framework for their personal projects, or would have contributed to other Ruby open source projects. That is probably the reason why the most occurring language in the code retrieved is Ruby, as displayed in Figure 3.1. It is also a decent explanation for JavaScript standing as the second most occurring language, as it is very often used together with Ruby-on-Rails, both being intended for web development. Based on its predominance in the dataset, we chose to explore further Ruby code, and to filter out the other programming languages.

(22)

rb js md py html java json yml go c 0 200000 400000 600000 800000 1000000 1200000 1400000 1600000

Number of diffs for top 10 languages in original data

Figure 3.1: Number of diffs for the ten most occurring programming languages.

We then removed the meta information output by the git diff command, e.g. the number of lines edited. As this command shows the difference between two successive versions of the code, it displays the previous version of the modified lines as well as the latest (respectively prefixed by ”+” and ”-” as can be seen in Figure 3.2). We removed the lines prefixed by ”+”, in order not to double the importance of the modified line compared to the context.

Figure 3.2: Example output of the git diff command on one file in one commit, with a context window of 3 lines.

The next step was to apply some common NLP transformations to the data combined with ad-hoc transformations for code data:

• tokenisation, demarcating and possibly classifying sections of a string of input characters • splitting, breaking composed variables or method names into shorter meaningful units

(23)

• lemmatisation, grouping together the inflected forms of a word so they can be analysed as a single item Keep single programming language Remove everything but code

Remove new version of the code Tokenize

Split names

Lemmatize Output of git diff

Preprocessed data

Figure 3.3: Data preprocessing pipeline.

The tokens were defined as a sequence of at least 3 characters included in{[0, 9] ∪ [a, z] ∪ [A, Z] ∪ {_}}. The splitting was necessary in this project because generally in source code, if one wants to combine words to gain precision about the intended purpose of a named entity (e.g. a variable or a method), these words can not be separated by spaces as they would in natural language. They are usually separated by underscores or letter case changes. The splitting thus identified the delimiter-separated words and the letter case-separated words. Finally, the lemmatizer tried to attach a part of speech (POS)3 tag to each token in order to use this information to define better a form of which word it was considering, so as to reduce it to its stem if possible. The tagger4 _{was pre-trained, not specific to the dataset, and the}

POS considered were adjective, verb, noun and adverb.

In order to remove words which do not provide information about the purpose of the code, we decided to remove the programming languages’ specific keywords, as well as English stop words, following the recommendations of Sun et al.[40]. To this end, we had to keep only one programming language in the dataset, because removing keywords of programming language B in pieces of code written in programming language A might remove some relevant information. For example, ”volatile” is a keyword in Java but not in Ruby, and it can be informative if used with another meaning in Ruby, such as a bird.

These removals led to a number of very short data items: at that point, a little more than 9% of the dataset was composed of texts shorter than 10 tokens. This can be noticed

3_{A part of speech is a category of words which have similar grammar} 4_{https://www.nltk.org/book/ch05.html}

(24)

in Figure 3.4, with the first bin of the histogram, corresponding to the texts shorter than 4 tokens, amounting to 6%.

5085 143 243 412 697 1181 2000

Diff Length (tokens)

0.00 0.01 0.02 0.03 0.04 0.05 0.06

Normed Frequency

average diff length: 300 median diff length: 106

Figure 3.4: Length distribution of the data items after preprocessing.

We can also notice that the diffs length can vary greatly from one project to another, as only 1 out of the 20 largest repositories in terms of number of diffs is also present in the 20 largest repositories in terms of number of tokens. Furthermore, several repositories are present multiple times because the same project have been forked by different users.

sshaw%2fddex

sam%2fhtmlunit-rhino-fork

makoto%2farel eric%2fDatejs _{pelle%2fbitcoinj}

Greenie0506%2fbower-angular-animate

bolek%2fhomebrew-versions

rncrtr%2fangular.js

cody%2fimmutable-js

norman%2farel

florent%2fmongoid renaud%2fgensim chetan%2fboto deepak%2fdevise _{artemave%2fcoypu} kang%2face-builds

manish-shrivastava%2fangular.js

holger%2fjquery-ui max%2fbasscss kenichi%2fcelluloid

0 1000 2000 3000 4000 5000

6000 Number of diffs for top 20 repos in filtered data

Figure 3.5: Number of diffs for the twenty largest repositories in the dataset in terms of number of diffs.

(25)

nathan%2fecma262 casey%2fencrypted-media arktisklada%2fdart-html5-animation limon%2fbrook eliot%2fYouTubeCenter kir%2fIndexedDBShim sanemat%2fautoprefixer-rails nate%2fazure-storage-node alastair%2fgogs deepj%2fautoprefixer-rails raul%2fdailyjs arton%2frmagick lsylvester%2fautoprefixer-rails Greenie0506%2fbower-angular-animate

kang%2face-builds _{stephen%2fecma262} benny%2flykos

marek%2fgogs schneems%2fautoprefixer-rails Christoph%2fblackbox_viewer 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

1e8 Total volume of diffs for top 20 repos in filtered data

Figure 3.6: Number of tokens for the twenty largest repositories in the dataset in terms of number of diffs.

3.3 Latent Dirichlet Allocation

The LDA algorithms have different parameters to tune in order to optimize the results. We will try to find the best configuration so as to achieve the best performance possible given the limited resources and time scope of the project. Some parameters would probably increase the quality of the results if they were set to other values, but in return for an extended computation time or memory allocation. With regards to these parameters, the aim is to find the good balance, not only to choose the values which would yield the best results.

The number of topics is the most straightforward parameter, and probably one of the most important. It will define the precision of your topics, how narrow their subjects are. It is very dependent on the data and there are no general values function of the size or vocabulary size of the dataset. It is usually set after examining the evolution of the perplexity with different values, and keeping the setting which has the lowest perplexity.

The second parameter to come to one’s mind when considering topic models is the param-eters of the Dirichlet prior, α and β. Wallach et al.[43] advocate the use of asymmetric α and symmetric β, while Steyvers et al. [39] recommend both parameters to be symmetric and take as values α= 50/T and β = 0.01, with T the number of topics. Considering the relative agree-ment between both works on the β value, we set it to a symmetric value of 0.01. Regarding

α, the Python library gensim5, a topic modeling framework that we use for training allows us to instantiate a symmetric alpha with 1/T , and update it during the training iterations.

There are also different learning methods available in LDA, as described in the theory (Section 1). Considering the rather large amount of data we are training on, it is too big in the context of our setup to be loaded in memory at once. We then do not have any choice but to use online learning (Section 2).

An optional argument, the random state, is present to ensure the reproducibility of the experiments. It will be used to initiate a random state of the model, which is simply a set of parameters for the chaotic function used as randomizer. The same parameters will always be obtained when given the same random state value, hence we will be able to get the same initial state every time, even if it is a random state that we can not determine in advance.

(26)

The κ and τ0 parameters of the algorithm (2) are respectively called decay and offset in

the code. Based on the results obtained by Hoffman et al. [22], we used training chunks of 4096 documents each with κ= 0.5 and τ0= 64. Those were the settings that yielded the best

results in their work on two different corpora in terms of perplexity.

One can also decide that instead of going once through the whole dataset, as expected in the online VBA algorithm (2), we can simulate a dataset larger than its actual size by going through it several times, i.e. making the for loop on line 3 longer. This would potentially yield better results as the available data would artificially be extended. As for our implementation, we decided to only make one pass through the whole dataset.

Finally, the option to iterate through multiple documents instead of only one between every update step is available. It can be useful for reducing the computation time. We decided to update the values after every single document.

Topic models do not perform well on short texts in general[24][45]. We removed from the dataset all the documents shorter than 40 tokens (19.39% of the total). In addition, we only considered the tokens which occurred more than 10 times in the whole dataset (60.91%), and were in less than 50% of the documents (100%).

3.4 Non-negative Matrix Factorization

As for the LDA algorithms, there are different ways of achieving a non-negative matrix fac-torization, depending on several parameters. First of all, the data we will try to reconstruct is not the same as for LDA. We took the same dataset we applied LDA on (without the short documents), but we transformed it to vectors of Term Frequency Inverse Document Frequency (TF-IDF)[36] scores, an approach inspired by Pauca et al.[33] and Choo et al.[15].

The objective function to optimize is

0.5∗ ∣∣X − W H∣∣F ro2

+ α∗ l1_ratio ∗ ∣∣vec(W )∣∣1+ α ∗ l1_ratio ∗ ∣∣vec(H)∣∣1

+ 0.5 * α∗ (1 − l1_ratio) ∗ ∣∣W ∣∣F ro2+ 0.5 ∗ α ∗ (1 − l1_ratio) ∗ ∣∣H∣∣_{F ro}2(3.1)

The term l1_ratio∈ [0; 1] is the regularization mixing parameter. Setting it to 0 gives an elementwise L2 penalty (also called Frobenius Norm) to the formula, when setting it to 1 gives an elementwise L1 penalty. A value between 0 and 1 will let us define a combination of L1 and L2. We set it to 0.5. The α corresponds to the weight you want to give to the regularization term, i.e. if you set it to 0 you will not have any regularization at all. We set it to 0.1 in order to obtain a low reconstruction error, following Figure 3.7.

(27)

0

1

2

3

4

5 Alpha value

880

900

920

940

960 Reconstruction error

50

100

150

200

250

300 Number of iterations

Figure 3.7: Reconstruction error of NMF and number of iterations to converge in function of

α value.

3.5 Metrics

The metrics we will use to evaluate the results of this thesis are perplexity for LDA, reconstruc-tion error for NMF and topic coherence for both. Perplexity is only applicable to LDA while reconstruction error is only applicable to NMF. They reflect the same idea of accuracy, how well the models represent the data they were trained on. However, they can not be compared as LDA is a probabilistic method while NMF is simply a matrix factorization.

As numerous ways to calculate topic coherence exist (cf. Section 2.2), we will select some of them, which are the following four :

CU M ass= (Pbd, Sonepre, mlc, σa) (3.2)

CU CI= (Psw(10), Sonepre, mlr, σa) (3.3)

CN P M I= (Psw(10), Spreone, mnlr, σa) (3.4)

CV = (Psw₍₁₁₀₎, Ssetone, ˜mcos_(mnlr,1), σa) (3.5)

It is worth noting that CU CI and CN P M I are very similar one to the other, the only difference being that CN P M I is a normalized version, through its confirmation measure mnlr, of CU CI and its own confirmation measure mlr. They differ from CU M assby their probability estimation (cf. Section 2.2), which is simply the boolean document in CU M asswhereas the two others introduce a sliding window, but also by their confirmation measure, because the one of

CU M asshas no consideration for the overall probability of one of the two subsets (cf. Section 2.11). Finally, the coherence CV is very different from all the others. It uses a larger sliding

(28)

window of 110 tokens, a different segmentation(cf. Section 2.2) and an indirect confirmation measure applying the cosinus on the respective vectors of mnlr scores of the two word subsets considered. This last measure appeared to be the most correlated to human understanding in the study by Chang et al.[12].

(29)

This chapter presents the results obtained after training LDA versus applying NMF on 80% of the dataset introduced in Section 3.2, the remaining being held out for test purposes. The splitting of those train and test sets is made randomly among the documents, so that both sets contain documents from a majority of the origin repositories. The results are evaluated thanks to the different metrics described in Section 2.2. Several variants of the coherence measure are computed, and both the models are trained with multiple number of topics ranging from 5 to 100, both included, with a step of 5. We are aiming at minimizing the perplexity and reconstruction error, and maximizing the coherence.

4.1 Perplexity and reconstruction error

These two metrics will allow us to analyze if the different models could represent well the data they were trained on. Given that we do not have any similar work available to benchmark the results, we will mostly consider the evolution of the values in function of the number of topics, and not the absolute values.

As the exact perplexity is not computable directly (cf. Section 2.2), the results depicted in Figure 4.1 represent the value of the ELBO which serves as a proxy, as explained in Section 2.4]. This means that those values majorise the true perplexity value for each setting.

One can notice that the values for both models are descending with the number of topics increasing. However, the reconstruction error decreases rather linearly while perplexity does so more exponentially.

4.2 Coherence

In this section, we will describe the results obtained for both models on the four selected coherence measures with respectively 10, 20, 50 and 100 top words kept for each topic.

The values obtained for the coherence measure CU CI defined in Equation 3.3 are depicted in Figure 4.3. All the curves have the same shape no matter which number of top words we kept. The values are also very similar, but we can note that the range of values is shifting to slightly higher values as the words number increase. The LDA model always outperforms the NMF with regards to this metric, apart from the settings with only 5 topics and less than 100

(30)

20

40

60

80

100 Number of topics

400

500

600

700

800

900 1000

Perplexity

Figure 4.1: Evolution of perplexity in function of number of topics.

20

40

60

80

100 Number of topics

740

760

780

800

820

840 Reconstruction error

(31)

20 40 60 80 100 Number of topics 9 8 7 6 5 4 Coherence NMF LDA 20 40 60 80 100 Number of topics 9 8 7 6 5 4 3 Coherence NMF LDA 20 40 60 80 100 Number of topics 8 7 6 5 4 3 2 Coherence NMF LDA 20 40 60 80 100 Number of topics 8 7 6 5 4 3 2 Coherence NMF LDA

Figure 4.3: CU CI coherences in function of the number of topics on (from the left to the right and the top to the bottom) 10, 20, 50 and 100 top words per topic.

words. In the four cases, the LDA values are always increasing while the NMF values remain almost stable. 20 40 60 80 100 Number of topics 0.325 0.300 0.275 0.250 0.225 0.200 0.175 0.150 Coherence NMF LDA 20 40 60 80 100 Number of topics 0.30 0.25 0.20 0.15 0.10 Coherence NMF LDA 20 40 60 80 100 Number of topics 0.25 0.20 0.15 0.10 Coherence NMF LDA 20 40 60 80 100 Number of topics 0.25 0.20 0.15 0.10 Coherence NMF LDA

Figure 4.4: CN P M I coherences in function of the number of topics on (from the left to the right and the top to the bottom) 10, 20, 50 and 100 top words per topic.

(32)

The results obtained for the coherence measure CN P M Idefined in Equation 3.4 are depicted in Figure 4.4. All the observations made on the CU CI values are valid for CN P M I as well. It is rather normal as this second one is simply a normalised version of the previous. Hence, only the range of values changed, from [−0.325; −0.10] previously to [−0; −2] now, with the same shift following the top words number increase as seen with CU CI.

20 40 60 80 100 Number of topics 0.350 0.375 0.400 0.425 0.450 0.475 0.500 0.525 0.550 Coherence NMF LDA 20 40 60 80 100 Number of topics 0.525 0.550 0.575 0.600 0.625 0.650 0.675 0.700 0.725 Coherence NMF LDA 20 40 60 80 100 Number of topics 0.66 0.68 0.70 0.72 0.74 0.76 0.78 0.80 0.82 Coherence NMF LDA 20 40 60 80 100 Number of topics 0.80 0.82 0.84 0.86 0.88 Coherence NMF LDA

Figure 4.5: CV coherences in function of the number of topics on (from the left to the right and the top to the bottom) 10, 20, 50 and 100 top words per topic.

The results obtained for the coherence measure CV defined in Equation 3.5 are depicted in Figure 4.5. Once again, the range of values is shifting with the increase in the number of top words, but this time more significantly as the lowest value of the first graph, 0.350, is lower than half of the lowest value of the last graph, 0.80. The tendency of LDA to outperform NMF is confirmed here apart from the setting with only 10 top words, where LDA begins better but ends weaker, with an intersection of both curves at 45 topics. The values for LDA and NMF are still respectively decreasing and rather stable.

Finally, the results obtained for the coherence measure CU M assdefined in Equation 3.2 are depicted in Figure 4.6. The LDA curves present a logarithmic shape: they begin very high, get low very quickly and remain almost stable from 20 to 100 topics. The NMF values are once more rather stable, and this time globally higher than LDA values. The range of values is still shifting but this time to lower values, and not in the same way: the lowest values change, but the highest almost not, which means values are concentrating towards the lower bound.

For all the different settings and coherence measures, we can notice that when the number of words increase, all the curves get smoother.

In order to be able to compare them more accurately, we will now consider the results of the normalised evolutions of the four coherence measures with all the settings.

The normalized evolutions of all coherence measures for LDA (Figure 4.8) show, no matter the number of top words, that CV and CU M ass are globally decreasing while CN P M I and CU CI are globally increasing with the number of topics.

With regards to NMF, however, the tendencies are more different from one setting to an-other and less monotonic. They get more and more monotonic as the number of top words

(33)

20 40 60 80 100 Number of topics 18 16 14 12 10 8 Coherence NMF LDA 20 40 60 80 100 Number of topics 17 16 15 14 13 12 Coherence NMF LDA 20 40 60 80 100 Number of topics 17.5 17.0 16.5 16.0 15.5 15.0 14.5 Coherence NMF LDA 20 40 60 80 100 Number of topics 17.5 17.0 16.5 16.0 15.5 Coherence NMF LDA

Figure 4.6: CU M ass coherences in function of the number of topics on (from the left to the right and the top to the bottom) 10, 20, 50 and 100 top words per topic.

20 40 60 80 100 0.0 0.2 0.4 0.6 0.8 1.0 c_v c_npmi c_uci u_mass 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8 1.0 c_v c_npmi c_uci u_mass 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8 1.0 c_v c_npmi c_uci u_mass 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8 1.0 c_v c_npmi c_uci u_mass

Figure 4.7: Normalized evolution of NMF coherences in function of the number of topics on (from the left to the right and the top to the bottom) 10, 20, 50 and 100 top words per topic.

increase. CV is ascending with the number of topics in the setting with 10 top words, descend-ing with significant fluctuations with 20 top words, and descenddescend-ing more clearly and with less fluctuations as the number of top words increase. CU M assis ascending for 10, 20 and 50 top

(34)

20 40 60 80 100 0.0 0.2 0.4 0.6 0.8 1.0 c_v c_npmi c_uci u_mass 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8 1.0 c_v c_npmi c_uci u_mass 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8 1.0 c_v c_npmi c_uci u_mass 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8 1.0 c_v c_npmi c_uci u_mass

Figure 4.8: LDA coherences normalized evolution in function of the number of topics on (from the left to the right and the top to the bottom) 10, 20, 50 and 100 top words per topic.

words, but oscillates between rather disperse values for 100. Finally, CU CI and CN P M Ido not show a clear tendency with 10 top words, but are ascending with 20, 50 and 100 top words, with again a curve getting smoother with the augmentation of top words.

To conclude, LDA evolutions are rather monotonic while NMF’s are frequently fluctuating and they do not always exhibit a global tendency to ascend or descend. CN P M I and CU CI curves are almost superimposed for NMF and completely for LDA.