Mapping Java Source Code To Architectural Concerns Through Machine Learning

(1)

Mapping Java Source Code to Architectural Concerns through Machine Learning

Alexander Florean Laoa Jalal

Faculty of Health, Science and Technology Computer Science

Clevel thesis 15 hp

Supervisor: Sebastian Herold Examiner: Mohammad Rajiullah Date: 20210531

(2)

(3)

Acknowledgments

We would like extend our thanks to our supervisor at Karlstad University, Sebastian Herold for both providing us the opportunity to do this study and the guidance to complete it.

i

(4)

(5)

Abstract

The explosive growth of software systems with both size and complexity results in the recognised need of techniques to combat architectural degradation. Reﬂexion Mod- elling is a method commonly used for Software Architectural Consistency Checking (SACC). However, the steps needed to utilise the method involve manual mapping, which could become tedious depending on the system’s size.

Recently, machine learning has been showing promising results outperforming other approaches. However, neither a comparison of different classiﬁers nor a comprehensive investigation of how to best pre-process source code has yet been performed. This thesis compares different classiﬁer and their performance to the manual effort needed to train them and how different pre-processing settings affect their accuracy. The study can be divided into two areas: pre-processing and how large the manual mapping should be to achieve satisfactory performance.

Across the three software systems used in this study, the overall best performing model, MaxEnt, achieved the following average results, accuracy 0.88, weighted precision 0.89 and weighted recall 0.88. SVM performed almost identically to MaxEnt.

Furthermore, the results show that Naive-Bayes, the algorithm in recent related work approaches, performs worse than SVM and MaxEnt.

The results yielded that the pre-processing that extracts packages and libraries, together with the feature representation method Bag-of-Words had the best performance.

Furthermore, it was found that manual mapping of a minimum of ten ﬁles per concern iii

(6)

is needed for satisfactory performance. The research results represent a further step towards automating code-to-architecture mappings, as required in reﬂexion modelling and similar techniques.

(7)

List of Figures

2.1 Architectural overview . . . 8

2.2 Example of using text classiﬁcation to detect spam . . . 11

2.3 Illustration of ﬁnding the hyperplane that creates the least signiﬁcant margin between the support vectors with linear kernel . . . 14

2.4 Shape of the logistic function . . . 15

2.5 Monte-carlo cross-validator . . . 19

3.1 Distribution between train and test ﬁles if training set size ratio is 0.1 . . 32

3.2 Distribution between train and test ﬁles when abs.size = 5 . . . 33

3.3 Single experiment run where different settings and feature representations are tested . . . 37

3.4 Structure of experiment when testing different training sizes . . . 38

4.1 Implementation overview . . . 42

4.2 Usage of regex on a java ﬁle . . . 44

4.3 The prepared folder structure for collecting mapped concerns . . . 45

4.4 Dataframe after gathering the dataset from the prepared folder structure from the JabRef system ’Main folder’ . . . 46

4.5 Jabref system . . . 54

4.6 Jabref system after ﬁltering minority classes . . . 55 xi

(14)

5.1 Average accuracy value of the feature representation methods over the three classiﬁers, with the pre-process setting s0 . . . . 62 5.2 Number of ﬁles of each architectural concern, in respective subject sys-

tems. . . 63 5.3 The pre-process setting performances with Bag-of-Words, using the

JabRef system. . . 63 5.4 The pre-process setting performances with Bag-of-Words, using the

ProM system. . . 64 5.5 The pre-process setting performances with Bag-of-Words, using the

TeamMates system. . . 64 5.6 Accuracy over train percentage from the train-test split, for respective

subject systems and classiﬁers, with s0 and Bag-of-Words. . . . 67 5.7 The average values of the performance metrics for the classiﬁers over

the three subject systems, with increasing train percentage from the train-test split, with s0 and Bag-of-Words. . . . 68 5.8 Accuracy over number of training ﬁles for respective subject systems

and classiﬁers, with s0 and Bag-of-Words. . . . 69 5.9 The average values of the performance metrics for the classiﬁers over

the three subject systems. With the number of ﬁles used to train the classiﬁers, with s0 and Bag-of-Words. . . . 70

(15)

List of Tables

2.1 Example of documents encoded, shown in matrix form . . . 16

2.2 Table when applying TF-IDF to table 2.1 . . . 17

2.3 Example of combining local metrics . . . 21

3.1 Subject systems . . . 27

3.2 Extraction options . . . 30

3.3 Parsing options . . . 31

3.4 The speciﬁc parsing settings that executes each extraction. . . 35

4.1 Demonstrating how we extract the "average" classiﬁer from the list of classiﬁer. . . 57

5.1 Deﬁnes the pre-process settings, with respective extractions and the parsing options for each extraction. The numbers in the parsing cells states the order in which they are executed. The settings use the same color-code shown in Figures 5.3, 5.4, 5.5 . . . 61

5.2 The average value for the metric values over the subject systems, clas- siﬁers, and feature representation methods, with respective pre-process settings. . . 62 5.3 The average value of the performance metrics over the three subject

systems, for each classiﬁers with pre-process setting s0 and Bag-of-Words. 64

xiii

(16)

(17)

Chapter 1 Introduction

Today’s software systems have substantially grown in size and advancement if compared to a few years back. As humans always strive for tools to automate and ease our everyday lives, the demand for new and more sophisticated software systems grows.

With time the system will start to age, or with careless changes and updates the system ends up with architectural erosion, which will lead to an accumulation of technical debt [1, 2, 3].

It is common practice to utilise Software Architectural Consistency Checking (SACC) to counter architectural erosion. SACC checks whether the implemented architecture deviates from or conforms to the planned architecture [4]. Architectural recovery meth- ods are methods that exist for either modifying or removing architectural erosion. The Reﬂexion Modelling is a common architectural recovery method that extracts the im- plemented architecture and visualises the difference from the planned architecture [2].

The extraction of the implemented architecture is done manually, which is something that becomes time-consuming with larger systems. The ﬁrst step that reﬂexion modelling uses to detect and visualise architectural inconsistencies in a system, is to map elements of the source code to architectural concerns. Architectural concern is used interchangeably throughout the dissertation and is also known as architectural modules.

1

(18)

In the quest to minimise the manual work needed to use reﬂextion modelling, we are going to implement and test machine learning models that map software architectural concerns. We limit this study to only look at systems coded in Java.

1.1 Motivation

We believe that there is an insufficient amount of research regarding machine learning in this field of study. We have observed two papers [5, 6], which uses Naive Bayes as a classification method. This motivates us further to explore the usage of different machine learning algorithms and see how well they perform alongside Naive Bayes for this particular task. Moreover, we will further this study by exploring different pre- processings and dataset sizes for the machine learning models.

1.2 Goals

The goals of the study are to implement and test machine learning models that map software architectural concerns in systems coded in Java. The study is guided by the following Research Questions (RQ).

RQ:1 How well does mapping software architectural concerns in java source code through machine learning perform?

Sub-RQ:1 Which pre-processing results in the most satisfactory performance?

Sub-RQ:2 How large should a manual mapping be in order to train a satisfactorily performing classiﬁer?

For motivation and further details see section 3.1.

(19)

1.3. SUMMARY OF RESULTS 3

1.3 Summary of results

In the first experiment, where different pre-processing settings was tested, the setting that resulted in overall best performance was s0which extracts packages and libraries from the input, together with Bag-of-words. The second experiment, where we altered the size of the training data for each classifier, resulted in MaxEnt performing best when the training data was minimal, more specifically, five files/architectural concern.

When increasing the training size to >10% of the total subject system, MaxEnt and SVM consistently outperform Naive Bayes on average precision, recall, accuracy, and F1-score. It is hard to distinguish between MaxEnt and SVM; both models perform practically equally good when the training size is larger than ﬁve ﬁles/architectural concern. In conclusion, we believe machine learning is a very feasible approach to automate the mapping of architectural concerns of Java coded systems. The overall best performance was achieved using MaxEnt, s0, Bag-of-Words, 10% train ratio, which results in the following metrics across the subject systems: accuracy 0.88, weighted precision 0.89 and weighted recall 0.88.

1.4 Ethics

With the recent explosive growth of Artiﬁcial Intelligence (AI), we give away more and more control over different aspects of daily life. In [7], the authors raise an interesting example regarding the possibilities of ethical issues in usage of AI. In the illustration, they go on to mention the possibility of a bank using a machine learning algorithm to recommend mortgage applications for approval, with the belief that the algorithm is blinded to the race of the applicants. But contrary to the belief of the bank, the statistics show that the bank’s approval rate for certain race’s applicants have been dropping.

They go on to say that the machine learning algorithm used the address information of the applicants who were from predominantly poverty-stricken areas, to base said

(20)

recommendations. Hence, it could conceivably be hypothesised that the more control we give over to the AI, the more likely it is that we miss important parameters that could cause ethical issues.

This brings us to the study, which does bring us a step further, though tiny, in giving more control to AI. More speciﬁcally, a step closer to teaching AI to identify architectural concerns. However, it is a bit of a stretch to relate this work, interpreting text about program code, to machine learning processing data about human beings.

1.5 Work split

The study can roughly be divided into the following parts, preparation of data, pre- processing, training classifiers, and evaluation of said classifiers. For implementation, Alexander Florean is responsible for the data preparation and pre-processing, whereas Laoa Jalal is responsible for the training and evaluation of the classifiers. The dissertation writing is divided according to said responsibilities of the study. However, both authors are equally responsible for the analysis and conclusions drawn from the results.

1.6 Dissertation layout

Chapter 2 describes the theory and methods used in this study. Which contains the general information about system architecture, followed by the theory in machine learning, classiﬁers, feature representation, and evaluation techniques. This chapter should provide enough theoretical background for the rest of the dissertation.

Chapter 3 includes the design of the experiments. The research questions are mo- tivated and detailed, followed by motivation on proposed classiﬁers. Furthermore, the motivation behind what is relevant information to extract from source ﬁles is discussed.

The chapter ends by describing the techniques used to split data and performing the

(21)

1.6. DISSERTATION LAYOUT 5 experiment.

Chapter 4 describes how the design from chapter 3 is implemented. The different libraries and tools used throughout the experiment are deﬁned, and how the lab framework was created, which was the playground for performing the experiments for ﬁnding the answers to the research questions.

Chapter 5 presents the results gathered from the experiments.

Chapter 6 discusses the results gathered from chapter 5, the validity of the gathered results and hypothesize about why they appear as they do. Moreover, the problems that occurred throughout the project, the limitations and propositions for possible future work is discussed.

Chapter 7 contains the concluding remarks of the study.

(22)

(23)

Chapter 2 Background

This chapter provides the context and background for the theory used in this study.

Section 2.1 details the theory behind recovering the architectural concerns, section 2.2 gives an overview of what machine learning is and the area of machine learning that our study regards. The following section 2.3 introduces the machine learning models, and provides an brief overview on how they work. The methods of how the textual data is transformed to a numerical representation is described in section 2.4. The last section 2.5 mentions the metrics for evaluation that are used in the study and describes what cross-validation is, and the technique used in this study.

2.1 Software Architecture

The term software architecture is a tricky term to deﬁne, leading to numerous deﬁ- nitions. In [8], the author describes it as “ . . . the structure of a system, which com- prise software elements, the externally visible properties of those elements, and the relationships among them.”. Figure 2.1 depicts the elements in the system, starting from individual classes to a subsystem that uses the classes in collaboration to perform higher-level tasks. The subsystem can then interact with other subsystems, creating a

7

(24)

new, higher-level element [9]. We mainly focus on the highest level of elements in the software system: the elements that describe the top-level architecture, which are referred to as architectural concerns in this study.

When developing software system, the developer must have a clear understanding of the software architecture to achieve the desired quality attributes, which are usually more than one. Early on, it was argued that with time the software system starts to age, either due to changes to domain the software exists in, or due to carelessly intro- duced changes to the system, which will thus deteriorate the quality of the architectural attributes [1] [4].

Architectural erosion is a term that describes violations against the architectural design [2] [10]. In order to combat the architectural erosion practitioners utilize Soft- ware Architecture Consistency Checking(SACC), also known as Software Architecture Conformance Checking. SACC checks whether the implemented architecture deviates from or conforms to the planned architecture. The implemented architecture refers to

Figure 2.1: Architectural overview

(25)

2.2. MACHINE LEARNING 9 the architecture extracted from the source code. The planned architecture is the intended architecture, which is the outcome from a design process [4].

Common methods of repairing architectural erosion are the utilization of architectural recovery methods, which are used to extract the implemented architectures [2].

A recurring technique for recovering the implemented architecture is the Reflexion Modelling. The Reflexion Modelling compares the implemented architecture against the planned architecture and visualizes the result. It achieves this by first defining the software architectural concerns (also known as software architectural modules) of the software system. Next, it maps units of source code to the architectural concerns, and lastly, it automatically computes the intended and implemented dependencies of the architectural concerns. The implementation is said to conform to its design if the dependencies of the architectural concerns match with the dependencies of the source code entities. The software is said it violates its architecture if the dependencies do not match [11].

2.2 Machine Learning

Machine Learning (ML) is a subset of Artiﬁcial intelligence that focus on learning through experience to improve performance. The experience refers to the past information used for training the learner [12]. The data is crucial when learning a machine learning algorithm, where quality and size of the data are important aspects for the success of a model. Model refers to a trained machine learning algorithm. This makes machine learning inherently related to data analysis and statistics, since the learning of algorithm is deeply dependent on the data used[13].

When learning a model, different scenarios exist from which the model is trained with certain types of training data, the order and method in which training data is received, and how the test data is used when evaluating the learning algorithm [13].

(26)

The scenario used in this thesis regards supervised learning.

2.2.1 Supervised learning

Supervised learning considers scenarios when the learning algorithm relies on learning from a dataset with labels mapped to each example.

Supervised learning can be applied to many fields, such as text or document clas- sification, Natural Language Processing (NLP), speech processing, computer visions, and numerous more. In each of the mentioned subjects, the approach to solving the problem may vary. When approaching a supervised learning problem, we can further abstract the problem into two categories: Classifications tasks and Regression tasks.

Classification refers to when a learner takes an input and returns an output of categorical nature. For example, when identifying an image and deciding if the image portraits a cat or a dog[12]. Here, the image is the input, the cat and dog are the categories to which the image can map. In Regression, the learner takes an input and predicts an output of continuous nature, that is, values which can be defined with a scalar value [14]. Regression could, for example, be applied in stock prediction problems [13]. In this study, we will focus on applying machine learning to text classification problems, where we will train a classifier with textual data, in our case source-code, to predict which architectural concern (label) the source-file belongs to.

2.2.2 NLP & Text classiﬁcation

NLP is a subset of artiﬁcial intelligence, that strives to achieve human-like language processing. Human-like language refers to written and/or oral deﬁned text, thus the only requirement is that the data must be understood by humans [15].

Text classiﬁcation is one of the most important task where supervised learning is applied alongside NLP. This ranges from tasks regarding e-mail spam detection,

(27)

2.3. CLASSIFIERS 11

Figure 2.2: Example of using text classiﬁcation to detect spam

anomaly detection, sentiment analysis, document classification and many more [16], see figure 2.2. More formally, text classification is a problem of classifying a document d_ifrom the set D, with predefined labels L ={l1, l₂, ..l_n}, so that each document dimaps to one or more label(s) l_j[17]. In our study, the document d_ionly maps to a single label, in our case,source-file -> concern.

When distinguishing text data, it is vital to ﬁnd parsing rules to distinguish each document to its respective label. Instead of discussing all NLP techniques in this section, we will thoroughly describe the NLP methods used to parse the source-ﬁles in section 3.3. In the upcoming section 2.4, another important concept in NLP, namely feature extraction is discussed.

2.3 Classiﬁers

A classifier can be described in mathematical terms as a function f which takes an input of independent variables < x₁, x₂, ..x_n>, often referred to as features, and returns a label output y. In classification problems, the domain of y is discrete; that is, the available outputs are restrained by D ={y1, y₂...y_n} which are often defined when training the classifier. When training the classifier, the main goal is to create a model (hypothesis) that has captured the relationship between features and classes so that an unseen set of

(28)

features can be mapped to its corresponding label [18].

There exist numerous classiﬁers, each with its strengths and weaknesses. Depending on factors like the training data size, some classiﬁers perform better when trained with small training samples, while others scale better at larger training sets.

The upcoming sections discuss the machine learning algorithms used when creating a classiﬁer and the theory behind each algorithm.

2.3.1 Naive-Bayes

Naive-Bayes classiﬁer is generative model commonly applied when working with text classiﬁcation problems [19]. The theory behind Naive-Bayes derives from Bayes theo- rem:

P(A|B) = P(A)× P(B|A)

P(B) (2.1)

Translating 2.1 to a text classiﬁcation problem, we get the following, where A is the label which we want to predict and B is the document, represented as vector of features B =< x₁, x₂, ..x_n>, where each feature x_irefers to a string in that document. Integrating the new designation, we get the following:

P(label| < x1, x₂, ..x_n>) = P(label)× ∏ⁿi=1P(x_i|label)

∏ⁿi=1P(x_i) (2.2)

In 2.2, we assume that P(< x₁, x₂, ..x_n>|label) can be approximated by ∏ⁿi=1P(x_i|label), assuming that all words inside the document are independent. This assumption is why the classiﬁer is called Naive-Bayes. The denominator of 2.2 can be removed as it does not depend on P(label), we get the following:

label←− argmax(P(label) ×

∏

ⁿ

i=1

P(x_i|label)) (2.3)

The function 2.3 returns the label with the highest probability score.

(29)

2.3. CLASSIFIERS 13 Although the simplicity and the strong assumptions made, Naive-Bayes has shown strong results when compared side-by-side on more sophisticated classiﬁcation algorithms [20] [19].

There exist numerous variants of Naive-Bayes, each with its purpose and beneﬁts.

The classifier used in this study is of the variant multinomial Naive-Bayes. The classifier is specified to use a multinomial distribution for each of the features. This variant is commonly used in problems regarding text classification [21].

2.3.2 Support Vector Machine

Support Vector Machine (SVM), has been proved to provide excellent results in many tasks such as text classiﬁcation, object recognition, and handwritten digit recognition [22]. SVM can be described informally as a hyperplane that separates the data points in the feature space. The data points are the categories that are mapped according to the feature input. When mapping the categories, SVM uses a kernel function to project the data point to a higher-level dimension. Alongside the transformation, SVM ﬁnds the hyperplane that separates the data points most optimally. That is, the hyperplane makes the tiniest margin between the two closest categorical points. These two points are often referred to as the support vectors [12]. Figure 2.3 illustrates how two classes, depicted with green and blue colours, are transformed with a linear kernel function, and separated with a hyperplane. Alongside the linear kernel, there are other, non- linear kernel functions that can perform more complex transformations, for example polynomial and RBF kernels [23].

(30)

Figure 2.3: Illustration of ﬁnding the hyperplane that creates the least signiﬁcant margin between the support vectors with linear kernel

Like many other classifiers, SVM is originally designed to handle binary classification problems; that is, the possible output is of boolean nature. SVM tackles multi- class problems by reducing them into multiple binary classification problems. In this study, we use the one-vs-rest technique. One-vs-rest follows the approach of fitting one classifier per class. This results in a total of n number of classifiers, where each class is fitted against the other classes [24].

2.3.3 Maximum Entropy

Maximum Entropy also referred to as multinomial logistic regression, is a commonly used classifier in different classification problems such as text classification [25]. It is an generalization of regular logistic regression which only provides two possible, discrete outcomes, while maximum entropy enables a finite set of discrete outcomes [25] [26].

Both these models are based on the sigmoid function:

f (x) = 1

1 + e^−x (2.4)

(31)

2.3. CLASSIFIERS 15 This function has the form depicted in Figure 2.4 and ranges between (0,1) in y-axis.

Suppose we have the input data X =< x₁, x₂, . . . xn>, where xnare the independent vari- ables, and X is the feature vector. We can represent the feature vector with the following expressionλ ⁼γ⁺β1x₁,β2x₂, ..βⁿ^xⁿ^{, where}γ ândβⁱare the unknown parameters. The expression λ can represent a specific problem in which we want to solve. In our case, λ represents the content of a source file. The sigmoid function can be modelled to represent a probability, and this gives us the logistic model:

P(label| < x1, x₂, . . . x_n>) = 1

1 + e^−(γ+β¹^x¹^,^β²^x²^,..^βxⁿ⁾ (2.5) If the model is ﬁtted with data, where data corresponds toλ and its label, the values for γ ^andβ can be obtained, which gives a model one can use to predict unseen data [26].

To extend the Equation 2.5 to handle multiple classes, that is, creating a MaxEnt model, we use the fact that the sum of all probabilities from a probability distribution is equal to 1. The output variable is deﬁned as C =< l₁, l₂, ..l_h>, which gives the ﬁnal model:

P(C = l_i| < x1, x₂, ..x_n>_i) = e^−(γ+β¹^x¹^,^β²^x²^,..^βxⁿ⁾ⁱ

1 +∑^hk=1e^−(γ+β¹^x¹^,^β²^x²^,..^βxⁿ⁾^k (2.6) When estimating the unknown parameters, a standard method is to use maximum like- lihood estimations [27].

Figure 2.4: Shape of the logistic function

(32)

2.4 Feature representation

When applying machine-learning to textual based problems, the data has to be transformed into a representation feasible for the model to interpret. The representation are based on a set of rules, thus a document d can be represented in numerous ways depending on the technique used.

2.4.1 Bag-of-words

Bag-of-words is a primitive yet valuable information retrieval technique used to convert text documents to a numerical representation [28]. The approach follows that the documents D ={d1, d₂..d_n} creates a mutual vector encoder. All unique words across all documents are encoded to a speciﬁc position inside the vector. Each document transforms based on the vector encoder, which often leads to sparse feature vectors due to the difference in textual data in each document [12]. Table 2.1 demonstrates four documents that are transformed according to the feature encoder"I","like","data","

science". When a document contains words that appear multiple times in the corpus, the instance of the word is counted and appended to its feature vector. As seen in DOC 1, the word "like" appears three times, leading to a feature vector with the scalar value three at the position encoded as "like".

Doc 1

Doc 2

Doc 3

Doc 4

I 1 1 0 1

like 3 2 0 1

data 0 1 1 1

science 1 1 1 1

Table 2.1: Example of documents encoded, shown in matrix form

(33)

2.4. FEATURE REPRESENTATION 17

2.4.2 Tf-idf

Term frequency-inverse document frequency (TF-IDF) is the second information retrieval technique used in our study. TF-IDF is a method that provides a way to handle the problem occurring when commonly used words are over-representing the feature vector [29]. TF-IDF can be deﬁned in the following Equation 2.7

w(d,t) = t f× id f (2.7)

where

id f = log(N

d_t). (2.8)

and

t f = (1 + log(T F)) (2.9)

The equation calculates the weight w(d,t) of the term t, in document d, by multiplying the term frequency, that is, the # of times the term occurs in the text divided by # of total words (normalized by adding 1 and taking the logarithm), by the inverse document. The inverse document is calculated by taking the logarithm of the total number of documents N divided by the number of documents containing the term t [30]. Table 2.2 shows the results when applying TF-IDF to the original table 2.1. The table shows that the term

"science" is set to a weight of zero. The term "science" appears in all documents, and by observing 2.8, it follows that log(⁴₄) = 0, thus resulting in zero weight.

Doc 1

Doc 2

Doc 3

Doc 4

I 0.037 0.049 0 0.049

like 0.097 0.087 0 0.049

data 0 0.049 0.087 0.049

science 0 0 0 0

Table 2.2: Table when applying TF-IDF to table 2.1

(34)

2.5 Evaluation

The performance of the machine learning models is evaluated with the help of a collection of metrics. For supervised learning, there are numerous deﬁned metrics, for both the regression and classiﬁcation domains. The relevancy for the different metrics depend on the label distribution, size of the dataset, and many more [12].

2.5.1 Cross-validation

Cross-validation (CV) is a widely used strategy for model selection. The idea is to split the dataset one or more times into training and testing samples. Each split corresponds to a test. The model is trained and evaluated with the sample pair, and the chosen CV technique speciﬁes the number of times the dataset splits. When all iterations have been tested, and the classiﬁer evaluated, the average metric scores are calculated [31].

In this study, we use the Monte Carlo cross-validation technique to evaluate the models. Monte Carlo CV performs n runs, where each iteration, the dataset is split into training and testing samples. The data should be randomly distributed inside the training and testing sample, leading to the model being trained and tested with as much variation as possible. Monte-Carlo CV has its benefits of producing a low variance in gathered results, but weakness in being a biased estimate [32]. Figure 2.5 shows an example of a Monte Carlo CV when ³₉ of the dataset specified for testing the classifier, while the rest of the dataset used for training the classifier.

(35)

2.5. EVALUATION 19

Figure 2.5: Monte-carlo cross-validator

2.5.2 Accuracy

When describing the following metrics, the following notations are used: TP stands for true positives, FP false positives, TN true negative, and FN false negative. The accuracy is a powerful yet simple metric which is calculated by taking the total number of correct predictions divided by all the predictions made:

Accuracy = All correct predictions

All predictions = T P + T N

T P + FP + T N + FN (2.10) However, it should be noted that the accuracy metric suffers from a signiﬁcant problem when the classes are imbalanced, i.e., one or a few of the classes have the majority of representation over the other classes in the dataset. It results in the metric becoming biased towards the major classes [12].

2.5.3 Precision

Precision is calculated by dividing the positively correct predictions by all positive predictions. The metric can be expressed as follows:

Precision = T P

FP + T P (2.11)

(36)

In our case, this metric shows how well the model distinguishes the concerns in the system; that is, if the precision is high, a high number of the predicted source ﬁles does correctly map to its corresponding concern.

2.5.4 Recall

The recall score is the ratio between all positively correct predictions and the total number of true positives. The metric can be expressed with the following:

Recall = T P

FN + T P (2.12)

Unlike precision, the recall score indicates when there is a high number missed positive predictions.

2.5.5 F1-score

Fβ-score is a metric which calculates the weighted harmonic mean, combining both precision and recall. Depending on the value ofβ, either recall or precision is set to be of higher priority, e.g.β > 1 lends more weight on the recall, whileβ < 1 adds weight to precision. In this study,β is set to 1, thus giving the F1-score:

F1 = 2×Precission× Recall

Precission + Recall (2.13)

F1-score is a commonly used metric when there exists a class imbalance inside the dataset [33].

2.5.6 Average & Weighted score

When calculating the precision, recall and F1-score for the classiﬁers, we calculate the metrics for each concern, which yields a local metric for that concern. In our study, we

(37)

2.6. RELATED WORK 21

Concerns GUI Database Logic Average(macro) score

precision 0.88 0.79 0.85 (0.88+0.79+0.85)/3 Recall 0.9 0.85 0.87 (0.9+0.85+0.87)/3 F1-score 0.89 0.82 0.86 (0.89+0.82+0.86)/3

Table 2.3: Example of combining local metrics

use macro and weighted averages. Macro score refers to taking the individual metrics from each concern and calculate the average. This is demonstrated in table 2.3. The macro score does not take the class imbalance to account, which is why the weighted score is used. The weighted score takes the local metric, and multiplies it by a weight. In our case, the weight refers to the proportion between the classes, e.g. class A represent 95% of the total dataset, and class B the remaining 5%, the metric M_weightedis calculated as M_weighted = 0.95· MclassA+ 0.05· MclassB.

2.6 Related Work

Olsson et al. combine semantic, and dependency information from a set of training ﬁles used as training data for the Naive Bayes classiﬁer [5]. They compare their technique with HugMe, a semi-automatic mapping technique, on six different subject systems.

When tested with the same amount of initial mapped data, they conclude that the Naive Bayes classiﬁer scores and improvement on the average performance and overall improvement in precision, recall, and F1-score. Mapping around 20% of the system ﬁles showed to give an overall best performance.

Link et al. propose a new, open-source, concern based recovery method, called RELAX [6]. RELAX uses the Naive Bayes classiﬁers to perform text classiﬁcation.

Link et al. compares their method against other architectural concern methods that are not based on machine learning models. The experiments were performed on eight subject systems, and their result showed that RELAX outperformed the other methods

(38)

in ﬁve out of the eight subject systems. The report does not provide any information about the actual mapping size or any technical details.

(39)

Chapter 3 Design

The design chapter describe the experimentation design, which is done with the following sections. Section 3.1 motivates the relevancy of the research questions, section 3.2 proceeds to tell about the subject systems and classiﬁers used in the study. Section 3.3 explains the reasoning behind the data extraction from the source ﬁles and introduces the pre-processing settings. Section 3.4 describes the different ways of splitting the dataset.

Section 3.5 details the approach to answer the research questions. The last section 3.6 contains the link to the repository for this thesis.

23

(40)

3.1 Research questions

The research questions will guide the study, with the help of the set research questions, we will explain how to ﬁnd the answers for each of the sub-research questions. The following subsections will motivate why these questions are relevant for this study and how we will answer the questions. Throughout the following text, Research Question is referred to as RQ.

The research questions are the following:

RQ:1 How well does mapping software architectural concerns in java source code through machine learning perform?

Sub-RQ:1 Which pre-processing results in the most satisfactory performance?

Sub-RQ:2 How large should a manual mapping be in order to train a satisfactorily performing classiﬁer?

3.1.1 RQ:1

How well does mapping software architectural concerns in java source code through machine learning perform?

The RQ:1 is the main reason for this study, and there are many different approaches to answering this question. When looking at how we would approach the answer for the research questions, we had to keep in mind the time the study would take to complete.

Considering the time limitation, we decided to focus on three areas, pre-processing, the amount of manual mapping needed, and which classiﬁer performs the best. The sub-questions were created around these three areas.

When looking for the answer of Sub-RQ:1 and Sub-RQ:2, we experiment with the different classiﬁers, and with the time limitation, the decision was made to only

(41)

3.1. RESEARCH QUESTIONS 25 experiment with three different classifiers. The answer to "Which classifier performs the best?" is found when looking at the gathered results from Sub-RQ:1, Sub-RQ:2, and in the conclusion of RQ:1, which is why we believe that it is not necessary to dedicate a RQ for the classifiers.

3.1.2 Sub-RQ:1

Which pre-processing results in the most satisfactory performance?

To answer RQ:1 we need to understand what input to use for the machine learning model to achieve good performance. This subsection will motivate why Sub-RQ:1 is relevant in the study and our approach to ﬁnding the answer.

The input must represent the concerns well, if it extracts overly specific data that is only represented in the originating source-file but not relevant for the architectural concern, then this leads to an overfitted model towards that source-file instead of being generalised towards the concern that it belongs. There also exists the opposite situation where the data is too generalised, this will create an underfitted model. With an underfitted model, the classifier will be unable to specify what architectural concern the source file should map. Overfitting and underfitting are two core-problems that arise when creating machine learning models, which is why it is crucial to find the answer to Sub-RQ:1.

When pre-processing the input, we use NLP methods such as stemming, tokenisation, removing English stop-words, separating compound words and change the up- percase letters to lowercase. Given the difference between source-code and natural language, we use these methods on the informal part of the input. It is done to improve the performance of the model [34]. The model’s performance is significantly influenced not only by which methods are used but also in which order they are used. We will find the answer to Sub-RQ:1 by using the classifiers metrics as a guide to determine

(42)

what combinations of pre-processing settings that results in classiﬁers performing well.

Further details on the pre-processing can be found in section 3.3 and the implementation of the pre-processing in section 4.3.

3.1.3 Sub-RQ:2

How large should a manual mapping be in order to train a satisfactorily performing classiﬁer?

To answer RQ:1, we think a good approach would be to test different dataset sizes and see the effect on the classiﬁer’s performance. In areas where ML is applied, the amount of training data is not restrained by the subject but instead the data available in that subject. In our case, it is the other way around, where we have access to all the data, but we want to use as little as possible of the data and still achieve acceptable results.

This question is important because: When training a machine learning model with little data, often times the classifier does not perform as good as if the dataset is larger in size [35]. In practice, we have to map the source-files manually, and if we need a large training dataset, this process becomes tedious as the system grows in size. There is a trade-off between the training size and the performance of the classifiers.

To answer this question, we look for an optimal training size, where performance is acceptable, and the process of manually mapping source-ﬁles becomes as little as possible.

(43)

3.2. SUBJECT SYSTEMS & CLASSIFIERS 27

System Version Lines Lines of # of # of

name of code comments java ﬁles concerns

Jabref 3.7 88,562 17,187 845 6

ProM 6.9 69,492 22,763 867 15

TeamMates 5.110 102,072 12,514 812 6

Table 3.1: Subject systems

3.2 Subject systems & Classiﬁers

3.2.1 Subject systems

The three open-sourced systems that are the subject for this study was provided by the supervisor. The biggest reason for choosing these systems as the subjects of this study is that mapping the architectural concerns was done prior to this study, and the mapping was provided along with the source files. The mapping along with the source files are publicly available at the SAEroCon repository¹. This saves time from finding, correctly mapping the concerns and shifts the focus of the study to the implementation and study of the machine learning models.

The systems that were provided are JabRef ², ProM ³ and TeamMates ⁴. JabRef is an acronym for Java Alver Batada Reference. It is a cross-platform citation and reference management software that uses BibTeX and BibLatex. ProM is an extensible framework for Process Mining which is from where the name stems. The framework provides means for monitoring and analysis of real-life processes. TeamMates is an online peer feedback system for student team projects.

Table 3.1 shows the size of the subject systems used in the study by specifying the lines of code, lines of comments, the number of java source ﬁles and the number of architectural concerns.

1https://github.com/sebastianherold/SAEroConRepo

2https://www.jabref.org/

3https://promtools.org/doku.php

4https://teammatesv4.appspot.com/web/front/home

(44)

3.2.2 Classiﬁers

The classiﬁers tested in this experiment are Naive-Bayes, SVM and Maximum entropy.

We test Naive-Bayes because of its simplicity and the previously achieved performance when [6] used Naive-Bayes to map architectural concerns. Maximum entropy has shown to be a great candidate alongside Naive-Bayes. In [25] maximum entropy performs better than Naive-Bayes in 2 out of 3 datasets, which motivates us to investigate if maximum entropy can outperform Naive-Bayes in our study. Alongside Naive-Bayes and Maximum entropy, which both has the property of being probabilistically built, we decided to test SVM. SVM is a robust machine-learning algorithm that has shown excellent results in topics regarding text classiﬁcations [22].

When trying to answer the question: which classifier is the best for the task, we base the results by answering the Sub-RQ1 and Sub-RQ2. We try to find an optimal classifier with respect to the pre-processing and size of the training set. In this study, we have chosen to not experiment with the classifiers hyperparameters; instead, we define a presetting of hyperparameters for each classifier, see section 4.4.1

3.3 Information Retrieval and Pre-processing

When looking for the answer to Sub-RQ:1, the focus is on three parts, the extraction of information from the input, the processing of said information with different combinations of NLP techniques, and then the representation of the information as features.

This section motivates how we decide on what to extract from the source-code and how we approach the pre-processing.

(45)

3.3. INFORMATION RETRIEVAL AND PRE-PROCESSING 29

3.3.1 What is relevant information to extract from the input?

Languages, in general, are a means of communication, and this includes programming languages. With the difference, that for programming languages, the syntax and seman- tics are formally defined, whereas natural languages both exist too, but are a lot more ambiguous. A parallel exists between programming languages and natural languages that can be found in the informal information contained within both the naming of identifiers and the natural language used in comments. In this study, information retrieval techniques are used to extract architectural information from the vocabulary used within the informal information in the Java source files [36].

Comments contain the developer’s messages and descriptions in the file’s domain, which is a good reason for including it. The Identifier names refer to the names used for packages, libraries, classes, methods, local and global variables [34]. We focus specifically on public identifiers, for they are not always contained in a single source- file but can exist in multiple source files, this creates a relationship between these files, which can contain relevant architectural information. All other fragments of the Java source code, such as private class members, method bodies, literals of all types and many more were investigated and considered not to carry relevant architectural information in general.

3.3.2 Pre-process settings

In search of an approach to implement pre-processing we needed three main things, a simple way to change what we extract and the NLP methods used, choose the order that the NLP methods are used, and a way to separate the extractions, to use different NLP methods on the different extracted information.

To ﬁnd the answer for Sub-RQ:1, we need to experiment with different combinations of extractions of information and NLP methods, thus changing the pre-processing

(46)

settings every time directly into the source-code would result in time-consuming work, which is why we wanted a simple way to change the settings. The order in which the NLP methods are used will change the result signiﬁcantly. An example of this is if we have the word "getMessage" and then decide to convert the word to lower case

"getmessage", then try to separate the compound word we would get"getmessage". Which results in it not being able to recognise the word as two separate words, but if we change the order of the methods used we will get["get", "message"]. Comments are typically written in natural language, whereas identiﬁer names can be compounded words with separators differing depending on which kind of identiﬁers it is, for example, libraries and packages in java source-code are separated by punctuation’s, but methods are in some instances separated by lower and upper case letters. This creates the need to differentiate the NLP methods used for the different extracted information.

The Table 3.2 shows the implemented settings for extraction, Table 3.3 shows the implemented parsing options, further details on the implementation and speciﬁcs of the pre-processing can be found in section 4.3. The pre-processing settings created from the experimentation can found in section 5.1.

Setting name Description

"raw" Extracts everything from the source-ﬁle.

"c" Extracts the class identiﬁers.

"pm" Extracts the public method identiﬁers.

"pv" Extracts the public variable identiﬁers.

"lib" Extracts the library identiﬁers (import statements).

"pac" Extracts the package identiﬁers.

"com" Extracts the comments.

Table 3.2: Extraction options

(47)

3.4. TRAINING 31

Setting name Description

"lc" Convert words to lower case.

"sc" Remove single characters.

"sw" Remove English stop words.

"jk" Remove java key-words.

"nu" Remove numbers.

"scw" Separate compound words that is separated by lower- and upper-case letters.

"stem" Use stemming.

"tow" Tokenize only the words.

Table 3.3: Parsing options

3.4 Training

This section aims to discuss the different techniques used to split the data between training and testing sets. We also introduce a threshold value which small concerns must pass to be included when performing the experiments.

3.4.1 Threshold

The threshold specified in this thesis regards the minimum number of files mapped to each concern in the ground-truth architecture. If this number is low, a limited set of files would be available for training a classifier to detect mappings to this concern. Concerns not exceeding this threshold are ignored, i.e. the resulting classifier will not be trained nor tested for such concerns. We decided to set the threshold value to five files, which implies that the testing data must contain five files or more than the training set.

3.4.2 Data-split: ratio

The ﬁrst split technique is based on a ratio value. The ratio indicates, in percentage, how much of the original data set should be partitioned between the training and testing sets. Splitting data based on a ratio is a widespread technique used in machine learning;

(48)

Figure 3.1: Distribution between train and test ﬁles if training set size ratio is 0.1

thus, we think it is essential to include it in our study. When splitting with ratios, we apply stratiﬁcation to maintain the proportion between the concerns in the system. For example, when experimenting with a ratio of 10% on a system with three concerns: A, B and C, where A represents 50% of system, B 30% and C the remaining 20%, the 10%

partition will contain 50% A, 30% B and 20% C. Stratiﬁcation is used to ensure that minority concerns that pass the threshold test will be included inside the training set when training the classiﬁer.

Figure 3.1 shows how the data is split between training and testing in the system JabRef when the ratio is 0.1.

3.4.3 Data-split: absolute size

When data is split based on the ratio, stratiﬁcation based splitting is used, which ensures that the ratio within the split is mirroring the ratio between concerns inside a system.

(49)

3.4. TRAINING 33 The usage of stratification is essential to include; otherwise, there is no certainty that all concerns are included when training the classifier. However, using stratification is based on knowing the ratio between concerns, which often is unknown knowledge. Therefore, we want to split data, ignoring the ratio between concerns, but instead focus on an absolute size.

In this approach, the training data is based on an absolute size instead of the ratio between concerns. This technique creates a realistic scenario, where we assume that little is known about the system. To simulate this, we extract x number of files from each concern specified for training the classifier. The drawback of using this technique instead of the ratio is that the more significant concerns may be underrepresented in the training set, while the minor concerns are over-represented. In figure 3.2, the JabRef system is split according to absolute size.

Figure 3.2: Distribution between train and test ﬁles when abs.size = 5

(50)

As figure 3.2 depicts, GUI and LOGIC, which are the more significant concerns, will be underrepresented in the training set. At the same time, with only eleven files, PREF is over-represented, with almost half of the total number of files used for training.

3.5 Experiments

In this section, we discuss our approach to answer the research questions. We start by defining a set of pre-processing settings used in the first experiment, where we test the pre-processing settings, alongside the feature representations, on the classifiers.

The results gathered from the first experiment are then used to perform the second experiment to find an optimal training size for the classifiers.

3.5.1 Finding pre-process settings

Before starting the experiments, we have to define a set of pre-processing settings that will parse the source files according to the settings. One major problem of finding the best settings is that there are endless ways to combine pre-process settings, making it highly impractical to find a setting that perfectly fits the classifier. Instead of using a brute force strategy to find the best setting, we decided to find pre-processing settings that is believed to extract the most relevant information. This means that there is a possibility missing a combination of settings that could perform better than that of the proposed setting. With time being a factor, we decided on eight different pre-process settings and used these to compare against each other to find the setting resulting in overall best performance.

For each extraction option, there is a speciﬁc list of parsing options which is shown in the Table 3.4, this limits the number of possible combinations of pre-processing settings and thus saves time. The descriptions of the parsing settings can be found in Table 3.3. For every extraction option, the tokenization of only words"tow"is used, the

(51)

3.5. EXPERIMENTS 35 Extraction option Parsing options

"c" tow, jk, scw, lc

"com" tow, lc, sw, stem

"lib" tow, scw, jk, lc, stem

"pac" tow, jk

"pm" tow, jk, scw, lc, sw

"pv" tow, jk, scw, lc

Table 3.4: The speciﬁc parsing settings that executes each extraction.

following parsing options will not work without tokenisation. With every extraction we remove Java keywords"jk", for they do not contain any relevant architectural information. For classes"c", we separate compound words as it is common to use compound words as identiﬁers, and sets the extracted identiﬁers to lowercase. For comments "

com", because they are typically written in natural language, we set the extractions to lowercase "lc", remove English stop words "sw", and use stemming "stem". There is a similarity between the extracted data from "lib" and "pac", but with different implications for the architectural information, and thus created a distinction between the two of them. The distinction is created by further manipulating the extraction of

"lib" compared to "pac", this is done by setting the extracted data to lowercase and use stemming. For public methods "pm" it is common to create identiﬁers that are compounded words, which is why"scw"is used, then the"lc"is used followed up with

"sw". Public variables"pv"are treated similarly as public methods with the exception of not using"sw".

3.5.2 Performance comparison:

Feature-representations & pre-processing settings

In this experiment, we test Bag-of-words and Tf-Idf, alongside the pre-process settings that we came up with before this section. This experiment focuses on gathering two results: the feature representation together with pre-process setting that results in the

(52)

best average performance.

As depicted in figure 3.3, we define the static settings located inside the Monte- Carlo CV. The static settings are the values that will remain the same throughout the experiment. We specify the test-train split to a ratio of 90/10, where 10% of the dataset is used for training and 90% testing. The number of iterations indicates the number of times the CV will train and test the classifiers. The number of iterations in this experiment is set to 100.

Before evaluating the classifiers, we pick one of the eight predefined pre-processing settings to parse the subject system. After parsing the system according to the setting, we specify what feature representation to use and enter the stratified Monte-Carlo CV.

In each iteration inside the validator, the training and testing set will be randomised and stratified to the ratio specified in the static variable and transformed into the specified feature representation. The classifiers are trained and tested, and their metrics are stored.

When all iterations have finished inside the CV, we obtain the classifiers’ average accuracy, precision, and recall score from that test. Individual results from each pre- processes setting is stored inside a table with the labelled settings, classifiers metric scores, and used feature representation. After iterating through all the pre-process settings and testing the different feature representations, we sort the table according to the classifier that procured the highest accuracy.

(53)

3.5. EXPERIMENTS 37

Figure 3.3: Single experiment run where different settings and feature representations are tested

3.5.3 Experiment: Training size

The results from the ﬁrst experiment is used, of which pre-processing settings and feature representation to use when looking at the different training sizes. In this experiment, we focus on altering the training size and see the effect this has on the performance of the classiﬁers. We decided to split the test into two parts, one where the training data is based on a ratio split, and the other is based on an absolute size. See section 3.4 for more details.

The two tests are similar in their approach, and their overall structure is displayed in figure 3.4. Following the figure, both tests share the same static settings. The static settings refer to the number of iterations the cross-validator performs and the pre-process setting and feature representation that gave the best results in the first experiment. Before evaluating the classifiers, we define a set of ratios and absolute training sizes that we base the tests on. The training/test set size ratios to be tested are:

{(10%/90%),(15%/85%),(20%/80%),(25%/75%)} , and absolute number of ﬁles/- concern:{5,10,15,20,25}.

Each of the ratios/absolute sizes speciﬁed is tested inside the cross-validator. Similar to the ﬁrst experiment, we use the same cross-validator technique when testing different

(54)

ratios: the stratified monte-Carlo cross-validator. When testing different absolute file sizes, we use a regular monte-Carlo cross-validator without stratifying. After cross- validating, we store the average metrics calculated from the classifiers. To test all the specified ratios/absolute sizes, we iterate through the list of ratios/absolute sizes and perform the cross-validation. After iterating through the list, we have a list of average metrics, which contains all the metrics from each test performed. The list of average metrics is used to base our conclusions, for example, we observe the increase/decrease in accuracy, precision, recall and F1-score when increasing the training size from 10%

to 15%.

Figure 3.4: Structure of experiment when testing different training sizes

(55)

3.6. REPOSITORY 39

3.6 Repository

All the Jupyter notebooks that contain the data and ﬁgures found in this dissertation, along with the scripts and prepared systems can be found in the following repository https://git.cse.kau.se/alexflor100/bsc-javsococlassifier.

(56)

(57)

Chapter 4 Implementation

The structure of our implementation can be divided into two parts, the experiment settings and the experiment framework. The experiment settings are the variables that we want to analyze and see the effect they induce on the performances of the classi- ﬁers. These settings are deﬁned before performing the experiments. The experiment framework is the body of our experiment, which adjusts according to the experiment settings. The experiment settings, alongside the framework, are integrated into jupyter notebooks. The notebooks are used as a playground when performing the experiments.

In upcoming sections, we focus on the three parts that build up the experiment framework, namely, Data collection, Pre-processing and training & evaluation (see figure 4.1). Data collection reads the files from the subject system and prepares the unstructured data for the pre-processing phase. Inside the pre-processer, the data is parsed and transformed according to provided settings, to a format readable by the classifiers. In training & evaluation, we train and test the classifiers with the data sent from the pre-processer and analyze the model’s performance based on the metrics they produce. Each of the three parts has its section, where we describe more thoroughly how to implement them in practice.

41

(58)

Figure 4.1: Implementation overview

4.1 Tools and libraries

When creating the lab environment, numerous tools are used to ease the process and save time. We use scikit-learn (Sklearn) to create the machine-learning models speciﬁed in section 3.2.2. Sklearn also contains functionalities for calculating metrics, which are used for evaluating the classiﬁers. When working with data analysis, creating structured data from unstructured data, we usepandas. In the pre-processing phase, we use the librariesregexandnltkfor data extraction and parsing.

4.1.1 Jupyter

Jupyter is an open-source project that provides an interactive web tool environment, also called jupyter notebooks. The notebooks sufﬁce the user with the ability to execute code from the browser, but also in-browser editing using the Markdown markup language, for commentary of the code [37]. Inside the notebooks, we integrate the scripts that deﬁne

(59)

4.1. TOOLS AND LIBRARIES 43 the experiment framework. Our study uses jupyterLab, which is an environment that provides functionalities to store and run the notebooks.

4.1.2 Scikit-learn

Sklearn is a python module for machine learning [38]. It provides an easy-to-use interface with highly efﬁcient state-of-the-art machine learning algorithms. Sklearn also provides well-documented class references, tutorial, and installation instructions, making it easier and faster for the user to start using their module [39]. In our study, mainly the classiﬁcation tools are used.

4.1.3 Pandas

Pandas is a powerful data analysis tool that provides the user with a rich set of data structures and functions designed to handle data fast, easy and expressive [40]. In this study, the data structure DataFrame is used. A DataFrame is a two dimensional tabular object with both column and row labeling [40]. We use the DataFrameobjects to represent the data, that is, the system architecture. How the data is converted to a DataFrameobject is described more in detail inside section 4.2.

4.1.4 Regex

Regular expressions are used to find and extract various types of patterns in the source code. The patterns we use are specifically designed to extract information inside java files. A tool¹was used to find and validate the regular expressions we needed.

1https://regexr.com

Mapping Java Source Code To Architectural Concerns Through Machine Learning