Bachelor Degree Project Evolution of Software Documentation Over Time

(1)

Author:

Helena Tevar Hernandez

Supervisor:

Francis Palma

Semester:

VT/HT 2020

Bachelor Degree Project

Evolution of Software

Documentation Over Time

(2)

Abstract

Software developers, maintainers, and testers rely on documentation to understand the code they are working with. However, software documentation is perceived as a waste of effort because it is usually outdated. How documentation evolves through a set of releases may show whether there is any relationship between time and quality. The results could help future developers and managers to improve the quality of their documentation and decrease the time developers use to analyze code. Previous studies showed that documentation used to be scarce and low in quality, thus, this research has investigated different variables to check if the quality of the documentation changes over time. Therefore, we have created a tool that would extract and calculate the quality of the comments in code blocks, classes, and methods. The results have agreed with the previous studies. The quality of the documentation is affected to some extent through the releases, with a tendency to decrease.

(3)

Preface

(4)

1 Introduction

Developers usually rely on the low-level documentation, especially class- and method-level documentation to comprehend, modify, and maintain a system that is continuously being evolved. The documentation has to be related to the class and method they are located, reflecting what they do and how they should be maintained. While creating and maintaining software is the job of developers, updating the documentation is not often seen as an important task [1, 2], and, thus, it is common to find documentation that has not been updated and does not reflect the actual functionality of the method or class where it is used. This study aims to study the cohesion between documentation and source code as a factor of the quality of the documentation because software evolves continuously.

1.1 Background

During the process of developing source code artifacts, developers need to understand the functions of said artifacts by using source code documentation. This kind of doc-umentation includes comments in source code that are used to explain blocks of code such as methods and classes. While good comments help the developers with their jobs, the act of documenting is often seen as counterproductive and time-consuming [1, 2], especially for projects developed whiting the Agile principles, that require fast-paced programming and continuous delivery. In other cases, the comments are outdated or difficult to create for legacy code [3]. Changes are added in an undisciplined man-ner. [4] This would create problems for the future implementer and other stakeholders that also work with the same code, such as testers and maintainers [2, 5]. Changes in code documentation and some aspects of quality have been studied previously [6,7], the research of Schreck, Dallmeier and Zimmermann studied the quality of documentation by similarity ratios in natural language and source code among other values [8]. Know-ing previous results, this research will focus on the similarity ratio by usKnow-ing different algorithms expecting to see how the documentation quality evolves through time on a big sample of projects.

1.1.1 Quality Definition

(7)

is inherent to the discussion of quality, for instance, a text difficult to understand is not universal to all humans. Metrics should include human insights [10] but that adds com-plexity to the studies. More objective variables that are related to factors of quality in the documentation are coherence, usefulness, completeness, and consistency, as men-tioned by Steidl [11]. Coherence covers how comment and code are related and thus, is measurable. The relation between the comments and code could be studied as the abil-ity to paraphrase the machine language to natural language in order to give context to the source code. In that case, the documentation should reflect the contents of the code. This was already stated by McBurney and McMillan [12], source code documentation should use keywords from source code. For this reason, a way to investigate how the documentation refers to the source code would be by measuring the similarity between them.

In order to check the similarity between two texts, many algorithms have been devel-oped. In the case of the research made by Steidl, Hummel, and Juergens, the similarity ratio used was the Levenshtein ratio [11]. Levenshtein ratio defines distance between two strings by counting the minimum number of operations needed to transform one string into the other [13]. There are two main branches of similarity ratios, string-based and corpus-string-based measures [13]. Corpus-string-based measures work the best with large texts, which is not the case for this study. String-based measures are more fitted for small size strings. This kind of algorithms includes character- and term- based ratio. Character-based algorithms measure the distance between characters in two strings, like the Levenshtein ratio. For instance, words like "Sam" and "Samuel" would be similar in character measures because they share three characters, however, they would be two different words for term-based ratios, thus their term-based similarity would result on two non related words. Term-based similarity is the one approach that could show the similarity of the developers’ comments and the programming code. In this research, we elaborate on two of the algorithms used to calculate the similarity ratio that have not been used before, they are the Jaccard ratio and Cosine similarity.

1.1.2 Jaccard ratio and Cosine similarity

Jaccard index similarity is calculated by the size of the intersection of two sets divided by the size of the union of the sets, where each set includes the words of a string [14]. Jaccard ratio calculates the similarity between a set of words, meaning that the repetition of words is ignored. Two strings that contain the same set of words will result in a Jaccard index of 1 because of the overlapping of each set, while two strings with no same set of words will result in an index of 0.

J (A, B) = |A ∩ B| |A ∪ B|

(8)

ana-lyzed string form a vector. This ratio takes into account the repetition of words to create the vectors required. When two strings have the most similar and repeated words, the cosine of the angle will be closer to 1, or in other words, the angle will be 0◦. On the contrary, when two strings are different in words and repetitions, the cosine will be 0, so the angle formed by the two vectors will be 90◦.

C(A, B) = cos(θ) = A · B kAkkBk = n P i=1 AiBi s n P i=1 A2 i s n P i=1 B2 i 1.1.3 Java Language

Java has a particular syntax that developers have to follow to be able to compile an application. In this case, the most important ones for the research were the comment, class declaration, and method declaration syntax [16].

The comments are the main source of documentation for Java. They follow a clear syntax where a set of symbols written before a string would make the compiler ignore them, while developers can still use them to add extra information. For instance, the symbols used for comments in Java are: [/*, /**, * , //, */]. Even when comments can be used anywhere through the code, Oracle pointed the space before class and method declarations as the position to write source code documentation by using Javadoc com-ments [17]. However, the background of the developer may affect the way of writing source code documentation. For instance, the father language of Java, C++, uses block comments as source code documentation. The Java compiler admits those C++ conven-tions in their language.

The class declaration syntax is structured around two mandatory keywords. The first mandatory keyword is one of the next words: class, enum, interface. After the class, the Java compiler needs an identifier, the actual name of the class, a set of characters in Unicode excluding numbers at the beginning of the words. The class is followed by a pair of parentheses and an open bracket that sets the beginning point for the contents of the class. Those are the minimum mandatory requirements to declare a class in Java. Additionally, the developers may add modifiers before the class declaration, for example: Public, Private, Protected, etc. When there is no modifier present in the code, Java will assume that the class is Public within the corresponding package, but otherwise is seen as Private [16].

(9)

order: modifiers, return data types, identifiers, parameters, exceptions. Similarly to class declarations, there could be one or more modifiers. Unlike the modifiers, there must be only one return type. However, the data type can be any of the build-in Java data types or custom made data types, in their single form or array form. The parameters’ syntax is an input data type and its identifier.

1.2 Related work

The current state of the art shows an opposition of forces between those that consider source code documentation as a reliable source of information [1, 18] and those that, while still agreeing with the importance of documentation, try to automatize its creation so developers can avoid the task [19]. However, automatic documentation does not go without criticism.

Even though tools have been developed to create source code documentation [3], studies have shown that automatically generated summaries were more inconsistent and with less similarity to the source code [12]. Those results could conclude that source code documentation gets better-perceived similarity when it is written by the develop-ers. However, one of the biggest complaints about documentation is how badly main-tained it is and how most of the time it is out-of-date. Studies have shown that JavaDoc comments change over time, especially when developers want to elaborate on use-tips in the JavaDoc annotation, but there is no information about how the quality of such documentation changed over the releases [6].

For the case of quality in the documentation, the tool JavaMiner studies this topic, among other variables [7]. However, as the previous study, the research behind JavaMiner only works with the JavaDoc comments. This research was continued by Steidl, Hum-mel, and Juergens that developed a project using machine learning on projects in Java and C/C++ that compares different quality aspects through source code exploration combined with interviews with developers during the process of the research [11].

The study behind the tool QUASOLEDO includes the documentation created by JavaDoc and the comments used in C++, block comments, and inline comments. The QUASOLEDO research studied variables related to the ratio of documented code blocks, and the quantity of the words used in the documentation. The results of pointed that only 12.1% of the changes made in the code were for modifications in both comment and code content of the block, 2.1% of the commits changed only comments and 67.4% changed only code, changes on documentation only made up 32% of the total changes on a project [8]. The study points out how sub-optimal this is for development.

(10)

1.3 Problem formulation

This study aims to research the cohesion between source code documentation and the code. The use of similarity ratios for all kinds of comments has not been studied and will add information to the field of study. We will study source code and documentation for a group of projects through a range of releases and the quality and cohesion of said documentation. We plan to use cohesive metrics of text similarity as a factor of the quality of the comments in consecutive releases to find how much the documentation quality changed through time.

1.4 Motivation

This research contributes to showing the behavior in source code documentation for classes and methods from the perspective of string similarity. The results may help engineers, technical writers, and project managers to understand the behavior of soft-ware documentation and its evolution, and to better plan for softsoft-ware documentation maintenance according to the results.

1.5 Research Questions and Objectives

After reviewing the literature, we have not found related work that fully explores the cohesion and similarity using term-based algorithms as a factor of quality. We will study that aspect of quality and its variation over the consequent releases from a project. The following research questions were used to plan the research:

• RQ 1: What is the proportion of code blocks with and without documenta-tion?

We investigate if the projects are documented in a large or small proportion. • RQ 2: What is the proportion of new code blocks’ with and without

docu-mentation?

How much source code is documented at the beginning of their life will show how developers prioritize documentation during implementation.

• RQ 3: Does the code blocks documentation quality improve across the re-leases?

We calculate the ratios selected as a factor of quality to make a study over the time the releases were made in order to see any change that may show a relation between quality and time of release in the documentation.

• RQ 4: Is there any relation between lines of code and quality of the docu-mentation?

(11)

In order to answer these research questions, the objectives presented in Table 1.1 were formulated:

O1 Study the difference between documented and non docu-mented code blocks among different releases and in total numbers.

O2 Calculate the cohesion ratios, Jaccard and cosine, of all the code blocks for each release.

O3 Perform statistical analysis to compare two sets of cohesion ratios for methods and classes for a release and its consecutive release.

O4 Perform statistical analysis to compare cohesion ratios with the lines of code of methods and classes.

Table 1.1: Thesis Project Objectives

1.6 Scope/Limitation

The data used for this project was limited to open-source projects to get free access to the source code. No specific requirements, as organization or size of the project, were followed in the election of the projects studied. The only requirement was to have at least 10 releases per project available, which was exceeded in all of the projects. They were also recently updated, being the oldest release uploaded in 2017. The language for the machine language was Java, as it is the most used in programming. For the case of the natural language, it was English, because it is the most common language in the technical field and because it shares keywords with Java. Those three factors were used to select the projects used in this research.

1.7 Target group

(12)

1.8 Outline

The rest of this report comprises the following sections:

Section 2 - Method: In this section we will approach different possible methods to resolve the research questions. We will elaborate on natural language processing as well as explaining the threats to validity and reliability we encountered.

Section 3 - Implementation: The implementation phase was done using a Python 3 application that extracted the required data. This section will explain how this process was implemented.

Section 4 - Results: The data gathered during implementation will be showed with-out further analysis. The objective is to display objective data that will be used in the analysis.

Section 5 - Analysis: This section will give answers to the research questions by using the results as a source.

(13)

2 Method

Previous researchers have shown a pattern to study documentation in source code. The steps followed required the extraction of the declaration identifiers and its comments and then process the extracted data. Similar studies followed this pattern as Steidl [11], which also used similarity ratios between comment and content of a block. The sim-ilarity ratios decided to use as an example of coherence, how comment and code are related, were term-based similarity ratios. However, to be able to compare two words in the most accurate way, it was decided to use natural language processes in order to eliminate words that are not related or useful to the research as well as lemmatize words to their inflected forms so two words can be compared. For the next step, the processed data was used with two different term-based similarity algorithms. While previous work studied the similarity between characters, we decided to study how similar are the whole terms and words between comment and source code. These two algorithms were se-lected among multiple options: First, term based algorithms were more fitted because we understood that similarity should be coherent. Code and comment that share some relationship is an indicator of a meaningful code block [11] and term-based algorithms are fitted for that requirement. One single algorithm could answer these questions, but the differences made between Jaccard and Cosine similarity ratios concerning word rep-etitions made it clear that we could use both, so we cover more possibilities.

The planning for the method displayed in Figure 2.1 shows how the methodology was applied. After reviewing already created tools and packages, we decided to create a tool that would fit our particular requirements and calculations. The tool requirements were to read all the files of a project, extract comments and their code blocks, use a natural language process to parse the strings extracted, calculate both similarity ratios and size by lines of code, and finally save the raw data in a CSV (Comma Separated Values) file for the consequent data analysis. Because the detection of code blocks in-cluded detecting classes and methods, we inin-cluded that division to extract more refined data. The tool was tested with test files that showed different cases that could create false positives, as conditional blocks, nested classes and methods, throw statements, or latex expressions.

(14)

Start Parse Code Java File? Source Code Files Extract Code Block Read Next File Last File? Analysis Cohesion / Time / Size Analysis New Blocks Analysis Documented Blocks Analysis Cohesion / Time Raw Data Select Projects Yes Yes No No

(15)

2.1 Natural Language Processing

Comments and some parts of the source code are written by developers in a natural language, for instance, identifiers and variables are mostly written in common English. The natural language selected for this research was the English language segmented by spaces. To get two strings that could be compared, the strings have to be parsed to have the most meaningful words in them. For that, we created a set of stop words from common English words, like prepositions or pronouns, that give little to no meaning to a string, so they were removed from the study. In the particular case of the computer language, keywords and typical words from the Java language were removed to extract the relevant words from the string [20].

2.2 Reliability and Validity

To have a good representation of the results, this study has used a total of ten projects from open source organizations, listed in Appendix A. The projects vary on owner or-ganization and total size. After the selection of the projects, the research uses a total of 10 consecutive releases for each project. That means that this research includes 100 repositories. Despite that, the results may not be representative, since when we are using only open-source projects. Private repositories may behave differently and may include different variables that affect the maintenance of source code documentation.

While Oracle names the comments created by JavaDoc as their main source of doc-umentation in source code, this study did not omit other types of comments. Developers come from different backgrounds and may use different code conventions. In the par-ticular case of Java, this research accepted the code conventions for C++ and included inline and block comments that are written over class and method declarations as source code documentation. The main reason was that, as a descendant of C++, Java still ad-mits C++ type of comments in their language. However, the results may differ when using only Oracle JavaDocs.

To ensure reliability, this research will provide information on implementation and the possibility to access the raw data extracted from the projects, as well as the data, resulted after further analysis, in Appendix A.

2.3 Ethical Considerations

(16)

3 Implementation

The particularities of the requirements for this research made it easier to create our own tool that would include all kind of comments written above the classes’ and methods’ declarations as source code documentation. Other packages researched would have dif-ficulties when reading non-JavaDoc comments, or including non-declarative comments as documentation. By doing our own implementation, we ensured that our documen-tation definition, which was the main requirement, was met. The application that was created for this research was developed using Python 3 with the packages numpy1 for analysis and matplotlib2 _{for plotting the results. The database used for this application}

is based on CSV (Character Separated Values) files.

The applications required a database that mapped the location of all the repositories studied. Then it iterated the projects’ source code and run the calculations required to get the information needed for the research, as seen in Figure 3.1. After the calculations, the results were written in multiple CSV files, one per release. The files included the ba-sic identification for each block, the identification of its parent block (for classes at the top of the file tree it would be its own name) and the parent block location. The infor-mation about the parent block was used to avoid problems with name duplication when comparing two classes or methods that had the same name. When the block finished, that means, when the application encountered a closing bracket, then the application saved the last part of the information: lines of code of the closed block, the declara-tion comments and the content of the block. The comments and contents were then through the natural language process of eliminating stop words, lemmatizing the string and, in the case of identifiers, separating words from several naming conventions. After this process, the two strings were ready to go through the similarities algorithms. The results were saved in the CSV file by the name with Jaccard ratio and Cosine ratio.

To ensure that the application worked properly, it run a test file that would include the most controversial comments and structures that could give problems to be parsed. For instance, conditional blocks could be assumed to be methods without modifiers nor return types, so a group of keywords were included to differentiate a method that conditionals, throw methods or lambda nested methods. After running the test file, the smallest project was run several times to test that no keyword or illegal character was detected as a method declaration by the application. For instance, the example in Listing 3.1 should be not detected:

1 MyClass

2 .DoSomething(); // This line should be avoided Listing 3.1: Comment example in Java

(17)

Start Line: String Yes No is Class? Yes No is Method? Yes No _{the body?}is part of

Yes continue has '{' ? Yes continue has '}' ? Update Last Block Data

Update Last Block Data Append to Last Block Comments Yes No is Comment? Clear Comments from Last Block

Add New Block

Save and Pop Calculate Similarity

(18)

3.1 Extraction

The process followed to create the resulting database was to create an algorithm that would read the Java files line by line and proceed with calculations. This algorithm used multiple string operations and regex operations to recognize if the line was a block declaration, its type, and comments, as seen in Figure 3.1. In the first instance, the ap-plication gathered all the Java file paths and iterated over them to get an array of strings that were the contents of that Java file. For each string, the application decided if it was a comment, a block declaration, class or method, or the contents of a block. The decision was made by using regex in the case of the class declaration and comments, while the decision for method declaration required multiple string operations. The contents of a block was the rest of the decisions.

3.1.1 Extracting comments

Comments are delimited by the Java language as all those lines or blocks of lines that include the symbols [/∗, / ∗ ∗, ∗, //, ∗/]. For instance, the code in Listing 3.2 is an example of types of comments in Java:

1 /**

2 _{* This is a Javadoc.} 3 _{* @returns void} 4 _*/

5

6 // This is an inline comment example. 7

8 /*

9 _{* This is a comment block.} 10 _*/

11 12 /*

13 This kind of comment is also valid for Java. 14 _*/

15 //Any number of blank spaces before the comment are also

valid.

Listing 3.2: Comment example in Java

(19)

algorithm checked if the last comment saved was an open block, meaning the current line is included in the comment block, or on the contrary, the last saved comment was an inline or include a closing comment block symbol. The final regex used for detecting comment in strings was as presented in Listing 3.3:

1 ’^\s*((\*\s*/)+|(/\s*\*)|(/\s*\*\s*\*)|(\*\s*)|(/\s*/)).*’ Listing 3.3: Regex for comments

3.1.2 Extracting classes

Considering that class declarations in Java have a formal syntax, the extraction of the data from them was similar to the comment extraction. A string would have to match a regex built around the class syntax in order to be used for data extraction. In this particular case, the only requirement for a class to be declared is to include the keywords for classes (’class’, ’enum’ or ’interface’) and an identifier. Additionally, modifiers, superclasses, and interfaces can be added to the class declaration, but are not mandatory. However, the list of modifiers is a group of keywords that can be used in a regex, because there is no possibility to create personalized modifiers. As a result, finding the identifier in a class declaration is obvious. The identifier of a class was the word located after one of the class keywords. Anything that is declared after the identifier was consumed by a greedy operator in the regex, because they gave no more information about the class identifier, for instance as demonstrated in Listing 3.4:

1 class ClassExample {

2 // This is a valid class declaration

3 } 4

5 public final ClassExample extends SuperClass implements IClass {

6 // This is also a valid class declaration

7 }

Listing 3.4: Class example in Java

In addition, regex in Python can be used to extract parts of the string. In this case, we wanted to extract the identifier of the class from the string. For that reason, the next word from the class keywords was extracted by using a regex tag for python. The tag P ? < id > marks a part of the pattern with a tag name, in this case ’id’, that can be referenced to extract its contents. The final regex used was as shown in Listing 3.5: 1 ’^(Annotation|public|protected|private|static|abstract

interface|enum) (?P<id>[a-zA-Z\_0-9]+)’

(20)

3.1.3 Extracting methods

In the case of a method, a more precise string manipulation was required. The only requirement for a method to be accepted is an identifier, two parentheses, and a body. Without modifiers and return types, the default values are public in the packages, private for the project, for the modifier and void for the return type. It is important to point out that the return type does not only include the Java data types, but also self-made data types imported in the Java project. Moreover, the position of the open bracket that opens the body of the method, is not required to be in the same line as the method declaration and some coding conventions and developers locate them below the method declaration, for instance, Listing 3.6 below:

1 class ClassExample { 2

3 public static myDataType myMethod (myDataType arg1, int arg2 )

4 {

5 // This is a valid method

6 }

7

8 myOtherMethod(){

9 // This is also a valid public method that returns void

10 } 11 }

The syntax of methods is similar to the syntax of any kind of conditional and loop block. Without constraints before the identifier and the parameters, an ’if’ statement can create false positives. One possibility was to check the class parent, but because the existence of nested method and classes, this was difficult to check. The solution found for this issue was to create a list of keywords for loops, conditional blocks, test asserts, and symbols that are not possible to use in method declarations. For instance, it is not possible to have arithmetic operations in a method declaration, so symbols like addition and equals could declare a string as not method declaration. A regex could be a solution, but the flexibility of none or many modifiers, self-made data types, and such made the regex possible but slow to process. Some of the problematic pieces of code found followed similar syntax as in Listing 3.7:

1 class FalsePositives { 2

3 public static myDataType myMethod (myDataType arg1, int arg2 )

4 {

5 if (arg1) {

6 // This line should not be accepted

7 }

8 }

(21)

10 myOtherMethod(){

11 assertThat(test) //This line should not be accepted 12 method.something() // This line should not be accepted 13

14 } 15

16 hashCodeImpl(Object content, String mimeType, String language, 17 URL url, URI uri, String name, String path, boolean internal, 18 boolean interactive, boolean cached, boolean legacy) {

19

20 //This method should be accepted

21 } 22 23 }

The solution to avoid the maximum amount of false positives and, at the same time, avoid the use of regex, was creating a process that would split, trim and extract the sub-string that was needed as seen in Figure 3.2. The requirement was to get the identi-fier, parenthesis, and parameters of the string, ignoring modifiers, return data types, or any other information at the right-hand side of the parameters. For that, the string was processed first in reverse, finding the closing parenthesis for the parameters. Any infor-mation from the closing parenthesis and the end of the string should be ignored. After this trim of the last part of the string, the string will be iterated from the beginning to find the first open parenthesis. The result from the first character to the first parenthesis is the part of the declaration that should include modifiers, data types, and identifiers. When creating an array of the words in a string with the method ’split’, the last word of the array will be the identifier. The sum of the identifier plus the contents of the parameters create the final parsed string that includes the information required.

public Collection<String> getUrlPrefixes (String a, String b) // comment

index: -1 End Trim

getUrlPrefixes (String a, String b)

String split

Result

(22)

3.2 Cohesion calculation

Before the cohesion calculations, a normalization of the two strings to be studied was required. For that, the two strings went through a natural language process. After the normalization of the strings, the cohesion ratios were calculated.

3.2.1 Parsing and normalizing strings

When encountering a closing bracket, the application knew that a code block was clos-ing and for that reason, that block was ready to process their comments and contents. The research planned two cohesion ratios to calculate, Jaccard similarity and Cosine similarity. Before calculating the ratios, it was necessary to process the comments and contents through a natural language process (NLP from now on) to avoid bloating the calculations with common words, however, even before processing a string, it was re-quired to parse the method and class identifiers.

Until this moment, the comments and contents of a block were part of a list. For more calculations, the arrays were combined in a string for comments and a string for content. After that, the strings still required small adjustments before being used for calculations.

Naming a class or a method is not standardized, nor is it mandatory to follow any guideline. Developers have total freedom to name their code as pleased, even so, some common naming conventions are used among developers. For this research, three nam-ing conventions were used to divide multiple words from the method and class identi-fiers. Those were camel case, Pascal case (also know as dromedary case or upper camel case), and underscores as are exemplified in Listing 3.8. All the identifiers that fol-lowed those three conventions were sliced into multiple words. In doing so, the number of words for content increased.

1 class Naming {

2 public void camelCase(){

3 // Became: camel case

4 }

5

6 public void DromedaryCase(){

7 // Became: dromedary case

8 }

9 10

11 public void under_score(){

12 // Became: under score

13 } 14 15 }

(23)

For this research, the package used to process natural languages was NLTK3 _version

3. With NLTK the two strings were parsed to avoid common words for the English language, numbers, pronouns, prepositions, and lemmatize words. A lemmatized word example could be changing ’numbers’ to ’number’. In this way, the two strings would be normalized as much as possible to include only the most important and relevant words in the same tense. When this step was complete for both strings, they were ready for the calculation of their cohesion ratios.

3.2.2 Jaccard algorithm

The Jaccard similarity works with sets of words, no repetitions allowed, so the Python algorithm was defined as in Listing 3.9:

1 def calc_Jaccard(self):

2 comment_set = set(self.comment.split()) 3 content_set = set(self.content.split()) 4

5 intersection = comment_set.intersection(content_set) 6 denominator = (len(comment_set) + len(content_set) - len(

intersection)) 7

8 if not denominator: 9 self.Jaccard = 0.0 10 else:

11 result = float(len(intersection)) /

12 (len(comment_set) + len(content_set) - len(intersection)) 13 self.Jaccard = result

Listing 3.9: Python Jaccard similarity algorithm

3.2.3 Cosine algorithm

The Cosine similarity algorithm was also implemented in Python by using the mathe-matical formula described in Section 1.1.2, resulting in the algorithm in Listing 3.10: 1 def calc_cosine(self):

2 comment_vector = self.text_to_vector(self.comment) 3 content_vector = self.text_to_vector(self.content) 4

5 intersection = set(comment_vector.keys()) & 6 set(content_vector.keys())

7

8 numerator = sum([comment_vector[x]*content_vector[x]

9 for x in intersection])

10

(24)

11 sum1 = sum([comment_vector[x] ** 2 for x in comment_vector.keys() ])

12 sum2 = sum([content_vector[x] ** 2 for x in content_vector.keys() ])

13

14 denominator = math.sqrt(sum1) * math.sqrt(sum2) 15

16 if not denominator: 17 self.cosine = 0.0 18 else:

19 result = float(numerator)/denominator 20 self.cosine = result

Listing 3.10: Python cosine similarity algorithm

3.3 Results of the extraction

After the cohesion calculations, the information of each code block was saved in the database with the following information: Project name, release name, release date, iden-tifier, type: class or method, lines of code, owner: parent class, Jaccard ratio, Cosine ratio, comments: string filtered, content: string filtered. The data obtained was used to perform statistical studies using Numpy and spreadsheets.

The analysis of the results was documented in three CSV files in three steps: one for the number of documented blocks, one for average and percentile calculations, and a final variation ratio document. Each project had 10 releases, so the output of raw data was 10 CSV files with classes, methods, and similarity ratios. By comparing the names and parents of each block, we calculated how many blocks were documented and how many were not documented, as well as the newly added and how many of them were documented at their creation, per each release. A second step calculated the percentiles of the lines of code of the blocks by type (class or method). We used the library Numpy for Python to make the calculations. The percentiles documented in the result CSV file were percentile 0, 5, 25, 50, 75, 95, and 100. Finally, to know how the similarity ratios evolved through the releases we calculated the average of the similarity ratios of each release. We used four discrete groups for the size of the block by using 4 percentiles: percentile 25, 50, 75, and 95. In total, we got the average of the Jaccard and cosine ratio, for each type of block, by 4 percentiles, for 10 releases.

To make the data more readable, and because the similarity ratios by themselves were not interesting for the research, but rather their variation over time, one more cal-culation step was taken. The final CSV file was modified to calculate the variation ratio values for each release using the formula vn/vn−1, where vnindicates the similarity

(25)

4 Results

The raw data used in this research was uploaded in the repository referenced in Ap-pendix A. It contains the information of all code blocks, their length, and cohesion ratios. Due to the size of the database, it is not directly included in this report.

This study used a group of ten open-source projects and ten consecutive releases per each project. In total, 100 releases were studied. The projects are diverse in length and ownership to provide a better representation of the data. More information about the projects used and all the data used, raw and processed, can be found in Appendix A. 4.1 RQ 1: What is the proportion of code blocks with and without

documenta-tion?

To get a result of this question, we needed to find all the code blocks that contain any documentation or comment. The blocks were divided by type of block, class, or method. The results calculated over the projects defined before are shown in Table 4.1. The raw total number of documented code blocks can also be found in the repository available in Appendix A.

Total documented Classes Methods Total blocks

Maven 19.54% 9.75% 9.78% 6871 Jmeter 34.57% 6.59% 27.98% 13940 Che 21.78% 6.95% 14.78% 14608 Tomcat 22.39% 5.87% 16.53% 28882 Springboot 13.86% 8.15% 5.71% 35666 CXF 8.56% 2.44% 6.12% 58581 Guava 11.41% 2.59% 8.82% 61671 Graal 9.24% 2.96% 6.27% 90915 Elasticsearch 14.18% 3.73% 10.45% 128208 Netbeans 21.45% 7.06% 14.39% 451083 Average 17.70% 5.61% 12.09%

Table 4.1: Percentage of code documented by type of block and total

4.2 RQ 2: What is the proportion of new code blocks with and without documen-tation?

(26)

maven jmeter che tomcat springb cxf

guava graal elastic netbeans

0

10

20

30

40

50

60 Percentage added

63.64 0.0 14.03 29.67 24.37 10.78 26.71 12.27 21.42 38.01 5.36 50.0 32.89 14.21 2.74 2.86 11.14 6.54 6.42 7.41

New added blocks documented

Classes

Methods

Figure 4.1: Percentage of new documented blocks

4.3 RQ 3: Does the code blocks documentation quality improve across the re-leases?

A statistical study of the raw results was done to get the outcome needed to answer the third research question: Do the code blocks documentation quality improve across the releases? For each project, we averaged the Jaccard and Cosine ratios by using discrete sizes as percentiles q25, q50, q75, and q95, excluding the 5% lower and higher. The aim of this research is to find how the quality of the documentation evolves over time, for this reason, we used the variation ratio of the results instead of the similarity values to improve readability and clarity. The similarity ratios were translated to their variation ratio using the formula vn/vn−1, where vnindicates the similarity ratio value for release

(27)

0 1 2 3 4 5 6 7 8 9 Releases 1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35 1.40 Variation

Project maven: jaccard variation for classes

q25 q50 q75 q95 0 1 2 3 4 5 6 7 8 9 Releases 0.95 1.00 1.05 1.10 1.15 1.20 1.25 1.30 Variation

Project maven: jaccard variation for methods

Project maven: cosine variation for classes

q25 q50 q75 q95 0 1 2 3 4 5 6 7 8 9 Releases 0.96 0.98 1.00 1.02 1.04 Variation

Project maven: cosine variation for methods

q25 q50 q75 q95

Figure 4.2: Evolution of similarity rates for project Maven

The results of the evolution and variance can be seen in Appendix B and will be used to continue the study for the next research question in Table 4.2 and Table 4.3.

4.4 RQ 4: Is there any relation between lines of code and quality of the documen-tation?

(28)

were the similarity ratios decreased over time. The ten projects studied resulted in an average ratio in their quality that is presented in Figure 4.3, Figure 4.4, and Table 4.4.

Block type Classes Methods

Percentile 25 50 75 95 25 50 75 95 Maven 1.0589 1.0212 1.0323 1.0242 0.9967 1.0073 1.0323 0.997 Jmeter 1.0 0.994 1.0001 1.0002 1.0001 1.0 1.0001 0.9997 Che 1.0003 1.0023 0.9999 0.9999 1.0015 1.0007 0.9999 0.9991 Tomcat 0.9981 0.9997 0.9997 0.9991 0.9999 1.9991 0.9997 0.9988 Springboot 1.0025 0.9932 1.0001 0.9966 0.999 0.9888 1.0001 0.993 Cxf 0.9957 0.9966 1.0033 0.9975 0.9969 0.9987 1.0033 0.9975 Guava 0.9913 0.9983 0.9993 0.9983 0.9965 1.0026 0.9993 0.9953 Graal 0.9952 0.9927 0.9998 1.0031 1.0021 1.0 0.9998 0.999 Elasticsearch 1.0002 0.9974 0.9959 0.9984 0.9942 0.9942 0.9959 0.9957 Netbeans 0.997 0.9996 1.0001 1.0 0.9992 0.9994 1.0001 0.9998

(29)

25

50

75

95 Percentile

0.996

0.997

0.998

0.999

1.000

1.001

1.002

1.003

1.004

1.005 Variation

1.00419 1.00004 1.00305 1.00155 0.99861 0.99918 1.00305 0.99749

Jaccard Variation Ratio

Classes

Methods

Figure 4.3: Average variation for Jaccard ratios

Block type Classes Methods

Percentile 25 50 75 95 25 50 75 95 Maven 1.0461 1.0142 1.0315 1.0402 0.9968 1.0037 0.9948 0.997 Jmeter 1.0 0.9994 1.0 1.0 1.0001 1.0 1.0003 0.9997 Che 0.9993 1.001 1.0 0.9984 1.0015 1.0006 1.0015 0.9989 Tomcat 0.9983 0.9998 0.9994 0.9982 1.0 1.0001 0.9994 0.9993 Springboot 1.0018 0.9924 1.0006 0.997 0.9978 0.9889 0.9989 0.9989 Cxf 0.9957 0.9985 1.0017 0.9976 0.997 0.9991 1.0008 0.9976 Guava 0.9915 0.9992 1.0002 0.9984 0.9969 1.0017 0.9978 0.9951 Graal 0.9972 0.9914 1.0002 1.0036 1.0037 1.0 0.9933 0.9976 Elasticsearch 1.0 0.997 0.9961 0.9979 0.9942 0.9937 0.9947 0.9955 Netbeans 0.9996 0.9995 0.9999 1.0 0.9992 0.9994 0.9995 0.9997

(30)

25

50

75

95 Percentile

0.996

0.997

0.998

0.999

1.000

1.001

1.002

1.003

1.004

1.005 Variation

1.00295 0.99924 1.00296 1.00313 0.99872 0.99872 0.9981 0.99737

Cosine Variation Ratio

Classes

Methods

Figure 4.4: Average variation for Cosine ratios

Projects Cosine variation Jaccard variation Project Size (LOC)

Maven 1.0091 1.0056 7,167 Jmeter 1.0000 1.0000 13,943 Che 1.0002 1.0004 19,951 Tomcat 0.9994 0.9995 29,269 Springboot 0.9965 0.9965 41,318 Cxf 0.9984 0.9986 61,264 Guava 0.9979 0.9978 63,893 Graal 0.9993 0.9985 109,920 Elasticsearch 0.9955 0.9952 146,415 Netbeans 0.9996 0.9995 452,863 Average 0.9996 0.9992

(31)

5 Analysis

According to the results gathered from the 100 releases studied, we do not see either a special pattern or a distribution that may affect the quality of the documentation over time. The variation of the quality of the documentation remains close to 1.0, meaning that there is no change in our similarity ratios.

RQ 1: What is the proportion of code blocks with and without documentation? The data extracted were divided to count how many of the code blocks did have docu-mentation by code type, class, or method. The results were averaged for each release and for all releases to have a single data point for classes, methods and in total, to create a percentage of the code that was documented as is displayed in Table 4.1. The results show a tendency to document the methods more than the classes. It could show a pat-tern where developers tend to document functionality over objects. Only an average of 5.61% of the classes has been documented against 12.09% of the methods. However, the total number of blocks documented was on average a 17.70% of the blocks. This would show a low intention of documenting the project.

RQ 2: What is the proportion of new code blocks with and without documentation? The research covered 10 consecutive releases, so we had available the data over time to work with. For each release, we looked for code blocks that were not already in the previous release and observed if those codes were added to the project including documentation.

The general tendency of the results shows that classes tend to be more documented than methods at the beginning of their life, as presented in Figure 4.1. However, as seen in the previous questions, during their lifetime, methods overgrow the classes and end up being the majority of the documented blocks. It could be understood as the tendency of developers to document their classes when they first created them. Over time, they will keep adding comments mainly to methods. This shows how documentation does not happen in one step, classes and methods are not documented at the same moment, developers will add documentation through time.

RQ 3: Does the code blocks’ documentation quality improve across the releases? The general quality of the documentation decreases over time. On average, the variation ratio of the cosine and Jaccard ratios decrease. The comments get less similar both in a set of words and repetitions. However, ratios for this result are close to 1.0, so even when there is a deterioration in the quality, it is small, as can be seen in Appendix B. In the case of Cosine, the resultant average for all the blocks is 0.9996 and Cosine is 0.9992. It shows a small decrease in quality, but so small that it could be assumed that there is no variation.

(32)

(33)

6 Discussion

The literature showed the preoccupation with the quality of the source code documen-tation. There is a paradox between developers’ complaints about how documentation is poorly maintained [5] while they are responsible for that maintenance. Multiple studies have found that documentation only changes significantly when big changes are made in a project [6], and generally low performance of the existing documentation for some aspects of quality [8, 11]. This research confirmed, by using a bigger set of data than previous research, that the quality of the documentation does not improve over time. Whatever is the ratio of the cohesion of a project at the beginning of the study, it does not show improvement but a small decrease in the quality over time. This low quantity on documentation was also shown in the work by Steidl [11], where the five projects studied had a range of class declaration comments between 5 to 20% of the classes doc-umented and between 28 to 49% of the methods docdoc-umented. The results presented in Table 4.1 also show a low tendency to document in general, but especially in regards to classes. We also confirmed that developers do not implement with documentation but rather implement first and add documentation, especially on methods, on later steps. As seen in Figure 4.1, classes are usually the most commented block when they are newly added to a project, but in Table 4.1, we can see how it changes in favor of methods. That shows how developers first document classes and add method documentation in the next releases. However, limiting the sample to open source projects may lead to not representative results. It is possible to find different results if this research is continued with private projects.

(34)

7 Conclusion

This research aimed to know how the source code documentation evolves through time. For that reason, we formulated four research questions which led to four objectives that were used to answer them. The data set used for the research included 100 releases from 10 open-source projects. The first and second research questions asked what is the proportion of code blocks with and without documentation and what was the proportion of new code blocks with and without documentation. With that in mind, the aim of Objective 1 was to study the difference between documented and non documented code blocks among different releases and in total numbers. The results showed a higher number of methods documented in comparison with classes and low documentation with an average of 17.70% of the code blocks documented. We also confirmed that the process of documenting source code happens in two steps, first documenting classes and secondly documenting methods. The third research question asked if the quality of the code blocks improve across the releases. With this in mind, we planned two objectives. Objective 2 led us to calculate the cohesion ratios, Jaccard and cosine, of all the code block for each release. The aim of Objective 3 was to perform statistical analysis on the cohesion ratios. The results pointed out that there is no improvement, but a slight decrease in the quality of the documentation. The last research question called if there was any relation between lines of code and quality of the documentation, which was addressed with the last objective. Objective 4 required us to perform a statistical analysis to compare cohesion ratios with the lines of code of methods and classes. The results showed no relationship between the size of the block and the cohesion ratios.

The research uses a large data set, but all the projects used are open source projects, which limited the results to the particularities of our data set. More extensive work could be done by studying private repositories, where other variables may affect the maintenance of the documentation, for example, project deadlines.

7.1 Future work

(35)

(36)

References

[1] I. Sommerville, “Software documentation,” in Software Engineering, vol. 2: The Supporting Processes, R. Thayer and M. Christensen, Eds. Wiley-IEEE, 2001, pp. 143–154. [Online]. Available: https://citeseerx.ist.psu.edu/viewdoc/summary? doi=10.1.1.114.8853

[2] T. C. Lethbridge, J. Singer, and A. Forward, “How software engineers use documentation: The state of the practice,” IEEE Software, vol. 20, no. 6, pp. 35–39, 2003. [Online]. Available: https://doi.org/10.1109/MS.2003.1241364 [3] L. Moreno, A. Marcus, L. Pollock, and K. Vijay-Shanker, “JSummarizer:

An automatic generator of natural language summaries for Java classes,” in Proceedings of the 21st International Conference on Program Comprehension, ser. ICPC ’13. IEEE, 2013, pp. 230–232. [Online]. Available: https: //doi.org/10.1109/ICPC.2013.6613855

[4] K. D. Welker, P. W. Oman, and G. G. Atkinson, “Development and application of an automated source code maintainability index,” Journal of Software Maintenance: Research and Practice, vol. 9, no. 3, pp. 127–159, 1997. [Online]. Available: https://doi.org/10.1002/(SICI)1096-908X(199705)9:3<127:: AID-SMR149>3.0.CO;2-S

[5] I. Sommerville, Software Engineering, ser. International Computer Science Series. Pearson, 2011. [Online]. Available: https://books.google.se/books?id= l0egcQAACAAJ

[6] L. Shi, H. Zhong, T. Xie, and M. Li, “An empirical study on evolution of API documentation,” in Fundamental Approaches to Software Engineering, D. Giannakopoulou and F. Orejas, Eds. Springer Berlin Heidelberg, 2011, pp. 416–431. [Online]. Available: https://doi.org/10.1007/978-3-642-19811-3_29 [7] N. Khamis, J. Rilling, and R. Witte, “Assessing the quality factors found in

in-line documentation written in natural language: The JavadocMiner,” Data & Knowledge Engineering, vol. 87, pp. 19–40, 2013. [Online]. Available: https://doi.org/10.1016/j.datak.2013.02.001

(37)

[9] American Society for Quality. (2020, Feb. 13) Quality glosary. [Online]. Available: https://asq.org/quality-resources/quality-glossary/q

[10] A. Wingkvist, M. Ericsson, R. Lincke, and W. Löwe, “A metrics-based approach to technical documentation quality,” in Proceedings of the 2010 Seventh International Conference on the Quality of Information and Communications Technology, ser. QUATIC ’10. IEEE, 2010, pp. 476–481. [Online]. Available: https://doi.org/10.1109/QUATIC.2010.88

[11] D. Steidl, B. Hummel, and E. Juergens, “Quality analysis of source code comments,” in Proceedings of the 2013 21st International Conference on Program Comprehension, ser. ICPC ’13. IEEE, 2013, pp. 83–92. [Online]. Available: https://doi.org/10.1109/ICPC.2013.6613836

[12] P. W. McBurney and C. McMillan, “An empirical study of the textual similarity between source code and source code summaries,” Empirical Software Engineering, vol. 21, pp. 17–42, 2006. [Online]. Available: https: //doi.org/10.1007/s10664-014-9344-6

[13] W. H. Gomaa and A. Fahmy, “A survey of text similarity approaches,” International Journal of Computer Applications, vol. 68, pp. 13–18, 2013. [Online]. Available: https://doi.org/10.5120/11638-7118

[14] P. Jaccard, “Étude comparative de la distribution florale dans une portion des Alpes et des Jura,” in Bulletin de la Société Vaudoise des Sciences Naturelles, 1901, pp. 547–579. [Online]. Available: https://ci.nii.ac.jp/naid/10019961020/en/ [15] A. Singhal, “Modern information retrieval: A brief overview,” IEEE

Data Engineering Bulletin, vol. 24, Jan. 2001. [Online]. Available: http: //sites.computer.org/debull/a01dec/a01dec-cd.pdf#page=37

[16] Oracle. (2020, Feb. 13) Java syntax. [Online]. Available: https://docs.oracle.com/ javase/specs/jls/se7/html/jls-18.html

[17] Oracle. (2020, Feb. 13) Java code conventions. [Online]. Available: https: //www.oracle.com/java/technologies/javase/codeconventions-comments.html [18] J. Raskin, “Comments are more important than code,” Queue, vol. 3, no. 2, pp.

(38)

(39)

A

Appendix — Selection of projects

The projects used in this study have been selected from open source repositories. These include five projects from the organization Apache and five from diverse sources. The Projects are ordered by size, from the smaller one (Apache Maven) to the biggest one (Apache Netbeans).

All the raw data extracted from these projects can be accessed in the following repos-itory: https://gitlab.com/HelenaTevar/documentation-evolution

Project Apache Maven

Repository https://github.com/apache/Maven

Description Apache Maven is a software project management and comprehension tool. Based on the concept of a project object model (POM), Maven can manage a project’s build, reporting and documentation from a central piece of information. Based on the concept of a project object model (POM), Maven can manage a project’s build, reporting and documentation from a central piece of information

Releases 3.5.0-beta-1 - 20 03 2017 3.5.0 - 03 04 2017 3.5.1 - 10 09 2017 3.5.2 - 18 10 2018 3.5.3 - 14 02 2018 3.5.4 - 17 06 2018 3.6.0 - 24 10 2018 3.6.1 - 04 04 2019 3.6.2 - 27 08 2019 3.6.3 - 19 11 2019

Project Apache Jmeter

(40)

Description Apache JMeter may be used to test performance both on static and dynamic resources, Web dynamic ap-plications. It can be used to simulate a heavy load on a server, group of servers, network or object to test its strength or to analyze overall performance under different load types.

Releases 5.2-rc1 - 07 10 2019 5.2-rc2 - 09 10 2019 5.2-rc3 - 15 10 2019 5.2-rc4 - 18 10 2019 5.2-rc5 - 29 10 2019 rel-v5.2 - 03 11 2019 5.2.1-rc1 - 12 11 2019 5.2.1-rc4 - 16 11 2019 5-2-1rc5 - 20 11 2019 rel-v5.2.1 - 24 11 2019

Project Eclipse Che

Repository https://github.com/eclipse/che

Description Next-generation container development platform, developer workspace server and cloud IDE. Che is Kubernetes-native and places everything the devel-oper needs into containers in Kube pods including dependencies, embedded containerized runtimes, a web IDE, and project code.

(41)

Project Apache Tomcat

Repository https://github.com/apache/tomcat

Description The Apache Tomcat software is an open sourceR

implementation of the Java Servlet, JavaServer Pages, Java Expression Language and Java Web-Socket technologies. The Java Servlet, JavaServer Pages, Java Expression Language and Java Web-Socket specifications are developed under the Java Community Process. Releases 9.0.22 - 04 07 2019 9.0.23 - 14 08 2019 9.0.24 - 14 08 2019 9.0.25 - 16 09 2019 9.0.26 - 16 09 2019 9.0.27 - 07 10 2019 9.0.28 - 14 11 2019 9.0.29 - 16 11 2019 9.0.30 - 07 12 2019 9.0.31 - 05 02 2020

Project Springboot - Spring

Repository https://github.com/spring-projects/spring-boot Description Spring Boot makes it easy to create Spring-powered,

production-grade applications and services with ab-solute minimum fuss. It takes an opinionated view of the Spring platform so that new and existing users can quickly get to the bits they need.

(42)

Project Apache CXF

Repository https://github.com/apache/cxf

Description Apache CXF is an open source services framework. CXF helps you build and develop services using frontend programming APIs, like WS and JAX-RS. These services can speak a variety of proto-cols such as SOAP, XML/HTTP, RESTful HTTP, or CORBA and work over a variety of transports such as HTTP, JMS or JBI. Releases 3.2.5 - 18 06 2018 3.2.6 - 08 08 2018 3.2.7 - 24 10 2019 3.2.8 - 24 01 2019 3.3.0 - 24 01 2019 3.3.1 - 28 02 2019 3.3.2 - 10 05 2019 3.3.3 - 08 08 2019 3.3.4 - 21 10 2019 3.3.5 - 10 01 2020

Project Google Guava

Repository https://github.com/google/guava

Description Guava is a set of core Java libraries from Google that includes new collection types (such as multimap and multiset), immutable collections, a graph library, and utilities for concurrency, I/O, hashing, caching, primitives, strings, and more! It is widely used on most Java projects within Google, and widely used by many other companies as well.

(43)

28.0 - 12 06 2019 28.1 - 28 08 2019 28.2 - 27 12 2019

Project Oracle Graal

Repository https://github.com/oracle/graal

Description GraalVM is a universal virtual machine for running applications written in JavaScript, Python, Ruby, R, JVM-based languages like Java, Scala, Clojure, Kotlin, and LLVM-based languages such as C and C++. Releases 19.0.0 - 09 05 2019 19.0.2 - 14 06 2019 19.1.0 - 27 06 2019 19.1.1 - 13 07 2019 19.2.0 - 19 08 2019 19.2.1 - 12 09 2019 19.3.0 - 15 11 2019 19.3.0.2 - 20 12 2019 19.3.1 - 14 01 2020 20.0.0 - 14 02 2020

Project Elastic ElasticSearch

Repository https://github.com/elastic/elasticsearch

Description Elasticsearch is a distributed RESTful search engine built for the cloud.

(44)

Project Apache Netbeans

Repository https://github.com/apache/netbeans

(45)

B

Appendix — Evolution of quality

0 1 2 3 4 5 6 7 8 9 Releases 0.994 0.995 0.996 0.997 0.998 0.999 1.000 1.001 Variation

Project jmeter: jaccard variation for classes

Project jmeter: jaccard variation for methods

q25 q50 q75 q95 0 1 2 3 4 5 6 7 8 9 Releases 0.994 0.995 0.996 0.997 0.998 0.999 1.000 1.001 1.002 Variation

Project jmeter: cosine variation for classes

Project jmeter: cosine variation for methods

q25 q50 q75 q95

(46)

0 1 2 3 4 5 6 7 8 9 Releases 0.990 0.995 1.000 1.005 1.010 Variation

Project che: jaccard variation for classes

Project che: jaccard variation for methods

Project che: cosine variation for classes

Project che: cosine variation for methods

q25 q50 q75 q95

(47)

0 1 2 3 4 5 6 7 8 9 Releases 0.9875 0.9900 0.9925 0.9950 0.9975 1.0000 1.0025 1.0050 1.0075 Variation

Project tomcat: jaccard variation for classes

Project tomcat: jaccard variation for methods

q25 q50 q75 q95 0 1 2 3 4 5 6 7 8 9 Releases 0.990 0.995 1.000 1.005 Variation

Project tomcat: cosine variation for classes

q25 q50 q75 q95 0 1 2 3 4 5 6 7 8 9 Releases 0.994 0.996 0.998 1.000 1.002 1.004 Variation

Project tomcat: cosine variation for methods

q25 q50 q75 q95

(48)

0 1 2 3 4 5 6 7 8 9 Releases 0.94 0.96 0.98 1.00 1.02 1.04 Variation

Project springboot: jaccard variation for classes

Project springboot: jaccard variation for methods

Project springboot: cosine variation for classes

Project springboot: cosine variation for methods

q25 q50 q75 q95

(49)

0 1 2 3 4 5 6 7 8 9 Releases 0.97 0.98 0.99 1.00 1.01 1.02 Variation

Project cxf: jaccard variation for classes

q25 q50 q75 q95 0 1 2 3 4 5 6 7 8 9 Releases 0.985 0.990 0.995 1.000 1.005 1.010 1.015 Variation

Project cxf: jaccard variation for methods

Project cxf: cosine variation for classes

Project cxf: cosine variation for methods

q25 q50 q75 q95

(50)

0 1 2 3 4 5 6 7 8 9 Releases 0.97 0.98 0.99 1.00 1.01 1.02 1.03 Variation

Project guava: jaccard variation for classes

Project guava: jaccard variation for methods

Project guava: cosine variation for classes

Project guava: cosine variation for methods

q25 q50 q75 q95

(51)

0 1 2 3 4 5 6 7 8 9 Releases 0.96 0.97 0.98 0.99 1.00 1.01 1.02 1.03 Variation

Project graal: jaccard variation for classes

Project graal: jaccard variation for methods

Project graal: cosine variation for classes

Project graal: cosine variation for methods

q25 q50 q75 q95

(52)

0 1 2 3 4 5 6 7 8 9 Releases 0.975 0.980 0.985 0.990 0.995 1.000 1.005 Variation

Project elasticsearch: jaccard variation for classes

Project elasticsearch: jaccard variation for methods

Project elasticsearch: cosine variation for classes

Project elasticsearch: cosine variation for methods

q25 q50 q75 q95

(53)

0 1 2 3 4 5 6 7 8 9 Releases 0.998 0.999 1.000 1.001 1.002 Variation

Project netbeans: jaccard variation for classes

Project netbeans: jaccard variation for methods

Project netbeans: cosine variation for classes

Project netbeans: cosine variation for methods

q25 q50 q75 q95

(54)

C

Appendix — Lists of stop words

For this research, we used a natural language process that would skip words that will give little to no meaning to the text we wanted to study. The words skipped are listed below.

C.1 NLTK stop words

’i’, ’me’, ’my’, ’myself’, ’we’, ’our’, ’ours’, ’ourselves’, ’you’, "you’re", "you’ve", "you’ll", "you’d", ’your’, ’yours’, ’yourself’, ’yourselves’, ’he’, ’him’, ’his’, ’himself’, ’she’, "she’s", ’her’, ’hers’, ’herself’, ’it’, "it’s", ’its’, ’itself’, ’they’, ’them’, ’their’, ’theirs’, ’themselves’, ’what’, ’which’, ’who’, ’whom’, ’this’, ’that’, "that’ll", ’these’, ’those’, ’am’, ’is’, ’are’,

’was’, ’were’, ’be’, ’been’, ’being’, ’have’, ’has’, ’had’, ’having’, ’do’, ’does’, ’did’, ’doing’, ’a’, ’an’, ’the’, ’and’, ’but’, ’if’, ’or’, ’because’, ’as’, ’until’, ’while’, ’of’, ’at’, ’by’, ’for’, ’with’, ’about’, ’against’, ’between’, ’into’, ’through’, ’during’, ’before’, ’after’, ’above’, ’below’, ’to’, ’from’, ’up’, ’down’, ’in’, ’out’, ’on’, ’off’, ’over’, ’under’, ’again’, ’further’, ’then’, ’once’, ’here’, ’there’, ’when’, ’where’, ’why’, ’how’, ’all’, ’any’, ’both’, ’each’, ’few’, ’more’, ’most’, ’other’, ’some’, ’such’, ’no’, ’nor’, ’not’, ’only’, ’own’, ’same’, ’so’, ’than’, ’too’, ’very’, ’s’, ’t’, ’can’, ’will’, ’just’, ’don’, "don’t", ’should’, "should’ve", ’now’, ’d’, ’ll’, ’m’, ’o’, ’re’, ’ve’, ’y’, ’ain’, ’aren’, "aren’t", ’couldn’, "couldn’t", ’didn’, "didn’t", ’doesn’, "doesn’t", ’hadn’, "hadn’t", ’hasn’, "hasn’t", ’haven’, "haven’t", ’isn’, "isn’t", ’ma’, ’mightn’, "mightn’t", ’mustn’, "mustn’t", ’needn’, "needn’t", ’shan’, "shan’t", ’shouldn’, "shouldn’t", ’wasn’, "wasn’t", ’weren’, "weren’t", ’won’, "won’t", ’wouldn’, "wouldn’t".

C.2 Extra stop words

’aboard’, ’according’, ’across’, ’along’, ’alongside’, ’amid’, ’anti’, ’around’, ’aside’, ’atop’, ’behind’,’beneath’, ’beside’, ’besides’, ’beyond’, ’concerning’, ’considering’, ’despite’,

(55)

’towards’, ’underneath’, ’unlike’, ’upon’, ’versus’, ’via’, ’within’, ’without’,’we’, ’they’.

C.3 Java Keywords as stop words

Bachelor Degree Project Evolution of Software Documentation Over Time

Author:

Helena Tevar Hernandez

Supervisor:

Francis Palma

Semester:

VT/HT 2020

Bachelor Degree Project

Evolution of Software

Documentation Over Time

Abstract

Preface

Contents

1

Introduction

2

Method

3

Implementation

4

Results

maven jmeter che tomcat springb cxf

guava graal elastic netbeans

0

10

20

30

40

50

60

Percentage added

New added blocks documented

Classes

Methods

25

50

75

95

Percentile

0.996

0.997

0.998

0.999

1.000

1.001

1.002

1.003

1.004

1.005

Variation

Jaccard Variation Ratio

Classes

Methods

25

50

75

95

Percentile

0.996

0.997

0.998

0.999

1.000

1.001

1.002

1.003

1.004

1.005

Variation

Cosine Variation Ratio

Classes

Methods

5

Analysis

6

Discussion

7

Conclusion

References

A