Measuring correlation between commit frequency and popularity on GitHub

(1)

IN

DEGREE PROJECT COMPUTER ENGINEERING, FIRST CYCLE, 15 CREDITS

STOCKHOLM SWEDEN 2017,

Measuring correlation between commit frequency and popularity on GitHub

JONATHAN JEFFORD-BAKER

MÅRTEN GRÖNLUND

(2)

Measuring correlation between commit frequency and popularity on GitHub

JONATHAN JEFFORD-BAKER M˚ ARTEN GR ¨ ONLUND

Supervisor: Roberto Guanciale Examiner: ¨Orjan Ekeberg

Swedish title: M¨atning av korrelation mellan commitfrekvens och popularitet p˚a GitHub

Stockholm 2017

School of Computer Science and Communication Kungliga Tekniska H¨ogskolan

(3)

Abstract

This thesis studies the correlation between the commit frequency and popularity of Github projects. Over 12 000 projects were retrieved using the Github API, resulting in a dataset containing 85 projects after filtering out projects that were deemed unfit. The analysis of the projects consisted of calculating the Pearson Correlation Coefficient using the frequency of commits and popularity as variables. Different time intervals were studied along with several metrics of popularity based upon the project’s metadata retrieved from Github. The results varied for the different time intervals and metrics of popularity but none of the measurements resulted in a correlation coefficient which indicated a strong or moderate correlation. Therefore this study reached the conclusion of no existing correlation between commit frequency and popularity. Although no correlation was found, several potential measures of improvement for further research were discovered.

(4)

Sammanfattning

Denna studie undersöker korrelationen mellan frekvensen av commits och popularitet hos Github projekt. Över 12 000 projekt utvanns genom Github API:et vilket resulterade i en datamängd inneh˚allandes 85 projekt efter att gallringen av oönskade projekt ägt rum. Analysen av projekten bestod av att beräkna Pearsons korrelationskoefficient med frekvensen av commits och popularitet som variabler. Baserat p˚a projektens metadata fr˚an Github undersöktes olika tidsintervall kombinerat med flera m˚att p˚a popularitet. Resultaten varierade för de olika tidsintervallen och pop- ularitetsm˚atten men ingen av mätningarna resulterade i en korrelationskoefficient som indikerade en stark eller medelstark korrelation. S˚aledes fastställde denna studie slutsatsen att ingen korrelation existerade mellan frekvensen av commits och popularitet. Trots att ingen korrelation hit- tades, upptäcktes däremot flera potentiella förbättrings˚atgärder för vidare forskning.

(5)

1 Introduction

Online code repositories have lead to a new way of developing open source software by making the source code from various projects available over the internet and offering features such as bug tracking and task management. The availability of the source code allows people from all over the world to participate in different software projects and allowing users to review the source code easily.

Some online code repositories have implemented version control systems which supplies metadata about the project such as: the changes that have been made, which user that implemented the change, reported bugs or requested features and more.

The code may be used by everything from a couple of users to the thousands, depending on how popular the project is. However, this metadata is seldom utilised in general, even though it can bring insight to different aspects of the project, as for example how quick bugs in the code are attended to. There exists some websites that try to visualise or in other ways analyse this data for research but utilisation of this valuable resource is important and needs to be recognised to greater extent.

In the vast number of repositories resides a large number of open source projects but which of these projects are actually worth investigating? For example, from an investor’s point of view this question is highly important, which project should they choose to invest time and possibly money in? A common reference for us humans is popularity, as being herd animals, humans tend to follow the herd which also is the case when it comes to software. Users tend to explore and use the software that is popular therefore a popular software project would be a good candidate for the investor.

Popularity is not very easy to measure though especially in the case of open source software (OSS) projects. Some projects are small in size with a few contributors and users of the software whilst other projects have thousands of contributors and hundreds of thousands in their userbase. There also exists projects with a few contributors but the usage of the software is immense, an example being Open SSL which has 15 developers and is used on millions of websites¹². It would therefore be inaccurate to measure popularity using the number of contributors as metric since the software still could be widely used.

1.1 Purpose

By analysing the metadata extracted from a large amount of code repositories one can develop a general image of the development process and detect certain patterns in it which have positive or negative consequences on the project’s outcome. Detecting these similarities or differences between the different projects could perhaps result in conclusions which can benefit the evolution of models for software development.

The purpose of this study is to extract the metadata from a large quantity of repositories and based on that data determine its effects on the software project. The focus will lie on a data type called “commits” which essentially are the changes that have been made to the project and occur in repositories which have implemented the Git version control language.

1https://www.openssl.org/community/team.html

2https://trends.builtwith.com/Server/OpenSSL

(7)

1.2 Problem Statement

Commits provide valuable information about the project since they together consist of the complete history of changes made to the code. For a project to progress, changes have to be made for the code to evolve, in other words commits have to be submitted for the project to move forward.

There exists a huge number of open source software projects but far from all of them are popular, in fact many projects die out after a short period of time.

There have been earlier studies which have investigated what makes an open source project popular. Aggarwal et al. [1] focused on the correlation between documentation evolution and popularity and Bissyand´e et al. [2] studied the connection between reported issues and project success. However, there has not been a study where the correlation between commits and project popularity was investigated and since many projects are unpopular it is interesting to study if the frequency of commits in certain periods of a project’s lifecycle has an impact on its popularity. Therefore the problem statement for this study will be:

How does the frequency of commits over a project’s lifetime correlate to its popularity?

1.2.1 Hypotheses

The beginning and end of a project might give insight to how the project became popular or how well the project has kept its popularity until the end (or current state). A project that has a high initial activity might interest others to contribute and build a userbase quickly. A project that also has a high activity during the end of the project (or is currently active) shows that the userbase is large enough for a continuous development.

Hypothesis 1 - A high frequency of commits in the beginning of a project will give it a good start and therefore gain popularity throughout the project’s lifetime.

Hypothesis 2 - A high commit frequency in the end of a project implies that the project is/was currently very active, thus popular as well.

A high frequency of commits on average throughout the project would result in high popularity. A high commit frequency on average would indicate an active project overall, even though the activity may fluctuate between different periods, resulting in a popular project.

Hypothesis 3 - A high average frequency of commits throughout the project should result in high popularity.

A project with a long period of continuous weekly activity is believed to poten- tially be popular, since such a period would indicate consistent interest in the project, therefore making it popular. In contrast a project with a large period without any activity in regard of commits is believed to be unpopular.

Hypothesis 4 - A high number of weeks in a row containing commits implies a popular project.

(8)

Hypothesis 5 - A high number of weeks in a row containing no commit activity implies an unpopular project.

1.3 Scope

There are several online repository services available, however, this study will focus on the Github repository. There are different types of repositories at Github, the Public repository is available for anyone to browse and is the most commonly used since it is free of charge unlike the others. Not all types of repositories are publicly available, the Personal repository as well as the Enterprise repositories are private and only accessible if the owner has allowed access. This study will therefore only involve public repositories. So called “forks” which are repositories that have been cloned from the original project but continued being developed independently will be disregarded and only the original projects will be studied. Another limitation concerns the size of a project, any project repository of a size below 500 bytes will be disregarded from the dataset. Program language may affect the popularity of a project because of the popularity of the language itself, this study will not consider this relation and all projects will be treated equally in terms of program language.

1.4 Outline

Section 2 provides an introduction to Github and development of open source software. Furthermore, the section explains the principles of correlation and discusses several correlation models. Additionally, the section also introduces the previous research in the same area as this study. Section 3 discusses and motivates the usage of dataset, metrics and time intervals as well as the procedure of analysing the data. The results are presented in section 4 and discussed in section 5, including possible sources of error along with considerations for future research. Lastly, the conclusions are presented in section 6.

(9)

2 Background

This section provides the background necessary to understand the research methodology in this study. Section 2.1 gives an overview of Github and related terminology used throughout the report. Open source development is discussed in 2.2. An overview of correlation along with models for measuring it is provided in section 2.3. Lastly, previous research relevant for this study is presented in section 2.4.

2.1 Github

Github is an online code repository for software development projects which has become immensely popular over time compared to similar platforms such as SourceForge and Google Code. This popularity is linked to the features Github offers such as the ability to fork a project, issue tracking and to watch projects [1]. In other words the Github offers a platform which eases the social interaction in the development process between the participants, something for example SourceForge lacked when Github started off, which caused Github to surpass SourceForge [3].

2.1.1 Github Terminology

Fork - Forking a project essentially means that a developer creates their own version of the project which can become fully independent of the original.

Issue tracking system - A feature that enables users to report and discuss issues, which can be for example bug reports or suggestions for new features to be added to the project.

Watching - When a user watches a project they are subscribed to receive notifications whenever changes are made in the project such as when new pull requests and new issues are submitted. Due to updates made by Github, Watch- ers are named Subscribers in the API hence the latter term will be used in this thesis.

Pull request - Takes place when a developer has made changes in the code and wishes to submit them to the main project. This request is then reviewed by the core developers and if granted, the change is made in the code of the main project.

Contributor - A user who has contributed to the project by submitting at least one commit sometime during the project’s course.

Star - The star feature allows a user to bookmark a repository for easier access and at the same time showing appreciation to the maintainer of the repository.

Repository - As found in the study by Kalliamvakou et al. a repository need not necessarily be a project, therefore a distinction between the two should be made [4]. A repository denotes an arbitrary repository (either a base repository or a fork) while a project denotes the base repository with all its forks.

(10)

Commit - A commit to a repository is any change to the repository that either adds a new file, deletes a file, change file contents or file structure. This change will be recorded automatically by Git, and in order for the change to be pushed to Github, a commit has to be made. The commit also include a commit message that should describe what changes have been made, as well as other metadata such as the author of the commit, a timestamp, a checksum etc³.

2.2 Open Source Development

The development process for open source projects hosted at online repositories such as Github often differs from more traditional ways of creating software.

The developers in Github projects often consists of people which develop the software during their spare time as more of a hobby rather than as a job [5]. In contrary to work-related software development the developers are in other words not committed full time to the project. Furthermore, the developers are decen- tralised, contributing from different geographical locations, communicating and meeting over the internet instead of meeting in person.

In contrast to traditional software development the team of developers is not limited to a specific range of people, anyone who wishes can contribute to a project on Github as long as the repository is public. Another difference is that there often is a lack of deadlines, software is developed “on the go” and the tempo in which progress is made varies over time [5].

2.3 Correlation

Correlation is a measurement in mathematical statistics which measures the dependence between two or more variables. The variables can also be denoted as observations, and will be used synonymously to each other in this text. In this study a correlation between the commit frequency and repositories would indicate that there exists a relationship between different frequencies of commits and the popularity of a repository. It is important to distinguish between correlation and causality because although a result might indicate a correlation between variables, it does not imply that there exists a causal relationship between them. A causal relationship between variables is a situation where one or several variables causes the outcome of the others, for example there exists a causality between the age of a child and its length. Correlation can indicate existence of a potential causal relationship but it does not tell anything about the underlying reason of the relationship. By measuring the correlation, this study will therefore only be able to answer if there exists a relationship between the two variables mentioned and not the causes of it.

2.3.1 P-value

An interesting value to study, besides the result of a statistical model (discussed below), is the p-value. The p-value represents the probability of the results being the same as the observations, or more extreme, given that the null hypothesis is true. In other words, if the results given by the model was the result of a

3https://developer.github.com/v3/repos/commits/

(11)

highly unlikely event or not. The null hypothesis in terms of this study is that no correlation between the variables exist.

When the p-value is lower than the significance value selected, typically 0.05 (5%) in most cases, the null hypothesis should be rejected. A higher p-value indicates that the null hypothesis cannot be rejected. The significance level used in this study will be 0.05.

2.3.2 Pearson Correlation Coefficient

Correlation is commonly used for linear relationships and could be measured with models that approximate a function. One common model is Pearson’s product-moment coefficient which measures how much data points deviate from a best-fit line. This model is only useful for two variables x and y that are either on an interval or a ratio scale [6]. The value of the Pearson Correlation Coefficient (PCC) ranges between -1 and 1, the closer a value is to either of the two, the higher the correlation. The PCC is computed following:

r =

Pn

i=1(x_i− ¯x)(y_i− ¯y) pPn

i=1(xi− ¯x)²pPn

i=1(yi− ¯y)²

Where r denotes the PCC, x_iand y_iare members of the datasets {x₁, x₂, ..., x_n} respectively {y1, y2, ..., yn} and ¯x, ¯y is the average of all x and y respectively.

There are guidelines with intervals for interpreting the resulting coefficient, but it is important to note that these intervals depend on the dataset measured [6]. For some datasets a 0.5 could be a strong positive correlation for example in social sciences where there are a lot of complicated factors that could influence the result [7]. In other datasets, for example calculations with physical laws measured with very accurate tools, a correlation of 0.5 would indicate on a very low correlation.

2.3.3 Spearman’s Rank Correlation Coefficient

Spearman’s Rank Correlation Coefficient is very similar to the Pearson Correla- tion Coefficient but instead uses the ranked values of the variables and assumes the existence of a monotonic relation between the variables and evaluates this relation [8]. A ranked value is computed by comparing all the variables and ranking them by their numerical value. A monotonic relation exists between two variables if both variables increase together or when one increases, the other decreases. The Spearman Coefficient is computed using the formula:

r_s= 1 − 6P d²_i n(n²− 1)

Where n represents the number of observations and di= rg(xi) − rg(yi) denotes the difference between the two ranks of each observation.

2.3.4 Mann-Whitney U-test

Another Ranked Correlation test is the Mann-Whitney U-test which assigns ranks to the different observations. The U value is computed firstly by assign- ing ranks to all observations, where rank 1 denotes the smallest value. If two

(12)

observations share the same value, instead of being assigned different ranks depending of order of occurrence, they are all given a rank equal to the average of the ranks they would otherwise receive. Secondly the ranks for sample 1 (x0, x1, ..., xn) are summed and the U value is then given by the equation:

U1= R1−n1(n1+ 1) 2

Where n1is the sample size for sample 1 and R1denotes the sum of ranks in sample 1. U2 is then calculated using the same formula for sample 2 (y0, y1, ..., yn), and the smallest of U1 and U2 is then used for comparison in a statistical significance table.

Just as previous models the Mann-Whitney U-test does not require an as- sumption of a normal distribution between the variables but it requires that the distribution under the null hypothesis is known. The null hypothesis denotes that it is equally probable that a value selected by random from one sample will be less than or greater than a value selected by random from a second sample [9].

2.3.5 The Model Used in the Study

Ranked models usually fits very well in studies where the data obtains values that are roughly on the same scale. If the majority of the results are close to each other but some are deviating significantly the ranking of the variables will be misleading. For example, a dataset containing the values {41, 42, 43, 2000}

would receive the ranks {1,2,3,4} and a dataset of values {41,42,43,44} would receive the same ranking {1,2,3,4}, even though the values 44 and 2000 differs a lot from each other. As mentioned previously there exists some projects with a massive userbase such as Open SSL on Github but most projects are of smaller size. The popularity was therefore believed to range on a wide scale with the majority of projects not very popular, which would increase the risk of a scenario similar to the one mentioned above. Therefore the two ranked models above will not be used in this project, leaving the Pearson Correlation Coefficient as the remaining model of the three. The PCC is widely used and does not require any ranking of the data and will therefore be used in this study.

2.4 Previous Research

There have been several studies investigating common features and correlations of OSS projects regarding active contributors, programming languages or structure, commits and other contributions [1, 2, 10, 11]. Many of the most recent studies mainly focuses on Github development or datasets originated from the platform such as stars, forks and issues. Some of the studies also include code or detailed descriptions on how the Github data was acquired [1, 10, 12]. Two of the mentioned studies used the Github v3 API⁴ to extract the dataset [10, 12] while one of the studies used the GHTorrent⁵ dataset [1]. The study by Peterson [12] also included an algorithm for selecting a repository by random.

The selection was done by choosing a random word from an arbitrary word list which is passed to the Github API, the API in turn returns a list of repositories

4https://developer.github.com/v3

5www.ghtorrent.org

(13)

containing the selected word in their descriptions. At last a repository in that list is selected at random and its contents extracted.

A study by Aggarwal et al. focused on what part documentation plays for a project’s popularity [1]. By extracting commits, forks, watchers and pull requests Aggarwal et al. used these features of a repository to measure how popular it was. The study measured how much the documentation changed over the lifetime of a project and by using cross correlation investigating how this was related to the popularity of the project. The metric for popularity was defined by the formula P opularity = Stars + F orks + P ulls², unfortunately the model used for computing the cross correlation is not described in detail.

The study found an apparent relationship between the popularity of a project and extensive (and consistent) documentation.

Studies by Kalliamvakou et al. [4] shows that the majority of all Github repositories are personal and inactive, which may have an significant impact on what conclusions that could be drawn from a dataset of Github repositories.

The conclusions were made by analysing parts of the GHTorrent dataset and sending out surveys to GitHub users. It was also found that most projects have very few commits therefore this should be kept in mind when analysing commits on GitHub.

Weicheng et al. [11] measured the relation between the frequency of commits and evolution of file versions in eight large and successful projects on Github.

The conclusion was that the average frequency of commits which involved at least five changed lines of code in at least five different files were 5.34 days, which was around three times less frequent than commits of lesser size. The frequency was related to the files in which changes were made and small commits, in particular in core project files, resulted in large changes of code in following commits.

(14)

3 Method

3.1 Scraping Github

Collecting data from Github was an essential part of the project. There are several different methods of collecting data from Github that would yield the necessary dataset for this project, each with advantages and setbacks. By comparing the methods used in previous research (see section 2.4) we found that using the Github API would give us the largest quantity of useful data while still being completely free. This alternative was chosen instead of using Github Archive⁶ along with Google BigQuery⁷ due to the simplicity of using the API and the lower quantities of data it would imply, in contrast with Github Archive which would require terabytes of storage.

Using a script⁸written in Python we collected repositories and their respec- tive commits by generating a random repository id as a variable when querying for a list of repository from the Github API. This list is represented in JSON format, where every repository is an object within the list. After getting a repository list we filtered out all repositories that were forks, in other words only extracting projects, as well as filtering out repositories smaller than 500 bytes. We also removed some key:value pair in the repository object that would not be used in the project in order to roughly remove most useless data (further reasoning of filtered-out data can be found in section 3.2.1). For each remaining repository object in the list, all associated commits was then downloaded and saved as a JSON list within the repository object called “commit”.

The data collection ran in parallel on two computers for the effective time of about 11 days. Since the API restricts the number of requests per hour to 5,000, the total number of projects ended up at 12,374 with accompanying 1,246,462 commits. Due to inconsistency⁹in the Github API and the way that the script crawled through the commits, some projects ended up only having exactly 30 commits in our dataset even though there would be more on the website. This resulted in development of another script which collected the remaining commits by going through all projects once again. After this procedure the final total number of commits ended up being 1,299,235, over 50,000 commits more than originally.

6https://www.githubarchive.org/

7https://cloud.google.com/bigquery/

8Avalable at https://github.com/martengooz/github-scraper

9Some API requests for certain repositories would not respond with the header pair “Link: [string value]” used for page traversing as explained here https://developer.github.com/guides/traversing-with-pagination/. This made the script believe there were no more pages (i.e. commits) to request and therefore terminate after only collecting the default 30 commits on the first page.

(15)

3.2 Analysis

3.2.1 Metrics and Limits

Several metrics of popularity were used consisting of metadata extracted from the Github API. The metrics used to define popularity are listed below:

Stars

The number of stars was one of the metrics used and was chosen primarily since Github uses this as a metric themselves to rank the currently trending repositories¹⁰. This metric seems to be a good start and a general baseline of how popularity can be measured.

Forks

The number of forks of a project was one of the chosen metrics since a high number of forks could be an indication of a popular software project since a high number of people wish to modify the source code and creating their own version. This value is however much lower than stars in the general case.

Stars + Forks + Subscribers

This metric includes the developer aspect by using forks as one of the metrics combined. Forking a project is an action made by a developer since it involves interaction with the project’s code. Starring or subscribing to a project can be an action carried out by both a project developer or user, by including forks the developer aspect is therefore combined with the user aspect. This sum will henceforth be denoted by SFS for shorter referencing.

The formula used in Aggarwal et al. [1] that was mentioned in section 2.4 was not used due to difficulties of retreiving the number of pulls from the API.

Another problem with using the number of pulls as a metric of popularity is that only a small part of projects on Github use pull requests [4]. Further discussion regarding issues with different metrics can be found in section 5.2.2.

The limits for the different variables which a project had to fulfill in order to be included in the dataset used in the analysis were the following:

• Stars > 10

• Subscribers > 10

• Forks > 1

• Commits > 100

After weeding out the projects that did not fulfill the above limits only 85 remained, creating a loss of nearly 99.3% of the originally downloaded projects.

As seen in section 5 there exists several data points deviating significantly from the others. Due to the very low number of projects remaining the choice was made to keep the deviating projects to avoid further reduction of the dataset.

10https://github.com/trending

(16)

3.3 Procedure

The data was analysed using a program written in Python using the SciPy package for measuring the correlation. In order to get the full picture of the data correlation, different time intervals of the projects were studied along with various metrics for measuring the popularity. The first time periods looked at was the first, respectively last month of a project. The commits in these intervals were counted and then the average of commits per day was calculated.

The next time period studied was the entire lifetime of a project, where the average commit frequency per hour was measured. Lastly, the weekly activity of projects was studied by measuring the highest number of weeks in a row commits were made and respectively the highest number of weeks in a row containing no commit activity at all.

(17)

4 Results

The result shown below is split up into five parts connected to the hypotheses described in section 1.4. Each part consists of three plots, one for each metric (stars, forks, SFS), except for the fourth hypothesis which includes six plots.

Each point in the graphs represents a single project and the line represents the best fitting line with the Pearson Correlation Coefficient denoted r together with p, the p-value for the model.

4.1 First Hypothesis

The first month of a project is where the most frequent commits happens at a total average of 1.74 commits/day. It is noticeable that the majority of the projects are located near the bottom of each graph, meaning that most projects have a low popularity score. The projects also are distributed mainly around the lower left corner indicating that a there are few commits per day. The PCC is positive in each metric, measuring 0.0479, 0.1396 and 0.1164 in the forks, stars and SFS case respectively. The p-value is significantly higher when measuring popularity solely by forks at 0.6632 compared to 0.2026 (stars) and 0.2888 (SFS).

(a) SFS (b) Stars

(c) Forks

Figure 1: Average daily commits first month

(18)

4.2 Second Hypothesis

The last month shows how much the end or current state of a project affects the popularity. In this instance the majority of the projects are also located near the bottom left of each graph. The average commit frequency is almost a fifth of the one in the first month being 0.39 commits/day. The PCC is slightly positive in each metric and in the case of measuring popularity by the number of forks the PCC value is 0.2112 combined with a p-value of 0.0524 compared to the result in the first month where forks had a higher PCC value of 0.0479 and p-value of 0.6632. The values for the other graphs stay within a 0.1 margin from the previous result with the exception of SFS p-value that decreased from 0.2888 to 0.1225. There is also a noticeable vertical line at exactly one commit per day consisting of 8 projects in total.

(a) SFS (b) Stars

(c) Forks

Figure 2: Average daily commits latest month

4.3 Third Hypothesis

When comparing the average commit frequency over the whole lifetime of the project we again see that most values are centered near the bottom left. The average number of commits throughout all projects is 0.75. The results are overall similar to each other between the metrics used for popularity and the PCC has a minor positive value between 0.0112 and 0.0659 and the p-value is close to 1 in both the case of stars and SFS.

(19)

(a) SFS (b) Stars

(c) Forks

Figure 3: Average daily commits in the repositories lifetime

4.4 Fourth Hypothesis

The longest streak where there had been at least one commit each week showed to be a special case where a single project made a significant difference to the PCC when included. For this reason we provided two graphs for the same metric, one with all projects included (the left column) and one with the project removed (the right column). The removed project is the named “broadgsa/gatk”

and can be seen rightmost on the images in the left column. The removal of the project more than doubled the PCC in each case and the p-value decreased by 0.3483 in average, which resulted in both SFS and forks having a p-value under the 0.05 significance level threshold.

(20)

(a) SFS (b) SFS without extreme case

(c) Stars (d) Stars without extreme case

(e) Forks (f) Forks without extreme case

Figure 4: Weekly commit activity streak

(21)

4.5 Fifth Hypothesis

When measuring the weeks of inactivity in a row, the PCC is slightly negative at 0.08 and the p-value is 0.48 ± 0.015 for all cases. Overall the most popular projects are within the 0-60 weeks range of inactivity, and only eight projects have had over 2 years of inactivity before another commit was pushed again.

(a) SFS (b) Stars

(c) Forks

Figure 5: Weekly inactivity streak

(22)

5 Discussion

The result was unexpected, such a weak correlation coefficient between the commit frequency in different time intervals and the different metrics of popularity was beforehand deemed highly unlikely. There are, however, certain aspects that might have contributed to this result. These involve the source of data (Github in this case), metrics used to measure popularity and the extracted dataset itself.

5.1 Result

The result is not as distinct as we hoped, but it does follow the expectations of a study of this character. Most of the projects are neither popular nor have high values in any of the other values measured in this report, which is following the conclusions of Kalliamvakou et al. [4].

5.1.1 Correlation

An overall convincing correlation cannot be found in any of the cases since the values are distributed in such way that a high value in popularity does not always correspond to a high frequency during that period. This is also represented by the PCC which does not exceed the absolute value of 0.3 except for one occasion, which would indicate that it could be a weak correlation due to the low p-value according to the standards of Cohen (1998) [13]. However, all coefficients are slightly positive when the hypothesis expected them to be, and in the fifth hypothesis the coefficient should be negative in order to agree with the hypothesis, which it does.

The absolute weakest correlation coefficient was found when examining hypothesis 3, measuring the correlation between average commit frequency throughout the project’s lifetime. Since the projects are randomly chosen and some have been stable for years with only a few regular commits each month for mainte- nance, gaining a lower average frequency compared to when they were new, whilst others are just in the starting period of the development and have not reached their full potential in popularity yet. This makes an unfair comparison between different length projects, which could be why the PCC is so low.

Another reason could be that projects with a high commit frequency but low popularity could be company or industry related projects that are not used widely outside that group, but is still being actively developed.

The biggest difference between the results for each hypothesis occured when the number of forks determined a project’s popularity, which in some cases had a noticeable higher PCC and in one case, a much lower. This questions whether the number of forks was a good metric to describe popularity on Github, or if the stars and SFS gave a false picture. This topic will be discussed later in section 5.2.2. Although the number of forks ended up having the highest PCC value (0.3480) and lowest p-value (0.0012) which could mean a weak correlation according to Cohen (1998) [13]. This was, however, after a datapoint was removed as described in section 4.4, and with a larger dataset the values could either increase or decrease. Although the higher value might not be accurate in the real world, it is an indication of what could be closer to the real value if the removed project is an exception. For this study though, this will not be

(23)

considered as a weak correlation due to the circumstances of having a very small dataset.

5.1.2 Abnormalities and Extreme Cases

There are some extreme cases in the dataset that both correspond to our hypotheses and others indicating the opposite by showing no correlation. For the fourth hypothesis a project was removed in order to show the magnitude of difference that a single project could make. Leaving the other deviating projects in the dataset has probably also affected the correlation coefficient negatively but on the other hand left more projects to study in the dataset.

In section 4.2 there is a noticeable vertical line at exactly 1 commit per day which seemed unnatural at first sight since the next higher value is at located at 2.2 on the x-axis. A possible explanation for this phenomenon is that these projects are perhaps following a policy of at least one commit per day in order to maintain development. Other alternatives are that somewhere the dataset became corrupt or that the data analysis software has a bug. Although it is only 8 projects that demonstrates this behaviour, it is quite a peculiar result.

Taking all this to account the overall results show no direct correlation between commit frequency and popularity. However, the results indicate that although no direct correlation can be found, many projects follow hypothesis 1, 2, 4 and 5 with some exceptions showing the opposite. The conclusion would therefore be that many popular projects follow the rule that the hypotheses make, but following them is not a recipe for success, at least not with the metrics used in this study.

5.2 Possible Error Sources

5.2.1 Source of Data

Being the biggest platform for code repositories Github was chosen as the source of data for this study. Using Github as the source of data has its advantages and disadvantages, due to its position as the most popular platform it offers a vast number of repositories which allows building large datasets to work with.

However, the adequacy of the repositories as test data varies to a large extent, a large number of the repositories retrieved initially was deemed unfit for the dataset due to their small amount of commits in accordance with the study by Kalliamvakou et al. [4]. This problem was attended to by screening the repositories with less than 100 commits but this is an arbitrary limit which was decided upon to filter out the worst repositories. 100 commits is a relatively small number of commits to study and the few number of data points may therefore give an inaccurate distribution of commit frequency in the different intervals. An example of this can be seen in section 4.4.

As Kalliamvakou et al. [4] also stated most projects are inactive and suggested filtering out projects with no recent commits or pull requests. Unfortu- nately there was no time to implement such filtering as it would require further analysis of the dataset to define a time period which would classify a project as inactive. Inactive projects should not have a major effect on the result in this study since the project lifetime is defined as the period ranging from the time

(24)

of the first commit to the last commit. Another peril with Github mentioned by Kalliamvakou et al. [4] is the large amount of repositories that are not used for software development. The study suggested analysis of the README file and the description as a solution to this eventuality. Initially during the development of the scraper software there were plans for including code for Natural Language Processing to analyse the description and README. This would increase the complexity of the software even more and the plans were abandoned when the discovery of the corrupt commit data was made which resulted in a lack of time to implement such features. As such, the quality of the dataset could have been affected by the existence of projects unfit for this study which would have a negative impact on the results.

Github Archive could have been used as the source of data instead which would provide additional interesting information, such as when a project gained a star or fork, this would have allowed metrics as derivatives between the commits to be used. The downside of using Github Archive is the vast increase of data which would have to be processed, the activity of one year make up around 600GB, compared to the 200MB file which was used in this study.

5.2.2 Similarities between Metrics of Popularity

Popularity is difficult to define, even though Github offers plenty of metadata through their API it is still far from easy to determine which data that represents a project’s popularity. Studying the results one can argue that the different metrics for measuring popularity have a very low significance as the results are very similar to each other. Had there been a noticeable difference between the results it could have constituted grounds for arguing that one metric may be better than the other but as so is not the case, this study will not claim any of the chosen metrics to be better than the others. The underlying reason for the very similar results is believed to be rooted in the fact that projects gain very few stars, forks and subscribers in general, which makes these metrics unfit for the task of this study (see Figure 6 for the distribution of forks and stars). Furthermore, there exists a strong correlation of from 0.7242 up to 0.7922 between stars, forks and subscribers, indicating that all metrics represents the popularity similarly and making one of the three metrics somewhat redundant.

Figure 7 shows the correlation between Stars/Forks and Stars/Subscribers, while the correlation between Forks/Subscribers is left out since it yielded a similar result of 0.7815.

Another problem with only using one of the metrics is that for example the amount of stars is not tied to the extent of usage of a project. The Open SSL¹¹ project has nearly 4600 stars on Github while Node¹² has over 34000, even though the former is used to a larger extent¹³¹⁴. The low number of forks observed could be related to usage of other platforms like Mercurial¹⁵. A project can easily be cloned from Github and then being further developed using another platform which leads to no a fork never being recorded.

11https://github.com/openssl/openssl

12https://github.com/nodejs/node

13https://trends.builtwith.com/Server/OpenSSL

14https://w3techs.com/technologies/details/ws-nodejs/all/all

15https://www.mercurial-scm.org/

(25)

(a) Forks (b) Stars

Figure 6: The distribution of forks and stars

(a) Stars and forks (b) Stars and subscribers

Figure 7: Correlation between different metrics of popularity

5.2.3 Other Metrics of Popularity

There were one other type of metadata that were supposed to be used as metrics of popularity in this study, the number of pull requests, but due to problems with the API we could not extract this data. This metric reflects the developer aspect which could have lead to a difference in the results. However, using pull requests as a metric would have meant further limitations of the dataset, since only a small part of repositories use them as found by Kalliamvakou et al.[4]. A metric more fit to measure popularity would probably be the number of downloads (or clones) of a project as this represents the usage of a project in a fairer way than the currently used metrics, since this action does not require users to be logged in. This data is not provided by Github and therefore it could not be used in this study. It should be stressed, however, that the popularity represented by a metric need not represent the actual popularity and vice versa.

(26)

5.2.4 The Dataset

The dataset extracted has a significant impact on the result and there are several aspects of it that needs to be considered. Firstly, the initially received data was corrupted by incorrect replies by the Github API as stated in section 3, fortunately the corrupt commit data was discovered but this was not the only incorrect data returned by the API. Seven occurrences of corrupt timestamps were also discovered, a very small fraction in relation to the total number of commits but the occurrences indicate that there might be more incorrect data in the dataset which has not been discovered. If such data exists it might have had an impact on the results but it is impossible to estimate the extent of it.

Another discovery made in the dataset was that there existed projects which had used the Git version control language before Github existed. This means that during the time between a project’s creation and the time when it was uploaded to Github, it could not receive any metadata connected to Github. In other words the project could not receive stars for example during this period, meaning the conditions of gaining popularity (as measured in this study) was different for some of the projects and thus needs to be considered.

Another important aspect of the dataset is its size. Over 12000 projects were retrieved from the Github API but after weeding out the ones not fulfilling the prerequisites only 85 remained. This reduction clearly confirms the findings of Kalliamvakou et al. [4] which stated that most projects are irrelevant for research of this kind. The small number of projects has a significant impact on drawing conclusions from the result since the amount only constitutes a small subset of all the projects on Github. The results can therefore not be seen as representative for projects on Github which limits the possibility of drawing general conclusions significantly, even if there had been an indication of a strong correlation the result would not have been entirely trustworthy.

5.3 Future Research

Evaluating the results and the methods used to retrieve them there are several things that future studies would do well to consider. Regarding the source of data, there probably are advantages to using a platform which offers the number of downloads for a project, something Github does not, as mentioned previously.

An advantage with Github is the vast number of accessible repositories, using another platform one might become limited with the number of repositories offered, depending on how extensive the research is.

One alternative source of data instead of the Github API is Github Archive since it contains the complete history of all Github activity from February 2011 and thus providing when certain events of a project took place, such as when it received a star. This alternative is quite demanding in terms of computing power and storage so one has to account for these parts when using Github Archive as source of data.

If Github is used one has to consider the time and effort it takes to gather and filter the data. A large amount of the repositories are probably considered irrelevant by most studies and weeding these out can be demanding and time consuming work. There are several ways of extracting data from Github but one of the easier is using their API. Using this option may result in receiving incorrect data, the instructions supplied for page navigation supplied by Github

(27)

resulted in a corrupt dataset during this study which meant additional work to find a working solution and also revising the dataset to find the corrupt parts.

By controlling the replies from the API with the real repositories will lower the risk of a corrupt dataset but one still needs to consider the possibility of incorrect replies by the API. The hard limitation of 5000 requests per hour may also prevent a shorter research project from achieving a desired dataset size.

(28)

6 Conclusions

The results show no strong indication of correlation between commit frequency and popularity. In all measurements the metric of popularity takes low values for the most of projects. One could argue that this indicates that most of the projects that received a low popularity score are unpopular, but that is not necessarily the case. Instead the low values of popularity can be consequence of the inadequacy of the metric itself to represent popularity, the number of downloads for example would be a better solution or complement to measure popularity. Even though the metrics used in this study may afterwards be deemed as unfit, the results still give a weak indication of certain relationships between commit frequency and popularity. Long streaks of weekly commit activity could be correlated to popularity, as our results give a weak indication of such a relationship. Also, long streaks of weekly inactivity in the sense of commits could have a negative relationship towards popularity, as one of the measurements indicated a very weak - but still negative - correlation coefficient for the two variables. However, to be able to confirm/verify the possibilities of mentioned relationships, further research using a larger dataset and better metrics would have to be conducted.

(29)

References

[1] Karan Aggarwal, Abram Hindle, and Eleni Stroulia. “Co-evolution of Project Documentation and Popularity Within Github”. In: Proceedings of the 11th Working Conference on Mining Software Repositories. MSR 2014. New York, NY, USA: ACM, 2014, pp. 360–363. isbn: 978-1-4503- 2863-0. doi: 10.1145/2597073.2597120. url: http://doi.acm.org/

10.1145/2597073.2597120 (visited on 05/11/2017).

[2] T. F. Bissyande et al. “Got issues? Who cares about it? A large scale investigation of issue trackers from GitHub”. In: 2013 IEEE 24th Inter- national Symposium on Software Reliability Engineering (ISSRE). Nov.

2013, pp. 188–197. doi: 10.1109/ISSRE.2013.6698918.

[3] Klint Finley. Github Has Surpassed Sourceforge and Google Code in Pop- ularity. June 2011. url: http://readwrite.com/2011/06/02/github- has-passed-sourceforge/ (visited on 05/11/2017).

[4] Eirini Kalliamvakou et al. “The Promises and Perils of Mining GitHub”.

In: Proceedings of the 11th Working Conference on Mining Software Repos- itories. MSR 2014. New York, NY, USA: ACM, 2014, pp. 92–101. isbn:

978-1-4503-2863-0. doi: 10.1145/2597073.2597074. url: http://doi.

acm.org/10.1145/2597073.2597074 (visited on 05/11/2017).

[5] Audris Mockus, Roy T. Fielding, and James D. Herbsleb. “Two Case Studies of Open Source Software Development: Apache and Mozilla”. In:

ACM Trans. Softw. Eng. Methodol. 11.3 (July 2002), pp. 309–346. issn:

1049-331X. doi: 10.1145/567793.567795. url: http://doi.acm.org/

10.1145/567793.567795 (visited on 05/11/2017).

[6] Pearson Product-Moment Correlation - When you should run this test, the range of values the coefficient can take and how to measure strength of association. url: https://statistics.laerd.com/statistical- guides/pearson-correlation-coefficient-statistical-guide.php (visited on 05/11/2017).

[7] Pearson correlation coefficient. en. Apr. 2017. url: https://en.wikipedia.

org/w/index.php?title=Pearson_correlation_coefficient&oldid=

775345264 (visited on 05/11/2017).

[8] Spearman’s Rank-Order Correlation. url: https://statistics.laerd.

com / statistical - guides / spearmans - rank - order - correlation - statistical-guide.php (visited on 05/23/2017).

[9] Mann–Whitney U test. Feb. 2017. url: https://en.wikipedia.org/

wiki/Mann%E2%80%93Whitney_U_test (visited on 05/23/2017).

[10] Oskar Jarczyk et al. “GitHub Projects. Quality Analysis of Open-Source Software”. en. In: Social Informatics. Springer, Cham, Nov. 2014, pp. 80–

94. doi: 10 . 1007 / 978 - 3 - 319 - 13734 - 6 _ 6. url: https : / / link . springer . com / chapter / 10 . 1007 / 978 - 3 - 319 - 13734 - 6 _ 6 (visited on 05/11/2017).

(30)

[11] Y. Weicheng, S. Beijun, and X. Ben. “Mining GitHub: Why Commit Stops Exploring the Relationship between Developer’s Commit Pattern and File Version Evolution”. In: 2013 20th Asia-Pacific Software Engineering Con- ference (APSEC). Vol. 2. Dec. 2013, pp. 165–169. doi: 10.1109/APSEC.

2013.133.

[12] Kevin Peterson. “The GitHub Open Source Development Process”. In: ().

url: http://kevinp.me/github-process-research/github-process- research.pdf (visited on 05/11/2017).

[13] Jacob Cohen. Statistical Power Analysis for the Behavioral Sciences. en.

Routledge, May 2013. isbn: 978-1-134-74277-6.

(31)

Measuring correlation between commit frequency and popularity on GitHub

Measuring correlation between commit frequency and popularity on GitHub

JONATHAN JEFFORD-BAKER

MÅRTEN GRÖNLUND

Measuring correlation between commit frequency and popularity on GitHub

JONATHAN JEFFORD-BAKER M˚ ARTEN GR ¨ ONLUND

Contents

1 Introduction

1.1 Purpose

1.2 Problem Statement

1.3 Scope

1.4 Outline

2 Background

2.1 Github

2.2 Open Source Development

2.3 Correlation

2.4 Previous Research

3 Method

3.1 Scraping Github

3.2 Analysis

3.3 Procedure

4 Results

4.1 First Hypothesis

4.2 Second Hypothesis

4.3 Third Hypothesis

4.4 Fourth Hypothesis

4.5 Fifth Hypothesis

5 Discussion

5.1 Result

5.2 Possible Error Sources

5.3 Future Research

6 Conclusions

References