Automatic Detection of Source Code Plagiarism in Programming Courses

(1)

Bachelor Degree Project

Automatic detection of source

code plagiarism in

programming courses

Author: Adam Bergman Supervisor: Daniel Toll Semester. VT 2021

(2)

I

Abstract

Source code plagiarism is an ongoing problem in programming courses at higher academic institutions. For this reason, different automated source code plagiarism detection tools have been developed. However, they require several manual steps before the submissions can be compared. Linnaeus University uses GitLab to handle their students’ code-related assignments but lacks an integrated workflow for checking submissions against plagiarism. Instead, Linnaeus University’s plagiarism-checking process is done manually, which is a challenging and time-consuming process. This thesis is a case study on Linnaeus University, focusing on integrating one of the plagiarism detection tools with GitLab using Continuous integration pipelines. The objectives have been to collect students’ submissions, communicate with the plagiarism tool, and visually present the results within GitLab. The prototype has been evaluated by a set of manually created submissions with different levels of plagiarism to ensure that the detection tool differentiates plagiarized and non-plagiarized submissions. Teachers at Linnaeus University have tested the workflow and reasoned whether the prototype fulfills their requirements.

(3)

II

Preface

I want to thank my supervisor, Daniel Toll, from Linnaeus University. Daniel has arranged weekly seminars in small groups where ideas have been exchanged. He has continuously contributed with valuable suggestions and improvements. Moreover, I am thankful to the participating teachers from Linnaeus University who set the requirements and provided feedback during the demonstrations.

(4)

III

4.2 Iteration 1 20 4.2.1 Evaluation of iteration 1 24 4.3 Iteration 2 24 4.3.1 Evaluation of iteration 2 24 4.4 Iteration 3 25 4.4.1 Evaluation of iteration 3 27 4.5 Iteration 4 27 4.5.1 Evaluation of iteration 4 28 4.6 Iteration 5 29 4.6.1 Evaluation of iteration 5 31 4.7 Iteration 6 31 4.7.1 Evaluation of iteration 6 31

(5)

5.1 Architectural structure in GitLab 32 5.2 Workflow in GitLab 34 5.3 Scenarios 36 5.4 Documentation 47 6 Evaluation 50 6.1 Fulfillment of requirements 50

6.2 Production use and generalizability 53

7 Discussion 54 8 Conclusion 56 8.1 Future work 56 References 58 Appendix 60 A Interviews in Swedish 60

A.1 Interview with teacher 1 60

A.2 Interview with teacher 2 60

(6)

1 Introduction

Source code plagiarism is an ongoing problem in introductory programming courses at higher academic institutions [1]. It is difficult and time-consuming for teachers to detect plagiarism manually by comparing numerous students’ submissions. Linnaeus University offers various programming-related courses and uses GitLab (as described further in section 3.3) to manage students’ submissions but lacks an automated process for detecting plagiarism in source code. This 15 HEC bachelor thesis in computer science focuses on integrating an existing automated source code plagiarism detection tool with GitLab by adapting to existing workflows for programming assignment submissions. The target group is educators in computer science.

1.1 Background

Plagiarism is a well-known phenomenon in the academic world. It is described as submitting someone else’s work as its own or by copying phrases, sentences, or images without properly acknowledging the source [2]. The regulations in Sweden for copying someone else’s work are disciplinary actions, such as warnings or suspensions up to six months [3], which might lead students to fall behind and potentially result in them not fulfilling their studies. The motivation of this study is based on the fact that Linnaeus University lacks an automated solution for detecting plagiarism in programming assignments. Instead, this process is done differently amongst teachers, including manual checking, partial automated, or absent.

The definition of code plagiarism is when a student reuses someone else’s code, intentionally or unintentionally, without acknowledging it correctly and therefore submitting it as its own solution [2]. There are two known techniques of code plagiarism: lexical and structural changes. Lexical changes refer to changing variable names, formatting, comments, and output. Structural changes include reordering of operands, code blocks, and scope modifications [4]. There are various reasons why students tend to plagiarize other’s work, including time pressure, fear of failing, not understanding what is considered plagiarism, idleness, amount of work, and poor time management [5]. A study conducted on two Australian universities has shown that cheating on programming assignments is widely accepted amongst their students, and approximately 77% of these have admitted to cheating [6]. Cheating does not necessarily include plagiarism; however, plagiarism is a form of cheating [7].

Finding source code solutions can be conducted in several ways, such as the Web, textbooks, and from other students. An extensive network of websites offers freelance developers at a low cost that students can hire to assess programming assignments [1]. As a result, the methods of obtaining source code solutions make plagiarization easier for students and open an easy way to pass an assignment. Programming courses can contain a large number of students, which would make the process of manually checking for plagiarism infeasible. An American survey found that 52% of 178 teachers in programming courses perform nearly zero manual inspection of students’ submissions [8].

(7)

Several automated source code plagiarism detection tools have been developed to mitigate the occurrences of plagiarism. The different tools have been described and evaluated in section 3.4.

One of the motivations of this thesis has been to detect and prevent source code plagiarism in programming courses at Linnaeus University, which is essential to preserve a high quality of education. With the aid of an automated source code plagiarism detection tool, educators could benefit by sparing the time required to look for plagiarism manually. Performing manual plagiarism detection in source code is complex and exposes the risk of not identifying plagiarism because of the human factor. The time saved might be delegated to educate and support students.

1.2 Related work

Nguyen et al. developed a plugin that integrates two source code plagiarism detection tools, JPlag and MOSS (described in more detail in section 3.4), in the virtual learning platform Moodle [9]. One goal was to promote source code plagiarism detectors as educational tools to help students understand the importance of academic integrity by providing feedback with similarity measurements. The authors let teachers and students evaluate the plugin, revealing that the tool could effectively assist teachers in detecting plagiarism. Students were deterred from sharing their solutions with others. However, there were some concerns about confidentiality and anxiety among students. Non-plagiarising students faced higher similarity values than they expected. The plugin used the native visual presentation of JPlag and MOSS with minor modifications when presented for students to avoid them from seeing each other’s identities. The evaluation showed that the results from JPlag and MOSS should be incorporated into a single interface with the ability to filter by specific criteria [9].

Tim Schubert developed a Continuous integration (CI, further explained in section 3.2) script for GitLab specifically designed to automate the handling of students’ programming assignments. The script supports plagiarism detection for students’ solutions with the use of JPlag. A GitLab group needs to be forked to check for plagiarism, and a tag needs to be pushed to that group. The results from JPlag can be viewed from the build artifacts of the CI job. The script currently supports the programming language Java [10].

The company Internet Guru provides a GitLab CI script that downloads repositories from all branches within a GitLab project and submits them to MOSS for plagiarism detection. The CI pipeline output log displays an URL to MOSS where the results can be viewed [11].

Since no earlier research has been done on how automated source code plagiarism detection can be used within GitLab, a research gap has been identified. The outcome of this thesis is beneficial for academic institutions using GitLab.

1.3 Problem formulation

(8)

1. How can a source code plagiarism detection tool be integrated into GitLab to assist teachers in finding plagiarism?

2. How can the results from a source code plagiarism detection tool be presented within a GitLab group for teachers?

A prototype has been designed and developed according to requirements and feedback from different teachers at Linnaeus University that use GitLab daily. The prototype has been worked on using a course with a mix of dummy submissions and submissions from inactive students. The course was created by the author and was only used during this study. More information on the prototype and the course structure can be found in Chapter 4.

1.4 Results

Several steps have been taken to integrate a source code plagiarism detection tool with Linnaeus University’s GitLab instance. The first step was to investigate different source code plagiarism detection tools. As a second step, requirements were gathered from teachers at the computer science department of Linnaeus University. These requirements were considered when choosing the plagiarism detection tool. The third step was to examine the possibilities with GitLab, such as using CI pipelines and their API. The later steps were an iterative process of prototype development, demonstration, and evaluation.

1.5 Scope/Limitation

This thesis is a case study on Linnaeus University and is based on their requirements and current structure in GitLab. Other academic institutions may benefit from the results by adapting to the same structure. Linnaeus University utilizes a self-managed instance of GitLab Enterprise Edition with an Ultimate license. The current version is 13.8.1. All of GitLab’s functionality has been available to this study. The project was limited to the functionality of GitLab since extending it would be infeasible for this 15 HEC thesis. It would also be time-consuming to maintain and fit for future updates from GitLab. The project has been limited to follow the same structure as Linnaeus University uses for their courses and assignments. The project was limited by the functionality of the plagiarism detection tool used in terms of visualization and the number of supported programming languages.

1.6 Outline

The following is a breakdown of the structure of this report. The Methodological

approach chapter describes why Design science is an appropriate research method for

the current problem. It explains how design science has been used during the case study and provides ethical considerations. A theoretical background is provided that discusses the knowledge gap and different relevant techniques for this report. Iterations describe the process used and the evolution of the prototype. The suggested prototype is

(9)

presented in the Results chapter, demonstrated with a dummy course. It outlines the motivation behind the design decisions and how the prototype fulfills the teachers’ requirements. The Evaluation chapter evaluates the findings and whether the prototype can be used for actual courses. The topic of Discussion is whether the issue was addressed and whether the problem was resolved. The outcome was compared to the outcomes of similar studies. The Conclusion outlines the results, discusses aspects that could have improved the outcome, and suggestions for future work.

(10)

2 Methodological approach

A prototype has been created to answer the research questions described in section 1.3. The research strategy used is Design science together with the research design Case study.

2.1 Design science

Peffers et al. [12] describe that a common practice is to use Design science when creating prototypes/artifacts and that research in design science can be constructed in the following way.

Activity 1, Problem identification and motivation, identifies the research problem and motivates the value provided by the suggested prototype.

Activity 2, Objectives, collects requirements from the problem definition and provides knowledge of the possibilities.

Activity 3, Design and development, involves the creation of the prototype. From a conceptual perspective, an artifact within design research can be any designed object contributing to research. The activity includes decisions on the prototype’s intended functionality and its construction, followed by the creation of the prototype.

The Design and Development process is followed by Activity 4, Demonstration, in which the use of the prototype is presented to demonstrate how it solves the problem.

Activity 5, Evaluation, consists of observations on how well the prototype solves the problem. The evaluation process involves comparing the objectives of the solution and the actual results from the use of the prototype in the demonstration. In this thesis, evaluations have been provided by teachers at Linnaeus University. The teachers’ evaluations have been analyzed and interpreted by the author. The mentioned activities have been worked with iteratively to enhance the prototype.

Lastly comes Activity 6, Communication, where the problem, its importance, the prototype, and its effectiveness to research are communicated. This thesis has been communicated via this report and via Linnaeus University’s GitLab instance.

(11)

Figure 2.1. The arrows pointing to an earlier activity represent how the author has worked, e.g., by directly moving from activity six to activity three.

2.1.1 Activity 1-2 - Identifying the problem and consulting teachers

The first activity; Problem identification and motivation, was conducted by discussions with teachers at Linnaeus University to comprehend the problem and understand why it is essential to research. These aspects have already been defined in the Introduction chapter.

The second activity, Objectives, was initially conducted during an early phase of the project by interviewing three teachers from the computer science department of Linnaeus University to gather requirements. The author handpicked the teachers by his own judgment. The author’s preferences for selecting teachers were that they needed to teach programming and use GitLab for source code-related assignments. One teacher had previous experience of using an automated plagiarism detection tool and was therefore included.

Due to the ongoing covid-19 pandemic, all interviews were conducted digitally to follow Swedish governmental recommendations and prevent the virus’s spread. It has been necessary to consider all requirements since the teachers use GitLab regularly for different courses, levels, and programming languages.

2.1.2 Activity 3-5 - Development, demonstration, and evaluation

The activities involved in activities 3-5 have been worked on iteratively. The process in each iteration started with planning what should be done, why, and how. The goal was to restrict the development only to fulfill the identified needs to streamline the progress of the prototype. However, there were occurrences when the author implemented more than planned during an iteration. The teachers were presented with the implementation

(12)

where feedback was given, followed by an evaluation against the requirements. During the last iteration, the prototype was evaluated against all requirements. The evolution of the prototype is found in Chapter 4.

2.3 Case study

A case study [13] dives deep into one specific situation, e.g., an organization. The aim is to create a detailed picture of an individual object or situation as a foundation for

understanding a general phenomenon. Conducting a case study can be done using interviews and observations. When it comes to gathering requirements, case studies can help overcome some of the drawbacks of surveys. Case studies allow a more in-depth examination of stakeholder needs and requirements over an extended period. The author may identify requirements not explicitly stated by the stakeholders. The deeper

understanding obtained by a case study and the reduced reliance on stakeholders may result in more comprehensive and relevant requirements. However, case studies are time-demanding, rely heavily on the author’s competence, and may be biased due to the author's preconceptions and interests.

A case study may help carry out the activities of the Design science process. An in-depth understanding of the problem can be achieved when studied in a real

environment, which helps formulate the problem and collect requirements. A case study is a prominent alternative for evaluating the prototype since it is done in its real

environment instead of an isolated one.

2.4 Requirements

The requirements have been defined after interviews with three teachers at Linnaeus University. The requirements and motivations behind each requirement can be found in

Iteration 1, section 4.2. The requirements have been evaluated against the final version

of the prototype by teachers from Linnaeus University.

2.5 Reliability and Validity

Bryman and Bell [14] imply that it is essential to address issues that may reduce the reliability of the study, such as errors in the method, and find means to reduce their impact on outcomes. Reliability refers to the reproducibility of the study; whether others would get the same results if they were to replicate the study. Another vital aspect is validity, which refers to whether the research findings are valid and supported by the method and the data collection process.

There are several different types of validity problems that might occur. Construction validity is about the misinterpretation of formulations. An effort has been made to use clear and objective definitions to reduce such validity issues. Internal validity is about if the results and conclusions follow the data collected. Bias is an internal validity problem for this thesis since the requirements have been collected from the same persons who determined whether the prototype fulfilled them. Using external stakeholders could reduce the risk of such internal validity problems. However, it could

(13)

result in a prototype that does not meet the requirements of the computer science department of Linnaeus University. External validity is a common problem in studies about generality and whether the method is general enough to ensure that the outcomes justify the conclusion. For this thesis, external validity is an issue due to the research design being a case study. Other academic institutions might use GitLab in other ways than Linnaeus University, which would result in the prototype not applying to requirements from other institutions. However, this study has aimed to allow other institutions with different configurations in GitLab to benefit from the results by adopting the same configuration and architecture as Linnaeus University.

Due to the research method being design science, the result might not be reliable for several reasons. The prototype could fail to fulfill the requirements because it is not within the intended purpose of using GitLab. The choice of technologies and insufficient knowledge of the author might affect the outcome. The validity could be affected due to the research method being design science since the time frame could restrict the results. This limitation could result in less time to try different solutions to get the most optimal results. Not including all explored paths within the report might affect the reliability. The author has made an effort to include all reasoning behind the choices made, but there could be uncertainties affecting the reliability of the project.

The requirements gathered from the teachers have been in their native language, Swedish, because of the convenience for the teachers and the author. Conducting interviews in the native language for both parts is beneficial since it allows the interviewees to speak freely and explain in their own words without language limitations. The author has translated and reformulated the gathered requirements into English, where misinterpretations might occur. The requirements have been carefully reformulated and presented to every teacher to avoid such misunderstandings. As earlier mentioned, the requirements were evaluated by technical possibilities and limitations due to time constraints. The author’s technical background has set the level for the technical abilities for this study, which may have affected the development process and the final prototype.

2.6 Ethical Considerations

Various ethical considerations may arise during different stages of a thesis, such as during data collection and analysis of findings [15]. The following aspects were considered to conduct the interviews with the teachers from Linnaeus University effectively, as described by Bryman and Bell [14]. The participants were informed in advance what the study’s goal was, what the gathered information would be used for, and how it would be obtained. The interviewees were given informed consent, meaning they participated voluntarily and were informed that they could withdraw their participation and the provided information at any time during the project. They were informed that the interviews were voice-recorded and that the recordings would be stored securely and removed after the end of this thesis. The author guaranteed the interviewees that the retrieved information would only be used for this study. The interviewees were informed that they would be anonymously referred to as Teacher 1, 2,

(14)

3. The questions to the interviewees were conducted with an objective approach to

avoid biased information.

As mentioned earlier, plagiarism is a severe issue, and it has been taken into consideration during all parts of this thesis. As a result, the author has been careful to credit and quote the proper author(s) where necessary. An effort has been made to avoid misrepresenting anyone else’s work as its own.

During activities that involved executing the plagiarism detection tools, inactive students’ submissions were used to populate the study sample. An ethical concern would arise if the plagiarism detection tools found plagiarism in the inactive students’ submissions. The collected submissions have been anonymized to avoid such risks. The purpose of this study has not been to detect plagiarism among students but to create prerequisites for teachers to do so.

(15)

3 Theoretical Background

In this chapter, the different terms used in this thesis are explained further.

3.1 Git and version control systems

Git is currently one of the most widely used version control systems (VCS). Atlassian [16] describes VCS as a practice of keeping track of and managing changes to source code. A VCS keeps track of every modification to the code, and if a mistake has been made, the current version can be compared to previous versions to detect where the mistake was made. Git is defined as a distributed version control system (DVCS); all contributors within a project have their own local copy of the entire code and can work with different features simultaneously without worrying about overwriting others’ changes (referred to as branching). The difference between a VCS and DVCS is that VCS is a centralized version control system, meaning that contributors must synchronize their code with a server. A DVCS means that the contributors can work locally until they are done with the feature [16]. The workflow consists of three stages in the following order; add, commit, and push. Adding is to track the changes made, commit to saving the tracked changes to the local repository, and pushing to upload the changes to the remote repository [17]. A repository is a virtual storage that allows to store different versions of source code [18]. When working with branching, there is one main branch (often referred to as master) and sub-branches created and worked on for bugs or features. After implementation has been completed when working on a sub-branch, the changes are merged into the master branch, called merging [19].

3.2 Continuous integration

Linnaeus University uses several pipelines for students’ assignments that help verify that the students’ code is built correctly, that it passes automated tests, and that the student has submitted their work before the deadline. Therefore the possibilities of integrating source code plagiarism detection into the workflow Linnaeus University uses for pipelines have been investigated. More on that can be found in Chapter 4. Fowler [20] defines CI as a process within software development, focusing on how to collaborate in teams. The core principle is that each team member should integrate their source code changes at least daily. Integrating changes daily is to reduce the impact of potential integration errors between different contributors’ changes. By integrating frequently, the risk of conflicts is reduced, and conflicts that arise are usually easier to resolve. CI pipelines is an automated verification system that enables frequent integration. A pipeline is a set of instructions for automating the process of creating and testing an artifact. Jobs is the term used to describe these instructions. Everything needed to build and run an application should be in a repository. Every new integration is run through a pipeline to ensure that the changes do not break the build. A critical aspect of CI is that the master branch should always be stable, and the pipeline should quickly provide details on whether the pipeline succeeded or failed [20].

(16)

3.3 GitLab and source code management

An online source code management (SCM) system enables sharing of source code in a project. GitLab, Github, and Bitbucket are examples of SCM services. By adopting an SCM system, the codebase is stored, and it provides a user-friendly interface to handle merging to avoid merge conflicts [21]. The action to merge one branch into another consists of a pre-step, creating a request to merge. The request is referred to as pull

request in Github and Bitbucket and merge request in GitLab. By adopting merge

requests, other project members can perform code reviews and approve or reject the requested changes [19]. GitLab, Github, and Bitbucket have in common that they offer numerous other features than storing source code. GitLab differentiates from the other examples as it is open source. GitLab describes its application as a complete continuous integration toolkit that creates a streamlined software workflow [22]. A majority of Scrum concepts can be converted into GitLab features. Backlog items are referred to as

issues, and sprints are referred to as milestones [23]. Other SCM services offer similar

functionalities.

The computer science department of Linnaeus University uses GitLab in a self-hosted instance. This thesis has therefore focused on finding an optimal solution for this platform only. The vocabulary used is the one that is set up by GitLab. Two major topics are explained as follows to avoid misconceptions.

Projects are used to store source code (also known as repositories). Apart from that,

they can be used for issue tracking, work planning, code collaboration, and continuously build, test, and use their built-in continuous integration/continuous delivery to deploy an application [24].

Groups can be used to store one or multiple projects that are related to each other.

For instance, a group can be created for a company’s members, named company-team, and subgroups (backend, frontend, production) can be created for each team [25].

3.4 Automated source code plagiarism detection tools

Lancaster and Culwin [26] have compared 11 different automated source code plagiarism detection tools. The tools were compared in terms of the number of supported programming languages, classification techniques used, and visual presentation. It was concluded that MOSS and JPlag were the best-suited alternatives. MOSS was considered marginally better than JPlag regarding robustness and availability [26].

Hage et al. [27] have compared five different tools: MOSS, JPlag, SIM, Marble, and Plaggie. Performance and features were the aspects compared. Two experiments were done for the performance comparison, one for comparing the sensitivity of the tools in detecting plagiarism and one for comparing the top ten highest-scoring plagiaristic pairs of every tool. The features compared were the number of supported programming languages, visual presentation, extendability, usability, ability to exclude files and template code, performing historical comparisons, file or submission-based scoring, web-based or locally operated, and whether they are open source. It was shown that the

(17)

results from the top 10 comparisons between JPlag, MOSS, and Marble were similar and more accurate than the results reported from SIM and Plaggie. The feature comparison showed that JPlag and MOSS were superior in terms of the number of supported languages (six in JPlag and 23 in MOSS). JPlag received the highest score regarding usability and presentation of results [27].

Martins et al. [28] have compared nine plagiarism detection tools, including CodeMatch, CPD, JPlag, Marble, MOSS, Plaggie, Sherlock, SIM, and YAP. The tools were compared based on features offered and further tested against different source codes. The features compared were the supported languages, extendability, accuracy of the reported results, visual presentation, ability to exclude template code, submission type, local or web-based, and open source. The detection system’s performance was compared to the source code explicitly created for this purpose. It was shown that MOSS, JPlag, Marble reported the most accurate plagiarism detection scores. The authors concluded that Marble, YAP, and Plaggie were less easy to use due to special file structure requirements and the need for compilation. The interface of JPlag was considered user-friendly and comprehensive, but since it was not available to use locally, the authors recommended SIM as an alternative [28].

According to the authors mentioned above’ comparisons, no tool outperformed the others significantly. However, it appears that MOSS and JPlag are the most popular options. If these studies had been published during recent years, other conclusions could have been made since JPlag has been thought of as a non-open source and provided as a web-based version only, which is of today not entirely accurate. The source code of JPlag is now available on Github and available as a local version. The authors of JPlag strongly advise against using the web-based version due to its out of date [29]. Therefore, JPlag and MOSS have been examined further and implemented in different ways.

3.4.1 JPlag

Guido Malpohl developed JPlag as a student project at Karlsruhe University in 1996. JPlag transforms source code files into token strings that represent a program’s structure. JPlag uses the Greedy String Tiling algorithm with various optimizations for improved efficiency [27]. JPlag currently supports the programming languages Java, C#, C, C++, Python 3, Scheme, and natural language text [29]. On their Github repository [30], JPlag provides information on how to add more languages. The results from JPlag [30] are generated into several HTML files that can be accessed from a web browser. The start page presents an overview of the analyzed submissions with clustering of pairs. Plagiarism rates are ranked according to similarity measurement in a percentage between 0 and 100. The rates are divided into an average similarity and a maximum similarity score. The highest rates are presented at the top and the lowest at the bottom. Every found pair of plagiarism between two submissions are clickable where they can be compared side-by-side. Colors are assigned to parts of the files that are similar on a token level. JPlag [29], [30], [31] supports the inclusion of template code. It is executed on a local computer, and it supports the comparison of all

(18)

submissions against each other or compares one submission against a bulk of other submissions. It does not support the ability to compare submissions in different programming languages.

The authors behind CodeGrade, a blended learning application explicitly developed for programming education [32], have extended the functionality of JPlag with added support for Javascript, PHP, R, Scala, Jupyter notebooks, and JSON [33].

3.4.2 Moss

MOSS is an abbreviation of Measure Of Software Similarity and was developed by Aiken et al. at Stanford University in 1994. It is available for free as a web service, accessible via a submission script obtained from the MOSS website [34]. MOSS detects plagiarism in source code by similarity using a document fingerprinting technique algorithm named Winnowing. A sequence of characters is used to create a unique document fingerprint. The algorithm is not entirely publicly available to prevent it from being circumvented. The system allows the inclusion of template code [35]. MOSS currently supports 25 programming languages, including Java, Python, C, C++, C#, Javascript, Scheme, Matlab, Haskell, Perl, and Assembly [34]. The results obtained from MOSS are accessible for everyone (except web crawlers) via their website by a URL containing randomly generated integers. The results are available for approximately 14 days [34]. The visual presentation has been created by the same author of JPlag [28], and it is similar to the output of JPlag [36]. MOSS presents a list of pairs, ordered by the amount of matched tokens. Each pair is clickable, and the submissions are then presented side-by-side for manual inspection. Similarities are highlighted in the same color. The MOSS results are self-adjusting, and parts of code common to most submissions are thought to be provided code and thus excluded from the calculations. This could result in two identical submissions being reported as less than 100% similar [26].

(19)

4 Iterations

This chapter presents the evolution of the prototype, the decisions made during the process, the explored approaches, and how the results have been evaluated. Different approaches to integrating source code plagiarism detection into GitLab have been taken since this study has aimed to find the most suitable solution. A summary of how the project has evolved has been described in section 4.1. Complete iterations and evaluations can be found in the sections below, starting from 4.2.

The prototype has been worked on during six iterations. The iterations have included the activities planning, design, development, presentation, and evaluation. Planning has included deciding which requirements to consider, design to decide on architectural decisions, development has included manual testing, presentation to demonstrate the prototype to teachers and receive feedback, and evaluation where the iteration has been compared to the requirements and expectations. The iterative process has been illustrated in figure 4.1.

Figure 4.1. The green boxes refer to the development step. The first iteration included the overall planning of the project, followed by interviews with three teachers to collect requirements for the project. The emojis represent the teacher(s) involved, i.e., there were three teachers involved during the initial interviews, and three teachers were

(20)

involved during the demonstrations.

4.1 Summary of all iterations

The first iteration began with gathering requirements by interviewing teachers at Linnaeus University. After the requirements had been collected, the development of a script was initiated. The script was executed locally, downloaded all merge requests for a particular milestone in a course created for test purposes, and performed plagiarism detection using MOSS. The results were uploaded to the course’s Wiki in GitLab.

During the second iteration, the script was extended to be more automated and secure. Instead of executing it for a particular milestone and course, the script contained a configuration file with multiple courses to keep track of. Whenever a milestone for any of the provided courses had been reached, the procedure described above followed.

During the third iteration, alternative ways of integrating plagiarism detection were explored. A CI pipeline (as explained in section 3.2) was created and used to run plagiarism detection upon merge requests. JPlag replaced MOSS to compare the source code within a merge request with old submissions.

The fourth iteration focused on how identified plagiarism should be presented within GitLab. A merge request with suspected plagiarized code received a label. The result from JPlag, available as a pipeline artifact, was password protected. The prototype was extended to support template code.

The fifth iteration consisted of removing the creation of labels for submissions suspected of plagiarism and instead generating test reports that failed or passed. Alarming plagiarism rates resulted in pipelines ending with warnings, which could be seen from the overview of all merge requests within the test course. The architectural structure was revised, and programming language-specific pipelines were created. The prototype was tested on Javascript, Java, and Python. Documentation was written to provide teachers instructions on how they could implement the prototype in their future assignments.

The last, sixth iteration, involved reorganizing the location of the reports from JPlag to only be accessible by teachers within the respective courses.

During the evolution of the prototype, a fictional course has been created by the author and it has only been used for this purpose. When one student or several students are referred to, they are not actual students but fictional students created by the author.

4.2 Iteration 1

The focus of the first iteration was to set the requirements of the prototype, examine MOSS and JPlag, and produce a prototype for further discussion.

The iteration was initiated by conducting interviews with three teachers at Linnaeus University to collect requirements for the prototype. The interviews were conducted digitally and transcribed and can be found in appendix A. One of the interviewees had previous experience with JPlag, and the others had no experience with any automated plagiarism detection tool. The interviewee that had used JPlag previously executed it from a local computer, and the submissions were collected manually. Another

(21)

interviewee does plagiarism detection manually once he/she has detected that a submission that is being viewed is similar to a previously viewed submission. The other interviewee does no plagiarism detection activities but instead focuses on oral examinations where students need to explain their solutions. The requirements identified have been described below. Each requirement is explained further with motivations on why it was included. They will, later on, be referred to their numbering, e.g., R1.

R1. The system must be automated with as few manual steps as possible

The interviewees agreed that an integration of an automated source code plagiarism detection tool needed to have few manual steps for efficiency and not significantly affect the time spent on correcting a submission.

R2. The system must be transparent for students in which the student can see if their solution is suspected of plagiarism without leaking any sensitive details, such as other students’ solutions and usernames

The interviewees expressed that they aim to maintain openness in all aspects, which should include plagiarism. However, they agreed that a student should not see what others have handed in or others’ plagiarism scores. If students could compare their solutions with each other, they could learn how to prevent them from being detected for future assignments.

R3. Plagiarism detection should be executed when a deadline for an assignment has passed

The interviewee with experience from using JPlag thought that the submissions should be compared to each other after all students have confirmed that they want to hand in to prevent students from rerunning the plagiarism detection tool to discover that their code is not unique enough.

R4. Plagiarism detection should be executed in connection with a student submitting their assignment via merge request

Linnaeus University has a working CI workflow where students’ code is built, linted, tested automatically, and checked for if the merge request is made before the deadline. The interviewees thought that it would therefore be relevant to run plagiarism detection as a CI pipeline step.

R5. The system should include submissions from previous years when executed

The interviewees claimed that students might find assignments published or shared by students from previous years. Therefore, including old submissions was considered important.

R6. The student’s submission should be saved automatically together with the old solutions from previous years

(22)

To increase the chance of detecting plagiarism.

R7. The reports generated by the plagiarism detection tool must be available within GitLab, only accessible by teachers

The interviewees wanted to make use of GitLab to its full extent and only use one platform. Publishing the report on Gitlab improves efficiency and reduces the number of manual steps.

R8. False positives from the system must be avoided to the greatest extent

The interviewee with previous experience of using JPlag was concerned with false positives reported by JPlag, due to the increased time it takes to go through them. Since this study relies on existing plagiarism detection tools, R8 may be difficult to address.

R9. The system must support the inclusion of template code

A common practice is that code-related assignments come with starter code, which should be excluded from the plagiarism comparison to avoid false positives. This feature could also be used when an assignment should be solved in a specific manner.

R10. It must be possible to filter by non-suspicious/suspected plagiarism when the system is executed upon every merge request

To allow examiners to start by correcting assignments not suspected of plagiarism or vice versa.

R11. Detected plagiarism must be available to view on repository-level by the submitting student and examiners

This relates to executing plagiarism detection upon a student’s merge request made from its repository. The interviewees wanted to view detected plagiarism within the merge request and use pipelines as they are intended to be used.

R12. The system must operate on Javascript, Java, and Python. The tool should also support C#

The prototype should work for the programming languages taught at the computer science department of Linnaeus University. There are no assignments in C#, but one interviewee claimed that they might use it in the near future.

After the requirements had been defined, the plagiarism detection tools MOSS and JPlag were explored. Since JPlag did not support Javascript, but the creators of CodeGrade had published their extended version of JPlag with Javascript support, an effort was made to run it. The author failed to run CodeGrade’s extended version of JPlag and contacted them via email for support. The original version of JPlag was installed and run successfully. However, considering Javascript being one of the three required programming languages to be supported by the prototype, MOSS needed to be the plagiarism detection tool to start with.

(23)

The initial thought was to run the plagiarism detection control when a student has performed a merge request. However, since using MOSS involves sending requests to their servers, it was discovered that MOSS’s servers had a rate limit. The workflow was reconsidered and instead focused on sending one request to MOSS with all the students’ submissions after a deadline was reached. JPlag runs locally, and the author reasoned that it would probably be technically feasible to perform plagiarism detection control upon a student’s merge request, given that a database would be available to store submissions from previous years since it would be problematic that the first student that performs the merge request does not have any submissions from the current year to be compared with. The disadvantage identified with such an approach was the risk of high CPU load to GitLab since JPlag seemed only to support comparison between all students, not one student against all others. Some investigation was made into looking at the source code of JPlag to see if it would be feasible to add an option that performs such a comparison. Due to the time limitations, the conclusion was that it would not be feasible.

The implementation of the first version resulted in a local executable script that expected the following arguments; an ID to the GitLab group, the name of the milestone, and the programming language used in the assignment. The script cloned all the repositories that had made a merge request against the provided milestone, visited each submission folder, and adjusted the folder structure so that each student submission contained the source code files directly. The reason behind changing the folder structure was since MOSS expected the submission structure to be in a certain way, without any subdirectories within each student folder, as demonstrated in figure 4.3.

Figure 4.3. Each submission folder needed its files to be available directly. Subdirectories and files with other extensions than the provided programming language were ignored.

The submissions were submitted to MOSS, and in return, MOSS outputted a URL where the results could be viewed. The program visited the URL, saved the HTML structure into a local file, and converted it to the markdown format. The markdown file was published as a Wiki page inside the provided GitLab group. The wiki page presented the results, but to view the side-by-side comparison between two submissions, any clicked link redirected to the results page hosted on MOSS servers. The wiki page was protected to the extent that only teachers were able to access it.

(24)

4.2.1 Evaluation of iteration 1

The first version was presented to a teacher working with GitLab daily. The ideas that came up were to keep the folder structure of each student’s source code files, i.e.,

src-components-App.js instead of only App.js, to make it more efficient for teachers to

locate suspicious files. Running the script locally is convenient; however, it requires manual work of executing it manually, outside of GitLab. On the other hand, it could be considered safer than making it accessible via the internet. From a user perspective, it is not an ideal solution. It would be better to have the ability to execute the script from the UI within a GitLab group. Since MOSS sends requests over the internet and the student submissions are stored on their servers, it was concluded that the submissions require pre-handling by erasing personal information in file- and folder names and comments.

4.3 Iteration 2

The second iteration focused on extending and automating the prototype developed from the first iteration and excluding personal information when uploading the submissions to MOSS.

The author investigated the possibilities of executing the script automatically based upon the expiration of a milestone in GitLab. The explored paths were the usage of CI pipelines and webhooks. At the time, it turned out that GitLab did not support the execution of pipelines nor webhooks based on milestone expiration. Although, it was found that others have requested these features, which opens the possibility of using plagiarism detection in such a way in the future . With that in mind, the script was1

developed further to run based on cron jobs, i.e., executed at intervals. By doing so, the script would be more automated. The extended functionality made it possible to provide multiple GitLab groups inside a configuration file. The script performed checks against each group to see if any milestones had passed recently. The functionality that changed the folder structure of the submissions was extended to remove usernames from file-and folder names file-and comments in the source code files.

One problem identified was that errors would occur if a group contained milestones with merge requests in different programming languages since the programming language was provided at a group level. This could be solved by changing the configuration file to include the names of each milestone with additional settings provided to each milestone. However, the file would probably grow over time and potentially become unmanageable. As an alternative to the cron-based solution, it could be interesting to redesign the script so that it is executed via pipelines or webhooks based on commits, pushes, or releases to a repository inside each GitLab group, created only for this purpose. An action to this repository would need to be made manually by the examiner after reaching a milestone. The advantages of using pipelines over

(25)

webhooks are many; the webhook would, in that case, have to make calls to a server hosted somewhere that in turn clones the repositories and outputs the results to the group’s Wiki page. The script could be hosted within GitLab, and the results could be presented via test reports or artifacts using pipelines. Executing the script and presenting the results in the same environment would be the most user-friendly option. Thoughts on whether a custom integration in GitLab could be the next step forward were considered.

4.4 Iteration 3

The third iteration focused on shifting from a local executable script to a prototype integrated into the user interface of GitLab. The prototype shifted from running plagiarism control after milestone expiration to upon merge requests. Therefore, MOSS was exchanged with JPlag since it could perform plagiarism detection of one student against others.

The possibility of building an integration in GitLab was explored. An integration is a way of adding functionality to GitLab, similar to a plugin. The source code of GitLab was investigated and how other integrations had been implemented. It was found that these integrations were built into the large codebase of GitLab and contained numerous lines of code in the programming language Ruby on Rails, which the author had no experience in. Due to this reason, the idea of building an integration was discarded.

Further, it was discovered that GitLab could be extended using File Hooks, which is a way of performing custom integrations without modifying the source code of GitLab. However, it was discovered that the user interface could not be altered using File Hooks. The author decided to search for other alternatives. Developing a Google Chrome extension seemed like an option since it allows manipulation of the HTML structure of websites. However, such a solution could not be published on the Chrome Web Store since it would need access to the script accessing parts of Linnaeus University’s GitLab instance. Therefore it would need to be installed manually from the

Developer mode of Google Chrome. Also, it would require the teachers to use Google

Chrome as a web browser. Another disadvantage of using a Chrome extension for this purpose was that the results from the plagiarism detection tool would only be available locally.

The author managed to run CodeGrade’s extended version of JPlag and discovered that their version supported the comparison of one student against all others by providing a path to the old submissions. The results from CodeGrade’s extended version of JPlag were stored into CSV files containing similarity measurements as percentages between the submissions. It did not include any data that could be accessed to view details of the submitted code. Therefore, the author investigated the source code and found that the functionality that outputs HTML files to enable the side-by-side view of two submissions was commented out. After the code was included, the program was rebuilt, and JPlag outputted both CSV and HTML files.

A GitLab group was created with the same structure as Linnaeus University uses for their courses, containing the subgroups Student Projects, Course Content, Pipelines,

(26)

Templates, Management, and Archive. Under Student Projects, a subgroup named

ab224qr was created, containing a project named Assignment 1 to test a workflow using

CI pipelines. As per default, the project contained one master branch. Another branch named release was created to follow the workflow used by Linnaeus University. Linnaeus University has created instructions for how pipelines within groups should be created. These have therefore been followed. The pipeline was created with the step

check plagiarism that checked if the event that triggered the pipeline was a merge

request onto the student’s release branch.

A way of collecting old submissions needed to be found. The author consulted the teachers on how old submissions could be collected. It turned out that there were no standardized ways of separating students from previous years with the current students. Although an Archive group is located within each course group that could preferably be used to put old students into, this option is not widely used. It was found that some archived student projects were empty. The author created a separate script that accepts a GitLab group name as an argument and clones all repositories inside it. It was identified that this script could be used in two ways; by running it upon a student’s merge request to get submissions to compare the current student’s solution with, or by uploading all the cloned repositories into a separate repository dedicated to storing submissions. The first option would cause a high load to the GitLab server since one course could contain hundreds of student projects. The latter option is beneficial since it would mean that only one request to the GitLab server needs to be made upon each merge request. This repository could store the submissions differently, such as separating them by course, assignment, or programming language. A discussion with one teacher led to creating a GitLab group named Submissions and one project for each programming language containing student solutions. In such a way, plagiarism comparison does not need to be restricted to the same assignments; however, it could cause unnecessary calculations with JPlag. Too much thought was not put into the decision of the structure since it could easily be changed in the future.

A project named Javascript was created inside the Submissions group and filled with five manually uploaded student solutions, as demonstrated in figure 4.4. One file of source code was uploaded to the ab224qr project. The author started to develop a Dockerfile to be used as an image to the check_plagiarism step in the pipeline. The Dockerfile contained a reference to the executable version of JPlag that had been rebuilt with support for HTML output and copied into the Docker environment. The file cloned the repository containing the Javascript parser provided by CodeGrade. The parser was moved to a folder where it was available in the environment’s PATH, meaning that the program javascript_to_jplag needed to be executable from where the execution of JPlag was. A project named Docker images was created inside the subgroup Pipelines. In the

Docker images project, a container registry was created in which the Dockerfile was

built locally and pushed to the registry. The Docker image was imported into the pipeline, and JPlag was executed successfully. The pipeline cloned the Javascript project and ran JPlag by providing the old submissions and the code belonging to the merge request. The output from JPlag was available from the pipeline’s artifacts.

(27)

Figure 4.4. Each folder contains the source code of a student that was used to test plagiarism detection.

The progress made in the third iteration has been high. Heavy efforts were made in finding one way of performing plagiarism detection after a deadline was reached. Because of the limitations discovered, the iteration took another path and implemented JPlag into a CI pipeline. With the pipeline implementation, R4 was met instead of R3. After demonstrating the prototype to one teacher, it was concluded that the CI pipeline implementation was the best solution so far. Such a workflow was considered superior to running plagiarism detection after a deadline had been met, which resulted in R3 no longer being relevant. The teacher claimed that they need to know if the plagiarism check detects any alarming plagiarism rates without manually checking each student’s submission. During the demonstration, it was discovered that the artifacts from the pipeline were available for anyone with access to the student’s project, which means that the student could see the full report from JPlag and therefore get access to other students’ solutions and usernames. It is, therefore, crucial to restrict access to teachers only. An alternative solution could be to manipulate the output from JPlag not to contain any personal information of any student. However, it was found that it is not suitable to reveal anonymously written code to students either. By making the reports only visible to teachers, there is no need to anonymize the students’ submissions, and the teachers can easily identify plagiarizing students by their usernames. The teacher would like to see the prototype working with code provided by the teacher that should not be reported as plagiarism (template/boilerplate code).

4.5 Iteration 4

The fourth iteration focused on improving the prototype developed from the third iteration by presenting the reports from JPlag and only allowing access for teachers and reducing the number of false positives by providing template code to JPlag.

(28)

The author investigated how alarming plagiarism rates for a specific student could be reported without teachers needing to inspect each merge request manually. Alarming plagiarism rates refer to submissions that should be treated as plagiaristic, which was decided based on the highest similarity measurement found compared to another submission. It was discovered that merge requests could be labeled, which is already used by Linnaeus University to mark assignments as submitted or graded. Labeling was considered a suitable solution and was implemented via the same CI pipeline whenever any alarming rates were found. The label was named Plagiarism.

The author investigated how pipelines could be configured to only show artifacts to certain users or user roles. Such a feature was not available, which can be understood since pipelines’ most common use case involves viewing artifacts to verify that the source code was built correctly and passes automated tests. As an alternative, it was decided to password protect the artifacts. Within the pipeline, the results from JPlag were password protected before it was uploaded as artifacts. The password was entered in plain text but should be provided via environment variables in the future.

To add the ability to include template code, a project named Assignment 1 was created inside the subgroup Templates. The project contained one of the files uploaded to the ab224qr’s assignment project to verify that the source code within this file would not be detected as plagiarism. Within the pipeline, a direct link to the repository was added as a variable. The link was cloned, and the arguments provided to JPlag were extended to include the directory to where the template code was located. The pipeline was executed, and it was shown that JPlag did not report any plagiarism for the provided template code.

The labeling of merge requests was presented to two teachers and discussed, in which it was found that the student that receives such a label would have the possibility of removing it. It could potentially have negative impacts on the student’s mental health with such serious accusations. Therefore, it was decided that the labeling needed to be discarded and replaced with something that lets the student be aware of the highest detected plagiaristic rate. It was concluded that the pipeline should end with a warning, which allows teachers to see which pipelines have warnings quickly, and that alarming plagiarism rates should be reported using a separate artifact to be viewed by everyone with access to the repository. It was desired to implement the presentation in the same way as the automated tests for specific assignments. Next, the prototype should be tested with Java and Python. The repository containing the old submissions should be pushed to upon a merge request. Later on, documentation should be written for teachers to follow to use the prototype in their courses. The pipeline should be moved from the

test313 group into a group named Pipelines, containing a project with different stages

already being used at Linnaeus University. The reason behind moving the pipeline is to allow future changes to it, directly affecting all courses using it.

(29)

4.6 Iteration 5

The visual presentation of the prototype was further improved. The prototype structure was changed to allow changes to it that affect all courses where it is used. The prototype was tested against R12, and documentation was written to educate teachers on using the prototype for their courses.

The labeling of a merge request within the pipeline was removed. Instead, the pipeline was instructed to end with a warning using exit codes. The author investigated the opportunities of reporting the plagiarism rate into a separate artifact using the pre-built user interface. It was discovered that JUnit could generate test reports within pipelines. JUnit expects the artifact to contain test suites and test cases. An XML file was generated within the pipeline containing one test suite with one test case named

Highest match: X, where X represents the percentage of the detected plagiarism. The

test case always succeeded and was only generated while alarming plagiarism rates were detected. Furthermore, the pipeline was extended to upload the current student’s merge request into the old submissions repository, replacing the previous solution if the student would have made an earlier merge request for the same assignment.

A project named check_plagiarism was created inside a subgroup named Stages inside the Pipelines group to move the pipeline from the course group. In the created project, a template pipeline was created that held the functionality. For each programming language, a folder containing a pipeline specific to that language was created. These pipelines imported the template pipeline and contained the necessary language-specific variables; language, file extension, and link to the repository with old submissions. Inside the Submissions group, Java and Python were added as projects at the same level as Javascript. These projects were filled with example submissions; Java solutions were retrieved from JPlag’s official demonstration of how the tool presents its results. Python solutions were retrieved from a repository on Github containing several small programs developed by Geekcomputers. A subgroup inside Pipelines was created, named Docker Images, to move the Docker container from the course. Inside the subgroup, a project named JPlag was created that held the same Dockerfile as used in the image created for the test313 group. The Dockerfile was built from the author’s local computer and pushed to the project’s container registry. The structure has been demonstrated in figure 4.5.

(30)

Figure 4.5. The Dockerfile does not necessarily need to be stored in GitLab since the Docker container is built locally. Each .check_plagiarism.yml file imports the .check_plagiarism-template.yml file, and therefore it is only used within the Stages subgroup.

To test the workflow, a new pipeline was created in the test313 group that imported the following pipeline stages:

- build - lint

- check_deadline - check_plagiarism

The first three stages were already set up and used for courses, in which it seemed suitable to include these. The last stage, check_plagiarism, was extended within this pipeline to declare a link to the repository containing template code. An assignment template was created in the same course under Templates/Assignment 5. The template included the recently created pipeline and was used to create a project for the student

ab224qr. A Javascript file was created in the student project containing 5-10 lines of

source code retrieved from one of the submissions in the Javascript project inside the

Submissions group. A merge request was created towards a release branch, in which the

pipeline was triggered, and the plagiarism control was executed successfully.

The workflow was further documented for teachers for them to be able to replicate the steps taken and use the pipeline step for their courses and assignments. As a final step, the password to the zip file containing the results from JPlag was moved from declaring it in plain text within the template pipeline to be declared inside a course group as a masked CI/CD variable.

(31)

The prototype was demonstrated to three teachers at Linnaeus University, which led to the following aspects to consider:

- The author, or any teacher from the computer science department of Linnaeus University, should test the workflow for a new course to verify that the documentation is complete.

- The zip file available as an artifact could be moved from the pipeline to a separate, protected repository only for teachers to view, containing JPlag results. This would eliminate the risk of students being able to crack the password. It was found that R10 could not be met since GitLab did not support filtering by pipeline statuses. Currently, there is an open issue on that on GitLab’s official website .2

The issue was discussed with the teachers, and it was concluded that the feature is not necessary since they found it easy to get an overview of the suspected submissions when looking at all merge requests within a course. All requirements were gone through, and it was agreed that the prototype fulfills all requirements once the zip-file from the artifact was removed, and the results were moved into a separate project within a course. The teachers wanted the source code to JPlag, the executable Java file, and the Javascript parser to be available within the Docker Images project inside the Pipelines group.

4.7 Iteration 6

The last iteration focused on minor changes to the prototype that evolved from the fifth iteration.

A project named Plagiarism Reports was created inside the test313 group’s

Management subgroup. The subgroup is only available for teachers within a course, and

therefore no students or non-authorized teachers would be able to access the reports from JPlag. The pipeline was extended to remove the zip file creation and upload the results to the Plagiarism Reports project. The source code to JPlag, the Javascript parser, and the executable Java file was uploaded to the Docker Images project.

The minor changes were presented to one of the teachers that the results from the fifth iteration were presented to. It was concluded that the prototype fulfilled all requirements and that the author should not continue on a seventh iteration. The description and evaluation of the prototype from iteration six are described in the results chapter.

(32)

5 Result: the design artifacts

This chapter presents the results and describes the prototype that has evolved from the sixth iteration. The prototype implements the plagiarism detection tool JPlag into GitLab. It includes an architectural structure (as described in section 5.1), a workflow that illustrates how the prototype should be worked with (described in section 5.2), scenarios that verify how the prototype fulfills the requirements (described in section 5.3), and documentation on how teachers can use the prototype in their courses (described in section 5.4). During the development of the prototype, it has been continuously tested manually by the author. The scenarios have been presented to teachers that have validated the implementation.

5.1 Architectural structure in GitLab

The structure relies on the existence of three groups in GitLab; Pipelines, Submissions, and a course. Students can not access the submissions group. The underlying structure of each group can be found in figures 5.2-5.4. Figure 5.1 describes the meanings of the different terms used in the following diagrams.

Figure 5.1. A GitLab group can contain subgroups and projects. A project can only exist inside a group. Folders and files exist within a project.

The Pipelines group has been illustrated in figure 5.2. The folder check_plagiarism contains one folder for each programming language, including a pipeline YAML file used for importing a pipeline step into an assignment. Each language-specific pipeline imports the template pipeline that provides the functionality and logic, whereas the language-specific pipelines change the variables needed for each programming language.

The JPlag project consists of a Docker container that enables JPlag to be executed, imported in the template pipeline, and available to the language-specific pipelines.

(33)

Figure 5.2. When an assignment comes with template code, the Git-repository URL to the template code should be provided in the assignment pipeline that overrides an empty variable set in each language-specific pipeline.

The Submissions group has been illustrated in figure 5.3. The group contains one project for each programming language that contains submissions used to compare merge requests against.

Figure 5.3. Each programming language-specific project contains submissions collected from students’ merge requests. Each folder is named as

(34)

PROJECT_ID_STUDENT_USERNAME, e.g. 10533_ab123ba.

Figure 5.4 illustrates the course structure, which had existed before this study was started. The addition is the Plagiarism Reports project that contains reports generated by JPlag.

Figure 5.4. In the subgroup Student Projects, all registered students in a course have their own group with one project for each assignment. When the prototype should be used for an assignment, a pipeline project should be created inside the subgroup Pipelines. An assignment template should be created inside the subgroup Templates, which use the newly created pipeline. The students’ assignments are generated by GitLab administrators at Linnaeus University from the created assignment template. The assignment template can be empty or contain template code.

5.2 Workflow in GitLab

Since the prototype involves different actors and numerous steps, two figures have been created. Figure 5.5 illustrates the overall steps and decisions for the different actors involved. Figure 5.6 illustrates the behavior of the pipeline upon students’ merge requests and the different scenarios that could occur.

Automatic Detection of Source Code Plagiarism in Programming Courses

Bachelor Degree Project