Code Clone Detection for Equivalence Assurance

(1)

Code Clone Detection for Equivalence Assurance

SARA ERSSON

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

Equivalence Assurance

SARA ERSSON

Master in Computer Science Date: August 17, 2020 Supervisor: Cyrille Artho Examiner: Robert Lagerström

School of Electrical Engineering and Computer Science Host company: King/ Midasplayer AB

Swedish title: Kodklonsdetektering för att säkerställa ekvivalens

(4)

(5)

Abstract

To support multiple programming languages, the concept of offering application programming interfaces (APIs) in multiple programming languages has become commonplace. However, this also brings the challenge of ensuring that the APIs are equivalent regarding their interface. To achieve this, code clone detection techniques were adapted to match similar function declarations in the APIs. Firstly, existing code clone detection tools were investigated. As they did not perform well, a tree-based syntactic approach was used, where all header files were compiled with Clang. The abstract syntax trees, which were obtained during the compilation, were then traversed to locate the function declaration nodes, and to store function names and parameter variable names.

When matching the function names, a textual approach was used, transforming the function names according to a set of implemented rules.

A strict rule compares transformations of full function names in a precise way, whereas a loose rule only compares transformations of parts of function names, and matches anything for the remainder. The rules were applied both by themselves, and in different combinations, starting with the strictest rule, followed by the second strictest rule, and so fourth.

The best-matching rules showed to be the ones which are strict, and are not affected by the order of the functions in which they are matched. These rules showed to be very robust to API evolution, meaning an increase in number of public functions. Rules which are less strict and stable, and not robust to API evolution, can still be used, such as matching functions on the first or last word in the function names, but preferably as a complement to the stricter and more stable rules, when most of the functions already have been matched.

The tool has been evaluated on the two APIs in King’s software development kit, and covered 94% of the 124 available function matches.

Keywords: APIs, Code Clone Detection, API Mapping

(6)

iv

Sammanfattning

För att stödja flera olika programmingsspråk har det blivit alltmer vanligt att erbjuda applikationsprogrammeringsgränssnitt (API:er) på olika programme- ringsspråk. Detta resulterar dock i utmaningen att säkerställa att API:erna är ekvivalenta angående deras gränssnitt. För att uppnå detta har kodklonsde- tekteringstekniker anpassats, för att matcha liknande funktionsdeklarationer i API:erna. Först undersöktes existerande kodklonsverktyg. Eftersom de inte presterade bra, användes ett trädbaserat syntaktiskt tillvägagångssätt, där alla header-filer kompilerades med Clang. De abstrakta syntaxträden, som erhölls under kompileringen, traverserades sedan för att lokalisera funktionsdeklara- tionsnoderna, och för att lagra funktionsnamnen och parametervariabelnam- nen. När funktionsnamnen matchades, användes ett textbaserat tillvägagångs- sätt, som omvandlade funktionsnamnen enligt en uppsättning implementerade regler.

En strikt regel jämför omvandlingar av hela funktionsnamn på ett exakt sätt, medan en lös regel bara jämför omvandlingar av delar of funktionsnamn, och matchar den resterande delen med vadsomhelst. Reglerna applicerades båda själva och i olika kombinationer, där den striktaste regeln applicerades först, följt av den näst strikaste, och så vidare.

De regler som matchar bäst visade sig vara de som är striktast, och som inte påverkas av ordningen på funktionerna i vilken de matchas. Dessa regler visade sig vara väldigt robusta mot API-evolution, dvs. ett ökat antal publi- ka funktioner i API:erna. Regler som är mindre strikta och stabila, och inte robusta mot API-evolution kan fortfarande användas, men helst som ett kom- plement till de striktare och mer stabila reglerna, när de flesta av funktionerna redan har blivit matchade.

Verktyget har evaluerats på de två API:erna i Kings mjukvaruutvecklarkit, och täckte 94% av de tillgängliga funktionsmatchningarna.

Nyckelord: API:er, kodklonsdetektering

(7)

Acknowledgement

I want to thank Maria José Mera, my supervisor at King, and the rest of the Developer Relations team for supporting me throughout the whole project, and truly making me feel like a member of the team. I also want to thank Andreas Valter, my unofficial supervisor at King, who was of great help when trying to understand the technical environment. I would also like to thank my supervisor at KTH, Cyrille Artho, for always helping me when I needed support, giving the best advice, and making sense of the problems I faced along the way.

Furthermore, I am really grateful for my classmates and friends at KTH. Albin Byström helped me a lot during my first years when I struggled with keeping up with the programming. Emma Good has been by my side as a true programming partner and friend during the whole master’s program. Finally, I am greatful for my partner Johannes Valck, who always have supported me, and convinced me that I have what it takes.

(8)

Introduction

This thesis covers parsing of application programming interfaces (APIs) for a software development kit (SDK) written in C and C++ respectively. The APIs are supposed to provide the same interface, and the problem lies in how to compare them and decide if they are equivalent, or not. The purpose with this thesis is to investigate how to solve this problem, and the goal with this degree project is to implement a tool, which can detect what public functions are missing in one of the APIs.

1.1 Background

King, also known as King Digital Entertainment, is a global video game company which develops games for the web, mobile phones, Facebook and Win- dows 10. King was founded in 2003 but had their breakthrough in 2012, after having released their cross-platform game Candy Crush Saga, using the freemium model. The name freemium is a mixture of the words free and pre- mium, and means that the game is free but players can, if they want, purchase extra features. [1]

King has developed their own SDK containing APIs which provide game platform functionality. The SDK exists so that game teams can focus on the actual development of the games instead of having to look into multiple SDKs for adding support for different platforms, such as Android, iOS and Facebook.

According to King the intention of the SDK is to make game development super easy for their game teams; their mission is to make other developers happy [2].

1

(14)

2 CHAPTER 1. INTRODUCTION

The SDK contains many different modules, such as the store module for in- app purchases, the social module for player authentication and integration with Facebook, and the notifications module for push and local notifications. It is used by both internal game teams, and by external partner studios. Moreover, the SDK is massive; it is created by compiling several different repositories, where each repository contains a lot of legacy code. At the same time there are about five to ten teams contributing to the SDK at any time. Furthermore, some game studios using the SDK are doing so in development environments that are black boxes, and the game engines that partner studios are using are new to King.

Because of the size of the SDK and its complexity, working with the SDK has become very inefficient. King’s latest project has therefore been to create a new SDK, called Unified SDK (USDK), with the purpose of making it easier to work with. At first, the USDK only had a C API available. However, after the internal game teams requested a C++ API, translating the current API to be able to add a C++ API became the latest project for the SDK team at King.

1.2 Problem Statement

King is currently writing a C++ API so that the USDK users can choose after their needs, but the APIs will be completely isolated and not know about each other. Furthermore, if developers want to add functionalities to the USDK there is nothing telling them to contribute to both of the APIs. This means that there does not exist anything that can ensure that both APIs provide the same interface. Derived from this problem following research questions were formed:

RQ1: To what extent can existing code clone detection (CCD) tools find sim- ilarities between two APIs of different programming languages?

RQ2: How can CCD techniques be adapted to compare two APIs written in different programming languages to ensure they provide the same interface?

RQ3: How robust are given interface mapping rules over time, against new API functions being added?

(15)

1.3 Purpose

The purpose of this thesis is to investigate how a C API and C++ API, written for the same software, can be parsed and compared to ensure that they provide the same interface. Firstly, it is of interest to test how well existing CCD tools perform on the API problem, and secondly, it is of interest to figure out how to combine and apply different approaches within CCD. Additionally, the purpose is to construct and test different interface mapping rules, and see how well they work over time as API interface is extended.

1.4 Goal

The goal of this degree project is to implement a security mechanism which can detect and alert when the C API lacks public functions that exist in the C++

API. The mechanism should state which function is missing and in which file.

The expected outcome of this degree project is that the users of the USDK are provided with assured equivalence between the two APIs regarding the interface.

1.5 Ethics and Sustainability

Since new programming languages have been, and will be, developed as the time passes, there will be a higher demand on offering products in multiple programming languages. In the long term, migrating to new programming languages will therefore occur more often. Therefore, researching within tools that help with this task is important, to be able to be certain that the customers of the products are provided with equivalent products, even though the programming languages increase in number.

1.6 Methodology

The work begins with understanding the problem and analysing the API structure. This is followed by conducting literature study, researching about different techniques and tools to detect similar code fragments. Selected tools are tested and evaluated, and an adaption of the selected tools and techniques are implemented and evaluated. The results from the evaluation show how many functions have been matched in relation to how many matches there actually are, and using which rules.

(16)

4 CHAPTER 1. INTRODUCTION

1.7 Delimitations

Because the API translation has been done by humans and only involves C and C++ the results of this work will not be a general solution. There are many different naming conventions which could be followed when writing code, and even if one is decided upon, it might not always be followed. Comparing APIs of any given languages would also be a much more complex task and out of the scope of this thesis work.

Since King wants to offer equivalent API versions for their customers, only the public functions are going to be taken into account, i. e. factors like function names, their parameters and return values are to be considered and not for example structs and assignments.

Behavioral differences are not going be considered since the research is about finding a way to check if a certain functionality is implemented or not. The way it is supposed to work is outside this work.

1.8 Outline

Chapter 1 gives an introduction to King as a company; what they do and how this degree project relates to earlier work. The introduction also includes a formulation of the problem, the purpose with the thesis and the goal with the degree project. In the latter section ethics and sustainability is an important part. After that the methodology is presented, and finally the delimitations of the thesis work is described. Chapter 2 covers the background to this thesis work, such as applications programming interfaces (APIs), the structure of the SDK and its APIs, code clone detection and related work. Chapter 3 explains the method of the thesis work. Chapter 4 presents the results of the implementation, using different rules. In chapter 5 the different results are discussed. Chapter 6 presents the conclusion of the thesis work. Chapter 7 presents future work.

(17)

Background

An SDK is a collection of software, written by a third party, which can be incorporated to applications with the purpose of supporting new capabilities [3]. Example of certain capabilities can be tracking, social media integration, notifications and in-app purchases. An SDK always includes at least one API.

This chapter gives an introduction to APIs, a background to King’s SDKs and the structure of the new SDK, how similar code fragments can be detected, and what code clone detection is, and finally presents what challenges there are with this kind of detection.

2.1 Application Programming Interface

In this section we present the definition of APIs, followed by what different API types and policies there are and what advantages there are with using, or providing, APIs.

2.1.1 Definition

An API allows someone outside of the company or project to use features, without having access to the actual software implementation. It can be seen as a balance between not revealing the code, and still letting someone use its functionalities.

2.1.2 Types and Policies

There are several different API types; regular interfaces, device APIs and web APIs. Device APIs allows communication between local applications through

5

(18)

6 CHAPTER 2. BACKGROUND

files, and does not require any network access, since the applications are com- municating within the device. These APIs can use any communication style.

Web APIs however, require network access and usually use SOAP, REST, UDDI and XML-RPC for communication [4]. However, the regular interfaces are the most common ones, and what this project is about.

When providing an API, one can do so following one of three different API policies [5]. Each policy comes with both advantages and disadvantages, and are listed below:

• private API: for internal use

• partner API: shared with specific business partners

• public API: available for everyone

Private APIs are for internal use only, and exist to increase productivity and efficiency within the company. They can also simplify collaboration across different departments internally. Private APIs are easier to modify since they are only used internally, and not by any third parties who could be affected by API changes. Partner APIs however, are shared with specific business partners according to an agreement. By letting customers use company APIs the reach of the brand can be expanded to partner customers, which can boost the company brand on the market. Finally, public APIs are the ones available for anyone to use. These types of APIs enable the company brand to enter completely new markets, without having to actively collaborate.

2.1.3 Advantages

An API works like a contract. By exposing what functionalities there are and how they are used, the APIs cannot be changed without it affecting the users’

implementations. What can be changed, however, are how the functionalities are implemented, without it affecting neither the behaviour nor the way they are used. Besides functionality assurance APIs also promote and simplify business collaboration and reach of brand.

2.2 King

This section covers a brief history of King’s SDKs and gives an explanation of the structure of the new SDK, supported by a visualization of the APIs

(19)

and their dependencies. The chapter also covers who use the APIs and what challenges there are.

2.2.1 Background

The SDK, used by all of King’s games, is called the King SDK (KSDK). The KSDK is used by the internal game teams and external partner studios. It exists in many different versions, and is created by compiling multiple repositories on King’s GitHub. Since the KSDK has been contributed to for many years it contains a lot of legacy code. Furthermore, the KSDK is not modular, meaning that users cannot select certain parts of it without receiving the whole KSDK. This also means that it cannot be shipped in parts, the KSDK has to be shipped all together. Another disadvantage with working with the KSDK is that it is not thread-safe. It easily crashes if calls do not come from the same thread.

Because of the inconvenience of working with the KSDK, King decided it was time to create a new SDK, called the Unified SDK (USDK). This one was going to have all code in one place, hence the term "Unified", and still be modular.

2.2.2 King’s New SDK

Initially the USDK only had a C API available for its users. This was appre- ciated by the external partner studios, but the internal game teams requested a C++ API for several reasons. Few game developers have experience in programming in C, the C API is very different from the rest of the code, and it is harder to debug since there will be several layers of conversions. Using the C API creates much boilerplate code and wrappers which the internal game developers have had to write themselves. Moreover, there is no automatic memory management in C.

When the demand for a C++ API became higher, the SDK team started to re- think the structure of the USDK. They wanted to introduce a C++ API without affecting any other APIs or existing implementations. The idea was to have two APIs; a C API and a C++ API, but with a common implementation. Even though they would share implementation, the APIs would have to be completely isolated and independent. To fulfill these requirements an additional layer of complexity needed to be added. Instead of the C implementation hav-

(20)

ing the functionality inside itself, it would have to call the C++ API. This meant that the C API would have to call the C implementation which would use the C++ implementation, while the C++ API would call the C++ implementation directly.

<<Interface>>

C++

<<Interface>>

C

C C++

Visual Paradigm Online Diagrams Express Edition

Figure 2.1: USDK structure

In Figure 2.1 the dependencies of the two APIs and their implementations are described. The thin arrow represents the calling of the C++ API in the C implementation, while the unfilled arrow mean that the API represents the functionality of the implementation.

2.3 Code Clone Detection

Lines of code, which are similar in certain ways, are called code clones or clone pairs. Finding code clones can be done using several different approaches, which can all be found within code clone detection (CCD).

This section will first introduce basic definitions within CCD, such as code fragments and clone pairs. Secondly, this section covers different clone types by providing code examples. Thirdly, we introduce CCD and its connection to improvement of software systems. Fourthly, we present the different phases of CCD, followed by different CCD approaches. Finally, we present the challenges with CCD.

(21)

2.3.1 Basic Definitions

Within CCD many different definitions have been used for describing clones.

In this report however, following definitions are used, derived from several CCD surveys.

Code Fragment

A code fragment (CF) can consist of a sequence of statements, a function, a method or begin-end blocks [6]. A CF usually consists of a few statements and should be of interest somehow. The key-characteristic is that the CF is a part of the source code which is needed to run the program.

Code Clones vs Clone Pairs

A syntactically correct program will compile and run, but a semantically correct program will also behave they way the developer intended it to. This also means that programs written to perform the same task, but in different languages, will be syntactically different but semantically similar. When two CFs are syntactically or semantically similar they are called each other’s code clones [6]. If the two CFs instead are similar in another way they are called a clone pair. An example of a clone pair is two corresponding function decla- rations, where the declarations are in different programming languages.

Clone Class

The CFs within a clone pair have a relation to one another. A set of clone pairs where each pair shares the same relation it is called a clone class [6].

2.3.2 Different Clone Types

There are two group of clone types. The clones in the first group are syntactically similar — they have similar syntax, and can be divided into three different types.

Type I – Exact clones

Two CFs are exactly the same, except for code without any effect on the execution, such as comments, blanks and white spaces [6]. Examples of exact

(22)

clones are Listings 2.1 and 2.2.

1 i n t f o o (i n t x , i n t y , i n t z ) {

2 i f( x > y ) {

3 z += x ; / / 1 s t comment

4 }

5 e l s e {

6 z += y ; / / 2 nd comment

7 }

8 r e t u r n z ;

9 }

Listing 2.1: Original CF

2 i f( x > y )

3 { / / 1 s t comment

4 z += x ;

5 }

6 e l s e

7 { / / 2 nd comment

8 z += y ;

9 }

10 r e t u r n z ;

11 }

Listing 2.2: Type I - exact

Type II – Renamed and/or Parameterized Clones

Two CFs are exactly the same except for names of identifiers, such as variables, functions, literals and types. If the names are changed consistently, they are called renamed and parameterized clones, and if not, they are just called re- named clones (see Listings 2.3 and 2.4) [7].

2 i f( x > y ) {

3 z += x ; / / 1 s t comment

4 }

5 e l s e {

7 }

8 r e t u r n z ;

9 }

1 i n t b a r (i n t a , i n t b , i n t c ) {

2 i f( b > a ) {

3 c += a ; / / 1 s t comment

4 }

5 e l s e {

6 c += b ; / / 2 nd comment

7 }

8 r e t u r n c ;

9 }

Listing 2.4: Type II - renamed and parameterized

Type III – Near Miss/Gapped Clones

Some define near miss clones as a mixture of type I and II, but they have also been defined as CFs which, in addition to the characteristics from type I and II, also have modifications such as removed or added statements (see Listings 2.5 and 2.6) [7][6]. Thus, there are "code gaps" in the clones that differ them from each other. These types of clones are called gapped clones.

(23)

2 i f( x > y ) {

3 z += x ; / / 1 s t comment

4 }

5 e l s e {

7 }

8 r e t u r n z ;

9 }

2 i f( x > y ) {

3 z += x ; / / 1 s t comment

4 z −= y ; / / new s t a t e m e n t

5 }

6 e l s e {

8 }

9 r e t u r n z ;

10 }

Listing 2.6: Type III - near miss/

gapped

The clones in the other group are semantically similar — they have the same functionality but not necessarily the same syntax, and only consist of one type alone.

Type IV – Semantic Clones

Two CFs are similar regarding behaviour, without having similar syntax (see Listings 2.7 and 2.8) [6].

2 i f( x > y ) {

3 z += x ; / / 1 s t comment

4 }

5 e l s e {

7 }

8 r e t u r n z ;

9 }

2 s w i t c h( t r u e ) {

3 c a s e x > y :

4 z += x ;

5 b r e a k;

6 c a s e z <=y :

7 z += y ;

8 b r e a k;

9 }

10 r e t u r n z ;

11 }

Listing 2.8: Type IV - semantic

2.3.3 CCD for Software Improvement

A common mistake within software development is to duplicate or re-use code.

Although it can seem easy and efficient to reuse code, it can result in many problems. If the original code contains bugs it can introduce the same or similar defects in all of its re-implementations, also known as bug-propagation.

Similarly, if the original code is poorly designed, hard to understand or inefficient, reusing it leads to more code having the same flaws. This becomes a

(24)

maintainability problem when the code needs to be updated, because then all of its implementations require the update, but there is nothing assuring that they will in fact be updated. Because of the reasons mentioned above CCD can be a very effective approach for improving software systems.

There are many reasons for releasing software in different languages. Today there are many more programming languages and platforms available than 20–

30 years ago, which means that there is a larger variety amongst developer customers regarding the languages in which they develop. Some of the reasons to migrate software to different languages are to provide language-specific features or the need of supporting specific platforms. In this report however, the focus lies in offering multiple choices of languages of APIs for attracting a wider range of developer customers. Since translating an API means to reuse logic but adapt the syntax to the new programming language, it could be inter- preted as reusing code. Using CCD across languages could be used to match similar function declarations, and thereby check whether two APIs in different languages are equivalent regarding their interface.

2.3.4 CCD Phases

Detecting code clones is not a simple task. It requires that several steps are performed. Since it is not known which CFs can be found multiple times, a large number of comparisons has to be made. To reduce the complexity of the comparison the first step is to preprocess the source code, eliminating irrelevant code [6]. The next step is to transform the code, parsing it and in many cases building some kind of structure. After the transformation the match detection phase starts, where similar transformed CFs are matched. Then, to represent the clones somehow, formatting is applied on the found clones. Finally, erro- neous clones can be filtered out.

The different steps will be briefly presented in this subsection, and then further explained under each CCD approach in the next subsection.

Code Preprocessing

Essentially, this step is about removing irrelevant parts, converting source code into units and determining comparison units [7]. Irrelevant parts of the source code are pieces of code which do not affect the execution behaviour, such as white space, blanks and comments. When the uninteresting pieces of code are removed the source code can be divided into a set of units. A unit in this

(25)

case can for example be a file, a function, statements or a number of lines in the source code, depending on the used method. The units can then be further divided into smaller units for comparison, but just as the partitioning of units this depends on what method is used and how complex the source code is.

Transformation

All approaches except for the textual approach, perform some kind of transformation of the preprocessed source code. The concept is that the source code is represented in a way that simplifies, or even makes the comparison possible.

Different approaches use different ways to transform the source code, however the three most common ones are tokenization, abstract syntax tree (AST) extraction and program dependency graph (PDG) extraction [6]. Tokenizing code means to parse it and covert each line of code into a sequence of tokens.

If there are variation in identifiers these are eliminated by replacing them with equal identifiers.

An AST is a tree representation of the syntactic structure of a source code, where each node in the tree symbolizes a construct in the source code. The AST nodes are modeled on a class hierarchy. From the root node all nodes in the AST can be reached through traversal. For basic node types, such as declarations and statements, there are multiple larger hierarchies. The declaration class, which as expected represents declarations in the source code, in turn has many sub-classes [8].

A PDG is graph representation of the semantic structure of a source code, where each node symbolizes an operation in the source code, and each edge represents a dependency between the operations [6]. Extracting a PDG means to generate the graph representation, which is used to be able to analyse the data and control flow of the source code.

Match Detection

This is the phase in which CFs actually are compared and matched. The input to this phase is the output from the transformation phase. Every CF is compared with every other CF using some matching algorithm depending on the elected approach [6]. If tokenization has been applied, sub-sequences are compared, and if ASTs or PDGs have been generated, sub-trees or sub-graphs are compared. The matches are added to a set, a list or grouped together in some other way, to prepare the input for the next phase. If preferred, aggre-

(26)

gation can also be applied in this phase. Essentially, multiple clone pairs can be aggregated into sets or groups with the concept of reducing the amount of data.

Formatting

In this phase the obtained output from the match detection phase is re-connected to the source code. A very clean and visual way to format the output is to high- light the code clones in the source code [6].

Filtering

This phase is about removing false positives and negatives from the code clone collection [6]. This is an optional phase and is usually done by humans with a lot of experience within the area.

2.3.5 CCD Approaches

The techniques, with which code clones can be detected, are categorized into four different approaches. These are textual, lexical, syntactic and semantic approaches, and within the two latter categories there are sub-categories [7].

Furthermore, there are several already implemented tools, of which a few are discussed in section 2.4.

Textual Approach

Techniques using a textual approach compare two CFs regarding their textual content, i.e., try to find sequences of equal strings [6]. Therefore very lit- tle code processing is needed, such as removing white space and comments, which is one of the reasons why these techniques are quite easy to implement. Not surprisingly, this approach is independent of programming languages, which can be both an advantage and a disadvantage. Theoretically it can even be used to detect clones cross-languages, but the more the languages differ, the worse the clone detection will work. Since the approach does not transform the code, it is very sensitive to differences. It cannot detect clones with different line breaks for example, since it changes the code layout. How- ever, it generates few false positives (false matches).

(27)

Lexical (Token-Based) Approach

Rather than comparing textual content, the lexical approach converts each source code line into a sequence of tokens. All tokens representing an identi- fier, such as a variable name or a function name, are replaced with the same token, to exclude identifier variance in the comparison [7]. Similar token sub- sequences can then be identified when matching code lines.

This approach requires the order of the lines of code to remain the same, otherwise duplication will not be detected [6]. This means that lines can neither be added nor removed. Although this approach is more complex then the textual approach, it is quite scalable. There is no need of parsing, but it can generate many false positives.

Syntactic Approach

Within this category there are two sub-categories; tree-based and metric-based approaches [6]. Tree-based techniques parse the source code into an AST.

Sub-trees in the AST can then be compared to find code clones using different tree-matching algorithms. Metric-based techniques calculate the number of different metrics within CFs, such as declaration statements, function calls, return statements and parameters. These metric vectors are then compared, instead of comparing lines of code or ASTs directly.

An advantage with tree-based techniques is that parsing source code into trees theoretically allows for cross-language clone detection, removing language- specific syntax to some extent [6]. This is something which is not possible with a textual approach. However, some tree-based approaches do not scale.

Tree structures can easily become very large when there are a lot of dependencies. Regarding the metric-based techniques a disadvantage is that two CFs can have the same metric values without being similar either syntactically or semantically. At the same time, CFs can be similar both syntactically and semantically without being similar regarding metrics. Another disadvantage is that a PDG generator or a parser is required.

Semantic Approach

This approach uses static program analysis to detect CFs performing the same computations. Thus, they do not necessarily have similar syntax. Within this category there are two sub-categories; graph-based and hybrid approaches.

(28)

Graph-based techniques use, as the name suggests, a graph for representing the control and data flow in each CF. Because of the control and data flow analysis the CFs can be compared regarding semantics. Clones are, within this approach, detected as isomorphic sub-graphs. Two graphs, or sub-graphs, are isomorphic if they contain the same number of vertices, and are connected in the same way [9]. The disadvantage with graph-based techniques is that they require PDG generators, which are not scalable.

Hybrid approaches use a combination of different techniques to be able to both use advantages of some approaches, and avoid their disadvantages. A tool, for detecting type I and II using a token-based approach, can for example be extended with a textual approach to be able to find type III clones as well [6].

2.3.6 Challenges

Several challenges can be derived from the different mentioned approaches within CCD. Firstly, a large aspect is scalability. ASTs and PDGs quickly become complex when the size of the source code increases. It is therefore harder to use a syntactic or a semantic approach on larger code bases since they require AST parsers or PDG generators [6]. Secondly, another challenge is to implement code transformation of the original source code, if it first needs to be parsed somehow. Thirdly, different approaches generate different amounts of false results, i. e. both false positives and false negatives (missing matches) [7].

2.4 Related Work

Based on the previously presented CCD approaches, several existing CCD tools using different approaches have been looked into for this section about related work. Each tool is briefly described regarding how it works and what clones it can find. We have researched about software product lines (SPL).

2.4.1 CCD Tools

NICAD

NICAD stands for Accurate Detection of Near-miss Intentional Clones, and is a CCD tool which had its release in 2008 [10]. NICAD uses two different

(29)

techniques; a textual and a tree-based syntactic approach. By combining these approaches it is able to detect clones of all syntactic types (I, II, III). It starts by extracting potential clones and prettyprints them. The prettyprinting divides statements into multiple lines, removing the impact of local changes, and assures that all code is presented with a uniform layout. The prettyprinted, potential clones are then normalized to remove redundant editing differences, such as white spaces and comments. Editing differences would otherwise affect the CCD negatively, resulting in false negatives (missed matches). NICAD then uses the longest common subsequence (LCS) algorithm to compare the prettyprinted, normalized, potential clones.

CCFinder

CCFinder was released in 2002, and is a CCD tool which uses a lexical/ token- based approach [6]. It analyses independently of the amount of files in the system under evaluation. CCFinder starts with performing a lexical analysis, where each line of source code is converted into tokens [11]. The tokens of all files are then merged into one single sequence of tokens. During the lexical analysis white spaces and comments are removed and stored for the reconstruc- tion of the original source code files to come [11]. After the lexical analysis CCFinder starts the transformation of the token sequence. First, it transforms the sequence according to a set of rules. Two examples of rules are removal of name space attributes and separation of function definitions [12]. It then replaces identifiers with a special token to enable detecting clones with different identifiers. When the token sequence has been transformed, equivalent sub-sequences can be identified creating a clone pair. The tools represents a clone pair using four positions, the position of the beginning and ending of both identified CFs. CCFinder also performs suffix tree matching to find clone classes.

CloneDr

CloneDr is one of the older CCD tools, released in 1998, and uses a tree- based syntactic approach [6]. It is able to find clones of type I and III, as well as refactored code. This means that it cannot find semantic clones. CloneDr begins with parsing the source code into an AST. After the AST is obtained the tool divides sub-trees into buckets using a hash function. It then uses three different algorithms to detect potential clones within the same bucket. The algorithms look for sub-tree clones, similar declaration and statement sequences

(30)

or other variable-size sequences of sub-tree clones, and complex near-miss clones using generalization of clone combinations.

MCD Finder

MCD Finder, where MCD stands for Metric-based Clone Detection, is a CCD tool released in 2013, written in Java. MCD Finder uses a metric-based syn- tactic approach, and can detect clones in Java programs. Instead of taking the Java source code as input, and performing metric computation on it directly, it uses its byte code file. Java byte code is very similar to other low-level languages, and is generated by using the compiler javac, which is found in the Java Developer Kit (JDK) [6]. When the byte code file is obtained MCD Finder performs metrics calculation, and stores the metrics in a database. The metrics are also mapped onto Excel sheets. MCD Finder then compares the metrics to be able to detect potential clones [13]. The computed class and function metrics are listed below.

Class Metrics:

• total number of functions

• total number of conditional statements

• line of code

• total number of variables

• total number of public, private, protected and friend variables Function Metrics

• name of functions present in class

• number of variables present at function level

• total numbers of lines in a function

• return type of function

• number of arguments passed to a function

• how many times a function is called [13]

(31)

2.4.2 Software Product Lines

As software systems become more complex and require higher quality and performance, it is of high importance to have the systems living up to the expec- tations of the users. By designing software systems so that they can share a set of software assets productivity can be increased, and both cost and time spent on developing and maintaining the systems, and the increased complexity can be reduced [14]. A collection of systems as such is called a software product line (SPL). Product-line engineering focuses on strategic reuse instead of op- portunistic reuse, meaning that the different software attributes are designed to be reused and implemented based on the needs of the market, instead of im- plementing them for their specific use only [15]. The latter is common within single-systems. If a single-system company wants to migrate to a product-line company, an interesting problem arises – how a single-system, which is not designed to easily be extended with more systems and features, can migrate to an SPL. This problem could be solved using code clone detection; detecting what parts of the code the systems have in common so that they can be refactored to common assets.

In 2005, Ronny Kolb et. al. presented their industrial case study where they applied a reverse-engineering-drive approach, transforming a single-system into reusable components [15]. The core of the new SPL infrastructure could then consist of these components. They conducted the case study investigat- ing a product called Image Memory Handler (IMH), which is used in some copier machines, printers as well as multi-functional peripherals [15]. They wanted to improve the existing implementation with respect to maintainability and reusability without changing how the software was used, worked and performed (at least significantly).

Their approach was divided into two cycles; one where basic refactorings were made, and one where more advanced refactorings were performed. In the first cycle three types of improvements were made. Firstly, automatic improvements of the whole code were conducted. Secondly, selected parts of the code were improved manually, such as division of large files, changing of data types, moving functions from one module to another, and renaming files and functions. Finally, components which have been identified as complex or risky are partially transformed to become less complex and safer.

In the second cycle three more advanced improvements were made. In the first

(32)

step of this cycle they used CCD to remove internal and external code clones, since code clones can lead to bug-propagation, and having systems which are hard to understand and maintain [15]. The second step consisted of merging of the different implementations, and usage of conditional compilation to perform realization. Finally, functions were reduced regarding complexity and scale.

2.5 Summary

An API is a set of functions which allows access to features without revealing the actual implementations. Independently of what policy the API follows, the implementation can be changed without having to affect the users, but how the public functions are used needs to remain the same [5].

King’s new SDK is called the Unified SDK, and has a C API and a C++ API under construction. The C++ was requested because of the inconveniences with C APIs in general. The two APIs need to have equivalent interfaces since they share and represent the same implementation, but the problem is that the APIs are completely isolated.

CCD is a technique for detecting similar code fragments, and consists of different stages. There are also different approaches within CCD, which many existing tools have adopted. Re-using code, which results in similar code fragments, can lead to bug-propagation and non-maintainable systems. CCD has therefore frequently been used for improving software systems [6].

Software product lines (SPLs) are used to increase productivity and lower costs, by designing systems that shares assets. A system can be changed to use the SPL concept, by analysis and refactoring it. Among other techniques code clone detection is used to reverse-engineer the system to prevent bug- propagation and poor maintainability.

(33)

Method

As previously stated, there are two sets of APIs in two different programming languages that are provided as a collection of files. The APIs have been sub- ject to some name transformation scheme that was applied manually.

The method of this thesis work is described in four sections, covering an evaluation of existing tools, and a design and implementation of a new tool, and an evaluation of the new tool. The new tool was evaluated in terms of its design and detailed rule set for code transformation.

3.1 Controlled Experiment

All the CCD tools found during the research were tools for detecting code clones or clone pairs within the same language. Many of them are also very old, hard to find and do not build anymore. Even so, it was of interest so see how at least one tool as such would perform on the API equivalence problem.

As it performed poorly, the project moved on to another approach, combining multiple techniques — a hybrid approach.

Two different tools were tested during this step; CCFinderX core and Clang.

3.1.1 Test Files

To test different tools, test files were needed. The SDK team had already implemented proof of concept (POC) files for the USDK, so the project decided to use them for testing, since they are straightforward and easy to understand.

21

(34)

22 CHAPTER 3. METHOD

The POC files consist of two modules; an actor module and a director module.

Each module has a C a C++ header file version. The actor module has one function, called say_something in C, and saySomething in C++. The only task the function has is to say whatever message it receives. Respective function is shown in Listings 3.1 and 3.2.

1 s t r u c t a c t o r _ a p i {

2 v o i d ( ∗ s a y _ s o m e t h i n g ) (

3 s t r u c t a c t o r _ s t a t e s t a t e ,

4 c o n s t c h a r∗ message ) ;

5 } ;

Listing 3.1: C version of actor API

1 p u b l i c:

2 v i r t u a l ~ I A c t o r ( ) = d e f a u l t;

3 v i r t u a l v o i d s a y S o m e t h i n g (c o n s t s t d : : s t r i n g & s o m e t h i n g ) = 0 ;

Listing 3.2: C++ version of actor API

The director module also has only one function, ask_actor_module_to_speak (C), and askActorModuleToSpeak (C++) respectively, which orders the actor to say whatever message it has received. The respective function are shown in Listings 3.3 and 3.4. To get a better understanding of the POC files, they are provided in Appendix A.

1 s t r u c t d i r e c t o r _ a p i {

2 v o i d ( ∗ a s k _ a c t o r _ m o d u l e _ t o _ s p e a k ) (

3 s t r u c t d i r e c t o r _ s t a t e s t a t e ) ;

4 } ;

Listing 3.3: C version of director API

1 p u b l i c:

2 v i r t u a l ~ I D i r e c t o r ( ) = d e f a u l t;

3 v i r t u a l v o i d a s k A c t o r M o d u l e T o S p e a k ( ) = 0 ;

Listing 3.4: C++ version of director API

When testing the CCD tool, it was of interest to see if it would be able to find similarities between the C and the C++ versions of the same module.

More precisely, we tested whether the tool would somehow mark that the two function declarations make a clone pair or not.

3.1.2 CCFinderX core

CCFinderX core is a re-designed and re-implemented version of CCFinder, with the purpose of improving performance, and enabling an interactive anal-

(35)

ysis. It also has the option to build with autoconf on Linux machines [16].

CCFinderX core was chosen because the source code could be found on GitHub, it builds, it has a lot of documentation and user guides, and it has performed well when used to find clones in previous studies [17].

To be able to run CCFinderX core on a Linux machine, a Docker container was created and run. Within the container, following packages were installed and configured (if needed).

• build-essential, manpages-dev, git, software-properties-common;

• gcc-7, g++-7;

• python2.7, python-dev;

• libicu-dev;

• libboost-dev, libboost-thread-dev, libboost-system-dev;

• libtool;

• m4;

• autoconf.

To download, compile and run the CCFinderX core on the prepared test files, the commands listed in Listing 3.5 were run. In Table 3.1 the options used are described.

1 g i t c l o n e h t t p s : / / g i t h u b . com / P e r E r / c c f i n d e r x −c o r e

2 c d c c f i n d e r x −c o r e

3 . / a u t o c o n f _ i n i t . s h

4 . / c o n f i g u r e

5 make && make i n s t a l l

6 c c f x d c p p −b 2 − t 2 −w w− −d . . / t e s t/ −o t e s t

7 c c f x m t e s t. c c f x d −c −o c l o n e m e t r i c s . t s v −f −o f i l e m e t r i c s . t s v

Listing 3.5: How to run CCFinderX core

(36)

Table 3.1: Different ccfx options and their descriptions.

Option Input/ Output Description

d — Clone detection mode

m clone-data file/ clone metrics file or file metrics file

Metrics calculation mode. Calculates and prints out metrics about each code clone or metrics about each source file

-c — Calculate clone metrics

-f — Calculate file metrics

-b The minimum length of the detected code clones

-t — The minimum number of kinds of tokens in code

fragments

-w- — Specifies range: do not detect code clones within a file

-d target directory/ clone- data file

Specify target directory, obtain clone-data file

The output from running the commands in Listing 3.5 were a clone metrics file and a file metrics file. The file metrics file contains, amongst other data, file IDs, clone lengths, ratios of the tokens that are covered by any code clone, whether there are many repeated sections, and whether there is a large amount of code clones between a file and another file. The clone metrics file contains, amongst other data, clone IDs, clone lengths, and whether there are many repeated sections.

3.1.3 Clang

Clang is a compiler for C, C++ and Objective-C, and is the default compiler for Mac OS X. The compiler is a part of the LLVM project, which according to their website is "a collection of modular and reusable compiler and tool chain technologies" [18]. Other sub-projects within LLVM are LLVM Core, which provides optimization and code generation, and LLDB, a native debugger built on LLVM and Clang libraries [18].

The full Clang compiler process is to perform preprocessing, parsing, optimization, code generation, assembly and linking [19]. However, depending on what the user wants, Clang can be stopped after any stage in the compilation process.

(37)

Stages

In the preprocessing stage the input source file is tokenized. If there are macros,

#includes or other code fragments which need to be handled for the compilation to succeed, they are expanded in this stage. In the next stage the tokens from the previous stage are parsed into a tree. A semantic analysis is then performed on the parse tree to give expression types and decide if the code is well formed. If, or when, there are warnings and errors they will most likely appear in this stage. When the parsing and semantic analysis is complete, it outputs an AST. The stage after handles code generation and optimization. The AST is translated into low-level intermediate code, and finally translated to optimized and target-specific machine code [19]. In the following stage, generated code from the previous stage is translated by the target assembler into a target object file. Finally, in the last stage, the target linker merges several object files into a dynamic library or an executable.

Options

Clang provides multiple options for running the compiler, such as setting the language standard, stopping after the parsing and semantic analysis stage, and printing the AST. In Table 3.2 we describe a few options which are relevant for this degree project.

Table 3.2: Different clang flags and their descriptions.

Flag Description

-x <language> Treat subsequent input files as having the given language -std=<standard> Specify the language standard to compile for

-I <directory> Add the specified directory to the search path for include files -fsyntax-only Run the preprocessor, parser and type checking stages -Xclang <arg> Pass <arg> to the clang compiler

-ast-dump Print the AST (possible arg to -Xclang

Since Clang is a very modern and frequently used compiler, which parses source code into ASTs with the possibility of stopping after the AST generation, we chose Clang [19]. To get a better understanding of how the ASTs would look like, the POC files were compiled with Clang in the terminal win- dow, and the output was dumped.

(38)

POC C files:

1 c l a n g −x c ++ − s t d = c ++17 −X c l a n g −a s t −dump −f s y n t a x −o n l y a c t o r . h

POC C++ files:

1 c l a n g −x c ++ − s t d = c ++17 −X c l a n g −a s t −dump −f s y n t a x −o n l y I A c t o r . h

3.2 Analysis of Tool Design

After having tested the tools, and learnt that CCFinderX core did not perform well at all on the API problem, the project moved forward with Clang and the tree-based and textual approach. In this section we first define which CFs will be used as the comparison units. Secondly, we present an overview of the proposed solution. This is followed by an overview of the architecture of the implementation, and a description of the key objects in the proposed implementation.

3.2.1 Defining CFs

Since the CFs are what is supposed to be matched, the CFs in this project are the function declarations. They are very short, often one line of code, and be- cause they are in different languages, they have a different syntax. This means that two corresponding CFs will never be exactly the same, and that their identifiers therefore will never be the only factors that differ between them. Since the CFs are declarations, neither will missing or added statements be the only factor that differs them. Therefore, when defining code clone types, all types in the group of syntactically similar clones (type I, II and III) could be ruled out.

The whole idea is that the implementations of the corresponding functions behave the same, i.e., are semantically similar. But since the CFs are the dec- larations of the functions, they lack the actual behaviour. Therefore the CFs cannot be clones of type IV either. Since none of the clone types fit the CFs, they can instead be defined as clone pairs.

3.2.2 Proposed Solution

Since the function names follow stated naming conventions, the function names themselves can be extracted and compared to be able to find their function declaration matches. To locate and extract the function declarations and their at-

(39)

tributes in each file, a tree-based syntactic approach can be used, parsing the source files into ASTs. This can be done using the compiler Clang. The ASTs can then be traversed to locate, extract and store the function declarations attributes. To compare and match the function and parameter variable names, the tree-based approach can be complemented with a textual approach, com- paring and matching the names as strings. Together this becomes a hybrid approach.

3.2.3 Architecture

The idea was to develop a script which would:

1. Locate the APIs in the USDK 2. Extract all header files

3. Compile all header files to obtain ASTs

4. Traverse ASTs and locate function declarations

5. Store found function attributes with corresponding file as a file data object

6. Match C++ file object with C file data object 7. Match functions per file data object pair

Since we had experience in writing Python scripts from before, it was chosen as the language in which the implementation would be written. Moreover, after having done some research, the Python library libclang was found, which enables file compilation with Clang. The structure of the implementation is shown in Figure 3.1.

(40)

Figure 3.1: Implementation structure

(41)

CHAPTER 3. METHOD 29

3.2.4 Key Data Structures

Several classes were designed and implemented to simplify the data transfer and storage when running the checker. In this subsection, objects for resources, functions, file data and different results are described.

Function +name : String +parameters : String[]

+match : Function +language : Language +has_doublets : boolean +convention : String +container : String +prefix : String +words : String[]

Processor +file : String

+resource : Resource +functions : List +found_api : boolean +index : clang.cindex.Index

+translation_unit : clang.cindex.TranslationUnit +function_set : Set

n

Resource +api_dir : String

+language : Language +dirs : String[]

+args : String[]

+param_cursor_kind : clang.cindex.CursorKind +struct_cursor_kind : clang.cindex.CursorKind +namespace_cursor_kind : clang.cindex.CursorKind +field_cursor_kind : clang.cindex.CursorKind +namespace_pattern : String

+api_pattern : String

n

<<enumeration>>

Language +cpp : int = 6 +c : int = 7

1 1

Visual Paradigm Online Diagrams Express Edition

Figure 3.2: UML diagram of the classes Function, FileData, Processor and Resource.

Resource Object

The purpose of the Resource class, which is found in Figure 3.2, is to pro- vide language-specific data such as paths to include directories and different matching patterns. The paths that are needed are those to includes for a third party, the C API, the C++ API, the C++ utilities and includes to Clang. In addition to the include paths, the Resource objects contains language-specific cursor kinds for AST traversal, and patterns for matching files and locating API declarations. This will be further explained in section 3.4

Function Object

The purpose of the Function class, which is found in Figure 3.2, is to package all function data together to simplify the match detection stage. The name attribute represents the displayname of the declaration, and the language enum

Code Clone Detection for Equivalence Assurance