CLONE DETECTION IN MODEL-BASED DESIGN: AN EVALUATION IN THE SAFETY-CRITICAL RAILWAY DOMAIN

(1)

V¨

aster˚

as, Sweden

Thesis for the Degree of Master of Science in Computer Science with

Specialization in Embedded Systems – 30.0 credits

CLONE DETECTION IN MODEL-BASED DESIGN:

AN EVALUATION IN THE SAFETY-CRITICAL

RAILWAY DOMAIN

Course code:

DVA503

Student name:

Christoffer Parkkila

Student id:

cpa16002

Examiner:

Thomas Nolte

M¨alardalen University, V¨aster˚as, Sweden

Supervisors:

Eduard Paul Enoiu

Muhammad Abbas

Company Supervisors: Melika Hozhabri

Addiva Software Technology AB, V¨aster˚as, Sweden

Daran Smally

(2)

Abstract

Introduction: Software reuse by copying and modifying components to fit new systems

is common in industrial settings. However, it can lead to multiple variants that compli-cate testing and maintenance. Therefore, it is beneficial to detect the variants in existing codebases to document or incorporate them into a systematic reuse process. For this purpose, model-based clone detection and variability management can be used. Unfortu-nately, current tools have too high computational complexity to process multiple Simulink models while finding commonalities and differences between them. Therefore, we explore a novel approach called MatAdd that aims to enable large-scale industrial codebases to be processed.

Objective: The primary objective is to process large-scale industrial Simulink codebases

to detect the commonalities and differences between the models.

Context and method: The work was conducted in collaboration with Addiva and Alstom

to detect variants in Alstom’s codebase of Simulink models. Alstom has specific modeling guidelines and conventions that the developers follow. Therefore, we used an exploratory case study to change the research direction depending on Alstom’s considerations.

Results and Conclusions: The results show that MatAdd can process large-scale industrial

Simulink codebases and detect the commonalities and differences between its models. MatAdd processed Alstom’s codebase that contained 157 Simulink models with 7820 blocks and 9627 lines in approximately 90 seconds and returned some 1, 2, and type-3 clones. However, current limitations cause some signals to be missed, and a more thorough evaluation is needed to assess its future potential. MatAdd’s current state assists developers in finding clones to manually encapsulate into reusable library components or find variants to document to facilitate maintenance.

(3)

1. Introduction

Reuse of existing software is a common practice in software product lines (SPLs) to increase productivity in the development and validity of the products. Industrial SPLs commonly

reuse software by clone-and-own [1,2] where existing software components are copied and

modified to fit a new context or product [3]. Clone-and-own reuse lets the developers

start from a verified codebase that they can change without violating project-specific requirements nor introducing errors into other systems. However, it results in multiple functional variants across the codebase that complicates maintenance. Therefore, it is beneficial to find and document them or incorporate them into a systematic reuse process.

Clone detection [4,5] can provide assistance in finding existing variants that are

scat-tered across existing codebases. Clone detection can find structurally identical or similar parts in software artifacts and functionally equivalent but structurally different parts. The detected parts are denoted as code clones and are classified into four types as follows.

Type-1 clones that are structurally identical, type-2 clones that are structurally identical

but ignore data that can be parameterized, type-3 clones that are structurally similar, and type-4 clones that are structurally different but functionally equivalent. The first three types are typically an effect of clone-and-own reuse, but they can emerge independently in larger codebases as well.

Clone detection is used both in traditional software development and in model-based design. Several techniques and tools have been proposed to support clone detection in

traditional software development [4,6,7,8]. However, in model-based design, the research

is limited, which is unfortunate due to its wide use in safety-critical industrial software

development. For instance, in the railway domain, Alstom uses Simulink 1 _{to create}

software for their propulsion system. The Simulink models are automatically converted to IEC 61131-3 compliant code and uploaded into their electronic control units (ECUs).

Existing tools [9, 10] and research [11, 12, 13, 14] in model-based clone detection

in Simulink are primarily based on graph theory which transforms the models into multi-graphs and compares isomorphic submulti-graphs. Graph-based clone detectors are NP-complete

[15] and cannot compare multiple models nor detect all commonalities since they only

con-sider a connected pattern. SIMONE [10] is the only non-graph-based tool that compares

Simulink subsystems based on their textual representations and clusters them into classes that contain type-3 clones. These tools assist developers in discovering their codebases so they can manually encapsulate common patterns into reusable library components, but they do not assist developers in managing the variability.

Studies in Simulink variability management [16,17,18,19,20] has focused on

encap-sulating variants into 150% models that can instantiate them by selecting the variability

through an interface. Alalfi et al. [16] used SIMONE to obtain type-3 clone clusters that

were separately processed with graph-based techniques to find connected patterns. How-ever, the variability consists of everything that is not a part of the common pattern, which means that the only possible alternatives to recreate a variant are the models’ different

variability. Schlie et al. [20] reports on a more find-grained approach that enables specific

variation points to be selected from a 150% model to recreate a variant that it encap-sulates. However, the computational complexity is too high to terminate in large-scale industrial codebases.

Limitations. Current tools and research in model-based clone detection and

variabil-ity management are either too complex to process multiple models or do not consider variability or disconnected patterns.

1

(6)

Approach. These limitations motivate work that can (1) process realistic large-scale

industrial codebases, (2) detect all commonalities and not solely a connected graph, (3) and designate the variation points. For this purpose, we explored a novel approach to process multiple Simulink models, while denoting the commonalities and differences between them. Our approach transforms each model into a special kind of adjacency matrix that contains their connections. The matrix is uniform and allocates enough space to enable it to represent all models. The models are transformed into the matrix representation and encoded with the increasing power of twos, so their sum yields the commonalities and differences between them.

Methodology. This thesis was conducted in collaboration with Addiva and Alstom 2

to detect variants in Alstom’s codebase of Simulink models. Alstom has specific design guidelines that the developers follow. Therefore, we used an exploratory case study since they are flexible and study phenomena in their occurring context, so the research di-rection could be changed depending on Alstom’s considerations. The explorative search consisted of reviewing the literature, studying Alstom’s design documents, and inspecting the underlying representation of Simulink models while reading the documentation for useful insights. The findings were presented on focus group sessions that either changed or focused the thesis’ direction. The feasibility was assessed by (1) processing Alstom’s codebase of Simulink models for their propulsion system and (2) performing an experimen-tal evaluation on a demonstration set since Alstom’s codebase is confidential and cannot be publicly disclosed. The experimental evaluation mainly showcases the approach’s po-tential by using a small demonstration set that can be grasped, while Alstom’s codebase shows its capabilities in processing realistic large-scale industrial Simulink models.

Results. The results show that our approach can process large-scale industrial

code-bases and denote commonalities and variation points between the Simulink models. Our approach processed Alstom’s codebase that contained 157 Simulink models with 7820 blocks and 9627 lines in approximately 90 seconds and returned some type-1, type-2, and type-3 clones. The approach’s current state assists developers in finding clones to en-capsulate into reusable library components or finding variants to document to facilitate maintenance. Also, we outline limitations that must be addressed to make it useful in variability management to create 150% models with fine-grained variation points.

1.1. Problem formulation

Clone-and-own is commonly used in industrial SPLs since it lets the developers start from a verified codebase that they can change without violating other projects’ requirements or

introducing errors to other systems. [1,2]. However, copying and modifying components

result in multiple variants that complicate testing and maintenance.

The research project eXcellence in Variant Testing (XIVT) [21] aims to facilitate testing

of variant-intensive embedded systems in the automotive, railway, industrial production, and telecommunication domains. The project currently develops an open-source toolchain that incorporates its own tools and third-party contributions. An important aspect of the project is the similarity analysis that detects reused components across systems to reduce testing efforts. Therefore, research in model-based clone detection and variability management is useful for their similarity analysis and industrial partners. However, the current tools in model-based clone detection cannot find all commonalities, whereas the research in variability management has too high computational complexity to be used in practice. An approach that can process large-scale industrial Simulink codebases and

(7)

tect the commonalities and variation points between its models would be useful in XIVT’s toolchain to detect reused components. Also, it would directly address a knowledge gap in the existing literature, and the findings could facilitate further research in automatically create 150% models. Therefore, this thesis explores a novel method that addresses these issues. The following research question has guided the work:

RQ: How can large-scale industrial Simulink codebases be processed to detect the

commonalities and differences between its models?

1.2. Outline

The remainder of the report is organized as follows: Section 2. provides the necessary background. Section 3. gives a brief overview of traditional clone detection and the state-of-the-art in model-based clone detection and variability management in Simulink. Section 4. outlines and motivates the research methodology that was followed to answer the research question. Section 5. delineates the implementation to enable reproduction of the results. Section 6. conducts an experimental evaluation by using a smaller demonstration set to enable the approach’s intuition to be grasped, whereas Section 7. elaborates on the results from processing Alstom’s larger codebase. Section 8., 9., 10., and 11., respectively, state validity threats, discusses the work, concludes the work, and state necessary future research directions.

(8)

2. Background

This section defines the terminology and techniques that are used throughout the report. Section 2.1. defines clone detection in traditional software development and in model-based design, whereas Section 2.2. introduces Simulink.

2.1. Clone detection

Clone detection is the process to automatically find clones 3_{, which are fragments in}

software artifacts ”that are similar based on some definition of similarity” [4]. Code clone

detection detects clones in traditional languages such as C/C++, Java, etc. In contrast, model-based clone detection detects clones in model-based design or dataflow languages such as Simulink and UML. Code clone detectors are either restricted to specific languages or cross-language compatible, whereas model-based clone detectors usually target specific languages since they are hard to generalize due to their vastly different abstractions and semantics. The literature treats code clone detection and model-based clone detection separately, but the overall steps are similar. This section defines the general steps in clone detection and emphasizes the differences between code– and model-based clone detection. Figure 1 shows the general steps and will be used as a reference.

Pre-processing Transform Match detection

Mapping Post-processing

Feature extraction

Aggregation

1. 2. 3. 4.

5. 6. 7.

Figure 1: The general steps in clone detection. The steps that are depicted with dashed boxes

indicate that the step depends on the clone detector.

2.1.1. Clone types

There exist different types of clones [22] that clone detection research aims to find. These

types are well defined for code clone detection but ill-defined for model-based clone de-tection since only a handful of papers have considered it. However, there is little reason to differentiate between them since models are code too and fit reasonably well into the existing taxonomy. The following list intertwines their definitions based on code clone

detection surveys [4,5] and model-based clone detection papers [10,11]:

• Type-1: Software fragments that are identical in structure except whitespace, layout, coloring, labels, etc. Type-1 clones are also called exact clones.

• Type-2: Software fragments that are identical in structure except identifier names such as functions, classes, variables, subsystems, etc., types, literal values, as well as the exceptions for type1-clones. Type-2 clones are also called parameterized clones. 3_{The term code clone will be used to refer to clones in source code, whereas model clone refers to clones}

(9)

• Type-3: Software fragments that are similar in structure but can have changed, removed or added statements, blocks, or connections, as well as the exceptions for type-1 and type-2 clones. Type-3 clones are also called near-miss clones.

• Type-4: Software fragments that are functionally equivalent but structurally differ-ent. Type-4 clones are also called semantic clones.

Type-1 clones are identical in functionality and structure. Type-2 clones are identical in functionality and structure except for parts that can be parameterized or abstracted with templates or generics. Type-3 clones are fragments that have likely occurred from

clone-and-own reuse [4]. Type-4 clones are most difficult to find [23, 24] since they differ

in structure and often require computational heavy static analysis. It is important to detect these clones since it would reduce memory footprint and bug propagation while mitigating software evaluation and maintenance issues that occur when multiple variants are scattered across systems.

Also, it is worth mentioning that these definitions are not the de facto standard in model-based clone detection. For instance, the documentation over MATLAB’s built-in

clone detector app4 _{states that it can detect identical and similar clone types. However,}

their definition of similar corresponds to a type-2 clone according to the reviewed literature.

2.1.2. Granularity

The granularity defines the comparison units (i.e., the software fragments to be

com-pared as clones). The granularity can be fixed or free [22]: A fixed granularity defines

comparison units within predefined syntactic boundaries such as functions, classes, or Simulink subsystems, whereas a free granularity does not restrict the comparisons. Fixed granularities compare meaningful units to enable refactoring, whereas free granularities can detect smaller sequences that might constitute a reusable pattern. The majority of

papers [9, 11, 12, 13, 14] in model-based clone detection in Simulink has used dynamic

granularity since their techniques are based on graph-theory to find maximum common

subgraphs. However, Alalfi et al. [10] compared subsystems since it was the most

mean-ingful unit for their industrial partners.

2.1.3. Preprocessing

The first step in the clone detection pipeline is Preprocessing, as illustrated in Figure 1. In this step, the comparison units are extracted and everything that is uninteresting is removed, such as whitespace, comments, layout, and coloring. The Preprocessing depends on the overall approach.

2.1.4. Transformation and match detection

The second and third step in the clone detection pipeline is transformation and match detection, respectively, as illustrated in Figure 1. In the transformation step, the compar-ison units are transformed into an intermediate representation that either facilitates the comparisons or enables useful features to be extracted and compared. In some papers, this step is excluded since the comparison works directly on the source code, whereas other papers use complex transformations to capture both syntax and semantics. For instance,

(10)

by transforming the software artifacts into abstract syntax trees (ASTs) or program de-pendency graphs (PDGs). In the match detection step, the transformed comparison units are compared to decide if they are clones.

The available tools and techniques are categorized depending on the transformations they used [4,5]:

• Text-based (textual): Text-based approaches [25,26,27,28] use the source code as

a unit of comparison to detect clones. The match detection consists of comparing the plain text.

• Token-based (lexical): Token-based approaches [29,30] tokenize the source code to

enable a higher abstraction that can find type-2 clones. The match detection consists of comparing the tokenized strings.

• Tree-based (syntactic): Tree-based approaches [31, 32, 33] create an AST or parse

tree that can detect type-3 clones. The match detection usually consists of dynamic programming techniques to find similar subtrees

• Graph-based (semantic): Graph-based approaches [34,35] create a PGD that

cap-tures control- and data-flow so that type-4 clones can be detected.

• Metric-based: Metric-based approaches [36,37] extract features from the source code

or an intermediate transformation. The match detection calculates the similarity between the feature vectors or uses machine learning algorithms.

• Machine learning: Machine learning approaches [23,38,39,40,41,42,43] have not

gained that much attention in the surveys and literature reviews in comparison to the other ones. But these approaches use supervised learning to train a classifier to detect clones.

2.1.5. Post-processing

The last three steps in the clone detection pipeline are mapping, post-processing, and aggregation. The mapping denotes the code clones’ locations in the original codebase. Post-processing involves a human oracle to manually verify the code clones that have been detected to filter out false positives. The aggregation is optional and combines the code clone pairs into code clone classes. Specifically, some techniques only report which software fragments are clones in a pairwise manner, but others aggregate the pairs into clusters to form clone classes.

2.2. Simulink

Simulink5 _{is a model-based language that is common in the industry to develop}

safety-critical software. It enables engineers to model and simulate their systems. Functional requirements that are impractical to test can be simulated to ensure the intended behavior. Simulink includes automatic code generation to compile and upload the models directly onto the hardware, such as field-programmable gate arrays (FPGAs) and application-specific integrated circuits (ASICs). Simulink is integrated in MATLAB and provides a graphical environment for engineers to design and test models.

As can be seen in Figure 2, a Simulink model is composed of blocks, ports, and lines. A basic block is either a source that generates data such as the inputs or constants, a

(11)

function that transforms it such as the logical operators, or a sink that consumes it such

as the outputs. The data are sent via the lines that connect one block to another. For instance, the bottom logical and block sends data via its only out-port, whereas the

logical orblock receives it in in-port two. Also, as can be seen, the signal branches, so

its data is sent to the subsystem’s in-port one as well. It is worth noting that a signal that goes from one out-port can be connected to multiple in-ports, whereas a signal that is connected to an in-port always have a distinct out-port it connects to.

On the one hand, a block’s in-ports can affect the output it generates depending on how the incoming signals are connected. For instance, the relational block performs a less than operation on its two incoming signals. The signal that is connected to in-port one corresponds to the left operand, whereas the signal that is connected to in-in-port two corresponds to the right. On the other hand, some blocks such as product and sum blocks are commutative and associative, so it does not matter how the incoming signals are connected. These blocks can take a variable number of inputs, as shown in the two different and blocks.

The block component is also used to encapsulate other blocks or to reference reusable components. Simulink has five types of these blocks:

• Subsystem: are blocks used to encapsulate other blocks together to make a model more readable. Subsystems can contain basic blocks as well as other subsystems to create a hierarchical structure. Subsystems are independent, meaning that changes in a subsystem do not propagate to copies and only affect the model it is used in. • Subsystem reference: Are blocks used to reference a subsystem. Subsystem

refer-ences are stored in their own separate files. Any changes to the file affect all models that reference it.

• Linked subsystem: are blocks that links to a library component. On the one hand, a linked subsystem works like a subsystem reference since any changes to the library component it is linked to is propagated in the models that uses it. On the other hand, the link can be disabled if the designer has not made it restricted, making the linked subsystem independent.

• Variant subsystem: are blocks that can contain multiple variants. The selected variant is active, while the others will not generate any code when compiling the models. A variant subsystem can be referenced or linked.

• Model reference: are standalone models with their own use case and interface that other models can use with a model reference block.

Figure 2: A Simulink model that illustrates the basic building blocks, subsystems, linked

(12)

Simulink also enables developers to create custom interfaces to the blocks by masking them. A mask can hide a block’s underlying properties and functions and replacing them with a custom-made interface to enforce or ease conventions.

(13)

3. Related Work

This section outlines the research in clone detection and related research areas. A brief overview of code clone detection establishes the basis, whereas a more detailed overview of model-based clone detection presents the state-of-the-art. The section ends with limi-tations in current methods and tools.

3.1. Code clone detection

The research in code clone detection is vast, and a lot of tools and techniques exist which

have caused a need to standardize the field. Consequently, several surveys [4,5,7,44,45]

and systematic literature reviews [6,8] have been conducted. Roy et al. [4] and Koschke

et al. [5] have conducted the most recognized surveys that introduce newcomers to the

field by providing the necessary terminology and definitions. Also, they give an overview

of the existing tools and categorize them into a taxonomy. Sheneamer et al. [7] extends

their surveys by including new tools and research. Gautam et al. [44] and Min et al. [45]

conclude by their surveys that even though code clone detection is an active research area, open questions exist that limit its applicability (e.g., type-4 clone detection and

high computational complexity in some techniques). Rattan et al. [6] conducted the first

systematic literature review up to 2013, whereas Ain et al. [8] reused their methodology

to include the latest research up to 2019. Both conclude that the definitions are used inconsistently across the field and that code clone detection depends on the intended use case and programming language.

Their intermediate source code transformation categorizes code clone detection tools

and techniques. Text-based approaches [25, 26,27, 28] use the source code as a unit of

comparison to detect clones. The match detection consists of comparing the plain text.

Token-based approaches [29, 30] tokenize the source code to enable a higher abstraction

that can find type-2 clones. The match detection consists of comparing the tokenized

strings. Tree-based approaches [31, 32, 33] create an abstract syntax tree or parse tree

that can detect type-3 clones. The match detection usually consists of dynamic

pro-gramming techniques to find similar subtrees. Graph-based approaches [34, 35] create

a program dependency graph that captures control- and data-flow so that type-4 clones can be detected. The match detection compares isomorphic subgraphs by using dynamic

programming. Metric-based approaches [36,37] extract features from the source code or

an intermediate transformation. The match detection calculates the similarity between the feature vectors.

Code clone detection using supervised learning [23,38,39,40,41,42,43] has increased

in the literature since they can be trained to detect all clones types. Using supervised

learning is possible since a labeled dataset [46] with mappings between code clones exists

for a Java repository. A software fragment is represented as a 1-dimensional feature vector. The features are extracted with various techniques that capture syntax and semantics, whereas a sample is represented as two merged feature vectors. For example, if fragment

a and b are functionally equivalent but structurally different, they are merged into one

training sample and labeled as a type-4 clone.

3.2. Model-based clone detection

The research in model-based clone detection is limited, which is unfortunate due to its importance in safety-critical industrial software development. Reusing code clone

(14)

plain text. A source code’s structure bears a meaning, while a graph’s layout or underly-ing textual representation is irrelevant due to graph isomorphism. Also, usunderly-ing supervised learning is not possible since no labeled dataset exists.

Current model-based clone detection research in Simulink are primarily based on

graph-theory [9, 11, 12, 13, 14]. The most recognized is ConQat [9] that transforms

models into multi-graphs and compares isomorphic subgraphs within a model or across a model pair. Blocks are treated as nodes, whereas signals are treated as edges. Labeling functions are used that take a node or edge as input and return a label that is used in the subgraph comparisons. The labeling functions ignore data that can be parameterized without changing the model’s semantic, such as the values of gain and constant blocks, while including additional information that otherwise would, such as the relational and trigonometric blocks operators. Also, port information is only included when it is rele-vant. For instance, in-ports to a product block do not change the functionality and would decrease the recall if they changed the labels. ConQat starts by expanding all subsystems followed by a comparison of all pairwise block combinations to find the pair with the most similar neighborhood. The main algorithm expands their neighborhoods and recursively repeats the neighborhood comparison and expansion process.

ModelCD [13] reuse ConQat’s preprocessing pipeline and labeling functions but change

its clone detection algorithm to support the identification of type-3 clones. ModelCD is based on two separate stages that either detect type-1 and type-2 clones or type-3 clones, respectively. The first stage – eScan – finds identical isomorphic subgraphs and expands them into larger patterns. The second stage – aScan – includes a similarity measure to

detect clone-and-own copies. Deissenboeck et al. showed [15] that ConQat can process

large real-world models with approximately 66000 connected blocks in 31.5 minutes, while

ModelCD did not terminate after 24 hours of execution. Struber et al. [49] also conclude

that ConQat can process large models in contrast to ModelCD, which does not scale due to its eScan implementation.

Hummel et al. [14] reuse ConQat’s preprocessing pipeline and labeling functions with

added support for index-based clone retrieval. Their method requires that the initial model is processed once. Successive applications use the stored information to reduce the computational cost to support developers by detecting clones in real-time.

Al-Batran et al. [11] reuse ConQat to find type-4 clones. Firstly, they normalize the

models based on mathematical, logical, and structural equivalence rules. For example, by using the commutative and associative laws, consecutive product and summation blocks could be merged, and by using De Morgan’s laws, logical operations and predicates could be uniformly expressed, all while maintaining the models’ functionality. Secondly, they use ConQAT to find the clones. The results were manually verified and vaguely reported, so it is impossible to assess the feasibility in a real context. However, they claimed it could at least detect some semantic clones.

SIMONE [10] is the only text-based tool that uses the models underlying textual

rep-resentation extracted from Simulink’s mdl file format. SIMONE adapt the text-based

code clone detector NICAD [26] to find type-3 subsystem clones by (1) removing

super-fluous textual information such as color and layout data, (2) sorting the text, so it is standardized across models to overcome the graph isomorphism problem, and (3) renam-ing certain elements that must match across subsystems. A similarity threshold decides how much subsystems can differ while still being clones. By using a similarity value of 1, SIMONE can find type-2 clones as well. After the preprocessing, the algorithm extracts all subsystems and clusters them into clone classes.

(15)

com-monalities across the subsystems. SIMONE only reports that a set of subsystems belongs to a clone class regarding the user-configurable threshold. Also, the graph-based ap-proaches can find clones across subsystem boundaries since they are expanded, whereas SIMONE uses the subsystem as the comparison granularity.

MATLAB also has a built-in clone detector app 6 _{in its integrated development}

envi-ronment, but it only detects type-2 subsystem clones within a single model. Specifically, to find subsystem clones, they must be identical except block parameter values. Also, the built-in app cannot find commonalities within subsystems.

ConQAT and SIMONE are the only tools that are openly available, whereas ModelCD,

Naive Clone detector [12], and the research by Al-Batran et al. are not. Consequently,

it is hard to know how their methods work in real-world contexts. Also, it is worth noting that ConQAT is no longer supported nor updated and have been replaced with

TeamScale7 _{which is a commercial product that incorporates clone detection for multiple}

languages such as C/C++, Java, Simulink, etc., as well as IDE integration in IntelliJ, Eclipse, NetBeans, and Visual Studio. TeamScale is not primarily a clone detector but a tool that assists developers in producing high-quality software. Since it is a commercial product with no in-depth documentation regarding its clone detection feature, this review did not consider it.

3.3. Variability management

Variability management focuses on the management and maintenance of software systems.

Research in Simulink variability management [16,17,18, 19,20] has focused on creating

a 150% model that encapsulates the commonalities and differences in a set of models

which can instantiate the specific variants. Alalfi et al. [16] used SIMONE to obtain

clus-ters of similar subsystems. They then used a method similar to ConQat to detect all the commonalities within these clusters. They intent to encapsulate the variability into

Simulink variant subsystem blocks in future work. In another study [17] they elaborated

on their approach and discussed another text-based method to explicitly find the

com-monalities and variability based on the UNIX diff command. Schlie et al. [19] utilize the

blocks’ unique SID’s to create subsignal strings by concatenating the source block’s SID, the signal name, and the destination block’s SID. These strings are compared to detect any variability between the set of models directly. However, their approach assumes that one model is a direct copy of the other. Otherwise, they would not share SIDs. Schlie et

al. [20] have also presented a method that is capable of detecting all commonalities and

variability. Their approach clusters similar models and then creates a 150% model that

can instantiate all models in the cluster. The computational complexity is O(n4_{) where n}

is the total number of blocks across all models, making it intractable in larger codebases.

3.4. Standardization

The lack of research in model-based clone detection has motivated Babur et al. [50] to

collect the available information into a portal8 _{to facilitate and encourage researchers to}

participate. The portal aims to provide an overview of the field, including research, tools, and evaluation, to further define the field and improve its research contributions.

6

https://se.mathworks.com/help/slcheck/ref/clonedetector-app.html

7_{https://www.cqse.eu/en/teamscale/overview} 8

(16)

3.5. Limitations

The reviewed literature shows that model-based clone detection and variability manage-ment in Simulink is still an immature field. Graph-based methods cannot find disconnected patterns so they cannot find all the commonalities. They are also NP-complete so that they will be inefficient in larger codebases. ConQat and SIMONE both suffer from

shad-owing issues where clones can obscure other clones. ConQat only considers the larger

pattern, which means that other patterns with higher class cardinality are not detected. In contrast, SIMONE only considers clones within its user-configurable threshold, which means that more similar nested subsystems are not detected. Also, the research in variabil-ity management is still too complex for real-world scenarios. An approach that could find all commonalities and variability in a set of models while being computationally efficient would significantly facilitate system maintenance and management. These limitations mo-tivate research to (1) process large-scale industrial codebases, (2) detect all commonalities, (3) and designate the variation points.

3.6. Research Contribution

The limitations in the related work led to the development of MatAdd – a novel approach that aims to process large-scale industrial codebases and find all their commonalities and differences. Reusing a graph-based approach such as ConQat or the text-based approach SIMONE is not adequate for Alstom’s use cases. Alstom wants to find variants in their codebase that consists of many models while denoting where all commonalities and dif-ferences exist. The graph-based approaches can only process one or two models and only return a single connected pattern, whereas SIMONE only clusters similar subsystems based on a user-configurable threshold without denoting commonalities or differences. Also, the research in variability management is too computational complex to process Alstom’s codebase. Therefore, it was necessary to explore a novel approach to overcome the limi-tations in the existing literature to meet Alstom’s needs, or at least provide a foundation that can meet them in future work.

MatAdd transforms each model into a uniform matrix and adds them. The separate system matrices are encoded with increasing twos power, so their sum yields the com-monalities and differences in the codebase. Matrix operations are fast and embarrassingly parallel, so processing large-scale industrial codebases is not an issue. The challenge is to represent each system matrix as uniformly as possible, so their additions are meaningful.

(17)

4. Research Methodology

This section delineates and motivates the research process that was used to answer the research question. The process is best described as an exploratory case study as outlined

in [51] by Runeson et al. Their reporting guidelines were followed to give a concise

de-scription of the process by including its context, objective, data collection, and analysis procedures. The high-level exploratory process is depicted in Figure 3, whereas Figure 4 shows how the research question was addressed and obtained during the work.

Industry Academia Initial need Problem formulation Exploring Evaluation Present findings Thesis

Figure 3: An illustration of the exploratory process. An initial need from the industry was

formu-lated into a problem formulation and explored. We presented the findings weekly to the industrial partners, and the obtained feedback narrowed the scope of the thesis. This process continued until we reached a satisfactory state and followed by a qualitative evaluation.

4.1. Motivation

This work was conducted in collaboration with Addiva and Alstom to detect variants in Alstom’s codebase of Simulink models. Alstom has specific modeling conventions, guide-lines, and development processes that the developers adhere to. Therefore, a case study

was suitable since they are flexible and study phenomena in their occurring context [51] so

we could change the research direction depending on the collected data. Also, performing a quantitative evaluation with experiments and comparisons is hard since (1) the tools and research use different definitions of similarity and operate at different granularities, and (2) collecting the necessary metrics requires a priori knowledge of the clones. This be-came evident during the work as the case studies in the literature deviated from Alstom’s

needs. For instance, the industrial partners in [10] were mainly interested in comparing

Simulink subsystems for similarity, whereas Alstom stated early on that it was not a useful scenario for them and nothing interesting would be found if we aimed to do so. Therefore, an exploratory case study was a natural choice to enable a flexible research process that iteratively defined its direction with Alstom’s considerations while allowing a qualitative evaluation by domain experts.

4.2. Context

This work used Alstom’s codebase of Simulink models for their propulsion system as a unit of analysis. Alstom has its own library of linked subsystems and reference models that its developers can use in combination with basic blocks to create new models. The

(18)

Literature study Data collection:_{Focus group}

Define the thesis' scope

[RQ formulated] [Revise] Literature study Data collection: Design document Data collection: Focus group Implementation Research question:

How can large-scale industrial Simulink codebases be processed to detect the commonalities and differences between its models?

[RQ answered] [Revise] Data analysis

Figure 4: An overview of the research process to obtain an answer to the research question. First,

we defined the scope of the thesis iteratively by seeking ideas in the literature and presenting them to the industrial partners during weekly meetings. Second, after we had clearly defined the scope, a concrete research question was formulated that guided the rest of the work in a similar process. The weekly meetings worked as focus group sessions.

models are converted with Simulink Embedded Coder and uploaded into their target em-bedded computers. Alstom’s design guidelines prevent the developers from using ordinary subsystems to facilitate testing and verification. Consequently, their models are flat (i.e., they do not contain nested subsystems) since linked subsystems and reference models are treated as indivisible blocks.

4.3. Objective and data collection

The initial objective was to find reusable patterns in Alstom’s codebase that could be encapsulated into reusable library components. However, their model-based design and propulsion lead stated that they are not sure what they want before they had it while

(19)

emphasizing the importance of a process that ensured identical behavior on their target systems. This led to the formulation of an initial research question that was later removed but facilitated in clearly defining the scope of the thesis:

RQ: How can a systematic reuse process that fits the industrial partner look like to

facilitate software evolution and maintenance?

We used an iterative process of complementary methods to answer this initial question. First, we conducted a literature study to determine the state-of-the-art in clone detection, model-based clone detection, and variability management in Simulink. Secondly, as illus-trated in Figure 2, we presented the findings during weekly meetings that worked as focus group sessions. We discussed if any existing tool or research could be used and incorpo-rated into a larger process. After a few iterations of searching, presenting, and defining the scope, we had reached a process that would be beneficial if successfully implemented as a tool. However, the process required a novel implementation that addressed limitations in the existing literature. This lead to the research question that guided the rest of the work:

RQ: How can large-scale industrial Simulink codebases be processed to detect the

commonalities and differences between its models?

We conducted an exploratory search that consisted of reviewing the literature, studying Alstom’s design document, studying the underlying representation of Simulink’s .slx and .mdl files for useful insights, and used MATLAB’s Simulink API to inspect the models’ properties while reading the Simulink documentation for possible solutions. Again, we presented the findings during the weekly meetings and discussed the direction of the next explorative phase that either was changed or narrowed.

4.4. Data analysis

We used the derived approach to process Alstom’s codebase and shared the patterns that were found to domain experts who performed a qualitative evaluation and shared their thoughts on its feasibility. This type of qualitative evaluation for model-based clone detectors is typical in the literature since their worth mainly depends on the industrial partners’ needs and use cases. Also, other types of evaluations are problematic since no benchmarks exist nor recommendations in the field except expert opinions.

(20)

5. Implementation

This section delineates the implementation to enable the results to be reproduced. Firstly, the development platform is given. Secondly, a high-level design provides the motivation and intuition for the approach, whereas a low-level design elaborates on important steps. Finally, identified limitations that future work must address are illustrated.

5.1. Development platform

The development platform included an i5-8600K CPU at 2.40 GHz, 2 GB RAM, and Win-dows 10, whereas MatAdd was implemented in MATLAB/Simulink R2021a. MATLAB’s

Simulink API9 _{was used to load the models into memory and to get their blocks, handles,}

and signal connections.

5.2. High-level design

Current tools and research in model-based clone detection and variability management

either restrict their comparisons to one or two models 10_{, or are intractable in larger}

codebases. This limitation led to explore a novel approach called MatAdd that aimed to process large-scale industrial codebases and denote the commonalities and variability between its models. MatAdd transforms each model into a special kind of adjacency matrix that contains their connections. The matrix is uniform and allocates enough space to enable it to represent all models. A first glance is given in Figure 5.

1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 SubC Inport Inport Const SubC Gain

Gain Out Out Sys 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 SubC Inport Inport Const SubC Gain

Gain Out Out Sys 2

1

Sys 2 Sys 1

Figure 5: A illustration over MatAdd’s representation matrix.

The representation matrix store its signals from destination to source. The first two columns in Figure 5 are the out-ports of the Subsystem block named SUB C. The first column corresponds to its first in-port, and the entries show the source out-port it connects

9

https://se.mathworks.com/help/simulink/referencelist.html?type=function

10

(21)

to. Also, as can be seen, the two systems have some variability that the representation matrix must incorporate to enable both of them to be represented. For instance, Sys1 has a Gain block that is missing in Sys2, but its space is allocated to enforce a uniform representation that can encode all systems. A complete overview of MatAdd is illustrated in Figure 6 and contains the following steps:

1. Expansion: Subsystems that are not linked or referenced are expanded.

2. Virtual types: Simulink blocks that share type but changes the system’s semantic due to their operators or references are mapped to new virtual types. For instance, both logical or and logical and have a block type of Logic. These blocks are respectively mapped to the new virtual types OR and AND. Also, linked subsystems and model references are mapped to virtual types that correspond to their path to distinguish them since their base types are Subsytem and ModelReference.

3. Representation: A uniform matrix is created that can represent all systems. The number of blocks for each type (including virtual type) in each model is calculated. The highest number is used to allocate space for the representation.

4. Transformation: The systems’ blocks and handles are mapped to the uniform ma-trix’s x-axis, so they are consistent across different systems.

5. Inserting: The separate system matrices are populated to show the connections between destination blocks’ in-ports and source blocks’ out-ports.

6. Sorting: The signals are sorted to ensure that the most similar signals are inserted at the same columns in the separate system matrices.

7. Encoding: The system matrices are encoded with the increasing power of twos. For instance, if there exist four systems, then they are respectively encoded as 1, 2, 4, and 8.

8. Addition: The system matrices are added together. The encoding ensures that the commonalities and variability can be tracked to the different models.

9. Frequencies: The final result matrix’s unique numbers are extracted, and their fre-quencies are calculated. A sorted list that contains the bit pattern, number of models in that pattern, and their shared signals are created that are used to retrieve models to inspect.

(22)

Transform Virtual types Represent Insert Sort

+

=

../lib/SUB_C.slx Subsystem Encode Add

Sub_C Gain Out Out

Sub_C Out Out

1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 SubC Inport Inport Const SubC Gain

Gain Out Out Result

2

Sub_C Gain Out Out

Sys 1 Sys 2

Sys 2

(23)

5.3. Low-level design

This section elaborates on MatAdd’s steps depicted in Figure 6 and briefly explained in Section 5.2.

5.3.1. Virtual types

Simulink blocks that share type but changes the systems’ semantics are mapped to new virtual types. This is necessary since MatAdd uses the blocks’ types to store the signals uniformly across all systems in their respective matrices. For instance, ordinary subsys-tems, referenced subsyssubsys-tems, and linked subsystems share the type Subsystem. These blocks obviously change the systems’ semantic depending on the actual subsystem or ref-erence that is used. Table 1 shows the blocks that are mapped to virtual types.

Table 1: Block types to virtual types

Block type Virtual types

Relational ==, !=, <, <=, >, >=

Logic AND, OR, XOR, NOT, NAND, NOR, NXOR

Subsystem* ../path/name.slx

ModelReference ../path/name.slx

As can be seen, relational and logic blocks use their operators as their virtual type, whereas linked subsystems, referenced subsystem, and referenced models use their path. The asterisk on the Subsystem indicate that only linked and referenced subsystem uses their path as their virtual types, whereas ordinary subsystems are expanded in a preprocessing step to obtain their basic blocks.

5.3.2. Representation

The representation step determines a uniform matrix that can depict all systems. The representation matrix uses a destination to source representation and depicts the signals for each block in the columns, as can be seen in Figure 5 and 6. The system matrices must be uniformly represented to ensure that the addition is meaningful. Therefore, all systems must first be processed to determine how much memory to allocate to fit all systems. The calculation is given in Algorithm 1 that mix pseudocode with Simulink API calls.

Algorithm 1: Calculate the representation matrix

for each model in models do

sys= load system(model)

hdls = find system(sys, ’Block’)

blks = get param(hdls, ’BlockType’)

vTypes, idx= create virtual types(hdls)

blks(idx) = vTypes

nBlksArr = save max blocks(nBlksArr)

end

nInportsArr = get num inports(nBlksArr) nOutportsArr = get num outports(nBlksArr)

xSize = sum(nBlksArr* nInportsArr)

ySize = sum(nBlksArr* nOutportsArr)

(24)

In Algorithm 1, the models are iterated and loaded into memory with Simulink’s API call load system(...). The loaded system’s handles and block types are retrieved with Simulink’s API calls find system(...) and get param(...). The blocks that should get a new virtual type are determined and replaces the original types. The maximum number of each unique block is updated in each iteration. After all models have been iterated, the number of in-ports and out-ports for each type is determined. The size of the x-axis is determined by summing the result of an elementwise multiplication of the maximum of each block with their in-ports, whereas the size of the y-axis uses the out-ports. This means that blocks with no in-ports are not allocated on the x-axis, whereas blocks with no out-ports are not allocated on the y-axis. For instance, in-port and Constant blocks does not have any in-ports, so they are not getting an entry allocated on the x-axis.

Also, the representation matrix allocates entries for ports that must be differentiated to depict the signals correctly. For instance, Relational and Subsystem blocks must include space for their ports since they directly affect the semantic of the model. However, some blocks’ in-ports does not affect their output, such as Logic, Product, and Sum blocks, so they are only allocated one column to (1) detect patterns that connect to them differently, and (2) avoid the problem of variability since they can take a variable number of inputs.

5.3.3. Transformation

The transformation step maps all blocks of each system to the matrix representation’s x-axis, so they are consistent across systems. In Figure 6, the representation step determined that the x-axis [Sub c, Gain, Out, Out] can depict all systems’ signal connections from destination to source. The transformation step must now map each system’s blocks to this axis to ensure consistency. For instance, in Figure 6, Sys1 is already consistent with the x-axis, whereas Sys2’s second out-port is mapped to the last block position to avoid a Gain and out-port block to be added with each other.

Also, the transformation steps map the destination blocks’ handles into separate struc-tures so they can be accessed when highlighting the models. However, future work will deal with improved highlighting and automatic model rebuilding, whereas this thesis mainly demonstrates the intuition and lays the foundations. Therefore, handle mapping and their use are not further discussed.

5.3.4. Insertion

The insertion step populates the systems’ matrices to depict the destination blocks’ con-nections. The representation matrix’s axes are divided into two levels as seen in Figure 6: 1. Block-level: the block-level denotes the indices to the blocks without considering

their ports. This is a mapping from block to ports.

2. Port-level: the port-level denote the indices to the ports. This is the memory that is allocated for the matrix representation’s axes.

For instance, in Figure 6, the SUB C block (i.e., the virtual type that has replaced the

Subsystemblock) on the x-axis has a block-level index 1 that maps to its two out-ports at

port-level index 1 and 2, whereas the second Out block has block-level index 4 that maps to its only out-port at port-level index 5. Similarly, these mappings are used for the y-axis as well, but they are different due to differences in in-ports and out-ports.

The blocks cannot have 1-to-1 mappings as an ordinary adjacency matrix since they would not be consistent across systems. Specifically, if a system would have multiple

(25)

SUB Cblocks that had 1-to-1 mappings to their sources, then the columns that depict the

signals would not be populated uniformly across systems since the mappings are system-specific. Therefore, a uniform insertion scheme is used that populates the matrices with the following rules:

Rule1: If a destination block connects to itself, then insert a one at the first block-level

index’s port-level for the appropriate port.

Rule2: If a destination block of type X connects to a different source block of type X,

and no other block of type X, then insert a one at the second block-level index’s port-level for the appropriate port.

Rule3: If a destination block of type X connects to a source block of type Y, and no other

block of type Y, then insert a one at the first block-level index’s port-level for the appropriate port.

Rule4: If a destination block of type X connects to multiple source blocks of type X, then

insert ones at the block-level y-index that the destination block connects most to in descending order, but reserve the first block-level y-index for loops.

Rule5: If a destination block of type X connects to multiple source blocks of type Y, then

insert ones at the block-level y-index that the destination block connects most to in descending order.

The first two rules ensure that loops are detected, whereas the last two ensure that the destination blocks’ connections across the systems are placed at block-level indices that are most likely similar. For instance, if a block X with 10 out-ports connects 8 times to a block Y and 2 times to another block Y in two systems, then they use the same block-level y-index to fill in the entries for the source block with 8 signals to it which likely captures better patterns. However, it is worth mentioning that this can be a worse mapping in some cases, since it is possible that the 8 connections to Y in the two different systems are different ports, whereas they might have shared at least 2 connections if they were mapped differently. This insertion scheme can be improved by weighting in the ports they connect to, but it is not considered in the current state of MatAdd.

The insertion scheme iterates the block-level x-indices (i.e., the distinct destination blocks in a system) and uses the stated rules to process the block-level y-indices.

5.3.5. Sorting

The sorting step maps the most similar destination blocks to the same block-level x-indices across all systems. For example, as illustrated in Figure 6, after the insertion of Sys1 and Sys2, their out-port blocks are mapped, so an addition of them would not detect the shared signal to the SUB C block. However, if Sys2’s out-port blocks were swapped, then it would show that they share a connection to the second out-port in SUB C. The sorting is based on matrix multiplications that compare all systems’ ports with each other for a specific block type to determine the best block-level x-index that a block should be placed in. A high-level illustration is shown in Figure 7.

In Figure 7, three systems Sy1, Sys2, and Sys3 have been transformed into their respective matrix representations. These systems have several SubC blocks that must be sorted, so they reside at the most appropriate block-level x-indices. For this purpose, the following steps are taken:

(26)

0 0 0 0 0 0 0 0 0 1 0 SubC Inport Inport Const SubC Sys 1 0 0 0 0 1 0 0 0 0 0 0 1 0 Out Out 3. Transpose Sys1 and multiply

x

=

0 0 1 0 0 0 0 0 0 0 0 SubC 0 1 0 0 0 0 1 0 0 0 0 0 SubC 0 1 0 0 0 0 1 0 0 0 0 0 SubC Inport Inport Const SubC Sys 2 0 0 0 0 1 0 0 0 0 0 0 1 0 Out Out 0 0 0 0 0 0 0 0 0 1 0 SubC 0 0 0 1 0 0 0 0 0 0 0 0 SubC 0 0 0 0 0 0 0 0 0 0 1 0 SubC Inport Inport Const SubC Sys 3 1 0 0 0 1 0 0 0 0 0 0 1 0 Out Out 1 0 0 0 0 1 0 0 0 0 0 SubC 0 0 0 0 0 0 0 0 0 0 1 0 SubC 0 SubC SubC SubC SubC SubC SubC 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0

1. Begin sort block type SubC

0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 1 1 1 0 1 1 0

+

=

0 0 2 1 0 1 2 0

6. Use the mappings to sort Sys2's blocks

5. Find first max row and col, then mark as used,

continue until all is used.

2 [1, 2] -1 0 2 -1 -1 -1 -1 0 [2, 3] 2 -1 -1 -1 -1 -1 -1 -1 -1 2 [3, 1]

[1, 2] Place Sys2's second SubC block into its first slot

[2, 3] Place Sys2's third SubC block into its second slot

[3, 1] Place Sys2's first SubC block into its last slot

Sys2's port 1 mapping matrix

Port 1

mapping matrix mapping matrixPort 2

0 0 2 1 0 1 2 0 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0

Sys2's first port after sorting 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0

x

₌

1 0 0 0 0 1 1 0 0 1 0 0 0 1 0 0 0 1 Sys3's port 1 mapping matrix 2. Slice first port

4. Obtain a mapping matrix for each port in the same way and add them

7. Vertically concat Sys2's first port and multiply with Sys3's first port. Repeat for all systems.

(27)

1. Select block group: Select block type to sort.

2. Slice ports: If the block type has more than one in-port, then slice them into their own matrices.

3. Multiply: Multiply the sliced matrices to obtain mapping matrices for the ports. 4. Add mapping matrices: Add all mapping matrices.

5. Determine mappings: Determine the highest number in the result mapping matrix. The row and column indicate the most similar columns in the systems being sorted. 6. Sort right side matrix: Use the mappings to sort the right side matrix in-place. 7. Integrate right side matrix: Vertically concatenate the right side matrix into the

transposed left side matrix. Repeat the process for all systems.

The steps ensure that all systems’ ports are compared with each other. The algorithm begins by first multiplying two matrices and then integrate the sorted left side matrix into the right side and repeats. This means that when marking indices as used in step 5, if the best match is located, for instance, at [6, 2], then row 3 and 6, and column 2, must all be marked as used.

5.3.6. Encoding and adding

The encoding step multiplies all systems’ matrices with the increasing power of twos as shown in Algorithm 2, whereas the addition step adds them together as shown in Algorithm 3. The encoding ensures that the commonalities and differences can be traced to the systems when they are added together.

Algorithm 2: Encode the system matrices

i ←0

for each matrix in matrices do

matrix ×2i i ← i+ 1

end

Algorithm 3: Add the system matrices

result ← {}

for each matrix in matrices do

result ← result+ matrix

end

5.3.7. Frequency list

The frequency list step extracts all unique numbers from the result matrix and creates a sorted list that contains the bit pattern, the number of models in that pattern, and their shared signals. For instance, if the pattern 00001101 occurs 100 times in the result matrix, then it means that model 1, 3, and 4, encoded respectively as 1, 4, and 8, share at least 100 signals. Also, the pattern 10011101 contains 1101, so this pattern is included in the frequency calculation as well. The frequency list is used to get an overview of potential clusters that contain clones. The frequency list can be seen as a comparison of all models with each other. For instance, the pattern 00000000 usually has the highest frequency,

(28)

but it just means that no models have the most shared signals together. That is all empty entries in the result matrix.

5.4. Limitations

This section highlights MatAdd’s current limitations that were discovered that future work must address.

• False patterns:

The insertion scheme enables models to be compared but causes block handle ambi-guity which is solved by keeping a handle table for each column. This is the source of false patterns in the final result matrix. For instance, if the first column contains a bit pattern that connects to a Constant block, whereas the second column contains the same pattern that also connects to a Constant block, then we do not know from the result matrix if the Constant block is the same or not in the separate systems. This might be solved by saving the handles and using them to resolve these false patterns.

• Sorting source blocks with multiple ports:

The insertion scheme can cause the sorting to miss patterns when a destination block connects multiple times to source blocks of the same type when they have more than

one out-port. For instance, if both s1 and s2 has a block A with two out-ports that

both connect from out-port 1 to a B in-port 1 and from out-port 2 to another block B in-port 2, then these patterns can be missed depending on which output connection

is processed first in s1 and s2.Specifically, they can be unaligned since they can use

two different block-level y-indices. Since they are vertically unaligned, the sorting does not consider them equal.

(29)

6. Experimental Evaluation

This section shows the results from an experimental evaluation using MatAdd to process a demonstration set since Alstom’s codebase is confidential and cannot be publicly disclosed. Firstly, the models that are used to demonstrate MatAdd are described. Secondly, the frequency list, result matrix, and the patterns that MatAdd obtain are shown.

6.1. Demonstration Set

The set of models in Figure 8 will be used to demonstrate MatAdd. The first, second,

third, and fourth model will be respectively denoted as s1, s2, s3, and s4. These models

contain patterns that are worth detecting while being sufficiently small to grasp.

System s1 and s3 are similar and has likely occurred from clone-and-own. The

variabil-ity consist of a signal that branches to an additional Output and a extra Constant block

that connects to the bottom left Logical or. System s2 and s4 are identical copies of

each others, while sharing a common pattern with s1 and s3 that starts from the Switch

blocks and backwards. The layout of the common pattern differs between s1 and s3, and

s2 and s4. This can cause a manual inspection to miss it, but layout does not affect clone

types, so it should be ignored and found by a clone detector.

Figure 8: The demonstration set of Simulink models that are used in the experimental evaluation.

The first and third model are similar and differs only in a branched Output and an extra Constant that connects to the bottom left Logical or block. The second and fourth model are identical copies of each other, while sharing a common pattern with the first and third.

(30)

6.2. Demonstration results

The result matrix is shown in Figure 9, whereas the frequency list is shown in Table 2. The first, second, third, and fourth pattern in the frequency list are respectively depicted in Figure 10, 11, 12, and 13. MatAdd automatically colors the blue highlighting in these figures by using the indices from the result matrix to map to system-specific handle maps. However, since the automatic rebuilding of 150% models is not considered in this thesis, the details are left out from the implementation.

Figure 9: The result matrix for the demonstration set described in Section 6.1.

In Figure 9, the blue highlighting shows the signals that exist in all systems, whereas

the red shows the variation points. In this result matrix, s1, s2, s3, and s4, are encoded

respectively as, 1, 2, 4, and 8. As can be seen, the two Switch blocks has connections that

are shared in all systems since the entries in the columns equals 24₋_{1 = 15. Also, the}

three columns for the Output blocks shows that system s2, encoded as 4, has an additional

Outputblock that is not present in the other system, and connects to an Switch block.

(31)

Table 2: The frequency list calculated from the result matrix in Figure 9.

Bit pattern Num of models Frequency

1010 2 31

0101 2 23

1110 3 14

1111 4 13

In Table 2, the frequencies are calculated for all unique patterns that exist in the result matrix in Figure 9. As can be seen, there exist four different patterns that are worth inspecting. The models that are encoded into these patterns are opened, and the shared signals are colored on a block-to-block level by retrieving their system-specific block handles. The results for all entries are illustrated in Figure 10, 11, 12, and 13.

Figure 10: The first pattern in the frequency list. This pattern correspond to s2 and s4. As can

be seen in the figure and result matrix, MatAdd recognized these as identical clones.

As can be seen from these figures, MatAdd’s highlighting illustrates the existing pat-terns across these models. However, it is worth mentioning that we depict the signals by coloring the destination block and the source block, so expecting the patterns to be captured by the highlighting itself is not feasible nor the intent.

(32)

Figure 11: The second pattern in the frequency list. This pattern corresponds to s1 and s3. As

(33)

Figure 12: The third pattern in the frequency list. This pattern corresponds to s2, s3, and s4. In

this case, MatAdd detects the common pattern that is present in all systems, but for this particular bit pattern, the frequency is higher by one signal in comparison to all systems. The signal that

increases the frequency is the Constant to the Logical or block in s2. This signal is also present

(34)

Figure 13: The fourth pattern in the frequency list. This pattern corresponds to the entire demon-stration set. As can be seen in the figure and result matrix, MatAdd detected the the common pattern with the variability.