Data analysis for Systematic Literature Reviews

(1)

Author: Roger Chao

Supervisor: Mauro Caporuscio, Mirko D’Angelo

Semester:HT 2020

Subject: Computer Science

Bachelor Degree Project

Data analysis for Systematic

Literature Reviews

(2)

Abstract

Systematic Literature Reviews (SLR) are a powerful research tool to identify and select literature to answer a certain question. However, an approach to extract inherent analytical data in Systematic Literature Reviews’ multi-dimensional data sets was lacking. Previous Systematic Literature Review tools do not incorporate the capability of providing said analytical insight. Therefore, this thesis aims to provide a useful approach comprehending various algorithms and data treatment techniques to provide the user with analytical insight on their data that is not evident in the bare execution of a Systematic Literature Review. For this goal, a literature review has been conducted to find the most relevant techniques to extract data from multi- dimensional data sets and the aforementioned approach has been tested on a survey regarding Self-Adaptive Systems (SAS) using a web-application. As a result, we find out what are the most adequate techniques to incorporate into the approach this thesis will provide.

Keywords: data analysis, systematic literature reviews, clustering, dimension-

ality reduction, self-adaptive systems, multi-dimensional data

(3)

Preface

I would like to thank both my supervisors Mauro and Mirko for the useful advice

provided in every meeting done and for the guidance in the last steps of this thesis. To

all the friends that I’ve met during my stay in Sweden and in some way have influenced

me, with special regard to my flatmates Andrea, Erica, and Hannah. To my family and

friends back in Barcelona who have had to bear with me from the distance. To Linnaeus

University to have hosted me in these troubled times and made my stay in Sweden a

memorable one.

(4)

1 Introduction 1

1.1 Background . . . . 3

1.2 Related work . . . . 4

1.3 Problem formulation . . . . 5

1.4 Objectives . . . . 6

1.5 Scope/Limitation . . . . 6

1.6 Outline . . . . 7

2 Method 8 2.1 Methods used and motivation . . . . 8

2.2 Alternatives . . . . 9

2.3 Reliability and Validity . . . . 10

2.3.1 Reliability . . . . 10

2.3.2 Validity . . . . 10

2.4 Ethical Considerations . . . . 11

3 Theoretical Background 12 3.1 Clustering . . . . 12

3.1.1 Clustering method: Hierarchical Agglomerative Clustering | HAC 13 3.1.2 Metric to calculate similarity: Gower Distance . . . . 14

3.1.3 Assessment of the results: Silhouette value . . . . 14

3.1.4 Display of the results . . . . 16

3.1.5 Discarded approaches . . . . 17

3.2 Dimensionality reduction . . . . 18

3.2.1 Multiple Correspondence Analysis | MCA . . . . 18

3.2.2 Application and result interpretation . . . . 18

3.2.3 Display of the results . . . . 19

3.2.4 Discarded approaches . . . . 21

4 Used technologies 22 4.1 Software Architecture . . . . 22

4.2 Firebase . . . . 23

4.3 NodeJS . . . . 24

4.4 R . . . . 24

4.5 HTML, CSS, JS . . . . 25

5 Implementation 26 5.1 Initial state . . . . 27

5.2 Set-up of the back-end . . . . 27

5.3 Data collection . . . . 28

5.4 Scripting the application . . . . 29

5.5 Integration of the scripts in the web application . . . . 29

5.6 Display of the results . . . . 29

6 Results and Analysis 30

7 Discussion 41

(5)

8 Conclusion 44 8.1 Q1: Which techniques are the most adequate to fulfill the requirements

R1, R2 and R3? . . . . 44 8.2 Q2: How is the set of techniques going to interact with the data to provide

results? . . . . 45 8.3 Q3: How is the application of the techniques going to provide the results

to the user? . . . . 45 8.4 Future Work . . . . 46

References 47

(6)

1 Introduction

The primary aim of this study is to identify a collection of techniques for providing ana- lytical data inherent in a Systematic Literature Review’s multi-dimensional data set to its consumer.

A Systematic Literature Review (SLR), identifies and selects scholarly studies in ac- cordance with a predefined procedure in order to address a question or create a definition.

Before the analysis, a given protocol is used in which the requirements for how the sources should be are clearly specified. A review protocol specifies the methods that are going to be used to perform an SLR.

Without it, it is possible that the selection of literature may be biased. The elements of the protocol include what are going to be the elements of the SLR plus additional information such as the background, the research questions or the data extraction strategy [1].

We obtain data from an SLR in the form of a multi-dimensional data set that commu- nicates any scientific paper, report, or study selected for analysis on a specific topic. Each entry has a collection of attributes that include information about the entry, such as the date it was published and the authors, as well as other attributes unique to the SLR’s field of study.

For this thesis, the data is obtained based on a SLR on Self-Adaptive Systems used as an example [2] where we could find a classified set of literature that could be displayed in the shape of a multi-dimensional data set.

In the example we will use for this thesis, these attributes can include the type of algorithm addressed by the entry, the standards it adheres to, the models used, and so on.

This taxonomy will be discussed further in the Results and Analysis section.

We will comprehend the problem discovered by this analysis from the previously defined context: how to interpret the multi-dimensional data set outcome of an SLR to determine underlying knowledge for the study on which the SLR is based.

The data for this thesis consists of a series of scientific papers, with each entry of the given data set including a series of attributes called categories that define various characteristics about the entry in question, such as the subject, algorithms used in the analysis, guidelines followed, and so on.

The set of techniques for this thesis is described as a collection of algorithms that, when used together, conform to a guideline for data analysis. As a consequence, the aim of this thesis will be to provide a method for analyzing these multidimensional data sets produced by an SLR in order to uncover underlying information that would otherwise be unintelligible.

We describe data analysis as obtaining a series of results from a raw data set that will assist the user in better understanding the trends and associations that occur within the data set entries.

Systematic Literature Reviews lack analytical insight on their results, so finding an

approach that extracts this data most optimally is the main motivation of this thesis.

(7)

This thesis will provide of a technique that will provide researchers of analytical data regarding their Systematic Literature Reviews. This data will address a gap existing on how to obtain underlying data on the data sets that conform SLRs.

This will be useful to better understand the researched information, to provide new data or contrast existing data that is not apparent on first sight.

This thesis result will assist scholars and/or engineers in studying the similarities and trends that occur between the various references they use on their analyses, which might not always be obvious if it is only presented by a Systematic Literature Review whose evidence has not been extensively analysed.

Nonetheless, while the methods are tested using a data collection based on a study relating to Self-Adaptive Systems, this method is open to being the focus for some kind of study and/or researchers.

We concluded that, while SLRs are very useful for finding relevant evidence, the abil-

ity to derive meaning from multidimensional data is lacking in current tools and tech-

niques.

(8)

1.1 Background

A Systematic Literature Review protocol is to restrict the body of work, which means that the reference point is a broad body of literature on a specific topic, and the end result is the most relevant material on that subject [3].

To narrow down the body of work, a literature search is performed first: titles are reviewed, a search using keywords specific to the Systematic Literature Review’s subject is conducted, and entries are filtered based on their date of publication, language, or the likelihood of bias.

When a reference is obtained repeatedly without yielding new data, the search on the literature comes to a halt [4].

The researcher then uses their questions to assess inclusion/exclusion requirements in order to minimize the body of work.

Figure 1.1: Image extracted from Xiao et al. [4] de- scribing the process followed for the execution of a Systematic Literature Review

Studies that are unrelated to the researcher’s definition of a question or subject to be dis- cussed should be omitted. Crite- ria may be based on the research process, subject, design, calcula- tion, sampling, and so on.

Finally, the quality of the en- tries is measured, and therefore a more subjective type of reduc- tion is performed, since there is no agreement about how qual- ity evaluation should be handled and is more dependent on the re- searcher’s needs and opinion.

Following that, data extrac- tion, interpretation, and reporting of findings are carried out, which addresses the intellectual value of

the Systematic Literature Review now that the body of work has been reduced and adapted to the researcher’s requirements.

The importance of extracting data from a data set used in the construction of an SLR lies in the provision of correlations and the identification of trends that are not detectable in raw data.

The results found in this study may be interesting to find and provide data to the researchers that might want to apply the set of techniques researched in this thesis to support (or contrast) the information given previously by the SLR itself.

This data must be able to provide the consumer with empirical knowledge specific to

the SLR they are operating in, and determining how to do so is the task of this study.

(9)

1.2 Related work

Numerous procedures to conduct a Systematic Literature Review already exist; many only support a specific step, such as text mining or visualization, while others support the whole SLR process. [5].

For instance, and to highlight some of the existing tools regarding the execution steps of a Systematic Literature Review we find:

• Tools dedicated to Literature Searching: Where we can find tools dedicated to text mining[6][7] and to analyze search queries and create search strategies via mining of keywords[8], which conform a vital part of the creation of a Systematic Literature Review.

These tools are used to provide the initial material to be more thoroughly filtered in screening. This does not filter within the literature related to the topic of the SLR, but it filters within the whole set of literature for articles that might be related to the topic of the SLR given keywords and other text mining techniques.

• Tools for screening: These kinds of tools extract the relevant entries obtained from the previous step and provide the SLR of what would be the final set of data it would work with.

The data set is represented as a multi-dimensional matrix, with each entry having a set of attributes that qualify it for inclusion in the body of work that this Systematic Literature Review will examine.

For instance, tools regarding this topic[9][10] are using search engine functionality and machine learning to guarantee that the entries filtered are relevant enough to be part of the SLR.

• Tools for meta-analysis: A meta-analysis is the process of analyzing and merging results from different similar studies. Therefore, this set of techniques to be studied in this thesis would be a type of meta-analysis

Certain approaches[11] that can be used to provide meta-data from SLRs are in- teresting to mention also because of their similarity on what this thesis aims to provide.

The difference between for instance this tool and the research gap of this thesis would be that this previously mentioned tool is focused on examining publication bias and plotting different models, while the research gap aims to address more specific topics such as reducing dimensionality and finding correlations and patterns within the already existing data.

To summarize, we observe that the current research gap for this thesis is to provide a

layer of data analysis to extract insight from the result of a SLR that would not be obvious

without applying a set of data analysis techniques to the multidimensional data set product

of performing a SLR.

(10)

1.3 Problem formulation

The gap found for this thesis to fill is to find a set of techniques that provide a SLR of data analysis related to pattern identification and correlations. As mentioned previously, there are other tools to analyze SLRs and what this thesis aims for is to provide a set of techniques specific for the requirements that will be explained in this section.

For that, we aim to address the multi-dimensionality of an SLR data set to identify patterns within the entries in the data set and find correlations between the different di- mensions of the data set.

With this now established gap, there are some questions to address this gap, which conform to the problem of this thesis — how to fill the knowledge gap that we have established?

The existing requirements are:

• R1: Finding a suitable approach that includes the necessary analytic capabili- ties for multi-dimensional datasets.

• R2: The techniques found shall identify patterns in the dimensions that conform to every entry of the input. It is established that an element of the input should be a document and the attributes are the dimensions assigned to each element.

• R3: The techniques found shall be able to correlate the dimensions of the el- ements of the input to reduce the number of relevant dimensions if there exists a correlation between two or more dimensions.

So given these problems, questions regarding them arise and should be answered to fulfil the solution this thesis aims for:

• Q1: Which techniques are the most adequate to fulfill the requirements R1, R2 and R3?

• Q2: How is the set of techniques going to interact with the data to provide results?

• Q3: How is the application of the techniques going to provide the results to the

user?

(11)

1.4 Objectives

Given the previous explanation on the problem, what motivates the research of this thesis, how it will be addressed, what requirements have we found and the questions that have arisen when establishing the research gap of this thesis, we can establish the following objectives that the conclusion of this thesis aims to fulfill:

O1 Study different techniques to find the most suitable ones for the scope of this thesis

O2 Find, within the studied techniques, a way to identify patterns in multidimensional data

O3 Find, within the studied techniques, a way to correlate the different dimensions that conform the inputed data.

O4 Test the chosen techniques in an example data set to obtain a set of results.

O5 Analyze the displayed results from O4 and to demonstrate the insight that can be obtained from the application of the studied techniques.

1.5 Scope/Limitation

The shortcomings of this thesis include the failure to analyse and research any method- ology currently available, so the focus would be limited to those that discuss the topics identified as an objective in Section 1.4.

Also, as mentioned in Section 2, the method studied in this thesis would be extended to a single case study rather than a larger sample of instances, as it would be extremely time consuming to apply and analyse the studied approach in multiple implementations or environments.

Another limitation of this thesis might be that we are not conducting the whole SLR but we are using material from an already existing one. Using a tailored SLR suited for the development of this thesis would be ideal to provide perfect results but it is also interesting to see how it will unfold with already existing SLRs.

Finally, since the method is still functional and is open to modification, the current ap-

proach to the method that can be used can be a source of limitations. As a result, while still

having to show the point of the functionality of the studied approach, the programming or

execution of the web application will not be 100 percent effective.

(12)

1.6 Outline

This report will be organized using the following structure:

• In Section 2 the Method will be discussed, as in how the project will be addressed, its reliability and ethical considerations.

• Section 3 consists of the Theoretical Background of the thesis, where the different techniques chosen are taken into consideration for the approach studied are defined and explained using an example, and lastly, the discarded techniques are addressed.

• In Section 4 we can find a brief explanation of the Technologies Used for the devel- oping of the artefact this thesis gets its results from as a way of demonstrating how does the approach work.

• In Section 5 the Implementation of this aforementioned artefact is approached and discussed.

• In Section 6 the Results obtained from running the studied techniques on an ex- ample multi-dimensional data set, a product of a Systematic Literature Review are displayed and a thorough Analysis on the obtained data is conducted and some conclusions about the information obtained are discussed.

• In Section 7 there is the Discussion on whether this thesis has fulfilled its objective and the problem this thesis revolves around has been answered. In other words, if the knowledge gap found has been solved.

• Lastly, on Section 8 is set as a Conclusion and Future Work is discussed.

(13)

2 Method

In order to achieve our objectives we apply a mixed method, that includes:

• A Literature Review, to gather information on the different techniques that can be used to extract the desired analysis (to fulfill R2 and R3 and to answer Q1) from a multi-dimensional data set, obtained from a SLR. This fulfills O1, O2 and O3.

• Following the Design Science method, to test the studied techniques obtained from the Literature Review and to provide this thesis of a set of results and analysis to answer Q1. This fulfills O4 and O5.

An explanation on the methods used and a justification will be addressed in this sec- tion.

2.1 Methods used and motivation

Finding documentation regarding technologies, methodologies and algorithms about data analysis that answer O1 and O2 will be the first step for this thesis to approach a solution.

Therefore, conducting a Literature Review has been chosen as a first step in the defini- tion of the method for this thesis. A Literature Review searches and evaluates the available literature in a given subject or chosen knowledge area.

The Literature Review has four main objectives:

• It surveys the literature in a certain topic.

• It synthesises the obtained information in a concise summary.

• The information is analysed to determine gaps in current knowledge and limitations of the different points of view.

• It presents the analysed data in an organised way.

A literature review demonstrates thorough knowledge of its subject; and that the per- formed research takes part into an existing body of agreed knowledge.

The motivation for conducting a Literature Review is, as previously mentioned and following the definition, to gather as much information regarding algorithms and method- ologies related to data analysis to find the most complete answer for O2.

Also, to extract proof from the obtained information of the Literature Review, or demonstrate its functionality, an artefact has been developed in order to be able to im- plement the researched methods on a data set to plot the results they extract from the aforementioned data set and analyse them.

In Design Science, a new artefact is developed. Artefacts can be material, but they also include, as it happens on this case, algorithms and system design methodologies.

The first step for the creation of an artefact, in this case, is collecting the necessary techniques that, when applied, will provide the expected results that answer Q1 by ad- dressing RQ1 and RQ2.

This step is related to the previous method, the Literature Review, as its result is what

provides the artefact of the techniques to be used, which will be tested in the next step.

(14)

When the artefact is defined, it has to be verified, as it has to fulfill its purpose of demonstrating the effectiveness of the researched techniques, and therefore, answering Q1.

This is approached by the formulation of RQ2 and RQ3, which will be the framework to test if the chosen techniques for the artefact fulfill the objectives of this thesis.

Creating said artefact generates new knowledge, as it is in this thesis’ case, where using this artefact conformed by the information, techniques and algorithms gathered by the Literature Review will provide of new information. This is the motivation for this thesis to make use of this method.

Figure 2.2: This figure shows the process followed in the application of Design Science for this thesis.

2.2 Alternatives

While writing this thesis and searching for a proper method to address it, other alternatives were found and discarded as they were not fitting enough for the future development of this thesis.

As an alternative to Design Science, a Controlled Experiment was proposed as an alternative, as it too would fit the purpose of proving the effectiveness of the researched techniques and answering Q1.

To perform a Controlled Experiment, a controlled environment is established and quantitative data is measured to prove or disprove a point.

The dependent variable is what is measured for the study and the independent vari- ables are inputs that affect the dependent variable.

As an example for dependent and independent variables would be the position of the sun depending on the time of the day, as the position of the sun (dependent variable) varies depending of what time it is (independent variable).

The focus on the study and measurement of different variables of a Controlled Exper-

iment approach rather than just focusing on creating a direct way to extract new informa-

tion by applying researched algorithms and analysing it made us choose Design Science

over this approach.

(15)

2.3 Reliability and Validity 2.3.1 Reliability

Reliability is defined as the capability of a researcher to replicate the results of a study when reproducing it.

Reliability problems can occur when using unconventional data collection methods, manual measurements that can be biased or relying on third parties participation.

Reliability issues on this thesis mainly lie on the user’s interpretation of the results, since the user’s criteria is relevant and can alter the final conclusions on the analysed data.

Applying the chosen techniques in a data set equivalent to the one used for this thesis should prevent reliability issues.

2.3.2 Validity

Validity is the extension to which the results and conclusions obtained are well-founded and correspond accurately to reality. There are three types of validity to consider:

• Construct Validity is defined on the interpretation of theoretical constructs. For example, if the words used when describing concepts will be applicable for every person reading the definition.

For this thesis, Construct Validity is addressed by not generalising about the effi- ciency of the researched algorithms, as its efficiency is strictly limited to the scope of this thesis.

• Internal Validity is based on the existence of bias in the study. For instance, if the data used is affected by variables that were not taken into consideration or if the person conducting the study modifies the results obtain to fit the hypothesis raised.

Internal Validity has to be addressed on the analysis of the results. There is no specific bias that has been detected in the analysis of the data set used.

Nonetheless, as Internal Validity states that bias can be generated by unknown vari- ables that are not taken into consideration, it is not safe to assume anything and deem the results as an absolute truth.

• External Validity regards the generality of the results, as the data set used in an study doesn’t necessarily reflect the behaviours that you might obtain if using a larger pool of data.

External Validity issues are obviously present due to the restricted behaviour of the data set used. The data set used for this thesis regards a Systematic Literature Review on Self-Adapting Systems, which is a very specific topic.

Even though the nature of the researched methods and algorithms makes them appli-

cable to a data set regardless of its topic, the attributes used in the data set are very specific

for the topic and can be hard for the reader to create parallelisms with the data set used by

another researcher.

(16)

2.4 Ethical Considerations

This study is based on straightforward research, experimentation, and analysis. In terms of ethical concerns, there is little to be worried about because all of the knowledge and data used is exclusively public, and therefore its use is not limited to private research.

There are no surveys, questionnaires, or direct questions to someone whose informa- tion could be compromised in an unsuitable environment for data storage and potential violation of the GDPR, so this study is free of any ethical concerns about data confiden- tiality.

There is no chance or damage since this thesis is entirely virtual in nature.

The sampling of the data set used to provide insight into the studied methodology may be the only issue.

The entries chosen were chosen solely on accessibility, subject (as they all have to be within the same topic, Self-Adapting Systems), and randomness, which may have omitted certain entries from being included in the data collection.

In terms of the artefact, since it is a research method, users need not be concerned with result filtering or unintended exposure because the findings are strictly informative to the user and there is no public display of any of the data analysed.

It is suggested that the use of the testing tool is implicit in the creation of this thesis, and that each study that wants to apply the offered approach should adapt the provided information to their study.

Given the preceding point, potential users of the disclosed approach should not be concerned with how their data will be handled, as the researcher is ideally in charge of how to use the techniques in each case.

We will find several variations of external libraries that are used for the implementa- tion of the techniques researched in this study, but the decision about how to incorporate and use them is entirely up to the researcher.

However, since these techniques can change and alter data sets, the researcher must be wary of this and maintain a controlled environment in which to test their data without compromising its integrity.

The artefact used in this study does, in reality, change the data sets in order to make them meet the requirements of the various libraries used.

Following this study as a guideline requires the researcher to keep and handle their

data in a secure manner in order to avoid changes that may affect the outcome of their

research.

(17)

3 Theoretical Background

To fulfill O1, all of the techniques or algorithms studied are oriented toward finding pat- tern recognition, in order to fulfill O2, or relationships between categorical variables, in order to fulfill O3. This section is the outcome of applying the first method of this thesis, a Literature Review on techniques and algorithms that fit R2 and R3.

All of the techniques discovered are appropriate for multidimensional datasets con- taining qualitative data.

Based on the literature review conducted in this study, two techniques, clustering and dimensionality reduction, have been identified as the key candidates to address Q1.

The method chosen for clustering is Hierarchical Agglomerative Clustering, which will be discussed in Section 3.1 and the subsections that follow.

In terms of dimensionality reduction, we use the Multiple Correspondence Analysis method since it is perfectly suited to both a multidimensional data set and a data set with purely categorical data, as will be discussed in Section 3.2.

3.1 Clustering

Clustering is the process of grouping a certain collection of elements based on their simi- larity, resulting in distinct sets of elements that are identical to each other [12].

These groups are referred to as clusters, and it is established as a standard method for data analysis, which requires pattern recognition, which is one of the criteria we previ- ously established as necessary for the researched approach in this thesis.

Consider the following example to understand clustering: A farmer’s wife enjoyed fried unripe green tomatoes.

The farmer gathered all the tomatoes from the farm in a large box and began touching only the green ones; if they weren’t ripe yet, he put them into a smaller box to send to his wife, but if they were ripe, he left them in the large box.

Without knowing it, this farmer was clustering his tomatoes in two clusters based on

colour and texture. The first cluster was the small box, fitting the farmer’s wife criteria,

and the other one was the big box, with the tomatoes the farmer’s wife didn’t like [13].

(18)

This figure shows another example of clustering found in [14] where we can see a bunch of fruit being clustered in 3 groups depending of their type. We could cluster the fruit by colour and we’ll observe 2 clusters, one with bananas and one with strawberries and apples.

Addressing clustering as a process to be followed, it consists of three steps [15]:

1. Selection of a clustering method

2. Selection a metric to calculate the similarity between items 3. Assessment of the obtained clusters

An example on Hierarchical Agglomerative Clustering will be shown in each point using this simplified example of a matrix of categorical data:

tomato extra cheese

1 yes cheese vegan

2 yes none regular

3 no none regular

4 yes cheese regular

5 no tomato vegan

Table 3.1: 3x5 matrix of categorical data that will be used as an example on the application of Hierarchical Agglomerative Clustering. This example shows a multi-dimensional data set containing categorical data containing what would be food orders in a pizza restaurant, where the entries define a pizza order and each attribute involves an ingredient, replicating what would be the data set would use, since the attributes values can be a binary question or have multiple answers.

3.1.1 Clustering method: Hierarchical Agglomerative Clustering | HAC

Since this study is about multidimensional data, it was important to find a multidimen- sional data clustering approach.

During the literature review for this thesis, several papers [16][17][18] use Hierarchi- cal Agglomerative Clustering (HAC) for clustering data sets with many attributes that are close to the data set used in this thesis.

For example, the data set used in Lung and Zhou’s paper [16] consists of two quantita- tive attributes and an 8-dimension matrix of qualitative data, which is close to the format of the data set used in this thesis.

Hierarchical Agglomerative Clustering is done from the bottom up, which means that each data point starts as a cluster and pairs of clusters merge based on their similarity to form a new cluster higher up in the hierarchy.

This is repeated recursively until the optimal number of clusters is reached.

The disclosure of how many clusters should be produced is discussed in the following

section.

(19)

3.1.2 Metric to calculate similarity: Gower Distance

We will need to measure a distance metric appropriate for categorical data in order to use this algorithm to cluster our data.

The Gower Distance measure is a commonly used metric that is appropriate for the data set used in this study since it only contains categorical data. The Gower Distance is determined using the following criteria [19]:

D _Gower = 1 − ( 1 p

p

X

j=1

s _j (x ₁ , x ₂ )) (1)

Where p indicates the number of attributes define an entry, and s j (x 1 , x 2 ) is 1 when the attribute reviewed is equal for x 1 , x ₂ , where x 1 , x ₂ are the two entries being compared, and 0 when the attribute reviewed is not equal.

To obtain the Gower’s Distance from two entries, for each attribute’s value they share equals 1 and vice versa, so if they share 3/5 attributes, the Gower’s Distance will be:

D _Gower = 1 − ( 1

5 × (1 + 0 + 1 + 1 + 0)) = 1 − 0.555 = 0, 444) (2) Now, using the previously defined data set as an example, we obtain the following dissimilarity matrix by applying Gower’s distance to each entry:

1 2 3 4

2 0.667

3 1 0.333

4 0.333 0.333 0.667

5 1 0.667 0.333 0.667

Table 3.2: This matrix shows the dissimilarity values between the entries of our example data set. The higher the value, the more dissimilar two entries are. For instance, entry 1 and 2 have a dissimilarity of 1 since they don’t share any attribute.

3.1.3 Assessment of the results: Silhouette value

The idea behind deciding the number of clusters when the clusters are formed is group- ing similar data to achieve a minimum distance between data points while achieving a maximum distance between data groups.

With this in mind, the evaluation of clustering revolves around how compact and/or isolated they are.

The silhouette value approach was chosen for clustering evaluation because it is an efficient way to reveal the number of clusters is the appropriate one provided a measure of how close each member of a certain cluster is to various data points outside of its cluster [20].

To use the silhouette method to calculate the appropriate number of clusters, the high-

est average silhouette width with the highest number of clusters that is not similar to a

very large number should be determined, since the higher the number of clusters is, the

less data points each cluster contains, which is not ideal [21].

(20)

In this case, the silhouette value is determined as follows:

a(i) = 1

|Ci| − 1

X

jci,i6=j

d(i, j) (3)

b(i) = min(k 6= i) 1

|C _k |

X

jC

_k

d(i, j) (4)

And given a(i) (2) — which represents the average distance from i to all the other points inside the same cluster C — and b(i) (3) — which represents the average distance of point i to the other points in the nearest cluster, we obtain s(i) — the silhouette value.

s(i) = b(i) − a(i)

max(a(i), b(i)) (5)

Using HAC on our example data collection, we obtain the average silhouette width metric, which is used to determine the clusters and pick the appropriate number of clus- ters:

test1 test2 test3

cluster.number 2 3 4

avg.silwidth 0.22 0.33 0.10

Table 3.3: A table that compares the number of clusters produced in a research setting to the average silhouette width obtained in that test.

This data can be plotted for greater visual clarity, but in this example data set, it is

obvious that k=3 will be chosen because it has the highest average silhouette value.

(21)

Figure 3.3: This plot shows the data of the previous table and shows an example on how the number of clusters should be assessed, as the highest average silhouette width is the one to be selected while the corresponding number of clusters is not too high. For instance, if two k values — 7 and 10 — share the same average silhouette width, k=7 is to be selected as a lower number of clusters is preferred in case of equality in the silhouette value.

3.1.4 Display of the results

The results of using Hierarchical Agglomerative Clustering can be directly plotted into a tree-type structure where the hierarchy occurring among the various entries can be seen.

In this case, this method will fall short of its aim of finding trends in the dimensions that conform to any entry in our multi-dimensional data set.

Instead, data will be shown on top of the dendrogram created by the algorithm to show the value of a given attribute of each entry. This is the most visually appealing way to view the Hierarchical Agglomerative Clustering algorithm’s hierarchy. It demonstrates to the consumer how clusters are formed and which entrants comprise each cluster based on their similarities.

For our example data set, the dendrogram showing the clustering done by the Hierar-

chical Agglomerative Clustering algorithm looks like this:

(22)

Figure 3.4: As we can observe, three clusters have been created, as determined in the previous step. We can observe that entry 5 is not similar to any of the other entries while the attributes from 1-4 and 2-3 are similar enough to group them in the other two clusters.

3.1.5 Discarded approaches

Specifically in this area, since the variety of clustering algorithms and data evaluation methods is immense, various techniques have been rejected for not being appropriate for use of this concrete data set or for failing to include the data that the user is looking for by using this technique.

Other clustering approaches, such as Hierarchical Divisive Clustering, were consid- ered, but were rejected because, according to the literature, HAC is the most commonly used technique for obtaining the data that this study seeks [22].

The justification for this decision stems from the high computing costs and model selection issues encountered with this method.

Furthermore, as opposed to HAC, it begins with a cluster containing all of the entries and splits it gradually, making it more vulnerable to initialization than HAC due to the large number of potential divisions in the first stage with the initial cluster [23].

Another abandoned approach to data evaluation is the elbow method, which is an alternative to the silhouette method.

It is, arguably, the most common approach to assess the cluster number. It is simple, but not very precise or robust to estimate the adequate value k — k for the number of clus- ters, obtained plotting the mean squared distance between each instance and its centroid and the number of clusters.

In conclusion, considering the use in the consulted bibliography and the fact that it is

referred to as the most insightful and adequate to find the optimal number of clusters to

be picked, Silhouette Value is preferred over Elbow Process [24].

(23)

3.2 Dimensionality reduction

Dimensionality reduction entails converting a given multi-dimensional data set into a low- dimensional representation that retains the properties expected from being a part of the starting data set[25], and then comparing it to a concept known as intrinsic dimension, which is defined as the smallest number of dimensions that a multi-dimensional data set can be reduced to without losing information.

The key goal of incorporating dimensionality reduction into this thesis project is to preprocess the data and deal with its high dimensionality by projecting the given mul- tidimensional data set to an n-dimensional data set that is lower than the starting point [26].

3.2.1 Multiple Correspondence Analysis | MCA

Multiple Correspondence Analysis (MCA) allows the analysis of many categorical de- pendent variables.

It is analogous to Principal Component Analysis (PCA) where the factors are qualita- tive rather than quantitative, as in this thesis.

The aim of MCA is to summarise the measurements by defining new dimensions based on the similarities discovered between the original dimensions [27][28].

Since it is an exploratory tool, it is useful for exploring and combining groups in order to include insight about how the categories are compared to each other in a manner that the user will not be able to understand by simply reading the raw data [29].

In essence, MCA results are obtained by the use of a standard Correspondence Anal- ysis given a matrix defined by 0 and 1 to enable the data to be categorical, and the Corre- spondence Analysis (CA) results later require correction and adaptation given the context of the data set with which it is used.

3.2.2 Application and result interpretation

Given the clarification in Section 3, this technique is used in this study because of its utility in identifying associations between attributes.

Furthermore, applying this algorithm to a specific multidimensional data set yields a low-dimensional map in which the points nearest to each other appear to choose the same values for the nominal variables[27].

The input of the MCA algorithm should a data table where n entries are described by p variables — in the case of MCA, categorical variables — which would contain x ij , which determine the values of p for each n entry.

It is necessary to choose how many measurements would be conserved in order to analyse the MCA data.

To do so, the measurements to be held, according to the average rule specified in the

Lorenzo-Seva, Timmerman, and Kiers paper[30], are those with a variance value equal to

or greater than 9 per cent.

(24)

For instance, using the data set used in Table 3.1 to apply MCA, we obtain the follow- ing table:

eigenvalue variance.percent cumulative.variance.percent

Dim1 0.828 62.136 62.136

Dim2 0.390 29.273 91.410

Dim3 0.114 8.590 100.000

Table 3.4: As previously said, in this case, we will consider dimensions 1 and 2 because they are the only ones with a variance percentage greater than 9The total variance per- centage column merely combines the variance percentages of each dimension and the eigenvalue, although it is not used to determine the number of measurements to consider.

The eigenvaluees reflect the sum of variance in the original data that has been preserved in the new dimension [31].

3.2.3 Display of the results

To display the results for the execution of the MCA technique, first of all, a decision on how many new dimensions have to be kept.

As mentioned beforehand, there is an average rule to select what generated dimensions are accurate enough to be considered, which uses the variance percentage displayed in the table generated by the application of this technique to the data set. In this study, as the literature suggests, a 9% of variance is the threshold to consider them as part of the results [30].

If the value has been determined and considered, it can be used to show the character- istics that correspond to that dimension and the percentage contribution they provide to the newly created dimension.

This demonstrates how much they add to the new category created by the MCA algo- rithm, since a high contribution percentage reflects how important certain attribute values are to the development of the new dimension.

Now, by plotting the various entries, current attributes, and components created by the methodology, we can gain various insights depending on how we look at them:

• By displaying the rows (entries) and the columns (attributes) of the studied data set we extract the similarity between them as in how close/far they are displayed between them. Similar points are close on the plot while dissimilar points are ex- pressed far away from each other. See Figure 3.5

• The distance from every entry point from the dimension axis shows us how relevant is that entry — as in its attribute values — to the composition of said dimension.

See Figure 3.6

• We can also display the contribution of each value in percentage for the creation of new dimensions. See Figure 3.7

Taking the explanation given above, we can plot the entries given their value in each

attribute in each pair of dimensions to show their relevancy in the newly created dimension

(25)

Figure 3.5: This plot shows us each entry in the multi-dimensional data set we used as an example and all the attribute values about the two dimensions generated by MCA. We can observe entries 1 and 2 very close to the 0 in dim2 axis, while they are very far from the 0 in dim1 axis. This shows us that entries 1 and 2 are relevant in the construction of dim1 while they are not so relevant in the construction of dim2.

(a) Using this plot we can obtain the most corraelated variables to the newly created

dimension1

(b) Using this plot we can obtain the most corraelated variables to the newly created

dimension2

Figure 3.6: We can plot the results to obtain the correlation between variables and the

newly created dimensions product of using MCA

(26)

(a) Using this plot we can obtain the top 5 more contributing factors used to determine

the newly created dimension1

(b) Using this plot we can obtain the top 5 more contributing factors used to determine

the newly created dimension2

Figure 3.7: As previously said, we can plot MCA results to determine the relationship between attribute values and the newly generated measurements as a result of the dimen- sionality reduction that MCA performed on our multi-dimensional data set.

3.2.4 Discarded approaches

Given the data’s strict categorical structure, there aren’t many methods to take or reject, so the only valid method discovered is the one used for this thesis, Multiple Correspondence Analysis.

Strictly speaking, as previously said, you may use other approaches such as Principal Component Analysis or Multiple Factor Analysis in categorical data called binary data to obtain results similar to those obtained with Multiple Correspondence Analysis.

However, since there is a proper methodology devoted exclusively to dealing with cat-

egorical data, it makes no sense to take further measures to handle the data and eventually

taint the outcome.

(27)

4 Used technologies

This chapter explains the results from the design science method used to answer RQ1 described on page 4. An application to derive the theoretical insight found in the example data set has been developed to apply the techniques found in the literature review. This artefact will also shed light on Q2 and Q3.

As part of the development, a back-end was added to the current web-application [32] to make its data set more portable and usable, and functions were created to run the techniques on the data set in order to obtain the results for this thesis.

To contextualize, the template this thesis is working with as a starting point is an application to visually display SLRs and provide metrics and filtering, like for instance show the similarity between entries or remove from display entries older than a certain year. The approach was to adapt a currently existing solution that is used to display the output of a Systematic Literature Review to include the techniques studied in this thesis to extract analytical information from it.

To give a brief overview, this application is part of the development of the artefact necessary to follow a Design Science method and will be used to answer Q2 and Q3 while fulfilling requirement R1, all also fulfilling O4 and O5, as this development will both provide a testing environment for the set of techniques on a SLR example and it will gather the functionalities to display the results in a visual way.

Also, to introduce the architecture, it consists of a database, a hosting platform and a way to deploy functions that can be triggered by HTTP requests by the webpage and are therefore scalable and contained within the whole platform that hosts the website.

4.1 Software Architecture

Figure 4.8: This figure shows the interaction and the data flow between the different elements that are part of the web-application used in this thesis

As we can see, the back-end of the application provides us of three functionalities or struc-

tures, like the hosting for the website, access to a non-relational data base from where the

(28)

user can insert and extract data using the different functionalities the front-end provides them to interact with it.

This information stored in the database is then either used by the Cloud Functions to apply the techniques researched in this study and provide an output for the back-end to display or just displayes the raw information from the database in the website.

There are two Cloud Functions for each algorithm (HAC and MCA) that are triggered by an HTTP call from the back-end to obtain the data from the database and send an output after their execution that will be displayed in the front-end.

This is the main reason this approach has been selected, besides its simplicity since Firebase just provides it as a native functionality when a user hosts a web application using Firebase.

What the user interacts with, mainly, is HTML5 code that uses JS functions to display the multiple different functionalities the application provides and using both bootstrap and a personalised stylesheet for display purposes.

In the next sections we will see a more in-depth explanation of each and every one of the components used:

4.2 Firebase

As the primary requirement of this thesis’ artefact was to adapt the current solution to provide a solid back-end on which to create links to a database while still offering the application of a hosting service.

Given the specific needs to implement the studied approach in the provided applica- tion, the first thought that came to mind after having previously used this service was to create a Firebase project and link it to the existing front-end to provide the web- application of both hosting and a synced database to store the multi-dimensional dataset.

Firebase helps its users to set up their backend without having to manage the servers and it makes it scalable to the needs of each user. Another point that made us decide on using Firebase was to observe that we could host and store the data sets in the same place, while having the code be connected to the Firebase database and adding the externally hosted functions to the code for scalability reasons.

Other possibilities would have been to use AWS Amplify, as it is also a Backend as a Service provider with a free version, but Firebase was chosen due to familiarity to work with it and due to having, to the writer’s opinion, a much better documentation than AWS Amplify.

Given the ease of use of this process, the web-application is completely deployed, hosted in Firebase, and a database based on JSON documents is created and synced into the project in a matter of steps.

To go in-depth of the functionalities used from Firebase, a bullet list containing them will be displayed:

• Realtime Database was used in this project to contain the contents of the multi-

dimensional database, with all entries imported via JSON format, since it is the way

(29)

Firebase provides functions to view, get data from, inject data into, and delete data from databases using NodeJS libraries and functions, making the transition from the original state of code to a modified one using Firebase’s facilities a simple process.

• Cloud Functions was used to provide the web-application with the various func- tions that are required to provide the user with the desired analytical data on their data collection, using the approach studied in this thesis to do so.

Since Firebase provides the ability to create Cloud Functions, as well as the ability to verify bugs, create logs on their execution, and view the results with a single click, this solution was chosen for its convenience and familiarity with the system itself.

• Hosting using Firebase is inherent to its use, and as soon as you deploy your Fire- base project it is immediately hosted in a URL provided to you via the console you execute the commands to deploy your project or using your Firebase account and checking it.

4.3 NodeJS

For the implementation of the back-end, NodeJS was a requirement more than a choice to the project to be linked with Firebase and its components, since all the proposed cases and examples by the Firebase documentation used NodeJS in every instance when the project was aimed to be a web-application. This is also related to the starting point of this application being coded in Javascript, as changing it from Javascript to fit NodeJS was the most efficient approach for it.

Mainly, the use of NodeJS in this thesis is limited to the interaction with the already existing front-end of the web application to the Firebase project deployed for this thesis to act as a back-end.

In the other hand, though, it turned out to be useful in other topics like for instance linking a function in the website with an R-script — using a specific library for NodeJS

— which helped to implement the data treatment necessary to provide the actual results to the user.

There were not many NodeJS libraries that could treat the data and apply the desired techniques/algorithms in a simple way directly in the web-application code, so having this new approach using NodeJS to require those external libraries made the implementation process simpler.

4.4 R

As previously stated, in order for this thesis’ artefact to be able to implement the tech- niques analysed, a way to treat the data format derived from the database and apply the appropriate algorithms/techniques to it in order to have the desired analytical insight was needed.

There were several alternatives, such as doing it natively in NodeJS, but this method of

running an R-script inside of the NodeJS code and directly obtaining its results simplified

the coding and creation of this particular phase since the checked literature already used

R to display results from the execution of HAC and MCA in different datasets of variable

structure and form of data.

(30)

4.5 HTML, CSS, JS

Knowledge of basic web programming was required for the adaptation of the web-application to interact with the back-end and to change some functionalities.

Of course, this is not a structural part of the thesis and the amount of work dedicated

to actual basic web-programming was reduced, but it was still relevant for the making of

the artefact and the adaptation of it with the studied techniques.

(31)

5 Implementation

Using the methodology discussed previously in this article, this project consisted of the construction of functionalities that give objective perspective to the recipient of a System- atic Literature Review.

To accomplish this, this study began with an already developed canvas[32] for the network framework, so the key progress has been focused on both the execution of the functions to fulfil the scope of this thesis and optimization —- via the use of a serverless back-end product and cloud functionalities[33] to alleviate the browser from computation processes.

The execution of this project was done sequentially to ensure that the code remained stable after each phase. Since it is an individual project, no agile methodologies were used, and the approach to the project’s life cycle is more close to what a waterfall model is.

There are three main blocks regarding how the implementation of this project was approached:

1. Provision of analytical tools for the data: This block has been addressed using the Cloud Functions directly provided by the back-end structure deployed by Firebase.

Using this tool and, as previously mentioned, the interaction between NodeJS for the strictly web-oriented programming and R scripts — via a NodeJS library — for the part of the coding more related on data treatment and calculations with it.

2. Structure a solid back-end: As the web application must have an accessible database and a server-side to be hosted in and provide the front-end of the required functionalities

3. Optimization of resources: This part has been the more overshadowed one as it

is covered by the previous ones almost to its entirety, but changing how the web

application works to optimize it has been a thing in mind for the whole duration of

the implementation part.

(32)

In this section, the implementation of the code will be discussed progressively, cover- ing every step in order of execution.

5.1 Initial state

The initial state of the artefact was lacking a back-end, it had to be hosted somewhere to access it, to obtain the data in a dynamical way connections with a database were to be implemented and the functions to provide the web-application of the studied techniques were, of course, not implemented.

5.2 Set-up of the back-end

As mentioned in Section 4, the technology used to provide the website of a back-end is Firebase, so for it to work properly for the needs of this project there is a list of steps that have to be followed:

1. Linkage: To set up the back-end for a web application in Firebase, a Firebase project must be built in the software to build the resources that would be used as a back-end, such as the Realtime Database and the Cloud Functions in this project.

After that, there must be a locally cached instance of the website, and the initial- isation of the emphFirebase instance must be performed in the folder where the website is stored.

Through these two stages, the website is already connected to a Firebase instance and has a database and server configured for website deployment.

2. Database management: In the previous instance of the website, there was no database built-in so the data was stored in .json files.

To change that and increase scalability, both JSONs are stored in the Realtime Database so that they can be accessed remotely through the server and called when- ever needed.

Using Firebase with this particular need simplifies the code required to do it ex- ponentially so all of the tasks relating to calls for information in the database and writing new entries on it are perfectly documented and straightforward to deal with.

3. First deployment: As previously stated, and now that any instance of the back- end that will be required for the time being is properly functioning, the code and Firebase created files must be deployed to Firebase in order for it to run and be accessible through a URL provided by the Firebase project.

This whole process is completed by simply running the command firebase deploy

in a terminal located in the folder where our Firebase project was launched.

(33)

5.3 Data collection

As the project was developing and given the code in the web application that was already done there were some changes to do regarding the approach of how the data was stored, both in the way that it was stored, which has been addressed via the Realtime Database from Firebase and in the format the data was stored in and treated afterwards

Therefore, a new approach on the collection of the categories was developed to go from an array of answers to questions to a proper structure key-value where the key was the question and the value was the answer.

Finally, both the data already existing and every new entry generated by the web application itself follows the following structure:

authors categories first_author id

string containing the full name of each author separated by ;

object containing a key-value pair of the attributes the entry contains

string containing the last name of the first author in the authors field.

This is used to calculate the id

string containing the fields first_author and year appended

reference title url year

string containing the full reference to the document for further referencing

string containing the name of the entry

string containing the URL where you can find the document this entry is referencing

integer containing the year where the document this entry is referencing was published

Whether or not a field is filled, for example, if the URL is undefined or the relation is not defined, the field would remain in the database as a void for greater compatibility in the data treatment portion of the research functions.

Also, if a category is not filled, there will be no void in the categories entry section,

but this is considered in the code for data handling and will not pose any problems in the

subsequent calculations.

(34)

5.4 Scripting the application

The key aspect of this thesis’ implementation is to provide the current web framework with the analytical capability needed to provide the user with the desired insight into the Systematic Literature Review.

As previously mentioned, the component that shapes both the data treatment and the implementation of the methodologies is performed through the JavaScript library r-script to execute a R script containing every move the data supplied by the Realtime Database needs to follow in order to provide the web application with the desired data that it can later present to the user.

To implement it in this case a three-step plan has been followed:

1. Data formatting: The data stored in the database is not exactly what is going to be used to extract the results, so to provide the scripts an adequate input, the data set has to be captured and treated and the result of this step will be what will be passed onto the scripts as an input to provide of the results of the data analyzing algorithms.

2. Application of the techniques: Once the input has been obtained, the R scripts corresponding to the HAC and MCA techniques will receive the input and run to generate the data and visually plot the results.

3. Result delivery: The artefact will obtain the results and show them to the user depending on what type of insight the user has selected to be shown.

5.5 Integration of the scripts in the web application

The scripts, as forementioned, are coded using R, so to integrate them a library was re- quired in the code to use it, so basically the web-application runs an R shell where the R scripts are executed.

The library found initially was not fit to be used with the kind of scripts that were to be used, so an adaptation on the standard r-script library made by a user was used[34].

The output of those scripts is returned to the NodeJS code to display them.

5.6 Display of the results

The results for both the executions of HAC and MCA on the multi-dimensional data set product of the Systematic Literature Review selected to test this thesis’ approach are shown using R plots, having to be interpreted by the user as there is not an objective way to display the results as in 100% accurate responses.

Each plot will be related to a certain kind of insight but it is not quantitative data they provide, therefore the results are to be interpreted by the user.

In the section Results, there will be images of the plots generated by the application of

the techniques regarding both the execution of HAC and MCA and differentiating where

do the results come from, what do they provide and how are the results to be interpreted.

(35)

6 Results and Analysis

The data derived from the use of the studied methodology and displayed in this section will assist us in understanding the use of these methods and what kind of perspective they will offer when applied in future studies.

As of now, we realise that this method is mostly used to gain two kinds of insight, namely, trends contained within the entries of the multidimensional data set that we deal with, the results of a Systematic Literature Review, and their attributes.

To gain this insight, and thus to satisfy R2 and R3, the techniques analysed were applied to an example data set with the following structure:

Title Self-Property ML Approach Adaptation Time PS1 Adaptation Bayesian Theory Behaviour Proactive PS2 Adaptation Reinforcement Learning Behaviour Reactive PS3 Configuration Fuzzy Learning Model Reactive PS4 Optimization Neural Network Framework Reactive

...

Purpose ML Control Method Assessment Model Justification Modeling Non MAPE Experimental Study Dynamic Without Justification Reasoning Non MAPE Without Assessment Dynamic Without Justification Modeling Non MAPE Experimental Study Static Experiment Comparison Modeling Non MAPE Case Study Static Without Justification

...

Table 6.5: Table containing a snippet of the multi-dimensional data set where the studied

approach was applied.

(36)

Different kinds of insight have been obtained from this process and will be displayed and analyzed sequentially following the order that a user of this approach would follow:

1. Assessing the number of clusters this object of study will follow:

The number of clusters is determined by the Silhouette Value obtained from the execution of Hierarchical Agglomerative Clustering, where a plot —- in this case, seen in Figure 6.9 —- indicates the various Average Silhouette Width values given a given number of clusters.

The researcher would view this plot to calculate the number of clusters with the highest Average Silhouette Value in the fewest clusters available but also retaining a logically large number of clusters, which means ignoring values equal to the number of entries but also values close to one.

As a result, as in this case, the Average Silhouette Value for 10 and 11 clusters is equal to the one for 1 cluster, which will be all the entries clustered in the same cluster, textbf a value of 10 clusters will be preferred to work with since it is a local limit.

It is obvious that the Silhouette Value for k=10 and k=11 is the same, but using the rationale presented above, we will prefer 10 clusters over a larger number of clusters since it is closer to the actual number of entries.

Figure 6.9: In this plot we can observe the result of plotting the previous table’s data. Each

point corresponds to the Average Silhouette Width given the number of clusters specified

in the x axis.

(37)

2. Displaying the clustering and understanding its results:

Now that we’ve established that the proper method would be using k=10 for cluster- ing, we can apply it to get what’s seen in Figure 6.10, where we can see the entries used for each cluster.

This will help us consider the distribution of the entries, as seen in Figure 6.11.

This can also be used to see an average display of the frequency of each value in each cluster. See Figure 6.12.

(a) Clustering applied to the tree generated by HAC

(b) The dendrogram is given a circular shape to increase clarity

Figure 6.10: To display the results of Hierarchical Agglomerative Clustering a dendro-

gram with colours is used to show the user which entries from the data set are in which

cluster and therefore share similarity to the other entries that are in the same cluster

(38)

(a) This dendrogram shows the distribution of the values of the attribute Model

(b) The dendrogram shows the distribution of the values of the attribute Justification

Figure 6.11: With the same data obtained in the execution of Hierarchical Agglomerative Clustering we can show the entries that have a certain value on a given attribute. This gives us information about the distribution of said variable through the clusters and therefore if showing if there exists a pattern whether this attribute is relevant to group a certain group of entries.

Figure 6.12: Using Hierarchical Agglomerative Clustering we can also show a heatmap of the distribution of the values on each cluster. This way we get the percentage of incidence of a said value of a given attribute is in each cluster. This means that we can understand how relevant is a certain value of an attribute has been in the making of this cluster given its incidence.

In the patterns we see in the clusters, for example, in cluster 5, we can see that 100%

of the entries with architecture as a concern object for adaptation use reinforcement

learning as a machine learning approach, or that entries that evaluate the proposed

machine learning technique with an experimental study have a higher chance of

using a proactive time for adaptation.

Data analysis for Systematic Literature Reviews

Author: Roger Chao

Supervisor: Mauro Caporuscio, Mirko D’Angelo

Semester:HT 2020

Subject: Computer Science

Bachelor Degree Project

Data analysis for Systematic

Literature Reviews

Abstract

Keywords: data analysis, systematic literature reviews, clustering, dimension-

ality reduction, self-adaptive systems, multi-dimensional data

Preface

I would like to thank both my supervisors Mauro and Mirko for the useful advice

provided in every meeting done and for the guidance in the last steps of this thesis. To

all the friends that I’ve met during my stay in Sweden and in some way have influenced

me, with special regard to my flatmates Andrea, Erica, and Hannah. To my family and

friends back in Barcelona who have had to bear with me from the distance. To Linnaeus

University to have hosted me in these troubled times and made my stay in Sweden a

memorable one.

Contents

1 Introduction 1

1.1 Background . . . . 3

1.2 Related work . . . . 4

1.3 Problem formulation . . . . 5

1.4 Objectives . . . . 6

1.5 Scope/Limitation . . . . 6

1.6 Outline . . . . 7

2 Method 8 2.1 Methods used and motivation . . . . 8

2.2 Alternatives . . . . 9

2.3 Reliability and Validity . . . . 10

2.3.1 Reliability . . . . 10

2.3.2 Validity . . . . 10

2.4 Ethical Considerations . . . . 11

3 Theoretical Background 12 3.1 Clustering . . . . 12

3.1.1 Clustering method: Hierarchical Agglomerative Clustering | HAC 13 3.1.2 Metric to calculate similarity: Gower Distance . . . . 14

3.1.3 Assessment of the results: Silhouette value . . . . 14

3.1.4 Display of the results . . . . 16

3.1.5 Discarded approaches . . . . 17

3.2 Dimensionality reduction . . . . 18

3.2.1 Multiple Correspondence Analysis | MCA . . . . 18

3.2.2 Application and result interpretation . . . . 18

3.2.3 Display of the results . . . . 19

3.2.4 Discarded approaches . . . . 21

4 Used technologies 22 4.1 Software Architecture . . . . 22

4.2 Firebase . . . . 23

4.3 NodeJS . . . . 24

4.4 R . . . . 24

4.5 HTML, CSS, JS . . . . 25

5 Implementation 26 5.1 Initial state . . . . 27

5.2 Set-up of the back-end . . . . 27

5.3 Data collection . . . . 28

5.4 Scripting the application . . . . 29

5.5 Integration of the scripts in the web application . . . . 29

5.6 Display of the results . . . . 29

6 Results and Analysis 30

7 Discussion 41

8 Conclusion 44 8.1 Q1: Which techniques are the most adequate to fulfill the requirements

R1, R2 and R3? . . . . 44 8.2 Q2: How is the set of techniques going to interact with the data to provide

results? . . . . 45 8.3 Q3: How is the application of the techniques going to provide the results

to the user? . . . . 45 8.4 Future Work . . . . 46

References 47

1 Introduction

The primary aim of this study is to identify a collection of techniques for providing ana- lytical data inherent in a Systematic Literature Review’s multi-dimensional data set to its consumer.

A Systematic Literature Review (SLR), identifies and selects scholarly studies in ac- cordance with a predefined procedure in order to address a question or create a definition.

Before the analysis, a given protocol is used in which the requirements for how the sources should be are clearly specified. A review protocol specifies the methods that are going to be used to perform an SLR.

Without it, it is possible that the selection of literature may be biased. The elements of the protocol include what are going to be the elements of the SLR plus additional information such as the background, the research questions or the data extraction strategy [1].

For this thesis, the data is obtained based on a SLR on Self-Adaptive Systems used as an example [2] where we could find a classified set of literature that could be displayed in the shape of a multi-dimensional data set.

In the example we will use for this thesis, these attributes can include the type of algorithm addressed by the entry, the standards it adheres to, the models used, and so on.

This taxonomy will be discussed further in the Results and Analysis section.

We will comprehend the problem discovered by this analysis from the previously defined context: how to interpret the multi-dimensional data set outcome of an SLR to determine underlying knowledge for the study on which the SLR is based.

We describe data analysis as obtaining a series of results from a raw data set that will assist the user in better understanding the trends and associations that occur within the data set entries.

Systematic Literature Reviews lack analytical insight on their results, so finding an

approach that extracts this data most optimally is the main motivation of this thesis.

This thesis will provide of a technique that will provide researchers of analytical data regarding their Systematic Literature Reviews. This data will address a gap existing on how to obtain underlying data on the data sets that conform SLRs.

This will be useful to better understand the researched information, to provide new data or contrast existing data that is not apparent on first sight.

Nonetheless, while the methods are tested using a data collection based on a study relating to Self-Adaptive Systems, this method is open to being the focus for some kind of study and/or researchers.

We concluded that, while SLRs are very useful for finding relevant evidence, the abil-

ity to derive meaning from multidimensional data is lacking in current tools and tech-

niques.

1.1 Background