IndustrySupervisor Supervisor Examiner PlaceforProject Comparisonstudyongraphsamplingalgorithmsforinteractivevisualizationsoflarge-scalenetworks

(1)

IN

DEGREE PROJECT INFORMATION AND COMMUNICATION TECHNOLOGY,

SECOND CYCLE, 30 CREDITS ,

STOCKHOLM SWEDEN 2019

Comparison study on graph

sampling algorithms for interactive

visualizations of large-scale

networks

ALEKSANDRA VOROSHILOVA

(2)

Comparison study on graph sampling algorithms for

interactive visualizations of large-scale networks

Aleksandra Voroshilova

2019-06-20 Master’s Thesis

Place for Project

Stockholm, Sweden

Examiner

Mihhail Matskin

KTH Royal Institute of Technology

Supervisor

Tino Weinkauf

KTH Royal Institute of Technology

Industry Supervisor

(3)

Abstract

Networks are present in computer science, sociology, biology, and neuroscience as well as in applied fields such as transportation, communication, medical industries. The growing volumes of data collection are pushing scalability and performance requirements on graph algorithms, and at the same time, a need for a deeper understanding of these structures through visualization arises. Network diagrams or graph drawings can facilitate the understanding of data, making intuitive the identification of the largest clusters, the number of connected components, the overall structure, and detecting anomalies, which is not achievable through textual or matrix representations. The aim of this study was to evaluate approaches that would enable visualization of a large scale peer-to-peer video live streaming networks. The visualization of such large scale graphs has technical limitations which can be overcome by filtering important structural data from the networks. In this study, four sampling algorithms for graph reduction were applied to large overlay peer-to-peer network graphs and compared. The four algorithms cover different approaches: selecting links with the highest weight, selecting nodes with the highest cumulative weight, using betweenness centrality metrics, and constructing a focus-based tree. Through the evaluation process, it was discovered that the algorithm based on betweenness centrality approximation offers the best results. Finally, for each of the algorithms in comparison, their resulting sampled graphs were visualized using a force-directed layout with a 2-step loading approach to depict their effect on the representation of the graphs.

Keywords

(4)

Abstract

Nätverk återfinns inom datavetenskap, sociologi, biologi och neurovetenskap samt inom tillämpade områden så som transport, kommunikation och inom medicinindustrin. Den växande mängden datainsamling pressar skalbarheten och prestandakraven på grafalgoritmer, samtidigt som det uppstår ett behov av en djupare förståelse av dessa strukturer genom visualisering. Nätverksdiagram eller grafritningar kan underlätta förståelsen av data, identifiera de största grupperna, ett antal anslutna komponenter, visa en övergripande struktur och upptäcka avvikelser, något som inte kan uppnås med text- eller matrisrepresentationer. Syftet med denna studie var att utvärdera tillvägagångssätt som kunde möjliggöra visualisering av ett omfattande P2P (peer-to-peer) livestreaming-nätverk. Visualiseringen av större grafer har tekniska begränsningar, något som kan lösas genom att samla viktiga strukturella data från nätverken. I den här studien applicerades fyra provtagningsalgoritmer för grafreduktion på stora överlagringar av P2P-nätverksgrafer för att sedan jämföras. De fyra algoritmerna är baserade på val av länkar med högsta vikt, av nodar med högsta kumulativa vikt, betweenness-centralitetsvärden för att konstruera ett fokusbaserat träd som har de längsta vägarna uteslutna. Under utvärderingsprocessen upptäcktes det att algoritmen baserad på betweenness-centralitetstillnärmning visade de bästa resultaten. Dessutom, för varje algoritm i jämförelsen, visualiserades deras slutliga samplade grafer genom att använda en kraftstyrd layout med ett 2-stegs laddningsinfart.

Nyckelord

(5)

Acknowledgements

I would like to thank my supervisors: Tino Weinkauf for guiding on the process, giving advices on the next steps and keeping it always professional and fun, and Alexandros Gkogkas for giving the opportunity to start this interesting study and sharing his knowledge and helping with everything I needed to successfully finish it. I thank Mihhail Matskin for examining the work and giving the final feedback. I also owe a gratitude to Hive Streaming team, who have made working there very pleasant.

(6)

Chapter 1 Introduction

Growing amounts of data are constantly generated, collected, analyzed and stored for later use. As the data collection rate increases, data processing algorithms are pushed towards optimization for size and complexity. Furthermore, a need for a deeper understanding of underlying data structures, such as complex graphs, emerges. A graph data structure is a mathematical representation of networks, consisting of data points and relationships among them. Common examples of such are social networks, neural networks, computer networks, traffic networks, and many other [1].

The networks of web pages are examples of tightly interconnected graphs, where the pages are the nodes, and the links among them are directed edges. As more pages are added, the networks are increasing in size. For instance, by the date of this paper, Wikipedia contains around 47 million pages in 293 languages, and 5,8 million interconnected articles in the English language alone [2]. Visualizing such a graph is a challenge from computational, layout, and information visualization points of view.

Wikipedia is not the largest example of a network. According to Google, the World Wide Web consists of trillions of interconnected indexed pages. Large networks are also present in other sciences. For example, a neural network of a human brain includes 86 billion interconnected neurons [3].

(9)

is hard to make conclusions on the network properties by looking at a text file. A visual representation of a graph is easier to comprehend. The graph structure, connected components, and clusters can be determined by having a quick glance at it, furthermore, it also possible to detect the anomalies and properties by examining the visualization thoroughly.

Technical and visual limitations are the two main challenges in the area of large graph visualization. Most graphics engines have a ceiling on the size of a graph they are able to render before running out of memory and computational capacity. The visual challenge is to lay out the graph data in a coordinate space such that the structure is reflected in the best way, and the user is able to understand the properties of the graph by looking at the visualization.

To tackle the technical limitations, it is necessary to reduce the size of the data structure. The emphasis of this work is on keeping the structural information while reducing the original graphs by applying a number of sampling algorithms in a resource-effective way. The study focuses on a comparison of the algorithms based on scalability and structural information retention. Additionally, the resulting sampled graphs are visualized using web-based graphics engine.

1.1 Background

One example of computer networks are peer-to-peer overlays, the computer systems where computers act as ”peers” and they are connected among each other, forming an overlay network. The computers are using their resources to execute tasks and utilize network bandwidth for sharing data. In a client-server architecture, the server acts as a supplier and the client as a consumer. Conversely, peer-to-peer nodes can act both as suppliers and consumers.

(10)

Hive Streaming builds scalable, peer-to-peer (P2P) content distribution solutions with a focus on live video streaming. The company has millions of installed agents that facilitate the distribution of thousands of data streams daily. To track the usage and performance of a P2P distribution network a vast amount of data is collected and then distilled into insightful analytics and interactive data visualizations with exploratory capabilities.

Enterprise networks are usually arranged into LANs, grouped into sites. Hive Streaming solutions are utilizing them to distribute data among peers, without having to send it to a separate server to be consumed by clients. Instead, the nodes are spreading the data via the internal network. Hive Streaming solutions are mostly used for live video streaming. Their solutions are significantly increasing the streaming performance and ease the network load.

1.2 Problem

The peer-to-peer overlay network is a scale-free dynamic network. The nodes are added and removed once computers join or leave the video stream. These networks can reach a size of hundreds of thousands of edges and tenth of thousands of nodes. The graphs of such size are not supported by most of the web-based libraries. Moreover, visualizing such a graph would lead to an unreadable space-filling jumble of edge crossings. To tackle this visualization challenge, various graph pre-processing and clustering techniques could be used to reduce a graph and visualize a simplified version of it.

(11)

1.3 Purpose

The purpose of this study is to investigate the state of the art and present a solution fitting the requirements of an interactive and large scale visualization of a peer-to-peer distribution network.

1.4 Goal

The goal of this study is to visualize the overlay structure of the network using graph pre-processing techniques while retaining important links and the overall structure. This thesis intends to produce an implementation of compared algorithms and experimentally evaluate them. Moreover, it aims to provide suggestions for a visualization technique of the resulting sampled graphs running inside a browser, scaling up to hundreds of thousands of edges.

1.5 Benefits, Ethics and Sustainability

The study focuses on reducing the data to show insightful analytics while having fewer data to operate on, that would result in a decrease in resource usage. The outcomes can be used in development processes for a quick overview of the network to detect anomalies and get a better understanding of the structure. Moreover, the customers can visually comprehend the overlay networks and evaluate the performance of peer-to-peer distribution without having to dig into technical logs.

(12)

1.6 Methodology

The study uses the applied research method. It starts with a literature study for which the inductive approach is used to collect the state of art solutions and determine the algorithms to be implemented. A comparative study is then done on chosen algorithms, and they are evaluated. [4]

1.7 Stakeholders

The stakeholders of this study are the Hive Streaming company and its customers. A visual representation of a peer-to-peer distribution network would be useful for the evaluation and analysis of the dynamics and structure of the formed overlay networks and lead to improvements in peer-to-peer distribution algorithm. While the basic representation of a network graph is stored in a text format, having it represented visually gives a quick and informative overview of the structure. As for the customers, they get to see the internals of the video stream distribution in a user-friendly way. Moreover, the visualization would reveal the underlying structure of a network. Many companies do not have updated information about their internal network structure, making such information very valuable. Therefore, among the benefits stated above, they can also use it for getting insights on their network structure.

1.8 Delimitations

(13)

1.9 Outline

The necessary network theory fundamentals and general graph visualization techniques are explained in Chapter 2, as well as a graph reduction as an approach to visualization, and the algorithms that are used for comparative study.

(14)

Chapter 2 Background

To cover the ground knowledge required to understand the work, an introduction into the theory fundamentals is presented in section 2.1. The information on the network theory comes from the fundamental book by Mark Newman ”Networks: An Introduction” [5].

The graph visualization part of this chapter (Section 2.2) explains the required steps for visualizing a graph and common approaches for working with large graphs. Section 2.3 covers related work on the graph reduction and, finally, in Section 2.4 the four graph sampling algorithms used for the comparative study are presented.

2.1 Network theory

(15)

biology, physics, social sciences, economics, and other sciences. It studies network structures and focuses on network analysis and optimization. Some examples of large-scale networks are biological protein chains, social networks, traffic connections networks, and the World Wide Web as the largest present computer network. Depending on the field, the scientific interest is focusing on different areas. In social sciences, for instance, the focal point is on the dynamics of relationships between social entities. Such research evaluates the connectivity density of social groups and assesses the probability of members of one group to relate with members of another group. In the case of traffic regulation, the aim is to design for shortest routes and distinguish large hubs. While in the biological field, the network structure is used to model interaction patterns between appropriate biological elements, such as biochemical networks, metabolic networks, protein-protein interaction networks, neural networks, and many others. [5]

Graph definition

A network can be described as a collection of data points, having relationships encoded as links. In computer science, the networks are represented as graphs, which are described as a structure amounting to a set of objects with pairwise relations between them. Graphs are studied thoroughly in the mathematics sub-field of graph theory and are formally represented as

G = (V, E)

where V is a collection of data points, called vertices or nodes. E ⊆ x, y|(x, y) ∈ V ∧ x ̸= y are the relationships between vertices, that are referred

to as links or edges. This mathematical representation enables an analysis of networks by applying various graph measures and metrics [5].

Types of graphs

(16)

Otherwise, if no multiedges are present and the graph does not contain any self-loops (when link’s source and destination are the same nodes), it is called a simple graph.

Knowing if a graph falls under a specific category is important since certain graph properties can facilitate analysis and visualization. The basic types of graphs are a ring, a tree, a complete graph (where there exist a link from each node every other node). Another network type is a small-world network, where neighbors of any given node are likely to be reachable in a small number of hops. Word co-occurrence networks [6] and brain neuron networks [7] are some examples of small world networks. Power-law networks, also called scale-free networks, are following power-law degree distribution, meaning that there are few nodes with a large degree, called hubs, followed a higher number of smaller hubs. The lower is a degree, there more nodes of such degree are present. One example of scale-free networks is social networks.

Adjacency matrix

A mathematical representation of a graph is an adjacency matrix A, such that Aij

is 1 if there exists an edge between i and j, otherwise 0.

A =     0 1 1 1 0 1 0 0 1    

The matrix above represents a graph with three nodes and five edges, one of which is a self-loop.

A network can be weighed, representing the strength of a given edge. For instance, on the Internet, a link weight can show the amount of traffic transferred between nodes [8]. In social networks, it can represent the strength of relationship or frequency of communication.

(17)

Shortest path

A geodesic path (or the shortest path) is a path between two vertices in a graph such that no shorter path exists. In weighted graphs, the sum of weights of its constituent edges is minimized to acquire the shortest path.

Centrality metrics

There are several measures of centrality, which represent the relative importance of vertices and edges based on various parameters, chosen depending on the use case. The common ones are eigenvector centrality, closeness centrality, Katz centrality, page rank, and betweenness centrality.

Betweenness centrality is calculated using the shortest paths metrics. The betweenness centrality of a node is the total number of shortest paths passing through the given node. It measures the extent to which a node lies between other nodes. For instance, in a star graph, the center node has the highest betweenness centrality. The betweenness centrality of an edge, similarly, is the total number of shortest paths passing through it.

BC(v) =∑_s̸=v̸=tσst(v)

σst ,{s, v, t} ∈ V

where σst is the number of shortest paths between s and t, and σst(v) counts only

the ones containing σv. Similarly, the betweenness centrality of an edge can be

calculated, by measuring how many shortest paths of a graph contain the given edge.

Clique

(18)

Transitivity

Another important property is transitivity. A relation a⃝ c is transitive ” ⃝ ” if

a⃝ b and b ⃝ c together imply that a ⃝ c. In networks, it is presented as if there is

an edge between a and b, and b and c, it implies that there exists an edge between a and c, resulting in a clique. Perfect transitivity is achieved in a complete graph, which is a graph where all vertices are connected to one another.

Clustering coefficient

Partial transitivity is used to calculate the clustering coefficient of a network - a fraction of paths of size two in a network, belonging to a clique.

C(G) = number of closed paths of length two

number of paths of length two , or alternatively

C(G) = (number of triangles) x 6

number of paths of length two , or

C(G) = (number of triangles) x 3

number of connected triples

The clustering coefficient represents the degree to which the network is clustered, and, as the above formula shows, it measures the frequency of loops of length three present in a network. One of the applications of it is to estimate a probability of two random vertices u and v to be connected is:

P (u, v) = C(G)

n , (u, v)∈ V

where C(G) is the clustering coefficient and n is a total number of vertices.

Assortativity

(19)

age, sex, race, language, cultural background or geographical location to belong to the same group. In web networks, pages of the same language tend to be linked. In computer networks computers belonging to the same LANs tend to be connected.

Multiple characteristics can be used to calculate assortativity. The discrete and scalar characteristics shall be considered separately. The discrete ones are limited sets of values like sex, race, language, geographical location. The scalar characteristics have an order and thus even if the values are not exactly similar, the difference can be measured. Examples of such characteristics in social networks are age and income.

2.2 Graph visualization

The area of graph visualization consists of different aspects such as graph layout, clustering algorithms, reduction algorithms, edge drawing, and others. Various algorithms and metrics from network theory can be applied to, for instance, find shortest paths, assign page ranks, and determine clusters all of which can be visualized, depending on the information aimed to be communicated.

In order to visualize the graph from a collection of data in a text format, layout algorithms are applied to allocate coordinates to each node in such a way that the structural information is communicated efficiently, and it is easy for the user to perceive the topological structure of a graph. That implies minimizing the number of edge crossings and occlusions and allocating optimal distance between nodes. Once the coordinates are assigned, the graph can be rendered using different graphics engines depending on system requirements.

2.2.1 Graph layout

(20)

coordinate space, vertices have to be distributed in a given frame such that edge crossings would be minimized, reflecting symmetry and graph structure while having acceptable computational performance.

The planar graphs are the simplest case since, by definition, those are graphs that can be embedded in a plane, meaning that it is possible to allocate coordinates in a 2D space without having any edge crossings. The large power-law graphs are non-planar, therefore the algorithms need to be applied to make an optimal layout.

One common approach is to use a force-directed placement [9]. The algorithm assigns physics forces to nodes and edges. The distribution of forces can vary depending on the implementation details. In a nutshell, the nodes that are connected by an edge are leaning towards each other, and the other edges are pushed away from each other. There are implementation variants of this method [9, 10].

The force-directed approach has several advantages evaluated by many papers: simplicity, good quality of results, intuitive design, customizability. The strength of forces can be adjusted to proportionally assign the length to the edges. The known drawback of the basic algorithm is a long running time, that is equal to

O(n3). There exist additions to the algorithm that improve performance [11]. An alternative algorithm is a spectral layout, which is based on a Laplace matrix of a graph. Two largest eigenvalues and eigenvectors are used to set the location of the nodes in a 2D place. The first eigenvector values are used to allocate X-coordinates and the second eigenvector values are used to allocate Y-X-coordinates. [12]

(21)

2.2.2 Edge bundling

Densely connected graphs can look very cluttered, having links filling the entire background in the worst cases. Geometric edge bundling attempts to address this problem by bending edges towards each other: similar edges are drawn together, creating an empty white space in between and thus reducing visual complexity. Although the visual complexity is reduced using this approach, the computational complexity increases. The graph does not only have to be visualized but also new positions for each link have to be calculated. [14]

2.2.3 Reduction and Clustering

To achieve higher information comprehension for the user and to tackle the technical limitations, large scale graphs need to be reduced. Two common approaches used in the literature are graph clustering and graph sampling, also referenced as filtering. A combination of both approaches can be also used, thus a graph can be both reduced and clustered.

The graph sampling approach is addressing the problem of having a large graph size by removing edges or nodes. In contrast, the clustering algorithms are grouping graph data based on structural properties and creating levels of hierarchy, which can then be used to visualize only these higher level groupings instead of displaying all nodes and links.

2.3 Related Work

(22)

The graph sampling algorithms can be stochastic or deterministic. Leskovec and Faloutsos have made a comparative study on stochastic algorithms [15], evaluating sampling results from graphs with zero-weight edges. Due to randomness, stochastic algorithms may produce different results when applied with the same parameters.

Deterministic filtering, on the other hand, guarantees the same result after each run. Lee [16] and Boutin [17] are using tree reconstruction methods to yield an approximate graph from the given one. Another approach [18] is to use betweenness centrality as a metric for selecting the edges to keep.

Apart from filtering, as stated above, a clustering approach is common to represent groups of nodes. There are many algorithms for grouping nodes into clusters [19]. Examples of such approaches include selecting nodes by authority metric [20], removing weak edges and thus grouping well-connected components [21], or selecting groups of nodes by their shortest path distance from selected nodes [22].

The graph data used in this study already consists of several layers of groups of clusters, created from the real network characteristics, it would be redundant to apply the clustering algorithms. Since the graphs are very densely connected, having hundreds of thousands of edges and tens of thousands of vertices, the filtering (or sampling) approach was chosen instead.

2.4 Sampling algorithms

(23)

2.4.1 Simple random sample (SRS2)

A ”Simple Random Sample” sampling algorithm is presented in a paper ”Effectively Visualizing Large Networks Through Sampling” by Rafiei and Davood [23]. A simple random sample of the edges is taken from an unweighted graph, and only vertices that are incident to those sampled edges are kept.

For a weighted graph, the same procedure is followed, with a difference that the selection criterium is the weight of an edge. The edges with the highest weight will be kept, and nodes with the largest amount of heavy-weight links will be in the sample. The algorithm does not guarantee connectivity. It is one of the most intuitive approaches when it comes to removing the edges, therefore it will be used as a baseline for comparison.

Given a number of edges to keep in the sampled graph, the procedure consists of the following 2 steps:

1. Sort

• Sort the graph edges by weight. 2. Reduce

• Remove the edges with the lowest weight until the amount of edges is reduced to the given threshold. Also, remove nodes that do not have any edges.

2.4.2 Weighted Independence Sampling (WIS)

The Weighted Independence Sampling algorithm is described in ”Walking on a Graph with a Magnifying Glass: Stratified Sampling via Weighted Random Walks” [24]. For a graph G, with a weight of an edge (u, v)∈ E, being w(u, v). The weight of node u∈ V is

w(u) = ∑

v∈N(u)

(24)

For each node v in a set V , a probability proportional to the node’s weight is assigned.

π(v) = ∑ w(v)

u∈V w(u)

Then the node is sampled with replacements, independently at random, based on the calculated probabilities. Since the algorithm samples nodes, the output is based on node filtering threshold rather than an edge threshold. Given k - the node threshold the algorithm is as follows:

1. Calculate vertex probabilities.

• To each node assign a probability π to be selected. 2. Select nodes

• Select the k nodes at random with probability π 3. Sample Graph

• The sampled graph consists of the k selected nodes and all the links existing between those nodes in the original graph.

2.4.3 Edge

Filtering

based

on

betweenness

centrality

(BC)

The algorithm introduced in the publication by Jia, Hoberock, Garland, and Hart ”On the Visualization of Social and other Scale-Free Networks” [18]. It is specifically designed for visualizing scale-free networks and is based on the betweenness centrality metric.

The algorithm consists of two steps: reduction and post-processing. 1. Reduction

• Calculate betweenness centrality for each edge in G.

(25)

• If a node has at least 3 edges, remove an edge 2. Post-processing

This step is executed only if the resulted reduced graph is disconnected. • Take a collection of edges removed in the Reduction step and sort by

betweenness centrality in a decreasing order

• From the collection, starting with edges with the highest betweenness centrality, add back the edges that would reconnect the disconnected components until the graph is connected.

One detail of the algorithm is such that an edge cannot be removed if one of the vertices belonging to an edge has less than 2 edges.

In addition, to optimize performance, an approximation of betweenness centrality metric is applied instead of the full graph betweenness centrality calculation.

2.4.4 Focus-based Filtering (FF)

The Focus-based Filtering algorithm is described in the paper ”Focus-based filtering + clustering technique for power-law networks with small world phenomenon” by Boutin, Francois and Thievre [17]. The algorithm is designed for networks with power-law degree distribution and eventually produces a connected graph. It is using the shortest path metric. The algorithm begins with selecting a root node called the filtering focus. It first builds a tree, from the edges of the original graph, and then removes longest paths by adding more edges until a threshold is reached.

1. Select the filtering focus node V1.

The root is a node with the highest amount of weighted edges

2. Take the node Vn+1that is connected to any node in Vnand has the highest

degree.

• Find all neighbors of Vn

(26)

• Connect it by selecting an edge with a node with the highest degree from

Vn

The output of this step is a tree . 3. Dense component extraction.

• Calculate shortest path distances in the tree between all nodes, that are connected with an edge in the graph G.

(27)

Chapter 3 Comparative study design

The following sections describe the process used to answer the research question. It is followed by a description of the experiment design, the planned analysis structure and the characteristics of the test data and test environment.

3.1 Research Process and Paradigm

The applied research method is used to determine the appropriate algorithms for graph reduction. The graph sampling algorithms are run on several datasets, and the quantitative results are collected from the reduced graphs.

3.2 Experimental design

The implementations of the four algorithms are producing different reduced graphs as output. To evaluate and compare them, different graph measurements are taken. These measurements are formulated into hypotheses, appropriate tests are executed and then hypotheses are proved or disproved.

(28)

three real-life datasets, they are run with nineteen different thresholds, which correspond to 228 reduced graphs.

3.3 Planned data analysis

The algorithms are producing sampled graphs that are smaller in size and may have different structural characteristics. To compare the produced graphs, their properties are measured and evaluated. Depending on the requirements on the expected outcome, different algorithms can be chosen for the end visualization implementation.

The goal of this study is to visualize a connected peer-to-peer network, giving an overview of the structure of the network. In this case, maintaining a structure means keeping the graph connected and keeping the dense clusters. A heavy reduction can lead to a complete loss of structure, as for instance, the components that are connected in the original graph, can appear disconnected. That should be avoided, a balance between reducing graph to a more comprehensible size and still keeping the connectivity and clustering information shall be maintained.

3.4 Test data collection

The real-life datasets are provided by the stakeholders and are representing a snapshot of a peer-to-peer overlay network in a particular moment in time when the video stream was active. The test datasets are compiled from the reported usage during a video stream, where a node is a viewer and a link is a data connection with another viewer or the source of the video stream. The graphs are weighted, the weight represents the amount of data in bytes transferred between two nodes.

(29)

each other representing one connected component. Additionally, one node in the network is the source of a video stream.

Since even the reduced graphs are large, the tests are also run on sample graphs to ensure the correctness of the algorithms. The sample graphs are generated using a relaxed cavemen graph model. A cavemen graph has a number of groups made up from cliques, of size k. The graph is relaxed meaning that the edges are wired with a given probability. [25]

Graph measurements are shown both for the caveman and for the real graphs in Table 3.4.1. ”Avg degree” stands for the average degree. ”GCC” stands for the global clustering coefficient.

ID Nodes Edges Avg degree GCC

cavemen 2 5 10 20 2.0 0.67 cavemen 3 10 30 135 4.5 0.62 cavemen 8 50 400 9800 24.5 0.76 9101 33561 656584 19.56 0.39 9001 17296 907800 52.48 0.74 9037 24854 1085964 47.5 0.57

Table 3.4.1: Test data

3.4.1 Degree distribution

(30)

of networks.

Networks having a power-law degree distribution are called scale-free networks. Examples of such networks are some social networks and the Internet. The test graphs are scale-free graphs as can be seen in Figure 3.4.1.

Figure 3.4.1: Degree distribution of the test graphs

3.4.2 Reduction size

The graphs can be reduced by any number of edges or vertices. To maintain a structure of the original graph, the edge cut should not be too large. The upper limit of a possible number of edges is determined by the graphics engine used for visualization, there is a limit on the number of nodes and edges that could be held in memory and rendered with a reasonable frame rate.

(31)

not exceed that.

3.5 Test environment

Running the implementations of the algorithms on a similar clean environment is essential to guarantee reproducible and viable results. There shall be no other processes and programs running that could possibly influence the performance or results. To ensure that, the tests were run in Amazon Web Services virtual instances. AWS is a cloud solution offering a variety of computing, storage, and other services. Most importantly it is possible to set up virtual machines to run code in parallel on the clean instances.

The tests were run on clean EC2 instances Ubuntu 18.04 (Bionic). The instance type is m5ad.large with 8 GiB RAM and 1 x 75 NVMe SSD, which is comparable performance-wise to an average computer. Each instance got graph-tool Ubuntu distribution installed along with Python and pip to install the dependencies. The graph files and the algorithms implementation code were copied to the instances and executed.

3.6 Assessing reliability and validity of the data

collected

(32)

3.7 Evaluation framework

(33)

Chapter 4 Implementation

The work for this comparative study consists of two parts: the implementation of the graph sampling algorithms and the network visualization. The algorithms are implemented as Python scripts that can be deployed to a back-end, and the visualization is implemented as a JavaScript script that can run in a browser.

4.1 Graph libraries

Graph libraries provide basic graph input and output methods for reading and writing graph files. In addition, they support the graph data structure and implement basic graph analysis and measurement functions presented in Chapter 2.1.

There are plenty of Python graph libraries available. The core algorithms for graph analysis are implemented in various frameworks such as JUNG, NetworkX, graph-tool, and others. These frameworks differ in terms of implementation, performance, and the set of algorithms offered.

Graph data formats

(34)

and edges. Such graphs can be stored in more complex formats such as GML, GraphML, JSON, Pajek, YAML, JSON, LEDA, and many others.

The right choice of a data format is determined by the structure of the data in question in combination with the libraries and software to be used. It is straightforward to convert from one format to another.

4.2 Graph-tool graph library

Graph-tool is a Python library for graph analysis. It has a well-documented set of APIs for core graph algorithms, measurements, and graph manipulation. It is essentially a C++ library wrapped in Python. It has dependencies to Boost, expat, SciPy, Numpy, CGAL, and other optional libraries. The Boost Graph Library is a C++ based graph library having standard mathematical graph algorithms implemented. [26]

Installation

A common way to install Python libraries that are not included in a standard Python distribution is by using the pip package manager to pull them from PyPI (The Python Package Index), a repository used to distribute software for Python. The graph-tool is not present there since it is practically a C++ library and has dependencies to other C++ libraries which are not possible to install using pip. For GNU/Linux distributions and MacOS, the installation can be done using package managers.

The fastest and the easiest way to get graph-tool is by downloading a Docker image. It is an OS-agnostic way, which requires minimal effort and does not cause compatibility issues since it is executed in an isolated Docker container.

(35)

require any adjustments in the code.

Performance

Graph-tool uses Open Multi-Processing API to run algorithms in parallel. It can be fully utilized when running on hardware with multiple cores, having parallel execution enabled.

It is based on Boost C++ libraries and takes advantage of metaprogramming techniques to achieve a high level of performance. Moreover, it has APIs for accessing vertices and edges as NumPy arrays, without having to create complex object collections. Another feature that boosts performance is a powerful filtering capability - it is an efficient way to filter out edges or vertices by assigning a property without having to copy an entire graph or modify the original structure.

License

The graph-tool is distributed under the GNU General Public License. The source is publicly available.

4.3 NetworkX graph library

NetworkX is a pure Python library for graph analysis, first publicly released in 2005 [27]. It is well-established and has extensive documentation and example code. It can be installed with the pip package manager and is compatible with all major operating systems (MacOS, Linux, Windows) [28].

It includes most of the standard network analysis algorithms, as well as links to the implementation, and usage examples.

Installation

(36)

License

NetworkX is free open-source software distributed with a 3-clause BSD License. It can be redistributed and modified under the terms of the license.

4.4 Graph sampling implementation

The implementation of the sampling algorithms was developed using Visual Studio Code Version 1.33.1 for Mac, with additional Python plugins installed. Originally the implementation was done using NetworkX 2.2, however, graphs loading and betweenness centrality calculation execution time was increasing significantly for the larger graphs. The shortest paths calculation for the overlay network graphs was taking over an hour. Therefore, the final implementation was rewritten using graph-tool graph library.

Graph-tool proved to have a better performance, and in addition, it has APIs for accessing vertices and edges as NumPy arrays, which enables the implementation of computations as matrix operations rather than looping over collections of objects. NumPy is a Python package for scientific computing, it supports multi-dimensional arrays and linear algebra computations, used heavily for data science [29].

4.5 Performance

(37)

due to the fact that the libraries have different implementations: NetworkX is a pure Python library, while graph-tool has a C++ based implementation.

The performance measures are presented in Table 4.5.1, ”BC gt” column stands for betweenness centrality calculation performance for each graph using graph-tool. The results are measured in seconds. ”BC nx” is the betweenness centrality calculation using NetworkX. ”SP gt” and ”SP nx” is the time it took for the shortest path distance calculation for all node pairs to complete in the graph using graph-tool and NetworkX respectively.

graph BC gt BC nx SP gt SP nx 2 5 0.002 s 0.005 s 0.004 s 0.02 s 3 10 0.002 s 0.02 s 0.0025 s 0.27 s 8 50 0.005 s 3.024 s 0.0352 s 206.93 s 9101 2.15 s 660 s 494.208 > 1 hour 9001 2.213 s 820 s 219.16 > 1 hour 9037 3.99 s 1122 s 363.5 s > 1 hour

Table 4.5.1: Performance measurements for networkX and graph-tool

The performance is drastically different, especially the calculation of the shortest path which, when applied to the network overlay graphs, takes hours with NetworkX and only a couple of minutes with graph-tool.

4.6 Visualization

(38)

SVG

SVG (Scalable Vector Graphics) is, as the name suggests, a vector-based image format represented in an XML like format. SVG images preserve proportions and shapes when scaled, and are resolution independent. The format supports well large high-resolution graphics, however, performs poorly when rendering many elements.

Canvas

Canvas is a container HTML element having a capability to draw interactive graphics in it. The containing elements can be paths, circles, boxes, text or images. Since those containing elements are also HTML elements, they are interactive and support all basic mouse events. The performance of Canvas decreases on large resolution visualization surfaces.

WebGL and Three.js

WebGL is a low-level engine which enables both three and two dimensional drawings. It is able to render many large objects while maintaining a reasonable performance, having strengths of both Canvas and SVG combined. This is achieved by having GPU-accelerated processing, that is not possible with regular Canvas and SVG elements. WebGL is used widely for 3D content creation, examples include game engines, like Unity, and Unreal Engine 4. [30]

Since WebGL exposes low level APIs, to facilitate developers efforts, there exist several libraries that abstract basic scene creation and manipulation functionality: A-Frame for virtual reality programs, BabylonJS, PlayCanvas, three.js, OSG.JS, and others. Three.js is a popular JavaScript library built on top of WebGL, providing high-level support for drawing GPU-enabled graphics. It abstracts away low-level WebGL calls, wraps repetitive bits and WebGL implementation details, resulting in lower overhead and ease of development. [31]

(39)

this study are large, the Three.js based solution was chosen for the visualization implementation.

d3

D3 is a visualization library, written in JavaScript. It can be used for visualizing data in a format of interactive charts, tables, trees, and many other data structures. D3 can be integrated with both Canvas, SVG, and WebGL. Mostly the documentation examples are done with SVG, and Canvas, to draw plots, and animated data tables. [32]

It also has support for visualizing graph structures and reading graph data format. In addition, it has an implementation for a force-directed graph layout in a sub-module called d3-force. The sub-module adds forces to nodes, enabling a physical simulation. [33]

4.7 Implementation

The proposed end solution consists of two blocks: the graph reduction algorithm, and the visualization block. The solution is using the betweenness centrality algorithm for graph sampling. The original graph is delivered in a JSON format, it is converted to .graphML format to be read into the graph-tool library. Then the algorithm is applied and the graph is reduced to the specified threshold.

The visualization client side is consuming two graphs: the original and the sampled one. The original graph is used to assigning forces and allocate coordinates to the graph nodes. The sampled graph tells which edges to display. To avoid duplication of the data, and to optimize loading time, the reduced graph only contains the identifiers of the edges. The graph metadata is still accessible from the original graph.

(40)

(41)

Chapter 5 Analysis and evaluation of

results

In this chapter, the results of the comparison study on the selected graph sampling algorithms described in Chapter 2.4 are presented. In the first section, the algorithms are compared across six performance criteria. In the second section, the resulting reduced graph visualizations are presented for each algorithm.

5.1 Algorithms performance comparison

(42)

5.1.1 Hypothesis 1

Hypothesis: One of the algorithms performs better when it comes to keeping the percentage of the total edge weight

Related Questions: How big is the difference between algorithms SRS2, FF, BC and WIS in terms of maintaining edges with the highest weight?

Test output (observed/dependent variables): A percentage of total edge weights retained after the application of the graph sampling algorithm.

(43)

(44)

5.1.2 Hypothesis 2

Hypothesis: Algorithms BC and FF perform better than WIS and SRS2 when it comes to keeping the graph connectivity.

Test output (observed/dependent variables): The structural connectivity of the resulting sampled graph expressed through the number of connected components.

Results (Figure 5.1.2): As expected FF and BC produce a connected sampled graph. The FF algorithm starts with selecting a vertex, and adding adjacent vertices, and connects edges one by one, forming a tree and then adding the edges between the vertices, forming the longest shortest path in the constructed tree. Therefore at all stages of the algorithm, the sampled graph is connected.

The BC algorithm first removes edges with the lowest betweenness centrality, which results in a disconnected graph. Then the post-processing step is applied, guaranteeing the connectivity.

(45)

(46)

5.1.3 Hypothesis 3

Hypothesis: SRS2, FF, BC shall output graphs with the same degree distribution.

Test output (observed/dependent variables): the average node degree of a reduced graph.

Results (Figure 5.1.3): The average node degrees are the same for BC, FF and SRS2. They are decreasing linearly, proportionally to the edge cut, indicating that the algorithms are correctly reducing the number of edges proportionally to the specified edge threshold. The difference is visible as the edge cut gets larger, this is due to the fact that BC and FF have a limitation on a maximum amount of reduced edges, while SRS2 does not.

(47)

(48)

5.1.4 Hypothesis 4

Hypothesis: One of the algorithms shall perform better in terms of computation time of the graph reduction.

Test output (observed/dependent variables): running time in seconds for different algorithms.

Output: FF is the slowest (due to the shortest path calculation). SRS2 is the fastest algorithm.

Results (Figure 5.1.4):

SRS2 and WIS take only a few seconds to execute since the algorithms are not using any time-demanding measurements. The more complex algorithms: FF and BC are slower in terms of running time, because of the shortest path and betweenness centrality calculations required. The running time comparison of these measures, implemented with Python libraries is presented in section 4.1.4.

(49)

(50)

5.1.5 Hypothesis 5

Hypothesis: One of the algorithms shall perform better in terms maintenance of global clustering structural characteristic.

Test output (observed/dependent variables): global clustering coefficient calculated using the formula described in Chapter 2.1 under the ”Clustering coefficient” section.

WIS has the highest global clustering coefficient (as well as an average degree as shown in Hypothesis 3). As well as for the average degree, the global clustering coefficient value goes up as the node threshold decreases. The outputs of SRS2 and FF depend on the density of the original graph, as the results for 9037, 9001 and 9101 differ.

(51)

(52)

5.1.6 Hypothesis 6

Hypothesis: One of the algorithms shall perform better in terms of keeping assortativity coefficient.

Test data: Only the real overlay networks graphs were compared for this hypotheses. All reduced graphs assortativity is positive, indicating that they are assortative networks where edges tend to connect vertices of the same type. Test parameter: The type parameter is the site, or geographical location, where the network computer is physically situated. The overlay network is designed to have direct connections between computers in the same location, and also connect to other sites.

Test output (observed/dependent variables): assortativity coefficient measuring the similarity of connections in the graph with respect to the given attribute.

The overlay networks graphs are densely connected within the sites, therefore, the first priority for reduction purposes would be to decrease the number of edges within clustered components and keep the edges connecting different sites, to show the data flow.

Since the vertex type is not considered in any of the implemented sampling algorithms, the results vary. The SRS2 keeps the assortativity at the same level but sacrifices connectivity as was shown in Hypotheses 2. The largest assortativity coefficient is present of the threshold of 0.01%, indicating that the edges are most likely to connect the vertices of the same type once the edge cut is large.

The FF algorithm shows the steepest curve as the cut gets larger. This is due to the fact that the base structure of the graph in the algorithm is a tree. Therefore all clusters, except for the one containing the filtering focus node, end up having leaf nodes that do not have direct connecting links.

(53)

the number of selected nodes get smaller, the amount of edges between nodes belonging to different sites them still stays large.

(54)

5.1.7 Conclusions

The following patterns are observed on the test results: SRS2 is the fastest algorithm but produces a reduced graph that has little in common with the original one structure wise. The WIS is also a very fast algorithm and it keeps some of the sampled graphs connected even when having half of the edges removed from the original overlay network. This is a good result, considering that connectivity property is not specified in the algorithm description. The reason for that is that the test graphs are very densely connected. The sampled graph stays connected when reducing the graph by 30% or more.

The focus-filtering algorithm is based on creating a tree and then adding back the edges. It works well when the sampled amount of edges is at least twice higher than the number of vertices, to guarantee the connectivity of the smaller components. However, once the amount of edges in the graph decreases, the shape of the graph gets closer to a tree. It can be clearly seen when visualizing the graph with a force-directed layout, as will be shown in the following section. The algorithm starts with the selected source node, growing the tree from there on. The result of it is unbalanced: the nodes and links closer to the source are present in the sampled graph, while the ones that are forming smaller clusters in the original graph are not selected, and appear as leaf nodes.

The betweenness centrality based algorithm (BC) performs better in terms of speed than FF, however, its performance decreases linearly once the edge cut gets higher. It always keeps the graph connected by design and, generally, the sampled graphs are well-balanced.

Overall, it can be concluded that the betweenness centrality based algorithm is the best choice for the case where the graph is expected to stay connected, the important links stay in place, and it is acceptable to have non-immediate execution time, therefore the BC algorithm is the best choice for this study.

Source node

(55)

is always selected as a ”filtering focus” node. The reasons behind that are that the source node has the highest amount of edges in all the test graphs. Since the ”filtering focus” is calculated as a node with the highest summed up weights of the edges, it gets selected as the root of the sampled tree. The outcome of the FF algorithm will always contain edges connected to the source node.

The same applies to the WIS algorithm: the weight of the source node is high and therefore it is present in all reduced graphs, guaranteeing that the video stream originating node will not be cropped off from the graph.

The source node also has a high betweenness centrality, after sampling the graph with the BC algorithm, the edges connecting to the source node have a high probability to stay.

(56)

5.2 Visualization results

In addition to having quantitative measures, a comparison of the visual representation of filtered graphs compared to the original ones is presented in this section. Such visualizations facilitate the understanding of the different datasets and their structural similarities.

The end goal of this study is to effectively visualize the networks by creating the closest possible approximation of the original graph. The end result should have visually similar properties to the original network, and show the structure while having the optimal performance.

5.2.1 Test graphs visualization

A generated caveman graph is presented in Figure 5.2.1 using the solution explained in Chapter 4. The caveman graph consists of two interconnected clusters, of size five, connected randomly, which results into 20 edges and 10 nodes.

Figure 5.2.1: Caveman 2x5

(57)

(a) BC (b) SRS2

(c) WIS (d) FF

Figure 5.2.2: Degrading caveman 2x5 graph, after applying sampling algorithms. 80% of edges left, original graph reduced by 20%.

(58)

graph structure quite similar, however, two visibly distinct components cannot be determined. The WIS produces a graph with a structure the most similar to the original one, by removing nodes. The FF algorithm keeps one component as it was originally, and the second one is quite disconnected.

By further reducing the graph by half, we see the extremes of what kind of end shape each algorithm tends to fall into. It is not practical to reduce the graphs that drastically for visualization, however, for making a decision on what algorithm to pick, it is crucial to determine what kind of structure they converge to. The resulting sampled graphs are presented in Figure 5.2.3.

(59)

(a) BC (b) SRS2

(c) WIS (d) FF

(60)

5.2.2 Overlay network visualization

In this section, the algorithms are applied to one of the original graphs of the network overlay structure provided by the company. The graph 9031 has 1821 vertices and 52343 edges, and three large distinct clusters, one of which has many sub-clusters and also contains a source node, shown in Figure 5.2.5.

(61)

Figure 5.2.5: Zoomed view on graph 9031

The graph was sampled with a filtering threshold of 8%. It was chosen empirically so that the output graph would still be connected, while it is small enough to guarantee good visualization performance. The resulting graphs are presented in Figure 5.2.6.

The BC algorithm is keeping the components connected, however, it is hard to distinguish original components with a force-directed layout. This is due to the fact that the algorithm keeps the edges having the highest betweenness centrality and removes the ones with the lowest, therefore the smaller dense components become less connected. It is good in terms of structural properties, however, the resulting force-directed visualization has one clear dense center and is not evenly distributed.

(62)

(a) Graph reduced with BC (b) Graph reduced with SRS2

(c) Graph reduced with WIS (d) Graph reduced with FF

(63)

satisfactory. The algorithm disconnects the connected component, as a result, many independent sub-graphs can be seen on the graph.

The FF also appears closer to the original, however, as can be seen from the caveman samples, when the filtering threshold decreases, the edges of the sampled graph will be concentrated in the area of the largest component and the other two components will be disconnecting.

The WIS size is different due to the design differences, to get closer to other samples 25% of the nodes were sampled resulting in a graph with 455 vertices and 5271 edges, approximately 10% edge cut. The sampled graph has two components since the weak nodes are removed.

2-step rendering

As can be seen from the previous section, once the graphs are reduced through sampling, the forces distribution changes. When the reduction percentage is small, it is possible to maintain a layout similar to the one of the original graph. However the original graphs have hundreds of thousands of edges and the rendering limitation is around tens of thousands of edges, thus a large portion of a graph needs to be removed.

To keep the structure the same, the 2-step rendering technique was developed. The visualization code takes two graphs: the original and the sampled one. The force-directed layout algorithm is first applied to the original graph, and the assigned coordinates of the nodes are kept in memory. Then as a second step, only the links that are present in the sampled graph are visualized using the coordinates, allocated to the original graph.

This approach keeps the structure of the original graph while rendering only the edges that were chosen by the sampling algorithm. The same graphs from Figure 5.2.6 are redrawn using the 2-steps approach resulting in the visualizations in Figure 5.2.7.

(64)

(a) Graph reduced with BC. (b) Graph reduced with SRS2.

(c) Graph reduced with WIS. 10% edge cut (d) Graph reduced with FF.

(65)

component has reduced drastically, it is hard to distinguish it in the view. The BC and FF algorithms have the resulting graphs closest to the original, having clearer structure. The difference between them is in the density of components: with BC the clusters are more densely connected between one another while with FF the components are more connected internally. Moreover, it can be seen that with FF the smaller independent components are not as clearly represented.

Large overlay network visualization

(66)

(67)

Chapter 6 Conclusions

Throughout the thesis work process, the fact that the graph reduction is necessary for visualization got clear. None of the currently existing web-based graphics engines are able to load the original graphs as is while keeping the nodes interactive. Therefore the investigation on the sampling algorithms was a valid way to go.

The main outcome of this thesis is that it evaluated reduction algorithms and proposed a functional solution enabling the visual representation of the overlay network, that allows the user to identify network hubs, evaluate the general structure and see differences between the networks.

Discussion

The comparison study was done by evaluating a set of objective graph measurements. The chosen betweenness centrality based algorithm fits the particular visualization use-case, applied to a dense overlay network. The comparison study findings presented in Chapter 5 can be used as a general guide for the selection of a filtering approach, however, the choice of the algorithm may be different depending on the requirements.

Future Work

(68)

that reduces the graph in steps which could lead to optimally reduced graphs and improved performance. For instance, using WIS first to reduce the number of nodes, then applying the betweenness centrality based algorithm to remove extra edges. On the visualization part of a solution, the future work could focus on usability and user requirements.

(69)

Bibliography

[1] Hu, Yifan and Shi, Lei. “Visualizing large graphs”. In: Wiley

Interdisciplinary Reviews:Computational Statistics 7.2 (2015), pp. 115–

136.

[2] Wikipedia, community. Wikipedia:Statistics. 2019. URL: https : / / en . wikipedia.org/wiki/Wikipedia:Statistics (visited on 05/13/2019). [3] Azevedo, Frederico AC et al. “Equal numbers of neuronal and nonneuronal

cells make the human brain an isometrically scaled-up primate brain”. In:

Journal of Comparative Neurology 513.5 (2009), pp. 532–541.

[4] Håkansson, Anne. “Portal of research methods and methodologies for research projects and degree projects”. In: The 2013 World Congress

in Computer Science, Computer Engineering, and Applied Computing WORLDCOMP 2013 (2013), pp. 67–73.

[5] Newman, Mark. Networks: an introduction. Oxford university press, 2010. [6] Cancho, Ramon Ferrer I and Solé, Richard V. “The small world of human language”. In: Proceedings of the Royal Society of London. Series B:

Biological Sciences 268.1482 (2001), pp. 2261–2265.

[7] Bassett, Danielle Smith and Bullmore, ED. “Small-world brain networks”. In: The neuroscientist 12.6 (2006), pp. 512–523.

[8] Castells, Manuel. The Internet galaxy: Reflections on the Internet,

business, and society. Oxford University Press on Demand, 2002, pp. 9–

(70)

[9] Fruchterman, Thomas MJ and Reingold, Edward M. “Graph drawing by force-directed placement”. In: Software: Practice and experience 21.11 (1991), pp. 1129–1164.

[10] Eades,

Peter. “A heuristic for graph drawing”. In: Congressus numerantium 42 (1984), pp. 149–160.

[11] Quigley, Aaron and Eades, Peter. “Fade: Graph drawing, clustering, and visual abstraction”. In: International Symposium on Graph Drawing. Springer. 2000, pp. 197–210.

[12] Koren, Yehuda. “On spectral graph drawing”. In: International Computing

and Combinatorics Conference. Springer. 2003, pp. 496–508.

[13] Sugiyama, Kozo, Tagawa, Shojiro, and Toda, Mitsuhiko. “Methods for visual understanding of hierarchical system structures”. In: IEEE

Transactions on Systems, Man, and Cybernetics 11.2 (1981), pp. 109–125.

[14] Holten, Danny and Van Wijk, Jarke J. “Force-directed edge bundling for graph visualization”. In: Computer graphics forum. Vol. 28. 3. Wiley Online Library. 2009, pp. 983–990.

[15] Leskovec, Jure and Faloutsos, Christos. “Sampling from large graphs”. In: Proceedings of the 12th ACM SIGKDD international conference on

Knowledge discovery and data mining. ACM. 2006, pp. 631–636.

[16] Lee, Bongshin et al. “Treeplus: Interactive exploration of networks with enhanced tree layouts”. In: IEEE Transactions on Visualization and

Computer Graphics 12.6 (2006), pp. 1414–1426.

[17] Boutin, Francois, Thievre, Jérôme, and Hascoët, Mountaz. “Focus-based filtering+ clustering technique for power-law networks with small world phenomenon”. In: Visualization and Data Analysis 2006. Vol. 6060. International Society for Optics and Photonics. 2006, 60600Q.

(71)

[19] Herman, Ivan, Melançon, Guy, and Marshall, M Scott. “Graph visualization

and navigation in information visualization: A

survey”. In: IEEE Transactions on visualization and computer graphics 6.1 (2000), pp. 24–43.

[20] Auber, D, Munzner, T, and Archambault, D. “Visual exploration of complex time-varying graphs”. In: IEEE transactions on visualization and

computer graphics 12.5 (2006), pp. 805–812.

[21] Auber, David et al. “Multiscale visualization of small world networks”. In: IEEE Symposium on Information Visualization 2003 (IEEE Cat. No.

03TH8714). IEEE. 2003, pp. 75–81.

[22] Wu, Andrew Y, Garland, Michael, and Han, Jiawei. “Mining scale-free networks using geodesic clustering”. In: Proceedings of the tenth ACM

SIGKDD international conference on Knowledge discovery and data mining. ACM. 2004, pp. 719–724.

[23] Rafiei, Davood. “Effectively visualizing large networks through sampling”. In: VIS 05. IEEE Visualization, 2005. IEEE. 2005, pp. 375–382.

[24] Kurant, Maciej et al. “Walking on a graph with a magnifying glass: stratified sampling via weighted random walks”. In: Proceedings of the

ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems. ACM. 2011, pp. 281–292.

[25] Fortunato, Santo. “Community detection in graphs”. In: Physics reports 486.3-5 (2010), pp. 75–174.

[26] Peixoto, Tiago P. “The graph-tool python library”. In: figshare (2014). DOI: 10.6084/m9.figshare.1164194. URL: http://figshare.com/articles/ graph_tool/1164194 (visited on 09/10/2014).

[27] Hagberg, Aric. NetworkX first public release (NX-0.2). 2005. URL:https: //mail.python.org/pipermail/python- announce- list/2005- April/ 003924.html (visited on 05/15/2019).

(72)

[29] developers, NumPy. NumPy. 2019. URL: https : / / www . numpy . org/ (visited on 05/13/2019).

[30] Mozilla and contributors, individual. WebGL: 2D and 3D graphics for the

web. 2019. URL: https://developer.mozilla.org/en- US/docs/Web/ API/WebGL_API (visited on 05/13/2019).

[31] three.js. 2019. URL:https://threejs.org/ (visited on 05/13/2019). [32] Bostock, Mike. Data-Driven Documents. 2019. URL:https://d3js.org/

(visited on 05/13/2019).

(73)

(74)

Appendix - Contents

(75)

Appendices

A

Network visualization

(76)

(77)

(78)