Enterprise network topology discovery based on end-to-end metrics

(1)

IN

DEGREE PROJECT INFORMATION AND COMMUNICATION TECHNOLOGY,

SECOND CYCLE, 30 CREDITS STOCKHOLM SWEDEN 2017,

Enterprise network topology discovery based on end-to-end metrics

Logical site discovery in enterprise networks

based on application level measurements in peer- to-peer systems

JONATAN BODVILL

(2)

(3)

Abstract

In data intensive applications deployed in enterprise networks, especially applications utilizing peer-to-peer technology, locality is of high importance. Peers should aim to maximize data exchange with other peers where the connectivity is the best. In order to achieve this, locality information must be present which peers can base their decisions on. This information is not trivial to find as there is no readily available global knowledge of which nodes have good connectivity. Having each peer try other peers randomly until it finds good enough partners is costly and low- ers the locality of the application until it converges. In this thesis a solution is presented which creates a logical topology of a peer-to-peer network, grouping peers into clusters based on their connectivity metrics. This can then be used to aid the peer-to-peer partner selection algorithm to allow for intelligent partner selection. A graph model of the system is created, where peers in the system are modelled as vertices and connections between peers are modelled as edges, with a weight in relation to the quality of the connection. The problem is then modelled as a weighted graph clustering problem which is a well-researched problem with a lot of published work tied to it. State-of-the-art graph community detection algorithms are researched, selected depending on factors such as performance and scalability, optimized for the current purpose and implemented. The results of running the algorithms on the streaming data are evaluated against known information. The results show that unsupervised graph community detection algorithm creates useful insights into networks connectivity structure and can be used in peer-to-peer contexts to find the best partners to exchange data with.

(4)

Referat

I dataintensiva applikationer i företagsnätverk, speciellt applikationer som använder sig av peer-to-peer teknologi, är lokalitet viktigt. Klienter bör försöka maximera datautbyte med andra klienter där nätverkskopplingen är som bäst. För att klienterna ska kunna göra sådana val måste information om vilka klienter som befinner sig vara vara tillgänglig som klienterna kan basera sina val på. Denna information är inte trivial att framställa då det inte finns någon färdig global information om vilka klienter som har bra uppkopp- ling med andra klienter och att låta varje klient prova sig fram blint tills de hittar de bästa partnerna är kostsamt och sänker applikationens lokalitet innan den konvergerar.

I denna rapport presenteras en lösning som skapar en logisk vy över ett peer-to-peer nätverk, vilken grupperar klienter i kluster baserat på deras uppkopplingskvalitet. Denna vy kan sedan användas för att förbättra lokaliteten i peer- to-peer applikationen. En grafmodell av systemet skapas, där klienter modelleras som hörn och kopplingar mellan klienter modelleras som kanter med en vikt i relation till uppkopplingskvaliteten. Problemet formuleras sedan som ett riktat grafklusterproblem vilket är ett väldokumente- rat forskningsområde med mycket arbete publicerat kring.

De mest framstående grafklusteralgoritmerna är sedan stu- derade, utvalda baserat på kravspecifikationer, optimerade för det aktuella problemet och implementerade. Resulta- ten som produceras av att algoritmerna körs på strömdata är evaluerade mot känd information. Resultaten visar att oövervakade grafklusteralgoritmer skapar användbar information kring nätverkens uppkopplingsstruktur och kan an- vändas i peer-to-peerapplikationssammanhang för att hitta de bästa partnerna att utbyta data med.

(5)

Acknowledgements

I have had a lot of support throughout this degree project from many intelligent people. I would like to thank Riccardo Reale for his supervision and help as well as his enthusiasm and interest in this project. I would also like to thank Dr.Roberto Roverso for his support, feedback and guidance on my method and solution. Lastly I would like to thank everyone at Hive streaming for their hospitality and support

(6)

Introduction

With the increasing quality and power of technology for data transfer in real time, higher demand is put on systems and networks to handle the high load that this way of consuming and producing data produces. One way to distribute load is to allow hosts to participate in distributing content [1]. This requires knowledge of the network structure in order to efficiently utilize the available network resources.

This especially true in enterprise networks because of their heterogeneous structure.

[2]

1.1 Background

"P2P is one of those rare ideas that is simply too good to go away." - The Economist 2001 [3]

Peer-to-peer is a software architecture which allows for serverless applications, let- ting the different clients, or "peers", coordinate and execute logic by communicating directly with each other. This architecture is powerful since problems in centralised client-to-server architectures can be addressed and avoided such as single points of failure, scalability issues and performance issues. In highly distributed peer-to-peer applications, where different peers are in different geographic locations as well as different networks, it is of interest to be able to tell which peers have good connectivity with each other. This is to enhance the performance and locality of the algorithms and reduce communication latency. This is particularly important in applications aimed at enterprise networks in low latency domains such as streaming.

1.1.1 Hive

Hive is an application developed by Hive streaming [4], a research and development company based in Stockholm. The application is based on publications [5] [6] and offers a solution for peer assisted video streaming distribution in enterprise networks.

The benefit over client-server solutions is that clients who are close to each other in the network can fetch the stream from each other instead of going to a server or a

(10)

CHAPTER 1. INTRODUCTION content delivery network (CDN). This will generate a lower load on bottlenecks in the network and utilize the network in a more efficient way while keeping the costs low. The Hive application has three main goals:

1. Maximize savings in the network through efficient peer-to-peer transfer. I.e.

minimize the traffic to the stream source

2. Maximize locality, peers should exchange data with other peers in the same network location to minimize load on weak external links.

3. Maximize quality of experience (QoE), minimizing buffering experienced by users.

1.1.2 Enterprise networks

In enterprise networks it is particularly important for a data-intensive network application to conisder locality. This is because of the design of enterprise networks which has the characteristics of having very high quality connections within locations such as offices or cities but low connectivity between these locations. [2] Heavy traffic on the inter-connecting links can lead to congestion and failures which will lower the performance of the application but also the network as a whole.

1.1.3 Peer selection

An interesting problem then arises: “How does an arbitrary peer know which peers it should consider for peer-to-peer sharing of the stream?”. Since the system is distributing streaming video, the transfer of data must be fast. If a partner is taking too long to share the stream data, the receiving peer lags behind and there is a low quality of experience with video buffering of pausing. Also, if the peers share data in an inefficient and random manner, peer-to-peer traffic will travel in a chaotic manner between companies offices, cities or even countries.

1.1.4 Sites

This introduces the concepts of sites. A site is defined as a location where peers within this site has good connectivity among each other and where traffic between peers are encouraged. A site can be a department, an office, a city etc. Sites are interesting to consider because peers within a site can use the site information to help with finding high performing peers where the connectivity is good. While sites often correspond to physical locations and networks, the site concept is just focused on the connectivity and theoretically there is nothing that says that two clients in different subnets or physical locations can not have good connectivity. Hence the concept of logical sites rather than physical sites.

2

(11)

1.2. PROBLEM DESCRIPTION

Static site definitions

One way to address this is to obtain static physical site definitions from the networks where the Hive client will operate. This can be done and is being done by querying companies for information on their network structure. However, this has shown to lead to two main issues: (1) All companies and customers might not be able to or do not want to produce and hand over these for different reasons and (2) These static definitions are known to be both crude, not cover the entire network and sometimes be actually incorrect. Incorrect information has also shown to do more harm than good if the system is trying to enforce intra site peering with incorrect site definitions.

The dynamic problem

Without site information, the only reasonable approach is to let peers find partners through some peer sampling service in the system, keeping the high performing partners and evicting the low performing partners according to some policy. This approach is however not free from issues either. (1) in such a random approach, the system will take time to converge into a stable state where the peers have found the highest performing partners (2) this convergence must take place at every new streaming event, since no knowledge is persisted. Because of this, in previous work, the benefit of having a network topology which is locality aware has been identified [7] [8].

This covers the issues with static site definitions and ad hoc partner selection.

This leads us into the alternative; dynamic site definition. By dynamic we mean that (1) no information is required from the network before hand (2) after sites have been located, the peers in the system will not need to execute a search of the entire peer-to-peer system again to locate high performing peers in the same site.

1.2 Problem description

In a peer-to-peer application for streaming video content where the two main goals are quality of experience and maximizing of locality in peer-to-peer data transfer, it is necessary for the peers in the application to identify which other peers it should communicate with to gain the data it needs to achieve its goal. If this peer-selection information does not exist from the beginning, it needs to be created. Thus, the problems addressed in this thesis are:

1. How to generate a view of logical sites in a peer-to-peer application based on content delivery metrics

2. How can such a view be used in a system to increase the efficiency of partner selection and thus increase performance and network savings.

(12)

CHAPTER 1. INTRODUCTION

1.3 Purpose

In this thesis, a solution for graph clustering to identify logical sites in enterprise networks based on application level connectivity metrics is designed, developed and evaluated. This solution aims to aid the Hive peer-to-peer application, increasing its locality while maintaining its high performance and quality of experience while also lowering the saturation of weak network links.

1.4 Goals

The project goal is to develop a new unsupervised way to identify logical sites within the peer-to-peer video streaming distribution application Hive. The result of this project and the effect goal is expected to further improve the application, allowing for more effective communication within the application while not requiring any additional information to do so. This will lead to a more refined product which is expected to have a positive impact on the company’s goals.

The following sub-goals are defined in this project:

1. Finding a meaningful and efficient way to represent and model the system at hand.

2. Discover a way to regardless of pre-available information analyze a peer network to find communities of peers with high connectivity.

3. Create a view of these communities that can be used in the system to aid peers in their partner selection.

1.5 Method

In this project, a series of methods will be used. The project starts with a period of litterature study in which related work in the area network topology discovery and the use of graph clustering to cluster networks is studied. This is to identify what other work has been done in the area, how they perform and if those solutions could be incorporated in the project solution. When the litterature study has produced desired information on the subject, a solution proof-of-concept is developed, implementing the solution. Furthermore, real-world data is gathered from experiment scenarios and the solution is run on this data and evaluated against a known truth. This known truth are networks chosen for the sole purpose of evaluating the algorithm, where every peer’s location is well defined in order for it to be possible to compare the clustering against the truth.

4

(13)

1.6. DELIMINATIONS

1.6 Deliminations

In this thesis, the possibility of running the clustering online, as a part of the application logic is not delved into. This thesis will not either compare the performance of all the different graph clustering algorithms. There are other works that do this which covers the issue very well and will be referenced in this thesis. Instead, a subset of clustering algorithms will be chosen based on the requirements and the characteristics of the algorithms as well as the state-of-the-art.

1.7 Outline

Chapter 2 describes the background and theory as well as the results from the litterature study to create a deeper understanding of the problem and the different characteristics of the solution alternatives.

Chapter 3 describes the methods used in this degree project as a whole, from research of literature to design and development process and evaluation.

Chapter 4 describes the solution, the different design choices made and why they were made.

Chapter 5 describes the results from the network topology inferrence both visualized and with evaluation metrics

Chapter 6 discusses the results and draws conclusions from the results Chpater 7 describes future work

(14)

(15)

Chapter 2

Background and Theory

In this chapter, the theoretical background for the project is discussed.

2.1 Enterprise networks

Enterprise networks differ in design characteristics from the open internet [2] [9].

The design of these networks generally consists of highly intra-connected sites, offices or cities, with good connectivity within themselves, connected to each other over weak VPN-links as visualized in figure 2.1. In these network the concept of network locality becomes important in any network applications sending data between hosts in the network. Keeping the communication local will efficiently utilize the network capacity while communication between sites will infer high costs and risk to congest the weak inter-connecting link leading to network failures and slowness.

Figure 2.1: Structure of an enterprise network with highly intra-connected offices and weak inter-connecting VPN-links [10]

(16)

CHAPTER 2. BACKGROUND AND THEORY

2.2 Peer assisted video streaming distribution

As the advancement of technology enables more complex and efficient software solutions, different ways of providing services are explored. Streaming of video is one of the services which has been made more accessible with the advancements in networking technology. It does however, still generate a lot of traffic, which is an issue in computer networks which have finite network resources and in particular in enterprise networks. Hive was developed to allow for large scaling of video streaming in corporate networks while maintaining a low saturation of the network as seen in figure 2.2. By offloading the source, Hive can lower the load on the stream source by up to 96% [6] while maintaining the same quality of experience (QoE) as a CDN solution.

(a) Non-peer assisted solution (b) Peer assisted solution

Figure 2.2: Differences between source load in (a) non-peer assisted distributions and (b) peer assisted distributions. Red links are clients fetching the stream from its source. [10]

2.2.1 Hive

Streaming of data in large quantities puts heavy load on network. naïve streaming architectures where every client pulls data from server thus infer a cost and a problem for a corporate network. To solve this issue, one approach is to utilize CDN solutions, content delivery networks, to cache the stream so that different clients can access different nodes, or slaves, to acquire the stream. However, this approach leads to an increasing cost of maintaining a CDN solution as the user count scales up. Because of these issues, Hive provides a streaming distribution implemented as an HTTP cache in every client which takes advantage of the fact that many clients are close to each other in the network and are watching the same stream. Because of these facts, clients can pull content not just from the CDN but also from each other in a peer-to-peer manner as seen in figure 2.3. This distributes the network load and does not put a lot of concentrated strain on any single network link.

8

(17)

2.2. PEER ASSISTED VIDEO STREAMING DISTRIBUTION

(a) naïve client-server system model (b) CDN system model

(c) Peer assisted system model

Figure 2.3: Difference in system designs for fetching stream data from a source. As seen in (a) the source is put under a load that is directly proportional to the number of clients while the CDN and peer assisted solution reduces the load, but the CDN solution requires additional CDN nodes in the network.

Peer selection

In order for peer-to-peer transfer of streaming video data to be feasible and efficient, both in terms of network utilization and quality of experience, it is important that clients choose peers intelligently. In the Hive client, peer selection is done by each client by receiving samples of peers from the system and trying them out, keeping partners which performs well and evicting low-performing peers from its partner set [6].

Sites

Introducing the site concept in the Hive system enables each client to favour a subset n ⊂ N of all the clients in the network which are likely to perform well and be close in the network structure. While a site is heavily correlated with physical partitions of the network such as offices or subnets, the concepts are decoupled as the physical location requires physical information not present in the system while a site can be defined using logical connectivity metrics. The challenge then lies in obtaining this site information in order to allow clients to make proper decisions

(18)

CHAPTER 2. BACKGROUND AND THEORY on which peers to choose. There are two scenarios; 1) there are partial network information which can be built upon to create a complete knowledge of the network 2) there is no knowledge to start from.

2.2.2 Static site information

In some cases, customers provide partial network information to Hive in form of physical network location data mapping clients IP addresses to sites. This is useful information and can be used in the partner selection within the Hive client to guide peers to select peers from specific subsets of clients. However from experience, this static site information has shown to often be flawed, incorrect or completely missing.

If the site information is incorrect, stating that two peers p1 and p2 are in the same network location when they are not, will lead to negative consequences as the peers will try to fetch from each other even when they should not. In these cases, where static information is readily available, there are several use-cases where dynamic site information can act as a complement described in figure 2.4

(a) Validation: Validating sites (b) Extend: Extending site with new peers

(c) Include: Merging sites (d) Separate: Splitting sites

Figure 2.4: The different scenarios where a dynamic site information acts as a complement to static information. Peers are visualized as circles, the static site information is visualized as the ellipses including the peers in that site and the dynamic site information is visualized as the fill colors of the peers

In other scenarios, where there is no static information available at all, the dynamic site information would instead act as the only information. In these cases the benefit is obvious as it will provide at least somewhat accurate network strucutre information as opposed to none at all.

10

(19)

2.3. RELATED WORK

2.2.3 Hive networks

Hive is deployed in a range of different eneterprise networks all over the world.

There are many different structures and characteristics of the networks in which Hive operates. In some networks, certain peers are only capable of communicating to other peers in the same location due to firewall settings in the network, leading to a Hive network structure that is similar to the physical structure and the network is clustered. In other cases, no such restrictions are in place and any Hive peer is free to talk to any other hive peer. Depending on the structure of the network, this has the potential of affecting the savings and locality negatively if peers happen to choose partners who are fram from them in the physical network structure.

2.3 Related work

The problem of discovering and defining network topologies is a widely addressed problems with a variety of different approaches. Many solutions has been proposed and tested in different network environments for different purposes. They can mainly be divided into two classes: Solutions based on physical network data or low- level metrics gathered from the network, and solutions solely based on application level connectivity metrics. Both were researched to understand the state-of-the-art in this topic of research.

2.3.1 Physical structure assessment

One of the most common ways to discovery network topologies is to use network tools such as SNMP [11] [12]. These solutions all utilize network services on the second and third level of the seven-layer OSI-model [13], requiring access to routers and switches and specific network tools. Because of this, these approaches are not appplicable in the scope of this project as the these are not available. In [14] different methods for network topology discovery are covered and the characteristics of them are summarized as seen in table 2.1

Methods Layer of

network Apply

scope Topology granularity

Network

load Authen-

tication Implem- entation difficulty

ICMP High Middle Middle High None Easy

SNMP Low Middle Middle Low Possess Difficult

ARP Low Small Middle Low None Middle

DNS High Large Large Low None Middle

OSPF Low Middle Middle Middle None Difficult

Table 2.1: Different methods of inferring network structure and their characteristics [14]

(20)

CHAPTER 2. BACKGROUND AND THEORY The approaches in the low levels of the network are not feasible due to the fact that there is no access to the tools and data needed to extract that information in the enterprise networks. That leaves trace route and ICMP [15] [16] as well as and DNS. ICMP and in particular trace route is often use to trace network structure and is therefore interesting. However, Hive experience as well as the studies done by [14] has shown that networks often are configured to block ICMP in firewalls etc.

making this approach unfeasible for this project. DNS is another protocol which inherently carries information on the network structure by how high in the DNS hierarchy a query has to go in order to resolve the host. However, as stated in table 2.1 the granularity of the strucutre assessment is quite high as it relies on the presence of many DNS servers in the network architecture. This excludes DNS as a potential way of defining the network structure.

2.3.2 End-to-end measurements & Network tomography

The other class of network topology discovery methods is those based on end-to-end measurements [17] [18] [19]. These approaches are more interesting as they require no access to the underlying network and the topology discovery is operating on application level and thus non-intrusive in the networks.

Bittorrent measurements

In [19] the authors propose a method of using bittorrent in order to broadcast data and use the bandwidth metrics inferred from the behaviour of the bittorrent algorithm to discover the topology of a grid network in France. While the approach is interesting, it is not applicable in the scope of this project because of a couple of reasons. The network examined in the work is a single, high-performing grid network confined to a single country, while the enterprise networks in which Hive operates are of different character and with more heterogenous structure. Furthermore the authors of [19] only considers bandwidth as a metric in order to infer the network structure. In this project we consider the collapsing of several features to construct the weight heuristic to examine the accuracy of the clustering when the network structures get more complex and the weight are composed of several metrics.

2.3.3 Differences

In general, there is not much work on end-to-end network topology discovery and the work that is present is often focused on grid networks which has different characteristics than enterprise networks and also require heavy measurements being executed in the network to gather data. The Hive application implicitly does these measurements when running an event, thus no additional measurement systems are desired and no heavy measurements being run just for the sake of the measurement is desired either. The aim is also to study the impact of using the different metrics in the weight heuristics to determine which data is best fit to describe site affiliations.

12

(21)

Chapter 3

Methodology

In this section the different methods that were used in this project will be described.

The project started with a period of literature study to review relevant research in order to find similar work and literature which contains information relevant for this project. Experiments were then conducted, using an instumented version of the Hive application to conduct measurements gathering the connectivity data.

The data was then formatted and modelled and clustering algorithms were applied to the data to generate the results. These results were then evaluated utilizing different evaluation metrics based on ground truth. Lastly, conclusions are drawn from the results and discussed in relation to the goals of the degree project.

3.1 Research strategy

In order to understand the requirements and constraints of the problems addressed in this thesis, three main areas needs research:

1. The nature of peer-to-peer applications and locality aware applications 2. Network topology inferrence

3. Graph clustering

These areas are all needed to research in order to build knowledge on reasonable approaches to consider in the project. When the relevant work has been covered, the project can continue into a design and implementation phase where data aggregation and modelling techniques can be developed and applied and measurements can be executed. Here, a quantitative method is applied. [20], in particular an experimental research approach where experiments are conducted using the developed solution and the result data is gathered and analyzed.

(22)

CHAPTER 3. METHODOLOGY

3.2 Literature study

The literature study aims to answer questions about (1) previous work within the creation of locality aware network topologies, and network structure assessment and(2) How the Hive application is designed and what is needed to help it improve its locality

3.3 Data collection

Hive is deployed and running in many different companies in a production environment. These systems produce a large amount of metadata during events and feeds it back to the backend of the system for analysis and logging. It is also possible to run experimental events on these networks or subsets of them in order to gather data for a specific cause. This data was made available for this thesis project to model and test algorithms on. The Hive application was instrumented to conduct random measurements on three well-defined networks and the data was stripped of any indication of which customer it belonged to and added to the set of testing data for this project.

3.4 Data analysis

The analysis of the experimental data was done in an abductive manner [20] where deductive methods of applying statistics to the aggregated data were combined with inductive methods using previous experience and domain knowledge of the experiment networks. The combination of the two approaches lead to a well-rounded analysis of the data. The data collected was used as input into a pipeline consisting of several steps:

1. Cleaning and formatting of data 2. Modelling data as graph

3. Applying graph community detection algorithms 4. Evaluating results

3.5 Evaluation

Evaluation in this work are done in two main ways: information recovery metrics to rate the clustering and visualization to understand the data on a more qualitative level.

14

(23)

3.6. TEST SETUP & MEASUREMENTS

3.6 Test setup & Measurements

Hive streaming events are run in corporate networks, generating real-world network data. The data is fed to the Hive backend and then pulled to a computer where the graph clustering pipeline is implemented. The algorithm produces a labelled graph which is the same as the input graph but where every vertex has an appended field containg its community association. This graph is then ran through a script which re-formats the result, generating tuples of (ip, cluster). This data can then be used as site information in the live Hive client, replacing the static site information received from the customer.

(24)

(25)

Chapter 4

Solution

In this chapter, the solution and approach is covered. The solution is implemented as a service which reads event stream metadata, gathered from the back end of the Hive system, cleans the data, models it as a graph, runs the implemented graph community detection algorithm and outputs a site definition clustering IP addresses in logical sites.

To facilitate for the need for scaling depending on network size and data volumes, the service is implemented in the Apache Spark framework, utilizing the GraphX library and deployed in Hive streamings Spark cluster environment.

4.1 Measurements

In order to gather metrics to base the topology discovery on, the Hive client was instrumented to carry out measurements with low-bitrate video on Hive deployments in real-world customer networks. The measurement executed a random version of the Hive algorithm, sharing data with random subsets of nodes provided from a tracker. running these tests for 10 minutes lead to measurements being taken from every node to a high number of randomly distributed nodes in the network. Sev- eral metrics were gathered which were considered to have potential when rating the connection quality:

1. RTT: Round trip time. The delay in milliseconds that it took to establish a connection between the source peer and the destionation peer.

2. rate: The throughput of which the data could be sent from p1 to p2 in bit/sec- ond

3. transferred: The total amount of data transferred between p1 and p2 in byte.

4. p2pSuccess: The number of successful transfers between p1 and p2.

5. p2pFailure: The number of failed transfers between p1 and p2.

(26)

CHAPTER 4. SOLUTION After the measurements were completed. A large amount of end-to-end communication quality metrics were available to use for the topology discovery. The data generated by the conducted measurements are:

1. Nodes - information about each node in the system: (Public IP, Private IP, GUID)

2. Connections - Detailed metrics describing the quality of connection between two nodes

This data was then modelled as a graph, where the nodes are modelled as vertices and the connection between two nodes are modelled as edges between the two vertices that models the node. Using the graph data structure as model for this system is intuitive as graphs are often used to model networks [21]. It also gives access graph clustering, which can allow for the detection of communities in the network.

4.2 Graph modelling

The Graph is a popular data structure often used to model data which has relations, e.g networks [21]. Graphs G(V, E) are data structures consisting of a vertex set V and edge set E. A vertex is a node with a set of attributes. An edge is a connection between two vertices which can either be directed or undirected and carry a weight or be unweighted. A graph where there is an equal probability for an edge e to exist between any pair of vertices, is called a random graph [22]. Another structure that is more prominent in networks of different sorts is the class of graphs called small-world graphs which have the characteristics of having a low graph diameter and local cluster structures [23]. This class of graphs are more often found in nature [24]. Due to the random measurements conducted in the experiments, the graphs examined in this project are random graphs. This means that there are no cluster structure in the distribution of edges and the cluster information instead is found in the value of the weights of the edges, assuming a good weight heuristic which depends on connectivity metrics is used.

Graph analysis is the study of graphs and the characteristics of the data it represents. Graph analysis and in particular graph clustering, has been identified to have great potential in many different fields. Some of these are in biology [25], medicine [26] and social media network analysis [27]. The structure of graphs allows for scaling and distribution of computation, making it possible to handle large sets of data and elastically scaling computation to the volume of data being handled.

4.3 Graph clustering

Many complicated problems can be solved by modelling data as graphs and applying the state-of-the-art in graph analysis. In graphs, there is a concept of ‘communities’

18

(27)

4.3. GRAPH CLUSTERING

or ‘clusters’. A cluster in a graph G(V, E) is a sub set of vertices V⁰ ⊂ V which has the characteristic that the vertices in that set is more densely connected to each other than to other vertices outside that set. Depending on what the graph is modelling, these tight relationships between sets of vertices can mean different things, but it means that they have something in common. The concept of clusters are similar to that of connected components, with the exception that connected component does not allow for any inter-connecting edges between components. The looser constraints in graph clustering than in connected components makes it a NP-hard problem [28]. However, Graph community detection, or graph clustering is a topic that is heavily researched and there are a lot of approaches and different methods to detect and identify communities within graphs. These different approaches and algorithms all have a slightly different approach with different strengths and weaknesses but with the same goal: to identify the different clusters or communities present in a graph.

4.3.1 Graph clustering requirements

Graph community detection algorithms can be divided into two classes: supervised graph clustering and unsupervised clustering. In supervised clustering, the goal is to cluster a graph G into k clusters. Thus the number of clusters are known a priori e.g. k-means [29]. In some cases there is a requirement that the clusters should all be of a size v. This is called balanced clustering. In unsupervised clustering however, the algorithm only has the graph data to work with. Since the amount of clusters in peer-to-peer network will differ depending on network and the fact that this information will not be available, algorithms which requires knowledge of the numbers of clusters are not applicable.

In dialogue with Hive streaming, criterias were defined in order to find a suitable clustering algorithm which fit the need and scenario at hand. These criterias were:

• Scalability: The algorithm must allow for decentralized computation of graphs, little to no global knowledge of the system should be needed.

• Cluster quality: The communities detected by the algorithm must be meaningful

• Integration with static knowledge: If partial site definitions are present, it would be beneficial if the algorithm could use that to aid it in its community detection.

• Speed: the algorithm should be able to identify communities in data sets within reasonable time.

• Weighted: it is crucial that the algorithm supports clustering in weighted graphs.

4.3.2 Graph clustering algorithms

In the domain of unbalanced grapch clustering, much reseach has been made and numerous algorithms and strategies to detect communities has been developed. All

(28)

CHAPTER 4. SOLUTION of these has their strengths and weaknesses and fits different data. Based on the literature study that was conducted in this project, a few algorithms stood out as particularly interesting.

Louvain method

In [30] Blondel et. al. proposes a greedy modularity maximizing approach to graph clustering. The theory is that the modularity metric proposed by Newman [31] is a metric defining the quality of a clustering of a graph and thus, by maximizing this metric, the best clustering can be found. The algorithm works in an hierarchical way by maximizing modulairty locally and then merging nodes found to be in the same cluster into super-vertices and repeating the method. The Louvain method is widely considered to be one of the most efficient graph clustering algorithms.

However, it requires global knowledge of the graph to function [32]. Furthermore, due to the nature of the modularity metric favouring clusterings with few, large clusters, modularity maximizing algorithms are known to struggle with fine granularity clustering [33]. In this project it is used as a reference for how state-of-the-art centralized clustering.

Label propagation algorithm

Due to the nature of the Hive peer-to-peer algorithm, decentralized algorithms are very interesting as it would allow to cluster the peer-to-peer network in a distributed way. The Label Propagation Algorithm, or LPA [34], is a lightweight gossip-based algorithm to spread community associations in an epidemic fashion as depicted in figure 4.1. At each superstep, every node gossips its community association to its neighbours. The neighbours analyzes the different community associations it has received and chooses the most dominant one (the community that most neighbours associate with). In a tie situation, the algorithm takes one of the most dominant communities at random.

20

(29)

(a) Each vertex is initiated with a unique clus-

ter tag (b) Vertices spread their cluster tag and adopt most dominant tag in its neighbourhood

(c) In clusters structures, one cluster tag will

evantually dominate (d) Leading to heavily connected vertices all having the same cluster tag

Figure 4.1: Visualization of the Label propagation algorithm working on a small graph with two identifiable communities, converging in four supersteps.

LPA has its strengths and weaknesses. Due to LPAs lightweight gossiping approach which is inherently distributed and only requires local knowledge at every node it is a very appealing algorithm to adapt in a peer-to-peer system. It is a very simple algorithm which has been proven to also be very potent in discovering communities. However, the decentralized and simple nature of LPA leads to weaknesses as well:

1. Consistency: Because of the random tie breaking in each node when choosing labels of same priority and the order of which messages arrive at nodes, the result returned by LPA with the same graph as input can differ from run to run.

2. Label oscillation: if LPA is implemented in a synchronous computing environment, a node might be balancing between two labels and thus, an endless oscillation of labels can occur, where two nodes exchange labels and choose each others labels over and over again for ever.

3. Trivial solution: Nodes in LPA are trying to propagate its label to everyone.

This means that one accepted outcome is that every vertex is put in the same cluster, which is not the goal of clustering a graph. Thus, LPA relies on getting stuck in a local maxima of modularity, where some nodes neighbours do not accept new labels. If the algorithm does not stop in this local maximum, LPA can produce the trivial solution to graph clustering; it consider every node in a connected component to be in the same cluster. This is not desired for obvious reasons.

(30)

CHAPTER 4. SOLUTION 4. Deteriorating cluster quality: in [35] the authors show that LPA has a peak where it produces the highest quality clusters at a specific number of supersteps. After this, the modularity deteriorates with each Superstep.

5. Static clustering controls: as opposed to one of its successors, LabelRank [35], there is no parameter except for the number of iterations which can be altered to control the granularity of the clusters detected.

LabelRank

The decentralized design and simplicity of LPA has made it a popular graph clustering algorithm, and work has been put into addressing its weaknesses while keeping its strengths. LabelRank is an algorithm proposed in [35] which combines the decentralized gossiping design of LPA and the configurable and powerful clustering detection of flow-based clustering algorithms. It does so by maintaining the gossiping model of LPA, where every node is in charge of its own local computation and shares its community association with its neighbours. However, in LabelRank the local computation in every node is more powerful than just choosing the most dominant cluster ID in its incoming messages. LabelRank introduces several measures to combat the drawbacks of LPA. In LabelRank, the local computation in every node is similar two the computation done in another famous centralized graph clustering algorithm: the Markov clustering algorithms, or MCL, which clsuters a graph by random walks and flow-maximizations [36]. There are four major operators in LabelRank to improve LPA:

1. Propagation: In LPA, a node only keeps a single cluster tag as its state, and can thus only choose one other tag to adopt when going through its incoming messages. This means that tie-breaking can be needed and in some cases, a future dominant label might be lost due to that there was another label which were more dominant at a specific point in time. In LabelRank, a node keeps a vector of cluster tags instead of a single one. The vector contains the cluster tag and the associated probability that it is the correct cluster tag for the node. This probability is based on the tags dominance in the neighbourhood.

This eliminates the need for tie breaking at the expense of a larger memory requirement for the state in each node.

2. Inflation: The inflation operator inflates the probabilities of the cluster tags and is the same as the inflation operator in [36]. The inflation operator is applied on the cluster tag vector at each node, increasing probabilities of high probability tags and lowering probabilities of low probability tags. The inflation operator also allows for control of the clustering. By changing the inflation parameter, either smaller more fine grained clusters are favoured or larger more coarse grained if the parameter is heightened.

22

(31)

3. Cutoff: To address the issue of growing states at each node, the cutoff operator is introduced. The operator drops any tags from the cluster tag vector which has a probability smaller than a given cutoff threshold t.

4. Conditional update: The author of the LabelRank paper observed a decrease in modularity of clustering if LPA runs for too many iterations. The modularity peaked at a certain number of iterations, and after that the nodes kept switching cluster tags, producing worse clusters. This is an undesired trait of the algorithm since one needs to know where the modularity peaks to know when to stop. Thus, the conditional update operator is introduced in LabelRank which allow a node to stop switching cluster tag associations if the changes in the cluster tag vector is smaller than a given threshold. Such small changes is an indication that the clustering has converged and that the nodes should stop updating. The authors of the LabelRank paper saw that this caused the modularity of the clustering produced by LabelRank to grow until the optimal number iterations but then staying at that peak, instead of deteriorating as in LPA [35].

4.3.3 Weight heuristics

The assignment of edge weights is a crucial factor in order for the community detection algorithm to be able to perform as desired. Since the algorithm is designed to work also with the randomized measures coming from the experiment version of Hive, the nodes randomly tries node in the system generating a random graph. In such a graph no unweighted clustering or community detection algorithm will work, because there is no clustered structure in the graph. However, the theory is that the quality of the connections will be significantly higher within sites than between sites. Thus, the weights of edges would be significantly higher within clusters and thus hold this cluster information. Finding a high quality heuristic for the weights is a central part to ensure the quality of the clusters the clustering algorithm is able to produce. Therefore, studies were conducted on metrics from a network with known accurate ground-truth. The goal of this study is to detect any potential pattern in the correlation of the value of the different metrics and whether the connection was intra- or inter-site. Using the information gained from the study, different weight heuristics were proposed and later evaluated to identufy the best performing one.

Metric study

In order to find a weight heuristic to be able to cluster nodes in sites, the metrics gathered in the measurement must contain statistically significant correlation showing a relation between high quality connections and site affiliations. Thus, data gathered from known networks with a well-defined ground truth was examined. Several cases were considered were patterns in metric data could fall short in cluster nodes into sites. One of these scenarios are heavily connected sites with small geographical distance with known good network links between sites. Thus,

(32)

CHAPTER 4. SOLUTION the metric study was conducted on meassurement data from this network with the intuition that if there were significant patterns in the data in this case, these will also appear in data from networks were the sites are further apart geographically and intra-connected with weaker links. The network studied is a well connected network split into four sites in Sweden and are presented in tables 4.1, 4.2 and 4.3.

Avg. Rate Site A Site B Site C Site D

Site A 2189 950 735 4393

Site B 396 2152 1021 839

Site C 696 1427 2985 1069

Site D 865 1182 698 2390

Table 4.1: Average rate for connections between sites

Avg. RTT Site A Site B Site C Site D

Site A 1 10 25 38

Site B 12 3 8 10

Site C 14 9 3 11

Site D 16 10 12 6

Table 4.2: Average RTT for connections between sites

Avg. Transferred Site A Site B Site C Site D

Site A 613040 256000 244024 32000

Site B 384000 403880 131064 116024

Site C 366064 128000 128000 1333136

Site D 952032 122040 390072 726056

Table 4.3: Average data transferred for connections between sites

Weight heuristic design

In the information gathered from the Hive clients, there is a lot of data tied to the quality of a connection between two peers. The goal is to define a function f(x1, x2, ..., xn) of these quality variables such that f(intraSiteConnection) >

f(interSiteConnection). Finding this f will result in the edges in the graph being stronger within a site than between sites. Thus, with a clustering algorithm which takes edge weights into consideration, clusters can be identified even in a randomly connected graph. Due to the complex nature of the networks, and the high demand on network resources in the domain of video streaming distribtuion, several metrics

24

(33)

are needed to be included in the weight heuristic. The connectivity metrics available in the data that are relevant are:

• Rate: The rate at which data flows between two peers

• Round trip time: The time it took for a peer to establish a connection to another peer

• Transferred: The total amount of data transferred

• p2psuccess: The number of succesful data transfers

• p2pfailure: The number of failed data transfers

It is important to not only use one of these, but instead a combination of several features. This is to create a more complete measure on the quality of a link, assuring that no outliers in metric data affect the final weight too heavily. It was also concluded that due to the very application-specific nature of the p2pfailure and p2psuccess metrics, these are not desired to be included in the final weight heuristic.

Additional parameters

As seen in the case example in tables 4.1, 4.2 and 4.3 with four highly connected sites, the metric data can sometimes be very similar intra- and inter-site. In table 4.2 one of the sites have better values to another site than to itself and in table 4.3 there even is no identifiable pattern. Thus, in addition to the connectivity metrics the distance between the private IP address of the two peers was used as a variable in the weight heuristic. This is because there is a larger probability that two peers with highly similar private IP addresses should be clustered together than two peers with very different private IP addresses.

Figure 4.2: An example scenario of when the similarity of private IP comes into play in the weight heuristic. In this case, it is more likely that node A is in the same site as node B and should thus choose its label.

Considering that low RTT, high rate and similar private IP addressess are all factors that imply two peers being in the same site, and a study where different heuristics were evaluated against each other, the follwing weight heuristic was designed:

weight(metrics) = rate

RT T × δ(p1, p2)

(34)

CHAPTER 4. SOLUTION where δ(p1, p2) is a function which returns the similarity between two IP ad- dresses, p1 and p2, returning 2 if p1 and p2 is in the same class-C subnet and 1 if they are not. While this weight to IP similarity might seem high, the differences in the metrics are often so large that intra- and inter-site connections often differ in orders of magnitude in weight, thus a multiple of 2 for nodes in the same class-C subnet is reasonable.

4.3.4 Weight heuristic evaluation

The goal of the weight heuristic is to reward high score for good connections and low score for bad connections. The defined heuristic does so by rewarding high rate, similar IP addresses and penalizing high rtt. While the parameters and design of the heuristic are intuitive and based on metric data, the heuristic needs to be tested in empirical tests to see if it helps the clustering algorithm in its aim to identify correct clusters. This is achieved by comparing the result of the clustering algorithm running on the same data set but with the exception of the weight heuristic used and is presented in chapter 5.

4.4 Implementation

The solution is implemented and made available to Hive as a service for dynamic site discovery.

4.4.1 System design

The system structure is designed as modules dedicated to completing specific tasks related to the solution and is shown in figure 4.3.

Figure 4.3: The strucutre of the dynamic site definition solution.

26

(35)

4.4. IMPLEMENTATION

4.4.2 Graph processing

Since graphs are a common data structure with certain characteristics fitting for distributed computing, frameworks and models for graph processing has been developed to take advantage of the distributed nature of this data structure. These include, GraphX [37], Giraph [38], Gelly [39] and PowerGraph [40] etc. The Hive environment utilizes a Spark cluster and thus, an implementation in this environment would increase its usability as a service. Therefore GraphX is chosen for the implementation.

Apache Spark

Apache Spark [41] is an in-memory distributed computational framework which was developed to improve on the acyclic data model of MapReduce where data needed to be read from and written to disk between operations. Spark utilizes DAG, directed acyclic graphs to define computation on data. Every node in the DAG receives input from previous nodes, executes some operation on that output and generates output to the next node in the graph. A central point in the Spark framework is its immutable data structure, resilient distributed datasets, RDDs. It also has a lot of libraries stretching from machine learning, streaming computation and graph computation.

GraphX

GraphX is a a graph processing framework built on top of Apache Spark, utlizing its underlying framework to allow for easy graph computation by introducing Graph objects with vertexRDDs and edgeRDDs [37]. It implements many common graph theory operators which makes distributed computation on graphs simple for a de- veloper. It also exposes a Pregel API which allows for the use of the GAS model as desribed in the PowerGraph paper [40].

Gather, apply, scatter

GAS is a computational model which divides computation on graphs into three steps: Gather, Apply and Scatter. In the GAS model, any computation on graphs can be expressed as to fit into these three steps. In the gather step, each node gathers information that has been propagated from its neighbours. In the apply step the node considers the information it has gathered and executes some local computation to either create a new state and/or produce new output. In the scatter step, the node shares its newly computed data with its neighbours. This process iterates on each node for a given number iterations. When every node in the graph has run an iteration of the three steps, the system has completed what is called a super-step. The implementation of the clustering algorithms was done in Scala on GraphX on Spark utilizing the GraphX Pregel API, formulating the steps in the algorithm in the GAS model. The first iteration was to implement simple

(36)

CHAPTER 4. SOLUTION LabelPropagation, in order to see if the implementation functioned well enough to still be considered. LPA was already implemented and could thus be used directly.

The results produces displayed two main issues which were the base for the move to implement the more sophisticated LabelRank algorithm instead. These were the ones covered in the theoretical coverage of the algorithms: The results were non- deterministic and there was no control over granularity and LabelPropagation could not perform well enough in the random graph tests.

Algorithm 1 LabelPropagation

1: procedure Gather

2: for message in inbox do

3: state.cluster_map ←(message.cluster_tag, 1)

4: procedure Apply

5: state.cluster ←cluster_map.max()

6: procedure Scatter

7: for neighbour in neighbours do

8: send(neighbour, state.cluster)

Label propagation simply chooses the cluster tag which is the most dominant in its incoming messages and propagates it to its outgoing neighbours. Its simplicity is very appealing and its fits very well for the GAS implementation.

Algorithm 2 LabelRank

1: procedure Gather

2: for message in inbox do

3: state.cluster_map.merge(message.weight, message.cluster_map)

4: procedure Apply

5: inflated_probabilities ←inflation(state.cluster_map, inflation_coefficient)

6: pruned_probabilities ←cutoff(inflated_probabilities, cutoff_threshold)

7: state.cluster ←pruned_possibilities

8: procedure Scatter

9: for neighbour in neighbours do

10: send(neighbour, state.cluster)

LabelRank is more complex and computational heavy than Label propagation but it has the same core idea to propagate cluster associations by gossiping.

4.5 Applying the solution

When the solution was designed and implemented, it is ready to be run in the Hive environment to analyze data to detect sites in a dynamic unsupervised way. The results of the clustering is presented in the next chapter.

28

(37)

Chapter 5

Results and Evaluation

In this chapter the results of the solution will be presented. The presentation of the results will be in two forms: visualizations of the networks and the clusterings as well as the evaluation metrics for each result. The metrics will be displayed for all the weight heuristics that were tested, but the visualization of the graph will be of the clustering utilzing the most accurate weight heuristic.

5.1 Evaluation

Often when evaluating graph clustering results, because of the nature of the data sets used, there is often no ground truth to evaluate against. This has led to a range of evaluation metrics capable of rating the quality of a clustering based on the definition of a cluster rather than the accuracy in reference to some ground truth.

The most popular of these metrics are Coverage which is a simple metric comparing the number of edges within a cluster to the total number of edges, Modularity [31] and Conductance [42]. While these metrics are widely used and often good measures of how well a clustering has managed to identify clusters in graphs, they are very dependant on the strucutre of the graph data and favours coarse grained clusterings [33]. However, in this project, the experiments were conducted in such a way that ground truth is available. This allows for the use of a range of different metrics for supervised classification. As these metrics accurately shows how similar the clustering was to the truth and are not vulnerable to different structures of the data, they were used to evaluate the results in this thesis.

5.1.1 Clustering metrics

In order to the most complete evaluation of the results, two widely used metrics for information recovery was used: NMI, normalized mutual information and FMI, Fowlkes-Mallows Index.

(38)

CHAPTER 5. RESULTS AND EVALUATION

5.1.2 Normalized mutual information

Normalized mutual information, NMI, is the normalized version of the mutual information, MI, of a clustering [43]. The MI of a clustering where nodes belong to a class Ci and a cluster Ki is defined as:

M I(C, K) =^X^C

i=1 K

X

j=1

|C_i∩ K_j| N × log(

|Ci∩Kj| N

|C_i|

N ×^|K_N^j^|)) This measure is then normalized as such:

N M I(C, K) = M I(C, K) p(H(C)H(K)) where H(C) is defined as

C

X

i=1

|C_i| N × log(

|C_i| N

N ) and H(K) is defined as

K

X

i=1

|K_i| N × log(

|K_i| N

N )

this value is in the range of [0,1] where 0 is a random clustering and 1 is the identical match between a clustering and the ground truth.

5.1.3 Fowlkes-Mallows Index

Fowlkes-Mallows score is a scoring rating comparing a clustering or classification with a known ground truth [44]. it is defined as

F M I = T P

p(T P + F P )(T P + F N)

Where T P is the true positive, the number of nodes from class Ci in the ground truth which are correctly assigned to the corresponding Ki. F P being the false positive, the number of nodes not in class Ci but still put in cluster Ki. F N being the false negative, the number of nodes in class Ci but incorrectly not put into the corresponding cluster Ki. The FMI-score, which also is in the range [0, 1] is a widely used score to score clusterings where a ground truth is available.

5.2 Evaluation of weight heuristic

One of the central points in this work is to construct a heuristic to model the relationships of nodes by mapping the different connectivity metrics available to a weight to apply to the edge between the two nodes in the graph. Since the peer selection algorithm is random and every node can talk to each other, the weights are

30

(39)

5.2. EVALUATION OF WEIGHT HEURISTIC

the only data which can show site associations. Thus, different weight heuristics were proposed, composed by different combination and weighing of metrics and tested to see which performed best. The weight heuristics were tested by running both the Louvain and the LabelRank clustering algorithm on the data-set utilizing the different weight heuristic to see what heuristic generated the highest score, clustering the nodes in the most accurate way in reference to the ground truth. The weight heuristics presented are (from left to right):

• fw1: unweighted

• fw2: (rate)

• fw3: (^avg.rtt_rtt )

• fw4: (^rate_rtt )

• fw5: (^rate_rtt × δ(p1, p2))

The results in figure 5.1, 5.2 and 5.3 show an increase in both NMI and FMI using the more sophisticated weight heuristics.

LabelRank FMI LabelRank NMI Louvain FMI Louvain NMI

FMI / NMI

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Weights functions

fw1 fw2 fw3 fw4 fw5

Figure 5.1: Graph showing the NMI and FMI values for different weight heuristics in the Sweden network

(40)

CHAPTER 5. RESULTS AND EVALUATION

FMI / NMI

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Weights functions

f_w1 f_w2 f_w3 f_w4 f_w5

Figure 5.2: Graph showing the NMI and FMI values for different weight heuristics in the International network

FMI / NMI

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Weights functions

fw1 fw2 fw3 fw4 fw5

Figure 5.3: Graph showing the NMI and FMI values for different weight heuristics in the North America network

32

Enterprise network topology discovery based on end-to-end metrics

Enterprise network topology discovery based on end-to-end metrics

Logical site discovery in enterprise networks

based on application level measurements in peer- to-peer systems

JONATAN BODVILL

Abstract

Referat

Acknowledgements

Contents

Chapter 1

Introduction

1.1 Background

1.2 Problem description

1.3 Purpose

1.4 Goals

1.5 Method

1.6 Deliminations

1.7 Outline

Chapter 2

Background and Theory

2.1 Enterprise networks

2.2 Peer assisted video streaming distribution

2.3 Related work

Chapter 3

Methodology

3.1 Research strategy

3.2 Literature study

3.3 Data collection

3.4 Data analysis

3.5 Evaluation

3.6 Test setup & Measurements

Chapter 4

Solution

4.1 Measurements

4.2 Graph modelling

4.3 Graph clustering

4.4 Implementation

4.5 Applying the solution

Chapter 5

Results and Evaluation

5.1 Evaluation

5.2 Evaluation of weight heuristic