An evaluation of random-walk based clustering of multiplex networks

(1)

IT 17 036

Examensarbete 15 hp Juli 2017

An evaluation of random-walk based clustering of multiplex networks

Kristofer Sundequist Blomdahl

(2)

(3)

Teknisk- naturvetenskaplig fakultet UTH-enheten

Besöksadress:

Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0

Postadress:

Box 536 751 21 Uppsala

Telefon:

018 – 471 30 03

Telefax:

018 – 471 30 00

Hemsida:

http://www.teknat.uu.se/student

Abstract

An evaluation of random-walk based clustering of multiplex networks

Kristofer Sundequist Blomdahl

A network, or a graph, is a mathematical construct used for modeling relationships between different entities. An extension of an ordinary network is a multiplex network. A multiplex network enables one to model different kinds of relationships between the same entities, or even to model how relationships between entities change over time. A common network analysis task is to find groups of nodes that are unusually tightly connected. This is called community detection, and is a form of clustering. The multiplex extension complicates both the notion of what a community is, and the process of finding them.

This project focuses on a random-walk based local method that can be used to find communities centered around supplied seed nodes. An implementation of the method is made which is used to evaluate its ability to detect communities in different kinds of multiplex networks.

Handledare: Roberto Interdonato

(4)

(5)

1 Introduction

A network is a construct for modeling relationships between different entities. A network is made up of a set of vertices which are connected through a set of edges. The vertices represent entities, and the edges represent relationships between entities. A popular way to analyze a network is the identification of communities. A community is a subset of vertices in a network that are highly connected to each other and well separated from the rest of the network. Finding these communities is a well studied subject with many applications.

Ordinary networks are limited in what systems they can model as they only describe one kind of relationship. The extension of multiplex networks, which are networks with many layers representing different kinds of relationships, enables one to accurately model systems with complex interactions. It however also complicates the notion of a community and the process of finding them.

This project focuses on a state-of-the-art community detection method described in A local perspective on community structure in multilayer networks as a part of a bigger comparison of multiplex community detection methods [1]. The method is based on finding parts of a network that act like bottlenecks on interlayer transitioning random- walks, with the assumption that these bottlenecks constitute good communities. The bottlenecks are found by approximating how the random walks spread from seed vertices, and by including vertices based on how easily they are reached from the seed. The use of seed vertices makes this a local method.

An implementation of the method is made which is used to evaluate its ability to detect communities in noisy networks as well as how it scales with network and community size.

The evaluation also includes a comparison with the ML-LCD algorithm, which is another local community detection method [10].

2 Background

This section covers background information related to networks, multiplex networks and the notion of a community in multiplex networks.

2.1 Multiplex networks

A network consists of a set of vertices v ∈ V which are connected through a set of edges (v_a, v_b) ∈ E.

A multiplex network is an extension of an ordinary network. Multiplex networks consist of layers of ordinary networks, where each layer contains the same vertices, but with different edges connecting them. Entities represented in the network have a vertex representation on every layer, and the set of all its vertices across layers is called an actor.

The multiplex extension makes it possible to model systems where actors have many different kinds of relationships. For example, in a social network, one layer could represent a friendship relationship while a second layer could represent a co-worker relationship.

This extra dimension of information enables the discovery of many interesting patterns since one can study correlations between layers. Figure 1 illustrates a multiplex network.

(8)

1α 2α 3α 4α

5α 6α 7α

8α

1β 2β 3β 4β

5β 6β 7β

8β

Figure 1: A multiplex network with two layers: α and β. The two layers contain the same vertices, but have different edges. The vertices connected by a dotted edge represent the same actor.

The topology of ordinary networks are commonly represented by an adjacency matrix.

An adjacency matrix encodes the edges of a network as in equation 1 where i and j are vertices in the network.

AM_jⁱ =

(1, (i, j) ∈ E

0, otherwise (1)

To represent multiplex networks, one can extend the adjacency matrix of an ordinary network to an adjacency tensor (i.e a 3D matrix). Equation 2 shows the adjacency tensor AT describing the connection between vertex i in layer α and vertex j in layer β.

AT_jβ^iα =

(1, (iα, jβ) ∈ E

0, otherwise (2)

2.2 Community

A popular way to analyze a network is the identification of communities. A community is a subset of vertices that are tightly connected within the subset, yet sparsely connected to its complement. For example, in a social network, a group of friends would likely constitute a community since the connections within the group would be very dense while the connections to people outside the group would be sparse in comparison. A community is related to the notion of a cluster.

(9)

The extension of multiplex networks makes the notion of a community more compli- cated, and it is not well defined. One way to define a multiplex community is to say that its vertices should be tightly connected on all layers. This definition uses information from all layers to verify that the set really is a community. Using this metric, the set of actors {5, 6, 7, 8} in figure 2 makes a good community, while the set {1, 2, 3, 4} creates a bad one since they are hardly connected on layer β.

A problem with this metric is that the community structure of {1α, 2α, 3α, 4α} is neglected. This warrants a community to be defined on a vertex level rather than on an actor level. This way one can isolate {1α, 2α, 3α, 4α} as a good community even though the corresponding β-set is bad. Depending on problem domain tough, a vertex level definition may not make sense.

While a vertex level definition enables the {1α, 2α, 3α, 4α} community, it raises the question of when vertices on different layers should be in the same community. One answer is when vertices on different layers form similar communities, like with vertices {5α, 6α, 7α, 8α, 5β, 6β, 7β, 8β}.

1α 2α 3α 4α

5α 6α 7α

8α

1β 2β 3β 4β

5β 6β 7β

8β

Figure 2: A two layer multiplex network illustrating different community structures on the different layers.

Different definitions of a community make sense for different problem domains, but the general notion is that a community is a set of vertices where their connectivity groups them together. The question is how to fuse information from different layers to enhance understanding.

2.3 Related work

A multiplex network is a more expressive model than an ordinary network, and traditional community detection algorithms do not directly translate. There are currently three general approaches to community detection in multiplex networks [4].

(10)

The first approach is to flatten the network into a single layer, and then apply existing community detection algorithms on it. There are many different ways of flattening the networks, and they can lead to very different results. One example is by letting the edges in the flattened graph be weighted by how many edges exist between the corresponding actors. This method is called the ”Projection-Average” [3].

The second approach is to perform community detection on each layer individually, and then use this information to form multilayer communities. Given a community partitioning of each layer, one can measure the similarity between vertices by the number of communities shared. The similarity score can then be used to form multiplex communities [5].

The third approach is to operate directly on the multilayer structure. An example of this is LART (Locally Adaptive Random Transitions), which uses a special type of random walk to obtain distances between vertices which is then used to perform agglomerative clustering [6].

The common community detection approach is global, but like the method evaluated in this project, it can also be done locally. A local approach means that a community is found that is centered around a seed vertex. One method that uses this strategy is the ML- LCD algorithm [10]. The method works by iteratively expanding a community starting from a seed by including neighboring vertices that increase the internal connectivity compared to the external connectivity.

3 ACL cut

This section describes the random-walk based multiplex community detection algorithm proposed in A Local Perspective on Community Structure in Multilayer Networks [1].

This method differs from conventional community detection methods in that it is local;

it will only find communities that are centered around a supplied set of seed vertices. This makes the task of global partitioning clumsy, but it opens up opportunities for finding communities containing a specific set of vertices missed by conventional methods. It also makes it fast.

The general idea of the algorithm is to approximate how random-walks spread through the network and find sets of vertices that act as bottlenecks on the spread. The process is to start from a set of seed vertices and locally approximate the spreading from those vertices. If the seed vertices are part of a community, then that community will act as a bottleneck on the random-walks, and can be detected efficiently.

3.1 Random walk

A random walk is a process that randomly traverses a network (in this case) given some rules of transition. A walker will be on a vertex in the network, and randomly transition to an adjacent vertex. The paper presents two different random walks: the classical and the relaxed.

The classical random walk is defined in equation 3 where P_jβ^iα denotes the probability to transition from vertex iα to vertex jβ. All vertices that correspond to the same actor are connected with the same weight ω ∈ [0, inf) i.e. A^iα_iβ = ω|α 6= β. This random walk

(11)

either transitions in the same layer to a vertex representing another actor, or transitions to another layer to a node representing the same actor [7].

P_jβ^iα = AT_jβ^iα P

jβ∈V AT_jβ^iα (3)

The relaxed random walk is defined in equation 4 where r ∈ [0, 1] is the probability to switch layers and δ(α, β) is 1 if α = β else 0. With probability 1 − r the walker stays within the same layer and transitions uniformly at random to an adjacent vertex. With probability r the random walker transitions uniformly at random to any vertex that is adjacent to any vertex corresponding to the same actor [8].

P_jβ^iα = (1 − r)δ(α, β) AT_jα^iα P

j∈ActorAT_jα^iα + r AT_jβ^iβ P

j∈Actor,β∈LayerAT_jβ^iβ (4) The parameters ω and r similarly control the interlayer behavior of the walks. These are parameters of the algorithm which control how the random-walks spread across layers. For example, if ω is set to 0 in the classical random walk, and there is only one seed vertex, then the random walk will not transition between layers at all, and the only communities found will be on the same layer as the seed. If on the other hand ω is set high, meaning that the random-walker likely switches layers, then the communities will also span many layers since a bottleneck to a layer switching random walk will have to be on many layers.

3.1.1 Stationary distribution

The stationary distribution of a random walk is a probability distribution which describes the random-walker’s likelihood of being at each vertex as it has made an infinite amount of transitions. It is denoted p_iα(∞), i.e. the probability that the random-walker is at vertex iα at time ∞. Figure 3 shows the stationary distribution of the classical random walk on a two layered network. An interlayer weight of 1 was used, making a transition to the corresponding node in the other layer equally likely as transitioning to any intra-layer neighbor. One can see that the probability of being at node 7β is higher than 3β since it is better connected.

(12)

1α 0.057

2α 0.057

3α 0.071 4α

0.071

5α 0.071 6α

0.071

7α 0.057

8α 0.057

1β 0.057

2β 0.057

3β 0.043 4β

0.071

5β 0.057 6β

0.071

7β 0.071

8β 0.057

Figure 3: The stationary distribution of the classical random-walk using an interlayer weight of 1.

3.2 Conductance

Conductance is, from a random-walk perspective, a measure of how easy it is to escape from a set of vertices. In the quest of community detection, a community can be viewed as a set of vertices with low conductance, i.e. a set of vertices that trap random walks.

Equation 5 presents the formula used to calculate the conductance of a set of vertices S [9].

φ(S) = P

iα∈S

P

jβ /∈SP_jβ^iαp_iα(∞) P

iα∈Sp_iα(∞) (5)

The definition of conductance given in equation 5 measures the probability to leave a set of vertices relative to the probability to be in the set.

The intuition is that it should be far more probable to stay within a community than to leave it, since a community is very tightly connected to itself and poorly connected to its complement.

3.3 Personalized PageRank-vector approximation

Given a seed vertex (or a set of vertices), the process of the algorithm is to approximate how random-walks spread from the seed. To approximate how random-walks spread, a personalized PageRank score is used. A personalized PageRank score measures the probability of being at each vertex, given that the random walker starts from the seed and is for each transition, with a certain probability γ, teleported back to the seed [13]. The probability mass is thus centered around the seed vertices, and describes the importance of each vertex as information spreaders from the seed. Figure 4 illustrates a

(13)

personlized PageRank score using the node 1α as a seed. One can clearly see how nodes near the seed have a higher probability mass than nodes further away.

1α 0.29

2α 0.105

3α 0.09 4α 0.088

5α 0.023 6α

0.023

7α 0.013

8α 0.012

1β 0.077

2β 0.043

3β 0.052 4β

0.042

5β 0.0103 6β

0.0137

7β 0.0125

8β 0.008

Figure 4: Personalized PageRank score of the classical random walk with interlayer weight 0.8 and with a teleportation rate of .1 using 1α as seed.

Calculating the personalized PageRank score of the whole network is computationally expensive, so a local approximation is instead used.

The personalized PageRank approximation is done by a series of push procedures from a residual vector to a PageRank vector representing the vertices in the network [2]. The residual vector starts off with all probability mass being at the seed and the PageRank vector starts off with zero probability mass.

The push procedure involves pushing a factor γ of a vertex’s residual to its PageRank, and then spreading half of the remaining residual to its adjacent vertices weighted by transition probability. This push procedure is repeated for all vertices that have enough residual.

Having enough residual depends on a truncation parameter and the stationary distribution. It is determined by checking if the residual is bigger than ∗ p(∞)_v. If a vertex has a high stationary probability, then the residual also needs to be high since a high stationary means that the vertex is highly connected. The high stationary vertices will thus only be pushed if many of their adjacent vertices have pushed residual to it. The truncation parameter determines the exactness of the approximation; a low lowers the threshold for when a push is made, and thus the approximation is more exact and reaches further from the seed.

Figure 5 illustrates an approximation after one iteration.

(14)

1α r: 0.45, pr: 0.1

2α

r: 0.1125, pr: 0

3α r: 0.1125, pr: 0

4α r: 0.1125, pr: 0

5α 6α 0, 0 0, 0

7α 0, 0

8α 0, 0

1β

r: 0.1125, pr: 0

2β 0, 0

3β 0, 0 4β 0, 0

5β 0, 0 6β 0, 0

7β 0, 0

8β 0, 0

Figure 5: One iteration into an approximation of a personalized PageRank vector using 1α as a seed with γ = 0.1 using the classical random walk. Each vertex has a r and a pr score that corresponds to the residual and PageRank score of the vertex.

This process is efficient since it, other than needing the stationary distribution, only ever operates on the local environment of a vertex which makes it independent of network size.

The result from this step is an -approximated personalized PageRank-vector with teleportation probability γ, describing the importance of each vertex in the information spreading from a seed.

3.4 Sweep cut

Given a personalized PageRank-vector, the sweep cut step orders the vertices by PageR- ank value and returns the subsets {{v₁}, {v₁, v₂}, {v₁, v₂, v₃}, {v₁, v₂, v₃, v₃}, ...} with the corresponding conductances of each subset where P ageRank(v_n) > P ageRank(v_n+1).

The conductance values of each subset are calculated in an efficient online manner.

A sweep cut of the network in figure 4, with the same personalized PageRank score, would yield 16 subsets with their corresponding conductances. The ordering of the vertices would be: {1α, 2α, 3α, 4α, 1β, 3β, 2β, 4β,

6α, 5α, 6β, 7α, 7β, 8α, 5β, 8β}, with corresponding conductances of the sweeps: {1, 0.73, 0.5, 0.3, 0.35, 0.27, 0.24, 0.12, 0.2, 0.3, 0.4, 0.5, 0.7, 0.73, 1, 1}. One can see that the global minimum is 8 nodes, i.e the set {1α, 2α, 3α, 4α, 1β, 3β, 2β, 4β}, and interestingly there is also a local minimum of four nodes: the set {1α, 2α, 3α, 4α}.

Communities can be extracted from the sweep sets either by only taking the global minimum, or by also including local minima.

(15)

4 Implementation

The implementation of the algorithm has been made in C++ for efficiency and in order to be included in a multilayer network analysis library. It involves two steps: preparation and community extraction.

The preparation part of the implementation takes three parameters: a multilayer network data structure, a random walk type with an interlayer weight parameter, and a random teleportation parameter. It then translates the network into a transition matrix based on random walk type and interlayer weight parameter. As the random walkers can potentially transition from any node to any other node in the network, the matrix is of the dimensions |actors|²|layers|². The majority of edges are however usually non- present, so a sparse matrix implementation is used. The stationary distribution of a random walk is calculated by either extracting the largest eigenvector or a PageRank vector from the transition matrix based on the random teleportation parameter [13]. The sparse matrix representation, eigenvector- and PageRank calculations are done using the software libraries Eigen and Spectra [14][15].

To extract a community, the personalized PageRank vector approximation and sweep cut procedure are done parameterized by , γ and a seed. The community returned is the global minimum conductance set from the sweep set.

As the network preparation can be costly, the same transition matrix and stationary distribution can be reused for different community extraction parameters, making extraction of multiple communities from the same network and random walk efficient.

5 Experiments

The experiments focus on evaluating the accuracy and scalability of the ACL-cut method.

A comparison is also made with a Python implementation of the ML-LCD algorithm to see how the ACL-cut method contrasts with it.

5.1 Accuracy

The main method used to measure accuracy is to compare produced communities with ground truth. The ground truth used is datasets with community labeled actors.

To produce a community using the ACL-cut, all vertices corresponding to the same actor are selected as a seed to the method. The resulting community is then simplified to an actor level by including the actor in the community if it is present on any layer.

The simplified community is then compared to the ground truth community of the actor used as seed. This is done for every actor in the network.

The comparison with ground truth is made with the Jaccard index. The Jaccard index measures the similarity between two sets by the ratio between their intersection and their union [11]. A Jaccard index of 1 means equal sets, and a Jaccard index of 0 means no shared values. Equation 6 defines the Jaccard index.

J (S₁, S₂) = |S₁∩ S₂|

|S₁∪ S₂| (6)

(16)

Another metric used to compare produced communities with ground truth is size ratio.

The size ratio is defined in equation 7. SR(S₁, S₂) will approach 1 as |S₁| increases compared to |S₂|, and approach 0 as |S₁| decreases compared to |S₂|. Equally sized communities yield a size ratio of 0.5.

SR(S₁, S₂) = |S₁|

|S₁| + |S₂| (7)

5.1.1 Accuracy with respect to noise

The first experiment measures the accuracy of the ACL-cut method with respect to noise on synthetic datasets. Four datasets with ground truth containing 300 actors on 3 layers are used. The datasets have increasing levels of noise in them, making the communities less pronounced. Noise is here defined by the fraction of edges a vertex shares with vertices in other communities.

The experiment is run on the different datasets using different combinations of the parameters γ and to see how they affect the result.

5.1.2 Real dataset

The second experiment is run on a real dataset of 61 employees of a University. The dataset consists of 61 actors on 5 layers. The 5 layers represent the relationships between the actors with respect to Facebook friendship, having lunch, co-authorship, friendship and working together [12]. Most actors in the dataset also belong to research groups, which is used as ground truth communities ¹. The method is again run using different combinations of the parameters γ and to see how they affect the result.

5.2 Scalability

The scalability aspect of the method is tested by running the implementation on synthetic datasets of increasing size. Preparation of the network and community extraction are tested separately since the preparation can be reused for multiple community extractions.

The execution time and accuracy of the community extraction is measured using a sample of 100 actors in each dataset as a seed. The community structure of the datasets is fixed to approximately nine equally sized communities, making the community sizes a function of network size. The edge density of the datasets is also fixed, where edge density is defined by |V ertices|∗(|V ertices|−1)^2∗|Edges| . The method is again run with different parameter combinations to see how these affect the result.

The aim of this experiment is to measure the method’s ability to detect big communities, and how execution time scales with network and community size.

1Only employees that belong to exactly one research group are used as seeds.

(17)

6 Results

This section covers results and interpretations of all experiments. Appendix A and B contain statistics about each dataset used, and complete results from all experiments.

6.1 Accuracy over noise

The first experiment measures accuracy over noise. Figure 6 illustrates the method’s performance with different parameter combinations as well as the performance of the ML-LCD method in networks with increasing levels of noise.

0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Noise

AvgJaccard

10/.4 10/.9 1/.4 1/.9 10/.998

1/.998 5/.4 5/.9 5/.998 ML-LCD

Figure 6: Results from the noise experiment. The datasets each contain 300 actors with increasing levels of noise. The y-axis shows the average Jaccard index between the produced communities and the ground truth. The x-axis shows the level of noise in the dataset. The functions are different executions of the ACL-cut method parameterized by /γ, as well as the ML-LCD algorithm. The relaxed random walk was used with an interlayer weight of .5.

The method performs very well on these datasets when noise is low using a low

(18)

and high γ parameter, meaning that a good approximation is made of a random walker with a low probability of being teleported back to the seed. The 1/.998 and 1/.9 settings have almost equal performance on each dataset, and outperforms the ML-LCD and other parameter settings by far when noise is at .1 and .25. However as noise increases to .4 and .5, a drastic performance dip is observed compared to the other settings. At these levels of noise, the 1/.998 and 1/.9 settings start producing very big communities compared to the ground truth, yet with much lower conductances than communities found near ground truth scale. An interesting notion not explored here is if the ground truth communities show up as local minima in the sweep sets.

The 10/.4, 5/.4, 10/.9 and 10/.998 settings consistently score low regardless of noise.

The communities produces by these settings are small compared to the ground truth, and the conductances are high. The probable explanation is that the approximations terminate before reaching the edge of the communities.

The 5/.9, 5/.998, 1/.4 and the ML-LCD all perform similarly as noise increases, with ML-LCD performing best at high levels of noise. Like the 10/.4, 5/.4, 10/.9 and 10/.998 settings, these settings also return smaller than ground truth communities with high conductance, suggesting that these parameters also need to be set to explore a bigger part of the network to find the edge of the communities.

6.2 Accuracy on the Aucs network

The second experiment measures the methods accuracy on the real world Aucs dataset.

Once again performance of different parameter settings and the ML-LCD is compared.

Figure 7 illustrates the performances.

Much like in the noise networks, the 1/.9 and 1/.998 settings found very big communities compared to the ground truth, but interestingly also with lower conductance values than the ones closer to ground truth.

The 1/.4, 5/.9, 10/.998 and 5/.998 all performed similarly and seem to best capture the ground truth, although their conductance values were quite high compared to the 1/.9 and 1/.998 settings.

The ML-LCD algorithm outperformed all tested parameter settings of the ACL-cut.

(19)

10/0.4 5/0.4 1/0.4 10/0.9 5/0.9 1/0.9 10/0.998 5/0.998 1/0.998 ML-LCD 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

0.9 Avg Jaccard

Std Jaccard Avg Conductance

Avg Sizeratio

Figure 7: Results from the Aucs network experiment. The x values are the parameter settings of the algorithm: γ/. Avg Jaccard is the average Jaccard index between the produced communities and ground truth. Std Jaccard is the standard deviation of the Jac- card index between the produced communities and ground truth. Avg Conductance is the average conductance of the produced communities. Avg Sizeratio is the average sizeratio between the produced communities and the ground truth, i.e SR(f ound, groundtruth).

The relaxed random walk was used with an interlayer weight of .5.

6.3 Scalability

6.3.1 Network preparation

The first scalability experiment measures the time it takes to translate a multilayer network data structure into a transition matrix and to calculate its stationary distribution.

The preparation is tested on datasets where the number of actors ranges between 200 and 10000 on three layers. Aside from the number of actors, the number of edges is also an important variable for the preparation time, and the number of edges approximately

(20)

quadruples between datasets, increasing from circa 3000 to 8000000. Figure 8 illustrates how the preparation time of the relaxed and classical random walks scale with the number of actors.

Results show that the preparation is quite fast for small networks, but that it does not scale well with network size when edge density is kept constant.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1

·10⁴ 0

2 4 6 8 10 12 14

Size

Seconds

Classical Relaxed

Figure 8: Results from the network preparation scalability experiment. The y-axis shows the number of seconds it took for the preparation to finish. The x-axis shows the number of actors of each dataset. The two functions describe how the network preparation part of the method scales for the classical and the relaxed random walks. The stationary distribution was calculated with a random teleportation parameter of .1, meaning that it corresponds to a PageRank vector.

6.3.2 Community extraction

The second scalability experiment measures how the community extraction scales. Figure 9 illustrates the average execution time of different parameter settings of the classical random walk as network and community sizes increase. The results show that different parameter settings scale differently. The 2/.998 setting has the worst slope, as it does the biggest and best approximation, while the 10/.4 setting has almost constant scaling.

An important consideration is that scales with network size, which leads to the fixed making a better approximation in the big networks than it does in the small networks.

Bigger experiments to test the implementation’s limits was prohibited by the library network data structure taking up a lot of memory.

(21)

The relaxed random walk was approximately 2x slower than the classical. Exact results can be found in Appendix B.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1

·10⁴ 0

0.1 0.2 0.3 0.4 0.5

Size

AvgSeconds

5/.998 5/.4 5/.9 10/.998

10/.4 10/.9 2/.998

2/.4 2/.9

Figure 9: Results from executing the method on datasets ranging from size 200 to 10000.

The y-axis shows the average amount of time it took to finish in seconds. The x-axis shows the number of actors in the datasets. The functions are different executions of the ACL-cut method parameterized by /γ. The classical random walk with an interlayer weight of 1 was used.

Figure 10 illustrates the same executions as figure 9, but also includes the ML-LCD execution time as a comparison. The average ML-LCD time breaks 100 seconds on the 1000 actor network, while the ACL-cut finishes in less than a second on the 10000 one.

(22)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1

·10⁴

−10 0 10 20 30 40 50 60 70 80 90 100 110

Size

AvgSeconds

5/.998 5/.4 5/.9 10/.998

10/.4 10/.9 2/.998

2/.4 2/.9 ML-LCD

Figure 10: This figure shows the same results as figure 9, but with the inclusion of the ML-LCD execution time for comparison.

6.3.3 Accuracy as size increases

The third scalability experiment measures how well the method captures communities as they increase in size. Figure 11 illustrates the average Jaccard index between the found communities and the corresponding ground truth as the network and community sizes increase using the classical random walk. The classical and the relaxed random walks scored almost identically.

The ML-LCD was only tested up to the 1000 actor network as execution time became excessive. The ML-LCD scored circa .6 on the 200 and 500 networks, and increased to .63 on the 1000 actor network.

The ACL-cut was able to capture the ground truth communities very well using the 2/.998 and 2/.9 settings, and scored progressively worse with lower settings. The biggest communities are of size 1111, and the good Jaccard index suggests that the method is very capable of finding communities of that size and bigger.

(23)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1

·10⁴ 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Size

AvgJaccard

5/.998 5/.4 5/.9 10/.998

10/.4 10/.9 2/.998

2/.4 2/.9 ML-LCD

Figure 11: Results from executing the method on datasets ranging from size 200 to 10000.

The y-axis shows the average Jaccard index between the produced communities and the ground truth. The x-axis shows the number of actors in the datasets. The functions are different executions of the ACL-cut method parameterized by /γ, as well as the ML-LCD algorithm. The classical random walk with an interlayer weight of 1 was used.

7 Future work

All communities produced by the ACL-cut have been simplified to an actor level by including an actor if it is present on any layer. This simplification has been made in order to compare the produced communities—which is on a vertex level—with actor level ground truth. A problem with this simplification is that it makes no difference between an actor found on all layers and an actor found on only one layer. A possible improvement to this comparison would be to instead convert the ground truth to a vertex level by including all vertices of the actors, and then make the comparison with the original community found by the ACL-cut.

For further evaluation, it would also be interesting to test the method on vertex level ground truth networks where the communities are present on only some layers. This would enable more experimentation with the interlayer parameters of the random walks.

Another interesting notion to explore is partial knowledge of communities. If one has

(24)

knowledge of a subset of a community, then using that subset as a seed simultaneously should increase the probability of finding the rest of the community compared to using single vertices or actors by themselves.

8 Conclusion

The ACL-cut algorithm, as described in the paper A local perspective on community structure in multilayer networks, finds communities in multiplex networks by identifying sets of vertices that act as a bottleneck on interlayer transitioning random walks [1]. This is done by approximating how they spread from a set of seed vertices. The idea is that a random walker is likely to stay within a set of vertices that has higher internal than external connectivity. Vertices that are in the same community as the seed are thus far more likely to be reached than other vertices. A vertex’s probability of being reached is then used as a score for how good a candidate the vertex is for community inclusion.

The vertices are included in a preliminary community in order of this score, and the conductance of the community is kept track of for each inclusion. Good communities can finally be extracted by taking conductance minima from the preliminary community.

A C++ implementation of the ACL-cut algorithm has been made and evaluated.

Experiments have been made to test the method’s ability to detect communities and to see how it scales with network size.

To test the method’s accuracy, the implementation was executed on synthetic datasets with ground truth containing increasing levels of noise. The method identified the ground truth communities accurately in low noise datasets, but tended to find very big communities compared to the ground truth as noise increased. An interesting point to explore in further evaluation is if the ground truth shows up as local minima in the preliminary community in the high noise datasets.

The method was also tested on the real world dataset Aucs, which is a multiplex network of 61 actors on 5 layers describing different relationships between 61 employees of a university [12]. Each employee of the university belongs to a research group, which was used as that employee’s ground truth community. The method managed to capture the ground truth for some seeds, but performed poorly compared to the ML-LCD algorithm.

The method was however again able to find quite big low-conductance sets, which may correspond to other interesting constructs.

To test the implementation’s scalability, it was executed on synthetic datasets of increasing size with a fixed number of ground truth communities, making community size a function of network size. It was shown to have orders of magnitude better execution time than a Python implementation of the ML-LCD algorithm. It was also shown that it is able to capture communities up to size 1111 with no loss of accuracy, suggesting that it is capable of detecting communities bigger than this.

In summary, the method has proved to be very fast and proficient at finding low conductance sets centered around seeds. The high number of parameters can however make the method difficult to use.

(25)

References

[1] Jeub, L. G., Mahoney, M. W., Mucha, P. J., & Porter, M. A. (2017). A local perspective on community structure in multilayer networks. Network Science, 1-20.

[2] Andersen, R., Chung, F., & Lang, K. (2006, October). Local graph partitioning using pagerank vectors. In Foundations of Computer Science, 2006. FOCS’06. 47th Annual IEEE Symposium on (pp. 475-486). IEEE.

[3] Loe, C. W., & Jensen, H. J. (2015). Comparison of communities detection algorithms for multiplex. Physica A: Statistical Mechanics and its Applications, 431, 29-45.

[4] Dickison, M. E., Magnani, M., & Rossi, L. (2016). Multilayer social networks. Cam- bridge University Press. 96-113

[5] Tang, L., Wang, X., & Liu, H. (2009, December). Uncoverning groups via heteroge- neous interaction analysis. In Data Mining, 2009. ICDM’09. Ninth IEEE International Conference on (pp. 503-512). IEEE.

[6] Kuncheva, Z., & Montana, G. (2015, August). Community detection in multiplex networks using locally adaptive random walks. In Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015 (pp. 1308-1315). ACM.

[7] De Domenico, M., Sol´e-Ribalta, A., G´omez, S., & Arenas, A. (2014). Navigability of interconnected networks under random failures. Proceedings of the National Academy of Sciences, 111(23), 8351-8356.

[8] De Domenico, M., Lancichinetti, A., Arenas, A., & Rosvall, M. (2015). Identifying modular flows on multilayer networks reveals highly overlapping organization in interconnected systems. Physical Review X, 5(1), 011027.

[9] Jerrum, M., & Sinclair, A. (1988, January). Conductance and the rapid mixing prop- erty for Markov chains: the approximation of permanent resolved. In Proceedings of the twentieth annual ACM symposium on Theory of computing (pp. 235-244). ACM.

[10] Interdonato, R., Tagarelli, A., Ienco, D., Sallaberry, A., & Poncelet, P. (2017, March). Node-centric community detection in multilayer networks with layer-coverage diversification bias. In Workshop on Complex Networks CompleNet (pp. 57-66).

Springer, Cham.

[11] Jaccard, P. (1912). The distribution of the flora in the alpine zone. New phytologist, 11(2), 37-50.

[12] Rossi, Luca & Magnani, Matteo (2015). Towards effective visual analytics on multiplex and multilayer networks. Chaos, Solitons and Fractals, 72, 68-76.

[13] Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The PageRank citation ranking: Bringing order to the web. Stanford InfoLab.

(26)

[14] Ga¨el Guennebaud and Benoˆıt Jacob and others. (2017). Eigen v3.

http://eigen.tuxfamily.org.

[15] Qiu Yixuan. (2017). Spectra. https://spectralib.org/.

(27)

A Dataset statistics

This appendix contains statistics about the synthetic datasets used. Following is a description of the table columns. nodes is the number of actors in the network. edges is the number of edges in the network. layers is the number of layers in the network. density is defined by |V ertices|∗(|V ertices|−1)^2∗|Edges| . avg degree is the average amount of edges of the vertices.

num communities is the number of ground truth communities. community mean size is the mean community size.

Accuracy

The noise column corresponds to the fraction of edges a vertex shares with vertices in other communities.

noise nodes edges layers density avg degree num communities community mean size

0.1 300 13040 3 0.096916 86.93333 4 75

0.25 300 13707 3 0.101873 91.38 4 75

0.4 300 13507 3 0.100386 90.04667 6 50

0.5 300 13078 3 0.097198 87.18667 7 43

Figure 12: 300 actors with increasing noise dataset statistics.

Scalability

nodes edges layers density average degree num communities community mean size

200 2865 3 0.04799 28.65 9 22.22222

500 18432 3 0.049251 73.728 9 55.55556

1000 72437 3 0.04834 144.874 9 111.1111

2000 294291 3 0.049073 294.291 9 222.2222

5000 1826367 3 0.048713 730.5468 9 555.5556

10000 7389425 3 0.049268 1477.885 9 1111.111

Figure 13: Scalability dataset statistics

B Experiment results

This appendix contains complete results from the noise, aucs and scalability experiments.

All experiments were executed using a computer with an Inter(R) Core(TM) i7-6700 CPU

@ 3.4 GHz processor and 16GB of DDR4 @ 2133 MHz RAM. Following is a description of what the table columns represent. params is the parameter settings of an execution

/γ. noise is the level of noise in the network. size is the number of actors in the network. avgjac is the average Jaccard index between the produced communities and the ground truth. stdjac is the standard deviation of the Jaccard index between the produced communities and the ground truth. avgcond is the average conductance of the produced communities. avgsizeratio is the average sizeratio between the produced communities and the ground truth, i.e. average SR(f ound, groundtruth) where SR is defined in equation 7. stdsizeratio is the standard deviation of the sizeratios. avgtimes is the average time

(28)

Accuracy over noise

Relaxed random walk

params noise avgjac stdjac avgcond avgsizeratio

10/0.4 0.1 0.0362559 0.0434039 0.964637 0.0454556

5/0.4 0.1 0.183692 0.20524 0.892813 0.15639

1/0.4 0.1 0.601723 0.262457 0.575776 0.400857

10/0.9 0.1 0.225032 0.24092 0.867465 0.181764

5/0.9 0.1 0.524421 0.281851 0.626812 0.356606

1/0.9 0.1 0.961057 0.0230311 0.106755 0.509999

10/0.998 0.1 0.255029 0.250341 0.851902 0.202655

5/0.998 0.1 0.603339 0.311837 0.5162 0.381819

1/0.998 0.1 0.957411 0.113166 0.099586 0.513155

10/0.4 0.25 0.0184038 0.0114907 0.958225 0.0453143

5/0.4 0.25 0.0874058 0.0885485 0.918743 0.128662

1/0.4 0.25 0.446453 0.207876 0.748061 0.394639

10/0.9 0.25 0.1329 0.121258 0.904435 0.166134

5/0.9 0.25 0.375325 0.200581 0.803931 0.345886

1/0.9 0.25 0.755557 0.145818 0.282854 0.574116

10/0.998 0.25 0.16537 0.143888 0.892208 0.191663

5/0.998 0.25 0.428477 0.209333 0.765231 0.373734

1/0.998 0.25 0.752655 0.163865 0.27256 0.576145

10/0.4 0.4 0.0331992 0.0339406 0.940761 0.0771425

5/0.4 0.4 0.113849 0.140003 0.885225 0.194674

1/0.4 0.4 0.37896 0.187945 0.755711 0.46108

10/0.9 0.4 0.14568 0.159962 0.874027 0.224926

5/0.9 0.4 0.334398 0.215099 0.80269 0.398211

1/0.9 0.4 0.246475 0.06533 0.375915 0.804413

10/0.998 0.4 0.176717 0.176649 0.86324 0.253127

5/0.998 0.4 0.376331 0.220372 0.780346 0.430609

1/0.998 0.4 0.247058 0.0657484 0.372936 0.804064

10/0.4 0.5 0.0354738 0.0335916 0.921107 0.104218

5/0.4 0.5 0.101322 0.126038 0.866776 0.221742

1/0.4 0.5 0.255074 0.1988 0.768076 0.440411

10/0.9 0.5 0.117442 0.145305 0.85958 0.242277

5/0.9 0.5 0.229359 0.206771 0.808205 0.383922

1/0.9 0.5 0.205299 0.0422679 0.39693 0.830689

10/0.998 0.5 0.13505 0.15831 0.849708 0.264971

5/0.998 0.5 0.255217 0.21418 0.798941 0.408758

1/0.998 0.5 0.205176 0.0422968 0.395848 0.830762

ML-LCD 0.1 0.5338779421867629 0.146651101854 0 0.34195433315068835 ML-LCD 0.25 0.44823296987633976 0.202394227371 0 0.3229629716177331 ML-LCD 0.4 0.39388134889301035 0.277715410254 0 0.31683434457834647 ML-LCD 0.5 0.26500926553594406 0.26225933535 0 0.3187538649906146

Figure 14: Results from the accuracy experiments using the relaxed random walk and the ML-LCD. Datasets used: 300 actors with increasing noise.

(29)

Classical random walk

params noise avgjac stdjac avgcond avgsizeratio 10/0.4 0.1 0.023974 0.026092 0.910498 0.0305058

5/0.4 0.1 0.159368 0.197802 0.862601 0.134117 1/0.4 0.1 0.559379 0.27146 0.590222 0.379102 10/0.9 0.1 0.205246 0.244989 0.835115 0.161503 5/0.9 0.1 0.502595 0.298117 0.603367 0.340761 1/0.9 0.1 0.960099 0.02406 0.100941 0.510255 10/0.998 0.1 0.237169 0.257268 0.818795 0.184618 5/0.998 0.1 0.595318 0.325251 0.492576 0.373948 1/0.998 0.1 0.960939 0.11208 0.0930402 0.512228 10/0.4 0.25 0.0160019 0.00853234 0.906777 0.0356281

5/0.4 0.25 0.0567042 0.0657609 0.882323 0.0956047 1/0.4 0.25 0.414328 0.23665 0.735099 0.369957 10/0.9 0.25 0.104044 0.118459 0.866632 0.135452 5/0.9 0.25 0.342772 0.233688 0.776019 0.315517 1/0.9 0.25 0.733787 0.155477 0.272377 0.582009 10/0.998 0.25 0.138982 0.144655 0.853561 0.163354 5/0.998 0.25 0.411389 0.230992 0.732664 0.358509 1/0.998 0.25 0.729851 0.169791 0.260578 0.584224 10/0.4 0.4 0.028356 0.0252647 0.891992 0.0639223

5/0.4 0.4 0.0914189 0.124805 0.847552 0.168601 1/0.4 0.4 0.332926 0.223226 0.725932 0.423091 10/0.9 0.4 0.11992 0.151538 0.831511 0.198659 5/0.9 0.4 0.29969 0.232497 0.765179 0.367331 1/0.9 0.4 0.249966 0.0676669 0.357197 0.802303 10/0.998 0.4 0.155161 0.172429 0.819186 0.232499 5/0.998 0.4 0.348819 0.23701 0.739357 0.407688 1/0.998 0.4 0.250123 0.0676531 0.346322 0.802203 10/0.4 0.5 0.0304374 0.0219873 0.872583 0.0860079

5/0.4 0.5 0.0845905 0.110363 0.822596 0.201625 1/0.4 0.5 0.228174 0.203898 0.725637 0.415772 10/0.9 0.5 0.101166 0.133265 0.811152 0.224385 5/0.9 0.5 0.200828 0.204298 0.758985 0.357126 1/0.9 0.5 0.209098 0.0450455 0.379571 0.828204 10/0.998 0.5 0.118732 0.149906 0.800055 0.249288 5/0.998 0.5 0.228874 0.217259 0.749346 0.38471 1/0.998 0.5 0.208772 0.0448574 0.370088 0.828406

Figure 15: Results from the accuracy experiments using the classical random walk.

Datasets used: 300 actors with increasing noise.

(30)

Aucs

Relaxed random walk and ML-LCD

params avgjac stdjac avgcond avgsizeratio stdsizeratio

10/0.4 0.260113 0.14562 0.83511 0.23362 0.117368

5/0.4 0.463138 0.224217 0.693585 0.355674 0.124513

1/0.4 0.618607 0.197907 0.451391 0.589628 0.0930869

10/0.9 0.494218 0.246757 0.629287 0.373862 0.138364

5/0.9 0.600986 0.229305 0.487484 0.524965 0.130202

1/0.9 0.274055 0.173739 0.241365 0.796332 0.0861884

10/0.998 0.550293 0.262584 0.587611 0.413367 0.140206

5/0.998 0.607951 0.255991 0.442852 0.566011 0.13369

1/0.998 0.261381 0.10273 0.217594 0.798036 0.0647451

ML-LCD 0.800551335456996 0.299519544386 0 0.509811262477382 0.089177090832

Figure 16: Results from the aucs dataset using the relaxed random walk with interlayer parameter 0.5 and the ML-LCD.

Classical random walk

params avgjac stdjac avgcond avgsizeratio stdsizeratio 10/0.4 0.150943 0.125003 0.474147 0.124679 0.0607898

5/0.4 0.178367 0.150198 0.470915 0.151114 0.0849807 1/0.4 0.488427 0.300882 0.383936 0.424656 0.230375 10/0.9 0.333034 0.295853 0.442062 0.244196 0.166585 5/0.9 0.570708 0.298129 0.363052 0.420487 0.182666 1/0.9 0.579357 0.267257 0.180703 0.650262 0.116768 10/0.998 0.399738 0.324468 0.415814 0.292934 0.19576

5/0.998 0.667347 0.281941 0.294644 0.4778 0.159217 1/0.998 0.339785 0.246229 0.116654 0.767199 0.115012

Figure 17: Results from the Aucs dataset using the classical random walk with an interlayer parameter of 1.

Scalability

ML-LCD

params size avgjac avgtimes avgsizeratio

ML-LCD 200 0.6006851851851852 0.10487611770629883 0.36758901872590566 ML-LCD 500 0.5967740740740741 5.611178197860718 0.36339029330162503 ML-LCD 1000 0.6331 101.83022415399552 0.38261364697363764

Figure 18: Results from the scalability experiments using the ML-LCD algorithm.

(31)

Relaxed random walk

params size avgjac avgcond avgtimes avgsizeratio 10/0.4 200 0.281647 0.815985 0.0000999 0.219396

5/0.4 200 0.507401 0.675671 0.00012932 0.346266 2/0.4 200 0.730483 0.464073 0.00017863 0.451609 10/0.9 200 0.687088 0.47709 0.00016764 0.412686 5/0.9 200 0.85495 0.222813 0.0002876 0.482311 2/0.9 200 0.949664 0.110435 0.00049353 0.513483 10/0.998 200 0.781904 0.28654 0.00021244 0.440881 5/0.998 200 0.903989 0.193501 0.0004253 0.48925 2/0.998 200 0.971285 0.100146 0.00098502 0.508242

10/0.4 500 0.241515 0.870706 0.00022519 0.201403 5/0.4 500 0.484082 0.727767 0.00039448 0.338195 2/0.4 500 0.697147 0.519181 0.00064291 0.44533 10/0.9 500 0.659311 0.534068 0.00057742 0.405594

5/0.9 500 0.857376 0.226768 0.00127474 0.48024 2/0.9 500 0.962546 0.106414 0.0024603 0.509646 10/0.998 500 0.774525 0.296154 0.00097176 0.437742 5/0.998 500 0.902001 0.198099 0.00185961 0.483941 2/0.998 500 0.985013 0.0993144 0.00489888 0.503822 10/0.4 1000 0.26266 0.880618 0.00068428 0.216233 5/0.4 1000 0.490601 0.734288 0.00139721 0.345454 2/0.4 1000 0.683783 0.553577 0.00232596 0.448036 10/0.9 1000 0.64863 0.566462 0.00205207 0.408924 5/0.9 1000 0.866396 0.220149 0.00465835 0.48471 2/0.9 1000 0.964552 0.105248 0.00789246 0.509122 10/0.998 1000 0.788331 0.281066 0.003557 0.443549 5/0.998 1000 0.912626 0.190644 0.00650271 0.486544 2/0.998 1000 0.990313 0.0978935 0.0180258 0.502466 10/0.4 2000 0.257007 0.886484 0.00208956 0.213468 5/0.4 2000 0.506775 0.732677 0.00481719 0.350745 2/0.4 2000 0.676898 0.547445 0.00783157 0.443973 10/0.9 2000 0.649821 0.570292 0.00686206 0.40951

5/0.9 2000 0.872423 0.21208 0.017187 0.482982 2/0.9 2000 0.968175 0.103208 0.0280946 0.508162 10/0.998 2000 0.803852 0.271374 0.0135313 0.444803 5/0.998 2000 0.921062 0.18039 0.0241019 0.487447 2/0.998 2000 0.989964 0.0969008 0.0594647 0.502551 10/0.4 5000 0.23557 0.902827 0.0107847 0.199896 5/0.4 5000 0.451558 0.764087 0.0258496 0.322299 2/0.4 5000 0.652725 0.567106 0.0454699 0.434025 10/0.9 5000 0.604786 0.602286 0.0377307 0.387128 5/0.9 5000 0.837271 0.242004 0.0975779 0.470036 2/0.9 5000 0.965769 0.102298 0.166789 0.508779 10/0.998 5000 0.758456 0.314185 0.0756757 0.424951 5/0.998 5000 0.892634 0.205979 0.136088 0.479675 2/0.998 5000 0.990114 0.0954997 0.299941 0.502511 10/0.4 10000 0.244516 0.8983 0.0426442 0.210367 5/0.4 10000 0.468101 0.752422 0.101864 0.334889 2/0.4 10000 0.679326 0.537619 0.181466 0.44899 10/0.9 10000 0.626624 0.578396 0.149135 0.40117 5/0.9 10000 0.872565 0.204907 0.39777 0.485298 2/0.9 10000 0.96398 0.104754 0.649179 0.509253 10/0.998 10000 0.798404 0.27194 0.307974 0.444425 5/0.998 10000 0.914282 0.181694 0.561228 0.485357 2/0.998 10000 0.992541 0.096689 1.0833131 0.501891

An evaluation of random-walk based clustering of multiplex networks

Examensarbete 15 hp Juli 2017

An evaluation of random-walk based clustering of multiplex networks

Kristofer Sundequist Blomdahl

Abstract

An evaluation of random-walk based clustering of multiplex networks

Contents

1 Introduction

2 Background

2.1 Multiplex networks

2.2 Community

2.3 Related work

3 ACL cut

3.1 Random walk

3.2 Conductance

3.3 Personalized PageRank-vector approximation

3.4 Sweep cut

4 Implementation

5 Experiments

5.1 Accuracy

5.2 Scalability

6 Results

6.1 Accuracy over noise

6.2 Accuracy on the Aucs network

6.3 Scalability

7 Future work

8 Conclusion

References

A Dataset statistics

Accuracy

Scalability

B Experiment results

Accuracy over noise

Aucs

Scalability