IT 17 036
Examensarbete 15 hp Juli 2017
An evaluation of random-walk based clustering of multiplex networks
Kristofer Sundequist Blomdahl
Teknisk- naturvetenskaplig fakultet UTH-enheten
Besöksadress:
Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0
Postadress:
Box 536 751 21 Uppsala
Telefon:
018 – 471 30 03
Telefax:
018 – 471 30 00
Hemsida:
http://www.teknat.uu.se/student
Abstract
An evaluation of random-walk based clustering of multiplex networks
Kristofer Sundequist Blomdahl
A network, or a graph, is a mathematical construct used for modeling relationships between different entities. An extension of an ordinary network is a multiplex network. A multiplex network enables one to model different kinds of relationships between the same entities, or even to model how relationships between entities change over time. A common network analysis task is to find groups of nodes that are unusually tightly connected. This is called community detection, and is a form of clustering. The multiplex extension complicates both the notion of what a community is, and the process of finding them.
This project focuses on a random-walk based local method that can be used to find communities centered around supplied seed nodes. An implementation of the method is made which is used to evaluate its ability to detect communities in different kinds of multiplex networks.
Handledare: Roberto Interdonato
Contents
1 Introduction 7
2 Background 7
2.1 Multiplex networks . . . 7
2.2 Community . . . 8
2.3 Related work . . . 9
3 ACL cut 10 3.1 Random walk . . . 10
3.1.1 Stationary distribution . . . 11
3.2 Conductance . . . 12
3.3 Personalized PageRank-vector approximation . . . 12
3.4 Sweep cut . . . 14
4 Implementation 15 5 Experiments 15 5.1 Accuracy . . . 15
5.1.1 Accuracy with respect to noise . . . 16
5.1.2 Real dataset . . . 16
5.2 Scalability . . . 16
6 Results 17 6.1 Accuracy over noise . . . 17
6.2 Accuracy on the Aucs network . . . 18
6.3 Scalability . . . 19
6.3.1 Network preparation . . . 19
6.3.2 Community extraction . . . 20
6.3.3 Accuracy as size increases . . . 22
7 Future work 23
8 Conclusion 24
A Dataset statistics 27
B Experiment results 27
1 Introduction
A network is a construct for modeling relationships between different entities. A network is made up of a set of vertices which are connected through a set of edges. The vertices represent entities, and the edges represent relationships between entities. A popular way to analyze a network is the identification of communities. A community is a subset of vertices in a network that are highly connected to each other and well separated from the rest of the network. Finding these communities is a well studied subject with many applications.
Ordinary networks are limited in what systems they can model as they only describe one kind of relationship. The extension of multiplex networks, which are networks with many layers representing different kinds of relationships, enables one to accurately model systems with complex interactions. It however also complicates the notion of a community and the process of finding them.
This project focuses on a state-of-the-art community detection method described in A local perspective on community structure in multilayer networks as a part of a bigger comparison of multiplex community detection methods [1]. The method is based on finding parts of a network that act like bottlenecks on interlayer transitioning random- walks, with the assumption that these bottlenecks constitute good communities. The bottlenecks are found by approximating how the random walks spread from seed vertices, and by including vertices based on how easily they are reached from the seed. The use of seed vertices makes this a local method.
An implementation of the method is made which is used to evaluate its ability to detect communities in noisy networks as well as how it scales with network and community size.
The evaluation also includes a comparison with the ML-LCD algorithm, which is another local community detection method [10].
2 Background
This section covers background information related to networks, multiplex networks and the notion of a community in multiplex networks.
2.1 Multiplex networks
A network consists of a set of vertices v ∈ V which are connected through a set of edges (va, vb) ∈ E.
A multiplex network is an extension of an ordinary network. Multiplex networks consist of layers of ordinary networks, where each layer contains the same vertices, but with different edges connecting them. Entities represented in the network have a vertex representation on every layer, and the set of all its vertices across layers is called an actor.
The multiplex extension makes it possible to model systems where actors have many different kinds of relationships. For example, in a social network, one layer could represent a friendship relationship while a second layer could represent a co-worker relationship.
This extra dimension of information enables the discovery of many interesting patterns since one can study correlations between layers. Figure 1 illustrates a multiplex network.
1α 2α 3α 4α
5α 6α 7α
8α
1β 2β 3β 4β
5β 6β 7β
8β
Figure 1: A multiplex network with two layers: α and β. The two layers contain the same vertices, but have different edges. The vertices connected by a dotted edge represent the same actor.
The topology of ordinary networks are commonly represented by an adjacency matrix.
An adjacency matrix encodes the edges of a network as in equation 1 where i and j are vertices in the network.
AMji =
(1, (i, j) ∈ E
0, otherwise (1)
To represent multiplex networks, one can extend the adjacency matrix of an ordinary network to an adjacency tensor (i.e a 3D matrix). Equation 2 shows the adjacency tensor AT describing the connection between vertex i in layer α and vertex j in layer β.
ATjβiα =
(1, (iα, jβ) ∈ E
0, otherwise (2)
2.2 Community
A popular way to analyze a network is the identification of communities. A community is a subset of vertices that are tightly connected within the subset, yet sparsely connected to its complement. For example, in a social network, a group of friends would likely constitute a community since the connections within the group would be very dense while the connections to people outside the group would be sparse in comparison. A community is related to the notion of a cluster.
The extension of multiplex networks makes the notion of a community more compli- cated, and it is not well defined. One way to define a multiplex community is to say that its vertices should be tightly connected on all layers. This definition uses information from all layers to verify that the set really is a community. Using this metric, the set of actors {5, 6, 7, 8} in figure 2 makes a good community, while the set {1, 2, 3, 4} creates a bad one since they are hardly connected on layer β.
A problem with this metric is that the community structure of {1α, 2α, 3α, 4α} is neglected. This warrants a community to be defined on a vertex level rather than on an actor level. This way one can isolate {1α, 2α, 3α, 4α} as a good community even though the corresponding β-set is bad. Depending on problem domain tough, a vertex level definition may not make sense.
While a vertex level definition enables the {1α, 2α, 3α, 4α} community, it raises the question of when vertices on different layers should be in the same community. One answer is when vertices on different layers form similar communities, like with vertices {5α, 6α, 7α, 8α, 5β, 6β, 7β, 8β}.
1α 2α 3α 4α
5α 6α 7α
8α
1β 2β 3β 4β
5β 6β 7β
8β
Figure 2: A two layer multiplex network illustrating different community structures on the different layers.
Different definitions of a community make sense for different problem domains, but the general notion is that a community is a set of vertices where their connectivity groups them together. The question is how to fuse information from different layers to enhance understanding.
2.3 Related work
A multiplex network is a more expressive model than an ordinary network, and traditional community detection algorithms do not directly translate. There are currently three general approaches to community detection in multiplex networks [4].
The first approach is to flatten the network into a single layer, and then apply existing community detection algorithms on it. There are many different ways of flattening the networks, and they can lead to very different results. One example is by letting the edges in the flattened graph be weighted by how many edges exist between the corresponding actors. This method is called the ”Projection-Average” [3].
The second approach is to perform community detection on each layer individually, and then use this information to form multilayer communities. Given a community par- titioning of each layer, one can measure the similarity between vertices by the number of communities shared. The similarity score can then be used to form multiplex communities [5].
The third approach is to operate directly on the multilayer structure. An example of this is LART (Locally Adaptive Random Transitions), which uses a special type of random walk to obtain distances between vertices which is then used to perform agglomerative clustering [6].
The common community detection approach is global, but like the method evaluated in this project, it can also be done locally. A local approach means that a community is found that is centered around a seed vertex. One method that uses this strategy is the ML- LCD algorithm [10]. The method works by iteratively expanding a community starting from a seed by including neighboring vertices that increase the internal connectivity compared to the external connectivity.
3 ACL cut
This section describes the random-walk based multiplex community detection algorithm proposed in A Local Perspective on Community Structure in Multilayer Networks [1].
This method differs from conventional community detection methods in that it is local;
it will only find communities that are centered around a supplied set of seed vertices. This makes the task of global partitioning clumsy, but it opens up opportunities for finding communities containing a specific set of vertices missed by conventional methods. It also makes it fast.
The general idea of the algorithm is to approximate how random-walks spread through the network and find sets of vertices that act as bottlenecks on the spread. The process is to start from a set of seed vertices and locally approximate the spreading from those vertices. If the seed vertices are part of a community, then that community will act as a bottleneck on the random-walks, and can be detected efficiently.
3.1 Random walk
A random walk is a process that randomly traverses a network (in this case) given some rules of transition. A walker will be on a vertex in the network, and randomly transition to an adjacent vertex. The paper presents two different random walks: the classical and the relaxed.
The classical random walk is defined in equation 3 where Pjβiα denotes the probability to transition from vertex iα to vertex jβ. All vertices that correspond to the same actor are connected with the same weight ω ∈ [0, inf) i.e. Aiαiβ = ω|α 6= β. This random walk
either transitions in the same layer to a vertex representing another actor, or transitions to another layer to a node representing the same actor [7].
Pjβiα = ATjβiα P
jβ∈V ATjβiα (3)
The relaxed random walk is defined in equation 4 where r ∈ [0, 1] is the probability to switch layers and δ(α, β) is 1 if α = β else 0. With probability 1 − r the walker stays within the same layer and transitions uniformly at random to an adjacent vertex. With probability r the random walker transitions uniformly at random to any vertex that is adjacent to any vertex corresponding to the same actor [8].
Pjβiα = (1 − r)δ(α, β) ATjαiα P
j∈ActorATjαiα + r ATjβiβ P
j∈Actor,β∈LayerATjβiβ (4) The parameters ω and r similarly control the interlayer behavior of the walks. These are parameters of the algorithm which control how the random-walks spread across lay- ers. For example, if ω is set to 0 in the classical random walk, and there is only one seed vertex, then the random walk will not transition between layers at all, and the only communities found will be on the same layer as the seed. If on the other hand ω is set high, meaning that the random-walker likely switches layers, then the communities will also span many layers since a bottleneck to a layer switching random walk will have to be on many layers.
3.1.1 Stationary distribution
The stationary distribution of a random walk is a probability distribution which describes the random-walker’s likelihood of being at each vertex as it has made an infinite amount of transitions. It is denoted piα(∞), i.e. the probability that the random-walker is at vertex iα at time ∞. Figure 3 shows the stationary distribution of the classical random walk on a two layered network. An interlayer weight of 1 was used, making a transition to the corresponding node in the other layer equally likely as transitioning to any intra-layer neighbor. One can see that the probability of being at node 7β is higher than 3β since it is better connected.
1α 0.057
2α 0.057
3α 0.071 4α
0.071
5α 0.071 6α
0.071
7α 0.057
8α 0.057
1β 0.057
2β 0.057
3β 0.043 4β
0.071
5β 0.057 6β
0.071
7β 0.071
8β 0.057
Figure 3: The stationary distribution of the classical random-walk using an interlayer weight of 1.
3.2 Conductance
Conductance is, from a random-walk perspective, a measure of how easy it is to escape from a set of vertices. In the quest of community detection, a community can be viewed as a set of vertices with low conductance, i.e. a set of vertices that trap random walks.
Equation 5 presents the formula used to calculate the conductance of a set of vertices S [9].
φ(S) = P
iα∈S
P
jβ /∈SPjβiαpiα(∞) P
iα∈Spiα(∞) (5)
The definition of conductance given in equation 5 measures the probability to leave a set of vertices relative to the probability to be in the set.
The intuition is that it should be far more probable to stay within a community than to leave it, since a community is very tightly connected to itself and poorly connected to its complement.
3.3 Personalized PageRank-vector approximation
Given a seed vertex (or a set of vertices), the process of the algorithm is to approximate how random-walks spread from the seed. To approximate how random-walks spread, a personalized PageRank score is used. A personalized PageRank score measures the probability of being at each vertex, given that the random walker starts from the seed and is for each transition, with a certain probability γ, teleported back to the seed [13]. The probability mass is thus centered around the seed vertices, and describes the importance of each vertex as information spreaders from the seed. Figure 4 illustrates a
personlized PageRank score using the node 1α as a seed. One can clearly see how nodes near the seed have a higher probability mass than nodes further away.
1α 0.29
2α 0.105
3α 0.09 4α 0.088
5α 0.023 6α
0.023
7α 0.013
8α 0.012
1β 0.077
2β 0.043
3β 0.052 4β
0.042
5β 0.0103 6β
0.0137
7β 0.0125
8β 0.008
Figure 4: Personalized PageRank score of the classical random walk with interlayer weight 0.8 and with a teleportation rate of .1 using 1α as seed.
Calculating the personalized PageRank score of the whole network is computationally expensive, so a local approximation is instead used.
The personalized PageRank approximation is done by a series of push procedures from a residual vector to a PageRank vector representing the vertices in the network [2]. The residual vector starts off with all probability mass being at the seed and the PageRank vector starts off with zero probability mass.
The push procedure involves pushing a factor γ of a vertex’s residual to its PageRank, and then spreading half of the remaining residual to its adjacent vertices weighted by transition probability. This push procedure is repeated for all vertices that have enough residual.
Having enough residual depends on a truncation parameter and the stationary dis- tribution. It is determined by checking if the residual is bigger than ∗ p(∞)v. If a vertex has a high stationary probability, then the residual also needs to be high since a high stationary means that the vertex is highly connected. The high stationary vertices will thus only be pushed if many of their adjacent vertices have pushed residual to it. The truncation parameter determines the exactness of the approximation; a low lowers the threshold for when a push is made, and thus the approximation is more exact and reaches further from the seed.
Figure 5 illustrates an approximation after one iteration.
1α r: 0.45, pr: 0.1
2α
r: 0.1125, pr: 0
3α r: 0.1125, pr: 0
4α r: 0.1125, pr: 0
5α 6α 0, 0 0, 0
7α 0, 0
8α 0, 0
1β
r: 0.1125, pr: 0
2β 0, 0
3β 0, 0 4β 0, 0
5β 0, 0 6β 0, 0
7β 0, 0
8β 0, 0
Figure 5: One iteration into an approximation of a personalized PageRank vector using 1α as a seed with γ = 0.1 using the classical random walk. Each vertex has a r and a pr score that corresponds to the residual and PageRank score of the vertex.
This process is efficient since it, other than needing the stationary distribution, only ever operates on the local environment of a vertex which makes it independent of network size.
The result from this step is an -approximated personalized PageRank-vector with teleportation probability γ, describing the importance of each vertex in the information spreading from a seed.
3.4 Sweep cut
Given a personalized PageRank-vector, the sweep cut step orders the vertices by PageR- ank value and returns the subsets {{v1}, {v1, v2}, {v1, v2, v3}, {v1, v2, v3, v3}, ...} with the corresponding conductances of each subset where P ageRank(vn) > P ageRank(vn+1).
The conductance values of each subset are calculated in an efficient online manner.
A sweep cut of the network in figure 4, with the same personalized PageRank score, would yield 16 subsets with their corresponding conductances. The ordering of the ver- tices would be: {1α, 2α, 3α, 4α, 1β, 3β, 2β, 4β,
6α, 5α, 6β, 7α, 7β, 8α, 5β, 8β}, with corresponding conductances of the sweeps: {1, 0.73, 0.5, 0.3, 0.35, 0.27, 0.24, 0.12, 0.2, 0.3, 0.4, 0.5, 0.7, 0.73, 1, 1}. One can see that the global minimum is 8 nodes, i.e the set {1α, 2α, 3α, 4α, 1β, 3β, 2β, 4β}, and interestingly there is also a local minimum of four nodes: the set {1α, 2α, 3α, 4α}.
Communities can be extracted from the sweep sets either by only taking the global minimum, or by also including local minima.
4 Implementation
The implementation of the algorithm has been made in C++ for efficiency and in order to be included in a multilayer network analysis library. It involves two steps: preparation and community extraction.
The preparation part of the implementation takes three parameters: a multilayer network data structure, a random walk type with an interlayer weight parameter, and a random teleportation parameter. It then translates the network into a transition matrix based on random walk type and interlayer weight parameter. As the random walkers can potentially transition from any node to any other node in the network, the matrix is of the dimensions |actors|2|layers|2. The majority of edges are however usually non- present, so a sparse matrix implementation is used. The stationary distribution of a random walk is calculated by either extracting the largest eigenvector or a PageRank vector from the transition matrix based on the random teleportation parameter [13]. The sparse matrix representation, eigenvector- and PageRank calculations are done using the software libraries Eigen and Spectra [14][15].
To extract a community, the personalized PageRank vector approximation and sweep cut procedure are done parameterized by , γ and a seed. The community returned is the global minimum conductance set from the sweep set.
As the network preparation can be costly, the same transition matrix and station- ary distribution can be reused for different community extraction parameters, making extraction of multiple communities from the same network and random walk efficient.
5 Experiments
The experiments focus on evaluating the accuracy and scalability of the ACL-cut method.
A comparison is also made with a Python implementation of the ML-LCD algorithm to see how the ACL-cut method contrasts with it.
5.1 Accuracy
The main method used to measure accuracy is to compare produced communities with ground truth. The ground truth used is datasets with community labeled actors.
To produce a community using the ACL-cut, all vertices corresponding to the same actor are selected as a seed to the method. The resulting community is then simplified to an actor level by including the actor in the community if it is present on any layer.
The simplified community is then compared to the ground truth community of the actor used as seed. This is done for every actor in the network.
The comparison with ground truth is made with the Jaccard index. The Jaccard index measures the similarity between two sets by the ratio between their intersection and their union [11]. A Jaccard index of 1 means equal sets, and a Jaccard index of 0 means no shared values. Equation 6 defines the Jaccard index.
J (S1, S2) = |S1∩ S2|
|S1∪ S2| (6)
Another metric used to compare produced communities with ground truth is size ratio.
The size ratio is defined in equation 7. SR(S1, S2) will approach 1 as |S1| increases compared to |S2|, and approach 0 as |S1| decreases compared to |S2|. Equally sized communities yield a size ratio of 0.5.
SR(S1, S2) = |S1|
|S1| + |S2| (7)
5.1.1 Accuracy with respect to noise
The first experiment measures the accuracy of the ACL-cut method with respect to noise on synthetic datasets. Four datasets with ground truth containing 300 actors on 3 layers are used. The datasets have increasing levels of noise in them, making the communities less pronounced. Noise is here defined by the fraction of edges a vertex shares with vertices in other communities.
The experiment is run on the different datasets using different combinations of the parameters γ and to see how they affect the result.
5.1.2 Real dataset
The second experiment is run on a real dataset of 61 employees of a University. The dataset consists of 61 actors on 5 layers. The 5 layers represent the relationships between the actors with respect to Facebook friendship, having lunch, co-authorship, friendship and working together [12]. Most actors in the dataset also belong to research groups, which is used as ground truth communities 1. The method is again run using different combinations of the parameters γ and to see how they affect the result.
5.2 Scalability
The scalability aspect of the method is tested by running the implementation on synthetic datasets of increasing size. Preparation of the network and community extraction are tested separately since the preparation can be reused for multiple community extractions.
The execution time and accuracy of the community extraction is measured using a sample of 100 actors in each dataset as a seed. The community structure of the datasets is fixed to approximately nine equally sized communities, making the community sizes a function of network size. The edge density of the datasets is also fixed, where edge density is defined by |V ertices|∗(|V ertices|−1)2∗|Edges| . The method is again run with different parameter combinations to see how these affect the result.
The aim of this experiment is to measure the method’s ability to detect big commu- nities, and how execution time scales with network and community size.
1Only employees that belong to exactly one research group are used as seeds.
6 Results
This section covers results and interpretations of all experiments. Appendix A and B contain statistics about each dataset used, and complete results from all experiments.
6.1 Accuracy over noise
The first experiment measures accuracy over noise. Figure 6 illustrates the method’s performance with different parameter combinations as well as the performance of the ML-LCD method in networks with increasing levels of noise.
0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Noise
AvgJaccard
10/.4 10/.9 1/.4 1/.9 10/.998
1/.998 5/.4 5/.9 5/.998 ML-LCD
Figure 6: Results from the noise experiment. The datasets each contain 300 actors with increasing levels of noise. The y-axis shows the average Jaccard index between the produced communities and the ground truth. The x-axis shows the level of noise in the dataset. The functions are different executions of the ACL-cut method parameterized by /γ, as well as the ML-LCD algorithm. The relaxed random walk was used with an interlayer weight of .5.
The method performs very well on these datasets when noise is low using a low
and high γ parameter, meaning that a good approximation is made of a random walker with a low probability of being teleported back to the seed. The 1/.998 and 1/.9 settings have almost equal performance on each dataset, and outperforms the ML-LCD and other parameter settings by far when noise is at .1 and .25. However as noise increases to .4 and .5, a drastic performance dip is observed compared to the other settings. At these levels of noise, the 1/.998 and 1/.9 settings start producing very big communities compared to the ground truth, yet with much lower conductances than communities found near ground truth scale. An interesting notion not explored here is if the ground truth communities show up as local minima in the sweep sets.
The 10/.4, 5/.4, 10/.9 and 10/.998 settings consistently score low regardless of noise.
The communities produces by these settings are small compared to the ground truth, and the conductances are high. The probable explanation is that the approximations terminate before reaching the edge of the communities.
The 5/.9, 5/.998, 1/.4 and the ML-LCD all perform similarly as noise increases, with ML-LCD performing best at high levels of noise. Like the 10/.4, 5/.4, 10/.9 and 10/.998 settings, these settings also return smaller than ground truth communities with high conductance, suggesting that these parameters also need to be set to explore a bigger part of the network to find the edge of the communities.
6.2 Accuracy on the Aucs network
The second experiment measures the methods accuracy on the real world Aucs dataset.
Once again performance of different parameter settings and the ML-LCD is compared.
Figure 7 illustrates the performances.
Much like in the noise networks, the 1/.9 and 1/.998 settings found very big commu- nities compared to the ground truth, but interestingly also with lower conductance values than the ones closer to ground truth.
The 1/.4, 5/.9, 10/.998 and 5/.998 all performed similarly and seem to best capture the ground truth, although their conductance values were quite high compared to the 1/.9 and 1/.998 settings.
The ML-LCD algorithm outperformed all tested parameter settings of the ACL-cut.
10/0.4 5/0.4 1/0.4 10/0.9 5/0.9 1/0.9 10/0.998 5/0.998 1/0.998 ML-LCD 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0.9 Avg Jaccard
Std Jaccard Avg Conductance
Avg Sizeratio
Figure 7: Results from the Aucs network experiment. The x values are the parameter settings of the algorithm: γ/. Avg Jaccard is the average Jaccard index between the produced communities and ground truth. Std Jaccard is the standard deviation of the Jac- card index between the produced communities and ground truth. Avg Conductance is the average conductance of the produced communities. Avg Sizeratio is the average sizeratio between the produced communities and the ground truth, i.e SR(f ound, groundtruth).
The relaxed random walk was used with an interlayer weight of .5.
6.3 Scalability
6.3.1 Network preparation
The first scalability experiment measures the time it takes to translate a multilayer net- work data structure into a transition matrix and to calculate its stationary distribution.
The preparation is tested on datasets where the number of actors ranges between 200 and 10000 on three layers. Aside from the number of actors, the number of edges is also an important variable for the preparation time, and the number of edges approximately
quadruples between datasets, increasing from circa 3000 to 8000000. Figure 8 illustrates how the preparation time of the relaxed and classical random walks scale with the number of actors.
Results show that the preparation is quite fast for small networks, but that it does not scale well with network size when edge density is kept constant.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
·104 0
2 4 6 8 10 12 14
Size
Seconds
Classical Relaxed
Figure 8: Results from the network preparation scalability experiment. The y-axis shows the number of seconds it took for the preparation to finish. The x-axis shows the number of actors of each dataset. The two functions describe how the network preparation part of the method scales for the classical and the relaxed random walks. The stationary distribution was calculated with a random teleportation parameter of .1, meaning that it corresponds to a PageRank vector.
6.3.2 Community extraction
The second scalability experiment measures how the community extraction scales. Figure 9 illustrates the average execution time of different parameter settings of the classical random walk as network and community sizes increase. The results show that different parameter settings scale differently. The 2/.998 setting has the worst slope, as it does the biggest and best approximation, while the 10/.4 setting has almost constant scaling.
An important consideration is that scales with network size, which leads to the fixed making a better approximation in the big networks than it does in the small networks.
Bigger experiments to test the implementation’s limits was prohibited by the library network data structure taking up a lot of memory.
The relaxed random walk was approximately 2x slower than the classical. Exact results can be found in Appendix B.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
·104 0
0.1 0.2 0.3 0.4 0.5
Size
AvgSeconds
5/.998 5/.4 5/.9 10/.998
10/.4 10/.9 2/.998
2/.4 2/.9
Figure 9: Results from executing the method on datasets ranging from size 200 to 10000.
The y-axis shows the average amount of time it took to finish in seconds. The x-axis shows the number of actors in the datasets. The functions are different executions of the ACL-cut method parameterized by /γ. The classical random walk with an interlayer weight of 1 was used.
Figure 10 illustrates the same executions as figure 9, but also includes the ML-LCD execution time as a comparison. The average ML-LCD time breaks 100 seconds on the 1000 actor network, while the ACL-cut finishes in less than a second on the 10000 one.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
·104
−10 0 10 20 30 40 50 60 70 80 90 100 110
Size
AvgSeconds
5/.998 5/.4 5/.9 10/.998
10/.4 10/.9 2/.998
2/.4 2/.9 ML-LCD
Figure 10: This figure shows the same results as figure 9, but with the inclusion of the ML-LCD execution time for comparison.
6.3.3 Accuracy as size increases
The third scalability experiment measures how well the method captures communities as they increase in size. Figure 11 illustrates the average Jaccard index between the found communities and the corresponding ground truth as the network and community sizes increase using the classical random walk. The classical and the relaxed random walks scored almost identically.
The ML-LCD was only tested up to the 1000 actor network as execution time became excessive. The ML-LCD scored circa .6 on the 200 and 500 networks, and increased to .63 on the 1000 actor network.
The ACL-cut was able to capture the ground truth communities very well using the 2/.998 and 2/.9 settings, and scored progressively worse with lower settings. The biggest communities are of size 1111, and the good Jaccard index suggests that the method is very capable of finding communities of that size and bigger.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
·104 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Size
AvgJaccard
5/.998 5/.4 5/.9 10/.998
10/.4 10/.9 2/.998
2/.4 2/.9 ML-LCD
Figure 11: Results from executing the method on datasets ranging from size 200 to 10000.
The y-axis shows the average Jaccard index between the produced communities and the ground truth. The x-axis shows the number of actors in the datasets. The functions are different executions of the ACL-cut method parameterized by /γ, as well as the ML-LCD algorithm. The classical random walk with an interlayer weight of 1 was used.
7 Future work
All communities produced by the ACL-cut have been simplified to an actor level by including an actor if it is present on any layer. This simplification has been made in order to compare the produced communities—which is on a vertex level—with actor level ground truth. A problem with this simplification is that it makes no difference between an actor found on all layers and an actor found on only one layer. A possible improvement to this comparison would be to instead convert the ground truth to a vertex level by including all vertices of the actors, and then make the comparison with the original community found by the ACL-cut.
For further evaluation, it would also be interesting to test the method on vertex level ground truth networks where the communities are present on only some layers. This would enable more experimentation with the interlayer parameters of the random walks.
Another interesting notion to explore is partial knowledge of communities. If one has
knowledge of a subset of a community, then using that subset as a seed simultaneously should increase the probability of finding the rest of the community compared to using single vertices or actors by themselves.
8 Conclusion
The ACL-cut algorithm, as described in the paper A local perspective on community structure in multilayer networks, finds communities in multiplex networks by identifying sets of vertices that act as a bottleneck on interlayer transitioning random walks [1]. This is done by approximating how they spread from a set of seed vertices. The idea is that a random walker is likely to stay within a set of vertices that has higher internal than external connectivity. Vertices that are in the same community as the seed are thus far more likely to be reached than other vertices. A vertex’s probability of being reached is then used as a score for how good a candidate the vertex is for community inclusion.
The vertices are included in a preliminary community in order of this score, and the conductance of the community is kept track of for each inclusion. Good communities can finally be extracted by taking conductance minima from the preliminary community.
A C++ implementation of the ACL-cut algorithm has been made and evaluated.
Experiments have been made to test the method’s ability to detect communities and to see how it scales with network size.
To test the method’s accuracy, the implementation was executed on synthetic datasets with ground truth containing increasing levels of noise. The method identified the ground truth communities accurately in low noise datasets, but tended to find very big commu- nities compared to the ground truth as noise increased. An interesting point to explore in further evaluation is if the ground truth shows up as local minima in the preliminary community in the high noise datasets.
The method was also tested on the real world dataset Aucs, which is a multiplex network of 61 actors on 5 layers describing different relationships between 61 employees of a university [12]. Each employee of the university belongs to a research group, which was used as that employee’s ground truth community. The method managed to capture the ground truth for some seeds, but performed poorly compared to the ML-LCD algorithm.
The method was however again able to find quite big low-conductance sets, which may correspond to other interesting constructs.
To test the implementation’s scalability, it was executed on synthetic datasets of increasing size with a fixed number of ground truth communities, making community size a function of network size. It was shown to have orders of magnitude better execution time than a Python implementation of the ML-LCD algorithm. It was also shown that it is able to capture communities up to size 1111 with no loss of accuracy, suggesting that it is capable of detecting communities bigger than this.
In summary, the method has proved to be very fast and proficient at finding low conductance sets centered around seeds. The high number of parameters can however make the method difficult to use.
References
[1] Jeub, L. G., Mahoney, M. W., Mucha, P. J., & Porter, M. A. (2017). A local perspec- tive on community structure in multilayer networks. Network Science, 1-20.
[2] Andersen, R., Chung, F., & Lang, K. (2006, October). Local graph partitioning using pagerank vectors. In Foundations of Computer Science, 2006. FOCS’06. 47th Annual IEEE Symposium on (pp. 475-486). IEEE.
[3] Loe, C. W., & Jensen, H. J. (2015). Comparison of communities detection algorithms for multiplex. Physica A: Statistical Mechanics and its Applications, 431, 29-45.
[4] Dickison, M. E., Magnani, M., & Rossi, L. (2016). Multilayer social networks. Cam- bridge University Press. 96-113
[5] Tang, L., Wang, X., & Liu, H. (2009, December). Uncoverning groups via heteroge- neous interaction analysis. In Data Mining, 2009. ICDM’09. Ninth IEEE International Conference on (pp. 503-512). IEEE.
[6] Kuncheva, Z., & Montana, G. (2015, August). Community detection in multiplex networks using locally adaptive random walks. In Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015 (pp. 1308-1315). ACM.
[7] De Domenico, M., Sol´e-Ribalta, A., G´omez, S., & Arenas, A. (2014). Navigability of interconnected networks under random failures. Proceedings of the National Academy of Sciences, 111(23), 8351-8356.
[8] De Domenico, M., Lancichinetti, A., Arenas, A., & Rosvall, M. (2015). Identifying modular flows on multilayer networks reveals highly overlapping organization in inter- connected systems. Physical Review X, 5(1), 011027.
[9] Jerrum, M., & Sinclair, A. (1988, January). Conductance and the rapid mixing prop- erty for Markov chains: the approximation of permanent resolved. In Proceedings of the twentieth annual ACM symposium on Theory of computing (pp. 235-244). ACM.
[10] Interdonato, R., Tagarelli, A., Ienco, D., Sallaberry, A., & Poncelet, P. (2017, March). Node-centric community detection in multilayer networks with layer-coverage diversification bias. In Workshop on Complex Networks CompleNet (pp. 57-66).
Springer, Cham.
[11] Jaccard, P. (1912). The distribution of the flora in the alpine zone. New phytologist, 11(2), 37-50.
[12] Rossi, Luca & Magnani, Matteo (2015). Towards effective visual analytics on multi- plex and multilayer networks. Chaos, Solitons and Fractals, 72, 68-76.
[13] Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The PageRank citation ranking: Bringing order to the web. Stanford InfoLab.
[14] Ga¨el Guennebaud and Benoˆıt Jacob and others. (2017). Eigen v3.
http://eigen.tuxfamily.org.
[15] Qiu Yixuan. (2017). Spectra. https://spectralib.org/.
A Dataset statistics
This appendix contains statistics about the synthetic datasets used. Following is a de- scription of the table columns. nodes is the number of actors in the network. edges is the number of edges in the network. layers is the number of layers in the network. density is defined by |V ertices|∗(|V ertices|−1)2∗|Edges| . avg degree is the average amount of edges of the vertices.
num communities is the number of ground truth communities. community mean size is the mean community size.
Accuracy
The noise column corresponds to the fraction of edges a vertex shares with vertices in other communities.
noise nodes edges layers density avg degree num communities community mean size
0.1 300 13040 3 0.096916 86.93333 4 75
0.25 300 13707 3 0.101873 91.38 4 75
0.4 300 13507 3 0.100386 90.04667 6 50
0.5 300 13078 3 0.097198 87.18667 7 43
Figure 12: 300 actors with increasing noise dataset statistics.
Scalability
nodes edges layers density average degree num communities community mean size
200 2865 3 0.04799 28.65 9 22.22222
500 18432 3 0.049251 73.728 9 55.55556
1000 72437 3 0.04834 144.874 9 111.1111
2000 294291 3 0.049073 294.291 9 222.2222
5000 1826367 3 0.048713 730.5468 9 555.5556
10000 7389425 3 0.049268 1477.885 9 1111.111
Figure 13: Scalability dataset statistics
B Experiment results
This appendix contains complete results from the noise, aucs and scalability experiments.
All experiments were executed using a computer with an Inter(R) Core(TM) i7-6700 CPU
@ 3.4 GHz processor and 16GB of DDR4 @ 2133 MHz RAM. Following is a description of what the table columns represent. params is the parameter settings of an execution
/γ. noise is the level of noise in the network. size is the number of actors in the network. avgjac is the average Jaccard index between the produced communities and the ground truth. stdjac is the standard deviation of the Jaccard index between the produced communities and the ground truth. avgcond is the average conductance of the produced communities. avgsizeratio is the average sizeratio between the produced communities and the ground truth, i.e. average SR(f ound, groundtruth) where SR is defined in equation 7. stdsizeratio is the standard deviation of the sizeratios. avgtimes is the average time
Accuracy over noise
Relaxed random walk
params noise avgjac stdjac avgcond avgsizeratio
10/0.4 0.1 0.0362559 0.0434039 0.964637 0.0454556
5/0.4 0.1 0.183692 0.20524 0.892813 0.15639
1/0.4 0.1 0.601723 0.262457 0.575776 0.400857
10/0.9 0.1 0.225032 0.24092 0.867465 0.181764
5/0.9 0.1 0.524421 0.281851 0.626812 0.356606
1/0.9 0.1 0.961057 0.0230311 0.106755 0.509999
10/0.998 0.1 0.255029 0.250341 0.851902 0.202655
5/0.998 0.1 0.603339 0.311837 0.5162 0.381819
1/0.998 0.1 0.957411 0.113166 0.099586 0.513155
10/0.4 0.25 0.0184038 0.0114907 0.958225 0.0453143
5/0.4 0.25 0.0874058 0.0885485 0.918743 0.128662
1/0.4 0.25 0.446453 0.207876 0.748061 0.394639
10/0.9 0.25 0.1329 0.121258 0.904435 0.166134
5/0.9 0.25 0.375325 0.200581 0.803931 0.345886
1/0.9 0.25 0.755557 0.145818 0.282854 0.574116
10/0.998 0.25 0.16537 0.143888 0.892208 0.191663
5/0.998 0.25 0.428477 0.209333 0.765231 0.373734
1/0.998 0.25 0.752655 0.163865 0.27256 0.576145
10/0.4 0.4 0.0331992 0.0339406 0.940761 0.0771425
5/0.4 0.4 0.113849 0.140003 0.885225 0.194674
1/0.4 0.4 0.37896 0.187945 0.755711 0.46108
10/0.9 0.4 0.14568 0.159962 0.874027 0.224926
5/0.9 0.4 0.334398 0.215099 0.80269 0.398211
1/0.9 0.4 0.246475 0.06533 0.375915 0.804413
10/0.998 0.4 0.176717 0.176649 0.86324 0.253127
5/0.998 0.4 0.376331 0.220372 0.780346 0.430609
1/0.998 0.4 0.247058 0.0657484 0.372936 0.804064
10/0.4 0.5 0.0354738 0.0335916 0.921107 0.104218
5/0.4 0.5 0.101322 0.126038 0.866776 0.221742
1/0.4 0.5 0.255074 0.1988 0.768076 0.440411
10/0.9 0.5 0.117442 0.145305 0.85958 0.242277
5/0.9 0.5 0.229359 0.206771 0.808205 0.383922
1/0.9 0.5 0.205299 0.0422679 0.39693 0.830689
10/0.998 0.5 0.13505 0.15831 0.849708 0.264971
5/0.998 0.5 0.255217 0.21418 0.798941 0.408758
1/0.998 0.5 0.205176 0.0422968 0.395848 0.830762
ML-LCD 0.1 0.5338779421867629 0.146651101854 0 0.34195433315068835 ML-LCD 0.25 0.44823296987633976 0.202394227371 0 0.3229629716177331 ML-LCD 0.4 0.39388134889301035 0.277715410254 0 0.31683434457834647 ML-LCD 0.5 0.26500926553594406 0.26225933535 0 0.3187538649906146
Figure 14: Results from the accuracy experiments using the relaxed random walk and the ML-LCD. Datasets used: 300 actors with increasing noise.
Classical random walk
params noise avgjac stdjac avgcond avgsizeratio 10/0.4 0.1 0.023974 0.026092 0.910498 0.0305058
5/0.4 0.1 0.159368 0.197802 0.862601 0.134117 1/0.4 0.1 0.559379 0.27146 0.590222 0.379102 10/0.9 0.1 0.205246 0.244989 0.835115 0.161503 5/0.9 0.1 0.502595 0.298117 0.603367 0.340761 1/0.9 0.1 0.960099 0.02406 0.100941 0.510255 10/0.998 0.1 0.237169 0.257268 0.818795 0.184618 5/0.998 0.1 0.595318 0.325251 0.492576 0.373948 1/0.998 0.1 0.960939 0.11208 0.0930402 0.512228 10/0.4 0.25 0.0160019 0.00853234 0.906777 0.0356281
5/0.4 0.25 0.0567042 0.0657609 0.882323 0.0956047 1/0.4 0.25 0.414328 0.23665 0.735099 0.369957 10/0.9 0.25 0.104044 0.118459 0.866632 0.135452 5/0.9 0.25 0.342772 0.233688 0.776019 0.315517 1/0.9 0.25 0.733787 0.155477 0.272377 0.582009 10/0.998 0.25 0.138982 0.144655 0.853561 0.163354 5/0.998 0.25 0.411389 0.230992 0.732664 0.358509 1/0.998 0.25 0.729851 0.169791 0.260578 0.584224 10/0.4 0.4 0.028356 0.0252647 0.891992 0.0639223
5/0.4 0.4 0.0914189 0.124805 0.847552 0.168601 1/0.4 0.4 0.332926 0.223226 0.725932 0.423091 10/0.9 0.4 0.11992 0.151538 0.831511 0.198659 5/0.9 0.4 0.29969 0.232497 0.765179 0.367331 1/0.9 0.4 0.249966 0.0676669 0.357197 0.802303 10/0.998 0.4 0.155161 0.172429 0.819186 0.232499 5/0.998 0.4 0.348819 0.23701 0.739357 0.407688 1/0.998 0.4 0.250123 0.0676531 0.346322 0.802203 10/0.4 0.5 0.0304374 0.0219873 0.872583 0.0860079
5/0.4 0.5 0.0845905 0.110363 0.822596 0.201625 1/0.4 0.5 0.228174 0.203898 0.725637 0.415772 10/0.9 0.5 0.101166 0.133265 0.811152 0.224385 5/0.9 0.5 0.200828 0.204298 0.758985 0.357126 1/0.9 0.5 0.209098 0.0450455 0.379571 0.828204 10/0.998 0.5 0.118732 0.149906 0.800055 0.249288 5/0.998 0.5 0.228874 0.217259 0.749346 0.38471 1/0.998 0.5 0.208772 0.0448574 0.370088 0.828406
Figure 15: Results from the accuracy experiments using the classical random walk.
Datasets used: 300 actors with increasing noise.
Aucs
Relaxed random walk and ML-LCD
params avgjac stdjac avgcond avgsizeratio stdsizeratio
10/0.4 0.260113 0.14562 0.83511 0.23362 0.117368
5/0.4 0.463138 0.224217 0.693585 0.355674 0.124513
1/0.4 0.618607 0.197907 0.451391 0.589628 0.0930869
10/0.9 0.494218 0.246757 0.629287 0.373862 0.138364
5/0.9 0.600986 0.229305 0.487484 0.524965 0.130202
1/0.9 0.274055 0.173739 0.241365 0.796332 0.0861884
10/0.998 0.550293 0.262584 0.587611 0.413367 0.140206
5/0.998 0.607951 0.255991 0.442852 0.566011 0.13369
1/0.998 0.261381 0.10273 0.217594 0.798036 0.0647451
ML-LCD 0.800551335456996 0.299519544386 0 0.509811262477382 0.089177090832
Figure 16: Results from the aucs dataset using the relaxed random walk with interlayer parameter 0.5 and the ML-LCD.
Classical random walk
params avgjac stdjac avgcond avgsizeratio stdsizeratio 10/0.4 0.150943 0.125003 0.474147 0.124679 0.0607898
5/0.4 0.178367 0.150198 0.470915 0.151114 0.0849807 1/0.4 0.488427 0.300882 0.383936 0.424656 0.230375 10/0.9 0.333034 0.295853 0.442062 0.244196 0.166585 5/0.9 0.570708 0.298129 0.363052 0.420487 0.182666 1/0.9 0.579357 0.267257 0.180703 0.650262 0.116768 10/0.998 0.399738 0.324468 0.415814 0.292934 0.19576
5/0.998 0.667347 0.281941 0.294644 0.4778 0.159217 1/0.998 0.339785 0.246229 0.116654 0.767199 0.115012
Figure 17: Results from the Aucs dataset using the classical random walk with an inter- layer parameter of 1.
Scalability
ML-LCD
params size avgjac avgtimes avgsizeratio
ML-LCD 200 0.6006851851851852 0.10487611770629883 0.36758901872590566 ML-LCD 500 0.5967740740740741 5.611178197860718 0.36339029330162503 ML-LCD 1000 0.6331 101.83022415399552 0.38261364697363764
Figure 18: Results from the scalability experiments using the ML-LCD algorithm.
Relaxed random walk
params size avgjac avgcond avgtimes avgsizeratio 10/0.4 200 0.281647 0.815985 0.0000999 0.219396
5/0.4 200 0.507401 0.675671 0.00012932 0.346266 2/0.4 200 0.730483 0.464073 0.00017863 0.451609 10/0.9 200 0.687088 0.47709 0.00016764 0.412686 5/0.9 200 0.85495 0.222813 0.0002876 0.482311 2/0.9 200 0.949664 0.110435 0.00049353 0.513483 10/0.998 200 0.781904 0.28654 0.00021244 0.440881 5/0.998 200 0.903989 0.193501 0.0004253 0.48925 2/0.998 200 0.971285 0.100146 0.00098502 0.508242
10/0.4 500 0.241515 0.870706 0.00022519 0.201403 5/0.4 500 0.484082 0.727767 0.00039448 0.338195 2/0.4 500 0.697147 0.519181 0.00064291 0.44533 10/0.9 500 0.659311 0.534068 0.00057742 0.405594
5/0.9 500 0.857376 0.226768 0.00127474 0.48024 2/0.9 500 0.962546 0.106414 0.0024603 0.509646 10/0.998 500 0.774525 0.296154 0.00097176 0.437742 5/0.998 500 0.902001 0.198099 0.00185961 0.483941 2/0.998 500 0.985013 0.0993144 0.00489888 0.503822 10/0.4 1000 0.26266 0.880618 0.00068428 0.216233 5/0.4 1000 0.490601 0.734288 0.00139721 0.345454 2/0.4 1000 0.683783 0.553577 0.00232596 0.448036 10/0.9 1000 0.64863 0.566462 0.00205207 0.408924 5/0.9 1000 0.866396 0.220149 0.00465835 0.48471 2/0.9 1000 0.964552 0.105248 0.00789246 0.509122 10/0.998 1000 0.788331 0.281066 0.003557 0.443549 5/0.998 1000 0.912626 0.190644 0.00650271 0.486544 2/0.998 1000 0.990313 0.0978935 0.0180258 0.502466 10/0.4 2000 0.257007 0.886484 0.00208956 0.213468 5/0.4 2000 0.506775 0.732677 0.00481719 0.350745 2/0.4 2000 0.676898 0.547445 0.00783157 0.443973 10/0.9 2000 0.649821 0.570292 0.00686206 0.40951
5/0.9 2000 0.872423 0.21208 0.017187 0.482982 2/0.9 2000 0.968175 0.103208 0.0280946 0.508162 10/0.998 2000 0.803852 0.271374 0.0135313 0.444803 5/0.998 2000 0.921062 0.18039 0.0241019 0.487447 2/0.998 2000 0.989964 0.0969008 0.0594647 0.502551 10/0.4 5000 0.23557 0.902827 0.0107847 0.199896 5/0.4 5000 0.451558 0.764087 0.0258496 0.322299 2/0.4 5000 0.652725 0.567106 0.0454699 0.434025 10/0.9 5000 0.604786 0.602286 0.0377307 0.387128 5/0.9 5000 0.837271 0.242004 0.0975779 0.470036 2/0.9 5000 0.965769 0.102298 0.166789 0.508779 10/0.998 5000 0.758456 0.314185 0.0756757 0.424951 5/0.998 5000 0.892634 0.205979 0.136088 0.479675 2/0.998 5000 0.990114 0.0954997 0.299941 0.502511 10/0.4 10000 0.244516 0.8983 0.0426442 0.210367 5/0.4 10000 0.468101 0.752422 0.101864 0.334889 2/0.4 10000 0.679326 0.537619 0.181466 0.44899 10/0.9 10000 0.626624 0.578396 0.149135 0.40117 5/0.9 10000 0.872565 0.204907 0.39777 0.485298 2/0.9 10000 0.96398 0.104754 0.649179 0.509253 10/0.998 10000 0.798404 0.27194 0.307974 0.444425 5/0.998 10000 0.914282 0.181694 0.561228 0.485357 2/0.998 10000 0.992541 0.096689 1.0833131 0.501891