Hide and Seek in a Social Network

(1)

Master of Science Thesis in Electrical Engineering

Department of Electrical Engineering, Linköping University, 2017

Hide and Seek in a Social

Network

(2)

Hide and Seek in a Social Network

Olle Abrahamsson LiTH-ISY-EX--17/5038–SE url Supervisor: Erik G Larsson

isy_{, Linköpings universitet} Examiner: Danyo Danyev

isy_{, Linköpings universitet}

Division of Communication Department of Electrical Engineering

(3)

Abstract

In this thesis a known heuristic for decreasing a node’s centrality scores while maintaining influence, called ROAM, is compared to a modified version specif-ically designed to decrease eigenvector centrality. The performances of these heuristics are also tested against the Shapley values of a cooperative game played over the considered network, where the game is such that influential nodes re-ceive higher Shapley values. The modified heuristic performed at least as good as the original ROAM, and in some instances even better (especially when the terror-ist network behind the World Trade Center attacks was considered). Both heuris-tics increased the influence score for a given targeted node when applied consec-utively on the WTC network, and consequently the Shapley values increased as well. Therefore the Shapley value of the game considered in this thesis seems to be well suited for discovering individuals that are assumed to actively trying to evade social network analysis.

(4)

(5)

Acknowledgements

Firstly I would like to thank my supervisor professor Erik G. Larsson for his never ending enthusiasm for this subject and all the support he has given me, and for introducing me to the wonderful subject of network theory. Likewise I would like to thank my examiner, docent Danyo Danyev, for taking the time and interest in this thesis.

Secondly, many thanks to my opponent Andreas Christensen for feedback and valuable criticism.

Finally I am very grateful to my friend Christoffer Holm whose knowledge in pro-gramming led to many valuable suggestions during the implementation phase of the project.

(6)

List of Figures vii

List of Tables viii

1 Introduction 1 1.1 Earlier work . . . 1 1.2 Method . . . 2 1.3 Demarcation . . . 2 1.4 Thesis layout . . . 2 2 Preliminaries 3 2.1 Graph theory . . . 3 2.2 Centrality metrics . . . 6 2.3 Influence metrics . . . 9

3 The ROAM heuristic 13 3.1 Original ROAM . . . 13

3.2 Modified ROAM . . . 14

4 A game theoretic approach 17 4.1 Cooperative games . . . 17

4.2 One game to consider . . . 20

5 Experiment and results 21 5.1 Data sets . . . 21

5.2 Experiments with ROAM and modified ROAM . . . 22

6 Discussion and conclusions 31 7 Further research 33 A Supplementary theory 37 A.1 Calculating the Shapley values for the game . . . 37

A.2 The Barabási-Albert (BA) model . . . 38 vi

(7)

B Source code 39

Bibliography 49

Index 51

List of Figures

2.1 An example of a graph, G, with 7 nodes and 7 edges. . . . 4

2.2 A subgraph, G0, of the graph G. . . . 5

2.3 The graph G, with a path (blue dashed edges) from v1to v7. . . 5

2.4 An example of a network to illustrate the various centrality mea-sures. . . 7

2.5 An example of a network to illustrate the linear threshold influ-ence model.. . . 10

2.6 The network after one iteration of linear threshold. . . 10

2.7 The network after two iterations of linear threshold. . . 11

2.8 The network after three iterations of linear threshold. In this case all nodes became activated.. . . 11

3.1 The original ROAM heuristic with budget b = 4 applied to a small network where the dotted red edge is removed, and the dashed green edges are added. (a) The original network. (b) The network after one round of ROAM. (c) The network after two rounds of ROAM. . . 15

4.1 A graph G and an induced subgraph G[{2, 3, 4, 5}]. . . . 19

4.2 An example of a network to illsutrate the game g1. . . 20

5.1 ROAM(3) and ROAMeig(3) run on the Facebook network. . . 22

5.2 ROAM(3) and ROAMeig(3) run on the WTC terrorist network. Mo-hamed Atta (node 26) was the target node. . . 22

5.3 ROAM(3) and ROAMeig(3) run on the smaller scale free network. 23 5.4 ROAM(3) and ROAMeig(3) run on the larger scale free network. . 23

5.5 ROAM(3) and ROAMeig(3) run on the Facebook network (left) and the WTC terrorist network (right).. . . 23

5.6 ROAM(3) and ROAMeig(3) run on the smaller scale-free network (left) and the larger scale-free network (right). . . 24

(8)

responsible for the 9/11 WTC attacks with Mohamed Atta (node 26, in red) as seed node. Right: The same model and network, but

after 3 rounds of ROAM. . . 26

5.8 Left: Linear threshold model of influence for the terrorist network responsible for the 9/11 WTC attacks with Mohamed Atta (node 26, in red) as seed node. Right: The same model and network, but

after 3 rounds of ROAMeig(3). . . 27

5.9 Left: Independent cascade model of influence for the terrorist net-work responsible for the 9/11 WTC attacks with Mohamed Atta (node 26, in red) as seed node. Right: The same model and

net-work, but after 3 rounds of ROAM(3).. . . 28

5.10 Left: Independent cascade model of influence for the terrorist net-work responsible for the 9/11 WTC attacks with Mohamed Atta (node 26, in red) as seed node. Right: The same model and

net-work, but after 3 rounds of ROAMeig(3). . . 29

List of Tables

2.1 Node ranking with respect to degree, closeness and betweenness

centralities for the example network. . . 8

5.1 Number of activated nodes in the WTC terrorist network before and after applying three rounds of ROAM(3) and ROAMeig(3), re-spectively. . . 25

5.2 Number of activated nodes in the Facebook network before and after applying three rounds of ROAM(3) and ROAMeig(3), respec-tively. . . 25

5.3 Number of activated nodes in the smaller scale-free network be-fore and after applying three rounds of ROAM(3) and ROAMeig(3), respectively. . . 25

5.4 Number of activated nodes in the larger scale-free network before and after applying three rounds of ROAM(3) and ROAMeig(3), re-spectively. . . 25

(9)

Nomenclature

Most of the recurring letters and symbols are described here.

Letters

G Graph, network

v, v†, vi, for i ∈ N Vertices

vivj, for i, j ∈ N Edge between viand vj

Symbols

CD(v) Degree centrality

CC(v) Closeness centrality

CB(v) Betweenness centrality

CE(v) Eigenvector centrality

CDm(vi) mth order degree mass

V (G) The set of vertices of a graph G

E(G) The set of edges of a graph G

N (v) Neighbourhood of the vertex v

Other conventions

End of proof

(10)

(11)

1

Introduction

We live in a global world that is becoming increasingly interconnected, which has spawned significant interest in social network analysis (SNA). This has lead to a rapid development of tools to analyse our behaviour online, and among many other utilities these tools can be used in the prevention and investigation of ter-rorists and other benevolent actors. Consequently, one must therefore assume that countermeasures are taken by the adversaries to evade detection from such analyses. In this thesis we study one such technique called ROAM, first devel-oped by Waniek et al in [14], where it was shown how an influential individual or community can manage their connections in order to lower some of their cen-trality scores while maintaining a high degree of influence in the network.

1.1 Earlier work

A number of various countermeasures have been proposed to tackle the issue of online privacy, e.g. algorithmic solutions [3] or market mechanisms which en-ables social media users to monetize on their private information [5]. However, little attention has been given to the study of evading SNA, and instead most research has been focused on developing progressively more advanced analysis tools. In [9] a new paradigm for SNA is suggested, whereby the strategic be-haviour of network actors is explicitly modelled, and in [14] two heuristics (of which one, ROAM, is already mentioned) are proposed for the intent of conceal-ing individuals and communities, respectively.

(12)

1.2 Method

The contribution of this thesis is a modified version of the proposed ROAM al-gorithm, called ROAMeig, which takes into account information about the entire network, not just the individual’s immediate neighbours. This is accomplished by replacing the degree centrality measure in ROAM with the second degree mass, a novel concept statistically correlated to the leading eigenvector of the adjacency matrix of the network. This first task is what the firs part of the title, hide, is referring to. The other part, seek, alludes to developing methods to counteract the evasive attempt. Therefore we also test how these methods for cloaking an important node stand up when we apply centrality analysis based on cooperative game theory. This is done by considering a cooperative game in which coalitions are highly rewarded if they maximise the social influence (defined as the number of agents in the coalition plus the number of agents outside of the coalition reach-able by one hop.) The centrality of each node is then the node’s Shapley value for the considered game. If the network under analysis is disconnected, the Myerson value is used instead.

1.3 Demarcation

We will restrict our study to undirected and unweighted networks without self-loops. This is because we are only interested in who knows who, and we make no assumptions on the strength of these connections, so weights are not needed. In this context self-loops makes little sense since the edges represent friendship or acquaintance. Both when considering the influence models in Chapter2and the cooperative games in Chapter4, we assume that the agents represented by the nodes act rationally. From a game theoretical viewpoint this means that players strive to maximise their long term payoff.

1.4 Thesis layout

The thesis is structured as follows: In Chapter2we introduce the reader to some basic definitions and facts about graphs and network theory, especially central-ity and influence metrics. In Chapter 3 the original and modified algorithms are described and discussed. Chapter4introduces a recently developed class of centrality measures based on cooperative game theory. In Chapter5some exper-imental results are reported where comparisons are made between the original and modified algorithms. Their abilities to cloak the targeted node is also tested with the respect to discovering this node via the game theoretic centrality anal-ysis introduced in Chapter 4. The results are discussed and commented on in Chapter6. Finally, in Chapter7some ideas for further research are suggested. There are also two appendices: In AppendixA some supplementary theory is provided, and in AppendixBthe source code for the implementations in Python are listed.

(13)

2

Preliminaries

This chapter will define notions used in the thesis. For a more thorough introduc-tion to graph theory, see for instance [1].

2.1 Graph theory

The following definitions are taken from [2]. Our main objects of study are graphs, which are often called a network in applied contexts.

Definition 2.1. A graph G is a pair (V (G), E(G)) consisting of a set V (G) of ver-tices(or nodes) and a set E(G) of edges, where each edge connects two distinct vertices and no two vertices are connected by more than one edge. Two vertices are adjacent if they are connected by an edge.

Remark 2.2. We will often say that two adjacent vertices are neighbours. We will also use the termsgraph and network interchangeably when suited.

Given a graph, we can take subsets of its vertex and edge sets and form a new graph, called a subgraph. This will be helpful in Chapter4when we will play games on such subgraphs (called coalitions in the language of cooperative game theory).

Definition 2.3. A graph G0

is a subgraph of the graph G if V (G0

) ⊆ V (G) and

E(G0

) ⊆ E(G).

One of the most important features of a node in network theory is its degree, which tells us how many friends the person represented by the node has in the network. These friends constitute the node’s neighbourhood.

Definition 2.4. The degree of a vertex v ∈ V (G), denoted d(v), is the number of edges that are incident with v.

(14)

Definition 2.5. The neighbourhood of a vertex v, denoted N (v), is the set of vertices adjacent to v.

Two other concepts that are of great importance for us are path and distance which give us information about how many hops are needed to reach a certain node from a given starting node.

Definition 2.6. A path, denoted by P , is a non-empty graph or subgraph of the form P = (V , E), with V = {v0, . . . , vn}and E = {v0v1, . . . , vn−1vn}where all vi are

distinct. We say that P joins the vertices v0and vn, or that P is a v0−vn−path.

The number of edges in a path is called the length of a path.

Definition 2.7. The distance between two vertices u and v, denoted by d(u, v), is the length of the shortest path joining u and v. If u and v are not connected by any path, then d(u, v) is infinite.

Lastly we shall define the adjacency matrix of a network. This is an algebraic representation of a graph which enables us to utilise all tools from linear algebra and matrix analysis and apply them to the study of the network in question. Definition 2.8. The adjacency matrix A of a graph G is a matrix of the form (auv) where auv =       

1, if u and v are neighbours 0, otherwise .

Remark 2.9. In this thesis we will only consider undirected networks, that is, v_iv_j∈E(G)

is equivalent to vjvi ∈E(G), so we let vivj = vjvi. In the following, we will therefore

assume that the adjacency matrix A is symmetric, so that A = AT.

These concepts will hopefully become clear (if they are not already) if we put them in context with an example, so let us do that.

Example 2.10

4

3

5

1

2

6

7

(15)

2.1 Graph theory 5

The graph G in Figure2.1has the following properties. It is a graph with |V (G)| = 7 nodes and |E(G)| = 7 edges. In Figure2.2we see a subgraph G0 with vertex set

3

1

2

Figure 2.2:A subgraph, G0, of the graph G.

{_v₁_{, v}₂_{, v}₃}_{= V (G}0_{) ⊂ V (G) and edge set {v}₁_v₃_{, v}₂_v₃}_{= E(G}0_{) ⊂ E(G). The degree} of node v5is d(v5) = 3, its neighbourhood is N (v5) = {v4, v6, v7}.

4

3

5

1

2

6

7

Figure 2.3:The graph G, with a path (blue dashed edges) from v1to v7.

In Figure2.3we also see a path v1v3v4v457, with dashed blue edges, from node

v1to node v7, and since this happens to be the shortest path between these nodes,

they have distance d(v1, v7) = 4. (Note that one could for instance have taken

the alternative path v1v3v4v5v6v7, but that path has one more edge than the one

illustrated, so it is not the shortest path.) Finally, the adjacency matrix for the graph G is A =                          0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 1 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 1 0 0 0 0 1 1 0                          .

(16)

2.2 Centrality metrics

Indicators of centrality identify the most important vertex in some sense (depend-ing on the measure chosen). In our context this simply means the most important individual with respect to the surrounding social network. It is worth pointing out that there are many other definitions of centralities than those listed here. However, the following are all we need for the task at hand. Let G = (V (G), E(G)) be a graph.

Definition 2.11. The degree centrality of a vertex v is

CD(v) = d(v).

Definition 2.12. The closeness centrality of a vertex v is

CC(v) =

|_{V (G)|} P

ud(u, v)

,

where d(u, v) is the distance between vertices u and v.

Remark 2.13. Note that since we are only considering (strongly) connected networks, the distance between any two nodes in the network is always finite, and so the denominator in the above definition is finite.

Definition 2.14. The betweenness centrality of a vertex v is

CB(v) = X s,t,v∈V (G) σst(v) σst ,

where σst is the total number of distinct shortest paths from vertex s to vertex t,

and σst(v) is the number of those paths that pass through v.

These three centralities have fairly straightforward interpretations. Degree cen-trality is simply the number of edges that a vertex is connected to, closeness is the reciprocal of the average shortest distance from a vertex to all other nodes, and betweenness is basically the proportion of all possible shortest paths which also pass through the given vertex.

(17)

2.2 Centrality metrics 7 Example 2.15

4

3

5

1

2

6

7

Figure 2.4:An example of a network to illustrate the various centrality mea-sures.

Consider the network in Figure2.4. The degree centralities are

CD(3) = CD(5) = 3

CD(4) = CD(6) = CD(7) = 2

CD(1) = CD(2) = 1.

Thus it might be reasonable to claim that nodes 3 and 5 are the most popular in the network. On the other hand, if we calculate the closeness centralities, we find that

CC(4) = 7/10

CC(3) = CC(5) = 7/11

CC(6) = CC(7) = 7/15

CC(1) = CC(2) = 7/16,

so node 4 could be considered the one which can most efficiently obtain informa-tion from every other node. Finally, if we instead are interested in betweenness centrality, we obtain that

CB(3) = 16/3

CB(4) = CB(5) = 13/3

CB(1) = CB(2) = CB(6) = CB(7) = 0,

from which we can say that node 3 is the one that can most frequently control information flow in the network.

As we can see, the nodes are ranked differently depending on what metric is used. The situation for the example network is summarised in Table2.1.

(18)

Rank CD(v) CC(v) CB(v)

1 3,5 4 3

2 4,6,7 3,5 4,5

3 1,2 6,7 1,2,6,7

4 1,2

Table 2.1: Node ranking with respect to degree, closeness and betweenness centralities for the example network.

Another very common centrality measure is the eigenvector centrality, defined below, which is based on the eigenvectors of the adjacency matrix. It assigns rela-tive scores to all of the nodes such that a vertex gets a high score if it is connected to other highly-scored vertices. Note how this recursion captures features of the global topology of the network, whereas the other three centrality measures only is influenced by the local topology.

Definition 2.16. The eigenvector centrality of a vertex vi is recursively defined

as CE(i) = 1 λ X k ak,iCE(k),

where λ is the largest eigenvalue in magnitude and ai,j are the elements of the

adjacency matrix A. In matrix form this can be written as

Ax = λx.

Then CE(i) = x(i), the ith element of the eigenvector x.

Remark 2.17. By choosing λ to be the largest eigenvalue in magnitude, it follows from Perron-Frobenius theorem that if the matrix A is irreducible, or equivalently if the graph it represents is connected1, then the eigenvector solution x is both unique and positive. We also define a new class of centrality measures, the degree mass, which is a variant of degree centrality. This idea was introduced by Li et al in [7, p. 3]. Definition 2.18. The mth order degree mass of a vertex viis

CDm(vi) = m+1 X k=1 (Aku)i = N X j=1        m X k=0 Ak        ij d(vj),

where A is the adjacency matrix, u = (1, 1, . . . , 1)T and N = |V (G)|.

As noted in [7, p. 7], the second order degree mass, that is when m = 2, correlates strongly with the components of the leading eigenvector of the adjacency matrix, and could therefore be a suitable approximation of eigenvector centrality since degree centrality has a linear time complexity, O(|V (G)|). The correlation between

1_{For directed graphs the condition is strongly connected. In this thesis directed graphs are not}

(19)

2.3 Influence metrics 9

CD(v) and CE(v) is further supported both numerically and analytically, via the

central limit theorem, in [11]. Given a vertex vi, we have for m = 2 that

CD2(vi) = N X j=1        2 X k=0 Ak        ij d(vj) = N X j=1 I + A + A2 ijd(vj), (2.1)

which can be interpreted as a weighted sum of all degrees.

2.3 Influence metrics

A concept closely related to that of centrality is the influence of a vertex. In our context of social networks, this corresponds to how much influence an individual has on the other individuals in the network. Since our aim is to lower centrality but leave influence mainly unaffected, it is of great importance to have a clear definition of how to measure this property.

When a vertex is sufficiently influenced by its neighbour(s), we say that it be-comes active, at which point it starts to influence any neighbour that has not yet become active. In order to initiate the process a set of vertices are activated from the start. This set is called the seed set.

Definition 2.19 (Active set). The subset of vertices that are active at time t is called the active set and is denoted by

I(t) ⊆ V (G), t = 0, 1, . . . .

The active set I(0) is called the seed set.

Active vertices influence inactive vertices differently depending on which influ-ence metric is considered. We will use two metrics, independent cascade and linear threshold.

Definition 2.20 (Independent cascade). To every pair of vertices we assign an activation probability

p : V (G) × V (G) → [0, 1].

Then at every time t ≥ 1, every vertex v ∈ V (G) that became active at time t − 1 activates every inactive neighbour, w ∈ N (v) \ I(t − 1), with probability p(v, w). The process ends when I(t) = I(t − 1) (i.e. when there are no new active vertices). Definition 2.21 (Linear threshold). To every vertex v ∈ V (G) we assign a thresh-old value tvwhich is sampled from the set {0, . . . , |N (v)|} according to some

prob-ability distribution. Then, at every time t ≥ 1, every inactive vertex v becomes active if |I(t − 1) ∩ N (v)| ≥ tv. The process ends when I(t) = I(t − 1).

(20)

Example 2.22

Consider again the network in Figure2.5, now with added threshold values on each node, and the seed node marked with red. In the first iteration, shown in Figure2.6we see that the target node’s neighbours have thresholds t3 = 0 and

t5= 1, so both get activated. In the next iteration, shown in Figure2.7, all nodes

except node 7 gets activated. This is because node 7 needs at least two active neighbours to become activated itself, but at this step only one neighbour, node 5, is currently active. In the third and final iteration, however, node 6 is now newly activated, so node 7 has two active neighbours and thus becomes activated itself. This is illustrated in Figure2.8. The process stops since the active set of nodes cannot grow any further, i.e. I(4) = I(3). Note however that it is only a coincidence that every node in the network became activated – this is typically not the case, as we shall see later on in Chapter5.

4

0

3

2

5

1

0

2

1

6

1

7

2

Figure 2.5:An example of a network to illustrate the linear threshold influ-ence model.

4

0

3

2

5

1

0

2

1

6

1

7

2

(21)

2.3 Influence metrics 11

4

0

3

2

5

1

0

2

1

6

1

7

2

Figure 2.7:The network after two iterations of linear threshold.

4

0

3

2

5

1

0

2

1

6

1

7

2

Figure 2.8: The network after three iterations of linear threshold. In this case all nodes became activated.

(22)

(23)

3

The ROAM heuristic

In this chapter we will describe a heuristic developed by Waniek et al [14, pp. 3-5] called ROAM (Remove One, Add Many), which is used for concealing a per-son while maintaining their influence over the community to which the perper-son belongs. Given a target vertex (typically a leader of the community), the idea is to disconnect the target from one of its neighbours and then connect this neighbour to some of the target’s other friends, where the number of edge modifications allowed is restricted. A modified version of ROAM adapted for the eigenvector centrality measure will also be suggested.

3.1 Original ROAM

Let G be a network with v†= maxv∈GCD(v), and let b ∈ {1, 2, . . . } be the number

of edges that we allow to be removed or added. We will henceforth refer to b as the budget of ROAM.

There are basically two critical steps in ROAM. The first is when the edge between the target vertex v†and the neighbour v0is removed. Note that this removal can

only decrease the degree of v†. It also decreases the closeness and betweenness of v†

since it removes any shortest path between v†

and any other vertex which runs through this edge. On the downside, v†

no longer has any direct influence over v0. This is remedied in the second step, where v0is connected to the b − 1

other friends of v†

with the lowest degrees. These new, indirect, paths between

v†

and v0 compensates for the lost direct path. This step does not affect the

degree of v†, but it does of course increase the degrees for some of its neighbours, which further disguises the importance of v†. The closeness and betweenness of v†cannot increase either. To see this, suppose that vi ∈ N (v

†

) with vi , v0.

(24)

Algorithm 1The ROAM heuristic 1: Input: G, b, v† 2: Output: A network G0 3: r ← N (v † ) −1

4: S ← {v0, v1, . . . , vr} such that CD(v0) ≥ CD(v1) ≥ · · · ≥ CD(vr), for all

vi ∈ N (v†), i = 0, . . . , r 5: ifr ≥ b − 1 then 6: E(G0) ←E(G) \ {v†v0} ∪ {_v₀_v_r−(b−1)_{, v}₀_v_r−(b−2)_{, . . . , v}₀_v_r} 7: else 8: ifr < b − 1 then 9: E(G0) ←E(G) \ {v†v0} ∪ {_v₀_v₁_{, v}₀_v₂_{, . . . , v}₀_v_r} 10: returnG0 = (V (G), E(G0)

Then any path containing the edges v†

vi and viv0is longer than the direct path

v†

v0 which was removed in step one, and this does not increase the percentage

of shortest paths going through v†

.

Thus the overall effect of the heuristic is that the degree, closeness and between-ness centralities of v†are reduced while most of its influence over N (v†) is pre-served.

Example 3.1

Consider the network in Figure 3.1(a) and suppose that our budget is b = 4. In this case we assume that v† is the target because it has the highest degree centrality, which suggests that it might be a leader in this social community. First we localise its neighbours with the highest degree centrality (note that there may be several vertices with the same maximum degree). They are v0, v1, v2 and v4,

all with degree 3. We choose one of them, say v0, and remove the edge between

v†and v0.

Now we shall connect v0 to the b − 1 = 3 nodes in N (v †

) with lowest degree centralities. We have N (v†) \ v0 = {v1, v2, v3, v4, v5}with CD(v5) = 1, CD(v3) = 2

and CD(vi) = 3 for i = 1, 2, 4. When deciding which of these nodes to connect to

v0, we naturally include v5 and v3 since they have the lowest degrees of all the

considered neighbours, and we pick one of {v1, v2, v4}arbitrarily, say v4.

Repeating the entire procedure results in the network shown in Figure3.1(c).

3.2 Modified ROAM

In this modified version, we would like to somehow replace the degree central-ity CD(v) with the eigenvector centrality CE(v). Here is where the degree mass

(25)

3.2 Modified ROAM 15

v

₆

v

₄

v

₃

v

₅

v

†

v

₀

v

₂

v

₁ (a)

v

₆

v

₄

v

₃

v

₅

v

†

v

₀

v

₂

v

₁ (b)

v

₆

v

₄

v

₃

v

₅

v

†

v

₀

v

₂

v

₁ (c)

Figure 3.1: The original ROAM heuristic with budget b = 4 applied to a small network where the dotted red edge is removed, and the dashed green edges are added. (a) The original network. (b) The network after one round of ROAM. (c) The network after two rounds of ROAM.

highly correlated with the eigenvector centrality as we discussed earlier. Conse-quently we expect that if CD(vi) is replaced with CD2(vi), then the ROAM

algo-rithm would lower CE(vi) as well. Thus we propose the following modification:

In view of Equation (2.1), the only practical difference from the original ROAM

in terms of calculations is the extra cost of calculating A2, which has a time com-plexity of O(|V (G)|2_{). While this is relatively expensive operation, it is only}

per-formed once. Thus it should not affect the computational time severely. It might put greater demands on the amount of data that has to be kept in memory.

(26)

Algorithm 2The modified ROAM heuristic 1: Input: G, b, v† 2: Output: A network G0 3: r := N (v † ) −1

4: S ← {v0, v1, . . . , vr} such that CD2(v0) ≥ CD2(v1) ≥ · · · ≥ CD2(vr), for all

vi ∈ N (v†), i = 0, . . . , r 5: ifr ≥ b − 1 then 6: E(G0) ←E(G) \ {v†v0} ∪ {_v₀_v_r−(b−1)_{, v}₀_v_r−(b−2)_{, . . . , v}₀_v_r} 7: else 8: ifr < b − 1 then 9: E(G0 ) ←E(G) \ {v† v0} ∪ {_v₀_v₁_{, v}₀_v₂_{, . . . , v}₀_v_r} 10: returnG0 = (V (G), E(G0)

(27)

4

A game theoretic approach

So far we have seen how an individual can evade conventional social network analysis with respect to the most common centrality metrics. One must therefore ask if it is still possible to detect these hidden individuals by other means. In this chapter we will discuss one such approach based on cooperative game theory, the study of games in which the participants form coalitions with each other, and compete against other coalitions in the same network. The cooperation is enforced on the players in forms of payoffs, so that certain alliances are more highly rewarded than others. We will first introduce a rather general concept, and then consider specialised methods applicable to social networks.

4.1 Cooperative games

Let G be a network. In this context the set of vertices V (G) are called players, denoted by V . A subset S ⊆ V of the set of players is called a coaltition.

Definition 4.1 (Characteristic function). Let V be a set of players. The

charac-teristic function, denoted, νG(S) is a function

ν : 2V → R

which assigns apayoff to every coalition S ⊆ V . The empty set, ∅, has the payoff value ν(∅) = 0.

A cooperative game is a pair (V , ν) of a set of players and a characteristic function, and a solution to such a game is a strategy to divide ν(V ) – the payoff from coop-eration – among the players. One such strategy is the Shapley value, introduced in 1953 by Lloyd Shapley in [12].

(28)

Definition 4.2. TheShapley value for a player i ∈ V is defined by SVi(ν) =

X

S⊆V \{i}

ξS(ν(S ∪ {i}) − ν(S)),

for some characteristic function ν, where

ξS =

|_{S|!(|V | − |S| − 1)!} |_{V |!} .

This can be interpreted as a weighted average marginal contribution of a player to every coalition S this player could belong to. That is,

SVi(ν) =

X

coalitions excluding i

marginal contribution of i to coalition number of coalitions of this size, excluding i, divided by the number of players in the network. The Shapley value is impor-tant since it is the unique strategy which exhibits the following four desirable properties:

(i) The value of a coalition, ν(V ), equals the sum of the payoffs of the players in that coalition, so the strategy is efficient

(ii) Symmetric players obtain symmetric payoffs1

(iii) Agents not contributing to the game obtain no payoff, and (iv) The division scheme is additive2

We want to restrict these games to networks, but we still want to allow coalitions to be formed by sets of disjoint, connected components of the network. Such coali-tions are said to be disconnected. In order to accomplish this, Myerson [10] con-sidered a characteristic function defined over both connected and disconnected coalitions, which makes use of the notion of induced subgraphs.

Definition 4.3. Let G(V , E) be a graph, and consider a subset of vertices V0 ⊆_{V .} Theinduced subgraph G[V0] is the graph whose vertex set is V0 and whose edge set, E0, consists of all edges in E such that both endpoints are vertices in V0.

Example 4.4

In Figure4.1(b) we see the subgraph of (a) induced by the vertex subset {2, 3, 4, 5}.

1_{This simply means that players that are strategically equivalent in the game receive the same}

payoff – they are symmetric.

2_{In other words, a transfer of an amount x from one player to another decreases the first player’s}

utility by x units, and increases the second player’s utility by x units. Consequently the players’ in-dividual payoffs add up to the payoff of the coalition they are members in – the division scheme is additive.

(29)

4.1 Cooperative games 19

1

3

2

4

5

(a)A graph G.

3

2

4

5

(b)The induced subgraph

G[{2, 3, 4, 5}].

Figure 4.1:A graph G and an induced subgraph G[{2, 3, 4, 5}].

Now we are ready to talk about the characteristic function considered by Myer-son, which is defined by

ν_GM(S) =        ν(S) if S ∈ C(G) P Ki∈K(S)ν(Ki) otherwise,

where M stands for Myerson3, C(G) is the set of all connected induced subgraphs of G, and K(S) = {K1, K2, . . . , Km}is the set of connected components of coalition

S if it is disconnected. This means that the payoff of a disconnected coalition is

the sum of the payoffs of its constituent components. Note that νM

G is defined

over all 2|V |_{coalitions, so the Shapley value can be applied. This is not the case}

with the simpler characteristic function νG : C(G) → R, defined by νG(S) = ν(S)

for all coalitions S ∈ C(G), which ν_GM is a generalisation of. This is of course due to the fact that we only consider connected coalitions in the simpler function. However, Myerson famously proposed a solution for this obstacle in [10], namely the following. First, we define theMyerson value.

Definition 4.5. TheMyerson value for a player i ∈ V is defined by MVi(νG) = SVi(ν

M

G ),

where SVi( · ) is the Shapley value.

Now, let each player vi in coalition V be assigned the payoff MVi(νG), i.e. the

Myerson value. Then we obtain the unique payoff division scheme that is charac-terised by two axioms: efficiency by components (the payoff vectors are efficient4

in each maximal connected coalition for each network) and fairness (the loss of one bilateral communication implies the same loss of payment for the players involved in this edge).

3_{This is just a notation in order to differentiate it from other characteristic functions; it has no}

special technical meaning.

(30)

4.2 One game to consider

A reasonable game for our purpose – finding hidden individuals with strong influ-ence – must be one that captures the social influinflu-ence. Let’s consider the following game, denoted by g1and suggested by Suri and Narahari [13], in which the value

of a group of players is the number of players within and adjacent to the group. It is intuitively clear that the Shapley value of players in this game defines a central-ity measure that is qualitatively better at capturing the influence property than the ones we studied in earlier chapters. Let us illustrate this argument with a small example network.

Example 4.6

Consider the network in Figure 4.2. While nodes v1 and v5 both have degree

CD(v1) = CD(v5) = 3, node v5 will receive a higher Myerson (or Shapley) value

in the considered game since it is the only player than can influence node v6.

Therefore, with respect to influence, it seems reasonable to rank v5higher than

v1. This interesting feature – which is obvious in this small example – would not

have been captured in the traditional centrality rankings.

v

₂

v

₃

v

₁

v

₅

v

₄

v

₆

Figure 4.2:An example of a network to illsutrate the game g1.

A mathematical deduction of an algorithm for calculating the Shapley values for this game is given in AppendixA.

(31)

5

Experiment and results

5.1 Data sets

The algorithms are tested on several different networks, all simple, undirected and without self-loops. The first two are real networks:

• Covert organizations: This category is represented by the terrorist network responsible for the attacks on the World Trade Center (WTC), September 11, 2001 [4].

• Social networks: We studied an anonymized fragment of Facebook, taken from SNAP – the Stanford Network Analysis Platform [6].

We also study randomly-generated networks:

• Two scale-free networks based on the Barabási-Albert (BA) model (for a technical explanation of this model, see AppendixA.2).

The following table contains basic data about the four networks.

Network Nodes Edges

Facebook 333 2519

WTC 60 124

Small scale-free 200 1945 Large scale-free 1000 9945

(32)

5.2 Experiments with ROAM and modified ROAM

From the results in Waniek et al [14], it is clear that a higher budget b gives better results in general. However, for b > 4 the improvement seems to be negligible. For the purpose of comparing the original ROAM with the modified variant (here-inafter called ROAMeig) we fix the budget to b = 3, and the heuristic with this budget will be denoted by ROAM(3) and ROAMeig(3), respectively. This value is chosen since it is large enough to show interesting features of the heuristics, and is thus suitable for comparisons. The figures5.1–5.4below show how a cho-sen target node (in these examples the target node is the one with highest degree centrality – picked arbitrarily if there are several candidates1) is affected with re-spect to betweenness, closeness and eigenvector centralities when ROAM(3) and ROAMeig(3) is run 8 consecutive times for each network. In Figures5.5–5.6the Shapley values for the respective target nodes are shown for the four networks (one network per subfigure) and how they are affected by applying ROAM(3) and ROAMeig(3). 1 2 3 4 5 6 7 8 Iteration 200 300 400 500 600 700 800 900 1000 1100 Betweenness centrality ROAM(3) ROAMeig(3) 1 2 3 4 5 6 7 8 Iteration 0.3655 0.366 0.3665 0.367 0.3675 0.368 0.3685 0.369 0.3695 Closeness centrality ROAM(3) ROAMeig(3) 1 2 3 4 5 6 7 8 Iteration 0.14 0.15 0.16 0.17 0.18 0.19 0.2 Eigenvector centrality ROAM(3) ROAMeig(3)

Figure 5.1:ROAM(3) and ROAMeig(3) run on the Facebook network.

1 2 3 4 5 6 7 8 Iteration 0 100 200 300 400 500 600 700 Betweenness centrality ROAM(3) ROAMeig(3) 1 2 3 4 5 6 7 8 Iteration 0.39 0.4 0.41 0.42 0.43 0.44 0.45 0.46 Closeness centrality ROAM(3) ROAMeig(3) 1 2 3 4 5 6 7 8 Iteration 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Eigenvector centrality ROAM(3) ROAMeig(3)

Figure 5.2: ROAM(3) and ROAMeig(3) run on the WTC terrorist network. Mohamed Atta (node 26) was the target node.

1_{The one exception is for the WTC terrorist network, where we know that the node 26,}

represent-ing Mohamed Atta, was one of the rrepresent-ingleaders and is therefore a priori considered the target node in this network.

(33)

5.2 Experiments with ROAM and modified ROAM 23 1 2 3 4 5 6 7 8 Iteration 200 400 600 800 1000 1200 1400 Betweenness centrality ROAM(3) ROAMeig(3) 1 2 3 4 5 6 7 8 Iteration 0.61 0.612 0.614 0.616 0.618 0.62 0.622 0.624 Closeness centrality ROAM(3) ROAMeig(3) 1 2 3 4 5 6 7 8 Iteration 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 Eigenvector centrality ROAM(3) ROAMeig(3)

Figure 5.3:ROAM(3) and ROAMeig(3) run on the smaller scale free network.

1 2 3 4 5 6 7 8 Iteration 0.5 1 1.5 2 2.5 Betweenness centrality 104 ROAM(3) ROAMeig(3) 1 2 3 4 5 6 7 8 Iteration 0.539 0.5395 0.54 0.5405 0.541 0.5415 Closeness centrality ROAM(3) ROAMeig(3) 1 2 3 4 5 6 7 8 Iteration 0.17 0.172 0.174 0.176 0.178 0.18 0.182 0.184 0.186 0.188 0.19 Eigenvector centrality ROAM(3) ROAMeig(3)

Figure 5.4:ROAM(3) and ROAMeig(3) run on the larger scale free network.

1 2 3 4 5 6 7 8 Iteration 0.8 0.805 0.81 0.815 0.82 0.825 0.83 0.835 0.84 0.845 Shapley value ROAM(3) ROAMeig(3) 1 2 3 4 5 6 7 8 Iteration 0.54 0.56 0.58 0.6 0.62 0.64 0.66 0.68 0.7 Shapley value ROAM(3) ROAMeig(3)

Figure 5.5: ROAM(3) and ROAMeig(3) run on the Facebook network (left) and the WTC terrorist network (right).

(34)

1 2 3 4 5 6 7 8 Iteration 0.16 0.18 0.2 0.22 0.24 0.26 0.28 0.3 0.32 0.34 0.36 Shapley value ROAM(3) ROAMeig(3) 1 2 3 4 5 6 7 8 Iteration 4.62 4.64 4.66 4.68 4.7 4.72 4.74 4.76 Shapley value ROAM(3) ROAMeig(3)

Figure 5.6:ROAM(3) and ROAMeig(3) run on the smaller scale-free network (left) and the larger scale-free network (right).

(35)

5.2 Experiments with ROAM and modified ROAM 25

Tables5.1–5.4below show the number of nodes that became activated, first within the original networks, and also after three iterations of ROAM(3) and ROAMeig(3) according to the two influence models we are considering – linear threshold and independent cascade. Figures5.7-5.10illustrate the process for the WTC network with Mohamed Atta as the seed node. This particular network was chosen for the illustration since it is small enough to show a suitable level of detail. Each graph is the average of 10 runs of the respective model. The left graph in each pair shows the cascade before ROAM(3) or ROAMeig(3), and the right graph shows the cascade after the heuristics have been applied three consecutive times.

Influence model Before 3×ROAM(3) 3× ROAMeig(3)

Linear threshold 42 42 57

Independent cascade 23 26 31

Table 5.1:Number of activated nodes in the WTC terrorist network before and after applying three rounds of ROAM(3) and ROAMeig(3), respectively.

Table 5.2: Number of activated nodes in the Facebook network before and after applying three rounds of ROAM(3) and ROAMeig(3), respectively.

Table 5.3: Number of activated nodes in the smaller scale-free network be-fore and after applying three rounds of ROAM(3) and ROAMeig(3), respec-tively.

Table 5.4:Number of activated nodes in the larger scale-free network before and after applying three rounds of ROAM(3) and ROAMeig(3), respectively.

(36)

1 4 2 3 5 6 7 9 1 0 8 1 1 2 1 1 6 2 0 1 2 1 3 1 8 1 4 1 5 1 7 1 9 2 3 2 4 2 5 2 6 2 2 2 7 2 8 2 9 3 0 3 3 3 4 3 1 3 5 3 6 3 8 3 9 4 2 6 0 3 2 4 3 4 9 3 7 5 5 4 0 4 1 5 8 5 9 4 8 5 6 4 5 4 4 4 7 4 6 5 4 5 0 5 1 5 2 5 3 5 7 1 4 2 3 5 6 7 9 1 0 8 1 1 2 1 1 6 2 0 1 2 1 3 1 8 1 4 1 5 1 7 1 9 2 3 2 4 2 5 2 6 2 7 3 0 2 8 2 9 3 1 3 3 3 4 3 5 3 6 3 8 3 9 4 2 6 0 3 2 4 3 4 9 3 7 5 5 4 0 4 1 5 8 5 9 4 8 5 6 4 5 4 4 4 7 4 6 5 4 5 0 5 1 5 2 5 3 5 7 2 2

Figure 5.7: Left: Linear threshold model of influence for the terrorist net-work responsible for the 9/11 WTC attacks with Mohamed Atta (node 26, in red) as seed node. Right: The same model and network, but after 3 rounds of ROAM.

(37)

5.2 Experiments with ROAM and modified ROAM 27 1 4 2 3 5 6 7 9 1 0 8 1 1 2 1 1 6 2 0 1 2 1 3 1 8 1 4 1 5 1 7 1 9 2 3 2 4 2 5 2 6 2 2 2 7 2 8 2 9 3 0 3 3 3 4 3 1 3 5 3 6 3 8 3 9 4 2 6 0 3 2 4 3 4 9 3 7 5 5 4 0 4 1 5 8 5 9 4 8 5 6 4 5 4 4 4 7 4 6 5 4 5 0 5 1 5 2 5 3 5 7 1 4 2 3 5 6 7 9 1 0 8 1 1 2 1 1 6 2 0 1 2 1 3 1 8 1 4 1 5 1 7 1 9 2 3 2 4 2 5 2 6 2 2 2 7 2 8 2 9 3 0 3 3 3 4 3 1 3 5 3 6 3 8 3 9 4 2 6 0 3 2 4 3 4 9 3 7 5 5 4 0 4 1 5 8 5 9 4 8 5 6 4 5 4 4 4 7 4 6 5 4 5 0 5 1 5 2 5 3 5 7

Figure 5.8: Left: Linear threshold model of influence for the terrorist net-work responsible for the 9/11 WTC attacks with Mohamed Atta (node 26, in red) as seed node. Right: The same model and network, but after 3 rounds of ROAMeig(3).

(38)

1 4 2 3 5 6 7 9 1 0 8 1 1 2 1 1 6 2 0 1 2 1 3 1 8 1 4 1 5 1 7 1 9 2 3 2 4 2 5 2 6 2 2 2 7 2 8 2 9 3 0 3 3 3 4 3 1 3 5 3 6 3 8 3 9 4 2 6 0 3 2 4 3 4 9 3 7 5 5 4 0 4 1 5 8 5 9 4 8 5 6 4 5 4 4 4 7 4 6 5 4 5 0 5 1 5 2 5 3 5 7 1 4 2 3 5 6 7 9 1 0 8 1 1 2 1 1 6 2 0 1 2 1 3 1 8 1 4 1 5 1 7 1 9 2 3 2 4 2 5 2 7 3 0 2 9 2 8 3 1 3 3 3 4 3 5 3 6 3 8 3 9 4 2 6 0 2 6 2 2 3 2 4 3 4 9 3 7 5 5 4 0 4 1 5 8 5 9 4 8 5 6 4 5 4 4 4 7 4 6 5 4 5 0 5 1 5 2 5 3 5 7

Figure 5.9: Left: Independent cascade model of influence for the terrorist network responsible for the 9/11 WTC attacks with Mohamed Atta (node 26, in red) as seed node. Right: The same model and network, but after 3 rounds of ROAM(3).

(39)

5.2 Experiments with ROAM and modified ROAM 29 1 4 2 3 5 6 7 9 1 0 8 1 1 2 1 1 6 2 0 1 2 1 3 1 8 1 4 1 5 1 7 1 9 2 3 2 4 2 5 2 6 2 2 2 7 2 8 2 9 3 0 3 3 3 4 3 1 3 5 3 6 3 8 3 9 4 2 6 0 3 2 4 3 4 9 3 7 5 5 4 0 4 1 5 8 5 9 4 8 5 6 4 5 4 4 4 7 4 6 5 4 5 0 5 1 5 2 5 3 5 7 1 4 2 3 5 6 7 9 1 0 8 1 1 2 1 1 6 2 0 1 2 1 3 1 8 1 4 1 5 1 7 1 9 2 3 2 4 2 5 2 6 2 2 2 7 2 8 2 9 3 0 3 3 3 4 3 1 3 5 3 6 3 8 3 9 4 2 6 0 3 2 4 3 4 9 3 7 5 5 4 0 4 1 5 8 5 9 4 8 5 6 4 5 4 4 4 7 4 6 5 4 5 0 5 1 5 2 5 3 5 7

Figure 5.10: Left: Independent cascade model of influence for the terrorist network responsible for the 9/11 WTC attacks with Mohamed Atta (node 26, in red) as seed node. Right: The same model and network, but after 3 rounds of ROAMeig(3).

(40)

(41)

6

Discussion and conclusions

From the centrality value plots (Figures5.1-5.4) it is clear that in general the ROAMeig algorithm performed just as well as the original ROAM, and for the WTC network it even seems to perform somewhat better than ROAM. There is no obvious reason as to why this should be the case (except for the eigenvector centrality) – it might just be a coincidence. Alternatively there is something about the topology for that exact network which plays a role here, but a more detailed analysis is needed before any conclusions can be drawn, and that lies outside of the scope of this thesis. The general similarity in performance of the two heuris-tics is arguably best explained by the fact that they are very similarly constructed, and the only difference is how to choose the target node’s b − 1 friends. The main improvement is with respect to eigenvector centrality, which is evident from the figures above. This should not be particularly surprising since ROAMeig was constructed with this very purpose in mind. The other decreased centrality val-ues achieved by running ROAMeig can be explained by the same reasoning as in Section3.1since they work very similar in practice.

Regarding the influence, it is somewhat surprising that both models show an im-proved level of influence (in terms of number of successfully activated nodes). The increase of influence with respect to the linear threshold model is consistent with the results from ROAM in [14]. The corresponding increase with respect to the independent cascade model could at least partly be explained by how in-fluence was measured – simply counting the average number of activated nodes after some number of iterations is a crude measure. Perhaps more accurate re-sults could be obtained by considering howlikely it is for a given node to become

activated given an initial seed node, that is, its global activation probability. The influence of the target node could then be some linear combination of these prob-abilities – e.g. their sum, or a weighted sum if one wishes to capture different

(42)

strengths of the edges as well.

Finally, the Shapley values for the game described in Section4.2decreased very little as ROAM and ROAMeig were run several consecutive times. This is ex-pected since the Shapley values are designed to capture the influence of each node, and for our target nodes we saw how that increased with both heuristics. Thus, as a centrality measure, the Shapley value stands robust against attempts to evade social network analysis by using ROAM and ROAMeig. It is worth noting that in this thesis a rather simple assumption is made on the sphere of influence of a coalition (as used in the cooperative game), namely that it consists of the member of the coalition itself as well as any nodes reachable by at most one hop from a coalition member node. This model could be developed further by e.g. considering only the nodes in the coalition and those which are reachable in at least k different ways from member nodes, which arguably is a more sophisti-cated assumption.

The main conclusions to be drawn from the experiments are the following. Firstly, the ROAMeig heuristic seems to be at least as good as ROAM, and outperforms it with respect to eigenvector centrality – which it was designed for. Secondly, both algorithms increased the influence of the target node when considering the crude measurement of counting the number of successfully activated nodes. Fi-nally, both heuristics are easily countered by considering the Shapley value for the game described in the thesis, and therefore we propose that the tools and methods from cooperative game theory deserves more attention when it comes to unveil covert networks which we assume are actively trying to conceal them-selves by using heuristics similar to ROAM and ROAMeig.

(43)

7

Further research

The questions asked in this thesis open up several possible explorations not cov-ered here. What follows are some suggestions for further consideration.

• Is it possible to achieve similar results if we instead of individuals are trying to hide/find entire communities? Of course a more sophisticated heuristic has to be developed. In Waniek et al [14] an algorithm, DICE, is suggested for this purpose. Perhaps this idea could be developed further, e.g. by considering game theoretic centrality measures as was done in this thesis. • There are other more direct ways to modify eigenvectors during

perturba-tions of the network. Could an algorithm based on these methods be de-veloped with even better performance (in terms of complexity as well as achieved effect)?

• One could consider other spheres of influence – e.g. only neighbours that are reachable in k different ways. These might more realistically reflect influence dynamics in a social network.

• How well do the heuristics examined in this thesis perform if weighted net-works are considered (where the weights can represent e.g. trust or strength of friendship)? The influence metrics would of course have to bee modified in order to capture this extra structure.

• A better method of assigning influence score via the influence models could perhaps reveal more realistically how the edge modifications affects how in-formation and ideas flow from the target node and outwards in the network. (In particular one expects that the influence modelled by independent cas-cade should decrease with each iteration of ROAM or ROAMeig.)

(44)

(45)

(46)

(47)

A

Supplementary theory

A.1 Calculating the Shapley values for the game

We want to calculate the Shapley value of all players for the game discussed in Section4.2. To find the value of a node vi, we consider all possible permutations

of the nodes in which vi would make a positive contribution to the coalition of

nodes occurring before itself. Let the set of nodes occurring before node vi in a

random permutation of nodes be denoted by Ci. As before we let N (vi) and d(vi)

denote the neighbourhood and degree of vi, respectively.

The key question to ask is: What is the necessary and sufficient condition for node vito marginally contribute node vj∈N (vi) ∪ {vi}to fringe(Ci)? This clearly

happens if and only if neither of vj nor any of its neighbours are present in Ci,

which formally translates to

N (vj) ∪ {vj}

∩_C_i _{= ∅.}

We will now prove that this condition holds with probability _1+d(v1

j). The

follow-ing proposition with proof is taken from [8].

Proposition A.1. The probability that in a random permutation none of the ver-tices from N (vj) ∪ {vj}occurs before vi, where vjand viare neighbours, is_1+d(v1 _j₎.

Proof: We need to count the number of permutations π that satisfy

π(vi) < π(v) ∀ v ∈

N (vj) ∪ {vj}

, (A.1)

i.e. permutations in which vi is mapped to the first position among the vertices

in the considered set. To this end, choose

N (vj) ∪ {vj}

positions in the sequence

(48)

of elements from V (G). This can be done in _1+d(v|V |

j) ways. In the last d(vj)

cho-sen positions, place all elements fromN (vj) ∪ {vj}

\ {_v_i}_{. Directly before these,} place the element vi. The number of such line-ups is (d(vj))!, and the remaining

elements can now be arranged in|_{V | −}_{1 + d(v}_j₎! different ways. Altogether, the number of permutations satisfying conditionA.1is

|_{V |} 1 + d(vj) ! (d(vj))! |_{V | −}_{1 + d(v}_j₎_{! =} |V |! 1 + d(vj) ,

implying that the probability of randomly choosing one such permutation is 1 |_{V |!} |_{V |!} 1 + d(vj) = 1 1 + d(vj) .

A.2 The Barabási-Albert (BA) model

This algorithm is used for constructing random scale-free1 networks. Among many other applications, BA networks are thought to be good approximations of real world social networks because of its scale-freeness. The BA networks are constructed as follows. An initial connected network of m0 nodes is given. In

each iteration, a new node is added and connected to m ≤ m0existing nodes. The

probability pithat the new node will be connected to node i is

pi =

d(i)

P

jd(j)

,

where d(i) is the degree of node i, and the sum is taken over all pre-existing nodes. Thus it is more likely that the new node gets connected to a node with a high degree rather than one with a low degree. This results in a network where there are a few nodes with a large number of neighbours, while most nodes will have few neighbours.

(49)

B

Source code

ROAM.py

The original ROAM heuristic.

1 from snap import *

2 from s y s import argv

3 import o p e r a t o r

4

5 def ROAM(* args ) :

6 f i l e n a m e = argv [ 1 ] 7 budget = i n t ( argv [ 2 ] ) 8

9 p r i n t " F r i e n d l y reminder : Run a l l networks through

F i l e C o n v e r t e r . py f i r s t . " 10

11 G = LoadEdgeList ( PUNGraph , filename , 0 , 1 ) 12 maxCentr = f l o a t (’ − i n f ’)

13

14 # c a l c u l a t e degree c e n t r a l i t y

15 #NI = node i t e r a t o r , c o n s t r u c t e d through t h e c a l l

16 # t o t h e method G. Nodes ( )

17

18 f o r NI i n G. Nodes ( ) :

19 degCentr = GetDegreeCentr (G, NI . GetId ( ) )* (G

. GetNodes ( ) −1)

20 i f degCentr > maxCentr :

21 targetNode = NI . GetId ( )

(50)

22 maxCentr = degCentr 23 24 i f l e n ( argv ) == 4 : 25 targetNode = i n t ( argv [ 3 ] ) 26 27 # p r i n t old c e n t r a l i t i e s

28 degCentr = GetDegreeCentr (G, targetNode )* (G. GetNodes ( ) −1)

29 c l o s e C e n t r = G et Cl o se ne s sC e nt r (G, targetNode )

30 Nodes = TIntFltH ( )

31 Edges = TIntPrFltH ( )

32 GetBetweennessCentr (G, Nodes , Edges , 1 . 0 )

33 inBtwCentr = Nodes [ targetNode ]

34

35 NIdEigenH = TIntFltH ( )

36 GetEigenVectorCentr (G, NIdEigenH )

37 e i g C e n t r = NIdEigenH [ targetNode ] 38

39 p r i n t " T a r g e t node was : " + s t r ( targetNode ) +" \n " 40 p r i n t(’ { } { } ’. format (" Old degree c e n t r a l i t y : ",

degCentr ) )

41 p r i n t(’ { } { } ’. format (" Old c l o s e n e s s c e n t r a l i t y : ", c l o s e C e n t r ) )

42 p r i n t(’ { } { } ’. format (" Old in −betweeness c e n t r a l i t y :

", inBtwCentr ) ) 43 p r i n t(’ { } { } ’. format (" Old e i g e n v e c t o r c e n t r a l i t y : ", e i g C e n t r ) +" \n ") 44 45 # f i n d targetNode ’ s f r i e n d s 46 f r i e n d s = TIntV ( )

47 f o r N i n G. GetNI ( targetNode ) . GetOutEdges ( ) :

48 f r i e n d s . Add (N)

49

50 # Get subgraph induced by t h e neighbours o f

targetNode . 51 subGraph = GetSubGraph (G, f r i e n d s ) 52 53 degCentrVec = { } 54 55 f o r NI i n subGraph . Nodes ( ) :

56 degCentr = GetDegreeCentr ( subGraph , NI .

GetId ( ) )* (G. GetNodes ( ) −1)

57 degCentrVec [ NI . GetId ( ) ] = degCentr

58

(51)

41

60 sortedCntrVec = s o r t e d ( degCentrVec . items ( ) , key= o p e r a t o r . i t e m g e t t e r ( 1 ) ) 61 i f l e n ( sortedCntrVec ) <= budget : 62 d e l sortedCntrVec [ 2 : l e n ( sortedCntrVec ) −( budget −1) ] 63 64 # l e t t h e f r i e n d with h i g h e s t c e n t r a l i t y be denoted by v0 65 v0 = sortedCntrVec [ 0 ] 66

67 #remove edge between targetNode and v0

68 G. DelEdge ( targetNode , v0 [ 0 ] ) 69

70 #add edges between v0 and t h e b−2 o t h e r f r i e n d s

71 f o r key , v al ue i n sortedCntrVec :

72 i f key != v0 [ 0 ] :

73 G. AddEdge ( key , v0 [ 0 ] )

74

75 # save graph

76 S a v e E d g e L i s t (G, " output . t x t ", " Save as tab − s e p a r a t e d l i s t o f edges ") ;

77

78 # p r i n t new c e n t r a l i t i e s

80 c l o s e C e n t r = Ge t Cl o se ne s sC e nt r (G, targetNode )

85

90 p r i n t(’ { } { } ’. format ("New degree c e n t r a l i t y : ", degCentr ) )

91 p r i n t(’ { } { } ’. format ("New c l o s e n e s s c e n t r a l i t y : ", c l o s e C e n t r ) )

92 p r i n t(’ { } { } ’. format ("New in −betweeness c e n t r a l i t y :

", inBtwCentr ) ) 93 p r i n t(’ { } { } ’. format ("New e i g e n v e c t o r c e n t r a l i t y : ", e i g C e n t r ) ) 94 95 i f __name__ == ’ __main__ ’: 96 ROAM( argv )

(52)

ROAMeig.py

The modified ROAM heuristic.

3 import numpy

5 import c s v

6

7 def ROAMeig (* args ) :

8 f i l e n a m e = argv [ 1 ] 9 budget = i n t ( argv [ 2 ] ) 10

11 p r i n t " F r i e n d l y reminder : Run a l l networks through

F i l e C o n v e r t e r . py f i r s t . "

12 G = LoadEdgeList ( PUNGraph , filename , 0 , 1 ) 13 maxCentr = f l o a t (’ − i n f ’)

14

15 # c r e a t e a d j a c e n c y l i s t

16 with open ( filename , ’ rb ’) a s f :

17 r e a d e r = c s v . r e a d e r ( f , d e l i m i t e r =’ \ t ’) 18 f o r i i n range ( 0 , 4 ) : 19 r e a d e r . next ( ) 20 a d j _ l i s t = l i s t ( r e a d e r ) 21 f . c l o s e d 22 23 # matrix s i z e 24 f l a t t e n = map( i n t , [ v a l f o r s u b l i s t i n a d j _ l i s t f o r v a l i n s u b l i s t ] ) 25 n = max ( f l a t t e n ) 26 A = numpy . z e r o s ( ( n , n ) ) 27 28 # b u i l d matrix 29 f o r i i n a d j _ l i s t : 30 A[ i n t ( i [ 0 ] ) − 1 ] [ i n t ( i [ 1 ] ) −1] = 1 31 A[ i n t ( i [ 1 ] ) − 1 ] [ i n t ( i [ 0 ] ) −1] = 1 32 33 # c a l c u l a t e square matrix

34 A2 = numpy . l i n a l g . matrix_power (A, 2 ) 35

36 # matrix sums

37 sum = numpy . i d e n t i t y ( n ) + A + A2 38

39 # c a l c u l a t e degree c e n t r a l i t y

40 #NI = node i t e r a t o r , c o n s t r u c t e d through t h e c a l l

(53)

43

42

43 degCentrVec = numpy . z e r o s ( n ) 44

45 f o r NI i n G. Nodes ( ) :

46 degCentr = GetDegreeCentr (G, NI . GetId ( ) )* (G

. GetNodes ( ) −1)

47 degCentrVec [ i n t ( NI . GetId ( ) ) −1] = degCentr

48 i f degCentr > maxCentr : 49 targetNode = NI . GetId ( ) 50 maxCentr = degCentr 51 52 i f l e n ( argv ) == 4 : 53 targetNode = i n t ( argv [ 3 ] ) 54 55 # p r i n t old c e n t r a l i t i e s

57 c l o s e C e n t r = Ge t Cl o se ne s sC e nt r (G, targetNode )

62

67 p r i n t " T a r g e t node was : " + s t r ( targetNode ) +" \n " 68 p r i n t(’ { } { } ’. format (" Old degree c e n t r a l i t y : ",

degCentr ) )

69 p r i n t(’ { } { } ’. format (" Old c l o s e n e s s c e n t r a l i t y : ", c l o s e C e n t r ) )

70 p r i n t(’ { } { } ’. format (" Old in −betweeness c e n t r a l i t y :

", inBtwCentr ) ) 71 p r i n t(’ { } { } ’. format (" Old e i g e n v e c t o r c e n t r a l i t y : ", e i g C e n t r ) +" \n ") 72 73 # f i n d targetNode ’ s f r i e n d s 74 f r i e n d s = TIntV ( )

75 f o r N i n G. GetNI ( targetNode ) . GetOutEdges ( ) :

76 f r i e n d s . Add (N)

77

78 # Get subgraph induced by t h e neighbours o f

targetNode .

79 subGraph = GetSubGraph (G, f r i e n d s ) 80

(54)

81 degMassCentrVec = { } 82

83 f o r NI i n subGraph . Nodes ( ) :

84 targetRow = sum [ i n t ( NI . GetId ( ) ) −1]

85 degMassCentr = numpy . dot ( targetRow ,

degCentrVec )

86 degMassCentrVec [ NI . GetId ( ) ] = degMassCentr

87

88 # l o c a t e t h e b−1 f r i e n d s with h i g h e s t c e n t r a l i t y 89 sortedMassCntrVec = s o r t e d ( degMassCentrVec . items ( ) ,

key=o p e r a t o r . i t e m g e t t e r ( 1 ) ) 90 i f l e n ( sortedMassCntrVec ) <= budget : 91 d e l sortedMassCntrVec [ 2 : l e n ( sortedMassCntr ) −_{( budget −1) ]} 92 93 # l e t t h e f r i e n d with h i g h e s t c e n t r a l i t y be denoted by v0 94 v0 = sortedMassCntrVec [ 0 ] 95

96 #remove edge between targetNode and v0

97 G. DelEdge ( targetNode , v0 [ 0 ] ) 98

99 #add edges between v0 and t h e b−2 o t h e r f r i e n d s

100 f o r key , v al ue i n sortedMassCntrVec : 101 i f key != v0 [ 0 ] : 102 G. AddEdge ( key , v0 [ 0 ] ) 103 104 # save graph 105 S a v e E d g e L i s t (G, " o u t p u t e i g . t x t ", " Save as tab − s e p a r a t e d l i s t o f edges ") ; 106 107 # p r i n t new c e n t r a l i t i e s

109 c l o s e C e n t r = G et Cl o se ne s sC e nt r (G, targetNode )

114

119 p r i n t(’ { } { } ’. format ("New degree c e n t r a l i t y : ", degCentr ) )

(55)

45

120 p r i n t(’ { } { } ’. format ("New c l o s e n e s s c e n t r a l i t y : ", c l o s e C e n t r ) )

121 p r i n t(’ { } { } ’. format ("New in −betweeness c e n t r a l i t y :

", inBtwCentr ) ) 122 p r i n t(’ { } { } ’. format ("New e i g e n v e c t o r c e n t r a l i t y : ", e i g C e n t r ) ) 123 124 i f __name__ == ’ __main__ ’: 125 ROAMeig ( argv )

(56)

indepCascade.py

The independent cascade influence model.

3 from c o l l e c t i o n s import deque

4 import numpy

6 import c s v

7

8 def IndependentCascade ( filename , targetNode ) :

9 G = LoadEdgeList ( PUNGraph , filename , 0 , 1 )

10 nrOfEdges = G. GetEdges ( ) 11 uniformProb = 0 . 4 12 a c t i v e S e t = deque ( [ targetNode ] ) 13 a c t i v a t e d N o d e s = s e t ( [ targetNode ] ) 14 while a c t i v e S e t : 15 temp = a c t i v e S e t . p o p l e f t ( )

16 f o r N i n G. GetNI ( temp ) . GetOutEdges ( ) :

17 i f N not i n a c t i v a t e d N o d e s and

numpy . random . uniform ( ) <= uniformProb :

18 a c t i v e S e t . append (N)

19 a c t i v a t e d N o d e s . add (N)

20 with open (’ indepCascade . t x t ’, ’w ’) a s f : 21 f . w r i t e ( s t r ( a c t i v a t e d N o d e s ) ) 22 f . c l o s e d

23 p r i n t " Number o f a c t i v a t e d nodes : "+ s t r ( l e n ( a c t i v a t e d N o d e s ) )

24 i f __name__ == ’ __main__ ’: