Community Detection applied to Cross-Device Identity Graphs

(1)

STOCKHOLM SWEDEN 2017,

Community Detection applied to Cross-Device Identity Graphs

VALENTIN GEFFRIER

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION

(2)

Abstract

The personalization of online advertising has now become a necessity for marketing agencies. The tracking technologies such as third-party cookies gives advertisers the ability to recognize internet users across different websites, to understand their behavior and to assess their needs and their tastes. The amount of created data and interactions leads to the creation of a large cross-device identity graph that links different identifiers such as emails to different devices used on different networks. Over time, strongly connected components appear in this graph, too large to represent only the identifiers or devices of only one person or household. The aims of this project is to partition these components according to the structure of the graph and the features associated to the edges without separating identifiers used by a same person. Subsequent to this, the size reduction of these components leads to the isolation of individuals and the identifiers associated to them. This thesis presents the design of a bipartite graph from the available data, the implementation of different community detection graphs adapted to this specific case and different validation methods designed to assess the quality of our partition. Different graph metrics are then used to compare the outputs of the algorithms and we will observe how the adaptation of the algorithm to the bipartite case can lead to better results.

Anpassningen av onlineannonsering har nu blivit en nödvändighet för mark- nadsföringsbyråer. Spårningstekniken som cookies från tredje part ger annon- sörer möjlighet att känna igen internetanvändare på olika webbplatser, för att förstå deras beteende och för att bedöma deras behov och deras smak. Mängden skapade data och interaktioner leder till skapandet av en stor identitetsgrafik för flera enheter som länkar olika identifierare, t.ex. e-postmeddelanden till olika enheter som används i olika nätverk. Över tiden visas starkt anslutna komponenter i det här diagrammet, för stora för att endast representera identifierare eller enheter av endast en person eller hushåll. Syftet med detta projekt är att partitionera dessa komponenter enligt grafens struktur och de egenskaper som är knutna till kanterna utan att separera identifierare som används av samma person. Efter detta leder storleksreduktionen av dessa komponenter till isoleringen av individer och de identifierare som är associerade med dem. Denna avhandling presenterar utformningen av en bifogad graf från tillgängliga data, genomföran- det av olika samhällsdetekteringskurvor anpassade till detta specifika fall och olika valideringsmetoder som är utformade för att bedöma kvaliteten på vår partition. Olika grafvärden används då för att jämföra algoritmens utgångar och vi kommer att observera hur anpassningen av algoritmen till tvåpartsfallet kan leda till bättre resultat.

(3)

I would like first to thank 1000mercis, the company which hosted me for this project, who gave me the means and the support I needed. Then the different supervisors who helped me during this project.

Pierre Colin and Romain Tailhades both data scientists at 1000mercis for their advices and ideas.

Erik Fransén my supervisor at KTH for his availability and his machine learning knowledge.

Johan Håstad for accepting to examine my thesis.

Thanks also to the rest of the datascience team of 1000mercis for helping me and creating a perfectly studious atmosphere.

(4)

Introduction

1.1 Context

1000mercis is a French digital marketing company which provides Customer Re- lationship Management (CRM) services and online advertising to a wide panel of clients from different sectors, such as bank, insurance, Consumer Package Goods brands, NGO,... They allow their clients to target specific groups of people on various channels: emails, online display, texts. This has been made possible by the use of cookies which are small text files stored on web browser, containing one id unique for each browser of each device. They can be created when a user is exposed to one the marketing campaigns, opens one of the emails or visits the website of one of the company’s clients. The performances of campaign can then be computed by counting the proportion of people who, after having been exposed, purchased a product on the website of the client.

The cookies can then be used to target these specific people with tailored ads according to their customer journey. When the user cookie is observed with an email address (for example after an email opening or a registration on a website), the id of the cookie and a hash (for confidentiality reasons) of the email address are stored in a database along with the date and time of the observation, the IP address of the network and the type of device used by the user. A graph is built with email addresses and cookies as nodes and the edges represent an observation of this email address on a browser containing this cookie. This is very useful for companies that have database filled with email addresses of former customers they want to reach on a different channel. As Facebook, Google, Amazon, Liveramp and a lot of other companies, 1000mercis can find the cookies linked to these addresses and target them specifically online.

To measure performances of marketing campaigns, most of the marketing agencies consider that a purchase on a client website can be attributed to their campaign if at least one banner was shown to the buyer the week before. This

3

(7)

can maybe not measure the real effect of the campaign because this buyer could have made his purchase without advertising. That is why 1000mercis uses a A/B testing protocol to take into account the organic purchases of the website. One group of email addresses is exposed, another is not and at the end of the campaign, the increment of the banners exposition is measured as an uplift of the buying rate between both groups. This supposes that both groups are statistically equal and completely independent. To ensure the latter condition, connected components of the graph are computed and groups A and B are built with email addresses and cookies belonging to the same connected components so that the intersection between groups A and B is empty.

This solution was acceptable as long the groups were statistically close, but after several years of aggregating online data, components merged into one big component of millions and emails and dozens of millions of cookies. This hap- pened when devices were shared between different persons, the extreme example being a library computer which can be linked to one cookie and hundreds of different email addresses. This component can no longer be included in one group or another without creating a bias, but removing it is a problem as it contains very active cookies (very active because they belonged to devices linked directly and indirectly to a lot of other devices), which would probably have been more responsive to a marketing campaign.

1.2 Aim and Scope

Using connected components gave the same importance to every link between nodes, and every link were kept in the analysis, which was a simple solution when data was sparse. Now that the graph is more dense and connected, some edges need to be ignored. This can be done by applying community detection algorithms on the graph. Furthermore, these algorithms can refine the non- problematic components we had before, in order to keep only trustworthy edges and be able to identify more clearly the cookies and the email addresses referring to only one person. This would lead to a better understanding of how these clusters form and how they interact.

Here, the original graph contains almost billions of nodes edges but due to technical limitations, the different algorithms are applied on a subgraph of 100,000 nodes and 120,000 edges here.

The goal of this thesis is to study, in the wide range of graph clustering methods, those who would be fit for our goal: for the type of clusters we are looking for, on large-sized graphs, taking into account the bipartite property of our graph if possible. In one specific case, we adapt a generic graph clustering algorithm in order to optimize a specific bipartite graph metric. We also adapt another algorithm for large sparse graphs with a more relevant data structure.

(8)

CHAPTER 1. INTRODUCTION 5

After obtaining different partitions as outputs of this graph, we will need tools such as graph metrics, partition distances and others to evaluate these partitions which can be incomplete in the case of unsupervised learning as ours. We will also use the other data available that was not used by the algorithms. The results of this comparison will then be presented at the end of the thesis.

1.3 Ethical Considerations

With the emergence of big data, marketers realized what this amount of data could represent for the understanding of their customers’ behaviour, because they could now recognize people on the Internet and target them specifically.

In order to protect the confidentiality of the users personal data, rules have been designed and ensure that these data are not kept more than a certain amount of time by companies, only if they are truly needed and if the user gave his consent to this use of their data.

Furthermore, in order to keep tracks of a user without being able to asso- ciate him to a real person, only the hash of email addresses can be stored. This hash, if it is done with an algorithm complex enough such as SHA-256, ensures that today’s computers can not trace back an email address from its hash.

Even if two strings are really close, their hashes can be completely different.

This allows to have data that, if found by a another person, can not be associated to a person. It is also strictly forbidden for someone who has access to the data to hash an email address and see if it matches some records of the database.

If this data was to fall into the wrong hands, knowing more specific information about specific people could, with tedious work, lead to a de-anonymization of the data[1]. This is why the anonymization of data can not be the only safeguard before sharing or selling it. As there are still grey areas on the legal aspects, marketing companies should be aware of these limitations and protect more their user’s data even if they are not shared to others and even if the law can sometimes still be lenient about it.

The goal of this kind of data analysis is to group different hashed ids that would represent one person in order to consolidate the data and understand better which ads would be interesting for them. It can be seen as removing a noise from observations of different events without looking for the true identity of the users and therefore remain within the law. This consolidation of the cookies and email addresses will be used only in the company for analysis purposes and will be not be shared or sold to other companies.

(9)

1.4 Limitations

In order to use different machine learning libraries, we use Python to implement the different algorithms and produce these results. Even though this language can be less quick than others, we think it is fitter for this kind of study which aims at choosing one method among a wide panel. After this choice, it can be interesting to use a different language like C++ to speed up an algorithm or languages such as Scala or Spark that enable distributed computations. For technical reasons, the use of the cluster of the company was not possible during this project so we first manage to reduce the size of the graph in order to be able to test different algorithms on a laptop with 16Go of RAM and a processor Intel Core i7-7500U CPU with a 2.70GHz frequency, 2 cores and 4 threads.

(10)

Chapter 2

Graph clustering

2.1 Graph theory

A graph is a data structure that is particularly useful to describe links between different elements. The elements are called nodes or vertices and the links are described by edges, which can be directed. In this thesis, because of the nature of the graph, we will focus on the case of undirected graphs. If the links do not have the same strength, weights can be associated to them and the graph is then said to be weighted. We can use the notation G “ pW, V, Eq to designate the graph G where V refers to the set of vertices of the graph, E the set of edges of the graph and W the set of weights associated to each edge. For all pair of nodes pna, n_bq P V², if pna, n_bq P E, then wa,b denotes the weight associated to his edge. If the graph is unweighted, we will consider that all weights equal to 1.

For a node a, the degree is defined as the number of edges linked to a in the case of an unweighted graph, or the sum of the weights associated to these edges otherwise. Let n be the number of nodes, m half of the sum of the degrees of the nodes, which is the number of edges in the case of an unweighted graph.

A graph is bipartite if the set of nodes can be divided into two subsets such that all edges of the graph link one node from one subset to one of the other. As we will see later, our graph is bipartite as the edges represent the simultaneous observation of an email address with a cookie id. The cookie ids will be one subset and the email addresses another one.

The graph can also be represented by its adjacency matrix A with @pni, n_jq P V², A_i,j “ wi,j if pni, n_jq P E else 0. In the case of an undirected graph, this matrix is symmetric, and if the graph is also bipartite, we can keep only the rows referring to nodes of one set, and columns referring to nodes of the other set, as all other elements of the matrix will be 0. This representation is a basis for several methods and metrics and its eigenvalues and eigenvectors can be

7

(11)

interpreted as important indicators on the graph structure.

The output of our algorithms is a clustering of the graph, defined as a partition π of V the set of nodes: π “ pV1, ..., VKq, K ą 0, K ď |V |, Y^K_i“1Vi“ V and if i ‰ j, ViX Vj“ 0.

2.2 Community structure

What these algorithms see as communities in these graphs are more densely connected groups of nodes, where the structure of the subgraph can be quite different from the global graph. These communities should also be quite isolated from each other. For example, in a people-based graph, a family or a group of friends will be represented by a dense part of the graph where most nodes are linked with an edge because they have interactions with most of the other nodes of the group, whereas they will be less connected to the rest of the graph, or not with the same people.

This is why community detection is part of unsupervised learning when we do not have labels on the nodes. As other clustering problems, we need to group similar points that are different from the other points, without a ground-truth partition. This task is also more difficult when, as in our case, we have a large graph which should be clustered in small communities containing cookies and email addresses belonging to only one user, because we do not know the number of clusters the partition should contain.

In most cases, the problem of finding the partition according to a certain criterion is NP-hard, and if the complexity of a clustering algorithm is only quadratic, its performances can be quite disappointing.

2.3 Algorithms validation, metrics and partition distance

To measure the quality of a clustering, several papers have defined tools to ex- press mathematically the idea of community and assess the quality of partitions given by the different algorithms.

2.3.1 Modularity

One of the most famous clustering measure is the modularity, designed by New- man [2], which has been used for more than a decade to compare most of the graph clustering methods. It is defined as the ratio of edges within communities minus its expected value in a random graph where the nodes have the same degree as before and the rest is uniformly random, which is the null model. With

(12)

CHAPTER 2. GRAPH CLUSTERING 9

our notations:

Q “ 1{2m ÿ

i,jPV

pAij´ kik_j{2mqδpCpiq, Cpjqq

Where Cpiq is the cluster containing the node i. It can also be reformulated like this:

Q “

C

ÿ

c“1

pec´ a²_cq

where e_c is the fraction of edges of the global graph with both ends in the community c whereas a_c is the fraction of edges of the global graph with at least one end in community c. It was defined to be equal to zero in the case of the null model. It takes values between -1 and 1 but starting from 0.3, the partition is often considered as relevant which means that there is indeed a community structure in the graph and this structure is well represented by this partition.

2.3.2 Bipartite modularity

A variant of modularity has been defined for bipartite graphs by Barber[3] as the null model can be more precise. Indeed, we know that some nodes can not be linked together as they belong to the same type and so the expected number of edges is different. We can then denote the bipartite modularity Qbas follows:

Qb“

C

ÿ

c“1

pec´ acˆ bcq

where a_c is the fraction of edges of the global graph with the first type node end in community c and b_c the fraction of edges of the global graph with the second type node end in community c.

2.3.3 Local Modularity

Widely used, it has been demonstrated that modularity can miss small communities in large graph, or embedded in larger communities. Indeed, the as- sumption of the null model is that nodes can be linked to any other node in the graph whereas in most of the real cases, nodes can only be linked to a local small part of the graph. If the network size increases, the expected number of edges between two clusters decreases and can be less that one which can lead to an automatic merge of small clusters, even if they have a lot of internal edges and just a few edges linking them because there will still be an increase of the modularity with the merging.

In order to solve this issue, Muff [4] have defined another metric, the local modularity, similar to the modularity but for each community it only considers the subgraph composed by the edges in the community and its neighbor

(13)

communities. This means it considers the ratio of edges inside the community among the edges in this subgraph and its expected value in the null model of this subgraph. This gives the same formula with:

Q “

C

ÿ

c“1

p˜ec´ ˜a²_cq

where ˜a_cand ˜e_care, this time, the fraction of edges of the subgraph neighboring the community c. This local version can also be adapted to the bipartite case.

2.3.4 The resolution parameter

Reichardt[5] introduced a Hamiltonian, generalized version of the modularity with a resolution parameter r that allows to weigh the different terms of the modularity as follows:

Q “

C

ÿ

c“1

pr ˆ ec´ a²_cq

Where r is a real number greater than 0. When r is greater than 1, the modularity will have higher values with bigger clusters, and vice-versa. When r equals 1, we have the original modularity. As we will see later, this parameter can be useful for modularity optimization algorithm on large graphs as it allows to look for partitions with bigger or smaller clusters. Indeed, we might not want to find the partition with the highest modularity among all possible partitions but only among the partitions that have clusters with specific sizes, for example when we look for communities representing only one individual. It is proven that this version still has its limits [6].

2.3.5 Partition distances

For some training graphs, one can have a golden-truth partition which represents the true underlying partition of the graph. In order to compare the output of an algorithm on this graph to the real partition, several indexes can be used [7]:

The Rand Index

This index introduced by Rand[8] counts the number of agreements from one partition to another on each pair of nodes and divide it by the total number of pairs of nodes. By agreement, we mean that in both partitions X and Y , the nodes a and b are in the same cluster or that in both partitions X and Y , the nodes a and b are in different clusters.

If we note t the number of pairs of nodes that are in the same cluster in X and in the same cluster in Y , u the number of pairs of nodes that are in different clusters in X and in different clusters in Y and N the number of nodes, then:

R “ t ` u N pN ´ 1q{2

(14)

CHAPTER 2. GRAPH CLUSTERING 11

This index takes values between 0 and 1 but one needs to know the value it takes in a random case for it to be relevant. That is why the ARI has been created next. As we will see in our case, when there are many clusters, it is easy to have a Rand index close to one because for example, taking the partition where all the nodes are separated gives a lot of agreements between the two partitions on nodes that are not together, whereas we should focus on the nodes that are together in one of the partitions because they are less frequent.

The ARI or Adjusted Rand Index

This index is the corrected version of the Rand index so that it yields a value of 1 if both partitions are identical, and 0 in average over all possible partitions.

This means it can take negative values. With X “ tX1, X₂, ..., X_ru and Y “ tY1, Y₂, ..., Y_su, @i, j ni,j “ |XiŞ Y_j|, ai “ ř

jn_i,j and b_j “ř

in_i,j it can be written as follows:

ARI “ R ´ Expected R M ax R ´ Expected R

“

ř

i,j

ˆni,j

2

˙

´

”ř

i

ˆai

2

˙ ř

j

ˆbj

2

˙ı {ˆn

2

˙ 1

2

”ř

i

ˆai

2

˙

`ř

j

ˆbj

2

˙ı

´

”ř

i

ˆai

2

˙ ř

j

ˆbj

2

˙ı {ˆn

2

˙

This versions corrects the bias we exposed in the former section as a positive value of the adjusted index shows a higher similarity compared to the random case which was not the case before. Nonetheless, as this method counts numbers of pairs, there is a quadratic relationship between this metric and cluster sizes which will affect much more the index when the clusters’ size will vary.

The NMI or normalized mutual information

This index is linked to the information theory and should measure the information shared by two partitions X and Y , or what we can learn on Y , knowing X.

The Shannon entropy for a partition X is hpXq “ ´ř

xPXppxqlogpppxqq where ppxq is the probability for a node to be labeled as a member of the community x.

We can define the same way hpY q and hpX, Y q “ ´ř

xPX

ř

yPYppx, yqlogpppx, yqq where ppx, yq is the probability for a node to be labeled as a member of the community x in the partition X and a member of the community y in the partition Y .

Then the mutual information [9] is

M IpX, Y q “ hpXq ` hpY q ´ hpX, Y q “ ÿ

xPX

ÿ

yPY

ppx, yqlog ppx, yq ppxqppyq

(15)

and its normalized version can be obtained as follows:

N M IpX, Y q “ M IpX, Y q ahpXqhpY q

NMI equals 0 when partitions contain no information about one another, 1 when they contain perfect information about one another i.e they are identical.

2.3.6 Rewiring

In order to be sure that a metric such as modularity has a certain value because of an underlying community structure that we have been able to detect and not because of wider characteristics such as its degree distribution, one can rewire the graph [10] [11] by swapping the ends of two edges chosen randomly for a large enough number of pair of edges. Then the same algorithms are applied to this new graph and the modularity can be computed to check if its value is significantly lower than the one we had on the original graph.

2.3.7 Without a golden-truth partition

If there is no golden-truth partition available, it is still possible to use other data that would not have been used for the clustering. For example, the authors of the Louvain algorithm [12], have used community detection algorithms on mobile users graphs where each node is a user and an edge represents a call between two users. In order to test their clustering, they measured the homogeneity in each cluster of a feature that they had not used before: the language associated to each user device. The language homogeneity in each cluster showed that nodes in the same cluster were similar according to a criteria that had not been used in the clustering. The limit of their method was that nodes from different clusters can also be similar according to this criteria so the basic partition where all nodes are separated in their own cluster is perfect according only to this measure. Clustering is also about creating groups that are different to each other. In our case, we will use the IP address and design a rule in order to assess the quality of our partition.

(16)

Chapter 3

Unsupervised learning and clustering

Clustering is one of the most frequent problems of unsupervised learning. Su- pervised learning actually train models by giving them for each sample the objective they should tend to. This job can seem easier as the algorithm is given the groups from which he should learn and can then understand the sim- ilarities of points inside a group and dissimilarities between groups. Here, we have to form the groups as well, and cannot as easily validate our results.

3.1 Clustering algorithms

General clustering methods that were designed for vectors in a multi-dimensional space cannot be directly applied to graphs. One possibility is to first define a distance between nodes such as the shortest path that respects the definition of a distance. However this requires to use either the Floyd-Warshall algorithm [13] that runs in Op|V |³q or Djikstra’s algorithm for each node that would then run in Op|V ||E| ` |V |²log|V |q which is better in a sparse graph when |E| is significantly smaller than |V |². One method for large graphs uses “landmark nodes” to approximate the distance and obtain a linear complexity in the size of the graph [14]. From this distance matrix, one can already apply hierarchical algorithms such as complete, single and average linkage [15], or Ward’s method [16]. If one wants to use algorithms that need nodes to be represented by vectors such as K-means or Mixture Models, finding a multidimensional space where we can place these nodes and respect the distance we found between nodes is not simple and can need a lot of dimensions. One solution presented is to use the columns of the distance matrix as vectors but this means that we would still have as many dimensions as the number of nodes. K-medoids [17] is also an adaptation that can work with the distance matrix as the only input.

13

(17)

3.2 Graph Clustering algorithms

In order to address the specific graph partitioning problem, various algorithms have been designed and a exhaustive survey has been provided by Fortunato [18]. Some are variants of the algorithms presented above: hierarchical clustering that uses graph-based distances, spectral clustering that places the nodes in the linear span of first eigenvectors of a similarity matrix of the nodes and then uses the k-means algorithms. Divisive algorithms [19] delete the edges that have the highest betweenness which is the number of pairs of nodes whose shortest path joining them contains this edge, until it has separated the graph in the desired number of clusters. These algorithms were not kept in the scope of this study as their complexity was too high for the size of the graph and most of them required the number of clusters as an input parameter whereas it is un- known in our case and can take a wide range of values.

An important aspect of our graph is that it is bipartite. In that case, in order to avoid an abnormal behaviour of an algorithm because of this speci- ficity, one can use the projection graph which considers only one type of nodes and create edges between them if they had a common neighbour in the original graph. This technique reduces the number of nodes but it can also increase the number of edges: one second-type node which was linked to 10 first-type nodes will be replaced by edges between each pair of nodes among these ten, and this will give 45 edges. Furthermore, in our case, the data carried by edges in the original graph can no longer be used after the projection.

To detect communities in the graph, we choose to focus on local approaches, algorithms that would build small clusters without requiring the number of clusters as an input, and with an appropriate scalability on large graphs to be able to test different algorithms with different parameters more quickly.

3.3 Algorithms in the scope of the project

3.3.1 The Louvain Algorithm

As modularity is widely used to evaluate partitions of graph, some algorithms focus on its optimization. The greedy-optimization algorithm will try every possible partition of the graph and find the one with highest modularity. This becomes quickly unfeasible when the number of nodes increases as the number of partitions of a graph of size n is the n-th Bell number (for n=19, this number is already 5,832,742,205,057).

The Louvain algorithm [12] aims at finding local optima of the modularity among all of the possible partitions by starting from the partition where all nodes are separated and then apply a local moving heuristic that moves one node to another group as long as these actions make the modularity increase.

(18)

CHAPTER 3. UNSUPERVISED LEARNING AND CLUSTERING 15

When it does not increase anymore, a new graph is built where nodes are the communities of the former graph then the local moving heuristic is applied again on these new nodes. The algorithm alternates these two steps until there is no change in the partition. The advantage of this method is that the modularity does not have to be recomputed entirely at each change, only the difference has to and its computation is easier. Optimizing the modularity is NP - hard but this heuristic actually quickens the optimization and gives quite good results.

This algorithm can be seen as a modularity optimization in the partition spaces where community merging between neighbours and moving one node from community to another are the only authorized moves. It runs in Opnlogpnqq.

The algorithm can be written as follows:

input:

G: graph

c: initial assignment of nodes to communities (in our case, we start with all the nodes in their own cluster, c Ð r1...N umberOf N odespGqsq

output:

c: Final assignment of nodes to communities // Apply the local moving heuristic c Ð LocalM ovingHeuristicpG, cq

if N umberOf Communitiespcq ď N umberOf N odespGq then // Produce the reduced network

GreducedÐ ReducedN etworkpG, cq

creducedÐ r1...N umberOf N odespGreducedqs // Recursively call the algorithm

creducedÐ LouvainAlgorithmpGreduced, creducedq

// Merge the commmunities with the partition of the reduced graph c_oldÐ c

for i Ð 1 to N umberOf Communitiespcoldq do cpcold“ iq Ð creducedpiq

end for end if

Even if there is no self-loop in the original graph, when a new reduced graph is built from the communities, a self loop is added to each new node with a weight equal to the sum of the weights of the edges inside the cluster (includ- ing self loops) associated to this node. Edges between two communities are replaced by one edge between the two nodes representing these communities

(19)

with a weight equal to the sum of the weight of these edges.

The difference of modularity caused by the insertion of the node i in the community c can then be computed with the following formula:

∆Q “

« ř

in,c`ki,c

2m ´

ˆ ř

tot,c`ki

2m

˙²ff

´

« ř

in,c

2m ´ ˆ ř

tot,c

2m

˙²

´ˆ ki

2m

˙2ff

Here,ř

in,cis the sum of all the weights of the links from and to the community c,ř

tot,cis the sum of all the weights of the links to nodes of the community c. ki,self represents the number of self-loops for the node i and is equal to 0 in our case when we have not reduced the graph yet.

This algorithm can also be used with the resolution parameter r and the formula becomes:

∆Q “

« r

ř

in,c`ki,c

2m ´

ˆ ř

tot,c`ki

2m

˙²ff

´

« r

ř

in,c

2m `k_i,self

2m ´

ˆ ř

tot,c

2m

˙²

´ˆ k_i 2m

˙2ff

Some variants have been designed in order to move more freely in the partitions space, test more partitions and reach higher modularity values: Rotta [20]

introduced a multilevel refinement that, once the original algorithm has been applied, allowed recursively to come back to the previous graphs and apply again the local moving heuristic at each level which gives another possible move in the partitions space that is moving one node from a community to another.

The SLM algorithm [21] adds a step where before working on the reduced graph, the first step of the algorithm would first be applied on the nodes of each community alone as a subgraph to create sub-communities that would then be the nodes of the reduced graph of the next step which would start already with the communities as partition. Splitting the communities again is another move possible to explore the partitions’ space. In the same paper, authors showed that simple several iterations of the basic Louvain algorithm where the output of each iteration is used as the input of the next iteration, can still increase slightly the modularity of the final partition.

3.3.2 Bipartite Louvain

The Louvain algorithm can be adapted to the bipartite version of modularity.

The heuristics stay exactly the same, and we adapt the modularity difference computation as we need to keep in mind for each community the total degree of edges with both ends in the community, the total degree of edges with only the first-type end in the community and the total degree of edges with only the second-type end in the community. We find then the expression of the bipartite modularity difference when inserting a node that was alone in another community with the following formula:

(20)

∆Qb“

„ ř

in,c`2ki,c` ki,self

2m ´

ˆ ř

tot,c,1`ki,1

2m ř

tot,2,c`ki,2

2m

˙

´

„ ř

in,c

2m `ki,self

2m ´ ˆ ř

tot,c,1

2m ř

tot,c,2

2m

˙

´ ˆ

2ki,1

2m ki,2

2m

˙

Here, ki,1represents the degree of i pointing to first-type nodes, ki,2 represents the degree of i pointing to second-type nodes,ř

tot,1,c andř

tot,2,c can be understood the same way. During the first phase, when the graph has not been reduced yet, either ki,1“ 0 or ki,2“ 0 because our graph is bipartite. When we reduce the graph, we need to pay attention specifically to keep for each cluster which part of its degree has first type node in the community and/or a second type node in the community.

This algorithm can also be used with the resolution parameter r and its formula is easily obtained as for the classic Louvain algorithm.

3.3.3 Local Modularity optimization

We can also adapt the Louvain algorithm with the local version of modularity.

However, it is more complex. Indeed, in the cases of the classic and bipartite modularity, the contribution of each cluster to the modularity depends on the cluster and the rest of the graph as a whole, not on the neighbour clusters. This means that merging two communities or separating them will only change the contribution of these two communities.

For the local modularity, the contribution of each cluster depends on the size of its local subgraph and so on the sizes of the neighbour clusters. This means that if we merge two communities, this will change the contribution of these two clusters but also the contributions of all of their neighbours. This goes against the advantages of the local moving heuristic: the variation of the local modularity for the merging of two communities can not be computed as quickly. This is why we will not study this case in this thesis.

3.3.4 Infomap

Infomap [22] uses the local moving heuristic of the Louvain algorithm and can also be upgraded with its variants, except it optimizes a different score. This score is the map equation, inspired by the information theory, and answers to the following problem: imagine a random walker on the graph walking from node to node. We want to assign codewords to each node so that this walk can be written as a sequence of these codewords. The codebook gives the cor- respondence between nodes and codewords. In order to minimize the length of this message, we use the Huffman code [23] to have one number per node written with its binary notation. The Huffman algorithm ensures that the code

(21)

is prefix-free which means one number is not the prefix of another one and the message can have only one interpretation. As the morse code does, the Huff- man code gives lower numbers to frequently visited nodes in order to reduce the length of your message.

We do not focus on one walk but on an average walk and the average message length needed per step. To compute this value, we need the probability for each node, which is in the case of an undirected graph, the relative weight of the edges linked to this node p_l “

ř

iPVwil

ř

i,jPV wi,j

. Then, the Shannon’s source coding theorem gives a lower bound to the average length, which is the entropy of the codeword variable X: hpXq “ ´ř

iPV pilogppiq.

This method can be optimized further in the specific case of a graph, as codewords can precede and follow only a codewords subgroup. More specifically, we can see that if we have an underlying community structure in our graph, it is understandable that a random walker will more probably stay in a community at least for a few steps before leaving for another community. It would be then be more optimal to have a codebook specifically for this community with shorter codewords for each node, and a codeword for when the walker exits the community, and a codeword in a global codebook to specify in the message when the walker enters the community so that the receiver of the message knows when it must switch to this community codebook (an example showing the benefit from the additional codebooks, taken from the original paper, can be found in the appendix).

For one partition of the graph, we can now have one global codebook with a codeword for each community, and a module codebook for each community with a codeword for each node and the exit. All of these codebooks are Huff- man codes but one codeword can be present in each of these codebooks without compromising the unicity of the interpretation of the message.

Finding the best communities can now be seen as finding the partition that gives the minimum length of the average walk description, which is represented by the map equation:

LpP q “ qÑhpQq `ÿ

c

p^ö_chcpXcq

where hpQq is the frequency-weighted average length of codewords in the global index codebook, c is one of the community of the partition P , q_Ñ “ ř

cq_c^Ñ the probability of switching communities with q_c^Ñ the probability to exit the community c. hcpXcq is the frequency-weighted average length of codewords in the codebook of community c. p^ö_c “ř

iPcp_i` q_c^Ñthe probability of using the codebook c at each step. This map equation is also the average of the different codebooks average length of codewords weighted by their probability of use.

(22)

After simplifying the equation, we can write in our case:

LpP q “ p^Ñlogpp^Ñq ´ 2ÿ

c

p^Ñ_c logpp^Ñ_c q ´ÿ

nPV

pnlogppnq `ÿ

c

pp^Ñ_c ` pcqlogpp^Ñ_c ` pcq This is the two-level version of Infomap. The multi-level version of Infomap [24] allows to create a hierarchy with more levels of module codebooks with the same rules. This can lead to another decrease of the minimum description length.

3.3.5 Markov Clustering Algorithm (MCL)

The Markov Clustering Algorithm [25] also uses the random walker concept to detect communities. The idea is that a random walker will more often stay in dense communities, so we study the probability of a random walker starting from a node i to arrive at a node j in N steps when N tends to infinity. In order to do so, we take the matrix M “ A ` I with A the adjacency matrix and I the identity matrix, which allows the random walker to stay on a node and, in the bipartite case, allows the algorithm to converge. M is then made stochastic by normalizing the columns so that in each column j, the coefficient at the row j is the probability to be at the node i at the next step. Then the algorithm have two phases repeated at each iteration:

The expansion takes the matrix to the power two to double the number of steps that the walker already took. Then each columns are normalized again.

The inf lation takes each coefficient to a certain power (2 usually) which is called inflation factor and then normalize again. This step tends to increase the highest probabilities and decrease the lowest ones which can speed up the algorithm. Changing the inflation factor can have an influence on clusters’ size:

a high inflation factor will disadvantage walks far from the original step and the walker will more probably stay closer to the departure node, the clusters will then be smaller. A low inflation factor will inversely give bigger clusters.

The Markov properties ensure that the matrix will converge to a limit, an equilibrium state where nodes will be part of attractor systems or point to one or several of these attractor systems. This means that in the limit matrix, for each node the column associated i will have a non zero entry at row i if it is an attractor and maybe others non-entry zero if there are other attractors in the same attractor system. If not, the node will point to at least one attractor who will be identified with the non-zero entries of this node’s associated column. Attractors from the same system and nodes pointing to them are grouped into clusters. Some random perturbations can be added if some nodes point to different attractor systems.

In the classic implementation, the algorithm converges when some values go below the threshold value for a float (2.210^´308 for Python) to be considered as

(23)

zero entries. This allows the only non value zero entry left in the column (when there is only one left) to become one after normalization.

The disadvantage of this method is that this matrix can take a lot of space (to N²) if the graph has only one connected component, its diameter is low and the threshold value is also really low as the expansion will populate quickly the matrix with non-zero entries.

3.3.6 A sparse version of MCL

The graph we on is very sparse so even if it has only one connected component, the number of edges is not much higher than the number of nodes, which means than the ratio of non-zero entries in the adjacency matrix is close to 1{N . One efficient way to store the matrix used by the MCL algorithm is the CSC (com- pressed sparse column) matrix format. This format reads the matrix column by column left-to-right, top-to-bottom and stores only the nnz nonzero entries of the original matrix and their positions with three arrays:

1. The values array val of size nnz that stores the values of the entries in the order of reading.

2. The indices array rowind of size nnz that stores the row index of each entry in the order of reading.

3. The column pointers array colptr of size N ` 1, with colptrr0s “ 0 and colptrrjs ´ colptrrj ´ 1s is the number of nonzero entries in the column j, so colptrrjs is the number of elements in the j first rows of the matrix. It is referred to as the column pointers array because colptrrj ´ 1s is also the index of the val array where the j-th column’s values start to appear.

This format is also used in a recent paper which presents a new distributed version of the MCL [26]. As long as our matrix is sparse, the space requirement of 2nnz `N `1 stays better. Nonetheless, as said before, the expansion part can at some point fill our entire matrix. To solve this limit, we increase significantly the zero-value threshold from the default python value, 10^´308, to values closer to 10^´15 so that smaller values can be pruned earlier.

3.3.7 Comparison on the karate club graph

In order to see how this sparse version of MCL performs, we can use it to detect the communities of the Karate Club graph [27] and see if it differs from the original version of the algorithm. This comparison is indeed not possible on our graph because the memory space needed to store the adjacency matrix and then the matrices the algorithm creates exceed the technical limitation of this project. For an inflation factor of 2 and zero-thresholds varying from 10^´30 to 10^´2, we find the exact same partition for this graph than with the original

(24)

version.

Table 3.3.1: Results of the Sparse MCL

MCL Sparse MCL

Threshold p10^´238q 10^´1 10^´2 10^´3 10^´6 10^´7

Iterations 18 9 11 12 12 13

Values stored 1156 850 1136 1140 1156 1156

Time (ms) 3 13 15 15 16 18

Here, we can see on the chart that the sparse algorithm needs a few less iterations before the convergence (from 9 to 13 instead of 18), and will also need less memory space as the original version stores the entire adjacency matrix (down to 850 instead of 1156). Much more time is needed on this graph with the sparse version on the graph but as it is much smaller, it cannot be a proof that it would also be the case on a bigger graph.

(25)

Results

4.1 Studied graph

The graph we apply these algorithms on is a bipartite graph with cookie ids and email addresses, that have been attributed or observed by the company, each edge representing the observation of an email address on a client’s website with the id of the cookie the company put on the user’s browser. It is made of billions of edges, hundreds of millions of emails and cookies. A first filter is applied by removing cookies linked to more than 10 emails and emails linked to 1000 cookies because they can be considered as outliers and so they are removed from the graph. This allows to ignore email addresses such as test@test.com, or cookies of computers shared by many people, for example libraries’ computers.

Then connected components of the graph are computed and this gives small components and one mega-component of million of emails, and still a few hundreds of millions of cookies and edges. Let us define the lifetime of an edge the time between the first and the last simultaneous observation of the email and the cookie linked by this edge. In order to keep only the skeleton of this component and because it is not possible anymore to consider that all edges can be considered equally important, we remove all the edges that have a lifetime below 24 hours. This can be interpreted as saying: an email and a cookie are linked only if this email has been used during more than one day on the browser that contains the cookie. Then we keep only the biggest connected component which is the most problematic one.

This final preprocessing gave one final connected component of 21,884 emails, 80,787 cookies and only 122,206 edges which indicates that the graph is not very densely connected anymore. Visualizations of this graph [28] with force oriented algorithms to find position for the nodes that can reveal a community pattern do not give any insight or pattern. Other metrics can give a few more informations on the graph. The maximum length of the shortest path between 2 nodes among all pairs of nodes, i.e the diameter of the graph is 61 in this case. At

22

(26)

CHAPTER 4. RESULTS 23

some points, we can also wonder is the graph is made of some kind of chains, long sequences of nodes of degree 2 but most of these chains were not longer than 3 edges. There is no nodes with very high degrees compared to the others so the graph can be as a balanced network without clear patterns.

Nodes here carry almost no information. Cookies’ ids are generated randomly and email addresses are hashed with the MD5 algorithm for confidentiality reasons. This algorithm is designed to make finding a string from its hash almost impossible. Furthermore, with this method, two similar strings will have completely different hashes. This makes it impossible to use the domain of the email addresses or to compute any distance between nodes’ names that would be relevant.

However, edges carry a lot of informations. For each observation of this edge, we have its date and time, the IP address linked to the internet connection, the type of browser and device. This can be used later for validation.

4.2 Performances of the Sparse MCL

After comparing the MCL algorithm on the karate club graph, we can see here the differences on our graph. The normal version of MCL can not be applied here because of the memory limit reached by the size of the adjacency matrix.

However, the results of the sparse-MCL are much clearer now. Even if the number of iterations does not decrease a lot with the threshold, we can see that for high thresholds, the algorithm is much quicker and needs less memory (as it stores only the non-zero values and their positions). Furthermore, in this case too, the partitions found with the different thresholds were identical, except for thresholds close to 1 such as 10^´1 or 10^´2.

Table 4.2.1: Influence of the threshold on the graph of concern Threshold 10^´4 10^´7 10^´10 10^´13 10^´16

iterations 65 65 67 69 70

Non zero values 1.210⁷ 2.410⁷ 1.510⁸ 2.610⁸ 3.010⁸

Time (ms) 27 51 150 351 512

In the following results, we will keep the threshold of 10^´7 and change only the inflation factor.

4.2.1 Influence of the inflation factor

One way to obtain different results from the MCL algorithm is to change the inflation factor. Here are the different metrics and performances for inflation factors from 1.2 to 6.

(27)

As expected, the number of clusters increase with the inflation factor, as the random walker is pressured by the inflation to stay in a closer neighborhood.

Then, we can see that the maximum number of non-zero values decrease a lot when the inflation factor increase as lower values get below the threshold more quickly. This means that the algorithm converges more quickly so the number of iterations and the time needed also decrease.

We can even check that the average time by iteration decrease as the number of values to process is lower with a high inflation factor. We see that we can have a trade off between different metrics, according to the average size of communities we want. Having a really low inflation factor would also mean that we need to increase the zero-threshold if we have the same memory limit because the maximum number of non-zero-values can be much higher.

4.3 Measures of modularity

We can now compare the algorithms with different measures of modularity:

(28)

Table 4.3.1: Different modularity measures for every algorithm

Algorithm Louvain Louvain Infomap Infomap

Bipartite Two-Level Multi-Level

Number of clusters 277 211 6615 15023

Modularity 0,9738 0,9730 0,8953 0,7803

Bipartite Modularity 0,9781 0,9785 0,8955 0,7804

Local Modularity 0,04 0,04 0,13 0,20

Algorithm Sparse MCL

Inflation factor 1.2 2 3 4 6

Number of clusters 1454 17741 20838 23906 28397

Modularity 0,9554 0,7444 0,6912 0,6589 0,6134

Bipartite Modularity 0,9582 0,7445 0,6914 0,6590 0,6135

Local Modularity 0,10 0,13 0,14 0,14 0,14

Most of algorithms reach high values of modularity (as said earlier, for values over 0.3, the modularity is considered high enough) which proves that there is indeed an underlying community structure, especially the Louvain algorithms which is intuitively natural as they specifically maximize this value.

Then we can observe that the bipartite modularity is often very close to the modularity and seems not to bring much more insights, surely because the second term of the modularity is quickly quite low compared to the first one when graphs get bigger. Nonetheless, we can check that the bipartite Louvain has a higher modularity than the original version and it is the opposite for the original modularity, for classic parameters. Furthermore, the partitions we got from these two algorithms are not that close so it is interesting to see how a different graph null model can change the output.

We can also see that the algorithms that give a lot of smaller clusters have a lower modularity but a bigger local modularity than those who produce less and bigger clusters. This really shows that different interpretations of the term

“community” can lead to completely different rankings of algorithms and we have here a trade-off between local and global metrics.

4.3.1 Influence of the resolution parameter

The classic version of the Louvain algorithm gives good results but the clusters of the output partition can be too big if we want to identify emails and cookies associated to only one person. Here are the results for resolution parameters from 0.0001 to 1, for the classic Louvain algorithm and the implementation for specific bipartite graphs:

(29)

The bipartite modularity is not shown as it was still almost equal to the classic modularity. These values reveal that for smaller resolution parameters, the Louvain algorithm indeed finds smaller communities, but the bipartite implementation which tries to optimize a different score actually gives partitions with higher modularity values than the classic version, for example for a resolution parameter of 0.0001, the modularity we get after applying the bipartite Louvain is 0.1 higher. This shows that knowing the graph is bipartite, the algorithm can actually explore better partitions. Furthermore, even if the modularity and the bipartite modularity are very close because it is computed on a large graph, when the local moving heuristic is applied, the difference between modularity and bipartite modularity variations will be more significant and it will lead to different clusters merging.

For low resolution parameters, we force the Louvain algorithm to optimize a score quite different from the classic modularity that is why the latter can be much lower (around 0.5 compared to the almost perfect modularity of 0.974 we have in the classic case).

(30)

4.4 Validation

4.4.1 Partition distance

One classic way to validate the algorithms is to use a ground truth partition, which would be the solution of our problems that help us validate the output of an algorithm. In our case, we are able to find a partition for 4410 emails of our graph. In another database, these emails are associated to names and zip codes. The partition is then defined by groups of emails linked to the same name and the same zip code. This data comes from clients files storing regular clients and these groups are much smaller because of the few interactions there are, compared to our graph based on web navigations. We can then compare the partition we had for these emails with the different algorithms to the ground truth partition, with different indices.

Table 4.4.1: Partition distances to the ground-truth partition

Rand index 0,98797 0,98464 0,99982 0,99994

Adjusted Rand Score 0,004 0,003 0,201 0,369

Normalized Mutual Information 0,020 0,019 0,134 0,287

Inflation Factor 1.2 2 3 4 6

Rand index 0,99090 0,99995 0,99996 0,99997 0,99997

Adjusted Rand Score 0,005 0,344 0,266 0,199 0,146

Normalized Mutual Information 0,031 0,310 0,268 0,170 0,106

As we can see in the charts, most of the algorithms have an excellent Rand Index (the maximum value being 1 for identical partitions), but it proves how this index can be irrelevant without normalizing it. In our case, in the golden- truth partition, most of the pairs of emails belong to different clusters. In the outputs of our algorithms too, but if we had a random algorithm creating very small clusters or even a partition where each node is alone in its cluster, we would still have an index close to 1 as for most of pairs of emails, they are supposed to be in different clusters.

NMI and ARI gives much more relevant value and take into account the score that we would have for random partitions, and weigh more the pair of nodes that should be together as they are much more rare. The values are now much lower but different enough to allow us to compare the algorithms. According

(31)

to this specific ground-truth partition, the multi-level version of the InfoMap algorithm and the sparse MCL algorithm with an inflation factor of 2 seems to perform better than the others with respective ARI and NMI of 0.344 and 0.310 for the space MCL, and 0.369 and 0.287 for the multi-level Infomap but we can also check if we can have better results with different resolution parameters for the Louvain Algorithm.

For the different resolution parameters, we have the following results:

We can see that for low resolution around 0.0005, the Louvain algorithm performs almost as well because it gets NMI and ARI close to the values we had for the Sparse MCL and the multi-level Infomap, up to 0.3.

4.4.2 IP addresses

As our ground-truth partition is not exactly built on the same kind of interactions, we validate the results of other algorithms with different methods and with the data that has not been used yet. The first other data available for each edge is the different IP addresses for each edge. For each observation, the device where the cookie is stored on accesses the Internet through a connection that appears with a specific IP address. The device can be connected to different networks so a cookie which is linked to only one device can be observed with different IP addresses. For each edge, if the device linked to the cookie was not

(32)

mobile (because mobile can be linked to too many different IP shared by many mobiles using the same carrier in the same zone), we keep the most frequent IP address linked to this device and use the following hypothesis, described on the figure.

If one email has been seen with different cookies but on the same IP, they should all three be in the same cluster. Why? Because it probably means that the cookies were either stored on the same device but not at the same time, or stored on different devices that used the same network of a household. On the contrary, two different emails used on the same device and linked to the same cookie with the same address IP do not have to be in the same cluster as this situation can occur if one uses a friend’s computer and enters an email address different than his friend’s.

(33)

Table 4.4.2: Performance according to the IP adresses rule

Inter-cluster edges 2154 1955 12734 26815

Invalid ones 328 304 3042 6488

Ratio 15,2% 15,5% 23,9% 24,2%

Inter-cluster edges 13381 31202 37702 41655 47226

Invalid ones 2721 6810 8207 8712 9086

Ratio 20,3% 21,8% 21,8% 20,9% 19,2%

In order to evaluate a partition, we count the proportion of inter-cluster edges that do not respect this rule. As we can see, the results are not significantly different, except for the Louvain algorithms that performs slightly better, and even more when the resolution parameter is closer to 0.005, with a ratio higher than 25%. The limit of this validation is that it can be applied only on

(34)

edges linked to desktop devices whereas mobiles represent an important part of the devices in the graph.

Furthermore, the IP address of a network can change quite often so this data is not completely reliable. Finally, the value of this ratio should be computed in the case of a random partition as the expected value of this ratio can probably also change depending on the number of clusters.

4.4.3 Lifetime of the inter-cluster-edges

To overcome the last validation method limit, we use the average lifetime of the edges between the clusters, i.e the edges the algorithms considered as links between communities that hase to be removed to view the underlying partition but also inside the cluster. The lifetime of each edge is the time between the first and last observations of a cookie and an email together. The following figure compares the average lifetime for inter-cluster edges and intra-cluster edges for the different algorithms output.

Table 4.4.3: Average lifetime of the inter and intra-cluster edges

Inter-cluster edges average lifetime 38,1 39 53,2 68,5 Intra-cluster edges average lifetime 80,9 80,8 83,2 83,4

Inter-cluster edges average lifetime 42,8 72,9 77 78,9 79,8

Intra-cluster edges average lifetime 81,6 82,6 81,5 80,7 80,2

We can see here that the intra-cluster edges average lifetime is always higher than the inter-cluster edges average lifetime. However, the difference is much more significant for the algorithms that gave less clusters, for example with a minimum value of 38.1 for the Inter-cluster edges average lifetime for the Louvain Algorithm. We can even see a pattern: when the number of clusters increases, so does the average lifetime of the inter-cluster edges. An interpretation is that the edges between the real communities have low lifetime because they occur less as anomalies do. The algorithms that create bigger communities have less edges to delete from the original graph in order to obtain the partition, and so they start with the less trustworthy one, which have low lifetimes. On the contrary, the ones which give small communities have more edges to cut to reach the final partition and if they approximately cut the edges by ascending

(35)

order of lifetime, the average can only be higher.

We can see that with lower values of the resolution parameter, we can get much better results as we still keep a low average edge lifetime for inter-cluster edges but a higher average edge lifetime value for the intra-cluster edges which was not the case for the other algorithms. This also validates the preprocessing we did on the graph when we deleted edges with very low lifetimes.For example, for a resolution parameter of 5, we have 300 clusters and the intra-cluster edges average lifetime is 40 days much lower than the inter-cluster edges average lifetime around 81, very close to the global average. On the opposite, for a resolution parameter of 0.0001 the intra-cluster edges average lifetime is around 65 days but the inter-cluster edges average lifetime is between 95 and 100 days, which is also significantly higher than the global average. This ten- dency confirms that this algorithm also cut edges following the ascending order of the lifetime edges, when it gives more clusters, but even more than the others algorithms.

4.5 Time

Most of these algorithms were implemented with Python. The implementation of Infomap available onhttp://www.mapequation.org [29] actually uses a wrapper and an implementation in C++. The running times are:

(36)

Table 4.5.1: Running time of the different algorithms Algorithm Louvain Louvain Infomap Infomap

Time (seconds) 28 31 3 9

Time (seconds) 180 34 25 20 19

We can see that Infomap is running in less than 10 seconds so it is much faster thanks to the C++. The Louvain algorithm is also quite rapid, and the resolution parameter does not influence significantly the running time for our range of values.

4.6 Synthesis

This last validation method let us think that, in order to be cautious, for A/B testing groups making, it is better to use the Louvain algorithms or the Sparse MCL with an inflation factor of 1.2, that both seem to cut fewer but more relevant edges and give larger clusters because it is less likely to make mistakes and we can afford large communities in this context. Louvain is quicker and has a linear memory complexity so it will scale better on larger graphs.

If we need to identify communities that represent individuals and therefore must be much smaller, we would prefer the bipartite Louvain with a resolution parameter between 0.001 and 0.005.

Community Detection applied to Cross-Device Identity Graphs