An Analysis of Notions of Differential Privacy for Edge-Labeled Graphs

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Datateknik

2020 | LIU-IDA/LITH-EX-A--20/057--SE

An Analysis of Notions of

Diﬀer-ential Privacy for Edge-Labeled

Graphs

En analys av olika uppfattningar om diﬀerentiell integritet i

grafer med kantetiketter

Robin Christensen

Supervisor : Patrick Lambrix Examiner : Olaf Hartig

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten ﬁnns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

The user data in social media platforms is an excellent source of information that is bene-ficial for both commercial and scientific purposes. However, recent times has seen that the user data is not always used for good, which has led to higher demands on user privacy. With accurate statistical research data being just as important as the privacy of the user data, the relevance of differential privacy has increased. Differential privacy allows user data to be accessible under certain privacy conditions at the cost of accuracy in query re-sults, which is caused by noise. The noise is based on a tuneable constant ε and the global sensitivity of a query. The query sensitivity is defined as the greatest possible difference in query result between the queried database and a neighboring database. Where the neigh-boring database is defined to differ by one record in a tabular database, there are multiple neighborhood notions for edge-labeled graphs.

This thesis considers the notions of edge neighborhood, node neighborhood, QL-edge neighborhood and QL-outedges neighborhood. To study these notions, a framework was developed in Java to function as a query mechanism for a graph database. ArangoDB was used as a storage for graphs, which was generated by parsing data sets in the RDF format as well as through a graph synthesizer in the developed framework. Querying a database in the framework is done with Apache TinkerPop, and a Laplace distribution is used when generating noise for the query results. The framework was used to study the privacy and utility trade-off of different histogram queries on a number of data sets, while employing the different notions of neighborhood in edge-labeled graphs. The level of pri-vacy is determined by the value on ε, and the utility is defined as a measurement based on the L1-distance between the true and noisy result. In the general case, the notions of edge neighborhood and QL-edge neighborhood are the better alternatives in terms of pri-vacy and utility. Although, there are indications that node neighborhood and QL-outedges neighborhood are considerable options for larger graphs, where the level of privacy for edge neighborhood and QL-edge neighborhood appears to be negligible based on utility measurements.

(4)

Acknowledgments

I would like to express my gratitude towards Olaf Hartig for the valuable feedback, as well as his flexibility and pedagogical guidance. I would also like to thank my family for their encouragement and support throughout my education at Linköping University.

(5)

List of Figures

2.1 Graph representation of a relation between two nodes. . . 5

2.2 Two communities in a graph. . . 5

2.3 Graph A is disconnected, graph B is weakly connected and graph C is strongly connected. . . 5

2.4 Simple histogram example. . . 7

2.5 Different node degree measurements on the format: [both, in, out]. . . 8

2.6 Simple histogram example. . . 8

2.7 Model of how the query mechanism MQworks in practice. Query Q is sent to the database, then MQadds noise to the true result R to become R1. . . 10

2.8 Overview of Laplace distributions with different values on λ . . . . 12

2.9 Original graph used for comparison when demonstrating neighboring graph with some notion. . . 14

2.10 Neighboring graphs defined the four different notions described in this section. . . 14

2.11 Graph consisting of six triangles, where the middle edge is involved in all of them. 15 2.12 Neighboring graphs of the graph in Figure 2.11 defined by both edge and node difference. . . 15

2.13 Worst case of betweenness queries for edge difference. . . 16

3.1 A parser goes through each triple in RDF file and creates a graph representation. . 21

3.2 Trade-off chart example. . . 26

4.1 Architecture of the framework. The framework includes the model in Figure 2.7, as well as functionalities for loading and creating a database. . . 27

4.2 Flow chart of the framework execution process. . . 29

4.3 Results from evaluation cases SEC1-3. . . 37

4.4 Results from evaluation cases SEC1,4-5. . . 37

4.12 Results from evaluation case NEC1. . . 41

4.18 Privacy and utility trade-off for Q2 on capitals data set with ε-values ranging from 0.001 to 1. . . 45

(8)

4.19 Privacy and utility trade-off for Q3 and Q5 on capitals data set with ε-values rang-ing from 0.001 to 1. . . 46 4.20 Privacy and utility trade-off for Q4 and Q6 on capitals data set with ε-values

rang-ing from 0.001 to 1. . . 46 4.21 Privacy and utility trade-off for Q2 on synthC data set with ε-values ranging from

0.001 to 1. . . 47 4.22 Privacy and utility trade-off for Q3 and Q5 on synthC data set with ε-values

rang-ing from 0.001 to 1. . . 47 4.23 Privacy and utility trade-off for Q4 and Q6 on synthC data set with ε-values

rang-ing from 0.001 to 1. . . 48 B.1 Privacy and utility trade-off for Q2 on continents.rdf where ε is in the interval

[0.001,1]. . . 79 B.2 Privacy and utility trade-off for Q2 on continents.rdf over specific intervals. . . 80 B.3 Privacy and utility trade-off for Q3 and Q5 on continents.rdf where ε is in the

interval [0.001,1]. . . 80 B.4 Privacy and utility trade-off for Q3 and Q5 on continents.rdf over specific intervals. 81 B.5 Privacy and utility trade-off for Q4 and Q6 on continents.rdf where ε is in the

interval [0.001,1]. . . 81 B.6 Privacy and utility trade-off for Q4 and Q6 on continents.rdf over specific intervals. 82 B.7 Privacy and utility trade-off for Q2 on capitals.rdf where ε is in the interval [0.001,1]. 82 B.8 Privacy and utility trade-off for Q2 on capitals.rdf over specific intervals. . . 83 B.9 Privacy and utility trade-off for Q3 and Q5 on capitals.rdf where ε is in the interval

[0.001,1]. . . 83 B.10 Privacy and utility trade-off for Q3 and Q5 on capitals.rdf over specific intervals. . 84 B.11 Privacy and utility trade-off for Q4 and Q6 on capitals.rdf where ε is in the interval

[0.001,1]. . . 84 B.12 Privacy and utility trade-off for Q4 and Q6 on capitals.rdf over specific intervals. . 85 B.13 Privacy and utility trade-off for Q2 on currencies.rdf where ε is in the interval

[0.001,1]. . . 85 B.14 Privacy and utility trade-off for Q2 on currencies.rdf over specific intervals. . . 86 B.15 Privacy and utility trade-off for Q3 and Q5 on currencies.rdf where ε is in the

interval [0.001,1]. . . 86 B.16 Privacy and utility trade-off for Q3 and Q5 on currencies.rdf over specific intervals. 87 B.17 Privacy and utility trade-off for Q4 and Q6 on currencies.rdf where ε is in the

interval [0.001,1]. . . 87 B.18 Privacy and utility trade-off for Q4 and Q6 on currencies.rdf over specific intervals. 88 B.19 Privacy and utility trade-off for Q2 on locations.rdf where ε is in the interval [0.001,1]. 88 B.20 Privacy and utility trade-off for Q2 on locations.rdf over specific intervals. . . 89 B.21 Privacy and utility trade-off for Q3 and Q5 on locations.rdf where ε is in the

inter-val [0.001,1]. . . 89 B.22 Privacy and utility trade-off for Q3 and Q5 on locations.rdf over specific intervals. 90 B.23 Privacy and utility trade-off for Q4 and Q6 on locations.rdf where ε is in the

inter-val [0.001,1]. . . 90 B.24 Privacy and utility trade-off for Q4 and Q6 on locations.rdf over specific intervals. 91 B.25 Privacy and utility trade-off for Q2 on countries.rdf where ε is in the interval

[0.001,1]. . . 91 B.26 Privacy and utility trade-off for Q2 on countries.rdf over specific intervals. . . 92 B.27 Privacy and utility trade-off for Q3 and Q5 on countries.rdf where ε is in the

inter-val [0.001,1]. . . 92 B.28 Privacy and utility trade-off for Q3 and Q5 on countries.rdf over specific intervals. 93 B.29 Privacy and utility trade-off for Q4 and Q6 on countries.rdf where ε is in the

(9)

B.30 Privacy and utility trade-off for Q4 and Q6 on countries.rdf over specific intervals. 94 B.31 Privacy and utility trade-off for Q2 on pathways.rdf where ε is in the interval

[0.001,1]. . . 94 B.32 Privacy and utility trade-off for Q2 on pathways.rdf over specific intervals. . . 95 B.33 Privacy and utility trade-off for Q3 and Q5 on pathways.rdf where ε is in the

in-terval [0.001,1]. . . 95 B.34 Privacy and utility trade-off for Q3 and Q5 on pathways.rdf over specific intervals. 96 B.35 Privacy and utility trade-off for Q4 and Q6 on pathways.rdf where ε is in the

in-terval [0.001,1]. . . 96 B.36 Privacy and utility trade-off for Q4 and Q6 on pathways.rdf over specific intervals. 97 B.37 Privacy and utility trade-off for Q2 on diseases.rdf where ε is in the interval [0.001,1]. 97 B.38 Privacy and utility trade-off for Q2 on diseases.rdf over specific intervals. . . 98 B.39 Privacy and utility trade-off for Q3 and Q5 on diseases.rdf where ε is in the interval

[0.001,1]. . . 98 B.40 Privacy and utility trade-off for Q3 and Q5 on diseases.rdf over specific intervals. . 99 B.41 Privacy and utility trade-off for Q4 and Q6 on diseases.rdf where ε is in the interval

[0.001,1]. . . 99 B.42 Privacy and utility trade-off for Q4 and Q6 on diseases.rdf over specific intervals. . 100 B.43 Privacy and utility trade-off for Q2 on enzyme.rdf where ε is in the interval [0.001,1].100 B.44 Privacy and utility trade-off for Q2 on enzyme.rdf over specific intervals. . . 101 B.45 Privacy and utility trade-off for Q3 and Q5 on enzyme.rdf where ε is in the interval

[0.001,1]. . . 101 B.46 Privacy and utility trade-off for Q3 and Q5 on enzyme.rdf over specific intervals. . 102 B.47 Privacy and utility trade-off for Q4 and Q6 on enzyme.rdf where ε is in the interval

[0.001,1]. . . 102 B.48 Privacy and utility trade-off for Q4 and Q6 on enzyme.rdf over specific intervals. . 103 B.49 Privacy and utility trade-off for Q2 on stw.rdf where ε is in the interval [0.001,1]. . 103 B.50 Privacy and utility trade-off for Q2 on stw.rdf over specific intervals. . . 104 B.51 Privacy and utility trade-off for Q3 and Q5 on stw.rdf where ε is in the interval

[0.001,1]. . . 104 B.52 Privacy and utility trade-off for Q3 and Q5 on stw.rdf over specific intervals. . . . 105 B.53 Privacy and utility trade-off for Q4 and Q6 on stw.rdf where ε is in the interval

[0.001,1]. . . 105 B.54 Privacy and utility trade-off for Q4 and Q6 on stw.rdf over specific intervals. . . . 106 B.55 Privacy and utility trade-off for Q2 on lexvo.rdf where ε is in the interval [0.001,1]. 106 B.56 Privacy and utility trade-off for Q2 on lexvo.rdf over specific intervals. . . 107 B.57 Privacy and utility trade-off for Q3 and Q5 on lexvo.rdf where ε is in the interval

[0.001,1]. . . 107 B.58 Privacy and utility trade-off for Q3 and Q5 on lexvo.rdf over specific intervals. . . 108 B.59 Privacy and utility trade-off for Q4 and Q6 on lexvo.rdf where ε is in the interval

[0.001,1]. . . 108 B.60 Privacy and utility trade-off for Q4 and Q6 on lexvo.rdf over specific intervals. . . 109 B.61 Privacy and utility trade-off for Q2 on geospecies.rdf where ε is in the interval

[0.001,1]. . . 109 B.62 Privacy and utility trade-off for Q2 on geospecies.rdf over specific intervals. . . 110 B.63 Privacy and utility trade-off for Q3 and Q5 on geospecies.rdf where ε is in the

interval [0.001,1]. . . 110 B.64 Privacy and utility trade-off for Q3 and Q5 on geospecies.rdf over specific intervals.111 B.65 Privacy and utility trade-off for Q4 and Q6 on geospecies.rdf where ε is in the

interval [0.001,1]. . . 111 B.66 Privacy and utility trade-off for Q4 and Q6 on geospecies.rdf over specific intervals.112 B.67 Privacy and utility trade-off for Q2 on synthA where ε is in the interval [0.001,1]. . 112 B.68 Privacy and utility trade-off for Q2 on synthA over specific intervals. . . 113

(10)

B.69 Privacy and utility trade-off for Q3 and Q5 on synthA where ε is in the interval

[0.001,1]. . . 113

B.70 Privacy and utility trade-off for Q3 and Q5 on synthA over specific intervals. . . . 114

B.71 Privacy and utility trade-off for Q4 and Q6 on synthA where ε is in the interval [0.001,1]. . . 114

B.72 Privacy and utility trade-off for Q4 and Q6 on synthA over specific intervals. . . . 115

B.73 Privacy and utility trade-off for Q2 on synthB where ε is in the interval [0.001,1]. . 115

B.74 Privacy and utility trade-off for Q2 on synthB over specific intervals. . . 116

B.75 Privacy and utility trade-off for Q3 and Q5 on synthB where ε is in the interval [0.001,1]. . . 116

B.76 Privacy and utility trade-off for Q3 and Q5 on synthB over specific intervals. . . . 117

B.77 Privacy and utility trade-off for Q4 and Q6 on synthB where ε is in the interval [0.001,1]. . . 117

B.78 Privacy and utility trade-off for Q4 and Q6 on synthB over specific intervals. . . . 118

B.79 Privacy and utility trade-off for Q2 on synthC where ε is in the interval [0.001,1]. . 118

B.80 Privacy and utility trade-off for Q2 on synthC over specific intervals. . . 119

B.81 Privacy and utility trade-off for Q3 and Q5 on synthC where ε is in the interval [0.001,1]. . . 119

B.82 Privacy and utility trade-off for Q3 and Q5 on synthC over specific intervals. . . . 120

B.83 Privacy and utility trade-off for Q4 and Q6 on synthC where ε is in the interval [0.001,1]. . . 120

B.84 Privacy and utility trade-off for Q4 and Q6 on synthC over specific intervals. . . . 121

D.1 Results from evaluation case NEC2. . . 126

(11)

List of Tables

1.1 Types of queries included in the study . . . 3

2.1 Typical statistical queries . . . 6

2.2 Used vehicles for sale . . . 7

2.3 Histogram array X . . . 7

3.1 Framework requirements . . . 19

3.2 Types of queries created from use cases. . . 22

3.3 Data sets involved in the study . . . 23

3.4 Evaluation cases for graph synthesizer robustness . . . 24

3.5 Evaluation cases for noise generator robustness . . . 25

4.1 Framework input parameters . . . 28

4.2 Graph synthesizer requirements . . . 30

4.3 Greatest possible difference in query result for each query-notion combination . . 35

4.4 True and noisy results of Q2 on capitals data set . . . . 45

4.5 True and noisy results of Q2 on synthC data set . . . . 48

A.1 True result of Q2 . . . 60

A.2 Noisy result of Q2 (ε=1) . . . 61

A.3 Noisy result of Q2 (ε=0.1) . . . 61

A.16 True result of Q5(10) . . . 68

A.17 Noisy result of Q5(10) (ε=1) . . . 68

A.18 Noisy result of Q5(10) (ε=0.1) . . . 69

(12)

C.1 Average true result for evaluation cases related to the number of nodes (SEC1-3) . 122 C.2 Standard deviation for evaluation cases related to the number of nodes (SEC1-3) . 122 C.3 Average true result for evaluation cases related to the number of edges (SEC1,4-5) 122 C.4 Standard deviation for evaluation cases related to the number of edges (SEC1,4-5) 123 C.5 Average true result for evaluation cases related to minimal community size (SEC1,6-7) . . . 123

C.6 Standard deviation for evaluation cases related to minimal community size (SEC1,6-7) . . . 123

C.7 Average true result for evaluation cases related to graph size (SEC1,8-9) . . . 123

C.8 Standard deviation for evaluation cases related to graph size (SEC1,8-9) . . . 123

C.9 Average true result for evaluation cases related to number of different labels (SEC10-12) . . . 124

C.10 Standard deviation for evaluation cases related to number of different labels (SEC10-12) . . . 124

C.11 Average true result for evaluation cases related to number of triangles (SEC13-15) 124 C.12 Standard deviation for evaluation cases related to number of triangles (SEC13-15) 124 C.13 Average true result for evaluation cases where number of buckets are six with a varying bucket width (SEC1,16-17) . . . 124

C.14 Standard deviation for evaluation cases where number of buckets are six with a varying bucket width (SEC1,16-17) . . . 125

C.15 Average true result for evaluation cases where number of buckets are two with a varying bucket width (SEC18-20) . . . 125

C.16 Standard deviation for evaluation cases where number of buckets are two with a varying bucket width (SEC18-20) . . . 125

C.17 Average true result for evaluation cases where number of buckets are ten with a varying bucket width (SEC21-23) . . . 125

C.18 Standard deviation for evaluation cases where number of buckets are ten with a varying bucket width (SEC21-23) . . . 125

D.1 Average noise produced after a number of iterations when running NEC1 . . . 128

D.10 Average noise produced after a number of iterations when running NEC10 . . . . 131

D.11 Average noise produced after a number of iterations when running NEC11 . . . . 131

(13)

1 Introduction

Asking someone about their personal information is a very sensitive question, and it is not always something a careful person would want publicly available. Although considering how the World Wide Web and social media platforms have thrived in the last decade, the question one should ask themselves is not if their information is hidden, but rather "who has it?". There were around 2.5 billion social media users in 2018 [1], and what the information about individuals and their activities is used for is a hot topic. One of the perhaps latest most known incidents of social media is the Facebook scandal in 2018 surrounding the vast collec-tion of user data being in the possession of Cambridge Analytica, which is a company that engages in political consulting and strategic communication [2]. The user data was obtained through an application run through the Facebook platform, and allowed the collection of pro-file information including what the users have liked. This was possible due to circumstances where no effort from Facebook was made to stop companies doing these activities, as well as users giving consent by accepting terms without care [2].

What is particularly worrying in the case of Cambridge Analytica is that Kosinki et al. [3] has shown that knowing what people on Facebook likes is worse than initially thought. He and his colleagues came up with an accurate model that uses such information to predict very sensitive data like sexual orientation, political views, use of substances and personality [3]. Given these disturbing news, and the fact that Cambridge Analytica backed the United States president’s campagin towards presidency, claims arose that Cambridge Analytica used this data to target individuals with advertisement regarding his campaign [4]. Another example involving Facebook is where Facebook claimed to not share personal information with third parties, which proved to be wrong when it was shown that they actually are through targeted advertising [5].

An attempt of protecting the individual is through anonymization of data, e.g. by replac-ing names with opaque identifiers. Although, releasreplac-ing anonymized data has proven to not always be a reliable solution. There are plenty of examples where anonymized data has suc-cessfully been de-identified, where some of them are told in stories in the work of Heffetz and Ligett [6]. The stories they present demonstrate how anonymized information can be de-anonymized using public records, where the information is not anonymized. Comparing those records to the anonymized ones has proven to be a red flag for anonymized informa-tion [6]. A method for countering these scenarios is k-anonymity. However, k-anonymity is far from perfect. On one hand, when the anonymization is too high, data can become

(14)

worth-1.1. Motivation

less in a research in the sense that it may be misleading. On the other hand, k-anonymity is vulnerable to both the homogeneity and background knowledge attack [7]. These attacks are based on weaknesses in the data set and knowledge about the individuals involved in it. In both cases, k-anonymity potentially fails at preserving confidentiality.

Dealing with the problems of user data privacy is paramount, as the social media plat-forms are gold mines of information. Such information in both quality and quantity is some-thing that grew to be very beneficial for social science research, thanks to the uprising of social media. The vast network of people and their relation to one another is perfect for studying information flow. Although as it has risen, so has the ethical issues surrounding it [8]. In the end the discussion comes down to how users’ rights for privacy can be upheld, while at the same time benefit social science and commercial purposes with statistical data about people. A solution that satisfies both sides of the coin is differential privacy [9][10][6].

1.1 Motivation

The idea of differential privacy is well motivated by Dwork [9][11] as a means of provid-ing privacy by hidprovid-ing users participation in a data set. In short, differential privacy can be achieved by adding noise to the query result, where the amount of noise depends on the sen-sitivity of the query [9]. The query sensen-sitivity is determined by how much the true result can vary by removing a row from the database. In the case of Cambridge Analytica, everyone who used the specific application was involved in the data collection. When differential pri-vacy is applied, each individual can claim that they are not part of a data set and no one will know the truth [9][11].

Differential privacy is a field of study with a long history. Although, it has not been studied enough within edge-labeled graphs, as pointed out by Reuben [12]. Task et al. [10] provides an adaption of the theory presented by Dwork that can be applied on graph structures, and Reuben presents a theoretical approach concerning edge-labeled graphs [12]. Edge-labeled graphs are interesting in particular. Not only due to the lack of study, but also due to the fact that it is a common structure of social graphs in social media platforms. Given the popularity of social media, this is a good reason to study differential privacy in edge-labeled graphs. More specifically, how to maximize privacy while providing accurate query results. Such analysis can prove useful when defining the privacy-preserving mechanisms of databases storing social graphs, as well as future works in the area.

1.2 Aim

The first part of the aim of this thesis was to conduct separate studies on both what impact the choice of notion of differential privacy in edge-labeled graphs had on the amount of noise added to true query results, as well as what impact the type of query had on the amount of noise added to true query results. The notions that were included in this work are listed below, and are defined in Section 2.4.

• Edge neighborhood • Node neighborhood • QL-edge neighborhood • QL-outedges neighborhood

Histogram queries are interesting in particular. Not only because they provide more in-formation compared to single value queries, but they also have low sensitivity in general [9][10]. This is essential for providing accurate results as a higher query sensitivity indicates more noise being added. The type of queries included in this study is found in Table 1.1 and defined in Table 3.2.

(15)

1.3. Research questions

Table 1.1: Types of queries included in the study

Query Return type

Maximum node degree An integer

Degree distribution Histogram

In-degree distribution Histogram

Out-degree distribution Histogram Label-specific in-degree distribution Histogram Label-specific out-degree distribution Histogram

Triangle count Histogram

To advance the study of notions of differential privacy for graph databases with labeled edges, the second part of the aim is to study the privacy and utility trade-off of the different notions in edge-labeled graphs. For the conducted study, an empirical analysis framework was developed. The framework was coded in Java and uses a graph computing framework called Apache TinkerPop [13] that allows for integration with graph databases stored with ArangoDB [14]. The data sets used in this study consist of real-world examples as well as synthetic graphs. The real-world examples were fetched in the form of RDF-file dumps [15], and the synthetic graphs were created by the developed framework.

1.3 Research questions

Using the framework, the aim with this thesis is to answer the following questions: 1. How does the choice of notion affect the noise that is added to the true query result? 2. How does the type of query affect the noise that is added to the true query result? 3. Which notion(s) has best performance in terms of privacy and utility trade-off?

1.4 Delimitations

The empirical study in this work was limited to the edge neighborhood, node neighborhood, QL-edge neighborhood and QL-outedges neighborhood notions as well as the queries in Ta-ble 1.1. The developed framework only supports the mentioned notions and queries, but is expandable given that the global sensitivity of the query is determined and provided in the framework for each notion. The real-world data sets was only acceptable up to a certain size, as the parsing time of greater files would be very extensive. Larger data sets could be provided through synthesizing.

(16)

2 Theory

This chapter captures the underlying theory used in this thesis, which is necessary in order to understand how the methods were used to achieve the results. The chapter starts off with an introduction to statistical databases, and basics regarding graph structures that was used in this thesis when synthesizing graphs and processing graph data. This is followed by statistical database queries and histograms, which continues towards differential privacy and noise determination. Finally, the notions of differential privacy in edge-labeled graphs that was used in this thesis are described and relevant works are presented.

2.1 Statistical graph databases

Statistical databases are very useful for multiple purposes. Compared to regular databases, statistical databases are characterized by storing statistical data which can be official, like census data[9][16]. It can also be data recorded through e.g. purchases or website visits, or provided directly by people. The statistical data in these types of databases can be utilized by scientists, companies or governments to benefit social science, provide efficient commercial plans and optimize the use of resources in different areas [16].

Graph data

Compared to tabular databases which are built up with rows (tuples) and columns (at-tributes) of data, the most basic components in a graph database are nodes and edges. To-gether, they build up entire graphs where edges are connecting nodes to each other. Graphs can be useful to describe systems or platforms and capture the relationships involved, and their structures usually vary depending on what type of application it is. Networking data, web link analysis and chemical data analysis are examples of domains that would yield dif-ferent graph structures [17].

Social graphs are often used to describe social platforms like Facebook or Twitter, to visu-alize people and their relation to one another [18]. Additional graph components like various edge and node properties can be used to describe the graph in more detail, e.g. labels. Edge labels are useful for describing the type of relationship that exist between nodes or individ-uals. For a graph describing Twitter relations, example labels are "likes" and "follows". A

(17)

2.1. Statistical graph databases

simple example of the graph components considered in this thesis is shown in Figure 2.1, which consist of two nodes and a directed labeled edge.

Figure 2.1: Graph representation of a relation between two nodes.

If the relation in Figure 2.1 represents the connection between two individuals in a social graphs, a group of individuals in such graph can form a community. A community in a graph is essentially a smaller graph within the graph, which on a social platform could correspond to a group of people being member of a certain association or fan club [19]. See Figure 2.2 for an example of two communities that forms a graph.

Figure 2.2: Two communities in a graph.

The graph in Figure 2.2 satisfies the criteria for having weak connectivity, which is that there exist an undirected path from each node to all other nodes [20]. Directed graphs are ei-ther disconnected, or they fulfill the requirement for weak or strong connectivity. Compared to weak connectivity, the criteria for having strong connectivity is satisfied if there exist a directed path from each node to all other nodes. If there exist a pair of nodes that have no undirected path between them, the graph is disconnected. See Figure 2.3 for examples of each condition.

Figure 2.3: Graph A is disconnected, graph B is weakly connected and graph C is strongly connected.

(18)

In Figure 2.3, the cause for disconnection in graph A is simply because node Z has no con-nection to the rest of the graph. Graph B is weakly connected, as the edges are connecting all nodes to each other, regardless of their direction. Graph B is not satisfying strong connectivity due to the lack of a directed path going from W to Z including the existing edges. The strong connectivity in graph C is caused by the fact that all nodes are in the path that connects all nodes together, which in this case forms a cycle. Besides the basics of graph structure, com-munities and graph connectivity, other concepts that are useful for synthesizing graphs are node degrees and clustering coefficients [19]. Before those concepts are brought up in detail, statistical database queries and histograms requires attention.

Statistical database queries

Queries for statistical databases returns information about groups of things (e.g. people), and it is usually of numerical nature. For a set of people, asking about the names of people having a specific disease is not statistically useful. Asking for the average salary among a group of people however, is a statistical query, as it does not ask for specific information regarding a specific individual [9][10]. Typical statistical queries with an example of their usage follows in Table 2.1 below.

Table 2.1: Typical statistical queries Query Usage

COUNT How many users on a website are from Sweden?

MAX What is the highest score on the leaderboard?

MIN How much does the cheapest car in the shop cost?

AVG What is the average wage in a working department?

SUM What is the total wage paid out to each working department?

The queries in Table 2.1 all have one thing in common, which is that they all return a single number. This is not so interesting due to the limitations in information value of only having one number. To provide useful data for research, such queries would have to be run in combinations and for an unknown amount of times, depending on the size of the database. This is not very efficient. A more efficient way to provide research data is through histogram queries [10][9].

Histogram

Histogram queries are more feasible than queries which only returns a single number, from a research perspective. The reason for it is because much more information can be obtained through a histogram query, as they return an array of values [10][9]. Consider the example data set of used vehicles for sale in Table 2.2 below.

(19)

Table 2.2: Used vehicles for sale Id Color Year Price 1 Yellow 2008 115 000 4 Red 2002 24 000 9 Blue 2005 84 000 10 Red 2006 98 000 12 Orange 1968 265 000 18 Green 2013 183 000 21 Black 2008 81 000 22 White 2008 120 000 26 White 2011 178 000

A possible histogram query for this data set is to ask about the price of cars. More specifi-cally, how many cars there are in specific price ranges. Assuming positive values and consid-ering prices ranging up to 200 000 and beyond, as well as a length of 25 000 for each interval, such query gives 9 intervals. The first 8 are simply the ones in the range between 0 and 200 000, and the last one ranges from 200 000 to infinity. These intervals are represented by "buckets" in a histogram, which is seen in Figure 2.4 below.

Figure 2.4: Simple histogram example.

Table 2.3: Histogram array X

X1 X2 X3 X4 X5 X6 X7 X8 X9

1 0 0 3 2 0 0 2 1

Table 2.3 shows the resulting array based on the histogram in Figure 2.4. Looking at the data set in Table 2.2, there are 3 cars in the price range between 75 000 and 100 000. That price range is the fourth interval, which can be seen in the histogram. For the resulting array, the

(20)

fourth interval having 3 cars corresponds to the fourth element having the value 3. In general, if we denote the array as X with size n, this corresponds to a histogram with n buckets. For array X, Xi, is bucket i where i P t1, ..., nu. For the histogram example in Figure 2.4, bucket

number 8 (X8) holds the value 2.

Degree distribution

The degree distribution in a graph is the measurement of how many nodes that have similar node degrees [10]. The node degree is a measure of how many edges are connected to a node. For graphs with directed edges, in- and out-degrees are subcategories of the measurement where the edge direction is taken into consideration when determining the corresponding node degree. Figure 2.5 shows a disconnected graph of four nodes, where the total node degree as well as the in- and out-degree is shown for each node.

Figure 2.5: Different node degree measurements on the format: [both, in, out]. Degree distribution is a common measure for studying social networks and graph struc-ture [10]. Degree distribution goes very well with histograms, where the histogram in Figure 2.6 is an example of degree distribution as well. Each bucket in the example holds nodes with three different amount of connected edges, where the nodes with highest degrees resides in the far right bucket. This particular histogram has higher peaks on the right half of the histogram, indicating an uneven distribution of node degrees for nodes in the related graph.

(21)

2.2. Differential privacy

Triangle count

In social networks, individuals are often connected to each other in ways where triangles are formed in a graph representation [10]. For a pair of connected nodes, each third node they form a triangle with indicates a common relation or friend. Common relations is also, like degree distribution, a common and important measure for studying social networks. Triangle count is also the root to another measurement, namely the clustering coefficient.

Clustering coefficient

The clustering coefficient exist on both global and local level, where it is referred to as the clustering coefficient of a graph or the clustering coefficient of a node, respectively. The global clustering coefficient can be described as the probability that two nodes that are connected to node n are also connected to each other [17]. It is calculated or determined by dividing the average node degree by the total amount of nodes in the graph. The local clustering coefficient is instead determined by the division between the triangle involvement count of the node, with the highest possible triangle involvement count for the node considering the current node degree [10].

Centrality

Centrality measurements are often used in social networks to determine the importance of specific individuals, i.e. how important their contribution is to the graph [10][21]. There are different kinds of centrality measures, where the simplest one is counting node degrees. A higher node degree would indicate that the node is more important. Alternatives to count-ing node degrees are the measures of betweenness and closeness. Measurcount-ing betweenness means counting how many times nodes are involved in shortest paths between two other nodes. Higher count would indicate that the node is important and acts as a crucial bridge for many such paths. When measuring closeness, the average length of the shortest paths from one node to all other nodes is calculated for each node in the graph. A lower average length would indicate a more useful node, due to the relatively shorter distance to all other nodes. Other centrality measures are more complex, where some of them an algorithm for determining their importance is involved.

2.2 Differential privacy

The purpose of differential privacy is to achieve privacy for individuals whose data may con-tribute to the result of statistical database queries, which is done by essentially ensuring that no one can tell whether an individual is a part of a data set or not [9]. For example, an individ-ual could have participated in a survey where they have shared their political views or other opinions that might be considered sensitive to the public, or they could be a part of a database containing sensitive medical records. By hiding participation, differential privacy works as a protection against privacy attacks like linking attacks and background knowledge attacks [7]. These types of attacks rely on that an adversary has knowledge about the participation of some individual in a data set, which in turn can be exploited to find out sensitive informa-tion. The information regarding the inclusion of an individual can be protected with various methods, where a common approach is drawing noise from a noise distribution that is ap-plied to the query result [9]. By implementing a privacy-preserving query mechanism to the database, noise can be applied to query results according to differential privacy requirements (see Figure 2.7). When there is no telling if an individual is present in such data sets when looking at the query result, one can almost conclude that the query mechanism is working as intended. The definitive conclusion can be drawn once the query mechanism has proven to fulfill a mathematical requirement, which is the foundation of achieving differential privacy.

(22)

2.3. Noise generation

Figure 2.7: Model of how the query mechanism MQworks in practice. Query Q is sent to the

database, then MQadds noise to the true result R to become R1.

Neighboring databases

The amount of noise that needs to be added depends on how much the true query result can change when a record is deleted or added to a database, which is also referred to as the query sensitivity [9]. Database neighbors can be studied to find out how much the true query result can differ, for which a notion of neighboring database needs to be defined. For differential privacy in tabular databases, a neighboring database D1_{to a database D is defined}

by differing by at most one record, which means that the neighboring database either has one additional record or is lacking one record from the original database. The definition of neighboring database can then be used in the definition of ε-differential privacy.

ε-differential privacy

If MQis the privacy-preserving query mechanism, then MQ(D)is the noisy result produced

by the mechanism when querying database D. Consider S being the set of possible results for a query when using the query mechanism, then the query mechanism is said to provide ε-differential privacy if Condition 2.1 below is true, for all pairs of database neighbors D and D1_[9].

Pr[MQ(D)PS]ĺeε¨Pr[MQ(D1)PS] (2.1)

Condition 2.1 should be read as: "The probability that the result from query mechanism MQ over database D exist in the set of possible results S is less than or equal to e raised to

ε(epsilon), multiplied with the probability that the result from query mechanism MQ over

neighboring database D1 _{exist in the set of possible results S." The constant ε is used as a}

means of setting the level of privacy, where a lower valued ε indicates stronger privacy for the query mechanism [9]. This becomes more clear in the part about the Laplace distribution in Section 2.3, which describes the noise distribution considered in this thesis for adding noise to the query result.

2.3 Noise generation

This section describes the theory behind the process of generating noise for true query results. It begins with the different approaches to determining query sensitivity, and is concluded with how a Laplace distribution is used for noise generation. The concept of smooth sensi-tivity is included in this section for the purpose of discussing it along with the thesis results in Chapter 5.

Query sensitivity

Before adding noise through the Laplace distribution, the query sensitivity needs to be de-termined. Consider the queries in Table 2.1. The greatest possible difference for a COUNT-query is 1 after removing or adding a record, as the counter would at the maximum either increase or decrease by one [9]. The sensitivity for queries MAX, MI N, AVG and SU M have potentially much greater differences, as they all depend on what value is being removed or

(23)

added. Assuming that MI N can range towards negative infinity, the greatest difference in query result for these queries could go towards infinity.

There are different equations for determining the query sensitivity for queries that return single numbers and histogram queries. For database neighbors D and D1_{, querying these}

databases yields Q(D) and Q(D1₎_{which are the true query results. The query sensitivity}

is a measure of the greatest possible difference in query result when comparing the results from any possible pair of neighboring databases, i.e. the difference between Q(D)and Q(D1₎

[9][10]. The query sensitivity is denoted by∆q, which for queries that are producing single numbers can be determined by using Equation 2.2 below.

∆q=max

D,D1 |Q(D)´Q(D

1₎_| _(2.2)

Histogram sensitivity

Equation 2.2 is not applicable for histogram queries however. Although, the determination of query sensitivity in histogram queries is very similar. The only difference is that the de-termination of query sensitivity in histogram queries is calculated as a sum over all buckets to retrieve a total value, which is described in Equation 2.3. If we consider the queries in Ta-ble 2.1 once again, removing one record would still yield infinite sensitivity for MAX, MI N, AVG and SUM. The COUNT-query is slightly different for histograms, as a difference in more than one histogram bucket is a possibility [9][10]. An example of such possibility is explained under the theory of graph sensitivity in Section 2.4. Equation 2.3 is how the query sensitivity for histograms is determined, where k signifies the amount of buckets in the his-togram and Q(D)_idenotes the result of histogram bucket i.

∆q=max D,D1 k ÿ i=1 |Q(D)_i´Q(D1₎ i| (2.3)

An example of what the equation entails for the query sensitivity of histogram queries is if the result for a query would be different for two or more histogram buckets, e.g. first bucket has one less and second bucket has one more [9][10]. This is a scenario where Equation 2.3 would yield a total sensitivity of 2. For query types like degree distribution, triangle count and centrality, the greatest possible differences are when multiple buckets are being different. Smooth sensitivity

The sensitivity determined by Equation 2.3 is also called global sensitivity, where the databases D and D1 _{can be any possible pair of neighbors [22][23]. The sensitivity can}

po-tentially be very high in this case, compared to scenarios where D is a fixed database. The global sensitivity for a query is essentially a constant representing the worst case scenario, where the two database neighbors causes the greatest possible difference in query result. If database D is instead fixed on a specific database, then the neighboring database D1_{is chosen}

so that the greatest possible difference in query result depends on the current database D. This concept is known as the local sensitivity of query q and is determined by Equation 2.4.

LSq(D) =max D1 k ÿ i=1 |Q(D)i´Q(D1)i| (2.4)

However, local sensitivity is not sufficient for fulfilling the differential privacy require-ments [22][23]. Considering the dependency to a specific database, the resulting noise based on the local sensitivity may leak information about that database. The solution to this prob-lem is to utilize the smooth bounds of the local sensitivity, which is the core to determining the smooth sensitivity. The smooth bounds of the local sensitivity is defined as a function S, which works as an upper bound to the local sensitivity. The function S is characterized by a

(24)

parameter β ą 0, which sets the smoothness of the function. Consider Dn as a set of undi-rected labeled graphs with at most n nodes, then function S is said to be a β-smooth upper bound on the local sensitivity of q if the following requirements are satisfied:

for all D P Dn : S(D)ľLSq(D)

for all neighbors D, D1 PDn : S(D)ĺeβS(D1)

In order to obtain smooth sensitivity, it is necessary to calculate the local sensitivity at different distances [22][23]. Consider distance t, and node distance dnode(D, D1)which

corre-sponds to how many nodes needs to be added or removed from D1_{to obtain D (referred to as}

distance), then the local sensitivity of query q at distance t can be determined with Equation 2.5.

LS(t)(D) = max

D1_{P D}n_{: d}

node(D,D1)ĺt

LSq(D1) (2.5)

With the definitions of smooth bounds and local sensitivity of query q at distance t, the smooth sensitivity can be determined according to Equation 2.6.

S˚

q,β(D) =_t=0,...,nmax e´tβLS(t)(D) (2.6)

Laplace distribution

Once the sensitivity of the query has been determined, it can be used as a parameter in draw-ing noise from a Laplace distribution [9]. The Laplace distribution is used as a tool to provide the noise that is added to the true query result. Noise is randomly drawn from the distribu-tion, where higher values on the y-axis indicates higher chance that the corresponding value on the x-axis is drawn as noise. The distribution is symmetric on the x-axis, which suggests that for providing accurate results it is necessary that the center of the distribution is set at x = 0 (see Figure 2.8). The Laplace distribution is denoted by Lap(λ), and is determined by the value of λ. The value of λ is calculated by dividing the query sensitivity with ε (see Equation 2.7). The value of ε is usually set to a low value, below 1.

λ=∆q/ε (2.7)

(25)

2.4. Notions of differential privacy in edge-labeled graphs

Figure 2.8 shows that a higher λ is necessary for a more even Laplace distribution, which means that higher values on λ is better for privacy [9]. Which means that a low value on ε is more favourable for stronger privacy. The direct opposite is favourable for providing more accurate (i.e., less noisy) query results, i.e. where the distribution is pointy at 0 and the chance that no noise is added at all is very high. Aiming for stronger privacy has negative effect on accuracy of query results and vice versa.

For histogram queries, the noise is drawn from the Laplace distribution more than once [9]. If R is the result from querying the database D with query Q, denoted by Q(D), then R is a vector with b elements for histogram queries (Rb). The number of elements corresponds to the number of buckets in the histogram. A query mechanism which provides ε-differential privacy, gives the noisy histogram R1 _{according to Equation 2.8. Each element in the noise}

vector ηb is independently drawn from the Laplace distribution and applied to its corre-sponding element in the result vector R.

R1=R+ηb (2.8)

2.4 Notions of differential privacy in edge-labeled graphs

This section describes the different notions of differential privacy in edge-labeled graphs con-sidered in this thesis, as well as how query sensitivities in graph databases are determined based on these notions. The description and definitions of edge neighborhood, node neigh-borhood, QL-edge neighborhood and QL-outedges neighborhood are covered below.

Graph definition

The definitions of graph neighborhoods are based on a common approach for describing the graph itself. If the collection of all nodes and edges in a graph are denoted as V and E respectively, then G = (V, E)denotes the graph with the corresponding nodes and edges [12]. For edge labels, the infinite set of all possible labels is denoted as L. A labeled edge e P E can therefore be described as e= (v1, l, v2), where v1, v2PV and l P L.

Edge neighborhood

The edge neighborhood notion entails that the neighboring graph is defined by differing by at most one edge, i.e. by adding or removing an edge regardless of the label (see Figure 2.9 and 2.10a) [12].

Definition 2.4.1. Graphs G = (V, E) and G1 _{= (}_V1_{, E}1₎_{are edge-neighbors if V}1 ₌ _{V and}

E1₌_{E ´ teu for some edge e P E}1_.

Node neighborhood

In node neighborhood, neighborhood of graphs is defined by adding/removing at most one node and all the edges connected to it (see Figure 2.9 and 2.10b) [12]. Compared to edge neighborhood, node neighborhood has larger impact on the graph as a whole due to the potential high amount of edges that can be removed. Removing a node potentially leaves loose ends, hence the removal of connected edges. The worst case is problematic as in theory, all edges could be removed [22]. It is a known problem for differential privacy in graphs. Definition 2.4.2. Graphs G= (V, E)and G1 _{= (}_V1_{, E}1₎_{are node-neighbors if V}1 ₌_{V ´ x and}

E1₌_{E ´ t}₍_v

(26)

QL-edge neighborhood

The idea of QL-edge neighborhood is similar to edge neighborhood, only now with focus on edge labels [12]. The definition tells us that neighboring graphs has at most one addi-tional/fewer labeled edge where the label is a part of QL, where QL is a subset of the set of possible labels L (see Figure 2.9 and 2.10c).

Definition 2.4.3. Let QL be a subset of L. Graphs G = (V, E) and G1 _{= (}_V1_{, E}1₎ _are

QL-edge-neighbors if V1 ₌ _{V and E}1 ₌ _{E ´ e for an edge e P E such that e} _{= (}_v

1, l, v2)and

l P QL.

QL-outedges neighborhood

Neighborhood of graphs in the QL-outedges neighborhood notion is defined by adding/re-moving a set of outgoing labeled edges from a node [12]. The labels of these edges is a part of QL, similar to QL-edge neighborhood. The difference from QL-edge neighborhood is an extra constraint that puts the focus only on outgoing edges, hence the name, as well as allow-ing a difference of multiple edges (see Figure 2.9 and 2.10d). The worst case scenario for this notion is the same as node neighborhood.

Definition 2.4.4. Let QL be a subset of L. Graphs G = (V, E) and G1 _{= (}_V1_{, E}1₎ _are

QL-outedge-neighbors if V1 ₌ _{V and E}1 ₌ _{E ´ t}₍_v

1, l, v2)|v1 = x and v2 P V and l P QLu for

some x P V.

Figure 2.9: Original graph used for comparison when demonstrating neighboring graph with some notion.

(a) Graph that is an edge neighbor of the graph in Figure 2.9.

(b) Graph that is an node neighbor of the graph in Figure 2.9.

(c) Graph that is an QL-edge neigh-bor of the graph in Figure 2.9 for a QL such that label_y P QL.

(d) Graph that is an QL-outedge neighbor of the graph in Figure 2.9 for a QL such that label_y P QL. Figure 2.10: Neighboring graphs defined the four different notions described in this section.

(27)

Graph sensitivities

Equation 2.2 and Equation 2.3 are also used when determining the query sensitivity for queries in graph databases. The only difference is how neighboring database D1_{is defined,}

which compared to tabular databases has more variations in graphs.

For degree distribution queries, the greatest possible difference in query result can vary. Given that the query result is presented as a histogram, and if the neighboring database is in difference by one edge, the sensitivity of the query is 4 [10]. This is because in the greatest possible difference scenario, four buckets can be different if one edge is removed. That is when two buckets have an additional node each and two buckets have one less node each, yielding the sensitivity of 4. The sensitivity if the neighboring database is in difference of one node and all the connected edges is infinity. Simply because it potentially make all node degrees become zero, where a graph could grow up towards an infinite amount of nodes with one node being connected to all others.

For triangle queries, the sensitivity is infinity with no regard to how the neighboring database is defined [10]. This is because in theory, the greatest possible difference for such queries can involve graphs where all triangles depend on one edge or one node. Removing either the node or the edge could result in a difference of N triangles down to zero triangles. See Figure 2.11 and Figure 2.12 for an example. As N could go towards infinity, this yields sensitivity that is infinite.

Figure 2.11: Graph consisting of six triangles, where the middle edge is involved in all of them.

(a) Graph with edge difference. _{(b) Graph with node difference.} Figure 2.12: Neighboring graphs of the graph in Figure 2.11 defined by both edge and node difference.

Betweenness queries have, just as triangle queries, sensitivities that can be infinite. For node difference, the reason is the same as for degree distribution. For edge difference, the reason is different and is demonstrated in Figure 2.13.

(28)

2.5. Related work

Figure 2.13: Worst case of betweenness queries for edge difference.

In the worst case, two nodes in the middle of a graph forms a connection where those two nodes are the only shortest path bridges in the graph. Removing the edge between them results in zero nodes acting as bridges in shortest paths. With the number of nodes N growing towards infinity, this yields an infinite sensitivity.

2.5 Related work

Task et al. [10] provided a guide based on their study on how differential privacy can be adapted for graph structures used in social network analysis. They recognize the issue of user privacy in social media platforms, and proposes out-link privacy as an alternative way to guarantee privacy besides node privacy and edge privacy. They employ these privacy guarantees on common graph structure analysis techniques used on social graphs, which are "triangle counting", "degree distribution" and "centrality" [10]. The same analysis techniques have been relevant for this thesis, and was inspired by their work.

Studying the privacy and utility trade-off of notions of differential privacy is suggested as future works by Reuben in her work about the theory of differential privacy in directed edge-labeled graphs [12]. The foundation for this work which is provided by her paper, are the differential privacy notions. She mentions edge privacy and node privacy from the work of Task et al. [10] as an inspiration behind these notions, where QL-outedges neighborhood seems to be inspired by out-link privacy. For this thesis, all three notions from her paper was selected for analysis, with QL-edge neighborhood being a fourth adapted from existing notions.

There is a known issue with two of these notions however. For node neighborhood and QL-outedges neighborhood, the query sensitivity can be infinity. This is a known problem for differential privacy in graphs. Kasiviswanathan et al. [22] developed algorithms to battle this problem in undirected graphs. Node neighborhood might be much more secure than edge neighborhood, but that comes with a very high cost of accuracy. Global sensitivity (Equation 2.3) is the cause of infinite query sensitivity when employing node neighborhood, due to the measure between any neighboring databases. This includes the scenario where one database have up towards infinite edges and the neighboring database has none, i.e. all edges are connected to the same node. The authors propose the use of smooth sensitivity as an approach to solve this problem.

(29)

2.5. Related work

As an alternative or complement to using real-world graphs for social science, synthetic graphs can prove useful. Providing synthetic graphs helps social scientists without involving the privacy issues of real user data. Edunov et al. [19] developed a synthetic graph generator called Darwini, which works in three steps. First it creates nodes and assigns node degrees as well as local clustering coefficients. Next, Darwini places these nodes into communities and connects the nodes within the communities. Finally the communities are connected to each other, forming the final graph [19]. The method of creating graphs in steps of building communities, that are later connected to each other, gave inspiration and was relevant to the framework development in this project. However, the project framework considers slightly different parameters, due to the focus of edge-labeled directed graphs. Darwini works only for unlabeled and undirected graphs. Another point is that Darwini uses the degree distri-butions and clustering coefficient distridistri-butions of real-world social graphs as input for gen-erating entire new graphs [19]. Thus, the size is harder to predict for each created synthetic graph, which gives less control over size when synthesizing the graph.

(30)

3 Method

This chapter provides a walk-through of the used methodology for this project. The first sec-tion describes an overview of the idea of a framework that was developed and used to carry out this study. Necessary features are listed as requirements that needed to be implemented, in order to produce privacy and utility data which could be analyzed. Following sections describes suitable framework components, as well as the evaluation of the framework and how the produced data was used to provide answers to the research questions for this thesis.

3.1 Analysis framework

The framework can be divided into two parts. First, the framework had to be able to both create synthetic data sets and load real-world data sets into a graph database. It also had to be able to query the database and receive the result. The second part can be described as the "query mechanism", which is responsible for the post-query processing such as applying generated noise to the query result and creating an output file. The amount of noise is de-termined by ε, and the query sensitivity which is a query-specific constant within the query mechanism. The noise is added with the help of the Laplace distribution.

The work process when using the framework began with finding existing files of real-world data sets in the RDF-format (Resource Description Framework), and by synthesizing own data sets based on a set of parameters. Loading a real-world data set was done by parsing a RDF-file into a graph that can be stored in a database management system. With the desired data set in the graph database, it is possible to send queries and receive true and noisy result through the framework according to some settings. The query interaction mode is none interactive, meaning that additional queries can not be sent during runtime. Each execution process considers only one query. The settings consist of name of the data set to query, which query to send, choice of notion and value on ε, as well as specifics regarding the width and number of histogram buckets. The noise measurements for different values on ε was then used to create a trade-off analysis chart in external software.

(31)

3.1. Analysis framework

Framework requirements

Determining the requirements for the framework described in the work process above was done by coming up with necessary features that would be useful when answering the thesis research questions. The most fundamental framework requirement is to be able to connect to an existing database, which is necessary in order to perform any further actions. Once the framework is connected to a database, three additional requirements needs to be fulfilled.

The framework have to be able to create graphs out of existing data sets, which means that such data sets needs to be parsed from their current format before a graph can be stored in the database. To provide complementary graphs the framework also have to be able to create synthetic graphs, and the graph synthesizer Darwini illustrates a method of doing so [19]. Compared to Darwini, the graph synthesizer mechanism in the framework will have to be different in a few ways. The first difference is the different graph properties, i.e. unla-beled and undirected graphs compared to edge-launla-beled and directed graphs. A synthesized edge-labeled graph requires a method for distributing labels. The second difference is that the graph synthesizer mechanism of the framework will not mirror real-life social graphs, which makes the parameters of degree distributions and clustering coefficient less conve-nient. Assigning those properties to each node of a synthesized graph in a random approach will provide inconsistency in graph size if the graph synthesizing process is run multiple times, unless these parameters are manually provided for the thousands of nodes involved. Using a fixed number of edges results in a consistent graph size, but also provides a random element in node degrees and clustering coefficients instead.

With graphs stored in the database, the framework must be able to query the database by choosing a query from a set of pre-defined queries. These queries must be implemented as algorithms in the framework. The post-query processing is addressed by another three requirements. First, the query result must be received in a histogram format in preparation for the noise generation process. Secondly, the framework must include a mechanism that draws a value from a noise distribution that can be applied to the query result.

Finally, with true and noisy query results, an output file needs to be created with the necessary data to analyze the trade-off between privacy and utility (utility is defined in the part about output data in Section 3.3). The output file needs to be in a common format that is suitable for being imported into an external software for the purpose of producing charts. The requirements are summarized in Table 3.1.

Table 3.1: Framework requirements ID Requirement Notes

1 Connect database Establish link to a database.

2 Create graph By parsing files of real-world data.

3 Create synthetic graph

By providing values on pa-rameters in framework.

4 Send query Query a graph in the

database. See queries in Table 3.2.

5 Receive result In a histogram format. 6 Distort result Using the Laplace

distribu-tion depending on query sensitivity and ε, where the query sensitivity depends on query and notion of dif-ferential privacy.

(32)

3.2. Framework components

7 Output file Produce a CSV-file with true and noisy results, along with utility mea-surements and ε-values to be used as input when creating charts.

The most efficient way of implementing these requirements in the framework was to iden-tify and make use of existing tools that specializes in certain features. These tools and neces-sary framework components are presented in the next section.

3.2 Framework components

The major parts of the framework rely on the existence of a database management system, an efficient method of operating on graphs, as well as the processing of query results. This section describes the approach concerning these parts, along with the framework queries and the creation of graphs.

Database Management System

Considering the necessity of studying multiple graphs, a database management system is useful to avoid the scenario of creating the same graph multiple times. A database man-agement system is convenient for the first four requirements listed in Table 3.1, which con-cerns the storage of graphs and framework connection. With a working storage for graphs, a method for creating and querying graphs in the database is required.

Graph processing

An efficient tool for processing graphs is necessary to accomplish the requirements regarding graph creation, both in terms of synthesizing graphs and creating graphs out of existing data sets, as well as sending queries and receiving results. It will work as an interface between the framework and the used database management system, which means that the tool have to be an API of the database management system that supports Java.

Resource Description Framework

Real-world data sets exist in many formats, where the RDF-format is one of them. It is a common format of data exchange, and the purpose of the format is to link data resources through triples [25]. One triple consist of two nodes and one labeled edge which is linking the nodes together, where the nodes represent different data resources. A set of these triples can be used to form graphs for visualization and analysis of those data resources. Creating graphs out of real-world data sets to be stored in a database management system requires the data set file to be parsed. Parsing an RDF file would work according to Figure 3.1.