Deep learning for differential privacy and density estimation

(1)

SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2019

Deep learning for differential

privacy and density estimation

JAUME ANGUERA PERIS

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

Master Thesis

Deep learning for di↵erential privacy

and density estimation

Author

Jaume Anguera Peris

Supervisor

Dr. Pol del Aguila Pla

A thesis submitted in fulfillment of the requirements

for the degree of Information and Network Engineering

in the

Division of Information Science and Engineering

(3)

(4)

Abstract

One of the most promising opportunities for scientists over the last years is the ease in the collection of data and its accessibility. Moreover, due to the proliferation of interdisciplinary fields, there has emerged the opportunity for data collaboration. For some years, though, organizations, companies, and in-stitutions did not examine secure ways to manage, process, and share data. As a result, we can see nowadays the repercussions of these actions, as some of them are su↵ering the consequences with the social and legal risks caused by the exposure of their data records. In fact, the concern towards o↵ering secure ways of protecting data has escalated so rapidly over the last years that it took a great part of many of the discussions held at the World Economic Forum in January 2019.

Motivated by this dire need for protecting data records, the first part of this project investigates successful ways of limiting the disclosure of private information held in a database when statistical information from it is released to the public. Examples of statistical information that is sold to third parties or released to the public include means, variances, or higher moments, as well as proportions of the population that satisfy certain conditions. However, we instead assume that we want to share the whole probability distribution, in the form of a density, without compromising the users’ privacy. This leads us to the problem of density estimation, which is the second pillar of this project. Indeed, this project aims to combine these two lines of research to obtain a di↵erentially private density estimator that is easily implemented using a neural network.

(5)

Sammanfattning

En av de mest lovande möjligheterna för forskare under de senaste ˚aren är förbättrad insamling av data och datans tillgänglighet. Dessutom har den stora mängden tvärvetenskapliga fält skapat möjligheter för datasamarbete. Under n˚agra ˚ar undersökte dock organisationer, företag och institutioner inte säkra sätt att hantera, bearbeta och dela data. Idag känner vi av konsekvenserna med de sociala och juridiska risker som orsakas av exponerade dataregister. Arbetet mot att erbjuda säkra sätt att försvara data har eskalerat s˚a drastiskt över de senaste ˚aren att det var en stor del av de diskussioner som hölls vid the World Economic Forum i Januari 2019.

Drivet av det trängande behovet att försvara dataregister, undersöker första delen av det här projektet sätt att begränsa avslöjande av privat information i en databas när statistisk information fr˚an datan släpps till o↵entligheten. Exempel p˚a statistisk information som säljs till tredjeparter eller släpps till allmänheten är medelvärden, varianser, högre moment, och andelar av populationen som uppfyller bestämda villkor. I v˚art fall antar vi istället att vi vill dela hela sannolikhetsfördelningen, som en täthetsfunktion, utan att kompromissa med användarnas privatliv. Detta leder oss till density estimation, vilket är den andra grundpelaren i det här projektet. I själva verket syftar det här projektet att kombinera dessa tv˚a forskningsgrenar för att generera en di↵erentierbar privat density estimator som enkelt kan implementeras med ett neuralt nätverk.

(6)

List of Figures

2.1 Example of two subsets x, y_{✓ 2}D _{that originate from a collection}

of recordsD with N = 4 individuals. The database in the upper right has_{|x| = 2 records and the database in the lower right has} |y| = 3 records. . . 7

4.1 Comparison between a random draw of L = 80 targets of the stochastic learning of the cumulative algorithm (orange) and the smooth interpolation of the cumulative algorithm (green) used to train a non-parametric neural network for estimating the cumu-lative density function of a Normal distribution _{N (0, 1) (blue).} . 32

6.1 Setup of our experimental studies that defines the relationship between the curator who secures the database and the analyst who wants to obtain information about the individuals in the database. . . 40

6.2 L1-norm of the di↵erence between the estimated and the true

probability density function for di↵erent number of training sam-ples N , and di↵erent batch sizes _|B|. . . . 42

6.3 Comparison between the true cumulative distribution function, the target data for the neural network, the output of the neu-ral network without di↵erential privacy, and the outputs of two neural networks with di↵erential privacy for a Gaussian mixture model. . . 43

6.4 Comparison between the true probability density function and its estimates after training a density estimator using the stochastic gradient of the cumulative approach with and without di↵erential privacy. . . 43

(9)

6.5 Estimated cumulative distribution function of two density esti-mators, one trained with di↵erential privacy and another trained without di↵erential privacy, for a scenario in which the data of the right-handed density is sensitive and cannot be released to the public.. . . 46

6.6 Estimated probability density function of two density estimators, one trained with di↵erential privacy and another trained without di↵erential privacy, for a scenario in which the data of the right-handed density is sensitive and cannot be released to the public. 46

6.7 Estimates of the probability density function for di↵erent values of C. For all cases, the density estimator is trained using on a non-parametric neural network approach with di↵erential privacy and 2_{= 5}_{· 10} 4_{and ⌦ = 240.} _{. . . .} ₄₇

6.8 L1-norm of the di↵erence between the estimated and the true

probability density function for di↵erent values of the total num-ber of epochs ⌦ =_{{240, 480, 960} as a function of the scale of the} noise . . . 48

6.9 Theoretical and practical values of the privacy loss for the R´enyi Di↵erential Privacy as a function of the scale of the noise when

(10)

List of Tables

2.1 Examples of adjacent databases in the context of machine learn-ing, where databases are considered only as collection of pairs of inputs and outputs.. . . 8

2.2 Mean and standard errors over 100 experiments for the estimation of ⌧ = 0.3 using the randomized response mechanism. . . 12

3.1 Summary of the upper bound of the privacy loss for the (↵, ✏)-R´enyi di↵erential privacy for mechanisms with Laplace and Gaus-sian noise distributions. . . 26

(11)

Chapter 1

Motivation

One of the most promising opportunities for scientists over the last years is the ease in the collection and accessibility of data that describes all kinds of facets of society. Now it is easier than ever for ecologists to collect satellite images to understand the impact of the climate on the vegetation. Climatologists can instantaneously access detailed reports of weather conditions to improve their predictions of natural disasters. Biologists can rapidly evaluate medical records for millions of patients to track the spread of a disease or prevent epidemics.

Moreover, due to the proliferation of interdisciplinary fields, there has also emerged the opportunity for data collaboration. For some years, though, orga-nizations, companies, and institutions did not examine secure ways to manage, process, and share data. As a result, we can see nowadays the repercussions of these actions, as some of them are su↵ering the consequences with the social and legal risks caused by the exposure of their data records. In fact, the concern towards o↵ering secure ways of protecting data has escalated so rapidly over the last years that it took a great part of many of the discussions held at the World Economic Forum in January 2019.

Motivated by this dire need for protecting data records, the first part of this project investigates successful ways of limiting the disclosure of private information held in a database when statistical information from it is released to the public. Examples of statistical information that is sold to third parties or released to the public include means, variances, or higher moments, as well as proportions of the population that satisfy certain conditions. However, we instead assume that we want to share the whole probability distribution, in the form of a density, without compromising the users’ privacy. This leads us to the problem of density estimation, which is the second pillar of this project.

(12)

Indeed, this project aims to combine these two lines of research to obtain a di↵erentially private density estimator that is easily implemented using a neural network. For that, the reader will be theoretically instructed in the areas of di↵erential privacy and density estimation. The reader, nonetheless, is expected to have solid knowledge of statistical signal processing and mathematics to fully understand all the concepts and proofs presented in this project.

1.1 Motivation for di↵erential privacy

The context of this project involves two actors. On one hand, we have a database consisting of sensitive data from a group of individuals, and we want to release statistical information from that database without jeopardizing any individual in it. On the other hand, we have an adversary whose intent is to gain some information from the database.

Ideally, we want the adversaries to only gain information about the database as a whole. Hence, if the adversaries analyzed the statistical information released to the public, they should know no more about any individual in the database after the analysis is completed than they knew before the analysis began.

Some of the existing approaches for data protection in this context revolve around the idea of anonymizing the database by removing any attributes that could be used to identify the individuals, e.g., remove the name or the security number of each individual. This process is also known as sanitizing databases. Other approaches go one step further and, besides removing sensitive informa-tion, they also group the attributes into ranges. For example, instead of saying that an individual is 25 years old, the attribute is changed to 20 < Age < 30. By doing so, a larger number of individuals share the same basic information, thus making it more difficult for the adversary to retrieve information from the database. In both cases, these actions are assumed to be reliable solutions for o↵ering data protection, such that the database could be published entirely without compromising the integrity of any of its individuals.

These approaches, however, become highly vulnerable to side information. When the adversary has extra knowledge about any of the individuals in the database, these approaches fail to protect the integrity of the database. One famous example that is still influencing the debate over data privacy dates back to 1997. Latanya Sweeney, then an MIT graduate student and now a computer scientist at Harvard, identified the medical records of Massachusetts Governor William Weld from information publicly available in a state insurance database. Sweeney knew that Governor Weld was hospitalized shortly after

(13)

collapsing on a public event, so she used the date and place of hospitalization together with Weld’s date of birth and zip code to locate Weld’s medical records in the Massachusetts Group Insurance Commission (GIC) database. To do so, Sweeney first looked at all the records that matched that information, and then narrowed down the search by cross-referencing them to Cambridge voter-registration records, eventually leading her to Weld’s medical records. This incident proved that it is possible to combine two sanitized database with side information to disclose information from an individual that appears in both of them.

Another remarkable incident happened in 2006. In that year, Netflix held the first edition of the Netflix Prize, an open competition for finding the best algorithm to predict user ratings for films. The contestants had access to 100 million anonymized movie ratings, each consisting of four attributes: the title of a movie, the ID of the subscriber who rated that movie, the subscriber’s rate, and the date it was rated. Short after this database was released to the public, Arvind Narayanan and Vitaly Shmatikov, two researchers at the University of Texas, announced that they had identified some of the Netflix users in the database. One of the techniques they used was to match the anonymized rating with data from other sides such as IMBD [1]. More interestingly, they also discovered that it was possible to retrieve the viewing history from a subscriber by knowing the movies they had rented in a given time period. This too is an example in which sensitive information was disclosed using side information.

The challenge is therefore to o↵er protection to all the individuals in the database irrespective of any side information. For solving that problem, we have focused on the work by Cynthia Dwork, computer scientist at Harvard University, who proposed the most prominent and novel standard for data pri-vacy, known as Di↵erential Privacy [2].

Di↵erential Privacy promises to protect individuals from any harm that they might face for being part of a database that they would had otherwise not faced if they had decided to not be part of it. That is to say that although individ-uals may indeed become vulnerable for being part of a database, di↵erential privacy promises that the probability of harm is not significantly increased by their choice of being part of the database. Let’s consider the following example to better understand this idea. Imagine there is a health survey intended to discover early indicators of breast cancer, which leads to conclusive results that hold for a given individual. Di↵erential privacy does not consider this to be a privacy violation because it ensures that these conclusive results could have been obtained with very similar probability whether or not that specific individual

(14)

had participated in the survey.

Throughout this project, we will frequently discuss three of the most interest-ing properties of di↵erential privacy: (i) the protection against re-identification, (ii) the protection against auxiliary information, and (iii) the quantification of the privacy loss. This last property is of utmost importance because it will allow us to compare di↵erent techniques as well as analyze and control the cumulative privacy loss over multiple computations. Moreover, it will give us the tools to quantify the impact of having di↵erent databases containing information from the same individual.

1.2 Motivation for density estimation

Data protection has become such a prime concern nowadays that some major companies have decided to devote a considerable amount of resources into find-ing solutions that incorporate di↵erential privacy. As a result, Google made public earlier this year an open source library on GitHub to make it easier not only for developers to train machine-learning models with privacy, but also for researchers to advance the state of the art in machine learning with strong pri-vacy guarantees [3]. This GitHub repository, however, is modified on a weekly basis and demanded too much of our time for us to understand all the weekly modifications. Thus, we decided to take some of the ideas from that repository and construct our own di↵erentially private algorithm.

The decision of designing our own algorithm also gave us the freedom to make two important choices. One one hand, we were not limited to using the MNIST or ImageNet databases, so we instead decided to model our databases as a collection of random samples. Thanks to this, the experimental studies gives us a general overview of the underlying algorithm and is not restricted to the data, thus allowing us to extrapolate the results to other scientific fields. On the other hand, we could define the purpose of the algorithm, giving us the opportunity to combine di↵erential privacy with other lines of research. In view of this, we decided to explore the topic of density estimation because its approaches are experiencing a shift towards using neural networks.

The idea behind using neural networks for density estimation emerge from the need of o↵ering easy-to-implement solutions for applications such as finding the optimal detection threshold for the signal detection problem [4], or predict-ing a future data point from a given time series [5], clustering for unsupervised classifier design [6], to name a few.

(15)

1.3 Thesis outline

Chapter 2 provides all the background knowledge required to understand the concept of di↵erential privacy. The first section describes the properties of the database. The second section presents the existing modes of interaction between the database and the adversary. The third section focuses on the the statistical information released to the public, and the last section motivates the use of di↵erential privacy for ensuring data protection.

Chapter 3formally defines di↵erential privacy and presents some of its key properties. The first section presents the standard definition of di↵erential pri-vacy, and the subsequent sections focus on two modifications that o↵er stronger privacy guarantees. Moreover, throughout Chapter3, there has been included the advantages and disadvantages of both the standard definition and its cor-responding modifications to understand the reasons why we decided to use one method instead of the others.

Chapter 4 focuses completely on density estimation and examines some of the existing density estimation techniques and discuss the shift they are experi-encing towards using neural networks. Chapter4 concludes with a description of the density estimation technique that serves as a basis for constructing a di↵erentially private density estimator in the next chapters.

Chapter 5 takes all the ideas presented in Chapters 2, 3, and 4 for con-structing a density estimator using a neural network with di↵erential privacy and measuring the level of privacy of such estimator.

Chapter 6performs two experiments for measuring the performance of our proposed algorithm. The first experiment focuses on density estimation, and the second experiment focuses on di↵erential privacy. These two experiments combined aim to cover all the concepts presented throughout this project and draw specific conclusions about the performance of the di↵erent algorithms in Chapter5.

Chapter 7 summarizes the main ideas of this project, and gives a general conclusion of all the key aspects presented in it. The end of this chapter also motivates future lines of research that could be constructed upon this project, and encourages the reader to do so.

(16)

Chapter 2

Introduction to privacy

preservation

In the previous chapter, we discussed the importance of having a privacy-preserving mechanism to protect the data of any individual in a database. In-tuitively, the underlying idea of di↵erential privacy is to protect the privacy of each individual while permitting statistical analysis of the database as a whole. With that in mind, this chapter aims to instruct the reader with all the ba-sic terms and mathematical tools that serve as the foundation of di↵erential privacy.

2.1 Databases

Let’s begin by considering a setD consisting of a collection of records with a specific number of variables from N individuals, as shown in Figure 2.1. We assume that each individual’s data is represented with a row vector, and a database x_{✓ 2}D _{is a subset of} _{D composed of a specific number |x| of rows,}

such that|x|  N.

2.1.1 First steps to ensure data privacy

There are essential precautions that one must consider when dealing with data sets. The first step, before considering any protection or privacy techniques, is to understand the architecture of the database. For that, we assume the existence of a trusted and trustworthy curator who holds the data of all the individuals in the database. The main responsibility of the curator is to sanitize the database.

(17)

Figure 2.1: Example of two subsets x, y✓ 2D_{that originate from a collection of}

records_{D with N = 4 individuals. The database in the upper right has |x| = 2} records and the database in the lower right has|y| = 3 records.

This process consists of identifying sensitive data and remove it permanently and irreversibly to make it unrecoverable. Examples of sensitive information are names, addresses, phone numbers, or social security numbers. After this process is completed, the curator can destroy the original database with no further consequences.

2.1.2 Distance between databases

Throughout this project, we will often be interested in investigating how similar or how di↵erent two databases are. For that purpose, we define a mapping : 2D _{! {0, 1}}|D|_{from a database x}_{✓ 2}D _{into a binary vector of size N such}

that, (x) =h x(v1), . . . , x(vN) i> , where 8 < : x(vi) = 1 if{vi} \ x = {vi} x(vi) = 0 if{vi} \ x = ;

where vi,8i = 1, . . . , N, represents each and every element of the set D. With

this mapping, we can now define a function : 2D_{⇥ 2}D _{! N representing the}

distance between two databases, such that, for any x, y_{✓ 2}D,

(x, y) = (x) (y) ₁=

|D|

X

i=1

x(vi) y(vi) . (2.1)

Note that equation (2.1) measures the Hamming distance between two binary vectors of size_{|D| = N, indicating the number of elements that di↵er between} x and y. Notice also that, for all x, y, z✓ 2D_,

(18)

1. (x, y) 0, (non-negativity)

2. (x, y) = 0_{, x \ y = x = y,} (identity of indiscernibles) 3. (x, y) = (y, x), (symmetry)

4. (x, y)_{ (x, z) + (z, y),} (triangular inequality) and therefore is a metric on the set 2D_.

Definition 2.1 (Adjacent databases). Two databases x, y✓ 2D _{are adjacent if}

they di↵er in at most one element, i.e., (x, y)_{ 1.}

In the context of machine learning, databases are composed of pairs of inputs and outputs (e.g., Table2.1a). Hence, when we talk about adjacent databases, we consider three possible cases, namely, the adjacent database has all the inputs but one of the outputs is missing (Table2.1b), the adjacent database has all the outputs but one input row is missing (Table2.1c), or the adjacent database has all the outputs but one entry from one input row is missing (Table2.1d).

Name Age Married? Henrik 27 No Georgina 36 Yes (a) Original database

Name Age Married? Henrik 27 -Georgina 36 Yes

(b) Label missing Name Age Married?

- - No Georgina 36 Yes (c) Input row missing

Name Age Married? Henrik 27 No Georgina - Yes

(d) Entry missing

Table 2.1: Examples of adjacent databases in the context of machine learning, where databases are considered only as collection of pairs of inputs and outputs.

2.2 Models of computation

Over the years, several solutions have been proposed to solve the problem of privacy preservation in sanitized databases. As a result, the literature has clas-sified all those solutions into two groups, the so-called models of computations. The first group encompasses the non-interactive, or o✏ine, models, and the second group encompasses the interactive, or online, models.

(19)

In the non-interactive models, the curator releases once and for all the data that is thought to be of interest to the analysts. This data might be the sanitized databases themselves or a collection of statistics. In the former case, the curator makes use of additional techniques such as k-anonymity [7], l-diversity [8], or t-closeness [9] to maintain stronger privacy guarantees. Nevertheless, all the examples of privacy failures presented in Chapter1 are examples of databases protected with non-interactive models. Thus, as we have seen already, these models are not reliable when the analyst has auxiliary information.

In the interactive models, the curator controls the access to the database in such a way that the analyst can only gain knowledge from the database by asking statistical information to the curator. This models, contrary to the non-interactive models, provide a more mathematically rigorous guarantee of privacy even when the analyst has disclosed information [2]. For this reason, this project focuses on interactive models.

2.3 Queries

As we have seen in the previous section, the analysts will never have direct access to the records held by the curator in the interactive model. In this case, the only way for the analysts to gain some knowledge about the databases is by means of queries.

Definition 2.2 (Query). A query f : 2D _{! R}k _{is a function that allows any}

data analyst to obtain statistical information from a given database.

We assume an interactive model where the analyst may send queries either simultaneously or adaptively. From a privacy preservation point of view, the latter case is more challenging, as the analyst might decide which query to ask next based on the outputs from previous queries. In either case, we shall always aim for a balance between knowledge and privacy. That is, we want the ana-lyst to gain some knowledge from our database without exposing any sensitive data from any individual. However, as we will see, the privacy protection will inevitably deteriorate with the number of queries asked.

2.3.1 Types of queries

There exist many di↵erent types of query functions in the literature, and they can all be grouped into two broad categories: the structurally joint query func-tions and the structurally disjoint query funcfunc-tions. We will specifically present

(20)

three types of structurally joint query functions (counting queries, selection queries, and linear queries), and one type of structurally disjoint query func-tions (histogram queries).

Structurally joint queries

One of the most used types of structurally joint queries is the counting queries, which return the number of individuals satisfying a certain property P . The output of these queries can be returned in its pure form, e.g., “How many individuals in the database . . . ?”, or in fractional form, e.g., “What percentage of individuals in the database . . . ?”. These type of queries are the most used ones because they are very powerful. They easily capture the basic statistical properties of the database and are very useful for many standard data-mining tasks.

Another type of structurally joint queries is the selection queries, which returns a summary statistic of the database, such as the maximum, the mini-mum, the mean, or the median, to name a few. These queries have been shown to easily compromise the privacy of the database because they are very sensitive to the data [10].

Finally, the last type of structurally joint queries we want to discuss is the linear queries. These queries return a weighted sum of the entries of each row in the database. For example, let’s say we want to predict how much time a runner will take to run a marathon [11], and our databases have data from di↵erent runners consisting on the maximum VO21, the year of birth, and the

longest distance run over the last four weeks. With that information, we could not give a direct answer as in the counting queries, or the selection queries. Instead, we can use linear queries and output a linear combination of those three values.

Structurally disjoint queries

The most well-known query functions in this group are histogram queries. This type of function queries is a special case of structural joint queries: the database x✓ 2D _{is partitioned into disjoint cells, and the query asks how many}

database elements lie in each of the cells. For example, an analyst could send k queries of the type “How many people’s first name is . . . ?” and create a

1_{The maximum VO}

2is the maximum volume of oxygen processable in milliliters per

kilo-gram body weight and minute, and it is a measure commonly used to reflect cardiorespiratory fitness and endurance capacity in exercise performance.

(21)

histogram with that information. Since each individual’s name is a distinctive identifier, each element in the database will be included only in one of the k bins of the histogram. We see, therefore, that the main di↵erence between structurally joint and structurally disjoint queries lies on whether the k queries sent by the analyst split the database into k disjoint groups or not.

2.3.2 Sensitivity

As we discussed earlier in this chapter, we must bear in mind that some analysts may have auxiliary information about the databases that could help them re-trieve information from specific individuals in the database. With that in mind, we would like to measure how sensitive the output of a query can be when any of the elements of the database is missing. For that purpose, we define the `p-sensitivity.

Definition 2.3 (`p-sensitivity). The `p-sensitivity of a query is the maximum

p-norm of the di↵erence between two outputs of the same query for any two adjacent databases, i.e.,

pf = max x,y✓2D (x,y)=1 f (x) f (y) _p= max x,y✓D (x,y)=1 k X i=1 |fi(x) fi(y)|p !1/p , (2.2)

where fi(·) represents the i-th component of the query sent by the analyst.

We will now provide two examples to illustrate the concept of `p-sensitivity.

For the first example, let’s say we have a database with data from di↵erent patients in a hospital, and there is an analyst interested in knowing the number of smokers. The analyst, therefore, decides to use counting queries with k = 1. In this case, adding or removing information from a single individual in the database will change the count by at most 1, so pf = 1, 8p. For the second

example, let’s say we have a database with data from di↵erent people in Stock-holm, and the analyst is interested in generating a graph with the percentage of people living in Stockholm from each and one of the di↵erent continents. The analyst, therefore, decides to use histogram queries with k = 7 (Europe, Asia, Africa, North America and South America, Australia, Antarctica). In this second case, the queries divide the database into seven disjoint cells, and the addition or removal of a single database individual a↵ects the count in exactly one cell. In fact, we can see that the value of k plays no special role because each individual will belong in only one of the cells. Consequently, the sensitivity is pf = 1,8p, irrespective of the value of k.

(22)

Number of queries

5 20 80

p = 0.9 0.38_{± 0.25} 0.29_{± 0.09} 0.28_{± 0.03} p = 0.8 0.43_{± 0.30} 0.36_{± 0.14} 0.31_{± 0.07} p = 0.7 0.60± 0.80 0.18± 0.47 0.29± 0.22 Table 2.2: Mean and standard errors over 100 experiments for the estimation of ⌧ = 0.3 using the randomized response mechanism.

However, not all the query functions are so easy to interpret. In general, calculating the sensitivity of an arbitrary query function is not an easy task. A considerable amount of research e↵ort is focused on estimating sensitivities [12], or finding alternative ways of calculating them [13].

2.4 Towards defining di↵erential privacy

Let’s consider an analyst who wants to know what percentage of people living in Barcelona have pets. To do this, the analyst could perform the experiment using a technique called randomized response [14]. For this technique, the analyst would ask every person living in Barcelona from a database x_{✓ 2}D_{whether they}

have pets or not, and the people would answer to that query f : 2D ! {0, 1} truthfully with probability p and lie with probability 1 p,

RRp(f ), 8 < : f (x) with probability p 1 f (x) with probability 1 p, (2.3) where the value of p is assumed to be publicly known. In this same regard, if we consider that the fraction of participants having pets is ⌧ , we could compute the expected number of outputs equal to 1 as

EhRRp(f ) = 1

i

= p⌧ + (1 p)(1 ⌧ ). (2.4) With this in mind, any analyst could then estimate the value of ⌧ from the outputs of the queries and equation (2.4). To give an example, we have performed an experiment in which the analyst estimates the value of ⌧ after receiving 5, 20, and 80 randomized responses from the participants. Table2.2

shows an example of the mean and standard deviation of the experiment for three di↵erent values of p.

From the results in Table 2.2, we can observe three things. First, the esti-mate of ⌧ becomes more accurate as the analyst obtains more responses from

(23)

the database. Second, increasing the randomness of the responses makes it more difficult for the analyst to get an accurate estimate. Third, and most impor-tantly, the outputs of the query have no unique identifiers, yet the aggregation of answers allows the analyst to draw some statistical information from the people in Barcelona.

This last observation shares the same premise as the concept of di↵erential privacy. That is, randomization protects the individuals in the database while still letting the analyst get useful information about the participants as a whole. Randomization is actually essential to guarantee privacy regardless of all present or even future sources of auxiliary information, e.g., other databases, studies, websites, online communities, newspapers, or government statistics [15].

Formally, we will randomize the responses by adding noise to the output of the queries we want to conceal. Most importantly, we would want to control the variance of the added noise, as it will determine the balance between the amount of exposed information and the knowledge gained by the analyst. Definition 2.4 (Randomized mechanisms). A randomized mechanism M with domain 2D _{and range}_Rk _{provides privacy to a database x}

✓ 2D _{by introducing}

randomness to the output of an observed query f (x), i.e., M(x) = f (x) + W

where W 2 Rk _{is random variable we call additive noise. Throughout this}

project, we will use m(x) to denote a sample of the random variable M(x), and Mxto denote the distribution of the random variable M(x), such that it satisfies

m(x)_{⇠ M}x. Similarly, we will use w to denote a sample of the random variable

W , and Lap(µ, ) or_{N (µ,} 2_{) to denote the Laplace or Gaussian distributions,}

respectively, such that,

wL⇠ Lap(µ, ), fWL(w) = 1 2 exp ✓ |w µ|◆_, wG⇠ N (µ, 2), fWG(w) = 1 p 2⇡ exp ✓ kw µ_k2 2 2 ◆ .

In fact, the Laplace and Gaussian distributions are the most explored noise distributions for W in the privacy preservation literature. They were first intro-duced in [16] and [17], respectively, and have been widely used ever since. Other lesser-known examples in the literature that contribute to the development of di↵erentially private mechanisms include the exponential mechanism [18, 19], the geometric mechanism [20], the median mechanism [21], or the multiplica-tive weights mechanism [22], among others.

(24)

Chapter 3

A premier on di↵erential

privacy and its history

After having introduced the fundamentals of di↵erential privacy (DP) in the previous chapter, we can now proceed to formally define it and present some of its key properties. We will specifically divide this chapter into three sections. The first section presents the standard definition of DP and discusses some of its limitations. The subsequent sections explore di↵erent relaxations of the definition of DP capable of (i) measuring the privacy guarantee for any of the randomized mechanism presented in Section2.4, and (ii) quantifying the e↵ect of having an analyst who uses side information from di↵erent databases.

3.1 Standard definition of Di↵erential Privacy

Intuitively, a di↵erentially private mechanism will guarantee that the outputs of the randomized mechanisms behave similarly for similar input databases. Definition 3.1 (✏-Di↵erential Privacy). A randomized mechanism M with do-main 2D and range Rk _{satisfies ✏-di↵erential privacy if for any two adjacent}

inputs x, y_{✓ 2}D _{and for any subset of outputs}_{S ✓ R}k _{it satisfies that}

Pr[m(x)2 S]  e✏_Pr[m(y)

2 S] , (3.1) which ensures that, for every run of the randomized mechanism M, the ratio between the distributions of the outputs between neighboring databases will always be upper bounded by a factor that depends on ✏ _{2 [0, +1). Since}

(25)

the databases x, y ✓ 2D _{in equation (}_3.1_{) are interchangeable, it follows by}

symmetry that

Pr[m(x)2 S] e ✏Pr[m(y)2 S] .

For mathematical convenience, we will introduce a new parameter that mea-sures the ratio between the di↵erent distributions in equation (3.1).

Definition 3.2 (Privacy loss). Given an output ⇠_{✓ S of a randomized} mech-anism and two databases x, y_{✓ 2}D, the privacy loss gives us a measure of how likely it is that the output ⇠ is generated from the database x than another database y, L(⇠, M, x, y) = ln ✓_Pr Mx[m(x)2 ⇠] PrMy[m(y)2 ⇠] ◆ = ln ✓_Pr Mx[⇠] PrMy[⇠] ◆ , (3.2)

which might be positive if an event is more likely under x than under y, or negative if an event is more likely under y than under x.

If we now compare equations (3.1) and (3.2) and consider that the range of ✏ is [0, +_{1), we come to the conclusion that, in the worst-case scenario, the} abso-lute value of the privacy loss will be upper bounded by ✏, i.e.,_{|L(⇠, M, x, y)|  ✏.} We can therefore interpret ✏ as the parameter that ensures that an observed out-put ⇠ is equally likely to be observed from a database x than it is to be observed from every other adjacent database y.

What is left to find is the relationship between the parameters of the ran-domized mechanisms and ✏. For that, let’s consider that the curator decides to use the Laplace mechanism. As its name suggests, the Laplace mechanism simply returns the output of a query f : 2D ! Rk _{perturbed with noise drawn}

from the Laplace distribution Lap(µ, ), i.e.,

m(x) = f (x) + wL= h f1(x), . . . , fk(x) i> + hw1, . . . , wk i>

where wL 2 Rk is a vector of i.i.d. random samples from W . If we set µ = 0,

(26)

 k Y i=1 exp ✓ |fi(x) fi(y)|◆ = exp ✓ kf(x) f (y)k1◆ , (3.3) where the first equality follows from the fact that the noise samples are i.i.d., and the inequality follows from the triangle inequality on the absolute value. Notice that equations3.3computes the ratio between probabilities considering when the set ⇠ gets so small that it tends to an arbitrary point ¯⇠.

We can now proceed to ensure that the Laplace mechanism preserves ✏-DP by finding an upper bound for the privacy loss. From the definition of `p

-sensitivity and the fact that the databases in equation (3.3) satisfy (x, y) 1, it follows that,

L(⇠, M, x, y) =kf(x) f (y)k1  1f = ✏. (3.4) Therefore, the Laplace mechanism preserves ✏-DP if the scale parameter is calibrated to the `1-sensitivity of f divided by ✏, i.e., wL⇠ Lap(0, 1f /✏).

3.1.1 Advantages of ✏-DP

One of the most interesting aspects of di↵erential privacy is understanding the behavior of di↵erent randomized mechanisms under composition. That is, un-derstanding how the privacy of the users in the databases degrade when we combine di↵erent di↵erentially private mechanisms.

Definition 3.3 (Composition theorem of ✏-DP mechanisms). The combination of an ✏1-DP Laplace mechanism M1: 2D ! Rk and an ✏2-DP Laplace

mecha-nism M2: 2D ! Rk, denoted by M1,2: 2D ! Rk⇥ Rk, for any two adjacent

inputs x, y _{✓ 2}D _{and for any given pair of outputs (⇠}

1, ⇠2) ✓ Rk⇥ Rk satisfy that, PrMx[m1,2(x)2 (⇠1, ⇠2)] PrMy[m1,2(y)2 (⇠1, ⇠2)] = PrMx[m1(x)2 ⇠1] PrMx[m2(x)2 ⇠2] PrMy[m1(y)2 ⇠1] PrMy[m2(y)2 ⇠2] = ✓_Pr Mx[m1(x)2 ⇠1] PrMy[m1(y)2 ⇠1] ◆ ✓_Pr Mx[m2(x)2 ⇠2] PrMy[m2(y)2 ⇠2] ◆  exp(✏1) exp(✏2) = exp(✏1+ ✏2) ,

i.e., the combination of an ✏1-DP Laplace mechanism and another ✏2-DP Laplace

(27)

As we have seen above, finding an upper bound for privacy loss for the Laplace mechanism is possible because there is a tight and accurate relationship between ✏ and the parameters of the Laplace mechanism. This, however, is not true for any mechanism, so finding a bound for the privacy loss for the composition of di↵erent randomized mechanisms will be a difficult task.

3.1.2 Disadvantages of ✏-DP

If we repeated the same analysis in equation (3.3) for the Gaussian mechanism, we would find that there is no practical value of ✏ for which it satisfies ✏-DP. This is a problem because, in some applications, using Gaussian additive noise is more convenient over other types of noise distributions. There are, in fact, two main reasons for why the Gaussian mechanism is more convenient over other distributions. The first reason is that the errors that may already be present in the databases are usually modelled with a Gaussian distribution. Hence, adding Gaussian noise to the output of the queries facilitates the analysis of DP. The second reason is that the probability that the analyst retrieves information from the database is computed as a tail bound approximation of the noise distribution. Hence, since the tails of the Gaussian distribution decay much faster than those of the Laplace distribution, adding Gaussian noise to the output of the queries reduces the probability of revealing information from the users in the database.

These reasons motivated the literature to explore softer versions of DP that could work well with Gaussian mechanisms. Some of the most important con-tributions are the (✏, )-approximate DP [17], the (✏, )-probabilistic DP [23], the (✏, )-random DP [24], the (✏, ⌧ )-concentrated DP [25], and the (↵, ✏)-R´enyi DP [26]. In the next two sections we are going to focus specifically on the (✏, )-approximate DP and the (↵, ✏)-R´enyi DP.

3.2 Relaxation of Di↵erential Privacy

As we discussed at the end of the previous section, the exploration of other definitions of di↵erential privacy was purely motivated from the fact that there is no feasible value of ✏ for which the Gaussian mechanism satisfies ✏-DP. In fact, Fang Liu proved in [27] that the analysis of ✏-DP requires unrealistic assumptions about the database in order to find the relationship between the parameters of the Gaussian mechanism and ✏, and, even then, the results it yields are very unlikely to be used in practical applications because the probability that the

(28)

analyst retrieves information from the database is considered too large to ensure sufficient privacy protection.

Definition 3.4 ((✏, )-di↵erential privacy). A randomized mechanism M with domain 2Dand rangeRk_{satisfies (✏, )-di↵erential privacy if for any two adjacent}

inputs x, y_{✓ 2}D _{and for any subset of outputs}_{S ✓ R}k _{it satisfies that}

Pr[m(x)2 S]  e✏_Pr[m(y)

2 S] + . (3.5) This ensures that, for every run of the randomized mechanism M, the distribu-tion of the outputs between neighboring databases will be upper bounded by a factor that depends on ✏ with probability 1 .

In order to better understand the role of in equation (3.5), we define as B = {⇠ ✓ S : L(⇠, M, x, y)  ✏} the subset of outputs ⇠ in S ✓ Rk _{for which}

the privacy loss is lower or equal than ✏. Likewise, we define the complement set Bc ₌ _{{⇠ ✓ S : L(⇠, M, x, y) > ✏}. With that in mind, we can express}

the relationship between the probability density functions of the outputs of the randomized mechanisms as Pr[m(x)_{2 S] = Pr[m(x) 2 B] + Pr[m(x) 2 B}c] (3.6)  e✏Pr[m(y)_{2 B] + Pr[m(x) 2 B}c] (3.7)  e✏Pr[m(y)_{2 S] + Pr[m(x) 2 B}c] (3.8) = e✏Pr[m(y)_{2 S] + Pr} Mx [_{L(⇠, M, x, y) > ✏],} (3.9) where (3.6) follows from the fact thatS = B [ Bc _and

B \ Bc₌

;, (3.7) follows from the standard definition of ✏-DP, and (3.8) follows from the fact that_{B ✓ S.} If we now take the result obtained in (3.9) and compare it with the definition of (✏, )-DP in (3.5), we come to the conclusion that,

= max x,y✓2D (x,y)1 Pr Mx [_{L(⇠, M, x, y) > ✏] .} (3.10)

where the maximum is taken over all possible pair of adjacent databases. The value of , therefore, gives us a measure of the maximum probability for which the randomized mechanism does not satisfy ✏-DP. This measure has been adopted by the research community as the probability that the analyst retrieves infor-mation from the database, and is the cornerstone of (✏, )-DP, for it allows to explore the di↵erential privacy for the Gaussian mechanism and for the family of exponential mechanisms as means of the privacy loss. Moreover, the analysis of (3.10) usually leads to computing a tail approximation of the noise distributions,

(29)

so any mechanism that measures the privacy preservation using that equations is said to have (✏, ) calibrated according to a tail bound approximation.

In practical applications, one can find an upper bound for by following the same reasoning as in [28], where Abadi et al. make use of the Markov inequality to facilitate the computation of as follows,

Pr

Mx

[_{L(⇠, M, x, y) > ✏] } EMx[L(⇠, M, x, y)]

✏ 8✏ > 0.

Notice that the numerator of the right-hand side of the equation corresponds to the Kullback–Leibler divergence. Hence, instead of finding the value of from equation (3.10), one can compute the Kullback-Leiber divergence for all pair of adjacent databases and use it as an upper bound for .

3.2.1 Advantages of (✏, )-DP

As we discussed at the beginning of this section, the definition of (✏, )-DP allows us to measure the privacy preservation of the Gaussian mechanisms, i.e., thanks to the definition of (✏, )-DP, we can find the relationship between the parameters of the Gaussian mechanism and the corresponding privacy level (✏, ). The mathematical proof on how to find that relationship is thoroughly presented in [15] and its analysis is left to the reader. The conclusion of the analysis is that a Gaussian mechanism with distributionN (0, 2_{) satisfies (✏,}

)-DP for ✏_{2 (0, 1) if the standard deviation fulfills} q2 log1.25_/✏.

If one wants to work in a di↵erent privacy regime than ✏2 (0, 1), Balle et al. [29] proved that it is possible to overcome this limitation by calibrating the variance of the Gaussian mechanism according to the Gaussian cumulative den-sity function instead of using the tail bound approximation from equation (3.10). One of the conclusions of their analysis is that a Gaussian mechanism with distri-bution_{N (0,} 2_{) satisfies (0, )-DP if the standard deviation} ₌

2f /(2 ). The

other conclusion of their analysis is that a Gaussian mechanism with distribution N (0, 2_{) satisfies (✏, )-DP for ✏ > 1 if the standard deviation}

2f /p2✏.

These results, which prove that is is possible to bound the privacy loss (3.2) for Gaussian mechanisms, have motivated the study of (✏, )-DP for applications that combine di↵erent outputs from multiple mechanisms.

The two most important examples of theorems for combining di↵erent out-puts from multiple mechanisms are the basic composition [16] and the advanced composition [30]. They analyze how the privacy loss deteriorates for two specific cases: (i) the repeated use of di↵erent di↵erentially private algorithms on the

(30)

same database, and (ii) the repeated use of di↵erentially private algorithms on di↵erent databases that may contain information related to the same individual. Definition 3.5 (Basic composition). Let M` : 2D ! Rk be an (✏`, `)-DP

randomized mechanism for `2 {1, . . . , L}. If ML : 2D ! QL_`=1Rk` is defined

to be ML(x) =⇥M1(x), . . . , ML(x)⇤, then ML satisfies⇣PL_`=1✏`,PL_{`=1 `}

⌘ -DP. The basic composition theorem is easy to interpret, but it assumes that the privacy loss worsens every time an analyst sends a query to the database. On average, though, this is not necessarily true because there will be cases where the analyst does not gain any extra information about a specific individual after receiving the output of the randomized mechanism. Two examples of such cases are when the individual does not belong to any of the two adjacent databases, or when the information that the analyst retrieves from two adjacent databases is the same. If we consider this and assume that all the randomized mechanisms satisfy (✏, )-DP, we can get a tighter bound for the privacy loss at the expense of worsening the probability of exposing the users’ data (3.10).

Definition 3.6 (Advanced composition). For any ✏ > 0, 2 [0, 1], ˜ 2 (0, 1], the class of (✏, )-DP mechanisms satisfies⇣˜✏, k + ˜⌘-DP under k-fold adaptive composition (any k consecutive queries over any adjacent databases) for

˜

✏ = k✏(e✏ _{1) + ✏}q_{2k log(1/˜) .}

There is recent work that explores tighter bounds on the composition of (✏, )-DP mechanisms, but we are mainly interested in the basic and advanced composition theorems because they are the ones that have been widely explored and embraced by multiple research communities. For further investigation on tighter bounds, see [25,31,32].

3.2.2 Disadvantages of (✏, )-DP

The introduction of in the standard definition of DP was a breakthrough because it allowed analyzing mechanisms that were mostly di↵erentially private, but could occasionally result in privacy losses. However, since the definition of (✏, )-DP was introduced in the literature, there has been no consensus on what is the interpretation of having a privacy loss greater than ✏. It is not clear whether the records of the individual that di↵ers between adjacent databases would become completely or partially exposed.

Another problem with the definition of (✏, )-DP is that we cannot express ✏ as the ratio between the probability density distributions in equation (3.5)

(31)

because of the additive parameter . As a consequence, it not only becomes much harder to find tight bounds for the privacy loss, but it also hinders the analysis of advanced composition theorems. In fact, it has been proved that the composition theorem of non-homogeneous mechanisms, i.e., the composition of di↵erent (✏i, i)-DP mechanisms for di↵erent values of ✏i and i, is a NP-hard

problem [33].

Motivated by these problems, Dwork and Rothblum [25], followed by Bun and Steinke [31], formulated the privacy loss in terms of the R´enyi divergence for two reasons. The first reason was to find a more convenient and accu-rate way of tracking the privacy loss of di↵erentially private mechanisms, and the second reason was to find a definition of DP that could be generalized to many randomized mechanisms instead of being limited to Gaussian mechanisms. Taken together, these two articles and the work on deep learning with di↵eren-tial privacy in [28] point toward adopting the R´enyi divergence in the analysis of di↵erential privacy as an e↵ective and flexible method for capturing privacy guarantees.

3.3 R´

enyi di↵erential privacy

When the concept of di↵erential privacy was formally presented in [30], Dwork et al. suggested a sophisticated way to evaluate di↵erential privacy in terms of distance measures between distributions. The distance measure they used is the Kullback–Leibler (KL) divergence, defined as

DKL M(x)kM(y) = EMx  ln ✓ PrMx[m(x)2 S] PrMy[m(y)2 S] ◆ ,

where the ratio is between the probability distributions of a randomized mecha-nism M with domain 2Dand rangeRk _{for any two adjacent inputs x, y}_{✓ 2}D_and

for any subset of outputs_{S ✓ R}k_{. This distance measure, however, is not}

sym-metric and can be infinite, specifically when M(y) has a subset not contained in M(x). As a consequence, Dwork et al. proposed the maximum divergence as a better approach to find the relationship between the parameters of the randomized mechanisms and the privacy loss for ✏-DP,

D₁ M(x)_{kM(y) =} max S✓Supp(M(y))  ln ✓_Pr Mx[m(x)2 S] Pr_My[m(y)2 S] ◆ , (3.11)

as well as the -approximate maximum divergence to find the relationship be-tween the parameters of the randomized mechanisms and the privacy loss for

(32)

(✏, )-DP, D₁ M(x)_{kM(y) =} max S✓Supp(M(y))  ln ✓_Pr Mx[m(x)2 S] PrMy[m(y)2 S] ◆ , (3.12)

such that a randomized mechanism satisfies ✏-DP for any two adjacent databases x, y _{✓ 2}D _{if D}

1 M(x)kM(y)  ✏ and D1 M(y)kM(x)  ✏, and a

random-ized mechanism satisfies (✏, )-di↵erential privacy if D₁ M(x)kM(y)  ✏ and D₁ M(y)_{kM(x)  ✏. Notice that the maximum in equations (}3.11) and (3.12) is taken over S ✓ Supp(M(y)), i.e., the subset of the domain M(y) whose ele-ments are nonzero, thus preventing the denominator of the ratio between prob-abilities to be zero.

These notions of maximum divergence (3.11) and -approximate maximum divergence (3.12) have been so rapidly accepted and embraced within the pri-vacy preservation research community that we can find numerous applications incorporating them in their analysis of di↵erential privacy. Nevertheless, as we discussed in Sections3.1.2and3.2.2, the standard definition of DP and the def-inition of (✏, )-DP present some disadvantages. With that in mind, and with an attempt to find a more natural relaxation for the standard definition of DP, the authors in [25,31] decided to explore another distance measure, namely, the R´enyi divergence, defined as,

D↵ M(x)kM(y) = 1 ↵ 1logEMy " ✓ PrMx[m(x)2 S] PrMy[m(y)2 S] ◆↵# , (3.13)

where ↵ is the order of the Rényi divergence. The Rényi divergence is generally defined for ↵ 0 and ↵_{6= 1, but the privacy preservation literature has only} studied the cases where ↵ > 1, so we will also restrict the analysis of di↵erential privacy to those specific values. In particular, in the limit ↵ _{! 1, the Rényi} divergence corresponds to the KL-divergence [34], and, as ↵ approaches infinity, the Rényi divergence is increasingly determined by the outputs S _{✓ R}k _of

highest probability.

In terms of distance measures, a randomized mechanism satisfies (↵, ✏)-RDP if and only if D↵ M(x)kM(y)  ✏ and D↵ M(y)kM(x)  ✏. It is important

to remark that the expectation in equation (3.13) is taken over _My. Hence,

contrarily to (3.11) and (3.12), we are not limited to study the case when ↵! 1, but rather di↵erent orders ↵ of the privacy loss.

From equation (3.13), Mironov [26] proposed a formal definition of R´enyi di↵erential privacy (RDP) that defines the relationship between the probability density functions and ✏.

(33)

Definition 3.7 (R´enyi di↵erential privacy). A randomized mechanism M with domain 2D_{and range}_Rk_{satisfies ✏-R´enyi di↵erential privacy of order ↵, denoted}

by (↵, ✏)-RDP, if for any two adjacent inputs x, y✓ 2D _{and for any subset of}

outputs_{S ✓ R}k _{it holds}

Pr[m(x)2 S]  ⇣e✏Pr[m(y)2 S]⌘(↵ 1)/↵. (3.14)

3.3.1 Advantages of (↵, ✏)-RDP

The definition of (↵, ✏)-RDP comes with numerous advantages over ✏-DP and (✏, )-DP. First, and most notable, RDP allows to find an upper bound for the privacy loss for many di↵erent randomized mechanisms, including the Gaussian and the Laplace mechanisms. Second, (↵, ✏)-RDP is a natural generalization of ✏-DP, and thus it is a very useful tool for the analysis of composition theorems, especially for the composition of non-homogeneous mechanisms. Third, the definition of RDP includes the ↵ parameter, which allows studying high-order moments of the privacy loss. This is particularly convenient to bound the tails of the randomized mechanisms and control their e↵ect on the privacy loss. Fourth, RDP shares, with some adaptations, many properties that make di↵erential privacy a useful and versatile tool [26]. Fifth, and finally, although (↵, ✏)-RDP is a relaxation of ✏-DP, it is a strictly stronger privacy definition than (✏, )-DP. In fact, it is possible to draw a direct connection of the privacy loss parameters between (↵, ✏)-RDP and (✏, )-DP [26].

Definition 3.8 (From (↵, ✏)-RDP to (✏, )-DP). A randomized mechanism M with domain 2D _{and range}_Rk _{that satisfies (↵, ✏)-RDP according to (}_3.14_{) for}

any two adjacent inputs x, y✓ 2D _{and for any subset of outputs}_{S ✓ R}k _also

satisfies (ˆ✏, )-DP for

ˆ

✏ = ✏ +log(1/ )

↵ 1 , 8 2 (0, 1). (3.15)

3.3.2 Disadvantages of (↵, ✏)-RDP

The concept of (↵, ✏)-RDP is relatively new in the literature compared to the other definitions of DP, and thus the number of software packages implementing it is very scarce. Moreover, most of the ongoing research in RDP is conducted at Google, so most of the information available publicly on their website or their GitHub repository is opaque.

(34)

Another problem with the definition of (↵, ✏)-RDP is that the ↵ parameter can take any real value ↵ > 1, so finding the optimal value of ↵ that yields the minimum ˆ✏ in (3.15) is not an easy task. In fact, there is no consensus on the range of values of ↵ that one should try for practical experiments.

3.3.3 RDP for di↵erent randomized mechanisms

Now that we have formally presented the definition of R´enyi di↵erential privacy, we will proceed to find the upper bound of the privacy loss for the Laplace and the Gaussian mechanisms. The analysis of (3.13) for ↵ = 1 corresponds to the KL-divergence, and its corresponding results can be found in [26], so we will only summarize them in Table3.1. As for the analysis of (3.13) for ↵ > 1, we decided to use the same reasoning as Dwork et al. [15] and find the upper bound of the privacy loss by considering the worst-case scenario. For the Laplace and Gaussian mechanisms, the worst-case scenario is defined as the probability of observing an output that occurs with a very di↵erent probability under x_{✓ 2}D

than under an adjacent database y _{✓ 2}D_{, where the probability space is the}

noise distribution of the randomized algorithm.

For our analysis, the Laplace and Gaussian distributions are continuous func-tions, so (3.13) can be simplified as

D↵(PkQ) = 1 ↵ 1log Z +1 1 p(x)↵q(x)1 ↵dx . (3.16) Laplace mechanism

Let’s consider a Laplace mechanism that introduces Laplace noise wL⇠ Lap(0, )

to an output of a query f : 2D _{! R}k _{with `}

1-sensitivity 1f , and an analyst

who sends k queries to two adjacent databases x, y✓ 2D_{. In terms of privacy}

guarantees, the worst-case scenario occurs when the di↵erence between the out-puts of the queries f (x) and f (y) is maximum, i.e., when the di↵erence between the outputs of the queries equals 1f . We can therefore find the upper bound

of the privacy loss by computing the R´enyi divergence (3.16) between a Laplace distribution with mean µ = 0 and a Laplace distribution with mean µ = 1f ,

i.e., p(x) =₂1 exp( |x 1f|/ ) and q(x) = ₂1 exp( |x|/ ).

The exponent of both p(x) and q(x) forces us to evaluate the R´enyi diver-gence over three separate intervals, namely, ( _{1, 0], [0,} 1f ], and [ 1f, +1),

such that, Z +1 1 p(x)↵q(x)1 ↵dx = 1 2 Z0 1 exp(↵(x 1f )/ + (1 ↵)x/ ) dx

(35)

+ 1 2 Z 1f 0 exp(↵(x 1f )/ (1 ↵)x/ ) dx + 1 2 Z+1 1f exp( ↵(x 1f )/ (1 ↵)x/ ) dx = 1 2exp( ↵ 1f / ) + 1 2(2↵ 1) h exp((↵ 1) 1f / ) + exp( ↵ 1f / ) i +1 2exp((↵ 1) 1f / ) (3.17)

Therefore, after putting all the similar terms together and substituting the result of the integral into (3.16), we get to the conclusion that, if a real-valued query function f :D ! Rk_{has sensitivity}

1f , then the Laplace mechanism satisfies

✓ ↵, 1 ↵ 1log  _↵ 2↵ 1exp ✓_(↵ ₁₎ 1f◆ + ↵ 1 2↵ 1exp ✓ _↵ 1f◆ ◆ RDP. Gaussian mechanism

Similar to the previous case, let’s consider a Gaussian mechanism that intro-duces Gaussian noise wG ⇠ N (0, 2) to an output of a query f : 2D ! Rk

with `2-sensitivity 2f , and an analyst who sends k queries to two adjacent

databases x, y ✓ 2D_{. Following the same reasoning as before, we can find the}

upper bound of the privacy loss by computing the R´enyi divergence (3.16) be-tween a Gaussian distribution with mean µ = 0 and a Gaussian distribution with mean µ = 2f , such that,

D↵(PkQ) = 1 ↵ 1log Z+1 1 1 p 2⇡exp ✓ _↵(x 2f )2 2 2 ◆ exp ✓ ₍₁ _↵)x2 2 2 ◆ dx = 1 ↵ 1log Z+1 1 1 p 2⇡exp ✓ ₁ 2 2 h x2 2↵ 2f x + ↵µ2 i◆ dx = 1 ↵ 1log Z+1 1 1 p 2⇡exp ✓ _(x _↵ 2f )2 2 2 ◆ exp ✓_(↵2 _↵) 2f2 2 2 ◆ dx = 1 ↵ 1log ⇢ p 2⇡ p 2⇡exp ✓ (↵2 ↵) 2f2 2 2 ◆ = ↵ 2f2/(2 2) . (3.18)

Therefore, if a real-valued query function f : 2D _{! R}k _{has sensitivity} 2f ,

then the Gaussian mechanism satisfies

↵, ↵ 2f2/(2 2) RDP.

In some cases, the `1-sensitivity is the only metric that gives us a logical and

(36)

cases, it is common to assume, without loss of generality, that 2f2 = 1f2

because the `p-sensitivity (2.2) satisfies the following property

Hence, in those cases where the `2-sensitivity is considered to be 2f2= 1f2,

the Gaussian mechanism satisfies

↵, ↵ 1f2/(2 2) RDP.

Mechanism R´enyi Di↵erential Privacy of order ↵ for µ = 1f

Laplace mechanism ↵ > 1: 1 ↵ 1log h ↵ 2↵ 1exp ⇣_{(↵ 1)µ}⌘ + ↵ 1 2↵ 1exp ↵µ i ↵ = 1: µ/ + exp( µ/ ) 1 = .5µ2/ 2+O(µ3_/ 3₎ Gaussian mechanism ↵ > 1: ↵µ 2_/(2 2₎ ↵ = 1: _{1 (not available)}

Table 3.1: Summary of the upper bound of the privacy loss for the (↵, ✏)-R´enyi di↵erential privacy for mechanisms with Laplace and Gaussian noise distribu-tions.

(37)

Chapter 4

Density Estimation

Besides o↵ering protection to a database, we also discussed in the first chapter that there is another interesting line of research in the literature focused on estimating random variables using their probability density functions. With that in mind, this chapter is aimed to examine some of the existing density estimation techniques and discuss the shift they are experiencing towards using neural networks. We will specifically divide this chapter into three sections. The first section presents some of the classical techniques for density estimation. The second section discusses some neural network-based solutions for density estimation. Finally, the last section overviews the density estimation technique that serves as a basis for constructing a di↵erentially private density estimator in the next chapters.

4.1 Classical techniques

The probability density function (PDF), fA(a), of a random variable is a

funda-mental concept in statistics, for it allows us to investigate, obtain valuable infor-mation, and fully characterize a random variable. This is specially advantageous to yield conclusions from a collection of observed samples a`⇠ A, 8` = 1, . . . , L.

The literature o↵ers a variety of approaches to estimate the PDF, and they can all be grouped into two broad categories [35], namely, the parametric ap-proaches and the non-parametric apap-proaches. In the parametric apap-proaches, the PDF is assumed to have a specific functional form that may be characterized by a limited set of parameters. In this case, the observed samples are used to estimate the parameters that define the functional form. Contrarily, in the non-parametric approaches, the PDF is not assumed to have a specific form, and

(38)

the number of parameters that characterizes the density function grows with the number of samples.

In either case, it should be emphasized that the problem of density estimation is fundamentally ill-posed because a finite set of observed samples could come from infinitely many probability distributions. In fact, any distribution fA(a)

that is nonzero at each of the data points a1, . . . , aL is a potential candidate.

With that in mind, we shall see in the next sections the way in which the parametric and non-parametric approaches cope with this difficulty.

4.1.1 Parametric approaches

As we discussed earlier, the parametric approach assumes that the PDF has a specific functional form. Some examples are the binomial and multinomial distributions for discrete random variables, and the Gaussian distribution for continuous random variables. These examples of distributions are parametric because they can be characterized by a small number of parameters. Therefore, in this case, the task of estimating the PDF comes down to determining a suitable value for those parameters given a set of observed samples.

There are two di↵erent methods for estimating the values of the parameters. The first method consists in optimizing some criterion, such as the likelihood function or the minimum-risk decision rule. For this method, we first construct a function that defines how likely it is to observe a particular value given the sta-tistical parameters that define the PDF, and then we maximize that function, or equivalently maximize the logarithm of that function, with respect to the parameter we are trying to estimate. The second method is based on Bayesian techniques. For this second method, we first introduce prior distributions over the parameters, and then we use the Bayes theorem to compute the correspond-ing posterior distribution given the observed data.

For the Bayesian method, we always try to consider a form for the prior distribution that has a simple interpretation as well as some useful analytical properties. These kinds of distributions are said to be conjugates to the like-lihood functions, such that the posterior distribution, which is proportional to the product of the prior and the likelihood function, have the same functional form as the prior. For example, the conjugate prior for the parameters of the multinomial distribution is the Dirichlet distribution [36], the conjugate prior for the probability in the Bernoulli distribution is the Beta distribution, or the conjugate prior for the mean of a Gaussian is another Gaussian.

(39)

of the PDF is rarely known beforehand, so we might make incorrect assumptions of the distribution that generates the observed samples, thus leading to a result with poor predictive performance.

4.1.2 Non-parametric approaches

The oldest and most widely used non-parametric method is the histogram. It simply partitions the observed samples into distinct bins of a certain width, and then count the number of observations falling into each of the bins. The histogram can also be generalized by allowing the widths of the bins to vary. In practice, the histogram technique can be useful for obtaining a quick visu-alization of the data in one or two dimensions but is unsuited to most density estimation applications. One of its drawbacks is that the estimated density is not continuous, and this brings some problems if it is required to compute the derivatives of the PDF. Another problem with the histograms is that they make an inefficient use of the data in procedures like cluster analysis and non-parametric discriminant analysis [35]. However, the greatest advantage of the histogram with respect to the methods we are going to discuss later on is that the histogram does not require the entire set of observed samples to be stored. The second most common non-parametric method is the kernel density es-timator, also known as the Parzen estimator [37]. This method estimates the PDF from the sum of L kernel functions centered on the observations a`. A

com-mon choice for the kernel function is the Gaussian, as it prevents the estimated PDF from having discontinuities, and thus avoids the problems of the histogram. The kernel function, however, depends on one hyper-parameter called the kernel width, often denoted by h. The kernel width plays a key role because it acts as a regularization parameter, but the kernel density estimator is extremely sensi-tive to the choice of h. A small value of h leads to under-smoothing, whereas a large value of h leads to over-smoothing. Methods to estimate the kernel width are either impractical because are given in terms of the unknown density, or are prone to statistical error because are based on cross-validation [38]. Another drawback of the kernel method is that it has a tendency to exhibit spurious noise at the tails of the estimates [35]. If the estimates are smoothed out to eliminate this spurious noise, then the essential features of the main part of the density are masked out.

Another common non-parametric method is the k-nearest neighbor tech-nique [39]. One of the difficulties of the kernel methods is that the hyper-parameter h is fixed for all the kernels, and the number of kernels L depends

Deep learning for differential privacy and density estimation

SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2019

Deep learning for differential

privacy and density estimation

JAUME ANGUERA PERIS

KTH ROYAL INSTITUTE OF TECHNOLOGY

Master Thesis

Deep learning for di↵erential privacy

and density estimation

Author

Jaume Anguera Peris

Supervisor

Dr. Pol del Aguila Pla

A thesis submitted in fulfillment of the requirements

for the degree of Information and Network Engineering

in the

Division of Information Science and Engineering

Abstract

Sammanfattning

Contents

List of Figures

List of Tables

Chapter 1

Motivation

1.1

Motivation for di↵erential privacy

1.2

Motivation for density estimation

1.3

Thesis outline

Chapter 2

Introduction to privacy

preservation

2.1

Databases

2.1.1

First steps to ensure data privacy

2.1.2

Distance between databases

2.2

Models of computation

2.3

Queries

2.3.1

Types of queries

2.3.2

Sensitivity

2.4

Towards defining di↵erential privacy

Chapter 3

A premier on di↵erential

privacy and its history

3.1

Standard definition of Di↵erential Privacy

3.1.1

Advantages of ✏-DP

3.1.2

Disadvantages of ✏-DP

3.2

Relaxation of Di↵erential Privacy

3.2.1

Advantages of (✏, )-DP

3.2.2

Disadvantages of (✏, )-DP

3.3

R´

enyi di↵erential privacy

3.3.1

Advantages of (↵, ✏)-RDP

3.3.2

Disadvantages of (↵, ✏)-RDP

3.3.3

RDP for di↵erent randomized mechanisms

Chapter 4

Density Estimation

4.1

Classical techniques

4.1.1