JOHANNABROMARK Privacy-PreservingSharingofHealthDatausingHybridAnonymisationTechniques-AComparison

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2019,

Privacy-Preserving Sharing of Health Data using Hybrid

Anonymisation Techniques - A Comparison

JOHANNA BROMARK

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

Privacy-Preserving Sharing of Health Data using Hybrid

Anonymisation Techniques - A Comparison

JOHANNA BROMARK

Master in Computer Science Date: October 10, 2019 Supervisor: Sonja Buchegger Examiner: Olof Bälter

School of Electrical Engineering and Computer Science Host company: KRY (Supervisor: Pedro Farias Machado)

Swedish title: Integritetsbevarande publicering av hälsodata genom hybrida anonymiseringstekniker - En jämförelse

(4)

(5)

iii

Abstract

Data anonymisation is not a trivial task due to the challenge of balancing the trade-off between anonymity and data utility. A fairly new attempt to address this challenge is the development of hybrid anonymisation algorithms - a combination of syntactic privacy models, often k-anonymity, and differential privacy. However, the complexity of evaluating the performance of anonymisation algorithms makes it difficult to draw conclusions of their performance in contrast to one another. To be able to use the algorithms in practice it is important to understand the differences between different algorithms and their strength and weaknesses in different settings.

This project addressed this by comparing two recently proposed hybrid anonymisation algorithms, MDP and SafePub, to study their applicability on medical datasets. The algorithms were applied on different datasets, among them a medical dataset from the wild. The resulting performance was based on the information loss and disclosure risk for the anonymised datasets. While MDP had less information loss for stronger privacy guarantees, it is less suitable for medical datasets since the datasets are anonymised under the assumption that all attributes in the dataset are independent. SafePub on the other hand, while keeping the attribute dependencies intact, had a substantial information loss for stronger privacy levels. Therefore, which algorithm that is best suitable depends on the dataset characteristics, the required privacy level and the acceptable information loss. It is of course possible that neither of the models are suitable for a specific use case. Also, to conclude a general performance for the algorithms on medical datasets, more tests are needed.

(6)

Sammanfattning

Anonymisering av data är komplicerat på grund av utmaningen att balanse- ra den anonyma datans användarbarhet och integritetsnivån. För att försöka förbättra både användbarheten och integriteten av anonymiserade dataset har hybrida anonymiseringsalgoritmer utvecklats - en kombination av syntaktis- ka modeller, ofta k-anonymisering, och differentiell integritet. Dock gör kom- plexiteten att evaluera algoritmernas resultat det svårt att dra några slutsatser om hur de står sig jämfört med varandra.

Det här examensarbetet jämförde två relativt nyligen publicerade hybrida algorithmer, MDP och SafePub, för att undersöka hur användbara de är för medicinska dataset. Detta gjordes genom att anonymisera olika typer av dataset, bland dessa ett riktigt medicinskt dataset. Algoritmernas prestation basera- des på förlorad information och risken att avslöja data. Medan MDP förlora- de mindre information för starkare integritetsnivåer, algoritmen anonymiserar attributen i datasetet som självständiga, vilket gör den mindre passande för att använda på medicinska dataset. SafePub, å andra sidan, förlorar mycket information för starkare integritetsnivåer. Om algoritmerna passar beror på datasetets karakteristik, vilken integritets- och sanningsnivå som behövs. Det kan försås vara så att ingen av modellerna är passande för det specifika syftet.

Dock, för att kunna dra en slutsats om algoritmernas generella prestanda på medicinsk data, behöver fler test genomföras.

(7)

Chapter 1 Introduction

With the increasing digitalisation, an increasing amount of data is collected, which sometimes includes personal and/or other sensitive information. The gathered data can then be used by both internal and external actors in for example statistical analysis, training of a machine learning model or for other research purposes. There is, however, a strong public interest in the preservation of the privacy and integrity of individuals, especially online. A recent sign of this is the enactment of the new, stricter General Data Protection Regulation (GDPR) in the European Union [1] to protect personal information.

The health care sector is one field that is in the digitalisation process, with devices that can track our health on a daily basis and services for meeting a doctor via video through a smartphone or tablet. This evidently produces data of which some could be very sensitive, but yet useful. However, in order to use the gathered data it is then often necessary to transform the dataset to not include any personal information. One way to do this is to anonymise (de- identify) the data, i.e. hide the identity of record owners so that an individual cannot be linked to a specific record or sensitive value. This is the purpose of privacy-preserving data publishing (PPDP), which is the process of sharing useful data without disclosing any personal information of the individuals in the dataset. The data is anonymised under two conditions; non-interactive and general-purpose [2]. This means that the dataset is anonymised before publication and that no assumptions are made of how the dataset will used, i.e. what type of analysis and queries that will be executed on the published dataset [3].

There are several anonymisation methods used in PPDP that typically in- volve some form of generalisation or suppression of values until some syntactic criteria is met, e.g. until there is a specific number of identical records

1

(10)

in the dataset. Another approach is to use perturbing techniques where, for example, noise is added to the record values or that the values are changed by other means.

Data anonymisation is not a trivial task and the greatest challenge is to balance the utility and privacy of the anonymised dataset. Data utility and privacy are somewhat opposites because increasing one factor often results in the decrease of the other. Consider the two extremes; the best utility would be achieved if the original dataset was used while the best privacy preservation approach would be to completely distort the data or simply not save it. It is easy to see that on the one hand there is no privacy and on the other there is no utility. This trade-off between data privacy and data utility, discussed in more detail in [4], is a central topic for the research and development of new data anonymisation techniques. Recent studies have tried to balance the trade- off by using hybrid anonymisation approaches, i.e. a combination of syntactic models (models that anonymises the data to fulfil syntactic criteria) and perturbing techniques. The reported performances of the hybrid techniques are promising in terms of the privacy-utility trade-off and the concept of hybrid techniques is presented in more detail in Section 2.3. While there are various hybrid anonymisation algorithms presented, there are yet very few com- parisons between different hybrid techniques. Data anonymisation is complex and affected by many different factors. It could therefore be useful, both for researchers and practitioners, with comparative studies of the hybrid algorithms to document their applicability in certain scenarios. For practitioners, this could serve as a guide to chose an appropriate algorithm for their anonymisation purposes and for researchers this could, for example, serve as a verification of the current state of the field.

The study was a collaboration with KRY, one of the actors in Sweden that provides the service for video meetings with doctors. To get a better un- derstanding of the performance of hybrid algorithms for PPDP, this project compared two fairly recently proposed hybrid algorithms for the purpose of anonymising medical datasets. The performance was tested for different levels of privacy using different datasets, among them a medical dataset from the wild.

1.1 Research Question

The research question for this study is:

What hybrid technique is best suited for the anonymisation of medical data in a non-interactive, general-purpose setting, from a combined privacy and

(11)

CHAPTER 1. INTRODUCTION 3

utility perspective?

The study can not answer the question in its entirety since only two hybrid techniques were compared and only one type of medical dataset was used.

However, the results serve as a first step to answer this question by addressing these sub-questions for the two algorithms studied:

How is the data utility affected on different levels of privacy?

With the same level of privacy, which algorithm preserves the best utility?

How is increasing dimensionality affecting the performance of the algorithms?

How is the performance affected by different dataset attribute types?

Here the performance of an algorithm refers to how much utility that is preserved in the resulting anonymised datasets. The more useful a dataset is, the better is the performance of the algorithm.

1.2 Delimitations

The study is limited to two hybrid algorithms and the number of dataset used was restricted to five. Due to the confidentiality of patient data, the medical dataset was synthetic but the structure of the dataset was similar to the original dataset. Finally, the methods were evaluated for PPDP, and since the setting in PPDP is non-interactive and general-purpose, the algorithms were evaluated for such a setting, which limited the possible evaluation methods.

1.3 Report Outline

This chapter ends with a section of some of the ethical and sustainability aspects of the area of data anonymisation. The outline of the rest of the report is as follows; Chapter 2 describes the background for data anonymisation along with a more detailed description of hybrid anonymisation methods as well as a section with other studies that compared anonymisation methods. In Chapter 3 the algorithms that were used in this study are presented with the evaluation metrics that were used to compare them. In Chapter 4 the method for evaluating the algorithms is presented, where all the datasets used are also described in more detail. In Chapter 5 the result is presented for each experiment and finally an analysis of the result and conclusion in Chapter 6 and 7 respectively.

(12)

1.4 Ethics and Sustainability

It is likely that the future of data management and research will be driven by personal and social data, which is followed by ethical dilemmas [5]. Data anonymisation could be a tool for a more responsible collection and analysis of data.

Even if the gathering and usage of data is regulated by law, there might be cases where the use of data is in accordance with the regulations, but it still might not be ethically motivated to use the data in its raw form. When access to a specific individual’s data is not required, it is not necessary to store the data in that explicit form and then anonymisation is useful. Anonymisation with different levels of privacy would also enable principle of least privilege (having access to only the information neccessary), when it is not necessary to view someones exact data, that would then not be an option.

From the definition of PPDP (sharing useful data) it follows that the disclosure risk is not entirely removed, since, for the data to be useful, some information must remain. If there is still information in the dataset, it is in- evitable that some disclosure risk remain. As a data publisher that is important to remember, as well as being aware of the implications that can follow when making a dataset publicly available.

As part of the UN Sustainable Development Goals, the ninth goal include

"Enhance scientific research, upgrade the technological capabilities of indus- trial sectors in all countries" [6], and one important component in research is often data. Data is either collected as a part of the research project or already gathered data is used, e.g. from a published dataset. PPDP could therefore be an important part in enabling more research, if, for example, more data could be acquired from the industry. As an example, digital health care providers continuously get data from their users. With the help of PPDP, that data could be made available for researchers in a relatively secure fashion, which could also have an economic benefit for research projects. PPDP could therefore make research accessible for more actors in various fields that, in a broader perspective, could have a positive impact towards a more sustainable and equal world.

(13)

Chapter 2 Background

This chapter will introduce the concept of data anonymisation starting with some background and techniques to transform datasets in Section 2.1. In Sec- tion 2.2 the most common privacy models will be described and in Section 2.3 hybrid anonymisation techniques are presented. The chapter ends with dataset characteristics in Section 2.4 and previous studies that have compared anonymisation models in Section 2.5.

2.1 Data Anonymisation

The attributes in a dataset are labeled identifiers, quasi-identifiers and confi- dential attributes [7][8] depending on what the attribute represents. Identifiers are attributes that can directly identify an individual (e.g. name and personal registration number). Quasi-identifiers (QID(s)) do not directly identify an individual by themselves but can in combination be used to single out an in- dividual like age and sex. Confidential attributes are attributes that contain information that is considered sensitive, for example disease or salary.

The trivial approach for dataset anonymisation (or de-identification) is to remove the identifiers. However, it has been shown that it can still be possible to uniquely identify people in a dataset when using the combination of sex, zip code and date of birth [9]. Background knowledge can be used by an adversary to single out a record or infer the sensitive value for an individual.

Such background knowledge can be easily acquired from other public datasets, social media or by being acquainted with the person. Therefore, the dataset must be modified further to keep the integrity of the individuals in it.

5

(14)

2.1.1 Value Generalisation and Suppression

One approach to anonymise a dataset is to generalise the record values, i.e. the values are changed to more general concepts that cover several other concepts as well. The generalisation is based on a generalisation hierarchy that defines the different generalisation levels, see Figure 2.1.

Figure 2.1: Hierarchy for attribute sex (left) and age (right).

The generalisation of values can either be done in a local or global fashion [10]. Local generalisation can apply different levels of generalisation for values from the same attribute. Global generalisation, or full-domain generalisation, on the other hand apply the same level for all values for an attribute.

If there are no levels that ensure the privacy criteria, the attributes can be suppressed, which is commonly denoted as "*". Some anonymisation methods combine generalisation and record suppression, by generalising to a certain degree and then suppressing the records that breach the privacy criteria.

2.1.2 Perturbation Techniques

Perturbation distorts the values in the dataset and is another approach to anonymise data and there are methods with different levels of distortion. A common approach is to add noise, where the level of noise can easily be controlled. The more noise added of course leads to a stronger privacy. Another perturbative technique is clustering, where the dataset is divided into clusters and each record is then changed to the value of the centroid of the cluster it belongs to.

2.2 Privacy Models

Privacy models have been define to set a privacy guarantee for an anonymised dataset. The models describe characteristics of a dataset or privacy mechanism, that should preserve a certain level of integrity for the record owners.

(15)

CHAPTER 2. BACKGROUND 7

2.2.1 k-anonymity

One of the first privacy models to be proposed was k-anonymity [11][12], where the idea is to preserve anonymity by “hiding in the crowd”.

Definition 1. A dataset D is k-anonymous if every combination of QIDs in D can be indistinctly matched to at least k individuals, for k > 1.

The group of records that have the same combination of QIDs is called an equivalence class. Common methods for achieving k-anonymity is value generalisation and dataset clustering.

A k-anonymous dataset protects against record linkage (linking an individual to a specific record), since there are multiple records with the same set of QIDs. From this follows that the probability to link an individual to a specific record is at most 1/k. It is easy to see that for small k, the probability for record linkage is still quite high. Also, even if k is large, k-anonymity does not protect against attribute linkage (linking an attribute to an individual). Take the following example:

Example 1. A dataset D, containing health records, is k-anonymous. Let’s say there is one equivalence class in which all records share the same disease.

If an adversary can link an individual to that equivalence class (by knowing the QIDs), the disease of that individual is disclosed.

Even if a specific record cannot be singled out, the sensitive information is still leaked in Example 1 and the privacy is thus breached. To improve these shortcomings, k-anonymity has been extended to e.g. l-diversity [13]

and t-closeness [14]. To conform to l-diversity each equivalence class needs to contain at least l “well represented” values for the confidential attribute, which would prevent attribute linkage. The most straight forward interpreta- tion of “well represented” is that there needs to be at least l distinct values for the confidential attribute in each equivalence class [15]. t-closeness requires the distribution of the values for the confidential attribute in each equivalence class to be similar to the distribution in the whole dataset. This would prevent against attribute inference in equivalence classes where the distribution of values is skewed, i.e. differs from the overall distribution of values.

All models mentioned above are examples of syntactic models, i.e. models that set criteria that the dataset need to fulfil. There is still criticism that syntactic models either fail to completely protect the privacy (the case for k- anonymity and l-diversity), or fail to preserve enough utility (the case for t- closeness) [16].

(16)

2.2.2 Differential Privacy

Differential privacy [17] defines privacy differently from the syntactic models.

Instead of defining a privacy property for the dataset itself, the requirement is on the actual data processing method. Differential privacy was originally developed for an interactive setting [18], with the purpose of masking the out- put of a query. (In an interactive setting the dataset is kept intact while the anonymity of the records is preserved by sanitising query replies [19].) A dataset can, however, be seen as a collection of responses to queries for each record and differential privacy can, therefore, be used in a non-interactive setting as well [20].

Differential privacy aims to reduce the probability of disclosing if a record exists in the dataset or not, i.e. outputs from a query should be probabilisti- cally indistinguishable regardless if an individual’s record is in the dataset or not. Therefore, the difference in the output of query f on neighboring datasets (datasets differing in one record) needs to be masked. This is done by applying a mechanism M, that perturbs the output. -differential privacy is formally defined as follows:

Definition 2. Let D and D’ be two neighboring datasets. Let M be a ran- domised mechanism. M satisfies -differential privacy if for every set of outputs S:

P r[M (D) ∈ S] ≤ eP r[M (D⁰) ∈ S]

The parameter is defined as the privacy budget [21], which is directly related to the privacy guarantee of the mechanism. Smaller results in stronger privacy guarantees and is usually set to 0.01, 0.1, ln2 or ln3 [22], but depending on the level of privacy that is required can be set to a greater or smaller value.

A variation of the privacy definition is (,δ)-differential privacy [23], which makes it possible to relax the condition for events that are less likely.

Definition 3. Let D and D’ be two neighboring datasets. Let M be a ran- domised mechanism. M satisfies (, δ)-differential privacy if for every set of outputs S:

P r[M (D) ∈ S] ≤ eP r[M (D⁰) ∈ S] + δ

This means that the bound emay be exceeded with a probability of at most δ.

The larger value for δ, the more relaxed is the constraint. -differential privacy is (, δ)-differential privacy where δ = 0.

(17)

To decide how much noise that is required to be added by the mechanism, the sensitivity of the query function needs to be considered. The sensitivity (denoted ∆) is the maximum possible difference of an answer from a query on neighboring datasets [22].

Definition 4. For a query f : D → R, the sensitivity is defined as:

∆f = max_D₁_,D₂kf (D₁) − f (D₂)k for all neighboring datasets D1, D₂.

The higher the sensitivity, the more noise needs to be added.

Differential privacy is argued to provide better privacy guarantees than the syntactic models, since it makes “almost” no assumptions of the background knowledge of the attacker [24]. However, differential privacy is criticised for making the anonymised dataset non-truthful since implementations rely on the addition of noise. Also, since it was developed for an interactive setting, it is more difficult to use in a non-interactive setting with sufficient data quality [25].

2.3 Hybrid Data Anonymisation

A hybrid anonymisation algorithm is a combination of a syntactic model, often k-anonymity, and differential privacy. In [26] the concept of hybrid approaches was mentioned as an interesting approach, and there are several studies that attempt to utilise the advantages of the two models.

In one of the first studies to test this idea it was shown that a

k-anonymisation algorithm could satisfy (, δ)-differential privacy if it was preceded by a random sampling step [27]. After this finding there have been others that have continued on this notion of a hybrid anonymisation algorithm by combining k-anonymity and differential privacy in different ways. One hybrid approach was suggested to better prevent composition attacks [28]. With a combination of random sampling, perturbation (adding noise to data values) and generalisation, they managed to reduce the risk for data disclosure, while getting a better data utility compared to only using differential privacy.

To deal with the increasing information loss when the number of QIDs increase, k-anonymity and -differential privacy was combined in [29] by split- ting the QID into two sets, whereby k-anonymisation was applied on one set and differential privacy on the other. Compared to k-anonymisation, this approach seemed to decrease the information loss and the disclosure risk. How-

(18)

ever, from the study it is not clear how the data utility was affected when the disclosure risk decreased because they were analysed separately.

The combined utility and privacy was improved in [20] by first applying microaggregation (a clustering method) to the data so that it was conform- ing to k-anonymity, and then applying a differential privacy mechanism. The microaggregation step reduced the amount of noise needed to achieve differential privacy, which in turn increased the utility. Another approach that used microaggregation was presented in [30], where the microaggregated dataset was protected instead of the original dataset. That enabled the use of a less perturbative microaggregation algorithm, hence the utility was increased.

A recently presented algorithm is SafePub [25]. SafePub first randomly samples the data and then generalises it. Safepub also includes a search strategy to find the best generalisation scheme to use. However, by sampling the data, the algorithm results in a relaxed (, δ)-differential privacy.

What makes hybrid methods interesting is that several studies have shown results that indicate that it is possible to combine the different models and get the different advantages they have, the utility preservation from the syntactic models and privacy preservation from differential privacy. Hybrid algorithms are also often developed for a non-interactive, general-purpose setting.

2.4 Dataset Characteristics

2.4.1 Dimensionality

When working with data, the curse of dimensionality is a common term for different phenomena that arise from high dimensional data that do not happen in a low dimensional setting. The dimensionality for a dataset is the number of attributes (columns) in it. Data anonymisation also suffers from the curse of dimensionality because high dimensional and sparse data make it more difficult for standard anonymisation methods reach the same level of privacy and utility. In [31] and [32] the authors studied the effect an increasing number of attributes had on the performance of k-anonymisation and perturbative models. It was shown that the privacy for k-anonymisation starts to decrease when the dimensionality exceeds 10 and the perturbative techniques showed similar trends.

(19)

2.4.2 Attribute Types

It is probable that a dataset has different types of attributes in it, such as numerical, ordinal and nominal attributes. Numerical attributes are numeric values and naturally have an order. Ordinal attributes are not numeric values but still have a logical ordering, such as days of the week. Nominal attributes on the other hand is a set of non-numeric values that do not have an natural, intrinsic ordering, e.g. sex in Figure 2.1. Due to this, there is not a straightforward way to compute difference and distance between the nominal concepts, however there are techniques to solve this, see next section.

2.4.3 Distance Between Nominal Values

To address the problem of comparing and ordering nominal attributes, seman- tic distance was introduced as the distance between two values in a nominal attribute hierarchy. The semantic distance take into consideration the semantic meaning of the possible values for an attribute. It is defined as a function of the non-common ancestors for two values divided by the total number of ancestors [33][34]. The semantic distance is thus formally defined as:

d(c1, c2) := log₂(1 + |T (c₁) ∪ T (c₂)| − |T (c₁) ∩ T (c₂)|

|T (c₁) ∪ T (c₂)| ) (2.1) where T (c) is the hierarchic ancestors of c, including itself.

It would of course be possible to use a binary distance between two values, i.e. 0 if they are the same and 1 if they are different. However, some concepts might be more closely related than others, which a binary distance leave no room to express. This is where semantic distance is superior, since it uses the defined hierarchy of the attributes. The semantic distance can be useful both when anonymising the dataset itself and evaluating the output from an anonymisation algorithm.

For numerical attributes a generalisation hierarchy can be defined with ranges of values (see Figure 2.1). However, using the semantic distance is not necessary in this case. The distance between leaves is trivial, and the distance between a child node and its parent node is the size of the range of the parent node, i.e. the distance between 20 and 20-39 is 20. The distance to a suppressed numerical attribute is the size of the attribute domain.

The distance dist between two records r1and r2that contain both numerical and nominal attribute can be defined as

(20)

dist(r₁, r₂) = s

d(a¹₁, a¹₂)²

d(a¹_b, a¹_t)² + ... +d(a^m₁ , a^m₂ )²

d(a^m_b , a^m_t )² (2.2) where aⁱ1 and aⁱ2is the ith attribute for r¹ and r² respectively and aⁱb and aⁱt is the attribute boundary, i.e. the smallest and largest attribute value.

For a nominal attribute the boundary can be computed using marginality, which is measure of the centrality of a value [33]. The marginality m for value a_j is defined as

m(a_j) = X

al∈T_A−a_j

d(a_j, a_l) (2.3)

where T^A is the taxonomy or domain for the attribute. The value with the smallest marginality is the most central value. One boundary is then the value with the largest distance from the central value and the other boundary is the most distant value from the first boundary [20].

2.4.4 Medical Datasets

Medical datasets can be very large and complex with high dimensionality, i.e.

many attributes. The attributes can also be of different types (numeric, nominal, free text), some attributes can have many missing values, which com- plicates the anonymisation process. k-anonymisation is the most commonly used anonymisation method for anonymising health data and there is a limited number of real world use cases of differential privacy on medical datasets [24]. Furthermore, the authors in [35] point out that solutions proposed by the computer science community may not be suitable in healthcare settings, as an example, in medical datasets there can be strong dependencies between QIDs and confidential attributes (e.g. female patients are more probable to be diagnosed with breast cancer). Thus, it is clear that working with health data can be difficult from several aspects.

2.5 Related Work

Due to the complexity of the privacy-utility trade-off and the strengths and weaknesses of the semantic models and differential privacy, there have been multiple studies that compare them and also try to find good metrics to measure the performance of anonymisation algorithms.

In [2] and [26] the authors discussed the syntactic models and differential privacy from a more theoretical point of view and identified some of their

(21)

shortcomings, such as disclosure risk for syntactical models and bad utility for differential privacy. However, they also concluded that there is no need to exclude one over the other, they both have a place alongside each other.

In the studies [36] and [37] the authors proposed two different frameworks for comparing privacy algorithms both in terms of the privacy guarantee and utility preservation. Both of the studies also compared syntactic models (k-anonymity, l-diversity and t-closeness) and differential privacy us- ing the frameworks proposed. The result from the comparison in [36] showed that neither of the models had a clear advantage when taking both privacy- preservation and data utility into consideration, while the result in [37] showed that differential privacy outperformed the syntactic models in both aspects.

Another approach for model comparison was presented in [38]. The idea was to first decide on a certain level for either privacy or utility needed and then rank the models according to the other factor, e.g. decide on a privacy level of x and then rank the models according to the utility preservation. Their experimental result showed different performance for the models depending on the dataset used.

The three studies mentioned above all got different results. From this, it is clear that evaluating anonymisation models is not a trivial task and the result from an experiment can differ depending on the metrics used as well as on the dataset. It is also pointed out in [10] that the lack of standardised metrics makes it difficult to compare algorithms to each other.

All the above mentioned studies compared k-anonymity (and its varia- tions) and differential privacy. While it is clear that hybrid approaches are gaining interest, there are currently very few hybrid algorithms that are compared to other hybrid algorithms. The results the studies that present new hybrid algorithms have shown this far indicate that there is a potential in the approach. It could therefore be useful to do more analytical and comparative studies between them, as has been done in several studies for the syntactic models and differential privacy [37][2][36][38][26].

2.5.1 Evaluation Metrics

Information loss is typically used as a measure of data quality for general pur- pose algorithms [20], e.g. by measuring similarities between the original and anonymised dataset, which is the purpose of sum of squared errors (SSE). The idea of general purpose quality models are further discussed in [25], where the authors mention that measuring information loss can be done on the cell-level, attribute-level or record-level. When comparing models in a general-purpose

(22)

setting, it could therefore be beneficial to use metrics that cover all the levels.

There are several approaches for measuring the disclosure risk, e.g. record linkage as the percentage of the records that can be correctly matched between the original and the anonymised dataset [20][38] and record uniqueness [37].

Record linkage is best suited when the values in the anonymised and original dataset has the same level of generalisation and it is more difficult to draw the same conclusions for an anonymised dataset with generalised values. From [37] it seems as if the methods for estimating record uniqueness seem to depend on the conditions they are used in and that could therefore make it difficult to compare results across studies. There is another approach for measuring disclosure risk in terms of record linkage that is based on the maximum- knowledge adversarial model for anonymisation (the assumption that the adversary has knowledge of both the anonymised and the original dataset), an approach that was introduced in [39]. This method makes use of the value ranks, rather than the values themselves, which makes it more suitable than the standard record linkage for evaluating techniques that anonymise by generalisation. The approach is described in more detail in Section 3.2.2.

(23)

Chapter 3 Comparative Analysis Setting

In Section 3.1 the hybrid anonymisation algorithms, compared in this study, are presented in more detail. Also, in Section 3.2 the models used for comparing the algorithms are presented.

3.1 Anonymisation Algorithms

As previously mentioned, two hybrid anonymisation algorithms were compared in this study, namely SafePub [25] and the algorithm presented by Soria- Comas et al. [20] (hereafter called Microaggregated Differential Privacy, MDP for short), which both were briefly described in Section 2.3. The choice of algorithms was based on following criteria:

• The models combined k-anonymity and differential privacy in two slightly different ways.

• The models can be used in a non-interactive, general purpose setting.

• The performance presented in the studies was interesting in terms of the privacy-utility trade-off.

• The level of complexity seemed to be suitable for incorporating them in a real world application.

As a baseline for the utility a k-anonymisation algorithm was used.

The algorithms will not be described in fine detail in the below sections so the reader is referred to the original papers for more details and proof of correctness.

15

(24)

3.1.1 MDP

Soria-Comas et al. presented MDP to increase the utility in differential privacy by first applying microaggregation, a clustering-based k-anonymisation algorithm, to the dataset [20]. The idea with this first microaggregation step was to decrease the sensitivity of the dataset, to reduce the noise needed to reach differential privacy. The second step of the algorithm is to add the correct amount of noise to the record values to fulfill differential privacy. The noise is added independently to each attribute. The first step of the algorithm is k-anonymisation and the second step belong to the differential privacy mechanism.

The algorithm was implemented by following the details described in the original paper and summarised in Algorithm 1.

Algorithm 1: MDP: Generate -differentially private dataset [20]

Data: dataset X, k, 

Result: Anonymised dataset X ← Microaggregation(X, k);

for r ← 1 to n do

x ←Sanitiser(Identity(X, r), );

insert x into X end

return X

Here Microaggregation() performs the clustering of the dataset and Sani- tiser() adds Laplacian noise to numerical values. If a numerical value, after the noise addition, is outside of the domain it is set to the largest or smallest value of the domain. Noise is added to categorical values during the clustering step when the cluster centroids are computed, where values that are more frequent in the cluster is more probable to be chosen. Identity() returns the r^th record in the dataset, i.e. querying each record in the dataset and thus simulates the interactive setting.

The performance of the model depended on the parameters k and . In the paper it is stated that in order for the microaggregation to have an effect, it is necessary that k ≥

√n. Apart from that, there were no other constraints on the parameters.

3.1.2 SafePub

SafePub [25] fulfills (, δ)-differential privacy, i.e. the privacy criteria is slightly more relaxed than what MDP guarantees. It builds on the idea that (, δ)-

(25)

CHAPTER 3. COMPARATIVE ANALYSIS SETTING 17

differential privacy can be satisfied by the approach of randomly sample records with probability β = 1 − e⁻ followed by attribute generalisation and then suppression of every record that appears less than k times, where k is derived from anon and δ. The algorithm consists of four steps; 1. random sampling of the dataset, 2. a search strategy for a good generalisation scheme (to what level each attribute should be generalised), 3. generalisation of values and 4.

suppression of records that appear less than k times.

The implementation of the algorithm was done through the open source ARX anonymisation tool [40]. At writing time, the implementation of SafePub with the search strategy was only available in the project repo on GitHub [41].

The overall algorithm is described in Algorithm 2.

Algorithm 2: SafePub: Generate (, δ)-differentially private dataset [25]

Data: dataset X, k, _anon, _search, δ, steps, metric m Result: Anonymised dataset

X_s ← RandomSampling(X, anon);

Initialise a set of transf ormations G for i ← 1 to steps do

U pdate G for g ∈ G do

X_a ← Anonymise(Xs, g, ^anon, δ);

score ← Evaluate(Xa, m) end

gm ← P robabilistically select g ∈ G based on score and search

end

X_,δ ← Anonymise(Xs, best gmfound, anon, δ) return X_,δ

Here the the total privacy budget = anon+ _search. RandomSampling() performs the sampling of the dataset with probability β = 1 − e⁻^anon. The records that are not sampled are suppressed. The transformations in G are full- domain generalisation schemes. Anonymise() perform the anonymisation of the dataset based on generalisation scheme g. The records that do not conform to k-anonymity are suppressed.

Each result from Anonymise() is evaluated based on a given metric that gives a score for the generalisation scheme. The metric used in this study was granularity [25] that penalises values that are generalised to a higher level in the generalisation hierarchy. The score is then used when a generalisation scheme is randomly selected, where better score increases the chance of choosing that scheme. The final dataset is then anonymised based on the best generalisation scheme found.

(26)

From the parameters and δ in (, δ)-differential privacy the k for the k- anonymisation is computed, the resulting k thus depends on the parameters set for the differential privacy. In the paper it is stated that δ should be chosen so that δ < _n¹, where n is the size of the dataset, and at least δ ≤ 10⁻⁴holds. For the authors set search = 0.1 since anonhad greater impact on the performance.

This value was also used in this study, but for ≤ 0.1, ^search = ₁₀ .

3.1.3 k-anonymisation

k-anonymisation does not provide as strong privacy guarantee as MDP and SafePub and only served as a reference point for the utility. It was therefore not directly compared to the two main algorithms, but gave an indication of what utility could be achieved by an algorithm that is commonly used in prac- tise. The k-anonymisation algorithm used was from the ARX anonymisation tool [40], and was a full-domain generalisation method. Some record suppression was allowed since that improved the performance, where some records were suppressed rather than some attributes were suppressed for all records.

Trivially, the algorithm only depended on k.

3.2 Evaluation Models

3.2.1 Utility Metrics

Below are the general-purpose metrics used to measure the models described in more detail. Each metric gives a measure for the information loss. For each metric score m that is associated with a sensitivity ∆m (∆ denote sensitivity, see Definition 4), m was scaled with ∆m so that s = _∆m^m , where s is the final score. Sum of squared errors was used in [20] while discernibility, non- uniform entropy and groups size were used in [25].

Sum of Squared Errors

Sum of Squared Errors (SSE) measures the practical change of values in the output [20]. SSE is the sum of squared distances between original records in X and the anonymised records in X. Let rⁱbe the ith record in X and rⁱbe the corresponding record in X, then SSE is defined as:

SSE(X, X) := X

ri∈X,ri∈X

dist(ri, ri)² (3.1)

where dist is the distance function described in Section 2.4.3.

(27)

CHAPTER 3. COMPARATIVE ANALYSIS SETTING 19

Discernibility

Discernibility is a record-level metric, which penalises records based on the size of the equivalence class they belong to [42]. Let EQX be the equivalence classes in the dataset X and {∗ ∈ X} denote the suppressed records in X.

φ(X) := X

E∈EQ_X

|E|²

|X| + |X||{∗ ∈ X}| (3.2)

The discernability is then:

disc_k(X) := φ(X) (3.3)

The sensitivity of disc^kis:

∀k ∈ N : ∆disck ≤

(5, if k = 1

k²

k−1 + 1, if k > 1 (3.4) Non-uniform Entropy

Non-uniform entropy is an attribute-level metric that quantifies the amount of information that can be collected about the original dataset using the anonymised dataset [25]. The idea with non-uniform entropy is to compare the frequencies of attribute values in the anonymised dataset, with the corresponding frequencies in the original dataset [43].

Let m be the number of attributes in the dataset and pi(X) be the projection of D to its i^thattribute. The score function for non-uniform entropy is defined as:

entk(X) :=

m

X

i=1

φ(pi(X)) (3.5)

The sensitivity of entkis:

∆ent_k≤

(5m, if k = 1

(_k−1^k² + 1)m, if k > 1 (3.6)

3.2.2 Measure Disclosure Risk

Both MDP and SafePub have a privacy guarantee in terms of used when anonymising the dataset. However, it could also be interesting to have another metric for the risk of disclosing a record. This was computed using the maximum-knowledge adversary model, where an potential attacker has access

(28)

to both the anonymised dataset and the original dataset in plain text [39]. The idea behind this model is that the adversary wants to find the linkages between the original dataset, X, and the anonymised dataset, Y. An important note is that since the intruder already has X, nothing more can be gained from the linkages but the intruder is purely malicious and might want to damage the data publisher.

Knowledge of X and Y allows for performing reverse mapping - a deriva- tion of another dataset, Z, based on the ranks of the attribute values (the order of the values) in X and Y. The values in Y is replaced with the value with corresponding rank from X. Z is therefore a permutation of X. It is then possible to estimate the record linkage based on the minimal permutation distance between a record in X and Z and the correct linkages form the metric.

A more detailed description of the approach with good examples is found in [39]. The method is only described for numerical and ordinal values, which has a natural order that forms the rank. The rank for nominal attributes is in this study computed based on the the marginality described in 2.4.3.

(29)

Chapter 4 Method

For the evaluation of the algorithms, the original and the anonymised datasets were compared in order to measure the information loss and the disclosure risk, where a result of a low information loss and a low disclosure risk was preferred. As mentioned in Section 3.1.1 and 3.1.2, MDP and SafePub differ slightly in their parameter dependencies. Both models depend on but while MDP also depend on k, SafePub depend on δ that in turn decides the value for k. For the purpose of this study, the most important parameter was , since that controlled the privacy guarantee and the value for k was rather seen as a tool for enabling better utility. By keeping k and δ constant, the effect of different privacy levels could be studied. For each parameter value in each experiment the anonymisation of each dataset was repeated several times since both algorithms involved randomisation. The total evaluation was performed by conducting four tests with different datasets, of which four were derived from publicly available data and three were based on data acquired from KRY.

The code used for the project is available on GitHub [44].

4.1 Datasets

The algorithms were applied on several different datasets that had some different characteristics. A summary of all the datasets are displayed in Table 4.1.

Each attribute in the dataset was considered to be a QID, so each attribute was used in the anonymisation process.

21

(30)

Name # Records # Attributes Numerical Nominal δ k Used in Experiment

Adult 30 162 8 1 7 1e⁻⁵ 174 1,3,4

SympForms 30 000 11 3 8 1e⁻⁵ 174 1,3,4

SympFormsLarge 100 000 11 3 8 1e⁻⁵ 317 1,3,4

SympFormsSmall 1000 11 3 8 1e⁻⁴ 32 2

Housing 20 433 9 9 0 1e⁻⁵ 143 1,3,4

HousingSmall 1000 9 9 0 1e⁻⁴ 32 2

Musk 6598 20 20 0 1e⁻⁴ 82 3

Table 4.1: Summary of the datasets properties and the experiments they are used in.

4.1.1 Adult

The Adult dataset, from the UCI Machine Learning Repository [45], is an excerpt of the U.S. census database that has been used to test anonymisation algorithms in several other studies [20] [25][36][37]. All records with missing entries were removed so Adult had 30 162 records and eight of the attributes were considered (age, workclass, education, marital status, occupation, race, sex, native country). There were one numerical attribute (age) and the other seven were nominal attributes. The generalisation hierarchies for the attributes were based on definitions available in [41].

4.1.2 SympForms

The datasets acquired from KRY were based on data from symptom forms patients fills out prior to each doctor meeting. The datasets were synthetic and generated by randomly drawing samples from a distribution from an exist- ing dataset. Three datasets were generated, SympForms with 30 000 records, SympFormsLarge with 100 000 records and SympFormsSmall with 1000 records. All the datasets had 11 attributes of which three were numerical and the other eight were nominal attributes. Most of the attributes were responses to questions regarding health status related to cold and flu symptoms. Three questions were optional which resulted in some records not having a value for those attributes corresponding to that question. The missing values were replaced by a range covering the domain for numerical values and a negating answer for nominal attributes. Seven out of the eight categorical attributes had only two generalisation levels, i.e. were directly generalised to * (suppressed).

4.1.3 Numerical Datasets

Both Adult and the versions of SympForms had mixed types of attributes, where the majority of the attributes were categorical and while the Symp- Forms datasets were actual medical datasets, Adult had a similar structure. It

(31)

CHAPTER 4. METHOD 23

was therefore interesting to compare the performance for the models when applied to a dataset with only numerical values, to see if there was a difference in performance when the attribute types were different. For this the dataset Housing was used [46], with a total of 20 433 records (when all records with missing values were removed) and nine numerical attributes. The number of generalisation levels varied from three to five. Another version of Housing was generated, HousingSmall, by randomly sample 1000 records from the original Housing dataset.

To better study the effect of dimensionality another numercal dataset was used, the Musk dataset from the UCI Machine Learning Repository [45], also used previous in the literature [47]. Musk had 166 attributes (after non-numeric attributes were removed) and 6598 records. However, during some initial test- ing it turned out that SafePub only was able to run with 20 attributes, otherwise the generalisation scheme became too large, due to the large range of values for each attribute. Therefore only the first 20 attributes were used in this study.

4.2 Experiments

Three main experiments were conducted to test how the performance of each algorithm was affected by different variables, and the fourth test was performed to see the models’ respective dependency on k and δ. Where the baseline k- anonymisation was used, it was a fixed value for k, k = 5, a value that has been used in literature [25] and seems to be a legal recommendation for internal consumption of health data in Sweden.

4.2.1 Experiment 1: Impact of on Information Loss

In the first experiment the impact the privacy parameter had on the utility was tested. The information loss was computed for varying ∈ {0.01, 0.1, 0.5, ln(2), 0.75, 1.0, ln(3), 1.25, 1.5, 2.0}. For MDP k was fixed to the smallest value (

√n) and for SafePub δ was set to 1.0e⁻⁵. The anonymisation for both models was repeated 50 times for each dataset and value for . The k-anonymisation algorithm was not repeated since that that did not include randomness. The datasets used were Adult, SympForm, SympFormLarge and Housing.

4.2.2 Experiment 2: Disclosure Risk

To measure the disclosure risk, the approach described in 3.2.2 was used. Only the two small datasets with 1000 records were used, due to long computation

(32)

time. The configuration was for SafePub δ = 1e⁻⁴ and for MDP k = 32. For each dataset and ∈ {1.0, 2.0} the anonymisation was repeated 25 times. The utility was also computed for each model and dataset. The datasets used were SympFormsSmall and HousingSmall.

4.2.3 Experiment 3: Effect of Dimensionality

To study the effect the dataset dimensionality had on the performance of the algorithms, all the parameters were fixed ( = 1.0, k =

√n, δ ∈ {1.0e⁻⁵, 1.0e⁻⁴} depending on the size of the dataset) and the number of attributes in the dataset was changed. The number of attributes ranged from two up to the total number of attributes for the dataset. If the number of possible attribute combinations exceeded 100, 100 attribute combinations were randomly selected. For each combination of attributes, the anonymisation was repeated 25 times. The datasets used were Adult, SympForm, SympFormLarge, Housing and Musk.

4.2.4 Experiment 4: Impact of k and δ

To test the effect different values for k had on MDP, and the effect different values for δ had on SafePub, was fixed to 1.0 and k varied for MDP and δ for SafePub. This was an independent test made with the purpose of observing the algorithms’ behavior for different parameter values that were fixed during the other experiments.

The value for k is dependent on the size of the dataset, therefore the different ranges were k ∈ [200, 4700] for Adult, SympForms and Housing and k ∈ [300, 9300] for SympFormsLarge, with increments of 300 and 600 respectively. For SafePub the range was δ ∈ [10e⁻²⁰, 10e⁻⁵], increasing by a factor of ten, for all datasets. Here, the anonymisation was repeated 50 times as well.

The datasets used were Adult, SympForm, SympFormsLarge and Housing.

(33)

Chapter 5 Results

In this chapter the results from the four experiments are presented.

The only useful metric for comparing the utility of the models turned out to be SSE, therefore only SSE was used when comparing the utility between the models. In the resulting graphs below, smaller values for both the information loss and are better. Less information loss is a sign of the data being more useful and smaller mean stronger privacy guarantee.

Each result was normalised so that 0 represented no information loss and 1 represented a dataset where all information had been removed, i.e. each record in the dataset was suppressed.

5.1 Experiment 1

The result from Experiment 1 is displayed in Figure 5.1, which shows how the information loss (SSE) depends on the value for , i.e. how the utility depend on the privacy level. The result is the average score from 50 iterations. In addition, an excerpt from anonymised datasets is included in Appendix A in Tables A.1 - A.5 for each model when = 2.0 and = 0.01, which show a clear difference between the anonymised datasets. While both datasets from MDP look similar, the datasets from SafePub are quite different, where the majority of the records were suppressed for = 0.1. This was also reflected by the resulting information loss.

Both MDP and SafePub had a decreasing information loss for increasing .

The information loss for MDP did not decrease as much as it did for SafePub, but was lower for smaller values for , which lead to a lower information loss overall for most datasets. For the SympForms datasets MDP had the low- est information loss over all and for the Housing dataset MDP performed

25

(34)

(a) SympForms (b) Adult

(c) SympFormsLarge (d) Housing

Figure 5.1: Information loss (SSE) when the algorithms were applied on four different datasets, with different values for , i.e. different levels of privacy. A small value is better for both axes, less information loss means more utility and smaller means stronger privacy. A confidence interval (95%) is plotted as a lighter/blurry line around the lines, but is too small to see for MDP. k-anonymisation is only used as a reference for the information loss and does not have the same privacy guarantees as MDP and SafePub.

much better than SafePub, even better than the baseline k-anonymisation. For the other three datasets, the information loss was, as expected, lower for k- anonymisation. For the Adult dataset SafePub surpassed MDP for > 1.5 and compared to SympForms, MDP performed worse on the Adult datset while SafePub performed slightly better. Except for MDP applied on Housing, neither of the models got an information loss < 50% for the privacy levels considered in this study. Also, the result from each iteration varied slightly more for SafePub across all datasets.

The result from the other usability metrics mentioned in Section 3.2 was bad for comparing the algorithms. When using the discernibility metric, MDP got zero information loss, and the non-uniform entropy resulted in an information gain, which is impossible. A dataset where noise has been added should not be more informative than the original dataset. Therefore those metrics were not included for MDP, but that will be discussed further in Section 6.1.

For SafePub the result from those metric was more as expected, with a decrease in information loss as increased. A figure of the result is included in

(35)

CHAPTER 5. RESULTS 27

Appendix A in Figure A.1.

5.1.1 Additional Observations

Since the two methods use different approaches when anonymising the datasets, the loss of information was caused by different reasons. The information loss due to MDP anonymisation followed from the distortion of values. Especially for categorical values, the distribution of values in the anonymised dataset could differ quite a lot from the original distribution. For SafePub, the information loss was caused by more general values and missing records and values due to suppression in the anonymised datasets.

MDP Distortion

For all datasets the trend was similar regarding the distortion of values following MDP. For numerical attributes, the distribution did not change by much.

The peak got slightly shifted to the right, i.e. towards larger values, and there was less variance. When decreased, the distribution started to resemble a nor- mal distribution. Although, for = 0.01 there was a spike for both the minimal and maximal value, which can be explained by values that ended up outside of the attribute domain after noise was added, were set to the largest or smallest value in the domain. For categorical values, the values were more evenly distributed for smaller . In the Adult dataset and attribute Native country, the value "United States" was dominating, but after the anonymisation process the distribution got more even for decreasing , as seen in Figure 5.2.

(a) Original (b) = 2.0 (c) = 0.1

Figure 5.2: The distribution for "Native country" in Adult dataset for different values of . Plotted with logarithmic scale and a 95% confidence interval. "United States" is marked with red.

SafePub Suppression

In Figure 5.3 the fraction of suppressed records and attributes in datasets anonymised with SafePub is shown. The number of suppressed records de-

(36)

(a) Suppressed records (b) Suppressed attributes

Figure 5.3: The ratio of suppressed records (left) and suppressed attributes (right) from SafePub for different values of . A confidence interval (95%) is plotted as a lighter/blurry line around the lines.

creased with increasing . The percentage of suppressed records in Adult ranged from almost 100% for = 0.01 to 35% for = 2.0. For SympForms the range was similar where the record suppression rate ranged from from almost 100% down to 30% for = 0.01 to = 2.0. There seem to be slightly less records suppressed for the SympForms dataset, however the result is similar for all datasets. The size of the dataset seemed to not have a noticable impact since SympFormsLarge got very similar result as SympForms (therefore not included in the figure).

The percentage of suppressed attributes for Adult was 40% for = 0.01 and for ≥ ln(2) 10% of the attributes were suppressed, i.e. on average one attribute was suppressed. For SympForms the trend was similar, but the range was instead from 50% for = 0.01 down to 30% for ≥ 1.0. It would seem that more attributes need to be suppressed for SympForms compared to Adult and Housing.

5.2 Experiment 2

The disclosure risk for all models (MDP, SafePub and k-anonymisation) in terms of record linkages from minimal perturbation distance was very small.

In Figure 5.4 the result is shown for each dataset. SafePub had similar result for both datasets and value for but the result differed slightly more for MDP. For SympFormsSmall both models had less record linkage than k-anonymisation.

However, for the Housing dataset MDP had similar record linkage as the k- anonymisation for = 1.0 and the record linkage exceeded the baseline for

= 2.0. SafePub had on the other hand more information loss, similarly to Experiment 1, with around 0.8 and 0.7 for = 1.0 and = 2.0 respectively compared to MDP around 0.5 for SypmFormsSmall for both values for and 0.02 for HousingSmall for both . Detailed results for each dataset and are

(37)

CHAPTER 5. RESULTS 29

found in Appendix A in Tables A.6 and A.7.

(a) SympFormsSmall (b) HousingSmall

Figure 5.4: Number of correct record linkages for each model on SympformSmall (left) and HousingSmall (right) with = 1.0 and = 2.0. Plotted with a 95% confidence interval.

5.3 Experiment 3

When studying the effect of the dataset dimensionality SafePub seemed to be more sensitive to increasing dimensionality while MDP was not that affected with an information loss that was more or less constant, as seen in Figure 5.5.

However, for SafePub the increase in information loss flattened out for higher dimensions, but then the information loss was already around 80-90%, which is a substantial loss of information. When studying the resulting datsets from SafePub, the suppression rate of attributes and records increase with increasing dimensionality.

5.4 Experiment 4

In Figure 5.6 the result from varying the parameter k for MDP and δ for SafePub is shown. For SafePub the information loss increased slightly with decreasing δ, which was expected, but the difference was within 10% and for SympForms δ seemed to have less of an impact. For MDP the trend was different. The information loss decreased when k increased until k ≈ 800 and k ≈ 1500, depending on the size of the dataset, and then started to increase.

However, k did not affect the performance much when the dataset only had numerical attributes.

(38)

(a) SympForms (b) Adult (c) Housing

(d) SympFormsLarge (e) Musk

Figure 5.5: Information loss (SSE) for different dataset dimensions (the number of attributes). Visualisation of the model performance on five different datasets, varying the number of attributes. Plotted with a 95% confidence interval.

(a) Adult (b) SympForms

(c) Housing (d) SympFormsLarge

Figure 5.6: Information loss (SSE) when varying k for MDP and δ for SafePub. Visualises how the performance of the two models depend on the parameters. A confidence interval (95%) is plotted as a lighter/blurry line around the lines, but is too small to see for MDP.

JOHANNABROMARK Privacy-PreservingSharingofHealthDatausingHybridAnonymisationTechniques-AComparison

Privacy-Preserving Sharing of Health Data using Hybrid

Anonymisation Techniques - A Comparison

JOHANNA BROMARK

Privacy-Preserving Sharing of Health Data using Hybrid

Anonymisation Techniques - A Comparison

JOHANNA BROMARK

Abstract

Sammanfattning

Contents

Chapter 1 Introduction

1.1 Research Question

1.2 Delimitations

1.3 Report Outline

1.4 Ethics and Sustainability

Chapter 2 Background

2.1 Data Anonymisation

2.1.1 Value Generalisation and Suppression

2.1.2 Perturbation Techniques

2.2 Privacy Models

2.2.1 k-anonymity

2.2.2 Differential Privacy

2.3 Hybrid Data Anonymisation

2.4 Dataset Characteristics

2.4.1 Dimensionality

2.4.2 Attribute Types

2.4.3 Distance Between Nominal Values

2.4.4 Medical Datasets

2.5 Related Work

2.5.1 Evaluation Metrics

Chapter 3

Comparative Analysis Setting

3.1 Anonymisation Algorithms

3.1.1 MDP

3.1.2 SafePub

3.1.3 k-anonymisation

3.2 Evaluation Models

3.2.1 Utility Metrics

3.2.2 Measure Disclosure Risk

Chapter 4 Method

4.1 Datasets

4.1.1 Adult

4.1.2 SympForms

4.1.3 Numerical Datasets

4.2 Experiments

4.2.1 Experiment 1: Impact of  on Information Loss

4.2.2 Experiment 2: Disclosure Risk

4.2.3 Experiment 3: Effect of Dimensionality

4.2.4 Experiment 4: Impact of k and δ

Chapter 5 Results

5.1 Experiment 1

5.1.1 Additional Observations

5.2 Experiment 2

5.3 Experiment 3

5.4 Experiment 4

4.2.1 Experiment 1: Impact of on Information Loss