Differential Privacy

Similar documents
DATA MINING - 1DL360

Information Security in Big Data using Encryption and Decryption

CS346: Advanced Databases

Data attribute security and privacy in distributed database system

Current Developments of k-anonymous Data Releasing

Privacy Preserving Health Data Publishing using Secure Two Party Algorithm

A Privacy-preserving Approach for Records Management in Cloud Computing. Eun Park and Benjamin Fung. School of Information Studies McGill University

A generalized Framework of Privacy Preservation in Distributed Data mining for Unstructured Data Environment

Privacy Preserved Association Rule Mining For Attack Detection and Prevention

ARX A Comprehensive Tool for Anonymizing Biomedical Data

Proposing a Novel Synergized K-Degree L-Diversity T- Closeness Model for Graph Based Data Anonymization

Keywords: Security; data warehouse; data mining; statistical database security; privacy

A Survey of Quantification of Privacy Preserving Data Mining Algorithms

SYNTHESIS LECTURES ON DATA MINING AND KNOWLEDGE DISCOVERY. Elena Zheleva Evimaria Terzi Lise Getoor. Privacy in Social Networks

A GENERAL SURVEY OF PRIVACY-PRESERVING DATA MINING MODELS AND ALGORITHMS

Survey on Data Privacy in Big Data with K- Anonymity

Database security. André Zúquete Security 1. Advantages of using databases. Shared access Many users use one common, centralized data set

(Big) Data Anonymization Claude Castelluccia Inria, Privatics

Privacy Techniques for Big Data

Computer Security (EDA263 / DIT 641)

PRIVACY-PRESERVING DATA ANALYSIS AND DATA SHARING

International Journal of Scientific & Engineering Research, Volume 4, Issue 10, October-2013 ISSN

Differentially Private Analysis of

Computer Security (EDA263 / DIT 641)

On the Performance Measurements for Privacy Preserving Data Mining

International Journal of Advanced Computer Technology (IJACT) ISSN: PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS

Probabilistic Prediction of Privacy Risks

A Bayesian Approach for on-line max auditing of Dynamic Statistical Databases

International Journal of Technical Research and Applications e Hari Kumar.R M.E (CSE), Dr.P.UmaMaheswari Ph.d, Abstract

Practicing Differential Privacy in Health Care: A Review

Centralized and Distributed Anonymization for High-Dimensional Healthcare Data

The Cost of Privacy: Destruction of Data-Mining Utility in Anonymized Data Publishing

A THEORETICAL COMPARISON OF DATA MASKING TECHNIQUES FOR NUMERICAL MICRODATA

Challenges of Data Privacy in the Era of Big Data. Rebecca C. Steorts, Vishesh Karwa Carnegie Mellon University November 18, 2014

Anonymizing Healthcare Data: A Case Study on the Blood Transfusion Service

Policy-based Pre-Processing in Hadoop

Social Media Mining. Data Mining Essentials

Obfuscation of sensitive data in network flows 1

An Approach for Finding Attribute Dependencies in an Anonymized Data

future proof data privacy

An Improved Collude Attack Prevention for Data Leakage * ha.r.p

CS377: Database Systems Data Security and Privacy. Li Xiong Department of Mathematics and Computer Science Emory University

Database and Data Mining Security

OLAP Online Privacy Control

ARTICLE 29 DATA PROTECTION WORKING PARTY

Module outline. CS 458 / 658 Computer Security and Privacy. (Relational) Databases. Module outline. Module 6 Database Security and Privacy.

CS 458 / 658 Computer Security and Privacy. Module outline. Module outline. Module 6 Database Security and Privacy. Winter 2010

Secure Computation Martin Beck

An Efficiency Keyword Search Scheme to improve user experience for Encrypted Data in Cloud

Developing a Blueprint for Preserving Privacy of Electronic Health Records using Categorical Attributes

KNOWLEDGE DISCOVERY and SAMPLING TECHNIQUES with DATA MINING for IDENTIFYING TRENDS in DATA SETS

A Practical Differentially Private Random Decision Tree Classifier

Privacy Challenges of Telco Big Data

Information Security in Big Data: Privacy and Data Mining (IEEE, 2014) Dilara USTAÖMER

Efficient Algorithms for Masking and Finding Quasi-Identifiers

Big Data Privacy: Challenges to Privacy Principles and Models

Using multiple models: Bagging, Boosting, Ensembles, Forests

Privacy-Preserving Data Visualization using Parallel Coordinates

Global Soft Solutions JAVA IEEE PROJECT TITLES

Private Record Linkage with Bloom Filters

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

PRIVACY IN STATISTICAL DATABASES: AN APPROACH USING CELL SUPPRESSION NEELABH BAIJAL. Department of Computer Science

Privacy-preserving Data-aggregation for Internet-of-things in Smart Grid

ACTA UNIVERSITATIS APULENSIS No 15/2008 A COMPARISON BETWEEN LOCAL AND GLOBAL RECODING ALGORITHMS FOR ACHIEVING MICRODATA P-SENSITIVE K -ANONYMITY

Detecting Data Leakage using Data Allocation Strategies. With Fake objects

Predictive Models for Min-Entropy Estimation

How To Identify A Churner

Workload-Aware Anonymization Techniques for Large-Scale Datasets

Experimental Analysis of Privacy-Preserving Statistics Computation

Calibrating Noise to Sensitivity in Private Data Analysis

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Limits of Computational Differential Privacy in the Client/Server Setting

ANONYMIZATION METHODS AS TOOLS FOR FAIRNESS

Technical Approaches for Protecting Privacy in the PCORnet Distributed Research Network V1.0

Privacy-Preserving Big Data Publishing

A Brief Survey on Anonymization Techniques for Privacy Preserving Publishing of Social Network Data

No Free Lunch in Data Privacy

Northumberland Knowledge

Introduction to Data Mining

Privacy-Preserving Models for Comparing Survival Curves Using the Logrank Test

SOCIAL NETWORK DATA ANALYTICS

Proximity-Aware Local-Recoding Anonymization with MapReduce for Scalable Big Data Privacy Preservation in Cloud

Li Xiong, Emory University

Standardization and Its Effects on K-Means Clustering Algorithm

Data mining successfully extracts knowledge to

Defining and Enforcing Privacy in Data Sharing

Database Security. Database Security Requirements

Constrained Obfuscation of Relational Databases

SECURE AND EFFICIENT PRIVACY-PRESERVING PUBLIC AUDITING SCHEME FOR CLOUD STORAGE

SPATIAL DATA CLASSIFICATION AND DATA MINING

QOS Based Web Service Ranking Using Fuzzy C-means Clusters

Analyzing Customer Churn in the Software as a Service (SaaS) Industry

Query Monitoring and Analysis for Database Privacy - A Security Automata Model Approach

PRIVACY PRESERVING ASSOCIATION RULE MINING

PrivacyCanary: Privacy-Aware Recommenders with Adaptive Input Obfuscation

Preserving Privacy by De-identifying Facial Images

Lesson 17: Margin of Error When Estimating a Population Proportion

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Transcription:

Differential Privacy Protecting Individual Contributions to Aggregated Results Toni Mattis Cloud Security Mechanisms (Prof. Polze, Christian Neuhaus, SS2013) 16. May 2013 Toni Mattis Differential Privacy 1 / 39

Structure 1 Privacy Goals and Examples 2 Early Means of Privacy Control Query Restriction Query Set Approximation Summary 3 Modern Data Anonymization Anonymity Concepts Resampling and Permutation Summary 4 Differential Privacy Epsilon-Differential Privacy Summary 5 Challenges and Current Research Topics Toni Mattis Differential Privacy 2 / 39

Privacy Examples Goal Avoid disclosure of an individual s contribution to an aggregate result. What does aggregated results mean? Average income of a neighbourhood Frequency of certain diseases among a population Correlation between coffee and tobacco consumption Sensitive individual attributes: Income (Census) Diseases (Clinical Reports) Habits (Surveys) Location and Movement (Mobile Phones) Toni Mattis Differential Privacy 3 / 39

Privacy Examples Goal Avoid disclosure of an individual s contribution to an aggregate result. What does aggregated results mean? Average income of a neighbourhood Frequency of certain diseases among a population Correlation between coffee and tobacco consumption Sensitive individual attributes: Income (Census) Diseases (Clinical Reports) Habits (Surveys) Location and Movement (Mobile Phones) Toni Mattis Differential Privacy 3 / 39

Privacy Examples Goal Avoid disclosure of an individual s contribution to an aggregate result. What does aggregated results mean? Average income of a neighbourhood Frequency of certain diseases among a population Correlation between coffee and tobacco consumption Sensitive individual attributes: Income (Census) Diseases (Clinical Reports) Habits (Surveys) Location and Movement (Mobile Phones) Toni Mattis Differential Privacy 3 / 39

May 12 2013: EE data sold to track customers? Official Response from Ipsos: We do not have access to any names, personal address information, nor postcodes or phone numbers. [...] We only ever report on aggregated groups of 50 or more customers. [...] We will never release any data that in any way allows an individual to be identified. Toni Mattis Differential Privacy 4 / 39

May 12 2013: EE data sold to track customers? Official Response from Ipsos: We do not have access to any names, personal address information, nor postcodes or phone numbers. [...] We only ever report on aggregated groups of 50 or more customers. [...] We will never release any data that in any way allows an individual to be identified. Toni Mattis Differential Privacy 4 / 39

The Inference Problem A tax officer requests the total revenue of the City at the Census Bureau. May the Census Bureau release the total revenue of the Region afterwards?[han71] A Hardware Stores City B Urban Region (City + Suburbs) Toni Mattis Differential Privacy 5 / 39

The Inference Problem Privacy Breach Store A may provably obtain B s contribution to the sum: B = Region City A (1) A Hardware Stores City B Urban Region (City + Suburbs) Toni Mattis Differential Privacy 6 / 39

The Inference Problem Privacy Breach Store A may provably obtain B s contribution to the sum: B = Region City A (1) The Inference Problem Background knowledge of stakeholders (i.e. Store A) used to infer more information than actually published. Toni Mattis Differential Privacy 6 / 39

Structure 1 Privacy Goals and Examples 2 Early Means of Privacy Control Query Restriction Query Set Approximation Summary 3 Modern Data Anonymization Anonymity Concepts Resampling and Permutation Summary 4 Differential Privacy Epsilon-Differential Privacy Summary 5 Challenges and Current Research Topics Toni Mattis Differential Privacy 7 / 39

Structure 1 Privacy Goals and Examples 2 Early Means of Privacy Control Query Restriction Query Set Approximation Summary 3 Modern Data Anonymization Anonymity Concepts Resampling and Permutation Summary 4 Differential Privacy Epsilon-Differential Privacy Summary 5 Challenges and Current Research Topics Toni Mattis Differential Privacy 8 / 39

Query Restriction US Census Files Census Officer Toni Mattis Differential Privacy 9 / 39

Query Restriction Database Restricted Interface Disclosure Policy Disclosure Policies according to [Den80]: Reject too small sets (minimum query set control) Reject too similar queries (minimum overlap control) Deny results which lead to a solvable system of equations (Auditing, only theoretical due to complexity) Toni Mattis Differential Privacy 10 / 39

Query Restriction Database Restricted Interface Disclosure Policy Disclosure Policies according to [Den80]: Reject too small sets (minimum query set control) Reject too similar queries (minimum overlap control) Deny results which lead to a solvable system of equations (Auditing, only theoretical due to complexity) Toni Mattis Differential Privacy 10 / 39

Query Restriction Database Restricted Interface Disclosure Policy Disclosure Policies according to [Den80]: Reject too small sets (minimum query set control) Reject too similar queries (minimum overlap control) Deny results which lead to a solvable system of equations (Auditing, only theoretical due to complexity) Toni Mattis Differential Privacy 10 / 39

Query Restriction Database Restricted Interface Disclosure Policy Disclosure Policies according to [Den80]: Reject too small sets (minimum query set control) Reject too similar queries (minimum overlap control) Deny results which lead to a solvable system of equations (Auditing, only theoretical due to complexity) Toni Mattis Differential Privacy 10 / 39

Query Restriction Example Name Age Income p.a. Person A 30 $40.000 Person B 35 $60.000 Person C 40 $30.000 Person D 45 $80.000 Minimum Query Set Control Q 3 Attack Vector Q 1 = SUM(Income Age < 42) = $130.000 (2) Q 2 = SUM(Income Age < 50) = $210.000 (3) Income D = Q 2 Q 1 = $80.000 (4) Toni Mattis Differential Privacy 11 / 39

Query Restriction Example Name Age Income p.a. Person A 30 $40.000 Person B 35 $60.000 Person C 40 $30.000 Person D 45 $80.000 Minimum Query Set Control Q 3 Attack Vector Q 1 = SUM(Income Age < 42) = $130.000 (2) Q 2 = SUM(Income Age < 50) = $210.000 (3) Income D = Q 2 Q 1 = $80.000 (4) Toni Mattis Differential Privacy 11 / 39

Query Restriction Example Name Age Income p.a. Person A 30 $40.000 Person B 35 $60.000 Person C 40 $30.000 Person D 45 $80.000 Minimum Query Set Control Q 3 Attack Vector Q 1 = SUM(Income Age < 42) = $130.000 (2) Q 2 = SUM(Income Age < 50) = $210.000 (3) Income D = Q 2 Q 1 = $80.000 (4) Toni Mattis Differential Privacy 11 / 39

Query Restriction Example Name Age Income p.a. Person A 30 $40.000 Person B 35 $60.000 Person C 40 $30.000 Person D 45 $80.000 Minimum Query Set Control Q 3 Attack Vector Q 1 = SUM(Income Age < 42) = $130.000 (2) Q 2 = SUM(Income Age < 50) = $210.000 (3) Income D = Q 2 Q 1 = $80.000 (4) Toni Mattis Differential Privacy 11 / 39

Query Restriction Example Name Age Income p.a. Person A 30 $40.000 Person B 35 $60.000 Person C 40 $30.000 Person D 45 $80.000 Minimum Overlap Control Attack similar to Minimum Query Set Control with more equations to be solved. Toni Mattis Differential Privacy 11 / 39

Partitioning (a.k.a. Microaggregation) Raw Data Name Age Income p.a. Person A 30 $40.000 Person B 35 $60.000 Person C 40 $30.000 Person D 45 $80.000 Partitioned Data Name Age Income p.a. Person AB 30-39 $50.000 (±20.000) Person CD 40-49 $55.000 (±25.000) 1960 US Census data available with partitions of size 1000 [Han71] Toni Mattis Differential Privacy 12 / 39

Partitioning (a.k.a. Microaggregation) Raw Data Name Age Income p.a. Person A 30 $40.000 Person B 35 $60.000 Person C 40 $30.000 Person D 45 $80.000 Partitioned Data Name Age Income p.a. Person AB 30-39 $50.000 (±20.000) Person CD 40-49 $55.000 (±25.000) 1960 US Census data available with partitions of size 1000 [Han71] Toni Mattis Differential Privacy 12 / 39

Partitioning (a.k.a. Microaggregation) Raw Data Name Age Income p.a. Person A 30 $40.000 Person B 35 $60.000 Person C 40 $30.000 Person D 45 $80.000 Partitioned Data Name Age Income p.a. Person AB 30-39 $50.000 (±20.000) Person CD 40-49 $55.000 (±25.000) 1960 US Census data available with partitions of size 1000 [Han71] Toni Mattis Differential Privacy 12 / 39

Partitioning (a.k.a. Microaggregation) Partitioned Data Name Age Income p.a. Person AB 30-39 $50.000 (±20.000) Person CD 40-49 $55.000 (±25.000) Consider adding Person E (Age 47, $40.000): Attack Vector Given: Knowledge about E s age and the previous database Income E = NewSize NewIncome OldSize OldIncome (2) = 3 50.000 2 55.000 (3) = 40.000 (4) [AW89] Toni Mattis Differential Privacy 13 / 39

Partitioning (a.k.a. Microaggregation) Partitioned Data Name Age Income p.a. Person AB 30-39 $50.000 (±20.000) Person CD 40-49 $55.000 (±25.000) Person CDE 40-49 $50.000 (±25.000) Consider adding Person E (Age 47, $40.000): Attack Vector Given: Knowledge about E s age and the previous database Income E = NewSize NewIncome OldSize OldIncome (2) = 3 50.000 2 55.000 (3) = 40.000 (4) [AW89] Toni Mattis Differential Privacy 13 / 39

Partitioning (a.k.a. Microaggregation) Partitioned Data Name Age Income p.a. Person AB 30-39 $50.000 (±20.000) Person CD 40-49 $55.000 (±25.000) Person CDE 40-49 $50.000 (±25.000) Consider adding Person E (Age 47, $40.000): Attack Vector Given: Knowledge about E s age and the previous database Income E = NewSize NewIncome OldSize OldIncome (2) = 3 50.000 2 55.000 (3) = 40.000 (4) [AW89] Toni Mattis Differential Privacy 13 / 39

Random Sampling Idea Pseudo-randomly select candidates for a given query same queries must operate on the same subset Simplified Algorithm[Den80] 1 Define elimination probability P e = 2 k. 2 Preprocessing: For each record r i compute a hash h i {0, 1} k. 3 Query processing: Compute hash H q {0, 1} k from the query. 4 Eliminate all r i where h i = H q from query set. Announce k, but keep hash function secret. Toni Mattis Differential Privacy 14 / 39

Random Sampling Idea Pseudo-randomly select candidates for a given query same queries must operate on the same subset Simplified Algorithm[Den80] 1 Define elimination probability P e = 2 k. 2 Preprocessing: For each record r i compute a hash h i {0, 1} k. 3 Query processing: Compute hash H q {0, 1} k from the query. 4 Eliminate all r i where h i = H q from query set. Announce k, but keep hash function secret. Toni Mattis Differential Privacy 14 / 39

Random Sampling Idea Pseudo-randomly select candidates for a given query same queries must operate on the same subset Simplified Algorithm[Den80] 1 Define elimination probability P e = 2 k. 2 Preprocessing: For each record r i compute a hash h i {0, 1} k. 3 Query processing: Compute hash H q {0, 1} k from the query. 4 Eliminate all r i where h i = H q from query set. Announce k, but keep hash function secret. Toni Mattis Differential Privacy 14 / 39

Random Sampling Idea Pseudo-randomly select candidates for a given query same queries must operate on the same subset Simplified Algorithm[Den80] 1 Define elimination probability P e = 2 k. 2 Preprocessing: For each record r i compute a hash h i {0, 1} k. 3 Query processing: Compute hash H q {0, 1} k from the query. 4 Eliminate all r i where h i = H q from query set. Announce k, but keep hash function secret. Toni Mattis Differential Privacy 14 / 39

Random Sampling Idea Pseudo-randomly select candidates for a given query same queries must operate on the same subset Simplified Algorithm[Den80] 1 Define elimination probability P e = 2 k. 2 Preprocessing: For each record r i compute a hash h i {0, 1} k. 3 Query processing: Compute hash H q {0, 1} k from the query. 4 Eliminate all r i where h i = H q from query set. Announce k, but keep hash function secret. Toni Mattis Differential Privacy 14 / 39

Random Sampling Idea Pseudo-randomly select candidates for a given query same queries must operate on the same subset Simplified Algorithm[Den80] 1 Define elimination probability P e = 2 k. 2 Preprocessing: For each record r i compute a hash h i {0, 1} k. 3 Query processing: Compute hash H q {0, 1} k from the query. 4 Eliminate all r i where h i = H q from query set. Announce k, but keep hash function secret. Toni Mattis Differential Privacy 14 / 39

Random Sampling Idea Pseudo-randomly select candidates for a given query same queries must operate on the same subset Simplified Algorithm[Den80] 1 Define elimination probability P e = 2 k. 2 Preprocessing: For each record r i compute a hash h i {0, 1} k. 3 Query processing: Compute hash H q {0, 1} k from the query. 4 Eliminate all r i where h i = H q from query set. Announce k, but keep hash function secret. Toni Mattis Differential Privacy 14 / 39

Random Sampling Idea Pseudo-randomly select candidates for a given query same queries must operate on the same subset Simplified Algorithm[Den80] 1 Define elimination probability P e = 2 k. 2 Preprocessing: For each record r i compute a hash h i {0, 1} k. 3 Query processing: Compute hash H q {0, 1} k from the query. 4 Eliminate all r i where h i = H q from query set. Announce k, but keep hash function secret. Toni Mattis Differential Privacy 14 / 39

Summary 1960-1990 Query Restriction Approach: (Inspired by manual work performed by Census Officers in pre-dbms age) Minimum Query Set Control Minimum Overlap Control Auditing (for small databases) Query Set Approximation Approach: Partitioning/Microaggregation (for static Databases) Random Sampling Toni Mattis Differential Privacy 15 / 39

Summary 1960-1990 Query Restriction Approach: (Inspired by manual work performed by Census Officers in pre-dbms age) Minimum Query Set Control Minimum Overlap Control Auditing (for small databases) Query Set Approximation Approach: Partitioning/Microaggregation (for static Databases) Random Sampling Toni Mattis Differential Privacy 15 / 39

Structure 1 Privacy Goals and Examples 2 Early Means of Privacy Control Query Restriction Query Set Approximation Summary 3 Modern Data Anonymization Anonymity Concepts Resampling and Permutation Summary 4 Differential Privacy Epsilon-Differential Privacy Summary 5 Challenges and Current Research Topics Toni Mattis Differential Privacy 16 / 39

Anonymizing Data Removing names and adresses in a clinical report may not be sufficient... Diagnosis Medication Visit date Date of Birth ZIP Occupation Address Name Party affiliation Medical Report Voters List Toni Mattis Differential Privacy 17 / 39

Anonymizing Data Medical databases: Publish data for researchers. Example Data: Job Gender Age Diagnosis Engineer Male 35 Hepatitis Engineer Male 38 Hepatitis Lawyer Male 38 HIV Writer Female 30 Flu Writer Female 30 HIV Dancer Female 30 HIV Dancer Female 30 HIV Quasi-Identifiers (Job, Gender, Age) can identify the record owner given the background knowledge. It s a Quasi Identifier (QID). Toni Mattis Differential Privacy 18 / 39

k-anonymity k-anonymity [Agg05]: Each QID is associated with at least k records. Example Data: k = 3 Job Gender Age Diagnosis Professional Male 35-40 Hepatitis Professional Male 35-40 Hepatitis Professional Male 35-40 HIV Artist Female 30-34 Flu Artist Female 30-34 HIV Artist Female 30-34 HIV Artist Female 30-34 HIV Toni Mattis Differential Privacy 19 / 39

k-anonymity k-anonymity [Agg05]: Each QID is associated with at least k records. Example Data: k = 3 Job Gender Age Diagnosis Artist Female 30-34 Flu Artist Female 30-34 HIV Artist Female 30-34 HIV Artist Female 30-34 HIV Partial Disclosure Background knowledge about female writer aged 30 yields diagnosis HIV with 75% confidence. Toni Mattis Differential Privacy 19 / 39

Anonymization Algorithms Input: Parameter k and QID Choose Generalizations with least Information Loss: Numbers to Ranges Age 34: [30 34] preferred to [30 39] Categories to Hypernyms Dancer: Performing Artist better than Artist Implementations: (Optimal Solution is NP-Hard!) Start with original Dataset, iteratively generalize an attribute by a small amount until k-anonymity reached. (Or start with most general and specialize)[ss98] Apply Genetic Algorithms (Start with random generalizations, iteratively combine best-performing configurations adding some mutations, continue until convergence)[iye02] Toni Mattis Differential Privacy 20 / 39

Anonymization Algorithms Input: Parameter k and QID Choose Generalizations with least Information Loss: Numbers to Ranges Age 34: [30 34] preferred to [30 39] Categories to Hypernyms Dancer: Performing Artist better than Artist Implementations: (Optimal Solution is NP-Hard!) Start with original Dataset, iteratively generalize an attribute by a small amount until k-anonymity reached. (Or start with most general and specialize)[ss98] Apply Genetic Algorithms (Start with random generalizations, iteratively combine best-performing configurations adding some mutations, continue until convergence)[iye02] Toni Mattis Differential Privacy 20 / 39

Anonymization Algorithms Input: Parameter k and QID Choose Generalizations with least Information Loss: Numbers to Ranges Age 34: [30 34] preferred to [30 39] Categories to Hypernyms Dancer: Performing Artist better than Artist Implementations: (Optimal Solution is NP-Hard!) Start with original Dataset, iteratively generalize an attribute by a small amount until k-anonymity reached. (Or start with most general and specialize)[ss98] Apply Genetic Algorithms (Start with random generalizations, iteratively combine best-performing configurations adding some mutations, continue until convergence)[iye02] Toni Mattis Differential Privacy 20 / 39

Anonymization Algorithms Input: Parameter k and QID Choose Generalizations with least Information Loss: Numbers to Ranges Age 34: [30 34] preferred to [30 39] Categories to Hypernyms Dancer: Performing Artist better than Artist Implementations: (Optimal Solution is NP-Hard!) Start with original Dataset, iteratively generalize an attribute by a small amount until k-anonymity reached. (Or start with most general and specialize)[ss98] Apply Genetic Algorithms (Start with random generalizations, iteratively combine best-performing configurations adding some mutations, continue until convergence)[iye02] Toni Mattis Differential Privacy 20 / 39

Anonymization Algorithms Input: Parameter k and QID Choose Generalizations with least Information Loss: Numbers to Ranges Age 34: [30 34] preferred to [30 39] Categories to Hypernyms Dancer: Performing Artist better than Artist Implementations: (Optimal Solution is NP-Hard!) Start with original Dataset, iteratively generalize an attribute by a small amount until k-anonymity reached. (Or start with most general and specialize)[ss98] Apply Genetic Algorithms (Start with random generalizations, iteratively combine best-performing configurations adding some mutations, continue until convergence)[iye02] Toni Mattis Differential Privacy 20 / 39

Anonymization Algorithms Input: Parameter k and QID Choose Generalizations with least Information Loss: Numbers to Ranges Age 34: [30 34] preferred to [30 39] Categories to Hypernyms Dancer: Performing Artist better than Artist Implementations: (Optimal Solution is NP-Hard!) Start with original Dataset, iteratively generalize an attribute by a small amount until k-anonymity reached. (Or start with most general and specialize)[ss98] Apply Genetic Algorithms (Start with random generalizations, iteratively combine best-performing configurations adding some mutations, continue until convergence)[iye02] Toni Mattis Differential Privacy 20 / 39

k-anonymity on Geospatial Data [Qar13] [15:57] google.com/search?q=geospatial+data [11:20] m.facebook.com/...? user=1460620434 The Data sold by EE last week contained cells of at least 50 individuals [Kob13] Toni Mattis Differential Privacy 21 / 39

k-anonymity on Geospatial Data [Qar13] [15:57] google.com/search?q=geospatial+data [11:20] m.facebook.com/...? user=1460620434 The Data sold by EE last week contained cells of at least 50 individuals [Kob13] Toni Mattis Differential Privacy 21 / 39

k-anonymity on Geospatial Data [Qar13] [15:57] google.com/search?q=geospatial+data [11:20] m.facebook.com/...? user=1460620434 The Data sold by EE last week contained cells of at least 50 individuals [Kob13] Toni Mattis Differential Privacy 21 / 39

k-anonymity on Geospatial Data [Qar13] 3 google.com/... m.facebook.com/... m.facebook.com/... 4 8 5 6 10 The Data sold by EE last week contained cells of at least 50 individuals [Kob13] Toni Mattis Differential Privacy 21 / 39

l-diversity l-diversity [2006]: Each QID is associated with at least l different values for sensitive attributes. Our Example Data is only 2-Diverse! Increase group size achieving 3-Diversity: Job Gender Age Diagnosis Professional Male 35-40 Hepatitis Professional Male 35-40 Hepatitis Professional Male 35-40 HIV Artist Female 30-34 Flu Artist Female 30-34 HIV Artist Female 30-34 HIV Artist Female 30-34 HIV Toni Mattis Differential Privacy 22 / 39

l-diversity l-diversity [2006]: Each QID is associated with at least l different values for sensitive attributes. Our Example Data is only 2-Diverse! Increase group size achieving 3-Diversity: Job Gender Age Diagnosis Professional Male 35-40 Hepatitis Professional Male 35-40 Hepatitis Professional Male 35-40 HIV Professional Male 35-40 Flu Artist Female 30-34 Hepatitis Artist Female 30-34 Flu Artist Female 30-34 HIV Artist Female 30-34 HIV Artist Female 30-34 HIV Toni Mattis Differential Privacy 22 / 39

t-closeness t-closeness [2007]: Distribution of sensitive values inside a group closely resembles the overall Distribution. t: Upper bound on Distance between distributions. Example for categorical values: Element Overall 10%-Closeness 5%-Closeness Flu 90% 80-100% 85-95% Hepatitis 5% 0-15% 0-10% HIV 5% 0-15% 0-10% (Numeric values use Earth Mover s Distance) Toni Mattis Differential Privacy 23 / 39

t-closeness t-closeness [2007]: Distribution of sensitive values inside a group closely resembles the overall Distribution. t: Upper bound on Distance between distributions. Example for categorical values: Element Overall 10%-Closeness 5%-Closeness Flu 90% 80-100% 85-95% Hepatitis 5% 0-15% 0-10% HIV 5% 0-15% 0-10% (Numeric values use Earth Mover s Distance) Toni Mattis Differential Privacy 23 / 39

t-closeness t-closeness [2007]: Distribution of sensitive values inside a group closely resembles the overall Distribution. t: Upper bound on Distance between distributions. Example for categorical values: Element Overall 10%-Closeness 5%-Closeness Flu 90% 80-100% 85-95% Hepatitis 5% 0-15% 0-10% HIV 5% 0-15% 0-10% (Numeric values use Earth Mover s Distance) Toni Mattis Differential Privacy 23 / 39

Other Approaches [FWCY10] Resampling: Construct distribution of original data. Replace n% values by random values drawn from this distribution. Permutation: Select groups of some size. Shuffle sensitive values inside group to de-associate QID and value. Preserve Mean, Variance and Distribution of isolated Attributes but heavily impact Correlation and Covariance between Attributes. Toni Mattis Differential Privacy 24 / 39

Other Approaches [FWCY10] Resampling: Construct distribution of original data. Replace n% values by random values drawn from this distribution. Permutation: Select groups of some size. Shuffle sensitive values inside group to de-associate QID and value. Preserve Mean, Variance and Distribution of isolated Attributes but heavily impact Correlation and Covariance between Attributes. Toni Mattis Differential Privacy 24 / 39

Other Approaches [FWCY10] Resampling: Construct distribution of original data. Replace n% values by random values drawn from this distribution. Permutation: Select groups of some size. Shuffle sensitive values inside group to de-associate QID and value. Preserve Mean, Variance and Distribution of isolated Attributes but heavily impact Correlation and Covariance between Attributes. Toni Mattis Differential Privacy 24 / 39

Census Revisited Knowing the previous techniques: What about the Census Example from the beginning? A Hardware Stores City B Urban Region (City + Suburbs) Attack Vector B = Region City A (5) Toni Mattis Differential Privacy 25 / 39

Census Revisited Attack Vector? B = Region City A (6) Rendering the sums ineffective for precise disclosure: Query Set Overlap Control: reject query Random Sampling: additional missing stores Microaggregations with size 3 / 3-Anonymity: aggregates differ in 0 or 3 stores Resampling: A computes either the real or a random number. Permutation: A computes a random competitor s revenue. Toni Mattis Differential Privacy 26 / 39

Census Revisited Attack Vector? B = Region City A (6) Rendering the sums ineffective for precise disclosure: Query Set Overlap Control: reject query Random Sampling: additional missing stores Microaggregations with size 3 / 3-Anonymity: aggregates differ in 0 or 3 stores Resampling: A computes either the real or a random number. Permutation: A computes a random competitor s revenue. Toni Mattis Differential Privacy 26 / 39

Census Revisited Attack Vector? B = Region City A (6) Rendering the sums ineffective for precise disclosure: Query Set Overlap Control: reject query Random Sampling: additional missing stores Microaggregations with size 3 / 3-Anonymity: aggregates differ in 0 or 3 stores Resampling: A computes either the real or a random number. Permutation: A computes a random competitor s revenue. Toni Mattis Differential Privacy 26 / 39

Census Revisited Attack Vector? B = Region City A (6) Rendering the sums ineffective for precise disclosure: Query Set Overlap Control: reject query Random Sampling: additional missing stores Microaggregations with size 3 / 3-Anonymity: aggregates differ in 0 or 3 stores Resampling: A computes either the real or a random number. Permutation: A computes a random competitor s revenue. Toni Mattis Differential Privacy 26 / 39

Census Revisited Attack Vector? B = Region City A (6) Rendering the sums ineffective for precise disclosure: Query Set Overlap Control: reject query Random Sampling: additional missing stores Microaggregations with size 3 / 3-Anonymity: aggregates differ in 0 or 3 stores Resampling: A computes either the real or a random number. Permutation: A computes a random competitor s revenue. Toni Mattis Differential Privacy 26 / 39

Census Revisited Attack Vector? B = Region City A (6) Rendering the sums ineffective for precise disclosure: Query Set Overlap Control: reject query Random Sampling: additional missing stores Microaggregations with size 3 / 3-Anonymity: aggregates differ in 0 or 3 stores Resampling: A computes either the real or a random number. Permutation: A computes a random competitor s revenue. Toni Mattis Differential Privacy 26 / 39

Structure 1 Privacy Goals and Examples 2 Early Means of Privacy Control Query Restriction Query Set Approximation Summary 3 Modern Data Anonymization Anonymity Concepts Resampling and Permutation Summary 4 Differential Privacy Epsilon-Differential Privacy Summary 5 Challenges and Current Research Topics Toni Mattis Differential Privacy 27 / 39

Modern Applications Input Perturbation Output Perturbation Query Processor Contributions Query Set Aggregated Result Toni Mattis Differential Privacy 28 / 39

ɛ-differential Privacy Person Attribute X Attribute Y Participant A No No Participant B Yes Yes Participant C Yes Yes Participant D Yes Yes Attack Vector Query COUNT (X = Y ) = COUNT (ALL) = 4 and you can infer X Y Solution: Modify COUNT Randomly add or subtract 1. Each participant can now plausibly claim he had no influence on the result, because the answer can also be generated by a Database not containing his contribution. Toni Mattis Differential Privacy 29 / 39

ɛ-differential Privacy Person Attribute X Attribute Y Participant A No No Participant B Yes Yes Participant C Yes Yes Participant D Yes Yes Attack Vector Query COUNT (X = Y ) = COUNT (ALL) = 4 and you can infer X Y Solution: Modify COUNT Randomly add or subtract 1. Each participant can now plausibly claim he had no influence on the result, because the answer can also be generated by a Database not containing his contribution. Toni Mattis Differential Privacy 29 / 39

ɛ-differential Privacy Person Attribute X Attribute Y Participant A No No Participant B Yes Yes Participant C Yes Yes Participant D Yes Yes Attack Vector Query COUNT (X = Y ) = COUNT (ALL) = 4 and you can infer X Y Solution: Modify COUNT Randomly add or subtract 1. Each participant can now plausibly claim he had no influence on the result, because the answer can also be generated by a Database not containing his contribution. Toni Mattis Differential Privacy 29 / 39

ɛ-differential Privacy Definition An aggregated result y = f (D) over Database D is differentially private if each Database D differing in a single element from D can plausibly generate the same result y. Can we measure plausibility? Plausibility A result y = f (a) can be plausibly generated by a different value b if the outcomes are equally probable: Pr(y = f (a)) Pr(y = f (b)) Toni Mattis Differential Privacy 30 / 39

ɛ-differential Privacy Pr(y = f (a)) Pr(y = f (b)) (7) (8) Toni Mattis Differential Privacy 31 / 39

ɛ-differential Privacy Pr(y = f (a)) Pr(y = f (b)) (7) Pr(y = f (D)) Pr(y = f (D )) (8) (9) Toni Mattis Differential Privacy 31 / 39

ɛ-differential Privacy Pr(y = f (a)) Pr(y = f (b)) (7) Pr(y = f (D)) Pr(y = f (D )) (8) Pr(y = f (D)) Pr(y = f (D )) eɛ (9) Toni Mattis Differential Privacy 31 / 39

ɛ-differential Privacy Definition [DMNS06] The result y of an aggregating function f satisfies ɛ-indistinguishability if for each two Databases D and D differing in a single Element: Pr(y = f (D)) Pr(y = f (D )) eɛ (10) Toni Mattis Differential Privacy 32 / 39

ɛ-differential Privacy Example: COUNT answers... the truth with P = 0.5 one less with P = 0.25 one more with P = 0.25 Given a Database D with COUNT (D) = 4 and a Database D with COUNT (D ) = 3, making them differ in 1 element. Pr(COUNT (D) = 4)) Pr(COUNT (D ) = 4)) (11) = 0.5 0.25 = 2 = e0.69 (12) Toni Mattis Differential Privacy 33 / 39

ɛ-differential Privacy Example: COUNT answers... the truth with P = 0.5 one less with P = 0.25 one more with P = 0.25 Given a Database D with COUNT (D) = 4 and a Database D with COUNT (D ) = 3, making them differ in 1 element. Pr(COUNT (D) = 4)) Pr(COUNT (D ) = 4)) (11) = 0.5 0.25 = 2 = e0.69 (12) Toni Mattis Differential Privacy 33 / 39

ɛ-differential Privacy Example: COUNT answers... the truth with P = 0.5 one less with P = 0.25 one more with P = 0.25 Given a Database D with COUNT (D) = 4 and a Database D with COUNT (D ) = 3, making them differ in 1 element. Pr(COUNT (D) = 4)) Pr(COUNT (D ) = 4)) (11) = 0.5 0.25 = 2 = e0.69 (12) Toni Mattis Differential Privacy 33 / 39

ɛ-differential Privacy Example: COUNT answers... the truth with P = 0.5 one less with P = 0.25 one more with P = 0.25 Given a Database D with COUNT (D) = 4 and a Database D with COUNT (D ) = 3, making them differ in 1 element. Pr(COUNT (D) = 4)) Pr(COUNT (D ) = 4)) (11) = 0.5 0.25 = 2 = e0.69 (12) Toni Mattis Differential Privacy 33 / 39

ɛ-differential Privacy What about SUM and other queries? Add noise proportional to the influence of a single individual: COUNT: 1 (+1 or -1 is fine) SUM: range(d) = max(d) min(d) MEAN: range(d)/count(d) Sensitivity The Sensitivity S(f ) of a function f is defined as the maximum change a single contribution in a Database D can cause to f (D) Toni Mattis Differential Privacy 34 / 39

ɛ-differential Privacy What about SUM and other queries? Add noise proportional to the influence of a single individual: COUNT: 1 (+1 or -1 is fine) SUM: range(d) = max(d) min(d) MEAN: range(d)/count(d) Sensitivity The Sensitivity S(f ) of a function f is defined as the maximum change a single contribution in a Database D can cause to f (D) Toni Mattis Differential Privacy 34 / 39

ɛ-differential Privacy What about SUM and other queries? Add noise proportional to the influence of a single individual: COUNT: 1 (+1 or -1 is fine) SUM: range(d) = max(d) min(d) MEAN: range(d)/count(d) Sensitivity The Sensitivity S(f ) of a function f is defined as the maximum change a single contribution in a Database D can cause to f (D) Toni Mattis Differential Privacy 34 / 39

ɛ-differential Privacy What about SUM and other queries? Add noise proportional to the influence of a single individual: COUNT: 1 (+1 or -1 is fine) SUM: range(d) = max(d) min(d) MEAN: range(d)/count(d) Sensitivity The Sensitivity S(f ) of a function f is defined as the maximum change a single contribution in a Database D can cause to f (D) Toni Mattis Differential Privacy 34 / 39

ɛ-differential Privacy What about SUM and other queries? Add noise proportional to the influence of a single individual: COUNT: 1 (+1 or -1 is fine) SUM: range(d) = max(d) min(d) MEAN: range(d)/count(d) Sensitivity The Sensitivity S(f ) of a function f is defined as the maximum change a single contribution in a Database D can cause to f (D) Toni Mattis Differential Privacy 34 / 39

ɛ-differential Privacy What about SUM and other queries? Add noise proportional to the influence of a single individual: COUNT: 1 (+1 or -1 is fine) SUM: range(d) = max(d) min(d) MEAN: range(d)/count(d) Sensitivity The Sensitivity S(f ) of a function f is defined as the maximum change a single contribution in a Database D can cause to f (D) Toni Mattis Differential Privacy 34 / 39

ɛ-differential Privacy Laplace Distribution λ: standard deviation Lap(λ) : Pr[X = x] e x λ (13) 0.5 b=1, μ=0 b=2, μ=0 b=4, μ=0 b=4, μ=-5 0.4 0.3 0.2 0.1 0-10 -5 0 5 10 Theorem [DMNS06] Answering Query f (D) with f (D) + x where x Lap(S(f )/ɛ) always satisfies ɛ-indistinguishability. Toni Mattis Differential Privacy 35 / 39

ɛ-differential Privacy Laplace Distribution λ: standard deviation Lap(λ) : Pr[X = x] e x λ (13) 0.5 b=1, μ=0 b=2, μ=0 b=4, μ=0 b=4, μ=-5 0.4 0.3 0.2 0.1 0-10 -5 0 5 10 Theorem [DMNS06] Answering Query f (D) with f (D) + x where x Lap(S(f )/ɛ) always satisfies ɛ-indistinguishability. Toni Mattis Differential Privacy 35 / 39

Summary of Modern Developments Data Anonymization Approach: k-anonymity (Groups of k indistinguishable individuals) l-diversity (Groups of l different sensitive values) t-closeness (Groups reflecting overall distribution) Resampling / Compression (Random data reflecting the real-world distribution) Permutation (deassociating individual and sensitive data) Output Perturbation Approach: ɛ-differential Privacy Toni Mattis Differential Privacy 36 / 39

Summary of Modern Developments Data Anonymization Approach: k-anonymity (Groups of k indistinguishable individuals) l-diversity (Groups of l different sensitive values) t-closeness (Groups reflecting overall distribution) Resampling / Compression (Random data reflecting the real-world distribution) Permutation (deassociating individual and sensitive data) Output Perturbation Approach: ɛ-differential Privacy Toni Mattis Differential Privacy 36 / 39

Structure 1 Privacy Goals and Examples 2 Early Means of Privacy Control Query Restriction Query Set Approximation Summary 3 Modern Data Anonymization Anonymity Concepts Resampling and Permutation Summary 4 Differential Privacy Epsilon-Differential Privacy Summary 5 Challenges and Current Research Topics Toni Mattis Differential Privacy 37 / 39

Challenges and Current Research Topics How to ensure privacy concerning... Distributed Sources? Secure Multiparty-Computation Tracking data? (RFID, Cellphones, Credit Card usage,...) Quite stable against perturbation due to high dimensionality Causes of combining multiple sources unforeseeable Genetic sequences? Privacy risk underestimated Potentially identifiable numerous generations later Social networks and interactions? Extremely stable against perturbation Toni Mattis Differential Privacy 38 / 39

Challenges and Current Research Topics How to ensure privacy concerning... Distributed Sources? Secure Multiparty-Computation Tracking data? (RFID, Cellphones, Credit Card usage,...) Quite stable against perturbation due to high dimensionality Causes of combining multiple sources unforeseeable Genetic sequences? Privacy risk underestimated Potentially identifiable numerous generations later Social networks and interactions? Extremely stable against perturbation Toni Mattis Differential Privacy 38 / 39

Challenges and Current Research Topics How to ensure privacy concerning... Distributed Sources? Secure Multiparty-Computation Tracking data? (RFID, Cellphones, Credit Card usage,...) Quite stable against perturbation due to high dimensionality Causes of combining multiple sources unforeseeable Genetic sequences? Privacy risk underestimated Potentially identifiable numerous generations later Social networks and interactions? Extremely stable against perturbation Toni Mattis Differential Privacy 38 / 39

Challenges and Current Research Topics How to ensure privacy concerning... Distributed Sources? Secure Multiparty-Computation Tracking data? (RFID, Cellphones, Credit Card usage,...) Quite stable against perturbation due to high dimensionality Causes of combining multiple sources unforeseeable Genetic sequences? Privacy risk underestimated Potentially identifiable numerous generations later Social networks and interactions? Extremely stable against perturbation Toni Mattis Differential Privacy 38 / 39

Thanks. Questions? Toni Mattis Differential Privacy 39 / 39

Sources and References Aggarwal, Charu C.: On k-anonymity and the curse of dimensionality. In: Proceedings of the 31st international conference on Very large data bases, VLDB Endowment, 2005 (VLDB 05). ISBN 1 59593 154 6, 901 909 Adam, Nabil R. ; Worthmann, John C.: Security-control methods for statistical databases: a comparative study. In: ACM Comput. Surv. 21 (1989), Dezember, Nr. 4, 515 556. http://dx.doi.org/10.1145/76894.76895. DOI 10.1145/76894.76895. ISSN 0360 0300 Denning, Dorothy E.: Secure statistical databases with random sample queries. In: ACM Trans. Database Syst. 5 (1980), September, Nr. 3, 291 315. http://dx.doi.org/10.1145/320613.320616. DOI 10.1145/320613.320616. ISSN 0362 5915 Dwork, Cynthia ; McSherry, Frank ; Nissim, Kobbi ; Smith, Adam: Calibrating noise to sensitivity in private data analysis. In: Proceedings of the Third conference on Theory of Cryptography. Berlin, Heidelberg : Springer-Verlag, 2006 (TCC 06). ISBN 3 540 32731 2, 978 3 540 32731 8, 265 284 Fung, Benjamin C. M. ; Wang, Ke ; Chen, Rui ; Yu, Philip S.: Toni Mattis Differential Privacy 39 / 39

Privacy-preserving data publishing: A survey of recent developments. In: ACM Comput. Surv. 42 (2010), Juni, Nr. 4, 14:1 14:53. http://dx.doi.org/10.1145/1749603.1749605. DOI 10.1145/1749603.1749605. ISSN 0360 0300 Hansen, Morris H.: Insuring confidentiality of individual records in data storage and retrieval for statistical purposes. In: Proceedings of the November 16-18, 1971, fall joint computer conference. New York, NY, USA : ACM, 1971 (AFIPS 71 (Fall)), 579 585 Iyengar, Vijay S.: Transforming data to satisfy privacy constraints. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. New York, NY, USA : ACM, 2002 (KDD 02). ISBN 1 58113 567 X, 279 288 Kobie, Nicole: EE data being sold to track customers. http://www.pcpro.co.uk/news/381766/ee-data-being-sold-to-track-customers, 2013 Qardaji, Wahbeh: Differentially Private Publishing of Geospatial Data. http://www.cerias.purdue.edu/news_and_events/events/security_seminar/details/index/ 9nr7je9aqbqneqem9hrg2salic, 2013 Samarati, Pierangela ; Sweeney, Latanya: Generalizing data to provide anonymity when disclosing information (abstract). In: Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems. New York, NY, USA : ACM, 1998 (PODS 98). ISBN 0 89791 996 3, 188 Toni Mattis Differential Privacy 39 / 39