L'Anonymisation des données du DMP Etat des lieux du Privacy Preserving Data Publishing



Similar documents
Data attribute security and privacy in distributed database system

Proposing a Novel Synergized K-Degree L-Diversity T- Closeness Model for Graph Based Data Anonymization

Obfuscation of sensitive data in network flows 1

In-Home Caregivers Teleconference with Canadian Bar Association September 17, 2015

Information Security in Big Data using Encryption and Decryption

CFT ICT review Questions/Answers

Langages Orientés Objet Java

BUSINESS PROCESS OPTIMIZATION. OPTIMIZATION DES PROCESSUS D ENTERPRISE Comment d aborder la qualité en améliorant le processus

Sun Management Center Change Manager Release Notes

Policy-based Pre-Processing in Hadoop

Gabon Tourist visa Application for citizens of Canada living in Alberta

DRAFT NISTIR 8053 De-Identification of Personally Identifiable Information

Anonymizing Healthcare Data: A Case Study on the Blood Transfusion Service

CS346: Advanced Databases

International Journal of Advanced Computer Technology (IJACT) ISSN: PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS

Sun StorEdge A5000 Installation Guide

Note concernant votre accord de souscription au service «Trusted Certificate Service» (TCS)

(Big) Data Anonymization Claude Castelluccia Inria, Privatics

DATA MINING - 1DL360

ACP-EU Cooperation Programme in Science and Technology (S&T II) / Programme de Coopération ACP-UE pour la Science et la Technologie

"Internationalization vs. Localization: The Translation of Videogame Advertising"

Introduction au BIM. ESEB Seyssinet-Pariset Economie de la construction contact@eseb.fr

Hours: The hours for the class are divided between practicum and in-class activities. The dates and hours are as follows:

Privacy Techniques for Big Data

International Journal of Scientific & Engineering Research, Volume 4, Issue 10, October-2013 ISSN

Évariste Galois and Solvable Permutation Groups

Thailand Business visa Application for citizens of Hong Kong living in Manitoba

NATIONAL EXECUTIVE CONFERENCE CALL THURSDAY, JULY 21, :00 P.M. - 7:20 P.M. OTTAWA TIME

Database security. André Zúquete Security 1. Advantages of using databases. Shared access Many users use one common, centralized data set

Solaris 10 Documentation README

STUDENT APPLICATION FORM (Dossier d Inscription) ACADEMIC YEAR (Année Scolaire )

How To Get A Kongo Republic Tourist Visa

Detection of water leakage using laser images from 3D laser scanning data

REQUEST FORM FORMULAIRE DE REQUÊTE

Audit de sécurité avec Backtrack 5

ARX A Comprehensive Tool for Anonymizing Biomedical Data

Sun Enterprise Optional Power Sequencer Installation Guide

Financial Literacy Resource French As a Second Language: Core French Grade 9 Academic FSF 1D ARGENT EN ACTION! Connections to Financial Literacy

Module outline. CS 458 / 658 Computer Security and Privacy. (Relational) Databases. Module outline. Module 6 Database Security and Privacy.

REQUEST FORM FORMULAIRE DE REQUÊTE

St. Edward s University

Secure Computation Martin Beck

Thomas Jeanjean Associate Professor of Accounting, HEC School of Management

Level 2 French, 2014

HEALTH CARE DIRECTIVES ACT

"Templating as a Strategy for Translating Official Documents from Spanish to English"

ANONYMIZATION METHODS AS TOOLS FOR FAIRNESS

Towards Privacy aware Big Data analytics

ANIMATION OF CONTINUOUS COMPUTER SIMULATIONS C.M. Woodside and Richard Mallet Computer Center, Carleton University ABSTRACT

TP1 : Correction. Rappels : Stream, Thread et Socket TCP

«Environnement Economique de l Entreprise»

Altiris Patch Management Solution for Windows 7.6 from Symantec Third-Party Legal Notices

Introduction to Oracle Solaris 11.1 Virtualization Environments

Congo Republic Tourist visa Application for citizens of Paraguay living in Alberta

Liste d'adresses URL

Sun Management Center 3.6 Version 5 Add-On Software Release Notes

Optimizing Solaris Resources Through Load Balancing

"Simultaneous Consecutive Interpreting: A New Technique Put to the Test"

REQUEST FORM FORMULAIRE DE REQUETE

ANNEX 1. Le Greffe. The Registry

Multi-table Association Rules Hiding

Efficient Integration of Data Mining Techniques in Database Management Systems

Efficient Algorithms for Masking and Finding Quasi-Identifiers

Transcription:

Partage et Secret de l'information de Santé Nancy 15/10/10 L'Anonymisation des données du DMP Etat des lieux du Privacy Preserving Data Publishing T. Allard & B.Nguyen Institut National de Recherches en Informatique et Automatique (INRIA) Secure and Mobile Information Systems Team & Université de Versailles St-Quentin Programme DEMOTIS 1

Summary Introduction Data Partitionning Family The continuous release problem Data Perturbation Family 2

A Brief History... Protecting data about individuals goes back to the 19 th century: Carrol Wright (Bureau of Labor Statistics, 1885) Stanley Warner: Interviews for market surveys (1965) Tore Dalenius (1977) definition of disclosure : If the release of the statistic S makes it possible to determine the microdata value more accurately than without access to S, a disclosure has taken place. Rakesh Agrawal (1999) invents Privacy Preserving Data Mining... 2010 differential privacy However : the definition of anonymous information is vague! 3

Disclaimer Context : statistical study... probably too limitative for many practicians. We do not cover the problem of studies that are directly done by the hospital that collects the data Data can only be exported if it is anonymized. 4

Privacy Preserving Data Publishing (PPDP) Principle Raw data Anonymized data (or sanitized ) Individuals ex: patients Publisher (trusted) ex: hospital Recipients (no assumption of trust) 5 ex: Pharma

Pseudonymization: A naïve privacy definition SSN Activity Age Diag Pseudo Activity Age Diag 123 "M2 Stud." 22 Flu 456 "M1 Stud." 25 HIV 789 "L3 Stud." 20 Flu Raw data ABC "M2 Stud." 22 Flu MNO "M1 Stud" 25 HIV XYZ "L3 Stud." 20 Flu Pseudonymized Data Individuals Publisher (trusted) Recipients (no assumption of trust) 6

Pseudonymization is not safe! Sweeney [1] shows the existence of quasi-identifiers: Medical data were «anonymized» and released; A voter list was publicly available; Identification of medical records of Governor Weld by joining datasets on the quasi-identifiers. In the US census of 1990: «87% of the population in the US had characteristics that likely made them unique based only on {5-digit Zip, gender, date of birth}» [1]. 7

Data Partitioning Family 8

Data classification Identifiers (ID) Quasi-Identifiers (QID) Sensitive data (SD) SSN Activity Age Diag For each tuple: Identifiers must be removed; The link between a quasi-identifier and its corresponding sensitive data must be obfuscated but remain true 9

k-anonymity Form groups of (at least) k tuples indistinguishable wrt their quasi-identifier: Name Activity Age Diag Sue "M2 Stud." 22 Flu Bob "M1 Stud." 21 HIV Bill "L3 Stud." 20 Flu Raw data Activity Age Diag "Student" [20, 22] Flu "Student" [20, 22] HIV "Student" [20, 22] Flu 3-anonymous data Record linkage probability: 1/k 10

Questions : k-anonymité Cas d école: "Student" [20, 22] {Flu, HIV, Flu} "Teacher" [24, 27] {Cancer, Cold, Cancer} On peut voir les attributs du QID comme les axes d analyse des données sensibles Peut on les déterminer avant l anonymisation? Typiquement, que contiennent les données sensibles? Leur cardinalité? Si les groupes se chevauchent: "Student" [20, 25] { } "Teacher" [24, 27] { } Quelles propriétés doivent être vérifiées pour qu elles soient utilisables selon vous? Habituellement, faites-vous des croisements multi-sources? Comment? 11

L-diversity Ensure that each group has «enough diversity» wrt its sensitive data; Name Activity Age Diag Activity Age Diag Pat Dan San "MC" 27 Cancer "PhD" 26 Cancer "PhD" 24 Cancer Raw data "Teacher" [24, 27] Cancer "Teacher" [24, 27] Cancer "Teacher" [24, 27] Cancer 3-anonymous data Name Activity Age Diag Activity Age Diag Sue Pat Dan San John "M2 Stud." 22 Flu "MC" 27 Cancer "PhD" 26 Cancer "PhD" 24 Cancer "M2 Stud" 22 Cold "University" [22, 27] Flu "University" [22, 27] Cold "University" [22, 27] Cancer "University" [22, 27] Cancer "University" [22, 27] Cancer Attribute linkage probability: 1/L Raw data 3-diverse data 12

Questions : l-diversité La l-diversité empêche typiquement: "Teacher" [24, 27] {Cancer, Cancer, Cancer} Quid de l utilité de classes l-diverses? 13

t-closeness Distribution of sensitive values within each group Global distribution (factor t); Example: Limited gain in knowledge 14

The continuous release problem Limits of Data Partitioning 15

m-invarience [5] Name Activity Age Diag Activity Age Diag t1 Bob Bill Jim "M1 Stud." 21 HIV "L3 Stud." 20 Flu "M2 Stud" 23 Cancer "Student" [20, 23] Flu "Student" [20, 23] HIV "Student" [20, 23] Cancer t2 Bob Helen Jules "M1 Stud." 21 HIV "L1 Stud." 18 Cold "L1 Stud" 19 Dysp. "Student" [19, 21] HIV "Student" [19, 21] Cold "Student" [19, 21] Dysp Current models: «each group must be (relatively) invariant»; May require introducing fake data; Make hard «case histories»; Our direction: sampling No fake data; «Case histories» hard also; Time t1 Time t2 16 There are two fake tuples

Questions: m-invarience Quels sont les cadres applicatifs du continuous release? Suivi individuel de chaque dossier? Suivi d une population?? La dichotomie entre données transientes/permanentes est-elle pertinente? 17

Data Perturbation Family Local Perturbation 18

Local Perturbation Anonymized data (or sanitized ) Individuals Recipients (no assumption of trust) 19

Mechanism Define a count query and send it to individuals; Each individual perturbs his answer with a known probability p; Receive the perturbed answers r pert and compute an estimate of the real count r est ; There are formal proofs of correctness: r est = (r pert p)/(1 2p) 20

Data Perturbation Family Input Perturbation 21

Input Perturbation Raw data Anonymized data (or sanitized ) Individuals Publisher (trusted) Recipients (no assumption of trust) 22

Mechanism (not detailled) Similar to Local Perturbation except that data is not perturbed independently; We can expect smaller errors; 23

Data Perturbation Family Statistics Perturbation 24

Statistics Perturbation (interactive setting) Individuals Raw data Publisher (trusted) Statistical queries (eg, counts) Perturbed answers Recipients (no assumption of trust) 25

Statistics perturbation Define a statistical query (eg, a count): Q i ; The server answers a count perturbed according to the query sensitivity: Q 1 + η 1 ; The error magnitude is low; The total number of queries is bounded; 26

Questions: pseudo-random De quel ordre est le nombre typique: D attributs dans une requête? De valeurs possibles par attributs? De requêtes d une étude épidémiologique? Les données auxquelles vous avez accès contiennent-elles déjà des erreurs non intentionnelles? De quels estimateurs statistiques avez-vous besoin? «Deviner» les statistiques d intérêt sans «voir» les données est-il réaliste? 27

Conclusion The difficulty of bridging the gap between the medical needs of very precise data, and legal constraints on privacy protection and anonymization. Participation of users in many studies depends on their security guarantees. Using secure hardware could convince patients to participate more in widespread studies. (our current work) 28

Merci! 29

References [1] L. Sweeney. k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst., 10(5), 2002. [2] Xiaokui Xiao, Yufei Tao, Anatomy: simple and effective privacy preservation, Proceedings of the 32nd international conference on Very large data bases, September 12-15, 2006, Seoul, Korea. [3] Ashwin Machanavajjhala, Daniel Kifer, Johannes Gehrke, Muthuramakrishnan Venkitasubramaniam, L-diversity: Privacy beyond k-anonymity, ACM Transactions on Knowledge Discovery from Data (TKDD), v.1 n.1, p.3-es, March 2007. [4] Ninghui Li, Tiancheng Li, and S. Venkatasubramanian. t-closeness: Privacy beyond k- anonymity and l-diversity. In Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on, pages 106--115, April 2007. [5] Xiaokui Xiao, Yufei Tao, M-invariance: towards privacy preserving re-publication of dynamic datasets, Proceedings of the 2007 ACM SIGMOD international conference on Management of data, June 11-14, 2007, Beijing, China [6] S. L. Warner, Randomized Response: A survey technique for eliminating evasive answer bias, Journal of the American Statistical Association, 1965. 30

References [7] Nina Mishra, Mark Sandler, Privacy via pseudorandom sketches, Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 26-28, 2006, Chicago, IL, USA [8] Alexandre Evfimievski, Johannes Gehrke, Ramakrishnan Srikant, Limiting privacy breaches in privacy preserving data mining, Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.211-222, June 09-11, 2003, San Diego, California. [9] Duncan G.T., Jabine T.B., and De Wolf V.A. (eds.). Private Lives and Public Policies. Report of the Committee on National Statistics Panel on Confidentiality and Data Access. National Academy Press, WA, USA, 1993. [10] J. Gouweleeuw, P. Kooiman, L. C. R. J. Willenborg, and P.-P. de Wolf, The Post Randomisation Method for Protecting Microdata, QUESTIIO, vol. 22, no. 1, 1998. [11] C. Dwork, A Firm Foundation for Private Data Analysis, To appear in Communications of the ACM, 2010. [12] S. E. Fienberg and J. McIntyre, "Data swapping: Variations on a theme by Dalenius and Reiss," in Privacy in Statistical Databases, pp. 14-29, 2004. [13] Bee-Chung Chen, Daniel Kifer, Kristen LeFevre and Ashwin Machanavajjhala, Privacy-Preserving Data Publishing, In Foundations and Trends in Databases, vol.2, issue 1 2, January 2009. 31

Basic mechanism [6] Survey context (Warner, 1965); Sensitive question: Have you ever driven intoxicated? Response: truthful with probability p, lie with probability (1-p); Estimator: Let π be the fraction of the population for which the true response is «Yes» Expected proportion of «Yes»: P(Yes) = (π * p) + (1 π)*(1 p) π= [P(Yes) (1 p)] / (2p 1) If m/n individuals answered «yes», π est estimates π: π est = [m/n (1 p)] / (2p 1) 32

Extended mechanism [7] (Mishra, 2006) Server: defines queries: A conjunction of values (eg, people that have «HIV+ = true» and «aids = false»); And wants to know the fraction of individuals that agree with the conjunction; 33

Extended mechanism [7] cont Individuals: each one receives the values of each conjunction : Eg : the conjunction contains «HIV+» and «aids»; Generates the vector of all the possible answers («HIV+ = true» and «aids = true», «HIV+ = true» and «aids = false», ) with his answer set to 1: 0 0 1 0 And flips each element of the vector with probability p; 1 0 1 0 Flipped (probability p) Not flipped (probability (1 34 p))

Extended mechanism [7] cont Server: receives the pertubed vectors: Estimate the count result: r pert = number of perturbed vectors that agree with the conjunction; r est = (r pert p)/(1 2p); The true result r is proven to be «not too far away» from r est ; 35