Privacy Techniques for Big Data



Similar documents
CS346: Advanced Databases

Challenges of Data Privacy in the Era of Big Data. Rebecca C. Steorts, Vishesh Karwa Carnegie Mellon University November 18, 2014

How To Protect Privacy From Reidentification

Policy-based Pre-Processing in Hadoop

(Big) Data Anonymization Claude Castelluccia Inria, Privatics

Big Data and Innovation, Setting the Record Straight: De-identification Does Work

Privacy Challenges of Telco Big Data

White Paper. The Definition of Persona Data: Seeing the Complete Spectrum

ARTICLE 29 DATA PROTECTION WORKING PARTY

Privacy Committee. Privacy and Open Data Guideline. Guideline. Of South Australia. Version 1

De-Identification 101

future proof data privacy

DRAFT NISTIR 8053 De-Identification of Personally Identifiable Information

PrivacyCanary: Privacy-Aware Recommenders with Adaptive Input Obfuscation

No silver bullet: De-identification still doesn't work

De-identification Koans. ICTR Data Managers Darren Lacey January 15, 2013

How to De-identify Data. Xulei Shirley Liu Department of Biostatistics Vanderbilt University 03/07/2008

Data De-identification and Anonymization of Individual Patient Data in Clinical Studies A Model Approach

Privacy Preserving Data Mining

Comments of the World Privacy Forum To: Office of Science and Technology Policy Re: Big Data Request for Information. Via to

Societal benefits vs. privacy: what distributed secure multi-party computation enable? Research ehelse April Oslo

Guidance on De-identification of Protected Health Information November 26, 2012.

The De-identification of Personally Identifiable Information

De-Identification of Health Data under HIPAA: Regulations and Recent Guidance" " "

Degrees of De-identification of Clinical Research Data


A Q&A with the Commissioner: Big Data and Privacy Health Research: Big Data, Health Research Yes! Personal Data No!

Health Data De-Identification by Dr. Khaled El Emam

Clinical Study Reports Approach to Protection of Personal Data

ADVISORY GUIDELINES ON THE PERSONAL DATA PROTECTION ACT FOR SELECTED TOPICS ISSUED BY THE PERSONAL DATA PROTECTION COMMISSION ISSUED 24 SEPTEMBER 2013

Information Security in Big Data using Encryption and Decryption

Legal Insight. Big Data Analytics Under HIPAA. Kevin Coy and Neil W. Hoffman, Ph.D. Applicability of HIPAA

Combining structured data with machine learning to improve clinical text de-identification

Privacy-preserving Data-aggregation for Internet-of-things in Smart Grid

Information Sheet: Cloud Computing

Yale University Open Data Access (YODA) Project Procedures to Guide External Investigator Access to Clinical Trial Data Last Updated August 2015

A Precautionary Approach to Big Data Privacy

NSF Workshop on Big Data Security and Privacy

Data Use and the Liquid Grids Model

March 31, Re: Government Big Data (FR Doc ) Dear Ms. Wong:

The Information Commissioner s Office response to HM Treasury s Call for Evidence on Data Sharing and Open Data in Banking

ARX A Comprehensive Tool for Anonymizing Biomedical Data

Decentralizing Privacy: Using Blockchain to Protect Personal Data

Zubi Advertising Privacy Policy

DESTINATION MELBOURNE PRIVACY POLICY

Modeling Unintended Personal-Information Leakage from Multiple Online Social Networks

MOBILE PHONE NETWORK DATA FOR DEVELOPMENT

We may collect the following types of information during your visit on our Site:

Formal Methods for Preserving Privacy for Big Data Extraction Software

IT Privacy Certification Outline of the Body of Knowledge (BOK) for the Certified Information Privacy Technologist (CIPT)

Anonymizing Unstructured Data to Enable Healthcare Analytics Chris Wright, Vice President Marketing, Privacy Analytics

On the Effectiveness of Obfuscation Techniques in Online Social Networks

PUBLIC CONSULTATION ISSUED BY THE PERSONAL DATA PROTECTION COMMISSION

The De-identification Maturity Model Authors: Khaled El Emam, PhD Waël Hassan, PhD

Robust De-anonymization of Large Sparse Datasets

DATA MINING - 1DL360

Guide to the National Safety and Quality Health Service Standards for health service organisation boards

Healthcare data analytics. Da-Wei Wang Institute of Information Science

De-Identification of Clinical Data

Health Data Governance: Privacy, Monitoring and Research - Policy Brief

Privacy Aspects in Big Data Integration: Challenges and Opportunities

Securing the Big Data Life Cycle

Data Mining with Differential Privacy

Obfuscation of sensitive data in network flows 1

Best Practice in SAS programs validation. A Case Study

Transcription:

Privacy Techniques for Big Data The Pros and Cons of Syntatic and Differential Privacy Approaches Dr#Roksana#Boreli# SMU,#Singapore,#May#2015# Introductions NICTA Australia s National Centre of Excellence in Information and Communication Technology 700 staff, ~ 300 PhD students Presenter Research Leader, Mobile Systems research group Interests: Privacy enhancing technologies, wireless comms and network protocols VRL ATP, NRL CRL QRL 2

Outline Privacy De-identification techniques Experiences from a recently completed project Evaluating the privacy utility trade-offs Question time 3 Personal data is collected by services and apps Used for targeted advertising 4

The growing importance of privacy Regulatory environment: legislation protecting personal data (PII) collection, storage, use Consumer attitudes to privacy increasing awareness Media attention high potential for negative publicity Maximise the opportunity: how to minimise the risks while preserving data utility for analytics 5 The meaning of PII Privacy regulations based on PII personal vs non-personal information Numerous examples of re-identification of anonymised data The falacy of PII - Any information could be PII, and should be protected PII : Personally identifiable information 6

Regulatory guidelines for de-identification Australian guidelines Frequency and dominance rules for data aggregates US HIPAA regulation (health data) 1. Redact identifiers 2. Generalise (mask) location and dates 3. Residual information should not lead to re-identification The Health Insurance Portability and Accountability Act of 1996 (HIPAA), Safe Harbor method 7 Netflix privacy breach Dataset for Netflix Prize contest: 17,770 movie titles 480,189 users with random customer IDs Ratings: 1-5 For each movie we have the ratings: (MovieID, CustomerID, Rating, Date) Given auxiliary information (random chats, IMDB), recommender identity can be uncovered with high probability. Robust De-anonymization of Large Sparse Datasets, Narayanan and Shmatikov, 2008 8

De-identifying mobility data Based on analysis of anonymised call information of ~1.5 million users in a western country (1 hour precision): * 4 spatio-temporal points are enough to uniquely identify of the individuals. 95% Yves-Alexandre de Montjoye, César A. Hidalgo, Michel Verleysen and Vincent D. Blondel, Unique in the Crowd: the privacy bounds of human mobility. Scientific Reports 3:1376, March 2013. Available on http://www.nature.com/srep/2013/130325/srep01376/full/srep01376.html. 9 Technologies from research domain 1. Anonymisation 2. Obfuscation, Differential privacy 3. Cryptographic solutions More on-going research topics 10

Anonymisation Scenario: storage or release of private information Simple: removal of PII and sensitive information Easy to reverse, using side information Name! Gender! Age! Oliver Brown# Emily Taylor# William Walker# Post code! Monthly bill! Male# 43# 3067# $198# Female# 37# 3040# $45# Male# 19# 3825# $146# Jack Harris# Male# 26# 3028# $35# Emma Anderson# Female# 42# 3195# $30# Lily White# Female# 55# 3067# $72# Lucas Johnson# Customer data Male# 59# 3818# $79# Name! Gender! Age! Post code! Jane Eyre# Female# 43# 2066# Emily Taylor# Female# 37# 3040# Rob Reed# Male# 59# 2100# Jack Johnson# Census data Male# 48# 3860# 11 Anonymisation: Syntatic approaches data set based rules Frequency rule Each aggregation result is derived from (min) k records http://www.oaic.gov.au/privacy/privacy-resources/privacy-businessresources/privacy-business-resource-4-de-identification-of-data-andinformation and http://www.nss.gov.au/nss/home.nsf/pages/confidentiality+-+how+to +confidentialise+data:+the+basic+principles 12

Anonymisation: Syntatic approaches k-anonymity Any unique combination of selected attributes/ features can belong to a min group of k users 476** 2* * 47677 47602 47678 29 22 27 Male Post code Age Gender Female L. Sweeney. k-anonymity: a model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10 (7), 2002. 13 k-anonymity (k=2) Gender! Age! Post code! Monthly bill! Male# -# 30**# $198# Female# -# 30**# $45# Male# -# 38**# $146# Male# -# 30**# $35# Female# -# 31**# $30# Female# -# 30**# $72# Male# -# 38**# $79# Female# -# 31**# $121# Male# -# 38**# $82# Female# -# 31**# $155# l-diversity: within the group of k, ensure that there is a mix of specific values of sensitive attribute t-closeness, etc.

Data obfuscation Scenario: releasing aggregate data information Differential privacy requires that computations be insensitive to changes in any particular individual's record. Consequently, being opted in or out of the database should make little difference to a person s privacy. AΔB=1## A" B" M" M" M(A)" # M(B)" 15 Differential privacy Add calibrated noise to sensitive data: e.g. generated by Laplace function ε parameter ~ privacy strength Name! Gender! Age! Post code! Monthly bill! Oliver Male 43 3067 $198 Brown Emily Female 37 3040 $45 Taylor Average bill: $96.3 + noise => Average bill: $91.4 C. Dwork. Differential privacy. In ICALP (2), pages 1 12, 2006 C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In Proc. of the 3rd TCC, pages 265 284, 2006 16

Evaluating the utility of privacy techniques Industry collaboration project POC implementation of selected privacy techniques Redaction (masking), anonymisation and obfuscation Evaluate the solution for a set of analytics scenarios: quantify utility and privacy Approach Use cases (Analytics) Data sets original after PET Privacy mechanisms Metrics Analytics: RMSE, MRE, Lift Compared to the results using original data Privacy: uniqueness, entropy, diff privacy ε 18