Privacy Techniques for Big Data The Pros and Cons of Syntatic and Differential Privacy Approaches Dr#Roksana#Boreli# SMU,#Singapore,#May#2015# Introductions NICTA Australia s National Centre of Excellence in Information and Communication Technology 700 staff, ~ 300 PhD students Presenter Research Leader, Mobile Systems research group Interests: Privacy enhancing technologies, wireless comms and network protocols VRL ATP, NRL CRL QRL 2
Outline Privacy De-identification techniques Experiences from a recently completed project Evaluating the privacy utility trade-offs Question time 3 Personal data is collected by services and apps Used for targeted advertising 4
The growing importance of privacy Regulatory environment: legislation protecting personal data (PII) collection, storage, use Consumer attitudes to privacy increasing awareness Media attention high potential for negative publicity Maximise the opportunity: how to minimise the risks while preserving data utility for analytics 5 The meaning of PII Privacy regulations based on PII personal vs non-personal information Numerous examples of re-identification of anonymised data The falacy of PII - Any information could be PII, and should be protected PII : Personally identifiable information 6
Regulatory guidelines for de-identification Australian guidelines Frequency and dominance rules for data aggregates US HIPAA regulation (health data) 1. Redact identifiers 2. Generalise (mask) location and dates 3. Residual information should not lead to re-identification The Health Insurance Portability and Accountability Act of 1996 (HIPAA), Safe Harbor method 7 Netflix privacy breach Dataset for Netflix Prize contest: 17,770 movie titles 480,189 users with random customer IDs Ratings: 1-5 For each movie we have the ratings: (MovieID, CustomerID, Rating, Date) Given auxiliary information (random chats, IMDB), recommender identity can be uncovered with high probability. Robust De-anonymization of Large Sparse Datasets, Narayanan and Shmatikov, 2008 8
De-identifying mobility data Based on analysis of anonymised call information of ~1.5 million users in a western country (1 hour precision): * 4 spatio-temporal points are enough to uniquely identify of the individuals. 95% Yves-Alexandre de Montjoye, César A. Hidalgo, Michel Verleysen and Vincent D. Blondel, Unique in the Crowd: the privacy bounds of human mobility. Scientific Reports 3:1376, March 2013. Available on http://www.nature.com/srep/2013/130325/srep01376/full/srep01376.html. 9 Technologies from research domain 1. Anonymisation 2. Obfuscation, Differential privacy 3. Cryptographic solutions More on-going research topics 10
Anonymisation Scenario: storage or release of private information Simple: removal of PII and sensitive information Easy to reverse, using side information Name! Gender! Age! Oliver Brown# Emily Taylor# William Walker# Post code! Monthly bill! Male# 43# 3067# $198# Female# 37# 3040# $45# Male# 19# 3825# $146# Jack Harris# Male# 26# 3028# $35# Emma Anderson# Female# 42# 3195# $30# Lily White# Female# 55# 3067# $72# Lucas Johnson# Customer data Male# 59# 3818# $79# Name! Gender! Age! Post code! Jane Eyre# Female# 43# 2066# Emily Taylor# Female# 37# 3040# Rob Reed# Male# 59# 2100# Jack Johnson# Census data Male# 48# 3860# 11 Anonymisation: Syntatic approaches data set based rules Frequency rule Each aggregation result is derived from (min) k records http://www.oaic.gov.au/privacy/privacy-resources/privacy-businessresources/privacy-business-resource-4-de-identification-of-data-andinformation and http://www.nss.gov.au/nss/home.nsf/pages/confidentiality+-+how+to +confidentialise+data:+the+basic+principles 12
Anonymisation: Syntatic approaches k-anonymity Any unique combination of selected attributes/ features can belong to a min group of k users 476** 2* * 47677 47602 47678 29 22 27 Male Post code Age Gender Female L. Sweeney. k-anonymity: a model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10 (7), 2002. 13 k-anonymity (k=2) Gender! Age! Post code! Monthly bill! Male# -# 30**# $198# Female# -# 30**# $45# Male# -# 38**# $146# Male# -# 30**# $35# Female# -# 31**# $30# Female# -# 30**# $72# Male# -# 38**# $79# Female# -# 31**# $121# Male# -# 38**# $82# Female# -# 31**# $155# l-diversity: within the group of k, ensure that there is a mix of specific values of sensitive attribute t-closeness, etc.
Data obfuscation Scenario: releasing aggregate data information Differential privacy requires that computations be insensitive to changes in any particular individual's record. Consequently, being opted in or out of the database should make little difference to a person s privacy. AΔB=1## A" B" M" M" M(A)" # M(B)" 15 Differential privacy Add calibrated noise to sensitive data: e.g. generated by Laplace function ε parameter ~ privacy strength Name! Gender! Age! Post code! Monthly bill! Oliver Male 43 3067 $198 Brown Emily Female 37 3040 $45 Taylor Average bill: $96.3 + noise => Average bill: $91.4 C. Dwork. Differential privacy. In ICALP (2), pages 1 12, 2006 C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In Proc. of the 3rd TCC, pages 265 284, 2006 16
Evaluating the utility of privacy techniques Industry collaboration project POC implementation of selected privacy techniques Redaction (masking), anonymisation and obfuscation Evaluate the solution for a set of analytics scenarios: quantify utility and privacy Approach Use cases (Analytics) Data sets original after PET Privacy mechanisms Metrics Analytics: RMSE, MRE, Lift Compared to the results using original data Privacy: uniqueness, entropy, diff privacy ε 18