Privacy through Accountability: A Computer Science Perspective Anupam Datta Associate Professor Computer Science, ECE, CyLab Carnegie Mellon University February 2014
Personal Information is Everywhere 2
Research Challenge Programs and People Ensure organizations respect privacy expectations in the collection, use, and disclosure of personal information 3
Web Privacy Example privacy policies: Not use detailed location (full IP address) for advertising Not 4 use race for advertising
Healthcare Privacy Auditor Hospital Patient informatio n Patient informatio n Patient information Drug Company Patient Physician Nurse Example privacy policies: Use patient health info only for treatment, payment Share patient health info with police if suspect crime 5
A Research Area Formalize Privacy Policies Precise semantics of privacy concepts (restrictions on personal information flow) Enforce Privacy Policies Audit and Accountability Detect violations Blame-assignment Adaptive audit resource allocation Related ideas: Barth et al Oakland 2006; May et al CSFW 2006; Weitzner et al CACM 2008, Lampson 2004 6
Today: Focus on Detection Healthcare Privacy Play in two acts Web Privacy Play in two (brief) acts 7
Example from HIPAA Privacy Rule A covered entity may disclose an individual s protected health information (phi) to law-enforcement officials for the purpose of identifying an individual if the individual made a statement admitting participating in a violent crime that the covered entity believes may have caused serious physical harm to the victim Concepts in privacy policies Actions: send(p1, p2, m) Roles: inrole(p2, law-enforcement) Data attributes: attr_in(prescription, phi) Temporal constraints: in-the-past(state(q, m)) Black-andwhite concepts Purposes: purp_in(u, id-criminal)) Beliefs: believes-crime-caused-serious-harm(p, q, m) Grey concepts 8
Detecting Privacy Violations Privacy Policy Species Title The Oracle Organizational audit log The Matrix character Complete formalization Computer Program of HIPAA Privacy Rule, A program designed to GLBA investigate the human psyche. Automated audit for black-andwhite policy concepts Detect policy violation s Computer-readable privacy policy 9 Audit Oracles to audit for grey policy concepts
Policy Auditing over Incomplete Logs With D. Garg (CMU MPI-SWS) and L. Jia (CMU) 2011 ACM Conference on Computer and Communications Security 10
Key Challenge for Auditing 11 Audit Logs are Incomplete Future: store only past and current events Example: Timely data breach notification refers to future event Subjective: no grey information Example: May not record evidence for purposes and beliefs Spatial: remote logs may be inaccessible Example: Logs distributed across different departments of a hospital
Abstract Model of Incomplete Logs Model all incomplete logs uniformly as 3-valued structures Define semantics (meanings of formulas) over 3-valued structures 12
reduce: The Iterative Algorithm reduce (L, φ) = φ' Logs Policy r e d u c e φ 0 φ e 1 φ 2 r e d u c 13 Time
Syntax of Policy Logic First-order logic with restricted quantification over infinite domains (challenge for reduce) Can express timed temporal properties, grey predicates 14
Example from HIPAA Privacy Rule A covered entity may disclose an individual s protected health information (phi) to law-enforcement officials for the purpose of identifying an individual if the individual made a statement admitting participating in a violent crime that the covered entity believes may have caused serious physical harm to the victim 15 p1, p2, m, u, q, t. (send(p1, p2, m) inrole(p2, law-enforcement) tagged(m, q, t, u) attr_in(t, phi)) (purp_in(u, id-criminal)) m. state(q,m ) is-admission-of-crime(m ) believes-crime-caused-serious-harm(p1, q, m )
reduce: Formal Definition General Theorem: If initial policy passes a syntactic mode check, then finite substitutions can be computed c is a formula for which finite satisfying substitutions of x can be computed Applications: The entire HIPAA and GLBA Privacy Rules pass this check 16
Example φ = p1, p2, m, u, q, t. (send(p1, p2, m) tagged(m, q, t, u) attr_in(t, phi)) inrole(p2, law-enforcement) purp_in(u, id-criminal) m. ( state(q, m ) is-admission-of-crime(m ) believes-crime-caused-serious-harm(p1, m )) { p1 UPMC, p2 allegeny-police, m M2, q Bob, u id-bank-robber, t date-of-treatment } { m M1 } Log Jan 1, 2011 state(bob, M1) Jan 5, 2011 send(upmc, allegeny-police, M2) tagged(m2, Bob, date-of-treatment, id-bank-robber) 17 φ' = T purp_in(id-bank-robber, id-criminal) is-admission-of-crime(m1) believes-crime-caused-serious-harm(upmc, M1)
Implementation and Case Study Implementation and evaluation over simulated audit logs for compliance with all 84 disclosure-related clauses of HIPAA Privacy Rule Performance: Average time for checking compliance of each disclosure of protected health information is 0.12s for a 15MB log Mechanical enforcement: reduce can automatically check 80% of all the atomic predicates 18
Ongoing Transition Efforts Integration of reduce algorithm into Illinois Health Information Exchange prototype Joint work with UIUC and Illinois HLN Auditing logs for policy compliance Ongoing conversations with Symantec Research 19
Related Work Distinguishing characteristics 1. General treatment of incompleteness in audit logs 2. Quantification over infinite domains (e.g., messages) 3. First complete formalization of HIPAA Privacy Rule and GLBA. Nearest neighbors Basin et al 2010 (missing 1, weaker 2, cannot handle 3) Lam et al 2010 (missing 1, weaker 2, cannot handle entire 3) Weitzner et al (missing 1, cannot handle 3) Barth et al 2006 (missing 1, weaker 2, did not do 3) 20
Formalizing and Enforcing Purpose Restrictions With M. C. Tschantz (CMU Berkeley) and J. M. Wing (CMU MSR) 2012 IEEE Symposium on Security & Privacy 21
Goal Give a semantics to Not for purpose restrictions Only for purpose restrictions that is parametric in the purpose Provide audit algorithm for detecting violations for that semantics 22
X-ray taken Send record No diagnosis by drug company Add x-ray Medical Record X-ray added Send record Med records used only for diagnosis Diagnosis by specialist 23
X-ray taken Send record No diagnosis by drug company Add x-ray Not achieve purpose Achieve purpose X-ray added Send record Diagnosis by specialist 24
X-ray taken Add x-ray Send record Choice point Specialist Best choice fails No diagnosis (by drug co. or specialist) 1/4 X-ray added Send record 3/4 Diagnosis by specialist 25
Planning Thesis: An action is for a purpose iff that action is part of a plan for furthering the purpose i.e., always makes the best choice for furthering the purpose 26
Auditing Purpose restriction Auditee s behavior Decisionmaking model Obeyed Inconclusiv e Violated 27
Record only for treatment Policy implications Violated No [, send record] Actions optimal? 28 MDP Solve r Optimal actions for each state
Summary: A Sense of Purpose Thesis: An action is for a purpose iff that action is part of a plan for furthering the purpose i.e., always makes the best choice for furthering the purpose Audit algorithm detects policy violations by checking if observed behavior could have been produced by optimal plan 29
Today: Focus on Detection Healthcare Privacy Play in two acts Web Privacy Play in two (brief) acts 30
Bootstrapping Privacy Compliance in a Big Data System With S. Sen (CMU) and S. Guha, S. Rajamani, J. Tsai, J. M. Wing (MSR) 2014 IEEE Symposium on Security & Privacy 31
Privacy Compliance for Bing Setting: Auditor has access to source code 32
Two Central Challenges Legal Team Crafts Policy Privacy Champion Interprets Policy Developer Writes Code Meeting s Audit Team Meeting s Verifies Compliance Meeting s 1. Ambiguous privacy policy Meaning unclear 2. Huge undocumented codebases & datasets Connection to policy unclear 33
1. Legalease Example: DENY Datatype IPAddress USE FOR PURPOSE Advertising EXCEPT ALLOW Datatype IPAddress: Truncated Clean syntax Layered allow-deny information flow rules with exceptions Precise Semantics No ambiguity Focus on Usability User study of Legalease with Microsoft privacy champions promising 34
2. Grok Dataset Name A Dataset Age B IPAddres Dataset D s Dataset IDX G Data Inventory Annotate code + data with policy data types Source labels propagated via data flow graph Process NewAcct 1 Dataset Hash C Dataset Country H Process GeoIP 4 Dataset IDX I Different Noisy Sources Variable Name Analysis Developer Annotations 35 Dataset E p Timestam Process Check Hijack 3 Process Login 2 Dataset Hash F Reportin Process g 6 Process Check Fraud 5 Dataset IDX J
2. Grok Example Policy Violation IPAddres Dataset D s Process GeoIP 4 Dataset IDX G IPAddress is used for reporting (advertising) Dataset Country H Dataset IDX I Process Check Fraud 5 Dataset F IPAddress Dataset IDX J 36 Reportin Process g 6
2. Grok Example Fix IPAddress is truncated before it is passed to reporting (advertising) job IPAddres Dataset D s Dataset Country H Process GeoIP 4 Dataset IDX G Dataset IDX I Truncate Process Check Fraud 5 Dataset F IPAddress Dataset IDX J 37 Reportin Process g 6
Bootstrapping Works Pick x% most frequently appearing column names, label them Then propagate label using Grok flow Pick the nodes which will label the most of the graph A small number of annotations is enough to get off the ground. ~200 annotations label 60% of nodes 38
Scale 39 77,000 jobs run each day By 7000 entities 300 functional groups 1.1 million unique lines of code 21% changes on avg, daily 46 million table schemas 32 million files Manual audit infeasible Information flow analysis takes ~30 mins
A Streamlined Audit Workflow Legal Team Crafts Policy Encode Privacy Champ Interprets Policy Refine Legalease A Formal Policy Specification Language Grok Data Inventory with Policy Datatypes Code analysis, developer annotations 40 Developer Writes Code Fix code Annotated Code Update Grok Audit Team Verifies Compliance Checker Legalease Policy Potential violations
Information Flow Experiments With Michael Carl Tschantz (CMU UC Berkeley) Amit Datta (CMU) Jeannette M. Wing (CMU Microsoft Research)
42
Web Tracking Confounding inputs Search terms Other users User Ads? Google Advertisers Websites 43
Experimental Design Drug Experimental Group Scientist Placebo Control Group 44
Information Flow Experiment Black Arrested? Group 1 White Looking for? Group 2 45
46 Black Arrested? Black Arrested? Black Arrested? White Looking for? White Looking for? White Looking for? Google
Information Flow Experiments as Science Experimental Science Natural process Population of units Information Flow System in question Subset of interactions Causation = Information flow Theorem 47
Browser Instances are Not Independent 17 13 13 13 12 11 10 10 8 7 48
Our Idea Use a non-parametric test Does not require model of Google Specifically, a permutation test Does not require independence among browser instances 49
Visiting Car Websites Impacts Ads 30 30 31 19 22 0 0 2 5 6 50
Conclusion A rigorous methodology for information flow experiments Connection to causality in natural sciences Experimental design for causal determination Significance testing with non-parametric statistics Future work Replicate and analyze previous experiments systematically Guha et al, Wills and Tatar, Sweeney Conduct new large-scale experiments systematically Tool support for automating information flow experiments 51
A Research Area Formalize Privacy Policies Precise semantics of privacy concepts (restrictions on personal information flow) Enforce Privacy Policies Audit and Accountability Detect violations Blame-assignment Adaptive audit resource allocation Application Domains Healthcare, Web privacy 52
53
Information Flow Analysis Analysis Access to program? Yes White box No Black box Control over inputs? Total Partial None Testing Experimenting Monitoring 54
Google Exhibits Complex Behavior 45 40 35 30 Ad id 25 20 15 10 5 0 55 0 50 100 150 200 Reload number 55
Privacy as Contextual Integrity Context-relative information flow norms Example contexts: healthcare, friendship Example norms: confidentiality, purpose, reciprocity [Nissenbaum 2004; Barth-D-Mitchell-Nissenbaum 2006] 56
Norms to Policies Privacy Norms Privacy Policies Example norm: confidentiality expectations in healthcare Associated policy: clauses in the HIPAA Privacy Rule Does policy reflect norm? Is policy respected? (Our focus) 57