Big Data Analytics for Mitigating Insider Risks in Electronic Medical Records Bradley Malin, Ph.D. Associate Prof. & Vice Chair of Biomedical Informatics, School of Medicine Associate Prof. of Computer Science, School of Engineering Vanderbilt University 21/8/2015
January 1, 2015 Logged over 2,000,000 users interactions Alice s Alice s Electronic Alice s Electronic Medical Electronic Medical Medical Record ecord Record Alice s Electronic Medical Record
January 2, 2015 Logged over 2,000,000 users interactions
January 3, 2015 Logged over 2,000,000 users interactions
Auditing Requirements Federal (US) 1. Access control 2. Track & audit employee accesses 3. Store logs for 6 years 6
7
How (Not) to Use Access Control Central Norway Health Region enabled break the glass 1/2 of 99,000 patients broke glass 1/2 of 12,000 users broke glass Role Users Break Grass Nurse 5633 36% Doctor 2927 52% Health Secretary 1876 52% Physiotherapist 382 56% Psychologist 194 58% ~300K events in 1 month (Røstad & Øystein 2007) 8
Oct 2007 Palisades Medical Center Dozens of Employees 9
July 8, 2011 UCLA HHS Investigation $1 million fine 10
The Model is Wrong 11
Learning Suspicious EMR Access Behavior (Boxwala et al, JAMIA, 2011) Manually select 505 potential cases / controls based on previous breaches at Partners Healthcare Model Support Vector Machines Logistic Regression LABEL Human experts label cases as + / SELECT New unlabeled events from DB PREDICT Calculate the prediction probabilities using classifier on all events BUILD Build classifier from labeled events 12
Learning Suspicious EMR Access Behavior (Boxwala et al, JAMIA, 2011) Feature Coefficient Odds Ratio Works in the same department 3.16 23.5 Same street address 2.60 13.45 Same family name 2.34 10.38 Over 200 accesses in a day 1.30 3.70 VIP Patient 1.18 3.23 Same Zip Code 1.46 0.23 Is Provider 2.33 0.10 Care unit visit match 3.40 0.03 13
Role Refinement Northwestern Memorial Hospital 3 months of access logs to inpatient records 8K users, 16K patients, 1.1M accesses User ID Patient ID User position Date / Time Number of Orders Entered Patient Location in hospital Service patient is on (Zhang, Gunter, Liebovitz, Tian, & Malin AMIA 2011) 14
Predictability is Job Dependent MOST PREDICTABLE Rank Most Predictable Accuracy Users 1 (tie) ED Assistant 100% 26 1 (tie) ED Physician CPOE 100% 43 1 (tie) NMH Resident/Fellow ID Clinic-CPOE 100% 10 LEAST PREDICTABLE Rank Least Predictable Accuracy Users 140 Patient Care Staff Nurse 7.6% 1554 139 Rehab Occupational Therapist (OT) 14.3% 28 136 Patient Care Staff Nurse (Pilot) 22.1% 217 15
Where are We Going Wrong? Actual Role Predicted Role Probability Rehab Rehab Occupational Therapist Physical Therapist 85.7% Rehab Rehab Physical Therapist Occupational Therapist 60.0% 16
Suspicious? 17
Suspicious or Anomalous? 18
Defining Access Control Detecting Suspicious Behavior 19
January 1 EMR users linked if they accessed 1 patient in common (Malin, Nyemba, Paulett 2011) 20
Mining to Model the System (Malin, Nyemba, Paulett 2011) 2 nd Principal Component University Hospital Children s Hospital 1 st Principal Component 21
Hypothesis! Collaborative systems are about social phenomena People should form communities We should be able to measure deviation from community structure Note: other social phenomena could be studied (temporal workflow*, function invoked* if any, etc.) (*Chen et al, IEEE TDSC 2012; Zhang et al. ACM SACMAT 2013; Zhang et al. ACM TMIS 2013) 22
Community Based Anomaly Detection (CADS) Pattern Extraction Anomaly Detection Access Logs Social Relation Construction User Communities Distance Measurement User Specific Deviation Scores Community Deviation Deviation Measurement (Chen & Malin ACM CODASPY 2011) 23
Example 6 Nearest Neighbor Network (1 day of accesses) The average cluster coefficient for this network is 0.48, which is significantly larger than 0.001 for random networks Users exhibit collaborative behavior in the health information system 24
Auditing Strategies of the Past Principle Components Analysis (PCA) Graph based anomaly detection (Shyu et al 2003) (How similar am I to spectral clusters of users?) K Nearest Neighbor (KNN) Nearest neighbor based anomaly detection (Liao et al 2002) (How similar am I to my friends?) High Volume Model (Gallagher et al 1998) (Do I access way more people than my relations?) 25
Social Structure Wins the Day! True Positive Rate False Positive Rate 26
Gripes & Future Musings Different providers within the same ward have different behavior! Different wards within the same healthcare institution have different behavior! Different healthcare organizations use different languages! Logic (i.e., access control) and AI (i.e., data mining) need to play nicely together 27
Questions? b.malin@vanderbilt.edu Health Information Privacy Laboratory http://www.hiplab.org/ 28
29
High Confidence Rules Rule Support Confidence Weeks Center for Patient & Professional Advocacy Hearing & Speech 0.000581 0.860 18 Practice City A Clinic City A 0.000193 0.673 21 Infectious Disease Clinic Infectious Disease 0.000206 0.637 21 NICU Neonatology 0.000613 0.629 17 VMG Family Practice Clinic City A 0.00132 0.628 21 Vanderbilt Hearing School Hearing & Speech 0.00142 0.619 22 30
Low Confidence Rules (but occur in at least 3 weeks) Rule Support Confidence Weeks Anesthesiology Vanderbilt Hearing School Anesthesiology 4N Labor & Delivery Anesthesiology Physician Liaison Program Emergency Medicine Nutrition Clinic Anesthesiology Cardiac Cath Lab Emergency Medicine Diabetes Ctr Anesthesiology Center for Clinical/Research Ethics Anesthesiology Infectious Disease Clinic Anesthesiology Pediatric Immunology Anesthesiology Mental Health Center 0.0000522 0.000581 6 0.0000526 0.000577 6 0.0000565 0.000574 4 0.0000454 0.000572 4 0.0000590 0.000565 3 0.0000458 0.000558 4 0.0000459 0.000528 7 0.0000454 0.000527 4 0.0000458 0.000514 4 0.0000453 0.000514 4 31
Big Data Audits Must Be Understandable to be Actionable 32
What Makes Sense? Dr. Smith s access of Peggy Johnson s medical record was strange Dr. Smith s access was 10 standard deviations away from normal behavior in his hospital Dr. Smith s access was strange because he is a neonatologist and he accessed the record of a 100 year old woman who, for the past year, has only been treated by gerontologists 33
So Do You Believe Inferred Patterns? 34
Hypothesis: Locally Knowledgeable of Class Anethesiologists Psychiatrists Coding & Charge Entry Medical Information Services Ane. Rules Psych. Rules Code Rules MIS Rules High (10) High (10) High (10) High (10) Medium (10) Medium (10) Medium (10) Medium (10) Low (10) Low (10) Low (10) Low (10) 35
Survey Employees presented with questions asked to report likelihood of rules on a 5 point Likert scale All employees asked the same set of 120 questions (four sets of 30) Someone from Anesthesiology accessed the record of patient John Doe. How likely is it that someone from the following organizational area accessed the same patient's record? Anesthesiology Psychiatry Not at all Not at all Slightly Moderately Very Completely Slightly Moderately Very Completely 36
Hypothesis: Locally Knowledgeable of Class Anethesiologists Ane. Rules High Medium Low Employees can distinguish between high, med, and low for their own rules Anesthesiologists evaluated with anesthesiology rules Tested hypothesis with linear mixed effects (LME) model 37
Hypothesis: Locally Knowledgeable of Class Anethesiologists Confirmed for every organizational area at 95% confidence level! Ane. Rules High Medium Low Area Strength p value ANE 0.75 0.007 CODE 0.44 0.011 MIS 0.32 0.037 PSY 0.82 0.020 38
Learning Suspicious EMR Access Behavior (Boxwala et al, JAMIA, 2011) Feature Coefficient Odds Ratio 39
Learning Rules for Suspicious Access Detection (Boxwala et al, JAMIA, 2011) Feature Coefficient Odds Ratio Works in the same department 3.16 23.5 40
Learning Rules for Suspicious Access Detection (Boxwala et al, JAMIA, 2011) Feature Coefficient Odds Ratio Works in the same department 3.16 23.5 Same street address 2.60 13.45 41
Learning Rules for Suspicious Access Detection (Boxwala et al, JAMIA, 2011) Feature Coefficient Odds Ratio Works in the same department 3.16 23.5 Same street address 2.60 13.45 Same family name 2.34 10.38 42
Learning Rules for Suspicious Access Detection (Boxwala et al, JAMIA, 2011) Feature Coefficient Odds Ratio Works in the same department 3.16 23.5 Same street address 2.60 13.45 Same family name 2.34 10.38 Over 200 accesses in a day 1.30 3.70 43
Learning Rules for Suspicious Access Detection (Boxwala et al, JAMIA, 2011) Feature Coefficient Odds Ratio Works in the same department 3.16 23.5 Same street address 2.60 13.45 Same family name 2.34 10.38 Over 200 accesses in a day 1.30 3.70 VIP Patient 1.18 3.23 44
Learning Rules for Suspicious Access Detection (Boxwala et al, JAMIA, 2011) Feature Coefficient Odds Ratio Works in the same department 3.16 23.5 Same street address 2.60 13.45 Same family name 2.34 10.38 Over 200 accesses in a day 1.30 3.70 VIP Patient 1.18 3.23 Same Zip Code 1.46 0.23 45
Learning Rules for Suspicious Access Detection (Boxwala et al, JAMIA, 2011) Feature Coefficient Odds Ratio Works in the same department 3.16 23.5 Same street address 2.60 13.45 Same family name 2.34 10.38 Over 200 accesses in a day 1.30 3.70 VIP Patient 1.18 3.23 Same Zip Code 1.46 0.23 Is Provider 2.33 0.10 46
Predictability is Job Dependent Prediction Accuracy 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Med Student CPOE NMH Resident / Fellow CPOE Patient Care Staff Nurse Rehab OT 0 500 1000 1500 2000 Number of Users in Role 47
Another Healthcare Environment Vanderbilt EMR Logs 6 months Arbitrary Week 2,500 users 35,000 patients 66,000 <user, patient> distinct accesses 48