Big Data Analytics for Healthcare



Similar documents
Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Exploration and Visualization of Post-Market Data

Predictive Care Models to Improve Outcomes Brendan Fowkes Sr. Healthcare Solution Executive May 14, 2013

Big Data Analytics and Healthcare

Electronic Medical Record Mining. Prafulla Dawadi School of Electrical Engineering and Computer Science

Smarter Research. Joseph M. Jasinski, Ph.D. Distinguished Engineer IBM Research

Big Data Analytics Opportunities and Challenges

Introduction to Data Mining

Social Media Mining. Data Mining Essentials

Investigating Clinical Care Pathways Correlated with Outcomes

An EVIDENCE-ENHANCED HEALTHCARE ECOSYSTEM for Cancer: I/T perspectives

Putting IBM Watson to Work In Healthcare

The Data Mining Process

Data Mining - Evaluation of Classifiers

Data Mining. Nonlinear Classification

Uncovering Value in Healthcare Data with Cognitive Analytics. Christine Livingston, Perficient Ken Dugan, IBM

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Azure Machine Learning, SQL Data Mining and R

DICON: Visual Cluster Analysis in Support of Clinical Decision Intelligence

Healthcare Big Data Exploration in Real-Time

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Knowledge Discovery and Data Mining

Big Data Text Mining and Visualization. Anton Heijs

Supervised Learning (Big Data Analytics)

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features

Data Mining in CRM & Direct Marketing. Jun Du The University of Western Ontario jdu43@uwo.ca

Advanced In-Database Analytics

Chapter 6. The stacking ensemble approach

The Scientific Data Mining Process

Find the signal in the noise

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Data Mining Part 5. Prediction

Big Data Mining Services and Knowledge Discovery Applications on Clouds

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Data Mining Analytics for Business Intelligence and Decision Support

Role of Social Networking in Marketing using Data Mining

Knowledge Discovery from patents using KMX Text Analytics

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION

CloudAtlas: Cloud-Based Predictive Modeling Platform for Healthcare Research

Divide-n-Discover Discretization based Data Exploration Framework for Healthcare Analytics

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

Using EHRs for Heart Failure Therapy Recommendation Using Multidimensional Patient Similarity Analytics

MACHINE LEARNING BASICS WITH R

2015 Workshops for Professors

Machine Learning with MATLAB David Willingham Application Engineer

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook

How To Perform An Ensemble Analysis

DATA MINING TECHNIQUES AND APPLICATIONS

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

IJCSES Vol.7 No.4 October 2013 pp Serials Publications BEHAVIOR PERDITION VIA MINING SOCIAL DIMENSIONS

Big Data Analytics for Mitigating Insider Risks in Electronic Medical Records

KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics

Maximizing Return and Minimizing Cost with the Decision Management Systems

Statistical Machine Learning

Using multiple models: Bagging, Boosting, Ensembles, Forests

Tax Fraud in Increasing

Travis Goodwin & Sanda Harabagiu

Big Data 101: Harvest Real Value & Avoid Hollow Hype

Healthcare data analytics. Da-Wei Wang Institute of Information Science

Final Project Report

An Overview of Knowledge Discovery Database and Data mining Techniques

Let the data speak to you. Look Who s Peeking at Your Paycheck. Big Data. What is Big Data? The Artemis project: Saving preemies using Big Data

Protein Protein Interaction Networks

WHITE PAPER. Caradigm Healthcare Analytics. Healthcare Analytics

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies

A.I. in health informatics lecture 1 introduction & stuff kevin small & byron wallace

Copyright This report and/or appended material may not be partly or completely published or

Performance Metrics for Graph Mining Tasks

Data Mining Algorithms Part 1. Dejan Sarka

Introduction. A. Bellaachia Page: 1

Using Data Mining for Mobile Communication Clustering and Characterization

Sanjeev Kumar. contribute

Learning from Diversity

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Maschinelles Lernen mit MATLAB

Introduction to Data Mining

Hexaware E-book on Predictive Analytics

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015

Statistics for BIG data

KnowledgeSEEKER Marketing Edition

Knowledge Discovery and Data Mining

Hadoop s Advantages for! Machine! Learning and. Predictive! Analytics. Webinar will begin shortly. Presented by Hortonworks & Zementis

WHITE PAPER. QualityAnalytics. Bridging Clinical Documentation and Quality of Care

Transcription:

Big Data Analytics for Healthcare Jimeng Sun Chandan K. Reddy Healthcare Analytics Department IBM TJ Watson Research Center Department of Computer Science Wayne State University 1

Healthcare Analytics using Electronic Health Records (EHR) Old way: Data are expensive and small Input data are from clinical trials, which is small and costly Modeling effort is small since the data is limited A single model can still take months EHR era: Data are cheap and large Broader patient population Noisy data Heterogeneous data Diverse scale Complex use cases 2

Heterogeneous Medical Data Diagnosis Genetic data Medication Images Lab Clinical notes 3

Challenges of Healthcare Analytics Analytics Scalability Challenges inchallenges Healthcare Collaboration across domains Analytic platform Intuitive results Scalable computation 4

PARALLEL MODEL BUILDING 5

Motivation Predictive modeling using EHR is growing Explosion in interest Need for scalable predictive modeling platforms/systems due to increased computational requirements from: Processing EHR data (due to volume, variability, and heterogeneity) Building accurate models Building clinically meaningful models Validating models for accuracy and generalizability 6

What does it take to develop a predictive model using EHR? Marina: IBM Analytics Consultant 5 4 Within 3 months, we need to 1. understand business case 2. obtain the data 3. prepare the data 4. develop predictive models 5. deliver the final model 1 2 3 David Gotz, Harry Starvropoulos, Jimeng Sun, Fei Wang. ICDA: A Platform for Intelligent Care Delivery Analytics, AMIA 2012 7

A Generalized Predictive Modeling Pipeline Model specification Cohort Construction: Find an appropriate set of patients with the specified target condition and a corresponding set of control patients without the condition. Feature Construction: Compute a feature vector representation for each patient based on the patient s EHR data. Cross Validation: Partition the data into complementary subsets for use in model training and validation testing. Feature Selection: Rank the input features and select a subset of relevant features for use in the model. Classification: The training and evaluation of a model for a specific classifier. Output: Clean up intermediate files and to put results into their final locations. 8

Cohort Construction D1 All patients Cases D2 D3 Controls D1 D2 D3 Disease Target Hypertension control Heart failure onset Hypertension diagnosis samples 5000 33K 300K 9

Feature Construction Observation Window Prediction Window Index date Diagnosis date We define Diagnosis date and index date Prediction and observation windows Features are constructed from the observation window and predict HF onset after the prediction window 10

11

13

14

15

PARAMO: A Parallel Predictive Modeling Platform Pipeline Specifications Dependency Graph Generator Remove Redundancies Identify Dependencies Dependency Graph Dependency Graph Execution Engine Prioritization Scheduling Parallel Execution Results Parallelization Infrastructure 16

An Example Set of Pipeline Specifications Cohort construction: One patient data set Feature construction: Feature Type Diagnoses Medications Symptoms Labs Aggregation Count Count Count Mean Cross-validation: 2-fold cross-validation Feature selection: Information Gain, Fisher Score Classification: Naïve Bayes, Logistic Regression, Random Forest 17

Dependency Graph Generator Pipeline Specifications Dependency Graph Generator Remove Redundancies Identify Dependencies Dependency Graph Input: Pipeline specifications Output: Dependency graph Function: Remove Redundancies Identify Dependencies Encode parallel jobs 18

Dependency Graph 19

Dependency Graph Execution Engine Dependency Graph Hadoop Cluster Nimble Execution Engine Prioritization Scheduling Parallel Execution Parallelization Infrastructure Dependency Graph Multi-Process Single Server Results Input: Dependency graph Output: Results (models, scores, etc.) Function: Schedules tasks in a topological ordering of the graph Prioritizes pending tasks using information from already completed tasks Executes tasks in parallel via the parallelization infrastructure 20

Dependency Graph Runtime Analysis (Hadoop, 20 Concurrent Tasks) 21

Experimental Data Sets Data Set Years Number of Data of Patients Number of Number of Features Records Number of Cases Number of Target Controls Condition Small 3 4,758 25932 3,312,558 615 949 Hypertension Control Medium 10 32,675 46117 24,719,809 4644 28031 Heart Failure Onset Large 4 319,650 49269 33,531,311 16385 164743 Hypertension Onset 22

Experimental Pipeline Specifications Cohort construction: One patient data set: Small, Medium, or Large Feature construction: Feature Type Diagnoses Medications Procedures Symptoms Labs Aggregation Count Count Count Count Mean Cross-validation: 10 x 10-fold cross-validation Feature selection: Information Gain, Fisher Score Classification: k-nn, Naïve Bayes, Logistic Regression, Random Forest 23

Running Time vs. Parallelism level 1000000 9 days Large Medium Small 72X speed up Runtime (s) 100000 3 hours 10000 1000 Serial 10 20 40 80 120 Number of Concurrent Tasks 160 Small: 5,000 patients, Medium: 33K, Large: 319K 10 times 10-fold cross validation Dependency graph: 1808 nodes and 3610 edges 24

Summary Predictive models in healthcare research is becoming more prevalent Electronic health records (EHR) adoption continues to accelerate Need for scalable predictive modeling platforms/systems PARAMO is a parallel predictive modeling platform for EHR data PARAMO can facilitate large-scale modeling endeavors and speed-up the research workflow Tests on real EHR data show significant performance gains 27

PATIENT SIMILARITY PLATFORM 28

Patient Similarity Problem Similarity search Patient Doctor 29

Patient Similarity Problem Patient Doctor 30

Challenges of Patient Similarity Similar patients Traditional setting Query patient Feature vector EHR database No exact match >100k dimensional feature vector Modern setting As the size of a feature vector increases, it is hard to find exact match on all features How to find relevant patients to a query for a specific clinical context? 31

Our Approach Query patient >100k dimensional feature vector 1. Feature Selection & Generalization 2. Patient Important features similarity learning For a clinical context, 1. What are important features? 2. What is the right similarity measure? 32

Healthcare Analytic Platform Large-scale Analytics Platform 33

Healthcare Analytic Platform Information Healthcare Data Mining Analytics Visualization Extraction 34

Healthcare Analytic Platform Information Extraction Patient representation Feature extraction Structured EHR Data Mining Visualization Context FeatureAnalytics Healthcare Unstructured EHR selection Visualization Patient similarity 35

Feature Extraction from Unstructured EHR data Information Extraction Patient representation Feature extraction Structured EHR Data Mining Visualization Context FeatureAnalytics Healthcare Unstructured EHR selection Visualization Patient similarity 36

Motivations for Early Detection of Heart Failure Heart failure (HF) is a complex disease Huge Societal Burden Diagnoses are usually made late, despite there are symptoms documented in clinical notes prior Our method exacting HF symptoms achieves precision 0.925, recall 0.896, and F-score 0.910 Roy J. Byrd, Steven R. Steinhubl, Jimeng Sun, Shahram Ebadollahi. Automatic identification of heart failure diagnostic criteria, using text analysis of clinical notes from electronic health records. International Journal of Medical Informatics 2013 37

Potential Impact on Evidence-based Therapies No symptoms Framingham symptoms Clinical diagnosis Opportunity for early intervention 3,168 patients eventually all diagnosed with HF 70.00% 60.00% Preceding Framingham diagnosis 50.00% 40.00% 30.00% After Framingham diagnosis 20.00% 10.00% After clinical diagnosis 0.00% Applying text mining to extract Framingham symptoms can help trigger early intervention Vhavakrishnan R, Steinhubl SR, Sun J, et al. Potential impact of predictive models for early detection of heart failure on the initiation of evidence-based therapies. J Am Coll Cardiol. 2012;59(13s1):E949-E949. 38

Knowledge plus Data Feature Selection Information Extraction Patient representation Feature extraction Structured EHR Data Mining Visualization Context FeatureAnalytics Healthcare Unstructured EHR selection Visualization Patient similarity 39

Combining Knowledge- and Data-driven Risk Factors Knowledge Knowledge base Risk factor gathering Knowledge risk factors Combination Risk factor augmentation Clinical data Data processing Data Potential risk factors Combined risk factors Target condition Jimeng Sun, Jianying Hu, Dijun Luo, Marianthi Markatou, Fei Wang, Shahram Ebadollahi, Steven E. Steinhubl, Zahra Daar, Walter F. Stewart. Combining Knowledge and Data Driven Insights for Identifying Risk Factors using Electronic Health Records. AMIA2012 Dijun Luo, Fei Wang, Jimeng Sun, Marianthi Markatou, Jianying Hu,Shahram Ebadollahi, SOR: Scalable Orthogonal Regression for Low-Redundancy Feature Selection and its Healthcare Applications. SDM 12 41

Risk Factor Augmentation Sparse learning objective formulation: Model error Correlation among data-driven features Correlation between data- and knowledgedriven features Sparse Penalty Jimeng Sun, Jianying Hu, Dijun Luo, Marianthi Markatou, Fei Wang, Shahram Ebadollahi, Steven E. Steinhubl, Zahra Daar, Walter F. Stewart. Combining Knowledge and Data Driven Insights for Identifying Risk Factors using Electronic Health Records. AMIA2012 Dijun Luo, Fei Wang, Jimeng Sun, Marianthi Markatou, Jianying Hu,Shahram Ebadollahi, SOR: Scalable Orthogonal Regression for Low-Redundancy Feature Selection and its Healthcare Applications. SDM 12 42

Prediction Results using Selected Features 0.8 +150 +200 0.75 +50 AUC 0.7 Data driven features improve AUC significantly +100 0.65 +Hypertension 0.6 +diabetes Knowledge driven features improve AUC slightly CAD 0.55 0.5 all knowledge features 0 100 200 300 400 Number of features 500 600 44

Top-10 Selected Data-driven Features Feature Relevancy to HF Dyslipidemia Thiazides-like Diuretics Antihypertensive Combinations Aminopenicillins Bone density regulators Naturietic Peptide Rales Diuretic Combinations S3Gallop NSAIDS 9 out of 10 are considered relevant to HF 45

Top-10 Selected Data-driven Features Feature Relevancy to HF Dyslipidemia Category Thiazides-like Diuretics Diagnosis Antihypertensive Combinations Medication Aminopenicillins Lab Bone density regulators Symptom Naturietic Peptide Rales Diuretic Combinations S3Gallop NSAIDS 9 out of 10 are considered relevant to HF The data driven features are complementary to the existing knowledge-driven features 46

Patient Similarity Information Extraction Patient representation Feature extraction Structured EHR Data Mining Visualization Context FeatureAnalytics Healthcare Unstructured EHR selection Visualization Patient similarity 47

Patient Similarity through Locally Supervised Metric Learning Under a specific clinical context Query patient Jimeng Sun, Fei Wang, Jianying Hu, Shahram Edabollahi: Supervised patient similarity measure of patient records. SIGKDD Explorations 14(1): 16-24 (2012) Jimeng Sun, Large-scaleheterogeneous Healthcare Analytics 48

Patient Similarity through Locally Supervised Metric Learning Under a specific clinical context Homogeneous neighborhood Heterogeneous neighborhood Query patient Homogeneous neighbors: true positives Heterogeneous neighbors: false positives Jimeng Sun, Fei Wang, Jianying Hu, Shahram Edabollahi: Supervised patient similarity measure of patient records. SIGKDD Explorations 14(1): 16-24 (2012) Jimeng Sun, Large-scaleheterogeneous Healthcare Analytics 49

Patient Similarity through Locally Supervised Metric Learning Under a specific clinical context Query patient Shrink homogeneous neighborhood Grow heterogeneous neighborhood Jimeng Sun, Fei Wang, Jianying Hu, Shahram Edabollahi: Supervised patient similarity measure of patient records. SIGKDD Explorations 14(1): 16-24 (2012) Jimeng Sun, Large-scaleheterogeneous Healthcare Analytics 50

Locally Supervised Metric Learning (LSML) Goal: Learn a generalized Mahalanobis distance for a specific clinical context (target label) Margin for xi Total distance to heterogeneous neighbors Total distance to homogeneous neighbors Homogeneous neighborhood for xi Maximize the total margin Heterogeneous neighborhood for xi Jimeng Sun, Fei Wang, Jianying Hu, Shahram Edabollahi: Supervised patient similarity measure of Jimeng Sun, Large-scaleheterogeneous Healthcare Analytics patient records. SIGKDD Explorations 14(1): 16-24 (2012) 51

Patient Similarity Experiment Design 200K samples 195 features D1 Cases D6 Controls D1 D2 D3 D4 D5 D6 Disease Target DM with Acute Complications DM without Complications Depression Heart Failure Asthma Lung Cancers samples 4,392 10,734 6,794 5,262 6,606 1,172 56

Prediction Results on Patient Similarity Baselines: EUC: Euclidean distance PCA: Principal component analysis LDA: Linear discriminant analysis Observations: LDA does not perform well, because of the resulting dimensionality is too low LSML algorithm performs the best among all DM with Acute DM without Complications Complications Depression Congestive Heart Failure Asthma Lung Cancers Euclidean 0.539 0.638 0.609 0.688 0.602 0.645 LDA 0.541 0.604 0.589 0.564 0.584 0.595 PCA 0.57 0.639 0.625 0.697 0.617 0.664 LSML 0.576 0.669 0.632 0.723 0.625 0.677 57

Clinical Relevancy of Patient Similarity Results yes 100% 90% no no 60% no no no no 80% 70% possible possible no possible possible possible possible possible 50% 40% 30% 20% 10% 0% DM with Acute DM without Complica ons Complica ons Depression Heart Failure Asthma Lung Cancers Retrieve the top ranked comorbidities among similar patients over 80% of those are considered as relevant to target disease 58

Summary on Patient Similarity LSML learns a customized distance metric Extension 1: Composite distance integration (Comdi) [1] How to combine multiple patient similarity measures? Extension 2: Interactive metric update (imet) [2] How to update an existing distance measure? 1. Fei Wang, Jimeng Sun, Shahram Ebadollahi: Integrating Distance Metrics Learned from Multiple Experts and its Application in Inter-Patient Similarity Assessment. SDM 2011: 59-70 56 2. Fei Wang, Jimeng Sun, Jianying Hu, Shahram Ebadollahi: imet: Interactive Metric Learning in Healthcare Applications. SDM 2011: 944-955 59

Visualization Information Extraction Patient representation Feature extraction Structured EHR Data Mining Visualization Context FeatureAnalytics Healthcare Unstructured EHR selection Visualization Patient similarity 60

Visualization FacetAtlas (InfoVis 10) DICON (InfoVis 11) SolarMap(ICDM 11) MatrixFlow (AMIA 12) 61

Conclusions Information Extraction Data Mining Patient representation Feature extraction Structured EHR Unstructured EHR Visualization Context Feature selection Visualization Patient similarity Scalable healthcare analytic research platform Enables efficient collaboration across domains Extensible analytic platform Intuitive analytic results Scalable computation engine 62

Future of Healthcare Analytics Health Analytic Apps Clinical data Social data Data miner Visualization User Behavior data Genomic data Privacy engine Heart disease predictor for $5.99 Analytic cloud Research Challenges Data analytic techniques for each data modality Privacy preserving data sharing Visual analytic techniques 63