High-Throughput Phenotyping from Electronic Health Records for Research Jyotishman Pathak, PhD Director, Clinical Informatics Program and Services Kern Center for the Science of Health Care Delivery Associate Professor of Biomedical Informatics Department of Health Sciences Research December 9 th, 2014 Quality Registers Symposium Stockholm, Sweden
2014 Quality Registers Symposium, Stockholm 2014 MFMER slide-2 DISCLOSURES Relevant Financial Relationship(s) None Off Label Usage None
2014 Quality Registers Symposium, Stockholm 2014 MFMER slide-3 Electronic health records (EHRs) driven phenotyping EHRs are becoming more and more prevalent within the U.S. healthcare system Meaningful Use is one of the major drivers Overarching goal To develop high-throughput semi-automated techniques and algorithms that operate on normalized EHR data to identify cohorts of potentially eligible subjects on the basis of disease, symptoms, or related findings both retrospectively and prospectively
2014 Quality Registers Symposium, Stockholm 2014 MFMER slide-4 http://gwas.org [emerge Network]
EHR-driven Phenotyping Algorithms - I Typical components Billing and diagnoses codes Procedure codes Labs Medications Phenotype-specific co-variates (e.g., Demographics, Vitals, Smoking Status, CASI scores) Pathology Radiology Organized into inclusion and exclusion criteria [emerge Network] 2014 Quality Registers Symposium, Stockholm 2014 MFMER slide-5
2014 Quality Registers Symposium, Stockholm 2014 MFMER slide-6 EHR-driven Phenotyping Algorithms - II Rules Phenotype Algorithm Evaluation Visualization Transform Transform Data Mappings NLP, SQL [emerge Network]
2014 Quality Registers Symposium, Stockholm 2014 MFMER slide-7 Example: Hypothyroidism Algorithm No thyroid-altering medications (e.g., Phenytoin, Lithium) 2+ non-acute visits in 3 yrs ICD- 9s for Hypothyroidism Abnormal TSH/FT4 Thyroid replace. meds An&bodies for TTG or TPO (anti-thyroglobulin, anti-thyroperidase) No ICD-9s for Hypothyroidism No Abnormal TSH/FT4 No thyroid replace. meds No secondary causes (e.g., pregnancy, ablation) No An&bodies for TTG/TPO No Hx of myasthenia gravis Case 1 Case 2 Control [Denny et al., ASHG, 2012; 89:529-542]
Example: Hypothyroidism Algorithm [Conway et al. AMIA 2011: 274-83] 2014 Quality Registers Symposium, Stockholm 2012 MFMER slide-8
Example: Hypothyroidism Algorithm Drugs Demographics Diagnosis Labs Procedures NLP Vitals [Conway et al. AMIA 2011: 274-83] 2014 Quality Registers Symposium, Stockholm 2012 MFMER slide-9
Phenotyping Algorithms Alzheimer s Dementia Cataracts Peripheral Arterial Disease Type 2 Diabetes Data Categories used to define EHR-driven Phenotyping Algorithms Clinical gold standard EHR-derived phenotype Validation (PPV/NPV) Sensitivity (Case/Cntrl) Demographics, clinical examination of mental status, histopathologic examination Clinical exam finding (Ophthalmologic examination) Clinical exam finding (ankle-brachial index or arteriography) Laboratory Tests Diagnoses, medications Diagnoses, procedure codes Diagnoses, procedure codes, medications, radiology test results Diagnoses, laboratory tests, medications 73%/55% 37.1%/99% 98%/98% 99.1%/93.6% 94%/99% 85.5%/81.6% 98%/100% 100%/100% Cardiac Conduction ECG measurements ECG report results 97% (case only algorithm) 96.9% (case only algorithm) [emerge Network]
2014 Quality Registers Symposium, Stockholm 2014 MFMER slide-11 Genotype-Phenotype Association Results gene / disease marker region Atrial fibrillation Crohn's disease Multiple sclerosis Rheumatoid arthritis Type 2 diabetes rs2200733 rs10033464 rs11805303 Chr. 4q25 Chr. 4q25 IL23R rs17234657 Chr. 5 rs1000113 Chr. 5 rs17221417 rs2542151 rs3135388 rs2104286 rs6897932 NOD2 PTPN22 DRB1*1501 IL2RA IL7RA rs6457617 Chr. 6 rs6679677 rs2476601 rs4506565 rs12255372 rs12243326 rs10811661 rs8050136 rs5219 rs5215 rs4402960 RSBN1 PTPN22 TCF7L2 TCF7L2 TCF7L2 CDKN2B FTO KCNJ11 KCNJ11 IGF2BP2 published observed 0.5 1.0 2.0 5.0 Odds Ratio 0.5 [Ritchie et al. AJHG 2010; 5 86(4):560-72]
Key lessons learned from emerge Algorithm design and transportability Non-trivial; requires significant expert involvement Highly iterative process Time-consuming manual chart reviews Representation of phenotype logic is critical Standardized data access and representation Importance of unified vocabularies, data elements, and value sets Questionable reliability of ICD & CPT codes (e.g., billing the wrong code since it is easier to find) Natural Language Processing (NLP) is critical [Kho et al. Sc. Trans. Med 2011; 3(79): 1-7] 2014 Quality Registers Symposium, Stockholm 2014 MFMER slide-12
SHARPn: Secondary Use of EHR Data 2014 Quality Registers Symposium, Stockholm 2012 MFMER slide-13
2014 Quality Registers Symposium, Stockholm 2014 MFMER slide-14 Algorithm Development Process - Modified Phenotype Algorithm Standardized and structured representation of phenotype definition criteria Use the NQF Quality Data Model (QDM) Rules Conversion of structured Semi-Automatic Execution phenotype criteria into executable queries Evaluation Use JBoss Drools (DRLs) Visualization Standardized representation of clinical data Create new and re-use existing clinical element models (CEMs) Transform Transform Data Mappings NLP, SQL [Welch et al., JBI 2012; 45(4):763-71] [Pathak et al., JAMIA 2013; 20(e2): e341-8]
2014 Quality Registers Symposium, Stockholm 2014 MFMER slide-15 Algorithm Development Process - Modified Semi-Automatic Execution Rules Phenotype Algorithm Transform Transform Visualization Standardized representation of clinical data Create new and re-use existing clinical element models (CEMs) Data Evaluation Mappings NLP, SQL [Welch et al., JBI 2012; 45(4):763-71] [Pathak et al., JAMIA 2013; 20(e2): e341-8]
Clinical Element Models Higher-Order Structured Representations 2014 Quality Registers Symposium, Stockholm 2014 MFMER slide-16 [Stan Huff, IHC]
CEMs available for patient demographics, medications, lab measurements, procedures etc. 2014 Quality Registers Symposium, Stockholm [Stan Huff, IHC]
2014 Quality Registers Symposium, Stockholm 2014 MFMER slide-18 SHARPn data normalization pipeline
[Welch et al., JBI 2012; 45(4):763-71] [Pathak et al., JAMIA 2013; 20(e2): e341-8] 2014 Quality Registers Symposium, Stockholm 2014 MFMER slide-19
2014 Quality Registers Symposium, Stockholm 2014 MFMER slide-20 Algorithm Development Process - Modified Semi-Automatic Execution Standardized and structured representation of phenotype definition criteria Use the NQF Quality Data Model (QDM) Rules Phenotype Algorithm Transform Transform Visualization Standardized representation of clinical data Create new and re-use existing clinical element models (CEMs) Data Evaluation Mappings NLP, SQL [Welch et al., JBI 2012; 45(4):763-71] [Pathak et al., JAMIA 2013; 20(e2): e341-8]
Example algorithm: Hypothyroidism 2014 Quality Registers Symposium, Stockholm 2012 MFMER slide-21
NQF Quality Data Model (QDM) Standard of the National Quality Forum (NQF) A structure and grammar to represent quality measures and phenotype definitions in a standardized format Groups of codes in a code set (ICD-9, etc.) "Diagnosis, Active: steroid induced diabetes" using "steroid induced diabetes Value Set GROUPING (2.16.840.1.113883.3.464.0001.113) Supports temporality & sequences AND: "Procedure, Performed: eye exam" > 1 year(s) starts before or during "Measurement end date" Implemented as a set of XML schemas Links to standardized terminologies (ICD-9, ICD-10, SNOMED-CT, CPT-4, LOINC, RxNorm etc.) 2014 Quality Registers Symposium, Stockholm 2014 MFMER slide-22
2014 Quality Registers Symposium, Stockholm 2012 MFMER slide-23 Example: Diabetes & Lipid Mgmt. - I
2014 Quality Registers Symposium, Stockholm 2014 MFMER slide-24 Example: Diabetes & Lipid Mgmt. - II Human readable HTML
2014 Quality Registers Symposium, Stockholm 2012 MFMER slide-25 Example: Diabetes & Lipid Mgmt. - III
2014 Quality Registers Symposium, Stockholm 2014 MFMER slide-26 Example: Diabetes & Lipid Mgmt. - IV Computer readable HQMF XML (based on HL7 v3 RIM)
2014 Quality Registers Symposium, Stockholm 2014 MFMER slide-27 Algorithm Development Process - Modified Phenotype Algorithm Standardized and structured representation of phenotype definition criteria Use the NQF Quality Data Model (QDM) Rules Conversion of structured Semi-Automatic Execution phenotype criteria into executable queries Evaluation Use JBoss Drools (DRLs) Visualization Standardized representation of clinical data Create new and re-use existing clinical element models (CEMs) Transform Transform Data Mappings NLP, SQL [Welch et al., JBI 2012; 45(4):763-71] [Pathak et al., JAMIA 2013; 20(e2): e341-8]
A Modular Architecture for Electronic Health Record-Driven Phenotyping Luke V. Rasmussen 1 ; Richard C. Kiefer 2, Huan Mo, MD, MS 3, Peter Speltz 3, William K. Thompson, PhD 4, Guoqian Jiang, MD, PhD 2, Jennifer A. Pacheco 1, Jie Xu, MS 1, Qian Zhu, PhD 5, Joshua C. Denny, MD, MS 3, Enid Montague, PhD 1, Jyotishman Pathak, PhD 2 Figure 1 1. Phenotype Execution and Modeling Northwestern University, Chicago, IL; 2 Architecture (PhEMA) Mayo Clinic, Rochester, MN; 3 Vanderbilt University, Nashville, TN; 4 NorthShore University HealthSystem, Evanston, IL; 5 University of Maryland Baltimore County, Baltimore, MD Abstract Increasing interest in and experience with electronic health record (EHR)-driven phenotyping has yielded multiple challenges that are at present only partially addressed. Many solutions require the adoption of a single software platform, often with an additional cost of mapping existing patient and phenotypic data to multiple representations. We propose a set of guiding design principles and a modular software architecture to bridge the gap to a standardized phenotype representation, dissemination and execution. Ongoing development leveraging this proposed architecture has shown its ability to address existing limitations. Introduction Electronic health record (EHR)-based phenotyping is a rich source of data for clinical and genomic research 1. Interest in this area, however, has uncovered many challenges including the quality and semantics of the data collected, as well as the disparate modalities in which the information is stored across institutions and EHR systems 2, 3. Significant effort has gone into addressing these challenges, and researchers have an increased understanding of the nuances of EHR data and the importance of multiple information sources such as diagnostic codes coupled with natural language processing (NLP) 4, 5. The process of developing a phenotype algorithm is complex requiring a diverse team engaged in multiple iterations of hypothesis generation, data exploration, algorithm generation and validation 6. The technical tools used during these phases vary across teams and institutions, driven in part by availability, licensing costs, technical Discussion infrastructure supported at an institution, and personal preference. For example, initial feasibility queries may be We conducted propose PhEMA using a system to support like i2b2 the 7 multiple followed facets by in-depth of EHR-based analysis using phenotyping. statistical There software are (e.g. notable SAS) similarities or structured to [Rasmussen et al. AMIA 2015] 2014 Quality Registers Symposium, Stockholm 2014 MFMER slide-28
Scalable and High-Throughput Execution of Clinical Quality Measures from Electronic Health Records using MapReduce and the JBoss R Drools Engine Kevin J. Peterson, MS 1,2 and Jyotishman Pathak, PhD 1 1 Dept. of Health Sciences Research, Mayo Clinic, Rochester, MN 2 Dept. of Computer Science & Engineering, University of Minnesota, Twin Cities, MN ABSTRACT Automated execution of electronic Clinical Quality Measures (ecqms) from electronic health records (EHRs) on large patient populations remains a significant challenge, and the testability, interoperability, and scalability of measure execution are critical. The High Throughput Phenotyping (HTP; http://phenotypeportal.org) project aligns with these goals by using the standards-based HL7 Health Quality Measures Format (HQMF) and Quality Data Model (QDM) for measure specification, as well as Common Terminology Services 2 (CTS2) for semantic interpretation. The HQMF/QDM representation is automatically transformed into a JBoss R Drools workflow, enabling horizontal scalability via clustering and MapReduce algorithms. Using Project Cypress, automated verification metrics can then be produced. Our results show linear scalability for nine executed 2014 Center for Medicare and Medicaid Services (CMS) ecqms for eligible professionals and hospitals for >1,000,000 patients, and verified execution correctness of 96.4% based on Project Cypress test data of 58 ecqms. 1 INTRODUCTION Secondary use of electronic health record (EHR) data is a broad domain that includes clinical quality measures, observational cohorts, outcomes research and comparative effectiveness research. A common thread across all these use cases is the design, implementation, and execution of EHR-driven phenotyping algorithms for identifying patients that meet certain criteria for diseases, conditions and events of interest (e.g., Type 2 Diabetes), and the subsequent analysis of the query results. The core principle for implementing and executing the phenotyping algorithms comprises extracting and evaluating data from EHRs, including but not limited to, diagnosis, procedures, [Peterson vitals, laboratory AMIA 2014] values, medication use, and NLP-derived observations. While we have successfully demonstrated the applicability of such algorithms for clinical and translational research within the Electronic Medical Records and Genomics (emerge)[1] and Strategic Health IT Advance Research Project (SHARPn)[2, 3], an important aspect that has re- 2014 Quality Registers Symposium, Stockholm 2014 MFMER slide-29
Example: Initial Patient Population criteria for CMS emeasure (CMS163V1) 2014 Quality Registers Symposium, Stockholm 2012 MFMER slide-30
Scalable and High-Throughput Execution of Clinical Quality Measures from Electronic Health Records using MapReduce and the JBoss R Drools Engine Kevin J. Peterson, MS 1,2 and Jyotishman Pathak, PhD 1 1 Dept. of Health Sciences Research, Mayo Clinic, Rochester, MN 2 Dept. of Computer Science & Engineering, University of Minnesota, Twin Cities, MN ABSTRACT Automated execution of electronic Clinical Quality Measures (ecqms) from electronic health records (EHRs) on large patient populations remains a significant challenge, and the testability, interoperability, and scalability of measure execution are critical. The High Throughput Phenotyping (HTP; http://phenotypeportal.org) project aligns with these goals by using the standards-based HL7 Health Quality Measures Format (HQMF) and Quality Data Model (QDM) for measure specification, as well as Common Terminology Services 2 (CTS2) for semantic interpretation. The HQMF/QDM representation is automatically transformed into a JBoss R Drools workflow, enabling horizontal scalability via clustering and MapReduce algorithms. Using Project Cypress, automated verification metrics can then be produced. Our results show linear scalability for nine executed 2014 Center for Medicare and Medicaid Services (CMS) ecqms for eligible professionals and hospitals for >1,000,000 patients, and verified execution correctness of 96.4% based on Project Cypress test data of 58 ecqms. 1 INTRODUCTION Secondary use of electronic health record (EHR) data is a broad domain that includes clinical quality measures, observational cohorts, outcomes research and comparative effectiveness research. A common thread across all these use cases is the design, implementation, and execution of EHR-driven phenotyping algorithms for identifying patients that meet certain criteria for diseases, conditions and events of interest (e.g., Type 2 Diabetes), and the subsequent analysis of the query results. The core principle for implementing and executing the phenotyping algorithms comprises extracting and evaluating data from EHRs, including but not limited to, diagnosis, procedures, vitals, [Peterson laboratory AMIA 2014] values, medication use, and NLP-derived observations. While we have successfully demonstrated the applicability of such algorithms for clinical and translational research within the Electronic Medical Records and Genomics (emerge)[1] and Strategic Health IT Advance Research Project (SHARPn)[2, 3], an important aspect that has re- 2014 Quality Registers Symposium, Stockholm 2014 MFMER slide-31
2014 Quality Registers Symposium, Stockholm 2014 MFMER slide-32 Validation using Project Cypress [Peterson AMIA 2014]
http://phenotypeportal.org [Endle et al., AMIA 2012]
http://api.phenotypeportal.org 2014 Quality Registers Symposium, Stockholm 2014 MFMER slide-34
2014 Quality Registers Symposium, Stockholm 2014 MFMER slide-35 How is this research applied at Mayo? Transfusion related acute lung injury (Kor) Pharmacogenomics of depression (Weinshilboum) Pharmacogenomics of breast cancer (Wang) Pharmacogenomics of heart failure (Pereira) Multi-morbidity in depression and heart failure (Pathak) Lumbar image reporting with epidemiology (Kallmes) Active surveillance for celiac disease (Murray) Phenome-wide association studies (Pathak)
2014 Quality Registers Symposium, Stockholm 2014 MFMER slide-36 How is this research applied at Mayo? Transfusion related acute lung injury (Kor) Pharmacogenomics of depression (Weinshilboum) Pharmacogenomics of breast cancer (Wang) Pharmacogenomics of heart failure (Pereira) Multi-morbidity in depression and heart failure (Pathak) Lumbar image reporting with epidemiology (Kallmes) Active surveillance for celiac disease (Murray) Phenome-wide association studies (Pathak)
Blood transfusion management 2014 Quality Registers Symposium, Stockholm 2014 MFMER slide-37 Transfusion-Related Acute Lung Injury (TRALI) Transfusion-Associated Circulatory Overload (TACO)
2014 Quality Registers Symposium, Stockholm 2014 MFMER slide-38 [Clifford et al. Transfusion 2013]
Missing reported TRALI/TACO cases
2014 Quality Registers Symposium, Stockholm 2014 MFMER slide-40 Missing reported TRALI/TACO cases Of the 88 TRALI cases correctly identified by the CART algorithm, only 11 (12.5%) of these were reported to the blood bank by the clinical service. Of the 45 TACO cases correctly identified by the CART algorithm, only 5 (11.1%) were reported to the blood bank by the clinical service. [Clifford et al. Transfusion 2013]
Active surveillance for TRALI/TACO TRALI/TACO sniffers 2014 Quality Registers Symposium, Stockholm [Herasavich et al. Healthcare 2012 MFMER Info. 2011;1:42-45] slide-41
Active surveillance for TRALI/TACO Example of a data-driven learning TRALI/TACO sniffers health care system 2014 Quality Registers Symposium, Stockholm [Herasavich et al. Healthcare 2012 MFMER Info. 2011;1:42-45] slide-42
Concluding remarks EHRs contain a wealth of phenotypes for clinical and translational research EHRs represent real-world data, and hence has challenges with interpretation, wrong diagnoses, and compliance with medications Handling referral patients even more so Standardization and normalization of clinical data and phenotype definitions is critical Phenotyping algorithms are often transportable between multiple EHR settings Validation is an important component 2014 Quality Registers Symposium, Stockholm 2012 MFMER slide-43
2014 Quality Registers Symposium, Stockholm 2014 MFMER slide-44 It takes a village
Mayo Clinic emerge Phenotyping team Christopher Chute, MD, DrPH Suzette Bielinski, PhD Mariza de Andrade, PhD John Heit, MD Hayan Jouni, MD Adnan Khan, MBBS Sunghwan Sohn, PhD Kevin Bruce Sean Murphy Mayo Clinic SHARPn Phenotyping team Christopher Chute, MD, DrPH Dingcheng Li, PhD Cui Tao, PhD (now @ UTexas) Gyorgy Simon, PhD (now @ UMN) Craig Stancl Cory Endle Sahana Murthy Dale Suesse Kevin Peterson Acknowledgment PhEMA and DEPTH collaborators Joshua Denny, MD William Thompson, PhD Guoqian Jiang, MD, PhD Enid Montague, PhD Luke Rasmussen, MS Richard Kiefer Peter Speltz Jennifer Pachecho Huan Mo CSHCD Clinical Informatics Program Daryl Kor, MD Maryam Panahiazhar, PhD Che Ngufor, PhD Dennis Murphree, PhD Sudhi Upadhyaya Funding NIH R01 GM105688 (PI Pathak): PheMA AHRQ R01 HS23077 (PI Pathak): DEPTH NIH R01 GM103859 (PI Pathak): PGx NIH R25 EB020381 (PI Pathak): BDC4CM NIH U01 HG006379 (PI Chute, Kullo): emerge ONC 90TR002 (PI Chute): SHARPn Mayo Clinic Center for the Science of Healthcare Delivery (CSHCD) 2012 MFMER slide-45
Essentially, we re going to be moving from an electronic medical record which initially was just an electronic version of a paper record... John Noseworthy, M.D. Mayo Clinic President and CEO Spring 2010 [http://mayoweb.mayo.edu/mayo-today/qajf10moving.html] 2014 Quality Registers Symposium, Stockholm 2012 MFMER slide-46
2014 Quality Registers Symposium, Stockholm 2014 MFMER slide-47 Thank You! Pathak.Jyotishman@mayo.edu http://jyotishman.info