Biomedical Informatics discovery and impact Next-generation Phenotyping Using Interoperable Big Data George Hripcsak, Chunhua Weng Columbia University Medical Center Collab with Mount Sinai Medical Center
Introducing OHDSI Observational Health Data Sciences and Informatics International network of researchers and observational health databases with a central coordinating center housed at Columbia University Mission: Large-scale analysis of observational health databases for population-level estimation and patient-level predictions Vision: Patients and clinicians use OHDSI tools every day to access evidence based on 1 billion patients http://ohdsi.org Clinical researcher, provider, patient Tools and algorithms Data nodes Infrastructure, models, ontologies
OHDSI s global research community >120 collaborators from 11 different countries Experts in informatics, statistics, epidemiology, clinical sciences Active participation from academia, government, industry, providers http://ohdsi.org/who-we-are/collaborators/
Global reach of ohdsi.org >4600 distinct users from 96 countries in 2015
Why large-scale analysis is needed in healthcare All health outcomes of interest All drugs
What is large-scale? Millions of observations Need for performance in handling relational structure with millions of patients and billions of clinical observations, focus on optimization to analytical use cases. Millions of covariates No analytics software in the world can fit a regression with >1m observations and >1m covariates on typical hardware but CYCLOPS can! Millions of questions Systematic solutions with massive parallelization should be designed to run efficiently for one-at-a-time AND all-by-all
Drug safety surveillance Device safety surveillance Vaccine safety surveillance Comparative effectiveness Health economics Quality of care Person Observation_period Specimen Death Standardized health system data Location Care_site Provider Standardized meta-data CDM_source Concept Standardized clinical data Visit_occurrence Procedure_occurrence Drug_exposure Device_exposure Condition_occurrence Measurement Note Observation Fact_relationship Payer_plan_period Procedure_cost Drug_era Visit_cost Drug_cost Device_cost Cohort Cohort_attribute Condition_era Dose_era Standardized health economics Standardized derived elements Vocabulary Domain Concept_class Concept_relationship Relationship Concept_synonym Concept_ancestor Source_to_concept_map Drug_strength Cohort_definition Attribute_definition Standardized vocabularies
Preparing your data for analysis Patient-level data in source system/ schema ETL design ETL implement Patient-level data in OMOP CDM ETL test OHDSI tools built to help WhiteRabbit: profile your source data RabbitInAHat: map your source structure to CDM tables and fields ATHENA: standardized vocabularies for all CDM domains Usagi: map your source codes to CDM http://github.com/ohdsi CDM: DDL, index, constraints for Oracle, SQL Server, PostgresQL; Vocabulary tables with loading scripts ACHILLES: profile your CDM data; review data quality assessment; explore populationlevel summaries vocabulary OHDSI Forums: Public discussions for OMOP CDM Implementers/developers
Data Evidence sharing paradigms Single study Write Protocol Develop code Execute analysis Compile result Patient-level data in OMOP CDM Develop app Real-time query Design query Submit job Review result evidence Large-scale analytics Develop app Execute script Explore results One-time Repeated
Patient-level data in OMOP CDM Standardized large-scale analytics tools under development within OHDSI ACHILLES: Database profiling CIRCE: Cohort definition HERMES: Vocabulary exploration HERACLES: Cohort characterization CALYPSO: Feasibility assessment OHDSI Methods Library: CYCLOPS CohortMethod SelfControlledCaseSeries SelfControlledCohort TemporalPatternDiscovery Empirical Calibration LAERTES: Drug-AE evidence base PLATO: Patient-level predictive modeling HOMER: Population-level causality assessment http://github.com/ohdsi
CIRCE for cohort definition CIRCE (Cohort Inclusion and Restriction Criteria Expression) User interface to define and review cohort definitions: COHORT is a set of persons satisfying one or more criteria for a duration of time Disease phenotype is a typical use case for cohort definition Interface translates a human-readable form into a standardized JSON representation for network-based analysis interoperabilities, and compiles the JSON into platform-specific SQL dialect for direct execution against any OMOP CDM-compliant dataset Open-source, freely available source code: https://github.com/ohdsi/circe
One interface allows definition of criteria across all tables and all fields of the OMOP Common Data Model. The user interface translates this humanreadable form into JSON, which is compiled into SQL dialects for 5 platforms.
Each expression can be defined by one or more standard concept sets, using OHDSI s standardized vocabularies
HERMES for vocabulary exploration OHDSI standardized vocabularies allows consistent definitions to be applied across disparate source vocabularies: Select descendents for SNOMED concept of Attention deficit hyperactivity disorder maps all ICD9, ICD10, READ codes to execute analysis across OHDSI s international data network
Concept sets can define one or more entitities. Here, the PheKB list of ADHD inclusionary medications has been represented by 21 RxNorm ingredient concepts, all brands/dose/form are subsumed
The human-readable Expression form is translated into JSON in realtime. This JSON object can be shared across partners to materialize the definition consistently and reproducibly without any programming required
Each expression is compiled into SQL. OHDSI supports rendering SQL into platform-specific dialects for SQL Server, Oracle, Postgres, RedShift, MS APS. This code can be copied and executed in your favorite SQL UI tool, or.
Patient-level observational databases that are converted to the OMOP Common Data Model and exposed to the OHDSI webapi (either local install or any public network version) can have the cohort definition directly executed within the database to produce a COHORT. The COHORT is then available for all subsequent research within the OHDSI environment
Try it yourself http://www.ohdsi.org/web/circe/#/146
Proof of concept Treatment pathways around the world Diabetes, hypertension, depression (Submitted to PNAS)
Cohort
Databases (255M) and definitions
Diabetes
Opportunities for collaboration Implement the PheKB library in CIRCE, so that all organizations with patient-level data (translated to OMOP common data model) can take the work from emerge and directly apply the logic to their own data and participate in emerge s research
Phenotyping hard challenges Quality of the data Ambiguous or unknown meaning Accuracy 50-100% accuracy [Hogan JAMIA 1997] Completeness mostly missing Complexity disease ontologies Bias
Truth observe & interpret Concept author Record read Concept Health status of the patient Error Clinician or patient s conception Error EHR/PHR Implicit 2 nd clinician s conception of the patient (or self, lawyer, compliance,...) Error process Model Computable representation
Biased Environment Patient state Therapy Care team Objective tests Electronic health record
Inpatient mortality for community acquired pneumonia 35 30 25 Mortality (%) 20 15 18715 cohort 1935 cohort Fine 10 5 0 1 2 3 4 5 Fine class Hripcsak... Comput Biol Med 2007;37:296-304 18715 cohort +CXR +fdg -recent pneu -recent visit 1935 cohort above plus +DSUM exist +ICD9 (pneu not sepsis)
EHR-derived phenotype Clinically relevant feature derived from EHR Patient has (a diagnosis of) type II diabetes Recent rash and fever Drug-induced liver injury Then use the phenotype in correlation studies, etc. Query Raw data Phenotype Experiment
Physics of the medical record 1. Study EHR as if it were a natural object Use EHR to learn about EHR Not studying patient, but recording of patient 2. Aggregate across units and model 3. Borrow methods from non-linear time series
345 Glucose by Δt and tau Glucose 0.45 0.4 0.35 0.4-0.45 0.3 0.25 0.35-0.4 0.3-0.35 0.25-0.3 MI 0.2 0.15 0.2-0.25 0.15-0.2 0.1-0.15 0.1 0.05 0-0.05 1 2 2 7 50 450 delta-t (days) 0.05-0.1 0-0.05-0.1-0 6 7 tau Albers... Translational Bioinformatics 2009 89 10 20 30 40 50 60 70 80 90 100 0.17 0.83
Correlate lab tests and concepts 22 years of data on 3 million patients 21 laboratory tests sodium, potassium, bicarbonate, creatinine, urea nitrogen, glucose, and hemoglobin 60 concepts derived from signout notes residents caring for inpatients to facilitate the transfer of care for overnight coverage concepts likely to have an association + controls
Intentional and physiologic associations 0.15 potassium 0.1 0.05 aldactone dialysis 0-60 -40-20 0 20 40 60-0.05 hyperkalemia hypokalemia hypomagnesemia -0.1-0.15
Timing of cause in disease vs. treatment 0.1 glucose 0.08 0.06 0.04 0.02 0-60 -10 40 hyperglycemia hypernatremia hypoglycemia insulin metformin pancreatitis -0.02-0.04
Specificity of the concept 0.14 creatinine 0.12 0.1 0.08 0.06 0.04 0.02 0-60 -40-20 0 20 40 60 aldactone dialysis diarrhea diuretic hctz hyperglycemia hypernatremia vomiting -0.02-0.04
Hripcsak... JAMIA 2013 Health care process model
Hripcsak... JAMIA 2013
inpatient admit ambulatory surgery
Hripcsak JAMIA 2009 Interpreting time
Deviation by stated unit 50 45 Stated time 40 35 30 25 20 15 10 5 0-1 -0.9-0.8-0.7-0.6-0.5-0.4-0.3-0.2-0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Now Number of occurrences day week month year Proportional deviation
Interpreting time Variable Definition Coefficient Significance value stated numeric value in the temporal assertion (1 to 30 in this sample) 0.0414 <0.001 round number true if value is a multiple of 5 (any unit) or 6 (with months) 0.0218 0.002 ln(duration) logarithm of stated duration in days, which equals the product of unit and value 0.150 0.023 gt 18 years true if duration 18 years, so the event should not be in the database 0.816 <0.001 intercept 0.406 0.416
Patient variability and sampling
Parameterizing Time
Parameterizing Time (Non-stationarity) 2.5 rate of change 2 coefficient of variation 1.5 1 clock warped sequence 0.5 0 creatinine glucose sodium potassium Hripcsak JAMIA 2015
Parameterizing Time
Vector autoregression to decipher associations
Noisy training sets with Nigam Shah; David Sontag
Summary OHDSI international collaboration could dovetail with emerge Next-generation phenotyping requires understanding the EHR