Environmental Health Science Data Streams Health Data Brian S. Schwartz, MD, MS January 10, 2013
When is a data stream not a data stream? When it is health data. EHR data = PHI of health system Data stream IRB approval, data pull (by IT), data transfer (to researchers), data cleaning, variable creation ( phenotyping of patients), data merging, environmental metrics, data analysis (computationally intensive person, place, time)
Using EHR Data: An Example Using longitudinal EHR data, how do we know if a patient has diabetes? When does observation of the patient begin? With EHR data cannot determine enrollment (health plan data) When did the patient s diabetes begin? How do we distinguish type 1 from type 2? What does it mean if a HbA1c level exists, or diabetes treatment began, before any ICD-9 code for diabetes? How do we define diabetes severity? How do we avoid confounding by indication? In observational studies of drug effects, drugs are not assigned randomly; indication for treatment may be related to risk of future health outcomes
The Natural History of Diabetes Healthy Pre-diabetes Diabetes Complications 100 mg/dl FBS 125mg/dL HbA1c (screening) ICD-9 code HbA1c (monitoring) Rx mean duration = 159d mean duration = 117d mean duration = 1534d HbA1c (pre-therapeutic) 1 st ICD-9 diabetes code HbA1c (post-icd-9) HbA1c (last-ever) n = 7337 mean = 7.51% n = 17,959 mean = 7.64% mean duration = 1732 days
NIH Research Collaboratory Council of Councils Meeting, July 1, 2010 NIH-HMORN Collaboratory: Common Fund Proposal Purpose: The NIH-HMORNHMORN Collaboratory will enhance and strengthen a research platform to accelerate large epidemiology studies, pragmatic clinical trials, and EHR-enabled health care delivery research by leveraging g the HMORN's scientific, data and operational infrastructure. Limited competition U54 RFP released 2-17-2011, then changed. Duke Clinical Research Institute awarded $9M from NIH to serve as the Coordinating Center for NIH s Health Care Systems Research Collaboratory 9/25/12 press release The goal of the Collaboratory is to involve clinicians and patients in the design and interpretation of trials, provide the education needed to enhance the value of their participation, and use the data collected during healthcare delivery as the core data source for the full spectrum of clinical research, from registries to observational studies and pragmatic randomized controlled trials.
Virtual Data Warehouse NIH-HMORN Collaboratory included a goal to develop a VDW The objectives of this initiative are to improve data quality; enable cross-sitesite and cross project synergies; balance site-, project-, and network-level priorities; and reduce the preparatory work needed to assemble cohorts, count events, and capture exposure and co-morbidity data, all in support of an array of different types of studies. HMO-RN members have been working on a VDW An internal website provides metadata (years, variable descriptions, labels, formats, definitions, specification, coding) HMORN has developed guidelines and policies to facilitate research, but control resides at each member site Efforts to write programs to extract & convert variables stored in legacy information systems to common standards; test standardized data for consistency & accuracy; standardize di methods by providing macros & programs that are used across sites; provide instructions on how to use VDW to create analytic files for research
The VDW Not a centralized data warehouse; it consists of parallel, identical databases at each HMORN site, to facilitate merging across sites It is not an analytic dataset, but does facilitate creation of such As of March 2011, VDW data domains include: Demographics: date of birth, gender, race and ethnicity Enrollment: health plan membership enrollment, with insurance types, benefits, effective dates of coverage Encounters: OPT, IPT, with associated diagnosis and procedure codes, type of encounter, provider seen, facility and discharge disposition Procedures: performed procedures (e.g., surgery, lab, radiology, immunization); various coding systems (CPT, HCPCS, ICD 9, insurance claims Revenue Codes) Diagnoses: dates, diagnosis codes, provider Providers: specialty, age, gender, race and year graduated Cancer/Tumor Registry: Surveillance, Epidemiology and End Results (SEER) program standards most complex domain of VDW Pharmacy Dispensing: date, National Drug or GPI code, therapeutic class, days supply, and amount dispensed Vital Signs: height, weight, blood pressure, tobacco use and type Laboratory Values: originally HbA1c, S-Cr, INR, FBG, serum K; values are being added through a timed priority list of 57 types of lab tests
In multisite studies, sitelevel differences in disease incidence, predictive variables, and health outcomes can represent: True small area variation in practice patterns & outcomes Variability in data collection methods across sites Data quality assessments across sites are a critical first step in multisite studies Kahn, et al., Medical Care, 2012
1)Type of data HEALTH data from electronic health records At Geisinger, i 400K+ primary care patients, t hundreds d of millions of records; many kinds of health information 2) What is the current status of data collection/archiving? Most patient health information will be collected electronically in the coming years There is no single repository for US health data Health systems most often use programs such as Epic; they then export data from Epic to a data warehouse for more easy access; and export from the warehouse for analysis There is no centralized warehouse; there are mechanisms for gaining access; there is no centralized catalog 3) Non technical aspects of sharing These data are not public; they can be accessed after agreements are in place, most often in collaborative research relationships; there is no routine sharing Creating a single national repository of EHR data would be a daunting task
4) Standardization in description of the data Many types of data: dates, encounters, diagnoses, ICD-9 and CPT codes, laboratory test codes with results, procedure test codes sometimes with results, physician orders, medications, imaging Variation across providers, clinics, health systems I do not believe there are as yet many ontology or metadata standards; text searching is necessary and natural language processing in early development 5) Movement and ability to combine with other data The health data are for INDIVIDUAL patients; individual patients cannot be directly linked to family members Health data can be linked to other data by location (generally residential address) and date (space and time) In general, approaches to analysis of EHR data are on a study by study basis Data have to be accessed, exported, used to create analytic variables, merged with other data, analyzed Epic has some analysis tools; in general we export data and use biostatistical software programs 6) Specific example: scientific question limited by integration challenges As long as other data have meaning in space and time there should not be obstacles to integration Have to acknowledge we may not be able to get what we actually want we use surrogates for exposure
Thank you for listening Second Presentation ENDS HERE