Exploring the Challenges and Opportunities of Leveraging EMRs for Data-Driven Clinical Research

White Paper Exploring the Challenges and Opportunities of Leveraging EMRs for Data-Driven Clinical Research Because some knowledge is too important not to share.

Exploring the Challenges and Opportunities of Leveraging EMRs for Data- Driven Clinical Research Electronic medical records (EMRs) can facilitate faster and cheaper clinical research investigations. By collecting diagnostic, intervention, and outcomes data at all levels of care and across time, EMRs capture a richer picture of clinical effects, relationships, efficiency, and more. With increased clinical reliance on EMR systems, the question remains of how to best leverage EMR data for research purposes. Data-driven research relying on EMR data must address two general areas of concern: data quality and data accessibility. Such issues stem from those generally associated with secondary and exploratory analyses and manifest in particular forms due to the fact that EMRs are fundamentally designed as tools for patient care not research. Managing these potential issues is becoming increasingly important as focus moves away from well-defined clinical variables (e.g., pulmonary function test score, mortality) and toward more complex care concepts. However, doing so can shift a researcher s attention away from scientific pursuits and onto data transformation tasks. Ultimately, this distraction must be mitigated with informatics and analyst tools that lie beyond offthe-shelf, commercial database software. Data Quality and Validity EMR-based research faces a number of data quality issues, largely those associated with secondary and observational analyses. Structured and unstructured data (e.g., discrete diagnostic codes and practitioner notes, respectively) are largely input into EMR systems by care providers. Ultimately, the reliability of this data is dependent on the precision, accuracy, and overall rigor of this data collection effort. EMR data is therefore not immune to relatively straightforward issues like human error and missingness. More complex quality issues can stem from non-uniformity in the use of EMRs across healthcare networks, care settings, and individual practitioners. Younger physicians tend to employ EMR platforms more thoroughly, for example, as do healthcare networks serving higher income populations 1. Consequently, EMR data quality can be confounded by geography, socioeconomic status, and more. These factors pose major threats to the generalizability (i.e., external validity) of a research study s results. Therefore, a researcher must be cognizant of both inherent patient-to-patient variability and potentially significant practitioner-to-practitioner variability in terms of data quality. 1 J. Lin, T. Jiao, J.E. Biskupiak, & C McAdam-Marx. Application of electronic medical record data for health outcomes research: A review of recent literature. Expert Rev. Pharmacoecon. Outcomes Res. 13(2), 191-200 (2013).

EMR-based data-driven research must also consider construct validity i.e., that a set of values in an EMR database actually represents phenomena of interest to a particular research project. As symptoms, diagnoses, and treatment components can overlap between health issues as do their codes researchers must find a way to distinguish which data is relevant to their individual needs. Similarly, EMR data reflects what comes up in an exchange between provider and patient, meaning information relevant to a research question may not be fully represented in EMRs. Furthermore, clinically-meaningful data points are often not of the resolution preferred by researchers; for example, a patient reporting that she is experiencing pain is actionable information clinically, whereas a novel pain-related research study may have asked a patient to report pain level on a standardized 1-10 scale 2. Accessing Meaningful Information The task of actually extracting research-grade data from potentially fractured EMR databases is itself nontrivial. Many recent publications have relied on welldefined clinical outcomes (e.g., occurrence of a cardiac event) and covariates (e.g., vitals). These sort of analyses take advantage of structured data, which can assume a set format (e.g., numeric values for weight) or one of a discrete set of values in a drop-down list, for example. However, up to 70% of clinically-useful information is recorded in unstructured fields, such as in the form of physician notes input into text boxes 1. The rate at which such unstructured clinical data has become available to researchers has outpaced the rate at which optimal methods to leverage it have been developed. This is largely due to the unrestricted nature of free text 3 : particularly with potential human error and syntax issues (e.g., acronym use, tense changes), reliable and comprehensive querying can be a major undertaking. Straightforward methods to manage such information accessibility challenges such as a subject matter expert annotating the free text are expensive in terms of man-hours and un-scalable. Furthermore, even when focusing on structured data, EMR databases are designed to optimize queries that are patient-centric, not attributecentric. This means that queries are optimized to return lists of patients seen by a practitioner on a given day, for example, rather than return data on patients who experienced a specific set of symptoms and received a certain treatment. Consequently, queries to obtain focused research data sets can become computationally more difficult. This is particularly true if the queries involve logic or rule-based searches, such as returning data on patients whose baseline blood pressure fell within a certain range. 2 S. Muller. Electronic medical records: The way forward for primary care research? Family Practice. 31(2): 127-129. 3 P.M. Nadkarni, L. Ohno-Machado, & W.W. Chapman. Natural language processing: An introduction. J. Am. Med. Inform. Assoc. 18, 544-551 (2011).

Unlocking the Power of EMR Data for Clinical Research Faced with significant data quality and accessibility issues, how can the promise of EMRs for faster and cheaper clinical research be realized? The answer lies in emerging methods and customized tools that mitigate the data transformation demands placed on researchers, which could otherwise disrupt the actual pursuit of research. In terms of managing unstructured data, structure is not necessarily the answer. As unstructured text fields tend to capture some of the most clinically-relevant information, and as the archive of such EMR data grows, forcing structure risks losing valuable information. One computational approach natural language processing (NLP) is increasingly being relied upon for processing unstructured EMR data. NLP brings together concepts from statistics, computer science, engineering, and clinical research to develop algorithms that automatically learn what is important information within unstructured text. This requires functionality to detect the beginning and ending of words, grouping phrases into concepts, aggregating the most meaningful information into usable quantifications, and much more; this is done despite human error (e.g., misspellings) and complex syntax issues (e.g., abbreviations) within free text. While promising, the computational demands of NLP algorithms can approach the level of IBM s Watson computer, and efforts to cost-effectively introduce them into clinical research settings are therefore ongoing. One approach for addressing data quality issues in both structured and unstructured data is to merge EMR datasets with those from other sources. Merging EMR datasets with medical claims data or pharmacy records, for example, can validate that a patient was prescribed a certain treatment. Similarly, identifying where an EMR database overlaps with medical registry information can give a subset of patients for whom some EMR data can be validated. Such merging efforts could even result in a richer dataset than was provided by either source individually. However, coherently merging datasets often entails intensive legwork, as use of EMR software remains disparate across healthcare networks, clinical settings, and providers. Given the legwork required to merge and curate databases, independent but overlapping efforts to do so are inefficient in a larger research context and highlight an opportunity to increase research productivity. With this in mind, an Architecture for Research Computing in Health (ARCH) strategy centralizes EMR, biobank, claims, electronic data capture, and other available data from diverse sources at an institutional level. By aggregating, organizing, and curating merged datasets at this level and producing local, customized datasets for individual research efforts, the data transformation burden is lifted off of researchers and overall efficiency improves.

An ARCH strategy requires an informatics infrastructure not offered by off-the-shelf database platforms. RexDB by Prometheus Research is a customizable data repository specifically designed with clinical research in mind. This platform seamlessly accepts clinical data from diverse sources as inputs, transforms it into usable forms, and provides localized, investigation-specific tools, datasets, and reports. These capabilities are clinically tested, as RexDB is the basis of a shared database infrastructure in a partnership with Weill Cornell Medical College and New York Presbyterian Hospital (NYPH): data from Epic EMR systems, Profiler Biobank, and Allscripts systems at the Center for Advanced Digestive Care and data from CompuRecord, Epic, and Allscripts systems from NYPH s anesthesiology department are loaded into a RexDB pipeline that aggregates and transforms the raw data to provide customized datasets for researchers as needed 4. Subtleties and complexities associated with both clinical phenomena and raw EMR data itself make off-the-shelf data management platforms suboptimal tools for increasingly complex research. RexDB, however, is fundamentally based around a goal of facilitating data-driven clinical analyses. Its flexibility offers a scalable, customizable informatics infrastructure for diverse clinical research projects at both laboratory and institutional levels. The database configuration, straightforward querying, and automated reporting capabilities of the RexDB suite is also backed by a team of analysts supporting a researcher s data processing needs from beginning to end. Altogether, this technological and analyst toolbox takes on the legwork of turning raw EMR data into usable forms for clinical studies: despite the challenges of working with EMR data, RexDB lets researchers focus on research. Ultimately, designing a good EMR-based datadriven study is not enough. Care must be taken when implementing efforts to obtain researchquality datasets from EMRs. As interest in using EMRs for research purposes grows, so do demands for the tools to facilitate such efforts. 4 S.B. Johnson, T.R. Campion, N.E. Pegoraro, L. Rozenblit, C. Tirrell, & C.L. Cole. An institutional strategy to support clinical research with centrally managed custom data repositories. American Medical Informatics Association 2014 Annual Symposium. Poster presentation (2014).

Additional Resources US CORPORATE OFFICE 55 Church Street 7th Floor New Haven, CT 06510 USA CONTACT US +1 800 693 9057 +1 203 672 5800 contact@prometheusresearch.com FOLLOW US Twitter: @PrometheusRsrch Facebook: www.facebook.com/prometheusresearch WEB & MORE For this and other white papers, academic presentations, and publications by Prometheus Research, please visit: www.prometheusresearch.com RexDB is a registered trademark of Prometheus Research, LLC. Copyright 2015. All rights reserved.