White Paper Big Data Integration and Governance Considerations for Healthcare by Sunil Soares, Founder & Managing Partner, Information Asset, LLC
Big Data Integration and Governance Considerations for Healthcare There is a lot of discussion in the press about Big Data. Big Data is traditionally defined in terms of the three V s of Volume, Velocity, and Variety. In other words, Big Data is often characterized as high-volume, streaming, and including semi-structured and unstructured formats.. Healthcare organizations have produced enormous volumes of unstructured data, such as the notes by physicians and nurses in electronic medical records (EMRs). In addition, healthcare organizations produce streaming data, such as from patient monitoring devices. Now, thanks to emerging technologies such as Hadoop and streams, healthcare organizations are in a position to harness this Big Data to reduce costs and improve patient outcomes. However, this Big Data has profound implications from an Information Governance perspective. In this white paper, we discuss Big Data Governance from the standpoint of three case studies. Governance of Electronic Medical Records for Predictive Analytics A large hospital system offered a broad range of services, including emergency care. A significant portion of the patient population was indigent. The hospital implemented a pilot program to leverage big data analytics aimed at reducing the readmission rate of patients with congestive heart failure. The objectives of the study were two-fold: 1. Reduce costs that will not be reimbursed by insurance. The United States Medicare and Medicaid programs are moving to approaches that reduce or eliminate payments for care to patients who are readmitted for the same disease. 2. Increase the quality of patient care by proactively implementing early intervention to prevent the progression of disease. Because the hospital system had limited funds for programs such as smoking cessation and home health care, it wanted these programs to be targeted at patients who were more likely to be readmitted within 30 days. For example, if smoking was a key predictor of patients who were readmitted within 30 days, then the hospital system wanted to target those persons with smoking cessation programs. The analytics department built a predictive model in IBM SPSS based on 150 variables and 20,000 patient encounters over five years. This data was sourced from a variety of applications, including electronic medical records, the admissions system, and the cost accounting database. From a Big Data Governance perspective, the hospital had to establish a number of policies.. 2
Data Quality The analytics team determined that a number of variables were significant predictors of a patient s readmission rate. The team used text analytics to improve the quality of sparsely populated structured data. In this situation, IBM InfoSphere BigInsights provides strong text analytics capabilities. We discuss four of these variables below: 1. Smoking status. Smoking status is a significant factor associated with heart disease. Surprisingly, the hospital did not have a complete history of patient smoking status, including years of smoking and frequency. At the outset, only 25 percent of the structured data around smoking status was populated with binary yes/no answers. However, by using content analytics, the analytics team was able to identify a larger population rate of 85 percent of patient encounters for smoking status. The content analytics team was also able to unlock additional information, such as smoking duration and frequency. There were a number of reasons for this discrepancy. For example, some patients indicated that they were non-smokers, but the text analytics revealed the following from the doctor s notes: Patient is restless and asked for a smoking break Patient quit smoking yesterday Quit 2. Drug and alcohol abuse. The clinical team knew from experience that drug abuse and alcohol abuse were significant predictors of hospital readmission rates. Only 20 percent of the patients checked off the box at admission to indicate whether they were addicted to drugs and alcohol. However, the analytics team used unstructured data sources to identify a total of 76 percent of the encounters where patients were abusing drugs and alcohol. 3. Assisted living facility. The clinical team knew from experience that patients in assisted living facilities were more likely to take their medications as compared with patients who lived alone. However, the hospital system was not capturing this patient status information in a formalized manner. The business intelligence team analyzed the text within discharge summaries, echocardiograms, patient histories, doctors notes, and physicals to find that 25 percent of the patients resided in an assisted living facility. The analysis confirmed that residence in an assisted living facility did indeed reduce the likelihood that a patient would be readmitted within 30 days 4. Pharmacology compliance indicator. Information about pharmacology compliance was critical to clinicians and case managers because it indicated the degree to which patients were taking their medications as part of a treatment plan. The business intelligence team analyzed doctor s notes and electronic medical records to populate this data. 3
Metadata The analytics team had to derive consistent definitions for key business terms. In this situation, IBM InfoSphere Business Glossary can provide the foundation for sound governance and stewardship of business terms. In our case study, the term readmission had at least three different definitions: 1. Clinical perspective: 30 days, all causes. The patient was readmitted to the hospital whether or not the condition was related to congestive heart failure. 2. Clinical perspective: 30 days, same diagnosis. The patient was readmitted to the hospital with a dominant ICD-9 diagnosis code related to heart failure. 3. Finance perspective: quarterly and annually. Finance had definitions of readmissions that were based on longer periods, including six to nine months. Master Data Management The analytics team also struggled with the lack of consistent patient data within the hospital system, caused by the proliferation of identification numbers for each patient. As a result, hospital personnel were not able to track medical events for the same patient across different facilities. As a workaround, the hospital system instituted a lengthy manual process to reconcile the medical events that related to the same patient. As a result, the team lost significant time in retrieving a patient s medical history when he or she was readmitted to another hospital in the same system within a very short period. This would potentially adversely affect decisions about treatment plans and clinical outcomes. In this scenario, an Enterprise Master Patient Index based on IBM InfoSphere Master Data Management can provide significant value. Reference Data Management ICD-9 reference data is a well-defined database with great granularity. For example, ICD-9 assigns code 428 for heart failure. Different ICD-9 codes describe details of heart failure conditions, such as 428.1 for left heart failure and 428.2 for systolic heart failure. Medical researchers at the hospital tried to analyze comorbidity the presence of one or more diseases or disorders associated with congestive heart failure using ICD-9 data. Being able to categorize ICD-9 codes into similar diseases helped to manage the results more effectively. The analytics team collaborated with clinicians to categorize over 21,000 ICD-9 codes into 20 disease groups. Based on this exercise, the analytics team was able to minimize the noise in their analysis and yield better clinical insights. IBM InfoSphere Master Data Management Reference Data Management Hub can support complex mappings of reference data sets in healthcare, such as ICD-9 and CPT codes. 4
Governance of Time Series Data in a Neonatal Intensive Care Unit A hospital leveraged IBM InfoSphere Streams to monitor the health of newborn babies in its neonatal intensive care unit. Using IBM InfoSphere Streams, the hospital was able to predict the onset of nosocomial (hospital-acquired) infection a full 24 hours earlier by identifying the onset of very slight symptoms. From a Big Data Governance perspective, the hospital had to establish multiple policies. Data Quality The application depended on large volumes of time series data. However, the time series data was sometimes missing when a patient moved and caused a lead (a monitor attached to the baby s skin) to disengage and discontinue readings. In these situations, IBM InfoSphere Streams applied linear and polynomial regressions to historical readings to fill in the gaps in the time series data. Information Lifecycle Management The hospital also tagged all time series data that had been modified by IBM InfoSphere Streams. In the event of a lawsuit or medical inquiry, the hospital would be able to produce both the original and the modified readings. In this situation, IBM InfoSphere Optim Data Growth Solution could potentially reduce data storage costs through archival and compression techniques. Privacy The hospital also established policies around safeguarding protected health information (PHI) pertaining to the time series data. In this scenario, IBM InfoSphere Optim Data Masking can mask PHI within non-production environments such as development and test. In addition, IBM InfoSphere Guardium can potentially monitor access by privileged users, such as database administrators, to PHI. Improving Confidence in Predictive Pathways for Disease The United States Centers for Disease Control and Prevention (CDC) estimates that nearly 8 percent of Americans have diabetes and another 60 million have prediabetes. As a result, medical intervention has become an imperative rather than an option for health insurers. However, nearly one-third of those who meet the criteria for diabetes do not know they have the disease. 5
Health plans are now able to uncover such insights with rapid and accurate patient scoring for diabetes based on multivariate analysis of very large datasets. Provider, facility, pharmacy, and enrollment data can be analyzed simultaneously for multiple variants, yielding a highly accurate and detail-rich statistical array with billions of rows of data. New member records can be matched against the disease profiles generated by the comprehensive analytics to help wellness companies reach out to patients as early as possible in the disease progression. This solution relies on a highly scalable platform based on IBM PureData System for Analytics. It also depends on consistent master data for members, providers, and pharmacies based on IBM InfoSphere Master Data Management. Finally, IBM InfoSphere Business Glossary offers a strong foundation of consistent business terms. IBM InfoSphere provides a robust platform for Big Data Integration and Governance. For more information, visit www.ibm.com/software/data/infosphere. About the Author Sunil Soares is the founder and managing partner of Information Asset, LLC, a consulting firm that specializes in helping organizations build out their Data Governance programs. Prior to this role, Sunil was the Director of Information Governance at IBM, where he worked with clients across six continents and multiple industries. Sunil has written four books about Information Governance, including The IBM Data Governance Unified Process, Selling Information Governance to the Business, Big Data Governance, and IBM InfoSphere: A Platform for Big Data Governance and Process Data Governance. The second and third books have dedicated chapters on healthcare. The following terms are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both: IBM, BigInsights, Guardium, InfoSphere, Optim, PureData, and SPSS. A current list of IBM trademarks is available on the Web at http://www.ibm.com/legal/copytrade.shtml. 6