Improvement of the quality of medical databases: data-mining-based prediction of diagnostic codes from previous patient codes



Similar documents
Grégoire FICHEUR a,1, Lionel FERREIRA CAREIRA a, Régis BEUSCART a and Emmanuel CHAZARD a. Introduction

Improvement of Diagnosis Coding by Analysing EHR and Using Rule Engine: Application to the Chronic Kidney Disease

Approach to Extract Billing Data from Medical Documentation in Russia - Lessons Learned

Practical Implementation of a Bridge between Legacy EHR System and a Clinical Research Environment

Using EHRs for Heart Failure Therapy Recommendation Using Multidimensional Patient Similarity Analytics

Change is Coming in 2014! ICD-10 will replace ICD-9 for Diagnosis Coding

Clinical Decision Support using a Terminology Server to improve Patient Safety

Victorian Additions to the Australian Coding Standards

Research Opportunities using the PaTH Network

Big Data Analytics and Healthcare

From Lawrence Weed to Apex and ICD-10. Paul Brakeman MD PhD Physician Lead for Provider Practice Change and Education

Investigating Clinical Care Pathways Correlated with Outcomes

ICD-10-CM Training Module for Dental Practitioners. Presented by Workgroup for Electronic Data Interchange

ICD-10 Testing Empowering & Enhancing the ICD 10 Transition

PharmaSUG2011 Paper HS03

Practical Development and Implementation of EHR Phenotypes. NIH Collaboratory Grand Rounds Friday, November 15, 2013

Electronic Medical Record Mining. Prafulla Dawadi School of Electrical Engineering and Computer Science

Detection of Heart Diseases by Mathematical Artificial Intelligence Algorithm Using Phonocardiogram Signals

USING DATA SCIENCE TO DISCOVE INSIGHT OF MEDICAL PROVIDERS CHARGE FOR COMMON SERVICES

Assessing Data Quality for Healthcare Systems Data Used in Clinical Research (Version 1.0)

Workflow framework for mining Diagnostic rules from ehrs Roxana Danger

ICD-10 Frequently Asked Questions

What is your level of coding experience?

Design of a Multi Dimensional Database for the Archimed DataWarehouse

Cardiovascular Endpoints

Malmö Preventive Project. Cardiovascular Endpoints

A Service Oriented Approach for Guidelines-based Clinical Decision Support using BPMN

Data and Analysis. Informatics 1 School of Informatics, University of Edinburgh. Part III Unstructured Data. Ian Stark. Staff-Student Liaison Meeting

LEADING-EDGE Cardiovascular Care

GENETIC DATA ANALYSIS

Graphical representation of the comprehensive patient flow through the Hospital

Tertiary Use of Electronic Health Record Data. Maggie Lohnes, RN, CPHIMS, FHIMSS VP Provider Relations Anolinx, LLC October 26, 2015

Kaiser Permanente Implant Registries

Designing ETL Tools to Feed a Data Warehouse Based on Electronic Healthcare Record Infrastructure

Electronic Health Record (EHR) Data Analysis Capabilities

Integrating Medical and Research Information: a Big Data Approach

Electronic Health Records Next Chapter: Best Practices, Checklists, and Guidelines ICD-10 and Small Practices April 30, 2014.

Data Warehousing and Data Mining in Business Applications

MISSING DATA ANALYSIS AMONG PATIENTS IN THE PINNACLE REGISTRY

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

Chapter 12 Analyzing Spaghetti Processes

Reasoning / Medical Context / Physicians Area of Competence

An Application of Inverse Reinforcement Learning to Medical Records of Diabetes Treatment

Appendix A WORK PROCESS SCHEDULE HIM (HEALTH INFORMATION MANAGEMENT) HOSPITAL CODER O*NET-SOC CODE: RAPIDS CODE: TBD

Coding with. Snayhil Rana

A NURSING CARE PLAN RECOMMENDER SYSTEM USING A DATA MINING APPROACH

SNOMED GP/FP RefSet and ICPC mapping project Seminar. Marc Jamoulle

SECTION 4 COSTS FOR INPATIENT HOSPITAL STAYS HIGHLIGHTS

A.I. in health informatics lecture 1 introduction & stuff kevin small & byron wallace

Specialty Drug Care: Case management services in Quebec

Home Health Care ICD-10-CM Coding Tip Sheet Overview of Key Chapter Updates for Home Health Care and Top 20 codes

PRACTICAL DATA MINING IN A LARGE UTILITY COMPANY

General Wellness: Policy for Low Risk Devices. Draft Guidance for Industry and Food and Drug Administration Staff

Ch Medical Coding and Claims. A bill sent to the insurance carrier for payment related to patient care

ICD-9 Basics Study Guide

Recognition and Privacy Preservation of Paper-based Health Records

Course Requirements for the Ph.D., M.S. and Certificate Programs

Patient Similarity-guided Decision Support

Discover more, discover faster. High performance, flexible NLP-based text mining for life sciences

Clintegrity 360 QualityAnalytics

ENHANCED CONFIDENCE INTERPRETATIONS OF GP BASED ENSEMBLE MODELING RESULTS

DATA MINING AND REPORTING IN HEALTHCARE

FY2015 Final Hospital Inpatient Rule Summary

All Patient Refined DRGs (APR-DRGs) An Overview. Presented by Treo Solutions

The ICD-9-CM uses an indented format for ease in reference I10 I10 I10 I10. All information subject to change

ICD-10: Prepare, Implement, and Train

medexter clinical decision support

Environmental Health Science. Brian S. Schwartz, MD, MS

Getting Ready for ICD-10. Part 2: ICD-10 Coding

510(k) Summary May 7, 2012

EHR System MojTermin: Implementation and Initial Data Analysis

Cardiovascular Endpoints

ICD-10: An Area We Call. Presented by Harry Goldsmith, DPM

WHITE PAPER. QualityAnalytics. Bridging Clinical Documentation and Quality of Care

Information Governance includes the Core Record Set for Coding Compliance Bonnie S. Cassidy, MPA, RHIA, FHIMSS

Transformation of Free-text Electronic Health Records for Efficient Information Retrieval and Support of Knowledge Discovery

Healthcare Measurement Analysis Using Data mining Techniques

Transcription:

Digital Healthcare Empowering Europeans R. Cornet et al. (Eds.) 2015 European Federation for Medical Informatics (EFMI). This article is published online with Open Access by IOS Press and distributed under the terms of the Creative Commons Attribution Non-Commercial License. doi:10.3233/978-1-61499-512-8-419 Improvement of the quality of medical databases: data-mining-based prediction of diagnostic codes from previous patient codes Mehdi DJENNAOUI a,1, Grégoire FICHEUR a, Régis BEUSCART a and Emmanuel CHAZARD a a Department of Public Health, EA 2694, University of Lille, F-59000 Lille, France. 419 Abstract. Introduction. Diagnoses and medical procedures collected under the French system of information are recorded in a nationwide database, the PMSI national database, which is accessible for exploitation. Quality of the data in this database is directly related to the quality of coding, which can be of poor quality. Among the proposed methods for the exploitation of health databases, data mining techniques are particularly interesting. Our objective is to build sequential rules for missing diagnoses prediction by data mining of the PMSI national database. Method. Our working sample was constructed from the national database for years 2007 to 2010. The information retained for rules construction were medical diagnoses and medical procedures. The rules were selected using a statistical filter, and selected rules were validated by case review based on medical letters, which enabled to estimate the improvement of diagnoses recoding. Results. The work sample was made of 59,170 inpatient stays. The predicted ICD codes were E11 (), I48 (atrial fibrillation and flutter) and I50 (heart failure).we validated three sequential rules with a substantial improvement of positive predictive value: {E11,I10,DZQM006}=>{E11} {E11,I10,I48}=>{E11} {I48,I69}=>{I48} Discussion. We were able to extract by data mining three simple, reliable and effective sequential rules, with a substantial improvement in diagnoses recoding. The results of our study indicate the opportunity to improve the data quality of the national database by data mining methods. Keywords. Electronic Health Records, Decision Support Techniques, Data mining, Nationwide Database. Introduction The main purpose of electronic health records (EHRs) is to support health care of the patient. However, they are increasingly used by medical research as part of data reuse [1-3]. Parts of EHR data are generated automatically or in the course of medical management. Other data, such as medical diagnoses, are collected and not used for 1 Corresponding Author.

420 M. Djennaoui et al. / Improvement of the Quality of Medical Databases patient care, but for payment or epidemiological purposes. The collection of such data often depends on the knowledge and interpretation of coding rules by coders, and on the time they accept to dedicate to this task. Health data can be of poor quality [4,5], especially in case the encoded data are not useful for patient care. To our knowledge, this problem is more important regarding medical diagnoses. Diagnoses and medical procedures collected under the French hospital payment system are recorded in a nationwide database called the PMSI national database [6]. This database is accessible for research exploitation, with the possibility to link information from a single patient, through a process of anonymous chaining. The quality of the data in this database is directly related to the quality of coding. Many control procedures of this coding process exist; most of them rely heavily on identifying diagnoses in medical letters. With millions of past inpatient stays, the national database falls within the definition of big data, which requires appropriate operating tools [7]. Among the proposed methods, data mining techniques are particularly interesting; they are increasingly used for the exploitation of health databases [8,9]. The objective of this work is to build diagnoses prediction rules from the national database by data mining. These rules would use information from the patient s previous stays to automatically predict diagnoses of the studied stay. 1. Methods Our working sample was constructed from the PMSI national database for years 2007 to 2010 (n=94,862,015 inpatient stays). The information retained for rules construction were medical diagnoses encoded according to the 10 th revision of the International Classification of Diseases (ICD10) [10], and medical procedures encoded according to the Common Classification of Medical Procedures (the French CCAM). The working sample was then filtered so that all the inpatient stays could be related to patients who had at least one stay in a community hospital from the North of France. In this hospital, we could access the complete inpatient stay and free-text anonymous medical letters in the frame of the permissions granted to the PSIP (Patient Safety through Intelligent Procedures in medication) project [11]. This was necessary for the validation step as described below. The sample was split in two samples: a learning sample (70%) for the creation of rules and a test sample (30%) for validation of the rules. The French version of the ICD10 is made of a set of codes and labels (n=33,816). The first 3 characters of the codes stand for code categories (n=2,049) and are usually used for code prediction. From this point, diagnostic codes will denote the 3-digits truncated ICD10 codes. Diagnostic codes to predict should be frequent, characterize chronic diseases and have a strong impact on hospital payment, assuming that these codes are most often coded. To select these diagnostic codes, we based on the results of the Valodiag study [12] that ranked diagnostic codes according to their average impact on payment and their frequency. As a data mining method, we built sequential rules. Sequential rules are expressed using the syntax {A}=>{B}, {A} constituting the predictive pattern, which stands for a set of one or several items, and {B} the predicted item, the predictive pattern preceding in time the predicted item. The rules generation is performed by algorithms based on the notions of support and confidence; support in this context is the proportion of stays with the rule on the set of stays, while confidence is the proportion of stays containing the predicted item on the set of stays expressing the predictive pattern. The user

M. Djennaoui et al. / Improvement of the Quality of Medical Databases 421 determines thresholds in advance for these two parameters. For our study, we set the threshold of support to 0.00075 and the threshold of confidence to 0.5. Sequential rules were constructed using R software [13], with a function implementing the reference algorithm SPADE [14,15]. The selection of rules was made by mean of a statistical filter, based on the value of the product (support*confidence) to obtain the best tradeoff between these two values. Evaluation of the rules was made by a review of cases using medical letters. So that this assessment is the closest real working conditions, it was bound to be based on medical mails. For this, we extracted for each rule from the test sample all the stays that met the following condition: for a patient stays sequence, the predictive pattern had to be present at a given stay n and the predicted code had to be missing at the next stay n+1, i.e. {A}>{B}. The careful reading of medical letters enabled to estimate the proportion of stays among which the diagnostic code was really missing, defining the positive predictive value (PPV) of the predictive pattern. The same method was applied for evaluating the sole predicted code as a predictive pattern of its own presence in the next stay, i.e. {B}>{B}. We then calculated the lift of each rule, according to the following formula (1). lift = PPV of the predictive pattern / PPV of the sole diagnostic code (1) 2. Results The work sample was made of 59,170 inpatient stays corresponding to 12,125 different patients. The predicted codes were E11 (), I48 () and I50 (heart failure). Six sequential rules were extracted from the learning sample (40,876 stays) (Table I). Table I. Sequential rules in the form Predictive pattern Predicted code, with Support and Confidence. Predicted codes E11 I48 I50 Predictive patterns Wordings Support Confidence E11 0.0717 0.55 I10, DZQM006, E11 transthoracic heart Doppler ultrasound 0.0069 0.71 I10, I48, E11 0.0052 0.72 I48 0.0571 0.51 I10, I48, E78 0.0035 0.60 disorders of lipoprotein metabolism I10, I48, Z95 0.0035 0.60 presence of cardiac and vascular implants I48, I69 sequelae of cerebrovascular disease 0.0037 0.62 I50 heart failure 0.0321 0.37 I10, I48, I50 0.0037 0.50 heart failure Rules validation from the test sample (18,294 stays) by estimating the lift enabled to validate three prediction rules (Table II).

422 M. Djennaoui et al. / Improvement of the Quality of Medical Databases Table 2. Lift of the complete rule pattern: PPV of the predictive pattern divided by PPV of the sole past code. Predictive Nb stays Wordings patterns reviewed PPV Lift E11 117 0.53 (reference) I10, DZQM006, E11 transthoracic heart Doppler ultrasound 32 0.69 1.30 I10, I48, E11 20 0.75 1.42 I48 92 0.30 (reference) I10, I48, E78 16 0.25 0.83 disorders of lipoprotein metabolism I10, I48, Z95 25 0.24 0.80 presence of cardiac and vascular implants I48, I69 sequelae of cerebrovascular disease 23 0.39 1.30 I50 heart failure 70 0.21 (reference) I10, I48, I50 heart failure 37 0.21 1.00 3. Discussion Consistent with the main purpose of our study, we validated three sequential rules for diagnosis prediction. These rules have a higher positive predictive value than a simple renewal of previously encoded diagnoses, with a lift greater than or equal to 1.30. The three rules are: {E11,I10,DZQM006}=>{E11} {E11,I10,I48}=>{E11} {I48,I69}=>{I48} The recoding improvement with an increase in the PPV is necessarily associated with a decrease in recall, which does not reflect the evaluation by the lift. However, this pitfall is acceptable if we consider the context for the recoding, which is marked by a lack of time; then the benefit provided by the prediction rules used for automatic recoding offsets loss of recall. Other pitfall, we could not retain rules for code I50 (heart failure). This can be explained by the fact that this code was less often coded, hence producing less trustworthy rules. Despite these limitations, we were able to extract three simple, reliable and effective sequential rules, enabling for a substantial improvement in recoding diagnoses. The originality of our work lays in the use of data mining methods for the development of control rules, which contrasted with conventional approaches of control, mainly based on the identification of diagnoses by reading medical letters. The evaluation of the rules by returning to medical letters allowed an accurate and objective assessment, with testing these rules as a tool for recoding in real situations of monitoring and improvement data quality. The assessment by the lift was well aware of the improvement of recoding.

M. Djennaoui et al. / Improvement of the Quality of Medical Databases 423 The results of our study indicate the opportunity to improve the data quality of the national database by data mining methods. Extending prediction to other diagnostic codes would be an interesting perspective. Acknowledgment The research leading to these results has received funding from the European Community s Seventh Framework Program (FP7/2007-2013) under Grant Agreement n 216130 the PSIP project. References [1] Jensen PB, Jensen LJ, Brunak S. Mining electronic health records: towards better research applications and clinical care. Nat Rev Genet. juin 2012;13(6):395405. [2] Geissbuhler A, Safran C, Buchan I, Bellazzi R, Labkoff S, Eilenberg K, et al. Trustworthy reuse of health data: A transnational perspective. Int J Med Inf. 2013;82(1):19. [3] Weiskopf NG, Weng C. Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. J Am Med Inform Assoc. 2013;20(1):14451. [4] Kahn MG, Raebel MA, Glanz JM, Riedlinger K, Steiner JF. A pragmatic framework for single-site and multisite data quality assessment in electronic health record-based clinical research. Med Care. juill 2012;50 Suppl:S219. [5] Richesson RL, Horvath MM, Rusincovitch SA. Clinical research informatics and electronic health record data. Yearb Med Inform. 2014;9(1):21523. [6] Agence Technique de l Information Hospitalière ATIH. http://www.atih.sante.fr/l-atih/presentation [7] Bacardit J, Widera P, Lazzarini N, Krasnogor N. Hard Data Analytics Problems Make for Better Data Analysis Algorithms: Bioinformatics as an Example. Big Data. 1 sept 2014;2(3):16476. [8] Koh HC, Tan G. Data mining applications in healthcare. J Healthc Inf Manag JHIM. 2005;19(2):6472. [9] Bailey S, Singh A, Azadian R, Huber P, Blum M. Prospective data mining of six products in the US FDA Adverse Event Reporting System: disposition of events identified and impact on product safety profiles. Drug Saf Int J Med Toxicol Drug Exp. 1 févr 2010;33(2):13946. [10] WHO International Classification of Diseases (ICD). http://www.who.int/classifications/icd/en/ [11] Beuscart R, McNair P, Brender J, PSIP consortium. Patient safety through intelligent procedures in medication: the PSIP project. Stud Health Technol Inform. 2009;148:613. [12] Ficheur G, Genty M, Chazard E, Flament C, Beuscart R. Proposition d une méthode automatisée calculant la valeur moyenne d un diagnostic. Rev DÉpidémiologie Santé Publique. 2013;61:S189. [13] R Core Team. R: A language and environment for statistical computing.2013.http://www.r-project.org/. [14] Zaki MJ. SPADE: An efficient algorithm for mining frequent sequences. Mach Learn. 2001;42:3160. [15] Hahsler CB and M, Diaz with contributions from D. arulessequences: Mining frequent sequences 2014. http://cran.r-project.org/web/packages/arulessequences/index.html