EVALUATING CLASSIFICATION POWER OF LINKED ADMISSION DATA SOURCES WITH TEXT MINING

Similar documents
Electronic Medical Record Mining. Prafulla Dawadi School of Electrical Engineering and Computer Science

Introduction to Data Mining

Tackling the Challenges of Big Data! Tackling The Challenges of Big Data. My Research Group. John Guttag. John Guttag. Big Data Analytics

Knowledge-based systems and the need for learning

Not all NLP is Created Equal:

Data Mining Fundamentals

Biomedical Informatics Applications, Big Data, & Cloud Computing

A Statistical Text Mining Method for Patent Analysis

Analysing Big Data to Improve Patient Outcomes Dr Jean Evans, Kolling Institute of Medical Research

Making Sense of Physician Notes: A Big Data Approach. Anupam Joshi UMBC joshi@umbc.edu Joint work with students, UBMC Colleagues, and UMMS Colleagues

Data Mining Yelp Data - Predicting rating stars from review text

Anthem Workers Compensation

IBM Watson and Medical Records Text Analytics HIMSS Presentation

Serious Injury Reporting An Irish Perspective. Maggie Martin

Health Care Utilization and Costs of Full-Pay and Subsidized Enrollees in the Florida KidCare Program: MediKids

Leading the next generation of coding technology

Natural Language Processing for Clinical Informatics and Translational Research Informatics

ICD-10 Frequently Asked Questions For Providers

Information Exchange and Data Transformation (INFORMED) Initiative

Travis Goodwin & Sanda Harabagiu

Putting IBM Watson to Work In Healthcare

Mining a Corpus of Job Ads

USING DATA SCIENCE TO DISCOVE INSIGHT OF MEDICAL PROVIDERS CHARGE FOR COMMON SERVICES

Secondary Uses of Data for Comparative Effectiveness Research

BioGrid s use of Business Analytics for Collaborative Medical Research. Maureen Turner, CEO, BioGrid Australia

Electronic Medical Records Getting It Right and Going to Scale

ICD -10 TRANSITION AS IT RELATES TO VISION. Presented by: MARCH Vision Care, 2013

HOW WILL BIG DATA AFFECT RADIOLOGY (RESEARCH / ANALYTICS)? Ronald Arenson, MD

Patient Information and Daily Programme for Patients Having Whipple s Surgery (Pancreatico duodenectomy)

PharmaSUG2011 Paper HS03

The Price of Cancer The public price of registered cancer in New Zealand

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Center for. Clinical Informatics. Clinical Informatics. Mission Statement

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Find the signal in the noise

Healthcare analytics powered by medical images

Patient Similarity-guided Decision Support

LEVERAGING BIG DATA ANALYTICS TO REDUCE SECURITY INCIDENTS A use case in Finance Sector

National Lung Screening Trial (NLST): Post-Trial Data and Image Exploration with an Open Source Query Tool

Clinical trial enrollment among older cancer patients

X-ray (Radiography), Chest

Hospital Information Systems HIS

Disease/Illness GUIDE TO ASBESTOS LUNG CANCER. What Is Asbestos Lung Cancer? Telephone

The Complete Medical Record and Electronic Charting

Understanding your Peripherally Inserted Central Catheter (PICC) Patient Information

How to Conduct a Thorough CAC Readiness Assessment

Impact of Corpus Diversity and Complexity on NER Performance

Population Health Management: Using Geospatial Analytics to Enable Data-Driven Decisions

Big Data how it changes the way you treat data

Population Health Informatics & Delivering the Transforming Services Together programme. Luke Readman, CIO

Industry leading Education

Cleaned Data. Recommendations

TMUNSW: Identification of disorders and normalization to SNOMED-CT terminology in unstructured clinical notes

X-ray (Radiography) - Chest

USING ADVANCED TECHNOLOGY TO SIMPLIFY REVENUE CYCLE MANAGEMENT

Equity forecast: Predicting long term stock price movement using machine learning

Using EHRs to extract information, query clinicians, and insert reports

SGRP 113 Objective: Use clinical decision support to improve performance on high priority health conditions

STAR WARS AND THE ART OF DATA SCIENCE

Cross-Validation. Synonyms Rotation estimation

Supervised Learning (Big Data Analytics)

9 Expenditure on breast cancer

Biomedical Big Data and Precision Medicine

Susan J Hyatt President and CEO HYATTDIO, Inc. Lorraine Fernandes, RHIA Global Healthcare Ambassador IBM Information Management

Urinalysis Compliance Tools. POCC Webinar January 19, 2011 Dr. Susan Selgren

Environmental Health Science. Brian S. Schwartz, MD, MS

Chicago Health Atlas Context, current status, and future work

Data Mining On Diabetics

Research Data Extraction Service. Research and Woodruff Health Sciences IT

Chapter 13. The hospital-based cancer registry

Domain Classification of Technical Terms Using the Web

Revenue Integrity Boot Camp. Coding. Agenda

i-care Integrated Hospital Information System

Carolina s Journey: Turning Big Data Into Better Care. Michael Dulin, MD, PhD

Indicator 9: Pneumoconiosis Hospitalizations

COMPARING NEURAL NETWORK ALGORITHM PERFORMANCE USING SPSS AND NEUROSOLUTIONS

GENETIC DATA ANALYSIS

Proposal Title: Smart Analytic Health Plan Systems

Predictive Analytics in Action: Tackling Readmissions

THE VIRTUAL DATA WAREHOUSE (VDW) AND HOW TO USE IT

Research Skills for Non-Researchers: Using Electronic Health Data and Other Existing Data Resources

Clinical and research data integration: the i2b2 FSM experience

MHI3000 Big Data Analytics for Health Care Final Project Report

ICD-10 Preparation for Dental Providers. July 2014

Artificial Neural Network Approach for Classification of Heart Disease Dataset

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

Innovation in the LIS: Implications for Design, Procurement and Management

Health Information. Technology and Cancer Information Management. Health Information Technology & Cancer Information Management 363

An Easily Accessed Clinical Research Database from your Epic EMR

Predicting Risk-of-Readmission for Congestive Heart Failure Patients on big data solutions

Electronic health records to study population health: opportunities and challenges

Standardized Representation for Electronic Health Record-Driven Phenotypes

Laparoscopic Nephrectomy

An EVIDENCE-ENHANCED HEALTHCARE ECOSYSTEM for Cancer: I/T perspectives

Using the Electronic Medical Record Advantages and pitfalls for Radiologists

Disease Prevention to Reduce New Hampshire Healthcare Claims and Costs: A Data Mining Approach

Secure Because Math: Understanding ML- based Security Products (#SecureBecauseMath)

Total Cost of Care and Resource Use Frequently Asked Questions (FAQ)

Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum

Transcription:

Kocbek et al. Big Data 2015, Sydney 1 EVALUATING CLASSIFICATION POWER OF LINKED ADMISSION DATA SOURCES WITH TEXT MINING Simon Kocbek, Lawrence Cavedon, David Martinez, Christopher Bain, Chris Mac Manus, Gholamreza Haffari, Ingrid Zukerman, Karin Verspoor

Kocbek et al. Big Data 2015, Sydney 2 Background and motivation Growing Electronic Health Records (EHR) data. Much of it in free text format. This text can be used in text mining applications. Most previous TM applications use a single textual data source. Increase in data linkage in hospitals allows multiple sources to be leveraged for complex analytical tasks. We describe a text mining system that detects positive cases of lung cancer for each admission: Use of multiple data sources. Evaluate performance (does performance improve?).

Kocbek et al. Big Data 2015, Sydney 3 Alfred REASON platform 15+ years of data. 171,000+ updates each day. 62.4 million updates per annum.

Kocbek et al. Big Data 2015, Sydney 4 Data Task Examples Admission Radiology question Admission 50yo complaining of left shoulder pain. Tender generally. Difficulty abducting the shoulder past 45 degrees. Home on HITH tomorrow - either inpatient or outpatient please Radiology report Mobile Chest performed on 02-JUN-2012 at 08:27 AM: The nasogastric tube has its tip in the stomach. The tracheostomy is seen at T2 level.. Pathology report Additional data ICD-10 code Age: 50 Date of admission: Jun/12 Gender: F Country: Urine Culture Acc No: 12-183-0731Source: Urine ------------ URINE MICROSCOPY (PHASE CONTRAST) ------------- Leucocytes x10^6/l (Ref <10)... <10 Erythrocytes x10^6/l (Ref <10).. <10...

Kocbek et al. Big Data 2015, Sydney 5 Data Characteristics Extracted data for 2 financial years from 2012 to 2014: 150,521 admissions, 40,800 radiology reports with associated question, 20,872 pathology reports, 121,700 additional data entries (demographics, hospital admission info). Admissions are associated to ICD-10 codes: Used as ground truth. ICD-10 code C34.* to identify positive cases for lung cancer. 496 positive admissions. Final dataset: Subsampling. 992 admissions.

Kocbek et al. Big Data 2015, Sydney 6 Methods (I) REASON sources Machine learning algorithm Radiology reports Radiology questions Pathology reports Additional data Classification Model Biomedical knowledge sources Language processing Textual and other features

Kocbek et al. Big Data 2015, Sydney 7 Methods (II) Features: Biomedical phrases. Identified negative context ( no lung cancer vs lung cancer ). Ambiguous words ( common cold vs cold temperature ). Machine learning algorithms Support Vector Machines. Parameter tuning. Evaluation: Precision, Recall, F-Score. Statistical significance. Steps: 8 different classification models (different combinations of data sources). Baseline: phrases from only radiology reports. Adding phrases from other sources.

Kocbek et al. Big Data 2015, Sydney 8 Results F-Score using 3 data sources 0.930 0.915 0.901 0.900 0.885 0.870 0.873 1 2 3 4 radiology question pathology report additional data

Kocbek et al. Big Data 2015, Sydney 9 Results F-Score using 3 data sources 0.930 0.915 0.917 0.901 0.900 0.885 0.870 0.873 1 2 3 4 radiology question pathology reports additional data

Kocbek et al. Big Data 2015, Sydney 10 Results F-Score using 4 data sources 0.930 0.930 0.915 0.917 0.900 0.901 0.885 0.870 0.873 1 2 3 4 radiology question pathology reports additional data

Kocbek et al. Big Data 2015, Sydney 11 Discussion More data sources lead to better performance. The classifier with the highest performance was built using features from all four data sources. Not all improvements were significant: Radiology question and metadata vs Pathology reports. Not all admissions had a pathology report associated with them.

Kocbek et al. Big Data 2015, Sydney 12 Conclusion We built a text mining system for detecting lung cancer admissions. Our methods show more informed systems can be built by including multiple linked data sources. Future work: Other diseases. Skewed datasets. Feature selection. 0.920 0.910 0.900 0.890 0.880 0.870 0.860 0.850 0.840 0.830 0.820 Breast cancer 0.893 1 2 3 4

Kocbek et al. Big Data 2015, Sydney 13 Thank you Questions? Comments?