Integrating Public and Private Medical Texts for Patient De-Identification with Apache ctakes



Similar documents
Combining structured data with machine learning to improve clinical text de-identification

A Method for Automatic De-identification of Medical Records

3M Health Information Systems

Building a Question Classifier for a TREC-Style Question Answering System

IBM Watson and Medical Records Text Analytics HIMSS Presentation

Natural Language Processing in the EHR Lifecycle

A De-identifier For Electronic Medical Records Based On A Heterogeneous Feature Set. Arya Tafvizi

Author Gender Identification of English Novels

Automated Tool for Anonymization of Patient Records

Binge drinking increases risk of dementia

POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition

PROGNOSTIC COPD HEALTHCARE MANAGEMENT SYSTEM. Piyush Jain. A Thesis Submitted to the Faculty of. The College of Computer Science and Engineering

TMUNSW: Identification of disorders and normalization to SNOMED-CT terminology in unstructured clinical notes

Automated Problem List Generation from Electronic Medical Records in IBM Watson

Strategies for De-Identification and Anonymization of Electronic Health Record Data for Use in Multicenter Research Studies

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Data Driven Approaches to Prescription Medication Outcomes Analysis Using EMR

Travis Goodwin & Sanda Harabagiu

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Projektgruppe. Categorization of text documents via classification

Transformation of Free-text Electronic Health Records for Efficient Information Retrieval and Support of Knowledge Discovery

Semantic annotation of requirements for automatic UML class diagram generation

Identify Disorders in Health Records using Conditional Random Fields and Metamap

Ernestina Menasalvas Universidad Politécnica de Madrid

Computer-assisted coding and natural language processing

A Commercial Approach to De-Identification Dan Wasserstrom, Founder and Chairman De-ID Data Corp, LLC

Presented to The Federal Big Data Working Group Meetup On 07 June 2014 By Chuck Rehberg, CTO Semantic Insights a Division of Trigent Software

Spam Filtering using Naïve Bayesian Classification

Clinical and research data integration: the i2b2 FSM experience

LIFESTYLE RETURNS STEPS PROGRAM

Sanda Harabagiu. The University of Texas at Dallas Human Language Technology Research Institute

Structuring Unstructured Clinical Narratives in OpenMRS with Medical Concept Extraction

Member Health Management Programs

Knowledge Discovery using Text Mining: A Programmable Implementation on Information Extraction and Categorization

Selected Topics in Applied Machine Learning: An integrating view on data analysis and learning algorithms

Processing Electronic Medical Records. Master thesis. Lars-Erik Bruce. Ontology-Driven Information Extraction and Structuring in the Clinical Domain

IT services for analyses of various data samples

Free Text Phrase Encoding and Information Extraction from Medical Notes. Jennifer Shu

SYLLABUS FOR DMRSc 1-1 MEDICAL RECORD SCIENCE

Collecting Polish German Parallel Corpora in the Internet

Meaningful use. Meaningful data. Meaningful care. The 3M Healthcare Data Dictionary: Standardizing lab data to LOINC for meaningful use

SNOMED CT in the EHR Present and Future. with the patient at the heart

Finding Advertising Keywords on Web Pages. Contextual Ads 101

Meaningful Use Stage 2 Certification: A Guide for EHR Product Managers

MedlinePlus Connect. Boost Box October 9, 2012

Extreme Makeover - ICD-10 Code Edition: Demystifying the Conversion Toolkit

Patient Similarity-guided Decision Support

Problem-Centered Care Delivery

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

Beacon User Stories Version 1.0

Data Mining. Nonlinear Classification

Implementation of a Population Health and Quality Reporting System for FQHCs

Electronic Health Record Systems and Secondary Data Use

Automatic Text Analysis Using Drupal

Text Processing with Hadoop and Mahout Key Concepts for Distributed NLP

How To Understand The Difference Between Terminology And Ontology

Not all NLP is Created Equal:

CRISIS COVER CLAIM FORM (DEAFNESS/ PARTIAL LOSS OF HEARING OR CAVERNOUS SINUS THROMBOSIS SURGERY/ COCHLEAR IMPLANT SURGERY) SECTION

Find the signal in the noise

CENG 734 Advanced Topics in Bioinformatics

Developing VA GDx: An Informatics Platform to Capture and Integrate Genetic Diagnostic Testing Data into the VA Electronic Medical Record

Content-Based Recommendation

Pricing Dread Disease Insurance Edward Fabrizio and Warren Gratton 1994

What is your level of coding experience?

MONTANA. Downloaded January 2011

Taxonomies for Auto-Tagging Unstructured Content. Heather Hedden Hedden Information Management Text Analytics World, Boston, MA October 1, 2013

Automated Content Analysis of Discussion Transcripts

2011 AHIMA Curriculum Competencies and Knowledge Clusters - Health Information Management Baccalaureate Degree Effective August, 2011

An Essential Ingredient for a Successful ACO: The Clinical Knowledge Exchange

Multi language e Discovery Three Critical Steps for Litigating in a Global Economy

A Supervised Named-Entity Extraction System for Medical Text

Home health nurse routing and scheduling

Bela T. Matyas, MD, MPH Health Officer, Solano County

Experiments in Web Page Classification for Semantic Web

Knowledge Clusters (Curricular Components) HIM Associate Degree Entry-Level Competencies (Student Learning Outcomes) Notes

Cardiac Rehab. Home. Do you suffer from a cardiac condition that is limiting your independence in household mobility?

Word Completion and Prediction in Hebrew

Automated assignment of ICD-9-CM codes to radiology reports

ICD -10 TRANSITION AS IT RELATES TO VISION. Presented by: MARCH Vision Care, 2013

31 Case Studies: Java Natural Language Tools Available on the Web

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

Sentiment Analysis of Movie Reviews and Twitter Statuses. Introduction

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Parsing Software Requirements with an Ontology-based Semantic Role Labeler

Big Data and Text Mining

Text Classification based on Lucene and LibSVM / LibLinear. Berlin Buzzwords, June 4th, 2012, Dr. Christoph Goller, IntraFind Software AG

Transcription:

Integrating Public and Private Medical Texts for Patient De-Identification with Apache ctakes Presented By: Andrew McMurry & Britt Fitch (Apache ctakes committers) Co-authors: Guergana Savova, Ben Reis, Zak Kohane

Clinical variables in patient notes often NOT available in coded EHR ü BMI ü Smoking status ü Family history of disease ü Pathology lab results ü Medication adherence ü Lifestyle factors ü High Quality Phenotypes

Clinical variables in patient notes often NOT available in coded EHR ü BMI ü Smoking status ü Family history of disease ü Pathology lab results ü Medication adherence ü Lifestyle factors ü X Confidential Patient Identifiers

De-identifying Private Medical Text Human experts Laborious $$ Fatigue errors Automation (machine learning) Training set from human annotators = small Training local features limits general utility

Reversing the De-ID task What are the chances that a word or phrase would occur in a medical journal or medical dictionary? P (~phi PublicText) vs P ( phi PrivateText)

Public Medical Text " Open Access on the rise! " Publications: natural language examples of medical topics " Ontologies: concept codes for medical topics " Automation: millions of training examples available

Physician Notes 1 2 3 Medical Publications 4 Medical Ontologies 1 Nouns and Numbers that only occur in Physician Notes are probably PHI. 2 Words that occur frequently in medical publica=ons are probably NOT PHI. 3 Words and phrases in in medical ontologies are probably not PHI. 4 Words shared in all three medical text sources are very unlikely to contain PHI.

Data Public Medical Text Ø 10,000 Journal Publications Ø 10 UMLS Ontologies Private Medical Text Ø I2b2 De-Identification Challenge

APPROACH " Annotate Ø Sentence boundaries Ø Tokenization Ø Part of Speech tagging Ø Named Entity Recognition " Learn background distribution from PUBLIC text " Learn properties of PHI from fewer human annotations " Classify new data: more like public text or private text???

Annotation Pipeline 1 2 3 4 Lexical Frequency Ontology Known PHI Part Of Speech Term Frequency Medical Concepts Lists and Patterns Source: ctakes Source: Journals Source: UMLS Source: US Census, Beckwith Annotation Pipeline 1 Lexical Phase: split document into sentences, tag part of speech for each token. 2 Frequency Phase: calculate term frequency with and without part of speech tag. 3 Ontology Phase: search for each word/phrase in ten UMLS ontologies 4 Known PHI Phase: match US census names and textual patterns for each PHI type.

Results 1 2 3 4 Lexical Frequency Ontology Known PHI Part Of Speech Term Frequency Medical Concepts Lists and Patterns Source: ctakes Source: Journals Source: UMLS Source: US Census, Beckwith

Part Of Speech highly informative for PHI classification Each entry(i,j) = Mutual Information (class, part of speech) ctakes Part Of Speech PHI class

PHI classes cluster by Part Of Speech (Biclustering normalized MI distance)

People and Places NOUNS

Patient Numbers NUMBERS

«NOUNS: people and places «NUMBERS: patient IDs and dates ü ADJECTIVES: clinical description severe chronic obstructive pulmonary disease ü VERBS: clinical action decreased white blood cell count Other part of speech Common words and dependencies, etc.

Results 1 2 3 4 Lexical Frequency Ontology Known PHI Part Of Speech Term Frequency Medical Concepts Lists and Patterns Source: ctakes Source: Journals Source: UMLS Source: US Census, Beckwith

Term Frequency in 10,000 Journal Articles

Classifier Performance

Results 1 2 3 4 Lexical Frequency Ontology Known PHI Part Of Speech Term Frequency Medical Concepts Lists and Patterns Source: ctakes Source: Journals Source: UMLS Source: US Census, Beckwith

Part of Speech + TF not enough "Mr. Huntington suffers from Huntington's disease. He was admitted to a hospital on Huntington street --Professor Szolovits Example*

Part of Speech + TF not enough "Mr. Huntington suffers from Huntington's disease. He was admitted to a hospital on Huntington street --Professor Szolovits Example* 1 Noun Person 2 Noun Disease 3 Noun Location same Term Frequency

# Concepts Dictionary Match Longest UMLS Phrase 3,461 COSTAR 5,020 HL7V2.5 8,062 HL7V3.0 102,048 ICD10CM 253,708 ICD10PCS 40,491 ICD9CM 327,181 LOINC 739,161 MESH HUNTINGTON DIS HUNTINGTONS DIS Huntington Chorea Huntington's Chorea Huntington's dementia Huntington's disease* Huntington's disease in last 7D Huntington's disease Multiple sclerosis in last 7 days 437,307 RXNORM 1,170,855 SNOMEDCT

Length of UMLS Matches post percutaneous transluminal coronary angioplasty

Results 1 2 3 4 Lexical Frequency Ontology Known PHI Part Of Speech Term Frequency Medical Concepts Lists and Patterns Source: ctakes Source: Journals Source: UMLS Source: US Census, Beckwith

Known PHI lists

Classifier Performance

Classifier Performance

Physician Notes Medical Publications How similar are tokens in medical publications to tokens in physician notes? Are the distances the same to PHI and non-phi tokens?

Distance Measure for public and private text tokens not PHI PHI Types

Variation for each Part Of Speech

False Positive Filtering with knn 83% of test set was mapped to 100NN All sharing the same class label

Favor Recall (Boosted Decision Tree) Screen Shot 2013-06-16 at 5.01.49 AM

Boosting + False Positive Filtering

Summary " Human annotations = $$, rare, hard to get " Automated De-ID tends to overfit local training instances " Learn background distribution from PUBLIC text " Learn properties of PHI from fewer human annotations " Apache ctakes lexical annotations very informative for DeID " Scrubber pipeline open source with example training set " Classify new data: more like public text or private text???

Physician Notes 1 2 3 Medical Publications 4 Medical Ontologies 1 Nouns and Numbers that only occur in Physician Notes are probably PHI. 2 Words that occur frequently in medical publica=ons are probably NOT PHI. 3 Words and phrases in in medical ontologies are probably not PHI. 4 Words shared in all three medical text sources are very unlikely to contain PHI.