Insight Driven Health Natural Language Processing in the EHR Lifecycle Cecil O. Lynch, MD, MS cecil.o.lynch@accenture.com Health & Public Service
Outline Medical Data Landscape Value Proposition of NLP Strategies for voice and text processing Tooling options Integration with the EMR lifecycle
Medical Data Landscape Copyright 2010 Accenture All All Rights Reserved. Accenture, its its logo, and High Performance Delivered are trademarks of of Accenture.
Medical Data Landscape
Medical Data Where is it? Two Types of Content 1. Structured Content - Typically found in a database A. Fits a pre-defined data model B. Fits well into relational tables. Examples 20% Databases XML Data Data warehouses Enterprise systems (CRM, ERP, etc.) UMLS RxNorm 2. Unstructured Content - Can be found throughout an organization A. Does not fit a pre-defined data model B. Does not fit well into relational tables. Examples - Text-based Email messages 80% Office documents Web documents BLOB (Binary Large Object) field type (e.g. Transcribed Doctor s Notes) Examples Non-Text-based Voice/Audio files (e.g. Dictated Doctor s Notes) Images Video files Medical Charts Slide from DataSkill
NLP Value Proposition Copyright 2010 Accenture All All Rights Reserved. Accenture, its its logo, and High Performance Delivered are trademarks of of Accenture.
NLP Value Proposition Data from IBM study at Seton Healthcare
Case Study 5 BJC HealthCare Making healthcare smarter BJC Healthcare NLP Results Results: Follow-up Appointments and Diagnoses Element Precision Recall Alcohol Use 91.8% 96.2% Alcohol Substance 95% 74% Alcohol Volume 96.3% 100.0% Alcohol Duration 86.7% 93.3% Alcohol Quit Duration 100.0% 96.1% Alcohol Family History 95.8% 83.3% Tobacco Use 90.0% 93.0% Medications 90.0% 92.0% 8
Strategies for Voice and Text Analytics Copyright 2010 Accenture All All Rights Reserved. Accenture, its its logo, and High Performance Delivered are trademarks of of Accenture.
Strategic Approach Voice recognition to standard EMR UI Voice recognition to a standard model Voice recognition to unstructured text document Content analytics on unstructured documents written to EMR fields Content analytics on unstructured documents written to a data warehouse Content analytics used at runtime and for predictive analytics and decision support
Is there a limit to Structured Data?
Tooling Options Copyright 2010 Accenture All All Rights Reserved. Accenture, its its logo, and High Performance Delivered are trademarks of of Accenture.
NLP Pipelines - UIMA Unstructured Information Management Architecture 4 Major Software Divisions It specifies component interfaces in an analytics pipeline It describes a set of Design patterns It suggests two data representations: an in-memory representation of annotations for high-performance analytics and an XML representation of annotations for integration with remote web services. It suggests development roles allowing tools to be used by users with diverse skills Is an OASIS Standard Reference Implementation Donated by IBM (SourceForge) Maintained by the Apache Foundation
Tooling
Tooling - Continued
Tooling - Continued
ctakes Clinical Text Analysis and Knowledge Extraction System (Mayo Clinic, Children's Hospital Boston) http://sourceforge.net/projects/ohnlp/files/ctakes/ Components Sentence boundary detector (OpenNLP) Rule-based tokenizer to separate punctuations from words Normalizer (NLM s NORM) Part-of-speech tagger (OpenNLP) Phrasal chunker (OpenNLP) Dictionary lookup annotator Context annotator Negation detector (NegEx) Dependency parser Module for the identification of patient smoking status Drug mention annotator Context dependent tokenizer
ctakes Derivation ctakes
Refined Lucene OWL Code Annotation
ClearTK ClearTK provides a framework for developing statistical natural language processing (NLP) components in Java and is built on top of Apache UIMA. http://code.google.com/p/cleartk/ (UCB) A common interface and wrappers for popular machine learning libraries such as SVMlight, LIBSVM, OpenNLP MaxEnt, and Mallet. A rich feature extraction library that can be used with any of the machine learning classifiers. Under the covers, ClearTK understands each of the native machine learning libraries and translates your features into a format appropriate to whatever model you're using. Infrastructure for creating NLP components for specific tasks such as partof-speech tagging, BIO-style chunking, named entity recognition, semantic role labeling, temporal relation tagging, etc. Wrappers for common NLP tools such as the Snowball stemmer, the OpenNLP tools, the MaltParser dependency parser, and the Stanford CoreNLP tools. Corpus readers for collections like the Penn Treebank, ACE 2005, CoNLL 2003, Genia, TimeBank and TempEval.
EMR Integration Options Copyright 2010 Accenture All All Rights Reserved. Accenture, its its logo, and High Performance Delivered are trademarks of of Accenture.
Optimal Goal Goal is: Convert unstructured to structured data Code this data into standard Meaningful Use terminologies Write the data to standard information models for health care data elements in standard ISO Healthcare datatypes
City of Hope A Proposed Architecture ETL Reporting and Business Intelligence Allscripts Database EMR OLTP Connection Content Analytics Natural Language Processing Staging - Relational ETL Staging - Triplestore Physical Layer ETL Logical Layer HL7 RIM V3 ETL EDW and Datamarts OLAP Analytics Predictive Analytics Statistics Datamining Allscripts Healthcare Accelerator RDF Triplestore Datamart Datamining Tool Examples: SPARQL, OWL, IBM SLRP, IBM IODT, OntoBroker, Sesame, Jena ETL ETL High Performance Analytics Risk stratification Treatment/Protocol evaluations Research cohort comparisons Real-time clinical decision support Disease management Population health management Personalized medicine / genomics Performance assessment Patient profiling Treatment cost calculations RDF Resource Description Framework OWL Web Ontology Language SPARQL Protocol and RDF Query Language IBM SLRP IBM Semantic Layer Research Platform IBM IODT IBM s toolkit for ontology-driven development OntoBroker Semantic web middleware Sesame Framework for querying and analyzing RDF data. Jena Semantic Web Framework for Java WATSON for Healthcare WEA Advisor Framework Tools APIs Methods Data Platform Massively Parallel Infrastructure Utilization Management Advisor Diagnosis and Treatment Advisor 25
Wrap Up Questions?? cecil.o.lynch@accenture.com
Thank You - Credits IBM jstart Team Randall Wilcox, Kevin Conroy Dataskill Victor Bagwell - CIO City of Hope Naveen Raja, D.O. CMIO Ying Liu, Ph.D. Bioinformatics Group Accenture German Acuna Suniti Ponkshe Jim Traficant