Automated assignment of ICD-9-CM codes to radiology reports Richárd Farkas University of Szeged Filip Ginter University of Turku
Overview Why clinical coding? Importance, use of automated coding Challenge description Data used Evaluation methodology Our solutions Szeged system Turku system Results and comparison Possibilities and challenges of a real-world system
NLP in the clinical domain Narrative texts A huge amount of information is hidden Manual processing requires expertise Time Costs Special features of medical texts Unique characteristics of the language used Smokes 2-3 cig / day, occ etoh, and no drugs except marijuana Exam
Clinical coding Automatic assignment of disease/symptom codes to clinical records International Classification of Diseases (ICD-X-CM) X revision (current: 10, used: 9) Used for statistics on diseases, or effects of treatment billing the task has commercial relevance Overcoding is penalised by 3x sum Undercoding means loss of revenue Codes are added to the text afterwards the treatment (US) coding costs $25 billion annually in the USA
International Challenge on Classifying Clinical Free Text Using Natural Language Processing Shared task challenge to evaluate NLP systems on clinical data http://www.computationalmedicine.org/challenge/ ICD-9-CM coding Radiology reports Organization Computational Medicine Center Cincinatti, Ohio, USA February/March, 2007 Motivation Practical importance for hospital administration and health insurance
120+ registered participants 44 systems submitted
Data Used Radiology records annotated with ICD codes 978 documents used for training ICD-9 systems 976 unseen documents used for evaluation Annotation provided by 3 health institutes majority labeling used as gold standard 45 different ICD codes used codes appear in various combinations (94 different sets of codes) frequency of labels vary The data is made available free of charge for research purposes by the challenge organizers
Example <doc id="97664713" type="radiology_report"> <codes> <code origin="cmc_majority" type="icd-9-cm">786.2</code> <code origin="company3" type="icd-9-cm">518.0</code> <code origin="company1" type="icd-9-cm">786.2</code> <code origin="company2" type="icd-9-cm">786.2</code> </codes> <texts> <text origin="cchmc_radiology" type="clinical_history"> Cough. History of pneumonia on 1/2/01. Increased work of breathing. </text> <text origin="cchmc_radiology" type="impression"> No significant change to overall appearance of perihilar lung opacities and peribronchial thickening most consistent with viral illness vs reactive airways disease. Increased densities superimposed over the right middle lobe and lingular region on the lateral view may represent superimposition of shadows. However atelectasis or a small amount of parenchymal consolidation cannot be fully excluded. This patient's lung markings have appeared prominent on the four existing chest x-rays in our file. It is recommended that the child receive a well - child chest x-ray in order to evaluate lung markings when the child is not sick. </text> </texts> </doc>
Distribution of labels
Results
Szeged, Hungary Richárd Farkas Research Group on Artificial Intelligence of the Hungarian Academy of Sciences, György Szarvas University of Szeged, Department of Informatics, Human Language Technology Group without physicians
Szeged ICD coding solutions Language Processing negation/speculation Exploiting ICD and utilise labeled data Inter-label dependecies Synonyms and abbreviations Challenge system: hand crafted reconstructed automatically (machine learning)
Language processing Coding guides order that uncertain diagnosis should not be coded speculations Peribronchial thickening most consistent with viral illness vs reactive airways disease negation Normal slightly hypoventilatory chest x-ray, no pneumonia. issues in the past without direct effect on current treatment should not be coded temporal resolution is neglected due to noisy annotation of historical findings
Detection of speculation/negation Simple approach, motivated by not too difficult grammar of the text physicians aim to briefly enumerate findings and their opinion rarely use very complex Noun Phrases or syntax Dictionaries of keywords collected from training data Scope identified by naive heuristic right scope end of sentence left scope previous punctuation (or nothing, depending on the keyword) Normal slightly hypoventilatory chest x-ray, no pneumonia.
Using the ICD
Exploration of inter-label dependencies Overcoding, e.g. symptoms and diseases C4.5 classifiers trained for false positive labels Features: base-system labels Extracted 5 dependencies each express Delete symptom if disease has textual evidence e.g. Delete Cough and Fever if Pneumonia coded
Data-driven model Vector Space Model token 1-2-3 grams as features C4.5 classifier on 45 binary classification tasks Expanding the dictionaries: Gathering missing synonyms, abbreviations C4.5 classifiers trained for false negative labels
Example of terms found Urinary Tract Infection uti Asthma reactive airways disease Laurence-Moon-Biedl syndrome Williams syndrome Beckwith-Wiedemann syndrome hemihypertrophy
External knowledge (ICD) vs. Data-driven models ICD data independent robust (information source is reliable) can cover rare codes Data-driven can explore individual coding style (synonyms, abbreviations) requires labeled documents cannot handle rare codes
Added values of the subphases 45-class statistical system ICD + inter-label dependencies + statistical enriching (synonyms) Union of statistical and coding guide Hand-crafted system - language processing Train 88.20% 84.07% 85.57% 90.26% 90.53% 90.02% 71.46% Eval 86.69% 83.21% 84.85% 88.93% 89.33% 89.41% 70.48%
The Turku Group in the Challenge Language processing group at the Department of IT, University of Turku and Turku Centre for Computer Science (TUCS) Antti Airola Filip Ginter Tapio Pahikkala Sampo Pyysalo Tapio Salakoski Hanna Suominen Department of nursing science, University of Turku Sanna Salanterä
The Turku ICD coding system Feature engineering Mapping text to UMLS concepts (MetaMap) Recognition of negation and speculation Generalization via hypernymy Machine learning Primary classifier (RLS) Secondary classifier (Ripper) corrections of known errors made by the primary classifier Additional training instances from ICD definitions
MetaMap MetaMap identifies instances of UMLS concepts in running text NLM s MetaMap program Divides running text to phrases Each phrase is mapped into a set of UMLS concepts from specified vocabularies A way to abstract from text
MetaMap output example Eleven year old Eleven, Quanitative Concept, C0205457 Year, Temporal Concept, C0439234 Old, Temporal Concept, C0580836 with acute leukemia Acute leukaemia, Neoplastic Process, C0085669 bone marrow transplant Bone marrow transplant, Therapeutic or Preventive Procedure, C0005961 on Jan. 2 now with three day history Three, Quantitative concept, C0205449 day, Temporal concept, C0439228 History, Occupation or Discipline, C0019664 of cough Cough, Sign or Symptom, C00010200
Hypernym expansion Hypernyms as additional features Generalize the identified concepts along the hierarchy Cough Respiratory symptoms Signs and Symptoms Fever Body temperature altered Signs and Symptoms Atelectasis Diseases of the lung Diseases of the respiratory system Pneumonia Diseases of the lung Diseases of the respiratory system
Hypernym expansion motivation More accurate similarity information Lexically, cough and fever are different Hypernym expansion adds the information that both are symptoms The connection can also be learned given large quantities of data But rare cases can benefit here
Negation and speculation Negation, speculation, temporal information Recognize trigger words could, history of, likely, may, mild, minimal, no, past, possible, possibly, probable, probably, questionable, suggestive, unsure, without Scope: Everything from a trigger word up to the end of the current sentence All features extracted from a negated text span are marked ICD coding guide: speculated / unsure code is not assigned
Hypernym expansion & negation Hypernym expansion and negation VALID: pneumonia lung disease INVALID: not pneumonia not lung disease Negated concepts are not expanded with hypernyms Room for improvement VALID: possible pneumonia possible lung disease
Feature engineering Final set of features entering the classifier Text tokens No particular order: Bag-of-Words (BoW) model Marked with neg- whenever negated Set of UMLS concepts (their c-codes) extracted with MetaMap Marked with neg- whenever negated Set of hypernyms of the extracted UMLS concepts Included only for non-negated concepts
Classification RLS (regularized least-squares) classifier Maximal-margin, kernel-based classifier Close relative of Support Vector Machines (SVMs) Linear kernel (fast & worked well) One classifier for each code 1 versus all classification May lead to no codes assigned or an impossible combination of codes
Correcting known errors Cascaded classifier attempts to correct known errors Empty or impossible combinations RIPPER Decision rules Much different paradigm than RLS Trained and applied exactly as the first classifier 1 vs. All Known errors made by the second classifier left uncorrected Experiments show no additional improvement
Using ICD-9 in training ICD-9 definitions as training instances Concatenate the textual definitions of each of the 45 codes and its parents in the ICD hierarchy Same generalization idea as previously Extract features in the standard way Pool the resulting 45 training instances with the challenge training data Provides additional positive examples
Turku system: Summary FEATURE EXTRACTION Source text UMLS hierarchy CLASSIFICATION RLS classifier 1 vs. All Tokenization Negation and speculation detection MetaMap Set of UMLS concepts UMLS hypernym expansion Extended set of UMLS concepts + Source text tokens Set of ICD codes impossible combination RIPPER classifier 1 vs. All possible combination Final set of ICD codes
Turku system: Component contribution F micro Error Relative Gain RLS (initial) 79.3 20.7 Tokenization 80.7 19.3 7% UMLS mapping 82.5 17.5 9% UMLS hypernyms 83.4 16.6 5% Negation/speculation 84.7 15.3 8% Cascaded Ripper 86.5 13.5 12% ICD-9 training data 86.6 13.4 1% Cross-validated performance on training data
Turku vs. Szeged: Crucial differences Szeged system No external resources beyond ICD-9 ICD-9 definitions and coding guidelines are the core of the system Challenge system: rule-based Replicated via machine learning Turku system Heavy reliance on UMLS MetaMap Hypernyms ICD-9 definitions used as training examples with 0.1 percentage point improvement No explicit use of ICD-9 coding guidelines Pure machine learning
Turku vs. Szeged: Crucial differences Szeged system allows individual ICD code deletion if code X is given, delete code Y Turku system rejects the whole code combination and applies a different classifier Paradoxically, no gain from using the Szeged finer ICD code handling on top of Turku results (0.3 percentage point F-score decrease) E.g. false positive disease code causes a true positive symptom code to be removed Use of hypernym expansion More detailed negation/speculation/temporal detection in Szeged system
Language specifics CMC challenge was on English text How about other languages? Szeged system Needs translated ICD Language-adapted negation/speculation detection Turku system Needs translated UMLS resources and MetaMap Much of the features are language-independent UMLS c-codes Language-adapted negation/speculation detection Both systems rely on string search in one way or another Problem in inflective languages
Crucial differences (cont.) Different approach to design Turku system Classifier-centric Extract all thinkable features and feed them into a stateof-the-art classifier Szeged system Data-centric Build from the available resources (ICD and training data) and use classifiers with interpretable models Study the mistakes and the model, correct errors
CMC challenge results: The big picture Best F-score 89.1 (Szeged system) Mean F-score 76.7 (=13.4) Turku and Szeged baselines Szeged: 83.2% F bare system with just NLP and ICD but no other direct use of the training data Turku: 80.7% F bare machine learning system with no data preprocessing of any kind (only whitespace tokenization) About half of the challenge submissions stayed below these baseline systems!
CMC challenge: Lessons learned General observations across all submissions Presented by Pestian et al., ACL 07 BioNLP workshop, 2007 Based on short system descriptions (not publicly available) 1. Best systems explicitly took into account negation and speculation 2. Better systems frequently worked with hypernym and synonym detection 3. Significant amount of symbolic processing 4. Two of the top three systems were ML-based
CMC challenge: Lessons learned 5. Careful, medically-informed feature engineering common 6. SVM and related state-of-the-art classification algorithms were strongly represented, but not reliably predictive of high ranking Turku development observation: a number of traditional classifiers matched RLS performance when used correctly
Beyond the ICD coding Similar NLP tasks The same architecture can be used Find the relevant parts of the documents Find relevant phrases (synonyms, abbreviations) simple string-matching with a particular dictionary Prototype tasks: The i2b2 obesity challenge Smoking status detection
The i2b2 obesity challenge Who's obese and what co-morbidities do they (definitely/likely) have? Informatics for Integrating Biology and the Bedside (i2b2) 2008. Febr. June 730 training and 507 evaluation document multi-label problem, 16 morbidities
Comparison Focusing on several morbidities (matchable with set of ICD) Longer documents (avg. of the lengths: 130 rows) More noise The patient has a positive family history of coronary disease Negation/speculation detection is highlighted (Y/N/Q/U F-macro)
Smoking status detection i2b2 challenge 2006 The patient in question is SMOKER, NON-SMOKER, PAST-SMOKER or smoker status UNKNOWN inter-annotator agreement ~85% 398 train and 104 eval documents Small dictionaries: smoke, tobacco etc. best systems 88% with external data 94%
Final thoughts on ICD coding Some clear advantages lower costs less error-prone processing of simpler cases Fully automatic system is impossible (nowadays) Far away from human intelligence will not solve rare, harder cases Right middle and probable right lower lobe pneumonia.
The place of an automatic system Pre-labeling/highligthing to speed up manual coding prediction along with confidence measure Validation suggesting erroneous / missed codes monitoring for health insurance companies Automated coding of large datasets mainly for statistical purposes
Tasks to be solved Extending systems to thousends of codes If a corpus with appropiate size is available Incorporating more expert knowledge into the statistical methods user-friendly interfaces interactive systems Better language processing Corpus for developing sophisticated scope detectors: BioScope (released 2008 June) www.inf.u-szeged.hu/rgai/bioscope
Open questions the coder or every institute has its own individual coding styles how to transfer among languages? Is there any drop in accuracy on other languages (free word order in Hungarian) on other domains (nursing notes)? What is the real speed-up of an automatic pre-coding/suggestion system?
Open questions (cont.) More training data needed to scale the systems up Hospitals have the data but privacy concerns prevent its dissemination to companies / NLP researchers who build the system Training data generally cannot be reconstructed from trained machine-learning systems Distribute an empty system? Legal issues? Technical issues?
Multilingual ICD tagging: summary Basic NLP tools Tokenizer Lemmatizer Tagger, phrase parser (in some approaches) Need domain adaptation Controlled domain vocabulary resources Term variants (e.g. synonyms and abbreviations) Generally scarce Ideally within a large framework such as UMLS Allowing tool re-use
Basic NLP resources Tokenizer Preferably domain-adapted Very poor language standards in some clinical documents Lemmatizer Point in case: FinTWOL and nursing narratives Basic FinTWOL extended by Lingsoft with ~3500 domain words Recognition rate grew from 83.1% to 90.7% That corresponds to 42% decrease in unrecognized running words Hungarian: lemmatizers exist but are not domain adapted due to data privacy concerns Researchers who are able to adapt the lemmatizers do not have appropriate data access permissions
References 1 st place: Farkas, R., & Szarvas, G. (2008). Automatic construction of rule-based ICD-9-CM coding systems. BMC Bioinformatics, 9S3, S10. 2 nd place: Crammer, K., Dredze, M., Ganchev, K., & Talukdar, P. P. (2007). Automatic code assignment to medical text. Proceedings of ACL 07 BioNLP workshop. 3 rd place: Suominen, H., Ginter, F., Pyysalo, S., Airola, A., Pahikkala, T., Salanterä, S., & Salakoski, T. (2008). Machine Learning to Automate the Assignment of Diagnosis Codes to Free-text Radiology Reports: a Method Description. Proceedings of the ICML/UAI/COLT Workshop on Machine Learning for Health-Care Applications. Challenge description: Pestian, J. P., Brew, C., Matykiewicz, P., Hovermale, D., Johnson, N., Cohen, K. B., & Duch, W. (2007). A shared task involving multi-label classification of clinical free text. Proceedings of ACL 07 BioNLP workshop.