Sanda Harabagiu. The University of Texas at Dallas Human Language Technology Research Institute

Transcription

1 Linking Information Extracted from Electronic Medical Records to Structured Knowledge Sanda Harabagiu The University of Texas at Dallas

2 Outline of the talk 1. The Problem 2. Extracting medical concepts 3. Identifying assertions in clinical texts 4. Relation Extraction 5. Lessons Learned

3 Ontological Resources: UMLS Unified Medical Language System (UMLS) consists of 1. a semantic network of biomedical semantic concepts and semantic relations that span them; hindlimb 2. a metathesaurus which encodes terms and codes from many vocabularies, including CPT, ICD-10-CM, LOINC, MeSH, RxNorm, and SNOMED CT leg region lower extremity hind limb 3. SPECIALIST Lexicon and Lexical Tools. We used UMLS to expand topic keywords into phrases encoded in the UMLS Metathesaurus which share the same CONCEPT ID. leg lower leg This primarily provides high confidence Keyword synonyms.

4 Resources: Clinical Ontologies The Systemized Nomenclature of Medicine Clinical Terms (SNOMED CT) is the most comprehensive, multilingual clinical healthcare terminology in the world. SNOMED CT is owned, maintained and distributed by the International Health Terminology Standards Development Organization (IHTSDO). SNOMED CT consists of four primary core components: 1. Concept Codes - numerical codes that identify clinical terms, primitive or defined, organized in hierarchies 2. Descriptions - textual descriptions of Concept Codes 3. Relationships - relationships between Concept Codes that have a related meaning 4. Reference Sets - used to group Concepts or Descriptions into sets, including reference sets and cross-maps to other classifications and standards. We utilize this relationship knowledge to expand a keyword so that it captures any phrase that partakes in the child-side of an IS-A, PART-OF or COMPONENT relationship. This allows us to expand hypernyms and meronyms. clozaril clozapine abilify atypical antipsychotic asenapine aripirazole

5 The Problem Ontologies provide machine-readable descriptions of biomedical concepts and their relations. Linking domain-specific terms expressed in clinical texts to their ontological encodings provides a platform for semantic interpretation of the clinical narratives. Knowledge extracted from clinical documents can be curated and used to update the content of biomedical ontologies.

6 The difficulties Principal link between clinical or biomedical texts and an ontology is a terminology, which aims to map concepts to terms. A term is a textual realization of a concept, e.g. disease, gene, protein. The problems: term variation and term ambiguity. Terms have a context they may have assertions associated with them Relations between terms exist differently that relations between concepts

7 Term Variation Term variation originates from the ability of a natural language to express a single concept in a number of ways. For example, in biomedicine there are many synonyms for proteins, enzymes, genes, etc Having six or seven synonyms for a single concept is not unusual in this domain. The probability of two experts using the same term to refer to the same concept is less than 20 per cent. In addition, biomedicine includes pharmacology, where numerous trademark names refer to the same compound (eg Advil, Brufen, Motrin, Nuprin and Nurofen all refer to ibuprofen).

8 Term ambiguity Bad News!!! :Term ambiguity occurs when the same term is used to refer to multiple concepts. Ambiguity is an inherent feature of natural language. Words typically have multiple dictionary entries and the meaning of a word can be altered by its context. Some Good News: Sublanguages, as the languages confined to specialized domains, provide a context which generally reduces the level of ambiguity. More Bad News!!! However, biomedicine encompasses a plethora of subdomains, which is an additional cause for the high level of ambiguity in biomedical terminology. For example, the term promoter refers to a binding site in a DNA chain at which RNA polymerase binds to initiate transcription of messenger RNA by one or more nearby structural genes in biology, while in chemistry it denotes a substance that in very small amounts is able to increase the activity of a catalyst. In addition, acronyms are extensively used in biomedicine (a new acronym is introduced in every five to ten abstracts in Medline) and they are known to be highly ambiguous (.80 per cent of acronyms are ambiguous, the average number of possible interpretations being 15).

9 More on ambiguity Acronym expansion: For example, AR could be expanded to any of the following terms: 1. Androgen Receptor, 2. AmphiRegulin, 3. Acyclic Retinoid, 4. Agonist Receptor, 5. Adrenergic Receptor Origins of ambiguity: text is not the only origin of ambiguity in biomedicine. Ambiguity is inherent to the field, because the evolution of species gave rise to many homologues and analogues. For instance, NFKB2 denotes a family of two individual proteins with separate identifiers in Swiss-Prot. These proteins are homologues belonging to different species, human and chicken

10 Pipeline of annotations Each natural language processing layer enhances the knowledge representation with machine readable information. Different forms of ambiguity are solved in the process: Lexical, syntactic, semantic, pragmatics Additional benefits: Joint learning and extraction of concepts and relations among them Learning how to represent context!!!!

12 Details on Concept Extraction Based on our experiments with the 2010 i2b2 Challenge data Extracting concepts involved two decisions Boundary classification: Identify the first and last words of each concept Type classification: Is the concept a problem, test, or treatment? Discharge summaries contain numerous fields (zones), some of which are semi-structured (dates, dosages, etc), others which are un-structured ( prose ) Finding: Both have concepts

13 The Data The 2010 i2b2 challenge data consists of 826 discharge summaries and progress notes, split into 349 training and 477 testing documents. The documents are annotated by medical professionals familiar with their use. The data contains 72,846 medical concepts (27k train, 45k test). Each concept is classified as: 1. a problem (e.g., disease, injury), 2. test (e.g., diagnostic procedure, lab test), or 3. treatment (e.g., drug, preventative procedure, medical device). Medical problems are assigned an assertion type (belief status) among: present, absent, possible, hypothetical, conditional, or associated with someone else. The distribution of assertion types is far from uniform: 69% of all problems are considered present, 20% absent, less than 5% for possible and hypothetical, and less than 1% for conditional and associated with someone else. Additionally, the data contains a third set of annotations, relations between concepts

14 Concept Extraction Architecture New Resources: Wikipedia and WordNet Advanced Semantic Processing Lexical, syntactic and semantic disambiguation Terms exhibit a high degree of variation, which is not always explicitly reflected in biomedical ontologies. For this reason, the UMLS ontology is distributed together with computational support for neutralisation of variation in the biomedical domain. MetaMap is a highly configurable program developed by Dr. Alan Aronson at the National Library of Medicine (NLM) to map biomedical text to the UMLS Metathesaurus or, equivalently, to discover Metathesaurus concepts referred to in text.

15 Example Quantitative Extractions Type Example Age patient is 79 years old. Date diagnosed on april with DiseaseID CHRONIC RENAL FAILURE ( ICD-9-CM 585 ) Dosage 5. Colace 100 milligrams po bid. List Element 5. Colace 100 milligrams po bid. Measurement Weight is 82 kilograms. Name Percent Time Electronically Signed by **NAME[YYY ZZZ] Birth weight was 3.29 kilograms in the 75th percentile FRI :04 PM

16 Concept Extraction Preprocessing: Rule-based detection of measurements, dosages, & other entities Boundary Extraction: Heuristic separates prose from non-prose text. Then two Conditional Random Field (CRF) classifiers are used to extract concepts (one from prose, one for non-prose) Concept Type: problem, test, or treatment Support Vector Machine (SVM) classifier performs 3-way classification

17 Concept Extraction Resources used: MetaMap/UMLS GENIA (chunk, POS) WordNet lemmas Quantitative types Results: Semantic parsing Wikipedia Various word features Affix features P R F1 Exact Boundary Exact Boundary + Type Inexact Boundary Inexact Boundary + Type

18 Feature Set 2 (1/2) CONTENT WORD (cw): lexicalized feature that selects an informative word from the constituent, other than the head. Selection heuristics available in the paper. E.g. June for the phrase in last June. PART OF SPEECH OF CONTENT WORD (cpos): part of speech tag of the content word. E.g. NNP for the phrase in last June. PART OF SPEECH OF HEAD WORD (hpos): part of speech tag of the head word. E.g. NN for the phrase the futures halt. NAMED ENTITY CLASS OF CONTENT WORD (cne): The class of the named entity that includes the content word. 7 named entity classes (from the MUC-7 specification) covered. E.g. DATE for in last June s treatment. 18

19 Feature Set 2 (2/2) BOOLEAN NAMED ENTITY FLAGS: set of features that indicate if a named entity is included at any position in the phrase: nediseaseid: set to true if an disease name is recognized in the phrase. nedosage: set to true if a dosage is recognized in the phrase. neperson: set to true if a person name is recognized in the phrase. nelist: set to true if a list expression is recognized in the phrase. nepercent: set to true if a percentage expression is recognized in the phrase. neage: set to true if a time of day expression is recognized in the phrase. nedate: set to true if a date temporal expression is recognized in the phrase. PHRASAL VERB COLLOCATIONS: set of two features that capture information about phrasal verbs: pvcsum: the frequency with which a verb is immediately followed by any preposition or particle. pvcmax: the frequency with which a verb is followed by its predominant preposition or particle. 19

21 Assertion Classification Determining the belief status of a medical problem is a combination of Prior probability for the problem Detection of context clues (words, predicates, section names) SVM classifier performed 6-way classification Present Absent Hypothetical Conditional Possible Associated with someone else

22 Architecture of assertion classification system We use a NegEx feature to indicate the negation word associated with the medical problem. This allows the classifier to decide whether or not the negation word is useful and what assertion type it reflects. Additional medical features indicate if the problem was found in UMLS or MetaMap as the distribution of assertion types for problems found within these resources differs from that of the documents. We use the General Inquirer s categorical information to better understand the context of a medical problem. We only use the If category, which indicates uncertainty words such as unexpected, hesitant, or suspicious.

23 Assertion Classification Resources used: Semantic Parsing NegEx General Inquirer Stemmed previous words Section Name Results: # P R F1 Present Absent Possible Hypothetical Conditional Assoc. w. someone else Overall 92.7

25 Relation Identification Relations can be present between any two concepts in a sentence We disallow relations between concepts with more than 9 intervening concepts Our Approach Form pairs of concepts from the sentence Classify each pair as having one of the relation types, or no relation

26 Relation Types 1. TrIP: A certain treatment has improved or cured a medical problem (e.g., infection resolved with antibiotic course ); 2. TrWP: A patient s medical problem has deteriorated or worsened because of or in spite of a treatment being administered (e.g., the tumor was growing despite the drain ); 3. TrCP: A treatment caused a medical problem (e.g., penicillin causes a rash ); 4. TrAP: A treatment administered for a medical problem (e.g., Dexamphetamine for narcolepsy ); 5. TrNAP: The administration of a treatment was avoided because of a medical problem (e.g., Ralafen which is contra-indicated because of ulcers ); 6. TeRP: A test has revealed some medical problem (e.g., an echocardiogram revealed a pericardial effusion ). 7. TeCP: A test was performed to investigate a medical problem (e.g., chest x-ray done to rule out pneumonia ); and 8. PIP: Two problems are related to each (e.g., Azotemia presumed secondary to sepsis ).

27 Strategy for extracting relations from electronic medical records The problem of relation discovery was cast as a multiclass classification problem. The classifier not only decides whether there is a relation between a pair of medical concepts, but it also decides the relation s type. To be able to make such decisions the classification system is trained on 349 documents comprising 5,264 relations.

28 Extraction of Medical Relations The multi-class classifier was implemented by using a Support Vector Machine (SVM) implementation called LibLINEAR [5]. This software is an extension of LibSVM [6] restricted to a linear kernel to achieve significant speed gains. LibLINEAR allows users to specify the importance of each class through a weighting mechanism. In this way, we could specify that no relation should be given less weight. A frequent class tends to bias SVM decisions toward that class improving accuracy, but possibly hurting F1 measure. Several weight values for the no relation class were tested by cross validation on the training set. The value which led to the best score was 0.025, a heavy discounting factor compared to the default of 1.0. Similarly, cross validation on the training set achieved the best results when the regularization parameter, C, was set to 0.5 and the termination parameter, epsilon, was set to 0.5.

29 Example [Bradycardia] prob is resolved after [beta blockers] treat and [calcium channel blockers] treat were stopped and [Norvasc] treat was started. The following pairs are formed from this sentence: (Bradycardia, beta blockers) [TrCP] (Bradycardia, calcium channel blockers) [TrCP] (Bradycardia, Norvasc) [TrIP] (beta blockers, calcium channel blockers) (beta blockers, Norvasc) (calcium channel blockers, Norvasc)

30 Classification Classification of concept pairs was performed using a single SVM classifier (Bradycardia, Norvasc) Features SVM Even pairs that could not form a valid relation were use for training TrIP TrWP TrCP TrAP TrNAP PIP TeRP TeCP No Relation (0.025 weight)

31 Features Five categories of information used for features: Features using words between the concepts Single-concept features Concept types of nearby concepts Wikipedia-based features Contextual similarity to training concept pairs

32 Contextual Features From the words between two concepts: String of the word Part of speech Concept type (if applicable) Phrase of all the words Does the phrase represent a conjunction? Sequence of phrase chunk types (GENIA) If there are intervening concepts: The relations that exists between those concepts

33 Example Patient developed [intermittent low - grade temperatures] prob with no [obvious etiology] prob ; [Tm] test 1/20 of 38 [Tm] test 1/ Features for (intermittent low-, Tm) Words: with, no, obvious, etiology, ; Concepts between: problem POS: IN, DT, JJ, NN, ; POS sequence: IN_DT_problem_: Relations between: PIP

34 Single-Concept Features String of the concept WordNet lemma General Inquirer positive/negative polarity Token before concept 3 tokens after concept Associated predicates extracted through a PropBank parse Alternative concept type pairs for both arguments

35 Example She was given [Zofran] treat for [some nausea] prob as well as [metoclopramide] treat p.r.n. Lemma1: zofran Lemma2: some nausea before1: given after1: for, some, nausea before2: for after2: as, well, as AssociatedPredicates1: given AlternativePair1: NONE_treatment AlternativePair2: problem_treatment

36 Wikipedia Features The idea: Map concepts to Wikipedia through exact page name match Features Determine if the pages link to each other Determine the depth of LCS for the two pages within the category hierarchy Top-level categories Medical tests -> Test Diseases and disorders -> Problem Medical treatments -> Treatment

37 Example He was started on [heparin] treat and he subsequently had [significant thrombocytopenia] prob with [platelets] test of 70,000. Thrombocytopenia (or paenia, or thrombopenia in short) is the presence of relatively few [platelets] in [blood]... platelets is called [throbocytopathy], which could be either a low number of platelets ([thrombocytopenia]), Platelet Transfusion medicine Clinical pathology (Test) Thrombocytopenia Clinical pathology

38 Inexact Matching Features Based on Edit Distance (Levenshtein) Used as a distance measure for k-nearest Neighbors During training a KNN classifier is trained on all but one document and used for that document During testing a KNN classifier is used which was trained on all training documents

39 Results # P R F1 Exact Span Span and relation type TrIP TrWP TrCP TrAP TrNAP PIP TeRP TeCP

40 Conclusions NLP techniques worked well on this data Could perform better if trained on medical text Large training data set may have reduced contribution of medical ontologies Future work shall take into account more knowledge mining Crowd-sourced resources such as Wikipedia still provide some valuable information