Automatic Generation of a Qualified Medical Knowledge Graph and its Usage for Retrieving Patient Cohorts from Electronic Medical Records Travis Goodwin & Sanda Harabagiu Human Language Technology Research Institute The University of Texas at Dallas
Outline The Problem The Qualified Medical Knowledge Graph Identifying Medical Concepts Recognizing Assertions Constructing the QMKG Evaluation & Discussion Conclusions
The Problem More and more clinical data is available through Electronic Medical Records (EMRs) Notes within EMRs include a variety of knowledge: Medical history Physical exam findings Lab reports Radiology reports Operative reports Discharge summaries Etc. EMRs do not document the rationale for medical decisions Patient cohort studies evaluate progression of disease as well as the factors that influence clinical outcomes
Patient Cohort Identification TRECMed: a retrieval task from NIST offered in 2011 & 2012 85 topics : queries targeting patient cohorts Medical concepts e.g. acute coronary syndrome Patient constraints e.g. children 95,703 de-identified EMRs from multiple hospitals in 2007. The EMRs were grouped into hospital visits consisting of one or more medical reports from each patient s hospital stay. Thus, the EMRs were organized into 17,199 different patient hospital visits. Each visit had the patient s admission diagnoses, discharge diagnoses, and related ICD-9 codes
Sample TRECMed Topics No. Topic 156 Patients with depression on anti-depressant medication. 160 Patients with low back pain who had imaging studies. 172 Patients with peripheral neuropathy and edema. 184 Patients with colon cancer who had chemotherapy. The 35 topics evaluated in 2011 and the 50 topics evaluated in 2012 were characterized by (a) usage of medical concepts (e.g. acute coronary syndrome or plavix ) and (b) constraints imposed on the patient population (e.g. children, female patients).
The Barrier Medical science involves: asking hypotheses, experimenting with treatments, and reasoning from medical evidence. Consequently, clinical writing reflects this modus operandus with a rich set of speculative statements. Barriers: Physicians use hedging or linguistic means of expressing an opinion, rather than a fact. Abundance of speculative statements Our Solution: Automatically detect medical concepts Automatically identify medical assertions (belief values) associated with each medical concept Use these qualified concepts to build a graph of medical knowledge.
Cohort Retrieval System Retrieval system designed for TRECMed 2011/2012 A brief overview: 1. A topic is analyzed for keywords, and other constraints. 2. Keywords are expanded using our qualified medical knowledge graph 3. Initial BM25 retrieval 4. Re-ranking to assure agreement between assertion values between document and query Qualified Medical Knowledge Graph
Outline The Problem The Qualified Medical Knowledge Graph Identifying Medical Concepts Recognizing Assertions Constructing the QMKG Evaluation & Discussion Conclusions
The Qualified Medical Knowledge Graph Medical concepts are automatically identified in EMRs, and classified as: Medical Problem Treatment Test Assertions are automatically identified and assigned to each medical concept Graph in which nodes are qualified medical concepts represented as triplets: (concept text, concept type, assertion)
Example of assertions
Outline The Problem The Qualified Medical Knowledge Graph Identifying Medical Concepts Recognizing Assertions Constructing the QMKG Evaluation & Discussion Conclusions
The Qualified Medical Knowledge Graph
The Qualified Medical Knowledge Graph An edge between two graph nodes exists if the corresponding medical concepts co-occur within a window of tokens (for our experiments, we set = 20) within the same EMR. This idea of generating edges between medical concepts recognized in EMRs was inspired by the SympGraph methodology reported in Sondhi et al (KDD 2012) which models symptom relationships in clinical notes.
Automatic Medical Concept Recognition
Medical Concept Identification in EMRs Medical concepts in the form of : 1. medical problems, such as ATRIAL FIBRILLATION (irregular heart beat); 2. treatments, such as ABLATION (removal of undesired tissue); and 3. tests, such as ECG (electrocardiogram) were recognized using the methods reported in (Roberts and Harabagiu JAMIA 2011). This method recognizes medical concepts in two steps: Step 1: Identification of the boundaries within text that refers to a medical concept; Step 2: Classification of the medical concept into (a) medical problems, (b) medical treatments, or (c) medical tests.
Medical Concept Identification Preprocessing: Rule-based detection of measurements, dosages, & other entities Boundary: Heuristic separates prose from non-prose text. Then two Conditional Random Field (CRF) classifiers are used to extract concepts (one from prose, one for non-prose) Type: Support Vector Machine (SVM) classifier performs 3-way classification
Training the Medical Concept Identification System The data: 349 discharge summaries and progress notes available from the 2011 i2b2 VA challenge, A total of 25K training instances of medical concepts available. Testing data on the TRECMed clinical documents. A very large set of features were extracted Three distinct automatic feature selection method were used: 1. Greedy forward: Also known as additive feature selection, this method takes a greedy approach by always selecting the best feature to add to the feature set. 2. Greedy forward/backward: Also known as floating forward feature selection, this is an extension of greedy forward selection that greedily attempts to remove features from the current feature set after a new feature is added. 3. Feature selection using a genetic algorithm
Results for Medical Concept Identification Official i2b2/va results P R F1 Exact Boundary 83.7 80.8 82.2 Exact Boundary + Type 81.0 78.2 79.6 Inexact Boundary 92.7 89.5 91.1 Inexact Boundary + Type 89.3 89.2 89.2 System Score Best i2b2 submission 85.23 Our i2b2 submission 79.59 Median i2b2 submission 77.78 Mean i2b2 submission 73.56
Outline The Problem The Qualified Medical Knowledge Graph Identifying Medical Concepts Recognizing Assertions Constructing the QMKG Evaluation & Discussion Conclusions
Medical Assertion Recognition
Assertion Classification Determining the belief status of a medical problem is also known as medical assertion. To be able to recognize automatically assertions, we cast this problem as a classification problem, implemented as an SVM classifier which is influenced by a) the medical concepts on which the assertion is produced, b) the meta data available in the section header where the assertion is implied and c) features available from UMLS (extracted by MetaMap) as well as features reflective of negated statements, disclosed through the NegEx negation detection package. A special case of features that provide belief values are available from the General Inquirer s category information. SVM classifier performed 12-way classification: 6 from 2010 i2b2 6 new assertion types, based on 2,349 new annotations.
Assertion Types = new assertion type
Results for Medical Assertion Classification System Score GFB+GA+GFB 93.94 GFB+GA 93.93 GFB 93.84 Best i2b2 submission 93.62 Our i2b2 submission (GF) 92.75 Median i2b2 submission 91.96 Mean i2b2 submission 86.18 A flexible framework for deriving assertions from electronic medical records, By Kirk Roberts and Sanda Harabagiu, JAMIA 2011. 23
Outline The Problem The Qualified Medical Knowledge Graph Identifying Medical Concepts Recognizing Assertions Constructing the QMKG Evaluation & Discussion Conclusions
Constructing the QMKG Weighted undirected graph encoding similarity between qualified medical concepts. G = (E, V) Vertices: triples representing qualified medical concepts (lexical concept, concept type, assertion) Edge between two vertices if and only if they cooccur within the same context (we used a window of 20 tokens)
Vertex Extraction
Constructing the QMKG QMKG represented as an Adjacency matrix, A: An associated weight matrix, W, encodes the similarity between all pairs of qualified concepts according to some similarity function S.
First-Order Similarity Functions
Second-Order Similarity Function Qualified medical concepts are extremely sparse within EMRs Many qualified medical concepts do not share the same window, but still share some degree of semantic similarity that could be of value We generalized the notion of second-order PMI to compute the second-order similarity between two nodes using any first-order similarity measure. Calculates the similarity of two nodes as an aggregation of the first-order similarities between them and the highest weighted β intermediate nodes.
Outline The Problem The Qualified Medical Knowledge Graph Identifying Medical Concepts Recognizing Assertions Constructing the QMKG Evaluation & Discussion Conclusions
Evaluations & Discussion Precision and Recall for our assertion values evaluated against the 2010 i2b2 data, and our own annotations on EMRs.
Evaluation of the QMKG Generated QMKG stats: 634 thousand nodes with 13.9 billion edges (3.45% connectivity) 53.0% of nodes are medical problems 23.6% of nodes are medical tests 23.3% of nodes are medical treatments Assertion types distributed as follows:
Evaluation of the QMKG Evaluated the QMKG by testing on the TRECMed cohort retrieval task. Used it as a means of query expansion: Keywords mapped to their qualified medical concepts in the QMKG Select the top 20 highest weighted neighbors for each keyword as new keywords
Query Expansion using the QMKG
TRECMed 2012 Scores iap: inferred Average Precision indcg: inferred Normalized Discounted Cumulative Gain P @ 10: refers to the precision within the first 10 results
Outline The Problem The Qualified Medical Knowledge Graph Identifying Medical Concepts Recognizing Assertions Constructing the QMKG Evaluation & Discussion Conclusions
Conclusion We created a medical knowledge graph relating pairs of medical concepts qualified by the physician s belief status. By using this kind of information, we are able to make progress towards bridging the inherent knowledge gap tied to understanding EMRs. It provides very promising results for patient cohort identification