Travis Goodwin & Sanda Harabagiu

Transcription

1 Automatic Generation of a Qualified Medical Knowledge Graph and its Usage for Retrieving Patient Cohorts from Electronic Medical Records Travis Goodwin & Sanda Harabagiu Human Language Technology Research Institute The University of Texas at Dallas

2 Outline The Problem The Qualified Medical Knowledge Graph Identifying Medical Concepts Recognizing Assertions Constructing the QMKG Evaluation & Discussion Conclusions

3 The Problem More and more clinical data is available through Electronic Medical Records (EMRs) Notes within EMRs include a variety of knowledge: Medical history Physical exam findings Lab reports Radiology reports Operative reports Discharge summaries Etc. EMRs do not document the rationale for medical decisions Patient cohort studies evaluate progression of disease as well as the factors that influence clinical outcomes

4 Patient Cohort Identification TRECMed: a retrieval task from NIST offered in 2011 & topics : queries targeting patient cohorts Medical concepts e.g. acute coronary syndrome Patient constraints e.g. children 95,703 de-identified EMRs from multiple hospitals in The EMRs were grouped into hospital visits consisting of one or more medical reports from each patient s hospital stay. Thus, the EMRs were organized into 17,199 different patient hospital visits. Each visit had the patient s admission diagnoses, discharge diagnoses, and related ICD-9 codes

5 Sample TRECMed Topics No. Topic 156 Patients with depression on anti-depressant medication. 160 Patients with low back pain who had imaging studies. 172 Patients with peripheral neuropathy and edema. 184 Patients with colon cancer who had chemotherapy. The 35 topics evaluated in 2011 and the 50 topics evaluated in 2012 were characterized by (a) usage of medical concepts (e.g. acute coronary syndrome or plavix ) and (b) constraints imposed on the patient population (e.g. children, female patients).

6 The Barrier Medical science involves: asking hypotheses, experimenting with treatments, and reasoning from medical evidence. Consequently, clinical writing reflects this modus operandus with a rich set of speculative statements. Barriers: Physicians use hedging or linguistic means of expressing an opinion, rather than a fact. Abundance of speculative statements Our Solution: Automatically detect medical concepts Automatically identify medical assertions (belief values) associated with each medical concept Use these qualified concepts to build a graph of medical knowledge.

7 Cohort Retrieval System Retrieval system designed for TRECMed 2011/2012 A brief overview: 1. A topic is analyzed for keywords, and other constraints. 2. Keywords are expanded using our qualified medical knowledge graph 3. Initial BM25 retrieval 4. Re-ranking to assure agreement between assertion values between document and query Qualified Medical Knowledge Graph

9 The Qualified Medical Knowledge Graph Medical concepts are automatically identified in EMRs, and classified as: Medical Problem Treatment Test Assertions are automatically identified and assigned to each medical concept Graph in which nodes are qualified medical concepts represented as triplets: (concept text, concept type, assertion)

10 Example of assertions

12 The Qualified Medical Knowledge Graph

13 The Qualified Medical Knowledge Graph An edge between two graph nodes exists if the corresponding medical concepts co-occur within a window of tokens (for our experiments, we set = 20) within the same EMR. This idea of generating edges between medical concepts recognized in EMRs was inspired by the SympGraph methodology reported in Sondhi et al (KDD 2012) which models symptom relationships in clinical notes.

14 Automatic Medical Concept Recognition

15 Medical Concept Identification in EMRs Medical concepts in the form of : 1. medical problems, such as ATRIAL FIBRILLATION (irregular heart beat); 2. treatments, such as ABLATION (removal of undesired tissue); and 3. tests, such as ECG (electrocardiogram) were recognized using the methods reported in (Roberts and Harabagiu JAMIA 2011). This method recognizes medical concepts in two steps: Step 1: Identification of the boundaries within text that refers to a medical concept; Step 2: Classification of the medical concept into (a) medical problems, (b) medical treatments, or (c) medical tests.

16 Medical Concept Identification Preprocessing: Rule-based detection of measurements, dosages, & other entities Boundary: Heuristic separates prose from non-prose text. Then two Conditional Random Field (CRF) classifiers are used to extract concepts (one from prose, one for non-prose) Type: Support Vector Machine (SVM) classifier performs 3-way classification

17 Training the Medical Concept Identification System The data: 349 discharge summaries and progress notes available from the 2011 i2b2 VA challenge, A total of 25K training instances of medical concepts available. Testing data on the TRECMed clinical documents. A very large set of features were extracted Three distinct automatic feature selection method were used: 1. Greedy forward: Also known as additive feature selection, this method takes a greedy approach by always selecting the best feature to add to the feature set. 2. Greedy forward/backward: Also known as floating forward feature selection, this is an extension of greedy forward selection that greedily attempts to remove features from the current feature set after a new feature is added. 3. Feature selection using a genetic algorithm

18 Results for Medical Concept Identification Official i2b2/va results P R F1 Exact Boundary Exact Boundary + Type Inexact Boundary Inexact Boundary + Type System Score Best i2b2 submission Our i2b2 submission Median i2b2 submission Mean i2b2 submission 73.56

20 Medical Assertion Recognition

21 Assertion Classification Determining the belief status of a medical problem is also known as medical assertion. To be able to recognize automatically assertions, we cast this problem as a classification problem, implemented as an SVM classifier which is influenced by a) the medical concepts on which the assertion is produced, b) the meta data available in the section header where the assertion is implied and c) features available from UMLS (extracted by MetaMap) as well as features reflective of negated statements, disclosed through the NegEx negation detection package. A special case of features that provide belief values are available from the General Inquirer s category information. SVM classifier performed 12-way classification: 6 from 2010 i2b2 6 new assertion types, based on 2,349 new annotations.

22 Assertion Types = new assertion type

23 Results for Medical Assertion Classification System Score GFB+GA+GFB GFB+GA GFB Best i2b2 submission Our i2b2 submission (GF) Median i2b2 submission Mean i2b2 submission A flexible framework for deriving assertions from electronic medical records, By Kirk Roberts and Sanda Harabagiu, JAMIA

25 Constructing the QMKG Weighted undirected graph encoding similarity between qualified medical concepts. G = (E, V) Vertices: triples representing qualified medical concepts (lexical concept, concept type, assertion) Edge between two vertices if and only if they cooccur within the same context (we used a window of 20 tokens)

26 Vertex Extraction

27 Constructing the QMKG QMKG represented as an Adjacency matrix, A: An associated weight matrix, W, encodes the similarity between all pairs of qualified concepts according to some similarity function S.

28 First-Order Similarity Functions

29 Second-Order Similarity Function Qualified medical concepts are extremely sparse within EMRs Many qualified medical concepts do not share the same window, but still share some degree of semantic similarity that could be of value We generalized the notion of second-order PMI to compute the second-order similarity between two nodes using any first-order similarity measure. Calculates the similarity of two nodes as an aggregation of the first-order similarities between them and the highest weighted β intermediate nodes.

30

32 Evaluations & Discussion Precision and Recall for our assertion values evaluated against the 2010 i2b2 data, and our own annotations on EMRs.

33 Evaluation of the QMKG Generated QMKG stats: 634 thousand nodes with 13.9 billion edges (3.45% connectivity) 53.0% of nodes are medical problems 23.6% of nodes are medical tests 23.3% of nodes are medical treatments Assertion types distributed as follows:

34 Evaluation of the QMKG Evaluated the QMKG by testing on the TRECMed cohort retrieval task. Used it as a means of query expansion: Keywords mapped to their qualified medical concepts in the QMKG Select the top 20 highest weighted neighbors for each keyword as new keywords

35 Query Expansion using the QMKG

36 TRECMed 2012 Scores iap: inferred Average Precision indcg: inferred Normalized Discounted Cumulative Gain 10: refers to the precision within the first 10 results

38 Conclusion We created a medical knowledge graph relating pairs of medical concepts qualified by the physician s belief status. By using this kind of information, we are able to make progress towards bridging the inherent knowledge gap tied to understanding EMRs. It provides very promising results for patient cohort identification

39