Travis Goodwin & Sanda Harabagiu



Similar documents
Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Big Data Text Mining and Visualization. Anton Heijs

Sanda Harabagiu. The University of Texas at Dallas Human Language Technology Research Institute

Protein Protein Interaction Networks

Guidelines for using V-CODES (Status Codes)

Electronic Medical Record Mining. Prafulla Dawadi School of Electrical Engineering and Computer Science

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

Identify Disorders in Health Records using Conditional Random Fields and Metamap

DATA ANALYSIS II. Matrix Algorithms

Automated Problem List Generation from Electronic Medical Records in IBM Watson

Radiology Business Management Association Technology Task Force. Sample Request for Proposal

Demonstrating Meaningful Use Stage 1 Requirements for Eligible Providers Using Certified EMR Technology

Supervised Learning (Big Data Analytics)

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

A Statistical Text Mining Method for Patent Analysis

Physician and other health professional services

Big Data Analytics for Healthcare

Building a Question Classifier for a TREC-Style Question Answering System

Understanding Diagnosis Assignment from Billing Systems Relative to Electronic Health Records for Clinical Research Cohort Identification

PSG College of Technology, Coimbatore Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.

BiTeM group report for TREC Medical Records Track 2011

TMUNSW: Identification of disorders and normalization to SNOMED-CT terminology in unstructured clinical notes

Machine Learning over Big Data

CENG 734 Advanced Topics in Bioinformatics

Programming Tools based on Big Data and Conditional Random Fields

Healthcare data analytics. Da-Wei Wang Institute of Information Science

Extracting Clinical entities and their assertions from Chinese Electronic Medical Records Based on Machine Learning

Nandan Banerjee Cogent Infotech Corporation COGENT INFOTECH CORPORATION

Database and Data Mining Security

Document Image Retrieval using Signatures as Queries

Signature Segmentation from Machine Printed Documents using Conditional Random Field

An Overview of Knowledge Discovery Database and Data mining Techniques

Data Mining on Social Networks. Dionysios Sotiropoulos Ph.D.

Exploration and Visualization of Post-Market Data

Cardiology ICD-10-CM Coding Tip Sheet Overview of Key Chapter Updates for Cardiology

ANALYTICS IN BIG DATA ERA

Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC

Introduction to Data Mining

KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS

Tachyarrhythmias (fast heart rhythms)

How To Understand The Network Of A Network

Community Detection Proseminar - Elementary Data Mining Techniques by Simon Grätzer

Information Management

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Investment Analysis using the Portfolio Analysis Machine (PALMA 1 ) Tool by Richard A. Moynihan 21 July 2005

SPECIALTY CASE MANAGEMENT

Applying Machine Learning to Stock Market Trading Bryce Taylor

The Data Mining Process

Copyright This report and/or appended material may not be partly or completely published or

Text Mining for Health Care and Medicine. Sophia Ananiadou Director National Centre for Text Mining

Integrating Public and Private Medical Texts for Patient De-Identification with Apache ctakes

Statistics for BIG data

Reputation Management Algorithms & Testing. Andrew G. West November 3, 2008

Big Data Analytics and Healthcare

Investigating Clinical Care Pathways Correlated with Outcomes

Big Data: Image & Video Analytics

Electronic Health Record (EHR) Data Analysis Capabilities

Extracting Information from Social Networks

1 o Semestre 2007/2008

ImageCLEF 2011

HELP DESK SYSTEMS. Using CaseBased Reasoning

HISTORICAL DEVELOPMENTS AND THEORETICAL APPROACHES IN SOCIOLOGY Vol. I - Social Network Analysis - Wouter de Nooy

Southwest General Surgical Associates General & Vascular Surgery 8230 Walnut Hill Lane Suite 408 Dallas, TX Phone-214) Fax-214)

USE OF EIGENVALUES AND EIGENVECTORS TO ANALYZE BIPARTIVITY OF NETWORK GRAPHS

A Content based Spam Filtering Using Optical Back Propagation Technique

Data Mining Analytics for Business Intelligence and Decision Support

Chapter ML:XI. XI. Cluster Analysis

Transformation of Free-text Electronic Health Records for Efficient Information Retrieval and Support of Knowledge Discovery

GENETIC DATA ANALYSIS

A Method for Automatic De-identification of Medical Records

Approximation Algorithms

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Patient Similarity-guided Decision Support

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Information Management course

Mining the Software Change Repository of a Legacy Telephony System

Secondary Uses of Data for Comparative Effectiveness Research

LEADING-EDGE Cardiovascular Care

ESC/EASD Pocket Guidelines Diabetes, pre-diabetes and cardiovascular disease

Big Data and Graph Analytics in a Health Care Setting

Chapter 13. The hospital-based cancer registry

Transcription:

Automatic Generation of a Qualified Medical Knowledge Graph and its Usage for Retrieving Patient Cohorts from Electronic Medical Records Travis Goodwin & Sanda Harabagiu Human Language Technology Research Institute The University of Texas at Dallas

Outline The Problem The Qualified Medical Knowledge Graph Identifying Medical Concepts Recognizing Assertions Constructing the QMKG Evaluation & Discussion Conclusions

The Problem More and more clinical data is available through Electronic Medical Records (EMRs) Notes within EMRs include a variety of knowledge: Medical history Physical exam findings Lab reports Radiology reports Operative reports Discharge summaries Etc. EMRs do not document the rationale for medical decisions Patient cohort studies evaluate progression of disease as well as the factors that influence clinical outcomes

Patient Cohort Identification TRECMed: a retrieval task from NIST offered in 2011 & 2012 85 topics : queries targeting patient cohorts Medical concepts e.g. acute coronary syndrome Patient constraints e.g. children 95,703 de-identified EMRs from multiple hospitals in 2007. The EMRs were grouped into hospital visits consisting of one or more medical reports from each patient s hospital stay. Thus, the EMRs were organized into 17,199 different patient hospital visits. Each visit had the patient s admission diagnoses, discharge diagnoses, and related ICD-9 codes

Sample TRECMed Topics No. Topic 156 Patients with depression on anti-depressant medication. 160 Patients with low back pain who had imaging studies. 172 Patients with peripheral neuropathy and edema. 184 Patients with colon cancer who had chemotherapy. The 35 topics evaluated in 2011 and the 50 topics evaluated in 2012 were characterized by (a) usage of medical concepts (e.g. acute coronary syndrome or plavix ) and (b) constraints imposed on the patient population (e.g. children, female patients).

The Barrier Medical science involves: asking hypotheses, experimenting with treatments, and reasoning from medical evidence. Consequently, clinical writing reflects this modus operandus with a rich set of speculative statements. Barriers: Physicians use hedging or linguistic means of expressing an opinion, rather than a fact. Abundance of speculative statements Our Solution: Automatically detect medical concepts Automatically identify medical assertions (belief values) associated with each medical concept Use these qualified concepts to build a graph of medical knowledge.

Cohort Retrieval System Retrieval system designed for TRECMed 2011/2012 A brief overview: 1. A topic is analyzed for keywords, and other constraints. 2. Keywords are expanded using our qualified medical knowledge graph 3. Initial BM25 retrieval 4. Re-ranking to assure agreement between assertion values between document and query Qualified Medical Knowledge Graph

Outline The Problem The Qualified Medical Knowledge Graph Identifying Medical Concepts Recognizing Assertions Constructing the QMKG Evaluation & Discussion Conclusions

The Qualified Medical Knowledge Graph Medical concepts are automatically identified in EMRs, and classified as: Medical Problem Treatment Test Assertions are automatically identified and assigned to each medical concept Graph in which nodes are qualified medical concepts represented as triplets: (concept text, concept type, assertion)

Example of assertions

Outline The Problem The Qualified Medical Knowledge Graph Identifying Medical Concepts Recognizing Assertions Constructing the QMKG Evaluation & Discussion Conclusions

The Qualified Medical Knowledge Graph

The Qualified Medical Knowledge Graph An edge between two graph nodes exists if the corresponding medical concepts co-occur within a window of tokens (for our experiments, we set = 20) within the same EMR. This idea of generating edges between medical concepts recognized in EMRs was inspired by the SympGraph methodology reported in Sondhi et al (KDD 2012) which models symptom relationships in clinical notes.

Automatic Medical Concept Recognition

Medical Concept Identification in EMRs Medical concepts in the form of : 1. medical problems, such as ATRIAL FIBRILLATION (irregular heart beat); 2. treatments, such as ABLATION (removal of undesired tissue); and 3. tests, such as ECG (electrocardiogram) were recognized using the methods reported in (Roberts and Harabagiu JAMIA 2011). This method recognizes medical concepts in two steps: Step 1: Identification of the boundaries within text that refers to a medical concept; Step 2: Classification of the medical concept into (a) medical problems, (b) medical treatments, or (c) medical tests.

Medical Concept Identification Preprocessing: Rule-based detection of measurements, dosages, & other entities Boundary: Heuristic separates prose from non-prose text. Then two Conditional Random Field (CRF) classifiers are used to extract concepts (one from prose, one for non-prose) Type: Support Vector Machine (SVM) classifier performs 3-way classification

Training the Medical Concept Identification System The data: 349 discharge summaries and progress notes available from the 2011 i2b2 VA challenge, A total of 25K training instances of medical concepts available. Testing data on the TRECMed clinical documents. A very large set of features were extracted Three distinct automatic feature selection method were used: 1. Greedy forward: Also known as additive feature selection, this method takes a greedy approach by always selecting the best feature to add to the feature set. 2. Greedy forward/backward: Also known as floating forward feature selection, this is an extension of greedy forward selection that greedily attempts to remove features from the current feature set after a new feature is added. 3. Feature selection using a genetic algorithm

Results for Medical Concept Identification Official i2b2/va results P R F1 Exact Boundary 83.7 80.8 82.2 Exact Boundary + Type 81.0 78.2 79.6 Inexact Boundary 92.7 89.5 91.1 Inexact Boundary + Type 89.3 89.2 89.2 System Score Best i2b2 submission 85.23 Our i2b2 submission 79.59 Median i2b2 submission 77.78 Mean i2b2 submission 73.56

Outline The Problem The Qualified Medical Knowledge Graph Identifying Medical Concepts Recognizing Assertions Constructing the QMKG Evaluation & Discussion Conclusions

Medical Assertion Recognition

Assertion Classification Determining the belief status of a medical problem is also known as medical assertion. To be able to recognize automatically assertions, we cast this problem as a classification problem, implemented as an SVM classifier which is influenced by a) the medical concepts on which the assertion is produced, b) the meta data available in the section header where the assertion is implied and c) features available from UMLS (extracted by MetaMap) as well as features reflective of negated statements, disclosed through the NegEx negation detection package. A special case of features that provide belief values are available from the General Inquirer s category information. SVM classifier performed 12-way classification: 6 from 2010 i2b2 6 new assertion types, based on 2,349 new annotations.

Assertion Types = new assertion type

Results for Medical Assertion Classification System Score GFB+GA+GFB 93.94 GFB+GA 93.93 GFB 93.84 Best i2b2 submission 93.62 Our i2b2 submission (GF) 92.75 Median i2b2 submission 91.96 Mean i2b2 submission 86.18 A flexible framework for deriving assertions from electronic medical records, By Kirk Roberts and Sanda Harabagiu, JAMIA 2011. 23

Outline The Problem The Qualified Medical Knowledge Graph Identifying Medical Concepts Recognizing Assertions Constructing the QMKG Evaluation & Discussion Conclusions

Constructing the QMKG Weighted undirected graph encoding similarity between qualified medical concepts. G = (E, V) Vertices: triples representing qualified medical concepts (lexical concept, concept type, assertion) Edge between two vertices if and only if they cooccur within the same context (we used a window of 20 tokens)

Vertex Extraction

Constructing the QMKG QMKG represented as an Adjacency matrix, A: An associated weight matrix, W, encodes the similarity between all pairs of qualified concepts according to some similarity function S.

First-Order Similarity Functions

Second-Order Similarity Function Qualified medical concepts are extremely sparse within EMRs Many qualified medical concepts do not share the same window, but still share some degree of semantic similarity that could be of value We generalized the notion of second-order PMI to compute the second-order similarity between two nodes using any first-order similarity measure. Calculates the similarity of two nodes as an aggregation of the first-order similarities between them and the highest weighted β intermediate nodes.

Outline The Problem The Qualified Medical Knowledge Graph Identifying Medical Concepts Recognizing Assertions Constructing the QMKG Evaluation & Discussion Conclusions

Evaluations & Discussion Precision and Recall for our assertion values evaluated against the 2010 i2b2 data, and our own annotations on EMRs.

Evaluation of the QMKG Generated QMKG stats: 634 thousand nodes with 13.9 billion edges (3.45% connectivity) 53.0% of nodes are medical problems 23.6% of nodes are medical tests 23.3% of nodes are medical treatments Assertion types distributed as follows:

Evaluation of the QMKG Evaluated the QMKG by testing on the TRECMed cohort retrieval task. Used it as a means of query expansion: Keywords mapped to their qualified medical concepts in the QMKG Select the top 20 highest weighted neighbors for each keyword as new keywords

Query Expansion using the QMKG

TRECMed 2012 Scores iap: inferred Average Precision indcg: inferred Normalized Discounted Cumulative Gain P @ 10: refers to the precision within the first 10 results

Outline The Problem The Qualified Medical Knowledge Graph Identifying Medical Concepts Recognizing Assertions Constructing the QMKG Evaluation & Discussion Conclusions

Conclusion We created a medical knowledge graph relating pairs of medical concepts qualified by the physician s belief status. By using this kind of information, we are able to make progress towards bridging the inherent knowledge gap tied to understanding EMRs. It provides very promising results for patient cohort identification