Automated assignment of ICD-9-CM codes to radiology reports

Size: px
Start display at page:

Download "Automated assignment of ICD-9-CM codes to radiology reports"

Transcription

1 Automated assignment of ICD-9-CM codes to radiology reports Richárd Farkas University of Szeged Filip Ginter University of Turku

2 Overview Why clinical coding? Importance, use of automated coding Challenge description Data used Evaluation methodology Our solutions Szeged system Turku system Results and comparison Possibilities and challenges of a real-world system

3 NLP in the clinical domain Narrative texts A huge amount of information is hidden Manual processing requires expertise Time Costs Special features of medical texts Unique characteristics of the language used Smokes 2-3 cig / day, occ etoh, and no drugs except marijuana Exam

4 Clinical coding Automatic assignment of disease/symptom codes to clinical records International Classification of Diseases (ICD-X-CM) X revision (current: 10, used: 9) Used for statistics on diseases, or effects of treatment billing the task has commercial relevance Overcoding is penalised by 3x sum Undercoding means loss of revenue Codes are added to the text afterwards the treatment (US) coding costs $25 billion annually in the USA

5 International Challenge on Classifying Clinical Free Text Using Natural Language Processing Shared task challenge to evaluate NLP systems on clinical data ICD-9-CM coding Radiology reports Organization Computational Medicine Center Cincinatti, Ohio, USA February/March, 2007 Motivation Practical importance for hospital administration and health insurance

6 120+ registered participants 44 systems submitted

7 Data Used Radiology records annotated with ICD codes 978 documents used for training ICD-9 systems 976 unseen documents used for evaluation Annotation provided by 3 health institutes majority labeling used as gold standard 45 different ICD codes used codes appear in various combinations (94 different sets of codes) frequency of labels vary The data is made available free of charge for research purposes by the challenge organizers

8 Example <doc id=" " type="radiology_report"> <codes> <code origin="cmc_majority" type="icd-9-cm">786.2</code> <code origin="company3" type="icd-9-cm">518.0</code> <code origin="company1" type="icd-9-cm">786.2</code> <code origin="company2" type="icd-9-cm">786.2</code> </codes> <texts> <text origin="cchmc_radiology" type="clinical_history"> Cough. History of pneumonia on 1/2/01. Increased work of breathing. </text> <text origin="cchmc_radiology" type="impression"> No significant change to overall appearance of perihilar lung opacities and peribronchial thickening most consistent with viral illness vs reactive airways disease. Increased densities superimposed over the right middle lobe and lingular region on the lateral view may represent superimposition of shadows. However atelectasis or a small amount of parenchymal consolidation cannot be fully excluded. This patient's lung markings have appeared prominent on the four existing chest x-rays in our file. It is recommended that the child receive a well - child chest x-ray in order to evaluate lung markings when the child is not sick. </text> </texts> </doc>

9 Distribution of labels

10 Results

11 Szeged, Hungary Richárd Farkas Research Group on Artificial Intelligence of the Hungarian Academy of Sciences, György Szarvas University of Szeged, Department of Informatics, Human Language Technology Group without physicians

12 Szeged ICD coding solutions Language Processing negation/speculation Exploiting ICD and utilise labeled data Inter-label dependecies Synonyms and abbreviations Challenge system: hand crafted reconstructed automatically (machine learning)

13 Language processing Coding guides order that uncertain diagnosis should not be coded speculations Peribronchial thickening most consistent with viral illness vs reactive airways disease negation Normal slightly hypoventilatory chest x-ray, no pneumonia. issues in the past without direct effect on current treatment should not be coded temporal resolution is neglected due to noisy annotation of historical findings

14 Detection of speculation/negation Simple approach, motivated by not too difficult grammar of the text physicians aim to briefly enumerate findings and their opinion rarely use very complex Noun Phrases or syntax Dictionaries of keywords collected from training data Scope identified by naive heuristic right scope end of sentence left scope previous punctuation (or nothing, depending on the keyword) Normal slightly hypoventilatory chest x-ray, no pneumonia.

15 Using the ICD

16 Exploration of inter-label dependencies Overcoding, e.g. symptoms and diseases C4.5 classifiers trained for false positive labels Features: base-system labels Extracted 5 dependencies each express Delete symptom if disease has textual evidence e.g. Delete Cough and Fever if Pneumonia coded

17 Data-driven model Vector Space Model token grams as features C4.5 classifier on 45 binary classification tasks Expanding the dictionaries: Gathering missing synonyms, abbreviations C4.5 classifiers trained for false negative labels

18 Example of terms found Urinary Tract Infection uti Asthma reactive airways disease Laurence-Moon-Biedl syndrome Williams syndrome Beckwith-Wiedemann syndrome hemihypertrophy

19 External knowledge (ICD) vs. Data-driven models ICD data independent robust (information source is reliable) can cover rare codes Data-driven can explore individual coding style (synonyms, abbreviations) requires labeled documents cannot handle rare codes

20 Added values of the subphases 45-class statistical system ICD + inter-label dependencies + statistical enriching (synonyms) Union of statistical and coding guide Hand-crafted system - language processing Train 88.20% 84.07% 85.57% 90.26% 90.53% 90.02% 71.46% Eval 86.69% 83.21% 84.85% 88.93% 89.33% 89.41% 70.48%

21 The Turku Group in the Challenge Language processing group at the Department of IT, University of Turku and Turku Centre for Computer Science (TUCS) Antti Airola Filip Ginter Tapio Pahikkala Sampo Pyysalo Tapio Salakoski Hanna Suominen Department of nursing science, University of Turku Sanna Salanterä

22 The Turku ICD coding system Feature engineering Mapping text to UMLS concepts (MetaMap) Recognition of negation and speculation Generalization via hypernymy Machine learning Primary classifier (RLS) Secondary classifier (Ripper) corrections of known errors made by the primary classifier Additional training instances from ICD definitions

23 MetaMap MetaMap identifies instances of UMLS concepts in running text NLM s MetaMap program Divides running text to phrases Each phrase is mapped into a set of UMLS concepts from specified vocabularies A way to abstract from text

24 MetaMap output example Eleven year old Eleven, Quanitative Concept, C Year, Temporal Concept, C Old, Temporal Concept, C with acute leukemia Acute leukaemia, Neoplastic Process, C bone marrow transplant Bone marrow transplant, Therapeutic or Preventive Procedure, C on Jan. 2 now with three day history Three, Quantitative concept, C day, Temporal concept, C History, Occupation or Discipline, C of cough Cough, Sign or Symptom, C

25 Hypernym expansion Hypernyms as additional features Generalize the identified concepts along the hierarchy Cough Respiratory symptoms Signs and Symptoms Fever Body temperature altered Signs and Symptoms Atelectasis Diseases of the lung Diseases of the respiratory system Pneumonia Diseases of the lung Diseases of the respiratory system

26 Hypernym expansion motivation More accurate similarity information Lexically, cough and fever are different Hypernym expansion adds the information that both are symptoms The connection can also be learned given large quantities of data But rare cases can benefit here

27 Negation and speculation Negation, speculation, temporal information Recognize trigger words could, history of, likely, may, mild, minimal, no, past, possible, possibly, probable, probably, questionable, suggestive, unsure, without Scope: Everything from a trigger word up to the end of the current sentence All features extracted from a negated text span are marked ICD coding guide: speculated / unsure code is not assigned

28 Hypernym expansion & negation Hypernym expansion and negation VALID: pneumonia lung disease INVALID: not pneumonia not lung disease Negated concepts are not expanded with hypernyms Room for improvement VALID: possible pneumonia possible lung disease

29 Feature engineering Final set of features entering the classifier Text tokens No particular order: Bag-of-Words (BoW) model Marked with neg- whenever negated Set of UMLS concepts (their c-codes) extracted with MetaMap Marked with neg- whenever negated Set of hypernyms of the extracted UMLS concepts Included only for non-negated concepts

30 Classification RLS (regularized least-squares) classifier Maximal-margin, kernel-based classifier Close relative of Support Vector Machines (SVMs) Linear kernel (fast & worked well) One classifier for each code 1 versus all classification May lead to no codes assigned or an impossible combination of codes

31 Correcting known errors Cascaded classifier attempts to correct known errors Empty or impossible combinations RIPPER Decision rules Much different paradigm than RLS Trained and applied exactly as the first classifier 1 vs. All Known errors made by the second classifier left uncorrected Experiments show no additional improvement

32 Using ICD-9 in training ICD-9 definitions as training instances Concatenate the textual definitions of each of the 45 codes and its parents in the ICD hierarchy Same generalization idea as previously Extract features in the standard way Pool the resulting 45 training instances with the challenge training data Provides additional positive examples

33 Turku system: Summary FEATURE EXTRACTION Source text UMLS hierarchy CLASSIFICATION RLS classifier 1 vs. All Tokenization Negation and speculation detection MetaMap Set of UMLS concepts UMLS hypernym expansion Extended set of UMLS concepts + Source text tokens Set of ICD codes impossible combination RIPPER classifier 1 vs. All possible combination Final set of ICD codes

34 Turku system: Component contribution F micro Error Relative Gain RLS (initial) Tokenization % UMLS mapping % UMLS hypernyms % Negation/speculation % Cascaded Ripper % ICD-9 training data % Cross-validated performance on training data

35 Turku vs. Szeged: Crucial differences Szeged system No external resources beyond ICD-9 ICD-9 definitions and coding guidelines are the core of the system Challenge system: rule-based Replicated via machine learning Turku system Heavy reliance on UMLS MetaMap Hypernyms ICD-9 definitions used as training examples with 0.1 percentage point improvement No explicit use of ICD-9 coding guidelines Pure machine learning

36 Turku vs. Szeged: Crucial differences Szeged system allows individual ICD code deletion if code X is given, delete code Y Turku system rejects the whole code combination and applies a different classifier Paradoxically, no gain from using the Szeged finer ICD code handling on top of Turku results (0.3 percentage point F-score decrease) E.g. false positive disease code causes a true positive symptom code to be removed Use of hypernym expansion More detailed negation/speculation/temporal detection in Szeged system

37 Language specifics CMC challenge was on English text How about other languages? Szeged system Needs translated ICD Language-adapted negation/speculation detection Turku system Needs translated UMLS resources and MetaMap Much of the features are language-independent UMLS c-codes Language-adapted negation/speculation detection Both systems rely on string search in one way or another Problem in inflective languages

38 Crucial differences (cont.) Different approach to design Turku system Classifier-centric Extract all thinkable features and feed them into a stateof-the-art classifier Szeged system Data-centric Build from the available resources (ICD and training data) and use classifiers with interpretable models Study the mistakes and the model, correct errors

39 CMC challenge results: The big picture Best F-score 89.1 (Szeged system) Mean F-score 76.7 (=13.4) Turku and Szeged baselines Szeged: 83.2% F bare system with just NLP and ICD but no other direct use of the training data Turku: 80.7% F bare machine learning system with no data preprocessing of any kind (only whitespace tokenization) About half of the challenge submissions stayed below these baseline systems!

40 CMC challenge: Lessons learned General observations across all submissions Presented by Pestian et al., ACL 07 BioNLP workshop, 2007 Based on short system descriptions (not publicly available) 1. Best systems explicitly took into account negation and speculation 2. Better systems frequently worked with hypernym and synonym detection 3. Significant amount of symbolic processing 4. Two of the top three systems were ML-based

41 CMC challenge: Lessons learned 5. Careful, medically-informed feature engineering common 6. SVM and related state-of-the-art classification algorithms were strongly represented, but not reliably predictive of high ranking Turku development observation: a number of traditional classifiers matched RLS performance when used correctly

42 Beyond the ICD coding Similar NLP tasks The same architecture can be used Find the relevant parts of the documents Find relevant phrases (synonyms, abbreviations) simple string-matching with a particular dictionary Prototype tasks: The i2b2 obesity challenge Smoking status detection

43 The i2b2 obesity challenge Who's obese and what co-morbidities do they (definitely/likely) have? Informatics for Integrating Biology and the Bedside (i2b2) Febr. June 730 training and 507 evaluation document multi-label problem, 16 morbidities

44 Comparison Focusing on several morbidities (matchable with set of ICD) Longer documents (avg. of the lengths: 130 rows) More noise The patient has a positive family history of coronary disease Negation/speculation detection is highlighted (Y/N/Q/U F-macro)

45 Smoking status detection i2b2 challenge 2006 The patient in question is SMOKER, NON-SMOKER, PAST-SMOKER or smoker status UNKNOWN inter-annotator agreement ~85% 398 train and 104 eval documents Small dictionaries: smoke, tobacco etc. best systems 88% with external data 94%

46 Final thoughts on ICD coding Some clear advantages lower costs less error-prone processing of simpler cases Fully automatic system is impossible (nowadays) Far away from human intelligence will not solve rare, harder cases Right middle and probable right lower lobe pneumonia.

47 The place of an automatic system Pre-labeling/highligthing to speed up manual coding prediction along with confidence measure Validation suggesting erroneous / missed codes monitoring for health insurance companies Automated coding of large datasets mainly for statistical purposes

48 Tasks to be solved Extending systems to thousends of codes If a corpus with appropiate size is available Incorporating more expert knowledge into the statistical methods user-friendly interfaces interactive systems Better language processing Corpus for developing sophisticated scope detectors: BioScope (released 2008 June)

49 Open questions the coder or every institute has its own individual coding styles how to transfer among languages? Is there any drop in accuracy on other languages (free word order in Hungarian) on other domains (nursing notes)? What is the real speed-up of an automatic pre-coding/suggestion system?

50 Open questions (cont.) More training data needed to scale the systems up Hospitals have the data but privacy concerns prevent its dissemination to companies / NLP researchers who build the system Training data generally cannot be reconstructed from trained machine-learning systems Distribute an empty system? Legal issues? Technical issues?

51 Multilingual ICD tagging: summary Basic NLP tools Tokenizer Lemmatizer Tagger, phrase parser (in some approaches) Need domain adaptation Controlled domain vocabulary resources Term variants (e.g. synonyms and abbreviations) Generally scarce Ideally within a large framework such as UMLS Allowing tool re-use

52 Basic NLP resources Tokenizer Preferably domain-adapted Very poor language standards in some clinical documents Lemmatizer Point in case: FinTWOL and nursing narratives Basic FinTWOL extended by Lingsoft with ~3500 domain words Recognition rate grew from 83.1% to 90.7% That corresponds to 42% decrease in unrecognized running words Hungarian: lemmatizers exist but are not domain adapted due to data privacy concerns Researchers who are able to adapt the lemmatizers do not have appropriate data access permissions

53 References 1 st place: Farkas, R., & Szarvas, G. (2008). Automatic construction of rule-based ICD-9-CM coding systems. BMC Bioinformatics, 9S3, S10. 2 nd place: Crammer, K., Dredze, M., Ganchev, K., & Talukdar, P. P. (2007). Automatic code assignment to medical text. Proceedings of ACL 07 BioNLP workshop. 3 rd place: Suominen, H., Ginter, F., Pyysalo, S., Airola, A., Pahikkala, T., Salanterä, S., & Salakoski, T. (2008). Machine Learning to Automate the Assignment of Diagnosis Codes to Free-text Radiology Reports: a Method Description. Proceedings of the ICML/UAI/COLT Workshop on Machine Learning for Health-Care Applications. Challenge description: Pestian, J. P., Brew, C., Matykiewicz, P., Hovermale, D., Johnson, N., Cohen, K. B., & Duch, W. (2007). A shared task involving multi-label classification of clinical free text. Proceedings of ACL 07 BioNLP workshop.

Travis Goodwin & Sanda Harabagiu

Travis Goodwin & Sanda Harabagiu Automatic Generation of a Qualified Medical Knowledge Graph and its Usage for Retrieving Patient Cohorts from Electronic Medical Records Travis Goodwin & Sanda Harabagiu Human Language Technology Research

More information

Unsupervised Extraction of Diagnosis Codes from EMRs Using Knowledge-Based and Extractive Text Summarization Techniques

Unsupervised Extraction of Diagnosis Codes from EMRs Using Knowledge-Based and Extractive Text Summarization Techniques Unsupervised Extraction of Diagnosis Codes from EMRs Using Knowledge-Based and Extractive Text Summarization Techniques Ramakanth Kavuluru 1,2, Sifei Han 2, and Daniel Harris 2 1 Division of Biomedical

More information

POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition

POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition POSBIOTM-NER: A Machine Learning Approach for Bio-Named Entity Recognition Yu Song, Eunji Yi, Eunju Kim, Gary Geunbae Lee, Department of CSE, POSTECH, Pohang, Korea 790-784 Soo-Jun Park Bioinformatics

More information

A Method for Automatic De-identification of Medical Records

A Method for Automatic De-identification of Medical Records A Method for Automatic De-identification of Medical Records Arya Tafvizi MIT CSAIL Cambridge, MA 0239, USA [email protected] Maciej Pacula MIT CSAIL Cambridge, MA 0239, USA [email protected] Abstract

More information

A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks

A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks Firoj Alam 1, Anna Corazza 2, Alberto Lavelli 3, and Roberto Zanoli 3 1 Dept. of Information Eng. and Computer Science, University of Trento,

More information

Automated Problem List Generation from Electronic Medical Records in IBM Watson

Automated Problem List Generation from Electronic Medical Records in IBM Watson Proceedings of the Twenty-Seventh Conference on Innovative Applications of Artificial Intelligence Automated Problem List Generation from Electronic Medical Records in IBM Watson Murthy Devarakonda, Ching-Huei

More information

Wireless Remote Monitoring System for ASTHMA Attack Detection and Classification

Wireless Remote Monitoring System for ASTHMA Attack Detection and Classification Department of Telecommunication Engineering Hijjawi Faculty for Engineering Technology Yarmouk University Wireless Remote Monitoring System for ASTHMA Attack Detection and Classification Prepared by Orobh

More information

Workshop. Neil Barrett PhD, Jens Weber PhD, Vincent Thai MD. Engineering & Health Informa2on Science

Workshop. Neil Barrett PhD, Jens Weber PhD, Vincent Thai MD. Engineering & Health Informa2on Science Engineering & Health Informa2on Science Engineering NLP Solu/ons for Structured Informa/on from Clinical Text: Extrac'ng Sen'nel Events from Pallia've Care Consult Le8ers Canada-China Clean Energy Initiative

More information

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words , pp.290-295 http://dx.doi.org/10.14257/astl.2015.111.55 Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words Irfan

More information

Integrating Public and Private Medical Texts for Patient De-Identification with Apache ctakes

Integrating Public and Private Medical Texts for Patient De-Identification with Apache ctakes Integrating Public and Private Medical Texts for Patient De-Identification with Apache ctakes Presented By: Andrew McMurry & Britt Fitch (Apache ctakes committers) Co-authors: Guergana Savova, Ben Reis,

More information

Find the signal in the noise

Find the signal in the noise Find the signal in the noise Electronic Health Records: The challenge The adoption of Electronic Health Records (EHRs) in the USA is rapidly increasing, due to the Health Information Technology and Clinical

More information

The ICD-9-CM uses an indented format for ease in reference I10 I10 I10 I10. All information subject to change. 2013 1

The ICD-9-CM uses an indented format for ease in reference I10 I10 I10 I10. All information subject to change. 2013 1 Section I. Conventions, general coding guidelines and chapter specific guidelines The conventions, general guidelines and chapter-specific guidelines are applicable to all health care settings unless otherwise

More information

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Fabian Grüning Carl von Ossietzky Universität Oldenburg, Germany, [email protected] Abstract: Independent

More information

Building a Question Classifier for a TREC-Style Question Answering System

Building a Question Classifier for a TREC-Style Question Answering System Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given

More information

Taxonomies in Practice Welcome to the second decade of online taxonomy construction

Taxonomies in Practice Welcome to the second decade of online taxonomy construction Building a Taxonomy for Auto-classification by Wendi Pohs EDITOR S SUMMARY Taxonomies have expanded from browsing aids to the foundation for automatic classification. Early auto-classification methods

More information

An intelligent tool for expediting and automating data mining steps. Ourania Hatzi, Nikolaos Zorbas, Mara Nikolaidou and Dimosthenis Anagnostopoulos

An intelligent tool for expediting and automating data mining steps. Ourania Hatzi, Nikolaos Zorbas, Mara Nikolaidou and Dimosthenis Anagnostopoulos An intelligent tool for expediting and automating data mining steps Ourania Hatzi, Nikolaos Zorbas, Mara Nikolaidou and Dimosthenis Anagnostopoulos Outline Data Mining, current tools An intelligent tool

More information

Combining structured data with machine learning to improve clinical text de-identification

Combining structured data with machine learning to improve clinical text de-identification Combining structured data with machine learning to improve clinical text de-identification DT Tran Scott Halgrim David Carrell Group Health Research Institute Clinical text contains Personally identifiable

More information

TMUNSW: Identification of disorders and normalization to SNOMED-CT terminology in unstructured clinical notes

TMUNSW: Identification of disorders and normalization to SNOMED-CT terminology in unstructured clinical notes TMUNSW: Identification of disorders and normalization to SNOMED-CT terminology in unstructured clinical notes Jitendra Jonnagaddala a,b,c Siaw-Teng Liaw *,a Pradeep Ray b Manish Kumar c School of Public

More information

Supervised Extraction of Diagnosis Codes from EMRs: Role of Feature Selection, Data Selection, and Probabilistic Thresholding

Supervised Extraction of Diagnosis Codes from EMRs: Role of Feature Selection, Data Selection, and Probabilistic Thresholding Supervised Extraction of Diagnosis Codes from EMRs: Role of Feature Selection, Data Selection, and Probabilistic Thresholding Anthony Rios Department of Computer Science University of Kentucky, Lexington,

More information

Computer-assisted coding and natural language processing

Computer-assisted coding and natural language processing Computer-assisted coding and natural language processing Without changes to current coding technology and processes, ICD-10 adoption will be very difficult for providers to absorb, due to the added complexity

More information

Mining a Corpus of Job Ads

Mining a Corpus of Job Ads Mining a Corpus of Job Ads Workshop Strings and Structures Computational Biology & Linguistics Jürgen Jürgen Hermes Hermes Sprachliche Linguistic Data Informationsverarbeitung Processing Institut Department

More information

Natural Language Processing for Clinical Informatics and Translational Research Informatics

Natural Language Processing for Clinical Informatics and Translational Research Informatics Natural Language Processing for Clinical Informatics and Translational Research Informatics Imre Solti, M. D., Ph. D. [email protected] K99 Fellow in Biomedical Informatics University of Washington Background

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Free Text Phrase Encoding and Information Extraction from Medical Notes. Jennifer Shu

Free Text Phrase Encoding and Information Extraction from Medical Notes. Jennifer Shu Free Text Phrase Encoding and Information Extraction from Medical Notes by Jennifer Shu Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements

More information

Identify Disorders in Health Records using Conditional Random Fields and Metamap

Identify Disorders in Health Records using Conditional Random Fields and Metamap Identify Disorders in Health Records using Conditional Random Fields and Metamap AEHRC at ShARe/CLEF 2013 ehealth Evaluation Lab Task 1 G. Zuccon 1, A. Holloway 1,2, B. Koopman 1,2, A. Nguyen 1 1 The Australian

More information

Sentiment Analysis of Movie Reviews and Twitter Statuses. Introduction

Sentiment Analysis of Movie Reviews and Twitter Statuses. Introduction Sentiment Analysis of Movie Reviews and Twitter Statuses Introduction Sentiment analysis is the task of identifying whether the opinion expressed in a text is positive or negative in general, or about

More information

The Data Mining Process

The Data Mining Process Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

More information

Protein-protein Interaction Passage Extraction Using the Interaction Pattern Kernel Approach for the BioCreative 2015 BioC Track

Protein-protein Interaction Passage Extraction Using the Interaction Pattern Kernel Approach for the BioCreative 2015 BioC Track Protein-protein Interaction Passage Extraction Using the Interaction Pattern Kernel Approach for the BioCreative 2015 BioC Track Yung-Chun Chang 1,2, Yu-Chen Su 3, Chun-Han Chu 1, Chien Chin Chen 2 and

More information

Search and Information Retrieval

Search and Information Retrieval Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search

More information

PharmaSUG2011 Paper HS03

PharmaSUG2011 Paper HS03 PharmaSUG2011 Paper HS03 Using SAS Predictive Modeling to Investigate the Asthma s Patient Future Hospitalization Risk Yehia H. Khalil, University of Louisville, Louisville, KY, US ABSTRACT The focus of

More information

Automated Content Analysis of Discussion Transcripts

Automated Content Analysis of Discussion Transcripts Automated Content Analysis of Discussion Transcripts Vitomir Kovanović [email protected] Dragan Gašević [email protected] School of Informatics, University of Edinburgh Edinburgh, United Kingdom [email protected]

More information

A Decision Support Approach based on Sentiment Analysis Combined with Data Mining for Customer Satisfaction Research

A Decision Support Approach based on Sentiment Analysis Combined with Data Mining for Customer Satisfaction Research 145 A Decision Support Approach based on Sentiment Analysis Combined with Data Mining for Customer Satisfaction Research Nafissa Yussupova, Maxim Boyko, and Diana Bogdanova Faculty of informatics and robotics

More information

Semi-Supervised Learning for Blog Classification

Semi-Supervised Learning for Blog Classification Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008) Semi-Supervised Learning for Blog Classification Daisuke Ikeda Department of Computational Intelligence and Systems Science,

More information

Natural Language to Relational Query by Using Parsing Compiler

Natural Language to Relational Query by Using Parsing Compiler Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,

More information

Natural Language Processing in the EHR Lifecycle

Natural Language Processing in the EHR Lifecycle Insight Driven Health Natural Language Processing in the EHR Lifecycle Cecil O. Lynch, MD, MS [email protected] Health & Public Service Outline Medical Data Landscape Value Proposition of NLP

More information

Electronic Medical Record Mining. Prafulla Dawadi School of Electrical Engineering and Computer Science

Electronic Medical Record Mining. Prafulla Dawadi School of Electrical Engineering and Computer Science Electronic Medical Record Mining Prafulla Dawadi School of Electrical Engineering and Computer Science Introduction An electronic health record is a systematic collection of electronic health information

More information

Keywords social media, internet, data, sentiment analysis, opinion mining, business

Keywords social media, internet, data, sentiment analysis, opinion mining, business Volume 5, Issue 8, August 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Real time Extraction

More information

Medical-Miner at TREC 2011 Medical Records Track

Medical-Miner at TREC 2011 Medical Records Track Medical-Miner at TREC 2011 Medical Records Track 1 J.M. Córdoba, 1 M.J. Maña, 1 N.P. Cruz, 1 J. Mata, 2 F. Aparicio, 2 M. Buenaga, 3 D. Glez-Peña, 3 F. Fdez-Riverola 1 Universidad de Huelva 2 Universidad

More information

Searching biomedical data sets. Hua Xu, PhD The University of Texas Health Science Center at Houston

Searching biomedical data sets. Hua Xu, PhD The University of Texas Health Science Center at Houston Searching biomedical data sets Hua Xu, PhD The University of Texas Health Science Center at Houston Motivations for biomedical data re-use Improve reproducibility Minimize duplicated efforts on creating

More information

Blog Post Extraction Using Title Finding

Blog Post Extraction Using Title Finding Blog Post Extraction Using Title Finding Linhai Song 1, 2, Xueqi Cheng 1, Yan Guo 1, Bo Wu 1, 2, Yu Wang 1, 2 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 2 Graduate School

More information

Projektgruppe. Categorization of text documents via classification

Projektgruppe. Categorization of text documents via classification Projektgruppe Steffen Beringer Categorization of text documents via classification 4. Juni 2010 Content Motivation Text categorization Classification in the machine learning Document indexing Construction

More information

Recognizing and Encoding Disorder Concepts in Clinical Text using Machine Learning and Vector Space Model *

Recognizing and Encoding Disorder Concepts in Clinical Text using Machine Learning and Vector Space Model * Recognizing and Encoding Disorder Concepts in Clinical Text using Machine Learning and Vector Space Model * Buzhou Tang 1,2, Yonghui Wu 1, Min Jiang 1, Joshua C. Denny 3, and Hua Xu 1,* 1 School of Biomedical

More information

Disease/Illness GUIDE TO ASBESTOS LUNG CANCER. What Is Asbestos Lung Cancer? www.simpsonmillar.co.uk Telephone 0844 858 3200

Disease/Illness GUIDE TO ASBESTOS LUNG CANCER. What Is Asbestos Lung Cancer? www.simpsonmillar.co.uk Telephone 0844 858 3200 GUIDE TO ASBESTOS LUNG CANCER What Is Asbestos Lung Cancer? Like tobacco smoking, exposure to asbestos can result in the development of lung cancer. Similarly, the risk of developing asbestos induced lung

More information

Chapter 8. Final Results on Dutch Senseval-2 Test Data

Chapter 8. Final Results on Dutch Senseval-2 Test Data Chapter 8 Final Results on Dutch Senseval-2 Test Data The general idea of testing is to assess how well a given model works and that can only be done properly on data that has not been seen before. Supervised

More information

Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

More information

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov Search and Data Mining: Techniques Text Mining Anya Yarygina Boris Novikov Introduction Generally used to denote any system that analyzes large quantities of natural language text and detects lexical or

More information

Brill s rule-based PoS tagger

Brill s rule-based PoS tagger Beáta Megyesi Department of Linguistics University of Stockholm Extract from D-level thesis (section 3) Brill s rule-based PoS tagger Beáta Megyesi Eric Brill introduced a PoS tagger in 1992 that was based

More information

ANNLOR: A Naïve Notation-system for Lexical Outputs Ranking

ANNLOR: A Naïve Notation-system for Lexical Outputs Ranking ANNLOR: A Naïve Notation-system for Lexical Outputs Ranking Anne-Laure Ligozat LIMSI-CNRS/ENSIIE rue John von Neumann 91400 Orsay, France [email protected] Cyril Grouin LIMSI-CNRS rue John von Neumann 91400

More information

SPATIAL DATA CLASSIFICATION AND DATA MINING

SPATIAL DATA CLASSIFICATION AND DATA MINING , pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal

More information

SVM Based Learning System For Information Extraction

SVM Based Learning System For Information Extraction SVM Based Learning System For Information Extraction Yaoyong Li, Kalina Bontcheva, and Hamish Cunningham Department of Computer Science, The University of Sheffield, Sheffield, S1 4DP, UK {yaoyong,kalina,hamish}@dcs.shef.ac.uk

More information

Exploration and Visualization of Post-Market Data

Exploration and Visualization of Post-Market Data Exploration and Visualization of Post-Market Data Jianying Hu, PhD Joint work with David Gotz, Shahram Ebadollahi, Jimeng Sun, Fei Wang, Marianthi Markatou Healthcare Analytics Research IBM T.J. Watson

More information

Transformation of Free-text Electronic Health Records for Efficient Information Retrieval and Support of Knowledge Discovery

Transformation of Free-text Electronic Health Records for Efficient Information Retrieval and Support of Knowledge Discovery Transformation of Free-text Electronic Health Records for Efficient Information Retrieval and Support of Knowledge Discovery Jan Paralic, Peter Smatana Technical University of Kosice, Slovakia Center for

More information

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 1 School of

More information

Analyzing survey text: a brief overview

Analyzing survey text: a brief overview IBM SPSS Text Analytics for Surveys Analyzing survey text: a brief overview Learn how gives you greater insight Contents 1 Introduction 2 The role of text in survey research 2 Approaches to text mining

More information

Predicting the Stock Market with News Articles

Predicting the Stock Market with News Articles Predicting the Stock Market with News Articles Kari Lee and Ryan Timmons CS224N Final Project Introduction Stock market prediction is an area of extreme importance to an entire industry. Stock price is

More information

Semantic annotation of requirements for automatic UML class diagram generation

Semantic annotation of requirements for automatic UML class diagram generation www.ijcsi.org 259 Semantic annotation of requirements for automatic UML class diagram generation Soumaya Amdouni 1, Wahiba Ben Abdessalem Karaa 2 and Sondes Bouabid 3 1 University of tunis High Institute

More information

Software Architecture Document

Software Architecture Document Software Architecture Document Natural Language Processing Cell Version 1.0 Natural Language Processing Cell Software Architecture Document Version 1.0 1 1. Table of Contents 1. Table of Contents... 2

More information

3M Health Information Systems

3M Health Information Systems 3M Health Information Systems 1 Data Governance Disparate Systems Interoperability Information Exchange Reporting Public Health Quality Metrics Research Data Warehousing Data Standards What is the 3M Healthcare

More information

New Developments in the Automatic Classification of Email Records. Inge Alberts, André Vellino, Craig Eby, Yves Marleau

New Developments in the Automatic Classification of Email Records. Inge Alberts, André Vellino, Craig Eby, Yves Marleau New Developments in the Automatic Classification of Email Records Inge Alberts, André Vellino, Craig Eby, Yves Marleau ARMA Canada 2014 INTRODUCTION 2014 2 OUTLINE 1. Research team 2. Research context

More information

Predicting Chief Complaints at Triage Time in the Emergency Department

Predicting Chief Complaints at Triage Time in the Emergency Department Predicting Chief Complaints at Triage Time in the Emergency Department Yacine Jernite, Yoni Halpern New York University New York, NY {jernite,halpern}@cs.nyu.edu Steven Horng Beth Israel Deaconess Medical

More information

A Medical Decision Support System (DSS) for Ubiquitous Healthcare Diagnosis System

A Medical Decision Support System (DSS) for Ubiquitous Healthcare Diagnosis System , pp. 237-244 http://dx.doi.org/10.14257/ijseia.2014.8.10.22 A Medical Decision Support System (DSS) for Ubiquitous Healthcare Diagnosis System Regin Joy Conejar 1 and Haeng-Kon Kim 1* 1 School of Information

More information

Health Science Career Field Allied Health and Nursing Pathway (JM)

Health Science Career Field Allied Health and Nursing Pathway (JM) Health Science Career Field Allied Health and Nursing Pathway (JM) ODE Courses Possible Sinclair Courses CTAG Courses for approved programs Health Science and Technology 1 st course in the Career Field

More information

Parsing Software Requirements with an Ontology-based Semantic Role Labeler

Parsing Software Requirements with an Ontology-based Semantic Role Labeler Parsing Software Requirements with an Ontology-based Semantic Role Labeler Michael Roth University of Edinburgh [email protected] Ewan Klein University of Edinburgh [email protected] Abstract Software

More information

Big Data Integration and Governance Considerations for Healthcare

Big Data Integration and Governance Considerations for Healthcare White Paper Big Data Integration and Governance Considerations for Healthcare by Sunil Soares, Founder & Managing Partner, Information Asset, LLC Big Data Integration and Governance Considerations for

More information

Collecting Polish German Parallel Corpora in the Internet

Collecting Polish German Parallel Corpora in the Internet Proceedings of the International Multiconference on ISSN 1896 7094 Computer Science and Information Technology, pp. 285 292 2007 PIPS Collecting Polish German Parallel Corpora in the Internet Monika Rosińska

More information

A.4.2. Challenges in the Deployment of Healthcare Information Systems and Technology

A.4.2. Challenges in the Deployment of Healthcare Information Systems and Technology A.4.2. Challenges in the Deployment of Healthcare Information Systems and Technology In order to support its constituent enterprise in Latin America and the Caribbean and deliver appropriate solutions,

More information

11-792 Software Engineering EMR Project Report

11-792 Software Engineering EMR Project Report 11-792 Software Engineering EMR Project Report Team Members Phani Gadde Anika Gupta Ting-Hao (Kenneth) Huang Chetan Thayur Suyoun Kim Vision Our aim is to build an intelligent system which is capable of

More information

X-ray (Radiography) - Chest

X-ray (Radiography) - Chest Scan for mobile link. X-ray (Radiography) - Chest What is a Chest X-ray (Chest Radiography)? The chest x-ray is the most commonly performed diagnostic x-ray examination. A chest x-ray produces images of

More information

Knowledge Discovery using Text Mining: A Programmable Implementation on Information Extraction and Categorization

Knowledge Discovery using Text Mining: A Programmable Implementation on Information Extraction and Categorization Knowledge Discovery using Text Mining: A Programmable Implementation on Information Extraction and Categorization Atika Mustafa, Ali Akbar, and Ahmer Sultan National University of Computer and Emerging

More information

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 5 ISSN 2229-5518

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 5 ISSN 2229-5518 International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 5 INTELLIGENT MULTIDIMENSIONAL DATABASE INTERFACE Mona Gharib Mohamed Reda Zahraa E. Mohamed Faculty of Science,

More information

Extracting Clinical entities and their assertions from Chinese Electronic Medical Records Based on Machine Learning

Extracting Clinical entities and their assertions from Chinese Electronic Medical Records Based on Machine Learning 3rd International Conference on Materials Engineering, Manufacturing Technology and Control (ICMEMTC 2016) Extracting Clinical entities and their assertions from Chinese Electronic Medical Records Based

More information

PoS-tagging Italian texts with CORISTagger

PoS-tagging Italian texts with CORISTagger PoS-tagging Italian texts with CORISTagger Fabio Tamburini DSLO, University of Bologna, Italy [email protected] Abstract. This paper presents an evolution of CORISTagger [1], an high-performance

More information

Active Learning SVM for Blogs recommendation

Active Learning SVM for Blogs recommendation Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the

More information

Sentiment Analysis of Twitter Feeds for the Prediction of Stock Market Movement

Sentiment Analysis of Twitter Feeds for the Prediction of Stock Market Movement Sentiment Analysis of Twitter Feeds for the Prediction of Stock Market Movement Ray Chen, Marius Lazer Abstract In this paper, we investigate the relationship between Twitter feed content and stock market

More information

Intelligent Tools For A Productive Radiologist Workflow: How Machine Learning Enriches Hanging Protocols

Intelligent Tools For A Productive Radiologist Workflow: How Machine Learning Enriches Hanging Protocols GE Healthcare Intelligent Tools For A Productive Radiologist Workflow: How Machine Learning Enriches Hanging Protocols Authors: Tianyi Wang Information Scientist Machine Learning Lab Software Science &

More information

Clinical Database Information System for Gbagada General Hospital

Clinical Database Information System for Gbagada General Hospital International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 2, Issue 9, September 2015, PP 29-37 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org

More information

Protect Your Family. and Friends from. The TB Contact Investigation TUBERCULOSIS

Protect Your Family. and Friends from. The TB Contact Investigation TUBERCULOSIS Protect Your Family TB and Friends from TUBERCULOSIS The TB Contact Investigation What s Inside: Read this brochure today to learn how to protect your family and friends from TB. Then share it with people

More information

ASTHMA IN INFANTS AND YOUNG CHILDREN

ASTHMA IN INFANTS AND YOUNG CHILDREN ASTHMA IN INFANTS AND YOUNG CHILDREN What is Asthma? Asthma is a chronic inflammatory disease of the airways. Symptoms of asthma are variable. That means that they can be mild to severe, intermittent to

More information

Application of Data Mining Methods in Health Care Databases

Application of Data Mining Methods in Health Care Databases 6 th International Conference on Applied Informatics Eger, Hungary, January 27 31, 2004. Application of Data Mining Methods in Health Care Databases Ágnes Vathy-Fogarassy Department of Mathematics and

More information

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

Open Domain Information Extraction. Günter Neumann, DFKI, 2012 Open Domain Information Extraction Günter Neumann, DFKI, 2012 Improving TextRunner Wu and Weld (2010) Open Information Extraction using Wikipedia, ACL 2010 Fader et al. (2011) Identifying Relations for

More information

Ear Infections Asthma in childhood asthma in childhood

Ear Infections Asthma in childhood asthma in childhood Asthma Ear Infections in childhood asthma in childhood Asthma in childhood is common and it can be serious. About one in six children (aged less than 15 years) in Western Australia are affected by asthma.

More information