A Supervised Abbreviation Resolution System for Medical Text
|
|
|
- Victor Powers
- 10 years ago
- Views:
Transcription
1 A Supervised Abbreviation Resolution System for Medical Text Pierre Zweigenbaum 1, Louise Deléger 1, Thomas Lavergne 1, Aurélie Névéol 1, and Andreea Bodnari 1,2 1 LIMSI-CNRS, rue John von Neumann, F Orsay, France 2 MIT, CSAIL, Cambridge, Massachusetts, USA Abstract. We present our participation in Task 2 of the 2013 CLEFeHEALTH Challenge, whose goal was to determine the UMLS concept unique identifier (CUI), if available, of an abbreviation or acronym. We hypothesize that considering only the abbreviations of the training corpus could be sufficient to provide a strong baseline for this task. We therefore test how a fully supervised approach, which predicts the CUI of an abbreviation based only on the abbreviations and CUIs seen in the training corpus, can fare on this task. We adapt to this task the processing pipeline we developed for CLEF-eHEALTH Task 1, entity detection: a supervised MaxEnt model based on a set of features including UMLS Concept Unique Identifiers, complemented here with a rule-based component for document headers. This system confirmed our hypothesis, and was evaluated at accuracy (strict) and accuracy (relaxed), ranking second out of five teams. Keywords: Abbreviation normalization, Natural Language Processing, Medical records, Machine learning 1 Introduction The 2013 CLEF-eHEALTH challenge [9] aims to develop methods and resources that make electronic medical records (EMRs) more understandable by both patients and health professionals. The challenge spans three tasks. The second task focuses on the resolution of abbreviations into UMLS concept unique identifiers (CUIs). We present here our participation in the second task and develop a system that can determine the CUI of an abbreviation. We propose a solution that combines a simple feature set with external knowledge gathered from the UMLS Metathesaurus. Clinical notes often use abbreviations and acronyms (henceforth collectively named abbreviations in this paper). While the number and variety of abbreviations is unlimited thanks to the plasticity of language and the inventiveness of the human mind, the distribution of abbreviations can be expected to follow a Zipfian law, with a small number of distinct abbreviations accounting for a large part of the total number of occurrences of abbreviations. We therefore
2 2 Pierre Zweigenbaum et al. hypothesize that the sample of abbreviations present in the training corpus provided by CLEF-eHEALTH for the task should contain most frequently occurring abbreviations and should hence account for a large part of the occurrences of abbreviations in the test corpus. We present a supervised abbreviation resolution system specifically adapted to the abbreviations of the CLEF ehealth corpus. Our system learns a MaxEnt model from the training data, based on a simple feature set that combines UMLS knowledge with information gathered from the EMR text. We present the system design, its results on the 2013 CLEF-eHEALTH [9] test data, and discuss its merits and limitations. 2 Related work Abbreviations and acronyms are pervasive in technical and scientific text, including in health and life sciences. They create obstacles to various information extraction tasks, for instance for coreference resolution in electronic medical records [10]. This has motivated work on abbreviation detection and expansion on various types of texts, the most prevalent of which for health are the scientific literature [6, 8, 1] and patient records [11, 3]. Initial work aimed to collect acronyms and their expansions from large corpora [6, 8], in expressions such as the numbers of underrepresented minorities (URMs) 1 where the expansion of an abbreviation is directly provided in the input text. This is however not always the case, and is even fairly rare in patient records. When abbreviation expansions are not explicit in source texts, abbreviation processing can be divided into two steps. The first is the detection of abbreviations, which determines which expressions in a text are abbreviations (or acronyms) [11]. This can be compared to an entity detection task. The second step is the resolution of abbreviations: given an expression, which may map to multiple expansions, select the appropriate expansion in context. If all possible expansions are available, this is similar to a disambiguation task [3]. Kim et al. [3] disambiguate abbreviations with a multi-class SVM classifier trained on feature vectors including the five preceding and following words (occurring at least five times in the corpus). They train on a corpus of 9,963 clinical notes in which full forms have automatically been replaced with abbreviations, following Pakhomov [5], using the inventory of abbreviations in the UMLS Specialist Lexicon file LRABR. They test their method on a corpus of 37 hand-annotated clinical notes and achieve an F-measure of (Precision = 0.683, Recall = 0.637) in exact match and in partial match. Task 2 of the ShARe CLEF ehealth 2013 Challenge addresses the abbreviation disambiguation step. Gold standard annotations of abbreviation boundaries are given for the abbreviations present in clinical notes, which removes the need to address the abbreviation detection step: the task consists of disambiguating and normalizing each abbreviation into a UMLS Concept Unique Identifier 1 Example from the MEDSTRACT corpus: gold-standards.html.
3 Supervised Abbreviation Resolution 3 (CUI), or the CUI-less label if no UMLS CUI is available. It is close to the task defined in [3], but it is to our knowledge the first time a dataset of this size is provided to address abbreviation disambiguation in clinical notes into a reference terminology. Besides, a rapid examination of the abbreviations found in the training dataset shows that a number of them are not present in the Specialist Lexicon LRABR file: for instance, oxygen saturation abbreviated as O2 sat or 02 sat (this latter example uses a zero instead of the letter O ) is not listed among the 4 abbreviations of oxygen saturation in LRABR (O2 saturation, O2sat, SO2, SpO2 ), so the semi-supervised method of [5, 3] might not apply here. 3 Materials and methods 3.1 Data The corpus used for the 2013 CLEF-eHEALTH challenge consists of de-identified plain text EMRs from the MIMIC II database, version 2.5 [7]. The EMR documents were extracted from the intensive-care unit setting and included discharge summaries, electrocardiography reports, echography reports, and radiology reports. The training set contained 200 documents and a total of 94,243 words, while the test set contained 100 documents and a total of 87,799 words (see Table 1). Abbreviations and acronyms are annotated in terms of span and UMLS Concept Unique Identifier (CUI) when available. If no CUI is available for the abbreviation the CUI-less code is used. The training set contained 3, 805 annotations, while the test set contained 3, 774 annotations (see Table 1). CUI-less abbreviations accounted for about 5% in the training set and 6% in the test set. Table 1. Description of training and test data sets. Distinct CUIs do not include the CUI-less code. Training Test Documents Words 94,243 87,799 Abbreviations 3,805 3,774 Distinct CUIs CUI-less abbreviations System design Document headers (see details below) contain pre-formatted fields, some of which contain abbreviations. Since these are very regular fields, their CUIs are not ambiguous, and simple rules can handle them. We therefore divided the problem into two parts: Rule-based resolution of abbreviations in document headers (Section 3.4); Supervised resolution of abbreviations in document body (Section 3.3).
4 4 Pierre Zweigenbaum et al. 3.3 Supervised resolution of abbreviations in document body We used a supervised linear-chain Conditional Random Fields (CRF) system which we restricted to Maximum Entropy mode (see below). We trained a model using 10-fold cross validation on the training set data, keeping 1/11th of the data for final tuning of the model. We first present how we formulate the problem. We then describe the pre-processing steps we performed on the datasets and the model feature set together with the CRF feature patterns. Problem formulation Abbreviation resolution is defined here as the determination of the CUI of an abbreviation, or else the attribution of the CUI-less label. An abbreviation may be ambiguous, i.e., be found with different CUIs in different contexts. For instance, in the training set, CATH is used for Catheterization [C ] and for Drainage Catheters [C ]. When ignoring case, 526 distinct abbreviations are found in the training set, 94 of which are ambiguous given the entries in the training set. When all candidate CUIs for an abbreviation are known, abbreviation expansion can be addressed as a supervised classification task this is what we do here. We label each abbreviation of the training set with its gold standard CUI, using a B-I-O scheme where each CUI introduces a B- and possibly some I- labels. For instance, the two tokens of abbreviation O2 sat are labelled as B-C and I-C This results in 738 distinct labels. An obvious limitation of this design is that only those CUIs seen in the training set can be assigned in the test set, but we found it worth trying. We used the Wapiti [4] 2 implementation of CRFs because it is fast and offers convenient patterns (e.g., patterns using regular expressions on field values). Data pre-processing Our data pre-processing and feature production architecture is schematized in Figure 1. Before using the challenge corpora for training and testing, we performed the following pre-processing steps: the training and test corpora provided by the challenge organizers were deidentified and thus contained special de-identification marks; to turn deidentification code into more normal phrases, we performed re-identification with pseudonyms on the input text. EMR documents present in general a well-structured form, with a header, document body, and a footer. The header and footer contain information relevant to clinical administration, but the disorder NPs are only encountered inside the document body. We thus removed the header and footer from the EMR documents and performed analysis on the document body only. Since headers may contain abbreviations, we handled them specifically (see below). 2
5 Supervised Abbreviation Resolution 5 Section labels Full text Tokenize, Re-identify Re-identified, tokenized text Detect document type and split Header Body Footer Rules Abbreviation CUIs Tokenized sentences Sections ctakes Exact match Gold standard (POS-tagged, parsed), UMLS information (1) UMLS information (2) Abbreviation boundaries Fig. 1. Diagram of feature production. System features Given a sentence s =... w 2 w 1 w 0 w 1 w 2... and a token of interest w k, we define features over w k and n-grams centered at w k. 1. Lexical and morphological features: we include information on the token in the form of unigrams over w k 1, w k, w k+1 and bigrams over w k 1 w k. Additionally we include as unigram features over w k token prefixes ranging from 1 to 4 characters. We also add a 5-gram feature which detects patterns containing two slashes, such as m/r/g, which may help disambiguate between disorders and abbreviations, and apply it over w k 4, w k 2, w k (the non-slash positions of the pattern). 2. Document structure features: the feature set contains as features the document type (e.g., radiology report, discharge summary) and the section type (e.g., Findings, Laboratory data, Social History). We extract the section type using a rule-based section extraction tool that identifies the occurrence of section names within the EMR. The section extraction tool uses a list of 58 manually defined section names. Both document type and section type are unigram features over w k. 3. UMLS features: we include UMLS information from two sources. We first run ctakes over the texts and keep concept unique identifiers (CUIs, defined over unigrams w k ). We use an additional UMLS mapping where we directly search for UMLS strings within the EMR text through exact match and include the concept unique identifier (CUI) of the identified phrase. The direct UMLS mapping features are unigram features over w k. 4. Abbreviation feature: gold standard boundaries for abbreviations are provided. These are used as unigram features over w k. All features pertaining to multi-token expressions instead of only single tokens (for instance, being a UMLS term with a given CUI, or being an abbreviation) are encoded with the Begin Inside Outside (B-I-O) scheme: given a label L, the first token is labeled B-L, the next tokens are labeled I-L, and tokens not having this feature are labeled O. An advantage of linear-chain CRFs is that they can include label bigrams in the model they learn. However, in our experiments on
6 6 Pierre Zweigenbaum et al. the training set, we were not able to use label bigrams because of the large number of labels (738 in the model we finally used): with label bigrams, the number of generated features for a CRF pattern which involves x observed values and y labels is x y 2, and it seems that the feature space generated when we tested with these label bigrams was to vast for the CRF to find a solution when training. Having no label bigrams means that our CRF is used basically as a Maximum Entropy classifier, with no dependency on the previous label. The total number of features generated for evaluation by the CRF for this model is about 50 million, among which it selects about 400, Rule-based resolution of abbreviations in document headers We examined abbreviations in the training set headers and wrote rules to label them. These rules are shown in Table 2. They handle the 6 abbreviations found in the document headers of the training set. The resulting annotations are merged with those obtained by the CRF. Table 2. Task 2. Rules to address abbreviations in document headers. Document type Pattern Abbrev = CUI DISCHARGE Sex : M M = C DISCHARGE Sex : F F = C ECG ECG ECG = C ECHO ECHO ECHO = C RADIO PORTABLE AP AP = C RADIO CT HEAD CT = C Results and discussion 4.1 Evaluation metrics We evaluate the system s ability to correctly generate the codes (CUIs or CUIless) of abbreviations. The evaluation measure is the accuracy of codes, defined as Accuracy = Correct Total where Correct= Number of pre-annotated acronyms/abbreviations with correctly generated code; Total = Number of pre-annotated acronyms/abbreviations. In some cases the human annotators generated a ranked list of codes for a given abbreviation. This makes it possible to define two variants of the accuracy score. In the strict variant, only the top code selected by the annotators (one best) is considered. In the relaxed variant, a code is considered correct if it is contained in the full list generated by the annotators (n-best). (1)
7 Supervised Abbreviation Resolution Rule-based resolution of abbreviations in headers Rule-based resolution of abbreviations in headers obtained the results displayed on Table abbreviations were located in the headers; the rules identified Table 3. Headers: Rule-based resolution of abbreviations. ECG is null because there was no ECG report in the test corpus. The two substitutions and the insertion actually seem to be errors in the gold standard. Abbreviation CUI GS Correct Substitution Deletion Insertion Accuracy M C F C ECG C ECHO C AP C CT C Others Total GS = Gold Standard. Accuracy = Correct / (Correct + Substitution + Deletion). 100, out of which 97 matched the gold standard. The two substitutions (the gold standard CUI was different from that proposed by the system) are both cases where human annotators marked as CUI-less an M (Male) or F (Female) abbreviation in the header, which seems to be an error in the gold standard. 3 Since we did not constrain rules to apply to gold standard abbreviation boundaries, the system also identified an M abbreviation which the human annotators forgot to mark in a document header (insertion). 4 If this is correct, rule-based resolution of abbreviations had perfect accuracy. However, the headers were processed only through these rules, and other abbreviations in the headers were thus not examined. 10 such abbreviations were missed (deletions), so that overall, our resolution of abbreviations in headers achieved an accuracy of 0.89 against the gold standard. All in all, the hundred of abbreviations present in headers are only a small part of the 3,774 abbreviations of the test corpus (about 3%), so this part of the method only contributed a small fraction of the performance of the system. 4.3 Resolution of abbreviations in full documents This section presents the global results that we obtained on the full documents, merging codes obtained through rule-based and supervised methods. We submitted one run which achieved an accuracy of under the strict evaluation and under the relaxed setting DISCHARGE SUMMARY.txt: , DISCHARGE SUMMARY.txt: DISCHARGE SUMMARY.txt:
8 8 Pierre Zweigenbaum et al. The training set contained 696 distinct CUIs (plus the CUI-less label), whereas the test set contained 706 distinct CUIs (plus CUI-less) (see Table 1). When training our final CRF model we kept 1/11th of the corpus for development, resulting in a model which has knowledge for 654 CUIs (plus CUI-less). Only 346 of these were present in the test set, which means that our system could only identify about one half of the 706 distinct CUIs of the test set. However, considering the number of occurrences of these CUIs draws a quite different picture. Table 4 shows that among the 3774 occurrences of abbreviations contained in the test set, 2906 had CUIs seen in the training corpus (including 221 occurrences of CUI-less, see Table 5). Correctly identifying these known CUIs Table 4. Full documents: Resolution of abbreviations (strict evaluation) is much better for seen CUIs than for unseen CUIs. A CUI is seen if it was fed to the CRF model during training. CUI GS Correct Substitution Deletion Insertion Accuracy Seen Unseen All GS = Gold Standard. Accuracy = Correct / (Correct + Substitution + Deletion). would have led to an accuracy of 2906/3774 = 0.770, which would outperform the best system of the challenge (0.719). Our hypothesis according to which considering only the abbreviations of the training corpus could be sufficient to provide a strong baseline is therefore confirmed. Besides, our system correctly identified the CUIs of 2354 occurrences of abbreviations, i.e., 81% of the best it could have done given its inherent limitations. In the case of unseen abbreviations, our system can only be right if the abbreviation is CUI-less and the system does assign this code to it; this occurred for 151 occurrences. Whether or not an abbreviation had a CUI did not affect much the performance of the system, as can be seen on Table 5. Table 5. Full documents: Resolution of abbreviations (strict evaluation) fares about as well on abbreviations with and without CUIs. CUI GS Correct Substitution Deletion Insertion Accuracy CUI-less with CUI All GS = Gold Standard. Accuracy = Correct / (Correct + Substitution + Deletion). We conclude this discussion with a word on insertions and deletions. These were not expected given the task definition: given gold standard abbreviation boundaries, determine the CUI of the abbreviation. Using a CRF with a B-I-O
9 Supervised Abbreviation Resolution 9 scheme allowed us to keep the very same framework as we prepared for Task 1 of the challenge, Entity Recognition [2]. Its drawback is that without additional constraints, it decides where to set the boundaries of the CUI or CUI-less codes that it assigns to its input tokens. When this created boundary changes, they caused a mismatch between the gold standard and system abbreviations, which can be counted as a deletion (missing gold standard abbreviation) combined to an insertion (spurious system abbreviation). For instance, the gold standard considered the string NCAT as two tokens NC [CUI-less] AT [CUI-less] while the system tagged it as one token: NCAT [CUI-less]. Conversely, the system tagged 02 sat [C ] as two tokens 02 [C ] sat [C ], and tagged LE s [C ] as three tokens LE [C ] [C ] s [C ]. A split into multiple tokens occurred more frequently than the reverse situation, which explains why insertions (285) are more numerous than deletions (163). Since the gold standard boundaries are given, a post-processing step could be added in the future to predict a code for the actual abbreviation span based on the CRF output. The cleanest way to cope with this issue though will be to rewrite the system to quit the linear-chain framework of our Task 1 system and switch to a standard Maximum Entropy (or other) classifier. 5 Conclusion and perspectives We present an abbreviation resolution system prepared for participation in Task 2 of the 2013 CLEF-eHEALTH challenge. We design our system as a derivation of our Task 1 framework into a supervised Maximum Entropy classifier with lexical, document structure, and UMLS features, complemented by rule-based processing of abbreviations in document headers. Our system obtains an accuracy of (strict) and (relaxed), which gives it a second position among five systems. We have tested the bold hypothesis that relying only on the CUIs seen in the training set is enough to provide a strong baseline. The data in the test set and the obtained results show that this hypothesis holds. We have seen that in principle this framework might even fare better than the current best challenge participant. Further work can go into two directions. One is to keep the current framework and strengthen it, for instance by removing deletions and insertions, and by studying the current causes of substitutions. The other is to move to dynamic generation of candidate expansions. This may require to abandon the neat supervised classification framework which was possible here, but is needed to extend processing to abbreviations unseen in the training set or in existing databases such as the UMLS Specialist Lexicon LRABR. 6 Acknowledgments We acknowledge the Shared Annotated Resources (ShARe) project funded by the United States National Institutes of Health with grant number R01GM The last author was funded by the Châteaubriand Science Fellowship
10 10 Pierre Zweigenbaum et al. This work was partly funded through project Accordys 5 funded by ANR under grant number ANR-12-CORD References 1. Paolo Atzeni, Fabio Polticelli, and Daniele Toti. An automatic identification and resolution system for protein-related abbreviations in scientific papers. In C. Pizzuti, M.D. Ritchie, and M. Giacobini, editors, EvoBIO 2011, number 6623 in LNCS, pages Springer-Verlag, Berlin Heidelberg, Andreea Bodnari, Louise Deléger, Thomas Lavergne, Aurélie Névéol, and Pierre Zweigenbaum. A supervised named-entity extraction system for medical text. In Proceedings of CLEF 2013, To appear. 3. Youngjun Kim, John Hurdle, and Stphane M. Meystre. Using UMLS lexical resources to disambiguate abbreviations in clinical text. In AMIA Annu Symp Proc, pages , Thomas Lavergne, Olivier Cappé, and François Yvon. Practical very large scale CRFs. In ACL Proc, pages , Serguei Pakhomov. Semi-supervised maximum entropy based approach to acronym and abbreviation normalization in medical texts. In Proc 40 th ACL, pages , Philadelphia, PA, ACL. 6. James Pustejovsky, José Casta no, Brent Cochran, Maciej Kotecki, and Michael Morrell. Automatic extraction of acronym-meaning pairs from MEDLINE databases. In MEDINFO 2001, number 1 in Studies in health technology and informatics, pages IOS Press, M. Saeed, M. Villarroel, A.T. Reisner, G. Clifford, L. Lehman, G.B. Moody, T. Heldt, T.H. Kyaw, B.E. Moody, and R.G. Mark. Multiparameter intelligent monitoring in intensive care II (MIMIC-II): A public-access ICU database. Clinical Care Medicine, 39: , Ariel S. Schwartz and Marti A. Hearst. A simple algorithm for identifying abbreviation definitions in biomedical text. In Pacific Symposium on Biocomputing, pages , Kauai, Hawaii, World Scientific Pub Co Inc. 9. Hanna Suominen, Sanna Salanterä, Wendy W. Sumitra Velupillai Chapman, Guergana Savova, Noemie Elhadad, Sameer Pradhan, Brett R. South, Danielle L. Mowery, Gareth J.F. Jones, Johannes Leveling, Liadh Kelly, Lorraine Goeuriot, David Martinez, and Guido Zuccon. Overview of the ShARe/CLEF ehealth evaluation lab In Proceedings of CLEF 2013, Lecture Notes in Computer Science, Berlin Heidelberg, Springer. To appear. 10. Özlem Uzuner, Andreea Bodnari, Shuying Shen, Tyler Forbush, John Pestian, and Brett South. Evaluating the state of the art in coreference resolution for electronic medical records. Journal of American Medical Informatics Association, 17: , February Yonghui Wu, S. Trent Rosenbloom, Joshua C. Denny Randolph A. Miller Subramani Mani, Dario A. Giuse, and Hua Xu. Detecting abbreviations in discharge summaries using machine learning methods. In AMIA Annu Symp Proc, pages , Accordys: Agrégation de Contenus et de COnnaissances pour Raisonner à partir de cas de DYSmorphologie fœtale, Content and Knowledge Aggregation for Case-based Reasoning in the field of Fetal Dysmorphology (ANR ).
A Supervised Named-Entity Extraction System for Medical Text
A Supervised Named-Entity Extraction System for Medical Text Andreea Bodnari 1,2, Louise Deléger 2, Thomas Lavergne 2, Aurélie Névéol 2, and Pierre Zweigenbaum 2 1 MIT, CSAIL, Cambridge, Massachusetts,
Identify Disorders in Health Records using Conditional Random Fields and Metamap
Identify Disorders in Health Records using Conditional Random Fields and Metamap AEHRC at ShARe/CLEF 2013 ehealth Evaluation Lab Task 1 G. Zuccon 1, A. Holloway 1,2, B. Koopman 1,2, A. Nguyen 1 1 The Australian
TMUNSW: Identification of disorders and normalization to SNOMED-CT terminology in unstructured clinical notes
TMUNSW: Identification of disorders and normalization to SNOMED-CT terminology in unstructured clinical notes Jitendra Jonnagaddala a,b,c Siaw-Teng Liaw *,a Pradeep Ray b Manish Kumar c School of Public
Recognizing and Encoding Disorder Concepts in Clinical Text using Machine Learning and Vector Space Model *
Recognizing and Encoding Disorder Concepts in Clinical Text using Machine Learning and Vector Space Model * Buzhou Tang 1,2, Yonghui Wu 1, Min Jiang 1, Joshua C. Denny 3, and Hua Xu 1,* 1 School of Biomedical
A Method for Automatic De-identification of Medical Records
A Method for Automatic De-identification of Medical Records Arya Tafvizi MIT CSAIL Cambridge, MA 0239, USA [email protected] Maciej Pacula MIT CSAIL Cambridge, MA 0239, USA [email protected] Abstract
Generating Patient Problem Lists from the ShARe Corpus using SNOMED CT/SNOMED CT CORE Problem List
Generating Patient Problem Lists from the ShARe Corpus using SNOMED CT/SNOMED CT CORE Problem List Danielle Mowery Janyce Wiebe University of Pittsburgh Pittsburgh, PA [email protected] [email protected]
Travis Goodwin & Sanda Harabagiu
Automatic Generation of a Qualified Medical Knowledge Graph and its Usage for Retrieving Patient Cohorts from Electronic Medical Records Travis Goodwin & Sanda Harabagiu Human Language Technology Research
Chapter 8. Final Results on Dutch Senseval-2 Test Data
Chapter 8 Final Results on Dutch Senseval-2 Test Data The general idea of testing is to assess how well a given model works and that can only be done properly on data that has not been seen before. Supervised
A De-identifier For Electronic Medical Records Based On A Heterogeneous Feature Set. Arya Tafvizi
A De-identifier For Electronic Medical Records Based On A Heterogeneous Feature Set by Arya Tafvizi S.B., Physics, MIT, 2010 S.B., Computer Science and Engineering, MIT, 2011 Submitted to the Department
POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition
POSBIOTM-NER: A Machine Learning Approach for Bio-Named Entity Recognition Yu Song, Eunji Yi, Eunju Kim, Gary Geunbae Lee, Department of CSE, POSTECH, Pohang, Korea 790-784 Soo-Jun Park Bioinformatics
Micro blogs Oriented Word Segmentation System
Micro blogs Oriented Word Segmentation System Yijia Liu, Meishan Zhang, Wanxiang Che, Ting Liu, Yihe Deng Research Center for Social Computing and Information Retrieval Harbin Institute of Technology,
SVM Based Learning System For Information Extraction
SVM Based Learning System For Information Extraction Yaoyong Li, Kalina Bontcheva, and Hamish Cunningham Department of Computer Science, The University of Sheffield, Sheffield, S1 4DP, UK {yaoyong,kalina,hamish}@dcs.shef.ac.uk
Automated Problem List Generation from Electronic Medical Records in IBM Watson
Proceedings of the Twenty-Seventh Conference on Innovative Applications of Artificial Intelligence Automated Problem List Generation from Electronic Medical Records in IBM Watson Murthy Devarakonda, Ching-Huei
Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov
Search and Data Mining: Techniques Text Mining Anya Yarygina Boris Novikov Introduction Generally used to denote any system that analyzes large quantities of natural language text and detects lexical or
ANNLOR: A Naïve Notation-system for Lexical Outputs Ranking
ANNLOR: A Naïve Notation-system for Lexical Outputs Ranking Anne-Laure Ligozat LIMSI-CNRS/ENSIIE rue John von Neumann 91400 Orsay, France [email protected] Cyril Grouin LIMSI-CNRS rue John von Neumann 91400
A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks
A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks Firoj Alam 1, Anna Corazza 2, Alberto Lavelli 3, and Roberto Zanoli 3 1 Dept. of Information Eng. and Computer Science, University of Trento,
Combining structured data with machine learning to improve clinical text de-identification
Combining structured data with machine learning to improve clinical text de-identification DT Tran Scott Halgrim David Carrell Group Health Research Institute Clinical text contains Personally identifiable
Predicting the Stock Market with News Articles
Predicting the Stock Market with News Articles Kari Lee and Ryan Timmons CS224N Final Project Introduction Stock market prediction is an area of extreme importance to an entire industry. Stock price is
Open Domain Information Extraction. Günter Neumann, DFKI, 2012
Open Domain Information Extraction Günter Neumann, DFKI, 2012 Improving TextRunner Wu and Weld (2010) Open Information Extraction using Wikipedia, ACL 2010 Fader et al. (2011) Identifying Relations for
PPInterFinder A Web Server for Mining Human Protein Protein Interaction
PPInterFinder A Web Server for Mining Human Protein Protein Interaction Kalpana Raja, Suresh Subramani, Jeyakumar Natarajan Data Mining and Text Mining Laboratory, Department of Bioinformatics, Bharathiar
Natural Language Processing for Clinical Informatics and Translational Research Informatics
Natural Language Processing for Clinical Informatics and Translational Research Informatics Imre Solti, M. D., Ph. D. [email protected] K99 Fellow in Biomedical Informatics University of Washington Background
Protein-protein Interaction Passage Extraction Using the Interaction Pattern Kernel Approach for the BioCreative 2015 BioC Track
Protein-protein Interaction Passage Extraction Using the Interaction Pattern Kernel Approach for the BioCreative 2015 BioC Track Yung-Chun Chang 1,2, Yu-Chen Su 3, Chun-Han Chu 1, Chien Chin Chen 2 and
Data Deduplication in Slovak Corpora
Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences, Bratislava, Slovakia Abstract. Our paper describes our experience in deduplication of a Slovak corpus. Two methods of deduplication a plain
11-792 Software Engineering EMR Project Report
11-792 Software Engineering EMR Project Report Team Members Phani Gadde Anika Gupta Ting-Hao (Kenneth) Huang Chetan Thayur Suyoun Kim Vision Our aim is to build an intelligent system which is capable of
Integrating Public and Private Medical Texts for Patient De-Identification with Apache ctakes
Integrating Public and Private Medical Texts for Patient De-Identification with Apache ctakes Presented By: Andrew McMurry & Britt Fitch (Apache ctakes committers) Co-authors: Guergana Savova, Ben Reis,
Domain Classification of Technical Terms Using the Web
Systems and Computers in Japan, Vol. 38, No. 14, 2007 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J89-D, No. 11, November 2006, pp. 2470 2482 Domain Classification of Technical Terms Using
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,
Building a Question Classifier for a TREC-Style Question Answering System
Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given
Terminology Extraction from Log Files
Terminology Extraction from Log Files Hassan Saneifar 1,2, Stéphane Bonniol 2, Anne Laurent 1, Pascal Poncelet 1, and Mathieu Roche 1 1 LIRMM - Université Montpellier 2 - CNRS 161 rue Ada, 34392 Montpellier
The Enron Corpus: A New Dataset for Email Classification Research
The Enron Corpus: A New Dataset for Email Classification Research Bryan Klimt and Yiming Yang Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213-8213, USA {bklimt,yiming}@cs.cmu.edu
BANNER: AN EXECUTABLE SURVEY OF ADVANCES IN BIOMEDICAL NAMED ENTITY RECOGNITION
BANNER: AN EXECUTABLE SURVEY OF ADVANCES IN BIOMEDICAL NAMED ENTITY RECOGNITION ROBERT LEAMAN Department of Computer Science and Engineering, Arizona State University GRACIELA GONZALEZ * Department of
University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task
University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task Graham McDonald, Romain Deveaud, Richard McCreadie, Timothy Gollins, Craig Macdonald and Iadh Ounis School
Parsing Software Requirements with an Ontology-based Semantic Role Labeler
Parsing Software Requirements with an Ontology-based Semantic Role Labeler Michael Roth University of Edinburgh [email protected] Ewan Klein University of Edinburgh [email protected] Abstract Software
THUTR: A Translation Retrieval System
THUTR: A Translation Retrieval System Chunyang Liu, Qi Liu, Yang Liu, and Maosong Sun Department of Computer Science and Technology State Key Lab on Intelligent Technology and Systems National Lab for
Free Text Phrase Encoding and Information Extraction from Medical Notes. Jennifer Shu
Free Text Phrase Encoding and Information Extraction from Medical Notes by Jennifer Shu Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements
W6.B.1. FAQs CS535 BIG DATA W6.B.3. 4. If the distance of the point is additionally less than the tight distance T 2, remove it from the original set
http://wwwcscolostateedu/~cs535 W6B W6B2 CS535 BIG DAA FAQs Please prepare for the last minute rush Store your output files safely Partial score will be given for the output from less than 50GB input Computer
Semantic annotation of requirements for automatic UML class diagram generation
www.ijcsi.org 259 Semantic annotation of requirements for automatic UML class diagram generation Soumaya Amdouni 1, Wahiba Ben Abdessalem Karaa 2 and Sondes Bouabid 3 1 University of tunis High Institute
Identifying Focus, Techniques and Domain of Scientific Papers
Identifying Focus, Techniques and Domain of Scientific Papers Sonal Gupta Department of Computer Science Stanford University Stanford, CA 94305 [email protected] Christopher D. Manning Department of
Statistical Machine Translation
Statistical Machine Translation Some of the content of this lecture is taken from previous lectures and presentations given by Philipp Koehn and Andy Way. Dr. Jennifer Foster National Centre for Language
Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features
, pp.273-280 http://dx.doi.org/10.14257/ijdta.2015.8.4.27 Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features Lirong Qiu School of Information Engineering, MinzuUniversity of
Search Query and Matching Approach of Information Retrieval in Cloud Computing
International Journal of Advances in Electrical and Electronics Engineering 99 Available online at www.ijaeee.com & www.sestindia.org ISSN: 2319-1112 Search Query and Matching Approach of Information Retrieval
Special Topics in Computer Science
Special Topics in Computer Science NLP in a Nutshell CS492B Spring Semester 2009 Jong C. Park Computer Science Department Korea Advanced Institute of Science and Technology INTRODUCTION Jong C. Park, CS
Term extraction for user profiling: evaluation by the user
Term extraction for user profiling: evaluation by the user Suzan Verberne 1, Maya Sappelli 1,2, Wessel Kraaij 1,2 1 Institute for Computing and Information Sciences, Radboud University Nijmegen 2 TNO,
Machine translation techniques for presentation of summaries
Grant Agreement Number: 257528 KHRESMOI www.khresmoi.eu Machine translation techniques for presentation of summaries Deliverable number D4.6 Dissemination level Public Delivery date April 2014 Status Author(s)
Web Document Clustering
Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,
ALIAS: A Tool for Disambiguating Authors in Microsoft Academic Search
Project for Michael Pitts Course TCSS 702A University of Washington Tacoma Institute of Technology ALIAS: A Tool for Disambiguating Authors in Microsoft Academic Search Under supervision of : Dr. Senjuti
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Fabian Grüning Carl von Ossietzky Universität Oldenburg, Germany, [email protected] Abstract: Independent
Extracting Clinical entities and their assertions from Chinese Electronic Medical Records Based on Machine Learning
3rd International Conference on Materials Engineering, Manufacturing Technology and Control (ICMEMTC 2016) Extracting Clinical entities and their assertions from Chinese Electronic Medical Records Based
Adaption of Statistical Email Filtering Techniques
Adaption of Statistical Email Filtering Techniques David Kohlbrenner IT.com Thomas Jefferson High School for Science and Technology January 25, 2007 Abstract With the rise of the levels of spam, new techniques
Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information
Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Satoshi Sekine Computer Science Department New York University [email protected] Kapil Dalwani Computer Science Department
CENG 734 Advanced Topics in Bioinformatics
CENG 734 Advanced Topics in Bioinformatics Week 9 Text Mining for Bioinformatics: BioCreative II.5 Fall 2010-2011 Quiz #7 1. Draw the decompressed graph for the following graph summary 2. Describe the
Terminology Extraction from Log Files
Terminology Extraction from Log Files Hassan Saneifar, Stéphane Bonniol, Anne Laurent, Pascal Poncelet, Mathieu Roche To cite this version: Hassan Saneifar, Stéphane Bonniol, Anne Laurent, Pascal Poncelet,
Author Gender Identification of English Novels
Author Gender Identification of English Novels Joseph Baena and Catherine Chen December 13, 2013 1 Introduction Machine learning algorithms have long been used in studies of authorship, particularly in
Workshop. Neil Barrett PhD, Jens Weber PhD, Vincent Thai MD. Engineering & Health Informa2on Science
Engineering & Health Informa2on Science Engineering NLP Solu/ons for Structured Informa/on from Clinical Text: Extrac'ng Sen'nel Events from Pallia've Care Consult Le8ers Canada-China Clean Energy Initiative
Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research [email protected]
Introduction to Machine Learning Lecture 1 Mehryar Mohri Courant Institute and Google Research [email protected] Introduction Logistics Prerequisites: basics concepts needed in probability and statistics
Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
Natural Language Processing
Natural Language Processing 2 Open NLP (http://opennlp.apache.org/) Java library for processing natural language text Based on Machine Learning tools maximum entropy, perceptron Includes pre-built models
Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection
Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection Gareth J. F. Jones, Declan Groves, Anna Khasin, Adenike Lam-Adesina, Bart Mellebeek. Andy Way School of Computing,
D2.4: Two trained semantic decoders for the Appointment Scheduling task
D2.4: Two trained semantic decoders for the Appointment Scheduling task James Henderson, François Mairesse, Lonneke van der Plas, Paola Merlo Distribution: Public CLASSiC Computational Learning in Adaptive
PoS-tagging Italian texts with CORISTagger
PoS-tagging Italian texts with CORISTagger Fabio Tamburini DSLO, University of Bologna, Italy [email protected] Abstract. This paper presents an evolution of CORISTagger [1], an high-performance
Taxonomies in Practice Welcome to the second decade of online taxonomy construction
Building a Taxonomy for Auto-classification by Wendi Pohs EDITOR S SUMMARY Taxonomies have expanded from browsing aids to the foundation for automatic classification. Early auto-classification methods
ScreenMatch: Providing Context to Software Translators by Displaying Screenshots
ScreenMatch: Providing Context to Software Translators by Displaying Screenshots Geza Kovacs MIT CSAIL 32 Vassar St, Cambridge MA 02139 USA [email protected] Abstract Translators often encounter ambiguous
Wikipedia and Web document based Query Translation and Expansion for Cross-language IR
Wikipedia and Web document based Query Translation and Expansion for Cross-language IR Ling-Xiang Tang 1, Andrew Trotman 2, Shlomo Geva 1, Yue Xu 1 1Faculty of Science and Technology, Queensland University
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati [email protected], [email protected]
Transformation of Free-text Electronic Health Records for Efficient Information Retrieval and Support of Knowledge Discovery
Transformation of Free-text Electronic Health Records for Efficient Information Retrieval and Support of Knowledge Discovery Jan Paralic, Peter Smatana Technical University of Kosice, Slovakia Center for
How To Identify A Churner
2012 45th Hawaii International Conference on System Sciences A New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication Namhyoung Kim, Jaewook Lee Department of Industrial and Management
Automated Content Analysis of Discussion Transcripts
Automated Content Analysis of Discussion Transcripts Vitomir Kovanović [email protected] Dragan Gašević [email protected] School of Informatics, University of Edinburgh Edinburgh, United Kingdom [email protected]
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
Word Completion and Prediction in Hebrew
Experiments with Language Models for בס"ד Word Completion and Prediction in Hebrew 1 Yaakov HaCohen-Kerner, Asaf Applebaum, Jacob Bitterman Department of Computer Science Jerusalem College of Technology
Stanford s Distantly-Supervised Slot-Filling System
Stanford s Distantly-Supervised Slot-Filling System Mihai Surdeanu, Sonal Gupta, John Bauer, David McClosky, Angel X. Chang, Valentin I. Spitkovsky, Christopher D. Manning Computer Science Department,
Bayesian Spam Filtering
Bayesian Spam Filtering Ahmed Obied Department of Computer Science University of Calgary [email protected] http://www.cpsc.ucalgary.ca/~amaobied Abstract. With the enormous amount of spam messages propagating
Sentiment analysis: towards a tool for analysing real-time students feedback
Sentiment analysis: towards a tool for analysing real-time students feedback Nabeela Altrabsheh Email: [email protected] Mihaela Cocea Email: [email protected] Sanaz Fallahkhair Email:
An Ontology Based Method to Solve Query Identifier Heterogeneity in Post- Genomic Clinical Trials
ehealth Beyond the Horizon Get IT There S.K. Andersen et al. (Eds.) IOS Press, 2008 2008 Organizing Committee of MIE 2008. All rights reserved. 3 An Ontology Based Method to Solve Query Identifier Heterogeneity
