Syntactic Parsing for Bio-molecular Event Detection from Scientific Literature
|
|
- Ashlie Rich
- 3 years ago
- Views:
From this document you will learn the answers to the following questions:
What is the main focus of this paper?
How many pages is the book Syntactic Parsing?
What is a keyword for semantic properties?
Transcription
1 Syntactic Parsing for Bio-molecular Event Detection from Scientific Literature Sérgio Matos 1, Anabela Barreiro 2, and José Luis Oliveira 1 1 IEETA, Universidade de Aveiro, Campus Universitário de Santiago, Aveiro, Portugal 2 Faculdade de Letras, Universidade do Porto, Via Panorâmica, Porto, Portugal {aleixomatos,jlo}@ua.pt, barreiro_anabela@hotmail.com Abstract. Rapid advances in science and in laboratorial and computing methods are generating vast amounts of data and scientific literature. In order to keep up-to-date with the expanding knowledge in their field of study, researchers are facing an increasing need for tools that help manage this information. In the genomics field, various databases have been created to save information in a formalized and easily accessible form. However, human curators are not capable of updating these databases at the same rate new studies are published. Advanced and robust text mining tools that automatically extract newly published information from scientific articles are required. This paper presents a methodology, based on syntactic parsing, for identification of gene events from the scientific literature. Evaluation of the proposed approach, based on the BioNLP shared task on event extraction, produced an average F-score of 47.1, for six event types. Keywords: Biomedical literature, information extraction, bio-molecular events, syntactic parsing, semantic properties. 1 Introduction Recent advances in biotechnology, namely the widespread use of high-throughput methods for gene analysis, have originated vast amounts of published scientific literature. While much of the data and results described in these studies are being annotated in the various existing biomedical databases, these are not easily kept up-to-date. As a result, many relevant research outcomes are still enclosed as free-text in the scientific literature, which remains the major source of information for researchers [1]. It is therefore increasingly difficult for researchers to keep track of the quickly expanding biomedical knowledge to support their experiment planning and analysis of results [2][3]. Researchers are currently faced with issues such as (i) how to identify the most relevant articles for their specific study, (ii) how to identify the mentioned concepts (genes, proteins, diseases and so on) and relations between them, and (iii) how to integrate the extracted information with the existing knowledge in a simple, efficient, and userfriendly manner [2][4]. This integrated view of information extracted from literature, in the framework of more systematized and formalized knowledge annotated in databases and ontologies, is an important requisite for biological data analysis [3]. L. Seabra Lopes et al. (Eds.): EPIA 2009, LNAI 5816, pp , Springer-Verlag Berlin Heidelberg 2009
2 80 S. Matos, A. Barreiro, and J.L. Oliveira To address these issues, several tools have been developed in the past years that combine Information Extraction (IE), Text Mining (TM) and Natural Language Processing (NLP) techniques with the domain knowledge available in resources such as the Entrez Gene, UniProt, GO or UMLS [1][2][4][5]. Such tools process text titles and abstracts from the MEDLINE/PubMed [6] literature database and present the extracted information in different forms. The ihop tool [7] identifies genes and proteins in PubMed abstracts and uses them as links, allowing the navigation through sentences and abstracts. The AliBaba system [8] is based on pattern matching and cooccurrence statistics to find associations between biological entities such as genes, proteins or diseases, and presents the search results in the form of a graph. EBIMed [9] also finds associations between protein/gene names, GO annotations, drugs and species in PubMed abstracts resulting from a user query. The results are displayed in a table with links to the sentences and abstracts that support the corresponding associations. A similar tool, FACTA [10] retrieves abstracts from PubMed and identifies biomedical concepts (e.g. genes/proteins, diseases, enzymes and chemical compounds) co-occurring with the user query term. The concepts are presented to the user in a tabular format and ranked based on the co-occurrence statistics or on pointwise mutual information. More recently, there has been some focus on applying more detailed linguistic processing in order to improve information retrieval and extraction. Chilibot [11] retrieves sentences from PubMed abstracts related to a pair or a list of proteins, genes, or keywords, and applies shallow parsing to classify these sentences as interactive, non-interactive or simple abstract co-occurrence. The identified relationships between entities or keywords are then displayed as a graph. MEDIE [12] uses a deep-parser and a term recognizer to index abstracts based on pre-computed semantic annotations, allowing for real-time retrieval of sentences containing biological concepts associated with the terms specified in the user query. Interest in the application of more advanced methods of linguistic processing is also evident in the recent information extraction evaluation challenges, namely the BioNLP shared task on event extraction [13] and the BioCreAtIvE II.5 challenge [14], which investigate the extraction of gene events from literature. In this paper, we describe a methodology based on syntactic parsing to detect and annotate bio-molecular events, such as protein production and breakdown, localization or binding events. We present results from our participation in the BioNLP shared task and discuss the main difficulties and further developments required in this area. 2 Methods The method described in this paper to identify bio-molecular events is based on syntactic grammars that process texts and detect the occurrence of linguistic patterns that describe such events. Syntactic parsing was implemented using NooJ [15], a freely available development environment and linguistic processing engine that includes tools for inflectional and derivational morphology, syntactic grammars and semantics. NooJ uses dictionaries and grammars to produce formalized descriptions of natural language and contains a system of inflectional and derivational paradigms, which interacts with the dictionary. Inflectional rules apply to a dictionary entry (lemma) to recognize and generate inflected forms, including gender, number and tense. Derivational
3 Syntactic Parsing for Bio-molecular Event Detection from Scientific Literature 81 rules apply to a dictionary entry to recognize and generate derived forms, such as nominalizations (predicate nouns morphosyntactically related to a verb) as adopted in [16]. Lemmas can also have semantic information included. Semantic properties allow, for example, adding the characteristic of a particular named entity, such as ORGANISM, PROTEIN or DISEASE. These properties are illustrated in Table 1. Table 1. Dictionary entries in NooJ Lemma PoS FLX Semantic properties ID TAXID human N TABLE ORGANISM 9606 Homo sapiens N ORGANISM 9606 Breast cancer type 1 N PROTEIN P susceptibility protein BRCA1 N PROTEIN P BRCA1 N PROTEIN P BRCA1 N GENE RNF53 N GENE To create the dictionaries used in this method, we adapted the verb dictionary from the biomedical resource BioLexicon [17][18]. BioLexicon includes verbs that occur frequently in the biomedical literature and that usually describe a specific event, such as express, bind and transcribe. We enhanced the BioLexicon dictionary with inflectional ( FLX ) and derivational ( DRV ) attributes and with semantic properties, as shown in Table 2. For example, ION:TABLE represents the derivational and inflectional paradigms for the nominalization expression (which inflects as the word TA- BLE), and ABOLISH represents the inflectional paradigm for the verb express. The semantic properties in NooJ dictionaries were used to assign specific event types to the verbs in the literature that describe those events. In Table 2, the verb stimulate, for example, is assigned a semantic property EventType with a value Positive_Regulation. This semantic property is then used in the syntactic grammars, which add an annotation to that type of event whenever it is detected in texts. Table 2. Definition of verbs in the dictionary Lemma PoS DRV FLX EventType express V ION:TABLE ABOLISH Gene_expression ligate V TION:TABLE SMILE Binding stimulate V TION:TABLE SMILE Positive_regulation The inflectional and derivational paradigms are described in terms of re-write rules. For example, the noun inflectional paradigm TABLE, defines that the plural of the dictionary word associated with this rule is formed by adding an s to the lemma. Hence, the plural of any word associated with the attribute +FLX=TABLE (ex. human ) will be obtained in the same way. In the case of verbs, inflectional rules describe the conjugation of the verb. For example, the inflectional paradigm SMILE defines re-write rules in terms of person, number and tense for verbs that
4 82 S. Matos, A. Barreiro, and J.L. Oliveira conjugate like the verb to smile. Similarly, the derivational system allows the derivation of a word, as defined by the derivational rule. This allows, for example, obtaining nouns and adjectives from verb entries. The derived word maintains the semantic properties of the word from which is derived (lemma). Thus, the predicate noun stimulation is produced and linked to a positive regulation event, through its inherited semantic properties from the verb stimulate. In order to define the type of events linked to each verb, we used the training data in the BioNLP shared task. Based on the manual linguistic annotations, we extracted the sentences corresponding to each event, and assigned the event type to the verbs found on those sentences. We then manually checked this list and selected only those verbs showing a specific link to a type of event. In case verbs were linked to more than one event type, only the most frequent event type was selected, and the remaining ones removed. In NooJ, syntactic grammars can be used to process sequences of tokens to recognize and annotate multiword expressions. In the approach used, our aim was to detect linguistic patterns, based on named entities (genes and proteins) and on biologically relevant verbs and verb nominalizations referencing some type of bio-molecular event. These entities, verbs and nouns are automatically annotated by NooJ when the dictionaries and grammars are applied to texts. In order to create the relevant grammars, we first used NooJ to extract general concordances from the texts that included an annotated gene or protein and a verb or nominalization. We then identified, in the examples provided by the concordances, specific grammatical constructions describing different types of events. For example, we were able to identify a simple pattern composed of a nominalization, the particle of and a named gene or protein, as in expression of p53 or stimulation of CD4. These patterns were described in terms of syntactic grammars, as illustrated in Fig. 1. The output of the grammar (shown below the connecting lines) identifies the protein ( CD4 ), the expression referencing the event ( stimulation ) and the type of event. Construction and refinement of the syntactic grammars is an iterative process. After creating a baseline grammar to describe a particular construction, we try to incorporate syntactic-semantic variants (paraphrases) in order to achieve better recall, without compromising precision. For example, the grammar used to identify the construction expression of p53 should also be able to identify expression of gene p53 or expression of the human gene p53. The training and development data sets of the shared task were used during this iterative process. The semantic properties included in the dictionary are used in the syntactic grammars to specify the event type in the annotation. Example 1 shows the output of the grammar in Fig. 1: CD4 is the named entity and stimulation is the expression identifying the bio-molecular event. The event type, positive regulation, is obtained directly from the expression s semantic properties. Example 1. Grammar output used to annotate the expression in texts Stimulation of human CD4 <EVENT+PROTEIN=CD4+EXP=Stimulation+TYPE=Positive_regula tion>
5 Syntactic Parsing for Bio-molecular Event Detection from Scientific Literature 83 Fig. 1. Grammar to detect phrases, such as stimulation of CD4 3 Results The application of the grammars described in the previous section allowed the extraction of phrases that reference gene related events. Table 3 shows some examples of the patterns described by these grammars and the corresponding concordances found in texts. Although these are relatively simple patterns, they can model a large portion of the language used to present such events. Table 3. Patterns detected by the grammars Pattern Concordance in text <entity> [<entity_type>] <nominalization> HSP gene expression <nominalization> of [<entity_type>] <entity> upregulation of Fas <entity> [<entity_type>] <be> [ not ] [<adverb>] <verb> IL-2R stimulation was totally inhibited <verb> <preposition> <entity> binding of TRAF2 <verb> <nominalization> of <entity> suppressing activation of STAT6 This section presents the evaluation results of the proposed method, obtained using the test data from the BioNLP shared task on event extraction. This data set was not used for defining the semantic properties to include in the dictionary or for creating the syntactic grammars. The aim of the shared task was to detect gene events in Pub- Med abstracts and create the corresponding annotations, including the protein(s) involved, the referencing expression or trigger and the type of event. The data for the BioNLP task was derived from the GENIA event corpus and comprised 800 abstracts in the training set, 150 in the development set, and 260 in the test set. Details on the annotation procedure and evaluation metrics are described in [13]. The BioNLP shared task divided events into nine types. The regulatory events were not included in this study due to time constraints and to the more complex structure of those events. Results for the remaining six event types are displayed in Table 4. These results were achieved using six grammars similar to the one exemplified in Fig. 1. An average F-score of was obtained. Except for binding events, the results are promising and show that a good performance can be obtained using this simple approach. In
6 84 S. Matos, A. Barreiro, and J.L. Oliveira Table 4. Performance of the event detection method (test data) Event type Recall Precision F-score Localization Binding Gene Expression Transcription Protein Catabolism Phosphorylation Average the case of binding events, the participation of two proteins creates extra difficulty in describing such events, and the results are still poor. 4 Discussion We have described an approach which uses syntactic grammars to detect and annotate gene events from the scientific literature. The proposed method takes advantage of the inflectional and derivational morphology and the semantic properties established in dictionaries and grammars developed with NooJ, which allow to associate terminological verbs and their derivations to specific event types. This approach provides a general and flexible solution for information extraction from biomedical texts. The results illustrated in Table 4 indicate that this approach can be used to process the literature and extract networks of events and interactions. These networks are valuable for literature search and navigation, as proposed in MEDIE or Chilibot tools, but require much less processing. However, some shortcomings need to be considered and improved. The first limitation is related to named entity recognition. In the BioNLP shared task, participants were supplied with the names and positions in text of mentioned genes and proteins. In such a setup, recognizing linguistic patterns where these entities occur is significantly simplified. In a more realistic task, the processing pipeline would not have the list of mentioned entities as an input and a named entity recognizer with a very good performance needs to be included in the processing steps. Another limitation concerns the identification of patterns and creation of grammars. Although a manual procedure such as the one taken can identify the most salient linguistic patterns, it would be interesting to investigate the possibility to generate and assess new patterns automatically. In this study, we have not included the gene regulatory events because these are frequently referenced by more complex constructions which are not yet covered by our grammars. Describing and extracting these events is of great importance and will become a future direction of our work. Finally, it is important to assess the advantages and disadvantages of the proposed approach for identifying relations and events, when compared to other methods based on shallow or deep parsing. Methods such as the one proposed in this paper can be used to help database curators identify the most relevant facts in the literature and speed-up the annotation process. Tools based on these methods can also provide alternative querying and browsing of facts cited in the literature and be useful for researchers. However, before these
7 Syntactic Parsing for Bio-molecular Event Detection from Scientific Literature 85 methods can be truly useful, they must be included in user-oriented tools that offer robust and reliable performance while hiding the complexity of the linguistic processing. It is also of major importance that these tools keep links to the reference databases so that users can navigate from the literature to these resources and back, in a simple and fluid way. References 1. Rebholz-Schuhmann, D., Kirsch, H., Couto, F.: Facts from text: is text mining ready to deliver? PLoS Biol. 3, e65 (2005) 2. Altman, R.B., Bergman, C.M., Blake, J., Blaschke, C., Cohen, A., Gannon, F., Grivell, L., Hahn, U., Hersh, W., Hirschman, L., Jensen, L.J., Krallinger, M., Mons, B., O Donoghue, S.I., Peitsch, M.C., Rebholz-Schuhmann, D., Shatkay, H., Valencia, A.: Text mining for biology - the way forward: opinions from leading scientists. Genome Biology 9(suppl. 2), S7 (2008) 3. Jensen, L.J., Saric, J., Bork, P.: Literature mining for the biologist: from information retrieval to biological discovery. Nat. Rev. Genet. 7, (2006) 4. Shatkay, H.: Hairpins in bookstacks: Information retrieval from biomedical text. Briefings in Bioinformatics 6(3), (2005) 5. Weeber, M., Kors, J.A., Mons, B.: Online tools to support literature-based discovery in the life sciences. Briefings in Bioinformatics 6(3), (2005) 6. PubMed, 7. Hoffmann, R., Valencia, A.: ihop - A Gene Network for Navigating the Literature. Nature Genetics 36, 664 (2004) 8. Plake, C., Schiemann, T., Pankalla, M., Hakenberg, J., Leser, U.: Ali Baba: PubMed as a graph. Bioinformatics 22(19), (2006) 9. Rebholz-Schuhmann, D., Kirsch, H., Arregui, M., Gaudan, S., Riethoven, M., Stoehr, P.: EBIMed text crunching to gather facts for proteins from Medline. Bioinformatics 23(2), (2007) 10. Tsuruoka, Y., Tsujii, J., Ananiadou, S.: FACTA: a text search engine for finding associated biomedical concepts. Bioinformatics 24(21), (2008) 11. Chen, H., Sharp, B.M.: Content-rich biological network constructed by mining PubMed abstracts. BMC Bioinformatics 5, 147 (2008) 12. Miyao, Y., Ohta, T., Masuda, K., Tsuruoka, Y., Yoshida, K., Ninomiya, T., Tsujii, J.: Semantic Retrieval for the Accurate Identification of Relational Concepts in Massive Textbases. In: Proceedings of COLING-ACL 2006, Sydney, pp (2006) 13. Kim, J.-D., Ohta, T., Pyysalo, S., Kano, Y., Tsujii, J.: Overview of BioNLP 2009 Shared Task on Event Extraction. In: Proceedings of Natural Language Processing in Biomedicine (BioNLP) NAACL 2009 Workshop (2009) 14. BioCreAtIvE - Critical Assessment of Information Extraction Systems in Biology, NooJ, Barreiro, A.M.: Make it simple with paraphrases: Automated paraphrasing for authoring aids and machine translation. PhD dissertation. Faculdade de Letras da Universidade do Porto, Porto (2008) 17. Sasaki, Y., Montemagni, S., Pezik, P., Rebholz-Schuhmann, D., McNaught, J., Ananiadou, S.: BioLexicon: A Lexical Resource for the Biology Domain. In: Proceedings of the Third International Symposium on Semantic Mining in Biomedicine (2008) 18. BOOTStrep Bio-Lexicon,
Extraction and Visualization of Protein-Protein Interactions from PubMed
Extraction and Visualization of Protein-Protein Interactions from PubMed Ulf Leser Knowledge Management in Bioinformatics Humboldt-Universität Berlin Finding Relevant Knowledge Find information about Much
More informationProtein-protein Interaction Passage Extraction Using the Interaction Pattern Kernel Approach for the BioCreative 2015 BioC Track
Protein-protein Interaction Passage Extraction Using the Interaction Pattern Kernel Approach for the BioCreative 2015 BioC Track Yung-Chun Chang 1,2, Yu-Chen Su 3, Chun-Han Chu 1, Chien Chin Chen 2 and
More informationCENG 734 Advanced Topics in Bioinformatics
CENG 734 Advanced Topics in Bioinformatics Week 9 Text Mining for Bioinformatics: BioCreative II.5 Fall 2010-2011 Quiz #7 1. Draw the decompressed graph for the following graph summary 2. Describe the
More informationNatural Language Processing for Bioinformatics: The Time is Ripe
Natural Language Processing for Bioinformatics: The Time is Ripe Jeffrey T. Chang Soumya Raychaudhuri is a Ph.D. candidate in the Russ Altman lab in the Biomedical Informatics program at Stanford University.
More informationToxiCat: Hybrid Named Entity Recognition services to support curation of the Comparative Toxicogenomic Database
ToxiCat: Hybrid Named Entity Recognition services to support curation of the Comparative Toxicogenomic Database Dina Vishnyakova 1,2, 4, *, Julien Gobeill 1,3,4, Emilie Pasche 1,2,3,4 and Patrick Ruch
More informationMolecular event extraction from Link Grammar parse trees in the BioNLP 09 Shared Task
Computational Intelligence, Volume xx, Number 000, 2009 Molecular event extraction from Link Grammar parse trees in the BioNLP 09 Shared Task Võ HáNguyên, Jörg Hakenberg, Luis Tari, Chitta Baral, Arizona
More informationAbstracting the types away from a UIMA type system
Abstracting the types away from a UIMA type system Karin Verspoor, William Baumgartner Jr., Christophe Roeder, and Lawrence Hunter Center for Computational Pharmacology University of Colorado Denver School
More informationText Mining for Health Care and Medicine. Sophia Ananiadou Director National Centre for Text Mining www.nactem.ac.uk
Text Mining for Health Care and Medicine Sophia Ananiadou Director National Centre for Text Mining www.nactem.ac.uk The Need for Text Mining MEDLINE 2005: ~14M 2009: ~18M Overwhelming information in textual,
More informationPPInterFinder A Web Server for Mining Human Protein Protein Interaction
PPInterFinder A Web Server for Mining Human Protein Protein Interaction Kalpana Raja, Suresh Subramani, Jeyakumar Natarajan Data Mining and Text Mining Laboratory, Department of Bioinformatics, Bharathiar
More informationExtracting value from scientific literature: the power of mining full-text articles for pathway analysis
FOR PHARMA & LIFE SCIENCES WHITE PAPER Harnessing the Power of Content Extracting value from scientific literature: the power of mining full-text articles for pathway analysis Executive Summary Biological
More informationPOSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition
POSBIOTM-NER: A Machine Learning Approach for Bio-Named Entity Recognition Yu Song, Eunji Yi, Eunju Kim, Gary Geunbae Lee, Department of CSE, POSTECH, Pohang, Korea 790-784 Soo-Jun Park Bioinformatics
More informationFinal Program Auction - Diagnos and Competitors
Final Program Second BioCreAtIvE Challenge Workshop: Critical Assessment of Information Extraction in Molecular Biology Venue: Auditorium Madrid, April, 23-25, 2007 Main Organizer Prof. Alfonso Valencia,
More informationImpact of Corpus Diversity and Complexity on NER Performance
Impact of Corpus Diversity and Complexity on NER Performance Tatyana Shmanina 1,2, Ingrid Zukerman 1,2, Antonio Jimeno Yepes 1,3, Lawrence Cavedon 1,3, Karin Verspoor 1,3 1 NICTA Victoria Research Laboratory,
More informationNatural Language to Relational Query by Using Parsing Compiler
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,
More informationSemantic annotation of requirements for automatic UML class diagram generation
www.ijcsi.org 259 Semantic annotation of requirements for automatic UML class diagram generation Soumaya Amdouni 1, Wahiba Ben Abdessalem Karaa 2 and Sondes Bouabid 3 1 University of tunis High Institute
More informationA leader in the development and application of information technology to prevent and treat disease.
A leader in the development and application of information technology to prevent and treat disease. About MOLECULAR HEALTH Molecular Health was founded in 2004 with the vision of changing healthcare. Today
More informationTowards a RB-SMT Hybrid System for Translating Patent Claims Results and Perspectives
Towards a RB-SMT Hybrid System for Translating Patent Claims Results and Perspectives Ramona Enache and Adam Slaski Department of Computer Science and Engineering Chalmers University of Technology and
More informationClassification and Prioritization of Biomedical Literature for the Comparative Toxicogenomics Database
Classification and Prioritization of Biomedical Literature for the Comparative Toxicogenomics Database Dina VISHNYAKOVA a,b,d,1, Emilie PASCHE a,b,d, Julien GOBEILL a,c,d, Arnaud GAUDINAT a,c,d, Christian
More informationEfficient Data Integration in Finding Ailment-Treatment Relation
IJCST Vo l. 3, Is s u e 3, Ju l y - Se p t 2012 ISSN : 0976-8491 (Online) ISSN : 2229-4333 (Print) Efficient Data Integration in Finding Ailment-Treatment Relation 1 A. Nageswara Rao, 2 G. Venu Gopal,
More informationAugmenting the Medical Subject Headings vocabulary with semantically rich variants to improve disease mention normalisation
Augmenting the Medical Subject Headings vocabulary with semantically rich variants to improve disease mention normalisation Riza Batista-Navarro and Sophia Ananiadou National Centre for Text Mining, School
More informationBig Data and Text Mining
Big Data and Text Mining Dr. Ian Lewin Senior NLP Resource Specialist Ian.lewin@linguamatics.com www.linguamatics.com About Linguamatics Boston, USA Cambridge, UK Software Consulting Hosted content Agile,
More informationHow the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.
Svetlana Sokolova President and CEO of PROMT, PhD. How the Computer Translates Machine translation is a special field of computer application where almost everyone believes that he/she is a specialist.
More informationTransformation of Free-text Electronic Health Records for Efficient Information Retrieval and Support of Knowledge Discovery
Transformation of Free-text Electronic Health Records for Efficient Information Retrieval and Support of Knowledge Discovery Jan Paralic, Peter Smatana Technical University of Kosice, Slovakia Center for
More informationUnderstanding Biology in the Era of Big Data:
FOR PHARMA & LIFE SCIENCES WHITE PAPER Understanding Biology in the Era of Big Data: Depth of Coverage Matters Executive Summary Biological research today can be summarized in one word data. With more
More informationEfficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words
, pp.290-295 http://dx.doi.org/10.14257/astl.2015.111.55 Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words Irfan
More informationIntegrating Annotation Tools into UIMA for Interoperability
Integrating Annotation Tools into UIMA for Interoperability Scott Piao, Sophia Ananiadou and John McNaught School of Computer Science & National Centre for Text Mining The University of Manchester UK {scott.piao;sophia.ananiadou;john.mcnaught}@manchester.ac.uk
More informationA Knowledge-Poor Approach to BioCreative V DNER and CID Tasks
A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks Firoj Alam 1, Anna Corazza 2, Alberto Lavelli 3, and Roberto Zanoli 3 1 Dept. of Information Eng. and Computer Science, University of Trento,
More informationProteinQuest user guide
ProteinQuest user guide 1. Introduction... 3 1.1 With ProteinQuest you can... 3 1.2 ProteinQuest basic version 4 1.3 ProteinQuest extended version... 5 2. ProteinQuest dictionaries... 6 3. Directions for
More informationReview PubMed and beyond: a survey of web tools for searching biomedical literature
Review PubMed and beyond: a survey of web tools for searching biomedical literature Zhiyong Lu* National Center for Biotechnology Information (NCBI), National Library of Medicine, Bethesda, MD 20894, USA
More informationBuilding a Spanish MMTx by using Automatic Translation and Biomedical Ontologies
Building a Spanish MMTx by using Automatic Translation and Biomedical Ontologies Francisco Carrero 1, José Carlos Cortizo 1,2, José María Gómez 3 1 Universidad Europea de Madrid, C/Tajo s/n, Villaviciosa
More informationHPI in-memory-based database system in Task 2b of BioASQ
CLEF 2014 Conference and Labs of the Evaluation Forum BioASQ workshop HPI in-memory-based database system in Task 2b of BioASQ Mariana Neves September 16th, 2014 Outline 2 Overview of participation Architecture
More informationWeb-Based Genomic Information Integration with Gene Ontology
Web-Based Genomic Information Integration with Gene Ontology Kai Xu 1 IMAGEN group, National ICT Australia, Sydney, Australia, kai.xu@nicta.com.au Abstract. Despite the dramatic growth of online genomic
More informationTibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features
, pp.273-280 http://dx.doi.org/10.14257/ijdta.2015.8.4.27 Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features Lirong Qiu School of Information Engineering, MinzuUniversity of
More informationFind the signal in the noise
Find the signal in the noise Electronic Health Records: The challenge The adoption of Electronic Health Records (EHRs) in the USA is rapidly increasing, due to the Health Information Technology and Clinical
More informationArchitecture of an Ontology-Based Domain- Specific Natural Language Question Answering System
Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System Athira P. M., Sreeja M. and P. C. Reghuraj Department of Computer Science and Engineering, Government Engineering
More informationBANNER: AN EXECUTABLE SURVEY OF ADVANCES IN BIOMEDICAL NAMED ENTITY RECOGNITION
BANNER: AN EXECUTABLE SURVEY OF ADVANCES IN BIOMEDICAL NAMED ENTITY RECOGNITION ROBERT LEAMAN Department of Computer Science and Engineering, Arizona State University GRACIELA GONZALEZ * Department of
More informationCOMPARING USABILITY OF MATCHING TECHNIQUES FOR NORMALISING BIOMEDICAL NAMED ENTITIES
COMPARING USABILITY OF MATCHING TECHNIQUES FOR NORMALISING BIOMEDICAL NAMED ENTITIES XINGLONG WANG AND MICHAEL MATTHEWS School of Informatics, University of Edinburgh Edinburgh, EH8 9LW, UK {xwang,mmatsews}@inf.ed.ac.uk
More informationStudy of Effect of Drug Lexicons on Medication Extraction from Electronic Medical Records. E. Sirohi and P. Peissig
Study of Effect of Drug Lexicons on Medication Extraction from Electronic Medical Records E. Sirohi and P. Peissig Pacific Symposium on Biocomputing 10:308-318(2005) STUDY OF EFFECT OF DRUG LEXICONS ON
More informationJust the Facts: A Basic Introduction to the Science Underlying NCBI Resources
1 of 8 11/7/2004 11:00 AM National Center for Biotechnology Information About NCBI NCBI at a Glance A Science Primer Human Genome Resources Model Organisms Guide Outreach and Education Databases and Tools
More informationAn Ontology Based Method to Solve Query Identifier Heterogeneity in Post- Genomic Clinical Trials
ehealth Beyond the Horizon Get IT There S.K. Andersen et al. (Eds.) IOS Press, 2008 2008 Organizing Committee of MIE 2008. All rights reserved. 3 An Ontology Based Method to Solve Query Identifier Heterogeneity
More informationA New Method to Retrieve, Cluster And Annotate Clinical Literature Related To Electronic Health Records
A New Method to Retrieve, Cluster And Annotate Clinical Literature Related To Electronic Health Records Izaskun Fernandez 1, Ana Jimenez-Castellanos 2, Xabier García de Kortazar 1, and David Perez-Rey
More informationInteractive Dynamic Information Extraction
Interactive Dynamic Information Extraction Kathrin Eichler, Holmer Hemsen, Markus Löckelt, Günter Neumann, and Norbert Reithinger Deutsches Forschungszentrum für Künstliche Intelligenz - DFKI, 66123 Saarbrücken
More informationHow To Rank Term And Collocation In A Newspaper
You Can t Beat Frequency (Unless You Use Linguistic Knowledge) A Qualitative Evaluation of Association Measures for Collocation and Term Extraction Joachim Wermter Udo Hahn Jena University Language & Information
More informationIntro to Bioinformatics
Intro to Bioinformatics Marylyn D Ritchie, PhD Professor, Biochemistry and Molecular Biology Director, Center for Systems Genomics The Pennsylvania State University Sarah A Pendergrass, PhD Research Associate
More informationParadigm Changes Affecting the Practice of Scientific Communication in the Life Sciences
Paradigm Changes Affecting the Practice of Scientific Communication in the Life Sciences Prof. Dr. Martin Hofmann-Apitius Head of the Department of Bioinformatics Fraunhofer Institute for Algorithms and
More informationSpecial Topics in Computer Science
Special Topics in Computer Science NLP in a Nutshell CS492B Spring Semester 2009 Jong C. Park Computer Science Department Korea Advanced Institute of Science and Technology INTRODUCTION Jong C. Park, CS
More informationDiscover more, discover faster. High performance, flexible NLP-based text mining for life sciences
Discover more, discover faster. High performance, flexible NLP-based text mining for life sciences It s not information overload, it s filter failure. Clay Shirky Life Sciences organizations face the challenge
More informationLABERINTO at ImageCLEF 2011 Medical Image Retrieval Task
LABERINTO at ImageCLEF 2011 Medical Image Retrieval Task Jacinto Mata, Mariano Crespo, Manuel J. Maña Dpto. de Tecnologías de la Información. Universidad de Huelva Ctra. Huelva - Palos de la Frontera s/n.
More informationTerminology Extraction from Log Files
Terminology Extraction from Log Files Hassan Saneifar 1,2, Stéphane Bonniol 2, Anne Laurent 1, Pascal Poncelet 1, and Mathieu Roche 1 1 LIRMM - Université Montpellier 2 - CNRS 161 rue Ada, 34392 Montpellier
More informationCuration of NLP Pipeline - A Review
ASSISTED CURATION: DOES TEXT MINING REALLY HELP? BEATRICE ALEX, CLAIRE GROVER, BARRY HADDOW, MIJAIL KABADJOV, EWAN KLEIN, MICHAEL MATTHEWS, STUART ROEBUCK, RICHARD TOBIN, AND XINGLONG WANG School of Informatics
More informationApproaches of Using a Word-Image Ontology and an Annotated Image Corpus as Intermedia for Cross-Language Image Retrieval
Approaches of Using a Word-Image Ontology and an Annotated Image Corpus as Intermedia for Cross-Language Image Retrieval Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information
More informationBIOMEDICAL LITERATURE MINING FOR PHARMACOKINETICS NUMERICAL PARAMETER COLLECTION. Zhiping Wang
BIOMEDICAL LITERATURE MINING FOR PHARMACOKINETICS NUMERICAL PARAMETER COLLECTION Zhiping Wang Submitted to the faculty of the University Graduate School in partial fulfillment of the requirements for the
More informationIMPLEMENTING BIG DATA IN TODAY S HEALTH CARE PRAXIS: A CONUNDRUM TO PATIENTS, CAREGIVERS AND OTHER STAKEHOLDERS - WHAT IS THE VALUE AND WHO PAYS
IMPLEMENTING BIG DATA IN TODAY S HEALTH CARE PRAXIS: A CONUNDRUM TO PATIENTS, CAREGIVERS AND OTHER STAKEHOLDERS - WHAT IS THE VALUE AND WHO PAYS 29 OCTOBER 2015 DR. DIRK J. EVERS BACKGROUND TreatmentMAP
More informationPontifícia Universidade Católica do Rio Grande do Sul Faculdade de Informática. Building Domain Specific Corpora in Portuguese Language
Pontifícia Universidade Católica do Rio Grande do Sul Faculdade de Informática Programa de Pós-Graduação em Ciência da Computação Building Domain Specific Corpora in Portuguese Language Lucelene Lopes,
More informationOverview of MT techniques. Malek Boualem (FT)
Overview of MT techniques Malek Boualem (FT) This section presents an standard overview of general aspects related to machine translation with a description of different techniques: bilingual, transfer,
More informationHow To Understand And Understand A Negative In Bbg
Some Aspects of Negation Processing in Electronic Health Records Svetla Boytcheva 1, Albena Strupchanska 2, Elena Paskaleva 2 and Dimitar Tcharaktchiev 3 1 Department of Information Technologies, Faculty
More informationCreating Metabolic Network Models using Text Mining and Expert Knowledge
Creating Metabolic Network Models using Text Mining and Expert Knowledge J.A. Dickerson 1, D. Berleant 1, Z. Cox 1, W. Qi 1, and E. Wurtele 2 Iowa State University, Ames, IA, 50011 Abstract: This paper
More informationGenerating SQL Queries Using Natural Language Syntactic Dependencies and Metadata
Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata Alessandra Giordani and Alessandro Moschitti Department of Computer Science and Engineering University of Trento Via Sommarive
More informationTMUNSW: Identification of disorders and normalization to SNOMED-CT terminology in unstructured clinical notes
TMUNSW: Identification of disorders and normalization to SNOMED-CT terminology in unstructured clinical notes Jitendra Jonnagaddala a,b,c Siaw-Teng Liaw *,a Pradeep Ray b Manish Kumar c School of Public
More informationRETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison
RETRIEVING SEQUENCE INFORMATION Nucleotide sequence databases Database search Sequence alignment and comparison Biological sequence databases Originally just a storage place for sequences. Currently the
More informationThe SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge
The SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge White Paper October 2002 I. Translation and Localization New Challenges Businesses are beginning to encounter
More informationNatural Language Processing and Systems Biology
Natural Language Processing and Systems Biology K. Bretonnel Cohen and Lawrence Hunter Center for Computational Pharmacology, University of Colorado School of Medicine, Denver, USA. E-mail: {kevin.cohen,
More informationBBSRC TECHNOLOGY STRATEGY: TECHNOLOGIES NEEDED BY RESEARCH KNOWLEDGE PROVIDERS
BBSRC TECHNOLOGY STRATEGY: TECHNOLOGIES NEEDED BY RESEARCH KNOWLEDGE PROVIDERS 1. The Technology Strategy sets out six areas where technological developments are required to push the frontiers of knowledge
More informationExposé for Diploma Thesis. Joint Extraction of Proteins and Bio-Molecular Events using Imperatively Defined Factor Graphs
Exposé for Diploma Thesis Joint Extraction of Proteins and Bio-Molecular Events using Imperatively Defined Factor Graphs Tim Rocktäschel Humboldt-Universität zu Berlin
More informationEnglish Grammar Checker
International l Journal of Computer Sciences and Engineering Open Access Review Paper Volume-4, Issue-3 E-ISSN: 2347-2693 English Grammar Checker Pratik Ghosalkar 1*, Sarvesh Malagi 2, Vatsal Nagda 3,
More informationIdentifying and extracting malignancy types in cancer literature
Identifying and extracting malignancy types in cancer literature Yang Jin 1, Ryan T. McDonald 2, Kevin Lerman 2, Mark A. Mandel 4, Mark Y. Liberman 2, 4, Fernando Pereira 2, R. Scott Winters 3 1, 3,, Peter
More informationChapter 8. Final Results on Dutch Senseval-2 Test Data
Chapter 8 Final Results on Dutch Senseval-2 Test Data The general idea of testing is to assess how well a given model works and that can only be done properly on data that has not been seen before. Supervised
More informationThe INFUSIS Project Data and Text Mining for In Silico Modeling
The INFUSIS Project Data and Text Mining for In Silico Modeling Henrik Boström 1,2, Ulf Norinder 3, Ulf Johansson 4, Cecilia Sönströd 4, Tuve Löfström 4, Elzbieta Dura 5, Ola Engkvist 6, Sorel Muresan
More informationCINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test
CINTIL-PropBank I. Basic Information 1.1. Corpus information The CINTIL-PropBank (Branco et al., 2012) is a set of sentences annotated with their constituency structure and semantic role tags, composed
More informationLINGSTAT: AN INTERACTIVE, MACHINE-AIDED TRANSLATION SYSTEM*
LINGSTAT: AN INTERACTIVE, MACHINE-AIDED TRANSLATION SYSTEM* Jonathan Yamron, James Baker, Paul Bamberg, Haakon Chevalier, Taiko Dietzel, John Elder, Frank Kampmann, Mark Mandel, Linda Manganaro, Todd Margolis,
More informationPerCuro-A Semantic Approach to Drug Discovery. Final Project Report submitted by Meenakshi Nagarajan Karthik Gomadam Hongyu Yang
PerCuro-A Semantic Approach to Drug Discovery Final Project Report submitted by Meenakshi Nagarajan Karthik Gomadam Hongyu Yang Towards the fulfillment of the course Semantic Web CSCI 8350 Fall 2003 Under
More informationNatural Language Database Interface for the Community Based Monitoring System *
Natural Language Database Interface for the Community Based Monitoring System * Krissanne Kaye Garcia, Ma. Angelica Lumain, Jose Antonio Wong, Jhovee Gerard Yap, Charibeth Cheng De La Salle University
More informationTOOL OF THE INTELLIGENCE ECONOMIC: RECOGNITION FUNCTION OF REVIEWS CRITICS. Extraction and linguistic analysis of sentiments
TOOL OF THE INTELLIGENCE ECONOMIC: RECOGNITION FUNCTION OF REVIEWS CRITICS. Extraction and linguistic analysis of sentiments Grzegorz Dziczkowski, Katarzyna Wegrzyn-Wolska Ecole Superieur d Ingenieurs
More informationUsing the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova
Using the Grid for the interactive workflow management in biomedicine Andrea Schenone BIOLAB DIST University of Genova overview background requirements solution case study results background A multilevel
More informationWord Completion and Prediction in Hebrew
Experiments with Language Models for בס"ד Word Completion and Prediction in Hebrew 1 Yaakov HaCohen-Kerner, Asaf Applebaum, Jacob Bitterman Department of Computer Science Jerusalem College of Technology
More informationElection of Diagnosis Codes: Words as Responsible Citizens
Election of Diagnosis Codes: Words as Responsible Citizens Aron Henriksson and Martin Hassel Department of Computer & System Sciences (DSV), Stockholm University Forum 100, 164 40 Kista, Sweden {aronhen,xmartin}@dsv.su.se
More informationBy Jonathan Clark, Loosdrecht, The Netherlands, (c) Publishing Research Consortium 2012
By Jonathan Clark, Loosdrecht, The Netherlands, (c) Publishing Research Consortium 2012 The Publishing Research Consortium (PRC) is a group representing publishers and societies supporting global research
More informationExtraction of Legal Definitions from a Japanese Statutory Corpus Toward Construction of a Legal Term Ontology
Extraction of Legal Definitions from a Japanese Statutory Corpus Toward Construction of a Legal Term Ontology Makoto Nakamura, Yasuhiro Ogawa, Katsuhiko Toyama Japan Legal Information Institute, Graduate
More informationWikipedia and Web document based Query Translation and Expansion for Cross-language IR
Wikipedia and Web document based Query Translation and Expansion for Cross-language IR Ling-Xiang Tang 1, Andrew Trotman 2, Shlomo Geva 1, Yue Xu 1 1Faculty of Science and Technology, Queensland University
More informationData Integration and Decision-Making For Biomarkers Discovery, Validation and Evaluation. D. POLVERARI, CTO October 06-07 2008
Data Integration and Decision-Making For Biomarkers Discovery, Validation and Evaluation D. POLVERARI, CTO October 06-07 2008 Data integration definition and aims Definition : Data integration consists
More informationInternational Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 5 ISSN 2229-5518
International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 5 INTELLIGENT MULTIDIMENSIONAL DATABASE INTERFACE Mona Gharib Mohamed Reda Zahraa E. Mohamed Faculty of Science,
More informationCustomizing an English-Korean Machine Translation System for Patent Translation *
Customizing an English-Korean Machine Translation System for Patent Translation * Sung-Kwon Choi, Young-Gil Kim Natural Language Processing Team, Electronics and Telecommunications Research Institute,
More informationModule Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg
Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg March 1, 2007 The catalogue is organized into sections of (1) obligatory modules ( Basismodule ) that
More informationHybrid Strategies. for better products and shorter time-to-market
Hybrid Strategies for better products and shorter time-to-market Background Manufacturer of language technology software & services Spin-off of the research center of Germany/Heidelberg Founded in 1999,
More informationProtein Protein Interaction Networks
Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics
More informationSemantic MEDLINE: An advanced information management application for biomedicine
Information Services & Use 31 (2011) 15 21 15 DOI 10.3233/ISU-2011-0627 IOS Press Semantic MEDLINE: An advanced information management application for biomedicine Thomas C. Rindflesch, Halil Kilicoglu,
More informationNatural Language Processing in the EHR Lifecycle
Insight Driven Health Natural Language Processing in the EHR Lifecycle Cecil O. Lynch, MD, MS cecil.o.lynch@accenture.com Health & Public Service Outline Medical Data Landscape Value Proposition of NLP
More informationUsing Ontologies in Proteus for Modeling Data Mining Analysis of Proteomics Experiments
Using Ontologies in Proteus for Modeling Data Mining Analysis of Proteomics Experiments Mario Cannataro, Pietro Hiram Guzzi, Tommaso Mazza, and Pierangelo Veltri University Magna Græcia of Catanzaro, 88100
More informationorg.rn.eg.db December 16, 2015 org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers.
org.rn.eg.db December 16, 2015 org.rn.egaccnum Map Entrez Gene identifiers to GenBank Accession Numbers org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank
More informationBeyond Health 2.0: the semantic web and intelligent systems
Beyond Health 2.0: the semantic web and intelligent systems Erik van Mulligen PhD Marc Weeber PhD Ravi Kalaputapu PhD Erasmus University Medical Center, Rotterdam, The Netherlands Knewco Inc, New York,
More informationKnowledge Discovery using Text Mining: A Programmable Implementation on Information Extraction and Categorization
Knowledge Discovery using Text Mining: A Programmable Implementation on Information Extraction and Categorization Atika Mustafa, Ali Akbar, and Ahmer Sultan National University of Computer and Emerging
More informationDoctor of Philosophy in Computer Science
Doctor of Philosophy in Computer Science Background/Rationale The program aims to develop computer scientists who are armed with methods, tools and techniques from both theoretical and systems aspects
More informationParsing Software Requirements with an Ontology-based Semantic Role Labeler
Parsing Software Requirements with an Ontology-based Semantic Role Labeler Michael Roth University of Edinburgh mroth@inf.ed.ac.uk Ewan Klein University of Edinburgh ewan@inf.ed.ac.uk Abstract Software
More informationA Survey on Product Aspect Ranking
A Survey on Product Aspect Ranking Charushila Patil 1, Prof. P. M. Chawan 2, Priyamvada Chauhan 3, Sonali Wankhede 4 M. Tech Student, Department of Computer Engineering and IT, VJTI College, Mumbai, Maharashtra,
More informationInformatics and Knowledge Management at the Novartis Institutes for BioMedical Research (NIBR)
Informatics and Knowledge Management at the Novartis Institutes for BioMedical Research (NIBR) Enable Science in silico & Provide the Right Knowledge to the Right People at the Right Time to enable the
More informationFrom Terminology Extraction to Terminology Validation: An Approach Adapted to Log Files
Journal of Universal Computer Science, vol. 21, no. 4 (2015), 604-635 submitted: 22/11/12, accepted: 26/3/15, appeared: 1/4/15 J.UCS From Terminology Extraction to Terminology Validation: An Approach Adapted
More informationComputational Drug Repositioning by Ranking and Integrating Multiple Data Sources
Computational Drug Repositioning by Ranking and Integrating Multiple Data Sources Ping Zhang IBM T. J. Watson Research Center Pankaj Agarwal GlaxoSmithKline Zoran Obradovic Temple University Terms and
More informationResolving Common Analytical Tasks in Text Databases
Resolving Common Analytical Tasks in Text Databases The work is funded by the Federal Ministry of Economic Affairs and Energy (BMWi) under grant agreement 01MD15010B. Database Systems and Text-based Information
More information11-792 Software Engineering EMR Project Report
11-792 Software Engineering EMR Project Report Team Members Phani Gadde Anika Gupta Ting-Hao (Kenneth) Huang Chetan Thayur Suyoun Kim Vision Our aim is to build an intelligent system which is capable of
More informationAn Open Platform for Collecting Domain Specific Web Pages and Extracting Information from Them
An Open Platform for Collecting Domain Specific Web Pages and Extracting Information from Them Vangelis Karkaletsis and Constantine D. Spyropoulos NCSR Demokritos, Institute of Informatics & Telecommunications,
More information