Syntactic Parsing for Bio-molecular Event Detection from Scientific Literature



From this document you will learn the answers to the following questions:

What is the main focus of this paper?

How many pages is the book Syntactic Parsing?

What is a keyword for semantic properties?

Similar documents
Extraction and Visualization of Protein-Protein Interactions from PubMed

Protein-protein Interaction Passage Extraction Using the Interaction Pattern Kernel Approach for the BioCreative 2015 BioC Track

CENG 734 Advanced Topics in Bioinformatics

Natural Language Processing for Bioinformatics: The Time is Ripe

ToxiCat: Hybrid Named Entity Recognition services to support curation of the Comparative Toxicogenomic Database

Molecular event extraction from Link Grammar parse trees in the BioNLP 09 Shared Task

Abstracting the types away from a UIMA type system

Text Mining for Health Care and Medicine. Sophia Ananiadou Director National Centre for Text Mining

PPInterFinder A Web Server for Mining Human Protein Protein Interaction

Extracting value from scientific literature: the power of mining full-text articles for pathway analysis

POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition

Final Program Auction - Diagnos and Competitors

Impact of Corpus Diversity and Complexity on NER Performance

Natural Language to Relational Query by Using Parsing Compiler

Semantic annotation of requirements for automatic UML class diagram generation

A leader in the development and application of information technology to prevent and treat disease.

Towards a RB-SMT Hybrid System for Translating Patent Claims Results and Perspectives

Classification and Prioritization of Biomedical Literature for the Comparative Toxicogenomics Database

Efficient Data Integration in Finding Ailment-Treatment Relation

Augmenting the Medical Subject Headings vocabulary with semantically rich variants to improve disease mention normalisation

Big Data and Text Mining

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

Transformation of Free-text Electronic Health Records for Efficient Information Retrieval and Support of Knowledge Discovery

Understanding Biology in the Era of Big Data:

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Integrating Annotation Tools into UIMA for Interoperability

A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks

ProteinQuest user guide

Review PubMed and beyond: a survey of web tools for searching biomedical literature

Building a Spanish MMTx by using Automatic Translation and Biomedical Ontologies

HPI in-memory-based database system in Task 2b of BioASQ

Web-Based Genomic Information Integration with Gene Ontology

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features

Find the signal in the noise

Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System

BANNER: AN EXECUTABLE SURVEY OF ADVANCES IN BIOMEDICAL NAMED ENTITY RECOGNITION

COMPARING USABILITY OF MATCHING TECHNIQUES FOR NORMALISING BIOMEDICAL NAMED ENTITIES

Study of Effect of Drug Lexicons on Medication Extraction from Electronic Medical Records. E. Sirohi and P. Peissig

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

An Ontology Based Method to Solve Query Identifier Heterogeneity in Post- Genomic Clinical Trials

A New Method to Retrieve, Cluster And Annotate Clinical Literature Related To Electronic Health Records

Interactive Dynamic Information Extraction

How To Rank Term And Collocation In A Newspaper

Intro to Bioinformatics

Paradigm Changes Affecting the Practice of Scientific Communication in the Life Sciences

Special Topics in Computer Science

Discover more, discover faster. High performance, flexible NLP-based text mining for life sciences

LABERINTO at ImageCLEF 2011 Medical Image Retrieval Task

Terminology Extraction from Log Files

Curation of NLP Pipeline - A Review

Approaches of Using a Word-Image Ontology and an Annotated Image Corpus as Intermedia for Cross-Language Image Retrieval

BIOMEDICAL LITERATURE MINING FOR PHARMACOKINETICS NUMERICAL PARAMETER COLLECTION. Zhiping Wang

IMPLEMENTING BIG DATA IN TODAY S HEALTH CARE PRAXIS: A CONUNDRUM TO PATIENTS, CAREGIVERS AND OTHER STAKEHOLDERS - WHAT IS THE VALUE AND WHO PAYS

Pontifícia Universidade Católica do Rio Grande do Sul Faculdade de Informática. Building Domain Specific Corpora in Portuguese Language

Overview of MT techniques. Malek Boualem (FT)

How To Understand And Understand A Negative In Bbg

Creating Metabolic Network Models using Text Mining and Expert Knowledge

Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata

TMUNSW: Identification of disorders and normalization to SNOMED-CT terminology in unstructured clinical notes

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

The SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge

Natural Language Processing and Systems Biology

BBSRC TECHNOLOGY STRATEGY: TECHNOLOGIES NEEDED BY RESEARCH KNOWLEDGE PROVIDERS

Exposé for Diploma Thesis. Joint Extraction of Proteins and Bio-Molecular Events using Imperatively Defined Factor Graphs

English Grammar Checker

Identifying and extracting malignancy types in cancer literature

Chapter 8. Final Results on Dutch Senseval-2 Test Data

The INFUSIS Project Data and Text Mining for In Silico Modeling

CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test

LINGSTAT: AN INTERACTIVE, MACHINE-AIDED TRANSLATION SYSTEM*

PerCuro-A Semantic Approach to Drug Discovery. Final Project Report submitted by Meenakshi Nagarajan Karthik Gomadam Hongyu Yang

Natural Language Database Interface for the Community Based Monitoring System *

TOOL OF THE INTELLIGENCE ECONOMIC: RECOGNITION FUNCTION OF REVIEWS CRITICS. Extraction and linguistic analysis of sentiments

Using the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova

Word Completion and Prediction in Hebrew

Election of Diagnosis Codes: Words as Responsible Citizens

By Jonathan Clark, Loosdrecht, The Netherlands, (c) Publishing Research Consortium 2012

Extraction of Legal Definitions from a Japanese Statutory Corpus Toward Construction of a Legal Term Ontology

Wikipedia and Web document based Query Translation and Expansion for Cross-language IR

Data Integration and Decision-Making For Biomarkers Discovery, Validation and Evaluation. D. POLVERARI, CTO October

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

Customizing an English-Korean Machine Translation System for Patent Translation *

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

Hybrid Strategies. for better products and shorter time-to-market

Protein Protein Interaction Networks

Semantic MEDLINE: An advanced information management application for biomedicine

Natural Language Processing in the EHR Lifecycle

Using Ontologies in Proteus for Modeling Data Mining Analysis of Proteomics Experiments

org.rn.eg.db December 16, 2015 org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers.

Beyond Health 2.0: the semantic web and intelligent systems

Knowledge Discovery using Text Mining: A Programmable Implementation on Information Extraction and Categorization

Doctor of Philosophy in Computer Science

Parsing Software Requirements with an Ontology-based Semantic Role Labeler

A Survey on Product Aspect Ranking

Informatics and Knowledge Management at the Novartis Institutes for BioMedical Research (NIBR)

From Terminology Extraction to Terminology Validation: An Approach Adapted to Log Files

Computational Drug Repositioning by Ranking and Integrating Multiple Data Sources

Resolving Common Analytical Tasks in Text Databases

Software Engineering EMR Project Report

An Open Platform for Collecting Domain Specific Web Pages and Extracting Information from Them

Transcription:

Syntactic Parsing for Bio-molecular Event Detection from Scientific Literature Sérgio Matos 1, Anabela Barreiro 2, and José Luis Oliveira 1 1 IEETA, Universidade de Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal 2 Faculdade de Letras, Universidade do Porto, Via Panorâmica, 4150-564 Porto, Portugal {aleixomatos,jlo}@ua.pt, barreiro_anabela@hotmail.com Abstract. Rapid advances in science and in laboratorial and computing methods are generating vast amounts of data and scientific literature. In order to keep up-to-date with the expanding knowledge in their field of study, researchers are facing an increasing need for tools that help manage this information. In the genomics field, various databases have been created to save information in a formalized and easily accessible form. However, human curators are not capable of updating these databases at the same rate new studies are published. Advanced and robust text mining tools that automatically extract newly published information from scientific articles are required. This paper presents a methodology, based on syntactic parsing, for identification of gene events from the scientific literature. Evaluation of the proposed approach, based on the BioNLP shared task on event extraction, produced an average F-score of 47.1, for six event types. Keywords: Biomedical literature, information extraction, bio-molecular events, syntactic parsing, semantic properties. 1 Introduction Recent advances in biotechnology, namely the widespread use of high-throughput methods for gene analysis, have originated vast amounts of published scientific literature. While much of the data and results described in these studies are being annotated in the various existing biomedical databases, these are not easily kept up-to-date. As a result, many relevant research outcomes are still enclosed as free-text in the scientific literature, which remains the major source of information for researchers [1]. It is therefore increasingly difficult for researchers to keep track of the quickly expanding biomedical knowledge to support their experiment planning and analysis of results [2][3]. Researchers are currently faced with issues such as (i) how to identify the most relevant articles for their specific study, (ii) how to identify the mentioned concepts (genes, proteins, diseases and so on) and relations between them, and (iii) how to integrate the extracted information with the existing knowledge in a simple, efficient, and userfriendly manner [2][4]. This integrated view of information extracted from literature, in the framework of more systematized and formalized knowledge annotated in databases and ontologies, is an important requisite for biological data analysis [3]. L. Seabra Lopes et al. (Eds.): EPIA 2009, LNAI 5816, pp. 79 85, 2009. Springer-Verlag Berlin Heidelberg 2009

80 S. Matos, A. Barreiro, and J.L. Oliveira To address these issues, several tools have been developed in the past years that combine Information Extraction (IE), Text Mining (TM) and Natural Language Processing (NLP) techniques with the domain knowledge available in resources such as the Entrez Gene, UniProt, GO or UMLS [1][2][4][5]. Such tools process text titles and abstracts from the MEDLINE/PubMed [6] literature database and present the extracted information in different forms. The ihop tool [7] identifies genes and proteins in PubMed abstracts and uses them as links, allowing the navigation through sentences and abstracts. The AliBaba system [8] is based on pattern matching and cooccurrence statistics to find associations between biological entities such as genes, proteins or diseases, and presents the search results in the form of a graph. EBIMed [9] also finds associations between protein/gene names, GO annotations, drugs and species in PubMed abstracts resulting from a user query. The results are displayed in a table with links to the sentences and abstracts that support the corresponding associations. A similar tool, FACTA [10] retrieves abstracts from PubMed and identifies biomedical concepts (e.g. genes/proteins, diseases, enzymes and chemical compounds) co-occurring with the user query term. The concepts are presented to the user in a tabular format and ranked based on the co-occurrence statistics or on pointwise mutual information. More recently, there has been some focus on applying more detailed linguistic processing in order to improve information retrieval and extraction. Chilibot [11] retrieves sentences from PubMed abstracts related to a pair or a list of proteins, genes, or keywords, and applies shallow parsing to classify these sentences as interactive, non-interactive or simple abstract co-occurrence. The identified relationships between entities or keywords are then displayed as a graph. MEDIE [12] uses a deep-parser and a term recognizer to index abstracts based on pre-computed semantic annotations, allowing for real-time retrieval of sentences containing biological concepts associated with the terms specified in the user query. Interest in the application of more advanced methods of linguistic processing is also evident in the recent information extraction evaluation challenges, namely the BioNLP shared task on event extraction [13] and the BioCreAtIvE II.5 challenge [14], which investigate the extraction of gene events from literature. In this paper, we describe a methodology based on syntactic parsing to detect and annotate bio-molecular events, such as protein production and breakdown, localization or binding events. We present results from our participation in the BioNLP shared task and discuss the main difficulties and further developments required in this area. 2 Methods The method described in this paper to identify bio-molecular events is based on syntactic grammars that process texts and detect the occurrence of linguistic patterns that describe such events. Syntactic parsing was implemented using NooJ [15], a freely available development environment and linguistic processing engine that includes tools for inflectional and derivational morphology, syntactic grammars and semantics. NooJ uses dictionaries and grammars to produce formalized descriptions of natural language and contains a system of inflectional and derivational paradigms, which interacts with the dictionary. Inflectional rules apply to a dictionary entry (lemma) to recognize and generate inflected forms, including gender, number and tense. Derivational

Syntactic Parsing for Bio-molecular Event Detection from Scientific Literature 81 rules apply to a dictionary entry to recognize and generate derived forms, such as nominalizations (predicate nouns morphosyntactically related to a verb) as adopted in [16]. Lemmas can also have semantic information included. Semantic properties allow, for example, adding the characteristic of a particular named entity, such as ORGANISM, PROTEIN or DISEASE. These properties are illustrated in Table 1. Table 1. Dictionary entries in NooJ Lemma PoS FLX Semantic properties ID TAXID human N TABLE ORGANISM 9606 Homo sapiens N ORGANISM 9606 Breast cancer type 1 N PROTEIN P38398 9606 susceptibility protein BRCA1 N PROTEIN P38398 9606 BRCA1 N PROTEIN P48754 10090 BRCA1 N GENE 672 9606 RNF53 N GENE 672 9606 To create the dictionaries used in this method, we adapted the verb dictionary from the biomedical resource BioLexicon [17][18]. BioLexicon includes verbs that occur frequently in the biomedical literature and that usually describe a specific event, such as express, bind and transcribe. We enhanced the BioLexicon dictionary with inflectional ( FLX ) and derivational ( DRV ) attributes and with semantic properties, as shown in Table 2. For example, ION:TABLE represents the derivational and inflectional paradigms for the nominalization expression (which inflects as the word TA- BLE), and ABOLISH represents the inflectional paradigm for the verb express. The semantic properties in NooJ dictionaries were used to assign specific event types to the verbs in the literature that describe those events. In Table 2, the verb stimulate, for example, is assigned a semantic property EventType with a value Positive_Regulation. This semantic property is then used in the syntactic grammars, which add an annotation to that type of event whenever it is detected in texts. Table 2. Definition of verbs in the dictionary Lemma PoS DRV FLX EventType express V ION:TABLE ABOLISH Gene_expression ligate V TION:TABLE SMILE Binding stimulate V TION:TABLE SMILE Positive_regulation The inflectional and derivational paradigms are described in terms of re-write rules. For example, the noun inflectional paradigm TABLE, defines that the plural of the dictionary word associated with this rule is formed by adding an s to the lemma. Hence, the plural of any word associated with the attribute +FLX=TABLE (ex. human ) will be obtained in the same way. In the case of verbs, inflectional rules describe the conjugation of the verb. For example, the inflectional paradigm SMILE defines re-write rules in terms of person, number and tense for verbs that

82 S. Matos, A. Barreiro, and J.L. Oliveira conjugate like the verb to smile. Similarly, the derivational system allows the derivation of a word, as defined by the derivational rule. This allows, for example, obtaining nouns and adjectives from verb entries. The derived word maintains the semantic properties of the word from which is derived (lemma). Thus, the predicate noun stimulation is produced and linked to a positive regulation event, through its inherited semantic properties from the verb stimulate. In order to define the type of events linked to each verb, we used the training data in the BioNLP shared task. Based on the manual linguistic annotations, we extracted the sentences corresponding to each event, and assigned the event type to the verbs found on those sentences. We then manually checked this list and selected only those verbs showing a specific link to a type of event. In case verbs were linked to more than one event type, only the most frequent event type was selected, and the remaining ones removed. In NooJ, syntactic grammars can be used to process sequences of tokens to recognize and annotate multiword expressions. In the approach used, our aim was to detect linguistic patterns, based on named entities (genes and proteins) and on biologically relevant verbs and verb nominalizations referencing some type of bio-molecular event. These entities, verbs and nouns are automatically annotated by NooJ when the dictionaries and grammars are applied to texts. In order to create the relevant grammars, we first used NooJ to extract general concordances from the texts that included an annotated gene or protein and a verb or nominalization. We then identified, in the examples provided by the concordances, specific grammatical constructions describing different types of events. For example, we were able to identify a simple pattern composed of a nominalization, the particle of and a named gene or protein, as in expression of p53 or stimulation of CD4. These patterns were described in terms of syntactic grammars, as illustrated in Fig. 1. The output of the grammar (shown below the connecting lines) identifies the protein ( CD4 ), the expression referencing the event ( stimulation ) and the type of event. Construction and refinement of the syntactic grammars is an iterative process. After creating a baseline grammar to describe a particular construction, we try to incorporate syntactic-semantic variants (paraphrases) in order to achieve better recall, without compromising precision. For example, the grammar used to identify the construction expression of p53 should also be able to identify expression of gene p53 or expression of the human gene p53. The training and development data sets of the shared task were used during this iterative process. The semantic properties included in the dictionary are used in the syntactic grammars to specify the event type in the annotation. Example 1 shows the output of the grammar in Fig. 1: CD4 is the named entity and stimulation is the expression identifying the bio-molecular event. The event type, positive regulation, is obtained directly from the expression s semantic properties. Example 1. Grammar output used to annotate the expression in texts Stimulation of human CD4 <EVENT+PROTEIN=CD4+EXP=Stimulation+TYPE=Positive_regula tion>

Syntactic Parsing for Bio-molecular Event Detection from Scientific Literature 83 Fig. 1. Grammar to detect phrases, such as stimulation of CD4 3 Results The application of the grammars described in the previous section allowed the extraction of phrases that reference gene related events. Table 3 shows some examples of the patterns described by these grammars and the corresponding concordances found in texts. Although these are relatively simple patterns, they can model a large portion of the language used to present such events. Table 3. Patterns detected by the grammars Pattern Concordance in text <entity> [<entity_type>] <nominalization> HSP gene expression <nominalization> of [<entity_type>] <entity> upregulation of Fas <entity> [<entity_type>] <be> [ not ] [<adverb>] <verb> IL-2R stimulation was totally inhibited <verb> <preposition> <entity> binding of TRAF2 <verb> <nominalization> of <entity> suppressing activation of STAT6 This section presents the evaluation results of the proposed method, obtained using the test data from the BioNLP shared task on event extraction. This data set was not used for defining the semantic properties to include in the dictionary or for creating the syntactic grammars. The aim of the shared task was to detect gene events in Pub- Med abstracts and create the corresponding annotations, including the protein(s) involved, the referencing expression or trigger and the type of event. The data for the BioNLP task was derived from the GENIA event corpus and comprised 800 abstracts in the training set, 150 in the development set, and 260 in the test set. Details on the annotation procedure and evaluation metrics are described in [13]. The BioNLP shared task divided events into nine types. The regulatory events were not included in this study due to time constraints and to the more complex structure of those events. Results for the remaining six event types are displayed in Table 4. These results were achieved using six grammars similar to the one exemplified in Fig. 1. An average F-score of 47.11 was obtained. Except for binding events, the results are promising and show that a good performance can be obtained using this simple approach. In

84 S. Matos, A. Barreiro, and J.L. Oliveira Table 4. Performance of the event detection method (test data) Event type Recall Precision F-score Localization 35.63 70.45 47.33 Binding 13.54 34.06 19.38 Gene Expression 46.40 78.45 58.31 Transcription 33.58 41.07 36.95 Protein Catabolism 35.71 62.50 45.45 Phosphorylation 49.63 79.76 61.19 Average 36.76 65.58 47.11 the case of binding events, the participation of two proteins creates extra difficulty in describing such events, and the results are still poor. 4 Discussion We have described an approach which uses syntactic grammars to detect and annotate gene events from the scientific literature. The proposed method takes advantage of the inflectional and derivational morphology and the semantic properties established in dictionaries and grammars developed with NooJ, which allow to associate terminological verbs and their derivations to specific event types. This approach provides a general and flexible solution for information extraction from biomedical texts. The results illustrated in Table 4 indicate that this approach can be used to process the literature and extract networks of events and interactions. These networks are valuable for literature search and navigation, as proposed in MEDIE or Chilibot tools, but require much less processing. However, some shortcomings need to be considered and improved. The first limitation is related to named entity recognition. In the BioNLP shared task, participants were supplied with the names and positions in text of mentioned genes and proteins. In such a setup, recognizing linguistic patterns where these entities occur is significantly simplified. In a more realistic task, the processing pipeline would not have the list of mentioned entities as an input and a named entity recognizer with a very good performance needs to be included in the processing steps. Another limitation concerns the identification of patterns and creation of grammars. Although a manual procedure such as the one taken can identify the most salient linguistic patterns, it would be interesting to investigate the possibility to generate and assess new patterns automatically. In this study, we have not included the gene regulatory events because these are frequently referenced by more complex constructions which are not yet covered by our grammars. Describing and extracting these events is of great importance and will become a future direction of our work. Finally, it is important to assess the advantages and disadvantages of the proposed approach for identifying relations and events, when compared to other methods based on shallow or deep parsing. Methods such as the one proposed in this paper can be used to help database curators identify the most relevant facts in the literature and speed-up the annotation process. Tools based on these methods can also provide alternative querying and browsing of facts cited in the literature and be useful for researchers. However, before these

Syntactic Parsing for Bio-molecular Event Detection from Scientific Literature 85 methods can be truly useful, they must be included in user-oriented tools that offer robust and reliable performance while hiding the complexity of the linguistic processing. It is also of major importance that these tools keep links to the reference databases so that users can navigate from the literature to these resources and back, in a simple and fluid way. References 1. Rebholz-Schuhmann, D., Kirsch, H., Couto, F.: Facts from text: is text mining ready to deliver? PLoS Biol. 3, e65 (2005) 2. Altman, R.B., Bergman, C.M., Blake, J., Blaschke, C., Cohen, A., Gannon, F., Grivell, L., Hahn, U., Hersh, W., Hirschman, L., Jensen, L.J., Krallinger, M., Mons, B., O Donoghue, S.I., Peitsch, M.C., Rebholz-Schuhmann, D., Shatkay, H., Valencia, A.: Text mining for biology - the way forward: opinions from leading scientists. Genome Biology 9(suppl. 2), S7 (2008) 3. Jensen, L.J., Saric, J., Bork, P.: Literature mining for the biologist: from information retrieval to biological discovery. Nat. Rev. Genet. 7, 119 129 (2006) 4. Shatkay, H.: Hairpins in bookstacks: Information retrieval from biomedical text. Briefings in Bioinformatics 6(3), 222 238 (2005) 5. Weeber, M., Kors, J.A., Mons, B.: Online tools to support literature-based discovery in the life sciences. Briefings in Bioinformatics 6(3), 27 286 (2005) 6. PubMed, http://www.ncbi.nlm.nih.gov/pubmed/ 7. Hoffmann, R., Valencia, A.: ihop - A Gene Network for Navigating the Literature. Nature Genetics 36, 664 (2004) 8. Plake, C., Schiemann, T., Pankalla, M., Hakenberg, J., Leser, U.: Ali Baba: PubMed as a graph. Bioinformatics 22(19), 2444 2445 (2006) 9. Rebholz-Schuhmann, D., Kirsch, H., Arregui, M., Gaudan, S., Riethoven, M., Stoehr, P.: EBIMed text crunching to gather facts for proteins from Medline. Bioinformatics 23(2), 237 244 (2007) 10. Tsuruoka, Y., Tsujii, J., Ananiadou, S.: FACTA: a text search engine for finding associated biomedical concepts. Bioinformatics 24(21), 2559 2560 (2008) 11. Chen, H., Sharp, B.M.: Content-rich biological network constructed by mining PubMed abstracts. BMC Bioinformatics 5, 147 (2008) 12. Miyao, Y., Ohta, T., Masuda, K., Tsuruoka, Y., Yoshida, K., Ninomiya, T., Tsujii, J.: Semantic Retrieval for the Accurate Identification of Relational Concepts in Massive Textbases. In: Proceedings of COLING-ACL 2006, Sydney, pp. 1017 1102. (2006) 13. Kim, J.-D., Ohta, T., Pyysalo, S., Kano, Y., Tsujii, J.: Overview of BioNLP 2009 Shared Task on Event Extraction. In: Proceedings of Natural Language Processing in Biomedicine (BioNLP) NAACL 2009 Workshop (2009) 14. BioCreAtIvE - Critical Assessment of Information Extraction Systems in Biology, http://www.biocreative.org/ 15. NooJ, http://www.nooj4nlp.net/ 16. Barreiro, A.M.: Make it simple with paraphrases: Automated paraphrasing for authoring aids and machine translation. PhD dissertation. Faculdade de Letras da Universidade do Porto, Porto (2008) 17. Sasaki, Y., Montemagni, S., Pezik, P., Rebholz-Schuhmann, D., McNaught, J., Ananiadou, S.: BioLexicon: A Lexical Resource for the Biology Domain. In: Proceedings of the Third International Symposium on Semantic Mining in Biomedicine (2008) 18. BOOTStrep Bio-Lexicon, http://www.nactem.ac.uk/biolexicon/