Extraction and Visualization of Protein-Protein Interactions from PubMed

Size: px
Start display at page:

Download "Extraction and Visualization of Protein-Protein Interactions from PubMed"

Transcription

1 Extraction and Visualization of Protein-Protein Interactions from PubMed Ulf Leser Knowledge Management in Bioinformatics Humboldt-Universität Berlin

2 Finding Relevant Knowledge Find information about Much knowledge is in text (and only text) Find articles with information about - PubMed/Medline - Which diseases is RAB5 associated to? Find information about inside each article - Reading many abstracts is tedious - What about a summarize results button? Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/2006 2

3 Question What is the risk of treating malaria patients that have a G6PD (Glucose 6-Phosphate Dehydrogenase) deficiency with Primaquine? Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/2006 3

4 Use PubMed Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/2006 4

5 Use AliBaba Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/2006 5

6 Use AliBaba Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/2006 6

7 Question Which proteins are associated to RAB5? Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/2006 7

8 Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/2006 8

9 Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/2006 9

10 Finding Relevant Knowledge Find information about Find articles with information about - PubMed/Medline - Which diseases is RAB5 associated to? Find information about inside each article - Reading many abstracts is tedious - What about a summarize results button? Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

11 Overview Why text mining for biomedical research Extraction of protein-protein interactions from text - Learning language patterns - Pattern generalization - Evaluation Alibaba: Summarizing PubMed results Vision Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

12 Possible Approaches to PPI Co-occurrence - Two proteins in one sentences -> PPI - Tendency: Low precision, very good recall Full sentence parsing - Recognizes syntactic relationship between entities - Extraction uses rules navigating syntax tree - Only ~30% of all sentences can be parsed unambiguously But recent developments (e.g. INFO-PUBMED, Rinaldi et al.) - Tendency: Good precision, low recall Pattern matching Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

13 Relationship Mining Language pattern - Sentence GENE regulates expression of GENE GENE is strongly suppressed by GENE - Adding part-of-speech GENE VRB NOM PRP GENE GENE is ADJ VRB PRP GENE Different levels of generality - GENE.* VRB.* GENE Simple rules, high recall, low precision - GENE [is] ADJ? {regulat suppres} NOM? PRP GENE Complex rules, lower recall, higher precision Balanced precision/recall requires many rules Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

14 State-of-the-Art Most systems work on hand-crafted sets of pattern - Hundreds of pattern - Enormous effort - Need to be created for any type of relationship Our idea Protein-protein, gene-disease, disease-drug, - Learn patterns automatically Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

15 Recall Bioinformatics Protein families are often defined by patterns How to find protein families? - [Very simple method] - Compute distances between protein sequences Alignment - Find clusters of similar sequences E.g. using hierarchical clustering - Build multiple sequence alignment for each sequence E.g. using ClustalW, DAlign, - Compute profile for each MSA From sequences (of AA) to sentences (of tokens) Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

16 AliBaba Workflow PubMed IntAct Protein pairs Search sentences Linguistic annotation Initial patterns Clustering Alignment Consensus pattern Extracted PPI Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

17 Initial Pattern Extract all pairs of proteins from IntAct - Only the names, not the evidence / links - Gold standard: These interactions are assumed to be real Find all sentences in PubMed - Pair of proteins and interaction word - FADD immediately activates procaspase-8 Extract core phrases - Width: Parameter - show that FADD immediately activates procaspase-8 during Annotate with linguistic information Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

18 Linguistic Annotation Multi-layered pattern Original FADD immediately activates procaspase-8 Class / POS PTN ADV VRB PTN Stem Token PTN immediat activat PTN PTN immediately activates PTN Initial pattern set - Highly specific - Can be used immediately, but results in very low recall Need to be generalized Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

19 Workflow PubMed IntAct Protein pairs Search sentences Linguistic annotation Initial patterns Clustering Alignment Consensus pattern Extracted PPI Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

20 Pattern Generalization Initial patterns - Too many (performance is an issue) - Too specific - Miss many little linguistic derivations Find clusters of similar patterns - Requires a distance measure for language patterns For each cluster, generate consensus pattern - Compute commonality of each set - Generate a new, generalized pattern Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

21 Distances of Initial Patterns Sentence alignment One layer: Standard dynamic programming End-Free alignment of patterns (core phrases) against sentences Cost for insertion, deletion, match, replacement Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

22 Substitution Matrices One substitution matrix per layer Layers can be weighted Score is aggregated over all layers c( i, j) = w l layers l * score Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/ l ( i[ l], j[ l])

23 Clustering and Generalization Distance matrix for all pairs of initial patterns Hierarchical clustering Consensus pattern using multiple sentence alignment - Generates a profile per layer Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

24 Workflow PubMed IntAct Protein pairs Search sentences NER and POS tagging Initial patterns Clustering Alignment Consensus pattern Extracted PPI Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

25 Search Phase Given a text: All sentences - are searched for at least two protein names - matched against all consensus pattern - Complication: Matching a sentence (i.e. a multi-layered pattern) against a pattern profile c( i, j) = wl * scorel ( i[ l], j[ l])*(1 freq( i[ l])) l layers Highest scoring pattern wins Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

26 Evaluation ~ IntAct pairs ~ sentences containing an IntAct pair and an interaction word ~ unique initial patterns - Difference between abstracts and full text Evaluation using SPIES corpus - Hao et al. 2004, ~900 sentences, ~1500 annotated PPI - Not the best corpus one can think of Only sentences with 2 proteins, taken from very few papers But strongest competitor Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

27 Results Using initial patterns directly - As expected: Precision ~85%, recall ~15% Generalization: ~9.500 consensus pattern - Some very large, most very small - Can be tuned towards precision or recall (cluster threshold) Result: 79% precision at 52% recall - F-measure: 63 - Most important type of error: Enumerations CUL-1 interacts with SKR-1, SKR-2, SKR-3, and SKR-10 - Tweaking towards higher recall yields 74 / 57 Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

28 Comparison Hao et al. report F-measure of 68 - Semi-automatic system - Patterns are learned from annotated corpus - Self-made corpus - [Alibaba on home-made corpus: F-measure 66] Alibaba - Needs no learning corpus at all - Semi-supervised methods examples are almost correct - Highly adaptable to different tasks Examples readily available in many databases Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

29 Overview Why text mining for biomedical research Extraction of protein-protein interactions from text Alibaba: Summarizing PubMed results Vision Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

30 Workflow Client 1. PubMed Query Server 2. Query PMIDs Internet Annotated Texts (XML) PMID: PMID: Local Document Index Annotation Pipeline Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

31 Alibaba Analyses results of a PubMed query - Full PubMed query syntax - Scope of analysis is defined by user Extracting and visualizing information - Entities: dictionary matches [Kirsch et al. 05] Genes, proteins, diseases, cells, tissues, species, drugs - Detects PPI using extraction pipeline - Detects further relationships using co-occurrence - Confidence scores Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

32 Query Extracted infos Visualization of extracted relationships Links to databases Links to textual evidence Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

33 Walk-through Which proteins are associated with the TNFalpha associated death domain (TRADD)? Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

34 Many! Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

35 Filter by Object Type and Confidence Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

36 Show only Connected Objects Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

37 Show Type of Interaction Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

38 Location of Interaction Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

39 View Annotated Abstracts Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

40 Overview Why text mining for biomedical research Extraction of protein-protein interactions from text Alibaba: Summarizing PubMed results Vision Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

41 Annotated Relationships Relationships have many parameters Example: Modeling in Systems Biology The apparent K(m) value was calculated for adenosine and found to be 3.63 x 10(-3) M, which indicates high affinity of adenosine deaminase for its substrate adenosine. Constant: K(m) Value: 3.63 x 10(-3) Unit: M Enzyme: Adenosine deaminase Compound: adenosine Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

42 KMedDB Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

43 More Overlying extracted networks with established pathways (KEGG) Application to other types of relationships - Protein disease, disease target drug - Annotated corpora for evaluation welcome Improving text mining performance Disambiguation Advanced NER methods (links are lost) Larger learning sample (reactome, BIND, DIP, ) Scalability Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

44 Conclusion Learning patterns is possible - Quickly adaptable to different tasks Corpus creation is a bottleneck - Even if available, might not be suitable for task at hand - Use semi-supervised methods - The more data, the more promising (full text, web) What is an interaction? - Probably hardest problem for higher felt precision - Solve more specific problems - [Alibaba: task-specific lists of interaction words] Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

45 Acknowledgements Humboldt-Universität, Informatics - Jörg Hakenberg Torsten Schiemann - Conrad Plake Markus Pankalla - Lukas Faulstich Long Nguyen Max-Planck-Institute for Molecular Genetics - Edda Klipp, Sebastian Schmeier, Axel Kowald European Bioinformatics Institute - Harald Kirsch, Dietrich Rebholz-Schumann Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

Text Mining and Knowledge Management

Text Mining and Knowledge Management Text Mining and Knowledge Management Ulf Leser Knowledge Management in Bioinformatics Humboldt-Universität Berlin Berlin Center for Genome Based Bioinformatics University of Applied Sciences Berlin Center

More information

CENG 734 Advanced Topics in Bioinformatics

CENG 734 Advanced Topics in Bioinformatics CENG 734 Advanced Topics in Bioinformatics Week 9 Text Mining for Bioinformatics: BioCreative II.5 Fall 2010-2011 Quiz #7 1. Draw the decompressed graph for the following graph summary 2. Describe the

More information

PPInterFinder A Web Server for Mining Human Protein Protein Interaction

PPInterFinder A Web Server for Mining Human Protein Protein Interaction PPInterFinder A Web Server for Mining Human Protein Protein Interaction Kalpana Raja, Suresh Subramani, Jeyakumar Natarajan Data Mining and Text Mining Laboratory, Department of Bioinformatics, Bharathiar

More information

Protein-protein Interaction Passage Extraction Using the Interaction Pattern Kernel Approach for the BioCreative 2015 BioC Track

Protein-protein Interaction Passage Extraction Using the Interaction Pattern Kernel Approach for the BioCreative 2015 BioC Track Protein-protein Interaction Passage Extraction Using the Interaction Pattern Kernel Approach for the BioCreative 2015 BioC Track Yung-Chun Chang 1,2, Yu-Chen Su 3, Chun-Han Chu 1, Chien Chin Chen 2 and

More information

POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition

POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition POSBIOTM-NER: A Machine Learning Approach for Bio-Named Entity Recognition Yu Song, Eunji Yi, Eunju Kim, Gary Geunbae Lee, Department of CSE, POSTECH, Pohang, Korea 790-784 Soo-Jun Park Bioinformatics

More information

ToxiCat: Hybrid Named Entity Recognition services to support curation of the Comparative Toxicogenomic Database

ToxiCat: Hybrid Named Entity Recognition services to support curation of the Comparative Toxicogenomic Database ToxiCat: Hybrid Named Entity Recognition services to support curation of the Comparative Toxicogenomic Database Dina Vishnyakova 1,2, 4, *, Julien Gobeill 1,3,4, Emilie Pasche 1,2,3,4 and Patrick Ruch

More information

ProteinQuest user guide

ProteinQuest user guide ProteinQuest user guide 1. Introduction... 3 1.1 With ProteinQuest you can... 3 1.2 ProteinQuest basic version 4 1.3 ProteinQuest extended version... 5 2. ProteinQuest dictionaries... 6 3. Directions for

More information

Discover more, discover faster. High performance, flexible NLP-based text mining for life sciences

Discover more, discover faster. High performance, flexible NLP-based text mining for life sciences Discover more, discover faster. High performance, flexible NLP-based text mining for life sciences It s not information overload, it s filter failure. Clay Shirky Life Sciences organizations face the challenge

More information

Syntactic Parsing for Bio-molecular Event Detection from Scientific Literature

Syntactic Parsing for Bio-molecular Event Detection from Scientific Literature Syntactic Parsing for Bio-molecular Event Detection from Scientific Literature Sérgio Matos 1, Anabela Barreiro 2, and José Luis Oliveira 1 1 IEETA, Universidade de Aveiro, Campus Universitário de Santiago,

More information

Final Program Auction - Diagnos and Competitors

Final Program Auction - Diagnos and Competitors Final Program Second BioCreAtIvE Challenge Workshop: Critical Assessment of Information Extraction in Molecular Biology Venue: Auditorium Madrid, April, 23-25, 2007 Main Organizer Prof. Alfonso Valencia,

More information

Text Mining for Health Care and Medicine. Sophia Ananiadou Director National Centre for Text Mining www.nactem.ac.uk

Text Mining for Health Care and Medicine. Sophia Ananiadou Director National Centre for Text Mining www.nactem.ac.uk Text Mining for Health Care and Medicine Sophia Ananiadou Director National Centre for Text Mining www.nactem.ac.uk The Need for Text Mining MEDLINE 2005: ~14M 2009: ~18M Overwhelming information in textual,

More information

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison RETRIEVING SEQUENCE INFORMATION Nucleotide sequence databases Database search Sequence alignment and comparison Biological sequence databases Originally just a storage place for sequences. Currently the

More information

Interactive Dynamic Information Extraction

Interactive Dynamic Information Extraction Interactive Dynamic Information Extraction Kathrin Eichler, Holmer Hemsen, Markus Löckelt, Günter Neumann, and Norbert Reithinger Deutsches Forschungszentrum für Künstliche Intelligenz - DFKI, 66123 Saarbrücken

More information

Extracting value from scientific literature: the power of mining full-text articles for pathway analysis

Extracting value from scientific literature: the power of mining full-text articles for pathway analysis FOR PHARMA & LIFE SCIENCES WHITE PAPER Harnessing the Power of Content Extracting value from scientific literature: the power of mining full-text articles for pathway analysis Executive Summary Biological

More information

Protein Protein Interaction Networks

Protein Protein Interaction Networks Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics

More information

Protein Protein Interactions (PPI) APID (Agile Protein Interaction DataAnalyzer)

Protein Protein Interactions (PPI) APID (Agile Protein Interaction DataAnalyzer) APID (Agile Protein Interaction DataAnalyzer) 23 APID (Agile Protein Interaction DataAnalyzer) Integrates and unifies 7 DBs: BIND, DIP, HPRD, IntAct, MINT, BioGRID. Includes 51,873 proteins 241,204 interactions

More information

Technical Report. The KNIME Text Processing Feature:

Technical Report. The KNIME Text Processing Feature: Technical Report The KNIME Text Processing Feature: An Introduction Dr. Killian Thiel Dr. Michael Berthold Killian.Thiel@uni-konstanz.de Michael.Berthold@uni-konstanz.de Copyright 2012 by KNIME.com AG

More information

Understanding Biology in the Era of Big Data:

Understanding Biology in the Era of Big Data: FOR PHARMA & LIFE SCIENCES WHITE PAPER Understanding Biology in the Era of Big Data: Depth of Coverage Matters Executive Summary Biological research today can be summarized in one word data. With more

More information

Vad är bioinformatik och varför behöver vi det i vården? a bioinformatician's perspectives

Vad är bioinformatik och varför behöver vi det i vården? a bioinformatician's perspectives Vad är bioinformatik och varför behöver vi det i vården? a bioinformatician's perspectives Dirk.Repsilber@oru.se 2015-05-21 Functional Bioinformatics, Örebro University Vad är bioinformatik och varför

More information

Doctor of Philosophy in Computer Science

Doctor of Philosophy in Computer Science Doctor of Philosophy in Computer Science Background/Rationale The program aims to develop computer scientists who are armed with methods, tools and techniques from both theoretical and systems aspects

More information

SAP HANA Enabling Genome Analysis

SAP HANA Enabling Genome Analysis SAP HANA Enabling Genome Analysis Joanna L. Kelley, PhD Postdoctoral Scholar, Stanford University Enakshi Singh, MSc HANA Product Management, SAP Labs LLC Outline Use cases Genomics review Challenges in

More information

A leader in the development and application of information technology to prevent and treat disease.

A leader in the development and application of information technology to prevent and treat disease. A leader in the development and application of information technology to prevent and treat disease. About MOLECULAR HEALTH Molecular Health was founded in 2004 with the vision of changing healthcare. Today

More information

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words , pp.290-295 http://dx.doi.org/10.14257/astl.2015.111.55 Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words Irfan

More information

PerCuro-A Semantic Approach to Drug Discovery. Final Project Report submitted by Meenakshi Nagarajan Karthik Gomadam Hongyu Yang

PerCuro-A Semantic Approach to Drug Discovery. Final Project Report submitted by Meenakshi Nagarajan Karthik Gomadam Hongyu Yang PerCuro-A Semantic Approach to Drug Discovery Final Project Report submitted by Meenakshi Nagarajan Karthik Gomadam Hongyu Yang Towards the fulfillment of the course Semantic Web CSCI 8350 Fall 2003 Under

More information

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

Open Domain Information Extraction. Günter Neumann, DFKI, 2012 Open Domain Information Extraction Günter Neumann, DFKI, 2012 Improving TextRunner Wu and Weld (2010) Open Information Extraction using Wikipedia, ACL 2010 Fader et al. (2011) Identifying Relations for

More information

Open Flow Biological Network Initiative: Pathway map building, standards, simulation, and knowledge sharing

Open Flow Biological Network Initiative: Pathway map building, standards, simulation, and knowledge sharing Open Flow Biological Network Initiative: Pathway map building, standards, simulation, and knowledge sharing Hiroaki Kitano (1,2), Yukiko Matsuoka (1) (1) The Systems Biology Institute, (2) OIST 2009/04/07

More information

Molecular event extraction from Link Grammar parse trees in the BioNLP 09 Shared Task

Molecular event extraction from Link Grammar parse trees in the BioNLP 09 Shared Task Computational Intelligence, Volume xx, Number 000, 2009 Molecular event extraction from Link Grammar parse trees in the BioNLP 09 Shared Task Võ HáNguyên, Jörg Hakenberg, Luis Tari, Chitta Baral, Arizona

More information

Identifying and extracting malignancy types in cancer literature

Identifying and extracting malignancy types in cancer literature Identifying and extracting malignancy types in cancer literature Yang Jin 1, Ryan T. McDonald 2, Kevin Lerman 2, Mark A. Mandel 4, Mark Y. Liberman 2, 4, Fernando Pereira 2, R. Scott Winters 3 1, 3,, Peter

More information

Tutorial for proteome data analysis using the Perseus software platform

Tutorial for proteome data analysis using the Perseus software platform Tutorial for proteome data analysis using the Perseus software platform Laboratory of Mass Spectrometry, LNBio, CNPEM Tutorial version 1.0, January 2014. Note: This tutorial was written based on the information

More information

Search and Information Retrieval

Search and Information Retrieval Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search

More information

Delivering the power of the world s most successful genomics platform

Delivering the power of the world s most successful genomics platform Delivering the power of the world s most successful genomics platform NextCODE Health is bringing the full power of the world s largest and most successful genomics platform to everyday clinical care NextCODE

More information

HPI in-memory-based database system in Task 2b of BioASQ

HPI in-memory-based database system in Task 2b of BioASQ CLEF 2014 Conference and Labs of the Evaluation Forum BioASQ workshop HPI in-memory-based database system in Task 2b of BioASQ Mariana Neves September 16th, 2014 Outline 2 Overview of participation Architecture

More information

Kinexus has an in-house inventory of lysates prepared from 16 human cancer cell lines that have been selected to represent a diversity of tissues,

Kinexus has an in-house inventory of lysates prepared from 16 human cancer cell lines that have been selected to represent a diversity of tissues, Kinexus Bioinformatics Corporation is seeking to map and monitor the molecular communications networks of living cells for biomedical research into the diagnosis, prognosis and treatment of human diseases.

More information

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov Search and Data Mining: Techniques Text Mining Anya Yarygina Boris Novikov Introduction Generally used to denote any system that analyzes large quantities of natural language text and detects lexical or

More information

Chapter 8. Final Results on Dutch Senseval-2 Test Data

Chapter 8. Final Results on Dutch Senseval-2 Test Data Chapter 8 Final Results on Dutch Senseval-2 Test Data The general idea of testing is to assess how well a given model works and that can only be done properly on data that has not been seen before. Supervised

More information

Leading Genomics. Diagnostic. Discove. Collab. harma. Shanghai Cambridge, MA Reykjavik

Leading Genomics. Diagnostic. Discove. Collab. harma. Shanghai Cambridge, MA Reykjavik Leading Genomics Diagnostic harma Discove Collab Shanghai Cambridge, MA Reykjavik Global leadership for using the genome to create better medicine WuXi NextCODE provides a uniquely proven and integrated

More information

Build Vs. Buy For Text Mining

Build Vs. Buy For Text Mining Build Vs. Buy For Text Mining Why use hand tools when you can get some rockin power tools? Whitepaper April 2015 INTRODUCTION We, at Lexalytics, see a significant number of people who have the same question

More information

Processing Genome Data using Scalable Database Technology. My Background

Processing Genome Data using Scalable Database Technology. My Background Johann Christoph Freytag, Ph.D. freytag@dbis.informatik.hu-berlin.de http://www.dbis.informatik.hu-berlin.de Stanford University, February 2004 PhD @ Harvard Univ. Visiting Scientist, Microsoft Res. (2002)

More information

METHODS IN MEDICAL INFORMATICS

METHODS IN MEDICAL INFORMATICS Chapman & Hall/CRC Mathematical and Computational Biology Series METHODS IN MEDICAL INFORMATICS Fundamentals of Healthcare Programming in Perln Pythoni and Ruby Jules J- Berman TECHNISCHE INFORMATION SBIBLIOTHEK

More information

IEEE International Conference on Computing, Analytics and Security Trends CAST-2016 (19 21 December, 2016) Call for Paper

IEEE International Conference on Computing, Analytics and Security Trends CAST-2016 (19 21 December, 2016) Call for Paper IEEE International Conference on Computing, Analytics and Security Trends CAST-2016 (19 21 December, 2016) Call for Paper CAST-2015 provides an opportunity for researchers, academicians, scientists and

More information

Presenting data: how to convey information most effectively Centre of Research Excellence in Patient Safety 20 Feb 2015

Presenting data: how to convey information most effectively Centre of Research Excellence in Patient Safety 20 Feb 2015 Presenting data: how to convey information most effectively Centre of Research Excellence in Patient Safety 20 Feb 2015 Biomedical Informatics: helping visualization from molecules to population Dr. Guillermo

More information

Bioinformatics Grid - Enabled Tools For Biologists.

Bioinformatics Grid - Enabled Tools For Biologists. Bioinformatics Grid - Enabled Tools For Biologists. What is Grid-Enabled Tools (GET)? As number of data from the genomics and proteomics experiment increases. Problems arise for the current sequence analysis

More information

An Interactive De-Identification-System

An Interactive De-Identification-System An Interactive De-Identification-System Katrin Tomanek 1, Philipp Daumke 1, Frank Enders 1, Jens Huber 1, Katharina Theres 2 and Marcel Müller 2 1 Averbis GmbH, Freiburg/Germany http://www.averbis.com

More information

Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems

Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems cation systems. For example, NLP could be used in Question Answering (QA) systems to understand users natural

More information

Introduction to IE with GATE

Introduction to IE with GATE Introduction to IE with GATE based on Material from Hamish Cunningham, Kalina Bontcheva (University of Sheffield) Melikka Khosh Niat 8. Dezember 2010 1 What is IE? 2 GATE 3 ANNIE 4 Annotation and Evaluation

More information

Dutch Parallel Corpus

Dutch Parallel Corpus Dutch Parallel Corpus Lieve Macken lieve.macken@hogent.be LT 3, Language and Translation Technology Team Faculty of Applied Language Studies University College Ghent November 29th 2011 Lieve Macken (LT

More information

A Statistical Text Mining Method for Patent Analysis

A Statistical Text Mining Method for Patent Analysis A Statistical Text Mining Method for Patent Analysis Department of Statistics Cheongju University, shjun@cju.ac.kr Abstract Most text data from diverse document databases are unsuitable for analytical

More information

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources 1 of 8 11/7/2004 11:00 AM National Center for Biotechnology Information About NCBI NCBI at a Glance A Science Primer Human Genome Resources Model Organisms Guide Outreach and Education Databases and Tools

More information

From Data to Foresight:

From Data to Foresight: Laura Haas, IBM Fellow IBM Research - Almaden From Data to Foresight: Leveraging Data and Analytics for Materials Research 1 2011 IBM Corporation The road from data to foresight is long? Consumer Reports

More information

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology http://tinyurl.com/bioinf525-w16

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology http://tinyurl.com/bioinf525-w16 Course Director: Dr. Barry Grant (DCM&B, bjgrant@med.umich.edu) Description: This is a three module course covering (1) Foundations of Bioinformatics, (2) Statistics in Bioinformatics, and (3) Systems

More information

Dr Alexander Henzing

Dr Alexander Henzing Horizon 2020 Health, Demographic Change & Wellbeing EU funding, research and collaboration opportunities for 2016/17 Innovate UK funding opportunities in omics, bridging health and life sciences Dr Alexander

More information

Semantic Data Management. Xavier Lopez, Ph.D., Director, Spatial & Semantic Technologies

Semantic Data Management. Xavier Lopez, Ph.D., Director, Spatial & Semantic Technologies Semantic Data Management Xavier Lopez, Ph.D., Director, Spatial & Semantic Technologies 1 Enterprise Information Challenge Source: Oracle customer 2 Vision of Semantically Linked Data The Network of Collaborative

More information

Terminology Extraction from Log Files

Terminology Extraction from Log Files Terminology Extraction from Log Files Hassan Saneifar 1,2, Stéphane Bonniol 2, Anne Laurent 1, Pascal Poncelet 1, and Mathieu Roche 1 1 LIRMM - Université Montpellier 2 - CNRS 161 rue Ada, 34392 Montpellier

More information

Natural Language Processing for Bioinformatics: The Time is Ripe

Natural Language Processing for Bioinformatics: The Time is Ripe Natural Language Processing for Bioinformatics: The Time is Ripe Jeffrey T. Chang Soumya Raychaudhuri is a Ph.D. candidate in the Russ Altman lab in the Biomedical Informatics program at Stanford University.

More information

Using the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova

Using the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova Using the Grid for the interactive workflow management in biomedicine Andrea Schenone BIOLAB DIST University of Genova overview background requirements solution case study results background A multilevel

More information

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari berardi@di.uniba.it

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari berardi@di.uniba.it Web Mining Margherita Berardi LACAM Dipartimento di Informatica Università degli Studi di Bari berardi@di.uniba.it Bari, 24 Aprile 2003 Overview Introduction Knowledge discovery from text (Web Content

More information

Building a Question Classifier for a TREC-Style Question Answering System

Building a Question Classifier for a TREC-Style Question Answering System Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given

More information

Visualization methods for patent data

Visualization methods for patent data Visualization methods for patent data Treparel 2013 Dr. Anton Heijs (CTO & Founder) Delft, The Netherlands Introduction Treparel can provide advanced visualizations for patent data. This document describes

More information

Information Extraction from Patents: Combining Text- and Image-Mining. Martin Hofmann-Apitius

Information Extraction from Patents: Combining Text- and Image-Mining. Martin Hofmann-Apitius Information Extraction from Patents: Combining Text- and Image-Mining Martin Hofmann-Apitius Bonn-Aachen International Centre for Information Technology (B-IT) September 25, 2007 Status Report: Major Achievements

More information

Pipeline Pilot Enterprise Server. Flexible Integration of Disparate Data and Applications. Capture and Deployment of Best Practices

Pipeline Pilot Enterprise Server. Flexible Integration of Disparate Data and Applications. Capture and Deployment of Best Practices overview Pipeline Pilot Enterprise Server Pipeline Pilot Enterprise Server (PPES) is a powerful client-server platform that streamlines the integration and analysis of the vast quantities of data flooding

More information

BANNER: AN EXECUTABLE SURVEY OF ADVANCES IN BIOMEDICAL NAMED ENTITY RECOGNITION

BANNER: AN EXECUTABLE SURVEY OF ADVANCES IN BIOMEDICAL NAMED ENTITY RECOGNITION BANNER: AN EXECUTABLE SURVEY OF ADVANCES IN BIOMEDICAL NAMED ENTITY RECOGNITION ROBERT LEAMAN Department of Computer Science and Engineering, Arizona State University GRACIELA GONZALEZ * Department of

More information

Efficient Data Integration in Finding Ailment-Treatment Relation

Efficient Data Integration in Finding Ailment-Treatment Relation IJCST Vo l. 3, Is s u e 3, Ju l y - Se p t 2012 ISSN : 0976-8491 (Online) ISSN : 2229-4333 (Print) Efficient Data Integration in Finding Ailment-Treatment Relation 1 A. Nageswara Rao, 2 G. Venu Gopal,

More information

Resolving Common Analytical Tasks in Text Databases

Resolving Common Analytical Tasks in Text Databases Resolving Common Analytical Tasks in Text Databases The work is funded by the Federal Ministry of Economic Affairs and Energy (BMWi) under grant agreement 01MD15010B. Database Systems and Text-based Information

More information

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Focusing on results not data comprehensive data analysis for targeted next generation sequencing Focusing on results not data comprehensive data analysis for targeted next generation sequencing Daniel Swan, Jolyon Holdstock, Angela Matchan, Richard Stark, John Shovelton, Duarte Mohla and Simon Hughes

More information

Brill s rule-based PoS tagger

Brill s rule-based PoS tagger Beáta Megyesi Department of Linguistics University of Stockholm Extract from D-level thesis (section 3) Brill s rule-based PoS tagger Beáta Megyesi Eric Brill introduced a PoS tagger in 1992 that was based

More information

Clinical and research data integration: the i2b2 FSM experience

Clinical and research data integration: the i2b2 FSM experience Clinical and research data integration: the i2b2 FSM experience Laboratory of Biomedical Informatics for Clinical Research Fondazione Salvatore Maugeri - FSM - Hospital, Pavia, italy Laboratory of Biomedical

More information

11-792 Software Engineering EMR Project Report

11-792 Software Engineering EMR Project Report 11-792 Software Engineering EMR Project Report Team Members Phani Gadde Anika Gupta Ting-Hao (Kenneth) Huang Chetan Thayur Suyoun Kim Vision Our aim is to build an intelligent system which is capable of

More information

Domain Classification of Technical Terms Using the Web

Domain Classification of Technical Terms Using the Web Systems and Computers in Japan, Vol. 38, No. 14, 2007 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J89-D, No. 11, November 2006, pp. 2470 2482 Domain Classification of Technical Terms Using

More information

An Online Service for SUbtitling by MAchine Translation

An Online Service for SUbtitling by MAchine Translation SUMAT CIP-ICT-PSP-270919 An Online Service for SUbtitling by MAchine Translation Annual Public Report 2011 Editor(s): Contributor(s): Reviewer(s): Status-Version: Volha Petukhova, Arantza del Pozo Mirjam

More information

Micro blogs Oriented Word Segmentation System

Micro blogs Oriented Word Segmentation System Micro blogs Oriented Word Segmentation System Yijia Liu, Meishan Zhang, Wanxiang Che, Ting Liu, Yihe Deng Research Center for Social Computing and Information Retrieval Harbin Institute of Technology,

More information

The PALAVRAS parser and its Linguateca applications - a mutually productive relationship

The PALAVRAS parser and its Linguateca applications - a mutually productive relationship The PALAVRAS parser and its Linguateca applications - a mutually productive relationship Eckhard Bick University of Southern Denmark eckhard.bick@mail.dk Outline Flow chart Linguateca Palavras History

More information

COMPARING USABILITY OF MATCHING TECHNIQUES FOR NORMALISING BIOMEDICAL NAMED ENTITIES

COMPARING USABILITY OF MATCHING TECHNIQUES FOR NORMALISING BIOMEDICAL NAMED ENTITIES COMPARING USABILITY OF MATCHING TECHNIQUES FOR NORMALISING BIOMEDICAL NAMED ENTITIES XINGLONG WANG AND MICHAEL MATTHEWS School of Informatics, University of Edinburgh Edinburgh, EH8 9LW, UK {xwang,mmatsews}@inf.ed.ac.uk

More information

Lecture 11 Data storage and LIMS solutions. Stéphane LE CROM lecrom@biologie.ens.fr

Lecture 11 Data storage and LIMS solutions. Stéphane LE CROM lecrom@biologie.ens.fr Lecture 11 Data storage and LIMS solutions Stéphane LE CROM lecrom@biologie.ens.fr Various steps of a DNA microarray experiment Experimental steps Data analysis Experimental design set up Chips on catalog

More information

ALIAS: A Tool for Disambiguating Authors in Microsoft Academic Search

ALIAS: A Tool for Disambiguating Authors in Microsoft Academic Search Project for Michael Pitts Course TCSS 702A University of Washington Tacoma Institute of Technology ALIAS: A Tool for Disambiguating Authors in Microsoft Academic Search Under supervision of : Dr. Senjuti

More information

Classification and Prioritization of Biomedical Literature for the Comparative Toxicogenomics Database

Classification and Prioritization of Biomedical Literature for the Comparative Toxicogenomics Database Classification and Prioritization of Biomedical Literature for the Comparative Toxicogenomics Database Dina VISHNYAKOVA a,b,d,1, Emilie PASCHE a,b,d, Julien GOBEILL a,c,d, Arnaud GAUDINAT a,c,d, Christian

More information

TMUNSW: Identification of disorders and normalization to SNOMED-CT terminology in unstructured clinical notes

TMUNSW: Identification of disorders and normalization to SNOMED-CT terminology in unstructured clinical notes TMUNSW: Identification of disorders and normalization to SNOMED-CT terminology in unstructured clinical notes Jitendra Jonnagaddala a,b,c Siaw-Teng Liaw *,a Pradeep Ray b Manish Kumar c School of Public

More information

Network Protocol Analysis using Bioinformatics Algorithms

Network Protocol Analysis using Bioinformatics Algorithms Network Protocol Analysis using Bioinformatics Algorithms Marshall A. Beddoe Marshall_Beddoe@McAfee.com ABSTRACT Network protocol analysis is currently performed by hand using only intuition and a protocol

More information

Data Integration. Lectures 16 & 17. ECS289A, WQ03, Filkov

Data Integration. Lectures 16 & 17. ECS289A, WQ03, Filkov Data Integration Lectures 16 & 17 Lectures Outline Goals for Data Integration Homogeneous data integration time series data (Filkov et al. 2002) Heterogeneous data integration microarray + sequence microarray

More information

Ask your Database: Natural Language Processing using In-Memory Technology

Ask your Database: Natural Language Processing using In-Memory Technology Enterprise Platform and Integration Concepts Master Project Summer Term 2015 Ask your Database: Natural Language Processing using In-Memory Technology Dr. Mariana Neves April 10th, 2015 Question Answering

More information

org.rn.eg.db December 16, 2015 org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers.

org.rn.eg.db December 16, 2015 org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers. org.rn.eg.db December 16, 2015 org.rn.egaccnum Map Entrez Gene identifiers to GenBank Accession Numbers org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank

More information

Curation of NLP Pipeline - A Review

Curation of NLP Pipeline - A Review ASSISTED CURATION: DOES TEXT MINING REALLY HELP? BEATRICE ALEX, CLAIRE GROVER, BARRY HADDOW, MIJAIL KABADJOV, EWAN KLEIN, MICHAEL MATTHEWS, STUART ROEBUCK, RICHARD TOBIN, AND XINGLONG WANG School of Informatics

More information

Content visualization of scientific corpora using an extensible relational database implementation

Content visualization of scientific corpora using an extensible relational database implementation . Content visualization of scientific corpora using an extensible relational database implementation Eleftherios Stamatogiannakis, Ioannis Foufoulas, Theodoros Giannakopoulos, Harry Dimitropoulos, Natalia

More information

Creating Metabolic Network Models using Text Mining and Expert Knowledge

Creating Metabolic Network Models using Text Mining and Expert Knowledge Creating Metabolic Network Models using Text Mining and Expert Knowledge J.A. Dickerson 1, D. Berleant 1, Z. Cox 1, W. Qi 1, and E. Wurtele 2 Iowa State University, Ames, IA, 50011 Abstract: This paper

More information

Large Scale Text Analysis Using the Map/Reduce

Large Scale Text Analysis Using the Map/Reduce Large Scale Text Analysis Using the Map/Reduce Hierarchy David Buttler This work is performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract

More information

Natural Language to Relational Query by Using Parsing Compiler

Natural Language to Relational Query by Using Parsing Compiler Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,

More information

Bio-Informatics Lectures. A Short Introduction

Bio-Informatics Lectures. A Short Introduction Bio-Informatics Lectures A Short Introduction The History of Bioinformatics Sanger Sequencing PCR in presence of fluorescent, chain-terminating dideoxynucleotides Massively Parallel Sequencing Massively

More information

Find the signal in the noise

Find the signal in the noise Find the signal in the noise Electronic Health Records: The challenge The adoption of Electronic Health Records (EHRs) in the USA is rapidly increasing, due to the Health Information Technology and Clinical

More information

Guide for Bioinformatics Project Module 3

Guide for Bioinformatics Project Module 3 Structure- Based Evidence and Multiple Sequence Alignment In this module we will revisit some topics we started to look at while performing our BLAST search and looking at the CDD database in the first

More information

Search and Data Mining: Techniques. Applications Anya Yarygina Boris Novikov

Search and Data Mining: Techniques. Applications Anya Yarygina Boris Novikov Search and Data Mining: Techniques Applications Anya Yarygina Boris Novikov Introduction Data mining applications Data mining system products and research prototypes Additional themes on data mining Social

More information

Acceleration for Personalized Medicine Big Data Applications

Acceleration for Personalized Medicine Big Data Applications Acceleration for Personalized Medicine Big Data Applications Zaid Al-Ars Computer Engineering (CE) Lab Delft Data Science Delft University of Technology 1" Introduction Definition & relevance Personalized

More information

72. Ontology Driven Knowledge Discovery Process: a proposal to integrate Ontology Engineering and KDD

72. Ontology Driven Knowledge Discovery Process: a proposal to integrate Ontology Engineering and KDD 72. Ontology Driven Knowledge Discovery Process: a proposal to integrate Ontology Engineering and KDD Paulo Gottgtroy Auckland University of Technology Paulo.gottgtroy@aut.ac.nz Abstract This paper is

More information

LDIF - Linked Data Integration Framework

LDIF - Linked Data Integration Framework LDIF - Linked Data Integration Framework Andreas Schultz 1, Andrea Matteini 2, Robert Isele 1, Christian Bizer 1, and Christian Becker 2 1. Web-based Systems Group, Freie Universität Berlin, Germany a.schultz@fu-berlin.de,

More information

SIMOnt: A Security Information Management Ontology Framework

SIMOnt: A Security Information Management Ontology Framework SIMOnt: A Security Information Management Ontology Framework Muhammad Abulaish 1,#, Syed Irfan Nabi 1,3, Khaled Alghathbar 1 & Azeddine Chikh 2 1 Centre of Excellence in Information Assurance, King Saud

More information

Big Data Problem? or Big Problem with Data? William Hayes, PhD SVP PlaCorm Dev, Selventa

Big Data Problem? or Big Problem with Data? William Hayes, PhD SVP PlaCorm Dev, Selventa Big Data Problem? or Big Problem with Data? William Hayes, PhD SVP PlaCorm Dev, Selventa 2013, Selventa. All Rights Reserved. Confiden;al 1 Who am I? ex- Aerospace Engineer Defected to Bioinforma;cs (PhD

More information

TS3: an Improved Version of the Bilingual Concordancer TransSearch

TS3: an Improved Version of the Bilingual Concordancer TransSearch TS3: an Improved Version of the Bilingual Concordancer TransSearch Stéphane HUET, Julien BOURDAILLET and Philippe LANGLAIS EAMT 2009 - Barcelona June 14, 2009 Computer assisted translation Preferred by

More information

Web-Based Genomic Information Integration with Gene Ontology

Web-Based Genomic Information Integration with Gene Ontology Web-Based Genomic Information Integration with Gene Ontology Kai Xu 1 IMAGEN group, National ICT Australia, Sydney, Australia, kai.xu@nicta.com.au Abstract. Despite the dramatic growth of online genomic

More information

Collecting Polish German Parallel Corpora in the Internet

Collecting Polish German Parallel Corpora in the Internet Proceedings of the International Multiconference on ISSN 1896 7094 Computer Science and Information Technology, pp. 285 292 2007 PIPS Collecting Polish German Parallel Corpora in the Internet Monika Rosińska

More information

DeCyder Extended Data Analysis (EDA) Software

DeCyder Extended Data Analysis (EDA) Software Part of GE Healthcare Data File 28-4015-41 AA DeCyder Extended Data Analysis (EDA) Software DeCyder EDA DeCyder Extended Data Analysis Software (DeCyder EDA) is high-performance informatics software for

More information

Kybots, knowledge yielding robots German Rigau IXA group, UPV/EHU http://ixa.si.ehu.es

Kybots, knowledge yielding robots German Rigau IXA group, UPV/EHU http://ixa.si.ehu.es KYOTO () Intelligent Content and Semantics Knowledge Yielding Ontologies for Transition-Based Organization http://www.kyoto-project.eu/ Kybots, knowledge yielding robots German Rigau IXA group, UPV/EHU

More information