CENG 734 Advanced Topics in Bioinformatics



Similar documents
PPInterFinder A Web Server for Mining Human Protein Protein Interaction

Extraction and Visualization of Protein-Protein Interactions from PubMed

Protein-protein Interaction Passage Extraction Using the Interaction Pattern Kernel Approach for the BioCreative 2015 BioC Track

Natural Language to Relational Query by Using Parsing Compiler

A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks

Protein Protein Interaction Networks

POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition

Three Methods for ediscovery Document Prioritization:

A Statistical Text Mining Method for Patent Analysis

Protein Protein Interactions (PPI) APID (Agile Protein Interaction DataAnalyzer)

A MACHINE LEARNING APPROACH TO FILTER UNWANTED MESSAGES FROM ONLINE SOCIAL NETWORKS

Information Extraction from Patents: Combining Text- and Image-Mining. Martin Hofmann-Apitius

ProteinQuest user guide

Document Image Retrieval using Signatures as Queries

BANNER: AN EXECUTABLE SURVEY OF ADVANCES IN BIOMEDICAL NAMED ENTITY RECOGNITION

Introducing diversity among the models of multi-label classification ensemble

TS3: an Improved Version of the Bilingual Concordancer TransSearch

Classification and Prioritization of Biomedical Literature for the Comparative Toxicogenomics Database

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

TUTORIAL From Protein-Protein Interactions (PPIs) to networks and pathways ???

UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee

An Introduction to Data Mining

Building a Spanish MMTx by using Automatic Translation and Biomedical Ontologies

Chapter 8. Final Results on Dutch Senseval-2 Test Data

CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet

Final Project Report

Wikipedia and Web document based Query Translation and Expansion for Cross-language IR

Extracting value from scientific literature: the power of mining full-text articles for pathway analysis

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Azure Machine Learning, SQL Data Mining and R

Technical Report. The KNIME Text Processing Feature:

Pharmacovigilance, also referred to as drug safety surveillance, has been

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology

SVM Based Learning System For Information Extraction

Blog Post Extraction Using Title Finding

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Domain Classification of Technical Terms Using the Web

A Content based Spam Filtering Using Optical Back Propagation Technique

Volume 2, Issue 11, November 2014 International Journal of Advance Research in Computer Science and Management Studies

Lan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information

Automatic Text Processing: Cross-Lingual. Text Categorization

Biomarker Discovery and Data Visualization Tool for Ovarian Cancer Screening

Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems

Experiments in Web Page Classification for Semantic Web

The INFUSIS Project Data and Text Mining for In Silico Modeling

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

Search and Information Retrieval

ARABIC PERSON NAMES RECOGNITION BY USING A RULE BASED APPROACH

Projektgruppe. Categorization of text documents via classification

Data, Measurements, Features

Text Mining with R. Rob Zinkov. October 19th, Rob Zinkov () Text Mining with R October 19th, / 38

Travis Goodwin & Sanda Harabagiu

Building A Smart Academic Advising System Using Association Rule Mining

Identifying SPAM with Predictive Models

Enhancing Quality of Data using Data Mining Method

ENHANCED CONFIDENCE INTERPRETATIONS OF GP BASED ENSEMBLE MODELING RESULTS

Extracting Clinical entities and their assertions from Chinese Electronic Medical Records Based on Machine Learning

Manjeet Kaur Bhullar, Kiranbir Kaur Department of CSE, GNDU, Amritsar, Punjab, India

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

HPI in-memory-based database system in Task 2b of BioASQ

Predicting Flight Delays

MHI3000 Big Data Analytics for Health Care Final Project Report

» A Hardware & Software Overview. Eli M. Dow <emdow@us.ibm.com:>

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Knowledge Discovery from patents using KMX Text Analytics

Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis

Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News

Graph Mining and Social Network Analysis

A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

Scalable Developments for Big Data Analytics in Remote Sensing

Text Opinion Mining to Analyze News for Stock Market Prediction

A Survey on Product Aspect Ranking Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques

Automatic Resolver Group Assignment of IT Service Desk Outsourcing

A Systematic Cross-Comparison of Sequence Classifiers

Search Result Optimization using Annotators

The Role of Size Normalization on the Recognition Rate of Handwritten Numerals

Data Mining in Web Search Engine Optimization and User Assisted Rank Results

Interactive Dynamic Information Extraction

Using Artificial Intelligence to Manage Big Data for Litigation

How To Write A Summary Of A Review

RRSS - Rating Reviews Support System purpose built for movies recommendation

Filtering Noisy Contents in Online Social Network by using Rule Based Filtering System

CS 6220: Data Mining Techniques Course Project Description

Server Load Prediction

Sentiment analysis on news articles using Natural Language Processing and Machine Learning Approach.

Doctor of Philosophy in Computer Science

Semantic annotation of requirements for automatic UML class diagram generation

Transcription:

CENG 734 Advanced Topics in Bioinformatics Week 9 Text Mining for Bioinformatics: BioCreative II.5 Fall 2010-2011

Quiz #7 1. Draw the decompressed graph for the following graph summary 2. Describe the two criteria that is used in the SPICi algorithm for stopping clustering expansion.

Tasks Overview of BioCreative II.5 Rank articles on how likely they contain PPIs Identify interacting proteins in articles that contain PPIs Identify interacting protein pairs Evaluation test set: 595 full text articles (61 PPI related) Performance measures: Area under the precision-recall curve (favors recall) Balanced F-measure (favors precision)

Overview of BioCreative II.5 15 teams participated Best AUC for article classification was 0.70 For interacting proteins, the best system achieved good macroaveraged recall (0.73) and AUC (0.58) After filtering incorrect species and mapping homonymous orthologs; for interacting protein pairs, the top (filtered, mapped) recall was 0.42 and AUC was 0.29. Ensemble systems improved performance for the interacting protein task.

Curation of PPIs from Literature Databases: MINT IntAct BioGRID Manual process

FEBS Letters Experiment In 2007 the FEBS Letters journal asked the authors to submit structured, i.e., machine readable, information along with their manuscript They have found that this slowed down publication process They now have MINT curators to extract structured information and ask approval of authors. Benchmark dataset for automated tools.

Tasks: in detail Article categorization task (ACT): Binary classification and ranking of articles (document classification) as relevant for curation, i.e., for extracting PPI annotations Interactor normalization task (INT): Ranked lists of identifiers of proteins for which the article reports evidence for an interaction with experimental evidence Interaction pair task (IPT): Lists of binary interaction pairs with experimental evidence as protein identifier pairs per article

The BioCreative Meta Server 10 min time limit

Performance measures: detailed Balanced F measure Interpolated AUC For ACT task

Difficulty of the challenge Even the human curators do not always agree 81% agreement between curators of different database 93% agreement between curators of the same database

Availability All data and results are available at http://www.biocreative.org/events/biocreativeii5/results-and-data/ BioCreative III was held this year. Workshop was on September 2010

ACT results

Different approaches For the most successful teams in the IPT (teams 18, 37, and 42), the approaches seem to be technically quite variable: Team 42 only used pattern matching, Team 18 employed conjunction handling including shallow parsing, while Team 37 employed a very wide range of NLP techniques, even deep parsing of the grammatical structures of the text

Team 42 Efficient Extraction of Protein-Protein Interactions from Full-Text Articles Jörg Hakenberg, Robert Leaman, Nguyen Ha Vo, Siddhartha Jonnalagadda, Ryan Sullivan, Christopher Miller, Luis Tari, Chitta Baral, and Graciela Gonzalez Most of the authors are from ASU

Team 42 Submitted results for 21 of 61 relevant articles for the IPT task Two goals: Protein named entity recognition, including normalization Extraction of protein-protein interactions from full text Can process a full text article from 10 seconds to 2 minutes based on different configurations

Summary Propose strategies to transfer document-level annotations to the sentence-level, which allows for the creation of a more fine-grained training corpus Use this corpus to automatically derive around 5,000 patterns. Rank sentences by relevance to the task of finding novel interactions with physical evidence, using a sentence classifier built from this training corpus.

A remarkable early work Fundel et al. found around 150,000 distinct PPIs of human genes/proteins within one million PubMed abstracts from 1990 to 2007 K. Fundel, R. Küffner, and R. Zimmer, RelEx- Relation Extraction Using Dependency Parse Trees, Bioinformatics, vol. 23, no. 3, pp. 365-371, 2007.

Text mining needed for identify abstracts/full-text articles that contain relevant data, spot relevant passages in a given article (as opposed to established background information), recognize mentions of relevant biomedical entities (proteins and organisms), map each entity mention to a database identifier (such as a UniProt ID for each protein), extract relationships between entities (such as mapping a protein to an organism or finding PPIs), extract additional information on the experiments described (interaction-detection method, clone library, antibodies, etc.).

Named Entity Recognition (NER) Steps machine-learning-based NER of protein names, BANNER NER system is used which is based on Conditional Random Fields. Trained on BioCreative II data dictionary-based recognition of species names, generating candidate IDs for each recognized mention based on dictionary matching, filtering of IDs by species, and ranking of candidate IDs.

Relationship Extraction Use a pattern based approach Common patterns of sentences that describe PPIs Generate task specific patterns by clustering annotated training data Conceptualizations of tokens lists of words referring to protein interactions Use the OpenDMAP framework for maintaining and matching patterns

Generation of Sentence Level Training Data Given: Full text and pairs of interacting proteins in that full text Use NER to predict locations of the identifiers in full text Remove sentences which do not contain interaction related terms At the end: 784 positive out of 14,844 sentences in total are identified and used for training

Pattern Generation Convert each sentence to a pair of protein ids and a keyword that describes the interaction. A interacts with B A binds B

Concept Groups

Example Patterns

OpenDMAP The generated patterns are input to OpenDMAP which is an ontology-driven, rulebased concept analysis and information extraction system.

Relevance Ranking of Sentences Train an SVM model on 9750 sentences using libsvm Features extracted for each sentence:

Ranking of Extracted Interactions Based on confidence scores obtained for each protein mention the disambiguation to find a UniProt ID per protein the relevance score of the ranked sentence the number of evidences found for each pair in the overall article.

Results

Results

Annotation Server Configurations

Patterns with Highest Support

Positions of Interactions

Next Week 1 paper to read about reconstruction of signaling networks: Reconstructing signaling pathways from RNAi data using probabilistic Boolean threshold networks by Kaderali et al., Bioinformatics, 2009.