Protein Protein Interaction Networks



Similar documents
How To Cluster Of Complex Systems

Subgraph Patterns: Network Motifs and Graphlets. Pedro Ribeiro

Graph theoretic approach to analyze amino acid network

Part 2: Community Detection

KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Graph Mining and Social Network Analysis

On Network Tools for Network Motif Finding: A Survey Study

9.1. Graph Mining, Social Network Analysis, and Multirelational Data Mining. Graph Mining

Bioinformatics: Network Analysis

Network Analysis. BCH 5101: Analysis of -Omics Data 1/34

Practical Graph Mining with R. 5. Link Analysis

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Mining Social-Network Graphs

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Cluster Analysis: Advanced Concepts

A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery

1. Introduction Gene regulation Genomics and genome analyses Hidden markov model (HMM)

ProteinQuest user guide

An Empirical Study of Two MIS Algorithms

Application of Graph-based Data Mining to Metabolic Pathways

Graph Theory and Networks in Biology

TUTORIAL From Protein-Protein Interactions (PPIs) to networks and pathways ???

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

CENG 734 Advanced Topics in Bioinformatics

Mining Social Network Graphs

Tutorial for proteome data analysis using the Perseus software platform

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Unsupervised learning: Clustering

SCAN: A Structural Clustering Algorithm for Networks

Choices, choices, choices... Which sequence database? Which modifications? What mass tolerance?

Using Ontologies in Proteus for Modeling Data Mining Analysis of Proteomics Experiments

How To Find Influence Between Two Concepts In A Network

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Social Media Mining. Data Mining Essentials

Data Integration. Lectures 16 & 17. ECS289A, WQ03, Filkov

Understanding the dynamics and function of cellular networks

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

Social Media Mining. Graph Essentials

Dmitri Krioukov CAIDA/UCSD

Big Data Text Mining and Visualization. Anton Heijs

Discrete Mathematics & Mathematical Reasoning Chapter 10: Graphs

Laboratory Module 8 Mining Frequent Itemsets Apriori Algorithm

A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks

Network (Tree) Topology Inference Based on Prüfer Sequence

Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network

Data Mining for Knowledge Management. Classification

Chapter 6: Episode discovery process

Healthcare Analytics. Aryya Gangopadhyay UMBC

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Random graphs with a given degree sequence

Master's projects at ITMO University. Daniil Chivilikhin PhD ITMO University

The Basics of Graphical Models

Visualizing Networks: Cytoscape. Prat Thiru

Statistics Graduate Courses

Final Project Report

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

VISUALIZING HIERARCHICAL DATA. Graham Wills SPSS Inc.,

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

IBM SPSS Modeler Social Network Analysis 15 User Guide

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

Data, Measurements, Features

Big Data Mining Services and Knowledge Discovery Applications on Clouds

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Why? A central concept in Computer Science. Algorithms are ubiquitous.

Data Mining - Evaluation of Classifiers

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

NETZCOPE - a tool to analyze and display complex R&D collaboration networks

Data Preprocessing. Week 2

Supervised Learning (Big Data Analytics)

Graph Mining and Social Network Analysis. Data Mining; EECS 4412 Darren Rolfe + Vince Chu

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

How To Cluster

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

A Primer of Genome Science THIRD

Network Algorithms for Homeland Security

Outline. NP-completeness. When is a problem easy? When is a problem hard? Today. Euler Circuits

Introduction to Graph Mining

Lecture 7: NP-Complete Problems

MODEL SELECTION FOR SOCIAL NETWORKS USING GRAPHLETS

Greedy Routing on Hidden Metric Spaces as a Foundation of Scalable Routing Architectures

A COGNITIVE APPROACH IN PATTERN ANALYSIS TOOLS AND TECHNIQUES USING WEB USAGE MINING

Data Mining Practical Machine Learning Tools and Techniques

Unsupervised Data Mining (Clustering)

Graph/Network Visualization

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

Genetomic Promototypes

Data Mining Algorithms Part 1. Dejan Sarka

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

WORKSHOP ON TOPOLOGY AND ABSTRACT ALGEBRA FOR BIOMEDICINE

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Introduction to Data Mining

Doctor of Philosophy in Computer Science

Lecture 11 Data storage and LIMS solutions. Stéphane LE CROM

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce

Transcription:

Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it

My Definition of Bioinformatics Hidden Knowledge (Pattern) Discovery Computational Techniques Machine Learning Techniques Data Mining Algorithms Mathematical Models Biological Data Knowledge Biomedical Applications Genome Proteome Networks Functional Characterization Disease Diagnosis Drug Development

Bioinformatics Milestone Computational Biology Functional Genomics Systems Biology Stage 1. Stage 2. Stage 3. Stage 4. Sequence Analysis Structure Analysis Genome Analysis System Analysis DNA sequencing Homolog search Motif finding Protein folding Homolog search Binding site prediction Function prediction Gene clustering Network modeling Module finding Pathway prediction

Protein Protein Interactions Protein Protein Interactions (PPIs) Permanent interactions vs. Transient interactions Determined by high throughput experimental methods: mass spectrometry and yeast two hybrid system Interactome The entire set of PPIs in a species Problem? Large scale & Unreliability Demand? Computational, integrative approaches

Protein Protein Interaction Networks Protein Protein Interaction Networks A set of nodes V represents proteins A set of edges E represents interactions Undirected unweighed graph, G(V,E) Problem? Complex connectivity Demand? Computational, systematic approaches

Overview Protein Function Prediction Protein Complex Prediction Functional Module Detection Functional Hub Identification Functional Pathway Identification

Protein Function Prediction from PPI Networks Previous Intuitions Utilize connectivity between proteins Local neighborhood based methods Limited accuracy Some interactions irrelevant to functional linkage My Direction Utilize network motifs Interconnection patterns occurring in PPI networks more frequently than in randomized networks Core components for functional activities Evolutionary conserved Wuchty, et al., 2003

Function Association Patterns Subgraph Patterns Functional Association Patterns Labeled subgraphs Assigned a set of functions of a protein into the corresponding node as a label Functional Association Pattern Mining Identification of functional association patterns occurring frequently Function Prediction Predicting protein function by functional association pattern mining

Pattern Mining Process Assumption in Frequency If a sub pattern of a pattern p is not frequent, then p is not frequent Downward closure Applied the frequent item set mining algorithm in the market basket problem Selective Joining Merged two frequent (k 1) node patterns to generate a candidate k node pattern Considered the (k 1) node patterns which share a frequent (k 2) node sub pattern Apriori Pruning Filtered out infrequent patterns using a threshold of minimum frequency Counted all isomorphic patterns using canonical forms

Pattern Mining Example candidate 2-node patterns frequency {f 1 } - {f 1 } 1 { f 1 } { f 1 } { f 1, f 2 } {f 1, f 2 } - {f 1, f 2 } 1 {f 1, f 3 } - {f 1, f 3 } 0 {f 1 } - {f 1, f 2 } 3 frequent 2-node patterns frequency {f 1 } - {f 1, f 2 } 3 {f 1 } - {f 1, f 3 } 2 {f 1 } - {f 1, f 3 } 2 { f 1, f 3 } { f 1, f 2 } {f 1, f 2 } - {f 1, f 3 } 1 { f 1 } frequent 3-node patterns frequency {f 1, f 2 } - {f 1 } - {f 1, f 3 } 3 candidate 3-node patterns frequency {f 1 } - {f 1, f 2 } - {f 1 } 1 {f 1, f 2 } - {f 1 } - {f 1, f 2 } 1 { f 1 } { f 1, f 3 } { f 1, f 2 } { f 1, f 2 } {f 1 } - {f 1, f 3 } - {f 1 } 1 {f 1, f 3 } - {f 1 } - {f 1, f 3 } 0 {f 1, f 2 } - {f 1 } - {f 1, f 3 } 3

Function Prediction Algorithm Pattern Analogue Replaced only one node label in a pattern p with a different label Function Prediction Given k, predicting function of an unknown protein by projecting each k node frequent pattern Algorithm Prediction ( G(V,E), F k-1, F k, v u V ) F n k-1 selecting frequent patterns p n k-1 including v n N(v u ) P uk generating patterns p uk by extending p k-1 n F k-1 n to v u if p ik P uk is a pattern analogue of p k j F k f predicting function of v u by the most frequent p k j end if return f

Function Prediction Accuracy Evaluation Design Leave one out cross validation Exact match vs. Inclusive match Exact match: { predicted func ons } { real func ons } Inclusive match: { predicted functions } { real functions } Results 1.0 0.8 1.0 85% 86% 0.8 0.6 0.6 accuracy 0.4 accuracy 0.4 0.2 min. freq. = 20 min. freq. = 40 0.2 min. freq. = 20 min. freq. = 40 min. freq. = 60 min. freq. = 60 0.0 2 3 4 5 6 0.0 2 3 4 5 6 k k Cho and Zhang, IEEE Transactions on Information Technology in Biomedicine (TITB), 2010

Overview Protein Function Prediction Protein Complex Prediction Functional Module Detection Functional Hub Identification Functional Pathway Identification

Protein Complex Prediction from PPI Networks Previous Intuitions Search densely connected subgraphs Graph clustering algorithms Limited accuracy Exclusion of proteins with low connectivity My Direction Apply a seed growth style approach An initial seed cluster grows based a connectivity function Searches core (dense region) and periphery (sparse region)

Graph Entropy Graph Entropy An information theoretic function of connectivity General Notations p i (v): probability of v having inner links (edges from v to the vertices in V of G (V,E ) ) p o (v): probability of v having outer links (edges from v to the vertices not in V ) Definition Vertex Entropy: e(v) = p i (v) log 2 p i (v) p o (v) log 2 p o (v) Graph Entropy: e(g(v,e)) = Σ v V e(v) Graph Entropy Example

Protein Complex Prediction Algorithm Algorithm Greedy algorithm Randomness, No parameters Creates an initial cluster including a vertex (seed) and its neighbors Removes vertices on cluster inner boundary iteratively to decrease graph entropy until it is minimal Adds vertices on cluster outer boundary iteratively to decrease graph entropy until it is minimal Outputs the cluster, and repeats all until there is no more vertex left

Protein Complex Prediction Accuracy Evaluation Design F measure Mean of recall and precision P value Hyper geometric distribution Results Kenley and Cho, Proteomics, 2011 Kenley and Cho, ICDM 2012

Overview Protein Function Prediction Protein Complex Prediction Functional Module Detection Functional Hub Identification Functional Pathway Identification

Functional Module Detection from PPI Networks Previous Intuitions Partition PPI networks Graph clustering algorithms Limited accuracy Unreliable, complex connections Non overlapping clusters My Direction Utilize an integrative approach Semantic analytics using the genome wide Gene Ontology (GO) data Simulate information propagation Information propagation from a selected source node through the network

Information Propagation Model Path Strength Measurement Formula : Factors Normalized edge weights Inverse of path length Inverse of node degree Functional Impact Scoring Functional impact F s (v i ) of a source v s on v i : Sum of strength of all possible paths from v s to v i

Dynamic Propagation Algorithm Algorithm Computation of cumulative functional impact of a source v s on all the other nodes s Repeated random walk simulation starting from a specific node v s Uses a minimum functional impact threshold Stops when there are no more updates on cumulative functional impact on any nodes Example 0.15 0.45 0.79 0.65 0.28 1.0 1.74 S 1.26 0.83 0.27 0.89 1.38 0.41 0.92 0.31 0.11

Propagation Pattern Mining Process

Functional Module Detection Efficiency Evaluation in Synthetic Networks Potential Factors Number of nodes Network density Average node degree Results 60 50 constant density constant average degree 120 100 constant number of nodes constant average degree run time (msec) 40 30 20 ru un time (msec) 80 60 40 10 20 0 0 1000 2000 3000 4000 5000 6000 7000 0 0 0.002 0.004 0.006 0.008 0.01 number of nodes density

Functional Module Detection Accuracy Hypothesis If two proteins have coherent propagation patterns, then they are likely to perform the same functions average correlatio on of functional flow patterns 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 random pairs from different functions random pairs from the same functions Results 0 Cho, Hwang, Ramanathan and Zhang, BMC Bioinformatics, 2007 Cho, Shi and Zhang, ICDM 2009

Overview Protein Function Prediction Protein Complex Prediction Functional Module Detection Functional Hub Identification Functional Pathway Identification

Functional Hub Identification from PPI Networks Previous Intuitions Search high degree nodes Search centrally located nodes Closeness / Betweenness Limited accuracy Unreliable, complex connections Unable to find a hierarchy My Direction Utilize an integrative approach Semantic analytics using the genome wide Gene Ontology (GO) data Reconstruct the hierarchical structure of proteins

Functional Similarity Model Path Strength Formula : Functional Similarity k length path strength: Functional similarity: where thresholdh

Network Conversion Algorithm Algorithm Conversion of a complex PPI network into a hierarchical tree structure of proteins Process (1) Computes centrality for each node a (2) Obtains a set of ancestor nodes T(a) of a (3) Selects a parent node p(a) of a

Hub Confidence Measurement Algorithm Quantification of functional hubness of proteins Process (1) Obtains a set of child nodes D(a) of a (2) Obtains a set of descendent nodes L a of a (3) Computes the hub confidence H(a) of a

Topological Assessment of Functional Hubs Network Vulnerability Test Random attack: Repeatedly disrupt a randomly selected node Degree based hub attack: Repeatedly disrupt the highest degree node Functional hub attack: Repeatedly disrupt the node with the highest hub confidence For each iteration, observe the largest component Results 1.00 t fraction of largest component 0.95 0.90 0.85 0.80 0.75 0.70 0.65 random attack degree-based hub attack structural hub attack 0.60 0 20 40 60 80 100 120 140 160 number of nodes

Biological Assessment of Functional Hubs Protein Lethality Test To determine lethal proteins by knock out experiment ( Lethality represents functional essentiality. ) Order proteins by degree and hub confidence For every 10 proteins, observe the cumulative proportion of lethal proteins Results average lethality 1.0 09 0.9 0.8 0.7 0.6 structural hubs degree-based d hubs 0.5 0.4 0 20 40 60 80 100 120 140 number of hubs Cho and Zhang, BMC Bioinformatics, 2010

Overview Protein Function Prediction Protein Complex Prediction Functional Module Detection Functional Hub Identification Functional Pathway Identification

Functional Pathway Identification from PPI Networks Previous Intuitions Search the strongest path between a source and a target Computational inefficiency Limited accuracy Difference between signal transduction and interaction My Direction Utilize an integrative approach Semantic analytics using the genome wide Gene Ontology (GO) data Measurement of path frequency towards the target node

Frequent Path Mining Notations l x : a list of length x, P k : a set of all paths of length k between a source and a target L x : a set of path prefixes of P k of length x A selection function f : returning a subset of P k, having a specific prefix l x Frequent Path Mining Given a path prefix l x, computes the support of the association rule, l x v i

Functional Pathway Prediction Algorithm Algorithm Mining frequent paths Selecting multiple successors by an expansion parameter Was the target selected? no yes Merging the pathways

Efficiency Improvement of Path Mining Preprocessing Quasi clustering Information propagation from the source node Frequent path mining in a subgraph of the PPI network Approximation Approximate support pre computation Estimation of cycles Use of a pruning parameter

Functional Pathway Identification Performance Accuracy Test for MAP Kinase signaling pathways Efficiency

Overview Protein Function Prediction Protein Complex Prediction Functional Module Detection Functional Hub Identification Functional Pathway Identification

Conclusion Bioinformatics Research Trend Gene level Genome level The local The global The par cular The universal Functional Characterization Process Pre genomic era: sequence structure function Post genomic era: interaction network function

Bioinformatics Program at Baylor Bioinformatics in Computer Science

References My personal webpage: http:// web.ecs.baylor.edu/faculty/cho/ My lab webpage: http://bionet.ecs.baylor.edu/ Baylor, Bioinformatics program webpage: http://www.ecs.baylor.edu/bioinformatics/ Baylor, Institute of Biomedical Studies webpage: http://www.baylor.edu/biomedical_studies/