Probabilistic methods for post-genomic data integration



Similar documents
Data Integration. Lectures 16 & 17. ECS289A, WQ03, Filkov

Integrating DNA Motif Discovery and Genome-Wide Expression Analysis. Erin M. Conlon

Christfried Webers. Canberra February June 2015

Current Motif Discovery Tools and their Limitations

Genetomic Promototypes

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

Course: Model, Learning, and Inference: Lecture 5

T cell Epitope Prediction

Probabilistic user behavior models in online stores for recommender systems

Network Analysis. BCH 5101: Analysis of -Omics Data 1/34

MIC - Detecting Novel Associations in Large Data Sets. by Nico Güttler, Andreas Ströhlein and Matt Huska

Comparing Methods for Identifying Transcription Factor Target Genes

Unraveling protein networks with Power Graph Analysis

Exercise with Gene Ontology - Cytoscape - BiNGO

Learning from Diversity

Probabilistic Latent Semantic Analysis (plsa)

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Bayesian Hidden Markov Models for Alcoholism Treatment Tria

Statistical Machine Learning from Data

Tracking Groups of Pedestrians in Video Sequences

Basic Concepts of DNA, Proteins, Genes and Genomes

Interaktionen von RNAs und Proteinen

The Information Bottleneck EM Algorithm

Statistics Graduate Courses

A Primer of Genome Science THIRD

Basics of Statistical Machine Learning

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

Cell Phone based Activity Detection using Markov Logic Network

Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

A Statistical Framework for Operational Infrasound Monitoring

STA 4273H: Statistical Machine Learning

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Stock Option Pricing Using Bayes Filters

Bayesian Statistics: Indian Buffet Process

Package EstCRM. July 13, 2015

Bayesian Active Distance Metric Learning

Bioinformatics: Network Analysis

CS229 Lecture notes. Andrew Ng

Visualization of High Dimensional Scientific Data

NOVEL GENOME-SCALE CORRELATION BETWEEN DNA REPLICATION AND RNA TRANSCRIPTION DURING THE CELL CYCLE IN YEAST IS PREDICTED BY DATA-DRIVEN MODELS

Protein & DNA Sequence Analysis. Bobbie-Jo Webb-Robertson May 3, 2004

BayesX - Software for Bayesian Inference in Structured Additive Regression

Machine Learning and Statistics: What s the Connection?

Lecture 3: Linear methods for classification

Learning outcomes. Knowledge and understanding. Competence and skills

Graphical Modeling for Genomic Data

Detecting Corporate Fraud: An Application of Machine Learning

Bayesian Image Super-Resolution

Big Data, Machine Learning, Causal Models

Using Graph Theory to Analyze Gene Network Coherence

Classification Problems

Protein Protein Interaction Networks

Denominazione insegnamento in italiano Denominazione insegnamento in inglese Tipologia dell esame (scritto- scritto/orale orale)

Analysis of gene expression data. Ulf Leser and Philippe Thomas

Linear Classification. Volker Tresp Summer 2015

Learning from Data: Naive Bayes

Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Pep-Miner: A Novel Technology for Mass Spectrometry-Based Proteomics

Mixtures of Robust Probabilistic Principal Component Analyzers

1 Maximum likelihood estimation

Using Bayesian Networks to Analyze Expression Data ABSTRACT

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Inference on Phase-type Models via MCMC

Bayes and Naïve Bayes. cs534-machine Learning

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Using MATLAB: Bioinformatics Toolbox for Life Sciences

An Introduction to Data Mining

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Data Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber

Biclustering Algorithms for Biological Data Analysis: A Survey

Predict Influencers in the Social Network

LabGenius. Technical design notes. The world s most advanced synthetic DNA libraries. hi@labgeni.us V1.5 NOV 15

Decision Support System For A Customer Relationship Management Case Study

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Data Visualization with Simultaneous Feature Selection

Bayesian Networks. Read R&N Ch Next lecture: Read R&N

Hidden Markov Models

Using Mixtures-of-Distributions models to inform farm size selection decisions in representative farm modelling. Philip Kostov and Seamus McErlean

Predictive Data modeling for health care: Comparative performance study of different prediction models

RNA & Protein Synthesis

Transcription:

Probabilistic methods for post-genomic data integration Dirk Husmeier Biomathematics & Statistics Scotland (BioSS) JMB, The King s Buildings, Edinburgh EH9 3JZ United Kingdom http://wwwbiossacuk/ dirk

Integrated analysis of regulatory networks

Integrated analysis of regulatory networks Expression data alone are not sufficient ombining multiple sources of information yields complementary constraints

ombining promoter sequences and gene expression data

ombining promoter sequences and gene expression data onventional approach: Find clusters of co-expressed genes Identify regulatory elements by searching for common over-represented motifs in the promoter regions of these genes

Shortcomings of the conventional algorithm

Microarray data Model Promoter sequences

Microarray data Model Promoter sequences

Microarray data Model Promoter sequences

Segal s unifying probabilistic model

Microarray data Model Promoter sequences

Microarray data Model Promoter sequences

Microarray data Model Promoter sequences

Segal, Yelensky, Koller (2003) Bioinformatics 19

Segal, Yelensky, Koller (2003) Bioinformatics 19 Revision: Motif finding

Motif: TAT A G T G A A T T T A T A G A

Motif: TAT A G T G A A T T T A T A G A

Motif: TAT A G T G A A T T T A T A G A

Motif: TAT A G T G A A T T T A T A G A

Motif: TAT A G T G A A T T T A T A G A

Motif: TAT A G T G A A T T T A T A G A

Motif: TAT A G T G A A T T T A T A G A

Position Specific Scoring Matrix (PSSM) Search for a motif of length W in binding sequences

Position Specific Scoring Matrix (PSSM) Search for a motif of length W in binding sequences W 4 matrix ψ k (l): Probability that the nucleotide in the kth position, k [1,, W ], is an l {A,, G, T }

Position Specific Scoring Matrix (PSSM) Search for a motif of length W in binding sequences W 4 matrix ψ k (l): Probability that the nucleotide in the kth position, k [1,, W ], is an l {A,, G, T } Background model for non-binding sequences 4-dim vector θ 0 (l): Probability of nucleotide l; this distribution is position-independent

Sequence S 1, S 2,, S N

Sequence S 1, S 2,, S N Non-binding sequence: R=0 P (S 1, S 2,, S N R = 0) = N θ 0 (S t ) t=1

Sequence S 1, S 2,, S N Non-binding sequence: R=0 P (S 1, S 2,, S N R = 0) = N θ 0 (S t ) Binding sequence: R=1, motif starting at position m+1 t=1 k=1 t=1 P (S 1, S 2,, S N R = 1, start = m + 1) m W N = θ 0 (S t ) ψ k (S m+k ) θ 0 (S t ) = N θ 0 (S t ) t=1 W k=1 ψ k (S m+k ) θ 0 (S m+k ) t=m+w +1

Binding sequence: R=1, motif starting at position m+1 P (S 1, S 2,, S N R = 1, start = m + 1) = N W θ 0 (S t ) t=1 k=1 ψ k (S m+k ) θ 0 (S m+k )

Binding sequence: R=1, motif starting at position m+1 P (S 1, S 2,, S N R = 1, start = m + 1) = N θ 0 (S t ) t=1 W k=1 Binding sequence: R=1, motif starting anywhere P (S 1, S 2,, S N R = 1) = = N W m=0 ψ k (S m+k ) θ 0 (S m+k ) P (start = m + 1)P (S 1, S 2,, S N R = 1, start = m + 1) N 1 θ 0 (S t ) N W + 1 t=1 N W m=0 W k=1 ψ k (S m+k ) θ 0 (S m+k )

Binding sequence: R=1, motif starting at position m+1 P (S 1, S 2,, S N R = 1, start = m + 1) = N θ 0 (S t ) t=1 W k=1 Binding sequence: R=1, motif starting anywhere P (S 1, S 2,, S N R = 1) = = N W m=0 ψ k (S m+k ) θ 0 (S m+k ) P (start = m + 1)P (S 1, S 2,, S N R = 1, start = m + 1) N 1 θ 0 (S t ) N W + 1 t=1 N W m=0 W k=1 ψ k (S m+k ) θ 0 (S m+k ) Objective: Prediction of binding activity from sequence: P (R = 1 S 1, S 2,, S N )

P (R = 1 S 1, S 2,, S N ) = = = Apply Bayes rule: P (S 1, S 2,, S N R = 1)P (R = 1) P (S 1, S 2,, S N R = 0)P (R = 0) + P (S 1, S 2,, S N R = 1)P (R = 1) ( ) 1 1 + P (R = 0)P (S 1, S 2,, S N R = 0) P (R = 1)P (S 1, S 2,, S N R = 1) ( [ ] N W 1 ) 1 P (R = 1) 1 W ψ k (S m+k ) 1 + P (R = 0) (N W + 1) θ 0 (S m+k ) m=0 k=1

P (R = 1 S 1, S 2,, S N ) = = = Apply Bayes rule: P (S 1, S 2,, S N R = 1)P (R = 1) P (S 1, S 2,, S N R = 0)P (R = 0) + P (S 1, S 2,, S N R = 1)P (R = 1) ( ) 1 1 + P (R = 0)P (S 1, S 2,, S N R = 0) P (R = 1)P (S 1, S 2,, S N R = 1) ( [ ] N W 1 ) 1 P (R = 1) 1 W ψ k (S m+k ) 1 + P (R = 0) (N W + 1) θ 0 (S m+k ) m=0 k=1 Define: w k (l) = log ψ k(l) θ 0 (l), w 0 = log P (R=1) P (R=0), logit(z) = 1 1+exp( z)

P (R = 1 S 1, S 2,, S N ) ( [ w 0 = logit log N W + 1 N W m=0 ( W )] ) exp w k (S t+k ) k=1 4 W + 1 parameters: w k (l), w 0

Motif: TAT A G T G A A T T T A T A G A

Motif: TAT A G T G A A T T T A T A G A Score 1

Motif: TAT A G T G A A T T T A T A G A Score 1 Score 2

Motif: TAT A G T G A A T T T A T A G A Score 1 Score 2 Score t

Motif: TAT A G T G A A T T T A T A G A Score 1 Score 2 Score t Score N

Motif: TAT A G T G A A T T T A T A G A Score 1 Score 2 Score t Score N +

Motif: TAT A G T G A A T T T A T A G A Score 1 Score 2 Score t Score N + Nonlinear transfer function

Motif: TAT A G T G A A T T T A T A G A Score 1 Score 2 Score t Score N + Nonlinear transfer function P(R=1 sequence)

P (R = 1 S 1, S 2,, S N ) ( [ w 0 = logit log N W + 1 N W m=0 ( W )] ) exp w k (S t+k ) k=1 4 W + 1 parameters: w k (l), w 0

Wolfgang Lehrach Biomathematics & Statistics Scotland Ab initio prediction of protein interaction

SH3 yeast two-hybrid interaction network Tong et al (2002), Science 295, 321-324 285 interactions between 28 SH3 proteins and 143 binding peptides 9 binding partners per SH3 domain on average

Final Test Set Performance 1 True positive rate (sensitivity) 08 06 04 02 0 061 Reiss 062 None 064 Naive 069 Gaussian 071 Laplacian with pruning 073 Laplacian 0 02 04 06 08 1 False positive rate (1 specificity)

The model of Segal, Yelensky and Koller Bioinformatics 19, 2003

P(gR 2 gs) TAT A G gs 1 gs 2 gs N gr 1 gr 2

Transcriptional Regulation Basics Evaluation MotifScanne ases onclusions

P(gR 2 gs) TAT A G gs 1 gs 2 gs N gr 1 gr 2

P(gR 2 gs) TAT A G gs gs2 1 gsn gr 1 gr 2 gm 1 2 3 gr 1 gr 2 gm

P(gR 2 gs) TA T A G gs1 gs2 gs N gr 1 gr 2 gr 1 gr 2 gm P(gE3 gm) gm 1 2 gm 1 2 3 3 0 ge ge 1 2 ge 3

P(gR 2 gs) TAT A G gs 1 gs 2 gs N gr1 gr 2 gm ge 1 ge 2 ge 3

P (gr i = 1 gs 1, gs 2,, gs N ) ( [ w 0 = logit log N W + 1 N W m=0 ( W )] ) exp w k (gs t+k ) k=1 4 W + 1 parameters per binding motif R i : w i k (l), wi 0

gs1 gs2 gs N gr 1 gr 2 gr 1 gr 2 gm 1 2 3 gm ge ge 1 2 ge 3

Softmax function P (gm = m gr 1 = r 1, gr 2 = r 2,, gr N = r N ) ( L ) exp i=1 u mir i = ( m exp L ) i=1 u mir i Parameter matrix: Number of motifs/regulators number of modules

gs1 gs2 gsn gr1 gr2 gm P(gE 3 gm) 0 gm 1 2 3 ge 1 ge 2 ge3

Independent Gaussian distributions P (ge 1, ge 2,, ge L gm = m) = j P (ge j gm = m) P (ge j gm = m) = N(µ j,m, σ j,m ) For each module m and each condition j: Mean: µ j,m Standard deviation: σ j,m

Parameter estimation

P(gR 2 gs) TA T A G gs1 gs2 gs N gr 1 gr 2 gr 1 gr 2 gm P(gE3 gm) gm 1 2 gm 1 2 3 3 0 ge ge 1 2 ge 3

P(gR 2 gs) TA T A G gs1 gs2 gsn gr 1 gr 2 gr 1 gr 2 gm P(gE3 gm) gm 1 2 gm 1 2 3 3 0 ge ge 1 2 ge 3

Bayesian approach P(parameters data) = P(parameters, latent variables data)

Bayesian approach P(parameters data) = P(parameters, latent variables data) Intractable!

Bayesian approach P(parameters data) = P(parameters, latent variables data) Intractable! Gibbs sampling parameters P(parameters latent variables, data) latent variables P(latent variables parameters, data)

y P(x,y) x

y P(x,y) x

y P(x,y) P(x y) x

y P(y x) P(x,y) x

y P(x,y) x

Still too expensive Find one good set of parameters rather than a whole sample from the posterior distribution Hard-assignment EM algorithm Various heuristic simplifications See Bioinformatics 19, 2003 for details

gs 1 gs 2 gs N gr1 gr 2 gm ge 1 ge 2 ge 3

E-step gs 1 gs 2 gs N gr1 gr 2 gm ge 1 ge 2 ge 3

M-step gs 1 gs 2 gs N gr1 gr 2 gm ge 1 ge 2 ge 3

gs 1 gs 2 gs N gr1 gr 2 gm ge 1 ge 2 ge 3

gs 1 gs 2 gs N gr1 gr 2 gm ge 1 ge 2 ge 3

Segal, Yelensky, Koller (2003) Bioinformatics 19 Saccharomyces cerevisiae

From Segal et al, Bioinformatics 2003

Experiment 1 173 microarrays, measuring responses to various stress conditions (Gasch et al 2000) onventional algorithms: 20% of the predicted motifs are known Unified probabilistic model: 45% of the predicted motifs are known

Experiment 2 77 microarrays, expression during the cell cycle (Spellman et al 1998) onventional algorithms: 30% of the predicted motifs are known Unified probabilistic model: 56% of the predicted motifs are known

From Segal et al, Bioinformatics 2003