Motif finding with Gibbs sampling. CS 466 Saurabh Sinha

Similar documents
MORPHEUS. Prediction of Transcription Factors Binding Sites based on Position Weight Matrix.

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

1 Solving LPs: The Simplex Algorithm of George Dantzig

MATH 140 Lab 4: Probability and the Standard Normal Distribution

Normal distribution. ) 2 /2σ. 2π σ

Statistical Machine Translation: IBM Models 1 and 2

Hidden Markov Models

Row Echelon Form and Reduced Row Echelon Form

Basics of Statistical Machine Learning

Frequently Asked Questions Next Generation Sequencing

Integrating DNA Motif Discovery and Genome-Wide Expression Analysis. Erin M. Conlon

Linear Algebra Notes

7 Gaussian Elimination and LU Factorization

Protein Protein Interaction Networks

Tagging with Hidden Markov Models

Current Motif Discovery Tools and their Limitations

Introduction to Markov Chain Monte Carlo

Pairwise Sequence Alignment

Bayesian Tutorial (Sheet Updated 20 March)

STAT 35A HW2 Solutions

CHAPTER 2 Estimating Probabilities

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

Group Testing a tool of protecting Network Security

1 Determinants and the Solvability of Linear Systems

a 11 x 1 + a 12 x a 1n x n = b 1 a 21 x 1 + a 22 x a 2n x n = b 2.

AP: LAB 8: THE CHI-SQUARE TEST. Probability, Random Chance, and Genetics

A linear combination is a sum of scalars times quantities. Such expressions arise quite frequently and have the form

Binary Adders: Half Adders and Full Adders

Review Horse Race Gambling and Side Information Dependent horse races and the entropy rate. Gambling. Besma Smida. ES250: Lecture 9.

Linear Programming. March 14, 2014

Gibbs Sampling and Online Learning Introduction

Week 4: Standard Error and Confidence Intervals

Discrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 10

Direct Methods for Solving Linear Systems. Matrix Factorization

3.2 Roulette and Markov Chains

Matrix-Chain Multiplication

Linear Programming Notes VII Sensitivity Analysis

Notes on Continuous Random Variables

Language Modeling. Chapter Introduction

NP-Completeness and Cook s Theorem

3 Some Integer Functions

Integer Programming Formulation

Teaching Pre-Algebra in PowerPoint

What is Linear Programming?

Chapter 3. if 2 a i then location: = i. Page 40

ECON 459 Game Theory. Lecture Notes Auctions. Luca Anderlini Spring 2015

Interaktionen von Nukleinsäuren und Proteinen

Basic Probability Concepts

Solutions to Math 51 First Exam January 29, 2015

T cell Epitope Prediction

Supplementary Information

Rarefaction Method DRAFT 1/5/2016 Our data base combines taxonomic counts from 23 agencies. The number of organisms identified and counted per sample

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Math Quizzes Winter 2009

Chapter 5. Phrase-based models. Statistical Machine Translation

Probabilistic methods for post-genomic data integration

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES

CALCULATIONS & STATISTICS

Independent samples t-test. Dr. Tom Pierce Radford University

I. Pointwise convergence

Unit 1 Number Sense. In this unit, students will study repeating decimals, percents, fractions, decimals, and proportions.

1.5 Oneway Analysis of Variance

Mathematical Induction

The Basics of Graphical Models

CMPSCI611: Approximating MAX-CUT Lecture 20

Application of Markov chain analysis to trend prediction of stock indices Milan Svoboda 1, Ladislav Lukáš 2

Amino Acids and Their Properties

Lecture 3: Finding integer solutions to systems of linear equations

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

2.1 Complexity Classes

COMP 250 Fall 2012 lecture 2 binary representations Sept. 11, 2012

1 Review of Least Squares Solutions to Overdetermined Systems

9.2 Summation Notation

Increasing for all. Convex for all. ( ) Increasing for all (remember that the log function is only defined for ). ( ) Concave for all.

Lecture 3. Linear Programming. 3B1B Optimization Michaelmas 2015 A. Zisserman. Extreme solutions. Simplex method. Interior point method

CHAPTER 6: Continuous Uniform Distribution: 6.1. Definition: The density function of the continuous random variable X on the interval [A, B] is.

14.1 Rent-or-buy problem

We can express this in decimal notation (in contrast to the underline notation we have been using) as follows: b + 90c = c + 10b

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

5.1 Identifying the Target Parameter

Introduction To Genetic Algorithms

Note on growth and growth accounting

Feed Forward Loops in Biological Systems

A PRACTICAL GUIDE TO db CALCULATIONS

Capacity Limits of MIMO Channels

DERIVATIVES AS MATRICES; CHAIN RULE

Lecture 13 - Basic Number Theory.

Standards Alignment Minnesota Science Standards Alignment Matrix

Abstract: We describe the beautiful LU factorization of a square matrix (or how to write Gaussian elimination in terms of matrix multiplication).

Predicting the World Cup. Dr Christopher Watts Centre for Research in Social Simulation University of Surrey

Linear Programming Notes V Problem Transformations

From the probabilities that the company uses to move drivers from state to state the next year, we get the following transition matrix:

I. GROUPS: BASIC DEFINITIONS AND EXAMPLES

Association Between Variables

1 Review of Newton Polynomials

Chapter 3. Cartesian Products and Relations. 3.1 Cartesian Products

Week 3&4: Z tables and the Sampling Distribution of X

Differential privacy in health care analytics and medical research An interactive tutorial

Optimization in ICT and Physical Systems

Domain of a Composition

Transcription:

Motif finding with Gibbs sampling CS 466 Saurabh Sinha

Regulatory networks Genes are switches, transcription factors are (one type of) input signals, proteins are outputs Proteins (outputs) may be transcription factors and hence become signals for other genes (switches) This may be the reason why humans have so few genes (the circuit, not the number of switches, carries the complexity)

Decoding the regulatory network Find patterns ( motifs ) in DNA sequence that occur more often than expected by chance These are likely to be binding sites for transcription factors Knowing these can tell us if a gene is regulated by a transcription factor (i.e., the switch )

Transcriptional regulation TRANSCRIPTION FACTOR GENE ACAGTGA PROTEIN

Transcriptional regulation TRANSCRIPTION FACTOR GENE ACAGTGA PROTEIN

A motif model To define a motif, lets say we know where the motif starts in the sequence The motif start positions in their sequences can be represented as s = (s 1,s 2,s 3,,s t ) Genes regulated by same transcription factor

Motifs: Matrices and Consensus Alignment a G g t a c T t C c A t a c g t a c g t T A g t a c g t C c A t C c g t a c g G A 3 0 1 0 3 1 1 0 Matrix C 2 4 0 0 1 4 0 0 G 0 1 4 0 0 0 3 1 T 0 0 0 5 1 0 1 4 Consensus A C G T A C G T Line up the patterns by their start indexes s = (s 1, s 2,, s t ) Construct position weight matrix with frequencies of each nucleotide in columns Consensus nucleotide in each position has the highest frequency in column

Position weight matrices Suppose there were t sequences to begin with Consider a column of a position weight matrix The column may be (t, 0, 0, 0) A perfectly conserved column The column may be (t/4, t/4, t/4, t/4) A completely uniform column Good profile matrices should have more conserved columns

Information Content In a PWM, convert frequencies to probabilities PWM W: W βk = frequency of base β at position k q β = frequency of base β by chance Information content of W: $ k $ " #{A,C,G,T } W "k log W "k q "

Information Content If W βk is always equal to q β, i.e., if W is similar to random sequence, information content of W is 0. If W is different from q, information content is high.

Detecting Subtle Sequence Signals: a Gibbs Sampling Strategy for Multiple Alignment Lawrence et al. 1993

Motif Finding Problem Given a set of sequences, find the motif shared by all or most sequences, while its starting position in each sequence is unknown Assumption: Each motif appears exactly once in one sequence The motif has fixed length

Generative Model Suppose the sequences are aligned, the aligned regions are generated from a motif model Motif model is a PWM. A PWM is a position-specific multinomial distribution. For each position i, a multinomial distribution on (A,C,G,T): q ia,q ic,q ig,q it The unaligned regions are generated from a background model: p A,p C,p G,p T

Notations Set of symbols: Sequences: S = {S 1, S 2,, S N } Starting positions of motifs: A = {a 1, a 2,, a N }!! Motif model ( ) : q ij = P(symbol at the i-th position = j) Background model: p j = P(symbol = j) Count of symbols in each column: c ij = count of symbol, j, in the i-th column in the aligned region

Probability of data given model!! = " = = W i j c ij ij q A S P 1 1 ), ( #!! = " = = W i j c j ij p A S P 1 1 0 ), ( #

Scoring Function Maximize the log-odds ratio: P W ( S A, # ) =!! q ij!! ij P ( S A,# ) 0 = i= 1 " j= 1 c W i= 1 " j= 1 cij p j F = log P( S A, # ) P( S A, # ) 0 = W " ij cij log i= 1 j= 1 p j!! q Is greater than zero if the data is a better match to the motif model than to the background model

Optimization and Sampling To maximize a function, f(x): Brute force method: try all possible x Sample method: sample x from probability distribution: p(x) ~ f(x) Idea: suppose x max is argmax of f(x), then it is also argmax of p(x), thus we have a high probability of selecting x max

Markov Chain sampling To sample from a probability distribution p(x), we set up a Markov chain s.t. each state represents a value of x and for any two states, x and y, the transitional probabilities satisfy: p(x)pr(x " y) = p(y)pr(y " x) This would then imply that if the Markov chain is run for long enough, the probability thereafter of being in state x will be p(x) 1 limn "! C( x) = N p( x)

Gibbs sampling to maximize F Gibbs sampling is a special type of Markov chain sampling algorithm Our goal is to find the optimal A = (a 1, a N ) The Markov chain we construct will only have transitions from A to alignments A that differ from A in only one of the a i In round-robin order, pick one of the a i to replace Consider all A formed by replacing a i with some other starting position a i in sequence S i Move to one of these A probabilistically Iterate the last three steps

Algorithm Randomly initialize A 0 ; Repeat: (1) randomly choose a sequence z from S; A* = A t \ a z ; compute θ t from A*; q ij = (2) sample a z according to P(a z = x), which is proportional to Q x /P x ; update A t+1 = A* x; Select A t that maximizes F; Q x : the probability of generating x according to θ t ; P x : the probability of generating x according to the background model c ij " k c ik

Algorithm Current solution A t

Algorithm Choose one a z to replace

Algorithm x For each candidate site x in sequence z, calculate Q x and P x : Probabilities of sampling x from motif model and background model resp.

Algorithm x Among all possible candidates, choose one (say x) with probability proportional to Q x /P x

Algorithm x Set A t+1 = A* x

Algorithm x Repeat

Local optima The algorithm may not find the global or true maximum of the scoring function Once A t contains many similar substrings, others matching these will be chosen with higher probability Algorithm will get locked into a local optimum all neighbors have poorer scores, hence low chance of moving out of this solution