I529: Machine Learning in Bioinformatics (Spring 2013) Markov Models

Similar documents
What is Candidate Sampling

Recurrence. 1 Definitions and main statements

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

v a 1 b 1 i, a 2 b 2 i,..., a n b n i.

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12

Causal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification

Extending Probabilistic Dynamic Epistemic Logic

1 Example 1: Axis-aligned rectangles

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

PERRON FROBENIUS THEOREM

Support Vector Machines

Forecasting the Direction and Strength of Stock Market Movement

Lecture 2 Sequence Alignment. Burr Settles IBS Summer Research Program 2008 bsettles@cs.wisc.edu

L10: Linear discriminants analysis

Calculation of Sampling Weights

1. Fundamentals of probability theory 2. Emergence of communication traffic 3. Stochastic & Markovian Processes (SP & MP)

Face Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching)

Logistic Regression. Steve Kroon

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

Implementation of Deutsch's Algorithm Using Mathcad

NPAR TESTS. One-Sample Chi-Square Test. Cell Specification. Observed Frequencies 1O i 6. Expected Frequencies 1EXP i 6

SIMPLE LINEAR CORRELATION

BERNSTEIN POLYNOMIALS

STATISTICAL DATA ANALYSIS IN EXCEL

Generalizing the degree sequence problem

Interpreting Patterns and Analysis of Acute Leukemia Gene Expression Data by Multivariate Statistical Analysis

An Alternative Way to Measure Private Equity Performance

The OC Curve of Attribute Acceptance Plans

) of the Cell class is created containing information about events associated with the cell. Events are added to the Cell instance

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

Can Auto Liability Insurance Purchases Signal Risk Attitude?

Product Quality and Safety Incident Information Tracking Based on Web

Logical Development Of Vogel s Approximation Method (LD-VAM): An Approach To Find Basic Feasible Solution Of Transportation Problem

Transition Matrix Models of Consumer Credit Ratings

Latent Class Regression. Statistics for Psychosocial Research II: Structural Models December 4 and 6, 2006

CHAPTER 5 RELATIONSHIPS BETWEEN QUANTITATIVE VARIABLES

How To Understand The Results Of The German Meris Cloud And Water Vapour Product

How Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence

Lecture 2: Single Layer Perceptrons Kevin Swingler

1. Measuring association using correlation and regression

Rate Monotonic (RM) Disadvantages of cyclic. TDDB47 Real Time Systems. Lecture 2: RM & EDF. Priority-based scheduling. States of a process

Evaluating the generalizability of an RCT using electronic health records data

Traffic State Estimation in the Traffic Management Center of Berlin

Dynamic Pricing for Smart Grid with Reinforcement Learning

GRAVITY DATA VALIDATION AND OUTLIER DETECTION USING L 1 -NORM

The Greedy Method. Introduction. 0/1 Knapsack Problem

CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements

CHAPTER 14 MORE ABOUT REGRESSION

Keywords : classifier, Association rules, data mining, healthcare, Associative Classifiers, CBA, CMAR, CPAR, MCAR. GJCST Classification : H.2.

Statistical Methods to Develop Rating Models

DEFINING %COMPLETE IN MICROSOFT PROJECT

This circuit than can be reduced to a planar circuit

Using Series to Analyze Financial Situations: Present Value

Single and multiple stage classifiers implementing logistic discrimination


HÜCKEL MOLECULAR ORBITAL THEORY

Realistic Image Synthesis

Exhaustive Regression. An Exploration of Regression-Based Data Mining Techniques Using Super Computation

Lecture 5,6 Linear Methods for Classification. Summary

How To Know The Components Of Mean Squared Error Of Herarchcal Estmator S

Portfolio Loss Distribution

Linear Circuits Analysis. Superposition, Thevenin /Norton Equivalent circuits

NON-CONSTANT SUM RED-AND-BLACK GAMES WITH BET-DEPENDENT WIN PROBABILITY FUNCTION LAURA PONTIGGIA, University of the Sciences in Philadelphia

Fragility Based Rehabilitation Decision Analysis

Availability-Based Path Selection and Network Vulnerability Assessment

Ring structure of splines on triangulations

Design of Output Codes for Fast Covering Learning using Basic Decomposition Techniques

Product-Form Stationary Distributions for Deficiency Zero Chemical Reaction Networks

Planning for Marketing Campaigns

Loop Parallelization

Risk-based Fatigue Estimate of Deep Water Risers -- Course Project for EM388F: Fracture Mechanics, Spring 2008

The Application of Fractional Brownian Motion in Option Pricing

Regression Models for a Binary Response Using EXCEL and JMP

Lecture 3: Force of Interest, Real Interest Rate, Annuity

Prediction of Disability Frequencies in Life Insurance

where the coordinates are related to those in the old frame as follows.

Stochastic Protocol Modeling for Anomaly Based Network Intrusion Detection

Power law distribution of dividends in horse races

We are now ready to answer the question: What are the possible cardinalities for finite fields?

Brigid Mullany, Ph.D University of North Carolina, Charlotte

Conversion between the vector and raster data structures using Fuzzy Geographical Entities

Intelligent stock trading system by turning point confirming and probabilistic reasoning

Review of Hierarchical Models for Data Clustering and Visualization

How To Find The Dsablty Frequency Of A Clam

Sample Design in TIMSS and PIRLS

Implied (risk neutral) probabilities, betting odds and prediction markets

Vision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION

Implementations of Web-based Recommender Systems Using Hybrid Methods

THE METHOD OF LEAST SQUARES THE METHOD OF LEAST SQUARES

Descriptive Models. Cluster Analysis. Example. General Applications of Clustering. Examples of Clustering Applications

Learning from Large Distributed Data: A Scaling Down Sampling Scheme for Efficient Data Processing

Georey E. Hinton. University oftoronto. Technical Report CRG-TR May 21, 1996 (revised Feb 27, 1997) Abstract

Efficient Reinforcement Learning in Factored MDPs

Adaptive Fractal Image Coding in the Frequency Domain

Distributed Multi-Target Tracking In A Self-Configuring Camera Network

An Evaluation of the Extended Logistic, Simple Logistic, and Gompertz Models for Forecasting Short Lifecycle Products and Services

Transcription:

I529: Machne Learnng n Bonformatcs (Sprng 213) Markov Models Yuzhen Ye School of Informatcs and Computng Indana Unversty, Bloomngton Sprng 213

Outlne Smple model (frequency & profle) revew Markov chan CpG sland queston 1 Model comparson by log lkelhood rato test Markov chan varants Kth order Inhomogeneous Markov chans Interpolated Markov models (IMM) Applcatons Gene fndng (Genemark & Glmmer) Taxonomc assgnment n metagenomcs (Phymm)

A DNA profle (matrx) TATAAA TATAAT TATAAA TATAAA TATAAA TATTAA TTAAAA TAGAAA 1 2 3 4 5 6 T 8 1 6 1 1 C A 7 1 7 8 7 G 1 Sparse data pseudo-counts 1 2 3 4 5 6 T 9 2 7 2 1 2 C 1 1 1 1 1 1 A 1 8 2 8 9 8 G 1 1 2 1 1 1

Frequency & profle model Frequency model: the order of nucleotdes n the tranng sequences s gnored; Profle model: the tranng sequences are algned the order of nucleotdes n the tranng sequences s fully preserved Markov chan model: orders are partally ncorporated

Markov chan model Sometmes we need to model dependences between adjacent postons n the sequence There are certan regons n the genome, lke TATA wthn the regulatory area, upstream a gene. The pattern CG s less common than expected for random samplng. Such dependences can be modeled by Markov chans.

Markov chans A Markov chan s a sequence of random varables wth Markov property,.e., gven the present state, the future and the past are ndependent. A famous example of Markov chan s the drunkard's walk at each step, the poston may change by +1 or 1 wth equal probablty. Pr(5->4) = Pr(5->6) =.5, all other transton probabltes from 5 are. these probabltes are ndependent of whether the system was prevously n step 4 or 6.

1 st order Markov chan An nteger tme stochastc process, consstng of a set of m>1 states {s 1,,s m } and 1. An m dmensonal ntal dstrbuton vector ( p(s 1 ),.., p(s m )) 2. An m m transton probabltes matrx M= (a s s j ) For example, for DNA sequence: the states are {A, C, T, G} (m=4) p(a) the probablty of A to be the 1 st letter a AG the probablty that G follows A n a sequence.

1 st order Markov chan X 1 X 2 X n-1 X n For each nteger n, a Markov Chan assgns probablty to sequences (x 1 x n ) as follows: p(( x, x,... x )) = p( X = x ) p( X = x X = x ) 1 2 n 1 1 1 1 = 2 n = px ( 1) = 2 n ax 1x

Matrx representaton A B C D A.95.2 B.5.2 C.5 1 D.3.8 The transton probabltes matrx M =(a st ) M s a stochastc matrx: a = t st 1 The ntal dstrbuton vector (u 1 u m ) defnes the dstrbuton of X 1 (p(x 1 =s )=u ).

Dgraph (drected graph) representaton.95 A A.95 B C.5 D.2 A B.5 B C.2.5.2.3.8.5.2.3 D 1.8 C D 1 Each drected edge A B s assocated wth the postve transton probablty from A to B.

Classfcaton of Markov chan states States of Markov chans are classfed by the dgraph representaton (omttng the actual probablty values) A, C and D are recurrent states: they are n strongly connected components whch are snks n the graph. B s not recurrent t s a transent state A B C D Alternatve defntons: A state s s recurrent f t can be reached from any state reachable from s; otherwse t s transent.

Another example of recurrent and transent states A B C D A and B are transent states, C and D are recurrent states. Once the process moves from B to D, t wll never come back.

A 3-state Markov model of the weather Assume the weather can be: ran or snow (state 1), cloudy (state 2), or sunny (state 3) Assume the weather of any day t s characterzed by one of the three states The transton probabltes between the three states A = {a j } = Questons a 11 a 12 a 13 a 21 a 22 a 23 = a 31 a 32 a 33.4.3.3.2.6.2.1.1.8 Gven the frst day s sunny, what s the probablty that the weather for the followng 7 days wll be sun-sun-ran-ran-sun-cloudy-sun? The probablty of the weather stayng n a state for d days? Rabner (1989)

CpG sland modelng In mammalan genomes, the dnucleotde CG often transforms to (methyl-c)g whch often subsequently mutates to TG. Hence CG appears less than expected from what s expected from the ndependent frequences of C and G alone. Due to bologcal reasons, ths process s sometmes suppressed n short stretches of genomes such as n the upstream regons of many genes. These areas are called CpG slands.

Questons about CpG slands We consder two questons (and some varants): Queston 1: Gven a short stretch of genomc data, does t come from a CpG sland? Queston 2: Gven a long pece of genomc data, does t contan CpG slands n t, where, and how long? We solve the frst queston by modelng sequences wth and wthout CpG slands as Markov Chans over the same states {A,C,G,T} but dfferent transton probabltes.

Markov models for (non) CpG slands a + st a - st The + model: Use transton matrx A + = (a + st ), = (the probablty that t follows s n a CpG sland) postve samples The - model: Use transton matrx A - = (a - st ), = (the probablty that t follows s n a non CpG sland sequence) negatve samples Wth these two models, to solve Queston 1 we need to decde whether a gven short sequence s more lkely to come from the + model or from the model. Ths s done by usng the defntons of Markov Chan, n whch the parameters are determned by tranng data.

Matrces of the transton probabltes A + (CpG slands): p + (x x -1 ) (rows sum to 1) X -1 A - (non-cpg slands): X A C G T A.18.274.426.12 C.171.368.274.188 G.161.339.375.125 T.79.355.384.182 X A C G T A.3.25.285.21 X -1 C.322.298.78.32 G.248.246.298.28 T.177.239.292.292

Model comparson Gven a sequence x=(x 1.x L ), now compute the lkelhood rato If RATIO>1, CpG sland s more lkely. Actually the log of ths rato s computed. = + = + + = + = 1 1 1 1 model) ( model) ( RATIO L L x x p x x p p p ) ( ) ( x x Note: p + (x 1 x ) s defned for convenence as p + (x 1 ). p - (x 1 x ) s defned for convenence as p - (x 1 ).

Log lkelhood rato test Takng logarthm yelds log Q = log p(x p(x 1 1...x...x L L + ) ) = log p p + (x x (x x 1 1 ) ) If logq >, then + s more lkely (CpG sland). If logq <, then - s more lkely (non-cpg sland).

A toy example Sequence: CGACTGAACCG P(CGACTGAACCG +) =? P(CGACTGAACCG -) =? Log lkelhood rato?

Where do the parameters (transton probabltes) come from? Learnng from tranng data. Source: A collecton of sequences from CpG slands, and a collecton of sequences from non-cpg slands. Input: Tuples of the form (x 1,, x L, h), where h s + or - Output: Maxmum Lkelhood parameters (MLE) Count all pars (X =a, X -1 =b) wth label +, and wth label -, say the numbers are N ba,+ and N ba,-.

CpG sland: queston 2 Queston 2: Gven a long pece of genomc data, does t contan CpG slands n t, and where? For ths, we need to decde whch parts of a gven long sequence of letters s more lkely to come from the + model, and whch parts are more lkely to come from the model. We wll defne a Markov Chan over 8 states. A + A - C + G + T + C - G - T - The problem s that we don t know the sequence of states (hdden) whch are traversed, but just the sequence of letters (observaton). Hdden Markov Model!

Markov model varatons kth order Markov chans (Markov chans wth memory) Inhomogeneous Markov chans (vs homogeneous Markov chans) Interpolated Markov chans

kth order Markov Chan (a Markov chan wth memory k) ( ) ( ) ( ) = = = = = = = = n k k k k k n x X x X x X x X p x X x X p x x p,...,,,...,... 2 2 1 1 1 1 1 kth Markov Chan assgns probablty to sequences (x 1 x n ) as follows: Intal dstrbuton Transton probabltes

Inhomogeneous Markov chan for gene fndng X 1 X 2 X 3 X 4 X 5 X 6 X 7 a b c a b c Agan, the parameters (the transton probabltes, a, b, and c need to be learned from tranng samples)

Inhomogeneous Markov chan: predcton X 1 X 2 X 3 X 4 X 5 X 6 X 7 Readng frame 1 a b c a b c Readng frame 2 c a b c a b Readng frame 3 b c a b c a

Gene fndng usng nhomogeneous Markov chan Consder sequence x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9. where x s a nucleotde let p 1 = a x1x2 b x2 x3 c x3x4 a x4x5 b x5x6c x6x7. p 2 = c x1x2 a x2x3 b x3x4 c x4x5 a x5x6 b x6x7. p 3 = b x1x2 c x2x3 a x3x4 b x4x5 c x5x6 a x6x7. then probablty that th readng frame s the codng frame s: P = p p 1 + p 2 + p 3 Genemark (gene fnder for bacteral genomes)

Selectng the order of a Markov chan For Markov models, what order to choose? Hgher order, more memory (hgher predctve value), but means more parameters to learn The hgher the order, the less relable the parameter estmates. E.g., we have a DNA sequence of 1 kbp 2 nd order Markov chan, 4 3 =64 parameters, 1562 tmes on average for each hstory 5 th order, 4 6 =496 parameters, 24 tmes on average 8 th order, 4 9 =65536 parameters, 1.5 tmes on average

Interpolated Markov models (IMMs) IMMs are called varable-order Markov models A IMM uses a varable number of states to compute the probablty of the next state smple lnear nterpolaton P (x x n,,x 1 )= P (x )+ 1 P (x x 1 )+ + n P (x x n,,x 1 ) general lnear nterpolaton P (x x n,,x 1 )= P (x )+ 1 (x )P (x x 1 )+ + n (x n,,x 1 )P (x x n,,x 1 )

GLIMMER Glmmer s a system for fndng genes n mcrobal DNA, especally the genomes of bactera, archaea, and vruses eukaryotc verson of Glmmer: GlmmerHMM Glmmer (Gene Locator and Interpolated Markov ModelER) uses IMMs to dentfy the codng. Glmmer verson 3.2 s the current verson of the system (http://www.cbcb.umd.edu/software/ glmmer/) Glmmer3 makes several algorthmc changes to reduce the number of false postve predctons and to mprove the accuracy of start-ste predctons

IMM n GLIMMER A lnear combnaton of 8 dfferent Markov chans, from 1st through 8th-order, weghtng each model accordng to ts predctve power. Glmmer uses 3-perodc nonhomogenous Markov models n ts IMMs. Score of a sequence s the product of nterpolated probabltes of bases n the sequence IMM tranng Longer context s always better; only reason not to use t s undersamplng n tranng data. If sequence occurs frequently enough n tranng data, use t,.e., λ = 1 Otherwse, use frequency and χ 2 sgnfcance to set λ.

Clusterng metagenomc sequences wth IMMs IMMs are used to classfy metagenomc sequences based on patterns of DNA dstnct to a clade (a speces, genus, or hgher-level phylogenetc group). Durng tranng, the IMM algorthm constructs probablty dstrbutons representng observed patterns of nucleotdes that characterze each speces. Nat Methods 29, 6(9):673-676