HMM : Viterbi algorithm - a toy example



Similar documents
Hidden Markov Models

Hardware Implementation of Probabilistic State Machine for Word Recognition

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

HMM Profiles for Network Traffic Classification

Gene Finding and HMMs

Statistical Machine Translation: IBM Models 1 and 2

Hidden Markov Models with Applications to DNA Sequence Analysis. Christopher Nemeth, STOR-i

Gate Delay Model. Estimating Delays. Effort Delay. Gate Delay. Computing Logical Effort. Logical Effort

Coding and decoding with convolutional codes. The Viterbi Algor

The Trip Scheduling Problem

Offline Recognition of Unconstrained Handwritten Texts Using HMMs and Statistical Language Models. Alessandro Vinciarelli, Samy Bengio and Horst Bunke

Introduction to Algorithmic Trading Strategies Lecture 2

Pairwise Sequence Alignment

Nonlinear Optimization: Algorithms 3: Interior-point methods

Tagging with Hidden Markov Models

South East of Process Main Building / 1F. North East of Process Main Building / 1F. At 14:05 April 16, Sample not collected

Reinforcement Learning

Protein Protein Interaction Networks

Recursive Algorithms. Recursion. Motivating Example Factorial Recall the factorial function. { 1 if n = 1 n! = n (n 1)! if n > 1

Reinforcement Learning

Using Segmentation Constraints in an Implicit Segmentation Scheme for Online Word Recognition

An Introduction to the Use of Bayesian Network to Analyze Gene Expression Data

Statistical Machine Learning from Data

A Basic Guide to Modeling Techniques for All Direct Marketing Challenges

Section IV.1: Recursive Algorithms and Recursion Trees

Vector Treasure Hunt Teacher s Guide

A parallel algorithm for the extraction of structured motifs

the recursion-tree method

National Sun Yat-Sen University CSE Course: Information Theory. Gambling And Entropy

Network Algorithms for Homeland Security

Why? A central concept in Computer Science. Algorithms are ubiquitous.

1 o Semestre 2007/2008

Bayesian Hidden Markov Models for Alcoholism Treatment Tria

SIMS 255 Foundations of Software Design. Complexity and NP-completeness

NLP Programming Tutorial 5 - Part of Speech Tagging with Hidden Markov Models

Lesson 4 Annuities: The Mathematics of Regular Payments

The Fundamentals of DS3

POS Tagging 1. POS Tagging. Rule-based taggers Statistical taggers Hybrid approaches

SOLiD System accuracy with the Exact Call Chemistry module

Current Motif Discovery Tools and their Limitations

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition)

NP-complete? NP-hard? Some Foundations of Complexity. Prof. Sven Hartmann Clausthal University of Technology Department of Informatics

STATISTICS AND DATA ANALYSIS IN GEOLOGY, 3rd ed. Clarificationof zonationprocedure described onpp

Quick Start Guide. Contents. Quick Start Guide Version 1.0 webcrm November 09

The Goldberg Rao Algorithm for the Maximum Flow Problem

Analysis of Algorithms, I

Gambling and Data Compression

Module 1. Sequence Formats and Retrieval. Charles Steward

4.2 Description of the Event operation Network (EON)

Protein & DNA Sequence Analysis. Bobbie-Jo Webb-Robertson May 3, 2004

A greedy algorithm for the DNA sequencing by hybridization with positive and negative errors and information about repetitions

Improving Knowledge-Based System Performance by Reordering Rule Sequences

Effect of Using Neural Networks in GA-Based School Timetabling

Lecture 13: The Knapsack Problem

Hidden Markov Models 1

What mathematical optimization can, and cannot, do for biologists. Steven Kelk Department of Knowledge Engineering (DKE) Maastricht University, NL

Course: Model, Learning, and Inference: Lecture 5

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

This document presents the new features available in ngklast release 4.4 and KServer 4.2.

#1-12: Write the first 4 terms of the sequence. (Assume n begins with 1.)

Lecture 5: Logical Effort

OBM (Out of Band Management) Overview

Hidden Markov models in gene finding. Bioinformatics research group David R. Cheriton School of Computer Science University of Waterloo

Detection of changes in variance using binary segmentation and optimal partitioning

2x + y = 3. Since the second equation is precisely the same as the first equation, it is enough to find x and y satisfying the system

3. About R2oDNA Designer

6.231 Dynamic Programming Midterm, Fall Instructions

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

Capturing Database Transformations for Big Data Analytics

Counting Primes whose Sum of Digits is Prime

Scheduling Shop Scheduling. Tim Nieberg

Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University

Towards Integrating the Detection of Genetic Variants into an In-Memory Database

Spot me if you can: Uncovering spoken phrases in encrypted VoIP conversations

Subgraph Patterns: Network Motifs and Graphlets. Pedro Ribeiro

The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge,

Chapter Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling

Bio-Informatics Lectures. A Short Introduction

Data Structures and Algorithms

Principles of Dat Da a t Mining Pham Tho Hoan hoanpt@hnue.edu.v hoanpt@hnue.edu. n


Agilent Measuring Noninsertable Devices

Data quality in Accounting Information Systems

Output: struct treenode{ int data; struct treenode *left, *right; } struct treenode *tree_ptr;

Scheduling Single Machine Scheduling. Tim Nieberg

Energy Efficient MapReduce

Transcription:

MM : Viterbi algorithm - a toy example.5.3.4.2 et's consider the following simple MM. This model is composed of 2 states, (high C content) and (low C content). We can for example consider that state characterizes coding DN while characterizes non-coding DN. The model can then be used to predict the region of coding DN from a given sequence. Sources: For the theory, see Durbin et al (1998); For the example, see Borodovsky & Ekisheva (26), pp 8-81

MM : Viterbi algorithm - a toy example.5.3.4.2 Consider the sequence S= CCT There are several paths through the hidden states ( and ) that lead to the given sequence S. Example: P = The probability of the MM to produce sequence S through the path P is: p = p () * p () * p * p () * p * p (C) *... =.5 *.2 *.6 *.2 *.4 *.3 *... =...

MM : Viterbi algorithm - a toy example.5.3.4.2 CCT There are several paths through the hidden states ( and ) that lead to the given sequence, but they do not have the same probability. The Viterbi algorithm is a dynamical programming algorithm that allows us to compute the most probable path. Its principle is similar to the DP programs used to align 2 sequences (i.e. Needleman-Wunsch) Source: Borodovsky & Ekisheva, 26

MM : Viterbi algorithm - a toy example Viterbi algorithm: principle.5.3.4.2 C C T The probability of the most probable path ending in state k with observation "i" is p l (i, x) = e l ( i)max k ( ) p k ( j, x "1).p kl! probability to observe element i in state l probability of the most probable path ending at position x-1 in state k with element j probability of the transition from state l to state k

MM : Viterbi algorithm - a toy example Viterbi algorithm: principle.5.3.4.2 C C T The probability of the most probable path ending in state k with observation "i" is In our example, the probability of the most probable path ending in state with observation "" at the 4th position is:! p l (i, x) = e l ( i)max ( ) p (,4) = e ()max p (C,3) p, p (C,3) p We can thus compute recursively (from the first to the last element of our sequence) the probability of the most probable path. k ( ) p k ( j, x "1).p kl

MM : Viterbi algorithm - a toy example -1-1 -1-2.322-1.737-1 -.737 C -1.737 C -2.322-1.737-1.322-2.322 T -2.322 T -1.737 Remark: for the calculations, it is convenient to use the log of the probabilities (rather than the probabilities themselves). Indeed, this allows us to compute sums instead of products, which is more efficient and accurate. We used here log 2 (p).

MM : Viterbi algorithm - a toy example -1-1 -1-2.322 C -1.737-1.737 T -2.322-1 -1.322-1.737 C -2.322-2.322 T -1.737 -.737 CCT Probability (in log 2 ) that at the first position was emitted by state Probability (in log 2 ) that at the first position was emitted by state p (,1) = -1-1.737 = -2.737 p (,1) = -1-2.322 = -3.322

MM : Viterbi algorithm - a toy example -1-1 -1-2.322 C -1.737-1.737 T -2.322-1 -1.322-1.737 C -2.322-2.322 T -1.737 -.737 CCT Probability (in log 2 ) that at the 2nd position was emitted by state Probability (in log 2 ) that at the 2nd position was emitted by state p (,2) = -1.737 + max (p (,1)+p, p (,1)+p ) = -1.737 + max (-2.737-1, -3.322-1.322) = -5.474 (obtained from p (,1)) p (,2) = -2.322 + max (p (,1)+p, p (,1)+p ) = -2.322 + max (-2.737-1.322, -3.322 -.737) = -6.59 (obtained from p (,1))

MM : Viterbi algorithm - a toy example -1-1 -1-2.322 C -1.737-1.737 T -2.322-1 -1.322-1.737 C -2.322-2.322 T -1.737 -.737 CCT C C T -2.73-5.47-8.21-11.53-14.1... -25.65-3.32-6.6-8.79-1.94-14.1... -24.49 We then compute iteratively the probabilities p (i,x) and p (i,x) that nucleotide i at position x was emitted by state or, respectively. The highest probability obtained for the nucleotide at the last position is the probability of the most probable path. This path can be retrieved by back-tracking.

MM : Viterbi algorithm - a toy example -1-1 -1-2.322 C -1.737-1.737 T -2.322-1 -1.322-1.737 C -2.322-2.322 T -1.737 -.737 CCT back-tracking (= finding the path which corresponds to the highest probability, -24.49) C C T -2.73-5.47-8.21-11.53-14.1... -25.65-3.32-6.6-8.79-1.94-14.1... -24.49 The most probable path is: Its probability is 2-24.49 = 4.25E-8 (remember that we used log 2 (p))

MM : Forward algorithm - a toy example.5.3.4.2 Consider now the sequence S= C What is the probability P(S) that this sequence S was generated by the MM model? This probability P(S) is given by the sum of the probabilities p i (S) of each possible path that produces this sequence. The probability P(S) can be computed by dynamical programming using either the so-called Forward or the Backward algorithm.

MM : Forward algorithm - a toy example Forward algorithm.5.3.4.2 Consider now the sequence S= C C.5*.3=.15.5*.2=.1

MM : Forward algorithm - a toy example.5.3.4.2 Consider now the sequence S= C C.5*.3=.15.15*.5*.3 +.1*.4*.3=.345.5*.2=.1

MM : Forward algorithm - a toy example.5.3.4.2 Consider now the sequence S= C C.5*.3=.15.15*.5*.3 +.1*.4*.3=.345.5*.2=.1.1*.6*.2 +.15*.5*.2=.27

MM : Forward algorithm - a toy example.5.3.4.2 Consider now the sequence S= C C.5*.3=.15.15*.5*.3 +.1*.4*.3=.345... +....5*.2=.1.1*.6*.2 +.15*.5*.2=.27... +...

MM : Forward algorithm - a toy example.5.3.4.2 Consider now the sequence S= C C.5*.3=.15.15*.5*.3 +.1*.4*.3=.345... +....13767.5*.2=.1.1*.6*.2 +.15*.5*.2=.27... +....24665 => The probability that the sequence S was generated by the MM model is thus P(S)=.38432. Σ =.38432

MM : Forward algorithm - a toy example.5.3.4.2 The probability that sequence S="C" was generated by the MM model is P MM (S) =.38432. To assess the significance of this value, we have to compare it to the probability that sequence S was generated by the background model (i.e. by chance). Ex: If all nucleotides have the same probability, p bg =.25; the probability to observe S by chance is: P bg (S) = p bg 4 =.25 4 =.396. Thus, for this particular example, it is likely that the sequence S does not match the MM model (P bg > P MM ). NB: Note that this toy model is very simple and does not reflect any biological motif. If fact both states and are characterized by probabilities close to the background probabilities, which makes the model not realistic and not suitable to detect specific motifs.

MM : Summary Summary The Viterbi algorithm is used to compute the most probable path (as well as its probability). It requires knowledge of the parameters of the MM model and a particular output sequence and it finds the state sequence that is most likely to have generated that output sequence. It works by finding a maximum over all possible state sequences. In sequence analysis, this method can be used for example to predict coding vs non-coding sequences. In fact there are often many state sequences that can produce the same particular output sequence, but with different probabilities. It is possible to calculate the probability for the MM model to generate that output sequence by doing the summation over all possible state sequences. This also can be done efficiently using the Forward algorithm (or the Backward algorithm), which is also a dynamical programming algorithm. In sequence analysis, this method can be used for example to predict the probability that a particular DN region match the MM motif (i.e. was emitted by the MM model). MM motif can represent a TF binding site for ex.

MM : Summary Remarks To create a MM model (i.e. find the most likely set of state transition and output probabilities of each state), we need a set of (training) sequences, that does not need to be aligned. No tractable algorithm is known for solving this problem exactly, but a local maximum likelihood can be derived efficiently using the Baum-Welch algorithm or the Baldi-Chauvin algorithm. The Baum-Welch algorithm is an example of a forward-backward algorithm, and is a special case of the Expectation-maximization algorithm. For more details: see Durbin et al (1998) MMER The UMMER3 package contains a set of programs (developed by S. Eddy) to build MM models (from a set of aligned sequences) and to use MM models (to align sequences or to find sequences in databases). These programs are available at the Mobyle plateform (http://mobyle.pasteur.fr/cgi-bin/mobyleportal/portal.py)