Hidden Markov Models



Similar documents
HMM : Viterbi algorithm - a toy example

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

Course: Model, Learning, and Inference: Lecture 5

Hidden Markov Models Chapter 15

Hidden Markov Models with Applications to DNA Sequence Analysis. Christopher Nemeth, STOR-i

The Basics of Graphical Models

Statistical Machine Translation: IBM Models 1 and 2

Heuristics for the Sorting by Length-Weighted Inversions Problem on Signed Permutations

(67902) Topics in Theory and Complexity Nov 2, Lecture 7

Introduction to Algorithmic Trading Strategies Lecture 2

CSCI567 Machine Learning (Fall 2014)

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

Lecture 6. Artificial Neural Networks

What mathematical optimization can, and cannot, do for biologists. Steven Kelk Department of Knowledge Engineering (DKE) Maastricht University, NL

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

AMS 5 CHANCE VARIABILITY

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression (1/24/13)

1 Maximum likelihood estimation

Tagging with Hidden Markov Models

Statistical Machine Learning from Data

Statistical Machine Learning

Pairwise Sequence Alignment

STAT 35A HW2 Solutions

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

DATA ANALYSIS II. Matrix Algorithms

Overview. Swarms in nature. Fish, birds, ants, termites, Introduction to swarm intelligence principles Particle Swarm Optimization (PSO)

Memory Allocation Technique for Segregated Free List Based on Genetic Algorithm

Lecture 13: Factoring Integers

Part 2: Community Detection

Principles of Evolution - Origin of Species

Bioinformatics Resources at a Glance

The Trip Scheduling Problem

The Backpropagation Algorithm

A Content based Spam Filtering Using Optical Back Propagation Technique

1 Sets and Set Notation.

Data Mining Lab 5: Introduction to Neural Networks

Supply planning for two-level assembly systems with stochastic component delivery times: trade-off between holding cost and service level

1 Portfolio Selection

Optimization Modeling for Mining Engineers

Principles of Dat Da a t Mining Pham Tho Hoan hoanpt@hnue.edu.v hoanpt@hnue.edu. n

Dimension Reduction. Wei-Ta Chu 2014/10/22. Multimedia Content Analysis, CSIE, CCU

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

The Analytic Hierarchy Process. Danny Hahn

Hardware Implementation of Probabilistic State Machine for Word Recognition

3.2 Roulette and Markov Chains

Reinforcement Learning

FUNCTION ANALYSIS SYSTEMS TECHNIQUE THE BASICS

Recursive Estimation

Data Mining: Algorithms and Applications Matrix Math Review

STA 4273H: Statistical Machine Learning

GLM, insurance pricing & big data: paying attention to convergence issues.

Amino Acids and Their Properties

Dynamic Programming. Lecture Overview Introduction

Elementary Statistics and Inference. Elementary Statistics and Inference. 17 Expected Value and Standard Error. 22S:025 or 7P:025.

Conditional Random Fields: An Introduction

Recursive Algorithms. Recursion. Motivating Example Factorial Recall the factorial function. { 1 if n = 1 n! = n (n 1)! if n > 1

Bioinformatics Grid - Enabled Tools For Biologists.

Bayesian Tutorial (Sheet Updated 20 March)

MATH 423 Linear Algebra II Lecture 38: Generalized eigenvectors. Jordan canonical form (continued).

Lecture 3: Linear methods for classification

MA 1125 Lecture 14 - Expected Values. Friday, February 28, Objectives: Introduce expected values.

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

A Service Revenue-oriented Task Scheduling Model of Cloud Computing

How To Solve A Sequential Mca Problem

THE ANALYTIC HIERARCHY PROCESS (AHP)

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions

Big Data: Big N. V.C Note. December 2, 2014

Unsupervised learning: Clustering

Why? A central concept in Computer Science. Algorithms are ubiquitous.

MATCH Commun. Math. Comput. Chem. 61 (2009)

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Tracking Groups of Pedestrians in Video Sequences

DYNAMIC RANGE IMPROVEMENT THROUGH MULTIPLE EXPOSURES. Mark A. Robertson, Sean Borman, and Robert L. Stevenson

Social Media Mining. Graph Essentials

0.1 Phase Estimation Technique

Notes on Factoring. MA 206 Kurt Bryan

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

SNP Essentials The same SNP story

Chapter 16. Law of averages. Chance. Example 1: rolling two dice Sum of draws. Setting up a. Example 2: American roulette. Summary.

Comparative genomic hybridization Because arrays are more than just a tool for expression analysis

Advanced Signal Processing and Digital Noise Reduction

Minesweeper as a Constraint Satisfaction Problem

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

A Review And Evaluations Of Shortest Path Algorithms

Biology Behind the Crime Scene Week 4: Lab #4 Genetics Exercise (Meiosis) and RFLP Analysis of DNA

Lecture 3: Finding integer solutions to systems of linear equations

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Reinforcement Learning

LOGICAL TOPOLOGY DESIGN Practical tools to configure networks

COMP 250 Fall 2012 lecture 2 binary representations Sept. 11, 2012

Dynamic programming formulation

Biological Sciences Initiative. Human Genome

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Transcription:

8.47 Introduction to omputational Molecular Biology Lecture 7: November 4, 2004 Scribe: Han-Pang hiu Lecturer: Ross Lippert Editor: Russ ox Hidden Markov Models The G island phenomenon The nucleotide frequencies in the human genome are A T G 29.5 20.4 20.5 29.6 Frequencies of particular nucleotide pairs tend to be steady across the human genome. One exception to this is G, which appears with expected frequency in and around genes but hardly appears in junk DNA. This occurs because G in junk DNA is typically methylated, and methylated G tends to mutated into TG. G islands are runs of DNA where G has much higher-than-normal frequency, indicating the likely presence of genetic data. Heuristics like sliding window searches can be used to locate G islands, but instead, we d like to analyze the sequence to find what the most likely G islands are. We can model the G island problem as the asino with Two oins problem. asino With Two oins In the asino with Two oins problem (the book refers to it as the Fair Bet asino problem), the casino has two coins, one fair and one biased. The input to the problem is a binary sequence, the results of the coin tosses. The output is the most likely tagging sequence indicating which coin was used for each toss. Given a particular tagging sequence S, S 2,..., S k and a corresponding data sequence X, X 2,..., X k, we define the probability of the data given the tagging model as n Pr {data tagging} = Pr {X i S i }. i= 7-

7-2 Lecture 7: November 4, 2004 We will write Pr {X i S i } as Pr {S i X i }, read S i emits X i. Not all tagging sequences are created equal: some might be more likely than others. For example, perhaps we know that the dealer in the casino changes coins infrequently. To model this, we also take as input a matrix specifying the probability of moving from state S to state S, written Pr {S S }. The probability of a tagging sequence is given by n Pr {tagging transition probabilities} = Pr {start in S } Pr {S i S i+ }. i= The individual probabilities used so far make up the hidden Markov model or HMM, so named because it is a Markov model (a state machine) where the states (the tagging) cannot be directly observed. For example, if the dealer has a 0 chance of changing 3 coins at each step, and if the biased coin is 4 heads, then the Markov model is given by: 9 Pr {F F } = Pr {B B} = 0 Pr {F B} = Pr {B F } = 0 Pr {F H} = 2 Pr {F T } = 2 3 Pr {B H} = 4 Pr {B T } = 4 We might depict such a model graphically as:

Lecture 7: November 4, 2004 7-3 Notation of Hidden Markov Model Typically we introduce some shorthands for the probabilities that make up the HMM. Since they are actually complete probability distributions, the probabilities for each set of outcomes must sum to one. T = [t ij ] = [Pr {i j}]; t ij = E = [e ix ] = [Pr {i x}]; e ix = j = [P i ] = [Pr {model starts in state i}]; P i = x Sometimes is omitted and it is assumed to be the left eigenvector of T with eigenvalue. We know has such an eigenvector because it has a trivial right eigenvector () with eigenvalue. It is also convenient to introduce the diagonal matrix D x whose diagonal has E ix (for some fixed x) as its entries. Under the new notation, our first problem is to compute the likelihood of a particular observation X given HMM parameters T, E, and. ombining the equations we gave earlier, we have that Pr {X T, E, } = D T... T D xn. x It is conventional to split this computation into two halves f(k, i) and b(k, i). The forward half f(k, i) is the probability of ending up in state i after k moves, and the backward half b(k, i) is the probability of ending up in state i when k moves away the end of the sequence. That is, [f(k, i)] = [Pr {x,..., x k, s k = i T, E, }] = D x T... T D xk [b(k, i)] = [Pr {s k = i, x k+,..., x n T, E, }]... T D xn = T D xk+ ombining these we get the original probability: Pr {X T, E, } = [f(k, i)][b(k, i)] = Pr {paths going through i}. i

7-4 Lecture 7: November 4, 2004 This tells us that f(k, i)b(k, i) Pr {s k = i T, E,, X} =, i f(k, i)b(k, i) which suggests an obvious method for computing the most likely tagging: simply take the most likely state for each position. Unfortunately, this obvious method does not work. Really the most likely sequence is arg max S Pr {S T, E,, X}. To find the maximum sequence, we can use a dynamic programming algorithm due to Viterbi. The necessary state is an analog to f(k, i), except that instead of summing over all incoming arrows we can only select the best one. The Viterbi dynamic programming algorithm can be viewed as a tropicalization of the original f and b functions. This is examined in detail in the next lecture. Training an HMM Much of the time we don t know the parameters of the HMM, so instead we train it on inputs, trying to derive parameters that maximum the likelihood of seeing the observed outputs. That is, given one or more sequences X, find T, E, that maximize Pr {X T, E, }. Unfortunately, this problem is NP-hard, so we ll have to content ourselves with heuristic approaches. Expectation Maximization (EM) One very effective heuristic approach for training an HMM is called expectation maximization, due to Baum and Welch. It is a local gradient search heuristic that follows from techniques for gradient search on constrained polynomials. It may not find a global maximum, so typically it is run a handful of times, starting from different initial conditions each time. We will assume that is the eigenvector of T as discussed above, so we need only worry about E and T. We start with some initial E and T, usually chosen randomly. Then we compute a new E ˆ and Tˆ using the following rules:

Lecture 7: November 4, 2004 7-5 Pr {i x} = Pr {path has s k = i} k:xk =x = f(k, i)b(k, i) k:x k =x Pr {i j} = Pr {path has s k = i, s k+ = j, and j emits x k } k = (f(k, i) Pr {i j} Pr {j x k } b(k +, j)) k ˆ This recomputing is iterated until the E and Tˆ converge. In general, expectation maximization sets Pr {â} in proportion to the number of times event a needs to happen compared to other events happening. The is a normalization term to make the probabilities sum to when necessary.