Hidden Markov Models Chapter 15



Similar documents
Hidden Markov Models

NLP Programming Tutorial 5 - Part of Speech Tagging with Hidden Markov Models

Course: Model, Learning, and Inference: Lecture 5

Conditional Random Fields: An Introduction

Advanced Signal Processing and Digital Noise Reduction

Introduction to Algorithmic Trading Strategies Lecture 2

Tagging with Hidden Markov Models

CSE 473: Artificial Intelligence Autumn 2010

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

The Basics of Graphical Models

Automatic Knowledge Base Construction Systems. Dr. Daisy Zhe Wang CISE Department University of Florida September 3th 2014

Cell Phone based Activity Detection using Markov Logic Network

Machine Learning.

Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Attribution. Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley)

Stock Option Pricing Using Bayes Filters

Dimension Reduction. Wei-Ta Chu 2014/10/22. Multimedia Content Analysis, CSIE, CCU

Introduction. BM1 Advanced Natural Language Processing. Alexander Koller. 17 October 2014

Multi-Class and Structured Classification

Machine Learning and Statistics: What s the Connection?

SOLiD System accuracy with the Exact Call Chemistry module

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

Bayesian networks - Time-series models - Apache Spark & Scala

Hardware Implementation of Probabilistic State Machine for Word Recognition

: Introduction to Machine Learning Dr. Rita Osadchy

Supervised Learning (Big Data Analytics)

Learning is a very general term denoting the way in which agents:

Exponential Random Graph Models for Social Network Analysis. Danny Wyatt 590AI March 6, 2009

Making Sense of the Mayhem: Machine Learning and March Madness

Lecture Note 1 Set and Probability Theory. MIT Spring 2006 Herman Bennett

Big Data, Machine Learning, Causal Models

8. Machine Learning Applied Artificial Intelligence

Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University

Machine Learning for natural language processing

Discrete Math in Computer Science Homework 7 Solutions (Max Points: 80)

Machine Learning. CS 188: Artificial Intelligence Naïve Bayes. Example: Digit Recognition. Other Classification Tasks

Lecture 9: Introduction to Pattern Analysis

Learning outcomes. Knowledge and understanding. Competence and skills

Chapter 5. Phrase-based models. Statistical Machine Translation

POS Tagsets and POS Tagging. Definition. Tokenization. Tagset Design. Automatic POS Tagging Bigram tagging. Maximum Likelihood Estimation 1 / 23

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions

What is Learning? CS 391L: Machine Learning Introduction. Raymond J. Mooney. Classification. Problem Solving / Planning / Control

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Log-Linear Models a.k.a. Logistic Regression, Maximum Entropy Models

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Chapter 4 Lecture Notes

Spot me if you can: Uncovering spoken phrases in encrypted VoIP conversations

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

A HYBRID GENETIC ALGORITHM FOR THE MAXIMUM LIKELIHOOD ESTIMATION OF MODELS WITH MULTIPLE EQUILIBRIA: A FIRST REPORT

Linear Threshold Units

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Bayesian Tutorial (Sheet Updated 20 March)

Machine learning for algo trading

Statistics Graduate Courses

Principles of Dat Da a t Mining Pham Tho Hoan hoanpt@hnue.edu.v hoanpt@hnue.edu. n

Towards running complex models on big data

Statistical Machine Learning

Ericsson T18s Voice Dialing Simulator

Machine Learning with MATLAB David Willingham Application Engineer

Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems

Tracking Groups of Pedestrians in Video Sequences

MA 1125 Lecture 14 - Expected Values. Friday, February 28, Objectives: Introduce expected values.

Globally Optimal Crowdsourcing Quality Management

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Bayesian Predictive Profiles with Applications to Retail Transaction Data

Beating the MLB Moneyline

HT2015: SC4 Statistical Data Mining and Machine Learning

3. The Junction Tree Algorithms

Decision Theory Rational prospecting

A Tutorial on Particle Filters for Online Nonlinear/Non-Gaussian Bayesian Tracking

Deterministic Sampling-based Switching Kalman Filtering for Vehicle Tracking

Thursday, November 13: 6.1 Discrete Random Variables

Machine Learning CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

1 An Introduction to Conditional Random Fields for Relational Learning

Speech: A Challenge to Digital Signal Processing Technology for Human-to-Computer Interaction

Decision Trees and Networks

Ch. 13.3: More about Probability

Log-Linear Models. Michael Collins

Semi-Supervised Support Vector Machines and Application to Spam Filtering

MS1b Statistical Data Mining

Chapter 5. Discrete Probability Distributions

Machine Learning Introduction

Machine Learning and Data Mining. Fundamentals, robotics, recognition

Solution. Solution. (a) Sum of probabilities = 1 (Verify) (b) (see graph) Chapter 4 (Sections ) Homework Solutions. Section 4.

Question 2 Naïve Bayes (16 points)

Statistics & Probability PhD Research. 15th November 2014

How To Bet On An Nfl Football Game With A Machine Learning Program

Language Modeling. Chapter Introduction

Statistical Machine Translation: IBM Models 1 and 2

Activity recognition in ADL settings. Ben Kröse

Lecture notes: single-agent dynamics 1

Transcription:

Hidden Markov Models Chapter 15 Mausam (Slides based on Dan Klein, Luke Zettlemoyer, Alex Simma, Erik Sudderth, David Fernandez-Baca, Drena Dobbs, Serafim Batzoglou, William Cohen, Andrew McCallum, Dan Weld) 1

Temporal Models Graphical models with a temporal component S t /X t = set of unobservable variables at time t W t /Y t = set of evidence variables at time t Notation X a:b = X a, X a+1,, X b 2

Target Tracking Radar-based tracking of multiple targets Visual tracking of articulated objects (L. Sigal et. al., 2006) Estimate motion of targets in 3D world from indirect, potentially noisy measurements 3

Financial Forecasting http://www.steadfastinvestor.com/ Predict future market behavior from historical data, news reports, expert opinions, 4

Biological Sequence Analysis (E. Birney, 2001) Temporal models can be adapted to exploit more general forms of sequential structure, like those arising in DNA sequences 5

Speech Recognition Given an audio waveform, would like to robustly extract & recognize any spoken words Statistical models can be used to Provide greater robustness to noise Adapt to accent of different speakers Learn from training 6 S. Roweis, 2004

Set of states Initial probabilities Transition probabilities Markov Chain Markov Chain models system dynamics 7

Markov Chains: Graphical Models 0.5 0.3 0.1 0.0 0.0 0.4 0.2 0.9 0.6 Difference from a Markov Decision Process? it is a system that transitions by itself 8

Set of states Hidden Markov Model Initial probabilities Transition probabilities Set of potential observations Emission/Observation probabilities o 1 o 2 o 3 o 4 o 5 HMM generates observation sequence 9

Hidden Markov Models (HMMs) Finite state machine Hidden state sequence Generates o 1 o 2 o 3 o 4 o 5 o 6 o 7 o 8 Observation sequence Graphical Model Hidden states... X t-2 X t-1 X t... Random variable X t takes values from {s 1, s 2, s 3, s 4 } Observations... y t-2 y t-1 y t... Random variable y t takes values from s {o 1, o 2, o 3, o 4, o 5, } 10

HMM Finite state machine Hidden state sequence Generates o 1 o 2 o 3 o 4 o 5 o 6 o 7 o 8 Observation sequence Graphical Model Hidden states... X t-2 X t-1 X t... Random variable X t takes values from sss {s 1, s 2, s 3, s 4 } Observations... y t-2 y t-1 y t... Random variable y t takes values from s {o 1, o 2, o 3, o 4, o 5, } 11

HMM Graphical Model Hidden states... X t-2 X t-1 X t... Random variable y t takes values from sss {s 1, s 2, s 3, s 4 } Observations... y t-2 y t-1 y t... Random variable x t takes values from s {o 1, o 2, o 3, o 4, o 5, } Need Parameters: Start state probabilities: P(x 1 =s k ) Transition probabilities: P(x t =s i x t-1 =s k ) Observation probabilities: P(y t =o j x t =s k ) 12

Observation Distribution Hidden Markov Models Just another graphical model Conditioned on the present, the past & future are independent Transition Distribution hidden states observed vars 13

Hidden states hidden states Given observed process, earlier observations provide no additional information about the future: 14

HMM Generative Process We can easily sample sequences pairs: X0:n,Y0:n = S0:n,W0:n Sample initial state: P(x 0 ) For i = 1... n Sample si from the distribution P(si si-1) Sample wi from the distribution P(wi si) 15

Example: POS Tagging Useful as a pre-processing step DT NNP NN VBD VBN RP NN NNS The Georgia branch had taken on loan commitments DT NN IN NN VBD NNS VBD The average of interbank offered rates plummeted Setup: states S = {DT, NNP, NN,... } are the POS tags Observations W = V are words Transition dist n P(si si-1) models the tag sequences Observation dist n P(wi si) models words given their POS 16

Example: Chunking Find spans of text with certain properties For example: named entities with types (PER, ORG, or LOC) Germany s representative to the European Union s veterinary committee Werner Zwingman said on Wednesday consumers should... [Germany] LOC s representative to the [European Union] ORG s veterinary committee [Werner Zwingman] PER said on Wednesday consumers should... 17

Example: Chunking [Germany]LOC s representative to the [European Union]ORG s veterinary committee [Werner Zwingman]PER said on Wednesday consumers should... Germany/BL s/na representative/na to/na the/na European/BO Union/CO s/na veterinary/na committee/na Werner/BP Zwingman/CP said/na on/na Wednesday/NA consumers/na should/na... HMM Model: States S = {NA, BL, CL, BO, CO, BL, CL} represent beginnings (BL, BO, BP} and continuations (CL, CO, CP) of chunks, and other (NA) Observations W = V are words Transition dist n P(s i s i -1 ) models the tag sequences Observation dist n P(w i s i ) models words given their type 18

Example: The Occasionally Dishonest Casino A casino has two dice: Fair die: P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = 1/6 Loaded die: P(1) = P(2) = P(3) = P(4) = P(5) = 1/10; P(6) = 1/2 Dealer switches between dice as: Prob(Fair Loaded) = 0.01 Prob(Loaded Fair) = 0.2 Transitions between dice obey a Markov process Game: 1. You bet $1 2. You roll (always with a fair die) 3. Casino player rolls (maybe with fair die, maybe with loaded die) 4. Highest number wins $2 19

An HMM for the occasionally dishonest casino P(1 F) = 1/6 P(2 F) = 1/6 P(3 F) = 1/6 P(4 F) = 1/6 P(5 F) = 1/6 P(6 F) = 1/6 P(1 L) = 1/10 P(2 L) = 1/10 P(3 L) = 1/10 P(4 L) = 1/10 P(5 L) = 1/10 P(6 L) = 1/2 20

Question # 1 Evaluation GIVEN A sequence of rolls by the casino player 124552646214614613613666166466163661636616361 QUESTION How likely is this sequence, given our model of how the casino works? This is the EVALUATION problem in HMMs 21

Question # 2 Decoding GIVEN A sequence of rolls by the casino player 1245526462146146136136661664661636616366163 QUESTION What portion of the sequence was generated with the fair die, and what portion with the loaded die? This is the DECODING question in HMMs 22

Question # 3 Learning GIVEN A sequence of rolls by the casino player 124552646214614613613666166466163661636616361651 QUESTION How loaded is the loaded die? How fair is the fair die? How often does the casino player change from fair to loaded, and back? This is the LEARNING question in HMMs 23

HMM Inference Evaluation: prob. of observing an obs. sequence Forward Algorithm (very similar to Viterbi) Decoding: most likely sequence of hidden states Viterbi algorithm Marginal distribution: prob. of a particular state Forward-Backward 24

Decoding Problem Given w=w 1 w n and HMM θ, what is best parse s 1 s n? Several possible meanings of solution 1. States which are individually most likely 2. Single best state sequence We want sequence s 1 s n, such that P(s w) is maximized s * = argmax s P( s w ) 1 2 1 2 1 2 1 2 K K K K o 1 o 2 o 3 o T 25

Most Likely Sequence Problem: find the most likely (Viterbi) sequence under the model Given model parameters, we can score any sequence pair NNP VBZ NN NNS CD NN. Fed raises interest rates 0.5 percent. (s0:n,w0:n) = P(NNP ) P(Fed NNP) P(VBZ NNP) P(raises VBZ) P(NN NNP).. In principle, we re done list all possible tag sequences, score each one, pick the best one (the Viterbi state sequence) 2n multiplications NNP VBZ NN NNS CD NN logp = -23 per sequence NNP NNS NN NNS CD NN NNP VBZ VB NNS CD NN logp = -29 logp = -27 S n state sequences! 26

The occasionally dishonest casino Known: The structure of the model The transition probabilities Hidden: What the casino did FFFFFLLLLLLLFFFF... Observable: The series of die tosses 3415256664666153... What we must infer: When was a fair die used? When was a loaded one used? The answer is a sequence FFFFFFFLLLLLLFFF... 27

The occasionally dishonest casino w w1, w2, w3 6,2,6 Pr( w, s (1) ) p( F 0) p(6 F) p( F F) p(2 F) p( F F) p(6 F) s (2) FFF 1 0.5 0.99 6 0.00227 1 6 0.99 1 6 s (2) LLL Pr( w, s (2) ) p( L 0) p(6 L) p( L L) p(2 L) p( L 0.5 0.5 0.8 0.1 0.8 0.5 L) p(6 L) 0.008 s (3) LFL Pr( w, s (3) ) p( L 0) p(6 L) p( F L) p(2 1 0.5 0.5 0.2 0.01 0.5 6 0.0000417 F) p( L F) p(6 28 L)

Finding the Best Trajectory Too many trajectories (state sequences) to list Option 1: Beam Search <> Fed:N Fed:V Fed:J raises:n raises:v raises:n raises:v A beam is a set of partial hypotheses Start with just the single empty trajectory At each derivation step: Consider all continuations of previous hypotheses Discard most, keep top k Beam search works ok in practice but sometimes you want the optimal answer and there s usually a better option than naïve beams 29

The State Lattice / Trellis ^ ^ ^ ^ ^ ^ N N N N N N V V V V V V J J J J J J D D D D D D $ $ $ $ $ $ START Fed raises interest rates END 30

The State Lattice / Trellis ^ ^ ^ ^ ^ ^ N N N N N N V V V V V V J J J J J J D D D D D D $ $ $ $ $ $ START Fed raises interest rates END 31

Dynamic Programming --1) δ i (s): probability of most likely state sequence ending with state s, given observations w 1,, w i 32

Dynamic Programming --1) δ i (s): probability of most likely state sequence ending with state s, given observations w 1,, w i 33

Vitterbi Algorithm 34

The Viterbi Algorithm w 1 w 2 w i-1 w i w N State 1 2 i Max s δ i-1 (s ) * P trans * P obs δ i (s) K Remember: δ i (s) = probability of most likely state seq ending with s at time i 35

Terminating Viterbi State 1 2 i K w 1 w 2..w N δ δ δ δ δ Choose Max 36

Terminating Viterbi w 1 w 2..w N State 1 2 i δ * Max K How did we compute *? Max s δ N-1 (s ) * P trans * P obs Now Backchain to Find Final Sequence Time: O( S 2 N) Space: O( S N) Linear in length of sequence 37

Viterbi: Example Viterbi: Example w 6 2 6 B 1 0 0 0 s F 0 (1/6) (1/2) = 1/12 (1/6) max{(1/12) 0.99, (1/4) 0.2} = 0.01375 (1/6) max{0.01375 0.99, 0.02 0.2} = 0.00226875 L 0 (1/2) (1/2) = 1/4 (1/10) max{(1/12) 0.01, (1/4) 0.8} = 0.02 (1/2) max{0.01375 0.01, 0.02 0.8} = 0.08 i( s) p( wi s)max p( s s') i 1( s') s' 38

Viterbi gets it right more often than not 39

Computing Marginals Problem: find the marginal distribution Given model parameters, we can score any tag sequence NNP VBZ NN NNS CD NN. Fed raises interest rates 0.5 percent. P(NNP ) P(Fed NNP) P(VBZ NNP) P(raises VBZ) P(NN NNP).. In principle, we re done list all possible tag sequences, score each one, sum up the values 40

The State Lattice / Trellis ^ ^ ^ ^ ^ ^ N N N N N N V V V V V V J J J J J J D D D D D D $ $ $ $ $ $ START Fed raises interest rates END 41

The Forward Backward Algorithm 42 ) ( ), ( ), ( ), ( ),, ( : 1 : 0 : 0 : 1 : 0 : 1 : 0 i n i i i i i n i i i n i i i s w P w s P w s w P w s P w w s P ), ( : si w0 n P

The Forward Backward Algorithm Sum over all paths, on both sides: 43

The Forward Backward Algorithm 44

HMM Learning Learning from data D Supervised D = {(s0:n,w0:n)i i = 1... m} Unsupervised D = {(w0:n)i i = 1... m} We won t do this case! (~hidden vars) EM Also called Baum Welch algorithm 45

Supervised Learning Given data D = {Xi i = 1... m } where Xi=(s0:n,w0:n) is a state, observation sequence pair Define the parameters Θ to include: For every pair of states: For every state, obs. pair: Then the data likelihood is: And the maximum likelihood solutions is 46

Final ML Estimates (as in BNs) c(s,s ) and c(s,w) are the empirical counts of transitions and observations in the data D The final, intuitive, estimates: Just as with BNs, the counts can be zero use smoothing techniques! 47

The Problem with HMMs We want more than an Atomic View of Words We want many arbitrary, overlapping features of words identity of word ends in -ly, -ed, -ing is capitalized appears in a name database/wordnet x t-1 x t x t+1 Use discriminative models instead of generative ones (e.g., Conditional Random Fields) w t -1 w t w t +1 48

Naïve Bayes Finite State Models HMMs Generative directed models Sequence General Graphs Conditional Conditional Conditional Logistic Regression Linear-chain CRFs General CRFs Sequence General Graphs 49

Temporal Models Full Bayesian Networks have dynamic versions too Dynamic Bayesian Networks (Chapter 15.5) HMM is a special case HMMs with continuous variables often useful for filtering (estimating current state) Kalman filters (Chapter 15.4) 50